Installing Grid Engine for batch jobs


Nowadays, one can have a desktop PC with 16 cores. Some sort of batch job manager is useful to load up your machine with overnight jobs, and run parallel jobs. CPSeis was written to work with PBS. SGE can be adapted, as it uses the same qsub command to submit jobs. The following page describes a minimal setup of SGE, with everything (master, exec and submit) on one machine. If you want all the features, installation is quite complicated. Keeping it simple also means that it more likely to keep working into the future.

The original Sun Grid Engine 6.2u5 will work in Centos 6, but will not install in anything else currently supported. Open Grid Scheduler 2011 or Son of Grid Engine 8.1 can be used in Centos 7. These will probably not install in any distro released since 2017. Instead go to Github and find grisu48 gridengine. This has patches to work in newer linux distributions. It will build in Suse 15. However, for Debian 10 or Fedora 29, some hacking is needed.

SGE has many requisites. If you have installed CPSeis, you would already have jdk and the X11 and motif libraries. You should install the following items:
tcsh
munge
jemalloc-devel
nscurses-devel
pam-devel
hwloc-devel
freetype
texinfo
mailx
In Debian, some of the names are different. You need libpam-dev, libhwloc-dev, libjemalloc-dev
In Fedora 28-29, you must also install tirpc-devel
Make sure the host given by "hostname" command is in /etc/hosts file. It cannot be localhost or have an IP address starting with 127.
In your .bashrc define
SGE_ROOT=/opt/sge
SGE_CELL=default

Open firewall ports 6444-6445/tcp
Unzip the Grid Engine file.
For Fedora 28-29, some extra steps are required to use tirpc instead of the old rpc. An extra variable is needed in .bashrc
SGE_INPUT_LDFLAGS='-ltirpc'
Then go to the /usr/include/tirpc directory and
sudo cp netconfig.h ..
Then you must edit the SGE source files: pack.c, wingrid.c and wingrid.h
Find the includes beginning with:
#include <rpc
and change those to
#include <tirpc/rpc
Now to compile SGE, go to source directory. Perform the following as an unpriveleged user:
./aimk -no-java -no-jni -no-secure -spool-classic -only-depend
./scripts/zerodepend
./aimk -no-java -no-jni -no-secure -spool-classic depend
./aimk -no-java -no-jni -no-secure -spool-classic
For Debian 10 or Fedora 29, one must add -no-qmake to the aimk commands. qmake will not build in those, but it is not needed for CPSeis.
Then the following must be done as root:
scripts/distinst -all -local -noexit
cd $SGE_ROOT
./inst_sge -m -nobincheck
./inst_sge -x -nobincheck
mkdir -p $SGE_ROOT/$SGE_CELL/common/accounting
For the inst_sge commands, one can hit Return for almost all the questions. Then add $SGE_ROOT/bin/lx_amd64 to your PATH in .bashrc and reboot. Check that sge_execd and sge_qmaster are running.
After it is installed, create a queue called "batch" with qconf
qconf -sq >qconf.sq
This creates a template file. Edit it and change the qname to batch. Change hostlist to your hostname. Then change slots to the number of CPU cores. Change shell to /bin/bash
Then load that edited file with this command:
sudo qconf -Aq qconf.sq
You need to make your machine an execution host. Execute this commands, followed by your hostname:
sudo qconf -ae
For multi-core jobs, you need to create a parallel environment called "mpich2" or "openmpi" or "mpich3", whichever you built CPSeis with.
sudo qconf -Ap qconf.pe
where qconf.pe looks like this (apart from number of slots):
pe_name            mpich2
slots              12
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE
The template in /opt/sge/mpi is not suitable. It is lacking the qsort_args line Then add this parallel environment to the batch queue. Edit the qconf.sq file. Find the line "pe_list" and add mpich2 (or whatever MPI you chose). Modify the batch queue with this command:
sudo qconf -Mq qconf.sq
Then run the qmon program as root. Click the queue control button. You should see a queue called batch. Click that, then modify. Select the Parallel Environment tab. On the available side, you will see mpich2 (or whatever you created). Select it and click the right arrow, so it goes to the Referenced PEs list. Finally click the OK button to save the settings.

One can also define environment variable EDITOR to nedit, gedit, kate or whatever you prefer. SGE will invoke this editor to modify configuration files.

SGE will run jobs from CPSeis cfe , however the jobs contain PBS directives. So the jobs will not be scheduled properly. Also SGE will create lots of stdout and stderr files in your home directory, which are not useful. I have modified buildjob.f90 to to fix this by inserting SGE directives into the job instead. Download the modified buildjob.f90 copy it into the cpseis/src directory and touch it. Then make cfebld in your platform directory. To enable SGE, edit the cps_config.dat file. Find the line
cps_pbs_type = pbs
change pbs to SGE
Other CPSeis configuration files needed to run batch jobs (on one machine):
compiler_nodes.dat should contain "localhost"
qserver_nodes.dat should contain "localhost batch"

You also should define environment variable PBS_NODESFILE to point to a file containing a list of nodes. Although SGE uses PE_HOSTFILE for this, the job script built by CPSeis uses PBS_NODEFILE. The nodes file should contain your hostname, repeated for how many cores you have (one per line). So for 12 cores, it is 12 lines of hostname duplicated.

*** buildjob.f90 can be downloaded here: buildjob.tar