|
| 1 | +Launching with Grid Engine |
| 2 | +========================== |
| 3 | + |
| 4 | +PRRTE supports the family of run-time schedulers including the Sun |
| 5 | +Grid Engine (SGE), Oracle Grid Engine (OGE), Grid Engine (GE), Son of |
| 6 | +Grid Engine, Open Cluster Scheduler (OCS), Gridware Cluster Scheduler (GCS) |
| 7 | +and others. |
| 8 | + |
| 9 | +This documentation will collectively refer to all of them as "Grid |
| 10 | +Engine", unless a referring to a specific flavor of the Grid Engine |
| 11 | +family. |
| 12 | + |
| 13 | +Verify Grid Engine support |
| 14 | +-------------------------- |
| 15 | + |
| 16 | +.. important:: To build Grid Engine support in PRRTE, you will need |
| 17 | + to explicitly request the SGE support with the ``--with-sge`` |
| 18 | + command line switch to PRRTE's ``configure`` script. |
| 19 | + |
| 20 | +To verify if support for Grid Engine is configured into your PRRTE |
| 21 | +installation, run ``prte_info`` as shown below and look for |
| 22 | +``gridengine``. |
| 23 | + |
| 24 | +.. code-block:: |
| 25 | +
|
| 26 | + shell$ prte_info | grep gridengine |
| 27 | + MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) |
| 28 | +
|
| 29 | +
|
| 30 | +Launching |
| 31 | +--------- |
| 32 | + |
| 33 | +When Grid Engine support is included, PRRTE will automatically |
| 34 | +detect when it is running inside SGE and will just "do the Right |
| 35 | +Thing." |
| 36 | + |
| 37 | +Specifically, if you execute an ``prterun`` command in a Grid Engine |
| 38 | +job, it will automatically use the Grid Engine mechanisms to launch |
| 39 | +and kill processes. There is no need to specify what nodes to run on |
| 40 | +|mdash| PRRTE will obtain this information directly from Grid |
| 41 | +Engine and default to a number of processes equal to the slot count |
| 42 | +specified. For example, this will run 4 application processes on the nodes |
| 43 | +that were allocated by Grid Engine: |
| 44 | + |
| 45 | +.. code-block:: sh |
| 46 | +
|
| 47 | + # Get the environment variables for Grid Engine |
| 48 | +
|
| 49 | + # (Assuming Grid Engine is installed at /opt/sge and $Grid |
| 50 | + # Engine_CELL is 'default' in your environment) |
| 51 | + shell$ . /opt/sge/default/common/settings.sh |
| 52 | +
|
| 53 | + # Allocate an Grid Engine interactive job with 4 slots from a |
| 54 | + # parallel environment (PE) named 'foo' and run a 4-process job |
| 55 | + shell$ qrsh -pe foo 4 -b y prterun -n 4 mpi-hello-world |
| 56 | +
|
| 57 | +There are also other ways to submit jobs under Grid Engine: |
| 58 | + |
| 59 | +.. code-block:: sh |
| 60 | +
|
| 61 | + # Submit a batch job with the 'prterun' command embedded in a script |
| 62 | + shell$ qsub -pe foo 4 my_prterun_job.csh |
| 63 | +
|
| 64 | + # Submit a Grid Engine and application job and prterun in one line |
| 65 | + shell$ qrsh -V -pe foo 4 prterun hostname |
| 66 | +
|
| 67 | + # Use qstat(1) to show the status of Grid Engine jobs and queues |
| 68 | + shell$ qstat -f |
| 69 | +
|
| 70 | +In reference to the setup, be sure you have a Parallel Environment |
| 71 | +(PE) defined for submitting parallel jobs. You don't have to name your |
| 72 | +PE "foo". The following example shows a PE named "foo" that would |
| 73 | +look like: |
| 74 | + |
| 75 | +.. code-block:: |
| 76 | +
|
| 77 | + shell$ qconf -sp foo |
| 78 | + pe_name foo |
| 79 | + slots 99999 |
| 80 | + user_lists NONE |
| 81 | + xuser_lists NONE |
| 82 | + start_proc_args NONE |
| 83 | + stop_proc_args NONE |
| 84 | + allocation_rule $fill_up |
| 85 | + control_slaves TRUE |
| 86 | + job_is_first_task FALSE |
| 87 | + urgency_slots min |
| 88 | + accounting_summary FALSE |
| 89 | + qsort_args NONE |
| 90 | +
|
| 91 | +.. note:: ``qsort_args`` is necessary with the Son of Grid Engine |
| 92 | + distribution, version 8.1.1 and later, and probably only applicable |
| 93 | + to it. |
| 94 | + |
| 95 | +.. note:: For very old versions of Sun Grid Engine, omit |
| 96 | + ``accounting_summary`` too. |
| 97 | + |
| 98 | +.. note:: For Open Cluster Scheduler / Gridware Cluster Scheduler it is |
| 99 | + necessary to set ``ign_sreq_on_mhost`` (ignoring slave resource requests |
| 100 | + on the master node) to ``FALSE``. |
| 101 | + |
| 102 | +You may want to alter other parameters, but the important one is |
| 103 | +``control_slaves``, specifying that the environment has "tight |
| 104 | +integration". Note also the lack of a start or stop procedure. The |
| 105 | +tight integration means that mpirun automatically picks up the slot |
| 106 | +count to use as a default in place of the ``-n`` argument, picks up a |
| 107 | +host file, spawns remote processes via ``qrsh`` so that Grid Engine |
| 108 | +can control and monitor them, and creates and destroys a per-job |
| 109 | +temporary directory (``$TMPDIR``), in which PRTE's directory will |
| 110 | +be created (by default). |
| 111 | + |
| 112 | +Be sure the queue will make use of the PE that you specified: |
| 113 | + |
| 114 | +.. code-block:: |
| 115 | +
|
| 116 | + shell$ qconf -sq all.q |
| 117 | + [...snipped...] |
| 118 | + pe_list make cre foo |
| 119 | + [...snipped...] |
| 120 | +
|
| 121 | +To determine whether the Grid Engine parallel job is successfully |
| 122 | +launched to the remote nodes, you can pass in the MCA parameter |
| 123 | +``--prtemca plm_base_verbose 1`` to ``prterun``. |
| 124 | + |
| 125 | +This will add in a ``-verbose`` flag to the ``qrsh -inherit`` command |
| 126 | +that is used to send parallel tasks to the remote Grid Engine |
| 127 | +execution hosts. It will show whether the connections to the remote |
| 128 | +hosts are established successfully or not. |
| 129 | + |
| 130 | +Various Grid Engine documentation with pointers to more used to be available |
| 131 | +at `the Son of GridEngine site <http://arc.liv.ac.uk/sge/>`_, and |
| 132 | +configuration instructions were found at `the Son of GridEngine |
| 133 | +configuration how-to site |
| 134 | +<http://arc.liv.ac.uk/SGE/howto/sge-configs.html>`_. This may no longer |
| 135 | +be true. |
| 136 | + |
| 137 | +An actively developed (2024, 2025) open source successor of Sun Grid Engine is |
| 138 | +`Open Cluster Scheduler <https://github.com/hpc-gridware/clusterscheduler>`_. |
| 139 | +It maintains backward compatibility with SGE and provides many new features. |
| 140 | +An MPI parallel environment setup for OpenMPI is available in |
| 141 | +`the Open Cluster Scheduler GitHub repository |
| 142 | +<https://github.com/hpc-gridware/clusterscheduler/tree/master/source/dist/mpi/openmpi>`_. |
| 143 | + |
| 144 | +Grid Engine tight integration support of the ``qsub -notify`` flag |
| 145 | +------------------------------------------------------------------ |
| 146 | + |
| 147 | +If you are running SGE 6.2 Update 3 or later, then the ``-notify`` |
| 148 | +flag is supported. If you are running earlier versions, then the |
| 149 | +``-notify`` flag will not work and using it will cause the job to be |
| 150 | +killed. |
| 151 | + |
| 152 | +To use ``-notify``, one has to be careful. First, let us review what |
| 153 | +``-notify`` does. Here is an excerpt from the qsub man page for the |
| 154 | +``-notify`` flag. |
| 155 | + |
| 156 | + The ``-notify`` flag, when set causes Sun Grid Engine to send |
| 157 | + warning signals to a running job prior to sending the signals |
| 158 | + themselves. If a SIGSTOP is pending, the job will receive a SIGUSR1 |
| 159 | + several seconds before the SIGSTOP. If a SIGKILL is pending, the |
| 160 | + job will receive a SIGUSR2 several seconds before the SIGKILL. The |
| 161 | + amount of time delay is controlled by the notify parameter in each |
| 162 | + queue configuration. |
| 163 | + |
| 164 | +Let us assume the reason you want to use the ``-notify`` flag is to |
| 165 | +get the SIGUSR1 signal prior to getting the SIGTSTP signal. PRRTE forwards |
| 166 | +some signals by default, but others need to be specifically requested. |
| 167 | +The following MCA param controls this behavior: |
| 168 | + |
| 169 | +.. code-block:: |
| 170 | +
|
| 171 | + prte_ess_base_forward_signals: Comma-delimited list of additional signals (names or integers) to forward to |
| 172 | + application processes [\"none\" => forward nothing]. Signals provided by |
| 173 | + default include SIGTSTP, SIGUSR1, SIGUSR2, SIGABRT, SIGALRM, and SIGCONT |
| 174 | +
|
| 175 | +Within that constraint, something like this batch script can be used: |
| 176 | + |
| 177 | +.. code-block:: sh |
| 178 | +
|
| 179 | + #! /bin/bash |
| 180 | + #$ -S /bin/bash |
| 181 | + #$ -V |
| 182 | + #$ -cwd |
| 183 | + #$ -N Job1 |
| 184 | + #$ -pe foo 16 |
| 185 | + #$ -j y |
| 186 | + #$ -l h_rt=00:20:00 |
| 187 | + prterun -n 16 mpi-hello-world |
| 188 | +
|
| 189 | +However, one has to make one of two changes to this script for things |
| 190 | +to work properly. By default, a SIGUSR1 signal will kill a shell |
| 191 | +script. So we have to make sure that does not happen. Here is one way |
| 192 | +to handle it: |
| 193 | + |
| 194 | +.. code-block:: sh |
| 195 | +
|
| 196 | + #! /bin/bash |
| 197 | + #$ -S /bin/bash |
| 198 | + #$ -V |
| 199 | + #$ -cwd |
| 200 | + #$ -N Job1 |
| 201 | + #$ -pe ompi 16 |
| 202 | + #$ -j y |
| 203 | + #$ -l h_rt=00:20:00 |
| 204 | + exec prterun -n 16 mpi-hello-world |
| 205 | +
|
| 206 | +Alternatively, one can catch the signals in the script instead of doing |
| 207 | +an exec on the mpirun: |
| 208 | + |
| 209 | +.. code-block:: sh |
| 210 | +
|
| 211 | + #! /bin/bash |
| 212 | + #$ -S /bin/bash |
| 213 | + #$ -V |
| 214 | + #$ -cwd |
| 215 | + #$ -N Job1 |
| 216 | + #$ -pe ompi 16 |
| 217 | + #$ -j y |
| 218 | + #$ -l h_rt=00:20:00 |
| 219 | +
|
| 220 | + function sigusr1handler() |
| 221 | + { |
| 222 | + echo "SIGUSR1 caught by shell script" 1>&2 |
| 223 | + } |
| 224 | +
|
| 225 | + function sigusr2handler() |
| 226 | + { |
| 227 | + echo "SIGUSR2 caught by shell script" 1>&2 |
| 228 | + } |
| 229 | +
|
| 230 | + trap sigusr1handler SIGUSR1 |
| 231 | + trap sigusr2handler SIGUSR2 |
| 232 | +
|
| 233 | + prterun -n 16 mpi-hello-world |
| 234 | +
|
| 235 | +Grid Engine job suspend / resume support |
| 236 | +---------------------------------------- |
| 237 | + |
| 238 | +To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to |
| 239 | +``prterun``. ``prterun`` will catch this signal and forward it to the |
| 240 | +``mpi-hello-world`` as a SIGSTOP signal. To resume the job, you send |
| 241 | +a SIGCONT signal to ``prterun`` which will be caught and forwarded to |
| 242 | +the ``mpi-hello-world``. |
| 243 | + |
| 244 | +Here is an example on Solaris: |
| 245 | + |
| 246 | +.. code-block:: sh |
| 247 | +
|
| 248 | + shell$ prterun -n 2 mpi-hello-world |
| 249 | +
|
| 250 | +In another window, we suspend and continue the job: |
| 251 | + |
| 252 | +.. code-block:: sh |
| 253 | +
|
| 254 | + shell$ prstat -p 15301,15303,15305 |
| 255 | + PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP |
| 256 | + 15305 rolfv 158M 22M cpu1 0 0 0:00:21 5.9% mpi-hello-world/1 |
| 257 | + 15303 rolfv 158M 22M cpu2 0 0 0:00:21 5.9% mpi-hello-world/1 |
| 258 | + 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1 |
| 259 | +
|
| 260 | + shell$ kill -TSTP 15301 |
| 261 | + shell$ prstat -p 15301,15303,15305 |
| 262 | + PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP |
| 263 | + 15303 rolfv 158M 22M stop 30 0 0:01:44 21% mpi-hello-world/1 |
| 264 | + 15305 rolfv 158M 22M stop 20 0 0:01:44 21% mpi-hello-world/1 |
| 265 | + 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1 |
| 266 | +
|
| 267 | + shell$ kill -CONT 15301 |
| 268 | + shell$ prstat -p 15301,15303,15305 |
| 269 | + PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP |
| 270 | + 15305 rolfv 158M 22M cpu1 0 0 0:02:06 17% mpi-hello-world/1 |
| 271 | + 15303 rolfv 158M 22M cpu3 0 0 0:02:06 17% mpi-hello-world/1 |
| 272 | + 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1 |
| 273 | +
|
| 274 | +Note that all this does is stop the ``mpi-hello-world`` processes. It |
| 275 | +does not, for example, free any pinned memory when the job is in the |
| 276 | +suspended state. |
| 277 | + |
| 278 | +To get this to work under the Grid Engine environment, you have to |
| 279 | +change the ``suspend_method`` entry in the queue. It has to be set to |
| 280 | +SIGTSTP. Here is an example of what a queue should look like. |
| 281 | + |
| 282 | +.. code-block:: sh |
| 283 | +
|
| 284 | + shell$ qconf -sq all.q |
| 285 | + qname all.q |
| 286 | + [...snipped...] |
| 287 | + starter_method NONE |
| 288 | + suspend_method SIGTSTP |
| 289 | + resume_method NONE |
| 290 | +
|
| 291 | +Note that if you need to suspend other types of jobs with SIGSTOP |
| 292 | +(instead of SIGTSTP) in this queue then you need to provide a script |
| 293 | +that can implement the correct signals for each job type. |
0 commit comments