waiting-for-jobs: add new guide

Add a new guide on how to wait for jobs to complete.
flux-framework · Mar 28, 2023 · 720ab1c · 720ab1c
1 parent 93dab21
commit 720ab1c
Showing 1 changed file with 216 additions and 0 deletions.
diff --git a/jobs/waiting-for-jobs.rst b/jobs/waiting-for-jobs.rst
@@ -0,0 +1,216 @@
+.. _waiting-for-jobs-to-finish:
+
+==========================
+Waiting For Jobs To Finish
+==========================
+
+There are several ways to wait for submitted jobs to complete, each with a unique set of pros and cons associated with them.  This covers the following techniques.
+
+- ``--wait``
+- ``flux job status``
+- ``flux job wait``
+- ``flux queue drain``
+
+-----------------
+The --wait option
+-----------------
+
+The most basic way to wait for a job to complete on a submitted job is the ``--wait`` option on ``flux job submit``.  Simply put, if the ``--wait`` option is passed to ``flux job submit``, the command will not return until the job has completed.
+
+.. code-block:: console
+
+    $ flux submit --wait -n1 bash -c "sleep 30; /bin/false"
+    ƒEMds3VemM
+    <wait for command to finish>
+    $ echo $?
+    1
+
+The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``.  The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.
+
+After the command has finished we print the exit code from ``flux submit``.  You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.
+
+---------------
+Flux Job Status
+---------------
+
+In most cases, you do not want to sit and wait for the current job submission to complete.  You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.
+
+The ``flux job status`` command is the most basic way to wait for a specific job, based on jobid, to complete.  Pass it one or more jobids to wait on, and ``flux job status`` will return once all of the jobs have completed.  It will exit with largest exit code from any of the jobids specified.  If the job(s) have already completed, ``flux job status`` returns immediately.  It can be run as many times as the user would like against the same jobid.
+
+Here are several examples.  In this first one, we submit a simple job that sleeps for 30 seconds then runs ``/bin/true``.  Afterwards, we pass the jobid to ``flux job status`` and wait for it to return when the job has finished.  After it has completed we can see that the exit code from ``flux job status`` is ``0``, as we expect from ``/bin/true``.
+
+.. code-block:: console
+
+    $ flux submit -n1 bash -c "sleep 30; /bin/true"
+    ƒLUebmCK
+    $ flux job status ƒLUebmCK
+    <we wait a little bit waiting for the job to finish>
+    $ echo $?
+    0
+
+If we run the above job with ``/bin/false`` instead of ``/bin/true``.
+
+.. code-block:: console
+
+    $ flux submit -n1 bash -c "sleep 30; /bin/false"
+    ƒeGz9fYs
+    $ flux job status ƒeGz9fYs
+    <we wait a little bit waiting for the job to finish>
+    $ echo $?
+    1
+
+The result is identical to the first example except the exit code from ``flux job status`` is ``1``, which is what we expect from running ``/bin/flase``.
+
+Finally, lets pass both jobids from above to ``flux job status``.
+
+.. code-block:: console
+
+    $ flux job status ƒLUebmCK ƒeGz9fYs
+    $ echo $?
+    1
+
+You'll notice two things about this example.  First, the command returns immediately.  This is because the two jobs have already completed.  Second, the exit code is ``1``, which is the largest exit code of the two jobs passed to ``flux job status`` (one ran ``/bin/true`` and the other ran ``/bin/false``).
+
+-------------
+Flux Job Wait
+-------------
+
+``flux job wait`` behaves similarly to ``flux job status`` but there are some differences in using it, which come with pros and cons.
+
+The most notable difference is that in order to use ``flux job wait``, jobs must be passed the ``waitable`` flag.  Any job that is not passed the ``waitable`` flag will not work with ``flux job wait``.  In addition, the ``waitable`` flag can only be used in user Flux instances (i.e. non-system instances).  User Flux instances are usually started via ``flux alloc`` or ``flux batch``.
+
+Here's a simple example of using ``flux job wait``.  It's very similar to the example from before, where run sleep for 30 seconds then run ``/bin/true``.
+
+.. code-block:: console
+
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/true"
+    ƒ4btMovw
+    $ flux job wait ƒ4btMovw
+    <we wait a little bit waiting for the job to finish>
+    $ echo $?
+    0
+
+Note that when submitting the job, we submitted it with the ``waitable`` flag via ``--flags waitable``.
+
+This doesn't really show us anything special, it seems to be the same as ``flux job status``.  Lets now look at the major advantages of ``flux job wait``.
+
+Perhaps the biggest advantage of ``flux job wait`` is that apriori knowledge of jobids is not necessary.  If ``flux job wait`` is specfied without any jobids, it will wait for the first job that completes amongst all of the jobs you have submitted via the ``waitable`` flag.
+
+.. code-block:: console
+
+    $ flux submit --flags waitable -n1 bash -c "sleep 60; /bin/true"
+    ƒ2WxyXSUF
+    $ flux submit --flags waitable -n1 bash -c "sleep 45; /bin/true"
+    ƒ2XRcLY7Z
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/true"
+    ƒ2Zjt9VSw
+    $ flux job wait
+    ƒ2Zjt9VSw
+    $ flux job wait
+    ƒ2XRcLY7Z
+    $ flux job wait
+    ƒ2WxyXSUF
+    $ flux job wait
+    flux-job: there are no more waitable jobs
+
+In this above example, we submit three jobs, sleeping for 60, 45, and 30 seconds respectively before running ``/bin/true``.  We then run ``flux job wait`` without any inputs.  You'll notice the jobids for the ``sleep 30`` job, then ``sleep 45`` job, then ``sleep 60`` job are returned in that order.  Finally, without any jobs left running with the ``waitable`` flag, ``flux job wait`` indicates there are no more waitable jobs.
+
+Another option is that all jobs can be waited on via the ``--all`` option to ``flux job wait``.  Lets try that in the below example.
+
+.. code-block:: console
+
+    $ flux submit --flags waitable -n1 bash -c "sleep 60; /bin/true"
+    ƒ4YNPpFmAf
+    $ flux submit --flags waitable -n1 bash -c "sleep 45; /bin/true"
+    ƒ4YPufmCjq
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/false"
+    ƒ4YSVQWfZq
+    $ flux job wait --all --verbose
+    ƒ4YSVQWfZq: task(s) exited with exit code 1
+    ƒ4YPufmCjq: job completed successfully
+    ƒ4YNPpFmAf: job completed successfully
+    $ echo $?
+    1
+
+This example is similar to the above, except one of the jobs runs ``/bin/false`` instead of ``/bin/true``.  When ``flux job wait --all`` is executed, you'll notice a message output indicating that one job has failed (the one that ran ``/bin/false``).  And similar to ``flux job status``, the exit code of ``1`` is returned due to the highest exit code of all the jobs.
+
+The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.
+
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/true"
+    ƒBbk3qrdro
+    $ flux job wait ƒBbk3qrdro
+    $ flux job wait ƒBbk3qrdro
+    flux-job: invalid job id, or job may be inactive and not waitable
+
+Here we've submitted yet another sleep job, and try to wait on the job twice with ``flux job wait``.  As you can see, an error is returned on the second attempt to wait on the job.
+
+You might be wondering, if you want to wait for a set of known jobids, is it better to use ``flux jobs status`` or ``flux job wait``?  Generally speaking, ``flux job wait`` is faster and more efficient than ``flux job status``.  It is especially more efficient with the ``--all`` option, instead of passing in a large list of jobids to ``flux job status``.
+
+As summary conclusion, here are a list of the pros and cons of using ``flux job status`` vs ``flux job wait``.
+
+Pros:
+
+- ``flux job wait`` more efficient when waiting for a set of jobs
+- Jobids do not need to be specified to ``flux job wait``
+- Easy to wait for all of your jobs to finish with the ``--all`` option
+
+Cons:
+
+- Jobs must be submitted with the ``waitable`` flag, which can only be used on user instances.
+- ``flux job wait`` can only be used once per job
+
+----------------
+Flux Queue Drain
+----------------
+
+The final technique for waiting for jobs is a bit of a special case.
+
+The command ``flux queue drain`` is commonly used by system administrators to wait for a system to become empty of jobs before performing system maintenance.  However, users may use it as well to indicate that all their jobs have completed.  The nuance is that all jobs in the queue must be done, including other user's jobs.  Therefore, is commonly used in user instances of Flux and not system instances.
+
+Lets run a simple example on the command line.
+
+.. code-block:: console
+
+    $ flux jobs -A
+           JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
+    $ flux submit -n1 bash -c "sleep 30; /bin/true"
+    ƒCSeWdUNb1
+    $ flux submit -n1 bash -c "sleep 30; /bin/true"
+    ƒCSesPJKR1
+    $ flux queue drain
+
+First, this example runs ``flux jobs -A``, which shows the jobs of all users on the system.  There are none, so we don't see any output other than the output header.
+
+Next we submit several sleep jobs and wait for those jobs to complete by running ``flux queue drain``.   It's not so different than our use of ``flux job wait --all`` above, except we don't need the ``waitable`` flag to be set.  Also, the exit code from ``flux queue drain`` will not reflect the exit status of the jobs.
+
+Typically, user instances have only a single job queue, since it belongs only to the user.  So it is common to create batch submission scripts like the following for ``flux batch``.
+
+.. code-block:: sh
+
+   flux submit -n1 job1.sh
+   flux submit -n1 job2.sh
+   flux submit -n1 job3.sh
+   ...
+   flux submit -n1 jobN.sh
+   flux queue drain
+
+In this example script, we are submitting a number of jobs, numbered ``job1.sh`` to ``jobN.sh``.  We would like the script to complete after all of the jobs have completed, so we simply add ``flux queue drain`` at the very end.
+
+One might wonder why use this technique vs. ``flux job wait --all``.  There are several potential reasons.
+
+- It is the most efficient way to wait for "all" your jobs to finish, since it does not involve any "processing" of any sort within Flux.  It simply waits for the queue to be empty and that's it.
+
+- ``flux job wait`` only works for a single user.  In special circumstances, you may wish for multiple user's jobs to complete.  In those cases it would be beneficial to use ``flux queue drain``.
+
+As summary conclusion, here are a list of the pros and cons of using ``flux queue drain`` over ``flux job status`` or ``flux job wait``
+
+Pros:
+
+- The most efficient way to wait for "all" your jobs to finish
+- Jobids do not need to be specified
+- No need for the ``waitable`` flag
+
+Cons:
+
+- Cannot know jobs that finished as they complete
+- Cannot get exit status of completed jobs