23

How do the terms "job", "task", and "step" as used in the SLURM docs relate to each other?

AFAICT, a job may consist of multiple tasks, and also it make consist of multiple steps, but, assuming this is true, it's still not clear to me how tasks and steps relate.

It would be helpful to see an example showing the full complexity of jobs/tasks/steps.

kjo
  • 27,601
  • 42
  • 124
  • 225

1 Answers1

33

A job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.

Jobs are typically created with the sbatch command, steps are created with the srun command, tasks are requested, at the job level with --ntasks or --ntasks-per-node, or at the step level with --ntasks. CPUs are requested per task with --cpus-per-task. Note that jobs submitted with sbatch have one implicit step; the Bash script itself.

Assume the hypothetical job:

#SBATCH --nodes 8
#SBATCH --tasks-per-node 8
# The job requests 64 CPUs, on 8 nodes.    

# First step, with a sub-allocation of 8 tasks (one per node) to create a tmp dir. 
# No need for more than one task per node, but it has to run on every node
srun --nodes 8 --ntasks 8 mkdir -p /tmp/$USER/$SLURM_JOBID

# Second step with the full allocation (64 tasks) to run an MPI 
# program on some data to produce some output.
srun process.mpi <input.dat >output.txt

# Third step with a sub allocation of 48 tasks (because for instance 
# that program does not scale as well) to post-process the output and 
# extract meaningful information
srun --ntasks 48 --nodes 6 --exclusive postprocess.mpi <output.txt >result.txt &

# Four step with a sub-allocation on a single node (because maybe 
# it is a multithreaded program that cannot use CPUs on distinct nodes)    
# to compress the raw output. This step runs at the same time as 
# the previous one thanks to the ampersand `&` 
OMP_NUM_THREAD=12 srun --ntasks 12 --nodes 1 --exclusive compress output.txt &

wait

Four steps were created and so the accounting information for that job will have 5 lines; one per step plus one for the Bash script itself.

damienfrancois
  • 39,477
  • 7
  • 71
  • 82
  • Thank you, that's illuminating. Q: Would it make sense to change `output.txt` to `/tmp/$USER/$SLURM_JOBID/output.txt` throughout your example? – kjo Oct 02 '17 at 21:07
  • 1
    One more Q: why the `wait` at the end? I could understand the `wait` if there were subsequent steps after it that needed all previous steps to finish before proceeding, but I don't understand its purpose if it's the last command in the script. To put it different, what would happen if the `wait` command at the end were omitted? – kjo Oct 02 '17 at 21:10
  • 1
    Not necessarily ; the temporary dir is more for intermediate data on each node, while output is a single file typically on a shared network. But it should be removed at the end of the script – damienfrancois Oct 03 '17 at 08:23
  • 3
    the `wait` command is there to ensure both `srun` commands that are sent to the background with the `&` sign (steps 3 and 4) are finished before the job is considered done and terminated. If it were not there, the script would terminated before the steps ; the job would be considered done by Slurm and all still-running steps would be killed by Slurm. – damienfrancois Oct 03 '17 at 08:25
  • Wow! The bit with `wait` makes a lot of sense once you explain it, but I would have not guessed it in a million years. Thanks! – kjo Oct 03 '17 at 11:26
  • Regarding `>output.txt`: if all tasks write to the same file on a shared network, wouldn't their outputs clobber each other? Wouldn't one *at least* want to use `>>output.txt`? – kjo Oct 03 '17 at 11:29
  • 1
    All tasks do not necessarily do the exact same thing. In an MPI program, the tasks do different things based on their 'rank'. Typically, writing the final output is the duty of only the task with rank 0, the 'master'. – damienfrancois Oct 03 '17 at 13:58
  • Thanks for that clarification. My Unix intuition is still confused. Unfortunately, I don't know how to explain my confusion succinctly enough for a comment, so I've posted a new (and unavoidably long-winded) question devoted to it: https://stackoverflow.com/questions/46574606/on-the-semantics-of-srun-output-file-for-parallel-tasks – kjo Oct 04 '17 at 21:41
  • Is a `task` same as a `process`? – nn0p Jul 26 '18 at 06:15
  • 2
    @nn0p short answer is yes. Long answer is that a task is a Slurm allocation unit and a process is a Linux running process ; processes are created by the commands present in the submission script and mapped the to CPUs associated with the tasks allocated to the job. – damienfrancois Jul 30 '18 at 06:59
  • Thanks for the explanation. Where could I find more details about how Slurm define a `task` or `job`? – nn0p Jul 30 '18 at 12:13