Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

1145 questions
10
votes
2 answers

How to activate a specific Python environment as part of my submission to Slurm?

I want to run a script on cluster (SBATCH file). How can active my virtual environment (path/to/env_name/bin/activate). Does i need only to add: module load python/2.7.14 source "/pathto/Python_directory/ENV2.7_new/bin/activate" in my_script.sh…
bib
  • 635
  • 1
  • 9
  • 26
10
votes
2 answers

SLURM sacct shows 'batch' and 'extern' job names

I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect: JobID …
Parsa
  • 2,485
  • 2
  • 14
  • 30
10
votes
1 answer

Running a binary without a top level script in SLURM

In SGE/PBS, I can submit binary executables to the cluster just like I would locally. For example: qsub -b y -cwd echo hello would submit a job named echo, which writes the word "hello" to its output file. How can I submit a similar job to SLURM.…
highBandWidth
  • 14,815
  • 16
  • 74
  • 126
10
votes
1 answer

How to change how frequently SLURM updates the output file (stdout)?

I am using SLURM to dispatch jobs on a supercomputer. I have set the --output=log.out option to place the content from a job's stdout into a file (log.out). I'm finding that the file is updated every 30-60 minutes, making it difficult for me to…
Neal Kruis
  • 1,805
  • 2
  • 24
  • 45
9
votes
0 answers

Unable to setup slurmdbd plugin: Connection refused

Unable to setup slurmdbd plugin. The SLURM installation works fine Set AccountingStorageType=accounting_storage/slurmdbd in the /etc/slurm/slurm.conf When I do sacctmgr list cluster it gives: sacctmgr: error: slurm_persist_conn_open_without_init:…
Leander
  • 91
  • 1
  • 3
9
votes
1 answer

Installing/emulating SLURM on an Ubuntu 16.04 desktop: slurmd fails to start

Edit What I am really looking for is a way to emulate SLURM, something interactive and reasonably user-friendly that I can install. Original post I want to test drive some minimal examples with SLURM, and I am trying to install it all on a local…
landau
  • 4,594
  • 15
  • 36
9
votes
0 answers

Get stdout/stderr from a slurm job at runtime

I have a batch file to send a job with sbatch. The contents of the batch file is # Setting the proper SBATCH variables ... #SBATCH --error="test_slurm-%j.err" #SBATCH --output="test_slurm-%j.out" ... WORKDIR=. echo "Run…
9
votes
2 answers

Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?

I was trying to run slurm jobs with srun on the background. Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together. From my…
Charlie Parker
  • 13,522
  • 35
  • 118
  • 206
9
votes
3 answers

SLURM sbatch job array for the same script but with different input arguments run in parallel

I have a problem where I need to launch the same script but with different input arguments. Say I have a script myscript.py -p -i , where I need to consider N different par_values (between x0 and x1) and M trials for each value…
maurizio
  • 585
  • 6
  • 19
9
votes
1 answer

Sbatch: pass job name as input argument

I have the following script to submit job with slurm: #!/bin/sh #!/bin/bash #SBATCH -J $3 #job_name #SBATCH -n 1 #Number of processors #SBATCH -p CA nwchem $1 > $2 The first argument ($1) is my input, the second ($2) is my output and I would…
Laetis
  • 943
  • 2
  • 13
  • 22
9
votes
2 answers

How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?

When using bsub with LSF, the -o option gave a lot of details such as when the job started and ended and how much memory and CPU time the job took. With SLURM, all I get is the same standard output that I'd get from running a script without LSF. For…
Christopher Bottoms
  • 10,220
  • 7
  • 44
  • 87
9
votes
1 answer

seq uses comma as decimal separator

I have noticed a strange seq behavior on one of my computers (Ubuntu LTS 14.04): instead of using points as decimal separator it is using commas: seq 0. 0.1 0.2 0,0 0,1 0,2 The same version of seq (8.21) on my other PC gives the normal points (also…
Miguel
  • 7,027
  • 1
  • 21
  • 40
8
votes
2 answers

How to configure the content of slurm notification emails?

Slurm can notify the user by email when certain types of events occur using options such as --mail-type and --mail-user. The emails I receive this way contain a void body and a title that looks like : SLURM Job_id=9228 Name=toto Ended, Run time…
Johann Bzh
  • 601
  • 2
  • 7
  • 20
8
votes
3 answers

How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job? Is there an environment variable for this purpose? The GPUs I'm using are all nvidia GPUs. Thanks.
Negelis
  • 327
  • 3
  • 14
8
votes
1 answer

Slurm server with a asterisk near the "idle"

I'm using Slurm. When I run sinfo -Nel it is common to see a server designated as idle, but sometimes there is also a little asterisk near it (Like this: idle*). What does that mean? I couldn't find any info about that. (The server is up and…
ZoRo
  • 351
  • 1
  • 4
  • 9
1 2
3
76 77