Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

1145 questions
19
votes
2 answers

Running TensorFlow on a Slurm Cluster?

I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager. I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how…
daniel451
  • 9,128
  • 15
  • 53
  • 115
17
votes
2 answers

Use Bash variable within SLURM sbatch script

I'm trying to obtain a value from another file and use this within a SLURM submission script. However, I get an error that the value is non-numerical, in other words, it is not being dereferenced. Here is the script: #!/bin/bash # This reads out the…
Madeleine P. Vincent
  • 2,627
  • 3
  • 19
  • 26
16
votes
2 answers

Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script. At the end of the job, the process was killed and this is the error I obtained. slurmstepd: error: Detected…
CafféSospeso
  • 645
  • 2
  • 6
  • 20
15
votes
1 answer

Slurm: Why use srun inside sbatch?

In a sbatch script, you can directly launch programs or scripts (for example an executable file myapp) but in many tutorials people use srun myapp instead. Despite reading some documentation on the topic, I do not understand the difference and when…
RomualdM
  • 393
  • 4
  • 10
15
votes
1 answer

comment in bash script processed by slurm

I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch: #!/usr/bin/env bash #SBATCH -o slurm.sh.out #SBATCH -p defq #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@something.com echo "hello" Can I…
user1981275
  • 11,812
  • 5
  • 59
  • 90
15
votes
2 answers

SLURM: How to run 30 jobs on particular nodes only?

You need to run, say, 30 srun jobs, but ensure each of the jobs is run on a node from the particular list of nodes (that have the same performance, to fairly compare timings). How would you do it? What I tried: srun --nodelist=machineN[0-3]…
Ayrat
  • 1,123
  • 1
  • 15
  • 30
15
votes
1 answer

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production…
Daniel Goodman
  • 243
  • 2
  • 6
13
votes
4 answers

Limit the number of running jobs in SLURM

I am queuing multiple jobs in SLURM. Can I limit the number of parallel running jobs in slurm? Thanks in advance!
Philipp H.
  • 1,347
  • 2
  • 13
  • 29
13
votes
3 answers

How to find from where a job is submitted in SLURM?

I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName …
Sibbs Gambling
  • 16,478
  • 33
  • 87
  • 161
12
votes
1 answer

Running slurm script with multiple nodes, launch job steps with 1 task

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when…
Nils_M
  • 972
  • 8
  • 24
11
votes
2 answers

Python - Log memory usage

Is there a way in python 3 to log the memory (ram) usage, while some program is running? Some background info. I run simulations on a hpc cluster using slurm, where I have to reserve some memory before submitting a job. I know that my job require a…
physicsGuy
  • 2,581
  • 3
  • 20
  • 31
11
votes
1 answer

Changing the bash script sent to sbatch in slurm during run a bad idea?

I wanted to run a python script main.py multiple times with different arguments through a sbatch_run.sh script as in: #!/bin/bash #SBATCH --job-name=sbatch_run #SBATCH --array=1-1000 #SBATCH --exclude=node047 arg1=10 #arg to be change during…
Charlie Parker
  • 13,522
  • 35
  • 118
  • 206
11
votes
2 answers

SLURM display the stdout and stderr of an unfinished job

I used to use a server with LSF but now I just transitioned to one with SLURM. What is the equivalent command of bpeek (for LSF) in SLURM? bpeek bpeek Displays the stdout and stderr output of an unfinished job I couldn't find the documentation…
Dnaiel
  • 6,884
  • 21
  • 58
  • 113
10
votes
3 answers

How to run code in a debugging session from VS code on a remote using an interactive session?

I am using a cluster (similar to slurm but using condor) and I wanted to run my code using VS code (its debugger specially) and it's remote sync extension. I tried running it using my debugger in VS code but it didn't quite work as expected. First…
Charlie Parker
  • 13,522
  • 35
  • 118
  • 206
10
votes
1 answer

Submit and monitor SLURM jobs using Apache Airflow

I am using the Slurm job scheduler to run my jobs on a cluster. What is the most efficient way to submit the Slurm jobs and check on their status using Apache Airflow? I was able to use a SSHOperator to submit my jobs remotely and check on their…
stardust
  • 141
  • 1
  • 9
1
2
3
76 77