Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

1145 questions
-1
votes
1 answer

Sbatch and srun SLURM sch

I've been pulling my hair for about a week to get the sbatch job script to submit to multiple nodes. I have two compute nodes with (2)sockets (12)cores/socket (2)threads/core. I have a simple c program which calculates the Fibonacci series (no…
Bhargav
  • 11
  • 4
-1
votes
1 answer

connection between mpirun command in bash script and MPI code

I'm wondering how it is possible to set a variable in a bash script(slurm) and use that variable in MPI program in C or vice versa. For example: In test-mpi.c define int i; ...... Then in bash script use it like this: if (i=o) mpirun --map-by…
Matrix
  • 1,543
  • 4
  • 18
  • 35
-1
votes
1 answer

bash script to assign value to a variable in for loop

I'm trying to submit each batch job in a different directory, i.e, .test1/, .test2/, .test3/. So I iterate over the ./test* directories and set the variable $SLURM_SUBMIT_DIR, which controls the directory where I submit the job. #!/bin/bash -l #…
James LT
  • 505
  • 1
  • 10
  • 20
-1
votes
2 answers

How can I cancel a job using if statement on slurm?

My bash script will compare and read two values from different two files. If they aren't equal, the script should cancel the job on slurm. I think I should get the job-ID. But I dont know how to get the job ID and cancel it in bash script. How can I…
selami
  • 89
  • 3
  • 8
-3
votes
1 answer

slurm example for a simple executable file

I am a a freshman to programming in SLURM. Is there any possibility to execute MATLAB code using sbatch. (I tried using MATLAB as a executable for getting some error /usr/local/MATLAB/R2012a/bin/matlab: 1:…
1 2 3
76
77