Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

1145 questions
-1
votes
0 answers

Slurm, Unable to run srun -N 2 nvidia-smi

When I tried to run srun -N2 hostname, everything is looking fine. However, when I tried running srun -N2 nvidia-smi, it is telling me no device connected. I did removed slurm version 19 and install slurm version 20.02.7. Running srun -N2 nvidia-smi…
-1
votes
0 answers

mpirun detected that one or more processes exited with non-zero status

I'm trying to use mpirun for a Lammps simulation and using slurm job scripts. When I submit the job, it submits but then returns this in the output. Output I figured that Exit Code: 127 refers to a command not being recognized, but I don't…
-1
votes
1 answer

slurm srun with singularity does not number MPI ranks correctly

I have a simple Fortran program that prints the number of ranks and the rank to the screen for each processor in and MPI program program hello include 'mpif.h' integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) call…
-1
votes
2 answers

Correct usage of gpus-per-task for allocation of distinct GPUs via SLURM

I am using the cons_tres SLURM plugin, which introduces, among other things, the --gpus-per-task option. If my understanding is correct, the following script should allocate two distinct GPUs on the same node: #!/bin/bash #SBATCH --ntasks=2 #SBATCH…
redhotsnow
  • 75
  • 5
-1
votes
1 answer

R fread() error when executing multiple scripts

edit: the file path i am giving is not /tmp/... the file path is in the form of: Erfurt/phased_edited_all_chr.gz i think the tmp directory is created by Slurm, but not sure about it. end edit. I am working on a data analysis project, and have a…
-1
votes
1 answer

"wget" command question from Blobtool tutorial

I followed up a tutorial (https://blobtoolkit.genomehubs.org/install/) based on 2. Fetch the nt database follows up first step 1.mkdir -p nt (I am done with that part) second step 2. wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" -P nt/ &&…
slin023
  • 1
  • 1
-1
votes
1 answer

How to resolve the hostname between the pods in the kubernetes cluster?

I am creating two pods with a custom docker image(ubuntu is the base image). I am trying to ping the pods from their terminal. I am able to reach it using the IP address but not the hostname. How to achieve without manually adding /etc/hosts in the…
Akhil
  • 1
-1
votes
1 answer

How does the scratch space differ from the normal disk space in the home node disk space?

I am new to HPC and I am struggling in setting up scratch space. In the cluster I am working with, I need to set-up Scratch space using the SLURM workload manager. And I am struggling with the following questions? How does the scratch space differ…
-1
votes
1 answer

multiple dataframes in parallel function in R

In R I'm calling parLapply() on a list and filtering 2 dataframes within the function using the elements from the list e.g. myfunction <- function(id) { r1 <- r %>% filter(ID == id) b1<- b %>% filter(ID == id) doSomething(r1,b1) } result <-…
user3725599
  • 151
  • 2
  • 9
-1
votes
1 answer

Slurm: sbatch: fatal: Unable to process configuration file

I'm trying to use university's grid computing following this (probably old) guide http://cmp.felk.cvut.cz/cmp/hardware/grid/ The problem is that I get this error There was an error running the Slurm sbatch command. The command was: '/usr/bin/sbatch…
-1
votes
1 answer

Bad substitution from SLURM array

The following batch script is meant to run a function against an array of files: #SBATCH --job-name="my_job" #SBATCH --partition=long #SBATCH --nodes=2 #SBATCH --ntasks=1 #SBATCH --mem=30G #SBATCH --ntasks-per-node=4 #SBATCH…
Marion
  • 81
  • 8
-1
votes
2 answers

Select slurm jobs based on sacct data

On a cluster using slurm I am trying to create a list of jobs that were submit in a certain time interval so that I can cancel them. By hand I can do this using: sacct --format="JobID,Submit" which will give me a list JobID's and the corresponding…
Kvothe
  • 182
  • 8
-1
votes
1 answer

No jobs run on Slurm excluded nodes

With our local cluster we are having the following problem with Slurm. User A sends a lot of jobs that fill the cluster with high priority and wants to leave a few nodes free for user B to use. So that user B can continue to work even though with…
-1
votes
1 answer

How to setup slurm on personal laptop?

I want to set up slurm on local machine (my laptop dual core). Following is the specification. But i am not sure about nodename and cluster name during configuration.
Manish
  • 2,853
  • 11
  • 41
  • 78
-1
votes
1 answer

How does one check why/reason my scripts are getting queued in slurm?

I am using slurm and I am getting trying to figure out why my script is not running/why its getting queued. According to me there should be enough resources to run but slurm doesn't agree. How do I check this? command ran: squeue -o…
Charlie Parker
  • 13,522
  • 35
  • 118
  • 206
1 2 3
76
77