HPC Crash Course

Slurm Scheduler and Resource Manager

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Overview

Aims

Understand the role of Slurm as a scheduler and resource manager.
Learn how to use Slurm to run jobs on the cluster and …
… how to interrogate Slurm for cluster and job status.

Actions

Submit interactive and batch jobs.
Write Slurm job scripts.
Query slurm with squeue, sinfo and scontrol.

Documentation Resources

BIH HPC Docs

Official Slurm Documentation

🤓 Google/Bing will help to find more

👍 User Forum at https://hpc-talk.cubi.bihealth.org!

Your Experience? 🤸

Do you already
- have experience with HPC?
- know Slurm?
- know another job scheduler or resource manager?

Slurm

Introduction
Running Interactive and Batch Jobs
Querying job and cluster status

What is a Scheduler?

Resource Manager

Slurm keeps a ledger of our node and job resources
- available CPUs, memory, GPUs on each node
- jobs and their required resources
- currently running jobs and their used resources

Job Scheduler

Slurm manages a schedule of submitted jobs
- Freshly submitted jobs are subjected to quick scheduling
- Periodically, Slurm will run full backfill scheduling

Our First Interactive Job: Submission 🎬

holtgrem_c@hpc-login-1$ srun --pty --time=2:00:00 --partition=short \
    --reservation=hpc-crash-course --mem=10G --cpus-per-task=1 bash -i
srun: job 14629328 queued and waiting for resources
srun: job 14629328 has been allocated resources
holtgrem_c@hpc-cpu-141$

start interactive job with srun
- –pty connects the job’s standard output and error streams to your shell
- –time=2:00:00 makes your job run up to two hours
- –partition=short submits into the short partition
- –reservation=hpc-crash-course submits into the hpc-crash-course reservation
- –mem=10G allocates 10GB of RAM to your job
- –cpus-per-task=1 allocates one thread for our task
- bash -i is the command to run (interactive bash)
now: look at your job in another shell: 🤸
- squeue -u $USER
- scontrol show job 14629328

Looking at `squeue` 🤸

What is the output of squeue?

holtgrem_c@hpc-cpu-141$ squeue -u $USER
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
14629328  short        bash holtgrem  R       1:40      1 hpc-cpu-141

More info with --long:

Tue Jul 11 15:21:13 2023
   JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
14629328  short        bash holtgrem  RUNNING       3:20   2:00:00      1 hpc-cpu-141

👉 Slurm Documentation: squeue

Looking at `scontrol show job` 🤸

Let us look at scontrol show job 14629328

holtgrem_c@hpc-cpu-141$ scontrol show job 14629328
JobId=14629328 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:06:37 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:17:37 EligibleTime=2023-07-11T15:17:37
   AccrueTime=2023-07-11T15:17:37
   StartTime=2023-07-11T15:17:53 EndTime=2023-07-11T17:17:53 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:17:53 Scheduler=Backfill
   Partition=training AllocNode:Sid=hpc-login-1:3631083
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-141
   BatchHost=hpc-cpu-141
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

👉 Slurm Documentation: scontrol

Observing a job’s states 🤸 (1/4)

The job in PENDING state

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:33 Scheduler=Main
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (2/4)

The job while running on a node:

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:05:04 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=2023-07-11T15:26:33
   StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T17:26:53 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-144
   BatchHost=hpc-cpu-144
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (3/4)

The job “just” after being terminated:

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:07:52 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=2023-07-11T15:26:33
   StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T15:34:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-144
   BatchHost=hpc-cpu-144
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (4/4)

After some time, the job is not known to the controller any more…

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
slurm_load_jobs error: Invalid job id specified

… but we can still get some information from the accounting (for 4 weeks) …

holtgrem_c@hpc-cpu-141$ sacct -j 14629381
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
14629381           bash      short hpc-ag-cu+          1  COMPLETED      0:0
14629381.ex+     extern            hpc-ag-cu+          1  COMPLETED      0:0
14629381.0         bash            hpc-ag-cu+          1  COMPLETED      0:0

You can use sacct -j JOBID --long | less -SR to see all available accounting information.

Our First Batch Job 🎬 (1/5)

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

echo "Hello World"
sleep 1min
EOF
holtgrem_c@hpc-login-1$ sbatch first-job.sh
sbatch: You did not specify a running time. Defaulting to two days.
sbatch: routed your job to partition medium
sbatch:
Submitted batch job 14629473

Our First Batch Job 🎬 (2/5)

Sadly, it failed:

holtgrem_c@hpc-login-1$ scontrol show job 14629473
JobId=14629473 JobName=first-job.sh
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=761 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:44:31 EligibleTime=2023-07-11T15:44:31
   AccrueTime=2023-07-11T15:44:31
   StartTime=2023-07-11T15:44:54 EndTime=2023-07-11T15:44:54 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:44:54 Scheduler=Backfill
   Partition=medium AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-219
   BatchHost=hpc-cpu-219
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/cephfs-1/home/users/holtgrem_c/first-job.sh
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   StdErr=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
   StdIn=/dev/null
   StdOut=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
   Power=

Our First Batch Job 🎬 (3/5)

Troubleshooting our job failure:

holtgrem_c@hpc-login-1$ cat /data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
Hello World
sleep: invalid time interval ‘1min’
Try 'sleep --help' for more information.

Our First Batch Job 🎬 (4/5)

More troubleshooting hints:

scontrol | grep Reason
- NonZeroExitCode? Timeout? Out of memory?
Does WorkDir exist and do you have access?
- Try: cd $WorkDir
Look at the StdOut/StdErr log files, if any.
Look at sacct -j 14629473 --format=JobID,State,ExitCode,Elapsed,MaxVMSize to look for hints regarding running time/memory (VM) size

Our First Batch Job 🎬 (5/5)

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

echo "Hello World"
sleep 1m
EOF
holtgrem_c@hpc-login-1$ sbatch first-job.sh
sbatch: You did not specify a running time. Defaulting to two days.
sbatch: routed your job to partition medium
sbatch:
Submitted batch job 14629474

👉hpc-docs: sbatch
👉hpc-docs: srun

Resource Allocation with `srun`/`sbatch`

We can explicitely allocate resources with the srun and sbatch command lines:

--job-name=MY-JOB-NAME: explicit naming
--time=D-HH:MM:SS: max running time
--partition=PARTITION: partition
--mem=MEMORY: allocate memory, use <num>G or <num>M
--cpus-per-task=CORES: number of cores to allocate

👉 Slurm Documentation: sbatch

Resource Allocation in Job Scripts

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

#SBATCH --job-name=tired-but-extravagent
#SBATCH --time=0:05:00
#SBATCH --partition=short
#SBATCH --mem=2G
#SBATCH --cpus-per-task=4

echo "I will waste 2GB of RAM and 4 corse for 1 min..."
sleep 1m
EOF

Your Turn: Writing Job Scripts 🤸

Write a job script that …

allocates minimal memory for sleep 1m (hint: how can you figure out the maximal memory used?)
writes separate stdout and stderr files (where could this be useful?)
is called job-1.sh that triggers job-2.sh on completion (is this useful? dangerous?)

Use online resources to figure out the right command line parameters.

Your Turn: 👀 Staring at the Scheduler 🤸

Use the following commands and use the online help and man $command to figure out the output:

sdiag
squeue
sinfo

scontrol show node NODE
sprio -l -S -y

Your Turn: 👓 Staring at the Scheduler 🤸

Use the following commands and use the online help and man $command to figure out the output:

sdiag
- quick look to see scheduler status and load
squeue
- investigate current queue status
sinfo
- get an overview of nodes’ health and load

scontrol show node NODE
- look at node health and load
sprio -l -S -y
- look at current scheduler priorities for jobs

Your Turn: Make it Fail! 🤸

Provoke the following situations:

Work directory does not exist.
Work directory exists but you have no access.
Stdout/stderr files cannot be written
Too many cores allocated (try: 100)
Job needs too much memory (allocate 500MB, then use this)
Job runs into timeout

In each case, look at scontrol/sacct output and look at log files.

Slurm Partitions (1)

The nodes in the cluster are assigned to partitions.
Partitions are used to
- separate nodes with special hardware (high memory, GPU)
- provide access to different quality of service
By default, all jobs go to standard
- they will then be routed to the appropriate partition based on their resource requirements
- you can override this with --partition=PARTITION on the command line
You must specify the the following partitions explicitely: highmem, gpu, mpi

Slurm Partitions (2)

Available by default:

debug - very short jobs for debugging
short - allows many jobs up to 4h
medium - up to 512 cores per user, up to 7days
long - up to 64 cores per user, up to 14 days

Request access by email to helpdesk

highmem - for high memory node access
gpu - for GPU node access
mpi - for running MPI jobs
critical - for running many jobs for clinical/mission-critical jobs

Tuning `squeue` Output

You can tune the output of many Slurm programs
For example squeue -o/--format.
See man squeue for details.

Example:

$ squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" --user=holtgrem_c
    JOBID PARTITION NAME                                                         USER       ST       TIME  NODES CPUS NODELIST(REASON)     TRES_PER_NODE
  15382564 medium    bash                                                         holtgrem_c  R      30:25      1    1 hpc-cpu-63           N/A

Another goodie: pipe into less -S to get a horizontally scrollable output

Your turn 🤸: look at man squeue and find your “top 3 most useful” values.

Useful `.bashrc` Aliases

alias sbi='srun --pty --time 7-00 --mem=5G --cpus-per-task 2 bash -i'
alias slurm-loginx='srun --pty --time 7-00 --partition long --x11 bash -i'
alias sq='squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" "$@"'
alias sql='sq "$@" | less -S'
alias sqme='sq -u holtgrem_c "$@"'
alias sqmel='sqme "$@" | less -S'

The Role of Logging (1)

Main use case: ⚡⚡ troubleshooting ⚡⚡
While running
- is my job stuck?
- is something weird going on?
Post mortem analysis
- why did my job fail?
- what did it do?

The Role of Logging (2)

In Bash scripts, set -x and set -v are heaven-sent.

set -x prints commands before execution
(set -v prints commands as they are read)

Logging Pitfalls

Log files are written to disk and they are not flushed manually
- You may thus not see the last lines of your log file!
If this is a problem, try using stdbuf in your script and manually redirect to a log file
- But you probably should not do this by default!

Job Script Temporary File Handling

#!/usr/bin/bash

# ... prelude that does not need $TMPDIR ...

# Create new unique directory below current `$TMPDIR`.
export TMPDIR=$(mktemp -d)
# Setup auto-cleanup of the just-created `$TMPDIR`.
trap "rm -rf $TMPDIR" ERR EXIT

# ... your usual script ...

Submitting GPU Jobs

hpc-login-1$ export | grep CUDA_VISIBLE_DEVICES
hpc-login-1$ srun --partition=gpu --gres=gpu:tesla:1 --pty bash
hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0"
hpc-gpu-1:~$ exit
res-login-1:~$ srun --partition=gpu --gres=gpu:tesla:2 --pty bash
hpc-gpu-2:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0,1"

👉 hpc-docs: How-To: Connect to GPU Nodes

Submitting High Memory Jobs (1)

The HPC has two nodes with 0.5TB and 1TB of RAM each.
To use these nodes, add --partition=highmem to sbatch/srun command line.
However:
- Request access first by email to hpc-helpdesk@bih-charite.de
- There are only four such nodes!

Submitting High Memory Jobs (2)

Can you reduce memory usage?
- Split your problem into smaller ones?
- Use more memory efficient data structures / algorithms?
- E.g., Unix sort allows for external memory (aka disk) sorting and makes for super memory efficient sorting
- Similarly, samtools sort allows fine-grained control over memory usage

Canceling Jobs with `scancel`

Jobs that are in running or pending state can be cancelled
For this: scancel JOBID
Cancel all your jobs: scancel --user=$USER
See scancel --help for more options

Your turn 🤸: submit a job, cancel it, look at scontrol and sacct output.

QOS and `sacctmgr` (1)

QOS = quality of service
Each partition has a QOS (with the same name) attached to it
This allows to restrict maximum wall time MaxWall …
… and the maximum tracked resource per user MaxTRESPU
To see these values:

$ sacctmgr show qos -p | cut -d '|' -f 1,19,20 | column -s '|' -t
Name             MaxWall      MaxTRESPU
normal                        cpu=512,mem=3.50T
debug            01:00:00     cpu=1000,mem=7000G
medium           7-00:00:00   cpu=512,mem=3.50T
critical                      cpu=12000,mem=84000G
long             14-00:00:00  cpu=64,mem=448G
...

Job Dependencies with `sbatch`

Each Slurm job correspond to a tasks in task-based parallelism
We can model dependencies betwen jobs:
- sbatch -d afterok:JOBID:JOBID:JOBID:...
- sbatch -d afterany:JOBID:...
- sbatch --job-name NAME -d singleton
More options / information: man sbatch

Your turn 🤸: write two jobs with -d afterok:JOBID

X11 Forwarding with Slurm

Sometimes you need to run a graphical application on the compute nodes
You must run a local X11 server (e.g., Linux, MobaXTerm)

Then, you can do:

$ ssh -X hpc-login-1
hpc-login-1$ srun --x11 --pty bash -i
hpc-cpu-141$ xterm

… and see an XTerm window locally

Your turn 🤸: start xterm if you have an local X11 server.

Reservations

Reservations allow administrators to reserve resources for specific users.

For example, we have a reservation for this course:

$ scontrol show reservation hpc-course
ReservationName=hpc-course StartTime=2023-08-27T00:00:00 EndTime=2023-09-02T00:00:00 Duration=6-00:00:00
  Nodes=hpc-cpu-[76,80,191-194,201-206,208-228],hpc-gpu-[6-7] NodeCnt=35 CoreCnt=592 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
  TRES=cpu=1184
  Users=holtgrem_c Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
  MaxStartDelay=(null)

To use this reservation, add --reservation=hpc-crash-course to your sbatch/srun command line.

Quiz Time!

https://PollEv.com/manuelholtgrewe153