HPC Crash Course

Slurm Scheduler and Resource Manager

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Overview

Aims

  • Understand the role of Slurm as a scheduler and resource manager.
  • Learn how to use Slurm to run jobs on the cluster and …
  • … how to interrogate Slurm for cluster and job status.

Actions

  • Submit interactive and batch jobs.
  • Write Slurm job scripts.
  • Query slurm with squeue, sinfo and scontrol.

Documentation Resources

🤓 Google/Bing will help to find more

👍 User Forum at https://hpc-talk.cubi.bihealth.org!

Your Experience? 🤸

  • Do you already
    • have experience with HPC?
    • know Slurm?
    • know another job scheduler or resource manager?

Slurm

  • Introduction
  • Running Interactive and Batch Jobs
  • Querying job and cluster status

What is a Scheduler?

Resource Manager

  • Slurm keeps a ledger of our node and job resources
    • available CPUs, memory, GPUs on each node
    • jobs and their required resources
    • currently running jobs and their used resources

Job Scheduler

  • Slurm manages a schedule of submitted jobs
    • Freshly submitted jobs are subjected to quick scheduling
    • Periodically, Slurm will run full backfill scheduling

Our First Interactive Job: Submission 🎬

holtgrem_c@hpc-login-1$ srun --pty --time=2:00:00 --partition=short \
    --reservation=hpc-crash-course --mem=10G --cpus-per-task=1 bash -i
srun: job 14629328 queued and waiting for resources
srun: job 14629328 has been allocated resources
holtgrem_c@hpc-cpu-141$
  • start interactive job with srun
    • –pty connects the job’s standard output and error streams to your shell
    • –time=2:00:00 makes your job run up to two hours
    • –partition=short submits into the short partition
    • –reservation=hpc-crash-course submits into the hpc-crash-course reservation
    • –mem=10G allocates 10GB of RAM to your job
    • –cpus-per-task=1 allocates one thread for our task
    • bash -i is the command to run (interactive bash)
  • now: look at your job in another shell: 🤸
    • squeue -u $USER
    • scontrol show job 14629328

Looking at squeue 🤸

What is the output of squeue?

holtgrem_c@hpc-cpu-141$ squeue -u $USER
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
14629328  short        bash holtgrem  R       1:40      1 hpc-cpu-141

More info with --long:

Tue Jul 11 15:21:13 2023
   JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
14629328  short        bash holtgrem  RUNNING       3:20   2:00:00      1 hpc-cpu-141

👉 Slurm Documentation: squeue

Looking at scontrol show job 🤸

Let us look at scontrol show job 14629328

holtgrem_c@hpc-cpu-141$ scontrol show job 14629328
JobId=14629328 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:06:37 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:17:37 EligibleTime=2023-07-11T15:17:37
   AccrueTime=2023-07-11T15:17:37
   StartTime=2023-07-11T15:17:53 EndTime=2023-07-11T17:17:53 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:17:53 Scheduler=Backfill
   Partition=training AllocNode:Sid=hpc-login-1:3631083
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-141
   BatchHost=hpc-cpu-141
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

👉 Slurm Documentation: scontrol

Observing a job’s states 🤸 (1/4)

The job in PENDING state

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:33 Scheduler=Main
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (2/4)

The job while running on a node:

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:05:04 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=2023-07-11T15:26:33
   StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T17:26:53 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-144
   BatchHost=hpc-cpu-144
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (3/4)

The job “just” after being terminated:

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:07:52 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
   AccrueTime=2023-07-11T15:26:33
   StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T15:34:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
   Partition=short AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-144
   BatchHost=hpc-cpu-144
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=10G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   Power=

Observing a job’s states 🤸 (4/4)

After some time, the job is not known to the controller any more…

holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
slurm_load_jobs error: Invalid job id specified

… but we can still get some information from the accounting (for 4 weeks) …

holtgrem_c@hpc-cpu-141$ sacct -j 14629381
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
14629381           bash      short hpc-ag-cu+          1  COMPLETED      0:0
14629381.ex+     extern            hpc-ag-cu+          1  COMPLETED      0:0
14629381.0         bash            hpc-ag-cu+          1  COMPLETED      0:0

You can use sacct -j JOBID --long | less -SR to see all available accounting information.

Our First Batch Job 🎬 (1/5)

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

echo "Hello World"
sleep 1min
EOF
holtgrem_c@hpc-login-1$ sbatch first-job.sh
sbatch: You did not specify a running time. Defaulting to two days.
sbatch: routed your job to partition medium
sbatch:
Submitted batch job 14629473

Our First Batch Job 🎬 (2/5)

Sadly, it failed:

holtgrem_c@hpc-login-1$ scontrol show job 14629473
JobId=14629473 JobName=first-job.sh
   UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
   Priority=761 Nice=0 Account=hpc-ag-cubi QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2023-07-11T15:44:31 EligibleTime=2023-07-11T15:44:31
   AccrueTime=2023-07-11T15:44:31
   StartTime=2023-07-11T15:44:54 EndTime=2023-07-11T15:44:54 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:44:54 Scheduler=Backfill
   Partition=medium AllocNode:Sid=hpc-login-1:3644832
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-cpu-219
   BatchHost=hpc-cpu-219
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/cephfs-1/home/users/holtgrem_c/first-job.sh
   WorkDir=/data/cephfs-1/home/users/holtgrem_c
   StdErr=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
   StdIn=/dev/null
   StdOut=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
   Power=

Our First Batch Job 🎬 (3/5)

Troubleshooting our job failure:

holtgrem_c@hpc-login-1$ cat /data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
Hello World
sleep: invalid time interval ‘1min’
Try 'sleep --help' for more information.

Our First Batch Job 🎬 (4/5)

More troubleshooting hints:

  1. scontrol | grep Reason
    • NonZeroExitCode? Timeout? Out of memory?
  2. Does WorkDir exist and do you have access?
    • Try: cd $WorkDir
  3. Look at the StdOut/StdErr log files, if any.
  4. Look at sacct -j 14629473 --format=JobID,State,ExitCode,Elapsed,MaxVMSize to look for hints regarding running time/memory (VM) size

Our First Batch Job 🎬 (5/5)

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

echo "Hello World"
sleep 1m
EOF
holtgrem_c@hpc-login-1$ sbatch first-job.sh
sbatch: You did not specify a running time. Defaulting to two days.
sbatch: routed your job to partition medium
sbatch:
Submitted batch job 14629474

👉hpc-docs: sbatch
👉hpc-docs: srun

Resource Allocation with srun/sbatch

We can explicitely allocate resources with the srun and sbatch command lines:

  • --job-name=MY-JOB-NAME: explicit naming
  • --time=D-HH:MM:SS: max running time
  • --partition=PARTITION: partition
  • --mem=MEMORY: allocate memory, use <num>G or <num>M
  • --cpus-per-task=CORES: number of cores to allocate

👉 Slurm Documentation: sbatch

Resource Allocation in Job Scripts

holtgrem_c@hpc-login-1$ cat >first-job.sh <<"EOF"
#!/usr/bin/bash

#SBATCH --job-name=tired-but-extravagent
#SBATCH --time=0:05:00
#SBATCH --partition=short
#SBATCH --mem=2G
#SBATCH --cpus-per-task=4

echo "I will waste 2GB of RAM and 4 corse for 1 min..."
sleep 1m
EOF

Your Turn: Writing Job Scripts 🤸

Write a job script that …

  1. allocates minimal memory for sleep 1m (hint: how can you figure out the maximal memory used?)
  2. writes separate stdout and stderr files (where could this be useful?)
  3. is called job-1.sh that triggers job-2.sh on completion (is this useful? dangerous?)

Use online resources to figure out the right command line parameters.

Your Turn: 👀 Staring at the Scheduler 🤸

Use the following commands and use the online help and man $command to figure out the output:

  • sdiag
  • squeue
  • sinfo
  • scontrol show node NODE
  • sprio -l -S -y

Your Turn: 👓 Staring at the Scheduler 🤸

Use the following commands and use the online help and man $command to figure out the output:

  • sdiag
    • quick look to see scheduler status and load
  • squeue
    • investigate current queue status
  • sinfo
    • get an overview of nodes’ health and load
  • scontrol show node NODE
    • look at node health and load
  • sprio -l -S -y
    • look at current scheduler priorities for jobs

Your Turn: Make it Fail! 🤸

Provoke the following situations:

  1. Work directory does not exist.
  2. Work directory exists but you have no access.
  3. Stdout/stderr files cannot be written
  4. Too many cores allocated (try: 100)
  5. Job needs too much memory (allocate 500MB, then use this)
  6. Job runs into timeout

In each case, look at scontrol/sacct output and look at log files.

Slurm Partitions (1)

  • The nodes in the cluster are assigned to partitions.
  • Partitions are used to
    • separate nodes with special hardware (high memory, GPU)
    • provide access to different quality of service
  • By default, all jobs go to standard
    • they will then be routed to the appropriate partition based on their resource requirements
    • you can override this with --partition=PARTITION on the command line
  • You must specify the the following partitions explicitely: highmem, gpu, mpi

Slurm Partitions (2)

Available by default:

  • debug - very short jobs for debugging
  • short - allows many jobs up to 4h
  • medium - up to 512 cores per user, up to 7days
  • long - up to 64 cores per user, up to 14 days

Request access by email to helpdesk

  • highmem - for high memory node access
  • gpu - for GPU node access
  • mpi - for running MPI jobs
  • critical - for running many jobs for clinical/mission-critical jobs

Tuning squeue Output

  • You can tune the output of many Slurm programs

  • For example squeue -o/--format.

  • See man squeue for details.

  • Example:

    $ squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" --user=holtgrem_c
        JOBID PARTITION NAME                                                         USER       ST       TIME  NODES CPUS NODELIST(REASON)     TRES_PER_NODE
      15382564 medium    bash                                                         holtgrem_c  R      30:25      1    1 hpc-cpu-63           N/A
  • Another goodie: pipe into less -S to get a horizontally scrollable output

Your turn 🤸: look at man squeue and find your “top 3 most useful” values.

Useful .bashrc Aliases

alias sbi='srun --pty --time 7-00 --mem=5G --cpus-per-task 2 bash -i'
alias slurm-loginx='srun --pty --time 7-00 --partition long --x11 bash -i'
alias sq='squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" "$@"'
alias sql='sq "$@" | less -S'
alias sqme='sq -u holtgrem_c "$@"'
alias sqmel='sqme "$@" | less -S'

The Role of Logging (1)

  • Main use case: ⚡⚡ troubleshooting ⚡⚡
  • While running
    • is my job stuck?
    • is something weird going on?
  • Post mortem analysis
    • why did my job fail?
    • what did it do?

The Role of Logging (2)

In Bash scripts, set -x and set -v are heaven-sent.

  • set -x prints commands before execution
  • (set -v prints commands as they are read)

Logging Pitfalls

  • Log files are written to disk and they are not flushed manually
    • You may thus not see the last lines of your log file!
  • If this is a problem, try using stdbuf in your script and manually redirect to a log file
    • But you probably should not do this by default!

Job Script Temporary File Handling

#!/usr/bin/bash

# ... prelude that does not need $TMPDIR ...

# Create new unique directory below current `$TMPDIR`.
export TMPDIR=$(mktemp -d)
# Setup auto-cleanup of the just-created `$TMPDIR`.
trap "rm -rf $TMPDIR" ERR EXIT

# ... your usual script ...

Submitting GPU Jobs

hpc-login-1$ export | grep CUDA_VISIBLE_DEVICES
hpc-login-1$ srun --partition=gpu --gres=gpu:tesla:1 --pty bash
hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0"
hpc-gpu-1:~$ exit
res-login-1:~$ srun --partition=gpu --gres=gpu:tesla:2 --pty bash
hpc-gpu-2:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0,1"

👉 hpc-docs: How-To: Connect to GPU Nodes

Submitting High Memory Jobs (1)

  • The HPC has two nodes with 0.5TB and 1TB of RAM each.
  • To use these nodes, add --partition=highmem to sbatch/srun command line.
  • However:
    • Request access first by email to hpc-helpdesk@bih-charite.de
    • There are only four such nodes!

Submitting High Memory Jobs (2)

  • Can you reduce memory usage?
    • Split your problem into smaller ones?
    • Use more memory efficient data structures / algorithms?
    • E.g., Unix sort allows for external memory (aka disk) sorting and makes for super memory efficient sorting
    • Similarly, samtools sort allows fine-grained control over memory usage

Canceling Jobs with scancel

  • Jobs that are in running or pending state can be cancelled
  • For this: scancel JOBID
  • Cancel all your jobs: scancel --user=$USER
  • See scancel --help for more options

Your turn 🤸: submit a job, cancel it, look at scontrol and sacct output.

QOS and sacctmgr (1)

  • QOS = quality of service
  • Each partition has a QOS (with the same name) attached to it
  • This allows to restrict maximum wall time MaxWall
  • … and the maximum tracked resource per user MaxTRESPU
  • To see these values:
$ sacctmgr show qos -p | cut -d '|' -f 1,19,20 | column -s '|' -t
Name             MaxWall      MaxTRESPU
normal                        cpu=512,mem=3.50T
debug            01:00:00     cpu=1000,mem=7000G
medium           7-00:00:00   cpu=512,mem=3.50T
critical                      cpu=12000,mem=84000G
long             14-00:00:00  cpu=64,mem=448G
...

Job Dependencies with sbatch

  • Each Slurm job correspond to a tasks in task-based parallelism
  • We can model dependencies betwen jobs:
    • sbatch -d afterok:JOBID:JOBID:JOBID:...
    • sbatch -d afterany:JOBID:...
    • sbatch --job-name NAME -d singleton
  • More options / information: man sbatch

Your turn 🤸: write two jobs with -d afterok:JOBID

X11 Forwarding with Slurm

  • Sometimes you need to run a graphical application on the compute nodes

  • You must run a local X11 server (e.g., Linux, MobaXTerm)

  • Then, you can do:

    $ ssh -X hpc-login-1
    hpc-login-1$ srun --x11 --pty bash -i
    hpc-cpu-141$ xterm
  • … and see an XTerm window locally

Your turn 🤸: start xterm if you have an local X11 server.

Reservations

  • Reservations allow administrators to reserve resources for specific users.

  • For example, we have a reservation for this course:

    $ scontrol show reservation hpc-course
    ReservationName=hpc-course StartTime=2023-08-27T00:00:00 EndTime=2023-09-02T00:00:00 Duration=6-00:00:00
      Nodes=hpc-cpu-[76,80,191-194,201-206,208-228],hpc-gpu-[6-7] NodeCnt=35 CoreCnt=592 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
      TRES=cpu=1184
      Users=holtgrem_c Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
      MaxStartDelay=(null)
  • To use this reservation, add --reservation=hpc-crash-course to your sbatch/srun command line.

Quiz Time!

  • https://PollEv.com​/manuelholtgrewe153