Slurm Scheduler and Resource Manager
Aims
Actions
squeue
, sinfo
and scontrol
.🤓 Google/Bing will help to find more
👍 User Forum at https://hpc-talk.cubi.bihealth.org!
Resource Manager
Job Scheduler
holtgrem_c@hpc-login-1$ srun --pty --time=2:00:00 --partition=short \
--reservation=hpc-crash-course --mem=10G --cpus-per-task=1 bash -i
srun: job 14629328 queued and waiting for resources
srun: job 14629328 has been allocated resources
holtgrem_c@hpc-cpu-141$
srun
squeue -u $USER
scontrol show job 14629328
squeue
🤸What is the output of squeue
?
holtgrem_c@hpc-cpu-141$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14629328 short bash holtgrem R 1:40 1 hpc-cpu-141
More info with --long
:
scontrol show job
🤸Let us look at scontrol show job 14629328
holtgrem_c@hpc-cpu-141$ scontrol show job 14629328
JobId=14629328 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:06:37 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:17:37 EligibleTime=2023-07-11T15:17:37
AccrueTime=2023-07-11T15:17:37
StartTime=2023-07-11T15:17:53 EndTime=2023-07-11T17:17:53 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:17:53 Scheduler=Backfill
Partition=training AllocNode:Sid=hpc-login-1:3631083
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-141
BatchHost=hpc-cpu-141
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=
The job in PENDING state
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=Unknown
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:33 Scheduler=Main
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=
The job while running on a node:
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:05:04 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=2023-07-11T15:26:33
StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T17:26:53 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-144
BatchHost=hpc-cpu-144
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=
The job “just” after being terminated:
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:07:52 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=2023-07-11T15:26:33
StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T15:34:45 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-144
BatchHost=hpc-cpu-144
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=
After some time, the job is not known to the controller any more…
… but we can still get some information from the accounting (for 4 weeks) …
holtgrem_c@hpc-cpu-141$ sacct -j 14629381
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
14629381 bash short hpc-ag-cu+ 1 COMPLETED 0:0
14629381.ex+ extern hpc-ag-cu+ 1 COMPLETED 0:0
14629381.0 bash hpc-ag-cu+ 1 COMPLETED 0:0
You can use sacct -j JOBID --long | less -SR
to see all available accounting information.
Sadly, it failed:
holtgrem_c@hpc-login-1$ scontrol show job 14629473
JobId=14629473 JobName=first-job.sh
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=761 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:44:31 EligibleTime=2023-07-11T15:44:31
AccrueTime=2023-07-11T15:44:31
StartTime=2023-07-11T15:44:54 EndTime=2023-07-11T15:44:54 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:44:54 Scheduler=Backfill
Partition=medium AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-219
BatchHost=hpc-cpu-219
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=1G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/data/cephfs-1/home/users/holtgrem_c/first-job.sh
WorkDir=/data/cephfs-1/home/users/holtgrem_c
StdErr=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
StdIn=/dev/null
StdOut=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
Power=
Troubleshooting our job failure:
More troubleshooting hints:
scontrol | grep Reason
WorkDir
exist and do you have access?
cd $WorkDir
StdOut
/StdErr
log files, if any.sacct -j 14629473 --format=JobID,State,ExitCode,Elapsed,MaxVMSize
to look for hints regarding running time/memory (VM) sizesrun
/sbatch
We can explicitely allocate resources with the srun
and sbatch
command lines:
--job-name=MY-JOB-NAME
: explicit naming--time=D-HH:MM:SS
: max running time--partition=PARTITION
: partition--mem=MEMORY
: allocate memory, use <num>G
or <num>M
--cpus-per-task=CORES
: number of cores to allocateWrite a job script that …
sleep 1m
(hint: how can you figure out the maximal memory used?)job-1.sh
that triggers job-2.sh
on completion (is this useful? dangerous?)Use online resources to figure out the right command line parameters.
Use the following commands and use the online help and man $command
to figure out the output:
sdiag
squeue
sinfo
scontrol show node NODE
sprio -l -S -y
Use the following commands and use the online help and man $command
to figure out the output:
sdiag
squeue
sinfo
scontrol show node NODE
sprio -l -S -y
Provoke the following situations:
In each case, look at scontrol
/sacct
output and look at log files.
standard
--partition=PARTITION
on the command linehighmem
, gpu
, mpi
Available by default:
debug
- very short jobs for debuggingshort
- allows many jobs up to 4hmedium
- up to 512 cores per user, up to 7dayslong
- up to 64 cores per user, up to 14 daysRequest access by email to helpdesk
highmem
- for high memory node accessgpu
- for GPU node accessmpi
- for running MPI jobscritical
- for running many jobs for clinical/mission-critical jobssqueue
OutputYou can tune the output of many Slurm programs
For example squeue -o/--format
.
See man squeue
for details.
Example:
$ squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" --user=holtgrem_c
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) TRES_PER_NODE
15382564 medium bash holtgrem_c R 30:25 1 1 hpc-cpu-63 N/A
Another goodie: pipe into less -S
to get a horizontally scrollable output
Your turn 🤸: look at man squeue
and find your “top 3 most useful” values.
.bashrc
Aliasesalias sbi='srun --pty --time 7-00 --mem=5G --cpus-per-task 2 bash -i'
alias slurm-loginx='srun --pty --time 7-00 --partition long --x11 bash -i'
alias sq='squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" "$@"'
alias sql='sq "$@" | less -S'
alias sqme='sq -u holtgrem_c "$@"'
alias sqmel='sqme "$@" | less -S'
In Bash scripts, set -x
and set -v
are heaven-sent.
set -x
prints commands before executionset -v
prints commands as they are read)hpc-login-1$ export | grep CUDA_VISIBLE_DEVICES
hpc-login-1$ srun --partition=gpu --gres=gpu:tesla:1 --pty bash
hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0"
hpc-gpu-1:~$ exit
res-login-1:~$ srun --partition=gpu --gres=gpu:tesla:2 --pty bash
hpc-gpu-2:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0,1"
--partition=highmem
to sbatch
/srun
command line.sort
allows for external memory (aka disk) sorting and makes for super memory efficient sortingsamtools sort
allows fine-grained control over memory usagescancel
scancel JOBID
scancel --user=$USER
scancel --help
for more optionsYour turn 🤸: submit a job, cancel it, look at scontrol
and sacct
output.
sacctmgr
(1)MaxWall
…MaxTRESPU
$ sacctmgr show qos -p | cut -d '|' -f 1,19,20 | column -s '|' -t
Name MaxWall MaxTRESPU
normal cpu=512,mem=3.50T
debug 01:00:00 cpu=1000,mem=7000G
medium 7-00:00:00 cpu=512,mem=3.50T
critical cpu=12000,mem=84000G
long 14-00:00:00 cpu=64,mem=448G
...
sbatch
sbatch -d afterok:JOBID:JOBID:JOBID:...
sbatch -d afterany:JOBID:...
sbatch --job-name NAME -d singleton
man sbatch
Your turn 🤸: write two jobs with -d afterok:JOBID
Sometimes you need to run a graphical application on the compute nodes
You must run a local X11 server (e.g., Linux, MobaXTerm)
Then, you can do:
$ ssh -X hpc-login-1
hpc-login-1$ srun --x11 --pty bash -i
hpc-cpu-141$ xterm
… and see an XTerm window locally
Your turn 🤸: start xterm
if you have an local X11 server.
Reservations allow administrators to reserve resources for specific users.
For example, we have a reservation for this course:
$ scontrol show reservation hpc-course
ReservationName=hpc-course StartTime=2023-08-27T00:00:00 EndTime=2023-09-02T00:00:00 Duration=6-00:00:00
Nodes=hpc-cpu-[76,80,191-194,201-206,208-228],hpc-gpu-[6-7] NodeCnt=35 CoreCnt=592 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
TRES=cpu=1184
Users=holtgrem_c Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
To use this reservation, add --reservation=hpc-crash-course
to your sbatch
/srun
command line.