Slurm Scheduler and Resource Manager
Aims
Actions
squeue, sinfo and scontrol.🤓 Google/Bing will help to find more
👍 User Forum at https://hpc-talk.cubi.bihealth.org!
Resource Manager
Job Scheduler
holtgrem_c@hpc-login-1$ srun --pty --time=2:00:00 --partition=short \
--reservation=hpc-crash-course --mem=10G --cpus-per-task=1 bash -i
srun: job 14629328 queued and waiting for resources
srun: job 14629328 has been allocated resources
holtgrem_c@hpc-cpu-141$srun
squeue -u $USERscontrol show job 14629328squeue 🤸What is the output of squeue?
holtgrem_c@hpc-cpu-141$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14629328 short bash holtgrem R 1:40 1 hpc-cpu-141More info with --long:
scontrol show job 🤸Let us look at scontrol show job 14629328
holtgrem_c@hpc-cpu-141$ scontrol show job 14629328
JobId=14629328 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:06:37 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:17:37 EligibleTime=2023-07-11T15:17:37
AccrueTime=2023-07-11T15:17:37
StartTime=2023-07-11T15:17:53 EndTime=2023-07-11T17:17:53 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:17:53 Scheduler=Backfill
Partition=training AllocNode:Sid=hpc-login-1:3631083
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-141
BatchHost=hpc-cpu-141
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=The job in PENDING state
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=Unknown
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:33 Scheduler=Main
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=The job while running on a node:
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:05:04 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=2023-07-11T15:26:33
StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T17:26:53 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-144
BatchHost=hpc-cpu-144
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=The job “just” after being terminated:
holtgrem_c@hpc-cpu-141$ scontrol show job 14629381
JobId=14629381 JobName=bash
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=661 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:07:52 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:26:33 EligibleTime=2023-07-11T15:26:33
AccrueTime=2023-07-11T15:26:33
StartTime=2023-07-11T15:26:53 EndTime=2023-07-11T15:34:45 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:26:53 Scheduler=Backfill
Partition=short AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-144
BatchHost=hpc-cpu-144
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/data/cephfs-1/home/users/holtgrem_c
Power=After some time, the job is not known to the controller any more…
… but we can still get some information from the accounting (for 4 weeks) …
holtgrem_c@hpc-cpu-141$ sacct -j 14629381
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
14629381 bash short hpc-ag-cu+ 1 COMPLETED 0:0
14629381.ex+ extern hpc-ag-cu+ 1 COMPLETED 0:0
14629381.0 bash hpc-ag-cu+ 1 COMPLETED 0:0You can use sacct -j JOBID --long | less -SR to see all available accounting information.
Sadly, it failed:
holtgrem_c@hpc-login-1$ scontrol show job 14629473
JobId=14629473 JobName=first-job.sh
UserId=holtgrem_c(100131) GroupId=hpc-ag-cubi(1005272) MCS_label=N/A
Priority=761 Nice=0 Account=hpc-ag-cubi QOS=normal
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2023-07-11T15:44:31 EligibleTime=2023-07-11T15:44:31
AccrueTime=2023-07-11T15:44:31
StartTime=2023-07-11T15:44:54 EndTime=2023-07-11T15:44:54 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-11T15:44:54 Scheduler=Backfill
Partition=medium AllocNode:Sid=hpc-login-1:3644832
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-cpu-219
BatchHost=hpc-cpu-219
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=1G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/data/cephfs-1/home/users/holtgrem_c/first-job.sh
WorkDir=/data/cephfs-1/home/users/holtgrem_c
StdErr=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
StdIn=/dev/null
StdOut=/data/cephfs-1/home/users/holtgrem_c/slurm-14629473.out
Power=Troubleshooting our job failure:
More troubleshooting hints:
scontrol | grep Reason
WorkDir exist and do you have access?
cd $WorkDirStdOut/StdErr log files, if any.sacct -j 14629473 --format=JobID,State,ExitCode,Elapsed,MaxVMSize to look for hints regarding running time/memory (VM) sizesrun/sbatchWe can explicitely allocate resources with the srun and sbatch command lines:
--job-name=MY-JOB-NAME: explicit naming--time=D-HH:MM:SS: max running time--partition=PARTITION: partition--mem=MEMORY: allocate memory, use <num>G or <num>M--cpus-per-task=CORES: number of cores to allocateWrite a job script that …
sleep 1m (hint: how can you figure out the maximal memory used?)job-1.sh that triggers job-2.sh on completion (is this useful? dangerous?)Use online resources to figure out the right command line parameters.
Use the following commands and use the online help and man $command to figure out the output:
sdiagsqueuesinfoscontrol show node NODEsprio -l -S -yUse the following commands and use the online help and man $command to figure out the output:
sdiag
squeue
sinfo
scontrol show node NODE
sprio -l -S -y
Provoke the following situations:
In each case, look at scontrol/sacct output and look at log files.
standard
--partition=PARTITION on the command linehighmem, gpu, mpiAvailable by default:
debug - very short jobs for debuggingshort - allows many jobs up to 4hmedium - up to 512 cores per user, up to 7dayslong - up to 64 cores per user, up to 14 daysRequest access by email to helpdesk
highmem - for high memory node accessgpu - for GPU node accessmpi - for running MPI jobscritical - for running many jobs for clinical/mission-critical jobssqueue OutputYou can tune the output of many Slurm programs
For example squeue -o/--format.
See man squeue for details.
Example:
$ squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" --user=holtgrem_c
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) TRES_PER_NODE
15382564 medium bash holtgrem_c R 30:25 1 1 hpc-cpu-63 N/AAnother goodie: pipe into less -S to get a horizontally scrollable output
Your turn 🤸: look at man squeue and find your “top 3 most useful” values.
.bashrc Aliasesalias sbi='srun --pty --time 7-00 --mem=5G --cpus-per-task 2 bash -i'
alias slurm-loginx='srun --pty --time 7-00 --partition long --x11 bash -i'
alias sq='squeue -o "%.10i %9P %60j %10u %.2t %.10M %.6D %.4C %20R %b" "$@"'
alias sql='sq "$@" | less -S'
alias sqme='sq -u holtgrem_c "$@"'
alias sqmel='sqme "$@" | less -S'In Bash scripts, set -x and set -v are heaven-sent.
set -x prints commands before executionset -v prints commands as they are read)hpc-login-1$ export | grep CUDA_VISIBLE_DEVICES
hpc-login-1$ srun --partition=gpu --gres=gpu:tesla:1 --pty bash
hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0"
hpc-gpu-1:~$ exit
res-login-1:~$ srun --partition=gpu --gres=gpu:tesla:2 --pty bash
hpc-gpu-2:~$ export | grep CUDA_VISIBLE_DEVICES
declare -x CUDA_VISIBLE_DEVICES="0,1"
--partition=highmem to sbatch/srun command line.sort allows for external memory (aka disk) sorting and makes for super memory efficient sortingsamtools sort allows fine-grained control over memory usagescancelscancel JOBIDscancel --user=$USERscancel --help for more optionsYour turn 🤸: submit a job, cancel it, look at scontrol and sacct output.
sacctmgr (1)MaxWall …MaxTRESPU$ sacctmgr show qos -p | cut -d '|' -f 1,19,20 | column -s '|' -t
Name MaxWall MaxTRESPU
normal cpu=512,mem=3.50T
debug 01:00:00 cpu=1000,mem=7000G
medium 7-00:00:00 cpu=512,mem=3.50T
critical cpu=12000,mem=84000G
long 14-00:00:00 cpu=64,mem=448G
...
sbatchsbatch -d afterok:JOBID:JOBID:JOBID:...sbatch -d afterany:JOBID:...sbatch --job-name NAME -d singletonman sbatchYour turn 🤸: write two jobs with -d afterok:JOBID
Sometimes you need to run a graphical application on the compute nodes
You must run a local X11 server (e.g., Linux, MobaXTerm)
Then, you can do:
$ ssh -X hpc-login-1
hpc-login-1$ srun --x11 --pty bash -i
hpc-cpu-141$ xterm… and see an XTerm window locally
Your turn 🤸: start xterm if you have an local X11 server.
Reservations allow administrators to reserve resources for specific users.
For example, we have a reservation for this course:
$ scontrol show reservation hpc-course
ReservationName=hpc-course StartTime=2023-08-27T00:00:00 EndTime=2023-09-02T00:00:00 Duration=6-00:00:00
Nodes=hpc-cpu-[76,80,191-194,201-206,208-228],hpc-gpu-[6-7] NodeCnt=35 CoreCnt=592 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
TRES=cpu=1184
Users=holtgrem_c Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)To use this reservation, add --reservation=hpc-crash-course to your sbatch/srun command line.