HPC Crash Course

Fundamentals: HPC

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Outline

  • Introduction to HPC

What is High-Performance Computing?

  • Attempt at a definition
  • Trade-Offs

Attempt at a Definition: HPC …

  • refers to advanced computing techniques & technologies to solve complex computational problems efficiently
  • involves leveraging parallel processing, large-scale data analysis, and specialized hardware
    • to achieve high computational performance
  • systems consist of multiple computing nodes connected through a high-speed network, working together
  • enables researchers to tackle computationally intensive tasks that would be infeasible or too time-consuming otherwise
  • finds applications in a wide range of fields, including scientific research, engineering, data analytics, and machine learning

Trade-Offs of HPC

Advantages

  • fast execution of complex computational tasks
  • process and analyze large data sets
  • fast and large storage systems
  • MORE POWER 🦾

Drawbacks

  • learning curve / entry barrier
  • usually shared with other users
  • expensive to buy/operate
  • high power usage/CO2 footprint (reference)
  • “why is my job killed/crashing/not running?” 😶‍🌫️

There is no free lunch!

Programming HPC Systems

  • Common Paradigms for Parallel Programing

⚠️ “Warning”: just a quick and superficial overview ;-)

“Es gibt keinen Königsweg” (1)

  • Master using Linux.
  • Master programming for
    • a single core
    • multiple cores on a single machine
    • multiple cores on multiple machines
    • programming GPUs
  • Realize that most problems are

“Es gibt keinen Königsweg” (2)

  • Master using Linux.
  • Master programming for
    • a single core
    • multiple cores on a single machine
    • multiple cores on multiple machines
    • programming GPUs
  • Realize that most problems
    • are embarassingly parallel

“Es gibt keinen Königsweg” (3)

  • Master using Linux.
  • Master programming for
    • a single core
    • multiple cores on a single machine
    • multiple cores on multiple machines
    • programming GPUs
  • Realize that most problems
    • are embarassingly parallel
    • have well-solved “cores”

Types of Parallelism

Single-core level (not in focus here)

  • Bit-level parallelism
    • aka bit-wise AND, OR, etc.
  • Instruction-level parallelism
    • aka instruction pipelining

Programming level

  • Pipeline parallelism
  • Task parallelism
  • Data parallelism

Pipeline Parallelism

  • Consider an imaginary laundry
    • Laundry steps: wash, dry, ironing
    • One machine/operatore per step
  • Naive / sequential execution is slow
  • We can speedup the process with pipeline Parallelism
    • The bottleneck / dominating step is drying
    • Also: pipeline startup/shutdown

Task Parallelism (1)

  • Often, work can be split into different tasks
  • These tasks can be processed independently
  • Tasks may have different “sizes” (required RAM, …)
  • Tasks may have dependencies
  • If we can easily split (part of) the work into independent tasks:
    • embarassingly parallel

Task Parallelism (2)

  • How to process these tasks?
  • Commonly, the manager-worker pattern is applied.
  • The manager
    • has a (potentially dynamic) list of tasks
    • distributes the tasks to the workers
  • The workers
    • do the actual work

Data Parallelism

  • specialized hardware examples: GPUs & TPUs
  • Regularly structured data (e.g., vectors, matrices, tensors) …
    • have obvious decompositions
    • … and operations can be easily parallelized.
  • Common applications:
    • Linear algebra / graphics
    • “Deep” learning etc.
    • Finite element methods for differential equations
      • (but with a twist!)

HPC Systems and Architecture

  • Compute nodes
  • Cluster architecture
  • Distributed file systems
  • Job schedulers and resource management

⚠️ “Warning”: just a quick and superficial overview ;-)

Compute Nodes (1)

“Same-same (as your laptop), but different.”

  • 2+ sockets with
    • many-cores CPUs
    • e.g., 2 x 24 x 2 = 96 threads
  • high memory (e.g., 200+ GB)
  • fast network interface card
    • “legacy”: 10GbE (x2)
    • modern: 25GbE (x2)
  • local disks
    • HDD or solid state SSD/NVME

Compute Nodes (2)

More differences from “consumer-grade” hardware:

  • error correcting memory (bit flips are real)
    • Google in 2009: 8% of DIMMs have 1+ 1bit errors/year, 0.2% of DIMMs have 1+ 2bit errors/year
  • stronger fans
  • redundant power control
  • redundant disks

You are not the admin

no root/admin access, no sudo

Cluster Architecture

  • head nodes (login/transfer)
  • compute nodes
    • generic: cpu
    • specialized: high-mem/gpus
  • storage cluster with parallel file system
  • scheduler to orchestrate jobs
  • Network/Interconnect

Job Scheduler and Resource Management

sequenceDiagram
    autonumber
    User-)+Scheduler: sbatch $resource_args jobscript.sh
    Scheduler->>+Scheduler: add job to queue
    User-)+Scheduler: squeue / scontrol show job
    Scheduler-->>+User: job status
    Note right of Scheduler: scheduler loop
    Scheduler-)Compute Node: start job
    Compute Node->>Compute Node: execute job
    Compute Node-)+Scheduler: job complete

HPC Pitfalls

  • File System Quotas
  • Killed out of Memory

File System Quotas

  • Your home volume has tight quotas

  • Some programs will write a lot there

  • Solution: use work volume

    NAME=.cpan
    mkdir -p work/$NAME
    mv $NAME/* work/$NAME
    rmdir $NAME
    ln -sr work/$NAME $NAME
  • E.g., run the above for NAME in ondemand miniconda3 R Downloads .apptainer .theano .singularity .npm .nextflow .local .debug .cpan .cache .aspera

Killed Out Of Memory Jobs

  • If you get random Killed messages, your job probably needs more memory than available
  • In particular:
    • On login nodes, you only have 100MB of RAM!
  • Solution: request more memory!
    • More on this in the Slurm session

Python Multiprocessing

  • Introduction
  • Reminder: map/apply
  • Example
  • Caveats

Introduction

  • Python has a threading library with low-level primitives
  • But why is there a multiprocessing? Processes?
    • Python has a global interpreter lock (GIL)
    • (maybe removed at some point but it is there)
    • Access to Python data structures is serialized
    • Multithreading: only while threads are blocked (I/O)
  • Multiprocessing to the rescue:
    • Copy data to another process
    • This process can do the work

Reminder: apply/map

  • Remember functional programming?
  • map(func, list) -> list
    • Apply func to each element on the list to obtain a new list of same size
    • Can be parallelized if there are no side effects
  • apply(func, list)
    • Apply func to each element on the list, ignoring results
    • Can be parallelized if there are no side effects

Thread Pools

  • Thread pools are abstractions for parallelism
  • We create a pool with N threads
  • We let the thread pool process collections / lists of work
    • multiprocessing.Pool() (process pool ;-))

Example

import multiprocessing

def func(element):
    # ...

if __name__ == "__main__":
  work = list(range(1_000_000))

  with multiprocessing.Pool(processes=4) as p:
    # force single element chunks
    done_work = p.map(func, work, chunksize=1)

  print(done_work)

Caveats

  • Remember the “copy data to processes”?
  • The func must be serializeable (top-level function!)
  • The arguments must be serializeable (int, str, list, dict)
  • If you need to pass parameter etc, it might be better to make this part of the work
def func(element, a, b):
  # ...

def func2(element_params):
  element, params = element_params
  func(element, **params)

# ...

params = {
  "a": 1,
  "b": 2,
}

real_work = list(range(1_000_000))
work_with_params = [(x, params) for x in real_work]
done_work = p.map(func, work, chunksize=1)

GNU Parallel

  • Introduction
  • Example

Introduction

  • GNU parallel is a command line tool that allows you to
    • run a potentially large list of tasks
    • with a fixed number of parallel tasks at the same time
  • It is similar to the earlier Python multiprocessing.ThreadPool
  • Excellent Tutorial

Example

THREADS=4

parallel -t -j $THREADS \
  'md5sum {} >{}.md5' ::: *.bam

parallel -t -j $THREADS \
  'cd {//}; md5sum {/} >{/}.md5' ::: *.bam

Placeholders

  • {} - whole argument
  • {/} - basename (filename) of argument
  • {//} - dirname of argument
  • see man parallel for more