HPC Crash Course

Snakemake Workflow Manager

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Overview

Aims

Understand the need for reproducibility when using computational methods.
Learn about the role of workflow managers.
Learn to use the Snakemake workflow manager.
Use Snakemake together with
- Slurm
- Conda

Actions

Installing and using Snakemake.
Writing modular Snakemake workflows.
Use Snakemake with Conda for software management and Slurm for execution.

Reproducibility in Computational Sciences

Reproducibility in General
Bioinformatics Reproducibility Issues

Reproducibility in General

Reproducing an experiment means that another research gets the same results with the same methodology.
- An analysis of published P-values tells us that there are “too many P-values <0.05”.
We will not cover this topic itself but rather about:
- How to ensure that a computational analysis is reproducible.
- Should be simple, right! Right? Right?!

Bioinformatics Reproducibility Issues

A small selection

your software
other people’s software
parameters
data
actual program execution

Issue: Your Software (1)

Which exact code did you use to create the results?
Relatively easy to fix with source code version control
- and some discipline!

Issue: Other People’s Software (2)

Your code does not exist in a vacuum.
You use other people’s software.
- Libraries (even standard library!)
- Interpreter version (or compiler…)
- Operating system kernel
- Operating system user land
- Other software that you run

Issue: Parameters

Most probably, your analysis software has parameters
- You need to keep track of them
- Ideally, you have the scripts for calling them in Git
- You should probably also have configuration files in Git
When using random numbers, make sure to use seeds for reprodubility
“Hidden” parameters
- environment variables (e.g., LANG)
- system / hardware configuration

Issue: Data

Keep it save
- file permissions
- backups
Keep integrity
- checksums
(Of course: keep rights)
- … of researcher
- … of individual (if human)

Workflow Managers

Introduction
Snakemake
Nextflow
Galaxy

Introduction

Workflow Managers allow you to describe your analysis as a workflow and execute it
Key features / tasks are:
- orchestration: select jobs to run and run them
- logging: keep logs
- continuability: allow to continue after failure / resuming failed jobs etc.
There is some overlap with resource managers (e.g., Slurm), but workflow managers are usually more high level

Snakemake

Python-based workflow manager
- you benefit from Python ecosystem and your knowledge
Similar to GNU/Unix Make “back to front”
Explicit file names / patterns

# run with `snakemake --cores=1`

rule all:
  input: "output.pdf"

rule write_md:
  output: "output.md"
  shell: "echo '# Hello World' > {output}"

rule md_to_pdf:
  input: "{name_in}.md"
  output: "{name_out}.pdf"
  shell: "pandoc {input} -o {output}"

Nextflow

runs on the JVM
DSL based on Groovy
- Groovy is a “kind of Ruby”
Based on channels (file-based also possible)
Data is by default “hidden”

process write_md {
  output: path 'result.md'
  script: "echo 'Hello World' > result.md"
}

process md_to_pdf {
  input: path "result.md"
  output: path "result.pdf"
  script: "pandoc result.md result.pdf"
}

workflow {
  foo | md_to_pdf
}

Won’t be covered in this course.

Galaxy

Graphical workflow manager
Needs central server installation / cannot be installed by users
Not available at BIH HPC
🤡🤡 “REAL PROGRAMMERS DONT USE GUIs” 🤡🤡

The Snakemake Workflow Manager

Introduction
Installation
Our First Workflow
A Real Workflow

Introduction

Snakemake: Documentation | Tutorial | Discord Chat | Stack Overflow
Integrates well with Conda, Slurm, Singularity

Installation

Quite straightforward ;-)

mamba create -y -n snake-pit python=3.10 snakemake
conda activate snake-pit

Our First Workflow (1)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md_a:
  output: "A.md"
  shell: "echo '# A' > A.md"

rule write_md_b:
  output: "B.md"
  shell: "echo '# B' > B.md"

rule write_md_c:
  output: "C.md"
  shell: "echo '# C' > C.md"

rule convert_md_a:
  input: "A.md"
  output: "A.pdf"
  shell: "pandoc A.md A.pdf"

rule convert_md_b:
  input: "B.md"
  output: "B.pdf"
  shell: "pandoc B.md B.pdf"

rule convert_md_c:
  input: "C.md"
  output: "C.pdf"
  shell: "pandoc C.md C.pdf"

Our First Workflow (2)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md_a:
  output: "A.md"
  shell: "echo '# A' > {output}"

rule write_md_b:
  output: "B.md"
  shell: "echo '# B' > {output}"

rule write_md_c:
  output: "C.md"
  shell: "echo '# C' > {output}"

rule convert_md_a:
  input: "A.md"
  output: "A.pdf"
  shell: "pandoc {input} {output}"

rule convert_md_b:
  input: "B.md"
  output: "B.pdf"
  shell: "pandoc {input} {output}"

rule convert_md_c:
  input: "C.md"
  output: "C.pdf"
  shell: "pandoc {input} {output}"

… using {input} and {output}

Our First Workflow (3)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md:
  output: "{name}.md"
  shell: "echo '# {wildcards.name}' > {output}"

rule convert_md:
  input: "{name}.md"
  output: "{name}.pdf"
  shell: "pandoc {input} {output}"

… using wildcards

Our First Workflow (4)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md:
  output: md="{name}.md"
  shell: "echo '# {wildcards.name}' > {output.md}"

rule convert_md:
  input: md="{name}.md"
  output: md="{name}.pdf"
  shell: "pandoc {input.md} {output.md}"

using named input/output
NB: {input} and {output} are still available and will list all file names

Our First Workflow (5)

output_files = [f"{name}" for name in ["A", "B", "C"]]

rule all:
  input: output_files

rule write_md:
  output: md="{name}.md"
  shell: "echo '# {wildcards.name}' > {output.md}"

rule convert_md:
  input: md="{name}.md"
  output: md="{name}.pdf"
  shell: "pandoc {input.md} {output.md}"

demonstrating to get output file names from a list
output_files could also have been generated by reading a sample sheet

A Real Workflow (1)

REF = "static/hs38.fasta"

rule output:
  input: "case_1.bam"

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t 4 -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -O BAM -o {output.bam}
    """

A Real Workflow (2)

REF = "static/hs38.fasta"

rule output:
  input: "output/case_1.bam"

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: 32,
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -O BAM -o {output.bam}
    """

A Real Workflow (3)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -@ {config.sort_threads} -O BAM -o {output.bam}
    """

config.yaml

ref: "static/hs38.fasta"
map_threads: 32
sort_threads: 4
samples:
  - case_1
  - case_2

A Real Workflow (4)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -@ {config.sort_threads} -O BAM -o {output.bam}
    """

(same config.yaml)

A Real Workflow (5)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  shell:
    r"""
    export TMPDIR=$(mktemp -d)
    trap "rm -rf $TMPDIR" EXIT

    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -T $TMPDIR/tmp -@ {config.sort_threads} -O BAM -o {output.bam}
    """

with explicit temporary file handling / cleanup

A Real Workflow (7)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  wildcard_constraints:
    sample=r"[A-Za-z0-9_\-]+",
  shell:
    r"""
    export TMPDIR=$(mktemp -d)
    trap "rm -rf $TMPDIR" EXIT

    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -T $TMPDIR/tmp -@ {config.sort_threads} -O BAM -o {output.bam}
    """

with wildcard constraint to avoid problems with file names

A Real Workflow (8)

In the Snakefile

rule map_reads:
  conda: "envs/map_reads.yaml"

envs/map_reads.yaml:

channels:
  - conda-forge
  - bioconda
dependencies:
  - bwa ==0.7.17
  - samtools =1.16
  - samblaster ==0.1.26
  - seqtk ==1.3

Now, as for running locally

run locally with given number of cores
```
$ ls
Snakefile
config.yaml
intput/
$ snakemake --cores=32 output/case_1.bam
```
(this will not enforce threads or memory)

To run on the HPC …

We need to use the Slurm runner
We can set default resources, partition, etc.
It is up to Slurm to enforce threads / memory using cgroups

$ snakemake --jobs=4 --slurm \
    --default-resources slurm_partition=long \
    output/case_1.bam

Logs will go into .snakemake/slurm_logs/rule_RULE_NAME/JOBID.log

Input Functions

You can also put functions as inputs:

def foo_input(wildcards):
  sample_upper wildcards.sample.upper()
  return f"input/{sample_upper}.fastq.gz"

rule foo:
  input: foo_input

If you want to return named input files, use unpack()

def foo_input(wildcards):
  sample_upper wildcards.sample.upper()
  return {
    "left": f"input/{sample_upper}_1.fastq.gz",
    "right": f"input/{sample_upper}_1.fastq.gz",
  }

rule foo:
  input: unpack(foo_input)
  script:
    r"""
    # ...
    seqtk mergepe {input.left} {input.right} \
    # ...
    """

Dynamic Resources

Similarly, the resources can be a callable as well:

def foo_threads(wildcards, input):
  return random.choice([23, 42])

def foo_resources(wildcards, input, threads, attempt):
  # give 20% more resources per attempt
  multiplier = 1 + 0.2 * (attempt - 1)
  return {
    "mem_mb": 1024 * threads * multiplier,
    "runtime": 60 * 60 * 24 * multiplier,
  }

rule foo:
  threads: threads
  resources: foo_resources

Quiz Time!

https://PollEv.com/manuelholtgrewe153