HPC Crash Course

Snakemake Workflow Manager

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Overview

Aims

  • Understand the need for reproducibility when using computational methods.
  • Learn about the role of workflow managers.
  • Learn to use the Snakemake workflow manager.
  • Use Snakemake together with
    • Slurm
    • Conda

Actions

  • Installing and using Snakemake.
  • Writing modular Snakemake workflows.
  • Use Snakemake with Conda for software management and Slurm for execution.

Reproducibility in Computational Sciences

  • Reproducibility in General
  • Bioinformatics Reproducibility Issues

Reproducibility in General

  • Reproducing an experiment means that another research gets the same results with the same methodology.
    • An analysis of published P-values tells us that there are “too many P-values <0.05”.
  • We will not cover this topic itself but rather about:
    • How to ensure that a computational analysis is reproducible.
    • Should be simple, right! Right? Right?!

Bioinformatics Reproducibility Issues

A small selection

  • your software
  • other people’s software
  • parameters
  • data
  • actual program execution

Issue: Your Software (1)

  • Which exact code did you use to create the results?
  • Relatively easy to fix with source code version control
    • and some discipline!

Issue: Other People’s Software (2)

  • Your code does not exist in a vacuum.
  • You use other people’s software.
    • Libraries (even standard library!)
    • Interpreter version (or compiler…)
    • Operating system kernel
    • Operating system user land
    • Other software that you run

Issue: Parameters

  • Most probably, your analysis software has parameters
    • You need to keep track of them
    • Ideally, you have the scripts for calling them in Git
    • You should probably also have configuration files in Git
  • When using random numbers, make sure to use seeds for reprodubility
  • “Hidden” parameters
    • environment variables (e.g., LANG)
    • system / hardware configuration

Issue: Data

  • Keep it save
    • file permissions
    • backups
  • Keep integrity
    • checksums
  • (Of course: keep rights)
    • … of researcher
    • … of individual (if human)

Workflow Managers

  • Introduction
  • Snakemake
  • Nextflow
  • Galaxy

Introduction

  • Workflow Managers allow you to describe your analysis as a workflow and execute it
  • Key features / tasks are:
    • orchestration: select jobs to run and run them
    • logging: keep logs
    • continuability: allow to continue after failure / resuming failed jobs etc.
  • There is some overlap with resource managers (e.g., Slurm), but workflow managers are usually more high level

Snakemake

  • Python-based workflow manager
    • you benefit from Python ecosystem and your knowledge
  • Similar to GNU/Unix Make “back to front”
  • Explicit file names / patterns
# run with `snakemake --cores=1`

rule all:
  input: "output.pdf"

rule write_md:
  output: "output.md"
  shell: "echo '# Hello World' > {output}"

rule md_to_pdf:
  input: "{name_in}.md"
  output: "{name_out}.pdf"
  shell: "pandoc {input} -o {output}"

Nextflow

  • runs on the JVM
  • DSL based on Groovy
    • Groovy is a “kind of Ruby”
  • Based on channels (file-based also possible)
  • Data is by default “hidden”
process write_md {
  output: path 'result.md'
  script: "echo 'Hello World' > result.md"
}

process md_to_pdf {
  input: path "result.md"
  output: path "result.pdf"
  script: "pandoc result.md result.pdf"
}

workflow {
  foo | md_to_pdf
}

Won’t be covered in this course.

Galaxy

The Snakemake Workflow Manager

  • Introduction
  • Installation
  • Our First Workflow
  • A Real Workflow

Introduction

Installation

Quite straightforward ;-)

mamba create -y -n snake-pit python=3.10 snakemake
conda activate snake-pit

Our First Workflow (1)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md_a:
  output: "A.md"
  shell: "echo '# A' > A.md"

rule write_md_b:
  output: "B.md"
  shell: "echo '# B' > B.md"

rule write_md_c:
  output: "C.md"
  shell: "echo '# C' > C.md"
rule convert_md_a:
  input: "A.md"
  output: "A.pdf"
  shell: "pandoc A.md A.pdf"

rule convert_md_b:
  input: "B.md"
  output: "B.pdf"
  shell: "pandoc B.md B.pdf"

rule convert_md_c:
  input: "C.md"
  output: "C.pdf"
  shell: "pandoc C.md C.pdf"

Our First Workflow (2)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md_a:
  output: "A.md"
  shell: "echo '# A' > {output}"

rule write_md_b:
  output: "B.md"
  shell: "echo '# B' > {output}"

rule write_md_c:
  output: "C.md"
  shell: "echo '# C' > {output}"
rule convert_md_a:
  input: "A.md"
  output: "A.pdf"
  shell: "pandoc {input} {output}"

rule convert_md_b:
  input: "B.md"
  output: "B.pdf"
  shell: "pandoc {input} {output}"

rule convert_md_c:
  input: "C.md"
  output: "C.pdf"
  shell: "pandoc {input} {output}"

… using {input} and {output}

Our First Workflow (3)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md:
  output: "{name}.md"
  shell: "echo '# {wildcards.name}' > {output}"
rule convert_md:
  input: "{name}.md"
  output: "{name}.pdf"
  shell: "pandoc {input} {output}"

… using wildcards

Our First Workflow (4)

rule all:
  input: "A.pdf", "B.pdf", "C.pdf"

rule write_md:
  output: md="{name}.md"
  shell: "echo '# {wildcards.name}' > {output.md}"
rule convert_md:
  input: md="{name}.md"
  output: md="{name}.pdf"
  shell: "pandoc {input.md} {output.md}"
  • using named input/output
  • NB: {input} and {output} are still available and will list all file names

Our First Workflow (5)

output_files = [f"{name}" for name in ["A", "B", "C"]]

rule all:
  input: output_files

rule write_md:
  output: md="{name}.md"
  shell: "echo '# {wildcards.name}' > {output.md}"
rule convert_md:
  input: md="{name}.md"
  output: md="{name}.pdf"
  shell: "pandoc {input.md} {output.md}"
  • demonstrating to get output file names from a list
  • output_files could also have been generated by reading a sample sheet

A Real Workflow (1)

REF = "static/hs38.fasta"

rule output:
  input: "case_1.bam"

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t 4 -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -O BAM -o {output.bam}
    """

A Real Workflow (2)

REF = "static/hs38.fasta"

rule output:
  input: "output/case_1.bam"

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: 32,
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -O BAM -o {output.bam}
    """

A Real Workflow (3)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -@ {config.sort_threads} -O BAM -o {output.bam}
    """

config.yaml

ref: "static/hs38.fasta"
map_threads: 32
sort_threads: 4
samples:
  - case_1
  - case_2

A Real Workflow (4)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  shell:
    r"""
    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -@ {config.sort_threads} -O BAM -o {output.bam}
    """

(same config.yaml)

A Real Workflow (5)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  shell:
    r"""
    export TMPDIR=$(mktemp -d)
    trap "rm -rf $TMPDIR" EXIT

    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -T $TMPDIR/tmp -@ {config.sort_threads} -O BAM -o {output.bam}
    """
  • with explicit temporary file handling / cleanup

A Real Workflow (7)

config: "config.yaml"

rule output:
  input: [f"output/{sample}.bam" for sample in config.samples]

rule map_reads:
  input:
    left="input/{sample}_R1.fastq.gz",
    right="input/{sample}_R2.fastq.gz",
  output:
    bam="output/{sample}.bam",
  threads: config.map_threads
  resources:
    mem_mb=16*1024,
    runtime=24*60,  # in minutes == 1day
  wildcard_constraints:
    sample=r"[A-Za-z0-9_\-]+",
  shell:
    r"""
    export TMPDIR=$(mktemp -d)
    trap "rm -rf $TMPDIR" EXIT

    seqtk mergpe {input.left} {input.right} \
    | bwa mem -t {config.map_threads} -R "@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}" \
      {REF} - \
    | samblaster -M \
    | samtools sort -T $TMPDIR/tmp -@ {config.sort_threads} -O BAM -o {output.bam}
    """
  • with wildcard constraint to avoid problems with file names

A Real Workflow (8)

In the Snakefile

rule map_reads:
  conda: "envs/map_reads.yaml"

envs/map_reads.yaml:

channels:
  - conda-forge
  - bioconda
dependencies:
  - bwa ==0.7.17
  - samtools =1.16
  - samblaster ==0.1.26
  - seqtk ==1.3

Now, as for running locally

  • run locally with given number of cores

    $ ls
    Snakefile
    config.yaml
    intput/
    $ snakemake --cores=32 output/case_1.bam

    (this will not enforce threads or memory)

To run on the HPC …

  • We need to use the Slurm runner
  • We can set default resources, partition, etc.
  • It is up to Slurm to enforce threads / memory using cgroups
$ snakemake --jobs=4 --slurm \
    --default-resources slurm_partition=long \
    output/case_1.bam
  • Logs will go into .snakemake/slurm_logs/rule_RULE_NAME/JOBID.log

Input Functions

You can also put functions as inputs:

def foo_input(wildcards):
  sample_upper wildcards.sample.upper()
  return f"input/{sample_upper}.fastq.gz"

rule foo:
  input: foo_input

If you want to return named input files, use unpack()

def foo_input(wildcards):
  sample_upper wildcards.sample.upper()
  return {
    "left": f"input/{sample_upper}_1.fastq.gz",
    "right": f"input/{sample_upper}_1.fastq.gz",
  }

rule foo:
  input: unpack(foo_input)
  script:
    r"""
    # ...
    seqtk mergepe {input.left} {input.right} \
    # ...
    """

Dynamic Resources

Similarly, the resources can be a callable as well:

def foo_threads(wildcards, input):
  return random.choice([23, 42])

def foo_resources(wildcards, input, threads, attempt):
  # give 20% more resources per attempt
  multiplier = 1 + 0.2 * (attempt - 1)
  return {
    "mem_mb": 1024 * threads * multiplier,
    "runtime": 60 * 60 * 24 * multiplier,
  }

rule foo:
  threads: threads
  resources: foo_resources

Quiz Time!

  • https://PollEv.com​/manuelholtgrewe153