configuration

the SWIBRID pipeline is configured via entries in the config.yaml file

# CONFIG FILE FOR SWIBRID
#
# name of conda environment (will be passed to snakemake)
ENV: "swibrid_env"
# default options to pass to snakemake
# e.g., include ``--profile`` to reference a profile for SLURM
SNAKEOPTS: "-j 100 -k --rerun-incomplete --latency-wait 60 -p --retries 10" 
# study name (used in summary output csv file)
STUDY: "current_study"
# input fastq file (if demultiplexing is included)
INPUT: "path/to/reads.fastq.gz"
# sample sheet specifying barcodes and associated sample names
SAMPLE_SHEET: "path/to/sample_sheet.csv"
# fasta file with barcodes and primers used during demultiplexing 
BARCODES_PRIMERS: 'index/barcodes_primers.fa'
# list of samples for which  to run the pipeline
# IMPORTANT: these sample names must appear in the sample sheet if demultiplexing is used
SAMPLES: ["sample1","sample2"]
# minimum read length
MINLENGTH: 500
# reference genome (fasta file, index should be present)
REFERENCE: 'index/hg38.fa'
# LAST index of reference genome
LAST_INDEX: 'index/hg38db'
# minimap2 index of reference genome if ALIGNER==minimap2
MINIMAP_INDEX: 'index/hg38-ont.mmi'
# aligner to use (LAST is more precise and recommended, minimap2 is much faster)
ALIGNER: 'LAST'
# fasta file specifying the telomer repeat unit, e.g., CCTAACCCTAACCCTAACCCTAACCCTAAC
TELO_REPEAT: ''
# coordinates of switch region (altogether)
SWITCH: 'chr14:105583000-105872000:-'
# bed file with coordinates of individual switch regions
SWITCH_ANNOTATION: 'index/hg38_switch_regions.bed'
# bed file with gene annotation
ANNOTATION: 'index/gencode.v33.annotation.exon.gene_shorted.bed'
# whether or not to run variant detection (experimental feature)
DETECT_VARIANTS: False
# vcf file with variant annotation (e.g., from dbSNP restricted to switch region on chr14)
VARIANT_ANNOTATION: '' 
# max number of reads to cluster
NMAX: 50000
# clustering metric to use 
CLUSTERING_METRIC: 'cosine'
# clustering method to use
CLUSTERING_METHOD: 'average'
# fixed clustering cutoff
CLUSTERING_CUTOFF: 0.01
# cutoff for cluster filtering
CLUSTER_FILTERING_CUTOFF: 0.95
# max gap size to remove before clustering
MAX_GAP: 75
# bin size for breakpoint analysis
BINSIZE: 50
# blacklisted regions for insert detection
BLACKLIST_REGIONS: ''
# weights to use for averaging features over clusters (or reads)
WEIGHTS: 'cluster'
# number of reads to use for downsampling analysis
CLUSTER_DOWNSAMPLING_NREADS: 1000
# number of replicates used in downsampling
CLUSTER_DOWNSAMPLING_NREPS: 10

for a more in-depth control over parameters to individual functions, have a look at the snakemake file pipeline.snake. e.g., resource requirements are specified there, or which exact plots should be produced.

for extended testing, the config.yaml (or test_config.yaml when using swibrid test) draws on a section specifying input parameters for generation of synthetic reads:

SIMULATION_PARAMS:
  input_clones: 'input/input_clones.bed'    # bed file with coordinates of switch segments for input clones
  input_pars: 'input/input_pars.par'     # parameters for adding mutations and insertions/deletions
  model: "poisson"                       # distribution of clone sizes
  mix_n10_s1_10k:                        # parameters for specific sample (name should appear in SAMPLES)
    nclones: 10                          # number of input clones
    nreads: 1000                         # total number of input reads
    seed: 1                              # random seed
  mix_n1000_s2_10k:
    nclones: 1000
    nreads: 10
    seed: 2