configuration
the SWIBRID pipeline is configured via entries in the config.yaml
file
# CONFIG FILE FOR SWIBRID
#
# name of conda environment (will be passed to snakemake)
ENV: "swibrid_env"
# default options to pass to snakemake
# e.g., include ``--profile`` to reference a profile for SLURM
SNAKEOPTS: "-j 100 -k --rerun-incomplete --latency-wait 60 -p --retries 10"
# study name (used in summary output csv file)
STUDY: "current_study"
# input fastq file (if demultiplexing is included)
INPUT: "path/to/reads.fastq.gz"
# sample sheet specifying barcodes and associated sample names
SAMPLE_SHEET: "path/to/sample_sheet.csv"
# fasta file with barcodes and primers used during demultiplexing
BARCODES_PRIMERS: 'index/barcodes_primers.fa'
# list of samples for which to run the pipeline
# IMPORTANT: these sample names must appear in the sample sheet if demultiplexing is used
SAMPLES: ["sample1","sample2"]
# minimum read length
MINLENGTH: 500
# reference genome (fasta file, index should be present)
REFERENCE: 'index/hg38.fa'
# LAST index of reference genome
LAST_INDEX: 'index/hg38db'
# minimap2 index of reference genome if ALIGNER==minimap2
MINIMAP_INDEX: 'index/hg38-ont.mmi'
# aligner to use (LAST is more precise and recommended, minimap2 is much faster)
ALIGNER: 'LAST'
# fasta file specifying the telomer repeat unit, e.g., CCTAACCCTAACCCTAACCCTAACCCTAAC
TELO_REPEAT: ''
# coordinates of switch region (altogether)
SWITCH: 'chr14:105583000-105872000:-'
# bed file with coordinates of individual switch regions
SWITCH_ANNOTATION: 'index/hg38_switch_regions.bed'
# bed file with gene annotation
ANNOTATION: 'index/gencode.v33.annotation.exon.gene_shorted.bed'
# whether or not to run variant detection (experimental feature)
DETECT_VARIANTS: False
# vcf file with variant annotation (e.g., from dbSNP restricted to switch region on chr14)
VARIANT_ANNOTATION: ''
# max number of reads to cluster
NMAX: 50000
# clustering metric to use
CLUSTERING_METRIC: 'cosine'
# clustering method to use
CLUSTERING_METHOD: 'average'
# fixed clustering cutoff
CLUSTERING_CUTOFF: 0.01
# cutoff for cluster filtering
CLUSTER_FILTERING_CUTOFF: 0.95
# max gap size to remove before clustering
MAX_GAP: 75
# bin size for breakpoint analysis
BINSIZE: 50
# blacklisted regions for insert detection
BLACKLIST_REGIONS: ''
# weights to use for averaging features over clusters (or reads)
WEIGHTS: 'cluster'
# number of reads to use for downsampling analysis
CLUSTER_DOWNSAMPLING_NREADS: 1000
# number of replicates used in downsampling
CLUSTER_DOWNSAMPLING_NREPS: 10
for a more in-depth control over parameters to individual functions, have a look at the snakemake file pipeline.snake
. e.g., resource requirements are specified there, or which exact plots should be produced.
for extended testing, the config.yaml
(or test_config.yaml
when using swibrid test
) draws on a section specifying input parameters for generation of synthetic reads:
SIMULATION_PARAMS:
input_clones: 'input/input_clones.bed' # bed file with coordinates of switch segments for input clones
input_pars: 'input/input_pars.par' # parameters for adding mutations and insertions/deletions
model: "poisson" # distribution of clone sizes
mix_n10_s1_10k: # parameters for specific sample (name should appear in SAMPLES)
nclones: 10 # number of input clones
nreads: 1000 # total number of input reads
seed: 1 # random seed
mix_n1000_s2_10k:
nclones: 1000
nreads: 10
seed: 2