command-line interface

swibrid

Command line interface for swibrid.

This is the main program entry point for the swibrid executable and its sub commands.

main commands:

swibrid setup copy config and snakefile to current directory
swibrid run run the entire pipeline
swibrid test perform a test run on synthetic reads

subcommands to run pipeline steps individually:

swibrid demultiplex demultiplex minION run
swibrid process_alignments process and filter alignment output
swibrid get_alignment_pars get alignment parameters (mismatch + gap rates) from align output
swibrid create_bed create bed files and summary tables for insert-containing reads
swibrid construct_msa construct a (pseudo) MSA
swibrid get_gaps find gaps / breakpoints in MSA
swibrid construct_linkage construct hierarchical clustering linkage
swibrid find_clusters get read clustering from linkage
swibrid find_variants variant calling
swibrid find_rearrangements detect rearrangements (structural variants)
swibrid plot_clustering plot the read clustering
swibrid get_breakpoint_stats get breakpoint histograms and stats
swibrid get_switch_homology get switch sequence homology
swibrid get_switch_motifs get switch sequence motifs
swibrid analyze_clustering analyze clustering results
swibrid downsample_clustering downsample and analyze clustering
swibrid get_summary get sample summary stats and plot
swibrid collect_results collect results for multiple samples

additional subcommands:

swibrid get_annotation prepare gene annotation
swibrid get_unique_clones_bed get bed file with unique clones
swibrid get_synthetic_reads create synthetic reads from bed file
swibrid plot_demux_report make a graphical summary of demultiplexing output
swibrid combine_replicates combine (aligned & processed) reads from replicates
swibrid downsample downsample (aligned & processed) reads from a sample

usage: swibrid [-h] [-v] [--version]

-h, --help: show this help message and exit

-v, --verbose: Increase verbosity.

--version: show program’s version number and exit

swibrid analyze_clustering

analyze clustering: given the clustering from find_clusters as input, various aggregate statistics are generated per cluster. output table contains the following columns:

cluster size (# of reads)
cluster isotype (most frequent)
average read fraction mapped per cluster
average read fraction mapped to same region
average read fraction ignored
consensus sequence
length of consensus
GC content of consensus
spread of breakpoints (number of bases with intermediate coverage)
size of inversions
size of duplications
frequency of inserts in cluster
fractional overlap of inserts in cluster
fractional overlap of insert-associated breakpoints
average insert length
average length of gaps around inserts
average isotype associated to inserts
number of homologous nucleotides around insert-associated breakpoints
number of homologous nucleotides around switch breakpoints
number of untemplated nucleotides around insert-associated breakpoints
number of untemplated nucleotides around switch breakpoints
length of different switch regions covered in cluster
cluster size adjusted for PCR length and GC bias

usage: swibrid analyze_clustering [-h] [--clustering CLUSTERING] [--msa MSA]
                                  [-o OUTPUT] [--gaps GAPS]
                                  [--max_gap MAX_GAP]
                                  [--max_realignment_gap MAX_REALIGNMENT_GAP]
                                  [--switch_coords SWITCH_COORDS]
                                  [--switch_annotation SWITCH_ANNOTATION]
                                  [--inserts INSERTS]
                                  [--realignments REALIGNMENTS]
                                  [--adjust_size]
                                  [--isotype_extra ISOTYPE_EXTRA]

-h, --help: show this help message and exit

--clustering <clustering>: required: find_clusters output file

--msa <msa>: required: file with (pseudo) multiple alignment of read sequences for clustering

-o <output>, --output <output>: required: output table

--gaps <gaps>: required: gap distribution

--max_gap <max_gap>: max gap size to ignore [75]

--max_realignment_gap <max_realignment_gap>: max gap size to count positions for realignment [75]

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

--inserts <inserts>: results.tsv file with inserts

--realignments <realignments>: file with breakpoint realignments

--adjust_size: calculate cluster size adjusted for fragment length and QC content

--isotype_extra <isotype_extra>: extra space added to define isotypes [500]

swibrid collect_results

collect stats from different samples: this will collect all summary files for the listed samples and produce one big table. optionally, this will also collect all inserts and produce a multi-sheet excel file. clustering analysis results can also be collected into an excel file.

usage: swibrid collect_results [-h] --samples SAMPLES --sample_stats
                               SAMPLE_STATS [--summary_path SUMMARY_PATH]
                               [--inserts_path INSERTS_PATH]
                               [--cluster_analysis_path CLUSTER_ANALYSIS_PATH]
                               [--cluster_stats CLUSTER_STATS]
                               [--inserts INSERTS] [--clean_column_names]

-h, --help: show this help message and exit

--samples <samples>: required: comma-separated list of samples

--sample_stats <sample_stats>: required: output file with sample_stats

--summary_path <summary_path>: path pattern for summary files [pipeline/{sample}/{sample}_summary.csv]

--inserts_path <inserts_path>: path pattern for insert tables [pipeline/{sample}/{sample}_inserts.tsv]

--cluster_analysis_path <cluster_analysis_path>: path pattern for cluster_analysis files [pipeline/{sample}/{sample}_cluster_analysis.csv]

--cluster_stats <cluster_stats>: output file with cluster stats

--inserts <inserts>: output file with inserts

--clean_column_names: re-name columns to make features a bit more explicit

swibrid combine_replicates

combine (aligned + processed) reads from replicates

usage: swibrid combine_replicates [-h] [-s SAMPLES] [-c COMBINED]
                                  [--nmax NMAX] [--aligner ALIGNER]
                                  [--msa_path MSA_PATH]
                                  [--msa_csv_path MSA_CSV_PATH]
                                  [--info_path INFO_PATH]
                                  [--process_path PROCESS_PATH]
                                  [--process_stats_path PROCESS_STATS_PATH]
                                  [--alignment_pars_path ALIGNMENT_PARS_PATH]
                                  [--aligned_seqs_path ALIGNED_SEQS_PATH]
                                  [--breakpoint_alignments_path BREAKPOINT_ALIGNMENTS_PATH]
                                  [--inserts_tsv_path INSERTS_TSV_PATH]
                                  [--bed_path BED_PATH]

-h, --help: show this help message and exit

-s <samples>, --samples <samples>: comma-separated list of samples to combine

-c <combined>, --combined <combined>: combined sample name

--nmax <nmax>: use only nmax reads [50000]

--aligner <aligner>: alignment algorithm used [last]

--msa_path <msa_path>: path pattern for msa files [“pipeline/{sample}/{sample}_msa.npz”]

--msa_csv_path <msa_csv_path>: path pattern for msa csv files [“pipeline/{sample}/{sample}_msa.csv”]

--info_path <info_path>: path pattern for info csv files [“input/{sample}_info.csv”]

--process_path <process_path>: path pattern for processing output [“pipeline/{sample}/{sample}_processed.out”]

--process_stats_path <process_stats_path>: path pattern for process stats csv [“pipeline/{sample}/{sample}_process_stats.csv”]

--alignment_pars_path <alignment_pars_path>: path pattern for alignment pars [“pipeline/{sample}/{sample}_{aligner}_pars.npz”]

--aligned_seqs_path <aligned_seqs_path>: path pattern for aligned sequences [“pipeline/{sample}/{sample}_aligned.fasta.gz

--breakpoint_alignments_path <breakpoint_alignments_path>: path pattern for breakpoint alignments [“pipeline/{sample}/{sample}_breakpoint_alignments.csv”]

--inserts_tsv_path <inserts_tsv_path>: path pattern for inserts tsv [“pipeline/{sample}/{sample}_inserts.tsv”]

--bed_path <bed_path>: path pattern for bed [“pipeline/{sample}/{sample}.bed”]

swibrid construct_linkage

construct hiearchical agglomerative clustering from MSA: from the input MSA, only coverage information will be used and gaps smaller than max_gap will be removed. by default, fastcluster is used with cosine metric and average linkage. output is a npz file containing the linkage matrix

EXPERIMENTAL: a sparse version building on pynndescent and gbbs agglomerative clustering, requires several libraries to be installed

usage: swibrid construct_linkage [-h] [--msa MSA] [--linkage LINKAGE]
                                 [--gaps GAPS] [--max_gap MAX_GAP]
                                 [--metric METRIC] [--method METHOD]
                                 [--nmax NMAX] [--ignore_unused_positions]
                                 [--use_sparse] [--distances DISTANCES]
                                 [--n_neighbors N_NEIGHBORS]
                                 [--n_threads N_THREADS] [--n_backup N_BACKUP]

-h, --help: show this help message and exit

--msa <msa>: required: file with (pseudo) multiple alignment of read sequences for clustering

--linkage <linkage>: required: output file contains linkage

--gaps <gaps>: output of get_gaps

--max_gap <max_gap>: max gap size to ignore [75]

--metric <metric>: clustering metric [cosine]

--method <method>: clustering method for hierarchical clustering [average]

--nmax <nmax>: use only nmax reads

--ignore_unused_positions: ignore positions that have no coverage

--use_sparse: EXPERIMENTAL: use sparse nearest-neighbor clustering

--distances <distances>: pre-computed sparse distance matrix

--n_neighbors <n_neighbors>: # of nearest neighbors for adjacency graph [100]

--n_threads <n_threads>: # of threads [1]

--n_backup <n_backup>: save full distances for these points in addition to the nearest neighbors [0]

swibrid construct_msa

construct (pseudo) MSA from processed alignments: this script will take aligned segments and sequences from process_alignments and construct a MSA. the MSA is stored as a sparse matrix of n_reads x n_positions, where the positions run over the concatenation of individual switch regions specified in a bed file. matrix values m code for coverage and nucleotide identity:

m % 10 gives nucleotide identity with A=1, C=2, G=3, T=4, gaps are zeros
m // 10 gives coverage of genomic regions by one read (1x, 2x, …) and indicates tandem duplications; if --use_orientation is set, inversions are also considered and associated coverage values are negative

usage: swibrid construct_msa [-h] [--coords COORDS] [--sequences SEQUENCES]
                             [--msa MSA] [--out OUT]
                             [--switch_coords SWITCH_COORDS]
                             [--switch_annotation SWITCH_ANNOTATION]
                             [--nmax NMAX] [--use_orientation]

-h, --help: show this help message and exit

--coords <coords>: required: process_alignments coordinate output

--sequences <sequences>: required: process_alignments sequence output

--msa <msa>: required: output file with (pseudo) multiple alignment of read sequences for clustering

--out <out>: required: output file with read info

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: required: bed file with coordinates of individual switch regions

--nmax <nmax>: use only nmax reads

--use_orientation: include alignment block orientation

swibrid create_bed

create bed files for all reads and summary table for insert-containing reads, including read sequence with insert part highlighted in uppercase

usage: swibrid create_bed [-h] [--processed_alignments PROCESSED_ALIGNMENTS]
                          [--bed BED] [--outfile OUTFILE]
                          [--raw_reads RAW_READS] [--paired_end_mode]
                          [--switch_coords SWITCH_COORDS]
                          [--annotation [ANNOTATION]]
                          [--switch_annotation SWITCH_ANNOTATION]

-h, --help: show this help message and exit

--processed_alignments <processed_alignments>: required: output of process_alignments

--bed <bed>: required: bed output with insert + switch coordinates for selected reads

--outfile <outfile>: required: output table with insert information

--raw_reads <raw_reads>: fasta file with minION reads (interleaved mates for paired-end data)

--paired_end_mode: EXPERIMENTAL: use paired-end mode (use interleaved mates as input)

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000]

--annotation <annotation>: bed file with gene annotation

--switch_annotation <switch_annotation>: bed file with switch annotation

swibrid demultiplex

demultiplex reads using output of BLAST against primers and barcodes

BLAST output is created with blastn -db {barcodes_primers} -query {query} -task blastn-short -max_target_seqs 50 -outfmt "6 saccver qaccver slen qlen pident length qstart qend evalue" -gapopen 5 -gapextend 2 -reward 3 -penalty -4 -evalue 1 -num_threads 1 -perc_identity 50 where {barcodes_primers} is a database containing barcode and primer sequences and {query} is the input fasta file

optional (but recommended) input is a sample sheet (tab delimited, no header or row names, 1st column barcode, 2nd column sample name)

usage: swibrid demultiplex [-h] -i INPUT -b BLAST -o OUTDIR [-c CUTOFF]
                           [-d DIST] [-m MIN_READS] [-f FIGURE] [-r REPORT]
                           [-s SAMPLE_SHEET] [--collapse] [--split_reads]
                           [--max_split_dist MAX_SPLIT_DIST] [--nmax NMAX]
                           [--add_umi] [--umi_regex UMI_REGEX]
                           [--write_info_chunks]

-h, --help: show this help message and exit

-i <input>, --input <input>: required: input fastq(.gz)

-b <blast>, --blast <blast>: required: BLAST output file (or stdin)

-o <outdir>, --outdir <outdir>: required: output directory

-c <cutoff>, --cutoff <cutoff>: % identify cutoff [70.0]

-d <dist>, --dist <dist>: max distance of match to end of sequence [100]

-m <min_reads>, --min_reads <min_reads>: min # of reads required to create separate file if no sample sheet is given [1000]

-f <figure>, --figure <figure>: summary figure

-r <report>, --report <report>: summary report

-s <sample_sheet>, --sample_sheet <sample_sheet>: sample sheet (barcode <tab> sample_name, no header)

--collapse: count reads regardless of whether one or both barcodes are present (for identical barcodes)? [False]

--split_reads: split reads with adjacent internal primer binding sites

--max_split_dist <max_split_dist>: maximum distance between internal primer matches to split read [200]

--nmax <nmax>: maximum number of reads

--add_umi: add UMI to read name

--umi_regex <umi_regex>: regex to find UMI in read description [[ACGTN]{12}]

--write_info_chunks: write info files in chunks

swibrid downsample

downsample (aligned + processed) reads for a sample

usage: swibrid downsample [-h] [-i INPUT] [-o OUTPUT] [-n N]
                          [--aligner ALIGNER] [--msa_path MSA_PATH]
                          [--msa_csv_path MSA_CSV_PATH]
                          [--info_path INFO_PATH]
                          [--process_path PROCESS_PATH]
                          [--process_stats_path PROCESS_STATS_PATH]
                          [--alignment_pars_path ALIGNMENT_PARS_PATH]
                          [--breakpoint_alignments_path BREAKPOINT_ALIGNMENTS_PATH]
                          [--inserts_tsv_path INSERTS_TSV_PATH]
                          [--bed_path BED_PATH]

-h, --help: show this help message and exit

-i <input>, --input <input>: input sample to downsample

-o <output>, --output <output>: downsampled sample name

-n <n>, --n <n>: use only n reads [5000]

--aligner <aligner>: alignment algorithm used [last]

--msa_path <msa_path>: path pattern for msa files [pipeline/{sample}/{sample}_msa.npz]

--msa_csv_path <msa_csv_path>: path pattern for msa csv files [pipeline/{sample}/{sample}_msa.csv]

--info_path <info_path>: path pattern for info csv files [input/{sample}_info.csv]

--process_path <process_path>: path pattern for processing output [pipeline/{sample}/{sample}_processed.out]

--process_stats_path <process_stats_path>: path pattern for process stats csv [pipeline/{sample}/{sample}_process_stats.csv]

--alignment_pars_path <alignment_pars_path>: path pattern for alignment pars [pipeline/{sample}/{sample}_{aligner}_pars.npz]

--breakpoint_alignments_path <breakpoint_alignments_path>: path pattern for breakpoint alignments [pipeline/{sample}/{sample}_breakpoint_alignments.csv]

--inserts_tsv_path <inserts_tsv_path>: path pattern for inserts tsv [pipeline/{sample}/{sample}_inserts.tsv]

--bed_path <bed_path>: path pattern for bed [pipeline/{sample}/{sample}.bed]

swibrid downsample_clustering

downsample the clustering to get more robust diversity measures: given the input MSA (and associated gaps), clustering is repeated nreps times on a subsample of nreads reads using fastcluster with (by defalt) cosine metric and average linkage and a cutoff derived from cluster_stats input. the following diversity measured are calculated:

mean_cluster_size (as fraction of reads)
std_cluster_size
nclusters_final (number of clusters after filtering)
nclusters_eff (from the entropy of the cluster size distribution pre-filtering)
cluster_gini: gini coefficient (post-filtering)
cluster_entropy: entropy (post-filtering)
cluster_inverse_simpson: inverse simpson coefficient (post-filtering)
top_clone_occupancy: relative fraction of reads in biggest cluster
big_clones_occupancy: fraction of reads in clusters > 1% occupancy

usage: swibrid downsample_clustering [-h] [--msa MSA]
                                     [--cluster_stats CLUSTER_STATS]
                                     [-s STATS] [--nreads NREADS]
                                     [--nreps NREPS] [--gaps GAPS]
                                     [--max_gap MAX_GAP] [--metric METRIC]
                                     [--method METHOD]
                                     [--ignore_unused_positions]
                                     [--filtering_cutoff FILTERING_CUTOFF]
                                     [--big_clone_cutoff BIG_CLONE_CUTOFF]

-h, --help: show this help message and exit

--msa <msa>: required: file with (pseudo) multiple alignment of read sequences for clustering

--cluster_stats <cluster_stats>: required: file with statistics from original clustering

-s <stats>, --stats <stats>: required: output file with clustering statistics after downsampling

--nreads <nreads>: number of reads used [1000]

--nreps <nreps>: number of replicates [10]

--gaps <gaps>: output of get_gaps

--max_gap <max_gap>: max gap size to ignore [75]

--metric <metric>: clustering metric [cosine]

--method <method>: clustering method for hierarchical clustering [average]

--ignore_unused_positions: ignore positions that have no coverage

--filtering_cutoff <filtering_cutoff>: filter out smallest clusters, keeping at least this fraction of reads [0.95]

--big_clone_cutoff <big_clone_cutoff>: cutoff to determine a “big” clone as fraction of clustered reads [0.01]

swibrid find_clusters

find clusters by cutting linkage dendrogram: inputs are the linkage matrix (from construct_linkage) and read info (from construct_msa). output is a csv file that adds cluster identity to the read info file

cutoff can be fixed or determined by scanning cutoffs between cmin and cmax and finding an inflection point (either by distance, or from the intersection of two exponential trends). small and isolated clusters are flagged, filtered clusters contain at least 95% of reads

usage: swibrid find_clusters [-h] -l LINKAGE -i INPUT -o OUTPUT [-s STATS]
                             [-f CUTOFF] [--scanning SCANNING]
                             [--fit_method FIT_METHOD] [-c CMIN] [-C CMAX]
                             [-n NC] [--filtering_cutoff FILTERING_CUTOFF]

-h, --help: show this help message and exit

-l <linkage>, --linkage <linkage>: required: output of construct_linkage

-i <input>, --input <input>: required: read info (output of construct_msa)

-o <output>, --output <output>: required: output file with clustering

-s <stats>, --stats <stats>: file with statistics

-f <cutoff>, --fix_cutoff <cutoff>: use fixed cutoff instead of data-derived

--scanning <scanning>: file with scanning data

--fit_method <fit_method>: cutoff determination method: “trend” or “distance” [distance]

-c <cmin>: minimum cutoff value [0.001]

-C <cmax>: maximum cutoff value [0.100]

-n <nc>: number of cutoff values between cmin and cmax [100]

--filtering_cutoff <filtering_cutoff>: filter out smallest clusters, keeping at least this fraction of reads [0.95]

swibrid find_rearrangements

find rearrangements (inversions/duplications) in MSA: regions with coverage larger or smaller than 1 and without gaps bigger than 25 nt. output files:

a .npz file with read indices, left/right positions and sizes for inversions and duplications per read
a bed file with consensus rearrangement coordinates (by merging individual events), type and allele frequency

usage: swibrid find_rearrangements [-h] --msa MSA -o OUT -m MAT
                                   [--switch_coords SWITCH_COORDS]
                                   [--switch_annotation SWITCH_ANNOTATION]
                                   [--max_rearrangement_gap MAX_REARRANGEMENT_GAP]

-h, --help: show this help message and exit

--msa <msa>: required: file with (pseudo) multiple alignment of read sequences for clustering

-o <out>, --out <out>: required: output file (bed)

-m <mat>, --mat <mat>: required: output file (matrix)

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

--max_rearrangement_gap <max_rearrangement_gap>: max gap within rearrangements to ignore [25]

swibrid find_variants

EXPERIMENTAL: find single-nucleotide variants in MSA and determine haplotypes. variable positions in the input MSA (excluding positions at or near gaps) where nucleotide distributions are different than what’s expected given mutation frequencies (estimated at the initial alignment step). potential variants are removed if

there’s less than min_cov non-gaps
fewer variants than expected
high strand bias
no cluster with at least min_cluster_cov reads and allele frequency > min_freq

variants are aggregated over clusters, and the distribution across clusters is tested for evenness. variants can be annotated by dbSNP id. motif occurrences around variants are also scored. haplotypes are determined by performing a weighted NMF on a matrix of allele frequencies per cluster and variant.

output is a vcf-style file with genomic (1-based) and relative (0-based) coordinates, and a matrix (sparse integer array, same shape as MSA) indicating which read contains a variant at which position

usage: swibrid find_variants [-h] [--msa MSA] [--clustering CLUSTERING]
                             [--stats STATS] [--reference REFERENCE]
                             [--pars PARS] [-o OUT] [-m MAT]
                             [--save_all SAVE_ALL]
                             [--switch_coords SWITCH_COORDS]
                             [--switch_annotation SWITCH_ANNOTATION]
                             [--fdr FDR] [--min_cov MIN_COV]
                             [--gap_window_size GAP_WINDOW_SIZE]
                             [--max_local_gap_freq MAX_LOCAL_GAP_FREQ]
                             [--min_cluster_cov MIN_CLUSTER_COV]
                             [--min_freq MIN_FREQ] [--nreads NREADS]
                             [--variant_annotation [VARIANT_ANNOTATION]]
                             [--haplotypes HAPLOTYPES] [--motifs MOTIFS]

-h, --help: show this help message and exit

--msa <msa>: required: file with (pseudo) multiple alignment of read sequences for clustering

--clustering <clustering>: required: clustering (find_clusters output)

--stats <stats>: required: clustering stats (find_clusters output)

--reference <reference>: required: reference sequence (switch chromosome)

--pars <pars>: required: alignment parameters to estimate mutation probability

-o <out>, --out <out>: required: output file (text)

-m <mat>, --mat <mat>: required: output file (matrix)

--save_all <save_all>: save all variants, not just those passing filters

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

--fdr <fdr>: FDR for variant calling [0.05]

--min_cov <min_cov>: minimum read coverage at potentially variable positions [50]

--gap_window_size <gap_window_size>: size of window to compute local gap frequency [10]

--max_local_gap_freq <max_local_gap_freq>: max. local gap frequency in window [0.7]

--min_cluster_cov <min_cluster_cov>: minimum read coverage at potentially variable positions per cluster [10]

--min_freq <min_freq>: minimum allele frequency at potentially variable positions (in total or per cluster) [0.40]

--nreads <nreads>: downsample to this number of reads to control sensitivity

--variant_annotation <variant_annotation>: variant annotation file (vcf.gz; e.g., from 1000Genomes project)

--haplotypes <haplotypes>: cluster haplotypes

--motifs <motifs>: comma-separated list of sequence motifs to look up at variant positions [Cg,wrCy,Tw]

swibrid get_alignment_pars

get LAST-like parameters from a SAM or MAF file. this will go through the alignments and tally mutation and indel frequencies.

usage: swibrid get_alignment_pars [-h] [-i INF] [-r REF] [-o OUT]

-h, --help: show this help message and exit

-i <inf>, --inf <inf>: input .bam or .maf.gz file, or .par from LAST

-r <ref>, --ref <ref>: reference genome

-o <out>, --out <out>: output file (.npz)

swibrid get_annotation

prepare shortened annotation file for insert annotation. input is a GTF file (e.g., from GENCODE), output is a bed file

usage: swibrid get_annotation [-h] [-i INPUT] [-o OUT]

-h, --help: show this help message and exit

-i <input>, --input <input>: input gtf

-o <out>, --out <out>: output bed

swibrid get_breakpoint_stats

analyze breakpoint statistics: this will produce aggregate statistics on breakpoints, either averaged over reads, or clusters. homology scores are calculated from binned breakpoint frequencies weighted by homology of bins; motif scores are calculated from binned breakpoint frequencies weighted by motif occurrences. output file contains following values:

breaks_normalized: number of breaks divided by number of reads / clusters
frac_breaks_duplications: fraction of breaks leading to duplication events
frac_breaks_inversions: fraction of breaks leading to inversion events
frac_breaks_single: fraction of breaks from SM to elsewhere
frac_breaks_sequential: fraction of breaks to different regions but not SM
frac_breaks_intra: fraction of breaks within a region
frac_breaks_inversions/duplications_intra: frac breaks with inversions/duplications within regions
frac_breaks_X_X: fraction of breaks connecting indicated regions
spread_XX: spread (standard deviation) of breakpoint positions in a region
homology_fw: homology in bins around breakpoint positions (same orientation)
homology_rv: homology in bins around breakpoint positions (opposite orientation)
homology_fw/rv_XX: homologies for breaks connecting indicated regions
donor/acceptor_score(_XX): motif scores for donor and acceptor breakpoints for different motifs (subdivided by region)
donor/acceptor_complexity(_XX): sequence complexity for donoer and acceptor breakpoints (subdivided by region)

usage: swibrid get_breakpoint_stats [-h] -g GAPS -r REARRANGEMENTS
                                    [-c CLUSTERING] [-a CLUSTERING_ANALYSIS]
                                    -o OUT [-p PLOT] [-b BINSIZE]
                                    [--scale_factor SCALE_FACTOR]
                                    [--max_gap MAX_GAP] [--homology HOMOLOGY]
                                    [--motifs MOTIFS] [--sample SAMPLE]
                                    [--switch_coords SWITCH_COORDS]
                                    [--switch_annotation SWITCH_ANNOTATION]
                                    [--use_clones USE_CLONES]
                                    [--weights WEIGHTS] [--range RANGE]
                                    [--top_donor TOP_DONOR]
                                    [--top_acceptor TOP_ACCEPTOR] [-n NTOP]
                                    [--reference REFERENCE]

-h, --help: show this help message and exit

-g <gaps>, --gaps <gaps>: required: file with gap positions (output of get_gaps)

-r <rearrangements>, --rearrangements <rearrangements>: required: file with inversion/duplication positions (output of find_rearrangements)

-c <clustering>, --clustering <clustering>: file with clustering results

-a <clustering_analysis>, --clustering_analysis <clustering_analysis>: file with clustering analysis

-o <out>, --out <out>: required: output file

-p <plot>, --plot <plot>: plot file

-b <binsize>, --binsize <binsize>: binsize [50]

--scale_factor <scale_factor>: factor to increase binsize for 2D histogram [10]

--max_gap <max_gap>: max gap size to ignore [75]

--homology <homology>: file with homology values (output of get_switch_homology)

--motifs <motifs>: file with motif counts (output of get_switch_motifs)

--sample <sample>: sample name (for figure title)

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

--use_clones <use_clones>: comma-separated list of clones to use or ‘all’ (default: filtered clusters)

--weights <weights>: use different weights (“cluster” | “reads” | “adjusted”) [cluster]

--range <range>: range of kmer sizes, e.g., 3,5-7 [5]

--top_donor <top_donor>: output additional file with top donor sequences

--top_acceptor <top_acceptor>: output additional file with top acceptor sequences

-n <ntop>, --ntop <ntop>: number of top bins to select

--reference <reference>: genome fasta file

swibrid get_gaps

find gaps in (pseudo)MSA: takes the MSA matrix from construct_MSA and finds all gap positions. output is a .npz file containing several integer arrays

read_idx: index of read
gap_left: left end (within the MSA; 0-based)
gap_right: right end
gap_size: size of gap

usage: swibrid get_gaps [-h] --msa MSA -o OUT [--paired_end_mode]
                        [--msa_csv MSA_CSV]

-h, --help: show this help message and exit

--msa <msa>: required: .npz file with (pseudo) multiple alignment of read sequences for clustering

-o <out>, --out <out>: required: output file (.npz)

--paired_end_mode: use paired-end mode

--msa_csv <msa_csv>: msa read info (for paired_end_mode)

swibrid get_summary

produce a summary of features derived from a sample and a plot. collects statistics produced by process_alignments, find_clusters, and downsample_clustering. produces cluster distribution statistics similar to downsample_clustering (the latter are averaged):

mean_cluster_size (as fraction of reads)
std_cluster_size
nclusters_final (number of clusters after filtering)
nclusters_eff (from the entropy of the cluster size distribution pre-filtering)
cluster_gini: gini coefficient (post-filtering)
cluster_entropy: entropy (post-filtering)
cluster_inverse_simpson: inverse simpson coefficient (post-filtering)
top_clone_occupancy: relative fraction of reads in biggest cluster
big_clones_occupancy: fraction of reads in clusters > 1% occupancy
size_length_bias: regression coefficient of log(cluster_size) ~ length
size_GC_bias: regression coefficient of log(cluster_size) ~ GC

averages cluster-specific features from analyze_clustering and get_breakpoint_stats over clusters. collects number of reads / clusters per isotype. gets statistics on variants (germline vs. somatic, transitions vs. transversions, etc.)

usage: swibrid get_summary [-h] [--sample SAMPLE] [--figure FIGURE]
                           [--stats STATS] [--process PROCESS] [--info INFO]
                           [--gaps GAPS] [--breakpoint_stats BREAKPOINT_STATS]
                           [--max_gap MAX_GAP] [--clustering CLUSTERING]
                           [--scanning SCANNING]
                           [--cluster_stats CLUSTER_STATS]
                           [--cluster_analysis CLUSTER_ANALYSIS]
                           [--cluster_downsampling CLUSTER_DOWNSAMPLING]
                           [--variants [VARIANTS]]
                           [--switch_coords SWITCH_COORDS]
                           [--switch_annotation SWITCH_ANNOTATION]
                           [--use_clones USE_CLONES] [--weights WEIGHTS]
                           [--max_n_blunt MAX_N_BLUNT]
                           [--big_clone_cutoff BIG_CLONE_CUTOFF]

-h, --help: show this help message and exit

--sample <sample>: sample name

--figure <figure>: output figure

--stats <stats>: output stats

--process <process>: stats file from process_alignments

--info <info>: read info

--gaps <gaps>: gap distribution

--breakpoint_stats <breakpoint_stats>: breakpoint statistics

--max_gap <max_gap>: max gap size to ignore [75]

--clustering <clustering>: find_clusters output file

--scanning <scanning>: find_clusters dendrogram scanning

--cluster_stats <cluster_stats>: find_clusters stats

--cluster_analysis <cluster_analysis>: analyze_clusters output

--cluster_downsampling <cluster_downsampling>: downsample_clusters output

--variants <variants>: file with variant table (from find_variants)

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

--use_clones <use_clones>: comma-separated list of clones to use

--weights <weights>: specify weights (“cluster” | “reads” | “adjusted”) [cluster]

--max_n_blunt <max_n_blunt>: max avg. number of untemplated / homologous nucleotides of reads in cluster to be considered a blunt end [0.5]

--big_clone_cutoff <big_clone_cutoff>: cutoff to determine a “big” clone as fraction of clustered reads [0.01]

swibrid get_switch_homology

analyze sequence homology within switch region bins by jaccard similarity of k-mers: from switch region coordinates and a genome file, this script assesses homology of pairs of bins by jaccard similarity of k-mer occurrences in forward or reverse orientation. output is a .npz file with pairwise forward and reverse homology arrays for each value of k

usage: swibrid get_switch_homology [-h] -g GENOME -o OUTPUT
                                   --switch_annotation SWITCH_ANNOTATION
                                   [--switch_coords SWITCH_COORDS]
                                   [-b BINSIZE] [--range RANGE]
                                   [--figure FIGURE]

-h, --help: show this help message and exit

-g <genome>, --genome <genome>: required: genome fasta file

-o <output>, --output <output>: required: output file (npz)

--switch_annotation <switch_annotation>: required: bed file with switch annotation

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

-b <binsize>, --binsize <binsize>: binsize [100]

--range <range>: range of kmer sizes, e.g., 3,5-7 [5]

--figure <figure>: output figure

swibrid get_switch_motifs

analyze sequence content within switch region bins by a number of motifs. given a list of motifs, genome file and switch region coordinates, this script counts the occurrences of these motifs in coordinate bins. output is a .npz file with count arrays for each motif

usage: swibrid get_switch_motifs [-h] -g GENOME -o OUTPUT
                                 [--switch_coords SWITCH_COORDS]
                                 [--switch_annotation SWITCH_ANNOTATION]
                                 [-b BINSIZE] [--motifs MOTIFS] [-k K]
                                 [--figure FIGURE]

-h, --help: show this help message and exit

-g <genome>, --genome <genome>: required: genome fasta file

-o <output>, --output <output>: required: output file (npz)

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000:-]

--switch_annotation <switch_annotation>: bed file with switch annotation

-b <binsize>, --binsize <binsize>: binsize [100]

--motifs <motifs>: comma-separated list of sequence motifs [S=G/C,W=A/T,WGCW,GAGCT,GGGST]

-k <k>: value k for estimate of k-mer complexity [5]

--figure <figure>: output figure

swibrid get_synthetic_reads

create synthetic reads from bed file. mutations, insertions and deletions will be added according to parameters estimated by LAST. clone sizes can be uniform or distributed as Poisson or Negative Binomial. additional homozygous / heterozygous / other variants can be added

usage: swibrid get_synthetic_reads [-h] -b BED -r REFERENCE -p PAR -o OUT
                                   [-i INFO] [-n N] [-k K] [-d DISTRIBUTION]
                                   [--nbinom_alpha NBINOM_ALPHA]
                                   [--variants [VARIANTS]]
                                   [--rearrangements [REARRANGEMENTS]]
                                   [-s SEED] [--no_mutations] [--no_deletions]
                                   [--no_insertions]

-h, --help: show this help message and exit

-b <bed>, --bed <bed>: required: input bed file

-r <reference>, --reference <reference>: required: reference sequence

-p <par>, --par <par>: required: file with LAST parameters

-o <out>, --out <out>: required: output fasta / fastq (.gz) file

-i <info>, --info <info>: output info file

-n <n>, --n <n>: use n random entries of input bed file

-k <k>, --k <k>: number of mutated copies per sequence [1.0]

-d <distribution>, --distribution <distribution>: distribution type for k (delta | poisson | nbinom | P | NB | D) [poisson]

--nbinom_alpha <nbinom_alpha>: dispersion alpha of negative binomial [4.0]

--variants <variants>: variant file (like output of find_variants) to add variants

--rearrangements <rearrangements>: rearrangement file (bed-like output of find_rearrangements) to add rearrangements

-s <seed>, --seed <seed>

--no_mutations

--no_deletions

--no_insertions

swibrid get_unique_clones_bed

produce a bed file with unique clones from the biggest clusters.

usage: swibrid get_unique_clones_bed [-h] -b BED -c CLUSTERING [-o OUTPUT]
                                     [--cut CUT]

-h, --help: show this help message and exit

-b <bed>, --bed <bed>: required: input bed file

-c <clustering>, --clustering <clustering>: required: clustering results

-o <output>, --output <output>: output file [stdout]

--cut <cut>: use only this fraction of the biggest clusters [0.8]

swibrid plot_clustering

Given a MSA and clustering results, this will display individual reads mapping over the switch regions. reads will be ordered as dictated by the linkage, or simply by isotype and cluster value. reads can be colored by different variables (isotype, cluster, haplotype, coverage, strand, orientation or other columns present in the info file). an additional sidebar can display additional variables. variant positions or breakpoint realignment statistics can also be indicated.

usage: swibrid plot_clustering [-h] --msa MSA --figure FIGURE
                               --clustering_results CLUSTERING_RESULTS
                               --switch_annotation SWITCH_ANNOTATION
                               [--linkage LINKAGE] [--sample SAMPLE]
                               [--info INFO]
                               [--clustering_stats CLUSTERING_STATS]
                               [--cutoff CUTOFF]
                               [--switch_coords SWITCH_COORDS]
                               [--annotation [ANNOTATION]]
                               [--color_by COLOR_BY]
                               [--sidebar_color_by SIDEBAR_COLOR_BY]
                               [--show_inserts] [--coords COORDS]
                               [--no_x_ticks] [--omit_scale_bar]
                               [--fig_width FIG_WIDTH]
                               [--fig_height FIG_HEIGHT]
                               [--linkage_border LINKAGE_BORDER]
                               [--cutoff_color CUTOFF_COLOR]
                               [--variants_table VARIANTS_TABLE]
                               [--variants_matrix VARIANTS_MATRIX]
                               [--haplotypes HAPLOTYPES]
                               [--realignments REALIGNMENTS] [--dpi DPI]
                               [--chunksize CHUNKSIZE] [--cmax CMAX]
                               [--paired_end_mode]

-h, --help: show this help message and exit

--msa <msa>: required: file(s) with (pseudo) multiple alignment of read sequences for clustering

--figure <figure>: required: output figure

--clustering_results <clustering_results>: required: file contains clustering results

--switch_annotation <switch_annotation>: BED file with switch annotation

--linkage <linkage>: file containing linkage

--sample <sample>: sample name (for figure title)

--info <info>: file with read info

--clustering_stats <clustering_stats>: file contains clustering stats with cutoff value

--cutoff <cutoff>: show cutoff line

--switch_coords <switch_coords>: Coordinates of switch region [chr14:106050000-106337000:-].

--annotation <annotation>: bed file with gene annotation

--color_by <color_by>: Color reads by isotype, cluster, haplotype, or other columns in the ‘info’ file [isotype].

--sidebar_color_by <sidebar_color_by>: Color sidebar reads by (comma-separated list of) isotype, cluster, haplotype, or other columns in the ‘info’ file [none].

--show_inserts: Show insert locations.

--coords <coords>: file with processed read coordinates (required if inserts are shown)

--no_x_ticks: Omit the x-axis ticks.

--omit_scale_bar: Omit scale bar.

--fig_width <fig_width>: Height of figure in inches [6].

--fig_height <fig_height>: width of figure in inches [8]

--linkage_border <linkage_border>: Fraction of figure used for dendrogram [0.10].

--cutoff_color <cutoff_color>: Color of dendrogram cutoff line [r].

--variants_table <variants_table>: Show variant positions from variant table (from find_variants).

--variants_matrix <variants_matrix>: Indicate variant occurrences from variant matrix (from find_variants).

--haplotypes <haplotypes>: File with haplotype clustering (from find_variants).

--realignments <realignments>: Indicate realignment scores with breakpoint realignments.

--dpi <dpi>: DPI of output figure [300].

--chunksize <chunksize>: Plot image in chunks of reads of this size [5000].

--cmax <cmax>: Maximum height in dendrogram plot.

--paired_end_mode: Use paired-end mode (requires coords file to indicate links).

swibrid plot_demux_report

plot demultiplexing report

usage: swibrid plot_demux_report [-h] [-o OUTDIR] [-f FIGURE]
                                 [-s SAMPLE_SHEET]

-h, --help: show this help message and exit

-o <outdir>, --outdir <outdir>: output directory

-f <figure>, --figure <figure>: summary figure

-s <sample_sheet>, --sample_sheet <sample_sheet>: sample sheet (barcode <tab> sample_name, no header)

swibrid process_alignments

process alignments: for an input file with aligned reads (MAF if LAST output, SAM if minimap2 output), get alignments to switch regions and elsewhere (potential inserts).

reads are removed if

they are too short
they don’t contain a forward and reverse primer
they contain internal primers
they contain no switch region
if mapped regions on the genome are much shorter than mapped parts of the read
there’s too much overlap between different alignments on the read, or on the genome
too little of the read maps
alignments to the switch region are in the wrong order
the isotype cannot be determined

output table contains

isotype
read orientation
read coverage
fraction of read sequence mapping to the same genomic regions
mapped switch region segments
inserts

aligned sequences can be saved separately (necessary to then construct a pseudo MSA). if –realign_breakpoints is set, 20nt on each side of a breakpoint are re-aligned, and statistics like number of homologous or untemplated bases are extracted

usage: swibrid process_alignments [-h] --alignments ALIGNMENTS --outfile
                                  OUTFILE --info INFO [--sequences SEQUENCES]
                                  [--stats STATS]
                                  [--switch_coords SWITCH_COORDS]
                                  [--switch_annotation SWITCH_ANNOTATION]
                                  [--min_length MIN_LENGTH] [--only_complete]
                                  [--keep_internal] [--remove_duplicates]
                                  [--ignore_order] [--telo TELO]
                                  [--min_switch_match MIN_SWITCH_MATCH]
                                  [--max_switch_overlap MAX_SWITCH_OVERLAP]
                                  [--max_insert_gap MAX_INSERT_GAP]
                                  [--max_insert_overlap MAX_INSERT_OVERLAP]
                                  [--min_insert_match MIN_INSERT_MATCH]
                                  [--min_num_switch_matches MIN_NUM_SWITCH_MATCHES]
                                  [--min_cov MIN_COV] [--max_gap MAX_GAP]
                                  [--isotype_extra ISOTYPE_EXTRA]
                                  [--telo_cutoff TELO_CUTOFF]
                                  [--min_telo_matchlen MIN_TELO_MATCHLEN]
                                  [--blacklist_regions [BLACKLIST_REGIONS]]
                                  [--realign_breakpoints REALIGN_BREAKPOINTS]
                                  [--realignment_penalties REALIGNMENT_PENALTIES]
                                  [--raw_reads RAW_READS] [--genome GENOME]
                                  [--keep_secondary_alignments]
                                  [--paired_end_mode]
                                  [--remove_isotypes REMOVE_ISOTYPES]
                                  [--interrupt_for_read INTERRUPT_FOR_READ]

-h, --help: show this help message and exit

--alignments <alignments>: required: MAF file (LAST output) or SAM file (minimap2 output) with aligned reads (LAST output)

--outfile <outfile>: required: output table with processed reads

--info <info>: required: csv file with read info

--sequences <sequences>: save sequence alignment to reference for pseudo-multiple alignment

--stats <stats>: output stats on removed reads

--switch_coords <switch_coords>: coordinates of switch region [chr14:106050000-106337000]

--switch_annotation <switch_annotation>: bed file with switch annotation

--min_length <min_length>: minimum read length [500]

--only_complete: use only complete reads with two primers

--keep_internal: keep reads with internal primers

--remove_duplicates: remove duplicate reads (based on UMIs in read IDs)

--ignore_order: don’t check order of matches along read

--telo <telo>: output of blasting reads against telomer repeats

--min_switch_match <min_switch_match>: min match length to switch region

--max_switch_overlap <max_switch_overlap>: max overlap between switch parts of different orientation [0]

--max_insert_gap <max_insert_gap>: max gap between insert and switch parts (on the read) [100]

--max_insert_overlap <max_insert_overlap>: max overlap between switch and insert parts (on the read) [50]

--min_insert_match <min_insert_match>: min length of insert [50]

--min_num_switch_matches <min_num_switch_matches>: min number of switch matches in read [1]

--min_cov <min_cov>: coverage cutoff on reads [0.90]

--max_gap <max_gap>: split alignments with gaps beyond that value [75]

--isotype_extra <isotype_extra>: extra space added to define isotypes [200]

--telo_cutoff <telo_cutoff>: % identify cutoff for telomer match [90]

--min_telo_matchlen <min_telo_matchlen>: minimum matchlen for telomeric repeat matches [25]

--blacklist_regions <blacklist_regions>: ignore inserts from blacklist regions defined in this bed file

--realign_breakpoints <realign_breakpoints>: re-align reads around breakpoints and save results to this file (requires --raw_reads and --genome)

--realignment_penalties <realignment_penalties>: re-alignment penalties (“ont”, “hifi” or “sr”) [ont]

--raw_reads <raw_reads>: fasta file with unaligned reads (interleaved mates for paired-end mode)

--genome <genome>: reference genome file

--keep_secondary_alignments: keep secondary alignments in SAM file (but conflicting alignments are not resolved!)

--paired_end_mode: EXPERIMENTAL: use paired-end mode (use interleaved mates as fastq input)

--remove_isotypes <remove_isotypes>: remove reads with these isotypes (comma-separated list)

--interrupt_for_read <interrupt_for_read>: (for debugging) interrupt for specified read(s)

swibrid run

main command: run the entire pipeline. run swibrid setup and edit config.yaml first!

this will call snakemake; additional options are passed to snakemake e.g., for a dry run use swibrid run -n, or swibrid run --unlock after a failed run

usage: swibrid run [-h] [-s SNAKEFILE] [-c CONFIG] [--slurm] [--sge]
                   [--queue QUEUE]
                   ...

snake_options: pass options to snakemake (…)

-h, --help: show this help message and exit

-s <snakefile>, --snakefile <snakefile>: snakefile [pipeline.snake]

-c <config>, --config <config>: config file [config.yaml]

--slurm: submit to slurm

--sge: submit to SGE

--queue <queue>: slurm queue [medium]

swibrid setup

main command: copy config and snakefile to current directory

usage: swibrid setup [-h] [-f]

-h, --help: show this help message and exit

-f, --overwrite: overwrite files if present [no]

swibrid test

test the pipeline. this will create synthetic reads in input and run the pipeline on this data, using a reduced hg38 genome in index with only the switch region (chr14:105000000-106000000). it will probably take about 30-60 minutes and will call snakemake, passing on additional options. e.g., for a dry run use swibrid test -n, and do swibrid test --unlock after a failed run

usage: swibrid test [-h] [-f] [-s SNAKEFILE] [-c CONFIG] [--slurm] [--sge]
                    [--queue QUEUE]
                    ...

snake_options: pass options to snakemake (…)

-h, --help: show this help message and exit

-f, --overwrite: overwrite files if present [no]

-s <snakefile>, --snakefile <snakefile>: snakefile [pipeline.snake]

-c <config>, --config <config>: config file [test_config.yaml]

--slurm: submit to slurm

--sge: submit to SGE

--queue <queue>: slurm queue [medium]