command-line interface
swibrid
Command line interface for swibrid.
This is the main program entry point for the swibrid
executable and its sub commands.
main commands:
swibrid setup
copy config and snakefile to current directoryswibrid run
run the entire pipelineswibrid test
perform a test run on synthetic reads
subcommands to run pipeline steps individually:
swibrid demultiplex
demultiplex minION runswibrid process_alignments
process and filter alignment outputswibrid get_alignment_pars
get alignment parameters (mismatch + gap rates) from align outputswibrid create_bed
create bed files and summary tables for insert-containing readsswibrid construct_msa
construct a (pseudo) MSAswibrid get_gaps
find gaps / breakpoints in MSAswibrid construct_linkage
construct hierarchical clustering linkageswibrid find_clusters
get read clustering from linkageswibrid find_variants
variant callingswibrid find_rearrangements
detect rearrangements (structural variants)swibrid plot_clustering
plot the read clusteringswibrid get_breakpoint_stats
get breakpoint histograms and statsswibrid get_switch_homology
get switch sequence homologyswibrid get_switch_motifs
get switch sequence motifsswibrid analyze_clustering
analyze clustering resultsswibrid downsample_clustering
downsample and analyze clusteringswibrid get_summary
get sample summary stats and plotswibrid collect_results
collect results for multiple samples
additional subcommands:
swibrid get_annotation
prepare gene annotationswibrid get_unique_clones_bed
get bed file with unique clonesswibrid get_synthetic_reads
create synthetic reads from bed fileswibrid plot_demux_report
make a graphical summary of demultiplexing outputswibrid combine_replicates
combine (aligned & processed) reads from replicatesswibrid downsample
downsample (aligned & processed) reads from a sample
usage: swibrid [-h] [-v] [--version]
- -h, --help
show this help message and exit
- -v, --verbose
Increase verbosity.
- --version
show program’s version number and exit
swibrid analyze_clustering
analyze clustering: given the clustering from find_clusters
as input,
various aggregate statistics are generated per cluster.
output table contains the following columns:
cluster size (# of reads)
cluster isotype (most frequent)
average read fraction mapped per cluster
average read fraction mapped to same region
average read fraction ignored
consensus sequence
length of consensus
GC content of consensus
spread of breakpoints (number of bases with intermediate coverage)
size of inversions
size of duplications
frequency of inserts in cluster
fractional overlap of inserts in cluster
fractional overlap of insert-associated breakpoints
average insert length
average length of gaps around inserts
average isotype associated to inserts
number of homologous nucleotides around insert-associated breakpoints
number of homologous nucleotides around switch breakpoints
number of untemplated nucleotides around insert-associated breakpoints
number of untemplated nucleotides around switch breakpoints
length of different switch regions covered in cluster
cluster size adjusted for PCR length and GC bias
usage: swibrid analyze_clustering [-h] [--clustering CLUSTERING] [--msa MSA]
[-o OUTPUT] [--gaps GAPS]
[--max_gap MAX_GAP]
[--max_realignment_gap MAX_REALIGNMENT_GAP]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--inserts INSERTS]
[--realignments REALIGNMENTS]
[--adjust_size]
[--isotype_extra ISOTYPE_EXTRA]
- -h, --help
show this help message and exit
- --clustering <clustering>
required: find_clusters output file
- --msa <msa>
required: file with (pseudo) multiple alignment of read sequences for clustering
- -o <output>, --output <output>
required: output table
- --gaps <gaps>
required: gap distribution
- --max_gap <max_gap>
max gap size to ignore [75]
- --max_realignment_gap <max_realignment_gap>
max gap size to count positions for realignment [75]
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --inserts <inserts>
results.tsv file with inserts
- --realignments <realignments>
file with breakpoint realignments
- --adjust_size
calculate cluster size adjusted for fragment length and QC content
- --isotype_extra <isotype_extra>
extra space added to define isotypes [500]
swibrid collect_results
collect stats from different samples: this will collect all summary files for the listed samples and produce one big table. optionally, this will also collect all inserts and produce a multi-sheet excel file. clustering analysis results can also be collected into an excel file.
usage: swibrid collect_results [-h] --samples SAMPLES --sample_stats
SAMPLE_STATS [--summary_path SUMMARY_PATH]
[--inserts_path INSERTS_PATH]
[--cluster_analysis_path CLUSTER_ANALYSIS_PATH]
[--cluster_stats CLUSTER_STATS]
[--inserts INSERTS] [--clean_column_names]
- -h, --help
show this help message and exit
- --samples <samples>
required: comma-separated list of samples
- --sample_stats <sample_stats>
required: output file with sample_stats
- --summary_path <summary_path>
path pattern for summary files [pipeline/{sample}/{sample}_summary.csv]
- --inserts_path <inserts_path>
path pattern for insert tables [pipeline/{sample}/{sample}_inserts.tsv]
- --cluster_analysis_path <cluster_analysis_path>
path pattern for cluster_analysis files [pipeline/{sample}/{sample}_cluster_analysis.csv]
- --cluster_stats <cluster_stats>
output file with cluster stats
- --inserts <inserts>
output file with inserts
- --clean_column_names
re-name columns to make features a bit more explicit
swibrid combine_replicates
combine (aligned + processed) reads from replicates
usage: swibrid combine_replicates [-h] [-s SAMPLES] [-c COMBINED]
[--nmax NMAX] [--aligner ALIGNER]
[--msa_path MSA_PATH]
[--msa_csv_path MSA_CSV_PATH]
[--info_path INFO_PATH]
[--process_path PROCESS_PATH]
[--process_stats_path PROCESS_STATS_PATH]
[--alignment_pars_path ALIGNMENT_PARS_PATH]
[--aligned_seqs_path ALIGNED_SEQS_PATH]
[--breakpoint_alignments_path BREAKPOINT_ALIGNMENTS_PATH]
[--inserts_tsv_path INSERTS_TSV_PATH]
[--bed_path BED_PATH]
- -h, --help
show this help message and exit
- -s <samples>, --samples <samples>
comma-separated list of samples to combine
- -c <combined>, --combined <combined>
combined sample name
- --nmax <nmax>
use only nmax reads [50000]
- --aligner <aligner>
alignment algorithm used [last]
- --msa_path <msa_path>
path pattern for msa files [“pipeline/{sample}/{sample}_msa.npz”]
- --msa_csv_path <msa_csv_path>
path pattern for msa csv files [“pipeline/{sample}/{sample}_msa.csv”]
- --info_path <info_path>
path pattern for info csv files [“input/{sample}_info.csv”]
- --process_path <process_path>
path pattern for processing output [“pipeline/{sample}/{sample}_processed.out”]
- --process_stats_path <process_stats_path>
path pattern for process stats csv [“pipeline/{sample}/{sample}_process_stats.csv”]
- --alignment_pars_path <alignment_pars_path>
path pattern for alignment pars [“pipeline/{sample}/{sample}_{aligner}_pars.npz”]
- --aligned_seqs_path <aligned_seqs_path>
path pattern for aligned sequences [“pipeline/{sample}/{sample}_aligned.fasta.gz
- --breakpoint_alignments_path <breakpoint_alignments_path>
path pattern for breakpoint alignments [“pipeline/{sample}/{sample}_breakpoint_alignments.csv”]
- --inserts_tsv_path <inserts_tsv_path>
path pattern for inserts tsv [“pipeline/{sample}/{sample}_inserts.tsv”]
- --bed_path <bed_path>
path pattern for bed [“pipeline/{sample}/{sample}.bed”]
swibrid construct_linkage
construct hiearchical agglomerative clustering from MSA: from the input MSA, only coverage information will be used and gaps smaller than max_gap will be removed. by default, fastcluster is used with cosine metric and average linkage. output is a npz file containing the linkage matrix
EXPERIMENTAL: a sparse version building on pynndescent and gbbs agglomerative clustering, requires several libraries to be installed
usage: swibrid construct_linkage [-h] [--msa MSA] [--linkage LINKAGE]
[--gaps GAPS] [--max_gap MAX_GAP]
[--metric METRIC] [--method METHOD]
[--nmax NMAX] [--ignore_unused_positions]
[--use_sparse] [--distances DISTANCES]
[--n_neighbors N_NEIGHBORS]
[--n_threads N_THREADS] [--n_backup N_BACKUP]
- -h, --help
show this help message and exit
- --msa <msa>
required: file with (pseudo) multiple alignment of read sequences for clustering
- --linkage <linkage>
required: output file contains linkage
- --gaps <gaps>
output of get_gaps
- --max_gap <max_gap>
max gap size to ignore [75]
- --metric <metric>
clustering metric [cosine]
- --method <method>
clustering method for hierarchical clustering [average]
- --nmax <nmax>
use only nmax reads
- --ignore_unused_positions
ignore positions that have no coverage
- --use_sparse
EXPERIMENTAL: use sparse nearest-neighbor clustering
- --distances <distances>
pre-computed sparse distance matrix
- --n_neighbors <n_neighbors>
# of nearest neighbors for adjacency graph [100]
- --n_threads <n_threads>
# of threads [1]
- --n_backup <n_backup>
save full distances for these points in addition to the nearest neighbors [0]
swibrid construct_msa
construct (pseudo) MSA from processed alignments: this script will take aligned segments and sequences from process_alignments and construct a MSA. the MSA is stored as a sparse matrix of n_reads x n_positions, where the positions run over the concatenation of individual switch regions specified in a bed file. matrix values m code for coverage and nucleotide identity:
m % 10 gives nucleotide identity with A=1, C=2, G=3, T=4, gaps are zeros
m // 10 gives coverage of genomic regions by one read (1x, 2x, …) and indicates tandem duplications; if
--use_orientation
is set, inversions are also considered and associated coverage values are negative
usage: swibrid construct_msa [-h] [--coords COORDS] [--sequences SEQUENCES]
[--msa MSA] [--out OUT]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--nmax NMAX] [--use_orientation]
- -h, --help
show this help message and exit
- --coords <coords>
required: process_alignments coordinate output
- --sequences <sequences>
required: process_alignments sequence output
- --msa <msa>
required: output file with (pseudo) multiple alignment of read sequences for clustering
- --out <out>
required: output file with read info
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
required: bed file with coordinates of individual switch regions
- --nmax <nmax>
use only nmax reads
- --use_orientation
include alignment block orientation
swibrid create_bed
create bed files for all reads and summary table for insert-containing reads, including read sequence with insert part highlighted in uppercase
usage: swibrid create_bed [-h] [--processed_alignments PROCESSED_ALIGNMENTS]
[--bed BED] [--outfile OUTFILE]
[--raw_reads RAW_READS] [--paired_end_mode]
[--switch_coords SWITCH_COORDS]
[--annotation [ANNOTATION]]
[--switch_annotation SWITCH_ANNOTATION]
- -h, --help
show this help message and exit
- --processed_alignments <processed_alignments>
required: output of process_alignments
- --bed <bed>
required: bed output with insert + switch coordinates for selected reads
- --outfile <outfile>
required: output table with insert information
- --raw_reads <raw_reads>
fasta file with minION reads (interleaved mates for paired-end data)
- --paired_end_mode
EXPERIMENTAL: use paired-end mode (use interleaved mates as input)
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000]
- --annotation <annotation>
bed file with gene annotation
- --switch_annotation <switch_annotation>
bed file with switch annotation
swibrid demultiplex
demultiplex reads using output of BLAST against primers and barcodes
BLAST output is created with blastn -db {barcodes_primers} -query {query} -task blastn-short -max_target_seqs 50 -outfmt "6 saccver qaccver slen qlen pident length qstart qend evalue" -gapopen 5 -gapextend 2 -reward 3 -penalty -4 -evalue 1 -num_threads 1 -perc_identity 50
where {barcodes_primers} is a database containing barcode and primer sequences and {query} is the input fasta file
optional (but recommended) input is a sample sheet (tab delimited, no header or row names, 1st column barcode, 2nd column sample name)
usage: swibrid demultiplex [-h] -i INPUT -b BLAST -o OUTDIR [-c CUTOFF]
[-d DIST] [-m MIN_READS] [-f FIGURE] [-r REPORT]
[-s SAMPLE_SHEET] [--collapse] [--split_reads]
[--max_split_dist MAX_SPLIT_DIST] [--nmax NMAX]
[--add_umi] [--umi_regex UMI_REGEX]
[--write_info_chunks]
- -h, --help
show this help message and exit
- -i <input>, --input <input>
required: input fastq(.gz)
- -b <blast>, --blast <blast>
required: BLAST output file (or stdin)
- -o <outdir>, --outdir <outdir>
required: output directory
- -c <cutoff>, --cutoff <cutoff>
% identify cutoff [70.0]
- -d <dist>, --dist <dist>
max distance of match to end of sequence [100]
- -m <min_reads>, --min_reads <min_reads>
min # of reads required to create separate file if no sample sheet is given [1000]
- -f <figure>, --figure <figure>
summary figure
- -r <report>, --report <report>
summary report
- -s <sample_sheet>, --sample_sheet <sample_sheet>
sample sheet (barcode <tab> sample_name, no header)
- --collapse
count reads regardless of whether one or both barcodes are present (for identical barcodes)? [False]
- --split_reads
split reads with adjacent internal primer binding sites
- --max_split_dist <max_split_dist>
maximum distance between internal primer matches to split read [200]
- --nmax <nmax>
maximum number of reads
- --add_umi
add UMI to read name
- --umi_regex <umi_regex>
regex to find UMI in read description [[ACGTN]{12}]
- --write_info_chunks
write info files in chunks
swibrid downsample
downsample (aligned + processed) reads for a sample
usage: swibrid downsample [-h] [-i INPUT] [-o OUTPUT] [-n N]
[--aligner ALIGNER] [--msa_path MSA_PATH]
[--msa_csv_path MSA_CSV_PATH]
[--info_path INFO_PATH]
[--process_path PROCESS_PATH]
[--process_stats_path PROCESS_STATS_PATH]
[--alignment_pars_path ALIGNMENT_PARS_PATH]
[--breakpoint_alignments_path BREAKPOINT_ALIGNMENTS_PATH]
[--inserts_tsv_path INSERTS_TSV_PATH]
[--bed_path BED_PATH]
- -h, --help
show this help message and exit
- -i <input>, --input <input>
input sample to downsample
- -o <output>, --output <output>
downsampled sample name
- -n <n>, --n <n>
use only n reads [5000]
- --aligner <aligner>
alignment algorithm used [last]
- --msa_path <msa_path>
path pattern for msa files [pipeline/{sample}/{sample}_msa.npz]
- --msa_csv_path <msa_csv_path>
path pattern for msa csv files [pipeline/{sample}/{sample}_msa.csv]
- --info_path <info_path>
path pattern for info csv files [input/{sample}_info.csv]
- --process_path <process_path>
path pattern for processing output [pipeline/{sample}/{sample}_processed.out]
- --process_stats_path <process_stats_path>
path pattern for process stats csv [pipeline/{sample}/{sample}_process_stats.csv]
- --alignment_pars_path <alignment_pars_path>
path pattern for alignment pars [pipeline/{sample}/{sample}_{aligner}_pars.npz]
- --breakpoint_alignments_path <breakpoint_alignments_path>
path pattern for breakpoint alignments [pipeline/{sample}/{sample}_breakpoint_alignments.csv]
- --inserts_tsv_path <inserts_tsv_path>
path pattern for inserts tsv [pipeline/{sample}/{sample}_inserts.tsv]
- --bed_path <bed_path>
path pattern for bed [pipeline/{sample}/{sample}.bed]
swibrid downsample_clustering
downsample the clustering to get more robust diversity measures: given the input MSA (and associated gaps), clustering is repeated nreps times on a subsample of nreads reads using fastcluster with (by defalt) cosine metric and average linkage and a cutoff derived from cluster_stats input. the following diversity measured are calculated:
mean_cluster_size (as fraction of reads)
std_cluster_size
nclusters_final (number of clusters after filtering)
nclusters_eff (from the entropy of the cluster size distribution pre-filtering)
cluster_gini: gini coefficient (post-filtering)
cluster_entropy: entropy (post-filtering)
cluster_inverse_simpson: inverse simpson coefficient (post-filtering)
top_clone_occupancy: relative fraction of reads in biggest cluster
big_clones_occupancy: fraction of reads in clusters > 1% occupancy
usage: swibrid downsample_clustering [-h] [--msa MSA]
[--cluster_stats CLUSTER_STATS]
[-s STATS] [--nreads NREADS]
[--nreps NREPS] [--gaps GAPS]
[--max_gap MAX_GAP] [--metric METRIC]
[--method METHOD]
[--ignore_unused_positions]
[--filtering_cutoff FILTERING_CUTOFF]
[--big_clone_cutoff BIG_CLONE_CUTOFF]
- -h, --help
show this help message and exit
- --msa <msa>
required: file with (pseudo) multiple alignment of read sequences for clustering
- --cluster_stats <cluster_stats>
required: file with statistics from original clustering
- -s <stats>, --stats <stats>
required: output file with clustering statistics after downsampling
- --nreads <nreads>
number of reads used [1000]
- --nreps <nreps>
number of replicates [10]
- --gaps <gaps>
output of get_gaps
- --max_gap <max_gap>
max gap size to ignore [75]
- --metric <metric>
clustering metric [cosine]
- --method <method>
clustering method for hierarchical clustering [average]
- --ignore_unused_positions
ignore positions that have no coverage
- --filtering_cutoff <filtering_cutoff>
filter out smallest clusters, keeping at least this fraction of reads [0.95]
- --big_clone_cutoff <big_clone_cutoff>
cutoff to determine a “big” clone as fraction of clustered reads [0.01]
swibrid find_clusters
find clusters by cutting linkage dendrogram: inputs are the linkage matrix (from construct_linkage
)
and read info (from construct_msa
). output is a csv file that adds cluster identity to the read info file
cutoff can be fixed or determined by scanning cutoffs between cmin and cmax and finding an inflection point (either by distance, or from the intersection of two exponential trends). small and isolated clusters are flagged, filtered clusters contain at least 95% of reads
usage: swibrid find_clusters [-h] -l LINKAGE -i INPUT -o OUTPUT [-s STATS]
[-f CUTOFF] [--scanning SCANNING]
[--fit_method FIT_METHOD] [-c CMIN] [-C CMAX]
[-n NC] [--filtering_cutoff FILTERING_CUTOFF]
- -h, --help
show this help message and exit
- -l <linkage>, --linkage <linkage>
required: output of construct_linkage
- -i <input>, --input <input>
required: read info (output of construct_msa)
- -o <output>, --output <output>
required: output file with clustering
- -s <stats>, --stats <stats>
file with statistics
- -f <cutoff>, --fix_cutoff <cutoff>
use fixed cutoff instead of data-derived
- --scanning <scanning>
file with scanning data
- --fit_method <fit_method>
cutoff determination method: “trend” or “distance” [distance]
- -c <cmin>
minimum cutoff value [0.001]
- -C <cmax>
maximum cutoff value [0.100]
- -n <nc>
number of cutoff values between cmin and cmax [100]
- --filtering_cutoff <filtering_cutoff>
filter out smallest clusters, keeping at least this fraction of reads [0.95]
swibrid find_rearrangements
find rearrangements (inversions/duplications) in MSA: regions with coverage larger or smaller than 1 and without gaps bigger than 25 nt. output files:
a .npz file with read indices, left/right positions and sizes for inversions and duplications per read
a bed file with consensus rearrangement coordinates (by merging individual events), type and allele frequency
usage: swibrid find_rearrangements [-h] --msa MSA -o OUT -m MAT
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--max_rearrangement_gap MAX_REARRANGEMENT_GAP]
- -h, --help
show this help message and exit
- --msa <msa>
required: file with (pseudo) multiple alignment of read sequences for clustering
- -o <out>, --out <out>
required: output file (bed)
- -m <mat>, --mat <mat>
required: output file (matrix)
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --max_rearrangement_gap <max_rearrangement_gap>
max gap within rearrangements to ignore [25]
swibrid find_variants
EXPERIMENTAL: find single-nucleotide variants in MSA and determine haplotypes. variable positions in the input MSA (excluding positions at or near gaps) where nucleotide distributions are different than what’s expected given mutation frequencies (estimated at the initial alignment step). potential variants are removed if
there’s less than min_cov non-gaps
fewer variants than expected
high strand bias
no cluster with at least min_cluster_cov reads and allele frequency > min_freq
variants are aggregated over clusters, and the distribution across clusters is tested for evenness. variants can be annotated by dbSNP id. motif occurrences around variants are also scored. haplotypes are determined by performing a weighted NMF on a matrix of allele frequencies per cluster and variant.
output is a vcf-style file with genomic (1-based) and relative (0-based) coordinates, and a matrix (sparse integer array, same shape as MSA) indicating which read contains a variant at which position
usage: swibrid find_variants [-h] [--msa MSA] [--clustering CLUSTERING]
[--stats STATS] [--reference REFERENCE]
[--pars PARS] [-o OUT] [-m MAT]
[--save_all SAVE_ALL]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--fdr FDR] [--min_cov MIN_COV]
[--gap_window_size GAP_WINDOW_SIZE]
[--max_local_gap_freq MAX_LOCAL_GAP_FREQ]
[--min_cluster_cov MIN_CLUSTER_COV]
[--min_freq MIN_FREQ] [--nreads NREADS]
[--variant_annotation [VARIANT_ANNOTATION]]
[--haplotypes HAPLOTYPES] [--motifs MOTIFS]
- -h, --help
show this help message and exit
- --msa <msa>
required: file with (pseudo) multiple alignment of read sequences for clustering
- --clustering <clustering>
required: clustering (find_clusters output)
- --stats <stats>
required: clustering stats (find_clusters output)
- --reference <reference>
required: reference sequence (switch chromosome)
- --pars <pars>
required: alignment parameters to estimate mutation probability
- -o <out>, --out <out>
required: output file (text)
- -m <mat>, --mat <mat>
required: output file (matrix)
- --save_all <save_all>
save all variants, not just those passing filters
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --fdr <fdr>
FDR for variant calling [0.05]
- --min_cov <min_cov>
minimum read coverage at potentially variable positions [50]
- --gap_window_size <gap_window_size>
size of window to compute local gap frequency [10]
- --max_local_gap_freq <max_local_gap_freq>
max. local gap frequency in window [0.7]
- --min_cluster_cov <min_cluster_cov>
minimum read coverage at potentially variable positions per cluster [10]
- --min_freq <min_freq>
minimum allele frequency at potentially variable positions (in total or per cluster) [0.40]
- --nreads <nreads>
downsample to this number of reads to control sensitivity
- --variant_annotation <variant_annotation>
variant annotation file (vcf.gz; e.g., from 1000Genomes project)
- --haplotypes <haplotypes>
cluster haplotypes
- --motifs <motifs>
comma-separated list of sequence motifs to look up at variant positions [Cg,wrCy,Tw]
swibrid get_alignment_pars
get LAST-like parameters from a SAM or MAF file. this will go through the alignments and tally mutation and indel frequencies.
usage: swibrid get_alignment_pars [-h] [-i INF] [-r REF] [-o OUT]
- -h, --help
show this help message and exit
- -i <inf>, --inf <inf>
input .bam or .maf.gz file, or .par from LAST
- -r <ref>, --ref <ref>
reference genome
- -o <out>, --out <out>
output file (.npz)
swibrid get_annotation
prepare shortened annotation file for insert annotation. input is a GTF file (e.g., from GENCODE), output is a bed file
usage: swibrid get_annotation [-h] [-i INPUT] [-o OUT]
- -h, --help
show this help message and exit
- -i <input>, --input <input>
input gtf
- -o <out>, --out <out>
output bed
swibrid get_breakpoint_stats
analyze breakpoint statistics: this will produce aggregate statistics on breakpoints, either averaged over reads, or clusters. homology scores are calculated from binned breakpoint frequencies weighted by homology of bins; motif scores are calculated from binned breakpoint frequencies weighted by motif occurrences. output file contains following values:
breaks_normalized: number of breaks divided by number of reads / clusters
frac_breaks_duplications: fraction of breaks leading to duplication events
frac_breaks_inversions: fraction of breaks leading to inversion events
frac_breaks_single: fraction of breaks from SM to elsewhere
frac_breaks_sequential: fraction of breaks to different regions but not SM
frac_breaks_intra: fraction of breaks within a region
frac_breaks_inversions/duplications_intra: frac breaks with inversions/duplications within regions
frac_breaks_X_X: fraction of breaks connecting indicated regions
spread_XX: spread (standard deviation) of breakpoint positions in a region
homology_fw: homology in bins around breakpoint positions (same orientation)
homology_rv: homology in bins around breakpoint positions (opposite orientation)
homology_fw/rv_XX: homologies for breaks connecting indicated regions
donor/acceptor_score(_XX): motif scores for donor and acceptor breakpoints for different motifs (subdivided by region)
donor/acceptor_complexity(_XX): sequence complexity for donoer and acceptor breakpoints (subdivided by region)
usage: swibrid get_breakpoint_stats [-h] -g GAPS -r REARRANGEMENTS
[-c CLUSTERING] [-a CLUSTERING_ANALYSIS]
-o OUT [-p PLOT] [-b BINSIZE]
[--scale_factor SCALE_FACTOR]
[--max_gap MAX_GAP] [--homology HOMOLOGY]
[--motifs MOTIFS] [--sample SAMPLE]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--use_clones USE_CLONES]
[--weights WEIGHTS] [--range RANGE]
[--top_donor TOP_DONOR]
[--top_acceptor TOP_ACCEPTOR] [-n NTOP]
[--reference REFERENCE]
- -h, --help
show this help message and exit
- -g <gaps>, --gaps <gaps>
required: file with gap positions (output of get_gaps)
- -r <rearrangements>, --rearrangements <rearrangements>
required: file with inversion/duplication positions (output of find_rearrangements)
- -c <clustering>, --clustering <clustering>
file with clustering results
- -a <clustering_analysis>, --clustering_analysis <clustering_analysis>
file with clustering analysis
- -o <out>, --out <out>
required: output file
- -p <plot>, --plot <plot>
plot file
- -b <binsize>, --binsize <binsize>
binsize [50]
- --scale_factor <scale_factor>
factor to increase binsize for 2D histogram [10]
- --max_gap <max_gap>
max gap size to ignore [75]
- --homology <homology>
file with homology values (output of get_switch_homology)
- --motifs <motifs>
file with motif counts (output of get_switch_motifs)
- --sample <sample>
sample name (for figure title)
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --use_clones <use_clones>
comma-separated list of clones to use or ‘all’ (default: filtered clusters)
- --weights <weights>
use different weights (“cluster” | “reads” | “adjusted”) [cluster]
- --range <range>
range of kmer sizes, e.g., 3,5-7 [5]
- --top_donor <top_donor>
output additional file with top donor sequences
- --top_acceptor <top_acceptor>
output additional file with top acceptor sequences
- -n <ntop>, --ntop <ntop>
number of top bins to select
- --reference <reference>
genome fasta file
swibrid get_gaps
find gaps in (pseudo)MSA: takes the MSA matrix from construct_MSA and finds all gap positions. output is a .npz file containing several integer arrays
read_idx: index of read
gap_left: left end (within the MSA; 0-based)
gap_right: right end
gap_size: size of gap
usage: swibrid get_gaps [-h] --msa MSA -o OUT [--paired_end_mode]
[--msa_csv MSA_CSV]
- -h, --help
show this help message and exit
- --msa <msa>
required: .npz file with (pseudo) multiple alignment of read sequences for clustering
- -o <out>, --out <out>
required: output file (.npz)
- --paired_end_mode
use paired-end mode
- --msa_csv <msa_csv>
msa read info (for paired_end_mode)
swibrid get_summary
produce a summary of features derived from a sample and a plot. collects statistics produced by process_alignments, find_clusters, and downsample_clustering. produces cluster distribution statistics similar to downsample_clustering (the latter are averaged):
mean_cluster_size (as fraction of reads)
std_cluster_size
nclusters_final (number of clusters after filtering)
nclusters_eff (from the entropy of the cluster size distribution pre-filtering)
cluster_gini: gini coefficient (post-filtering)
cluster_entropy: entropy (post-filtering)
cluster_inverse_simpson: inverse simpson coefficient (post-filtering)
top_clone_occupancy: relative fraction of reads in biggest cluster
big_clones_occupancy: fraction of reads in clusters > 1% occupancy
size_length_bias: regression coefficient of log(cluster_size) ~ length
size_GC_bias: regression coefficient of log(cluster_size) ~ GC
averages cluster-specific features from analyze_clustering and get_breakpoint_stats over clusters. collects number of reads / clusters per isotype. gets statistics on variants (germline vs. somatic, transitions vs. transversions, etc.)
usage: swibrid get_summary [-h] [--sample SAMPLE] [--figure FIGURE]
[--stats STATS] [--process PROCESS] [--info INFO]
[--gaps GAPS] [--breakpoint_stats BREAKPOINT_STATS]
[--max_gap MAX_GAP] [--clustering CLUSTERING]
[--scanning SCANNING]
[--cluster_stats CLUSTER_STATS]
[--cluster_analysis CLUSTER_ANALYSIS]
[--cluster_downsampling CLUSTER_DOWNSAMPLING]
[--variants [VARIANTS]]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--use_clones USE_CLONES] [--weights WEIGHTS]
[--max_n_blunt MAX_N_BLUNT]
[--big_clone_cutoff BIG_CLONE_CUTOFF]
- -h, --help
show this help message and exit
- --sample <sample>
sample name
- --figure <figure>
output figure
- --stats <stats>
output stats
- --process <process>
stats file from process_alignments
- --info <info>
read info
- --gaps <gaps>
gap distribution
- --breakpoint_stats <breakpoint_stats>
breakpoint statistics
- --max_gap <max_gap>
max gap size to ignore [75]
- --clustering <clustering>
find_clusters output file
- --scanning <scanning>
find_clusters dendrogram scanning
- --cluster_stats <cluster_stats>
find_clusters stats
- --cluster_analysis <cluster_analysis>
analyze_clusters output
- --cluster_downsampling <cluster_downsampling>
downsample_clusters output
- --variants <variants>
file with variant table (from find_variants)
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --use_clones <use_clones>
comma-separated list of clones to use
- --weights <weights>
specify weights (“cluster” | “reads” | “adjusted”) [cluster]
- --max_n_blunt <max_n_blunt>
max avg. number of untemplated / homologous nucleotides of reads in cluster to be considered a blunt end [0.5]
- --big_clone_cutoff <big_clone_cutoff>
cutoff to determine a “big” clone as fraction of clustered reads [0.01]
swibrid get_switch_homology
analyze sequence homology within switch region bins by jaccard similarity of k-mers: from switch region coordinates and a genome file, this script assesses homology of pairs of bins by jaccard similarity of k-mer occurrences in forward or reverse orientation. output is a .npz file with pairwise forward and reverse homology arrays for each value of k
usage: swibrid get_switch_homology [-h] -g GENOME -o OUTPUT
--switch_annotation SWITCH_ANNOTATION
[--switch_coords SWITCH_COORDS]
[-b BINSIZE] [--range RANGE]
[--figure FIGURE]
- -h, --help
show this help message and exit
- -g <genome>, --genome <genome>
required: genome fasta file
- -o <output>, --output <output>
required: output file (npz)
- --switch_annotation <switch_annotation>
required: bed file with switch annotation
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- -b <binsize>, --binsize <binsize>
binsize [100]
- --range <range>
range of kmer sizes, e.g., 3,5-7 [5]
- --figure <figure>
output figure
swibrid get_switch_motifs
analyze sequence content within switch region bins by a number of motifs. given a list of motifs, genome file and switch region coordinates, this script counts the occurrences of these motifs in coordinate bins. output is a .npz file with count arrays for each motif
usage: swibrid get_switch_motifs [-h] -g GENOME -o OUTPUT
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[-b BINSIZE] [--motifs MOTIFS] [-k K]
[--figure FIGURE]
- -h, --help
show this help message and exit
- -g <genome>, --genome <genome>
required: genome fasta file
- -o <output>, --output <output>
required: output file (npz)
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000:-]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- -b <binsize>, --binsize <binsize>
binsize [100]
- --motifs <motifs>
comma-separated list of sequence motifs [S=G/C,W=A/T,WGCW,GAGCT,GGGST]
- -k <k>
value k for estimate of k-mer complexity [5]
- --figure <figure>
output figure
swibrid get_synthetic_reads
create synthetic reads from bed file. mutations, insertions and deletions will be added according to parameters estimated by LAST. clone sizes can be uniform or distributed as Poisson or Negative Binomial. additional homozygous / heterozygous / other variants can be added
usage: swibrid get_synthetic_reads [-h] -b BED -r REFERENCE -p PAR -o OUT
[-i INFO] [-n N] [-k K] [-d DISTRIBUTION]
[--nbinom_alpha NBINOM_ALPHA]
[--variants [VARIANTS]]
[--rearrangements [REARRANGEMENTS]]
[-s SEED] [--no_mutations] [--no_deletions]
[--no_insertions]
- -h, --help
show this help message and exit
- -b <bed>, --bed <bed>
required: input bed file
- -r <reference>, --reference <reference>
required: reference sequence
- -p <par>, --par <par>
required: file with LAST parameters
- -o <out>, --out <out>
required: output fasta / fastq (.gz) file
- -i <info>, --info <info>
output info file
- -n <n>, --n <n>
use n random entries of input bed file
- -k <k>, --k <k>
number of mutated copies per sequence [1.0]
- -d <distribution>, --distribution <distribution>
distribution type for k (delta | poisson | nbinom | P | NB | D) [poisson]
- --nbinom_alpha <nbinom_alpha>
dispersion alpha of negative binomial [4.0]
- --variants <variants>
variant file (like output of find_variants) to add variants
- --rearrangements <rearrangements>
rearrangement file (bed-like output of find_rearrangements) to add rearrangements
- -s <seed>, --seed <seed>
- --no_mutations
- --no_deletions
- --no_insertions
swibrid get_unique_clones_bed
produce a bed file with unique clones from the biggest clusters.
usage: swibrid get_unique_clones_bed [-h] -b BED -c CLUSTERING [-o OUTPUT]
[--cut CUT]
- -h, --help
show this help message and exit
- -b <bed>, --bed <bed>
required: input bed file
- -c <clustering>, --clustering <clustering>
required: clustering results
- -o <output>, --output <output>
output file [stdout]
- --cut <cut>
use only this fraction of the biggest clusters [0.8]
swibrid plot_clustering
Given a MSA and clustering results, this will display individual reads mapping over the switch regions. reads will be ordered as dictated by the linkage, or simply by isotype and cluster value. reads can be colored by different variables (isotype, cluster, haplotype, coverage, strand, orientation or other columns present in the info file). an additional sidebar can display additional variables. variant positions or breakpoint realignment statistics can also be indicated.
usage: swibrid plot_clustering [-h] --msa MSA --figure FIGURE
--clustering_results CLUSTERING_RESULTS
--switch_annotation SWITCH_ANNOTATION
[--linkage LINKAGE] [--sample SAMPLE]
[--info INFO]
[--clustering_stats CLUSTERING_STATS]
[--cutoff CUTOFF]
[--switch_coords SWITCH_COORDS]
[--annotation [ANNOTATION]]
[--color_by COLOR_BY]
[--sidebar_color_by SIDEBAR_COLOR_BY]
[--show_inserts] [--coords COORDS]
[--no_x_ticks] [--omit_scale_bar]
[--fig_width FIG_WIDTH]
[--fig_height FIG_HEIGHT]
[--linkage_border LINKAGE_BORDER]
[--cutoff_color CUTOFF_COLOR]
[--variants_table VARIANTS_TABLE]
[--variants_matrix VARIANTS_MATRIX]
[--haplotypes HAPLOTYPES]
[--realignments REALIGNMENTS] [--dpi DPI]
[--chunksize CHUNKSIZE] [--cmax CMAX]
[--paired_end_mode]
- -h, --help
show this help message and exit
- --msa <msa>
required: file(s) with (pseudo) multiple alignment of read sequences for clustering
- --figure <figure>
required: output figure
- --clustering_results <clustering_results>
required: file contains clustering results
- --switch_annotation <switch_annotation>
BED file with switch annotation
- --linkage <linkage>
file containing linkage
- --sample <sample>
sample name (for figure title)
- --info <info>
file with read info
- --clustering_stats <clustering_stats>
file contains clustering stats with cutoff value
- --cutoff <cutoff>
show cutoff line
- --switch_coords <switch_coords>
Coordinates of switch region [chr14:106050000-106337000:-].
- --annotation <annotation>
bed file with gene annotation
- --color_by <color_by>
Color reads by isotype, cluster, haplotype, or other columns in the ‘info’ file [isotype].
- --sidebar_color_by <sidebar_color_by>
Color sidebar reads by (comma-separated list of) isotype, cluster, haplotype, or other columns in the ‘info’ file [none].
- --show_inserts
Show insert locations.
- --coords <coords>
file with processed read coordinates (required if inserts are shown)
- --no_x_ticks
Omit the x-axis ticks.
- --omit_scale_bar
Omit scale bar.
- --fig_width <fig_width>
Height of figure in inches [6].
- --fig_height <fig_height>
width of figure in inches [8]
- --linkage_border <linkage_border>
Fraction of figure used for dendrogram [0.10].
- --cutoff_color <cutoff_color>
Color of dendrogram cutoff line [r].
- --variants_table <variants_table>
Show variant positions from variant table (from find_variants).
- --variants_matrix <variants_matrix>
Indicate variant occurrences from variant matrix (from find_variants).
- --haplotypes <haplotypes>
File with haplotype clustering (from find_variants).
- --realignments <realignments>
Indicate realignment scores with breakpoint realignments.
- --dpi <dpi>
DPI of output figure [300].
- --chunksize <chunksize>
Plot image in chunks of reads of this size [5000].
- --cmax <cmax>
Maximum height in dendrogram plot.
- --paired_end_mode
Use paired-end mode (requires coords file to indicate links).
swibrid plot_demux_report
plot demultiplexing report
usage: swibrid plot_demux_report [-h] [-o OUTDIR] [-f FIGURE]
[-s SAMPLE_SHEET]
- -h, --help
show this help message and exit
- -o <outdir>, --outdir <outdir>
output directory
- -f <figure>, --figure <figure>
summary figure
- -s <sample_sheet>, --sample_sheet <sample_sheet>
sample sheet (barcode <tab> sample_name, no header)
swibrid process_alignments
process alignments: for an input file with aligned reads (MAF if LAST output, SAM if minimap2 output), get alignments to switch regions and elsewhere (potential inserts).
reads are removed if
they are too short
they don’t contain a forward and reverse primer
they contain internal primers
they contain no switch region
if mapped regions on the genome are much shorter than mapped parts of the read
there’s too much overlap between different alignments on the read, or on the genome
too little of the read maps
alignments to the switch region are in the wrong order
the isotype cannot be determined
output table contains
isotype
read orientation
read coverage
fraction of read sequence mapping to the same genomic regions
mapped switch region segments
inserts
aligned sequences can be saved separately (necessary to then construct a pseudo MSA). if –realign_breakpoints is set, 20nt on each side of a breakpoint are re-aligned, and statistics like number of homologous or untemplated bases are extracted
usage: swibrid process_alignments [-h] --alignments ALIGNMENTS --outfile
OUTFILE --info INFO [--sequences SEQUENCES]
[--stats STATS]
[--switch_coords SWITCH_COORDS]
[--switch_annotation SWITCH_ANNOTATION]
[--min_length MIN_LENGTH] [--only_complete]
[--keep_internal] [--remove_duplicates]
[--ignore_order] [--telo TELO]
[--min_switch_match MIN_SWITCH_MATCH]
[--max_switch_overlap MAX_SWITCH_OVERLAP]
[--max_insert_gap MAX_INSERT_GAP]
[--max_insert_overlap MAX_INSERT_OVERLAP]
[--min_insert_match MIN_INSERT_MATCH]
[--min_num_switch_matches MIN_NUM_SWITCH_MATCHES]
[--min_cov MIN_COV] [--max_gap MAX_GAP]
[--isotype_extra ISOTYPE_EXTRA]
[--telo_cutoff TELO_CUTOFF]
[--min_telo_matchlen MIN_TELO_MATCHLEN]
[--blacklist_regions [BLACKLIST_REGIONS]]
[--realign_breakpoints REALIGN_BREAKPOINTS]
[--realignment_penalties REALIGNMENT_PENALTIES]
[--raw_reads RAW_READS] [--genome GENOME]
[--keep_secondary_alignments]
[--paired_end_mode]
[--remove_isotypes REMOVE_ISOTYPES]
[--interrupt_for_read INTERRUPT_FOR_READ]
- -h, --help
show this help message and exit
- --alignments <alignments>
required: MAF file (LAST output) or SAM file (minimap2 output) with aligned reads (LAST output)
- --outfile <outfile>
required: output table with processed reads
- --info <info>
required: csv file with read info
- --sequences <sequences>
save sequence alignment to reference for pseudo-multiple alignment
- --stats <stats>
output stats on removed reads
- --switch_coords <switch_coords>
coordinates of switch region [chr14:106050000-106337000]
- --switch_annotation <switch_annotation>
bed file with switch annotation
- --min_length <min_length>
minimum read length [500]
- --only_complete
use only complete reads with two primers
- --keep_internal
keep reads with internal primers
- --remove_duplicates
remove duplicate reads (based on UMIs in read IDs)
- --ignore_order
don’t check order of matches along read
- --telo <telo>
output of blasting reads against telomer repeats
- --min_switch_match <min_switch_match>
min match length to switch region
- --max_switch_overlap <max_switch_overlap>
max overlap between switch parts of different orientation [0]
- --max_insert_gap <max_insert_gap>
max gap between insert and switch parts (on the read) [100]
- --max_insert_overlap <max_insert_overlap>
max overlap between switch and insert parts (on the read) [50]
- --min_insert_match <min_insert_match>
min length of insert [50]
- --min_num_switch_matches <min_num_switch_matches>
min number of switch matches in read [1]
- --min_cov <min_cov>
coverage cutoff on reads [0.90]
- --max_gap <max_gap>
split alignments with gaps beyond that value [75]
- --isotype_extra <isotype_extra>
extra space added to define isotypes [200]
- --telo_cutoff <telo_cutoff>
% identify cutoff for telomer match [90]
- --min_telo_matchlen <min_telo_matchlen>
minimum matchlen for telomeric repeat matches [25]
- --blacklist_regions <blacklist_regions>
ignore inserts from blacklist regions defined in this bed file
- --realign_breakpoints <realign_breakpoints>
re-align reads around breakpoints and save results to this file (requires
--raw_reads
and--genome
)
- --realignment_penalties <realignment_penalties>
re-alignment penalties (“ont”, “hifi” or “sr”) [ont]
- --raw_reads <raw_reads>
fasta file with unaligned reads (interleaved mates for paired-end mode)
- --genome <genome>
reference genome file
- --keep_secondary_alignments
keep secondary alignments in SAM file (but conflicting alignments are not resolved!)
- --paired_end_mode
EXPERIMENTAL: use paired-end mode (use interleaved mates as fastq input)
- --remove_isotypes <remove_isotypes>
remove reads with these isotypes (comma-separated list)
- --interrupt_for_read <interrupt_for_read>
(for debugging) interrupt for specified read(s)
swibrid run
main command: run the entire pipeline.
run swibrid setup
and edit config.yaml first!
this will call snakemake; additional options are passed to snakemake
e.g., for a dry run use swibrid run -n
, or swibrid run --unlock
after a failed run
usage: swibrid run [-h] [-s SNAKEFILE] [-c CONFIG] [--slurm] [--sge]
[--queue QUEUE]
...
- snake_options
pass options to snakemake (…)
- -h, --help
show this help message and exit
- -s <snakefile>, --snakefile <snakefile>
snakefile [pipeline.snake]
- -c <config>, --config <config>
config file [config.yaml]
- --slurm
submit to slurm
- --sge
submit to SGE
- --queue <queue>
slurm queue [medium]
swibrid setup
main command: copy config and snakefile to current directory
usage: swibrid setup [-h] [-f]
- -h, --help
show this help message and exit
- -f, --overwrite
overwrite files if present [no]
swibrid test
test the pipeline.
this will create synthetic reads in input
and run the pipeline on this data,
using a reduced hg38 genome in index
with only the switch region (chr14:105000000-106000000).
it will probably take about 30-60 minutes and will call snakemake, passing on additional options.
e.g., for a dry run use swibrid test -n
, and do swibrid test --unlock
after a failed run
usage: swibrid test [-h] [-f] [-s SNAKEFILE] [-c CONFIG] [--slurm] [--sge]
[--queue QUEUE]
...
- snake_options
pass options to snakemake (…)
- -h, --help
show this help message and exit
- -f, --overwrite
overwrite files if present [no]
- -s <snakefile>, --snakefile <snakefile>
snakefile [pipeline.snake]
- -c <config>, --config <config>
config file [test_config.yaml]
- --slurm
submit to slurm
- --sge
submit to SGE
- --queue <queue>
slurm queue [medium]