understanding output
read plot
SWIBRID produces read plots to visualize alignments, clustering and single-nucleotide/structural variants detected in input reads. By default, two plots are produced (in output/read_plots
). The first one shows all read alignments to selected regions in the IGH locus, where reads are ordered according to the dendrogram produced by hierarchical clustering and colored by the cluster identity obtained from cutting the dendrogram at the specified cutoff (indicated by the red line). Primer locations are visible from the “justified” positions of the read ends, while breaks are visible from the “ragged” positions. Small gaps are ignored for clustering, while intermediate gaps indicate intra-switch breaks, and large gaps regular switch junctions.

The second plot shows the same reads but now colored by coverage (of the genome, per read): blue indicates 1x coverage, violet 2x coverage (i.e., a duplication event), and orange negative coverage (an inversion event). Single-nucleotide variants are marked by black dots, and triangles on top indicate a tentative assignment of these variants into likely homo- or heterozygous germline, or likely somatic.

breakpoint histogram
The breakpoint histograms (in output/breakpoint_plots
) show 1D and 2D histograms of breakpoint positions over selected regions in the IGH locus. Regular breaks are shown in black, while breaks connected to inversions and duplications are shown in red and blue, respectively. The 1D histogram shows donor and acceptor breakpoints together in 50nt bins, the 2D histogram shows junctions between a donor region (on the x-axis) and an acceptor region (on the y-axis) in 500nt bins. The regions shaded in green, orange and purple indicate breaks corresponding to “single” events, sequential switching or intra-switch breaks, respectively. Inversions and duplications (in red and blue) are indicated in the lower right diagonal of the plot.

QC plot
QC plots are created in output/QC_plots
and contain the following panels:
histogram of input read lengths
histogram of positions of detected primers and barcodes (dashed lines indicate “internal” barcodes / primers, i.e., more than 100nt from the ends)
isotype composition, as fraction of clusters (top) or reads (bottom)
plot of # clusters (in red; left y-axis, log-scale) as function of dendrogram cutoff, or of the entropy of the cluster distribution (blue; right y-axis, linear scale)
histogram of cluster size, filtered clusters are indicated by orange shading
scatter plot of clone length (in nt) vs. cluster size (# of reads).
histogram of gap sizes in MSA. 75bp cutoff is indicated by red line
histogram of breakpoint positions, separated by donor (=left) and acceptor (=right)

output features
SWIBRID produces a table of features for each sample in output/summary
. Below is a detailed explanation of the columns in that file. Note that some feature names are different internally and only changed to these values in the final aggregation step (swibrid collect_results). Check the source code for the mapping of old to new names.
QC features
- nreads_initial
initial number of reads
- nreads_mapped
number of reads mapped
- nreads_removed_short
number of reads that are too short for processing (<500nt)
- nreads_removed_incomplete
number of reads without forward and reverse primer
- nreads_removed_no_info
number of reads without info on primers
- nreads_removed_internal_primer
number of reads removed because of internal primers
- nreads_removed_no_switch
number of reads removed because of lacking alignment to switch region
- nreads_removed_length_mismatch
number of reads removed because mapped part of read are much longer than mapped part of genome
- nreads_removed_overlap_mismatch
number of reads removed because there’s overlap betwen alignments on the read or on the genome
- nreads_inversions
number of reads that contain inverted segments
- nreads_removed_low_cov
number of reads removed because too little of the read maps
- nreads_removed_switch_order
number of reads removed because switch alignments are in wrong order
- nreads_removed_no_isotype
number of reads removed because isotype could not be determined
- nreads_removed_duplicate
number of reads removed because of duplicate UMIs (in UMI mode,
--remove-duplicates
)- nreads_unmapped
number of unmapped reads
- frac_reads_unused
fraction of unused reads
- nreads
final number of reads
- mean_frac_mapped
mean fraction of read that’s mapped
- mean_frac_mapped_multi
mean fraction of read that’s mapped multiple
- mean_frac_ignored
mean fraction of read that doesn’t map to selected switch regions
- clustering_cutoff
dendrogram cutoff (specified or inferred from data)
- frac_singletons
fraction of singleton clusters
- PCR_bias_length
regression coefficient of cluster length vs. cluster size
- PCR_bias_GC
regression coefficient of cluster GC vs. cluster size
diversity features
these features are computed twice: once on the total number of reads (“_raw”), and then again as averages of 10 replicates from downsampling to 1000 reads
- clusters_initial
pre-filter cluster number
- clusters(_raw)
post-filter cluster number
- clusters_eff(_raw)
number of equally-sized clusters that has the same entropy as observed
- cluster_size_(mean/std)(_raw)
mean and std.dev of cluster size (in fraction of reads)
- cluster_gini(_raw)
Gini cofficient of cluster size distribution
- cluster_entropy(_raw)
entropy of cluster size distribution
- cluster_inverse_simpson(_raw)
inverse Simpson coefficient of cluster size distribution
- occupancy_top_cluster(_raw)
fraction of reads in biggest cluster
- occupancy_big_clusters(_raw)
fraction of reads in all clusters that contain more than 1% of reads
- read_length_(mean/std)
mean and std.dev. of length of reads in clusters (in nt)
- GC_content_(mean/std)
mean and std.dev. GC content of clusters
- read_length_(mean/std)_SX
mean and std of cluster length for isotype SX
isotypes
- frac_reads_SX
fraction of reads in isotype SX
- pct_clusters_SX
percentage of clusters with isotype SX
- alpha_ratio_reads
fraction of reads with isotype SA*
- alpha_ratio_clusters
fraction of clusters with isotype SA*
structural rearrangements
- pct_templated_inserts
insert frequency (over the dataset)
- inversion_size
median size of inversions
- duplication_size
median size of duplications
- pct_inversions
percentage of breaks creating inversions
- pct_duplications
percentage of breaks creating duplications
- frac_breaks_inversions_intra
fraction of breaks creating inversions for intra-switch breaks
- frac_breaks_duplications_intra
fraction of breaks creating duplications for intra-switch breaks
- n_inserts
number of inserts
- n_unique_inserts
number of unique inserts
- n_clusters_inserts
number of clusters with inserts
- mean_cluster_insert_frequency
mean insert frequency (per cluster)
- mean_insert_overlap
mean overlap of inserts for reads in the same cluster
- mean_insert_pos_overlap
mean overlap of left and right insert-generating breakpoints for reads in the same cluster
- mean_insert_length
mean insert length
- mean_insert_gap_length
mean length of gaps between left and right insert-generating breakpoints
- ninserts_SX_SY
number of inserts between SX and SY regions
context features
- untemplated_inserts
mean number of untemplated nucleotides for switch junctions
- untemplated_inserts_SX
mean number of untemplated nucleotides for switch junctions for isotype SX
- homology
mean number of homologous nucleotides for switch junctions
- homology_SX
mean number of homologous nucleotides for switch junctions of isotype SX
- pct_blunt
percentage of blunt ends (no untemplated, no homologous nucleotides)
- pct_blunt_SX
percentage of blunt ends (no untemplated, no homologous nucleotides) for isotype SX
- homology_score_fw
average amount of homology in 50nt bins between donor and acceptor region
- homology_score_rv
average amount of (reverse-complementary) homology in 50nt bins between donor and acceptor region
- homology_score_fw/rv_SX_SY
average amount of homology in 50nt bins between donor (SX) and acceptor (SY) region
- donor_score_M
score (=weighted frequency) for motif M (M=S,W,WGCW,GAGCT,GGGST with S=G/C, W=A/T) in 50nt bins of donor switch regions
- acceptor_score_M
score for motif M in 50nt bins of acceptor switch regions
- donor_complexity
complexity score (number of observed 5mers / number of possible 5mers) in 50nt bins of donor switch regions
- acceptor_complexity
complexity score for acceptor regions
- donor/acceptor_score_M_SX_SY
donor/acceptor score for motif M between donor region SX and acceptor region SY
- donor/acceptor_complexity_SX_SY
donor/acceptor complexity between donor region SX and acceptor region SY
breakpoint matrix
- breaks_normalized
number of breaks by number of clusters
- pct_breaks_SX_SY
fraction of breaks between SX and SY
- pct_direct_switch
percentage of “single” breaks (see green region in 2D breakpoint histogram)
- pct_sequential_switch
percentage of “sequential” breaks (see orange region in 2D breakpoint histogram)
- pct_intraswitch_deletion
percentage of breaks creating intraswitch deletions (see purple region in 2D breakpoint histogram)
- intraswitch_size_(mean/std)
mean and std.dev. size of intra-switch breaks
- break_dispersion_SX
dispersion (standard deviation) of breakpoint positions in region SX
variants
- num_variants
number of detected single-nucleotide variants
- somatic_variants
number of variants classified as likely somatic (not germline)
- frac_variants_transitions
fraction of transition variants
- frac_variants_X>Y
fraction of variants from ref X to alt Y
- num_variants_M
number of variants around specific motif (M=Tw, wrCy, Cg)
demultiplexing
swibrid demultiplex
creates fastq files for each sample in a sample sheet, together with {sample}_info.csv
files that contain meta-data for each read (taken from the header of the original raw output, plus locations and identities of detected primers and barcodes). It also creates a summary csv file with statistics on how many reads were assigned to each sample barcode, and a summary figure:

The bar plot top left shows how many reads were assigned to each sample or remained “undetermined”. The pie chart on the right shows a breakdown of undetermined reads by barcode, in case the sample sheet is missing a sample barcode appearing in the reads. The histogram in the lower left shows relative locations of the detected barcodes, and the one in the lower right the distriubtion of read lengths in each sample (colors match the sample colors in the top plot).
other output
the pipeline writes intermediate files to folders pipeline/{sample}
and log files to logs/{step}
(or logs/slurm/{step}
if --slurm
is used). intermediate files are
{sample}_aligned.par
LAST parameters estimated for that sample
{sample}_last_pars.npz
LAST parameters as numpy array
{sample}_aligned.maf.gz
LAST output (as MAF) for that sample
{sample}_telo.out
output from BLASTing reads against telomer repeats
{sample}_processed.out
process_alignments
output table{sample}_process_stats.csv
process_alignments
stats{sample}_aligned.fasta.gz
aligned reads as input to
construct_MSA
{sample}_breakpoint_alignments.csv
table with breakpoint re-alignment results
{sample}_msa.npz
MSA in numpy sparse array format
{sample}_msa.csv
table with reads used in MSA
{sample}_gaps.npz
get_gaps
output (numpy arrays){sample}_rearrangements.npz
find_rearrangements
output (numpy arrays){sample}_rearrangements.bed
find_rearrangements
output as bed file{sample}_linkage.npz
construct_linkage
output (linkage matrix, python standard){sample}_inserts.tsv
table with insert coordinates, annotation and sequence
{sample}.bed
bed file with alignment coordinates for all reads and inserts
{sample}_cutoff_scanning.csv
number of clusters as function of cutoff
{sample}_clustering.csv
cluster assignment
{sample}_cluster_stats.csv
find_clusters
statistics{sample}_cluster_analysis.csv
statistics aggregated over clusters
{sample}_cluster_downsampling.csv
cluster statistics after downsampling 10 times
{sample}_breakpoint_stats.csv
table with breakpoint matrix statistics
{sample}_variants.txt
txt file (vcf-like) with variants
{sample}_variants.npz
numpy file with variant and read coordinates
{sample}_haplotypes.csv
table with haplotype assignments for clusters
{sample}_summary.csv
summary of features for that sample