understanding output

read plot

SWIBRID produces read plots to visualize alignments, clustering and single-nucleotide/structural variants detected in input reads. By default, two plots are produced (in output/read_plots). The first one shows all read alignments to selected regions in the IGH locus, where reads are ordered according to the dendrogram produced by hierarchical clustering and colored by the cluster identity obtained from cutting the dendrogram at the specified cutoff (indicated by the red line). Primer locations are visible from the “justified” positions of the read ends, while breaks are visible from the “ragged” positions. Small gaps are ignored for clustering, while intermediate gaps indicate intra-switch breaks, and large gaps regular switch junctions.

read plot by cluster identity

The second plot shows the same reads but now colored by coverage (of the genome, per read): blue indicates 1x coverage, violet 2x coverage (i.e., a duplication event), and orange negative coverage (an inversion event). Single-nucleotide variants are marked by black dots, and triangles on top indicate a tentative assignment of these variants into likely homo- or heterozygous germline, or likely somatic.

read plot by coverage with variants

breakpoint histogram

The breakpoint histograms (in output/breakpoint_plots) show 1D and 2D histograms of breakpoint positions over selected regions in the IGH locus. Regular breaks are shown in black, while breaks connected to inversions and duplications are shown in red and blue, respectively. The 1D histogram shows donor and acceptor breakpoints together in 50nt bins, the 2D histogram shows junctions between a donor region (on the x-axis) and an acceptor region (on the y-axis) in 500nt bins. The regions shaded in green, orange and purple indicate breaks corresponding to “single” events, sequential switching or intra-switch breaks, respectively. Inversions and duplications (in red and blue) are indicated in the lower right diagonal of the plot.

breakpoint histograms

QC plot

QC plots are created in output/QC_plots and contain the following panels:

  1. histogram of input read lengths

  2. histogram of positions of detected primers and barcodes (dashed lines indicate “internal” barcodes / primers, i.e., more than 100nt from the ends)

  3. isotype composition, as fraction of clusters (top) or reads (bottom)

  4. plot of # clusters (in red; left y-axis, log-scale) as function of dendrogram cutoff, or of the entropy of the cluster distribution (blue; right y-axis, linear scale)

  5. histogram of cluster size, filtered clusters are indicated by orange shading

  6. scatter plot of clone length (in nt) vs. cluster size (# of reads).

  7. histogram of gap sizes in MSA. 75bp cutoff is indicated by red line

  8. histogram of breakpoint positions, separated by donor (=left) and acceptor (=right)

QC plots

output features

SWIBRID produces a table of features for each sample in output/summary. Below is a detailed explanation of the columns in that file. Note that some feature names are different internally and only changed to these values in the final aggregation step (swibrid collect_results). Check the source code for the mapping of old to new names.

QC features

nreads_initial

initial number of reads

nreads_mapped

number of reads mapped

nreads_removed_short

number of reads that are too short for processing (<500nt)

nreads_removed_incomplete

number of reads without forward and reverse primer

nreads_removed_no_info

number of reads without info on primers

nreads_removed_internal_primer

number of reads removed because of internal primers

nreads_removed_no_switch

number of reads removed because of lacking alignment to switch region

nreads_removed_length_mismatch

number of reads removed because mapped part of read are much longer than mapped part of genome

nreads_removed_overlap_mismatch

number of reads removed because there’s overlap betwen alignments on the read or on the genome

nreads_inversions

number of reads that contain inverted segments

nreads_removed_low_cov

number of reads removed because too little of the read maps

nreads_removed_switch_order

number of reads removed because switch alignments are in wrong order

nreads_removed_no_isotype

number of reads removed because isotype could not be determined

nreads_removed_duplicate

number of reads removed because of duplicate UMIs (in UMI mode, --remove-duplicates)

nreads_unmapped

number of unmapped reads

frac_reads_unused

fraction of unused reads

nreads

final number of reads

mean_frac_mapped

mean fraction of read that’s mapped

mean_frac_mapped_multi

mean fraction of read that’s mapped multiple

mean_frac_ignored

mean fraction of read that doesn’t map to selected switch regions

clustering_cutoff

dendrogram cutoff (specified or inferred from data)

frac_singletons

fraction of singleton clusters

PCR_bias_length

regression coefficient of cluster length vs. cluster size

PCR_bias_GC

regression coefficient of cluster GC vs. cluster size

diversity features

these features are computed twice: once on the total number of reads (“_raw”), and then again as averages of 10 replicates from downsampling to 1000 reads

clusters_initial

pre-filter cluster number

clusters(_raw)

post-filter cluster number

clusters_eff(_raw)

number of equally-sized clusters that has the same entropy as observed

cluster_size_(mean/std)(_raw)

mean and std.dev of cluster size (in fraction of reads)

cluster_gini(_raw)

Gini cofficient of cluster size distribution

cluster_entropy(_raw)

entropy of cluster size distribution

cluster_inverse_simpson(_raw)

inverse Simpson coefficient of cluster size distribution

occupancy_top_cluster(_raw)

fraction of reads in biggest cluster

occupancy_big_clusters(_raw)

fraction of reads in all clusters that contain more than 1% of reads

read_length_(mean/std)

mean and std.dev. of length of reads in clusters (in nt)

GC_content_(mean/std)

mean and std.dev. GC content of clusters

read_length_(mean/std)_SX

mean and std of cluster length for isotype SX

isotypes

frac_reads_SX

fraction of reads in isotype SX

pct_clusters_SX

percentage of clusters with isotype SX

alpha_ratio_reads

fraction of reads with isotype SA*

alpha_ratio_clusters

fraction of clusters with isotype SA*

structural rearrangements

pct_templated_inserts

insert frequency (over the dataset)

inversion_size

median size of inversions

duplication_size

median size of duplications

pct_inversions

percentage of breaks creating inversions

pct_duplications

percentage of breaks creating duplications

frac_breaks_inversions_intra

fraction of breaks creating inversions for intra-switch breaks

frac_breaks_duplications_intra

fraction of breaks creating duplications for intra-switch breaks

n_inserts

number of inserts

n_unique_inserts

number of unique inserts

n_clusters_inserts

number of clusters with inserts

mean_cluster_insert_frequency

mean insert frequency (per cluster)

mean_insert_overlap

mean overlap of inserts for reads in the same cluster

mean_insert_pos_overlap

mean overlap of left and right insert-generating breakpoints for reads in the same cluster

mean_insert_length

mean insert length

mean_insert_gap_length

mean length of gaps between left and right insert-generating breakpoints

ninserts_SX_SY

number of inserts between SX and SY regions

context features

untemplated_inserts

mean number of untemplated nucleotides for switch junctions

untemplated_inserts_SX

mean number of untemplated nucleotides for switch junctions for isotype SX

homology

mean number of homologous nucleotides for switch junctions

homology_SX

mean number of homologous nucleotides for switch junctions of isotype SX

pct_blunt

percentage of blunt ends (no untemplated, no homologous nucleotides)

pct_blunt_SX

percentage of blunt ends (no untemplated, no homologous nucleotides) for isotype SX

homology_score_fw

average amount of homology in 50nt bins between donor and acceptor region

homology_score_rv

average amount of (reverse-complementary) homology in 50nt bins between donor and acceptor region

homology_score_fw/rv_SX_SY

average amount of homology in 50nt bins between donor (SX) and acceptor (SY) region

donor_score_M

score (=weighted frequency) for motif M (M=S,W,WGCW,GAGCT,GGGST with S=G/C, W=A/T) in 50nt bins of donor switch regions

acceptor_score_M

score for motif M in 50nt bins of acceptor switch regions

donor_complexity

complexity score (number of observed 5mers / number of possible 5mers) in 50nt bins of donor switch regions

acceptor_complexity

complexity score for acceptor regions

donor/acceptor_score_M_SX_SY

donor/acceptor score for motif M between donor region SX and acceptor region SY

donor/acceptor_complexity_SX_SY

donor/acceptor complexity between donor region SX and acceptor region SY

breakpoint matrix

breaks_normalized

number of breaks by number of clusters

pct_breaks_SX_SY

fraction of breaks between SX and SY

pct_direct_switch

percentage of “single” breaks (see green region in 2D breakpoint histogram)

pct_sequential_switch

percentage of “sequential” breaks (see orange region in 2D breakpoint histogram)

pct_intraswitch_deletion

percentage of breaks creating intraswitch deletions (see purple region in 2D breakpoint histogram)

intraswitch_size_(mean/std)

mean and std.dev. size of intra-switch breaks

break_dispersion_SX

dispersion (standard deviation) of breakpoint positions in region SX

variants

num_variants

number of detected single-nucleotide variants

somatic_variants

number of variants classified as likely somatic (not germline)

frac_variants_transitions

fraction of transition variants

frac_variants_X>Y

fraction of variants from ref X to alt Y

num_variants_M

number of variants around specific motif (M=Tw, wrCy, Cg)

demultiplexing

swibrid demultiplex creates fastq files for each sample in a sample sheet, together with {sample}_info.csv files that contain meta-data for each read (taken from the header of the original raw output, plus locations and identities of detected primers and barcodes). It also creates a summary csv file with statistics on how many reads were assigned to each sample barcode, and a summary figure:

demultiplexing output

The bar plot top left shows how many reads were assigned to each sample or remained “undetermined”. The pie chart on the right shows a breakdown of undetermined reads by barcode, in case the sample sheet is missing a sample barcode appearing in the reads. The histogram in the lower left shows relative locations of the detected barcodes, and the one in the lower right the distriubtion of read lengths in each sample (colors match the sample colors in the top plot).

other output

the pipeline writes intermediate files to folders pipeline/{sample} and log files to logs/{step} (or logs/slurm/{step} if --slurm is used). intermediate files are

{sample}_aligned.par

LAST parameters estimated for that sample

{sample}_last_pars.npz

LAST parameters as numpy array

{sample}_aligned.maf.gz

LAST output (as MAF) for that sample

{sample}_telo.out

output from BLASTing reads against telomer repeats

{sample}_processed.out

process_alignments output table

{sample}_process_stats.csv

process_alignments stats

{sample}_aligned.fasta.gz

aligned reads as input to construct_MSA

{sample}_breakpoint_alignments.csv

table with breakpoint re-alignment results

{sample}_msa.npz

MSA in numpy sparse array format

{sample}_msa.csv

table with reads used in MSA

{sample}_gaps.npz

get_gaps output (numpy arrays)

{sample}_rearrangements.npz

find_rearrangements output (numpy arrays)

{sample}_rearrangements.bed

find_rearrangements output as bed file

{sample}_linkage.npz

construct_linkage output (linkage matrix, python standard)

{sample}_inserts.tsv

table with insert coordinates, annotation and sequence

{sample}.bed

bed file with alignment coordinates for all reads and inserts

{sample}_cutoff_scanning.csv

number of clusters as function of cutoff

{sample}_clustering.csv

cluster assignment

{sample}_cluster_stats.csv

find_clusters statistics

{sample}_cluster_analysis.csv

statistics aggregated over clusters

{sample}_cluster_downsampling.csv

cluster statistics after downsampling 10 times

{sample}_breakpoint_stats.csv

table with breakpoint matrix statistics

{sample}_variants.txt

txt file (vcf-like) with variants

{sample}_variants.npz

numpy file with variant and read coordinates

{sample}_haplotypes.csv

table with haplotype assignments for clusters

{sample}_summary.csv

summary of features for that sample