understanding output ==================== read plot ********* SWIBRID produces read plots to visualize alignments, clustering and single-nucleotide/structural variants detected in input reads. By default, two plots are produced (in ``output/read_plots``). The first one shows all read alignments to selected regions in the IGH locus, where reads are ordered according to the dendrogram produced by hierarchical clustering and colored by the cluster identity obtained from cutting the dendrogram at the specified cutoff (indicated by the red line). Primer locations are visible from the "justified" positions of the read ends, while breaks are visible from the "ragged" positions. Small gaps are ignored for clustering, while intermediate gaps indicate intra-switch breaks, and large gaps regular switch junctions. .. image:: _static/20211122_85_n500_1_downsampled_reads_annotated.png :width: 800 :alt: read plot by cluster identity The second plot shows the same reads but now colored by coverage (of the genome, per read): blue indicates 1x coverage, violet 2x coverage (i.e., a duplication event), and orange negative coverage (an inversion event). Single-nucleotide variants are marked by black dots, and triangles on top indicate a tentative assignment of these variants into likely homo- or heterozygous germline, or likely somatic. .. image:: _static/20211122_85_n500_1_downsampled_coverage_annotated.png :width: 800 :alt: read plot by coverage with variants breakpoint histogram ******************** The breakpoint histograms (in ``output/breakpoint_plots``) show 1D and 2D histograms of breakpoint positions over selected regions in the IGH locus. Regular breaks are shown in black, while breaks connected to inversions and duplications are shown in red and blue, respectively. The 1D histogram shows donor and acceptor breakpoints together in 50nt bins, the 2D histogram shows junctions between a donor region (on the x-axis) and an acceptor region (on the y-axis) in 500nt bins. The regions shaded in green, orange and purple indicate breaks corresponding to "single" events, sequential switching or intra-switch breaks, respectively. Inversions and duplications (in red and blue) are indicated in the lower right diagonal of the plot. .. image:: _static/20211122_85_n500_1_downsampled_breakpoints_annotated.png :width: 300 :alt: breakpoint histograms QC plot ******* QC plots are created in ``output/QC_plots`` and contain the following panels: A. histogram of input read lengths B. histogram of positions of detected primers and barcodes (dashed lines indicate "internal" barcodes / primers, i.e., more than 100nt from the ends) C. isotype composition, as fraction of clusters (top) or reads (bottom) D. plot of # clusters (in red; left y-axis, log-scale) as function of dendrogram cutoff, or of the entropy of the cluster distribution (blue; right y-axis, linear scale) E. histogram of cluster size, filtered clusters are indicated by orange shading F. scatter plot of clone length (in nt) vs. cluster size (# of reads). G. histogram of gap sizes in MSA. 75bp cutoff is indicated by red line H. histogram of breakpoint positions, separated by donor (=left) and acceptor (=right) .. image:: _static/20211122_85_n500_1_downsampled_summary_annotated.png :width: 500 :alt: QC plots output features *************** SWIBRID produces a table of features for each sample in ``output/summary``. Below is a detailed explanation of the columns in that file. Note that some feature names are different internally and only changed to these values in the final aggregation step (`swibrid collect_results`). Check the source code for the mapping of old to new names. QC features ----------- nreads_initial initial number of reads nreads_mapped number of reads mapped nreads_removed_short number of reads that are too short for processing (<500nt) nreads_removed_incomplete number of reads without forward and reverse primer nreads_removed_no_info number of reads without info on primers nreads_removed_internal_primer number of reads removed because of internal primers nreads_removed_no_switch number of reads removed because of lacking alignment to switch region nreads_removed_length_mismatch number of reads removed because mapped part of read are much longer than mapped part of genome nreads_removed_overlap_mismatch number of reads removed because there's overlap betwen alignments on the read or on the genome nreads_inversions number of reads that contain inverted segments nreads_removed_low_cov number of reads removed because too little of the read maps nreads_removed_switch_order number of reads removed because switch alignments are in wrong order nreads_removed_no_isotype number of reads removed because isotype could not be determined nreads_removed_duplicate number of reads removed because of duplicate UMIs (in UMI mode, ``--remove-duplicates``) nreads_unmapped number of unmapped reads frac_reads_unused fraction of unused reads nreads final number of reads mean_frac_mapped mean fraction of read that's mapped mean_frac_mapped_multi mean fraction of read that's mapped multiple mean_frac_ignored mean fraction of read that doesn't map to selected switch regions clustering_cutoff dendrogram cutoff (specified or inferred from data) frac_singletons fraction of singleton clusters PCR_bias_length regression coefficient of cluster length vs. cluster size PCR_bias_GC regression coefficient of cluster GC vs. cluster size diversity features ------------------ these features are computed twice: once on the total number of reads ("_raw"), and then again as averages of 10 replicates from downsampling to 1000 reads clusters_initial pre-filter cluster number clusters(_raw) post-filter cluster number clusters_eff(_raw) number of equally-sized clusters that has the same entropy as observed cluster_size_(mean/std)(_raw) mean and std.dev of cluster size (in fraction of reads) cluster_gini(_raw) Gini cofficient of cluster size distribution cluster_entropy(_raw) entropy of cluster size distribution cluster_inverse_simpson(_raw) inverse Simpson coefficient of cluster size distribution occupancy_top_cluster(_raw) fraction of reads in biggest cluster occupancy_big_clusters(_raw) fraction of reads in all clusters that contain more than 1% of reads read_length_(mean/std) mean and std.dev. of length of reads in clusters (in nt) GC_content_(mean/std) mean and std.dev. GC content of clusters read_length_(mean/std)_SX mean and std of cluster length for isotype SX isotypes -------- frac_reads_SX fraction of reads in isotype SX pct_clusters_SX percentage of clusters with isotype SX alpha_ratio_reads fraction of reads with isotype SA* alpha_ratio_clusters fraction of clusters with isotype SA* structural rearrangements ------------------------- pct_templated_inserts insert frequency (over the dataset) inversion_size median size of inversions duplication_size median size of duplications pct_inversions percentage of breaks creating inversions pct_duplications percentage of breaks creating duplications frac_breaks_inversions_intra fraction of breaks creating inversions for intra-switch breaks frac_breaks_duplications_intra fraction of breaks creating duplications for intra-switch breaks n_inserts number of inserts n_unique_inserts number of unique inserts n_clusters_inserts number of clusters with inserts mean_cluster_insert_frequency mean insert frequency (per cluster) mean_insert_overlap mean overlap of inserts for reads in the same cluster mean_insert_pos_overlap mean overlap of left and right insert-generating breakpoints for reads in the same cluster mean_insert_length mean insert length mean_insert_gap_length mean length of gaps between left and right insert-generating breakpoints ninserts_SX_SY number of inserts between SX and SY regions context features ---------------- untemplated_inserts mean number of untemplated nucleotides for switch junctions untemplated_inserts_SX mean number of untemplated nucleotides for switch junctions for isotype SX homology mean number of homologous nucleotides for switch junctions homology_SX mean number of homologous nucleotides for switch junctions of isotype SX pct_blunt percentage of blunt ends (no untemplated, no homologous nucleotides) pct_blunt_SX percentage of blunt ends (no untemplated, no homologous nucleotides) for isotype SX homology_score_fw average amount of homology in 50nt bins between donor and acceptor region homology_score_rv average amount of (reverse-complementary) homology in 50nt bins between donor and acceptor region homology_score_fw/rv_SX_SY average amount of homology in 50nt bins between donor (SX) and acceptor (SY) region donor_score_M score (=weighted frequency) for motif M (M=S,W,WGCW,GAGCT,GGGST with S=G/C, W=A/T) in 50nt bins of donor switch regions acceptor_score_M score for motif M in 50nt bins of acceptor switch regions donor_complexity complexity score (number of observed 5mers / number of possible 5mers) in 50nt bins of donor switch regions acceptor_complexity complexity score for acceptor regions donor/acceptor_score_M_SX_SY donor/acceptor score for motif M between donor region SX and acceptor region SY donor/acceptor_complexity_SX_SY donor/acceptor complexity between donor region SX and acceptor region SY breakpoint matrix ----------------- breaks_normalized number of breaks by number of clusters pct_breaks_SX_SY fraction of breaks between SX and SY pct_direct_switch percentage of "single" breaks (see green region in 2D breakpoint histogram) pct_sequential_switch percentage of "sequential" breaks (see orange region in 2D breakpoint histogram) pct_intraswitch_deletion percentage of breaks creating intraswitch deletions (see purple region in 2D breakpoint histogram) intraswitch_size_(mean/std) mean and std.dev. size of intra-switch breaks break_dispersion_SX dispersion (standard deviation) of breakpoint positions in region SX variants -------- num_variants number of detected single-nucleotide variants somatic_variants number of variants classified as likely somatic (not germline) frac_variants_transitions fraction of transition variants frac_variants_X>Y fraction of variants from ref X to alt Y num_variants_M number of variants around specific motif (M=Tw, wrCy, Cg) demultiplexing ************** ``swibrid demultiplex`` creates fastq files for each sample in a sample sheet, together with ``{sample}_info.csv`` files that contain meta-data for each read (taken from the header of the original raw output, plus locations and identities of detected primers and barcodes). It also creates a summary csv file with statistics on how many reads were assigned to each sample barcode, and a summary figure: .. image:: _static/20220411_demultiplexing.png :width: 400 :alt: demultiplexing output The bar plot top left shows how many reads were assigned to each sample or remained "undetermined". The pie chart on the right shows a breakdown of undetermined reads by barcode, in case the sample sheet is missing a sample barcode appearing in the reads. The histogram in the lower left shows relative locations of the detected barcodes, and the one in the lower right the distriubtion of read lengths in each sample (colors match the sample colors in the top plot). other output ************ the pipeline writes intermediate files to folders ``pipeline/{sample}`` and log files to ``logs/{step}`` (or ``logs/slurm/{step}`` if ``--slurm`` is used). intermediate files are ``{sample}_aligned.par`` LAST parameters estimated for that sample ``{sample}_last_pars.npz`` LAST parameters as numpy array ``{sample}_aligned.maf.gz`` LAST output (as MAF) for that sample ``{sample}_telo.out`` output from BLASTing reads against telomer repeats ``{sample}_processed.out`` ``process_alignments`` output table ``{sample}_process_stats.csv`` ``process_alignments`` stats ``{sample}_aligned.fasta.gz`` aligned reads as input to ``construct_MSA`` ``{sample}_breakpoint_alignments.csv`` table with breakpoint re-alignment results ``{sample}_msa.npz`` MSA in numpy sparse array format ``{sample}_msa.csv`` table with reads used in MSA ``{sample}_gaps.npz`` ``get_gaps`` output (numpy arrays) ``{sample}_rearrangements.npz`` ``find_rearrangements`` output (numpy arrays) ``{sample}_rearrangements.bed`` ``find_rearrangements`` output as bed file ``{sample}_linkage.npz`` ``construct_linkage`` output (linkage matrix, python standard) ``{sample}_inserts.tsv`` table with insert coordinates, annotation and sequence ``{sample}.bed`` bed file with alignment coordinates for all reads and inserts ``{sample}_cutoff_scanning.csv`` number of clusters as function of cutoff ``{sample}_clustering.csv`` cluster assignment ``{sample}_cluster_stats.csv`` ``find_clusters`` statistics ``{sample}_cluster_analysis.csv`` statistics aggregated over clusters ``{sample}_cluster_downsampling.csv`` cluster statistics after downsampling 10 times ``{sample}_breakpoint_stats.csv`` table with breakpoint matrix statistics ``{sample}_variants.txt`` txt file (vcf-like) with variants ``{sample}_variants.npz`` numpy file with variant and read coordinates ``{sample}_haplotypes.csv`` table with haplotype assignments for clusters ``{sample}_summary.csv`` summary of features for that sample