quick start guide

installation

clone the github repo and change into the source folder:

git clone git@github.com:bihealth/swibrid.git
cd swibrid

create a conda environment:

conda env create -f swibrid_env.yaml
conda activate swibrid_env

install swibrid:
```
pip install .
```

alternatively, use the docker image:

docker run -v $(pwd):/home/swibriduser -u $(id -u):$(id -g) ghcr.io/bihealth/swibrid:latest -h

testing

for a simple and (relatively) quick end-to-end test, run:

swibrid test

this will create two samples with about 1000 synthetic reads in input and run the pipeline on this data, using a reduced hg38 genome in index with only the switch region (chr14:105000000-106000000). it will probably take about 5 minutes and produce plots in output/read_plots and table of summary statistics in output/summary

running your own data

this assumes you have a fastq.gz file with sequencing output from minION or PacBio. If samples were multiplexed (e.g., with ONT barcodes), you should set up a sample sheet like so:

BC01         sample1
BC02         sample2
...

and a file with barcode and primer sequences like so:

>BC01
AAGAAAGTTGTCGGTGTCTTTGTG
>BC02
TCGATTCCGTTTGTAGTCGTCTGT
...
>primer_mu_fw
CACCCTTGAAAGTAGCCCATGCCTTCC
>primer_alpha_rv
CTCAGTCCAACACCCACCACTCC
>primer_gamma_rv
CTGCCTCCCAGTGTCCTGCATTACTTCTG

if you don’t have multiplexed data, you should still run the demultiplexing for primer detection; simply set up a dummy sample sheet and the file with primers; all reads will end up in undetermined.fastq.gz

set up snakemake and config files in a new directory:
```
mkdir results
cd results
swibrid setup
```

provide genome (+ index), annotation files in index:

mkdir index
cd index
# get hg38 genome from UCSC (or elsewhere)
wget http://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
# create LAST index
lastdb hg38db hg38.fa
# download gene annotation from ENCODE (or elsewhere)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz
gunzip gencode.v33.annotation.gtf.gz
swibrid get_annotation -i gencode.v33.annotation.gtf -o gencode.v33.annotation.exon.gene_shorted.bed

create bed file with switch region definitions:

chr14   105588700       105591700       SA2
chr14   105603000       105603500       SE
chr14   105626500       105629000       SG4
chr14   105645400       105647900       SG2
chr14   105708900       105712900       SA1
chr14   105743700       105747700       SG1
chr14   105772100       105775600       SG3
chr14   105856100       105861100       SM

edit (at least) the following entries in the config.yaml file (make sure that sample names in SAMPLES all appear in the sample sheet):

INPUT: "path/to/input.fastq.gz"
SAMPLE_SHEET: "path/to/sample_sheet.csv"
BARCODES_PRIMERS: "path/to/barcodes_primers.fa"
SAMPLES: ["sample1","sample2", ...]
SWITCH_ANNOTATION: "path/to/switch_regions.bed"

run the pipeline:

swibrid run -np        # for a dry-run
swibrid run            # for an actual run
swibrid run --slurm    # submit to slurm
swibrid run --unlock   # unlock snakemake before restarting an interrupted/killed instance