Running a pipeline - Tutorial
Before beginning this tutorial make sure you have the TallyTrin installed correctly, please see here (see Installation) for installation instructions.
As a tutorial example of how to run a TallyTrin workflow we will run the count pipeline.
This workflow is for generating a count matrix for downstream differential expression analysis using nanopore reads. The pipeline takes an input fastq file, processes it, and outputs a count matrix with samples as columns and rows as either transcripts or genes. The pipeline makes use of multiple Python libraries and tools like Minimap2, Samtools, UMI-tools, and mclumi.
Tutorial start
1. First download the tutorial data:
mkdir count
cd count
wget https://datashare.molbiol.ox.ac.uk/public/project/cribbslab/acribbs/Trimer/test.fastq.gz
wget https://datashare.molbiol.ox.ac.uk/public/project/cribbslab/acribbs/Trimer/Homo_sapiens.GRCh38.cdna.all.fa
wget https://datashare.molbiol.ox.ac.uk/public/project/cribbslab/acribbs/Trimer/hg38.fa
wget https://datashare.molbiol.ox.ac.uk/public/project/cribbslab/acribbs/Trimer/hg38_geneset_all.bed
wget https://datashare.molbiol.ox.ac.uk/public/project/cribbslab/acribbs/Trimer/hg38_geneset_all.gtf
2. Next we will generate a configuration yml file so the pipeline output can be modified:
conda activate tallytrin
# To show all available pipelines:
tallytrin -h
# Generate config
tallytrin count config
3. Modify the config if required:
At this stage you would normally modify the config, but in this case the defaults should be fine in this case:
# Config file for pipeline_count.py
## general options
# Copyright statement
copyright: cribbslab (2021)
cdna_fasta: Homo_sapiens.GRCh38.cdna.all.fa
genome_fasta: hg38.fa
junc_bed: hg38_geneset_all.bed
gtf: hg38_geneset_all.gtf
# Specify if the pipeline should run umi correction or not
correct: 1
# Specify if the pipeline should run with a trimer UMI on the tso
tso_present: 1
# Specify if a split prefix of index is needed for running minimap2 if the index is large
minimap2_splitprefix: 0
# Threshold to remove UMI errors
error_removal: 1
# mclumi options
mclumi:
editdistance: 9
memory: 100G
job_options: -t 48:00:00
4. Next we will run the pipleine:
tallytrin count make full -v5 --no-cluster
This --no-cluster
will run the pipeline locally if you do not have access to a cluster. Alternatively if you have a
cluster remove the --no-cluster
option and the pipleine will distribute your jobs accross the cluster.