This repo contains the code for processing dual RNA-seq data as currently hosted in the PhageExpressionAtlas. First, a nextflow pipeline is used to trim paired-end or single-end RNA-seq reads, check their quality, map the reads to a dual genome, filter the alignments and count reads. A Jupyter Notebook is then used to annotate pre-process the count data for storage and analysis with the PhageExpressionAtlas.
Nextflow stands out with its enormous parallelization capability, performance and documentation. Therefore, this pipeline can be applied to efficiently process, map and count reads from dual RNA-seq of phage-host interactions.
This pipeline makes use of the following tools:
- quality control: fastQC, multiQC
- adapter & quality trimming: cutadapt
- read mapping (and dual reference genome indexing): hisat2
- alignment processing: samtools
- feature counting: featureCounts
Installation of conda environment from file:
conda env create -f /env/env_nextflow.yml
or, when using mamba:
mamba env create -f /env/env_nextflow.yml
Also, check if Nextflow is available, otherwise install manually.
Input can be specified from command line or in /nf/conf/params.config.
- hostGenome, phageGenome: provide absolute paths to full genomic fasta files.
- hostGFF, phageGFF: provide absolute paths to full genomic feature annotation files.
- pairedEnd: call, if dealing with paired-end RNA-seq data
- dRNAseq: this flag will direct to dRNA-seq processing, but is not yet implemented
- reads: absolute path to sreads in fastq format, can also be compressed as *.fastq.gz
- outputDir: absolute path to output directory
- adapter1, adapter2: by deafult Illumina TruSeq adapter sequences, which can be adjusted, if others used
- countFeature, featureIdentifier: specify feature to count with using featureCounts and the identifier used from the GFF file
Example usage for single-end data:
nextflow run rnaseq_workflow.nf --reads "/path/to/data/*.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"
Example usage for paired-end data:
nextflow run rnaseq_workflow.nf --reads "/path/to/data/*_{R1,R2}.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"
- Building reference genome and indexing with hisat2
- fastQC on raw fastq files
- adapter (and quality) trimming with cutadapt
- fastQC on trimmed fastq files
- mapping of trimmed fastq files with hisat2
- processing of alignment with samtools
- feature counting using featureCounts
The dual RNA-seq pipeline produces the following output:
- Counts table generated by featureCounts.
- Bam files, which can be inspected in a genome viewer.
- The dual genome fasta and gff file needed for exploration of alignment maps in a genome viewer.
- Quality control logs including fastQC and multiQC html reports.
The counts table and gff file produced by the pipeline serve as input for the Postprocessing notebook /downstream_processing/Processing_notebook_template.ipynb
. Also, the metadata for the raw sequencing data will be helpful to annotate sample names with the correct time point and replicate declaration.
- Data loading
- Annotation with time points, in silico rRNA depletion and data normalization
- PCA-based quality control and cleaning of data by outlier sample removal
- Grouping replicates per time points
- Phage gene classification with pre-defined thresholds
- Calculation of stabilized variance for dataset filtering
- Export data
File paths only need to be adjusted at the beginning and end of the notebook. The tables are ready to be fed into the database for the PhageExpressionAtlas.
Nextflow is a modular system and you can virtually include any other tool, add new modules and adapt existing ones. Please feel free to leave feedback, whether you think important steps are currently missing.
We highly welcome your contributions and feedback! Please contact Maik Wolfram-Schauerte.