Processing of dual RNA-seq data for the PhageExpressionAtlas

This repo contains the code for processing dual RNA-seq data as currently hosted in the PhageExpressionAtlas. First, a nextflow pipeline is used to trim paired-end or single-end RNA-seq reads, check their quality, map the reads to a dual genome, filter the alignments and count reads. A Jupyter Notebook is then used to annotate pre-process the count data for storage and analysis with the PhageExpressionAtlas.

Nextflow pipeline

Nextflow stands out with its enormous parallelization capability, performance and documentation. Therefore, this pipeline can be applied to efficiently process, map and count reads from dual RNA-seq of phage-host interactions.

Tools in the pipeline

This pipeline makes use of the following tools:

quality control: fastQC, multiQC
adapter & quality trimming: cutadapt
read mapping (and dual reference genome indexing): hisat2
alignment processing: samtools
feature counting: featureCounts

Installation

Installation of conda environment from file:

conda env create -f /env/env_nextflow.yml

or, when using mamba:

mamba env create -f /env/env_nextflow.yml

Also, check if Nextflow is available, otherwise install manually.

Input & usage

Input can be specified from command line or in /nf/conf/params.config.

hostGenome, phageGenome: provide absolute paths to full genomic fasta files.
hostGFF, phageGFF: provide absolute paths to full genomic feature annotation files.
pairedEnd: call, if dealing with paired-end RNA-seq data
dRNAseq: this flag will direct to dRNA-seq processing, but is not yet implemented
reads: absolute path to sreads in fastq format, can also be compressed as *.fastq.gz
outputDir: absolute path to output directory
adapter1, adapter2: by deafult Illumina TruSeq adapter sequences, which can be adjusted, if others used
countFeature, featureIdentifier: specify feature to count with using featureCounts and the identifier used from the GFF file

Example usage for single-end data:

nextflow run rnaseq_workflow.nf --reads "/path/to/data/*.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"

Example usage for paired-end data:

nextflow run rnaseq_workflow.nf --reads "/path/to/data/*_{R1,R2}.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"

Pipeline steps explained briefly

Building reference genome and indexing with hisat2
fastQC on raw fastq files
adapter (and quality) trimming with cutadapt
fastQC on trimmed fastq files
mapping of trimmed fastq files with hisat2
processing of alignment with samtools
feature counting using featureCounts

Output

The dual RNA-seq pipeline produces the following output:

Counts table generated by featureCounts.
Bam files, which can be inspected in a genome viewer.
The dual genome fasta and gff file needed for exploration of alignment maps in a genome viewer.
Quality control logs including fastQC and multiQC html reports.

Pre-processing with the Juypter notebook

The counts table and gff file produced by the pipeline serve as input for the Postprocessing notebook /downstream_processing/Processing_notebook_template.ipynb. Also, the metadata for the raw sequencing data will be helpful to annotate sample names with the correct time point and replicate declaration.

Steps & output

Data loading
Annotation with time points, in silico rRNA depletion and data normalization
PCA-based quality control and cleaning of data by outlier sample removal
Grouping replicates per time points
Phage gene classification with pre-defined thresholds
Calculation of stabilized variance for dataset filtering
Export data

File paths only need to be adjusted at the beginning and end of the notebook. The tables are ready to be fed into the database for the PhageExpressionAtlas.

Customization of workflow

Nextflow is a modular system and you can virtually include any other tool, add new modules and adapt existing ones. Please feel free to leave feedback, whether you think important steps are currently missing.

Contributing to the PhageExpressionAtlas

We highly welcome your contributions and feedback! Please contact Maik Wolfram-Schauerte.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
downstream_processing		downstream_processing
env		env
images		images
nf		nf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Processing of dual RNA-seq data for the PhageExpressionAtlas

Nextflow pipeline

Tools in the pipeline

Installation

Input & usage

Pipeline steps explained briefly

Output

Pre-processing with the Juypter notebook

Steps & output

Customization of workflow

Contributing to the PhageExpressionAtlas

References & Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Integrative-Transcriptomics/phage_host_dual_transcriptomics

Folders and files

Latest commit

History

Repository files navigation

Processing of dual RNA-seq data for the PhageExpressionAtlas

Nextflow pipeline

Tools in the pipeline

Installation

Input & usage

Pipeline steps explained briefly

Output

Pre-processing with the Juypter notebook

Steps & output

Customization of workflow

Contributing to the PhageExpressionAtlas

References & Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages