Skip to content

Integrative-Transcriptomics/phage_host_dual_transcriptomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Processing of dual RNA-seq data for the PhageExpressionAtlas

This repo contains the code for processing dual RNA-seq data as currently hosted in the PhageExpressionAtlas. First, a nextflow pipeline is used to trim paired-end or single-end RNA-seq reads, check their quality, map the reads to a dual genome, filter the alignments and count reads. A Jupyter Notebook is then used to annotate pre-process the count data for storage and analysis with the PhageExpressionAtlas.

Nextflow pipeline

Nextflow stands out with its enormous parallelization capability, performance and documentation. Therefore, this pipeline can be applied to efficiently process, map and count reads from dual RNA-seq of phage-host interactions.

Tools in the pipeline

This pipeline makes use of the following tools:

  • quality control: fastQC, multiQC
  • adapter & quality trimming: cutadapt
  • read mapping (and dual reference genome indexing): hisat2
  • alignment processing: samtools
  • feature counting: featureCounts

Installation

Installation of conda environment from file:

conda env create -f /env/env_nextflow.yml

or, when using mamba:

mamba env create -f /env/env_nextflow.yml

Also, check if Nextflow is available, otherwise install manually.

Input & usage

Input can be specified from command line or in /nf/conf/params.config.

  • hostGenome, phageGenome: provide absolute paths to full genomic fasta files.
  • hostGFF, phageGFF: provide absolute paths to full genomic feature annotation files.
  • pairedEnd: call, if dealing with paired-end RNA-seq data
  • dRNAseq: this flag will direct to dRNA-seq processing, but is not yet implemented
  • reads: absolute path to sreads in fastq format, can also be compressed as *.fastq.gz
  • outputDir: absolute path to output directory
  • adapter1, adapter2: by deafult Illumina TruSeq adapter sequences, which can be adjusted, if others used
  • countFeature, featureIdentifier: specify feature to count with using featureCounts and the identifier used from the GFF file

Example usage for single-end data:

nextflow run rnaseq_workflow.nf --reads "/path/to/data/*.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"

Example usage for paired-end data:

nextflow run rnaseq_workflow.nf --reads "/path/to/data/*_{R1,R2}.fastq.gz" --hostGenome "/path/to/hostGenome.fasta" --phageGenome "/path/to/phageGenome.fasta" \
--hostGFF "/path/to/hostGenome.gff" --phageGFF "/path/to/phageGenome.gff" --outputDir "/path/to/outputDir" --countFeature "gene" --featureIdentifier "ID"

Pipeline steps explained briefly

Overview of the workflow:

  1. Building reference genome and indexing with hisat2
  2. fastQC on raw fastq files
  3. adapter (and quality) trimming with cutadapt
  4. fastQC on trimmed fastq files
  5. mapping of trimmed fastq files with hisat2
  6. processing of alignment with samtools
  7. feature counting using featureCounts

Output

The dual RNA-seq pipeline produces the following output:

  1. Counts table generated by featureCounts.
  2. Bam files, which can be inspected in a genome viewer.
  3. The dual genome fasta and gff file needed for exploration of alignment maps in a genome viewer.
  4. Quality control logs including fastQC and multiQC html reports.

Pre-processing with the Juypter notebook

The counts table and gff file produced by the pipeline serve as input for the Postprocessing notebook /downstream_processing/Processing_notebook_template.ipynb. Also, the metadata for the raw sequencing data will be helpful to annotate sample names with the correct time point and replicate declaration.

Steps & output

  1. Data loading
  2. Annotation with time points, in silico rRNA depletion and data normalization
  3. PCA-based quality control and cleaning of data by outlier sample removal
  4. Grouping replicates per time points
  5. Phage gene classification with pre-defined thresholds
  6. Calculation of stabilized variance for dataset filtering
  7. Export data

File paths only need to be adjusted at the beginning and end of the notebook. The tables are ready to be fed into the database for the PhageExpressionAtlas.

Customization of workflow

Nextflow is a modular system and you can virtually include any other tool, add new modules and adapt existing ones. Please feel free to leave feedback, whether you think important steps are currently missing.

Contributing to the PhageExpressionAtlas

We highly welcome your contributions and feedback! Please contact Maik Wolfram-Schauerte.

References & Acknowledgements

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published