Enhance references and update vignette for improved clarity and functionality

stemangiola · stemangiola · commit 5dc53d8ad3c8 · 2025-09-13T19:53:09.000+09:30
- Added multiple new references to the bibliography for comprehensive citation support.
- Updated the vignette to include a more structured approach to RNA-seq data analysis using the airway dataset.
- Improved code organization and styling for better readability and user experience.
- Enhanced explanations and added steps for data preparation, normalization, and analysis techniques.
- Ensured compatibility with tidyverse tools and improved overall clarity in the vignette content.
diff --git a/references.bib b/references.bib
@@ -192,4 +192,47 @@ @article{ritchie2015limma
   pages={e47--e47},
   year={2015},
   publisher={Oxford University Press}
+}
+
+@article{aran2017xcell,
+  title={xCell: digitally portraying the tissue cellular heterogeneity landscape},
+  author={Aran, Dvir and Hu, Zicheng and Butte, Atul J},
+  journal={Genome biology},
+  volume={18},
+  number={1},
+  pages={220},
+  year={2017},
+  publisher={BioMed Central}
+}
+
+@article{becht2016mcp,
+  title={Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression},
+  author={Becht, Etienne and Giraldo, Nicolas A and Lacroix, Laetitia and Bifulco, Carlo and Buttard, B{\'e}n{\'e}dicte and Elarouci, Nabila and Petitprez, Florent and Selves, Janick and Laurent-Puig, Pierre and Saut{\`e}s-Fridman, Catherine and others},
+  journal={Genome biology},
+  volume={17},
+  number={1},
+  pages={218},
+  year={2016},
+  publisher={BioMed Central}
+}
+
+@article{finotello2019quantiseq,
+  title={Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data},
+  author={Finotello, Francesca and Mayer, Clemens and Plattner, Christina and Laschober, Gerhard and Rieder, Dietmar and Hackl, Hubert and Krogsdam, Anne and Loncova, Zuzana and Posch, Wilfried and Sopper, Sieghart and others},
+  journal={Genome medicine},
+  volume={11},
+  number={1},
+  pages={34},
+  year={2019},
+  publisher={BioMed Central}
+}
+
+@article{racle2017epic,
+  title={Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data},
+  author={Racle, Julien and de Jonge, Kaat and Baumgaertner, Petra and Speiser, Daniel E and Gfeller, David},
+  journal={Elife},
+  volume={6},
+  pages={e26476},
+  year={2017},
+  publisher={eLife Sciences Publications Limited}
 } 
diff --git a/vignettes/comparison_coding.Rmd b/vignettes/comparison_coding.Rmd
@@ -6,19 +6,16 @@ package: tidybulk
 output:
   BiocStyle::html_document:
     toc_float: true
-abstract: >
-  Tidybulk is a comprehensive R package for modular transcriptomic data analysis that brings transcriptomics 
-  to the tidyverse. It provides a unified interface for data transformation, normalization, filtering, 
-  dimensionality reduction, clustering, differential analysis, cellularity analysis, and gene enrichment 
-  with seamless integration of SummarizedExperiment objects and tidyverse principles.
+bibliography: references.bib
+link-citations: true
+
 keywords: "transcriptomics, RNA-seq, differential expression, data analysis, tidyverse, SummarizedExperiment, 
   bioinformatics, genomics, gene expression, clustering, dimensionality reduction, cellularity analysis, 
   gene enrichment, R package"
 vignette: >
-  %\VignetteEngine{knitr::knitr}
+  %\VignetteEngine{knitr::rmarkdown}
   %\VignetteIndexEntry{Side-by-side comparison with standard interfaces}
-  %\usepackage[UTF-8]{inputenc}
-
+  %\VignetteEncoding{UTF-8}
 ---
 
 
@@ -33,13 +30,49 @@ vignette: >
 <style>
 .column-left{
   float: left;
-  width: 50%;
+  width: 48%;
   text-align: left;
+  margin-right: 2%;
 }
 .column-right{
   float: right;
-  width: 50%;
-  text-align: right;
+  width: 48%;
+  text-align: left;
+  margin-left: 2%;
+}
+
+/* Improve code block styling */
+.column-left pre,
+.column-right pre {
+  font-size: 0.85em;
+  line-height: 1.3;
+  overflow-x: auto;
+  white-space: pre;
+  word-wrap: normal;
+  max-width: 100%;
+}
+
+/* Ensure code blocks don't get too narrow */
+.column-left pre code,
+.column-right pre code {
+  font-size: 0.8em;
+  line-height: 1.2;
+}
+
+/* Better spacing for code chunks */
+.column-left .sourceCode,
+.column-right .sourceCode {
+  margin-bottom: 1em;
+}
+
+/* Responsive design for smaller screens */
+@media (max-width: 768px) {
+  .column-left,
+  .column-right {
+    width: 100%;
+    float: none;
+    margin: 0 0 1em 0;
+  }
 }
 </style>
 
@@ -69,64 +102,107 @@ Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T.
 [Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7)
 
 
-```{r, echo=FALSE, include=FALSE, }
+```{r setup-libraries-and-theme, echo=FALSE, include=FALSE}
 library(knitr)
-# knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
-#                       message = FALSE, cache.lazy = FALSE)
-
 library(dplyr)
 library(tidyr)
 library(tibble)
+library(purrr)
 library(magrittr)
+library(forcats)
 library(ggplot2)
 library(ggrepel)
+library(SummarizedExperiment)
 library(tidybulk)
 library(tidySummarizedExperiment)
 library(airway)
 
+# Define theme
 my_theme = 	
 	theme_bw() +
 	theme(
 		panel.border = element_blank(),
 		axis.line = element_line(),
-		panel.grid.major = element_line(size = 0.2),
-		panel.grid.minor = element_line(size = 0.1),
+		panel.grid.major = element_line(linewidth = 0.2),
+		panel.grid.minor = element_line(linewidth = 0.1),
 		text = element_text(size=12),
 		legend.position="bottom",
 		aspect.ratio=1,
 		strip.background = element_blank(),
 		axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
 		axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
 	)
+```
+
+In this vignette we will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the [airway](https://bioconductor.org/packages/airway/) package.
 
-# Load airway dataset
+```{r load airway}
+library(airway)
 data(airway)
+```
+
+This workflow, will use the [tidySummarizedExperiment](https://bioconductor.org/packages/tidySummarizedExperiment/) package to manipulate the data in a `tidyverse` fashion. This approach streamlines the data manipulation and analysis process, making it more efficient and easier to understand.
+
+```{r load tidySummarizedExperiment}
+library(tidySummarizedExperiment)
+```
+
+Here we will add a gene symbol column to the `airway` object. This will be used to interpret the differential expression analysis, and to deconvolve the cellularity.
 
-# Add gene symbol and entrez for better annotation
+```{r add-gene-symbol}
+library(org.Hs.eg.db)
+library(AnnotationDbi)
+
+# Add gene symbol and entrez
 airway <-
-  airway |> 
-  mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
-                                        keys = .feature,
-                                        keytype = "ENSEMBL",
-                                        column = "SYMBOL",
-                                        multiVals = "first"
-  )) |> 
+  airway |>
   
-  mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
-                                      keys = .feature,
-                                      keytype = "ENSEMBL",
+  mutate(entrezid = mapIds(org.Hs.eg.db,
+                                      keys = gene_name,
+                                      keytype = "SYMBOL",
                                       column = "ENTREZID",
                                       multiVals = "first"
 )) 
 
-# Convert dex to factor for proper differential expression analysis
-colData(airway)$dex = as.factor(colData(airway)$dex)
 
+
+detach("package:org.Hs.eg.db", unload = TRUE)
+detach("package:AnnotationDbi", unload = TRUE)
 ```
 
-## Aggregate duplicated `transcripts`
+# Side-by-side Comparison with Standard Interfaces
 
-tidybulk provide the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.
+This vignette demonstrates how tidybulk compares to standard R/Bioconductor approaches for transcriptomic data analysis. We'll show the same analysis performed using both tidybulk (tidyverse-style) and traditional methods side by side.
+
+## Data Overview
+
+We will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells:
+
+```{r data-overview}
+airway
+```
+
+Loading `tidySummarizedExperiment` makes the `SummarizedExperiment` objects compatible with tidyverse tools while maintaining its `SummarizedExperiment` nature. This is useful because it allows us to use the `tidyverse` tools to manipulate the data.
+
+```{r check-se-class}
+class(airway)
+```
+
+### Prepare Data for Analysis
+
+Before analysis, we need to ensure our variables are in the correct format:
+
+```{r convert-condition-to-factor}
+# Convert dex to factor for proper differential expression analysis
+airway = airway |>
+  mutate(dex = as.factor(dex))
+```
+
+## Step 1: Aggregate Duplicated Transcripts
+
+tidybulk provides the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.
+
+> Transcript aggregation is a standard bioinformatics approach for gene-level summarization.
 
 <div class="column-left">
 TidyTranscriptomics
@@ -153,10 +229,12 @@ colnames(dge_list.nr) <- colnames(dge_list)
 </div>
 <div style="clear:both;"></div>
 
-## Scale `counts`
+## Step 2: Scale Abundance
 
 We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). `scale_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method as arguments and returns a tibble with additional columns with scaled data as `<NAME OF COUNT COLUMN>_scaled`.
 
+> Normalization is crucial for comparing expression levels across samples with different library sizes.
+
 <div class="column-left">
 TidyTranscriptomics
 ```{r normalise}
@@ -180,7 +258,7 @@ norm_counts.table <- cpm(dgList)
 <div style="clear:both;"></div>
 
 ```{r, include=FALSE}
-airway.norm |> select(`counts`, counts_scaled, .abundant, everything())
+airway.norm |> dplyr::select(`counts`, counts_scaled, .abundant, everything())
 ```
 
 We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by treatment.
@@ -194,9 +272,11 @@ airway.norm |>
 	my_theme
 ```
 
-## Filter `variable transcripts`
+## Step 3: Filter Variable Transcripts
+
+We may want to identify and filter variable transcripts to focus on the most informative features.
 
-We may want to identify and filter variable transcripts.
+> Variable transcript filtering helps reduce noise and focuses analysis on the most informative features.
 
 <div class="column-left">
 TidyTranscriptomics
@@ -230,10 +310,12 @@ norm_counts.table$cell_type = tibble_counts[
 <div style="clear:both;"></div>
 
 
-## Reduce `dimensions`
+## Step 4: Reduce Dimensions
 
 We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.
 
+> Dimensionality reduction helps visualize high-dimensional data and identify patterns.
+
 **MDS** (Robinson et al., 10.1093/bioinformatics/btp616)
 
 <div class="column-left">
@@ -268,7 +350,7 @@ cmds$cell_type = tibble_counts[
 On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by treatment.
 
 ```{r plot_mds, fig.width = 10, fig.height = 10, eval=FALSE}
-airway.norm.MDS |> pivot_sample()  |> select(contains("Dim"), everything())
+airway.norm.MDS |> pivot_sample()  |> dplyr::select(contains("Dim"), everything())
 
 airway.norm.MDS |>
 	pivot_sample() |>
@@ -306,7 +388,7 @@ On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured
 
 ```{r plot_pca, fig.width = 10, fig.height = 10, eval=FALSE}
 
-airway.norm.PCA |> pivot_sample() |> select(contains("PC"), everything())
+airway.norm.PCA |> pivot_sample() |> dplyr::select(contains("PC"), everything())
 
 airway.norm.PCA |>
 	 pivot_sample() |>
@@ -315,9 +397,11 @@ airway.norm.PCA |>
 
 
 
-## Rotate `dimensions`
+## Step 5: Rotate Dimensions
 
 We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as `<NAME OF DIMENSION> rotated <ANGLE>` by default, or as specified in the input arguments.
+
+> Dimension rotation can help align data with biological axes of interest.
 <div class="column-left">
 TidyTranscriptomics
 ```{r rotate}
@@ -366,9 +450,11 @@ airway.norm.MDS.rotated |>
   my_theme
 ```
 
-## Test `differential abundance`
+## Step 8: Test Differential Abundance
 
 We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). `test_differential_expression` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g.,  log fold change, p-value and false discovery rate).
+
+> Differential expression analysis identifies genes that are significantly different between conditions.
 <div class="column-left">
 TidyTranscriptomics
 ```{r de, message=FALSE, warning=FALSE, results='hide'}
@@ -410,10 +496,12 @@ airway.de =
   pivot_transcript()
 ```
 
-## Adjust `counts`
+## Step 6: Adjust for Unwanted Variation
 
 We may want to adjust `counts` for (known) unwanted variation. `adjust_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as `<COUNT COLUMN>_adjusted`. At the moment just an unwanted covariates is allowed at a time.
 
+> Batch effect correction is important for removing technical variation that could confound biological signals.
+
 <div class="column-left">
 TidyTranscriptomics
 ```{r adjust, message=FALSE, warning=FALSE, results='hide'}
@@ -508,10 +596,12 @@ results$cell_type = tibble_counts[
 
 ```
 
-## Cluster samples
+## Step 7: Cluster Samples
 
 We may want to cluster our samples based on the transcriptomic profiles. `cluster_elements` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the cluster labels.
 
+> Clustering helps identify groups of samples with similar expression profiles.
+
 <div class="column-left">
 TidyTranscriptomics
 ```{r cluster}
@@ -535,10 +625,12 @@ cluster_labels = kmeans_result$cluster
 </div>
 <div style="clear:both;"></div>
 
-## Test differential abundance
+## Step 9: Test Differential Abundance (Alternative Method)
 
 We may want to test for differential abundance between conditions. `test_differential_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model, and returns a tibble with additional columns for the statistics from the hypothesis test (e.g.,  log fold change, p-value and false discovery rate).
 
+> This demonstrates an alternative approach to differential expression analysis.
+
 <div class="column-left">
 TidyTranscriptomics
 ```{r differential}