Skip to content

Commit 5dc53d8

Browse files
committed
Enhance references and update vignette for improved clarity and functionality
- Added multiple new references to the bibliography for comprehensive citation support. - Updated the vignette to include a more structured approach to RNA-seq data analysis using the airway dataset. - Improved code organization and styling for better readability and user experience. - Enhanced explanations and added steps for data preparation, normalization, and analysis techniques. - Ensured compatibility with tidyverse tools and improved overall clarity in the vignette content.
1 parent 8ebd575 commit 5dc53d8

File tree

2 files changed

+180
-45
lines changed

2 files changed

+180
-45
lines changed

references.bib

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,4 +192,47 @@ @article{ritchie2015limma
192192
pages={e47--e47},
193193
year={2015},
194194
publisher={Oxford University Press}
195+
}
196+
197+
@article{aran2017xcell,
198+
title={xCell: digitally portraying the tissue cellular heterogeneity landscape},
199+
author={Aran, Dvir and Hu, Zicheng and Butte, Atul J},
200+
journal={Genome biology},
201+
volume={18},
202+
number={1},
203+
pages={220},
204+
year={2017},
205+
publisher={BioMed Central}
206+
}
207+
208+
@article{becht2016mcp,
209+
title={Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression},
210+
author={Becht, Etienne and Giraldo, Nicolas A and Lacroix, Laetitia and Bifulco, Carlo and Buttard, B{\'e}n{\'e}dicte and Elarouci, Nabila and Petitprez, Florent and Selves, Janick and Laurent-Puig, Pierre and Saut{\`e}s-Fridman, Catherine and others},
211+
journal={Genome biology},
212+
volume={17},
213+
number={1},
214+
pages={218},
215+
year={2016},
216+
publisher={BioMed Central}
217+
}
218+
219+
@article{finotello2019quantiseq,
220+
title={Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data},
221+
author={Finotello, Francesca and Mayer, Clemens and Plattner, Christina and Laschober, Gerhard and Rieder, Dietmar and Hackl, Hubert and Krogsdam, Anne and Loncova, Zuzana and Posch, Wilfried and Sopper, Sieghart and others},
222+
journal={Genome medicine},
223+
volume={11},
224+
number={1},
225+
pages={34},
226+
year={2019},
227+
publisher={BioMed Central}
228+
}
229+
230+
@article{racle2017epic,
231+
title={Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data},
232+
author={Racle, Julien and de Jonge, Kaat and Baumgaertner, Petra and Speiser, Daniel E and Gfeller, David},
233+
journal={Elife},
234+
volume={6},
235+
pages={e26476},
236+
year={2017},
237+
publisher={eLife Sciences Publications Limited}
195238
}

vignettes/comparison_coding.Rmd

Lines changed: 137 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,16 @@ package: tidybulk
66
output:
77
BiocStyle::html_document:
88
toc_float: true
9-
abstract: >
10-
Tidybulk is a comprehensive R package for modular transcriptomic data analysis that brings transcriptomics
11-
to the tidyverse. It provides a unified interface for data transformation, normalization, filtering,
12-
dimensionality reduction, clustering, differential analysis, cellularity analysis, and gene enrichment
13-
with seamless integration of SummarizedExperiment objects and tidyverse principles.
9+
bibliography: references.bib
10+
link-citations: true
11+
1412
keywords: "transcriptomics, RNA-seq, differential expression, data analysis, tidyverse, SummarizedExperiment,
1513
bioinformatics, genomics, gene expression, clustering, dimensionality reduction, cellularity analysis,
1614
gene enrichment, R package"
1715
vignette: >
18-
%\VignetteEngine{knitr::knitr}
16+
%\VignetteEngine{knitr::rmarkdown}
1917
%\VignetteIndexEntry{Side-by-side comparison with standard interfaces}
20-
%\usepackage[UTF-8]{inputenc}
21-
18+
%\VignetteEncoding{UTF-8}
2219
---
2320

2421

@@ -33,13 +30,49 @@ vignette: >
3330
<style>
3431
.column-left{
3532
float: left;
36-
width: 50%;
33+
width: 48%;
3734
text-align: left;
35+
margin-right: 2%;
3836
}
3937
.column-right{
4038
float: right;
41-
width: 50%;
42-
text-align: right;
39+
width: 48%;
40+
text-align: left;
41+
margin-left: 2%;
42+
}
43+
44+
/* Improve code block styling */
45+
.column-left pre,
46+
.column-right pre {
47+
font-size: 0.85em;
48+
line-height: 1.3;
49+
overflow-x: auto;
50+
white-space: pre;
51+
word-wrap: normal;
52+
max-width: 100%;
53+
}
54+
55+
/* Ensure code blocks don't get too narrow */
56+
.column-left pre code,
57+
.column-right pre code {
58+
font-size: 0.8em;
59+
line-height: 1.2;
60+
}
61+
62+
/* Better spacing for code chunks */
63+
.column-left .sourceCode,
64+
.column-right .sourceCode {
65+
margin-bottom: 1em;
66+
}
67+
68+
/* Responsive design for smaller screens */
69+
@media (max-width: 768px) {
70+
.column-left,
71+
.column-right {
72+
width: 100%;
73+
float: none;
74+
margin: 0 0 1em 0;
75+
}
4376
}
4477
</style>
4578

@@ -69,64 +102,107 @@ Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T.
69102
[Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7)
70103

71104

72-
```{r, echo=FALSE, include=FALSE, }
105+
```{r setup-libraries-and-theme, echo=FALSE, include=FALSE}
73106
library(knitr)
74-
# knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
75-
# message = FALSE, cache.lazy = FALSE)
76-
77107
library(dplyr)
78108
library(tidyr)
79109
library(tibble)
110+
library(purrr)
80111
library(magrittr)
112+
library(forcats)
81113
library(ggplot2)
82114
library(ggrepel)
115+
library(SummarizedExperiment)
83116
library(tidybulk)
84117
library(tidySummarizedExperiment)
85118
library(airway)
86119
120+
# Define theme
87121
my_theme =
88122
theme_bw() +
89123
theme(
90124
panel.border = element_blank(),
91125
axis.line = element_line(),
92-
panel.grid.major = element_line(size = 0.2),
93-
panel.grid.minor = element_line(size = 0.1),
126+
panel.grid.major = element_line(linewidth = 0.2),
127+
panel.grid.minor = element_line(linewidth = 0.1),
94128
text = element_text(size=12),
95129
legend.position="bottom",
96130
aspect.ratio=1,
97131
strip.background = element_blank(),
98132
axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
99133
axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
100134
)
135+
```
136+
137+
In this vignette we will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the [airway](https://bioconductor.org/packages/airway/) package.
101138

102-
# Load airway dataset
139+
```{r load airway}
140+
library(airway)
103141
data(airway)
142+
```
143+
144+
This workflow, will use the [tidySummarizedExperiment](https://bioconductor.org/packages/tidySummarizedExperiment/) package to manipulate the data in a `tidyverse` fashion. This approach streamlines the data manipulation and analysis process, making it more efficient and easier to understand.
145+
146+
```{r load tidySummarizedExperiment}
147+
library(tidySummarizedExperiment)
148+
```
149+
150+
Here we will add a gene symbol column to the `airway` object. This will be used to interpret the differential expression analysis, and to deconvolve the cellularity.
104151

105-
# Add gene symbol and entrez for better annotation
152+
```{r add-gene-symbol}
153+
library(org.Hs.eg.db)
154+
library(AnnotationDbi)
155+
156+
# Add gene symbol and entrez
106157
airway <-
107-
airway |>
108-
mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
109-
keys = .feature,
110-
keytype = "ENSEMBL",
111-
column = "SYMBOL",
112-
multiVals = "first"
113-
)) |>
158+
airway |>
114159
115-
mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
116-
keys = .feature,
117-
keytype = "ENSEMBL",
160+
mutate(entrezid = mapIds(org.Hs.eg.db,
161+
keys = gene_name,
162+
keytype = "SYMBOL",
118163
column = "ENTREZID",
119164
multiVals = "first"
120165
))
121166
122-
# Convert dex to factor for proper differential expression analysis
123-
colData(airway)$dex = as.factor(colData(airway)$dex)
124167
168+
169+
detach("package:org.Hs.eg.db", unload = TRUE)
170+
detach("package:AnnotationDbi", unload = TRUE)
125171
```
126172

127-
## Aggregate duplicated `transcripts`
173+
# Side-by-side Comparison with Standard Interfaces
128174

129-
tidybulk provide the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.
175+
This vignette demonstrates how tidybulk compares to standard R/Bioconductor approaches for transcriptomic data analysis. We'll show the same analysis performed using both tidybulk (tidyverse-style) and traditional methods side by side.
176+
177+
## Data Overview
178+
179+
We will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells:
180+
181+
```{r data-overview}
182+
airway
183+
```
184+
185+
Loading `tidySummarizedExperiment` makes the `SummarizedExperiment` objects compatible with tidyverse tools while maintaining its `SummarizedExperiment` nature. This is useful because it allows us to use the `tidyverse` tools to manipulate the data.
186+
187+
```{r check-se-class}
188+
class(airway)
189+
```
190+
191+
### Prepare Data for Analysis
192+
193+
Before analysis, we need to ensure our variables are in the correct format:
194+
195+
```{r convert-condition-to-factor}
196+
# Convert dex to factor for proper differential expression analysis
197+
airway = airway |>
198+
mutate(dex = as.factor(dex))
199+
```
200+
201+
## Step 1: Aggregate Duplicated Transcripts
202+
203+
tidybulk provides the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.
204+
205+
> Transcript aggregation is a standard bioinformatics approach for gene-level summarization.
130206
131207
<div class="column-left">
132208
TidyTranscriptomics
@@ -153,10 +229,12 @@ colnames(dge_list.nr) <- colnames(dge_list)
153229
</div>
154230
<div style="clear:both;"></div>
155231

156-
## Scale `counts`
232+
## Step 2: Scale Abundance
157233

158234
We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). `scale_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method as arguments and returns a tibble with additional columns with scaled data as `<NAME OF COUNT COLUMN>_scaled`.
159235

236+
> Normalization is crucial for comparing expression levels across samples with different library sizes.
237+
160238
<div class="column-left">
161239
TidyTranscriptomics
162240
```{r normalise}
@@ -180,7 +258,7 @@ norm_counts.table <- cpm(dgList)
180258
<div style="clear:both;"></div>
181259

182260
```{r, include=FALSE}
183-
airway.norm |> select(`counts`, counts_scaled, .abundant, everything())
261+
airway.norm |> dplyr::select(`counts`, counts_scaled, .abundant, everything())
184262
```
185263

186264
We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by treatment.
@@ -194,9 +272,11 @@ airway.norm |>
194272
my_theme
195273
```
196274

197-
## Filter `variable transcripts`
275+
## Step 3: Filter Variable Transcripts
276+
277+
We may want to identify and filter variable transcripts to focus on the most informative features.
198278

199-
We may want to identify and filter variable transcripts.
279+
> Variable transcript filtering helps reduce noise and focuses analysis on the most informative features.
200280
201281
<div class="column-left">
202282
TidyTranscriptomics
@@ -230,10 +310,12 @@ norm_counts.table$cell_type = tibble_counts[
230310
<div style="clear:both;"></div>
231311

232312

233-
## Reduce `dimensions`
313+
## Step 4: Reduce Dimensions
234314

235315
We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.
236316

317+
> Dimensionality reduction helps visualize high-dimensional data and identify patterns.
318+
237319
**MDS** (Robinson et al., 10.1093/bioinformatics/btp616)
238320

239321
<div class="column-left">
@@ -268,7 +350,7 @@ cmds$cell_type = tibble_counts[
268350
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by treatment.
269351

270352
```{r plot_mds, fig.width = 10, fig.height = 10, eval=FALSE}
271-
airway.norm.MDS |> pivot_sample() |> select(contains("Dim"), everything())
353+
airway.norm.MDS |> pivot_sample() |> dplyr::select(contains("Dim"), everything())
272354
273355
airway.norm.MDS |>
274356
pivot_sample() |>
@@ -306,7 +388,7 @@ On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured
306388

307389
```{r plot_pca, fig.width = 10, fig.height = 10, eval=FALSE}
308390
309-
airway.norm.PCA |> pivot_sample() |> select(contains("PC"), everything())
391+
airway.norm.PCA |> pivot_sample() |> dplyr::select(contains("PC"), everything())
310392
311393
airway.norm.PCA |>
312394
pivot_sample() |>
@@ -315,9 +397,11 @@ airway.norm.PCA |>
315397

316398

317399

318-
## Rotate `dimensions`
400+
## Step 5: Rotate Dimensions
319401

320402
We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as `<NAME OF DIMENSION> rotated <ANGLE>` by default, or as specified in the input arguments.
403+
404+
> Dimension rotation can help align data with biological axes of interest.
321405
<div class="column-left">
322406
TidyTranscriptomics
323407
```{r rotate}
@@ -366,9 +450,11 @@ airway.norm.MDS.rotated |>
366450
my_theme
367451
```
368452

369-
## Test `differential abundance`
453+
## Step 8: Test Differential Abundance
370454

371455
We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). `test_differential_expression` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).
456+
457+
> Differential expression analysis identifies genes that are significantly different between conditions.
372458
<div class="column-left">
373459
TidyTranscriptomics
374460
```{r de, message=FALSE, warning=FALSE, results='hide'}
@@ -410,10 +496,12 @@ airway.de =
410496
pivot_transcript()
411497
```
412498

413-
## Adjust `counts`
499+
## Step 6: Adjust for Unwanted Variation
414500

415501
We may want to adjust `counts` for (known) unwanted variation. `adjust_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as `<COUNT COLUMN>_adjusted`. At the moment just an unwanted covariates is allowed at a time.
416502

503+
> Batch effect correction is important for removing technical variation that could confound biological signals.
504+
417505
<div class="column-left">
418506
TidyTranscriptomics
419507
```{r adjust, message=FALSE, warning=FALSE, results='hide'}
@@ -508,10 +596,12 @@ results$cell_type = tibble_counts[
508596
509597
```
510598

511-
## Cluster samples
599+
## Step 7: Cluster Samples
512600

513601
We may want to cluster our samples based on the transcriptomic profiles. `cluster_elements` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the cluster labels.
514602

603+
> Clustering helps identify groups of samples with similar expression profiles.
604+
515605
<div class="column-left">
516606
TidyTranscriptomics
517607
```{r cluster}
@@ -535,10 +625,12 @@ cluster_labels = kmeans_result$cluster
535625
</div>
536626
<div style="clear:both;"></div>
537627

538-
## Test differential abundance
628+
## Step 9: Test Differential Abundance (Alternative Method)
539629

540630
We may want to test for differential abundance between conditions. `test_differential_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model, and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).
541631

632+
> This demonstrates an alternative approach to differential expression analysis.
633+
542634
<div class="column-left">
543635
TidyTranscriptomics
544636
```{r differential}

0 commit comments

Comments
 (0)