Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions vignettes/ExtendingGenomicRanges.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: "Extending *GenomicRanges*"
author:
- name: "Michael Lawrence"
- name: "Bioconductor Team"
date: "Edited: Oct 2014; Compiled: `r format(Sys.time(), '%d %B, %Y')`"
package: GenomicRanges
vignette: >
%\VignetteIndexEntry{Extending Genomic Ranges}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
output:
BiocStyle::html_document:
number_sections: yes
toc: yes
toc_depth: 4
---

# Introduction

The goal of `r Biocpkg("GenomicRanges")` is to provide general containers for
genomic data. The central class, at least from the user perspective, is
*GRanges*, which formalizes the notion of ranges, while allowing for arbitrary
"metadata columns" to be attached to it. These columns offer the same
flexibility as the venerable *data.frame* and permit users to adapt *GRanges* to
a wide variety of *adhoc* use-cases.

The more we encounter a particular problem, the better we understand it. We
eventually develop a systematic approach for solving the most frequently
encountered problems, and every systematic approach deserves a systematic
implementation. For example, we might want to formally store genetic variants,
with information on alleles and read depths. The metadata columns, which were so
useful during prototyping, are inappropriate for extending the formal semantics
of our data structure: for the sake of data integrity, we need to ensure that
the columns are always present and that they meet certain constraints.

We might also find that our prototype does not scale well to the increased data
volume that often occurs when we advance past the prototype stage. *GRanges* is
meant mostly for prototyping and stores its data in memory as simple R data
structures. We may require something more specialized when the data are large;
for example, we might store the data as a Tabix-indexed file, or in a database.

The `r Biocpkg("GenomicRanges")` package does not directly solve either of these
problems, because there are no general solutions. However, it is adaptable to
specialized use cases.

# The *GenomicRanges* abstraction

Unbeknownst to many, most of the *GRanges* implementation is provided by methods
on the *GenomicRanges* class, the virtual parent class of *GRanges*.
*GenomicRanges* methods provide everything except for the actual data storage
and retrieval, which *GRanges* implements directly using slots. For example, the
ranges are retrieved like this:

```{r granges-ranges, message=FALSE}
library(GenomicRanges)
selectMethod(ranges, "GRanges")
```

An alternative implementation is *DelegatingGenomicRanges*, which stores all of its data in a delegate *GenomicRanges* object:

```{r delegating-granges-ranges}
selectMethod(ranges, "DelegatingGenomicRanges")
```

This abstraction enables us to pursue more efficient implementations for
particular tasks. One example is *GNCList*, which is indexed for fast range
queries, we expose here:

```{r gnclist-granges}
getSlots("GNCList")["granges"]
```

The `r Biocpkg("MutableRanges")` package in svn provides other, untested
examples.

# Formalizing `mcols`: Extra column slots

An orthogonal problem to data storage is adding semantics by the formalization
of metadata columns, and we solve it using the "extra column slot" mechanism.
Whenever *GenomicRanges* needs to operate on its metadata columns, it also
delegates to the internal `extraColumnSlotNames` generic, methods of which
should return a character vector, naming the slots in the *GenomicRanges*
subclass that correspond to columns (i.e., they have one value per range). It
extracts the slot values and manipulates them as it would a metadata column --
except they are now formal slots, with formal types.

An example is the *VRanges* class in `r Biocpkg("VariantAnnotation")`. It stores
information on the variants by adding these column slots:

```{r vranges, message=FALSE, warning=FALSE}
GenomicRanges:::extraColumnSlotNames(VariantAnnotation:::VRanges())
```

Mostly for historical reasons, *VRanges* extends *GRanges*. However, since the
data storage mechanism and the set of extra column slots are orthogonal, it is
probably best practice to take a composition approach by extending
*DelegatingGenomicRanges*.
121 changes: 0 additions & 121 deletions vignettes/ExtendingGenomicRanges.Rnw

This file was deleted.