Orchestrating high-throughput genomic analysis with Bioconductor

Journal name:
Nature Methods
Volume:
12,
Pages:
115–121
Year published:
DOI:
doi:10.1038/nmeth.3252
Received
Accepted
Published online

Abstract

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.

At a glance

Figures

  1. Example uses of the Ranges algebra.
    Figure 1: Example uses of the Ranges algebra.

    A GRanges object, g (top), represents two transcript isoforms of a gene, each with two exons. The coordinates of unspliced transcripts are identified with the function range(g). Calculating the gene region involves flattening the gene model into its constituent exons and reducing these to nonoverlapping ranges, reduce(unlist(g)). Ranges defining disjoint bins, disjoin(unlist(g)), are useful in counting operations, e.g., in RNA-seq analysis. Putative promoter ranges are found using strand-aware range extension, flank(range(g), width = 100). Elementary operations can be composed to succinctly execute queries such as psetdiff(range(g), g) for computing the intron ranges.

  2. The integrative data container SummarizedExperiment.
    Figure 2: The integrative data container SummarizedExperiment.

    Its assays component is one or several rectangular arrays of equivalent row and column dimensions. Rows correspond to features, and columns to samples. The component rowData stores metadata about the features, including their genomic ranges. The colData component keeps track of sample-level covariate data. The exptData component carries experiment-level information, including MIAME (minimum information about a microarray experiment)-structured metadata21. The R expressions exemplify how to access components. For instance, provided that these metadata were recorded, rowData(se)$entrezId returns the NCBI Entrez Gene identifiers of the features, and se$tissue returns the tissue descriptions for the samples. Range-based operations, such as %in%, act on the rowData to return a logical vector that selects the features lying within the regions specified by the data object CNVs. Together with the bracket operator, such expressions can be used to subset a SummarizedExperiment to a focused set of genes and tissues for downstream analysis.

  3. Visualization along genomic coordinates with ggbio.
    Figure 3: Visualization along genomic coordinates with ggbio.

    The plot shows the gene Apoe alongside RNA-seq data from mouse hematopoietic stem cells (HSC) and four fractions of multipotent progenitor (MPP) cells22. The disjoint bins (center) were computed from the four transcript isoforms shown in the bottom panel. The y axis of the top panel shows the relative exon usage coefficients estimated with the DEXSeq method23. Regions detected as differentially used between the cell fractions are colored dark red in the center panel.

References

  1. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
  2. R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2014).
  3. Hahne, F., Huber, W., Gentleman, R. & Falcon, S. Bioconductor Case Studies (Springer, 2008).
  4. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
  5. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  6. Ohnishi, Y. et al. Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages. Nat. Cell Biol. 16, 2737 (2014).
  7. Finak, G. et al. OpenCyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis. PLoS Comput. Biol. 10, e1003806 (2014).
  8. Chelaru, F., Smith, L., Goldstein, N. & Corrada Bravo, H. Epiviz: interactive visual analytics for functional genomics data. Nat. Methods 11, 938940 (2014).
  9. Gentleman, R. Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol. 4, Article 2 (2005).
  10. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
  11. Laufer, C., Fischer, B., Billmann, M., Huber, W. & Boutros, M. Mapping genetic interactions in human cancer cells with RNAi and multiparametric phenotyping. Nat. Methods 10, 427431 (2013).
  12. Waldron, L. et al. Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. J. Natl. Cancer Inst. 106, dju049 (2014).
  13. Riester, M. et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J. Natl. Cancer Inst. 106, dju048 (2014).
  14. McMurdie, P.J. & Holmes, S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10, e1003531 (2014).
  15. Goecks, J., Nekrutenko, A., Taylor, J. & The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
  16. Pérez, F. & Granger, B.E. IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9, 2129 (2007).
  17. Anonymous. Credit for code. Nat. Genet. 46, 1 (2014).
  18. Altschul, S. et al. The anatomy of successful computational biology software. Nat. Biotechnol. 31, 894897 (2013).
  19. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  20. Lawrence, M. & Morgan, M. Scalable genomics with R and Bioconductor. Stat. Sci. 29, 214226 (2014).
  21. Brazma, A. et al. Minimum information about a microarray experiment (MIAME) - toward standards for microarray data. Nat. Genet. 29, 365371 (2001).
  22. Cabezas-Wallscheid, N. et al. Identification of regulatory networks in HSCs and their immediate progeny via integrated proteome, transcriptome, and DNA methylome analysis. Cell Stem Cell 15, 507522 (2014).
  23. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 20082017 (2012).
  24. Obenchain, V. et al. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076 (2014).

Download references

Author information

Affiliations

  1. European Molecular Biology Laboratory, Heidelberg, Germany.

    • Wolfgang Huber,
    • Simon Anders,
    • Andrzej K Oleś &
    • Alejandro Reyes
  2. Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA.

    • Vincent J Carey
  3. Harvard School of Public Health, Boston, Massachusetts, USA.

    • Vincent J Carey,
    • Rafael A Irizarry &
    • Michael I Love
  4. Genentech, South San Francisco, California, USA.

    • Robert Gentleman &
    • Michael Lawrence
  5. Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.

    • Marc Carlson,
    • Valerie Obenchain,
    • Hervé Pagès,
    • Paul Shannon,
    • Dan Tenenbaum &
    • Martin Morgan
  6. Department of Medical Genetics, School of Medical Sciences, State University of Campinas, Campinas, Brazil.

    • Benilton S Carvalho
  7. Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.

    • Hector Corrada Bravo
  8. Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA.

    • Sean Davis
  9. Department of Biochemistry, University of Cambridge, Cambridge, UK.

    • Laurent Gatto
  10. Institute for Integrative Genome Biology, University of California, Riverside, Riverside, California, USA.

    • Thomas Girke
  11. Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.

    • Raphael Gottardo
  12. Novartis Institutes for Biomedical Research, Basel, Switzerland.

    • Florian Hahne
  13. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA.

    • Kasper D Hansen
  14. Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA.

    • Kasper D Hansen
  15. Dana-Farber Cancer Institute, Boston, Massachusetts, USA.

    • Rafael A Irizarry &
    • Michael I Love
  16. Department of Environmental and Occupational Health Sciences, University of Washington, Seattle, Washington, USA.

    • James MacDonald
  17. Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

    • Gordon K Smyth
  18. Department of Mathematics and Statistics, University of Melbourne, Parkville, Victoria, Australia.

    • Gordon K Smyth
  19. School of Urban Public Health at Hunter College, City University of New York, New York, New York, USA.

    • Levi Waldron

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Additional data