Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt

Abstract

Genomic experiments produce multiple views of biological systems, among them are DNA sequence and copy number variation, and mRNA and protein abundance. Understanding these systems needs integrated bioinformatic analysis. Public databases such as Ensembl provide relationships and mappings between the relevant sets of probe and target molecules. However, the relationships can be biologically complex and the content of the databases is dynamic. We demonstrate how to use the computational environment R to integrate and jointly analyze experimental datasets, employing BioMart web services to provide the molecule mappings. We also discuss typical problems that are encountered in making gene-to-transcript–to-protein mappings. The approach provides a flexible, programmable and reproducible basis for state-of-the-art bioinformatic data integration.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Principal component analysis using the mRNA profiles of the 200 most variable probesets.
Figure 2: The CGH log-ratios of chromosome I for three cell lines (MCF10A, BT549 and BT483).
Figure 3: Expression data of probes mapping to chromosome 1 for the two cell lines BT483 and BT549.
Figure 4
Figure 5: Heatmap showing a hierarchical clustering of the proteins (down right-hand side) and samples (along the bottom) based on the protein expression measurements.
Figure 6: Expression profiles of AURKA over the cell lines (along the x-axis) for mRNA (orange) and protein (green) levels.
Figure 7: Scatterplots of protein expression levels versus mRNA expression levels in four cell lines.

References

  1. R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2008) ISBN 3-900051-07-0.

  2. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (10): R80 (2004).

    Article  Google Scholar 

  3. Kasprzyk, A. et al. Ensmart: a generic system for fast and flexible access to biological data. Genome Res. 14 (1): 160–169 (2004).

    CAS  Article  Google Scholar 

  4. Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res. 37 (Database issue): D690–D697 (2009).

    CAS  Article  Google Scholar 

  5. Rogers, A. et al. Wormbase 2007. Nucleic Acids Res. 36 (Database issue): D612–D617 (2008).

    CAS  PubMed  Google Scholar 

  6. Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37 (Database issue): D619–D622 (2009).

    CAS  Article  Google Scholar 

  7. Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).

    CAS  Article  Google Scholar 

  8. Durinck, S. Integrating biological data resources into R with biomaRt. The Newsletter of the R Project 6/5, 40–45 (2006).

    Google Scholar 

  9. Boutros, M. et al. Analysis of cell-based RNAi screens. Genome Biol. 7, R66 (2006).

    Article  Google Scholar 

  10. Wei, J.S. et al. The MYCN oncogene is a direct target of miR-34a. Oncogene 27 (39): 5204–5213 (2008).

    CAS  Article  Google Scholar 

  11. Hahne, F. et al. Bioconductor Case Studies. Springer Verlag, New York, USA, (2008).

    Book  Google Scholar 

  12. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 (Database issue): D61–D65 (2007).

    CAS  Article  Google Scholar 

  13. Bruford, E.A. et al. The HGNC database in 2008: a resource for the human genome. Nucleic Acids Res. 36 (Database issue): D445–D448 (2008).

    CAS  Article  Google Scholar 

  14. Neve, R.M. et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell 10, 515–527 (2006).

    CAS  Article  Google Scholar 

  15. Parkinson, H. et al. Arrayexpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37, D868–D872 (2009).

    CAS  Article  Google Scholar 

  16. Irizarry, R.A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).

    Article  Google Scholar 

Download references

Acknowledgements

We thank Arek Kasprzyk and Rhoda Kinsella for insightful discussions.

This work was partially funded by the U24 CA126551 grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steffen Durinck.

Supplementary information

Supplementary Data 1

Zip archive containing the raw data of the Neve et al. study on a panel of 51 breast cell lines. It consists of Affymetrix CEL files of gene expression measurements deposited in ArrayExpress as experiment E-TABM-157, and Array CGH and protein quantification data which are available from http://cancer.lbl.gov/breastcancer. (ZIP 168067 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Durinck, S., Spellman, P., Birney, E. et al. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4, 1184–1191 (2009). https://doi.org/10.1038/nprot.2009.97

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nprot.2009.97

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing