Abstract
Genomic experiments produce multiple views of biological systems, among them are DNA sequence and copy number variation, and mRNA and protein abundance. Understanding these systems needs integrated bioinformatic analysis. Public databases such as Ensembl provide relationships and mappings between the relevant sets of probe and target molecules. However, the relationships can be biologically complex and the content of the databases is dynamic. We demonstrate how to use the computational environment R to integrate and jointly analyze experimental datasets, employing BioMart web services to provide the molecule mappings. We also discuss typical problems that are encountered in making gene-to-transcript–to-protein mappings. The approach provides a flexible, programmable and reproducible basis for state-of-the-art bioinformatic data integration.
This is a preview of subscription content
Access options
Subscribe to Journal
Get full journal access for 1 year
$119.00
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Buy article
Get time limited or full article access on ReadCube.
$32.00
All prices are NET prices.







References
R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2008) ISBN 3-900051-07-0.
Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (10): R80 (2004).
Kasprzyk, A. et al. Ensmart: a generic system for fast and flexible access to biological data. Genome Res. 14 (1): 160–169 (2004).
Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res. 37 (Database issue): D690–D697 (2009).
Rogers, A. et al. Wormbase 2007. Nucleic Acids Res. 36 (Database issue): D612–D617 (2008).
Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37 (Database issue): D619–D622 (2009).
Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Durinck, S. Integrating biological data resources into R with biomaRt. The Newsletter of the R Project 6/5, 40–45 (2006).
Boutros, M. et al. Analysis of cell-based RNAi screens. Genome Biol. 7, R66 (2006).
Wei, J.S. et al. The MYCN oncogene is a direct target of miR-34a. Oncogene 27 (39): 5204–5213 (2008).
Hahne, F. et al. Bioconductor Case Studies. Springer Verlag, New York, USA, (2008).
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 (Database issue): D61–D65 (2007).
Bruford, E.A. et al. The HGNC database in 2008: a resource for the human genome. Nucleic Acids Res. 36 (Database issue): D445–D448 (2008).
Neve, R.M. et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell 10, 515–527 (2006).
Parkinson, H. et al. Arrayexpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37, D868–D872 (2009).
Irizarry, R.A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
Acknowledgements
We thank Arek Kasprzyk and Rhoda Kinsella for insightful discussions.
This work was partially funded by the U24 CA126551 grant.
Author information
Authors and Affiliations
Corresponding author
Supplementary information
Supplementary Data 1
Zip archive containing the raw data of the Neve et al. study on a panel of 51 breast cell lines. It consists of Affymetrix CEL files of gene expression measurements deposited in ArrayExpress as experiment E-TABM-157, and Array CGH and protein quantification data which are available from http://cancer.lbl.gov/breastcancer. (ZIP 168067 kb)
Rights and permissions
About this article
Cite this article
Durinck, S., Spellman, P., Birney, E. et al. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4, 1184–1191 (2009). https://doi.org/10.1038/nprot.2009.97
Published:
Issue Date:
DOI: https://doi.org/10.1038/nprot.2009.97
Further reading
-
Perplexity: evaluating transcript abundance estimation in the absence of ground truth
Algorithms for Molecular Biology (2022)
-
SARS-COV-2 as potential microRNA sponge in COVID-19 patients
BMC Medical Genomics (2022)
-
Epigenetic regulation of innate immune memory in microglia
Journal of Neuroinflammation (2022)
-
PTHrP induces STAT5 activation, secretory differentiation and accelerates mammary tumor development
Breast Cancer Research (2022)
-
NetSeekR: a network analysis pipeline for RNA-Seq time series data
BMC Bioinformatics (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.