Key Points
-
Online resources of biological data, such as the model-organism-system databases and the various genome databases, have become vital to the work of many scientists.
-
The diverse user interfaces of online databases can be confusing to biologists, who must go back and forth between them.
-
Differences in both database form and content are a hindrance to both biologists and bioinformaticists who wish to integrate the data sets.
-
This review article surveys the state of the art in the integration of biological databases and indicates a path forward.
Abstract
Recent years have seen an explosion in the amount of available biological data. More and more genomes are being sequenced and annotated, and protein and gene interaction data are accumulating. Biological databases have been invaluable for managing these data and for making them accessible. Depending on the data that they contain, the databases fulfil different functions. But, although they are architecturally similar, so far their integration has proved problematic.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res. 30, 13–16 (2002). A description of web services for biological databases that use the SOAP software infrastructure.
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
The FlyBase Consortium. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 30, 106–108 (2002). A highly successful attempt to develop a common controlled vocabulary for describing gene-product function and location.
Harris, T. W. et al. WormBase: a cross-species database for comparative genomics. Nucleic Acids Res. 31, 133–137 (2003).
The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server — new features. Bioinformatics 18, 1149–1150 (2002).
Davidson, S. B. et al. K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40 [online] <http://www.research.ibm.com/journal/sj/402/davidson.html> (2001). This paper describes a web-services approach to sharing genome annotations.
Ritter, O., Kocab, P., Senger, M., Wolf, D. & Suhai, S. Prototype implementation of the integrated genomic database. Comput. Biomed. Res. 27, 97–115 (1994).
Bahl, A. et al. PlasmoDB: the Plasmodium genome resource. An integrated database that provides tools for accessing, analysing and mapping expression and sequence data (both finished and unfinished). Nucleic Acids Res. 30, 87–90 (2002). This paper contrasts efforts by the same group to integrate biological data sources using the federated database and data warehousing approaches.
Dowell, R. D., Jokerst, R. M., Day, A., Eddy, S. R. & Stein, L. The distributed annotation system. BMC Bioinform. 2, 7 (2001).
Lewis, S. E. et al. Apollo: a sequence annotation editor. Genome Biol. 3, R0082.1–R0082.14 (2002).
Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000). A description of web services for biological databases using the CORBA software infrastructure.
Wilkinson, M. D. & Links, M. BioMOBY: an open source biological web services proposal. Brief Bioinform. 3, 331–341 (2002).
Foster, I. & Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure (Kaufmann, San Francisco, 1999).
Stein, L. Creating a bioinformatics nation. Nature 417, 119–120 (2002).
Gilberg, D. G. euGenes: a eukaryote genome information system. Nucleic Acids Res. 30, 145–148 (2002).
Acknowledgements
The author thank the anonymous reviewers for their helpful comments, and D. Gessler, M. Wilkinson, N. Goodman and S. Lewis for many illuminating discussions.
Author information
Authors and Affiliations
Related links
Related links
DATABASES
LocusLink
S. pombe GeneDB
Saccharomyces Genome Database
WormBase
FURTHER INFORMATION
Global Open Biological Ontologies
HUGO Gene Nomenclature Committee
Interoperable Informatics Infrastructure Consortium
The Institute for Genomics Research (TIGR) Database
Glossary
- ORTHOLOGUE
-
A homologous gene that is derived from a speciation event or by vertical descent.
- FLAT FILES
-
Data files that contain records with no structured relationships.
- KNOWLEDGE DOMAIN
-
A body of knowledge that is often associated with a specialized scientific discipline.
- LINE
-
Long interspersed-repeat transposable elements.
- SINE
-
Short interspersed-repeat transposable elements.
- SYNTAX
-
The grammar, structure and order of elements in a language statement. In computing, it refers to the rules that govern the structure of computer commands — for example, statements or other instructions that are used in code.
Rights and permissions
About this article
Cite this article
Stein, L. Integrating biological databases. Nat Rev Genet 4, 337–345 (2003). https://doi.org/10.1038/nrg1065
Issue Date:
DOI: https://doi.org/10.1038/nrg1065
This article is cited by
-
An extensive survey on the use of supervised machine learning techniques in the past two decades for prediction of drug side effects
Artificial Intelligence Review (2023)
-
Improving reusability along the data life cycle: a regulatory circuits case study
Journal of Biomedical Semantics (2022)
-
RDFIO: extending Semantic MediaWiki for interoperable biomedical data management
Journal of Biomedical Semantics (2017)
-
Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces
BioData Mining (2017)
-
BioFed: federated query processing over life sciences linked open data
Journal of Biomedical Semantics (2017)