Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Integrating biological databases

Key Points

  • Online resources of biological data, such as the model-organism-system databases and the various genome databases, have become vital to the work of many scientists.

  • The diverse user interfaces of online databases can be confusing to biologists, who must go back and forth between them.

  • Differences in both database form and content are a hindrance to both biologists and bioinformaticists who wish to integrate the data sets.

  • This review article surveys the state of the art in the integration of biological databases and indicates a path forward.

Abstract

Recent years have seen an explosion in the amount of available biological data. More and more genomes are being sequenced and annotated, and protein and gene interaction data are accumulating. Biological databases have been invaluable for managing these data and for making them accessible. Depending on the data that they contain, the databases fulfil different functions. But, although they are architecturally similar, so far their integration has proved problematic.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Biological database architecture.
Figure 2
Figure 3
Figure 4
Figure 5: Data warehousing.
Figure 6: Knuckles-and-nodes approach.

Similar content being viewed by others

References

  1. Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res. 30, 13–16 (2002). A description of web services for biological databases that use the SOAP software infrastructure.

    Article  CAS  Google Scholar 

  2. Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

    Article  CAS  Google Scholar 

  3. Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  Google Scholar 

  4. The FlyBase Consortium. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 30, 106–108 (2002). A highly successful attempt to develop a common controlled vocabulary for describing gene-product function and location.

  5. Harris, T. W. et al. WormBase: a cross-species database for comparative genomics. Nucleic Acids Res. 31, 133–137 (2003).

    Article  CAS  Google Scholar 

  6. The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  7. Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server — new features. Bioinformatics 18, 1149–1150 (2002).

    Article  CAS  Google Scholar 

  8. Davidson, S. B. et al. K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40 [online] <http://www.research.ibm.com/journal/sj/402/davidson.html> (2001). This paper describes a web-services approach to sharing genome annotations.

  9. Ritter, O., Kocab, P., Senger, M., Wolf, D. & Suhai, S. Prototype implementation of the integrated genomic database. Comput. Biomed. Res. 27, 97–115 (1994).

    Article  CAS  Google Scholar 

  10. Bahl, A. et al. PlasmoDB: the Plasmodium genome resource. An integrated database that provides tools for accessing, analysing and mapping expression and sequence data (both finished and unfinished). Nucleic Acids Res. 30, 87–90 (2002). This paper contrasts efforts by the same group to integrate biological data sources using the federated database and data warehousing approaches.

    Article  CAS  Google Scholar 

  11. Dowell, R. D., Jokerst, R. M., Day, A., Eddy, S. R. & Stein, L. The distributed annotation system. BMC Bioinform. 2, 7 (2001).

    Article  CAS  Google Scholar 

  12. Lewis, S. E. et al. Apollo: a sequence annotation editor. Genome Biol. 3, R0082.1–R0082.14 (2002).

    Article  Google Scholar 

  13. Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000). A description of web services for biological databases using the CORBA software infrastructure.

    Article  CAS  Google Scholar 

  14. Wilkinson, M. D. & Links, M. BioMOBY: an open source biological web services proposal. Brief Bioinform. 3, 331–341 (2002).

    Article  Google Scholar 

  15. Foster, I. & Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure (Kaufmann, San Francisco, 1999).

    Google Scholar 

  16. Stein, L. Creating a bioinformatics nation. Nature 417, 119–120 (2002).

    Article  CAS  Google Scholar 

  17. Gilberg, D. G. euGenes: a eukaryote genome information system. Nucleic Acids Res. 30, 145–148 (2002).

    Article  Google Scholar 

Download references

Acknowledgements

The author thank the anonymous reviewers for their helpful comments, and D. Gessler, M. Wilkinson, N. Goodman and S. Lewis for many illuminating discussions.

Author information

Authors and Affiliations

Authors

Related links

Related links

DATABASES

LocusLink

mrt-2

rad

S. pombe GeneDB

rad17

rad24

Saccharomyces Genome Database

Rad17

Rad24

WormBase

rad-3

FURTHER INFORMATION

BioMOBY project

Ensembl

FlyBase

Gene Ontology Consortium

Gene Ontology (GO) Database

Global Open Biological Ontologies

HUGO Gene Nomenclature Committee

Interoperable Informatics Infrastructure Consortium

LocusLink

MyGrid project

Omniview genome viewer

PubMed

RefSeq

Semantic Web

Sequence Ontology Project

The Institute for Genomics Research (TIGR) Database

UCSC Genome Browser

University of Indiana euGenes database

WormBase

Glossary

ORTHOLOGUE

A homologous gene that is derived from a speciation event or by vertical descent.

FLAT FILES

Data files that contain records with no structured relationships.

KNOWLEDGE DOMAIN

A body of knowledge that is often associated with a specialized scientific discipline.

LINE

Long interspersed-repeat transposable elements.

SINE

Short interspersed-repeat transposable elements.

SYNTAX

The grammar, structure and order of elements in a language statement. In computing, it refers to the rules that govern the structure of computer commands — for example, statements or other instructions that are used in code.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stein, L. Integrating biological databases. Nat Rev Genet 4, 337–345 (2003). https://doi.org/10.1038/nrg1065

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1065

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing