Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Opinion
  • Published:

Addressing the problems with life-science databases for traditional uses and systems biology

Abstract

A prerequisite to systems biology is the integration of heterogeneous experimental data, which are stored in numerous life-science databases. However, a wide range of obstacles that relate to access, handling and integration impede the efficient use of the contents of these databases. Addressing these issues will not only be essential for progress in systems biology, it will also be crucial for sustaining the more traditional uses of life-science databases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Classical and systems biology roles of life-science databases.
Figure 2: The database integration process: a database warehouse as an example.
Figure 3: Alternative representations of metabolic pathways: alcohol dehydrogenase as an example.

Similar content being viewed by others

References

  1. Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).

    Article  CAS  PubMed  Google Scholar 

  2. Pennisi, E. How will big pictures emerge from a sea of biological data? Science 309, 94 (2005).

    Article  CAS  PubMed  Google Scholar 

  3. Roos, D. S. Computational biology. Bioinformatics — trying to swim in a sea of data. Science 291, 1260–1261 (2001).

    Article  CAS  PubMed  Google Scholar 

  4. Augen, J. Information technology to the rescue! Nature Biotechnol. 19, BE39–BE40 (2001).

    Article  CAS  Google Scholar 

  5. Ge, H., Walhout, A. J. & Vidal, M. Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet. 19, 551–560 (2003).

    Article  CAS  PubMed  Google Scholar 

  6. Carel, R. Practical data integration in biopharmaceutical research and development. PharmaGenomics 22–35 (June 2003).

  7. Galperin, M. Y. The Molecular Biology Database Collection: 2006 update. Nucleic Acids Res. 34, D3–D5 (2006).

    Article  CAS  PubMed  Google Scholar 

  8. Cerami, E. Web services essentials (O'Reilly, Beijing; Sebastopol, California, 2002).

    Google Scholar 

  9. Sugawara, H. & Miyazaki, S. Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res. 31, 3836–3839 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277–D280 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Pillai, S. et al. SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res. 33, W25–W28 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Stajich, J. E. et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12, 1611–1618 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Mangalam, H. The Bio * toolkits — a brief overview. Brief. Bioinformatics 3, 296–302 (2002).

    Article  PubMed  Google Scholar 

  14. Wang, L., Riethoven, J. J. & Robinson, A. XEMBL: distributing EMBL data in XML format. Bioinformatics 18, 1147–1148 (2002).

    Article  CAS  PubMed  Google Scholar 

  15. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005).

    Article  CAS  PubMed  Google Scholar 

  16. Luciano, J. S. PAX of mind for pathway researchers. Drug Discov. Today 10, 937–942 (2005).

    Article  CAS  PubMed  Google Scholar 

  17. Lloyd, C. M., Halstead, M. D. & Nielsen, P. F. CellML: its future, present and past. Prog. Biophys. Mol. Biol. 85, 433–450 (2004).

    Article  CAS  PubMed  Google Scholar 

  18. Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002).

  19. Orchard, S. et al. Further steps in standardisation. Report of the second annual Proteomics Standards Initiative Spring Workshop (Siena, Italy 17–20th April 2005). Proteomics 5, 3552–3555 (2005).

    Article  CAS  PubMed  Google Scholar 

  20. Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531 (2003).

    Article  CAS  PubMed  Google Scholar 

  21. Green, M. L. & Karp, P. D. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res. 33, 4035–4039 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Stevens, R. et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 16, 184–185 (2000).

    Article  CAS  PubMed  Google Scholar 

  23. Köhler, J., Philippi, S. & Lange, M. SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19, 2420–2427 (2003).

    Article  PubMed  Google Scholar 

  24. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  Google Scholar 

  25. Philippi, S. & Köhler, J. Using XML technology for the ontology-based semantic integration of life science databases. IEEE Trans. Inf. Technol. Biomed. 8, 154–160 (2004).

    Article  PubMed  Google Scholar 

  26. NC-IUBMB. Enzyme Nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes (Academic Press, San Diego, 1992).

  27. Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32, D35–D40 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Hendler, J. Communication. Science and the semantic web. Science 299, 520–521 (2003).

    Article  CAS  PubMed  Google Scholar 

  29. Noble, D. Will genomics revolutionise pharmaceutical R&D? Trends Biotechnol. 21, 333–337 (2003).

    Article  CAS  PubMed  Google Scholar 

  30. Smith, B., Köhler, J. & Kumar, A. On the application of formal principles to life science data: a case study in the gene ontology. Proc. Data Integr. Life Sci. First Int. Workshop 79–94 (2004).

  31. Zhang, S. & Bodenreider, O. Law and order: assessing and enforcing compliance with ontological modeling principles in the Foundational Model of Anatomy. Comput. Biol. Med. 6 Sep 2005 (doi:10.1016/j.compbiomed.2005.04.007).

  32. van Helden, J. et al. Representing and analysing molecular and cellular function using the computer. Biol. Chem. 381, 921–935 (2000).

    CAS  PubMed  Google Scholar 

  33. Bornberg-Bauer, E. & Paton, N. W. Conceptual data modelling for bioinformatics. Brief. Bioinformatics 3, 166–180 (2002).

    Article  CAS  PubMed  Google Scholar 

  34. Nelson, M. R., Reisinger, S. J. & Henry, S. G. Designing databases to store biological information. BioSilico 1, 134–142 (2003).

    Article  CAS  Google Scholar 

  35. Taylor, C. F. et al. A systematic approach to modeling, capturing, and disseminating proteomics experimental data. Nature Biotechnol. 21, 247–254 (2003).

    Article  CAS  Google Scholar 

  36. Ma, Z. & Chen, J. (eds) Database Modeling in Biology: Practices and Challenges (Springer, in the press).

  37. Karp, P. D. et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 33, 6083–6089 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Searls, D. B. Data integration — connecting the dots. Nature Biotechnol. 21, 844–845 (2003).

    Article  CAS  Google Scholar 

  39. Karp, P. D. What we do not know about sequence analysis and sequence databases. Bioinformatics 14, 753–754 (1998).

    Article  CAS  PubMed  Google Scholar 

  40. Camon, E. et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Gattiker, A. et al. Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003).

    Article  CAS  PubMed  Google Scholar 

  42. Garcia-Berthou, E. & Alcaraz, C. Incongruence between test statistics and P values in medical papers. BMC Med. Res. Methodol. 4, 13 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Mecham, B. H. et al. Increased measurement accuracy for sequence-verified microarray probes. Physiol. Genomics 18, 308–315 (2004).

    Article  CAS  PubMed  Google Scholar 

  44. Ntzani, E. E. & Ioannidis, J. P. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362, 1439–1444 (2003).

    Article  CAS  PubMed  Google Scholar 

  45. Hirschhorn, J. N., Lohmueller, K., Byrne, E. & Hirschhorn, K. A comprehensive review of genetic association studies. Genet. Med. 4, 45–61 (2002).

    Article  CAS  PubMed  Google Scholar 

  46. Müller, H., Naumann, F. & Freytag, J.-C. Data quality in genome databases. Proc. Conf. Inf. Qual. (IQ 03) 269–284 (2003).

  47. Iliopoulos, I. et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 19, 717–726 (2003).

    Article  CAS  PubMed  Google Scholar 

  48. Leser, U. & Hakenberg, J. What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinformatics 6, 357–369 (2005).

    Article  CAS  PubMed  Google Scholar 

  49. Resnik, D. B. Strengthening the United States' database protection laws: balancing public access and private control. Sci. Eng. Ethics 9, 301–318 (2003).

    Article  PubMed  Google Scholar 

  50. Maurer, S. M., Hugenholtz, P. B. & Onsrud, H. J. Intellectual property. Europe's database experiment. Science 294, 789–790 (2001).

    Article  CAS  PubMed  Google Scholar 

  51. Merali, Z. & Giles, J. Databases in peril. Nature 435, 1010–1011 (2005).

    Article  CAS  PubMed  Google Scholar 

  52. Ellis, L. B. & Kalumbi, D. The demise of public data on the web? Nature Biotechnol. 16, 1323–1324 (1998).

    Article  CAS  Google Scholar 

  53. Greenbaum, D. & Gerstein, M. A universal legal framework as a prerequisite for database interoperability. Nature Biotechnol. 21, 979–982 (2003).

    Article  CAS  Google Scholar 

  54. Brazma, A. et al. Minimum information about a microarray experiment (MIAME) — toward standards for microarray data. Nature Genet. 29, 365–371 (2001).

    Article  CAS  PubMed  Google Scholar 

  55. Bourne, P. Will a biological database be different from a biological journal? PLoS Comput. Biol. 1, 179–181 (2005).

    CAS  PubMed  Google Scholar 

  56. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Rother, K. et al. Columba: multidimensional data integration of protein annotations. Proc. Data Integr. Life Sci. First Int. Workshop 156–171 (2004).

  58. Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server — recent developments. Bioinformatics 18, 368–373 (2002).

    Article  CAS  PubMed  Google Scholar 

  59. Haas, L. M. et al. DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst. J. 40, 489–511 (2001).

    Article  Google Scholar 

  60. Köhler, J. et al. Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. In Silico Biol. 5, 33–44 (2004).

    Google Scholar 

  61. Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).

    Article  CAS  PubMed  Google Scholar 

  62. Köhler, J. Integration of life science databases. Drug Discov. Today 2, 61–69 (2004).

    Article  Google Scholar 

  63. Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Kolchanov, N. A. et al. Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res. 30, 312–317 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would like to thank C. Rawlings and P. Verrier for commenting on an earlier version of this article. Furthermore we would like to thank the following individuals for exploring with us the pitfalls of life-science databases over the past years: J. Baumbach, J. Butz, E. Kirchem, F. Klingert, S. Knop, B. Kormeier, I. Kupp, A. Neu, A. Rüegg, A. Skusa, B. Steuernagel, J. Taubert, P. Verrier and R. Winnenburg. S.P. gratefully acknowledges funding by the European Science Foundation. Rothamsted Research receives grant-aided support from the UK Biotechnological and Biological Science Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephan Philippi.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

BioJava

BioPAX — Biological Pathways Exchange

BioPerl

BioRuby

CellML

DiscoveryLink

DNA Data Bank of Japan

EC (enzyme class) numbers of the enzyme nomenclature

Ensembl Trace Server

European Bioinformatics Institute SRS server

European Bioinformatics Institute

Extensible Markup Language (XML)

Gene Ontology homepage

Kyoto Encyclopedia of Genes and Genomes

Microarray Gene Expression Data Society

mySQL

NCBI taxonomy

Nucleic Acids Research Database Categories List

ONDEX

Open Biomedical Ontologies

Open Source Initiative License Index

PostgreSQL

Proteomics Standards Initiative — molecular interaction

Systems biology markup language

Universal Protein Resource

Web Services Activity

Glossary

Controlled vocabulary

A standardized set of terms that can be used in a given application domain. A prominent example is the enzyme class nomenclature, which describes classes of biochemical reaction.

Database management system

A system that provides a means of storing, modifying and extracting data from a database.

Evidence code

A controlled vocabulary that is used to track the types of evidence that support a gene annotation.

Flat file

Human readable, non-standardized files that can be used to exchange the contents of life-science databases.

Ontology

A commonly agreed definition of real-world concepts, such as 'protein' and 'enzyme', and their particular relationships, for example, an enzyme 'is a' protein.

Parser

Software that reads a given input, such as a flat file, for further processing.

Web service

A standardized way to allow for interoperable machine-to-machine interaction over a network.

XML

The extensible markup language (XML) is a standard for the creation of application-specific, self-descriptive markup languages, which, for example, can be used for the definition of data-exchange formats.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Philippi, S., Köhler, J. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet 7, 482–488 (2006). https://doi.org/10.1038/nrg1872

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1872

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing