Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Genotype–phenotype databases: challenges and solutions for the post-genomic era

Key Points

  • Research data concerning the genetic basis of health and disease is accumulating rapidly, as modern, high-throughput experimental techniques deliver increasingly larger data sets.

  • Data integration efforts in the field face numerous challenges, including the increased data size and complexity, quality control, data sensitivity and personal privacy, data access and publication bias.

  • Traditional approaches of gathering data into centralized repositories and publishing results in static paper journals, which have proved successful in the past, will not be sufficient to address the emerging and future needs of the field.

  • The alternative of a partially centralized and partially federated model has been proposed to solve this problem. This will entail a distributed, decentralized network of interconnected information sources and analysis services, the first incarnations of which are now starting to appear. A central requirement of this model is the far greater use of standardization for data models and exchange formats, and in the deployment of existing and emerging software components and network protocols.

  • Community adoption of new database technologies, and the development of robust data standards, will be vital to achieving the global integration of G2P data in the future. This might also help to address other challenges, such as accrediting and rewarding data submitters and database managers, as we move towards the emergence of a universal G2P 'knowledge environment'.

Abstract

The flow of research data concerning the genetic basis of health and disease is rapidly increasing in speed and complexity. In response, many projects are seeking to ensure that there are appropriate informatics tools, systems and databases available to manage and exploit this flood of information. Previous solutions, such as central databases, journal-based publication and manually intensive data curation, are now being enhanced with new systems for federated databases, database publication, and more automated management of data flows and quality control. Along with emerging technologies that enhance connectivity and data retrieval, these advances should help to create a powerful knowledge environment for genotype–phenotype information.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Extreme models for database integration.
Figure 2: Databases and database networks.
Figure 3: Success depends upon recognition and reward.

Similar content being viewed by others

References

  1. Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35, D5–D12 (2007).

    Article  CAS  Google Scholar 

  2. Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

    Article  CAS  Google Scholar 

  3. Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  Google Scholar 

  4. Stein, L. Creating a bioinformatics nation. Nature 417, 119–120 (2002).

    Article  CAS  Google Scholar 

  5. Miyazaki, S. et al. DDBJ in the stream of various biological data. Nucleic Acids Res. 32, D31–D34 (2004).

    Article  CAS  Google Scholar 

  6. Benson, D. A. et al. GenBank. Nucleic Acids Res. 36, D25–D30 (2008).

    Article  CAS  Google Scholar 

  7. Kanz, C. et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 33, D29–D33 (2005).

    Article  CAS  Google Scholar 

  8. Chen, N. et al. WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 33, D383–D389 (2005).

    Article  CAS  Google Scholar 

  9. Twigger, S. N. et al. The Rat Genome Database, update 2007 — easing the path from disease to data and back again. Nucleic Acids Res. 35, D658–D662 (2007).

    Article  CAS  Google Scholar 

  10. Bult, C. J. et al. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res. 36, D724–D728 (2008).

    Article  CAS  Google Scholar 

  11. Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).

    Article  CAS  Google Scholar 

  12. McKusick, V. A. Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders (Johns Hopkins Univ. Press, 1966).

    Google Scholar 

  13. Ball, E. V. et al. Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutat. 26, 205–213 (2005).

    Article  CAS  Google Scholar 

  14. Altman, R. B. PharmGKB: a logical home for knowledge relating genotype to drug response phenotype. Nature Genet. 39, 426–426 (2007).

    Article  CAS  Google Scholar 

  15. Lehmann, H. & Kynoch, P. A. M. Human Haemoglobin Variants and Their Characteristics (North-Holland Publishing, Amsterdam, 1976).

    Google Scholar 

  16. Horaitis, O. et al. A database of locus-specific databases. Nature Genet. 39, 425 (2007).

    Article  CAS  Google Scholar 

  17. Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

    Article  CAS  Google Scholar 

  18. Becker, K. G. et al. The Genetic Association Database. Nature Genet. 36, 431–432 (2004).

    Article  CAS  Google Scholar 

  19. Bertram, L. et al. Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nature Genet. 39, 17–23 (2007).

    Article  CAS  Google Scholar 

  20. Allen, N. C. et al. Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. Nature Genet. 40, 827–834 (2008).

    Article  CAS  Google Scholar 

  21. Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).

    Article  CAS  Google Scholar 

  22. Howe, D. et al. Big data: the future of biocuration. Nature 455, 47–50 (2008).

    Article  CAS  Google Scholar 

  23. Goble, C. & Stevens, R. State of the nation in data integration for bioinformatics. J. Biomed. Inform. 41, 687–693 (2008). This paper describes many of the technologies and challenges in data integration; in particular, different methods ranging from 'heavyweight' data warehousing approaches to loose-touch data 'mashups'.

    Article  Google Scholar 

  24. Knoppers, B. et al. Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? Eur. J. Hum. Genet. 16, 664–665 (2008).

    Article  CAS  Google Scholar 

  25. Ioannidis, J. P. A. et al. A road map for efficient and reliable human genome epidemiology. Nature Genet. 38, 3–5 (2006).

    Article  CAS  Google Scholar 

  26. Elnitski, L. L. et al. The ENCODEdb portal: simplified access to ENCODE Consortium data. Genome Res. 17, 954–959 (2007).

    Article  CAS  Google Scholar 

  27. Hoyweghen, I. V. & Horstman, K. European practices of genetic information and insurance: lessons for the Genetic Information Nondiscrimination Act. JAMA 300, 326–327 (2008).

    Article  Google Scholar 

  28. Diergaarde, B. et al. Genetic information: special or not? Responses from focus groups with members of a health maintenance organization. Am. J. Med. Genet. A 143, 564–569 (2007).

    Article  Google Scholar 

  29. Gilbar, R. Patient autonomy and relatives' right to know genetic information. Med. Law 26, 677–697 (2007).

    PubMed  Google Scholar 

  30. Knoppers, B. M. et al. The emergence of an ethical duty to disclose genetic research results: international perspectives. Eur. J. Hum. Genet. 14, 1170–1178 (2006).

    Article  Google Scholar 

  31. Godard, B. et al. Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective. Eur. J. Hum. Genet. 11 (Suppl. 2), S88–S122 (2003).

    Article  Google Scholar 

  32. Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).

    Article  Google Scholar 

  33. Cambon-Thomsen, A., Rial-Sebbag, E. & Knoppers, B. M. Trends in ethical and legal frameworks for the use of human biobanks. Eur. Respir. J. 30, 373–382 (2007).

    Article  CAS  Google Scholar 

  34. Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).

    Article  CAS  Google Scholar 

  35. Giardine, B. et al. PhenCode: connecting ENCODE data with mutations and phenotype. Hum. Mutat. 28, 554–562 (2007).

    Article  CAS  Google Scholar 

  36. Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).

    Article  CAS  Google Scholar 

  37. Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000).

    Article  CAS  Google Scholar 

  38. Quackenbush, J. Standardizing the standards. Mol. Syst. Biol. 2, 2006.0010 (2006).

  39. Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnol. 25, 1251–1255 (2007).

    Article  CAS  Google Scholar 

  40. Dowell, R. D. et al. The Distributed Annotation System. BMC Bioinformatics 2, 7 (2001).

    Article  CAS  Google Scholar 

  41. Berners-Lee, T., Hendler, J. & Lassila, O. The Semantic Web — a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci. Am. 284, 34–43 (2001).

    Article  Google Scholar 

  42. Compete, collaborate, compel [Editorial]. Nature Genet. 39, 931 (2007).

  43. Kauffmann, F. & Cambon-Thomsen, A. Tracing biological collections: between books and clinical trials. JAMA 299, 2316–2318 (2008).

    Article  CAS  Google Scholar 

  44. Merali, Z. & Giles, J. Databases in peril. Nature 435, 1010–1011 (2005).

    Article  CAS  Google Scholar 

  45. Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008). This is a recent comprehensive review of current and emerging components of informatics infrastructure for modern biological research.

    Article  CAS  Google Scholar 

  46. Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, research0046.1–00469 (2002).

    Article  Google Scholar 

  47. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  48. Jones, A. R. et al. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnol. 25, 1127–1133 (2007).

    Article  CAS  Google Scholar 

  49. Clark, T., Martin, S. & Liefeld, T. Globally distributed object identification for biological knowledgebases. Brief. Bioinform. 5, 59–70 (2004).

    Article  CAS  Google Scholar 

  50. Saltz, J. et al. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 22, 1910–1916 (2006).

    Article  CAS  Google Scholar 

  51. Wang, X., Gorlitsky, R. & Almeida, J. S. From XML to RDF: how semantic web technologies will change the design of 'omic' standards. Nature Biotechnol. 23, 1099–1103 (2005). This paper describes the potential of semantic web standards and technologies for describing and integrating biological data.

    Article  CAS  Google Scholar 

  52. Taylor, C. F. et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnol. 26, 889–896 (2008).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the valuable ideas, advice and funding provided by the GEN2PHEN project as part of the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754, which enabled the preparation of this Review.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony J. Brookes.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

AlzGene

Biobanking and Biomolecular Resources Research Infrastructure (BBMRI)

Cancer Biomedical Informatics Grid (caBIG)

Coordination and Sustainability of International Mouse Informatics Resources (CASIMIR)

dbGaP

Enabling Grids for E-sciencE (EGEE)

ENCODEdb

EuroGenetest

European Advanced Translational Research Infrastructure in Medicine (EATRIS)

European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI)

European Clinical Research Infrastructures Network (ECRIN)

European Genotype Archive (EGA)

European Life Sciences Infrastructure for Biological Information (ELIXIR)

European Model for Bioinformatics Research and Community Education (EMBRACE)

European Network of Genomic and Genetic Epidemiology (ENGAGE)

European Strategy Forum on Research Infrastructures (ESFRI)

FINDbase

Framingham Heart Study

GEN2PHEN project

Generic Model Organism Database (GMOD)

Genes, Environment and Health Initiative

Genetic Association Database (GAD)

GenomEUtwin

GWAS Database, Japan

Health Level Seven (HL7)

HGVbaseG2P

Human Gene Mutation Database (HGMD)

Human Genome Epidemiology Network (HuGENet)

Human Genome Variation Society

Human Genomics and Proteomics journal

Human Variation Project (HVP)

International Nucleotide Sequence Database Collaboration (INSDC)

Minimum Information for Biological and Biomedical Investigations (MIBBI)

Minimum Information for QTLs and Association Studies specification (MIQAS)

Obiba

Online Mendelian Inheritance in Man (OMIM)

Open Biomedical Ontologies (OBO)

OpenEHR

PDGene

Persistent Uniform Resource Locator (PURL)

Pharmacogenetics and Pharmacogenoics Knowledge Base (PharmGKB)

Phenotype and Genotype Experiment Object Model (PaGE-OM)

PhenX

Public Population Project in Genomics (P3G)

Public Population Project in Genomics observatory

Resource Description Framework (RDF)

Service-oriented architecture (SOA)

SZGene

Type 1 Diabetes Genetics Consortium

UK Biobank

Glossary

Screen-scraping

The automated process of extracting data from web pages intended for human viewing.

Genotype-to-phenotype

(G2P). The relationship between genetic variation in an organism and how this affects its observable characteristics.

Genome-wide association study

(GWA study). Examination of DNA variation (typically SNPs) across the whole genome in a large number of individuals who have been matched for population ancestry and assessed for a disease or trait of interest. Correlations between variants and the trait are used to locate genetic risk factors.

Knowledge representation

Structured presentation of information that facilitates the drawing of inferences or conclusions, often giving predictive abilities.

ENCODE

(Encyclopedia of DNA Elements). An international research project to identify all functional elements in the human genome.

Biobanking

Assembling large collections of biosamples and associated information, for the purpose of biomedical investigation.

Syntax

The syntax of information is concerned with how the data is organized, ordered and structured.

Semantics

The semantics of information is concerned with the meaning of the data elements, such as words.

Semantic web

An extension of the World Wide Web that embeds semantics, or meaning, in documents, in links between documents and in descriptions of web services, thereby enabling navigation and reasoning by automated agents.

Genetic association database

A catalogue of reported genetic associations between genotype and phenotype.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thorisson, G., Muilu, J. & Brookes, A. Genotype–phenotype databases: challenges and solutions for the post-genomic era. Nat Rev Genet 10, 9–18 (2009). https://doi.org/10.1038/nrg2483

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2483

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing