Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Human genotype–phenotype databases: aims, challenges and opportunities

Key Points

  • Genotype–phenotype databases contain data on genetic variants and associated phenotypes. In medical contexts, such databases are focused on disease-causing mutations and resulting diseases or phenotypic abnormalities.

  • A major goal of genotype–phenotype databases is to provide assistance in assigning pathogenicity to genetic variants.

  • As the focus shifts from the investigation of single genes by Sanger sequencing towards the determination of variants in tens, hundreds or thousands of genes or even the entire genome by next-generation sequencing, databases are becoming ever more essential for the interpretation of variants in diagnostic and research contexts.

  • Numerous online databases of human variability exist, which differ with respect to the type of data stored, the amount of phenotypic information provided, the degree of accessibility of the data, and the number of diseases or genes covered.

  • Increasingly, the focus of genotype–phenotype databases has shifted to support data discovery as a critical underpinning for data provision.

  • Currently, the volume and the quality of phenotype data compared with genotype data held in genotype–phenotype databases is lower, possibly owing to practical, financial, ethical, legal and organizational challenges that must be overcome to produce good phenotypic data on large numbers of individuals.


Genotype–phenotype databases provide information about genetic variation, its consequences and its mechanisms of action for research and health care purposes. Existing databases vary greatly in type, areas of focus and modes of operation. Despite ever larger and more intricate datasets — made possible by advances in DNA sequencing, omics methods and phenotyping technologies — steady progress is being made towards integrating these databases rather than using them as separate entities. The consequential shift in focus from single-gene variants towards large gene panels, exomes, whole genomes and myriad observable characteristics creates new challenges and opportunities in database design, interpretation of variant pathogenicity and modes of data representation and use.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Emerging landscape of genotype–phenotype databases.
Figure 2: Modes of data provision.
Figure 3: Data sharing and data discovery.
Figure 4: Data handling in genotype–phenotype databases.

Similar content being viewed by others


  1. Johnston, J. J. & Biesecker, L. G. Databases of genomic variation and phenotypes: existing resources and future needs. Hum. Mol. Genet. 22, R27–R31 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Rehm, H. L. Disease-targeted sequencing: a cornerstone in the clinic. Nat. Rev. Genet. 14, 295–300 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Zemojtel, T. et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci. Transl. Med. 6, 252ra123 (2014).

    PubMed  PubMed Central  Google Scholar 

  5. Saunders, C. J. et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 (2012).

    PubMed  PubMed Central  Google Scholar 

  6. Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).

    CAS  PubMed  Google Scholar 

  7. Li, M. X. et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 9, e1003143 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).

    PubMed  PubMed Central  Google Scholar 

  9. MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Horaitis, O. et al. A database of locus-specific databases. Nat. Genet. 39, 425 (2007).

    CAS  PubMed  Google Scholar 

  11. Patrinos, G. P. et al. Human Variome Project country nodes: documenting genetic information within a country. Hum. Mutat. 33, 1513–1519 (2012).

    CAS  PubMed  Google Scholar 

  12. Fokkema, I. F. et al. LOVD v.2.0: the next generation in gene variant databases. Hum. Mutat. 32, 557–563 (2011).

    CAS  PubMed  Google Scholar 

  13. Beroud, C., Collod-Beroud, G., Boileau, C., Soussi, T. & Junien, C. UMD (Universal Mutation Database): a generic software to build and analyze locus-specific databases. Hum. Mutat. 15, 86–94 (2000). References 12 and 13 describe the two most highly used software platforms for creating LSDBs and for annotating, analysing and displaying DNA variations in genes.

    CAS  PubMed  Google Scholar 

  14. Polvi, A. et al. The Finnish disease heritage database (FinDis) update — a database for the genes mutated in the Finnish disease heritage brought to the next-generation sequencing era. Hum. Mutat. 34, 1458–1466 (2013).

    PubMed  Google Scholar 

  15. Dalgleish, R. et al. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2, 24 (2010).

    PubMed  PubMed Central  Google Scholar 

  16. MacArthur, J. A. et al. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 42, D873–D878 (2014).

    CAS  PubMed  Google Scholar 

  17. Gout, A. M. et al. Analysis of published PKD1 gene sequence variants. Nat. Genet. 39, 427–428 (2007).

    CAS  PubMed  Google Scholar 

  18. Bell, C. J. et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Chen, S. N. et al. Human molecular genetic and functional studies identify TRIM63, encoding muscle RING finger protein 1, as a novel gene for human hypertrophic cardiomyopathy. Circ. Res. 111, 907–919 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Ploski, R. et al. Does p.Q247X in TRIM63 cause human hypertrophic cardiomyopathy? Circ. Res. 114, e2–e5 (2014).

    CAS  PubMed  Google Scholar 

  21. Witt, C. C. et al. Cooperative control of striated muscle mass and metabolism by MuRF1 and MuRF2. EMBO J. 27, 350–360 (2008).

    CAS  PubMed  Google Scholar 

  22. Plon, S. E. et al. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum. Mutat. 29, 1282–1291 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Sosnay, P. R. et al. Defining the disease liability of variants in the cystic fibrosis transmembrane conductance regulator gene. Nat. Genet. 45, 1160–1167 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. den Dunnen, J., Cutting, G. R. & Paalman, M. H. Mandatory variant submission — our experiences. Hum. Mutat. 33, 1 (2012).

    Google Scholar 

  25. Terry, S. F. Disease advocacy organizations catalyze translational research. Front. Genet. 4, 101 (2013).

    PubMed  PubMed Central  Google Scholar 

  26. Wicks, P. et al. Sharing health data for better outcomes on PatientsLikeMe. J. Med. Internet. Res. 12, e19 (2010).

    PubMed  PubMed Central  Google Scholar 

  27. Kirkpatrick, B. E. et al. GenomeConnect: matchmaking between patients, clinical laboratories and researchers to improve genomic knowledge. Hum. Mutat. 36, 974–978 (2015).

    PubMed  PubMed Central  Google Scholar 

  28. McAllister, M. & Dearing, A. Patient reported outcomes and patient empowerment in clinical genetics services. Clin. Genet. 88, 114–121 (2015).

    CAS  PubMed  Google Scholar 

  29. The Lancet Editorial. Patient empowerment — who empowers whom? Lancet 379, 1677 (2012).

  30. Abecasis, G. R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

    Article  PubMed  Google Scholar 

  31. Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).

    CAS  PubMed  Google Scholar 

  33. Zhang, J. et al. International Cancer Genome Consortium Data Portal — a one-stop shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). The ICGC data portal provides tools for visualizing, querying and downloading an immense amount of data from the ICGC, with an innovative approach to federating data and annotations across numerous participating centres.

    Google Scholar 

  34. Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    PubMed  PubMed Central  Google Scholar 

  35. Petrovski, S. & Goldstein, D. B. Phenomics and the interpretation of personal genomes. Sci. Transl. Med. 6, 254fs35 (2014).

    PubMed  Google Scholar 

  36. Dorschner, M. O. et al. Actionable, pathogenic incidental findings in 1,000 participants' exomes. Am. J. Hum. Genet. 93, 631–640 (2013). This report convincingly demonstrates that there is often insufficient evidence for pathogenicity of many variants reported in databases or in the medical literature.

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Dyment, D. A. et al. Whole-exome sequencing broadens the phenotypic spectrum of rare pediatric epilepsy: a retrospective study. Clin. Genet. 88, 34–40 (2015).

    CAS  PubMed  Google Scholar 

  38. Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009). DECIPHER is the largest publicly available database of genotypic and phenotypic data of mainly undiagnosed patients with rare diseases. DECIPHER has a large suite of tools to facilitate the interpretation of candidate variants.

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Gonzalez, M. A. et al. GEnomes Management Application ( a new software tool for large-scale collaborative genome analysis. Hum. Mutat. 34, 842–846 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Schaefer, C. & RPGEH GO Project Collaboration. The Kaiser Permanente Research Program on Genes, Environment and Health: a resource for genetic epidemiology in adult health and aging. Clin. Med. Res. 9, 177–178 (2011).

    PubMed Central  Google Scholar 

  42. Lappalainen, I. et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Tryka, K. A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).

    CAS  PubMed  Google Scholar 

  44. Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–428 (2011).

    CAS  PubMed  Google Scholar 

  45. Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).

    PubMed  PubMed Central  Google Scholar 

  46. Adamusiak, T. et al. Observ-OM and Observ-TAB: universal syntax solutions for the integration, search, and exchange of phenotype and genotype information. Hum. Mutat. 33, 867–873 (2012).

    PubMed  Google Scholar 

  47. Beck, T., Hastings, R. K., Gollapudi, S., Free, R. C. & Brookes, A. J. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 22, 949–952 (2014). The largest publicly available compilation of summary-level findings from genetic association studies. Together with references 50 and 111–113, this provides alternative ways of searching and visualizing GWAS data.

    PubMed  Google Scholar 

  48. Gaye, A. et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43, 1929–1944 (2014).

    PubMed  PubMed Central  Google Scholar 

  49. Karr, A. et al. Secure, privacy-preserving analysis of distributed databases. Technometrics 49, 335–345 (2007).

    Google Scholar 

  50. Cariaso, M. & Lennon, G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 40, D1308–D1312 (2012).

    CAS  PubMed  Google Scholar 

  51. Rappaport, N. et al. MalaCards: an integrated compendium for diseases and their annotation. Database (Oxford) 2013, bat018 (2013).

    Google Scholar 

  52. Lopes, P., Dalgleish, R. & Oliveira, J. L. WAVe: web analysis of the variome. Hum. Mutat. 32, 729–734 (2011).

    CAS  PubMed  Google Scholar 

  53. Glusman, G., Caballero, J., Mauldin, D. E., Hood, L. & Roach, J. C. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 3216–3217 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Philippakis, A. A. et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).

    PubMed  PubMed Central  Google Scholar 

  55. Manolio, T. A. et al. Global implementation of genomic medicine: we are not alone. Sci. Transl. Med. 7, 290ps13 (2015).

    PubMed  PubMed Central  Google Scholar 

  56. Hayden, E. C. Geneticists push for global data-sharing. Nature 498, 16–17 (2013). A report on the founding of the GA4GH.

    PubMed  Google Scholar 

  57. Gottlieb, M. M. et al. GeneYenta: a phenotype-based rare disease case matching tool based on online dating algorithms for the acceleration of exome interpretation. Hum. Mutat. 36, 432–438 (2015).

    PubMed  Google Scholar 

  58. Lancaster, O. et al. Cafe Variome: general-purpose software for making genotype–phenotype data discoverable in restricted or open access contexts. Hum. Mutat. 36, 957–964 (2015).

    PubMed  Google Scholar 

  59. Wellcome Trust. Enhancing discoverability of public health and epidemiology research data. Wellcome Trust [online], (2014).

  60. Digital Curation Centre (DCC). Jisc Research Data Registry and Discovery Service DCC (2014).

  61. Cotton, R. G. et al. The Human Variome Project. Science 322, 861–862 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Rehm, H. L. et al. ClinGen — the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015). ClinGen is a National Institutes of Health-funded resource, building an authoritative knowledge base that promotes evidence-based clinical annotation and interpretation of genomic variants. ClinVar (reference 61), an active partner of the ClinGen project, is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

    CAS  PubMed  Google Scholar 

  64. Beaulieu, C. L. et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am. J. Hum. Genet. 94, 809–817 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Buske, O. J. PhenomeCentral: a portal for phenotypic and genotypic matchmaking of patients with rare genetic diseases. Hum. Mutat. 36, 931–940 (2015).

    PubMed  PubMed Central  Google Scholar 

  66. Thompson, R. et al. RD-Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. J. Gen. Intern. Med. 29, S780–S787 (2014).

    PubMed  Google Scholar 

  67. Conley, J. M., Cook-Deegan, R. & Lázaro-Muñoz, G. Myriad after Myriad: the proprietary data dilemma. N. C. J. Law Technol. 15, 597–637 (2014).

    PubMed  PubMed Central  Google Scholar 

  68. Riggs, E. R., Jackson, L., Miller, D. T. & Van Vooren, S. Phenotypic information in genomic variant databases enhances clinical care and research: the International Standards for Cytogenomic Arrays Consortium experience. Hum. Mutat. 33, 787–796 (2012).

    PubMed  PubMed Central  Google Scholar 

  69. Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).

    CAS  PubMed  Google Scholar 

  70. Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).

    PubMed  PubMed Central  Google Scholar 

  71. Knoppers, B. M. Framework for responsible sharing of genomic and health-related data. HUGO J. 8, 3 (2014). A report on an ethical framework for responsible data sharing developed in conjunction with a wide spectrum of the bioethics, genomics and clinical communities, under the auspices of the GA4GH.

    PubMed  PubMed Central  Google Scholar 

  72. Mascalzoni, D. et al. International Charter of principles for sharing bio-specimens and data. Eur. J. Hum. Genet. 23, 721–728 (2015).

    PubMed  Google Scholar 

  73. Rath, A. et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Hum. Mutat. 33, 803–808 (2012). Orphanet is a portal for rare disease and orphan drugs that provides an inventory of rare diseases and a classification system that serves as a model for updating international terminologies such as the International Classification of Diseases.

    PubMed  Google Scholar 

  74. Köhler, S. et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, D966–D974 (2014). A report on the Human Phenotype Ontology, a widely used standard for annotating and analysing phenotypic abnormalities in diagnostic and translational research settings.

    PubMed  Google Scholar 

  75. Varga, E. A. & Moll, S. Cardiology patient pages. Prothrombin 20210 mutation (factor II mutation). Circulation 110, e15–e18 (2004).

    PubMed  Google Scholar 

  76. Beaudet, A. L. & Tsui, L. C. A suggested nomenclature for designating mutations. Hum. Mutat. 2, 245–248 (1993).

    CAS  PubMed  Google Scholar 

  77. den Dunnen, J. T. & Antonarakis, S. E. Nomenclature for the description of human sequence variations. Hum. Genet. 109, 121–124 (2001). An initial description of the Human Genome Variation Society's nomenclature standard for naming sequence variants.

    CAS  PubMed  Google Scholar 

  78. Taschner, P. E. & den Dunnen, J. T. Describing structural changes by extending HGVS sequence variation nomenclature. Hum. Mutat. 32, 507–511 (2011).

    CAS  PubMed  Google Scholar 

  79. Laros, J. F., Blavier, A., den Dunnen, J. T. & Taschner, P. E. A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form. BMC Bioinformatics 12, S5 (2011).

    PubMed  PubMed Central  Google Scholar 

  80. Wildeman, M., van Ophuizen, E., den Dunnen, J. T. & Taschner, P. E. Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum. Mutat. 29, 6–13 (2008).

    CAS  PubMed  Google Scholar 

  81. Hart, R. K. et al. A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature. Bioinformatics 31, 268–270 (2015).

    CAS  PubMed  Google Scholar 

  82. Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).

    CAS  PubMed  Google Scholar 

  83. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  84. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Byrne, M. et al. VarioML framework for comprehensive variation data representation and exchange. BMC Bioinformatics 13, 254 (2012).

    PubMed  PubMed Central  Google Scholar 

  86. Vihinen, M. Variation Ontology for annotation of variation effects and mechanisms. Genome Res. 24, 356–364 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  87. Kibbe, W. A. et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 43, D1071–D1078 (2015).

    CAS  PubMed  Google Scholar 

  88. Groza, T. et al. The Human Phenotype Ontology: semantic unification of common and rare disease. Am. J. Hum. Genet. 97, 111–124 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).

    PubMed  PubMed Central  Google Scholar 

  90. Sifrim, A. et al. eXtasy: variant prioritization by genomic data fusion. Nat. Methods 10, 1083–1084 (2013).

    CAS  PubMed  Google Scholar 

  91. Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. Singleton, M. V. et al. Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. Am. J. Hum. Genet. 94, 599–610 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  93. Javed, A., Agrawal, S. & Ng, P. C. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat. Methods 11, 935–937 (2014).

    CAS  PubMed  Google Scholar 

  94. Westbury, S. K. et al. Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders. Genome Med. 7, 36 (2015).

    PubMed  PubMed Central  Google Scholar 

  95. Adam, D. Mental health: on the spectrum. Nature 496, 416–418 (2013).

    CAS  PubMed  Google Scholar 

  96. Stenson, P. D. et al. Human Gene Mutation Database: towards a comprehensive central mutation database. J. Med. Genet. 45, 124–126 (2008).

    CAS  PubMed  Google Scholar 

  97. Abel, O., Powell, J. F., Andersen, P. M. & Al-Chalabi, A. ALSoD: a user-friendly online bioinformatics tool for amyotrophic lateral sclerosis genetics. Hum. Mutat. 33, 1345–1351 (2012).

    CAS  PubMed  Google Scholar 

  98. Chandrasekharappa, S. C. et al. Massively parallel sequencing, aCGH, and RNA-seq technologies provide a comprehensive molecular diagnosis of Fanconi anemia. Blood 121, e138–e148 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  99. Dalgleish, R. The human type I collagen mutation database. Nucleic Acids Res. 25, 181–187 (1997).

    CAS  PubMed  PubMed Central  Google Scholar 

  100. Piirila, H., Valiaho, J. & Vihinen, M. Immunodeficiency mutation databases (IDbases). Hum. Mutat. 27, 1200–1208 (2006).

    CAS  PubMed  Google Scholar 

  101. Ruiz-Pesini, E. et al. An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res. 35, D823–D828 (2007).

    CAS  PubMed  Google Scholar 

  102. Papadopoulos, P. et al. Developments in FINDbase worldwide database for clinically relevant genomic variation allele frequencies. Nucleic Acids Res. 42, D1020–D1026 (2014).

    CAS  PubMed  Google Scholar 

  103. Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).

    CAS  PubMed  Google Scholar 

  104. Hamosh, A. et al. PhenoDB: a new web-based tool for the collection, storage, and analysis of phenotypic features. Hum. Mutat. 34, 566–571 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. Sobreira, N., Schiettecatte, F., Valle, D. & Hamosh, A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum. Mutat. 36, 928–930 (2015).

    PubMed  PubMed Central  Google Scholar 

  106. Amberger, J., Bocchini, C. & Hamosh, A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum. Mutat. 32, 564–567 (2011). OMIM is one of the oldest and most important knowledge bases in human medicine, going back to work initiated in the early 1960s by Victor McKusick. In addition to 12 book editions of Mendelian Inheritance in Man (MIM) between 1966 and 1998, the online version (OMIM) has been available since 1987.

    PubMed  Google Scholar 

  107. Mungall, C.J. et al. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum. Mutat. 36, 979–984 (2015).

    PubMed  PubMed Central  Google Scholar 

  108. Wilks, C. et al. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database (Oxford) 2014, bau093 (2014).

    Google Scholar 

  109. Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).

    CAS  PubMed  Google Scholar 

  110. Cheng, W. C. et al. DriverDB: an exome sequencing database for cancer driver gene identification. Nucleic Acids Res. 42, D1048–D1054 (2014).

    CAS  PubMed  Google Scholar 

  111. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    CAS  PubMed  Google Scholar 

  112. Li, M. J. et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 40, D1047–D1054 (2012).

    CAS  PubMed  Google Scholar 

  113. Koike, A., Nishida, N., Inoue, I., Tsuji, S. & Tokunaga, K. Genome-wide association database developed in the Japanese Integrated Database Project. J. Hum. Genet. 54, 543–546 (2009).

    PubMed  Google Scholar 

  114. Whirl-Carrillo, M. et al. Pharmacogenomics knowledge for personalized medicine. Clin. Pharmacol. Ther. 92, 414–417 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  115. Carey, J. C., Allanson, J. E., Hennekam, R. C. & Biesecker, L. G. Standard terminology for phenotypic variations: The elements of morphology project, its current progress, and future directions. Hum. Mutat. 33, 781–786 (2012).

    PubMed  Google Scholar 

Download references


Preparation of this article was facilitated by funding from the European Union Seventh Framework Programme (FP7/2007-2013; 'BioShaRE' grant no. 261433, 'SYBIL' grant no. 602300, 'EMIF' IMI-JU grant no. 115372), the National Institutes of Health Office of the Director (grant no. 5R24OD011883), and the Bundesministerium für Bildung und Forschung (BMBF; project no. 0313911). The authors also acknowledge the many key insights provided by attendees of an IRDiRC workshop dedicated to this topic, and expert suggestions made by colleague R. Dalgleish (University of Leicester, UK).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Anthony J. Brookes.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links


Amyotrophic Lateral Sclerosis Online Genetics Database

Cancer Genomics Hub

Catalogue of Somatic Mutations in Cancer

CFTR2 database


Database of Genotypes and Phenotypes



ETHNOS databases

European Genome-Phenome Archive

European Variation Archive

Fanconi Anemia Mutation Database



GWAS Catalog

GWAS Central


Human Gene Mutation Database

Human Genome Variation Database


Leiden Open Variation Database


Online Mendelian Inheritance in Man


Osteogenesis Imperfecta Variant Database

PharmacoGenomics Database



PheWAS Catalog

Universal Mutation Database


1000 genomes

Beacon project

Café Variome Central



Collaborative Cancer Cloud


Exome Aggregation Consortium

Exome Variant Server of the NHLBI Exome Sequencing Project

GA4GH BRCA Challenge



GEnomes Management Application

Global Alliance for Genomics and Health

Human Genome Variation Society Nomenclature

Human Phenotype Ontology

Human Variome Project

International Cancer Genome Consortium

International Rare Diseases Research Consortium

Kaiser Permanente Research Program on Genes, Environment and Health


Locus Reference Genomic


MatchMaker Exchange

Monarch Initiative


National Institutes of Health Data Sharing Policy


PEER platform

Personal Genome Project



The Cancer Genome Atlas




PowerPoint slides



In biology, genotype refers to the genetic makeup of an organism with reference to either a single nucleotide, a larger genetic locus or the entire genome. In the current context, genotype refers to a genetic sequence variant being assessed for potential causality of a disease, as well as its status as heterozygous, homozygous or hemizygous.


In biology, phenotype refers to the observable characteristics of an organism, but in medicine, the word is usually used to describe clinically relevant abnormalities, including signs, symptoms and abnormal findings of laboratory analyses, imaging studies, physiological examinations, as well as behavioural anomalies.


Genetic variants describe any deviations from a normal or reference sequence. For example, a substitution of one nucleotide for another at a certain chromosomal position, an insertion or deletion of one or more nucleotides, a chromosomal microdeletion encompassing several million nucleotides or a trisomy of an entire chromosome.


The tendency of a genetic variant in a person's genome to produce disease. The term is most often used in the context of cancer or inherited disease, when a genetic variant has a substantial deleterious effect on the function of the gene product that leads to, or substantially contributes to, the development of disease.

Effect sizes

The percentages of genetic variance explained by a specific locus, ranging from less than 1% for many common traits up to 100% for some Mendelian diseases.

Multiple testing

The process of using bioinformatics analysis to assess potential pathogenicity of a variant is often formulated as a statistical hypothesis test. As tens of thousands of such tests may be performed in the analysis of diagnostic next-generation sequencing data, adjustments of the P values resulting from assessments of individual variations are required to avoid numerous false positive results, a procedure known as multiple testing correction.

Whole-exome sequencing

(WES). A sequencing technique that seeks to selectively enrich and assay only the sequences belonging to the ~ 1.5% of the human genome consisting of the exons of protein-coding genes (called the exome) because the majority of causative variations identified in Mendelian diseases to date have been located in or very close to these exons.

Big data

This term is used to describe collections of data that are characterized by features such as being large in size, complex and heterogeneous in type, rapidly produced or frequently changing, and of uncertain veracity, such that analysis requires high-performance computing resources and sophisticated algorithms. In biomedicine, especially high-throughput omics data such as whole-genome sequencing, as well as ever increasing amounts of clinical data available in electronic health care records, are often regarded as big data.


In the present context, a formal set of specifications about the format and contents of data records of variants or diseases that are to be exchanged between databases.


Metadata, literally 'data about data', refers to information that accompanies other data and explains their context or provenance.


Array-comparative genomic hybridization (CGH) enables the gain or loss of genetic material to be detected in the range of as little as 40 kilobases up to entire chromosomes. Array-CGH has become a standard diagnostic tool for the identification of copy number variants.

Web services

Databases, data processing or analytical functions that can be accessed by another computer program over the worldwide web.


The proportion of persons who carry a pathogenic germline variation and also show signs of a disease irrespective of the clinical severity.


The degree of clinical expression and severity of a disease in individuals who have inherited a given germline variation.

Stratified medicine

An approach to patient care that subdivides patients into groups that are defined on the basis of expected risk of developing disease or the expected response to a certain treatment.

Personalized medicine

This concept is synonymous with individualized medicine, and is used in varying ways to convey the idea of health or medical care being in some way tailored and optimized for a person. This typically means going beyond shaping care for groups of similar patients to the ultimate of uniquely customizing interventions for each separate individual.

Probabilistic modelling

A class of computational algorithms that describe data observed from a system in a way that takes uncertainty and noise associated with the model into account. It is one method for making predictions about disease onset or severity on the basis of genetic and other data.


A software strategy that allows data from disparate databases and other sources to be aggregated ad hoc as a virtual database that can be used for analysis. In the present context, federation involves connecting genotype–phenotype databases across networks to allow combined searches for information about variations or diseases.


Remote servers that are accessed via the Internet and provide data storage and analysis resources.

Informed consent

An agreement on the part of a patient to take part in a clinical study and allow the results of the study to be used in some way, such as for additional research or health care activities or for sharing with others in a publication or database. Consent can only reasonably be given after the subject is informed and given the opportunity to discuss the purpose of the research and any potential harms and benefits.


Collections of biological (often medically relevant) specimens such as blood, saliva or tissue, associated with data annotations that describe the subjects from whom the specimens were obtained, such as age, gender, environmental exposures, phenotypic features, molecular test results or clinical diagnosis. Biobanks are used by researchers to obtain sets of data and specimens from subjects with the same diagnosis or with similar characteristics to undertake research investigations.


A registry comprises a collection of information about individuals affected by a specific disease or who share other similarities. Many registries collect information about individuals over time or are used to track information regarding the response of patients to treatments. A registry may, but does not necessarily, include genetic information.

International Rare Diseases Research Consortium

(IRDiRC). This consortium comprises rare disease researchers and funding organizations and promotes the goal of developing 200 new therapies for rare diseases and a means to diagnose most rare diseases by the year 2020.

Global Alliance for Genomics and Health

(GA4GH). This alliance comprises more than 200 institutions working in health care, research, disease advocacy, life science and information technology with the goal of creating a common framework of harmonized approaches to enable the responsible, voluntary, and secure sharing of genomic and clinical data.

Human Variome Project

An umbrella organization that intends to help coordinate efforts to integrate the collection, curation, interpretation and sharing of information on variation in the human genome into routine clinical practice and research.


In the present context, a person or organization with an interest or role in medical databases, including patients and families, physicians, researchers, public and private research institutions, and funding agencies.

Phenotype term cross-mapping

A computational link between equivalent or related terms in two or more different phenotype ontologies. For instance, the Medical Dictionary for Regulatory Activities (MedDRA) term Platyspondylia (10068629) is mapped to the Human Phenotype Ontology term Platyspondyly (HP:0000926).


ORCID provides a persistent digital identifier (for example, for each researcher that can be used to streamline workflows such as manuscript and grant submission and to unambiguously identify researchers in databases.


(Application programming interfaces). A specification of a software component in terms of functionalities, formats and data types. In the current context, an API is a framework that allows exchange and processing of data and contents between different websites and databases.


Ontologies are computational resources that combine catalogues of the relevant entities of a domain (a conceptualization) with a description of the interrelationships among those entities (a specification).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brookes, A., Robinson, P. Human genotype–phenotype databases: aims, challenges and opportunities. Nat Rev Genet 16, 702–715 (2015).

Download citation

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing