Human genotype–phenotype databases: aims, challenges and opportunities

Brookes, Anthony J.; Robinson, Peter N.

doi:10.1038/nrg3932

Review Article
Published: 10 November 2015

Human genotype–phenotype databases: aims, challenges and opportunities

Anthony J. Brookes^1,2 &
Peter N. Robinson^3,4,5

Nature Reviews Genetics volume 16, pages 702–715 (2015)Cite this article

12k Accesses
68 Citations
29 Altmetric
Metrics details

Subjects

Key Points

Genotype–phenotype databases contain data on genetic variants and associated phenotypes. In medical contexts, such databases are focused on disease-causing mutations and resulting diseases or phenotypic abnormalities.
A major goal of genotype–phenotype databases is to provide assistance in assigning pathogenicity to genetic variants.
As the focus shifts from the investigation of single genes by Sanger sequencing towards the determination of variants in tens, hundreds or thousands of genes or even the entire genome by next-generation sequencing, databases are becoming ever more essential for the interpretation of variants in diagnostic and research contexts.
Numerous online databases of human variability exist, which differ with respect to the type of data stored, the amount of phenotypic information provided, the degree of accessibility of the data, and the number of diseases or genes covered.
Increasingly, the focus of genotype–phenotype databases has shifted to support data discovery as a critical underpinning for data provision.
Currently, the volume and the quality of phenotype data compared with genotype data held in genotype–phenotype databases is lower, possibly owing to practical, financial, ethical, legal and organizational challenges that must be overcome to produce good phenotypic data on large numbers of individuals.

Abstract

Genotype–phenotype databases provide information about genetic variation, its consequences and its mechanisms of action for research and health care purposes. Existing databases vary greatly in type, areas of focus and modes of operation. Despite ever larger and more intricate datasets — made possible by advances in DNA sequencing, omics methods and phenotyping technologies — steady progress is being made towards integrating these databases rather than using them as separate entities. The consequential shift in focus from single-gene variants towards large gene panels, exomes, whole genomes and myriad observable characteristics creates new challenges and opportunities in database design, interpretation of variant pathogenicity and modes of data representation and use.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Emerging landscape of genotype–phenotype databases.**

**Figure 3: Data sharing and data discovery.**

**Figure 4: Data handling in genotype–phenotype databases.**

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

MGeND: an integrated database for Japanese clinical and genomic information

Article Open access 06 December 2019

Mayumi Kamada, Masahiko Nakatsui, … Yasushi Okuno

dbTMM: an integrated database of large-scale cohort, genome and clinical data for the Tohoku Medical Megabank Project

Article Open access 10 December 2021

Soichi Ogishima, Satoshi Nagaie, … Masayuki Yamamoto

References

Johnston, J. J. & Biesecker, L. G. Databases of genomic variation and phenotypes: existing resources and future needs. Hum. Mol. Genet. 22, R27–R31 (2013).
CAS PubMed PubMed Central Google Scholar
Rehm, H. L. Disease-targeted sequencing: a cornerstone in the clinic. Nat. Rev. Genet. 14, 295–300 (2013).
CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zemojtel, T. et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci. Transl. Med. 6, 252ra123 (2014).
PubMed PubMed Central Google Scholar
Saunders, C. J. et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 (2012).
PubMed PubMed Central Google Scholar
Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
CAS PubMed Google Scholar
Li, M. X. et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 9, e1003143 (2013).
CAS PubMed PubMed Central Google Scholar
Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).
PubMed PubMed Central Google Scholar
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
CAS PubMed PubMed Central Google Scholar
Horaitis, O. et al. A database of locus-specific databases. Nat. Genet. 39, 425 (2007).
CAS PubMed Google Scholar
Patrinos, G. P. et al. Human Variome Project country nodes: documenting genetic information within a country. Hum. Mutat. 33, 1513–1519 (2012).
CAS PubMed Google Scholar
Fokkema, I. F. et al. LOVD v.2.0: the next generation in gene variant databases. Hum. Mutat. 32, 557–563 (2011).
CAS PubMed Google Scholar
Beroud, C., Collod-Beroud, G., Boileau, C., Soussi, T. & Junien, C. UMD (Universal Mutation Database): a generic software to build and analyze locus-specific databases. Hum. Mutat. 15, 86–94 (2000). References 12 and 13 describe the two most highly used software platforms for creating LSDBs and for annotating, analysing and displaying DNA variations in genes.
CAS PubMed Google Scholar
Polvi, A. et al. The Finnish disease heritage database (FinDis) update — a database for the genes mutated in the Finnish disease heritage brought to the next-generation sequencing era. Hum. Mutat. 34, 1458–1466 (2013).
PubMed Google Scholar
Dalgleish, R. et al. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2, 24 (2010).
PubMed PubMed Central Google Scholar
MacArthur, J. A. et al. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 42, D873–D878 (2014).
CAS PubMed Google Scholar
Gout, A. M. et al. Analysis of published PKD1 gene sequence variants. Nat. Genet. 39, 427–428 (2007).
CAS PubMed Google Scholar
Bell, C. J. et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011).
CAS PubMed PubMed Central Google Scholar
Chen, S. N. et al. Human molecular genetic and functional studies identify TRIM63, encoding muscle RING finger protein 1, as a novel gene for human hypertrophic cardiomyopathy. Circ. Res. 111, 907–919 (2012).
CAS PubMed PubMed Central Google Scholar
Ploski, R. et al. Does p.Q247X in TRIM63 cause human hypertrophic cardiomyopathy? Circ. Res. 114, e2–e5 (2014).
CAS PubMed Google Scholar
Witt, C. C. et al. Cooperative control of striated muscle mass and metabolism by MuRF1 and MuRF2. EMBO J. 27, 350–360 (2008).
CAS PubMed Google Scholar
Plon, S. E. et al. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum. Mutat. 29, 1282–1291 (2008).
CAS PubMed PubMed Central Google Scholar
Sosnay, P. R. et al. Defining the disease liability of variants in the cystic fibrosis transmembrane conductance regulator gene. Nat. Genet. 45, 1160–1167 (2013).
CAS PubMed PubMed Central Google Scholar
den Dunnen, J., Cutting, G. R. & Paalman, M. H. Mandatory variant submission — our experiences. Hum. Mutat. 33, 1 (2012).
Google Scholar
Terry, S. F. Disease advocacy organizations catalyze translational research. Front. Genet. 4, 101 (2013).
PubMed PubMed Central Google Scholar
Wicks, P. et al. Sharing health data for better outcomes on PatientsLikeMe. J. Med. Internet. Res. 12, e19 (2010).
PubMed PubMed Central Google Scholar
Kirkpatrick, B. E. et al. GenomeConnect: matchmaking between patients, clinical laboratories and researchers to improve genomic knowledge. Hum. Mutat. 36, 974–978 (2015).
PubMed PubMed Central Google Scholar
McAllister, M. & Dearing, A. Patient reported outcomes and patient empowerment in clinical genetics services. Clin. Genet. 88, 114–121 (2015).
CAS PubMed Google Scholar
The Lancet Editorial. Patient empowerment — who empowers whom? Lancet 379, 1677 (2012).
Abecasis, G. R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Article PubMed Google Scholar
Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).
CAS PubMed PubMed Central Google Scholar
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
CAS PubMed Google Scholar
Zhang, J. et al. International Cancer Genome Consortium Data Portal — a one-stop shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). The ICGC data portal provides tools for visualizing, querying and downloading an immense amount of data from the ICGC, with an innovative approach to federating data and annotations across numerous participating centres.
Google Scholar
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
PubMed PubMed Central Google Scholar
Petrovski, S. & Goldstein, D. B. Phenomics and the interpretation of personal genomes. Sci. Transl. Med. 6, 254fs35 (2014).
PubMed Google Scholar
Dorschner, M. O. et al. Actionable, pathogenic incidental findings in 1,000 participants' exomes. Am. J. Hum. Genet. 93, 631–640 (2013). This report convincingly demonstrates that there is often insufficient evidence for pathogenicity of many variants reported in databases or in the medical literature.
CAS PubMed PubMed Central Google Scholar
Dyment, D. A. et al. Whole-exome sequencing broadens the phenotypic spectrum of rare pediatric epilepsy: a retrospective study. Clin. Genet. 88, 34–40 (2015).
CAS PubMed Google Scholar
Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009). DECIPHER is the largest publicly available database of genotypic and phenotypic data of mainly undiagnosed patients with rare diseases. DECIPHER has a large suite of tools to facilitate the interpretation of candidate variants.
CAS PubMed PubMed Central Google Scholar
Gonzalez, M. A. et al. GEnomes Management Application (GEM.app): a new software tool for large-scale collaborative genome analysis. Hum. Mutat. 34, 842–846 (2013).
CAS PubMed PubMed Central Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
CAS PubMed PubMed Central Google Scholar
Schaefer, C. & RPGEH GO Project Collaboration. The Kaiser Permanente Research Program on Genes, Environment and Health: a resource for genetic epidemiology in adult health and aging. Clin. Med. Res. 9, 177–178 (2011).
PubMed Central Google Scholar
Lappalainen, I. et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).
CAS PubMed PubMed Central Google Scholar
Tryka, K. A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).
CAS PubMed Google Scholar
Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–428 (2011).
CAS PubMed Google Scholar
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).
PubMed PubMed Central Google Scholar
Adamusiak, T. et al. Observ-OM and Observ-TAB: universal syntax solutions for the integration, search, and exchange of phenotype and genotype information. Hum. Mutat. 33, 867–873 (2012).
PubMed Google Scholar
Beck, T., Hastings, R. K., Gollapudi, S., Free, R. C. & Brookes, A. J. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 22, 949–952 (2014). The largest publicly available compilation of summary-level findings from genetic association studies. Together with references 50 and 111–113, this provides alternative ways of searching and visualizing GWAS data.
PubMed Google Scholar
Gaye, A. et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43, 1929–1944 (2014).
PubMed PubMed Central Google Scholar
Karr, A. et al. Secure, privacy-preserving analysis of distributed databases. Technometrics 49, 335–345 (2007).
Google Scholar
Cariaso, M. & Lennon, G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 40, D1308–D1312 (2012).
CAS PubMed Google Scholar
Rappaport, N. et al. MalaCards: an integrated compendium for diseases and their annotation. Database (Oxford) 2013, bat018 (2013).
Google Scholar
Lopes, P., Dalgleish, R. & Oliveira, J. L. WAVe: web analysis of the variome. Hum. Mutat. 32, 729–734 (2011).
CAS PubMed Google Scholar
Glusman, G., Caballero, J., Mauldin, D. E., Hood, L. & Roach, J. C. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 3216–3217 (2011).
CAS PubMed PubMed Central Google Scholar
Philippakis, A. A. et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).
PubMed PubMed Central Google Scholar
Manolio, T. A. et al. Global implementation of genomic medicine: we are not alone. Sci. Transl. Med. 7, 290ps13 (2015).
PubMed PubMed Central Google Scholar
Hayden, E. C. Geneticists push for global data-sharing. Nature 498, 16–17 (2013). A report on the founding of the GA4GH.
PubMed Google Scholar
Gottlieb, M. M. et al. GeneYenta: a phenotype-based rare disease case matching tool based on online dating algorithms for the acceleration of exome interpretation. Hum. Mutat. 36, 432–438 (2015).
PubMed Google Scholar
Lancaster, O. et al. Cafe Variome: general-purpose software for making genotype–phenotype data discoverable in restricted or open access contexts. Hum. Mutat. 36, 957–964 (2015).
PubMed Google Scholar
Wellcome Trust. Enhancing discoverability of public health and epidemiology research data. Wellcome Trust [online], (2014).
Digital Curation Centre (DCC). Jisc Research Data Registry and Discovery Service DCC http://www.dcc.ac.uk/projects/research-data-registry-pilot (2014).
Cotton, R. G. et al. The Human Variome Project. Science 322, 861–862 (2008).
CAS PubMed PubMed Central Google Scholar
Rehm, H. L. et al. ClinGen — the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015). ClinGen is a National Institutes of Health-funded resource, building an authoritative knowledge base that promotes evidence-based clinical annotation and interpretation of genomic variants. ClinVar (reference 61), an active partner of the ClinGen project, is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
CAS PubMed PubMed Central Google Scholar
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
CAS PubMed Google Scholar
Beaulieu, C. L. et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am. J. Hum. Genet. 94, 809–817 (2014).
CAS PubMed PubMed Central Google Scholar
Buske, O. J. PhenomeCentral: a portal for phenotypic and genotypic matchmaking of patients with rare genetic diseases. Hum. Mutat. 36, 931–940 (2015).
PubMed PubMed Central Google Scholar
Thompson, R. et al. RD-Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. J. Gen. Intern. Med. 29, S780–S787 (2014).
PubMed Google Scholar
Conley, J. M., Cook-Deegan, R. & Lázaro-Muñoz, G. Myriad after Myriad: the proprietary data dilemma. N. C. J. Law Technol. 15, 597–637 (2014).
PubMed PubMed Central Google Scholar
Riggs, E. R., Jackson, L., Miller, D. T. & Van Vooren, S. Phenotypic information in genomic variant databases enhances clinical care and research: the International Standards for Cytogenomic Arrays Consortium experience. Hum. Mutat. 33, 787–796 (2012).
PubMed PubMed Central Google Scholar
Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).
CAS PubMed Google Scholar
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
PubMed PubMed Central Google Scholar
Knoppers, B. M. Framework for responsible sharing of genomic and health-related data. HUGO J. 8, 3 (2014). A report on an ethical framework for responsible data sharing developed in conjunction with a wide spectrum of the bioethics, genomics and clinical communities, under the auspices of the GA4GH.
PubMed PubMed Central Google Scholar
Mascalzoni, D. et al. International Charter of principles for sharing bio-specimens and data. Eur. J. Hum. Genet. 23, 721–728 (2015).
PubMed Google Scholar
Rath, A. et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Hum. Mutat. 33, 803–808 (2012). Orphanet is a portal for rare disease and orphan drugs that provides an inventory of rare diseases and a classification system that serves as a model for updating international terminologies such as the International Classification of Diseases.
PubMed Google Scholar
Köhler, S. et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, D966–D974 (2014). A report on the Human Phenotype Ontology, a widely used standard for annotating and analysing phenotypic abnormalities in diagnostic and translational research settings.
PubMed Google Scholar
Varga, E. A. & Moll, S. Cardiology patient pages. Prothrombin 20210 mutation (factor II mutation). Circulation 110, e15–e18 (2004).
PubMed Google Scholar
Beaudet, A. L. & Tsui, L. C. A suggested nomenclature for designating mutations. Hum. Mutat. 2, 245–248 (1993).
CAS PubMed Google Scholar
den Dunnen, J. T. & Antonarakis, S. E. Nomenclature for the description of human sequence variations. Hum. Genet. 109, 121–124 (2001). An initial description of the Human Genome Variation Society's nomenclature standard for naming sequence variants.
CAS PubMed Google Scholar
Taschner, P. E. & den Dunnen, J. T. Describing structural changes by extending HGVS sequence variation nomenclature. Hum. Mutat. 32, 507–511 (2011).
CAS PubMed Google Scholar
Laros, J. F., Blavier, A., den Dunnen, J. T. & Taschner, P. E. A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form. BMC Bioinformatics 12, S5 (2011).
PubMed PubMed Central Google Scholar
Wildeman, M., van Ophuizen, E., den Dunnen, J. T. & Taschner, P. E. Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum. Mutat. 29, 6–13 (2008).
CAS PubMed Google Scholar
Hart, R. K. et al. A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature. Bioinformatics 31, 268–270 (2015).
CAS PubMed Google Scholar
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).
CAS PubMed Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
CAS PubMed PubMed Central Google Scholar
Byrne, M. et al. VarioML framework for comprehensive variation data representation and exchange. BMC Bioinformatics 13, 254 (2012).
PubMed PubMed Central Google Scholar
Vihinen, M. Variation Ontology for annotation of variation effects and mechanisms. Genome Res. 24, 356–364 (2014).
CAS PubMed PubMed Central Google Scholar
Kibbe, W. A. et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 43, D1071–D1078 (2015).
CAS PubMed Google Scholar
Groza, T. et al. The Human Phenotype Ontology: semantic unification of common and rare disease. Am. J. Hum. Genet. 97, 111–124 (2015).
CAS PubMed PubMed Central Google Scholar
Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).
PubMed PubMed Central Google Scholar
Sifrim, A. et al. eXtasy: variant prioritization by genomic data fusion. Nat. Methods 10, 1083–1084 (2013).
CAS PubMed Google Scholar
Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348 (2014).
CAS PubMed PubMed Central Google Scholar
Singleton, M. V. et al. Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. Am. J. Hum. Genet. 94, 599–610 (2014).
CAS PubMed PubMed Central Google Scholar
Javed, A., Agrawal, S. & Ng, P. C. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat. Methods 11, 935–937 (2014).
CAS PubMed Google Scholar
Westbury, S. K. et al. Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders. Genome Med. 7, 36 (2015).
PubMed PubMed Central Google Scholar
Adam, D. Mental health: on the spectrum. Nature 496, 416–418 (2013).
CAS PubMed Google Scholar
Stenson, P. D. et al. Human Gene Mutation Database: towards a comprehensive central mutation database. J. Med. Genet. 45, 124–126 (2008).
CAS PubMed Google Scholar
Abel, O., Powell, J. F., Andersen, P. M. & Al-Chalabi, A. ALSoD: a user-friendly online bioinformatics tool for amyotrophic lateral sclerosis genetics. Hum. Mutat. 33, 1345–1351 (2012).
CAS PubMed Google Scholar
Chandrasekharappa, S. C. et al. Massively parallel sequencing, aCGH, and RNA-seq technologies provide a comprehensive molecular diagnosis of Fanconi anemia. Blood 121, e138–e148 (2013).
CAS PubMed PubMed Central Google Scholar
Dalgleish, R. The human type I collagen mutation database. Nucleic Acids Res. 25, 181–187 (1997).
CAS PubMed PubMed Central Google Scholar
Piirila, H., Valiaho, J. & Vihinen, M. Immunodeficiency mutation databases (IDbases). Hum. Mutat. 27, 1200–1208 (2006).
CAS PubMed Google Scholar
Ruiz-Pesini, E. et al. An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res. 35, D823–D828 (2007).
CAS PubMed Google Scholar
Papadopoulos, P. et al. Developments in FINDbase worldwide database for clinically relevant genomic variation allele frequencies. Nucleic Acids Res. 42, D1020–D1026 (2014).
CAS PubMed Google Scholar
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
CAS PubMed Google Scholar
Hamosh, A. et al. PhenoDB: a new web-based tool for the collection, storage, and analysis of phenotypic features. Hum. Mutat. 34, 566–571 (2013).
CAS PubMed PubMed Central Google Scholar
Sobreira, N., Schiettecatte, F., Valle, D. & Hamosh, A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum. Mutat. 36, 928–930 (2015).
PubMed PubMed Central Google Scholar
Amberger, J., Bocchini, C. & Hamosh, A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM^®). Hum. Mutat. 32, 564–567 (2011). OMIM is one of the oldest and most important knowledge bases in human medicine, going back to work initiated in the early 1960s by Victor McKusick. In addition to 12 book editions of Mendelian Inheritance in Man (MIM) between 1966 and 1998, the online version (OMIM) has been available since 1987.
PubMed Google Scholar
Mungall, C.J. et al. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum. Mutat. 36, 979–984 (2015).
PubMed PubMed Central Google Scholar
Wilks, C. et al. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database (Oxford) 2014, bau093 (2014).
Google Scholar
Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
CAS PubMed Google Scholar
Cheng, W. C. et al. DriverDB: an exome sequencing database for cancer driver gene identification. Nucleic Acids Res. 42, D1048–D1054 (2014).
CAS PubMed Google Scholar
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
CAS PubMed Google Scholar
Li, M. J. et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 40, D1047–D1054 (2012).
CAS PubMed Google Scholar
Koike, A., Nishida, N., Inoue, I., Tsuji, S. & Tokunaga, K. Genome-wide association database developed in the Japanese Integrated Database Project. J. Hum. Genet. 54, 543–546 (2009).
PubMed Google Scholar
Whirl-Carrillo, M. et al. Pharmacogenomics knowledge for personalized medicine. Clin. Pharmacol. Ther. 92, 414–417 (2012).
CAS PubMed PubMed Central Google Scholar
Carey, J. C., Allanson, J. E., Hennekam, R. C. & Biesecker, L. G. Standard terminology for phenotypic variations: The elements of morphology project, its current progress, and future directions. Hum. Mutat. 33, 781–786 (2012).
PubMed Google Scholar

Download references

Acknowledgements

Preparation of this article was facilitated by funding from the European Union Seventh Framework Programme (FP7/2007-2013; 'BioShaRE' grant no. 261433, 'SYBIL' grant no. 602300, 'EMIF' IMI-JU grant no. 115372), the National Institutes of Health Office of the Director (grant no. 5R24OD011883), and the Bundesministerium für Bildung und Forschung (BMBF; project no. 0313911). The authors also acknowledge the many key insights provided by attendees of an IRDiRC workshop dedicated to this topic, and expert suggestions made by colleague R. Dalgleish (University of Leicester, UK).

Author information

Authors and Affiliations

Department of Genetics, University of Leicester, Leicester, LE1 7RH, UK
Anthony J. Brookes
Data to Knowledge for Practice Facility, Cardiovascular Research Centre, Glenfield Hospital, Leicester, LE1 9HN, UK
Anthony J. Brookes
Institute for Medical Genetics and Human Genetics, and the Berlin Brandenburg Center for Regenerative Therapies, Charité Universitätsmedizin Berlin, Berlin, 13353, Germany
Peter N. Robinson
Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany
Peter N. Robinson
Department of Mathematics and Computer Science, Institute for Bioinformatics, Free University of Berlin, Berlin, 14195, Germany
Peter N. Robinson

Authors

Anthony J. Brookes
View author publications
You can also search for this author in PubMed Google Scholar
Peter N. Robinson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anthony J. Brookes.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Genotype: In biology, genotype refers to the genetic makeup of an organism with reference to either a single nucleotide, a larger genetic locus or the entire genome. In the current context, genotype refers to a genetic sequence variant being assessed for potential causality of a disease, as well as its status as heterozygous, homozygous or hemizygous.
Phenotype: In biology, phenotype refers to the observable characteristics of an organism, but in medicine, the word is usually used to describe clinically relevant abnormalities, including signs, symptoms and abnormal findings of laboratory analyses, imaging studies, physiological examinations, as well as behavioural anomalies.
Variants: Genetic variants describe any deviations from a normal or reference sequence. For example, a substitution of one nucleotide for another at a certain chromosomal position, an insertion or deletion of one or more nucleotides, a chromosomal microdeletion encompassing several million nucleotides or a trisomy of an entire chromosome.
Pathogenicity: The tendency of a genetic variant in a person's genome to produce disease. The term is most often used in the context of cancer or inherited disease, when a genetic variant has a substantial deleterious effect on the function of the gene product that leads to, or substantially contributes to, the development of disease.
Effect sizes: The percentages of genetic variance explained by a specific locus, ranging from less than 1% for many common traits up to 100% for some Mendelian diseases.
Multiple testing: The process of using bioinformatics analysis to assess potential pathogenicity of a variant is often formulated as a statistical hypothesis test. As tens of thousands of such tests may be performed in the analysis of diagnostic next-generation sequencing data, adjustments of the P values resulting from assessments of individual variations are required to avoid numerous false positive results, a procedure known as multiple testing correction.
Whole-exome sequencing: (WES). A sequencing technique that seeks to selectively enrich and assay only the sequences belonging to the ~ 1.5% of the human genome consisting of the exons of protein-coding genes (called the exome) because the majority of causative variations identified in Mendelian diseases to date have been located in or very close to these exons.
Big data: This term is used to describe collections of data that are characterized by features such as being large in size, complex and heterogeneous in type, rapidly produced or frequently changing, and of uncertain veracity, such that analysis requires high-performance computing resources and sophisticated algorithms. In biomedicine, especially high-throughput omics data such as whole-genome sequencing, as well as ever increasing amounts of clinical data available in electronic health care records, are often regarded as big data.
Standards: In the present context, a formal set of specifications about the format and contents of data records of variants or diseases that are to be exchanged between databases.
Metadata: Metadata, literally 'data about data', refers to information that accompanies other data and explains their context or provenance.
Array-CGH: Array-comparative genomic hybridization (CGH) enables the gain or loss of genetic material to be detected in the range of as little as 40 kilobases up to entire chromosomes. Array-CGH has become a standard diagnostic tool for the identification of copy number variants.
Web services: Databases, data processing or analytical functions that can be accessed by another computer program over the worldwide web.
Penetrance: The proportion of persons who carry a pathogenic germline variation and also show signs of a disease irrespective of the clinical severity.
Expressivity: The degree of clinical expression and severity of a disease in individuals who have inherited a given germline variation.
Stratified medicine: An approach to patient care that subdivides patients into groups that are defined on the basis of expected risk of developing disease or the expected response to a certain treatment.
Personalized medicine: This concept is synonymous with individualized medicine, and is used in varying ways to convey the idea of health or medical care being in some way tailored and optimized for a person. This typically means going beyond shaping care for groups of similar patients to the ultimate of uniquely customizing interventions for each separate individual.
Probabilistic modelling: A class of computational algorithms that describe data observed from a system in a way that takes uncertainty and noise associated with the model into account. It is one method for making predictions about disease onset or severity on the basis of genetic and other data.
Federation: A software strategy that allows data from disparate databases and other sources to be aggregated ad hoc as a virtual database that can be used for analysis. In the present context, federation involves connecting genotype–phenotype databases across networks to allow combined searches for information about variations or diseases.
Cloud: Remote servers that are accessed via the Internet and provide data storage and analysis resources.
Informed consent: An agreement on the part of a patient to take part in a clinical study and allow the results of the study to be used in some way, such as for additional research or health care activities or for sharing with others in a publication or database. Consent can only reasonably be given after the subject is informed and given the opportunity to discuss the purpose of the research and any potential harms and benefits.
Biobanks: Collections of biological (often medically relevant) specimens such as blood, saliva or tissue, associated with data annotations that describe the subjects from whom the specimens were obtained, such as age, gender, environmental exposures, phenotypic features, molecular test results or clinical diagnosis. Biobanks are used by researchers to obtain sets of data and specimens from subjects with the same diagnosis or with similar characteristics to undertake research investigations.
Registry: A registry comprises a collection of information about individuals affected by a specific disease or who share other similarities. Many registries collect information about individuals over time or are used to track information regarding the response of patients to treatments. A registry may, but does not necessarily, include genetic information.
International Rare Diseases Research Consortium: (IRDiRC). This consortium comprises rare disease researchers and funding organizations and promotes the goal of developing 200 new therapies for rare diseases and a means to diagnose most rare diseases by the year 2020.
Global Alliance for Genomics and Health: (GA4GH). This alliance comprises more than 200 institutions working in health care, research, disease advocacy, life science and information technology with the goal of creating a common framework of harmonized approaches to enable the responsible, voluntary, and secure sharing of genomic and clinical data.
Human Variome Project: An umbrella organization that intends to help coordinate efforts to integrate the collection, curation, interpretation and sharing of information on variation in the human genome into routine clinical practice and research.
Stakeholder: In the present context, a person or organization with an interest or role in medical databases, including patients and families, physicians, researchers, public and private research institutions, and funding agencies.
Phenotype term cross-mapping: A computational link between equivalent or related terms in two or more different phenotype ontologies. For instance, the Medical Dictionary for Regulatory Activities (MedDRA) term Platyspondylia (10068629) is mapped to the Human Phenotype Ontology term Platyspondyly (HP:0000926).
ORCID: ORCID provides a persistent digital identifier (for example, orcid.org/0000-0002-0736-9199) for each researcher that can be used to streamline workflows such as manuscript and grant submission and to unambiguously identify researchers in databases.
APIs: (Application programming interfaces). A specification of a software component in terms of functionalities, formats and data types. In the current context, an API is a framework that allows exchange and processing of data and contents between different websites and databases.
Ontologies: Ontologies are computational resources that combine catalogues of the relevant entities of a domain (a conceptualization) with a description of the interrelationships among those entities (a specification).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brookes, A., Robinson, P. Human genotype–phenotype databases: aims, challenges and opportunities. Nat Rev Genet 16, 702–715 (2015). https://doi.org/10.1038/nrg3932

Download citation

Published: 10 November 2015
Issue Date: December 2015
DOI: https://doi.org/10.1038/nrg3932

This article is cited by

Datenstandards für Seltene Erkrankungen
- Peter N. Robinson
- Holm Graessner
Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz (2022)
Chinese Medicine Phenomics (Chinmedphenomics): Personalized, Precise and Promising
- Chunchun Yuan
- Weiqiang Zhang
- Yongjun Wang
Phenomics (2022)
VariantStore: an index for large-scale genomic variant search
- Prashant Pandey
- Yinjie Gao
- Carl Kingsford
Genome Biology (2021)
ACO2 clinicobiological dataset with extensive phenotype ontology annotation
- Khadidja Guehlouz
- Thomas Foulonneau
- Marc Ferré
Scientific Data (2021)
Identifying disease-causing mutations in genomes of single patients by computational approaches
- Cigdem Sevim Bayrak
- Yuval Itan
Human Genetics (2020)