Key Points
-
Genotype–phenotype databases contain data on genetic variants and associated phenotypes. In medical contexts, such databases are focused on disease-causing mutations and resulting diseases or phenotypic abnormalities.
-
A major goal of genotype–phenotype databases is to provide assistance in assigning pathogenicity to genetic variants.
-
As the focus shifts from the investigation of single genes by Sanger sequencing towards the determination of variants in tens, hundreds or thousands of genes or even the entire genome by next-generation sequencing, databases are becoming ever more essential for the interpretation of variants in diagnostic and research contexts.
-
Numerous online databases of human variability exist, which differ with respect to the type of data stored, the amount of phenotypic information provided, the degree of accessibility of the data, and the number of diseases or genes covered.
-
Increasingly, the focus of genotype–phenotype databases has shifted to support data discovery as a critical underpinning for data provision.
-
Currently, the volume and the quality of phenotype data compared with genotype data held in genotype–phenotype databases is lower, possibly owing to practical, financial, ethical, legal and organizational challenges that must be overcome to produce good phenotypic data on large numbers of individuals.
Abstract
Genotype–phenotype databases provide information about genetic variation, its consequences and its mechanisms of action for research and health care purposes. Existing databases vary greatly in type, areas of focus and modes of operation. Despite ever larger and more intricate datasets — made possible by advances in DNA sequencing, omics methods and phenotyping technologies — steady progress is being made towards integrating these databases rather than using them as separate entities. The consequential shift in focus from single-gene variants towards large gene panels, exomes, whole genomes and myriad observable characteristics creates new challenges and opportunities in database design, interpretation of variant pathogenicity and modes of data representation and use.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Johnston, J. J. & Biesecker, L. G. Databases of genomic variation and phenotypes: existing resources and future needs. Hum. Mol. Genet. 22, R27–R31 (2013).
Rehm, H. L. Disease-targeted sequencing: a cornerstone in the clinic. Nat. Rev. Genet. 14, 295–300 (2013).
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Zemojtel, T. et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci. Transl. Med. 6, 252ra123 (2014).
Saunders, C. J. et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 (2012).
Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Li, M. X. et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 9, e1003143 (2013).
Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Horaitis, O. et al. A database of locus-specific databases. Nat. Genet. 39, 425 (2007).
Patrinos, G. P. et al. Human Variome Project country nodes: documenting genetic information within a country. Hum. Mutat. 33, 1513–1519 (2012).
Fokkema, I. F. et al. LOVD v.2.0: the next generation in gene variant databases. Hum. Mutat. 32, 557–563 (2011).
Beroud, C., Collod-Beroud, G., Boileau, C., Soussi, T. & Junien, C. UMD (Universal Mutation Database): a generic software to build and analyze locus-specific databases. Hum. Mutat. 15, 86–94 (2000). References 12 and 13 describe the two most highly used software platforms for creating LSDBs and for annotating, analysing and displaying DNA variations in genes.
Polvi, A. et al. The Finnish disease heritage database (FinDis) update — a database for the genes mutated in the Finnish disease heritage brought to the next-generation sequencing era. Hum. Mutat. 34, 1458–1466 (2013).
Dalgleish, R. et al. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2, 24 (2010).
MacArthur, J. A. et al. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 42, D873–D878 (2014).
Gout, A. M. et al. Analysis of published PKD1 gene sequence variants. Nat. Genet. 39, 427–428 (2007).
Bell, C. J. et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011).
Chen, S. N. et al. Human molecular genetic and functional studies identify TRIM63, encoding muscle RING finger protein 1, as a novel gene for human hypertrophic cardiomyopathy. Circ. Res. 111, 907–919 (2012).
Ploski, R. et al. Does p.Q247X in TRIM63 cause human hypertrophic cardiomyopathy? Circ. Res. 114, e2–e5 (2014).
Witt, C. C. et al. Cooperative control of striated muscle mass and metabolism by MuRF1 and MuRF2. EMBO J. 27, 350–360 (2008).
Plon, S. E. et al. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum. Mutat. 29, 1282–1291 (2008).
Sosnay, P. R. et al. Defining the disease liability of variants in the cystic fibrosis transmembrane conductance regulator gene. Nat. Genet. 45, 1160–1167 (2013).
den Dunnen, J., Cutting, G. R. & Paalman, M. H. Mandatory variant submission — our experiences. Hum. Mutat. 33, 1 (2012).
Terry, S. F. Disease advocacy organizations catalyze translational research. Front. Genet. 4, 101 (2013).
Wicks, P. et al. Sharing health data for better outcomes on PatientsLikeMe. J. Med. Internet. Res. 12, e19 (2010).
Kirkpatrick, B. E. et al. GenomeConnect: matchmaking between patients, clinical laboratories and researchers to improve genomic knowledge. Hum. Mutat. 36, 974–978 (2015).
McAllister, M. & Dearing, A. Patient reported outcomes and patient empowerment in clinical genetics services. Clin. Genet. 88, 114–121 (2015).
The Lancet Editorial. Patient empowerment — who empowers whom? Lancet 379, 1677 (2012).
Abecasis, G. R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Zhang, J. et al. International Cancer Genome Consortium Data Portal — a one-stop shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). The ICGC data portal provides tools for visualizing, querying and downloading an immense amount of data from the ICGC, with an innovative approach to federating data and annotations across numerous participating centres.
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Petrovski, S. & Goldstein, D. B. Phenomics and the interpretation of personal genomes. Sci. Transl. Med. 6, 254fs35 (2014).
Dorschner, M. O. et al. Actionable, pathogenic incidental findings in 1,000 participants' exomes. Am. J. Hum. Genet. 93, 631–640 (2013). This report convincingly demonstrates that there is often insufficient evidence for pathogenicity of many variants reported in databases or in the medical literature.
Dyment, D. A. et al. Whole-exome sequencing broadens the phenotypic spectrum of rare pediatric epilepsy: a retrospective study. Clin. Genet. 88, 34–40 (2015).
Firth, H. V. et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533 (2009). DECIPHER is the largest publicly available database of genotypic and phenotypic data of mainly undiagnosed patients with rare diseases. DECIPHER has a large suite of tools to facilitate the interpretation of candidate variants.
Gonzalez, M. A. et al. GEnomes Management Application (GEM.app): a new software tool for large-scale collaborative genome analysis. Hum. Mutat. 34, 842–846 (2013).
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Schaefer, C. & RPGEH GO Project Collaboration. The Kaiser Permanente Research Program on Genes, Environment and Health: a resource for genetic epidemiology in adult health and aging. Clin. Med. Res. 9, 177–178 (2011).
Lappalainen, I. et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).
Tryka, K. A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).
Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–428 (2011).
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).
Adamusiak, T. et al. Observ-OM and Observ-TAB: universal syntax solutions for the integration, search, and exchange of phenotype and genotype information. Hum. Mutat. 33, 867–873 (2012).
Beck, T., Hastings, R. K., Gollapudi, S., Free, R. C. & Brookes, A. J. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 22, 949–952 (2014). The largest publicly available compilation of summary-level findings from genetic association studies. Together with references 50 and 111–113, this provides alternative ways of searching and visualizing GWAS data.
Gaye, A. et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43, 1929–1944 (2014).
Karr, A. et al. Secure, privacy-preserving analysis of distributed databases. Technometrics 49, 335–345 (2007).
Cariaso, M. & Lennon, G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 40, D1308–D1312 (2012).
Rappaport, N. et al. MalaCards: an integrated compendium for diseases and their annotation. Database (Oxford) 2013, bat018 (2013).
Lopes, P., Dalgleish, R. & Oliveira, J. L. WAVe: web analysis of the variome. Hum. Mutat. 32, 729–734 (2011).
Glusman, G., Caballero, J., Mauldin, D. E., Hood, L. & Roach, J. C. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 3216–3217 (2011).
Philippakis, A. A. et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).
Manolio, T. A. et al. Global implementation of genomic medicine: we are not alone. Sci. Transl. Med. 7, 290ps13 (2015).
Hayden, E. C. Geneticists push for global data-sharing. Nature 498, 16–17 (2013). A report on the founding of the GA4GH.
Gottlieb, M. M. et al. GeneYenta: a phenotype-based rare disease case matching tool based on online dating algorithms for the acceleration of exome interpretation. Hum. Mutat. 36, 432–438 (2015).
Lancaster, O. et al. Cafe Variome: general-purpose software for making genotype–phenotype data discoverable in restricted or open access contexts. Hum. Mutat. 36, 957–964 (2015).
Wellcome Trust. Enhancing discoverability of public health and epidemiology research data. Wellcome Trust [online], (2014).
Digital Curation Centre (DCC). Jisc Research Data Registry and Discovery Service DCC http://www.dcc.ac.uk/projects/research-data-registry-pilot (2014).
Cotton, R. G. et al. The Human Variome Project. Science 322, 861–862 (2008).
Rehm, H. L. et al. ClinGen — the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015). ClinGen is a National Institutes of Health-funded resource, building an authoritative knowledge base that promotes evidence-based clinical annotation and interpretation of genomic variants. ClinVar (reference 61), an active partner of the ClinGen project, is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Beaulieu, C. L. et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am. J. Hum. Genet. 94, 809–817 (2014).
Buske, O. J. PhenomeCentral: a portal for phenotypic and genotypic matchmaking of patients with rare genetic diseases. Hum. Mutat. 36, 931–940 (2015).
Thompson, R. et al. RD-Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. J. Gen. Intern. Med. 29, S780–S787 (2014).
Conley, J. M., Cook-Deegan, R. & Lázaro-Muñoz, G. Myriad after Myriad: the proprietary data dilemma. N. C. J. Law Technol. 15, 597–637 (2014).
Riggs, E. R., Jackson, L., Miller, D. T. & Van Vooren, S. Phenotypic information in genomic variant databases enhances clinical care and research: the International Standards for Cytogenomic Arrays Consortium experience. Hum. Mutat. 33, 787–796 (2012).
Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Knoppers, B. M. Framework for responsible sharing of genomic and health-related data. HUGO J. 8, 3 (2014). A report on an ethical framework for responsible data sharing developed in conjunction with a wide spectrum of the bioethics, genomics and clinical communities, under the auspices of the GA4GH.
Mascalzoni, D. et al. International Charter of principles for sharing bio-specimens and data. Eur. J. Hum. Genet. 23, 721–728 (2015).
Rath, A. et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Hum. Mutat. 33, 803–808 (2012). Orphanet is a portal for rare disease and orphan drugs that provides an inventory of rare diseases and a classification system that serves as a model for updating international terminologies such as the International Classification of Diseases.
Köhler, S. et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, D966–D974 (2014). A report on the Human Phenotype Ontology, a widely used standard for annotating and analysing phenotypic abnormalities in diagnostic and translational research settings.
Varga, E. A. & Moll, S. Cardiology patient pages. Prothrombin 20210 mutation (factor II mutation). Circulation 110, e15–e18 (2004).
Beaudet, A. L. & Tsui, L. C. A suggested nomenclature for designating mutations. Hum. Mutat. 2, 245–248 (1993).
den Dunnen, J. T. & Antonarakis, S. E. Nomenclature for the description of human sequence variations. Hum. Genet. 109, 121–124 (2001). An initial description of the Human Genome Variation Society's nomenclature standard for naming sequence variants.
Taschner, P. E. & den Dunnen, J. T. Describing structural changes by extending HGVS sequence variation nomenclature. Hum. Mutat. 32, 507–511 (2011).
Laros, J. F., Blavier, A., den Dunnen, J. T. & Taschner, P. E. A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form. BMC Bioinformatics 12, S5 (2011).
Wildeman, M., van Ophuizen, E., den Dunnen, J. T. & Taschner, P. E. Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum. Mutat. 29, 6–13 (2008).
Hart, R. K. et al. A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature. Bioinformatics 31, 268–270 (2015).
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Byrne, M. et al. VarioML framework for comprehensive variation data representation and exchange. BMC Bioinformatics 13, 254 (2012).
Vihinen, M. Variation Ontology for annotation of variation effects and mechanisms. Genome Res. 24, 356–364 (2014).
Kibbe, W. A. et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 43, D1071–D1078 (2015).
Groza, T. et al. The Human Phenotype Ontology: semantic unification of common and rare disease. Am. J. Hum. Genet. 97, 111–124 (2015).
Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).
Sifrim, A. et al. eXtasy: variant prioritization by genomic data fusion. Nat. Methods 10, 1083–1084 (2013).
Robinson, P. N. et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348 (2014).
Singleton, M. V. et al. Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. Am. J. Hum. Genet. 94, 599–610 (2014).
Javed, A., Agrawal, S. & Ng, P. C. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat. Methods 11, 935–937 (2014).
Westbury, S. K. et al. Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders. Genome Med. 7, 36 (2015).
Adam, D. Mental health: on the spectrum. Nature 496, 416–418 (2013).
Stenson, P. D. et al. Human Gene Mutation Database: towards a comprehensive central mutation database. J. Med. Genet. 45, 124–126 (2008).
Abel, O., Powell, J. F., Andersen, P. M. & Al-Chalabi, A. ALSoD: a user-friendly online bioinformatics tool for amyotrophic lateral sclerosis genetics. Hum. Mutat. 33, 1345–1351 (2012).
Chandrasekharappa, S. C. et al. Massively parallel sequencing, aCGH, and RNA-seq technologies provide a comprehensive molecular diagnosis of Fanconi anemia. Blood 121, e138–e148 (2013).
Dalgleish, R. The human type I collagen mutation database. Nucleic Acids Res. 25, 181–187 (1997).
Piirila, H., Valiaho, J. & Vihinen, M. Immunodeficiency mutation databases (IDbases). Hum. Mutat. 27, 1200–1208 (2006).
Ruiz-Pesini, E. et al. An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res. 35, D823–D828 (2007).
Papadopoulos, P. et al. Developments in FINDbase worldwide database for clinically relevant genomic variation allele frequencies. Nucleic Acids Res. 42, D1020–D1026 (2014).
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
Hamosh, A. et al. PhenoDB: a new web-based tool for the collection, storage, and analysis of phenotypic features. Hum. Mutat. 34, 566–571 (2013).
Sobreira, N., Schiettecatte, F., Valle, D. & Hamosh, A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum. Mutat. 36, 928–930 (2015).
Amberger, J., Bocchini, C. & Hamosh, A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum. Mutat. 32, 564–567 (2011). OMIM is one of the oldest and most important knowledge bases in human medicine, going back to work initiated in the early 1960s by Victor McKusick. In addition to 12 book editions of Mendelian Inheritance in Man (MIM) between 1966 and 1998, the online version (OMIM) has been available since 1987.
Mungall, C.J. et al. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum. Mutat. 36, 979–984 (2015).
Wilks, C. et al. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database (Oxford) 2014, bau093 (2014).
Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
Cheng, W. C. et al. DriverDB: an exome sequencing database for cancer driver gene identification. Nucleic Acids Res. 42, D1048–D1054 (2014).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Li, M. J. et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 40, D1047–D1054 (2012).
Koike, A., Nishida, N., Inoue, I., Tsuji, S. & Tokunaga, K. Genome-wide association database developed in the Japanese Integrated Database Project. J. Hum. Genet. 54, 543–546 (2009).
Whirl-Carrillo, M. et al. Pharmacogenomics knowledge for personalized medicine. Clin. Pharmacol. Ther. 92, 414–417 (2012).
Carey, J. C., Allanson, J. E., Hennekam, R. C. & Biesecker, L. G. Standard terminology for phenotypic variations: The elements of morphology project, its current progress, and future directions. Hum. Mutat. 33, 781–786 (2012).
Acknowledgements
Preparation of this article was facilitated by funding from the European Union Seventh Framework Programme (FP7/2007-2013; 'BioShaRE' grant no. 261433, 'SYBIL' grant no. 602300, 'EMIF' IMI-JU grant no. 115372), the National Institutes of Health Office of the Director (grant no. 5R24OD011883), and the Bundesministerium für Bildung und Forschung (BMBF; project no. 0313911). The authors also acknowledge the many key insights provided by attendees of an IRDiRC workshop dedicated to this topic, and expert suggestions made by colleague R. Dalgleish (University of Leicester, UK).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
DATABASES
FURTHER INFORMATION
Glossary
- Genotype
-
In biology, genotype refers to the genetic makeup of an organism with reference to either a single nucleotide, a larger genetic locus or the entire genome. In the current context, genotype refers to a genetic sequence variant being assessed for potential causality of a disease, as well as its status as heterozygous, homozygous or hemizygous.
- Phenotype
-
In biology, phenotype refers to the observable characteristics of an organism, but in medicine, the word is usually used to describe clinically relevant abnormalities, including signs, symptoms and abnormal findings of laboratory analyses, imaging studies, physiological examinations, as well as behavioural anomalies.
- Variants
-
Genetic variants describe any deviations from a normal or reference sequence. For example, a substitution of one nucleotide for another at a certain chromosomal position, an insertion or deletion of one or more nucleotides, a chromosomal microdeletion encompassing several million nucleotides or a trisomy of an entire chromosome.
- Pathogenicity
-
The tendency of a genetic variant in a person's genome to produce disease. The term is most often used in the context of cancer or inherited disease, when a genetic variant has a substantial deleterious effect on the function of the gene product that leads to, or substantially contributes to, the development of disease.
- Effect sizes
-
The percentages of genetic variance explained by a specific locus, ranging from less than 1% for many common traits up to 100% for some Mendelian diseases.
- Multiple testing
-
The process of using bioinformatics analysis to assess potential pathogenicity of a variant is often formulated as a statistical hypothesis test. As tens of thousands of such tests may be performed in the analysis of diagnostic next-generation sequencing data, adjustments of the P values resulting from assessments of individual variations are required to avoid numerous false positive results, a procedure known as multiple testing correction.
- Whole-exome sequencing
-
(WES). A sequencing technique that seeks to selectively enrich and assay only the sequences belonging to the ~ 1.5% of the human genome consisting of the exons of protein-coding genes (called the exome) because the majority of causative variations identified in Mendelian diseases to date have been located in or very close to these exons.
- Big data
-
This term is used to describe collections of data that are characterized by features such as being large in size, complex and heterogeneous in type, rapidly produced or frequently changing, and of uncertain veracity, such that analysis requires high-performance computing resources and sophisticated algorithms. In biomedicine, especially high-throughput omics data such as whole-genome sequencing, as well as ever increasing amounts of clinical data available in electronic health care records, are often regarded as big data.
- Standards
-
In the present context, a formal set of specifications about the format and contents of data records of variants or diseases that are to be exchanged between databases.
- Metadata
-
Metadata, literally 'data about data', refers to information that accompanies other data and explains their context or provenance.
- Array-CGH
-
Array-comparative genomic hybridization (CGH) enables the gain or loss of genetic material to be detected in the range of as little as 40 kilobases up to entire chromosomes. Array-CGH has become a standard diagnostic tool for the identification of copy number variants.
- Web services
-
Databases, data processing or analytical functions that can be accessed by another computer program over the worldwide web.
- Penetrance
-
The proportion of persons who carry a pathogenic germline variation and also show signs of a disease irrespective of the clinical severity.
- Expressivity
-
The degree of clinical expression and severity of a disease in individuals who have inherited a given germline variation.
- Stratified medicine
-
An approach to patient care that subdivides patients into groups that are defined on the basis of expected risk of developing disease or the expected response to a certain treatment.
- Personalized medicine
-
This concept is synonymous with individualized medicine, and is used in varying ways to convey the idea of health or medical care being in some way tailored and optimized for a person. This typically means going beyond shaping care for groups of similar patients to the ultimate of uniquely customizing interventions for each separate individual.
- Probabilistic modelling
-
A class of computational algorithms that describe data observed from a system in a way that takes uncertainty and noise associated with the model into account. It is one method for making predictions about disease onset or severity on the basis of genetic and other data.
- Federation
-
A software strategy that allows data from disparate databases and other sources to be aggregated ad hoc as a virtual database that can be used for analysis. In the present context, federation involves connecting genotype–phenotype databases across networks to allow combined searches for information about variations or diseases.
- Cloud
-
Remote servers that are accessed via the Internet and provide data storage and analysis resources.
- Informed consent
-
An agreement on the part of a patient to take part in a clinical study and allow the results of the study to be used in some way, such as for additional research or health care activities or for sharing with others in a publication or database. Consent can only reasonably be given after the subject is informed and given the opportunity to discuss the purpose of the research and any potential harms and benefits.
- Biobanks
-
Collections of biological (often medically relevant) specimens such as blood, saliva or tissue, associated with data annotations that describe the subjects from whom the specimens were obtained, such as age, gender, environmental exposures, phenotypic features, molecular test results or clinical diagnosis. Biobanks are used by researchers to obtain sets of data and specimens from subjects with the same diagnosis or with similar characteristics to undertake research investigations.
- Registry
-
A registry comprises a collection of information about individuals affected by a specific disease or who share other similarities. Many registries collect information about individuals over time or are used to track information regarding the response of patients to treatments. A registry may, but does not necessarily, include genetic information.
- International Rare Diseases Research Consortium
-
(IRDiRC). This consortium comprises rare disease researchers and funding organizations and promotes the goal of developing 200 new therapies for rare diseases and a means to diagnose most rare diseases by the year 2020.
- Global Alliance for Genomics and Health
-
(GA4GH). This alliance comprises more than 200 institutions working in health care, research, disease advocacy, life science and information technology with the goal of creating a common framework of harmonized approaches to enable the responsible, voluntary, and secure sharing of genomic and clinical data.
- Human Variome Project
-
An umbrella organization that intends to help coordinate efforts to integrate the collection, curation, interpretation and sharing of information on variation in the human genome into routine clinical practice and research.
- Stakeholder
-
In the present context, a person or organization with an interest or role in medical databases, including patients and families, physicians, researchers, public and private research institutions, and funding agencies.
- Phenotype term cross-mapping
-
A computational link between equivalent or related terms in two or more different phenotype ontologies. For instance, the Medical Dictionary for Regulatory Activities (MedDRA) term Platyspondylia (10068629) is mapped to the Human Phenotype Ontology term Platyspondyly (HP:0000926).
- ORCID
-
ORCID provides a persistent digital identifier (for example, orcid.org/0000-0002-0736-9199) for each researcher that can be used to streamline workflows such as manuscript and grant submission and to unambiguously identify researchers in databases.
- APIs
-
(Application programming interfaces). A specification of a software component in terms of functionalities, formats and data types. In the current context, an API is a framework that allows exchange and processing of data and contents between different websites and databases.
- Ontologies
-
Ontologies are computational resources that combine catalogues of the relevant entities of a domain (a conceptualization) with a description of the interrelationships among those entities (a specification).
Rights and permissions
About this article
Cite this article
Brookes, A., Robinson, P. Human genotype–phenotype databases: aims, challenges and opportunities. Nat Rev Genet 16, 702–715 (2015). https://doi.org/10.1038/nrg3932
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3932
This article is cited by
-
Datenstandards für Seltene Erkrankungen
Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz (2022)
-
Chinese Medicine Phenomics (Chinmedphenomics): Personalized, Precise and Promising
Phenomics (2022)
-
VariantStore: an index for large-scale genomic variant search
Genome Biology (2021)
-
ACO2 clinicobiological dataset with extensive phenotype ontology annotation
Scientific Data (2021)
-
Identifying disease-causing mutations in genomes of single patients by computational approaches
Human Genetics (2020)