INTRODUCTION

Mendelian conditions are individually rare, but collectively contribute to disease in ~0.4% of children and young adults, and 8% of live births if all congenital anomalies are considered.1 These findings likely underestimate the true burden of Mendelian conditions; the estimates focus on the severe end of the phenotypic spectrum and often fail to capture disorders caused by de novo pathogenic variant alleles or characterized by adult onset. Prior to the broader availability of genome-wide assays, discovery of loci underlying Mendelian conditions relied heavily on traditional genetic mapping and positional cloning approaches that had little power to detect disorders characterized by de novo variation, incomplete penetrance, and locus heterogeneity. Chromosome microarray analysis (CMA) and next-generation sequencing (NGS), applied to well-phenotyped individuals,2 have provided substantial technological advances toward clinical genomics and identifying a more complete variant spectrum (single-nucleotide variants [SNVs], indels, and copy-number variants [CNVs]) and molecular basis for human Mendelian conditions.3,4,5,6,7,8,9,10 Despite substantial progress in variant detection genome-wide, the overwhelming majority of annotated genes have yet to be assigned function in the context of human disease traits. Thus, a comprehensive molecular understanding of disease biology and disease gene function remains to be achieved.

Several national and international programs have been developed to both stimulate and support the study of Mendelian conditions. In Canada, Finding of Rare Disease Genes (FORGE)11 and Care4Rare Canada Consortium12 have contributed to this global effort, leading to development of the Canadian Rare Diseases Models and Mechanisms Network (RDMM), which supports collaboration among clinical and human geneticists and model organism researchers in the study of rare variants and their functional impact. In the UK, the Deciphering Developmental Disorders (DDD)13,14,15,16,17 study has for over a decade made significant contributions to the understanding of the molecular etiologies of neurodevelopmental delay and the roles of different variant types, mutational mechanisms, and new pathogenic variants in disease traits. Additional international efforts in rare disease gene discovery include the Undiagnosed Diseases Network International (UDNI)18 and the International Rare Diseases Research Consortium (IRDiRC).19

In the United States, the Centers for Mendelian Genomics (CMGs)20,21 and Undiagnosed Diseases Network (UDN)22 use complementary approaches to investigate the molecular etiology of Mendelian conditions. The CMGs comprise four Centers: a joint Baylor College of Medicine–Johns Hopkins University Center (BHCMG), the Broad Institute/Harvard University (BIHCMG), University of Washington (UWCMG), and Yale University (YCMG) (www.mendelian.org). The Centers are supported by the National Human Genome Research Institute (NHGRI); the National Heart, Lung, and Blood Institute (NHLBI); and the National Eye Institute (NEI); are leveraged by local resources; and are focused on shared goals of novel disease gene discovery using exome and genome sequencing (ES/GS), and rare variant, family based genomics approaches. Knowledge dissemination is facilitated through publication (both in scientific journals and online at www.mendelian.org), resource and data sharing, education of the scientific and medical communities, and collaboration with clinicians, families, and researchers worldwide.

ES coupled with the power inherent in a rare variant, family based analysis enables the identification of rare, de novo, and cosegregating variants with large phenotypic effect, i.e., disease traits tied to a specific locus, yielding results of immediate clinical utility and driving novel disease gene discovery. Gene-first approaches, in which a cohort of individuals with rare variation at a particular locus undergo careful phenotyping, can elucidate the full phenotypic spectrum associated with an allelic series at a disease gene locus.23,24 For example, analysis of variation at the POGZ locus led to the delineation of White–Sutton syndrome (WHSUS; MIM 616364), after phenotype-focused cohort studies identified only a subset (developmental delay/intellectual disability [DD/ID], autism spectrum disorder, schizophrenia) of the cognitive phenotypes associated with rare variation in POGZ.14,25,26,27,28,29 Further examples of this CMG and collaborator-facilitated gene-first approach include studies of ASXL3 (refs.30,31), CDK13 (ref. 32), PNPLA1 (ref. 33), POLE,34 IFT81 (ref. 35), HDAC8 (ref. 36), AHDC1 (ref. 37), and CDC42 (ref. 38).

Disease gene discovery and genetic diagnosis informs clinical care

Rare disease research and gene discovery inform and enhance molecular diagnosis and can impact patient management. Molecular diagnostic assays provide potential precise genetic contributors to clinical diagnoses, important prognostic information, and guidance for clinical management and disease surveillance, and enable more accurate recurrence risk estimates for families. In turn, this individualized “precision” information provides an entry for illuminating disease biology and insight, enabling development and implementation of rational and targeted therapeutics. For example, the discovery of loss-of-function (LoF) PCSK9 variants causing hypocholesterolemia led to the rapid development of monoclonal antibodies targeting PCSK9, to treat cardiovascular disease and familial hypercholesterolemia.39,40,41

At the initiation of the CMG program, we and others predicted that opportunities provided by NGS technologies, novel computational analytic approaches, and the access to these technologies for clinicians and families from populations around the world would transform the field of Mendelian genomics, and our understanding of both human biology and perturbations to homeostasis resulting in disease.10,20,42,43 However, it was not anticipated that such studies might potentially enable building testable models for the genetics of disease from the bottom up. In the next section, we highlight CMG accomplishments that are driving this transformation.21

CMG-FACILITATED DISCOVERIES

Disease gene discovery and functional annotation of the human genome

A primary goal of the CMGs is to identify novel disease genes responsible for human Mendelian conditions.20,21 The CMGs have reported a total of 3617 disease gene–phenotype pairs (http://mendelian.org/phenotypes-genes), categorized as novel, phenotypic expansion (phenotypic features extending beyond those previously reported for a Mendelian condition),21 or known (Fig. 1, Supplemental Figure 1). The CMGs are well positioned to achieve the overall goal of connecting phenotypes to high penetrance variants in a substantial fraction of all ~20,000 annotated human genes, and the current pace of discovery within the CMGs does not show evidence of slowing (Fig. 2). This simple accounting or “tally” of gene discovery does not fully represent the genetic and genomic insights and new understanding generated by CMG international collaborative efforts regarding disease traits, human biology, and human developmental and homeostatic processes.

Fig. 1
figure 1

Centers for Mendelian Genomics (CMG) disease gene discovery through 31 May 2018 (year 7, quarter 2) by the four centers. Discoveries are defined as “novel” if (1) the causal variant was identified in a gene not previously associated with a Mendelian phenotype at the time of case acceptance to the study (i.e., novel gene), or (2) the causal variant was identified in association with a Mendelian phenotype with a MIM number (a known phenotype) and for which no causal variants had previously been reported (i.e., novel gene, unexplained known phenotype), or (3) the causal variant was identified in association with a Mendelian phenotype with no MIM number and for which no variants in the identified gene had been previously reported as causal of a Mendelian phenotype (i.e., novel gene, new phenotype). Graph of discoveries (genotype–phenotype pairs) categorized as novel, phenotypic expansion, or known. Discoveries are classified as either tier 1 (blue bars) or tier 2 (orange bars, not meeting tier 1 definition). Tier 1 genes include high confidence genes reported by individual centers as tier 1, defined as having been identified in either (1) multiple kindreds with shared phenotypic features and likely pathogenic variants in the same gene, or (2) a single family plus a model organism with orthologous phenotypic features, or (3) a single family with supportive functional and mapping data. Pheno expan phenotypic expansion.

Fig. 2
figure 2

Cumulative Centers for Mendelian Genomics (CMG) disease gene discovery. The number of novel gene–phenotype discoveries as reported by all four centers is graphed by progress reporting period (blue bars) and cumulatively (red bars). Biannual phenotypic expansion discoveries involving previously known disease genes (green line) and biannual known disease gene discoveries (yellow line) are also graphed. The yellow arrow indicates the pace of novel gene–phenotype discovery, and demonstrates a pace of 263 novel gene–phenotype discoveries per year, or 1 novel gene–phenotype discovery for every 28 exome sequences (ES) performed.

Dissemination of knowledge

As of 31 August 2018, the CMGs have contributed to a total of 522 manuscripts with collaborators worldwide (Table 1). These efforts have supported the establishment of tenure-track positions for 16 junior faculty, and the successful preparation of 9 K- or R-level NIH-funded grants (Appendix, Supplemental Tables 1 and 2), and aided in the training of numerous graduate students. The CMGs have also taken steps to disseminate prepublished data to the scientific community. The Genomic Sequencing Program Coordinating Center (GSPCC)-managed CMG website provides public access to a searchable phenotypes and genes database of CMG disease gene discoveries; depositions to ClinVar and dbGaP further support knowledge dissemination.

Table 1 Centers for Mendelian Genomics (CMG)-wide recruitment, production, and knowledge dissemination and collaboration achievements

Contribution to clinical diagnosis and patient management

The CMGs have engaged diagnostic laboratories as an extension of the research laboratory efforts, increasing gene discovery through analysis of nondiagnostic clinical exomes. This interaction facilitates rapid transition from novel disease gene discovery to patient report, with direct involvement of additional stakeholders in the discovery efforts.44 Review of 12,577 sequential noncancer cases referred to the Baylor Genetics diagnostic laboratory yielded 4075 cases for which ES was diagnostic. Of these, 333 molecular diagnoses explaining part or all of the reported clinical phenotype involved CMG discovery genes (Supplemental Figure 2A). A precise molecular diagnosis (PGM3 [ref. 45], TANGO2 [ref. 46] and ABL1 [ref. 47]) informed medical management of 21 individuals, with several additional clinically impactful CMG disease gene discoveries beyond this clinical cohort (Supplemental Table 3). Other collaborations between the CMGs and worldwide diagnostic and research laboratories make extensive use of the Matchmaker Exchange network,48,49 which includes CMG-developed nodes GeneMatcher,50,51 MyGene2 (ref. 52), and matchbox, facilitating novel disease gene discoveries worldwide (Supplemental Table 4). The impact of CMG disease gene discoveries on molecular diagnostics is further reflected in the number of pathogenic or likely pathogenic variant entries in discovery genes in ClinVar and the Genetic Testing Registry database (GTR, https://www.ncbi.nlm.nih.gov/gtr/; Supplemental Figure 2B–D). These findings illustrate the substantial impact of the CMGs on both clinical diagnostics and medical management, demonstrating unequivocally a successful “bedside-to-bench-to-bedside” approach.

Molecular mechanisms underlying Mendelian conditions

Over the past decade there has been tremendous progress in understanding the molecular basis of and mechanisms underlying Mendelian conditions. De novo pathogenic variants have been increasingly recognized as a major source of rare conditions, particularly those that reduce reproductive fitness.53,54,55 This has been borne out in clinical referral cohorts across all ages,56,57 as well as across numerous phenotypes, such as neurodevelopmental disorders,58,59 Meier–Gorlin syndrome (MIM 616835) (ref. 60), visceral myopathy (megacystis–microcolon–intestinal hypoperistalsis syndrome [MIM 155310]) (ref. 61), and nasopalpebral lipoma–coloboma syndrome.62 Somatic mosaic variation has been demonstrated to be an important contributor to rare disease.63,64,65 Parental mosaicism can impact recurrence risk counseling for families with apparently sporadic disease, and the likelihood of parental germline mosaicism is dependent on whether the new variant arises on the maternal or paternal allele.66,67,68 Proband mosaicism has been found to underlie many conditions, including Cornelia de Lange syndrome (CdLS), for which both genetic heterogeneity and mosaicism can impact clinical expressivity of disease.69,70,71,72 Mosaic reversion of pathogenic variants to wild type has also been described in ichthyosis with confetti lesions caused by variants in KRT10 or KRT1 (refs. 73,74), and in immunodeficiency syndromes for which the affected cell populations are under strong negative selection.75

Intragenic CNVs, most notably exon deletion or “dropout” alleles sometimes affecting only a single exon, have been identified as a difficult-to-detect cause of many Mendelian conditions (Supplemental Table 5). Additionally, exonic deletions from clinical diagnostic CMA have fostered gene discovery efforts.76 Mosaic and copy-number variants are underdiagnosed by current NGS technologies,77 and these examples illustrate the clinical relevance of such discovery and the need to develop and apply dedicated computational pipelines to their identification.

Variation in patterns of disease inheritance

For certain Mendelian conditions, empirical observations suggest more than one pattern of disease trait transmission associated with variation at a single locus (e.g., autosomal recessive, AR; autosomal dominant, AD; in some instances, X-linked, XL; and even common, complex), which may confound genomic mapping studies for a particular trait.58,78,79,80,81,82 These observations can be explained by allelic heterogeneity, with different consequences of the pathogenic variants (i.e., LoF, GoF, dominant negative) at a given locus or variable magnitude of the pathogenic variant effect.58,79 This is exemplified by SMCHD1, for which missense variants located within the ATPase domain are associated with Bosma arhinia microphthalmia syndrome (MIM 603457), whereas LoF is associated with facioscapulohumeral muscular dystrophy type 2 (FSHD2, MIM 158901) and digenic inheritance.83 Collectively, the CMGs have identified over 30 loci (http://mendelian.org/phenotypes-genes) with known or proposed human disease phenotypes for which elucidation of the responsible gene and causative variants explains the clinical observation of both dominant (monoallelic) and recessive (biallelic) inheritance of the corresponding disease traits with either similar or dissimilar phenotypic features (Supplemental Table 6).

There are increasing examples of variants that escape nonsense-mediated decay (NMD) and result in expression of a phenotype due to GoF.84,85,86 An NMD escape intolerance score metric based on the depletion of protein-truncating variants within gene regions predicted to escape NMD may facilitate the identification of variants that function through a GoF mechanism.87 Such variants may be present in genes with low probability of loss-of-function intolerance (pLI) scores predicting tolerance to LoF variants.87

The CMGs have unraveled the biology of genetic heterogeneity in analyses of several cohorts with apparently homogeneous phenotypes. The identification of novel disease genes (DVL1, DVL3, FZD2, and NXN) in Robinow syndrome (MIM 268310, 180700, 616331, 616894) has provided a molecular diagnosis in potentially 95% of the studied disease cohort and highlighted a shared role in the noncanonical Wnt pathway for this phenotype.88,89,90 Similarly, in Noonan syndrome, the CMGs and others have identified novel disease genes with a role in the RAS/MAPK pathway.91,92,93,94,95,96,97,98,99,100 Rare variation in genes encoding the cohesin complex have now been described to underlie Mendelian conditions termed “cohesinopathies,” which demonstrate clinical features that are similar to those observed in CdLS.72,101 The frequency of the cohesin complex subunit protein/gene contribution depends on how the phenotype is ascertained: specifically as CdLS-like phenotypes, or more broadly as DD/ID.72 Increasingly, Mendelian inheritance is appreciated as vastly more complicated and nuanced than the simple binary patterns Gregor Mendel described in 1864, and ES now enables delineation of this complexity through identification of allelic and locus heterogeneity in human Mendelian disorders.

CMG-facilitated studies contributed to elucidation of multilocus pathogenic variant effects on disease trait manifestations:

  • Digenic inheritance has been described in facioscapulohumeral dystrophy type 2 (FSHD2 [MIM 158901]), involving rare variation in SMCHD1 and a permissive DUX4 allele, both required for expression of disease.102,103 Digenic inheritance of a rare SMAD6 variant in association with a common variant downstream of BMP2 was described in association with midline craniosynostosis.104 In both examples, the observation of reduced penetrance drove discovery of the second locus required for disease expression.

  • Dual/multiple molecular diagnoses or multilocus pathogenic variation involving CNVs and/or SNVs result in blended phenotypes estimated to comprise at least 4.9% of all diagnostic clinical exome cases.56,57,105,106,107,108,109 Presenting phenotypes may be distinct or overlapping, and may obscure clinical ascertainment, and parental mosaicism can impact recurrence risk.56,110

  • Mutational burden and modifiers can modulate the phenotypic severity of the observed trait, and may explain intrafamilial phenotypic variability, as has been observed in peripheral neuropathy.111 Similarly, an aggregation of rare variants has been shown to influence susceptibility to Parkinson disease,112 and the age of onset of amyotrophic lateral sclerosis (ALS).113

  • Phenotypic expansion21 is often observed with recently discovered disease genes, for which the full phenotypic spectrum of disease has not yet been appreciated. Multilocus variation can explain some cases of apparent phenotypic expansion,114 resulting in the observation of additional phenotypic features (multiple molecular diagnoses) or modifying the severity or characteristics of the primary observed phenotype (as multiple molecular diagnoses, or as modifiers).

Bioinformatic tool development

CMG investigators have developed tools for gene matching, data sharing, phenotype analysis, and exome variant data analysis (Table 2). Gene-matching tools connecting clinicians and human and model organism genetics investigators include GeneMatcher,50,51 MyGene2 (which includes a patient-facing portal for data sharing),52 and matchbox (Fig. 3). These tools each communicate through the MME (www.matchmakerexchange.org/), enabling gene and phenotype matching both within and across matching tools in the United States and internationally.49 Members of the CMGs have essential roles in developing and maintaining the MME.48,49

Table 2 Bioinformatic analysis tools developed by the CMGs
Fig. 3
figure 3

Matchmaking tools developed through the Centers for Mendelian Genomics (CMGs). (a) The Matchmaker Exchange (MME) facilitates communication among multiple databases of human genomic and phenotypic data, each unique in focus and design. Each database functions as a node within the MME. (b) Total number of entries in each MME node, as well as total number of entries per node shared within the MME, are indicated. Also listed is the total number of unique genes per node. Note that a given unique gene may be present in more than one node. (c) Cumulative GeneMatcher statistics demonstrate 26,614 submissions of 10,341 genes through 1 November 2018. This has resulted in 5195 matched genes. GeneMatcher submitters in 77 countries today, demonstrating worldwide democratization of disease gene discovery. (d) MyGene2 is a database through which patients and families can directly share their genomic data. Matchbox is an open-source tool through which institutions or groups with genomic data can connect to the MME.

The CMGs have also developed software to record and compare phenotype data and analyze sequence data with the aim of identifying responsible genes and variants. These include PhenoDB50,115 and seqr (https://seqr.broadinstitute.org), which, in addition to recording phenotypic data in a standard structured ontology (e.g., Human Phenotype Ontology, HPO),116 also enable variant prioritization utilizing patterns of Mendelian inheritance, minor allele frequencies from reference population databases, and annotation of genes and variants by OMIM, ClinVar, and other resources (Table 2). ALoFT (annotation of loss-of-function transcripts) annotates and predicts putative disease-causing LoF pathogenic variants. It can further distinguish between disease-causing LoFs, which are heterozygous, compared with those in a homozygous state.117 Quantification of missense variant-induced local perturbation on a protein structure can identify putative disease-causing missense pathogenic variants.118 The localized frustration metric can identify variants that disrupt protein function without severely affecting the global stability of proteins.118 Additional analysis software developed by the CMGs is listed in Table 2.

WHAT REMAINS TO BE DONE

Phenotypic annotation of variant effects in all ~20,000 human genes will provide the necessary evidence base to study the biologic relevance of each locus in the human genome. Some of the key challenges in meeting this long-term goal are elucidated below.

Disease gene discovery

Despite the progress of the CMGs, thousands of disease genes remain to be discovered. Currently, OMIM lists 3961 genes known to have high penetrance variants (~19% of the total annotated protein coding genes) underlying Mendelian conditions (www.OMIM.org; 4 October 2018). The early years of the CMGs saw rapid-paced gene discovery, including much of the “low-hanging fruit” available for study. Moving forward, the CMGs plan to explore new strategies for identification and engagement of families and clinicians worldwide and mainstreaming use of more complex, biologically based analysis strategies to identify novel genes for Mendelian conditions. Discoveries will stimulate new biological questions about the relationship between rare variation and human Mendelian conditions, including the impact of different mutational mechanisms, the consequences of pathogenic variants on RNA and protein function, and the extent and consequences of mosaicism. Elucidation of the phenotypes resulting from allelic heterogeneity (LoF, GoF) and the impact of variation on protein function (LoF [amorph], partial LoF [hypomorph], increase in function [hypermorph], novel function [neomorph], and dominant negative [antemorph])119 are incompletely explored for most loci, underscoring the important role of allelic series. We will expand allelic series for newly identified disease genes and further explore gene-first approaches to develop more sophisticated genotype–phenotype relationships. This rapidly expanding application of genomic sequencing to identify novel genotype–phenotype relationships across the world’s population will expand the utility of clinical genomics, enable precision medicine in all countries, and fuel the rapidly increasing trajectory of biological insights into perturbed homeostasis and disease.

Methods and approach

Continued evaluation and integration of appropriate technological advances will provide the greatest likelihood of success in reaching our goals. To date, ES has been the predominant platform contributing to discovery of new disease genes, owing to its markedly lower cost compared with GS and enrichment of rare variants with large phenotypic effect in coding regions. In support of ES reanalysis to increase molecular diagnostic rates, one pilot study of ES reanalysis in 74 nondiagnostic clinical cases, with expansion to available relatives for trio- and multiplex-ES, led to the identification of a likely or potential molecular diagnosis in 51% (38/74) of previously unsolved clinical cases.46

On contemporary capture platforms, sensitivity of detection of SNVs is similar between ES and GS, however GS enables more sensitive structural variant calling, particularly for copy-neutral variation (e.g., inversions) and smaller sized (<10 Kb) CNVs. GS also promotes integration of CMG data with those generated by ENCODE, GTEx, and GENCODE. Caveats to the use of GS include limited annotation of noncoding variants and lower-fold coverage, which reduces power to detect low allele fraction mosaic variants and increases false positive de novo variant calls. Reports of improved coverage of coding regions by GS compared with ES are complicated by lack of an appropriately powered head-to-head study comparing contemporaneous versions of both technologies. Despite several large-scale investments in GS, the paucity of new disease gene discoveries reported from GS that would not have been made at much lower cost by ES challenges GS as a cost-effective strategy. Some studies suggest, however, that the addition of RNAseq and/or GS identifies causative variants or adds functional support for the interpretation of variants discovered by ES. For example:

  • Muscle-related and mitochondrial phenotypes for which highly penetrant noncoding variants and/or tissue-specific transcript-level changes can cause Mendelian conditions120,121,122

  • Recessive conditions for which a single (i.e., monoallelic) rare coding variant has been identified in a candidate gene, increasing the likelihood of having a noncoding SNV or CNV impacting splicing on the second allele (in trans)123,124,125

Further development of analytic methods and bioinformatics tools

Several variant types remain poorly (or at least not routinely) recognized by current variant-calling methods. The sensitivity for indel calling is suboptimal by standard currently utilized analytic tools (Atlas2, GATK). Analytic methods need to incorporate information about imprinted genomic areas, X-linked pseudoautosomal regions, and uniparental disomy.126,127 Structural variant identification also remains a challenge, particularly single-exon dropout alleles, small CNVs 50–1000 bp in size, mobile element insertions (MEIs), and copy number neutral structural variants (e.g., inversions and translocations), as well as trinucleotide repeat expansions. Improved methods design needs to consider family structure, modes of inheritance, and the contribution of rare, or even private variants to Mendelian conditions.128 Methods such as Combined Annotation Dependent Depletion (CADD),129,130 which predict missense variant impact on protein function, are needed to support resolution of clinically reported variants of uncertain clinical significance.

We need a better understanding of the contribution of synonymous and noncoding variants to altered function through transfer RNA (tRNA) abundance and splicing effects.131,132,133 There is also a clear need for development of programs, such as NMDEscPredictor, to predict a variant’s effect on NMD.87 Simultaneous computational integration of rare and common variant analyses should be evaluated for enabling identification of conditions resulting from a combination of variants of these types (e.g., compound inheritance underlying 10% of congenital scoliosis).134 Population-specific databases are increasingly helpful tools in identifying rare variants within a given population, and continued growth and diversification of these resources are needed. The CMGs have studied multiple non-European populations including large Turkish, Middle Eastern, and African cohorts of over 1000 individuals.135

The challenge of non-Mendelian inheritance

The insights and discoveries described in previous sections define a shift that moves beyond the boundaries of one-disease-one-gene models. Mechanistically, more work is needed to explore the molecular basis of penetrance. Two CMG-studied conditions illustrate compound inheritance of both rare and common variant models for incomplete penetrance involving a single or more than one locus. The observation of incomplete penetrance after identifying variants in SMAD6 in nonsyndromic midline craniosynostosis cohort prompted the additional discovery of a common variant (minor allele frequency [MAF] 0.41) downstream of BMP2, which explained the incomplete penetrance.104 In Han Chinese individuals with congenital scoliosis, incomplete penetrance observed in relatives sharing a 16p11.2 deletion or TBX6 LoF variant led to the discovery of a common TBX6 hypomorphic allele (MAF 0.44 in Chinese population, 0.33 in Caucasians, <0.01 in individuals of African descent) in trans with the rare TBX6 null allele (MAF of 16p11.2 deletion is 0.0003 worldwide).134 Individuals with biallelic LoF + hypomorphic TBX6 variants have a distinct TBX6-associated congenital scoliosis (TACS) phenotype characterized by hemivertebrae and/or butterfly vertebrae involving the lower spine.136 Mouse models of biallelic LoF + hypomorphic TBX6 alleles demonstrated reduced Tbx6 expression from hypomorphic alleles, leading to a vertebral malformation phenotype;137 homozygosity for the null allele leads to distortion of Mendelian ratios through embryonic lethality. These studies demonstrate that a rare null and a noncoding common hypomorphic allele can influence gene dosage and expression at a locus and thereby impact phenotypic expression of human disease traits.

We need to understand the contributions to penetrance of variation in environmental exposures,138,139 variation at various modifier loci, and epigenetic effects. From a computational perspective, we also need a more refined definition of LoF, with distinction between nonsense and frameshifting variants that are likely to escape—or be subject to—NMD. This distinction will be increasingly important for understanding the pathogenesis underlying variants that result in premature translation termination, a class of variants for which premature truncation readthrough-based therapeutics may become available.140

Bridging the gap between rare and common variation and disease

The genetic architecture of rare and common disease is often conceptualized as a continuum based primarily on the frequencies of the relevant variant alleles: rare variant alleles causative for Mendelian conditions and common alleles contributing risk for common disease phenotypes.141 Rare diseases, defined in the United States as any condition affecting fewer than 200,000 individuals, typically have etiologic or causal variants of large effect and a population frequency of far less than 0.1%. Common diseases of adult life often have a mixed genetic and environmental etiology, with susceptibility variants that are more common (>1%) and have markedly smaller effect sizes.142

Discoveries in Mendelian conditions have refined our understanding of the genetic architecture of common disease, with contributions of both rare and common variants to common disease.14,111,143,144,145,146,147,148 An analysis of genes associated with rare disease revealed that almost 20% were nearest to, or contain, a variant that had been associated with common disease.21 Moreover, rare de novo SNVs with large phenotypic effect contribute to common childhood traits including neurodevelopmental and congenital heart conditions.14,15,16,149,150,151,152,153,154,155,156 Abnormalities of gene dosage mediated by rare CNVs have further been recognized as an etiology of both Mendelian conditions and risk for common diseases such as neuropathy, dementia, depression, bipolar disease, schizophrenia, autism, and intellectual disability.53,151,157,158,159,160,161,162,163,164,165,166

The ongoing exploration of Mendelian conditions by the CMGs and others has increased appreciation for the extent of allelic and locus heterogeneity, variability of expression and penetrance, the role of new mutation, and mosaicism in disease and the phenotypic complexity that can arise from combinatorial effects of rare alleles at a locus (biallelic versus monoallelic) or at different loci (i.e., multilocus pathogenic variation)—characteristics shared by both rare and common disorders. We explore these concepts using four examples, and discuss the impact of genomics informed by pedigree structure and mode of inheritance on the human genetics field’s understanding of the architecture of common disease:

  1. 1.

    Rare variation may present phenotypically as a common disease, obscuring recognition of a distinct monogenic disorder. Monogenic forms of steroid-resistant nephrotic syndrome due to rare variation in NUP93, NUP205, XPO5, and FAT1 illustrate this concept: these monogenic conditions implicated a role for BMP7-induced SMAD signaling and Rho-like small GTPase signaling pathways in defective podocyte migration, providing therapeutic targets for drug development.143,144 A recent analysis of electronic health records for correlations between phenotypes overlapping with a Mendelian condition using a phenotypic risk score (PheRS) and genotype data in individuals with presumed common disease revealed 18 previously unrecognized Mendelian diagnoses.146

  2. 2.

    Rare variants causing dominant traits may present as a phenotypically milder common trait, such as the dominant carpal tunnel syndrome that may be observed in PMP22 deletion heterozygotes, who typically are expected to develop hereditary neuropathy with liability to pressure palsies (HNPP [MIM 162500]) (refs.167,168). Allelic series have elucidated loci harboring rare, highly penetrant variants leading to Mendelian conditions, and more common variants contributing risk for common disease; for example, rare and common variants in SNCA, including duplication and triplication CNV of the locus, have been described in association with familial and sporadic forms of Parkinson disease, respectively.169,170,171,172,173

  3. 3.

    Several genes identified because of their association with recessive Mendelian conditions were later discovered to contribute to risk for common complex disease in heterozygous individuals, representing an expansion of the originally defined phenotypes (Supplemental Table 7). Notably, heterozygosity for alleles that cause severe recessive disease may be associated with reduced risk for common disease. For example, heterozygosity for pathogenic alleles in SLC12A1, KCNJ1, and SLC12A3, associated with Bartter and Gitelman syndromes, reduces blood pressure and protects against adult-onset hypertension.147 Population cohorts with a high rate of consanguinity and carrier frequency for recessive conditions represent an opportunity to analyze phenotypic effects of heterozygous LoF.148,174

  4. 4.

    Multilocus mutational burden can impact expression of common disease. Genomic studies of neuropathy and Parkinson disease have suggested a model in which an aggregation of rare variants in disease-associated genes can influence clinical severity and can contribute to common complex traits.111,112

These discoveries at the intersection of rare and common disease will facilitate further development of precision medicine through elucidation of targetable pathways underlying disease.

Data sharing

A worldwide effort to share individual-level exome variant and phenotype databases can be highly beneficial for rare disease research as well as other genetic studies. CMG data are deposited to dbGaP and ClinVar. Additionally, access to the Broad CMG data can be applied for through the Broad’s Data Use Oversight System (DUOS). DUOS (https://duos.broadinstitute.org/#/home) is a novel framework for automating the data use oversight process that is overseen by a Data Access Committee. DUOS provides de-identified genotype and phenotype data to authorized researchers in a substantially more usable fashion than the currently cumbersome dbGaP platform. De-identified rare variants tied to broad phenotype data for all cases sequenced by the UWCMG are shared publicly through Geno2MP (htttp://geno2mp.gs.washington.edu) and deposited in MyGene2. Variant data for candidate genes can be directly requested from both BHCMG and Baylor Genetics clinical diagnostic laboratories. Submission of candidate genes to the MME further fosters global involvement in discovery. Continued integration of data with patient-facing portals, for example MyGene2 (ref. 52), may facilitate further engagement of stakeholders to support patient involvement in research studies. The additional development of publicly accessible online tools for direct interrogation of exome data would be useful to further patient and physician engagement.

Integration with other genome sequencing programs

Partnership with the Centers for Common Disease Genomics (CCDGs) will continue to be an important strategy for the CMGs as rare disease discoveries are likely to increasingly impact common disease discoveries and both programs implement genome-wide approaches. Collaborations with the CCDGs have already been instrumental in development, improvement, and implementation of sequencing methods and variant data processing, annotation, and analysis pipelines (Farek et al., https://github.com/jfarek/xatlas/blob/master/README.md).175,176 The genomics community and CMGs in particular have benefitted tremendously from the development of the ExAC and gnomAD databases as well as the ARIC database.87,129,177 The ARIC database is represented by a more general population and not a disease cohort. Using these former resources, the study of constrained genes that show fewer than expected LoF or missense variants in general populations has refined prioritization of candidate disease genes. Likewise, GoF variant allele prioritization has been assisted by the ARIC database and NMDEscPredictor.87

A recent formalized collaboration between the CMGs and the Knock-Out Mouse Project (KOMP, https://www.komp.org) centers promotes rapid sharing of CMG discovery gene lists with the KOMP centers. Mutant mouse strains generated through these collaborations will be available to researchers worldwide. Similarly, an enhanced interface with the UDN clinical and model organism screening centers (MOSCs) should enable in-depth characterization of allelic series for disease genes.178 As many as 45% of Drosophila genes important for neurodevelopment have a human disease ortholog, and Drosophila genes with more than one human ortholog are enriched eightfold for human disease genes.178,179 One such gene, ANKLE2, had been identified as a CMG tier 2 gene in a family with severe microcephaly; recent studies implicate ANKLE2 as a target of the Zika virus.180 Collaborative efforts have included the UDN,32,78,79,179,180,181,182,183 the DDD,59,72,184 and the UK10K Project.80 Expansion of such collaborations to similar clinically oriented discovery programs, such as the Gabriella Miller Kids First (GMKF) program, and deeper integration with international programs like FORGE Canada Consortium,11 DECIPHER (https://decipher.sanger.ac.uk), Care4Rare Canada Consortium (http://care4rare.ca) and rare disease programs affiliated with IRDiRC (http://www.irdirc.org) and in Asia134,136 could further foster international collaboration and stakeholder impact for CMG discovery.

Integration with clinical testing programs

With the goal of rapidly translating novel CMG discoveries to maximally impact patient care, expanded partnerships with clinical diagnostic laboratories and engaging clinicians worldwide will be important. Clinicians provide perhaps the most important role in discovery and are truly at the forefront of efforts to engage in detailed phenotyping. Collaboration with diagnostic laboratories provides several advantages: (1) availability of thousands of cases for which ES has been nondiagnostic, maximizing the likelihood of novel disease gene discovery; (2) potential for enrollment of individuals with pre-existing exome data into research; (3) contact with referring physicians, allowing access to phenotypic information and the opportunity for clinical reassessment; (4) collaborative research; and (5) clinical (College of American Pathologists [CAP], CLIA-accredited) reporting of novel discoveries to the referring physician, facilitating rapid dissemination of information from bench to bedside.

CONCLUSION

CMG-facilitated collaborative research efforts have provided clear deliverables including, most notably, >1000 new disease genes and >500 peer-reviewed publications. However, much work remains. Extending gene discoveries to the interrogation of LoF, GoF, and dominant negative variants on disease expression, and modeling allelic series in mice, underscores the need for analysis of multiple variants per gene. The relationship between rare and common disease is real but complex and can include the intersection of both rare variant and common variant alleles at one or more loci. The extent to which multilocus pathogenic variation contributes to blended phenotypes, phenotypic severity, and phenotypic expansion remains to be explored.

As we, the CMG and world collaborators, investigate personal genome variation in the context of an individual’s phenotype, computational methods for analysis of observed clinical phenotypes using structured phenotypic ontologies105 will enable the field to fully explore genotype–phenotype relationships and to potentially achieve individualized care. Expansion of recruitment efforts to understudied countries, ethnicities, and phenotypes will further expand disease gene discovery and improve clinical utility. Continued development of sequencing technology and bioinformatic tools for genomic data analysis will also increase the effectiveness and efficiency of the CMG collaborative efforts. Finally, and perhaps most importantly, increased integration with clinical genomics,2 extending the reach of the research laboratories and enabling novel discoveries that benefit patient care as expeditiously as possible, is essential for realizing the maximal benefit to world populations.