Introduction

Genomic sequencing in the form of large gene panels, whole-genome sequencing or whole-exome sequencing has entered into genetics clinics with thousands or even tens of thousands of such tests performed to date, significantly increasing the diagnostic yields in several clinical scenarios.1,2,3,4 The promise of genomic medicine has long been acknowledged to be at the center of precision medicine by the scientific community and has just recently gained national recognition.5 With such comprehensive testing, however, come several challenges, the key one of which is the interpretation of the large number of variants—estimated to be 3–5 million per human genome—generated in the process, with thousands of which having uncertain clinical significance. Even more alarming is the fact that the majority of variants identified are very rare, having been seen in only one family.6 Characterizing the functional impact of all variants identified requires tremendous resources that are beyond the capacity of any single molecular diagnostic laboratory.

Recognizing this challenge, the National Institutes of Health has recently funded three different groups under the Clinical Genome Resource (ClinGen) Program, whose main goal is to build and maintain a publicly accessible genomic knowledge base to promote variant data sharing among clinical laboratories, researchers, and clinicians.6 Another major goal of this program is dissemination of standards and guidelines for variant interpretation and gene–disease association.7,8 In a similar approach for gene–disease associations, we have recently shown that a systematic evaluation of genes associated with hearing loss can largely eliminate unnecessary interpretation of variants in genes with weak disease associations.9 Still, a large number of novel variants of uncertain clinical significance are routinely identified in genes with strong disease associations, necessitating novel approaches to prioritize candidate disease-causing variants.

The use of population sequencing data sets such as the Exome Aggregation Consortium (ExAC) has been useful in genome-wide assessment of the tolerance of genes to variation and has shown a greater degree of intolerance for genes that are associated with Mendelian disorders.10,11,12,13 A similar strategy at a gene level can potentially identify tolerant or sensitive intragenic regions for a particular class of variants for any given disease gene.14 The tolerance predictions of these regions to different classes of variation can be further enhanced through the use of clinically interpreted variant data sets from diagnostic laboratories. Here, we used variant data sets from population databases and from a clinical laboratory to statistically define domain-level disease associations and to assess exon-level tolerance of loss-of-function variants across 132 genes included in gene panels offered at our laboratory. We evaluated the utility of this approach by examining the impact of regions defined through our analysis of classification of variants previously determined to be of uncertain clinical significance. In addition, we show a significant role for our approach in refining the variant interpretation process.

Materials and Methods

Clinically curated variants

A total of 186,009 variant observations representing 6,978 unique variants (Supplementary Table S1 online) in 132 genes (Supplementary Table S2 online) were identified in a cohort of 11,219 probands tested at the Laboratory for Molecular Medicine between 2005 and 2015. This cohort consisted of individuals affected with cardiomyopathies (n = 5,466), rasopathies (n = 3,022), hearing loss (n = 1,990), and connective tissue disorders (n = 741).

All 6,978 unique variants were manually assessed for clinical significance and each was classified into one of five categories based on its potential impact and role in causing the patient’s disease: pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), likely benign (LB), and benign (B). Details of the variant classification workflow at our laboratory have been described.15 The third category, VUS, is further refined into three subcategories (VUS–favor benign (FB), VUS, and VUS–favor pathogenic (FP)) based on whether evidence favors a benign or pathogenic classification but does not meet thresholds or criteria for a likely pathogenic or likely benign classification. VUS-FP leans toward LP and VUS-FB leans toward LB, whereas the VUS subcategory denotes a lack of evidence to favor a causative or neutral role or equally conflicting evidence.

Protein domain burden analysis

The numbers of probands positive or negative for each of the 6,978 clinically curated variants were determined using our internal database. For controls, we queried the Exome Aggregation Consortium (release 0.3)12 database (60,706 whole exome samples) for all high-quality (PASS) variants and their allele frequencies in these 132 genes. Then, the numbers of individuals positive or negative for each of the rare variants were determined. To minimize any potential annotation discrepancies, we reannotated all variants in cases and controls using SnpEff v4.0e16 with a gene model from RefSeq.17 Annotations include exon, transcript, gene, amino acid change, and the variant’s impact on the protein.

Based on their functional impact, reannotated case and control variants were then grouped into “functional” variants (mostly missense and in-frame insertions or deletions), “loss of function” variants, or both (lof_functional) (Supplementary Table S3 online, http://snpeff.sourceforge.net/SnpEff_manual.html). Domain burden was analyzed by first calculating, within each variant group, the total number of rare variants—defined as variants with allele frequency <1% in cases and controls—at each protein domain (N = 645) based on boundaries from the Pfam database18 (domain boundaries were predicted using hidden Markov models, downloaded from UCSC tables, last updated 1 July 2013). Coding regions that do not overlap with any Pfam domain were labeled as “outside domains” for each transcript. We then generated a 2 × 2 table at each protein domain/“outside domain” region for the number of positives and negatives in cases and controls and assessed significance using Fisher’s exact test. All P values were adjusted by Bonferroni correction based on the total number of domains (the significance level was <1.52 × 10–5). All significant, tolerant, and intolerant regions are listed in Table 1 and Supplementary Table S4 online.

Table 1 Domains or intragenic regions with significant enrichment of variants in cases or controls

As a quality check, significantly “intolerant” regions were confirmed to have adequate coverage in ExAC (Supplementary Table S4 online), thus ruling out any potential false-positive results due to the lack of coverage in the general population. Furthermore, all identified regions were validated for enrichment of clinically significant variants in our internal database, HGMD, and ClinVar.

Loss-of-function variant analysis per RefSeq transcripts

After reannotating variants from the Exome Aggregation Consortium, we investigated all RefSeq transcripts in the 132 genes (361 transcripts and 2,825 unique exons) to identify exons with high-allele frequency (MAF ≥0.1%) loss-of-function (LoF) variant(s) and/or with multiple such variants. The loss-of-function effect of each variant was predicted by SnpEff v4.0e (see details under “Protein Domain Burden Analysis”) and included mainly stop-gains, splice sites (± 1 or 2), frameshifts, and start-losses.

Exons with high-allele frequency LoF variants were interpreted in the context of disease prevalence, mode of inheritance, potential annotation errors, and/or gene structure (see “Results”).

Results

Intragenic disease burden

A significant number of variants in known genes are continually identified that have uncertain clinical significance due to absence of supporting functional and/or statistical genetic evidence. For example, the clinical significance of 35% of the variants (2,413/6,978) identified in 132 genes analyzed in this study could not be determined (see below) despite the fact that most of those genes have well-established associations with cardiomyopathies, rasopathies, connective tissue disorders, or hearing loss. In such genes, quantifying intragenic intolerance to variation can potentially support variant prioritization and interpretation.

We made the assumption that intragenic regions depleted of functional variation in the general population but enriched for variants in individuals with a disease are most likely to be clinically relevant relative to other regions. To quantify such intragenic region–disease associations, variant counts (<1% MAF) in cases versus controls were calculated at each of the annotated Pfam domains in the 132 genes (“Materials and Methods”). The number of cases, genes, domains, and clinically curated variants per disease are shown in Supplementary Table S1 online. The ExAC database (N = 60,706 individuals) was used as a control data set. We statistically determined whether any domain(s) had a significant enrichment of variants in cases versus controls. Within a total of 34 genes, we identified 33 regions with a significant disease burden in addition to 25 regions that were relatively tolerant to variation in the general population (P < 1.52 × 10–5, Table 1 and Supplementary Table S4 online). The remaining 98 genes did not exhibit any significant enrichment, most likely owing to lack of sufficient, especially clinical, variant data (see “Discussion”).

Validation of the intragenic burden analysis

Because all manually curated variants with MAF <1% were included regardless of their clinical classification, we sought to determine whether constrained or intolerant regions were enriched for pathogenic or likely pathogenic variants and vice versa. A total of 6,978 unique variants with clinical significance ranging from benign to pathogenic were used in the analysis (Supplementary Figure S1A online; see also “Materials and Methods” for more information on variant classification). The distribution of all classified variants across the four diseases and the 132 genes is shown in Supplementary Figure S1B and Supplementary Table S2 online, respectively. As expected, there was a significant enrichment of P and LP variants relative to those classified as VUS-FB in all disease-constrained intragenic regions (P < 0.001, Supplementary Table S5 online). However, no such enrichment was observed in most (26/29) tolerant regions (Supplementary Table S6 online). Instead, the latter regions contained, on average, threefold more VUS-FB variants (P < 0.006) but 13-fold fewer P and LP variants relative to the disease-intolerant regions (P < 0.001, Figure 1a ).

Figure 1
figure 1

Intolerant intragenic regions show enrichment of clinically significant variants relative to tolerant regions that are depleted for such variants but enriched for benign or likely benign variants. (a) The distribution of classified variants in disease-constrained (black) versus disease-tolerant (grey) regions. Average numbers ± standard error are shown. ***P < 0.001; **P < 0.006 (Mann-Whitney nonparametric two-sample test). (b) The distribution of ClinVar P/LP, B/LB, and HGMD DM variants in disease-constrained (red) versus disease-tolerant (green) regions. Average numbers ± standard error are shown. ***P < 0.001; *P = 0.01 (Mann-Whitney nonparametric two-sample test).

To further validate our findings, we determined the distribution of disease-causing variants from ClinVar and HGMD across the regions identified (Supplementary Information and Supplementary Table S4 online). We detected a fivefold enrichment of ClinVar P/LP to ClinVar B/LB variants in the intolerant regions (P = 2 × 10–4, Figure 1b ). Again, no such enrichment was observed in most tolerant regions. Rather, similar to our laboratory-curated variants, the latter regions contained, on average, 2.6-fold more ClinVar B/LB variants (P = 0.0145) and 3- and 2.3-fold fewer ClinVar P/LP (P = 0.011) and HGMD DM (P = 0.017) variants, respectively, relative to the disease-constrained regions ( Figure 1b ). These findings highlight the utility of our approach and show that the identified intragenic regions can predict the distribution of disease-causing and benign variants across the relevant genes.

Examples of constrained domains

As might be expected, most of the constrained regions were within the 65 genes that are primarily associated with autosomal dominant disorders due to gain-of-function pathogenic variants, including cardiomyopathies, rasopathies, and connective tissue disorders. A total of 25 of these 65 (~40%) genes had constrained regions, with some genes shown to be generally intolerant to variation (PTPN11, BRAF, SOS1, HRAS, FBN1, MYH7, MYBPC3, PKP2, and LMNA) and others having certain intragenic regions that were significantly more or less constrained; this observation could be used to support variant prioritization in the relevant gene. One example is the MYH7 gene, for which most pathogenic variation for cardiomyopathy is due to missense variants. Despite its overall intolerance to variation, the myosin head domain in this gene is most disease-burdened relative to other domains, such as the myosin tail, which appears to be slightly tolerant to variation ( Figure 2 ). As expected, the myosin head domain contained 186 P/LP variants but only 9 VUS-FB variants. By contrast, there were only 3 and 14 VUS-FB and P/LP variants, respectively, in the tail domain (Supplementary Tables S5 and S6 online). Another example is PTPN11, for which the pathogenic variation for rasopathies is due to missense variants. PTPN11 is generally intolerant to variation, but, as shown in Supplementary Figure S2 online, the SH2 and phosphatase domains in this gene are extremely constrained, such that novel variants identified in those domains are more likely to be clinically significant relative to others regions of the gene. In fact, 88% (376/428) of the P and LP variants identified in this gene were restricted to those two domains (Supplementary Table S5 online).

Figure 2
figure 2

A heat map showing the distribution of variation in cases versus controls across the MYH7 gene. The table shows the raw data, i.e., the numbers of probands and controls that are positive or negative for variants in certain domains or regions within the gene. LMM refers to the laboratory for molecular medicine. P(LMM) refers to the P-value for statistical significance in cases relative to controls. P(ExAC) is the P-value for statistically significant enrichment in controls.

Unlike autosomal dominant disorders, the 67 autosomal recessive hearing-loss genes rarely had regions that were constrained. Only two genes had such regions, although one of them can also cause disease in an autosomal dominant fashion (MYO7A). By contrast, there were nine genes with regions that were highly tolerant to variation in the general population ( Table 1 ). One interesting example is GPR98, which has been shown to be an extremely tolerant gene.10,12 Our analysis, however, shows that this tolerance is significant only in the regions that are “outside domains” ( Table 1 ), whereas other regions of the gene did not appear to be significantly tolerant (data not shown).

Impact of burden analysis

To determine the impact of this analysis on variant classification, we investigated the extent to which the identified statistically significant regions can support classification of variants that lack any other evidence, so-called VUS variants in our classification system (see “Materials and Methods”). Identification of such variants in the tolerant or disease-constrained regions is very likely to support reclassifying them at least to VUS-favor benign or VUS-favor pathogenic, respectively. Of the total 1,742 VUS variants identified in the 132 genes, 450 (or 26%) resided in these regions and can thus be considered for reclassification (Supplementary Figure S3 and Supplementary Tables S5 and S6 online). Based on this, we anticipate that our intragenic constraint bins will enable variant prioritization in the relevant genes.

Exons with high burden of loss-of-function variants

Although most clinical laboratories tend to include the longest transcripts in their assays because of limited available information on the critical exons required for gene function, detailed analysis of LoF variants and transcript structure can be helpful to further refine the clinical relevance of intragenic disease regions. We used allele frequencies from the general population12 to identify exons harboring high-allele frequency LoF variants in the 132 genes studied. Such exons could cause disease but with reduced penetrance, or they could be clinically benign, potentially due to alternative splicing leaving out exons not critical to gene function. For routine interpretation, the laboratory uses the relevant disease prevalence, estimated penetrance, and genetic heterogeneity to calculate the minor allele frequency above which variants in the associated genes can be considered likely benign, with higher frequencies used for a benign classification. For hearing loss, 0.1% and 0.3% likely benign cutoffs were used for autosomal dominant and autosomal recessive hearing loss, respectively. For the remaining diseases, 0.3% allele frequency was conservatively used.

Our first-tier analysis identified a large number of LoF variants in the genes of interest. However, we performed manual curation to exclude very-low-quality variants, especially frameshifts, and “apparent” nonsense variants, whereby adjacent single-nucleotide variants, likely to be in cis, changed their interpretation; these are the so-called multinucleotide polymorphisms (MNPs).12 After this filtration, we identified 26 exons from 21 genes, each harboring at least one LoF variant exceeding the aforementioned allele frequency cutoffs ( Table 2 ). Some of those exons also had several rare LoF variants in ExAC. Most exons (18 or 70%) appeared to be alternatively spliced and therefore not expressed by all transcripts, probably explaining why they are found at a frequency too high for the disorder. Three exons were small in-frame exons; basal exon skipping19 or nonsense-induced alternative splicing20 might rescue the expected protein loss of function due to such variants. Two exons—one of which was also alternatively spliced—had high-allele frequency start-loss variants with nearby secondary methionine suggesting either annotation errors or start reinitiation. Three “LoF” variants were in the last exons of CABP2, DSC2, and P2RX2 and thus are not expected to be true LoF variants due to escape from nonsense-mediated decay (NMD) and limited protein impact. The last exons in DSC2 and P2RX2 were also alternatively spliced ( Table 2 ).

Table 2 Exons with high-allele frequency loss-of-function variants in the general population

Finally, three exons did not fit any of these categories. One was in DFNA5, wherein only variants leading to exon 8 skipping have been shown to cause autosomal dominant hearing loss through a potential gain of function mechanism.21,22 In fact, a frameshift variant in exon 5 of this gene failed to segregate with disease in an Iranian family with hearing loss,23 confirming that loss of function is not a disease mechanism. The remaining two LoF variants are splice acceptor variants in ABCC9 and SLC26A4 found in 0.7% South Asian alleles (with one homozygous individual) and 0.4% East Asian alleles, respectively ( Table 2 ). Whether these two variants are true, potentially founder, LoF mutations in the Asian subpopulation has yet to be determined. Nevertheless, none of the 26 exons contained pathogenic variants in our patient cohorts, further suggesting they are not required for gene function.

Examples of exons with high-allele frequency LoF variants

One example is the PCDH15 gene, wherein pathogenic variants have been strongly associated with Usher syndrome type 1 (refs. 24,25). One LoF variant with 1% allele frequency is found in exon 33 of the NM_033056 transcript ( Table 2 ). Additionally, 37 other LoF variants in ExAC are also found within this exon, which is alternatively spliced from other transcripts, suggesting that it is unlikely to harbor any disease-causing variants. In fact, this finding has also been confirmed in a recent study.26

Another example is the USH1C gene, which has also been associated with Usher syndrome type 1.27 This gene contains 28 exons and produces several transcripts that share exons 1–14 and 16–21 encoding for three PDZ domains and one coiled-coil domain. A longer isoform with an additional seven exons (NM_ 153676.3: Exons 15A-20A, 26A) encoding a second coiled-coil domain and a PST (proline, serine, threonine-rich) domain has also been described.28 Two LoF variants, a splice site (0.6%) in exon 19A and a nonsense variant (0.3%) in exon 20A, were identified by ExAC in the African subpopulation ( Figure 3a ). This allele frequency is relatively high given the Usher type I disease prevalence, suggesting that those exons are very unlikely to be involved in this disease. To examine this, we assessed the expression of USH1C exons 19A and 20A in different human tissues using the recently published Genotype-Tissue Expression (GTEx) Project in which multitissue RNA sequencing was performed for 173 individuals.29 Gene- and exon-level expression patterns were made publicly accessible through the GTEx portal (http://www.gtexportal.org/home/documentationPage). Interestingly, exons 19A and 20A and a few others showed very low expression, if at all, in the tissues where USH1C was found to be expressed. It is thus strongly predicted that these exons are spliced out of the primary isoform of the protein in most, if not all, tissues ( Figure 3b ). Because cochlear and retinal tissues are not represented in the GTEx database, we cannot exclude the possibility that this isoform is expressed only in these tissues involved in Usher syndrome; however, given the high-frequency LoF variants identified in the ExAC database but not found in our disease population, the most likely conclusion is that these exons are also not required in retina and cochlea.

Figure 3
figure 3

High-allele LoF variants in two alternatively spliced exons (19A and 20A) in the USH1C gene. (a) A schematic adapted from Alamut Visual Software showing two USH1C transcripts and highlighting exons 19A and 20A and the LoF variants in those two exons. (b) This schematic is adapted from the GTEx Portal website showing the exon-level expression of USH1C in two different tissues (only the two tissues with the highest expression are shown). These data were generated using RNA sequencing and clearly show the lack of expression of several exons, including 19A and 20A, as signified by exon skipping. Note that this trend holds for all other tissues where USH1C is present at lower levels.

We also found a similar low expression pattern in GTEX for several exons in Table 2 , including CCDC50 exon 6, COCH exon 2, EDN3 exon 4, LAMA4 exon 2, LOXHD1 exon 1, MYO15A exons 2 and 26, and PAX3 exon 4. In addition, 5 of the 26 exons (MYO15A exons 2 and 26, OTOF exon 32, PCDH15 exon 33, and TRIOBP exon 1) overlapped with the tolerant regions identified through the burden analysis described (see also Supplementary Table S6 online). We did not expect high concordance because the latter approach is extremely dependent on clinically curated variants that are lacking for several genes and/or regions (see “Discussion”) and uses domain boundaries that often consist of multiple exons whose tolerance is averaged into an overall domain tolerance.

In summary, although tolerance to 23 of the 26 high-allele frequency LoF variants is probably explained by gene structure (alternative splicing, in-frame exons, start reinitiation, NMD escape) and one due to a distinct disease mechanism (DFNA5), we cannot exclude for any of these variants genuine loss-of-function effect but with reduced penetrance and/or variable expressivity. Nevertheless, such information is extremely important to use during clinical variant interpretation. Extra caution should be exercised in interpreting variants in these exons given that all variations may be benign, as supported by the fact that all 26 exons were devoid of pathogenic variants in our patient population.

Discussion

Because it is impossible to have sufficient evidence to support the classification of all potential variants in disease genes, new approaches are needed to better inform the variant interpretation process. In this study, we leveraged information about the frequency of variants in cases and controls to statistically bin intragenic regions of disease genes into those with higher or lower tolerance to variation. We showed that such regions exhibit the expected enrichment of pathogenic or benign variants based on regional overall tolerance. We also showed that this approach provides additional information to support the classification of variants with limited or no evidence and to prioritize variants in the relevant genes. Additionally, we highlighted exons that harbor LoF variants at frequencies in the general population that exceed what would be expected based on disease prevalence, suggesting that these exons might not be expressed and/or disease-relevant. In fact, most such exons seem to be alternatively spliced and were devoid of pathogenic variants, supporting the premise that LoF variants with high frequencies in the general population, if supported with strong analytical data, can guide transcript annotation to aid in interpretation.

It should be noted that additional expert interpreted variants, along with deeper allele frequency information from variants in large general populations, are always needed to gain more resolution across intragenic regions. This is especially important for large domains where unequal tolerance to variation can average significant signals. Furthermore, because most diseases can be subdivided into different forms based on phenotypic variation, more clinical data will be needed to stratify based on phenotype. In addition, our approach does not take into consideration information about missense change types (for example, conservative, nonconservative, or cysteine changes) that can affect this analysis. Although some domains might tolerate conservative or nonconservative changes, they might be extremely intolerant to disulfide bond disruptions (cysteine changes), for example. Finally, with increased variant allele frequency in patients and in the general population, this analysis can be performed based on ethnicity to uncover regions that have subpopulation-specific tolerance or intolerance to variation. All these reasons, in addition to lack of variant data for many genes as described, might explain why only 34 genes had significantly tolerant or intolerant regions in our analysis while 98 genes did not.

An illustrative example to demonstrate some of these complexities is the troponin domain, encoded for by exons 10–14 of the TNNT2 gene, which is associated with both hypertrophic cardiomyopathy (HCM) and dilated cardiomyopathy (DCM). This domain was found to be tolerant to variation using our burden analysis (Supplementary Table S6 online), although missense P/LP variants have been reported in this domain. Interestingly, most of those variants are localized to exon 10, disrupt highly conserved basic arginine residues, and are carried by patients with DCM only and not HCM, consistent with previous findings.30,31 This example clearly highlights the issues of amino acid properties and conservation, phenotype stratification, and variable tolerance within any domain.

It is also important to note that sufficient variant data were not available for all genes. Well-curated variant databases would be needed to take this approach for other genes. Furthermore, our approach relies on sufficient understanding of disease prevalence, penetrance, and extent of genetic heterogeneity. Because such information is not always available, conservative approaches must be taken in assuming variation of high-allele frequency is likely benign. Systematic collection of these data for all diseases will be a useful resource that can inform data-analysis approaches such as those presented here.

Despite the limitations, our study clearly demonstrates the importance of using the relative distribution of variants in controls versus affected individuals to support variant interpretation and/or prioritization within the relevant genes. Incorporating our intragenic disease burden statistics into the existing phylogenetic-based in silico algorithms, such as SIFT32 and PolyPhen-2,33 is likely to enhance their prediction of variant clinical significance. Finally, our study highlights the need for appropriately capturing and sharing disease sequencing data to enable such approaches, which are likely to reduce the interpretation challenge facing clinical genomics as well as guide high-impact functional research in disease genes.

Disclosure

All authors work for fee-for-service laboratories to perform clinical genetic testing. The other authors declare no conflict of interest.