Introduction

Rapid advances in sequencing technologies have resulted in increasingly more genetic testing services, ranging from single-gene analysis to targeted panels and whole-exome and whole-genome sequencing. In clinical settings, the limiting factor has shifted from acquisition of sequencing data to classification, interpretation, and reporting of novel and recurring sequence variants with little or no conclusive information supporting causation.1

Classification of sequence variants considers the prevalence of the variant in presumably healthy unaffected individuals, cosegregation of the variant with disease in families, and computational and in vitro/in vivo analyses showing the predicted effect of the variant on function or aberrant splicing.2 In particular, the frequency of occurrence, or lack thereof, of a variant in the general population (controls) constitutes an important line of evidence impacting variant classification. Additionally, these databases are utilized in next-generation sequencing pipelines to exclude common variants that are less likely to be pathogenic.3,4 If the frequency threshold is set too low or if the data set used to ascertain frequency contains affected individuals, then potentially disease-causing variants may be filtered out in the early stages of the pipeline. Therefore, the utility of large frequency databases to support classification and analysis of variants is rapidly gaining momentum.

The Exome Aggregation Consortium (ExAC),5 a collection of whole-exome sequencing data from more than 60,000 ostensibly healthy individuals representing diverse human populations, was released in late 2014. The aim of this study was to evaluate this database as a representative control cohort for analysis and classification of sequence variants observed in a clinical laboratory. In particular, we wanted to explore whether the ExAC data set was enriched for pathogenic variation in specific disorders or genes. As the number, diversity, and heterogeneity of genes and disorders tested in clinical settings are rather diverse, we decided to pilot our study to include a broad, but representative, sampling of dominant tumor suppressor genes, dominant cardiovascular-disorder genes, and recessive genes with well-established clinical utility and uptake in clinical diagnostic settings.

Materials and Methods

Data collection and analysis

The ExAC data set provides sequence variation in 60,706 unrelated individuals from various disease-specific and population genetic studies. The data set includes a distribution of diverse ethnicities including European (non-Finnish), European (Finnish), African, Latino, South Asian, East Asian, and “Other.” Sequencing data from 17 contributing projects were included in ExAC. Although phenotype data for the individuals included have not been provided, individuals affected by severe pediatric disease were excluded from the data set.

Variants for analysis comprised a collection of our internal classifications in 19 genes to include dominant tumor suppressor genes (BRCA1, BRCA2, MLH1, MSH2, MSH6, and PMS2), dominant cardiac-disorder genes (MYBPC3, MYH7, TNNT2, TNNI3, PKP2, DSG2, DSP, DSC2, and FBN1), and recessive genes (CFTR, GJB2, HBB, and MEFV). All variants were classified by a tiered in-house variant classification protocol (https://submit.ncbi.nlm.nih.gov/ft/byid/pttb9itm/labcorp_variant_classification_method_-_may_2015.pdf) following guidelines issued by the American College of Medical Genetics and Genomics (ACMG).2 The data presented encompass 2,984 classified variants across 19 genes spanning diverse disorders and modes of inheritance.

ExAC data for each gene were downloaded from http://exac.broadinstitute.org/.5 The corresponding frequency in ExAC of each variant in the data set described above was queried by the corresponding nucleotide level nomenclature scheme (c.name). Differences in nomenclature between ExAC and our internal variant database were reconciled with the HGVS-approved standard for each variant in the data set to ensure accuracy of ascertainment.6

To derive traceable comparisons for each gene, the evidence supporting phenotype prevalence, locus/allelic heterogeneity, and penetrance was used to estimate the maximal pathogenic allele frequency (MPAF) for each gene (Supplementary Table S1 online). MPAF provides a conservative maximum expected frequency of pathogenic alleles in any gene under the assumption that the corresponding disease is entirely attributable to a single pathogenic variant.7 Variants present at frequencies above MPAF provide supportive evidence for nonpathogenicity.

For each gene, the frequency in ExAC was determined for all variants classified as pathogenic or likely pathogenic. In addition, all classified variants with frequencies above the MPAF in each gene were ascertained. Carrier frequencies and ethnicity-specific variant distribution(s) in ExAC were compared with the published literature for variants in genes with available information.

Results

The three pathogenic BRCA variants with the highest allele frequency in ExAC were the well-known Ashkenazi Jewish (AJ) founder mutations, namely, BRCA2 c.5946delT (32/120,698 chromosomes), BRCA1 c.68_69delAG (29/120,972 chromosomes), and BRCA1 c.5266dupC (19/121,412 chromosomes). The carrier frequency of 1/756 for the three BRCA1 and 2 AJ founder mutations in the ExAC database was consistent with the frequency of 1 in 400 to 800 individuals reported to carry pathogenic germ-line mutations in BRCA1 or BRCA2 in the general population.8,9,10

The carrier frequency of the most frequent AJ mutation, c.3846G>A (p.W1282X) in ExAC was 1/1312, which is lower than the reported carrier frequency of 1/863 for this CFTR variant in an ethnically diverse US population (P < 0.05).11 This indicates that the AJ ethnicity is not overrepresented in the ExAC data set. Likewise, the three most frequent pathogenic CFTR variants observed in ExAC were c.1521_1523delCTT (p.F508del), c.350G>A (p.R117H), and c.3209G>A (p.R1070Q), each with a carrier frequency of 1/74, 1/325, and 1/619, respectively. Of these, the carrier frequency of p.F508del and p.R117H in ExAC were in range of the reported frequency for p.F508del (1/65) and p.R117H (1/422) in an ethnically diverse US population.11 Within the subpopulations represented in ExAC, the carrier frequencies of these three most frequent pathogenic CFTR variants are highest in non-Finnish Europeans (1/47) for p.F508del, in non-Finnish Europeans (1/195) for p.R117H, and in South Asians (1/95) for p.R1070Q. The overall distribution pattern of these variants within different ethnicities is consistent with published data among African, Asian, Caucasian, Latino, and other populations.11,12 Furthermore, the distribution of pathogenic variants with homozygous occurrences in GJB2 (p.V37I and c.35delG in East Asians and Europeans, respectively), HBB (p.E7K and p.E7V in Africans), and MEFV (p.V726A in Europeans) followed the expected distribution based upon the reported prevalence of autosomal recessive deafness (GJB2, OMIM 220290), hemoglobinopathies (HBB, OMIM 141900), and Familial Mediterranean Fever (MEFV, OMIM 249100) in these subpopulations.13,14,15 These observations demonstrate that ExAC is not overenriched for pathogenic variants in the specific disorders tested, thereby supporting its utility as a control cohort in genetic analysis.

Only 3 of 871 variants (0.34%) that had been classified as pathogenic or likely pathogenic across 19 genes exceeded the estimated MPAF. The distribution in ExAC of the average minor allele frequency (MAF) of pathogenic and likely pathogenic variants in relation to the corresponding estimated MPAF in the genes analyzed is provided in Table 1 .

Table 1 Average MAF of pathogenic variants in analyzed genes

Of 237 BRCA1 and 2 variants that have been classified as pathogenic or likely pathogenic, 44 were present in ExAC. The majority of these variants had an allele count of 1 or 2 of about 121,412 total chromosomes (n = 35). None had an allele frequency exceeding the MPAF for each gene.

Of the 266 cardiac-disorder gene variants that have been classified as pathogenic or likely pathogenic, 32 were present in ExAC. The majority of these variants had an allele count of 1 or 2 of about 121,412 total chromosomes (n = 20). Three variants, DSG2, c.1174G>A (p.Val392Ile), TNNT2, c.832C>T (p.Arg278Cys) and PKP2, c.419C>T (p.Ser140Phe), had a frequency that exceeded the MPAF for a pathogenic variant by 10-, 3-, and 4-fold, respectively. Each of these has been reevaluated by our laboratory with the DSG2 and PKP2 variants being reclassified as likely benign and the TNNT2 variant being reclassified as VUS.

Of 87 Lynch syndrome (OMIM 120435) variants that have been classified as pathogenic or likely pathogenic, 14 were present in ExAC. None of these variants had an allele frequency exceeding the MPAF for each gene.

For genes associated with recessively inherited disorders, namely CFTR, GJB2, HBB, and MEFV, a total of 133 variants that have been classified as pathogenic or likely pathogenic were present in ExAC. As with breast cancer and Lynch syndrome genes, none of these variants had an allele frequency exceeding the MPAF for each gene.

Eighty-four percent of variants with frequencies above the MPAF in ExAC were classified as “benign/likely benign” ( Table 2 ). Additionally, 20% of cardiac and 19% of Lynch syndrome gene variants originally classified as “VUS”(variant of uncertain clinical significance) occurred with ExAC frequencies above the estimated MPAF, making these worthy of reassessment.

Table 2 Classification of variants with MAF higher than MPAF in the analyzed gene sets

Discussion

The use of the estimated MPAF for each gene illustrated in this study represents a traceable paradigm for assessing the impact of variant occurrences in population databases as supportive evidence of non-pathogenicity. As demonstrated with BRCA and CFTR, the carrier frequency and ethnicity-specific distribution of classic, well-studied pathogenic variants in our data set matched the values reported from the general population and it was not overrepresented by variation specific to ethnicities such as the AJ. Therefore, ExAC is not enriched for pathogenic variation in the specific disorders and genes evaluated, making it a useful data set to facilitate accurate classification outcomes.

Next, we used ExAC occurrences to identify variants in our database that could be reclassified in light of new evidence. Only 3 of 871 variants originally classified as pathogenic or likely pathogenic were present in ExAC at frequencies exceeding the estimated MPAF. Each of these three variants was in a gene associated with inherited cardiac disorders and had been originally classified conservatively prior to the large population control databases such as ESP and ExAC. Therefore, ExAC served as useful supporting evidence to merit a reevaluation of the pathogenicity of these variants.

Lastly, a majority (84%) of variants that had frequencies above the estimated MPAF were appropriately classified as benign or likely benign. Specifically, 98% of variants in BRCA1 and BRCA2 genes and 95% of variants observed among the 4 genes associated with recessive disorders (CFTR, GJB2, HBB, and MEFV) that had frequencies above the estimated MPAF were classified as benign or likely benign. Variants in cardiac and Lynch syndrome genes were the two exceptions to this observation. Forty-three cardiac gene variants and 12 Lynch syndrome gene variants that were originally classified as VUS had an ExAC frequency exceeding the estimated MPAF. Ten of the 43 cardiac variants were found in ethnic groups that were not represented in ESP (Latino, East Asian, and South Asian), and they would not have been observed prior to the release of ExAC. Eighteen of the 43 cardiac gene variants had a frequency only one- to threefold above the estimated MPAF and could not be considered strong evidence for classification as benign. The remaining cardiac gene variants represent a subset associated with factors such as digenic inheritance, low penetrance, population specific variation, or potential role as disease modifiers, causing their classification to be conservative, even with significant occurrences of the variant in the control population.16,17

Nine of the 12 (75%) Lynch syndrome variants with an ExAC frequency exceeding the MPAF were in the PMS2 gene. Analysis of variants in PMS2 is challenging owing to the presence of numerous pseudogenes with high homology that preclude unequivocal differentiation between true variants versus those originating in the pseudogenes.18,19,20 Because of a high rate of mismapping of next-generation sequencing alignments in pseudogene regions, reports that do not include long-range PCR or RNA analysis to specifically distinguish variant occurrence between the gene and pseudogene are not weighted in our classifications. ESP, 1000 genomes, and ExAC do not specifically rule out pseudogene interference, which makes them less useful. Therefore, PMS2 variants present at high frequency with little supporting data are more likely be classified conservatively as a VUS. As with cardiac genes, the remaining 3 Lynch syndrome variants had a frequency of one- to twofold above the MPAF, not reaching a threshold for unequivocal classification as benign.

A limitation of ExAC is the use of non-HGVS standard variant nomenclature. This increases the likelihood of false negative observations. Although, single-nucleotide variants are likely to be called accurately, heightened awareness in reviewing the annotation of variants, such as deletions and insertions, is recommended. In conclusion, our observations support ExAC as a control cohort for classifying variants in clinical settings. We recommend that this database be evaluated across diverse sets of genes and disorders, mindful of underlying genetic complexities (such as pseudogenes) that pose challenges in deriving meaningful classifications using control data sets.

Disclosure

At the time this study was conducted, all authors were employed by Integrated Genetics, Laboratory Corporation of America®Holdings, and may hold stock of and/or stock options with LabCorp.