Using high-resolution variant frequencies to empower clinical genome interpretation

Purpose Whole-exome and whole-genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognized as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants. Methods We present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets. Results Using the example of cardiomyopathy, we show that our approach reduces by two-thirds the number of candidate variants under consideration in the average exome, without removing true pathogenic variants (false-positive rate<0.001). Conclusion We outline a statistically robust framework for assessing whether a variant is “too common” to be causative for a Mendelian disorder of interest. We present precomputed allele frequency cutoffs for all variants in the ExAC dataset.

The maximum tolerated allele count (AC) was computed as the AC occurring at the upper bound of the one--tailed 95% confidence interval (95%CI AC) for the established maximum credible allele frequency, given the observed allele number (AN). Since the population is drawn without replacement, this would strictly be a hypergeometric distribution, but this can be modeled as binomial as the sample is much smaller than the population from which it is drawn. For ease of computation, we approximate this with a Poisson distribution. In R, this is implemented as max_ac = qpois(quantile_limit,an*af), where max_ac is the 95%CI AC, quantile_limit is 0.95 (for a one--sided 95%CI), an is the observed allele number, and af is the maximum credible population allele frequency.

Defining maximum credible AF for recessive diseases
For a biallelic condition caused by only a single gene, the prevalence can be determined from the combined frequency of all possible pairs of alleles in that gene: where !,! is the penetrance of two alleles and with frequencies ! and ! and and may be the same allele. This is generalisable in that all variants may be included in the summation, as the !,! will be 0 for benign variants, and those variants will not contribute to disease.
For a condition caused by multiple genes, the above equation must then be summed across contributing genes: where !,! is the penetrance of and .
Since the penetrance cannot be separately estimated for each combination of alleles (though is typically high for recessive conditions), we simplify by estimating penetrance as constant for the condition, and calculating only for pathogenic alleles, reducing the equation to: where is the penetrance for any pair of pathogenic alleles and and are both pathogenic alleles.
Since we are now treating penetrance as constant for each gene this simplifies further: for a given gene containing pathogenic alleles, the frequency of individuals who are homozygous or compound heterozygous is given by the square of the combined allele frequencies of all contributing alleles.
If we know the AF of a single variant , and the contribution that variant makes to disease, then the combined allele frequencies of all contributing alleles ( ! !!! ! ) could in turn be represented as where ! is the proportion of disease that is attributable to variant .
Substituting this into our equation for prevalence yields: For a specific variant in a specific gene, with known allelic and genetic contribution (i.e. the proportion of cases attributable to the gene that contains variant ), prevalence can be expressed as: Finally, we can rearrange this to give an upper bound for the maximum credible AF of an individual causative variant: is the maximum proportion of cases attributatble to any one gene and is the maximum proportion of cases attributable to any one variant within that gene.

Curation of a high frequency PCD variant
NME8 NM_016616.4:c.271--27C>T which is reported as Pathogenic in ClinVar is found in 2306/120984 ExAC individuals. This variant was initially reported as pathogenic on the basis of two compound heterozygous cases when specifically searching for NME8 variants in a set of patients, and was found in 2/196 control chromosomes 1 . We further note that NME8 has not otherwise been associated with PCD and shows no evidence of missense nor loss--of--function constraint and that this splice variant affects a non--canonical transcript. During our curation exercise we found that this variant meets none of the ACMG criteria for assertions of pathogenicity, and therefore we reclassified it as a VUS.

Dealing with penetrance
It is often difficult to obtain accurate penetrance information for reported variants, and it is also difficult to know what degree of penetrance to expect or assume in newly discovered pathogenic variants. In this work we uniformly apply a value of 50% penetrance for inherited cardiac conditions (i.e. assuming variant penetrance is no lower than 50%) equivalent to that reported for our HCM example variant, and the lowest we found reported for any of our examples or reported accross our studied disorders. We recognise several other approaches that can be used to deal with the issues of penetrance, these include: setting a penetrance level equivalent to the minimum that is 'clinically actionable' for a disorder; lowering the penetrance if reduced penetrance is expected in a family; or using a two tiered approach, initially searching for a high--moderate penetrance variant but allowing for a lower--penetrance variant in a second pass. We believe that the ease of re--calculating our "maximum credible population allele frequency" lends itself to any of these approaches. We provide an online calculator to facilitate the exploration of these parameters (http://cardiodb.org/alleleFrequencyApp). If there are large case and control populations for a disease and the diease prevalence is known, we can use these to estimate penetrance 2 .

Treatment of singletons and other populations
It is worth considering whether a single observation in a reference sample should ever be treated as incompatible with disease. Using the approach outlined above, it can be inferred that an ExAC AC=1 would be considered incompatible with a true population allele frequency <2.9x10 --6 (with 95% confidence). For a penetrant disease with a prevalence of 1:1,000,000, the probability of observing a specific causative allele in ExAC is <0.01, even if the disease is genetically homogeneous with just one causative variant. In practice however, we feel that there are few, if any, diseases that are extremely rare yet have sufficiently well--characterized genetic architecture to discard singleton variants from a reference sample. Therefore, for singletons (variants observed exactly once in ExAC), we set the filtering allele frequency to zero.
We also note that occasionally a variant is seen in individuals falling under the Finnish or "Other" population categories in ExAC, and is a singleton or absent in all five continental populations. For these variants, the filtering allele frequency is set to zero. Because the Finnish are a bottlenecked population, disease-causing alleles may reach frequencies that would be impossible in large outbred populations. Similarly, because we have not assigned ancestry for the "Other" individuals, it is difficult to assess the population frequency of variants seen only in this set of individuals. Users are left to judge whether variants that would not be filtered on the basis of frequency in the five continental populations, but that are sufficiently frequent in Finnish or "Other" populations, should be removed from consideration according to the specific circumstances.

Description of the filtering allele frequency
We define the "filtering allele frequency" for a variant, or af_filter, as the highest true population allele frequency for which the upper bound of the 95% confidence interval of allele count under a Poisson distribution is still less than the variant's observed allele count in the reference sample. It functions as equivalent to a lower bound estimate for the true allele frequency of an observed variant: if the filtering allele frequency of a variant is at or above the maximum credible allele frequency for a disease, then the variant is considered too common to be causative of the disease. In the example, the highest allele frequency that gives a 95%CI AC of 2 when AN=100,000 is approximately 8.17e--6. Instead of solving exactly for such values, which would require solving the inverse cumulative distribution function of the Poisson distribution, we derive a numerical approximation in two steps: 1.
For each variant in consideration, we use R's uniroot function to find an AF value (though not necessarily the highest AF value) for which the 95%CI AC is one less than the observed AC.

2.
We then loop, incrementing by units of millionths, and return the highest AF value that still gives a 95%CI AC less than the observed AC.
In order to pre--compute af_filter values for all of ExAC (verson 0.3.1), we apply this procedure to the AC and AN values for each of the five major continental populations in ExAC, and take the highest result from any population. Usually, this is from the population with the highest nominal allele frequency. However, because the tightness of a 95% confidence interval in the Poisson distribution depends upon sample size, the stringency of the filter depends upon the allele number (AN). The stringency of the filter therefore varies appropriately according the the size of the sub--population in which the variant is observed, and sequencing coverage at that site, and af_filter is occasionally derived from a population other than the one with the highest nominal allele frequency.
For this analysis, we used adjusted AC and AN, meaning variant calls with GQ≥20 and DP≥10.

SUPPLEMENTARY TABLE S1
Variants previously reported as causative of HCM either in ClinVar, or in a clinical series of 6179 HCM cases, but that were observed in ExAC above the maximum tolerated allele count for HCM (AC>9 globally) were manually curated according to the ACMG guidelines for interpretation and classification on sequence variants.