Introduction

The rapidly decreasing cost of whole-genome sequencing (WGS) and its ability to simultaneously analyze all human genes make it an attractive technique for genetic diagnosis. Early anecdotal reports describing the use of WGS or whole-exome sequencing (WES) have demonstrated the power of these new technologies to impact patient care.1,2,3 However, there exist significant barriers to the widespread application of WGS/WES in clinical medicine. Technical hurdles are being addressed in the marketplace, where competition will lead to faster, cheaper, and more accurate sequencing.4 Practical obstacles such as the time and effort required for analysis of clinically relevant variants, and return of complex results to patients, will require transition from traditional genetic testing approaches.

In a clinical environment, the most productive use of WGS/WES will likely be in the diagnosis of patients with distinctive features suggestive of a genetic disorder. In these individuals, there will also be genetic findings unrelated to the presenting symptoms, which are “incidental” or “secondary” findings, the aggregate of which has previously been termed the “incidentalome.”5 Arguably, the vast majority of an individual’s genetic variants will be unrelated to the presenting symptoms. Therefore, the problem of how to deal with incidental findings poses a formidable problem for clinicians and laboratorians.

In the pursuit of evidence-based genomic medicine, it will be vital to avoid overwhelming patients and physicians with genomic findings of dubious clinical value. Because the use of common single-nucleotide polymorphisms for prediction of common disease risk is still of limited value clinically,6 we have chosen to focus on monogenic disorders. Given that any individual has a very small a priori likelihood of being affected with an incidentally identified Mendelian disorder, few truly disease-causing genetic variants are expected per person. Therefore, any attempt to sift through genomic data for clinically relevant incidental findings will benefit from the recognition that the vast majority of variants bear negligible clinical significance. In other words, the identification of incidental findings should maximize specificity.

The challenge, therefore, is to synthesize collective knowledge about the genetic causation of disease and implement a practical, clinically oriented approach to the analysis of genome-scale variant data. We recently described a conceptual strategy for classifying genes into “bins” to facilitate informed consent, analysis, and return of incidental findings in a clinical setting.7 In our proposed system, the first step is to assign genes to bins according to features such as clinical utility/actionability (Bin 1) and clinical validity (Bin 2), and the potential to cause harm (Bin 2a, 2b, 2c; see Supplementary Materials and Methods online for details). The second step is to select the variants in a given individual that merit detailed review. The third step involves human review of the resulting subset of variants. Because a variant of uncertain significance (VUS), by definition, has no known clinical value, only known mutations or likely disease-causing novel mutations would be reported as incidental findings.

Variants identified by any sequencing method can be readily sorted based on their genomic location (whether they fall within a “binned” gene) and further annotated in terms of effect on the translated protein and predicted zygosity. For recessive disorders in which a single heterozygous mutation signifies the carrier state but is not considered disease causing, heterozygous variants would be moved into a separate category, “Bin R,” for reproductive implications. Our binning approach thus attempts to capture clinical differences between genes and organize them into a succinct number of categories to facilitate the pretest counseling and posttest reporting of suspected disease-causing variants when discovered incidentally during WGS/WES.

The goals of this endeavor were to evaluate the average number and type of potentially clinically relevant incidental findings and the impact of various “filters” on the output of the proposed analytic framework. We implemented a prototype of this strategy with an analysis of 80 whole genomes as a proof-of-concept, showing that multiple genomes can be efficiently analyzed to identify clinically relevant variants. This strategy can be refined with advances in our understanding of disease-causing and benign variants and offers an initial means of structured clinical assessment of WGS/WES data in a practical and efficient manner.

Subjects and Methods

Binning of OMIM genes

OMIM files (accessed June 2011) containing entries for 12,786 genes were scrutinized using OMIM, PubMed, Gene Reviews, and other resources. Genes were placed into Bin 3 (no clinical implications) if there was no indication of association with a Mendelian disorder, if only somatic mutations were reported, or if limited evidence of pathogenicity was available. Loci mapped by linkage, for which specific genes/mutations have not been documented, were also removed from consideration. A total of 2,016 genes associated with Mendelian disorders were identified, and their respective inheritance patterns were determined.

We made two judgments about genes involved in Mendelian disorders: (i) most genes do not have clinical utility/actionability in terms of definable preventive measures or treatment and (ii) for most Mendelian disorders, the potential for psychosocial harm caused by their incidental discovery is neither trivial nor highly concerning. Therefore, all 2,016 genes were initially placed in Bin 2b. We then manually reviewed each gene and applied a first-order approximation of the previously outlined criteria to provisionally place each gene into a bin. Genes for which there existed a reasonable suggestion of beneficial interventions were provisionally assigned to Bin 1. Genes having potentially significant risk of psychosocial harm were provisionally assigned to Bin 2c.

Genome sequences

WGS was performed by Complete Genomics (Mountain View, CA).8 Nineteen genomes were from patients enrolled in an institutional research board–approved research study for genetic evaluation of hereditary cancer susceptibility. Sixty-one genomes, representing presumably healthy individuals from diverse ethnic groups, were made publically available by Complete Genomics (http://www.completegenomics.com/sequence-data/download-data/). All genome coordinates are based on NCBI build 37.

Databases and computational analysis

Tables containing variant calls and annotations were stored in a PostgreSQL 8.4.3 database and joined with a table of allele frequencies generated from phase I consensus single-nucleotide polymorphism sites (2 May 2011) from the 1000 Genomes Project and small insertion/deletion calls from the 1000 Genomes pilot paper dataset (20 October 2010).9 To address differences in allele frequency (AF) between different populations, we used the highest minor AF reported for a given variant in any of the phase I population groups. A local instance of the Human Gene Mutation Database (HGMD)10 was created in another PostgreSQL database. Genomic coordinates for HGMD mutations were lifted over to NCBI Build 37 and converted to the Complete Genomics standard variant format. Variants matching with annotated disease mutations (“DM” variants) could then be readily identified in the 80 WGS samples.

A Python (2.6.5) script was written to iterate through variant files and select variants meeting the criteria outlined in the manuscript. Because Complete Genomics independently calls each allele, two separate lines in the variant file represent homozygous variants. The script collapses homozygous positions to a single line and indicates the zygosity of the variant in a separate field. For genes associated with autosomal recessive disorders, the script counts the number of variants meeting the predefined criteria and, if only one heterozygous variant exists, annotates that variant as signifying carrier status. The algorithm thus recognizes the potential for biallelic mutations (although it is important to note that further investigation is required to adjudicate whether the mutations are in cis or in trans).

Results

To demonstrate the applicability of the proposed analytic framework, we provisionally binned 2,016 genes implicated in Mendelian disorders, implemented a computational analytic pipeline, and explored the output from 80 whole-genome sequences. In this first attempt at binning the genome (Supplementary Table S1 online), 161 genes were assigned to Bin 1, 1,798 genes were assigned to Bin 2b, and 57 genes were assigned to Bin 2c. We emphasize that the binning of genes used in this study is provisional and used for illustrative purposes; the final population of bins will change over time and the choices made by our group and others may well differ.

We then explored parameters (AF cut-offs and effect of the mutation) used to select variants for further manual review ( Figure 1 ). The total number of variants selected ( Figure 1a ) is decreased 10–20 fold using AF filters of <5 or <1% ( Figure 1b ). Selecting for protein-altering variants (missense, nonsense, frameshift, and splice-site) further decreases this number ( Figure 1c ). However, the resulting numbers are still incompatible with the small chance of an individual having a Mendelian disorder; thus, the vast majority of variants with <5% AF must have minimal clinical consequences. When selecting only predicted truncating (nonsense, frameshift, and splice-site) variants, the number identified per patient is more consistent with realistic expectations ( Figure 1d ).

Figure 1
figure 1

Selection of variants based on allele frequency and predicted effect on the translated protein. (a) The initial informatics analysis resulted in an average of ~13,000 variants per person in Bin 1 genes, ~175,000 variants per person in Bin 2b genes, and ~9,000 variants per person in Bin 2c genes. (b) Limiting these variants to <5% allele frequency (AF) or <1% AF reduces these counts ~10–15 fold. (c) Restricting to protein-coding variants (missense, nonsense, frameshift, and splice-site) at <5% AF results in ~10 variants per person in Bin 1 genes and 100–200 variants per person in Bin 2b genes. At <1% AF there were ~5 variants per person in Bin 1 genes and 50–100 variants per person in Bin 2b genes. (d) Restricting only to truncating variants (nonsense, frameshift, and splice-site) results in only a small number of variants to be analyzed by the reviewer. Of note, the AF cut-off (<5 vs. <1%) does not dramatically affect the number of truncating variants that are selected.

Clearly, the sensitivity of the algorithm is decreased by the exclusion of rare missense mutations. To address this issue, we queried a local instance of HGMD for variants in these genes annotated as “DM” and identified 871 unique variants (771 missense) among the 80 whole genome sequences. On average there were 74 (range 61–106) “DM” variants per person ( Figure 2a ), which is strikingly similar to the report of the 1000 Genomes Project Consortium that individuals are heterozygous for 50–100 variants classified as disease causing in HGMD.9 Nevertheless, this large number of putatively disease-causing mutations is surprising, given the very low probability of a Mendelian disorder truly being present in any of the subjects.

Figure 2
figure 2

Analysis of mutations annotated as “DM” (disease mutation) in the Human Gene Mutation Database (HGMD). (a) All variants were queried against the HGMD to identify variants classified as “DM.” The numbers of rare (<5% allele frequency) missense variants in all bins and the numbers of HGMD “DM” variants per person are depicted as a box-whisker plot with whiskers indicating the 5th and 95th percentiles and outliers shown as filled circles. Homozygous variants are counted twice. (b) The overlap between the rare missense variants and “DM” variants is depicted as a Venn diagram. (c) The maximum 1000 Genomes Project allele frequencies were determined for each variant identified in the 80 whole genomes, and histograms of allele frequencies were generated for each person. These histograms were then combined to depict the average number of variants per person within each range of allele frequencies (depicted as a bar plot with standard deviations). (d) The maximum 1000 Genomes Project allele frequencies were determined for all “DM” variants in HGMD and graphed as a histogram. The inset shows the distribution for “DM” variants with ≥3% allele frequencies.

Because 88% of the unique “DM” variants were missense substitutions, we hypothesized that these variants could comprise a subset of the ~150 missense variants per person identified in Bins 1, 2b, and 2c with <5% AF ( Figure 2a ). Surprisingly, there was minimal overlap between the less common missense variants and “DM” variants detected in the 80 genomes ( Figure 2b ), and upon further review, 251 of the 871 unique “DM” variants (29%) had >5% AF. As a result, 78% of “DM” variants per person were >5% AF ( Figure 2c ). This finding is similar to that of a previous report that 74% of HGMD variants identified in 448 genes implicated in severe recessive diseases of childhood were variants with >5% AF.11 Although some of these variants could represent recessive alleles that are relatively frequent in certain populations, this explanation cannot account for the vast majority of these variants.

To further assess the pervasiveness of misleading database errors, we queried the 1000 Genomes Project allele frequencies and found allele frequencies for 1,811 of 74,694 “DM” variants (mostly substitution variants). Of these, 1,152 had <1% AF, 299 had 1–3% AF, 95 had 3–5% AF, and 265 (~0.35% of all “DM” variants) had >5% AF ( Figure 2d ). The small subset of variants with >5% AF comprised the majority of “DM” variants identified in a given genome sequence, simply because of the prevalence of these variants in the general population; in subsequent analyses we restricted HGMD variants to those with <5% AF.

The final algorithm selected variants according to the following criteria: (i) presence in a binned gene, (ii) <5% AF, and either (iii) annotation as a disease-causing mutation (“DM”) in HGMD or (iv) predicted to be truncating. Variants were further analyzed for zygosity to assign single heterozygous variants in recessive genes to a separate category for carrier status (Bin R). When we applied this algorithm to the 80 genomes, a total of 1,391 variants (906 unique variants) were selected. The per-person averages were 1.5 variants in Bin 1 genes, 6.4 variants in Bin 2b genes, 0.2 variants in Bin 2c genes, and 9.2 variants considered to imply carrier status for recessive disorders ( Table 1 and Supplementary Table S2 online).

Table 1 Numbers of variants selected by the informatics algorithm

The variants selected by the algorithm were then manually reviewed using a combination of OMIM, PubMed, Google Scholar, UCSC genome browser, and locus-specific databases to assess the evidence for pathogenicity or to reclassify the variants selected from the 80 genomes. Variants were reclassified if two variants identified in an individual likely comprised a single complex substitution allele or comprised a single common haplotype. In many cases, variants annotated as “DM” in HGMD were reclassified as VUS or likely polymorphisms. In other cases, the type of variant or its location within a specific transcript was inconsistent with a pathogenic effect. Zygosity was reassessed when it was determined that two variants were likely to be in cis or that only one of the selected variants in a gene was likely to be pathogenic; in these cases, the remaining heterozygous variant was reassigned to Bin R. Table 2 shows examples of binned variants, reclassified variants, and variants removed from consideration. Several detailed examples are described in the Supplementary Materials online. A list of binned variants from the 61 publically available genomes is available in Supplementary Table S3 online.

Table 2 Selected examples of selected variants, reclassified variants, and variants removed from consideration after human review

After review, 705 variants were removed from consideration and 71 were reassigned to carrier status. Differing percentages of variants were reclassified or removed from consideration in each bin category ( Figure 3a ) and lower proportions of novel variants were removed ( Figure 3b ) as compared with HGMD “DM” variants ( Figure 3c ). In all, 279 of the 358 unique variants removed from consideration were HGMD “DM” variants. After the final analysis, the revised per-person averages were 0.3 variants in Bin 1 genes, 2.6 variants in Bin 2b genes, 0.06 variants in Bin 2c genes, and 5.5 variants considered to imply carrier status ( Table 1 and Supplementary Table S2 online).

Figure 3
figure 3

Results of the manual review of variants selected by the informatics algorithm. After individual review of the 906 unique variants returned by the final informatics algorithm, 45% were reassigned or removed from consideration. The graphs depict the variants initially selected within a given “bin” and the stacked segments represent the proportions of those variants that were confirmed, reassigned, or removed after review. (a) All 906 unique variants, (b) the 392 rare truncating variants identified by the algorithm, and (c) the 514 rare “DM” (disease mutation) variants from the Human Gene Mutation Database (HGMD). A higher proportion of “DM” variants in each bin category were removed from consideration as compared with novel truncating variants.

Discussion

One barrier to the clinical use of WGS/WES is the legitimate concern that the burden of incidental findings will be overwhelming and lead to expensive and unnecessary follow-up despite little evidence that such variants have a strong role in causing disease.5,12 The approach we describe here demonstrates that analysis of WGS/WES data for clinically significant incidental variants is a tractable problem and that manageable numbers of variants can be selected for manual review.

Predictive value of variants identified in an incidental context

These results indicate that a small number of potentially disease-causing variants can be readily identified using a relatively straightforward process consisting of a priori gene classification, computational filtering, and database queries. As with any medical test, the analytic parameters used in this approach represent a trade-off between sensitivity and specificity. The choices outlined in our strategy reflect the impact of sensitivity and specificity on the calculation of the negative predictive value and positive predictive value. When the prior probability of disease is very low (e.g., the chance of having a Mendelian disorder that would be discovered incidentally), a test with reduced specificity will yield results with poor positive predictive value, whereas reduced sensitivity has negligible effect on the negative predictive value. We have therefore chosen to set a threshold that emphasizes specificity, in order to enrich for incidental findings that have a high likelihood of representing truly disease-causing mutations.

Because selection of rare missense variants in known disease genes results in a large number of VUSs, which provide no “actionable intelligence” for a clinician or patient, we excluded missense variants unless annotated as “DM” in HGMD. Various algorithms are used in research to predict the likely functional consequences of missense variants,13 but these programs are not clinically validated14 and in the absence of other supporting data they are generally insufficient to upgrade the status of a missense variant from VUS to likely pathogenic.15 The proposed framework also excludes synonymous variants as well as variants in the untranslated portions of the transcript and introns, which are most likely benign but might alter expression of the transcript or cause splicing abnormalities. Although the exclusion of novel missense, synonymous, and noncoding variants decreases the sensitivity of the approach, the lack of any clinically validated means of selecting the true-positive mutations from among the numerous variants of unknown (or no) clinical significance requires that we sacrifice some sensitivity to maintain high specificity. Inclusion of the HGMD substantially increased the sensitivity of the algorithm, but misannotated HGMD “DM” variants (which could represent errors in the medical literature or database curation errors) still constituted a major source of false-positive results.

Because there is no gold standard against which to compare our results, we cannot definitively estimate the clinical sensitivity or specificity of this analytic framework. However, even after manual inspection, the numbers of variants selected per person ( Table 1 ) indicate that a number of false positives remain. Some of the putative mutations identified in these 80 genomes could reflect sequencing artifacts, which would be revealed by follow-up Sanger sequencing. Many of the “DM” mutations remaining after manual curation may still represent VUSs or the milder end of the genotype–phenotype spectrum for a given disease. Perhaps more intriguingly, these findings could indicate a much greater degree of clinical variability and incomplete penetrance than has previously been appreciated in Mendelian disorders, which could dramatically impact the logistics of return of such information clinically. We anticipate that improvements in both clinical databases and predictive algorithms will allow us to further improve sensitivity and specificity over time.

Comparison to other reports

The average numbers of potentially clinically important variants identified in this article differ substantially from those of previous efforts to quantify the burden of clinically important incidental findings, and we feel that it represents a more realistic picture of what to expect from WGS in terms of clinical yield. These differences hinge largely on the assumptions made about disease causation and the framework we have chosen for identification of potentially clinically relevant variants. For example, whereas other groups have been inclined to report1 and/or interpret the possible clinical significance2 of variants that may modify risks for common diseases, we intentionally ignored common single-nucleotide polymorphisms that are weakly associated with multifactorial diseases. This decision is based on the lack of validated models for incorporating such information into medical care6 and the inconsistent interpretive results obtained in different labs,16 although the framework described here could be readily modified to include multifactorial risk calculations if warranted by advances in medical genetics and genomics. Pharmacogenomic variants can also be accommodated in the binning framework but were not considered here.

Cassa et al.17 estimated that individuals harbor ~2,100 substitution variants that might need to be returned to research subjects, which is four orders of magnitude higher than the 0–2 likely deleterious Bin 1 variants per person identified in this study. Possible explanations for this striking difference are the stringency with which genes are categorized as having clinical utility, and the thresholds for reporting variants. We argue that a relatively high evidentiary standard should be applied in order for a gene to be placed in Bin 1, such that the expected benefits gained by improved medical management would outweigh the possible harms that could arise from the revelation of such a finding in an incidental context. Using these criteria, most known disease genes are placed in Bin 2, in which patient choice is paramount in determining whether such incidental findings should be returned. In addition, we believe that only variants that are known to be pathogenic or highly likely to be pathogenic should be returned in an incidental context. The vast majority (~96%) of variants included in the Cassa et al.17 study originated from the HGMD, and our current data demonstrate that many of these variants are likely to represent false positives. It is difficult to discern how many of the ~2100 substitution variants per person reported by Cassa et al.17 are actually benign common polymorphisms, although approximately one-third of these variants were homozygous (suggesting a general population AF substantially >5%), indicating that the putative “reportable” variants identified by Cassa et al.17 include many variants that are not deleterious and should not be reported either in a research context or a clinical context.

MacArthur et al.18 reported a survey of loss-of-function variants in the 1000 Genomes Project data and identified many challenges of interpreting WGS/WES data with respect to generating annotations and predicting the effects of possibly truncating variants. A number of known and likely disease-causing loss-of-function mutations were identified among the subjects analyzed, most of which would represent carrier status for autosomal recessive disorders. Again, however, these results point out the difficulty of predicting pathogenicity of a given variant and the importance of review by a clinical molecular diagnostician. Similar to our results, one putative disease-causing mutation listed among the loss-of-function variants by MacArthur et al.18 was a nonsense mutation in LRRK2, which is of uncertain clinical significance because the reported mutations in LRRK2-related autosomal dominant Parkinson’s disease are missense substitutions.19

Challenges and future directions

The bin assignments described here should be viewed as a first step in the development of the binning process. The central concept of Bin 1 is that these findings have sufficient clinical actionability that no preference would be elicited regarding their return (in effect, the “duty to warn” would supersede the patient’s autonomy). This denial of the patient’s “right not to know” requires us to set a very high threshold regarding the types of findings that are appropriate for this category. On the other hand, our strategy places the majority of disease genes within Bin 2, where the potential risk for harm is the organizing principle, and the concept of individual preference is paramount. Thus, we feel that our strategy strikes a balance regarding patient choice and medical paternalism. A possible future addition might be to subcategorize Bin 2b into disease groups (such as cancer, cardiovascular/sudden death, neurodegenerative, and “other” Mendelian disorders) that would allow a more refined choice in a clinical context. Of course, the disadvantage of introducing more and more categories is that the clinical decision making could devolve into a gene-by-gene menu, which would impose prohibitive demands on clinicians and laboratories with respect to informed consent and analysis.

This provisional binning of genes is not meant to represent a final or definitive list, and we expect that there will be disagreement among experts about the criteria that define Bin 1 or Bin 2 genes, or the types of incidental findings that should routinely be returned to patients (and how they should be returned) during the course of a genome-scale diagnostic test.20 Furthermore, there may be differences of opinion regarding the classes of variants that should be reported to patients when discovered incidentally. Our evolving understanding of the genetic underpinnings of disease will necessitate a flexible approach to the structured clinical analysis of genome sequences, and an important future direction will be to establish more granular criteria for determining the novel variants that are selected for review based on the reported spectrum of disease-causing mutations. It is likely that the large numbers of genomes currently being sequenced worldwide will greatly facilitate the clinical interpretation of variants that are found in known disease genes. Better estimates of penetrance will inform the contexts in which certain variants are reported, and many variants previously reported as disease-causing may need to be carefully scrutinized to separate those that are truly deleterious from those that simply reflect normal population variation. Therefore, the value of a centralized and rigorously maintained clinical-grade database containing known variants and their significance cannot be overstated.

Conclusion

These results represent a proof-of-concept demonstration of a structured clinical analysis of incidental findings in genome-scale sequence data that can serve as a general model for assessment of WGS/WES incidental findings. This framework makes the identification clinically relevant incidental findings much more tractable, as it reduces the number of variants requiring hand curation to a manageable number (10–20), and it should prove robust to differing bin structures or gene assignments. We expect that consensus will be possible regarding the bin assignment of many genes,20 and we note that as of this publication there are ongoing discussions and debate among genetics professionals regarding these issues. Advances in medical genetics will also mandate a periodic re-evaluation of these bin assignments. Nevertheless, we anticipate that assignment of genes to bins based on clinical utility and stratified based on the risk of psychosocial harm will enable efficient analysis of data as well as facilitate pretest informed consent, posttest counseling, and return of results as we enter the era of clinical genomics. Further research on the implementation of this analytic framework and the responses of individuals to incidental findings is under way.

Disclosure

The authors declare no conflict of interest.