Main

Use of next-generation sequencing (NGS) technologies in clinical medicine and research is increasing, with tens of thousands of human genomes now completed.1 The decreasing cost of generating whole-genome or whole-exome sequence data using NGS platforms means that it may soon be less expensive to routinely sequence the entire genome than to use many more targeted assays.2 Thus, we are rapidly approaching a time when genome-scale sequencing tests may be analytically robust, clinically relevant, and affordable enough to replace the single-gene tests currently used in clinical practice.

These new technologies allow for a vast proportion of an individual’s DNA to be queried, raising the issue of genomic incidental findings (IFs—unanticipated information discovered in the course of testing or medical care).3 IFs will inevitably arise when NGS is not restricted to portions of the genome that are relevant to the indication for the test. Individual genomes that have been evaluated using whole-genome sequencing indicate that there are typically over 4 million variants observed per person compared with the reference sequence.4,5,6,7,8,9 Each person carries an average of 50–100 heterozygous variants classified by the Human Gene Mutation Database as causing inherited disorders, although emerging information suggests that most of these are incorrect assertions of pathogenicity and/or errors of curation.10,11 Without thoughtful interpretation, it would be a daunting task to return this information to patients in ways that would be helpful and accurate.12

Although the number of clinically relevant IFs that could result from each sequence generated remains unclear, questions about the capacity of laboratories, providers, and health systems to manage the interpretation and presentation of findings to patients remain pertinent.3,13 Because of the unprecedented ability of NGS to identify variants across the entire genome, most IFs will have unknown clinical significance. A process and criteria are needed to organize the return of results from IFs to enable end users to focus on the small number of clinically relevant variants.

Berg and colleagues14 proposed a framework for a priori categorization of genomic loci into three “bins” for management of IFs. Any given individual will have few or no clinically actionable IFs (bin 1).11 These results should have well understood and validated associations between genotype and phenotype, and be clinically actionable, meaning that interventions could delay or prevent manifestations, or health outcomes could be improved through early identification and treatment. IFs that are clinically valid but not actionable (bin 2) are likely to be slightly more numerous.11 These results have well understood and validated associations between genotype and phenotype but are not immediately clinically actionable. Examples include variants relevant to pharmacogenomics (which become actionable only when a relevant drug is prescribed), most single-gene disorders, and risk alleles such as APOE genotype for Alzheimer disease risk (for which there is not yet effective prevention or treatment). Although some patients may be interested in receiving information about these IFs, Berg and colleagues argued that their return should not be automatic because the knowledge would not definitely improve health outcomes, and some results could result in psychosocial harms. Most genomic loci will fall into the third category of uninformative IFs (bin 3), where there is an insufficient knowledge base on the association between genotype and phenotype. These IFs would typically not be considered for return of results in a clinical setting.

Several initiatives have begun to guide decisions about which IFs to return to patients or research participants from NGS platforms. At least three initiatives have been sponsored by the National Human Genome Research Institute, including the electronic medical records and genomics Network,15 the Clinical Sequencing Exploratory Research consortium, and the Return of Results consortium. Professional organizations have established committees to assess NGS platforms (e.g., College of American Pathologists16 and the American College of Medical Genetics and Genomics).2

For the past 8 years, the Centers for Disease Control and Prevention Office of Public Health Genomics has supported the independent Evaluation in Genomic Applications in Practice and Prevention (EGAPP) Working Group to develop evidence-based recommendations on the use of genomic tests in clinical practice and disease prevention.17 The EGAPP Working Group sponsored and guided the Knowledge Synthesis Center to develop a process to determine which gene/condition pairs belong in the category of clinically actionable IFs in adults. The Knowledge Synthesis Center collaborated with several members of the Working Group to extend the framework proposed by Berg and colleagues, by clarifying and operationalizing criteria for determining clinically actionable IFs. The method includes a structured process to identify evidence, consider the evidence against a priori exclusion and inclusion criteria, and present evidence-based considerations to decision makers.

Materials and Methods

Method development

We adapted the conceptual framework of population screening, which includes ethical, procedural, and statistical considerations. A higher standard is needed in the case of screening because the pretest probability is low, which affects the likelihood that a detected mutation is a true health-impairing variant. There must be at least moderate certainty that screening-based identification of the gene/condition and subsequent management have been shown to have a net benefit—that health benefits will outweigh potential harms.18

Typically, screening tests should be held to a very high standard, with clinical19,20 recommendations made only after a thorough, systematic review of the available evidence. A systematic review of each gene mutation would not be an effective or efficient approach, due to the sizeable, very active, but largely developmental clinical research base. Therefore, pragmatic choices were made to increase the feasibility and scalability of the method to the entire genome, such as limiting the sources that would be considered (i.e., a reproducible but not comprehensive search for evidence).21 Our approach is influenced by the principles of systematic reviews, including (i) the need for a priori objective criteria on which to make decisions; (ii) an emphasis on transparency and reproducibility of decision making; (iii) clear and transparent communication of methods and results; and (iv) clear procedures to allow results to be revised as new evidence emerges. To develop these criteria, we considered published articles on population screening in general, newborn screening, or screening for specific genetic conditions.22,23,24,25,26,27,28,29,30,31,32

Our method is composed of three distinct stages. In stage I, we define rule-out criteria to rapidly eliminate genes/conditions that do not meet a baseline threshold. In stage II, genes/conditions that pass this initial threshold are subjected to a more resource-intensive search of the literature, with the relevant evidence synthesized into a short summary. In stage III, decision makers consider the summary for final selection of clinically actionable IFs.

Method evaluation

Rater agreement in stage I. We evaluated the same 99 conditions selected by Green et al.12 to assess the method for stage I. The 99 conditions were primarily selected from the GeneTests website (http://www.ncbi.nlm.nih.gov/sites/GeneTests/), based on the frequency of laboratory testing, among other considerations. Each condition was evaluated independently by two raters (K.A.B.G., J.A.W., E.M.W., K.A.S., H.S.F., D.C.-O., or J.S.B.) and categorized as retained or ruled out. We calculated overall agreement and Cohen’s κ33 as measures of inter-rater agreement. We adjudicated disagreements by discussion among raters to clarify reasons for disagreement, and erred toward “retain” if there was ambiguity.

External validity. We compared the categorization of the 99 conditions for stage I (retain/rule out) to the findings of an expert-based approach, which we refer to as the Green method.12 In the Green method, 16 specialists (including J.S.B.), including clinical geneticists and molecular laboratory directors, evaluated 99 conditions as to whether they would routinely disclose or not disclose the information when discovered as an IF. Specialists were asked to assume the sequencing was perfectly accurate, family history was not available, the patient had no previously recognized clinical features consistent with the disease variant under consideration, the patient’s sex was known; and the patient was an adult, but the exact age was unknown. Each specialist recorded his or her response independently (with no explicit guidelines regarding clinical actionability). The overall results were reported as the proportion of specialists who responded to “disclose” the finding, which ranged from 50% to 100% across the 99 conditions. We infer that conditions with a high level of concordance among these experts represent those with the highest degree of clinical actionability and thus agreement that such information would be important to return as IFs.

Clinical scenario

We assume an adult patient has received whole-genome sequencing or whole-exome sequencing as part of the clinical care for an unspecified indication, and the sequencing has acceptable analytic validity. Sex is known, but not the patient’s specific age, or any other personal or familial medical history. The patient is currently undiagnosed with the condition under consideration. This criterion does not imply that the patient is asymptomatic. Asymptomatic patients are either “disease-free” or the condition is present, but the physical signs are “undetected” by the patient or the clinician (e.g., tumors or precursor lesions may be present years before cancer symptoms appear). Symptomatic patients can also have symptoms that are nonspecific, so the genetic condition is “unrecognized.” For instance, a patient may be aware of and receive treatment for high cholesterol but may be unaware that he or she carries a mutation that causes familial hypercholesterolemia. All of these scenarios—disease-free, undetected disease, and unrecognized disease—are part of a spectrum of undiagnosed patients.

Results

Stage I

The purpose of stage I is to eliminate from further consideration genes/conditions that do not meet a minimal threshold of clinical actionability. We expect a large majority of genes/conditions will be ruled out in this stage. Stage I has three predefined criteria addressed in five questions ( Table 1 ). The three criteria are (i) actionability, (ii) moderate or high penetrance, and (iii) association with a significant health condition. The criteria for stage I are documented on the Binning Dashboard (Supplementary Table S1 online). If any of the three criteria are not met, the gene/condition is excluded from further consideration. For each gene/condition, it is most efficient to begin with an area suspected to not meet the criteria.

Table 1 Criteria for stages I and II

References used to assess the criteria include existing guidelines, systematic reviews, or expert-derived guidance ( Table 2 ). These resources are identified using a predefined method for searching existing databases for related materials. Only one reference is required to meet any given criteria. Therefore, the search procedure is focused and not necessarily comprehensive.

Table 2 Search method to identify references of evidence-based practice guidelines, systematic reviews, or expert consensus–based practice guidelines

This stage requires high sensitivity, so the gene/condition should be retained when it is uncertain if the minimum threshold is met. Two reviewers assess each gene/condition independently. Discrepant findings are adjudicated by consensus, with additional input from a third party, as needed.

Criterion 1: actionability. For disease-free patients, actionability implies an effective intervention to delay or prevent clinical manifestations or reduce disease impact. For patients with undetected disease, actionability includes screening for earlier diagnosis and to increase the likelihood of less burdensome disease. For symptomatic (but clinically unrecognized) patients, actionability includes alterations in patient management proven to provide benefit. Other actions include family management to improve outcomes for family members (referral to genetic services or reproductive decision making alone is not sufficient) and avoidance of circumstances for the patient (e.g., behavioral modifications, diet, exercise, smoking cessation).

Sufficient support of actionability is derived from a practice guideline, expert-derived guidance, or a systematic review. If no such guidance or review exists, our process defines the gene/condition pair as not actionable. If a guideline or systematic review is available but recommends care only once symptoms have manifested, then the actionability criteria has not been met for an undiagnosed adult. If clinical signs and symptoms would always be recognized in childhood or early adolescence, the condition is not considered actionable in adults.

Criterion 2: penetrance. At least one variant in the gene(s) under consideration should have high or moderate penetrance or risk in any population. We selected an arbitrary threshold for the penetrance of either 40% or a measure of relative risk of 2 or greater. Although the absolute penetrance for a condition may be low, the relative risk compared with the general population may still be significant. For example, the penetrance of C282Y homozygosity in the HFE gene is 13.5% for the development of hereditary hemochromatosis;34 however, the odds ratio for liver disease is 3.9 and up to 11 for hepatocellular carcinoma.35 Penetrance data from studies in affected individuals may be used if data from an unselected population are not available. If the data regarding penetrance are unavailable, and all other criteria are met, the gene/condition should be retained to stage II.

Criterion 3: significance/burden of condition. The condition should cause significant morbidity or mortality in adults. This does not include normal human variation such as eye color, hair color, skin color, or body size measurements such as height. We do not address infertility alone or reproductive decision making as part of this approach, recognizing that these could be important in some settings.

Agreement in stage I

The inter-rater agreement was substantial, with an overall agreement of 89% and a κ of 0.70. Both reviewers retained 55 and both ruled out 30 of the 99 conditions. For the 14 conditions where the reviewers disagreed, the reasons were four disagreements about the presence or absence of a guideline, one disagreement about the overall actionability, three disagreements about the actionability in adults, five disagreements about whether the penetrance was high/moderate or low, and one disagreement about multiple areas. After resolving the disagreements, 62 gene/conditions were retained to stage II.

External validity of stage I

The results of the stage I evaluation compared favorably with the expert-based Green method ( Table 3 ). In most (88%) conditions that we retained after stage I, a high percentage (≥88%) of experts also agreed to recommend disclosure. Likewise, most conditions (92%) that we ruled out after stage I similarly had lower agreement (<70%) among experts to disclose. When examining 37 conditions with only moderate agreement (75–81%) between experts in the Green method, we retained an intermediate proportion (46%), and 9 of 14 disagreements between reviewers in our method were among these 37 conditions.

Table 3 External validity as compared with expert opinion-based method by Green et al. (ref. 12)

We ruled out six conditions that a high percentage (≥88%) of experts selected to disclose in the Green method. Four of these conditions (retinoblastoma, neurofibromatosis 1, medium-chain acyl-CoA dehydrogenase deficiency, and isovaleric acidemia) have predominantly childhood onset and are unlikely to present in an undiagnosed adult. One condition is not actionable (Tay–Sachs disease), and one condition is not actionable in adults (Beckwith-Wiedemann syndrome). Therefore, none of these genes/conditions were deemed likely to ultimately fulfill the criteria of clinically actionable IFs had they been retained in stage II, and the apparent discrepancy between our results and the Green method does not indicate a serious problem with our method.

Stage II

The purpose of stage II is to document and summarize the readily available evidence related to key features of actionability to evaluate candidates for a clinically actionable IF. Although a relatively high bar for clinical actionability must be met for a gene/condition pair to qualify for routine reporting, we specify a relatively low threshold for the type of evidence that is permitted, including nonsystematic or expert-based references. This approach is both pragmatic and efficient because this represents the most common available evidence for highly penetrant rare genetic conditions. There is transparency in the level of evidence by documenting the relative strength using a tiered system ( Table 4 ) and quality rating the evidence reference using existing methods.36,37

Table 4 Evidence tiers for stage II

In stage II, we conduct a reproducible search of existing synthesized literature using the predefined method ( Table 2 ). Once a full search for potentially relevant references is completed, we examine these references to determine relevance. Any references deemed irrelevant are excluded.

To assess the relative quality of the identified references, we place each reference into one of four tiers ( Table 4 ). The tiers are used to facilitate a hierarchical method for examining existing literature, starting with the most evidence based. Different tiers of evidence can be present in the same document. For example, background information in a systematic review is not subjected to the same methods as data from the actual review questions.

We then produce a narrative summary with standardized information that is abstracted and documented for each gene/condition ( Table 1 ; Supplementary Table S2 online). For each question, we use the reference(s) with the highest available tier of evidence for data abstraction. All sources are referenced if they are in agreement. If not, the reviewer determines the best reference based on considerations including quality, methods (e.g., search strategy, inclusion criteria, analytic methods, funding source), and date of publication (e.g., more recent publications may be more relevant). If there is not a best reference, we abstract data from all relevant references. The tier of evidence is recorded for each abstracted data element.

References are quality rated only if they are used as a reference for at least one question in the summary report and there is a discrepancy between references. Thus, quality rating serves as a “tie breaker” when we cannot otherwise decide on a best reference. For systematic reviews (tier 1), we use the AMSTAR method.37 For practice guidelines (tier 2), we use the AGREE II method.36 Quality rating for tiers 3 and 4 is not performed, because there is no established method.

Stage II case studies

We conducted case studies for familial adenomatous polyposis (Supplementary Table S3 online), hereditary hemochromatosis (Supplementary Table S4 online), and α1-antitrypsin deficiency (Supplementary Table S5 online). These topics were selected to represent a spectrum in the expert opinion on clinical actionability. For each condition, we identified between 31 and 46 references, of which 13–30 were relevant to the topic ( Table 5 ). About 40–50% of the relevant references were tier 1, and 5–10 references were cited in each summary document. We did not complete the quality rating for any of these three topics, because no “tie breakers” were needed.

Table 5 Identified references for three case studies in stage II

Resource requirements for stage I and stage II

We estimated the resource requirements to complete these evaluations on a genome-wide scale. We assume 2,000 topics for review at stage I; each topic is dual-reviewed and requires 1 h to complete per person, and 25% of topics require adjudication of disagreements (15 min each for two people). Under these assumptions, stage I would require slightly more than two full-time employees for 1 year. For stage II, we assume that at most 25% of the 2,000 topics in stage I are retained, and each topic takes between 12 and 20 h to complete. Stage II requires three to five full-time employees for a year.

Stage III

The purpose of stage III is for a decision-making panel of experts to review the evidence in the summary document and make decisions about clinically actionable IFs. Different decision-making bodies may come to different decisions on which IFs should be routinely reported. Panels are convened by various stakeholders, including professional organizations, payers, and governmental agencies. One such panel is the EGAPP Working Group, which has broad experience in methods development and making recommendations for genomic applications.17,38,39 We presented the three case studies to the EGAPP Working Group. In each example, the group had consensus that the methodology produced sufficient documentation to decide that all three conditions are clinically actionable and IFs should be routinely reported (Supplementary Tables S3–S5 online).

Discussion

We have presented a transparent, reproducible, efficient, and evidence-informed process for identifying clinically actionable IFs in adults. This study builds on an established framework11 by operationalizing the categorization of clinically actionable IFs.14 We implemented the method in stage I for 99 conditions, and demonstrated both high inter-rater agreement (89%) and external validity as compared with the Green et al12. expert-based approach. Although the Green method is not a gold standard, it is the only published method available for comparison. We pilot-tested the method using three case studies for stages II and III, which showed that the evidence summaries provided sufficient information for the EGAPP Working Group to recommend clinically actionable IFs that should be routinely reported. The process is scalable to the whole genome, and can be completed within a year for 2,000 gene/condition pairs by about two people for stage I and three to five people for stage II.

Because genome-scale sequencing tests are more routinely incorporated into clinical care, the ability to centralize the determination of clinically actionable IFs in a transparent and reproducible way is critical. Clinicians cannot be expected to individually assess the evidence across the entire genome and then make individual determinations about returning IFs. Methods such as this one could facilitate centralized recommendations that can be confidently adopted by clinical laboratories and physicians, resulting in more consistent care across providers. In addition, the evidence base is rapidly developing, so the need for transparency in decisions as to why gene/conditions are not clinically actionable supports updating recommendations as new information becomes available. Guidance on the appropriate reporting of IFs is relevant to many stakeholder groups.

Our method, although grounded in evidence-based principles, is also pragmatic in that it allows consideration of both expert-based and evidence-based guidelines. This approach recognizes the still-limited clinically relevant evidence on genetic mutations for most conditions, as well as the fact that very rare conditions may never accumulate the level of evidence required for traditional evidence-based practice guidelines. Our use of tiers to indicate the relative strength of the methodology used for reference documents allows flexibility for decision makers. For example, if the usual presentation is sudden death, decision makers might accept a low threshold of evidence in favor of prevention. The method provides balance through consideration of the potential risks and burdens, as well as benefits of interventions. Our approach uses consistent, but not comprehensive, search strategies to identify already summarized or synthesized evidence, allowing the method to be scalable to the whole genome. This systematic method would be made even more acceptable if combined with opportunities for public input so that any nonreportable gene/condition pair can be revised with the availability of new (or overlooked) information, a practice that has been used by PLoS Currents/Genomic Tests (http://currents.plos.org/genomictests/).

There are several limitations of the proposed approach. Further experience, including clarification for some of the criteria, would strengthen the method. Semiquantitative measures of the elements of actionability would avoid subjective interpretation of these criteria and facilitate thresholds to determine actionability. Evidence will change over time with advances in medical genetics, necessitating re-evaluation of genes/conditions. To account for these changes, the method needs an approach for updating and consideration of raised objections. The proposed method is conservative in assuming that no personal or family history information will be available to guide interpretation. Individual results will need to be contextualized by a clinician with expertise in genetics in order to personalize the interpretation.

This method is completely agnostic to the specific technology used to detect any given variant. Therefore, we consider the assessment of analytic validity of molecular testing platforms as a necessary, but separate, issue. Although we do not intend to implement the method at the level of specific variants, in clinical practice, variants within genes will need to be classified as deleterious or benign. This topic is being addressed by clinical laboratories and efforts such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) and the Human Variome Project (http://www.humanvariomeproject.org/).

This work is an important step toward identifying clinically actionable IFs on a genome-wide scale. Future work is needed to confirm our findings by applying all three stages to more cases and to address situations that were outside the scope of this work. Berg and colleagues explicitly recognize a category of variants that are clinically validated but not actionable (bin 2), which can be further subdivided into categories based on the risk of psychosocial harm. The proposed method does not address how to classify IFs within this category. NGS technology will probably be used in clinical settings that we did not address, including reproductive decision making, newborn screening, and pediatric cases. These scenarios will need significant revision of the method to address them. The method may achieve the widest adoption if the acceptability of the criteria and staged approach is assessed among diverse stakeholders and potentially modified (e.g., selecting different thresholds) to include different perspectives. Nevertheless, we expect that application of this method will result in a robust framework for the subsequent analysis and management of IFs from individual genome–scale sequencing assays.

Disclosure

The authors declare no conflict of interest.