Introduction

Genome-scale sequencing inevitably leads to the identification of many genomic variants with vastly differing clinical relevance, which requires development of innovative categorical approaches for informed consent, analysis, and return of results. Clinical genomic sequencing for suspected monogenic disorders may identify millions of genetic variants in a single patient, with only one or two “diagnostic” variants likely to explain the molecular etiology (“primary” results). Virtually all of the remaining variants are “incidental” to the original indication for analysis, although the term “secondary findings” is now the preferred term for such results when sought in a systematic fashion.1

We previously proposed a framework for organizing potential incidental/secondary findings into “bins” categorized by clinical validity and clinical utility2 and developed provisional lists of binned genes.3 Our goal is to categorize potential findings before their discovery in a patient to guide informed decision making and return of results. As part of a National Human Genome Research Institute–funded Clinical Sequencing Exploratory Research project called “North Carolina Clinical Evaluation by Next-gen Exome Sequencing (NCGENES),” we assembled a Locus-Variant Binning Committee (LVBC) to refine a category of genomic findings that we call “bin 1”—the list of clinically actionable genes to be analyzed for pathogenic variants and returned as part of the routine results.4 Similar efforts are underway at other institutions and organizations.5,6,7

Recognizing that an expert consensus-based approach without a clear definition and framework for adjudicating actionability could lead to inconsistent and arbitrary results, the LVBC developed a semiquantitative metric for determining the clinical actionability of gene–disease pairs. This metric explicitly recognizes that actionability is a continuum, not a binary state.8,9 That being said, we think that it is vital to define a core set of gene–disease pairs that reach a sufficient threshold of clinical actionability to be considered as part of the routine results of a genome-scale diagnostic test.

In parallel to the efforts of NCGENES, the Evidence-based Genomic Applications in Practice and Prevention Working Group established an evidence-based review procedure consisting of a rapid, sensitive screen for genes with possible actionability; structured data gathering organized around the elements of actionability articulated by the LVBC and detailed herein; and provided assessment by an expert deliberative committee.10 Such a framework will be most useful, not for definitively determining actionability, but rather for identifying the minority of genes in the human genome that should undergo further scrutiny as possibly actionable in a given specific context.

The 2013 recommendations for analysis and return of certain highly actionable incidental/secondary findings by the American College of Medical Genetics and Genomics (ACMG) used a deliberative consensus method to identify gene–disease pairs within which clearly pathogenic variants should be returned as part of clinical genome-scale sequencing.5 These recommendations were met with criticisms,11,12,13 among which were concerns about the process by which the recommended gene list was developed. Also noted were concerns that some genes on the recommended list may not reach an evidentiary threshold sufficient to justify being returned as incidental/secondary findings. The development of a clear framework for the assignment of clinical actionability is therefore a necessary step toward formalizing such judgments and making assessments reproducible and updatable.

Materials and Methods

Semiquantitative metric categories and scoring rules

The LVBC established five core characteristics of clinical actionability, with particular emphasis on the ramifications of finding a clearly pathogenic variant in a person without signs or symptoms of the disease ( Table 1 ). The five characteristics are reflected by the following questions:

Table 1 Semiquantitative metric framework, questions, scores, and examples
  • 1. Severity: “What is the nature of the potential adverse health outcome in an individual carrying a deleterious allele in this gene?” Severity is scored from minimal health impact to modest morbidity to sudden/inevitable death.

  • 2. Likelihood: “What is the chance that this adverse outcome will manifest?” Scoring for this category uses brackets of likelihood and is similar to penetrance.

  • 3. Efficacy of intervention: “How effective are the established interventions for preventing the harmful outcome?” Efficacy of the intervention is scored from lack of demonstrable efficacy to highly effective intervention.

  • 4. Burden of intervention: “How acceptable are the interventions in terms of the burdens or risks placed on the individual?” The burden or acceptability of the intervention is scored from highly consequential to minimally burdensome intervention.

  • 5. Knowledge base: “How much is known about the gene, condition, and intervention to allow scoring in each category?” Knowledge is scored from controversial or poor evidence to substantial evidence.

All five criteria are scored on a scale of 0–3. The “outcome” and “intervention” are defined in advance and the other components of the metric are scored with respect to these parameters. It is critical to consider outcomes together with corresponding interventions to balance the clinical effects expected by natural history against the benefits and harms of these interventions in individuals who have not manifested symptoms of disease.

Gene sets scored

To judge the ability of the metric to distinguish between conditions that vary widely in terms of clinical actionability, three lists of genes were scored: (i) a list of 161 provisionally actionable genes3 (hereafter referred to as “bin 1 genes”); (ii) a list of 57 genes originally recommended by the ACMG4 (hereafter referred to as “ACMG genes”); and (iii) a list of 1,000 genes randomly selected from the National Center for Biotechnology Information RefSeq database (hereafter referred to as “random genes”). The random genes were selected from a 7 October 2013 RefSeq download using an in-house python script utilizing the “random” module. They were cross-referenced against Online Mendelian Inheritance in Man (http://www.omim.org) and OrphaNet (http://www.orphadata.org/cgi-bin/inc/ordo_orphanet.inc.php) and manually curated to identify those with disease associations as of that time. The majority of the random genes (889/1,000) had no documented disease association, were associated only with somatic mutations, or had a modest influence on disease risk based on association study data. These conditions scored 0 by default and were excluded from further analysis.

Although the ACMG’s recommended list has subsequently been reduced to 56, all 57 original genes were analyzed with the expectation that the removed gene (NTRK1) would prove to be an outlier with regard to clinical actionability.

In addition, a list of “other” gene–disease pairs were scored, including conditions with phenotypes overlapping those of genes considered to be potentially actionable, disorders that are allelic to others that were scored, or conditions that were selected to evaluate the range of scores obtained for conditions considered not to be actionable. Scores for these “other genes” are included in the overall analysis but were not subject to statistical comparisons between lists because of their heterogeneity.

Assessment and consensus scoring

The multidisciplinary LVBC included clinical geneticists, genetic counselors, physicians from other specialties such as cardiology and neurology, a primary care physician, clinical laboratorians, and ethicists. Information about gene-disease relationships was obtained from OMIM, GeneReviews,14 PubMed searches, and clinical guidelines, when available. Members of the LVBC prepared the evidence review, typically with a single member assigned primary responsibility for each gene–disease phenotype pair. The committee met regularly to review the evidence and to agree on a score for each element of the semiquantitative metric or direct additional review.

To mitigate the subjective nature of assessing certain categories and to enhance consistency between scores, the LVBC arrived at a series of scoring conventions (examples in Table 1 ). Scores for categories 1 (severity) and 2 (likelihood) are linked to the same specific outcome, either the most severe potential outcome or what is generally considered the primary outcome for the disease. However, scores for a given gene–disease pair can be calculated for more than one outcome of interest to account for disease pleiotropy. For example, different scores can be calculated for BRCA1 depending on whether the outcome of interest is breast cancer or ovarian cancer. In effect, categories 1 and 2 reflect the medical implications of disease faced by an individual with a pathogenic finding. Scores for categories 3 (efficacy) and 4 (burden) reflect specific presymptomatic interventions targeted to the outcome described in categories 1 and 2 (e.g., bilateral risk-reducing mastectomy or bilateral salpingo-oophorectomy per the example of BRCA1). The semiquantitative metric thus approximates the concept of clinical utility by balancing the potential benefits and harms of intervention when an incidental/secondary finding is discovered in a presymptomatic individual.

Results

A total of 1,213 unique genes were evaluated using the semiquantitative metric (Supplementary Table S1 online). After removing genes not implicated in a single-gene disorder, 324 unique genes representing 372 gene–disease pairs were scored. These genes included 161 bin 1 genes, 57 ACMG genes, and 111 random genes associated with defined monogenic disorders. There was some degree of overlap between these lists, as depicted in Figure 1a . In cases where the random genes were associated with more than one condition, the highest of the scores was chosen to represent the random gene–disease pair; scores for additional gene–disease pairs were tallied in the “other” category.

Figure 1
figure 1

Semiquantitative metric scores. (a) Summary of overlap between the gene lists analyzed. The American College of Medical Genetics and Genomics (ACMG) genes included 26 gene–disease pairs (25 not including NTRK) that were not among the bin 1 genes. Conversely, 130 bin 1 genes were not among the ACMG genes. Of the 111 random genes with a defined disease association, 4 overlapped with the bin 1 genes (ANK2, BRIP1, COL1A2, PROC), 1 overlapped with the ACMG genes (NTRK1), and 1 gene overlapped both lists (PTEN). The Locus-Variant Binning Committee also evaluated 80 other gene–disease pairs, including alternative phenotypes for some genes, or different genes with similar disease phenotypes. One of these genes was on the ACMG list (NF2) and two were on the random list (CASQ2 and MAX). (b) Distribution of semiquantitative metric scores. Box-whisker plots showing the median, 25th–75th percentiles (box), and 5th–95th percentiles (whiskers) of the scores for the bin 1 gene list, ACMG gene list, random gene list, and other gene list. Asterisks indicate statistically significant differences (P < 0.0001, two-tailed Mann-Whitney test).

The median score of the bin 1 genes was 12 (range 0–15); 84/161 gene–disease pairs scored ≥12, while 29/161 pairs scored <10. The median score for the ACMG genes was 11 (range 7–14); 25/57 gene–disease pairs scored ≥12, while 11/57 pairs scored <10. The NTRK1 gene, originally included on ACMG’s preliminary recommended list and subsequently dropped, scored 7. In comparison, the median score of the 111 random genes was 7 (range 1–13); only 14/111 gene–disease pairs scored ≥12, while 81/111 pairs scored <10. Figure 1b shows the distribution of scores for all of the pairs. The distributions of scores for the bin 1 genes and the ACMG genes are not significantly different from each other, but both lists are significantly different than the random genes (P < 0.0001, two-tailed Mann-Whitney test), indicating that the semiquantitative metric effectively distinguishes between gene–disease pairs that were qualitatively deemed to be actionable in earlier efforts from those that would not be enriched for actionability. Table 2 presents several scoring examples, and all scores are included in Supplementary Table S1 online.

Table 2 Examples of semiquantitative metric scores for selected genes

Among the gene–disease pairs scoring highest using this metric were MLH1 (associated with Lynch syndrome) at 13 and RYR1 (malignant hyperthermia) at 12. Despite its low penetrance, the HFE gene (implicated in hereditary hemochromatosis) scored 11 because of the availability of highly effective and noninvasive preventive measures. By contrast, some genes that were considered actionable by the ACMG, such as SDHB, SDHC, and SDHD (associated with hereditary pheochromocytoma/paraganglioma susceptibility) received scores below 11 because of limited evidence that biochemical screening in an otherwise asymptomatic individual would produce better long-term outcomes than treatment upon onset of symptoms. Other genes, such as MYH11 and MYLK (which have recently been implicated in familial thoracic aortic aneurysm and dissection), could be considered to have effective interventions by analogy to other well-known conditions, but they scored lower because of a limited knowledge base, which precludes accurate assessment of penetrance.

As demonstrated by the range of scores observed for the selected gene–disease pairs, actionability is a continuum rather than a binary state. Using the random genes as a benchmark, 21/111 (19%) scored ≥11, while 30/111 (27%) scored ≥10. The LVBC chose to consider genes with a score ≥11, essentially the top quintile, as meeting the threshold of actionability for inclusion in the revised bin 1 list. This yields a list of 168 genes representing 176 gene–disease pairs from the total 372 pairs scored. The fact that 19% of random genes associated with single-gene disorders scored ≥11 suggests that as many as 500 genes of the ≥3,000 single-gene disorders might rise to this threshold of actionability. Thus, we have not yet identified all of the “actionable” gene–disease pairs, and a systematic screen of single-gene disorders is needed. Furthermore, scores are subject to change depending on advances in medical genetics, which will likely increase some scores over time.

Discussion

Management of the vast range of heterogeneous information generated when genomic analysis is undertaken remains one of the most challenging aspects of applying genomics in the clinical realm. Patient preferences must be taken into account, especially with regard to genomic findings that have limited clinical actionability. Individuals may make greatly varying choices regarding whether they want to learn about different types of genomic findings; we are studying these preferences and the parameters that influence them as part of the NCGENES study. However, just as there are incidentally discovered laboratory values that are flagged as “critical” levels, or radiographic findings that require clinical action, it follows that when certain genomic findings exceed a threshold of actionability, the default procedure should be to provide those results as part of the routine protocol when performing clinical genome-scale sequencing tests.

It is thus critical to define a subset of clinically actionable genomic findings that are likely to be accepted by most individuals and allow a standardized and streamlined process for informed decision making in clinical genome-scale diagnostic testing. Otherwise, the decision about returning genomic findings could (at the reductio ad absurdum extremes) be relegated to an all-or-none choice irrespective of the actionability of the information, or a nearly infinite menu of potential findings organized at the level of certain genes or even specific variants. Neither of these options seems tenable in the current clinical setting.

It should be stressed that a policy of routine return of a small subset of genomic findings does not preclude patient choice by means of an informed “opt-out” at the initiation of testing, as now endorsed by the ACMG. In addition, this policy does not prohibit laboratories from offering additional categories of non–medically actionable genomic information (what we refer to as “bin 2”) as an “opt-in” to those who desire such information, with appropriate education and decision making. Nevertheless, any such menu of options needs to be articulated before consent and analysis, which calls for an a priori process to define which gene–disease pairs fall into any given category.

This article describes the delineation of a novel semiquantitative metric providing a transparent definition of clinical actionability and a framework for evaluating criteria in a streamlined fashion. We outline criteria for actionability generally similar to other expert deliberative processes.5,15 However, this framework is unique in that the dimensions of actionability can be assessed consistently across different types of disorders. The results indicate that, as expected, the ACMG list is enriched for genes that achieve high scores for actionability, both supporting their inclusion in a recommended list and generally reinforcing the parameters used in the current assessment. Future versions of the ACMG list could be informed by this scoring metric, or one similar to it, in order to remove genes that fall below a stringent threshold and to include additional genes with scores equivalent to those on the current recommended list. The percentage of individuals who will have such findings is predictable and depends on the list of gene–disease pairs being evaluated and the stringency with which variants are selected for return.3,15,16,17

Nuances in application of the semiquantitative metric

The subcategories of the metric reflect the clinical impact of a condition (severity and likelihood of a given outcome) while balancing the potential benefits (effectiveness of interventions) and harms (burden of intervention), thus approximating the clinical utility of revealing incidental/secondary findings in a presymptomatic individual. Each of these facets is necessary to include, despite the subjectivity inherent in scoring some of them. For example, both periodic phlebotomy (as in the case of hemochromatosis) and surgical removal of the stomach to prevent diffuse gastric cancer (in the case of CDH1 mutations) are highly effective measures to prevent morbidity and mortality, yet the burdens of these interventions are dramatically different and therefore greatly affect the concept of actionability when considering the return of genetic information in a setting in which the individual is not likely to have overt symptoms.

Scoring each category on a 0–3 scale allows for a limited degree of granularity that captures the qualitative nature of certain categories (e.g., effectiveness and burdens of intervention). It would be difficult, and potentially problematic, to spread the range of scores into finer subdivisions or to develop a more complex nonlinear scoring system. That said, different scoring systems could be explored should there be a compelling rationale to do so. In addition, customized gene lists could be generated for other contexts by applying weighting schemes or selecting a different threshold. For example, one might envision that selection of genes for primary screening of the general population should demand extremely high knowledge and efficacy scores, and those components of the metric could be weighted accordingly. In general, differential weighting of criteria will change the rank order of scores that are close to one another, and will primarily affect gene–disease pairs near the threshold used to define actionability. However, those with scores farther away from the chosen threshold would be much less likely to have their position above or below the threshold affected by changes in weighting. Finally, the evidence used in establishing the scores can be explicitly defined, and the scores can be updated to incorporate new evidence. Thus, the semiquantitative metric provides a structured framework and a more nuanced and transparent method for defining a list of clinically actionable genes than would be possible with expert consensus approaches.

In practice, the LVBC established a set of conventions for scoring different categories (see Table 1 for examples). It is challenging to directly compare the severity of conditions that lead to bodily harm, such as death or organ failure, with conditions that lead to physical or cognitive impairment. The severity score is thus intended to judge the relative severity between disparate conditions. The likelihood of a given outcome is the most quantitatively definable component of the metric, although data are lacking for many conditions, requiring either an estimate with some uncertainty (reflected in a lower score for knowledge base) or a score of 0 when the available data are simply too limited to make a reasonable assessment, as in the case of autosomal-dominant conditions with only a few patients reported in the medical literature.

In the absence of definitive end points, the effectiveness of interventions for different clinical outcomes often relies on expert opinion. The LVBC generally considered the screening or preventive measures that all individuals with a positive finding would be expected to undergo, rather than more definitive treatments that would be required only in those who manifest symptoms. By any measure, the burden of intervention is the most subjective and personally nuanced aspect of the metric. It is likely that different individuals hold different views on what is acceptable and what constitutes an unreasonable burden in the context of their own life experiences. Thus, while we fully recognize that it can be difficult to assess the burden of a particular intervention for an individual, we attempted to define a scoring rubric that could roughly define the relative burdens of interventions across the population. This score could be replaced in the future by a more quantitative measure derived from discrete choice experiments or other means of assessing relative values, such as the methods used to measure quality-adjusted life-years.

Finally, the knowledge base score was applied as a single measure reflecting the degree to which each component of the score could be confidently defined. Alternatively, a knowledge score could be assigned to each component separately based on the knowledge base for that element. While more complicated, such an approach would provide greater granularity. The strength of the gene–disease association itself is embedded in the knowledge base score, since less well-described conditions rarely have sufficient knowledge to accurately score certain elements of the metric. However, we do not intend for this metric to be used as a stand-alone measure of the clinical validity of a gene–disease association.

Disorders that predispose to thoracic aortic aneurysm and dissection illustrate certain nuances of scoring. These conditions convey an increased risk of sudden death due to dissection, and the typical intervention is to implement a vascular imaging screening program. This intervention is a highly effective and noninvasive means to detect and monitor the size of an aneurysm before it poses significant danger of acute dissection. More invasive intervention (i.e., vascular surgery) is required only if the individual develops a clinically significant aneurysm. Overall, this combination of interventions would be a highly effective and generally acceptable way to manage the risk of sudden death due to an aortic dissection, although in some conditions the risk of dissection is not directly related to aneurysm size, in which case screening would be less effective. In addition, the effectiveness of an intervention for more rare conditions, such as MYLK-associated thoracic aortic aneurysm and dissection, must be extrapolated based on analogy to related conditions because of the lack of information available regarding the effectiveness of interventions specific to that condition.

Potential limitations

The current metric does not account for certain contextual factors, such as the age of the individual, the typical age at onset of disease or the age at which clinical actions would be implemented in a presymptomatic individual, the sex of the individual, the general availability and cost of recommended preventive strategies, or the ability of relevant genetic lesions to be detected. Given these limitations and the necessarily subjective nature of any assessment of actionability, the scores and evidence base generated by the application of the semiquantitative metric are best considered an initial starting point for more nuanced discussions about individual conditions or particular clinical applications.

Certain features of the metric could lead to minor irregularities in scoring. For example, if the LVBC decided that a proposed intervention for a given condition was considered ineffective (score = 0), then no score could be assigned to reflect the “burden” of that intervention because the additional points would inflate the final score. However, If the proposed intervention was considered even “minimally” effective, the total score could swing by as much as four points (effectiveness = 1 and burden = 3 points for a minimally invasive intervention). This impact could be partially mitigated by weighting schemes that emphasize the contribution of the effectiveness of the intervention over the burden score to the overall total.

The final scores for each gene–disease pair were determined through consensus of the LVBC. This process was not amenable to evaluation of interrater variability in scoring, but in practice we found that the semiquantitative metric facilitated more efficient discussions and greater consistency than earlier attempts to arrive at consensus without a structured framework. It could be argued that this process simply replaces a single idiosyncratic expert consensus about “actionability” (the current state of other deliberative processes) with several different potentially idiosyncratic decisions. However, assessment of scores for each criterion permits more systematic evidence curation and updating, as well as a more flexible approach to weighting the importance of different criteria, than would be possible otherwise.

Finally, in some cases review of the scores by domain experts may prompt revisions based on deeper understanding of the clinical scenario or greater awareness of literature that was not captured in our review process. In other cases medical advances may increase the overall scores by improving the knowledge base and perhaps the efficacy of interventions for many conditions. It is expected that there will be a need for ongoing assessment of clinical actionability and updating of results, which again is streamlined by the existence of a structured framework.

Conclusions and future questions

We have presented a framework that defines five aspects of clinical actionability, evaluates them qualitatively, and effectively distinguishes between lists of genes deemed to be potentially actionable by expert opinion versus randomly selected genes. This framework is flexible and can be adapted to different contexts. It is too early to know whether it is more efficient or more reliable than other expert consensus approaches, or whether it ultimately leads to better clinical outcomes. In addition, it remains to be determined whether other groups would arrive at the same scores. However, the inherent transparency of the framework facilitates comparison between different efforts, critical evaluation, and updating of the scores as new knowledge accrues as a result of the constantly evolving medical literature. We anticipate using this or a similar metric to evaluate all human disease genes to guide the application of genomic medicine.

Disclosure

The authors declare no conflict of interest.