Introduction

The Center for Medical Technology Policy convenes stakeholders to develop effectiveness guidance documents (EGDs), which provide disease- or technology-specific methodological recommendations for studies targeting the information needs of payers, with input from clinicians and patients. EGDs are analogous and complementary to US Food and Drug Administration (FDA) regulatory guidance documents, focusing on study designs that address payers’ expectations for evidence.

Groups conducting technology assessments or systematic evidence reviews, or translating evidence into clinical practice guidelines, have frequently concluded that the evidence supporting the clinical use of a recently introduced molecular diagnostic (MDx) test is insufficient.1 While these assessments typically include identification of the critical gaps in knowledge that limit the translation of specific tests into practice, they often stop short of providing specific guidance for study design to overcome these deficiencies, nor do they provide test developers with a clear sense of the evidence that public and private health plans require for coverage.2 Yet relatively few commercially available MDx tests are reviewed for coverage because of a lack of clinical utility (CU) studies.3 This EGD4 provides to test developers specific recommendations to evaluate the clinical validity (CV) and CU of “actionable” MDx tests in a manner that is acceptable to payers, and it serves as a resource for payers to communicate standards of evidence to test developers.

We used “molecular diagnostic test for oncology” as an umbrella term for any test that, at the molecular level, helps to identify patients with an inherited risk for cancer or to diagnose, classify, or guide management of a patient’s cancer. This definition included tests for individual biomarkers, “omics”-based tests, and tests for circulating tumor cells; was independent of the assay method; and applied to tests that were not codeveloped as companion diagnostics. Codeveloped companion diagnostics were excluded from the scope because these test–drug combinations undergo FDA review, and this process typically results in adequate information regarding the utility of the test for the approved indication. Most other tests do not, and many are marketed as laboratory-developed tests that are regulated under the CLIA of 1988. We used the ACCE framework (analytic validity (AV), CV, CU, and ethical, legal, and social implications) to categorize the types of evidence needed to recommend the use of MDx tests.5

The recommendations apply to actionable tests, meaning tests that can lead to changes in the clinical management of patients, predict survival or other clinical end points independent of any specific treatment (“prognostic test”), predict response to treatment (“therapy-guiding test” or “predictive test”), and assess response to treatment (“monitoring test”), and that are used to identify the risk of organ-based toxicities or altered metabolism and/or response to cancer drugs (“pharmacogenomic test”). The target condition can involve either solid or hematologic malignancies in adult patients. Since these tests guide patient care decisions for a potentially life-threatening clinical condition, all are classified as “high risk” in terms of the potential benefits and harms to patients.

Materials and Methods

Figure 1 outlines the process for the development of EGDs from gap identification to final EGD recommendations.4

Figure 1
figure 1

Overall method for developing recommendations.

We convened a 10-person technical working group (TWG), comprising clinical and methodological experts representing the Centers for Medicare and Medicaid Services, Blue Cross Blue Shield Technology Evaluation Center, the National Cancer Institute, Brigham and Women’s Hospital, the Effectiveness in Genomic Application in Practice and Prevention initiative, Duke University School of Medicine, the American Society of Clinical Oncology, Veridex, Epic Sciences, New Enterprise Associates, and the Research Advocacy Network (for details, see the EGD3). The group held an initial all-day, in-person meeting followed by a series of five teleconferences over 8 months to develop draft methodological recommendations. Following those steps, a 20-person advisory group comprised of life-sciences industry experts was convened to review and comment on the draft recommendations. Two joint advisory group/TWG in-person workshops were held, with the additional participation of patients, payers, clinicians, regulators, professional societies, and researchers. Major health plans (WellPoint (now Anthem–Kaiser Permanente), UnitedHealth, Centers for Medicare and Medicaid Services, Palmetto GBA, and Blue Cross Blue Shield Technology Evaluation Center) supported the project through funding or direct participation. Between these two workshops, a series of six joint advisory group/TWG subgroups refined and finalized specific recommendations over a 5-month period. The resulting recommendations incorporate collective stakeholder input while representing standards that are acceptable to many payers for decision making regarding coverage. Effort was made to mediate conflicting opinions within the TWG, but full consensus was not achieved. The Center for Medical Technology Policy takes responsibility for the final content. Only TWG members listed as coauthors can be considered to endorse the recommendations.

Results

Ten specific EGD recommendations are discussed here. The recommendations are divided into three categories: reporting AV, CV, and CU. Several position statements are included to emphasize the broader need to promote evidence generation.

Reporting AV

Recommendation 1. Follow standard reporting guidelines to document that analytic validity has been established. Greater transparency will enable others to more easily assess these claims.6,7,8,9 Although specific methodological recommendations related to AV were excluded from the scope of this guidance document, ensuring AV before the final assessment of CV is critical to improving the evidence base for MDx tests in oncology.

Clinical validity

The strength of the association between the test result and the clinical condition of interest must be established to assess the CV of an MDx test. The most common flaws in MDx clinical validation studies include relying on intermediate outcomes that are not predictive of the definitive clinical end point of interest (e.g., progression-free survival is often not predictive of overall survival) and use populations that are not representative of the population in which the test is intended to be used (e.g., test validation with a largely Caucasian population when the underlying disease also affects large numbers of African Americans).10,11,12 Best practices can be achieved through attention to study design and quality (i.e., bias), sample size, patient population, choice of outcome measures, and appropriate statistical analysis and result interpretation.13,14,15

Recommendation 2. Specify the clinical context and patient population intended to benefit from the action or decision guided by the test result. One or more specific intended uses for the MDx test and outcomes of interest should also be determined as early as possible in the development process.16 While preliminary or exploratory studies early in test development (including the development of classifier models) might use convenience samples obtained from less representative patient subgroups, efforts should be made to identify a specific intended use for the MDx test as early as possible in the development process. As test development proceeds, an unbiased clinical validation should ensure that the test sets used for validation are drawn from the intended use population and are independent of any training data sets used to develop the test.

Recommendation 3. Report the strength of an association between the MDx test and a specific disease state using metrics that are most useful to clinicians. When the clinical disease state is binary (e.g., a continuous variable with an actionable threshold), preferred metrics are clinical sensitivity, clinical specificity, positive predictive value, and negative predictive value, provided with measures of uncertainty such as 95% confidence intervals. Disease prevalence among the tested population is required to compute the positive predictive value and negative predictive value. The acceptable balance of false-positive versus false-negative results depends on the clinical context. Although the area under the receiver-operator characteristic curve should not be the only metric used to evaluate CV, the optimal cut point for clinical decision making can be selected using a receiver-operator characteristic curve to plot sensitivity and (1 − specificity) pairs versus the associated levels of the MDx biomarker.17,18

Prognostic biomarkers are typically evaluated as part of a multivariate analysis for a model predicting a particular outcome13,19 and are best examined in a prospective cohort study20 or possibly in the control arm of a randomized controlled trial (RCT). The preferred study design for validating a predictive biomarker is an RCT comparing two treatments, where biomarker status is available for all patients at baseline (not an enrichment design, which in this case refers to the prospective use of a patient’s biomarker status for determining enrollment in a trial to increase the likelihood of observing a drug effect). When the predictive biomarker is a continuous measure, a useful approach for choosing a cutoff value is to use treatment predictiveness curves,15 plotting clinical outcome (e.g., 5-year disease-free survival rate; y axis) as a function of biomarker value (x axis) separately for each treatment arm. This allows one to assess which treatment yields greater benefit at each biomarker value and to estimate the proportion of patients who will benefit from each treatment.

To encourage transparent and complete reporting of study design and statistical analyses, and to promote reproducibility, reporting of test validation studies should utilize appropriate standards, such as the QUADAS checklist (designed to assess the quality of primary diagnostic accuracy studies)21 and the REMARK (Reporting Recommendations for Tumor Marker Prognostic Studies) checklist.1,2

Clinical utility

Evidence of the CU of an MDx test establishes the net clinical benefit to the patient of adding the MDx test to the current/standard clinical decision-making matrix. The AV and CV of the test should be “fully specified and locked down” before initiating prospective evaluations of CU.22 Because these tests are used to inform oncology care decisions, they are considered high-risk medical decision tools; correspondingly high evidence standards apply. RCTs are therefore the preferred method to assess CU in this context (recommendations 4 and 5). Under specific circumstances, however, alternative study designs may be permissible (recommendations 7, 8, and 9), and in some situations, a chain of evidence might be constructed using existing evidence on therapeutics to correlate testing with patient outcomes (recommendation 10).

The earliest stages of MDx assay development should include a systematic plan for evidence-based translation into clinical practice. To determine the type(s) of studies that will be required, describe the proposed CU of the test in a flow diagram ( Figure 2 ) that outlines at a conceptual level the intended clinical use and key elements, such as the intended use population, existing test strategies, treatment alternatives, and the associated primary patient outcomes; this is analogous to defining the primary study objectives for a clinical trial.23 The flow diagram serves two critical purposes: (i) helping the researcher to decide whether a prospective study is necessary by identifying existing data sources that estimate the strength of association between a test result and patient outcome(s) and (ii) helping to identify critical missing data elements, thereby supporting the design of efficient studies.

Figure 2
figure 2

Example flow diagram that outlines at a conceptual level the intended clinical use in practice and the associated primary patient outcomes for a clinical trial. For illustration purposes, the diagram includes some hypothetical data and reference sources for each pathway. Dark grey boxes indicate decision steps for which information does not exist or is inadequate. An actual flow diagram would specify the information available and sources for each branch in the diagram to provide a more detailed map of the type of information that is still needed to fully develop the test. MDx, molecular diagnostic.

Recommendation 4. Specify in advance the potential therapeutic actions or decisions (i.e., clinical pathways) that should be followed based on test results, and include all relevant (for the given clinical context) treatment alternatives under consideration at the time of testing. Standardizing the potential clinical pathways associated with various test results reduces variation and enhances the ability of the study to assess the impact of test results on patient outcomes. The explicit description of how the test results will be used compared with non–biomarker-guided treatment strategies is also informative for patients who are considering enrollment in the study.

Recommendation 5. Include outcome measures that assess both the potential benefits and harms of testing from the patient perspective, recognizing that these outcomes may occur at different time points and are the result of clinical management decisions guided by test results.

The primary clinical application for actionable MDx tests in oncology is to enhance the stratification of patients to more precisely classify risk and target interventions. Examples of typical outcome measures include clinical assessments of disease remission and progression, response to therapy, functional status, as well as disease- and treatment-related adverse events. Measures of benefits and harms should also routinely include patient-reported outcome measures, with the assurance that the selected measures are appropriate and validated for the clinical context.3,24 CU studies may reasonably include end points such as survival and downstream health-care resource utilization. The decision to include these end points should be guided by the robustness of the existing evidence base regarding the specific clinical intervention prompted by the test result and its effects on relevant health outcomes. However, process measures, such as changes in physician behavior, are typically insufficient to qualify as persuasive study end points unless there exists a separate, robust body of credible evidence (as determined by widely accepted evidence review standards) linking specific clinical management decisions with relevant health outcomes. Studies designed to report intended care plans following an MDx test are insufficient for demonstrating CU.

Recommendation 6. The preferred method for assessing the CU of MDx tests is RCTs that adequately evaluate the impact of the clinical decision (treatment or other clinical pathway) relative to an appropriate control for both marker-positive and marker-negative patients.11

In general, designs that use a biomarker to guide the analysis are preferred over designs that use a biomarker to guide the treatment assignment.10 Accordingly, a preferred RCT design is the “all comers” marker-stratified design for evaluating the CU of MDx tests4 ( Figure 3a , b ).10,11,12 When there exists compelling evidence that a subgroup of patients with a particular marker cannot benefit from a treatment, or when a group of responders has been identified for further study within an otherwise highly heterogeneous population, enrichment designs are useful to focus on a specific group of interest25 ( Figure 4a ). In general, however, the approach is justified only in cases where the biologic rationale and preliminary evidence that only one group benefits is sufficiently compelling that equipoise does not truly exist between the current alternatives for all patients, making it unethical to randomize treatment options to all marker-based groups.

Figure 3
figure 3

Generally preferred designs. (a) “All comers,” prospective marker-stratified design: (a) a prognostic test and (b) a predictive test. Adapted from ref. 40.

Figure 4
figure 4

Designs having disadvantages. (a) Marker enrichment design. This is not recommended except where there exists compelling evidence that marker-negative patients cannot benefit from a treatment, or when a group of responders has been identified for further study within an otherwise highly heterogeneous population; no information on the excluded group is obtained. (b) Biomarker strategy design. This is not preferred because the approach reduces statistical power, given that patients in both study arms receive the “standard of care” as their intervention. MDx, molecular diagnostic. Adapted from ref. 40.

The biomarker strategy design, in which the patients who are randomized to usual care are not tested, is often used to study genomics-guided treatment versus usual care11 ( Figure 4b ). With this strategy, however, some patients receiving MDx-guided therapy receive the same treatment (standard of care) as patients in the standard therapy arm, which dilutes the ability to observe a treatment effect11 ( Figure 4b ). The same objectives can typically be achieved with fewer patients using the marker-stratified design described above. Given the larger sample size required to demonstrate a difference between study arms, the biomarker strategy design is not preferred.

Recommendation 7. Conduct a well-designed, prospective-retrospective study when there exists an appropriately-designed, powered, and conducted clinical trial with banked biospecimens ( Figure 5 ). Replication of study results (second study) and pooling of biospecimen samples from comparable RCTs are two approaches to address limitations related to causal inference and insufficient sample sizes. To ensure the appropriate use of a “prospective-retrospective” study design to evaluate the CU of a new biomarker, several conditions must be present to ensure that this approach is of sufficient scientific rigor to convincingly demonstrate CU.26 For example, the analysis plan for the biomarker study must be completely prespecified, and the analytic validity of the test must be well established to ensure that results from archived tissues resemble the results from tissue collected in real time.

Figure 5
figure 5

Prospective-retrospective randomized controlled trial (RCT) design. A drug is tested first in an RCT, and marker-status is determined retrospectively from tissue samples. This is recommended for situations in which the marker was not known when drug was first developed. It can also be used for independent validation. Adapted from ref. 40.

Replicating validation study results is excellent verification of evidence. We believe, however, that if a single properly designed and adequately powered prospective-retrospective study has positive results, this is considered adequate evidence of CU.

Recommendation 8. Single-arm studies can be used to establish the CU of an MDx test provided the following conditions are met: (i) the MDx test is being developed with an oncology drug that has already been approved by the FDA on the basis of pivotal trials of a study population that was not previously stratified on the basis of molecular marker status; (ii) adequate archived tissue samples are not available to conduct a prospective-retrospective trial to assess CU; (iii) it is feasible to use response, variably defined as complete or overall response, as an end point in the single-arm study; and (iv) there exists comparable response data from a noncontemporaneous comparative cohort.

This approach is applicable when an MDx test potentially identifies a subset of patients who benefit differentially from a drug treatment that has already received FDA approval on the basis of randomized trials in a broad patient population defined by disease characteristics but not biomarker status. In this setting, it would not be ethical or practicable to conduct subsequent RCTs in which a control group is denied the approved therapy. An alternative is to conduct a single-arm study. The study can be interpreted in the context of the response of a noncontemporaneous cohort or end points such as tumor shrinkage. Single-arm studies of this type are not as robust as RCTs because they provide only information on the test-positive patients, not the test-negative patients (who cannot be assumed not to benefit from the treatment). Nevertheless, marker-based differential tumor response can provide useful data to clinicians that can be used in the context of other relevant information to create an individual treatment plan.

Recommendation 9. Longitudinal observational study designs such as prospective cohort studies, patient registries that explicitly include comparators, and multiple group, pretest/posttest designs (quasi-experimental) may be used as evidence of CU provided that a compelling rationale for not doing an RCT is addressed, efforts to minimize confounding factors are documented, and good research practices for prospective observational studies are followed, including public registration of studies. Since the necessary parameters for evaluating the CU of MDx tests (e.g., clinical characteristics of patients, test findings and interpretation, subsequent care, and patient outcomes) are typically not found in secondary databases (including most electronic health records), the pursuit of retrospective observational studies is generally not adequate.

The decision to pursue an observational study rather than an RCT should be considered only when other approaches are not possible; this may be particularly problematic when evaluating predictive biomarkers that compare outcomes between treatments. Factors influencing the decision include the state of clinical equipoise for the MDx test of interest and whether the proposed study design and analysis plan will sufficiently address potential problems with time-varying and time-invariant confounding and bias.27 A prospective observational study should adopt best practices to minimize threats to validity. A full protocol with corresponding hypotheses and specified intervention groups, definitions of outcome measures as well as subgroups, power calculations, and an analysis plan that describes how to handle potential confounding, missing data, loss to follow-up, and heterogeneity of treatment effects is essential.28 Various user guides on best practices for designing observational studies have been prepared by the Agency for Healthcare Research and Quality and other expert task forces, and researchers are encouraged to consult these guides before planning an observational study.27,29,30,31

Recommendation 10. Use formal decision-analytic modeling techniques to elucidate the relationship between test results, corresponding clinical pathways, and downstream patient outcomes in cases where an MDx test has established evidence of CV and plausible evidence of CU based on modeling of the initial scenario (a simplified approach for outcomes: base case, best case, worst case).

In this context, decision-analytic modeling denotes a model that is used to depict a common clinical scenario in MDx testing; however, other model types, such as state-transition models or discrete event simulations, may be appropriate, depending on the clinical situation.32 These models are useful in the common situation where there is no direct evidence of CU. Developing a simple decision model, called a “scenario model,” that consists of a simplified decision tree and a series of “what if” scenarios can provide a quantitative assessment of the general likelihood that an MDx test will demonstrate CU. The key parameters and assumptions under three scenarios (base case, best case, and worst case) should be revisited with key stakeholders (e.g., patients, clinicians, and payers) and the outcomes estimated for each case.

For MDx tests that cross the plausibility threshold, modeling techniques are used to project the overall downstream health outcomes (all patient-relevant benefits and harms related to the duration and quality of remaining life, such as modeled estimates of clinical events, life expectancy, and quality-adjusted life-years)33 that in most instances may not be available, even within the context of RCTs, because of limited follow-up, highly selected patient populations, and/or small sample sizes. Alternatively, data from separate studies demonstrating the relationship between biomarker statuses, various steps in the care pathway, and patient outcomes may be quantitatively linked through modeling to provide estimates of the net benefit to patients.

Discussion

These recommendations aim to clarify what is adequate evidence for coverage of MDx tests. Greater clarity, consistency, and predictability of evidence requirements are essential for investors and diagnostics companies to make informed decisions regarding test development. The TWG specifically confined these recommendations to “actionable” MDx tests; they exclude tests that do not provide information leading to an alteration in clinical management. While there has been debate on the definition of “clinical utility,” our TWG rapidly came to consensus with the prevailing concept of the Effectiveness in Genomic Application in Practice and Prevention Working Group, Medicare, many evidence review groups,29,34 and others35 that CU refers to evidence that use of MDx test information leads to a change in patient management that can result in improved health outcomes.

This definition of actionable is consistent with many payers’ concept of a “medically necessary” test, which can entail consideration not only of the impact of the test on patient management but also of the current standard of care, including the adequacy of other tools available for the same purpose as the test (i.e., comparative effectiveness). The evaluation of the CU of an MDx test is, likewise, inherently a comparative effectiveness research question, requiring a comparison of the effects of the new test result versus a standard (or no) test result on patient outcomes. For this purpose, the focus is primarily on health outcomes. Health-resource utilization would also be a meaningful outcome to examine in comparative studies but was not the focus of this work, since the significance of any economic analyses is dependent on sound evidence of CV and CU.

Given the uneven quality of published studies to date, numerous groups, including the Institute of Medicine,22 the National Cancer Institute,36 and the National Comprehensive Cancer Network,37 among others,38 have published checklists, study design recommendations, and criteria for evaluating the CV and CU of MDx tests, although not always strictly limited to tests used in oncology. Our process is distinct from these in that it involved a sustained dialogue across the full range of experts and stakeholders, and emphasized the information needs and participation of major health plans. A limitation of the EGD is that it is not a consensus statement of all participants or payers generally. Nevertheless, thoughtful input of key health-plan decision makers lends confidence that tests evaluated successfully under these guidelines can achieve affirmative coverage decisions.

Notably, the recommendations expand consideration of evidence to include not only RCTs and prospective-retrospective analyses of samples from previously conducted clinical trials but also prospective observational studies and modeling when the circumstances justify using these options. The recommendations thus reflect a growing recognition of the limitations of RCTs to address all relevant comparative questions in oncology and the usefulness of appropriately designed nonrandomized comparative effectiveness research studies.38

These recommendations create an important foundation for clarifying the evidence of CV and CU needed for coverage of MDx tests. However, as new high-throughput genomic sequencing techniques increasingly gain prominence in clinical laboratories, gradually supplanting traditional single-gene (or few gene) analyses, novel challenges arise for evaluating and covering testing. Many biomarkers originally developed as drug targets in a particular cancer can be targets in other types of cancer as well, but the effectiveness of the targeting in the new context is often unknown. How should the CU of large gene panels, or whole-exome or whole-genome sequencing, be evaluated? When is coverage appropriate? The barriers to using RCTs for assessment are all the more acute as the number of new variants to be evaluated increases. Answering these questions through a multistakeholder dialogue that includes payers—work that is underway39—is a critical next step to building constructively on the principles established in this EGD and ensuring patient access to high-quality, efficacious genomic testing for oncology decision making.

Disclosure

After this paper was written, L.J. (ADVI, Washington, DC) consulted for Myrial, GenomeDx, and Exact Sciences. It should be understood that this consultancy did not overlap with the writing of the manuscript. L.J. received an honorarium for his consultancy with Exact Sciences. After the writing of the manuscript was complete, L.J. gave expert testimony to the House Energy & Commerce Committee. D.N. discloses that he holds the position of director at the Clearity Foundation. D.N. has been compensated by Life Science Group for consultancy/advisory work and owns stock in Epic Sciences. R.T.M. discloses his position as head of technology strategy and innovation in research and development at Janssen Oncology. R.T.M. also owns stock in Johnson and Johnson. The other authors declare no conflict of interest.