Doubt is not a pleasant condition, but certainty is an absurd one.

Voltaire, in: Letter to Frederick William, Prince of Prussia; 28 November 1770

Introduction

Testing for measurable (‘minimal’) residual disease (MRD) in people with acute leukaemias and other haematologic cancers has gained popularity [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. Results of these tests are now often included as an endpoint in reports of clinical trials outcomes and increasingly used in clinical practice with haematopoietic cell transplantation no exception. In addition to stratification for the risk of cancer recurrence, MRD testing is used to inform transplant-related medical decisions. For example, many experts, consensus statements, and management guidelines suggest considering results of MRD testing in the decision whether persons with acute lymphoblastic leukaemia (ALL) or acute myeloid leukaemia (AML) should receive a transplant in first remission, in selecting the type of haematopoietic cell graft, intensity of pretransplant conditioning and type, intensity or duration of post-transplant interventions such as immune suppression and/or pre-emptive post-transplant anti-leukaemia therapy [17,18,19,20,21]. However, as with any other prognostic or predictive test, the interpretation of MRD-test results is subject to limitations in statistical properties that need to be considered when translating these data to the clinic. Adding to this complexity, there are many techniques to quantify MRD and each has different operating characteristics. Treatment strategies are often decided based on results of one MRD test reported as a binary (negative or positive). This approach ignores basic characteristics of these tests, and the test’s accuracy and precision (Fig. 1) in predicting clinical outcomes is not well described and often misunderstood. MRD tests using different techniques, such as multi-parameter flow cytometry and quantitative polymerase chain reaction (PCR), done on the same sample may give different results, especially when the readout is a binary: positive or negative [22, 23]. As such, data from MRD tests using different techniques should be considered complementary rather than duplicative. Discordances further complicate interpreting MRD-test results. Using binary readouts from MRD tests has several statistical issues besides decreased sensitivity and specificity including decreased power, underestimation of variation in outcome between groups (persons with very low-level positive MRD-test levels may be outcome-wise closer to MRD-test-negative persons than those who test high-level MRD positive), and inability to identify any linear relationships with outcomes [24]. Flexible models of quantitative MRD, such as spline models ([25]; elaborated on in another part of this series), can help to elucidate non-linear relationships in the data.

Here we discuss characteristics and properties common to all MRD tests including sensitivity, specificity, accuracy, precision, and positive- and negative-predictive values (see Table 1 for glossary and definitions of statistical terms). We define and compare these quantities, discuss their role in informing medical decisions, and describe the role of randomized trials in evaluating MRD-test results. For a broader discussion of outcome prediction in people with haematologic cancers, see [26].

The perfect MRD test and why it does not yet exist (and may never exist)

Critical appraisal of the performance properties of any MRD test requires defining what the ideal MRD test should do and what is the clinical comparator. For example, is the test designed to detect some or all residual cancer cell(s) or only cancer cells biologically able to cause relapse within a specified interval (perhaps the subject’s remaining lifetime) or which cause relapse within a specified observation interval? These are distinct, sometimes overlapping, goals. Equally relevant is the outcome we want to predict with the MRD-test result. Do we plan to use it to predict relapse, best analysed as cumulative incidence of relapse (CIR) because of competing events, relapse-free or event-free survival, overall survival, or some other endpoint of clinical interest? With the current interest in MRD testing for risk-stratification and treatment decision-making, a perfect MRD test might be defined as one which accurately identifies and quantifies the smallest population(s) of cancer cells in someone in histological complete remission which, if left untreated, cause relapse whilst being indifferent towards residual cancer cells which do not cause relapse during a specified observation interval.

The follow-up period in retrospective MRD-test analyses is important but often ignored. Most clinical trials have a finite follow-up interval, say, 2 or 5 years. We can evaluate whether results of a positive MRD test predict relapse over a lifetime only if all relapses are observed during this interval (i.e. there will be no further relapses after the finite follow-up time). Are the positive MRD tests in subjects without relapse in the follow-up interval false-positives or has the observation interval been insufficient for all relapses to occur? Moreover, it is likely some subjects who might have relapsed during the observation interval died of other causes, say graft-versus-host disease (GvHD) or a heart attack, before they could relapse. We can correct for this only imperfectly by accounting for competing causes of failure as is done by CIR analyses. Other persons may die after the observation interval of related or unrelated events, say cancer recurrence or a stroke. We will not, of course, know of these events. As such, there must be an unavoidable rate of false-positive and -negative MRD-test results, real or not, when events occur beyond the observation interval if the goal is to use MRD to evaluate lifetime risk. Further complicating this analysis is that some patients may have such high risk of a competing event (e.g. non-relapse mortality) that their CIR will not be relevant as such patient would be predicted to die before relapse could ever occur.

Although several technologies focused on immune phenotype or cytogenetic and/or molecular abnormalities have been developed to detect neoplastic haematopoietic cells, each with advantages and disadvantages, our understanding of cancer stem cells and how they differ in the context of immune phenotype and/or molecular features from other cancer cell populations is incomplete and unavoidably imperfect [3,4,5,6, 9,10,11, 27,28,29,30,31,32,33]. In other words, one reason the perfect MRD test with 100% sensitivity and specificity to predict cancer recurrence at the cohort- or subject-levels does not yet exist is related to incomplete knowledge of the neoplastic cells able to cause relapse. There are however additional reasons accounting for the substantial rates of false-positive and -negative tests.

For the clinical performance of any MRD test the theoretical maximum sensitivity and specificity of an assay to detect operationally relevant residual cancer cells (i.e. those causing relapse), together with other characteristics such as the reproducibility and repeatability or test–retest reliability (the components of a test’s precision) or replicability are important. Of course, these characteristics are not unique to MRD tests but apply to other medical assessments such as the histological assessment of a bone marrow specimen [34]. The precision of the test may be impacted by small volume sample (discussed later) as well as measuring technology (e.g. calibration of flow cytometers or PCR machines). In addition to sampling site and volume, other sampling details (timing, frequency, etc.) and result interpretation, for which many uncertainties remain, are of practical consideration. For example, even using histological criteria for complete remission, synchronous biopsies at several sites may be discordant. This is true not only for solid cancers such as prostate cancer but also for haematopoietic cancers. Discordance rates are even higher for metachronous biopsies. In other words, the perfect MRD test to predict cancer recurrence at the cohort- or subject-levels may never exist.

Measures of accuracy for binary definitions of MRD

When results of a test with quantitative, numerical data are reduced to a binary outcome, an unfortunate and often inaccurate strategy, there are four possible, mutually exclusive, outcomes: (1) true-positive; (2) true-negative; (3) false-positive; or (4) false-negative. The 2 × 2 diagram in Table 2 summarizes the distribution of the four test outcomes. In medical testing we are often concerned with falsepositives and negatives. Assume in this example we wanted to develop an MRD test that would identify and quantify cells which without further treatment are biologically able to cause relapse within a specified interval. For such a test, a false-positive would be an MRD test result indicating there are remaining cancer cells destined to cause relapse when, in fact, no intervention is needed to prevent cancer recurrence within the specified observation interval. This could be because the test is not sufficiently specific or because it identifies cancer cells that cannot or do not cause relapse within the observation interval. There are several potential reasons for these errors including the cells detected by the MRD test lack the biological ability to cause cancer recurrence during the observation interval or because of stochastic considerations (the cells have the biological ability to cause relapse but this does not occur for unpredictable reasons such as the cell(s) never divide(s)). A false-negative MRD test would indicate that there were no remaining cells which would result in relapse unless there is an effective intervention. The true-positive rate is equal to 1 − the false-negative rate and is often referred to as sensitivity. The true-negative rate is equal to 1 − the false-positive rate and is often referred to as specificity.

In addition to the four measures described above, positive- and negative-predictive values (PPV and NPV) are important in understanding the performance of any test. PPV and NPV are the proportions of positive tests which are true-positives and negative tests which are true-negatives. PPV and NPV depend on sensitivity and specificity of the test and, importantly, on the true prevalence of positive subjects compared with the positive and negative test results. These values can be estimated for a binary test and binary outcome with straightforward 2 × 2 table calculations. Table 3 provides sample data showing how these values can be calculated. In this example we use MRD measured by multi-parameter flow cytometry from an AML trial in subjects age 18–60 years [35, 36]. We note similar calculations can be done with more complicated definitions of MRD including combining MRD results across multiple time-points or summaries of MRD kinetics over time.

In this cohort MRD data were available on 170 subjects achieving histological complete remission [36]. Relapse-free survival at 1 year was measured from the date of histological complete remission and relapse and death were considered events. In this example:

• sensitivity (true-positive rate = probability MRD test is positive amongst subjects who will relapse and/or die in the following year) = $$\frac{{18}}{{18 + 20}}$$ = 47%

• specificity (true-negative rate = probability MRD test is negative amongst subjects who will neither relapse nor die in the following year) = $$\frac{{99}}{{99 + 33}}$$ = 75%

• PPV (probability of experiencing relapse and/or death within 1 year amongst subjects who are MRD positive) = $$\frac{{18}}{{33 + 18}}$$ = 35%

• NPV (probability of experiencing neither relapse nor death within 1 year amongst subjects who are MRD negative) = $$\frac{{99}}{{99 + 20}}$$ = 83%

Even acknowledging additional anti-leukaemia therapy was given before the 1-year mark to most of these subjects, these calculations highlight data from this MRD test with the outcome of relapse at 1-year result in substantial mis-classification rates. However, even with this high level of mis-classification, the MRD-test result is strongly associated with relapse-free survival with an odds ratio of 2.7 for 1-year relapse-free survival consistent with the strong prognostic association of MRD observed across many cohorts of people with AML [5].

Unfortunately, sensitivity, specificity, PPV, and NPV of MRD tests are not routinely described in biomedical publications. Most focus on the prognostic strength of the MRD test showing, on average, outcomes of persons with MRD-negative tests are significantly better than outcomes of persons with MRD-positive tests. Typically, the outcome interrogated is survival although an MRD-test result is biologically more likely to correlate with CIR because survival is influenced by other outcomes, including some, such as therapy-related toxicity (TRM) not expected to correlate with MRD-test result, whereas others like GvHD are confounded with CIR (persons with GvHD are less likely to relapse than those without GvHD). Although understandable from a clinical perspective (and perhaps driven by requirements from Health Authorities) the focus of many if not most reports on survival rather than CIR makes little biological sense. As an additional limitation, when estimating sensitivity, specificity, or other statistical quantities in settings with censored data or competing risks, a 2 × 2 table cannot be accurately constructed and specific statistical methodologies are needed to account for these data features [37]. These analyses are complex and debatable; for example, the definition of specificity can vary on how persons with a competing event are analysed [38].

Because of the strong correlation between MRD-test results and cancer recurrence (and related outcomes) and the diverse treatment options for many persons with haematologic cancers many physicians wish to use MRD-test results to determine best-possible therapy options. But biomarkers with strong prognostic associations can still have very poor predictive properties with respect to identifying the best therapy for someone [39]. For example, an odds ratio (OR) of 3 can be associated with greater than 50% false-positive or false-negative rates. A test with 90% specificity and 80% sensitivity can require an OR of 36 or higher, much higher than odds ratios or hazard ratios typically found in reports of MRD testing in the biomedical literature. But even an OR of 36 may be insufficient to have a very accurate biomarker. For example, a sensitivity of 97% and a specificity of 50% will also have an odds ratio of 36, highlighting the importance of reporting values including sensitivity and specificity, not just an OR or a hazard ratio.

Accuracy of quantifying MRD-test data

Although MRD tests are typically quantitative, results are often reported as positive or negative. This quantitative measurement can be converted to a binary measurement (positive or negative) by identifying people as positive who have any residual cancer cells detected by the test or by setting a minimum threshold of residual disease to be detected, for example, >0.1% of cells with an abnormal immune phenotype or residual cells with a mutation variant allele frequency >0.001. Many statistical methods are proposed to identify the ‘best’ threshold for creating a binary but using thresholds instead of the quantitative measurement is often associated with reduced (at times, substantially reduced) predictive performance [25, 40].

A common statistic reported as a generalization of sensitivity or specificity is the area under the receiver operating characteristic (ROC) curve (AUC) often translated to a concordance or C-statistic, discussed below. The ROC curve is plotted by tabulating the sensitivity and one-specificity of binary markers defined by every possible cut-point in the quantitative biomarker. As such it is invariant to the scale or units of the biomarker and so ROCs can be compared for different MRD measurements. The AUC is a single-summary value of the ROC and is constrained to be between 0 and 1. The C-statistic is the proportion of pairs of persons correctly ranked by the biomarker (i.e. the person with worse outcome in the pair also has a worse biomarker score). A C-statistic <0.5 is consistent with the test being worse than the flip of a fair coin in predicting outcome, a C-statistic of 0.5 is consistent with a flip of a fair coin and a C-statistic of 1 is perfect prediction (no false-positives or negatives). With a binary outcome, the AUC is equal to the C-statistic. C-statistics are defined more generally than AUC and so can be reported for time-to-event endpoints. The C-statistic for the data in Table 2, 0.59, shows weak predictive accuracy. We note that the C-statistic is a function of the prevalence of a biomarker, and so a test with constant sensitivity and specificity can vary between populations with varying MRD-positive prevalence.

What even the perfect MRD test cannot tell you

It is unlikely any MRD test will have perfect sensitivity to detect a designated target or targets and/or biomarker(s). Even with perfect sensitivity one might get a false-negative MRD-test result because of inconsistent presence of the assay target(s) in cancer cells.

As we summarized previously [5], it is a common misunderstanding that improvements in the MRD-test technology will eventually eliminate false-negative MRD tests by providing a complete accounting of the remaining residual cancer cells. Rather, the ability to detect low levels of residual cancer cells is limited primarily by the character and size of the sample tested, not MRD-test sensitivity. This is an important limitation considering MRD tests are typically based on small samples such as 1 mL of bone marrow from an estimated 750 mL volume in a 70 kg male or a 10 mL blood sample from a 5.5 L estimated blood volume. There is also the issue of topographic heterogeneity. For example, leukaemia cells are thought to occupy specific bone marrow niches rather than being uniformly distributed. Taking larger bone marrow samples will not necessarily resolve this bias. For example, bone marrow samples larger than 5 mL simply contain more blood cells, not more bone marrow cells [41, 42].

The MRD-test performance may also be impacted by the frequency of testing. Single time-point measurements more likely result in false-positive and -negative MRD-test results than when result trends are considered. Re-testing can decrease the likelihood of incorrectly interpreting an MRD-test result. Even without intervention, however, repeat MRD-testing results are occasionally discordant in both possible directions: a negative to positive MRD test or the converse. Discordances have many explanations but precision of the test and small volume sampling of a topographically heterogeneous population of cancer cells [43, 44] are important considerations. Requiring concordant results to declare a person MRD-positive or -negative increases specificity but decreases sensitivity. Test result validation using alternative methodologies may be useful in such circumstance but may be impractical. Sequential MRD testing can be particularly helpful as a strategy to increase sensitivity if changes in MRD-levels (e.g. increasing transcript levels or increasing percentage of immune-phenotypically abnormal cells) are the readout. A single discordant datapoint would be insufficient to make an estimate of clinically relevant changes in residual cancer cells. The optimal interval and duration of sequential MRD testing is unknown and may depend on several variables such as the type and mutation profile of the cancer, type of therapy, or interval since achieving remission.

Evidence for basing treatment decisions on MRD-test results

Biomarkers measured prior to treatment to provide information of the likelihood of response to a specific therapy are often called predictive biomarkers. Common examples of predictive biomarkers in oncology are genomic biomarkers which indicate who should receive a targeted therapy, or ‘fitness’ biomarkers such as a performance score indicating a persons’ ability to survive intensive therapy. Because MRD tests are imperfect, because people are misclassified and because retrospective and observational studies are subject to many biases [45], randomized trials are needed to prove the benefit of using MRD-test results for therapy decisions such as whether or not to do a transplant in someone with ALL or AML in first remission. Not only because MRD-test technologies are quickly evolving, there has been insufficient commitment to the large, long-term randomized clinical trials needed to prove the value in making therapy decisions based on results of MRD testing. Nevertheless, such trials are needed to accurately characterize the trade-offs present in such a treatment strategy. Retrospective non-randomized data cannot be used to evaluate this because the outcomes are often confounded by physician actions based on the results of MRD tests such as giving additional therapy or performing a transplant or making treatment decisions on the basis on other criteria including clinical judgement.

False-negatives and -positives are important when considering MRD-test results for decisions on interventions associated with serious adverse medical consequences such as allogeneic hematopoietic cell transplant. Assuming an example in which a cohort of AML patients underwent MRD testing at the completion of post-remission therapy (Fig. 2; hypothetical example like data reported by Terwijn et al. [46]):

• If all MRD-test-positive persons were deemed at high risk of relapse and received transplant, many persons would have received it with no possibility of benefit and substantial possibility of harm.

• If all MRD-test-negative persons were deemed low risk of relapse and did not receive a transplant, many would have relapsed. Whether or not they could be rescued with a transplant at this time is controversial. However, there are no convincing data a transplant done earlier would have improved their subsequent relapse outcome.

Our bottom line

Interpreting results of MRD testing is complex. Whether making therapy decisions based on MRD-test results improves clinical outcomes can only be tested in randomized clinical trials [47,48,49,50]. Such data are lacking. Is MRD testing useful? Clearly. However, it is important physicians understand when they act on an MRD-test result, either by giving or withholding a therapy, they will often be wrong. Risks associated with these MRD-test result-based decisions are asymmetrical. When the intervention is safe, an incorrect prediction may have little medical consequence although it may have other adverse effects such as psychological and fiscal. In contrast, when the intervention is associated with serious adverse medical consequences including death, this uncertainty needs to be acknowledged by the physician and conveyed to the patient. Harm may be of a lesser magnitude when the decision to withhold an intervention is based on results of MRD testing as there are no convincing data yet that most earlier interventions for an event such as cancer recurrence improves outcomes [51,52,53]. As Stephen Hawking said: The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.