Fate laughs at probability

Edward George Earle Lytton Bulwer-Lytton, 1st Baron Lytton PC

In Greek mythology athánatoi: the immortals (the Moirai or Fates) fixed a person’s destiny at birth and made certain the Fate assigned to every being by eternal laws might take its course without obstruction.1, 2 There was nothing man or even Zeus could do to alter one’s Fate. The Greek stoic philosopher Epictetus in the Enchiridion was a bit more optimistic: . Some things are in our control and others not.3

We, however, live in a different philosophical age where mortals (which, no doubt surprisingly to some physicians, we are) dare to tempt the Fates and in the process predict the impact of our attempts to alter our Fate or the Fate of others by our interventions. In his Book of Prognostics, one of the earliest written works about medicine (about 400 BC) Hippocrates noted: “it appears to me a most excellent thing for the physician to cultivate Prognosis; for by foreseeing and foretelling, in the presence of the sick, the present, the past, and the future and explaining the omissions which patients have been guilty of, he will be the more readily believed to be acquainted with the circumstances of the sick; so that men will have confidence to entrust themselves to such a physician”.4

Prediction is fundamental to therapy decisions in persons with AML. After the diagnosis is made physicians must recommend one of various options include conventional chemotherapy, either intensive (typically cytarabine and daunorubicin) or less intensive (typically azacitidine or decitabine), no active intervention or participation in a trial of an investigational intervention. The poorer predicted outcome with the conventional interventions the greater the impetus to suggest palliative therapy only, or participation in a clinical trial. Subsequently physicians must make other decisions such as whether to recommend a transplant to a person in remission or wait to see if he/she relapses. In this editorial, we discuss several issues relevant to predicting outcomes of persons with AML. We conclude physicians over-estimate their ability to accurately predict the Fate of subjects with AML and also question physicians’ (and patients’) enthusiasm for the supposed guidance provided by prognostic and predictive metrics.

How good are we at predicting?

Many covariates correlate with response to therapy. Increasingly these covariates are biomarkers.5 These covariates are often combined to produce a prognostic model dividing populations into cohorts with better, intermediate or worse prognostic scores and therefore predicted outcomes.6, 7 It is, of course, important to validate the prognostic value of any model in a separate, albeit similar population.8, 9 Because prognostic models are likely to be intervention-dependent, development and validation populations should receive the same therapy. One problem complicating the validation process is selection biases. Typically subjects in the model development and validation populations were treated in clinical trials leaving one to assume such subjects are representative of all people with AML most of whom are treated outside trials. This assumption is likely incorrect.10

Although P-values <0.05 derived from multivariate regression analyses are routinely used as evidence a model identifies distinct prognostic cohorts8 we believe the concordance statistic (hereafter the C-statistic) is a better and at least a complementary way to evaluate a model’s performance. In cases with a binary outcome the C-statistic is often referred to as the area under a receiver operating characteristic curve (AUC). Practically speaking, the C-statistic is the probability a person in whom an event such as leukemia relapse and/or death occurs will have a worse score than a person in whom the event does not occur or occurs at a later time.11, 12 A C-statistic of 1.0 indicates perfect concordance between the prediction provided by the model and actual outcome whereas a C-statistic of 0.5 indicates random concordance similar to flipping a fair coin. C-statistics values of 0.6–0.7, 0.7–0.8 and 0.8–0.9 are commonly considered reflective of poor, fair and good concordance with predictions. In newly diagnosed persons with AML C-statistic values for several prognostic models including those incorporating mutation data combined with other covariates are often 0.65–0.84,13, 14, 15, 16 approximately midway between prognostic certainty and a fair coin flip. This level of prognostic accuracy is similar to the correlation between prostate specific antigen level and biopsy-proved prostate cancer,17 a magnitude of correlation which, together with the medical consequences (if any) of biopsy-proved prostate cancer, has prompted heated debate about appropriateness of routine prostate specific antigen screening in older males. Clearly C-statistic values and other measures of explained variation indicate estimates of events such as relapse or death from prognostic and predictive models in AML are much less accurate than physicians believe based on P-values, which are not measures of accuracy.18

The C-statistic has limitations. For example, its value depends on the prevalence and/or distribution of important covariates.11 If ECOG performance score 3–4 vs. 0-2 is correlated with an event the same prognostic or predictive model might have a higher C-statistic if tested in a population with ECOG performance scores of 0–4 vs. solely in a population with ECOG performance scores of 0–2. Most importantly, the C-statistic does not directly address a question of interest to many people, namely the probability of an event when the model predicts they will have one (positive predictive value (PPV)) or the probability of no event when the model predicts they will not have one (negative predictive value (NPV)). Answers to these questions depend on true event rates in a population.11 Thus although the C-statistic quantifies the ability of various prognostic models to discriminate persons at high- vs. low-risk, it is does not measure calibration, a measure of how a model’s predictions correspond to observed event rates. These considerations aside, we believe information about C-statistics, PPVs and NPVs should accompany prognostic and predictive models.

Sources of uncertainty

One explanation for the imperfect performance of various prognostic models is unidentified or latent predictive covariates. Prominent among these are selection biases which can be difficult to quantify10 and presence/absence in AML blasts of large numbers of mutations, abnormalities in gene expression and so on, knowledge of which is increasing rapidly. For example, Gerstung et al.19, 20 recently used data from 1540 adults <60 years to compare survivals predicted by the European Leukemia Net (ELN) system with those predicted using a knowledge bank of data from studies of mutations in 111 cancer genes in these subjects. Subjects were divided into those predicted to have more or less than a 10% survival benefit from a transplant in first complete remission vs. delaying a transplant until relapse or second complete remission. Knowledge bank- and ELN-based predictions were discordant in about 15% of subjects. Subjects the knowledge bank, but not the ELN system, predicted to benefit from a transplant because of a less favorable prognosis and who received a transplant had a benefit whereas subjects predicted not to benefit from a transplant in first remission because of a more favorable prognosis did not benefit. However, there are many confounders in these analyses such as selection biases, namely the possibility persons in these studies, particularly those receiving a transplant, are the chosen people and thus un-representative of most persons with AML.10 Importantly, it is unclear this re-classification resulted in a clinically important improvement in the C-statistic necessary to determine if this mutation-based prediction of survival is better than current systems at the subject-level.

We believe there is inappropriate dependence on pre-treatment covariates such as age, performance score, cytogenetics, mutations and co-morbidities. Intuitively, prognosis becomes clearer after therapy begins. For example, because older subjects and those with a poor performance score and/or with co-morbidities are disproportionately likely to die soon after starting therapy, the prognostic power of these important covariates diminishes with time. And performance score can improve or worsen after starting therapy. Most AML prognostic models do not consider these time-dependent covariates nor revise estimates of prognosis based on prior events. Such adjustments are readily done using frequentist or Bayesian techniques.21 After all, it is much easier to predict who will finish a marathon at the 41 km mark than at the starting line.

Assessment of post-therapy covariates is receiving increasing attention. For example, Chen et al.22 reported platelets <30 × 10E+9/L 21 days after starting induction therapy was independently-associated with failure to achieve a complete remission despite continued observation and bone marrow with <5 percent myeloblasts assessed by histology and multi-parameter flow cytometry.

Tests to detect measurable residual disease (MRD), typically quantified using multi-parameter flow cytometry cytogenetics, florescent in situ hybridization, PCR of DNA or RNA molecules or next generation sequencing are receiving increasing attention. Some data suggest this type information can, to a considerable extent, supplant pre- and post-therapy data to inform therapy decisions in persons with AML achieving a cytomorphological complete remission.23 However, the more important question is the extent to which incorporating results of MRD-testing improves predictive accuracy, that is, increases the C-statistic. Othus et al. assessed the value added by MRD-testing in 170 subjects achieving complete remission on Southwest Oncology Group (SWOG) trial S0106 which compared cytarabine and daunorubicin with or without gemtuzumab ozogamicin.24 About 20% of subjects had a positive MRD-test. Including results of MRD-testing to data on age, cytogenetic risk and NPM1 and FLT3 mutations increased C-statistic values from 0.64 to 0.66 for relapse-free survival and from 0.66 to 0.70 for survival.25 The net re-classification index (NRI) which quantifies how often addition of a new biomarker such as MRD-test results in a change in classification can be used to evaluate the effect of changes of this magnitude in the C-statistic/AUC.26 However, without evaluation of the clinical impact of the change in classification, the NRI can be mis-leading over-emphasizing the importance of small increases in the C-statistic.27 Increases of magnitude observed by adding MRD-test data to current prognostic models are probably not clinically important in subsequent therapy decision-making and suggest a more conservative assessment of the value of MRD-testing than widely perceived. Much higher C-statistic values (for example, >0.90) are needed to accurately evaluate the Fate of persons with AML in cytomorphological complete remission. C-statistic values in this range coupled with the often unsatisfactory outcomes with conventional therapies might substitute for randomized trials comparing conventional and new therapies. Many clinical investigators accept this possibility and many potential trial subjects might welcome it, particularly those accurately predicted to fare poorly with conventional therapies.

We also over-estimate our ability to reproducibly assess covariates we know. An example is the seeming trivial task of evaluating percent bone marrow myeloblasts. In the World Health Organization (WHO) classification of myeloid neoplasms a person is considered (with few exceptions) to have AML rather than myelodysplastic syndrome (MDS) if a bone marrow aspirate is reported to have 20% myeloblasts.28 Moreover, a diagnosis of AML based on this WHO-criterion is required to qualify for most AML clinical trials. Consequently, a person with a bone marrow aspirate report of 19% myeloblasts among 200 nucleated cells enumerated is considered to have MDS and excluded from entering AML studies. This occurs even though the upper boundary of the 95% confidence interval for 38 myeloblasts in 200 cells enumerated is 25%. These data indicate about one-half of repeated bone marrow aspirates from the same site would be interpreted as AML rather than MDS using the WHO-criterion. Furthermore, given the heterogeneity of the bone marrow a bone marrow aspirate in another site in the same person at the same time or the next day might give a very different result.

This issue of reproducibility is particularly important with respect to quantifying MRD-test results. In addition to sampling error there is currently no standardization or harmonization of diverse assays.29 Importantly, although a positive MRD-test has a high PPV for relapse it is far less certain when relapse might occur. Recent data from commercial laboratories indicate a >10 per cent discordance in test results. This uncertainty is of critical clinical import. For example, a recommendation to receive a transplant to potentially avoid a relapse within 3 months is entirely different than trying to avoid a relapse likely to occur only after 2 or 3 years.

Another important source of uncertainty is confusion between prognostic and predictive covariates, and biomarkers.30 A covariate or biomarker is deemed prognostic if associated with outcome regardless of therapy. An example is performance score; we are unaware of a situation where people with scores of 3 or 4 fare better than those with scores of 0, 1 or 2. A predictive biomarker is therapy-specific as determined by a statistical test for interaction showing the effect of the therapy differs depending on the biomarker values. For example, persons <60 years with FLT3 mutations and with a wild-type NPM. (NPN have better event-free survival and survival if randomized to intensive chemotherapy and midostaurin compared with intensive chemotherapy only.31 However, designating a FLT3 mutation predictive requires showing a differential benefit of midostaurin in persons with vs. without a FLT3 mutation. This is unproved. A biomarker such as a FLT3 mutation can be prognostic if having it is always associated with a worse outcome, predictive as noted above or prognostic and predictive if the aberration always confers a worse prognosis but if adding a treatment such as midostaurin mitigates this adverse impact.30 Because therapies targeted at biomarkers such as FLT3 mutation are often studied only in persons with the mutation we may incorrectly assume a biomarker is predictive when it is prognostic. Therapy consequences of this distinction are considerable. For example, FLT3 internal tandem duplication (ITD) was initially considered to be a biomarker predictive of response to sorafenib. However further study suggested sorafenib combined with intensive chemotherapy conferred benefit in persons with and without FLT3 ITD.32 There are few convincing data of predictive biomarkers for the outcome of transplants compared with non-transplant therapies.

The reality is there is unlikely to be any combination of covariates which allows us to perfectly predict the Fate of someone with AML. This is because some events are inherently and unavoidably unpredictable. An example is radioactive decay.33 Regardless of how accurately we measure atomic particles the time at which a radionuclide will release a particle or electro-magnetic wave cannot be precisely predicted but only be expressed as a probability. There is no reason the same uncertainty will not apply in medicine. A recent typescript cautioned against the tendency of physicians to assume certainty exists in clinical medicine and pointed out the deleterious consequences associated with this view.34

Physician attitudes

Above we emphasized limitations in our ability to predict the Fate of persons with AML. However, the C-statistic values discussed above indicate prognostic models, despite limitations, are preferable to fair coin tosses. Although unproved, we suspect they are also better than physicians’ intuition. Clearly physician (and patient) tolerance for an incorrect prediction may depend on personal factors and vary with circumstances. For example, a false-positive forecast of relapse might have greater consequence if it led to a transplant rather than a potentially safer therapy such as a tyrosine kinase-inhibitor. Nonetheless we question the willingness of physicians and patients to be informed by prognostic models whilst ignoring their uncertainty. Only infrequently does information from AML prognostic models such as the hematopoietic cell transplant co-morbidity index or comprehensive geriatric assessment appear in typescripts. We suspect these models find even less use outside academia.

More generally we believe physicians and patients, like all humans, may be reluctant to suspend their prior beliefs. For example, although data from randomized trials suggest no benefit from a neutropenic diet, this diet remains in common use.35 Call it the triumph of hope over reason. Even when data suggest benefit/risk ratios are similar in arms of a randomized trial there is often an unwillingness to randomize subjects to different therapies. For example, the United Kingdom National Cancer Research Institute’s AML14 trial proposed random assignment of subjects between intensive and non-intensive induction therapies because it was uncertain which approach was better. Despite this seeming equipoise only 8 of the 1400 subjects recruited into the trial were randomized suggesting patients and/or physicians had little uncertainty which therapy was better.36 Why they believed so is unknown but hinges on processes known as heuristics in psychology and discussed elsewhere.37

The role physicians’ backgrounds may play in decision-making also deserves attention. Bories et al.38 reported male physicians are more likely to recommend intensive AML induction therapy than female physicians even after adjusting for prognostic variables. A similar correlation was seen between a physician recommending intensive therapy and his/her fiscal risk-taking tolerance. It seems plausible such factors determine physician’s selection of AML therapy to an equal or greater extent than prognostic scores.

It is important to realize people given a survival estimate by their physician interpret this information entirely differently than the physician giving the estimate. Gramling et al.39 recently reported significant discordances in 2-year survival expectations in two-thirds of patient-physician pairs following a physician-led discussion. In almost all instances the patient’s interpretation of the survival estimate was substantially more optimistic than the physician’s estimate. This discordance led to inappropriate therapy decisions whereby patients agreed to receive therapies with substantial adverse effects and reduced quality-of-life but with only a slight prolongation in survival. So, besides the difficulty in accurately predicting the Fate of someone with AML, we need to consider the likelihood patients only poorly understand our predictions and even less so our limitations.

Conclusions

Predictions of outcomes in AML are often inaccurate. Physicians place too much reliance on these predictions when they move from predicting the performance of a cohort of subjects to predicting how someone in the cohort will fare. This can have potentially untoward consequences. Even at the cohort level over-estimation of the precision of prognostic or predictive models can lead to false-positive or -negative conclusions in trials where observed results are compared with those predicted by a prognostic model. For this reason we caution against using prediction-based historical controls or matched-pair analyses for determing whether an intervention is effective. More information should be provided to aid assessment of the models’ accuracy. Thus P-values derived from multivariate regression analyses8 should be supplemented by metrics such as the C-statistic, PPVs and NPVs. Biomarkers which are prognostic should be distinguished from those which are predictive. And it should be emphasized prognostic indices do not necessarily provide guidance regarding timing of an intervention. All this notwithstanding, we criticize the tendency of physicians to disregard prognostic models in lieu of more informal attempts to assess prognosis. Comparison of the accuracy of these attempts with those informed by prognostic models would be of great interest.

Back to the Greeks who had a second important concept we can learn from: a personality quality of extreme or foolish pride or dangerous over-confidence. Voltaire noted: “doubt is not a pleasant condition, but certainty is an absurd one”.40 Turning to our times we can also learn from two modern philosophers who, although from diverse backgrounds, are each reputed to have commented on prediction: Niels Bohr, the nuclear physicist noting “It is difficult to predict, especially the future”41 and Samuel Goldwin, the film producer from MGM studios stating “Never make forecasts, especially about the future”.42