Introduction

Since the introduction of tyrosine kinase inhibitors (TKI) survival in chronic myeloid leukemia (CML) has profoundly improved. Inside clinical trials, survival of CML patients is comparable to that of the general population [1]. Comorbidities have more influence on survival than CML itself [2, 3]. Achieving complete cytogenetic remission (CCyR) is a well-accepted milestone in the management of CML [4]. In the management recommendations of the European LeukemiaNet (ELN), achievement of CCyR is included in the criteria for optimal response, warning and failure in the course of the disease [5]. However, the optimal time to wait for major molecular remission (MMR) is unknown. One study could show a significant impact of MMR at 12 month on overall survival (OS) [6, 7]. However, if patients are divided in response levels >1, 0.1–1 and <0.1% the significance between the latter groups is no longer detectable. Other studies could not find an effect of MMR on OS at all [8, 9]. Therefore, failure to achieve an MMR has not been defined as a failure criterion in the ELN recommendations.

Under TKI therapy, the majority of CML patients reaches major or even deep molecular remissions [10, 11], which are indications of a good prognosis [12]. However, a small percentage of patients do not achieve MMR. The question is: when is it necessary to regard a lack of MMR as “failure” and to switch therapy. Therefore, we were looking for a critical turning point at which waiting for an MMR is no longer promising. The same analysis was conducted for BCR-ABL levels <1% (MR2) as well. In addition, potential milestones for the achievement of MR4.5 were investigated.

We used data of the CML study IV to estimate in a mathematical model the time to define a treatment failure in a CML patient without MMR under imatinib therapy.

Patients and methods

Study design and goals

From July 2002 through March 2012, 1551 patients were recruited for the CML study IV, a five-arm randomized trial for chronic-phase CML comparing first-line imatinib treatment with different dosages and with or without additional non-TKI therapy.

The study design and patient characteristics have been described in detail [6, 7]. 1228 patients were available for the current analyses after exclusion of 128 patients from the “Imatinib after IFN”-arm, 164 with molecular analyses in non-standardized laboratories resp. not sufficient for the detection of a MR4, 16 without EUTOS score, and 15 due to various other reasons (see Fig. 1, Consort diagram). The median age of the analyzed patients was 52 years (range: 16–85); 742 (60.4%) of the patients were male and 1080 (88%) were low-risk patients according to the EUTOS score. The median observation time was 7.0 years.

Fig. 1
figure 1

Consort diagram

Definitions

Progression-free survival (PFS) was defined as survival with the absence of accelerated phase, blast crisis or death. PFS times and times to MR4.5 were calculated starting at the date of diagnosis. For MR4.5, progression or death were considered competing events. Patients without an event were censored at the date of the last observation. Patients with stem cell transplantation were censored at the date of transplantation.

Patients were considered to be in MR², when they had achieved a BCR-ABL/ABL ratio of less than 1% according to the international scale (IS) [13, 14], which can be regarded as an equivalent to complete cytogenetic remission [15]. MMR was defined as a BCR-ABL/ABL (IS) value below 0.1%, while MR4 and MR4.5 were defined as BCR-ABL/ABL (IS) <0.01% and <0.0032%, respectively.

Cytogenetic and molecular analyses

Molecular diagnostics for residual BCR-ABL transcripts followed the procedures and definitions of Hughes et al. [16] and Cross et al. [13] and were performed in standardized and accredited laboratories with defined conversion factors for the equivalence of the tests (Mannheim, Basel, Bern and MLL Munich) [12, 13]. Median molecular examinations per year were 2.9 per patient.

Statistics

Patients were randomly divided into a learning and a validation set at a ratio of 2:1.

To get a function of the hazard ratio (HR) depending on the prediction time, we used the landmarking approach of van Houwelingen: [17, 18] Starting at 6 months after diagnosis, at each month until month 60, patients of the learning sample still at risk were categorized into three groups (“no remission”, “MR²” and “MMR or deeper”) according to their deepest remission achieved until that time. The state “No remission” was defined as not even having reached an MR2. Patients that were switched to any second-generation TKI were counted as “no remission”or “MR2” after the switch, if they had not reached an MR2 or MMR before switch, respectively. 106 patients in the learning sample were switched to a 2nd-generation TKI before achieving a MMR. A Cox model was fitted to the combined data set at all landmarks where the baseline hazard as well as the regression coefficients of the time-varying remission states were dependent on the landmark time. This time-dependence was modeled using a cubic function of time. In addition, EUTOS score [4, 19] and age were considered as covariates for adjustment. Details can be found in the online supplement.

To account for the heterogeneity in the data set and to receive stable confidence intervals, bootstrap resampling was used and the procedure was repeated 10,000 times. Out of these 10,000 estimates of the regression coefficient functions, a median curve was calculated (together with 95% confidence intervals). The minimum of this median curve is the point, where the largest difference between the patients in remission and those without was found. Therefore it was considered as cutoff.

For the validation, a bootstrap was performed again. In the 10,000 random subsets from the validation set, patients were categorized into “no remission”, “MR²” and “MMR (or deeper)” at the cutoff derived from the learning sample.

All computations were performed using the statistics software R 3.0.2.

Ethics

The protocol followed the Declaration of Helsinki and was approved by the local ethics committees. Written informed consent was obtained from all patients before they entered the study.

Results

Learning set

Two thirds of the total data set (n = 819 patients) were randomly allocated to the learning sample. At 6 months, 805 patients were still alive and without progression. Of those, 534 (66%) had no remission, 182 (23%) had achieved an MR² and 89 (11%) an MMR. Naturally, the percentage of patients without molecular remission is decreasing over time, while the percentage of patients with MMR is increasing. The total number of patients is decreasing, since patients who had a progression are not in the data set at a later landmark anymore, nor are patients with a shorter observation time. At the last landmark at 5 years, 528 patients were still at risk. This is depicted in Fig. 2.

Fig. 2
figure 2

Distribution of patients according to remission status achieved at different landmarks from 6 months to 5 years (a) and respective percentages of these states (b)

As described in the methods section, we drew bootstrap sets (with replacement) from the learning set. For each of these 10,000 subsets, a Cox model was estimated, resulting in a HR for suffering a progression, dependent on the landmark time. From this 10,000 h functions, the median and 95% bootstrap confidence intervals were taken (see Supplementary figure 1). Figure 3 shows the HR for progression (y axis) dependent on the landmark time (x axis). When e.g. setting the landmark at 6 months, patients that had already achieved an MMR by this time, had only about 0.5 times the risk of those that did not. For the statistical details, we refer to the online supplement. The minimum of this HR function was found between the landmarks 2.33 and 2.75 years with a HR of 0.28 (95% CI: 0.16–0.51). The model was adjusted for two additional covariates, age and EUTOS score. For the first covariate we observed HRs of 1.05 per year (95% CI: 1.02–1.08). For the comparison of high- vs. low-risk patients (EUTOS score), HRs of 1.93 (95% CI: 0.89–3.77) were observed.

Fig. 3
figure 3

Median hazard ratio functions for the comparison of patients who had achieved a MMR resp. MR2 to those who did not have any remission at different landmarks with respect to PFS from 6 months to 5 years together with the 95% confidence intervals. On the y axis, the hazard ratio for PFS is plotted on a logarithmic scale. Note that on the x axis the landmark time is plotted instead of the event time. A hazard ratio of, e.g., 0.5 at landmark 6 months indicates that patients with MMR have only half the risk of patients with no MMR before or at 6 months

Analogously to the HR curve for MMR, the HR for MR2 was also estimated. This curve is depicted as well in Fig. 3. The minimum of the MR2 curve was at 1.25 years, with very similar values between 1.17 and 1.33 years. The HR in comparison with patients without any remission was 0.34 (95% CI: 0.20–0.56).

Additionally, the model was re-estimated with four different groups: “no remission”, “MR²”, “MMR” and “MR4 and deeper”. In this analysis, the shape of the curve was slightly different, with a minimum for the patients in MMR between landmark times 3.00 and 3.92 years and an HR of 0.28 (95% CI: 0.11–0.61).

When stratifying due to imatinib (IM) dose (i.e., patients from the imatinib 800 mg vs. patients from the three imatinib 400 mg arms with or without other treatments), the curve for the imatinib 400 mg arms was rather similar to the one shown in Fig. 3. For the imatinib 800 mg patients, we did not find a minimum, the curve remained almost constant after about 2.75 years, but with much wider confidence intervals. Therefore, no separate proper cutoff could be found for this arm.

In a sensitivity analysis a similar landmark for the outcome MR4.5 instead of PFS was tried to be found (see Supplementary figure 2). The results are depicted in Fig. 4. In contrast to the PFS analysis before, we were not able to find a cutoff, the form of the function suggested that the probability of achieving MR4.5 was the higher the earlier the patient had achieved an MMR.

Fig. 4
figure 4

Median hazard ratio function for the comparison of patients who had achieved an MMR to those who did not have any remission at different landmarks with respect to MR4.5 from 6 months to 5 years together with the 95% confidence intervals. On the y axis, the hazard ratio for MR4.5 is plotted on a logarithmic scale. Note that on the x axis the landmark time is plotted instead of the event time

In our analysis, patients were always considered to be not in MR2 or MMR, respectively, when they were switched to a second-generation TKI before the achievement of these remissions. To assess the impact, we performed an additional analysis, were we ignored the switch. For the MMR we found the minimum at 2.75 years with a HR of 0.24 (95% CI: 0.14–0.44) and similar values between 2.58 and 2.75 years. For the MR2, the minimum was at 15 months again, with a HR of 0.32 (95% CI: 0.18–0.53) and similar values between 1.00 and 1.58 years.

Validation set

Based on the results described above, we used the landmark of 2.5 years for further analysis, because visits at 3 months intervals are a common practice in clinical trials. Again, 10,000 subsets were drawn out of the 409 observations of the validation set. The median HR for PFS for the patients in MMR (compared with those without remission) was 0.20 (95% CI: 0.13–0.69). The corresponding median p-value was 0.007, herewith confirming the MMR landmark of 2.5 years in predicting PFS.

We further validated the landmark of 1.25 years for the MR2. The median HR for PFS for the patients in MR2 (compared with those without remission) was 0.45 (95% CI: 0.25–0.85). The corresponding median p-value was 0.023. For comparison, we evaluated the established landmark of 12 months as well and found a median p-value of 0.047.

Discussion

The aim of this analysis was to define treatment failure with respect to MMR. We sought to find the landmark showing the largest difference between patients in MMR and those without any remission with regard to PFS. This landmark was found at 2.5 years and was successfully validated in an independent patient sample. Therefore, this analysis is able to close the gap in the ELN recommendations for the definition of failure; a gap that existed between the non-achievement of CCyR at 12 months and the loss of MMR at any time.

It could be demonstrated for the first time, that MMR has a significant impact on PFS. In addition, we were able to replicate the 12 month landmark for the achievement of MR2, which corresponds to CCyR. Until about 1.5 years, the MR2 and the MMR curve are almost identical. This corresponds to the clinical experience that for this time period, an MR2 is sufficient in respect for PFS.

During the further course of the disease, the achievement of MMR gets more and more important. It will be of interest in the future if at some landmark even MMR is not sufficient in preventing progression.

For the prediction of the achievement of an MR4.5, we were not able to find a particular landmark. However, there was a clear correlation; the earlier MMR was reached the higher was the probability to achieve a MR4.5 during the disease course. This has to be taken into account for decision making in view of the new possibility of treatment cessation in CML [20, 21]. So far, it has not been proven that the earlier MMR and MR4.5 are reached the more successful is treatment cessation. However, the time being in MMR or deeper seems to be of relevance [22].

From our point of view, these results are of high clinical relevance as with the success of molecular standardization in Europe, molecular assessments have increasingly replaced cytogenetics. Vice versa, even with the increasing impact of deep molecular responses (MR4 and MR4.5), MMR still remains an important benchmark: (i) MMR is an indicator for deep response and (ii) there are laboratories in the daily routine care that are not able to detect deep responses correctly.

From a methodological point of view, it seems possible to censor patients receiving 2nd line TKI. However, as the loss of a remission and perhaps a subsequent progression are among the most frequent indications for switching TKI, this would have introduced a bias. Furthermore, not censoring for 2nd generation TKI should reflect the current situation in routine care. We found a minor difference for the cutoff with censoring of patients after switching TKI.

Like every other cutoff value, this one is a compromise as well. It has to be weighted between not identifying patients that would need to be switched on the one hand and identifying (and switching) too many patients that would still have achieved an MMR later on, on the other hand. Therefore, this cutoff has to be discussed again, when, e.g., new agents enter the market and the risk-benefit ratio has changed.

Our work has strengths but also some limitations. It is possible that the identified landmark is dose-dependent. Due to the low number of events in the imatinib 800-arm, the analysis stratified for different dosages had a lack of power regarding the potential identification of a significant influence of dosage. Therefore, despite the internal validation, an external validation of the results with a completely different data set is encouraged, i.e., to check the robustness of the landmark time under different treatments. In addition, it has to be stated that this retrospective analysis was not pre-specified in the protocol, which reinforces the demand for an external validation.

On the other hand this is probably the largest data set of CML patients in the imatinib era with a sufficiently long follow up under the conditions of a controlled clinical trial. The patient population of the CML IV study is—despite of the lower age—very close to cohorts of registries especially in regards to comorbidities. Therefore, we think that the results are applicable outside of trials.

It has to be stated that in first-line studies with nilotinib, dasatinib or bosutinib [23,24,25,26,27,28], molecular remission levels were achieved faster as compared with imatinib in the standard dosage of 400 mg. One can assume that in a corresponding analysis with 2nd generation TKI the recommendable time to judge on therapy failure could be less than 2.5 years as identified in this analysis.

These data show that an optimum time frame exists in order to predict PFS based on the achievement of MMR. This should be important for the optimization of treatment in CML concerning prolongation of survival. In regards of treatment cessation we have demonstrated that the earlier a patient achieved MMR the higher was the chance to reach DMR later on.