Introduction

Outcomes among children with acute lymphoblastic leukemia (ALL) have steadily improved, and event-free and overall survival (OS) now exceed 85% and 90% [1]. Therapy for ALL is determined using established risk factors, balancing treatment intensity with prognosis to minimize overtreating patients with favorable risk, and undertreating patients with higher risk.

Informed risk group (RG) stratification is crucial for optimal therapy [2]. Contemporary risk stratification algorithms typically use threshold-defined dichotomous categories of clinically relevant risk factors including presenting white blood cell count (e.g., < vs. ≥50 × 109/L; WBC) and minimal residual disease (e.g., < vs. ≥0.01%; MRD). The Children’s Oncology Group (COG) B-ALL algorithm includes National Cancer Institute (NCI) RG, clinical variables (extramedullary disease status and steroid pretreatment), sentinel favorable and unfavorable risk genetics (FRG and URG, respectively), flow cytometric MRD of peripheral blood on induction day 8 (D8 MRD), and marrow on induction day 29 (D29 MRD) and at end of consolidation (EOC) [3].

Assigning differing weights to individual risk factors or using continuous numerical rather than categorical values may more accurately predict relapse risk. The UKALL group used MRD as a continuous variable to develop a prognostic model that generated a continuous score (prognostic indexUKALL, PIUKALL) predicting patient-level relapse risk [4, 5]. This model incorporated favorable and unfavorable genetics, and both presenting WBC and D29 MRD as continuous variables. An increase in the PIUKALL score was strongly associated with relapse risk in a validation cohort of three European pediatric ALL trials (combined n = 2313) [5].

We conducted an external validation of the PIUKALL in >20000 COG trial participants and subsequently assessed the value of D8 MRD added to this model given our prior work showing the prognostic value of D8 MRD in certain patient subsets [6]. In contrast to the UK group, COG conducts different B- and T-ALL trials [7]. Thus, we focused on B-ALL and developed a novel risk score, PICOG, and compared patient outcomes between current and PICOG-derived RGs.

Methods

Study population

The cohort included 13,875 NCI standard-risk (SR) and 7324 NCI high-risk (HR) non-infant B-ALL patients enrolled on four COG trials from 2004-2019; two for SR and two for HR patients: AALL0331 (SR; n = 5099) [8], AALL0232 (HR; n = 2900) [9], AALL0932 (SR; n = 8776) [10], and AALL1131 (HR; n = 4424) [3, 11]. Patients and/or their caregiver(s) provided informed consent for these trials in accordance with the NIH central IRB and the Declaration of Helsinki. Randomizations differed for each trial. In all trials except AALL0232, primary analyses indicated no statistical differences in disease-free survival (DFS) rates between experimental treatment and standard of care arms [3, 8,9,10,11]. Down syndrome and Philadelphia chromosome-positive (Ph+) patients were excluded. Patients with T-ALL will be considered separately in future work. The CONSORT diagram shows the breakdown of study participants in each group and the final analysis population (Fig. 1).

Fig. 1: CONSORT diagram for testing (AALL0331/0232) and training (AALL0932/1131) therapeutic trials.
figure 1

B-ALL, B-Cell Acute Lymphoblastic Leukemia; BCR/ABL1, Philadelphia chromosome-positive ALL by BCR-ABL1 oncoprotein; MRD, minimal residual disease; Day 29, end-of-induction minimal residual disease; Day 8, induction day 8 minimal residual disease; CNS, central nervous system involvement; Bone marrow M1, <5% lymphoblasts (remission) Bone marrow M2, 5–25% lymphoblasts; Bone marrow M3, >25% lymphoblasts. Of note, day 8 PB MRD testing was not routinely measured for earlier trials (AALL0331/0232) until partway through accrual.

Variable selection methods

Choice of predictor variables is a crucial step when building a clinical prediction model. Investigators typically must reduce a larger set of candidates to a final set of predictor variables used for final model estimation. The methods of predictor selection can be classified into two categories: (1) reduction before modeling and (2) reduction while modeling [12]. Method (1) implies that the predictors are selected based on domain expertise prior to studying the relationship between the outcome and candidate predictors in the data to be used for model building. This method of predictor selection is generally preferred, as it best preserves the statistical properties of later model estimation and hypothesis testing [13]. Method (2) implies that knowledge of the relationship between the outcome and candidate variables in the data is used to select predictors. Examples of method (2) include univariable screening and stepwise selection (forward, backward, and combined). Though occasionally justifiable, the disadvantages of stepwise selection are well documented and include unstable selection; misleading bias in regression coefficients, standard errors, and p-values; and poorer predictions relative to a full model [12, 13]. Univariable screening inherits the same disadvantages as forward stepwise selection but tends to have poorer performance due to neglect of marginally “insignificant” variables [12]. Therefore, in this work, predictor variables (described in detail below) were selected for inclusion in the model a priori based on clinical expertise.

Potential predictive variables

Known prognostically important genetic variables including ETV6::RUNX1 fusions, double trisomies of chromosome 4, 10 (DT), intrachromosomal amplification of chromosome 21 (iAMP21), and KMT2A-rearrangements were determined by fluorescence in situ hybridization. Hypodiploid ALL was defined as modal chromosome number <44 or DNA index <0.81. These genetic variables determined the FRG and URG groups (detailed in Supplementary Methods). D8 and D29 MRD were measured by flow cytometry, as described [6]. CNS status was treated as a categorical variable: CNS1 (no blasts), CNS2 (CSF WBC < 5/µL with blasts), or CNS3 (CSF WBC ≥ 5/µL with blasts) (Supplementary Methods).

Transformations for WBC (log(WBC); WBClog), D8 MRD, and D29 MRD were consistent with Enshaei et al. [5] due to reasonable performance in the PIUKALL model and clinical knowledge regarding their distributions. Transformed MRD is displayed as τ(MRD), corresponding roughly to the negative log transformation [5]. The maximum τ(MRD) was 13.82, corresponding to MRD < 1.0 × 10−5. Candidate predictor variables are shown in Supplementary Table 1.

External validation of PIUKALL

Steps for external validation followed published guidelines [14]. These steps and the level of information required for each step’s execution are defined in Supplementary Table 2 and are referred to as Step(1)-Step(6). Note Step(5) and Step(6) are not included due to unavailable information. If in Step(1) the overall calibration slope is found to be less than one, the model is technically considered not to be optimal for the external validation data, though further steps should still be examined as the model may still have practical utility.

We applied the published PIUKALL equation to the COG data [5]:

$${{{{{{\rm{PI}}}}}}}_{{{{{{\rm{UKALL}}}}}}}= \, [-0.218 \, {^{\ast}} \, {{{{{\rm{\tau }}}}}}\left({{{{{\rm{D}}}}}}29{{{{{\rm{MRD}}}}}}\right)-0.440 \, {^{\ast}} \, {{{{{\rm{CYTO}}}}}}\_{{{{{\rm{GR}}}}}}\\ +1.066 \, {^{\ast}} \, {{{{{\rm{CYTO}}}}}}\_{{{{{\rm{HR}}}}}} +0.138 \, {^{\ast}} \, {{{{{\rm{WBC}}}}}}_{\log }]$$

According to Step(1), the overall calibration slope for the PIUKALL was calculated. The calibration slope is the estimated log-hazard ratio from a univariable Cox model with the PIUKALL as the predictor. A calibration slope less than 1 in external validation data is indicative of poorer discrimination in the validation data than in the development data, a common occurrence among predictive models reflecting decreased generalizability of the original model and heterogeneity of patient prognosis in derivation vs. validation populations [14]. A formal test for the null hypothesis that the overall calibration slope equaled one was conducted.

For Step(2), the primary metric used to compare model discrimination was the concordance index (C-index), defined as the proportion of randomly selected pairs of patients that the model orders concordantly (for a pair to be concordant, the patient with the higher model-predicted probability of relapse has the shorter observed time to relapse) [15]. A C-index >0.7 indicates acceptable discriminative capability of a model, while a value of 0.5 indicates that prediction is equivalent to random chance [13]. In Step(3), ideally, the original published coefficients would be equal to those obtained if the model were refit in the external validation data. We examined the coefficients for the PIUKALL model re-derived in the full COG analysis population compared to the published coefficients to examine possible true difference in predictor effects between UKALL and COG data. We conducted a hypothesis test for equality of published vs. externally derived model coefficients as detailed in Supplementary Table 2. Kaplan-Meier curves within PIUKALL-defined RGs were reported to satisfy Step(4).

Added value of D8 MRD

D8 MRD is of particular interest to COG, as its collection is unique and standard within the collective. Therefore, to assess incremental added predictive value of D8 MRD to the UKALL model, we fit a multivariable Cox proportional hazards model including τ(D29 MRD), FRG, URG, WBClog and compared this to the model with τ(D8 MRD) included.

Development of PICOG

The development of a new prognostic index for relapse risk, the PICOG, utilized pre-specified covariates based on domain expertise and existing literature from UKALL and COG data [6]. The AALL0932/AALL1131 cohort comprised training data, while AALL0331/AALL0232 patients were used as testing data for temporal (external) validation (Fig. 1). For model development on the training data, τ(D29 MRD), FRG, URG, WBClog, τ(D8 MRD), age at diagnosis (Age), and CNS status were included from an initial class of potential covariates (Supplementary Table 1) due to existing evidence of prognostic relevance and current risk stratification algorithms.

Graphical methods assessed the assumptions of the functional relationships between relapse risk and covariates [13]. The proportional hazards assumption was examined using scaled Schoenfeld residual plots by covariate. Plots of the delta-beta residuals helped to visually identify participants with strong influence on hazard ratio estimation. We pre-specified a comprehensive set of potential interactions among the continuous variables (Supplementary Table 1) and assessed them for possible model inclusion as a group. PICOG was defined as the linear predictor from the model. Calibration slopes and C-indices were obtained for PICOG overall and within sex and race/ethnicity groups to diagnose potential lack of model fit.

Validation and calibration were assessed using the rms package in R [16]. The final model was internally validated using bootstrapping with B = 1000 resamples with optimism-corrected estimates calculated [15]. Calibration was examined using smoothed calibration plots [13]. Cox model performance was compared to machine learning (ML) alternatives to assess whether relaxed assumptions improved predictive ability. Random forest [17], support vector machine [18], and boosted Cox models were fit to the same predictor variables included in the Cox model (Supplementary Table 3) [19]. The benchmarking study included a 5×5-fold nested cross-validation routine adapted from Fouodo et al. [18].

To compare possible risk stratification approaches, patients were classified according to the current risk classification algorithms used in COG AALL1731 (SR; NCT03914625) and AALL1732 (HR; NCT03959085) trials (Supplementary Table 4, 5). Using the training dataset, cutpoints were calculated dividing the continuous PICOG into four risk-based categories optimizing the model’s discriminative ability [20]. The censored nature of the data was accounted for by maximizing the Concordance Probability Estimate (CPE), a variation of the C-index [20]. Further details of how cutpoints were calculated are included in the Supplementary Methods. Point estimates of 5-year relapse-free survival (RFS) within risk subgroups were obtained using Kaplan-Meier estimation. RFS was defined as time from end of induction (EOI) to relapse or death in remission, or censored at second malignant neoplasm (SMN) or date of last contact for those who remained event-free. Estimates for DFS and OS were also obtained. DFS was defined as time from EOI to relapse, death in remission, or SMN, or censored at last contact. OS was defined as the time from EOI to death or censored at last contact. All analyses were conducted using R Statistical Software® version 4.2.1 (code available from corresponding author upon request) [21].

Results

Study population

Overall, the distributions of clinical characteristics were similar between the training and testing data in both the generating analysis population (Table 1) and between the training and testing data in the post-induction relapse-free survival cohort used for model development and numeric validation (Supplementary Table 6). Among genetic groups, 9629 participants (45.4%) were FRG (52.1% ETV6::RUNX1 fusions and 48.2% DT) and 1256 participants (5.9%) were URG (29.0% KTM2A-rearranged, 28.0% hypodiploid, and 43.4% iAMP21). Ph-like ALL (Supplementary Methods) was present in 996 of 4836 patients tested. D8 MRD and D29 MRD data were available for 76.4% and 84.4% of patients, respectively. Figure 2 shows the distribution of continuous prognostic factors for the combined population.

Table 1 Patient characteristics of the analysis population (n = 21199)a.
Fig. 2: Density plots of the distribution of continuous prognostic variables for the full analysis population (n = 21199).
figure 2

White blood cell count (WBC) was log-transformed. End-of-induction (D29) minimal residual disease (MRD) and induction day 8 (D8) MRD were transformed according to UKALL [4, 5].

External validation of PIUKALL

The calibration slope for the original PIUKALL applied to COG data using the published coefficients was 0.79, which was significantly different from one (p < 0.001). The original PIUKALL retained discrimination ability, with a C-index of C = 0.725. When the coefficients for the PIUKALL model were recalculated using external COG data, FRG and WBC have larger hazard ratio estimates and D29 MRD and URG have smaller estimates, indicating possible differing predictor variable effect weighting between the two populations [derived PIUKALL = −0.136*τ(D29 MRD)-0.913*FRG + 0.692*URG + 0.166*WBClog], and the risk directions for all factors were consistent between the original and derived PIUKALL. For example, FRG is associated with lower relapse risk in both cohorts. These coefficients (log-hazard ratios) associated with the derived PIUKALL yield the following hazard ratios: 0.87 for τ(D29 MRD), 0.40 for FRG, 1.99 for URG, and 1.18 for WBClog. The test for equality of published vs. externally derived model coefficients showed evidence of difference in the coefficients (p < 0.001), indicating that model fit could be improved. Kaplan-Meier curves within PIUKALL-defined RGs are shown in Supplementary Fig. 1 and exhibit good separation between curves (log-rank p < 0.001).

Added value of D8 MRD

τ(D8 MRD) was a statistically significant addition to the model, with a modest hazard ratio estimate in the testing data of 0.96 (1 DF Wald p < 0.001) (Supplementary Table 7). The effect size corresponds to an estimated 4% relapse risk reduction for a one-unit increase in τ(D8 MRD) (decrease in D8 MRD), holding D29 MRD, WBClog, FRG, and URG constant.

Development of PICOG

We next developed a model using COG predictors best known for relapse risk using the training dataset (n = 11,102). Tested as a group, the set of potential statistical interactions did not significantly improve model fit (Supplementary Table 1) and were excluded. Table 2 reports the estimated coefficients and hazard ratios from the model containing transformed D8 and D29 MRD, FRG, URG, WBClog, CNS status, and Age. Except for CNS3 (n = 90 in training data, Supplementary Table 6) vs. CNS1, each predictor was strongly associated with relapse risk. Increases in transformed D8 and D29 MRD (i.e., decreases in MRD) were each associated with a decreased relapse risk. Table 2 can also be visualized as an equation as follows:

$${{{{{{\rm{PI}}}}}}}_{{{{{{\rm{COG}}}}}}}= \, [\!-0.102 \, {^{\ast}} \, {{{{{\rm{\tau }}}}}}\left({{{{{\rm{D}}}}}}29{{{{{\rm{MRD}}}}}}\right)-0.040 \, {^{\ast}} \, {{{{{\rm{\tau }}}}}}\left({{{{{\rm{D}}}}}}8{{{{{\rm{MRD}}}}}}\right)\!-0.741 \, {^{\ast}} \, {{{{{\rm{FRG}}}}}}\\ +0.644 \, {^{\ast}} \, {{{{{\rm{URG}}}}}}+0.156 \, {^{\ast}} \, {{{{{{\rm{WBC}}}}}}}_{\log }+0.386 \, {^{\ast}} \, {{{{{\rm{I}}}}}}({{{{{\rm{CNS}}}}}}2)\\ +0.364 \, {^{\ast}} \, {{{{{\rm{I}}}}}}({{{{{\rm{CNS}}}}}}3)+0.061 \, {^{\ast}} \, {{{{{\rm{Age}}}}}}]$$

where indicator I(CNS Status) is one if the patient falls into that CNS category, and zero otherwise. This equation can be used to calculate an individual patient’s PICOG risk score. Supplementary Fig. 2 provides a visual comparison of the shapes of the distributions of PIUKALL and PICOG. Figure 3 portrays the prognostic index by genetic RG, with higher genetic risk associated with higher PICOG.

Table 2 Summaries of the PICOG model derived on the training study population.
Fig. 3: Boxplots of the distribution of the COG ALL Prognostic Index (PICOG) risk score by genetic risk group.
figure 3

The central “box” is made up of the 25th percentile, median (50th percentile), and 75th percentile. Lines on either side extend to the minimum and maximum (excluding outliers). Outliers are marked on the plot by points that are higher than the maximum denoted by the upper line.

Diagnostic plots indicate no concerning evidence of non-proportional hazards (Supplementary Fig. 3A) or influential points (Supplementary Fig. 3B). Internal validation indicated very little data-driven overfitting in the modeling process (Supplementary Table 8). Temporal external validation of the new model in the AALL0232/AALL0331 testing data (n = 4100) yielded an overall calibration slope of 0.94, not significantly different from 1 (p = 0.13), indicating overall good calibration in the testing dataset. The model held discrimination as well, with a C-index in the testing data of 0.738. Calibration curves are displayed in Supplementary Fig. 4. In testing data stratified by protocol, we observed a slight underestimation of risk among the few NCI HR patients (AALL0232) with very poor observed risk, likely due to the lack of sufficient data to obtain reliable predictions. Among NCI SR patients (AALL0331), there was an overestimation of risk across the range of the data, with the poorest model estimates again in ranges with fewer observations. The final Cox model was compared to ML alternatives using the same prognostic variables (Supplementary Table 3). Despite enhanced flexibility in the ML models, the discriminative ability of the Cox model was comparable to all ML alternatives.

Comparison of risk stratification for PICOG vs. COG current clinical

The cutpoints maximizing the CPE for PICOG were −1.377, −0.589, and 0.093, resulting in classification of patients’ relapse risk into: 38.6% of patients as “low” (RFS 96.8%); 33.1% “standard” (92.6%); 16.8% “intermediate” (84.9%); and 11.5% “high” (66.9%). Figure 4A shows excellent separation and sensible RFS estimates among Kaplan-Meier curves within PICOG-defined RGs. Supplementary Fig. 5 displays the Kaplan-Meier curves within PICOG-defined RGs stratified by testing and training datasets, showing well-separated curves within each dataset. These stratified Kaplan-Meier curves are overlaid for comparison in Supplementary Fig. 6. Figure 4B demonstrates the practical implications of splitting patients’ prognostic values by RG, with each patient’s PICOG value falling into one of the four risk categories depending on prognostic features. The distribution of the PICOG is similar when stratified by testing and training datasets (Supplementary Fig. 7).

Fig. 4: Summaries of the concordance probability estimator (CPE)-defined risk groups of the PICOG.
figure 4

A Kaplan-Meier Curves for Relapse-Free Survival probability within each PICOG-defined risk group for the combined RFS cohorts (n = 15202) and corresponding risk table. B Density plots of the distribution of the PICOG with CPE-defined risk groups indicated by text (Low, Standard, Intermediate, and High) and color for the combined relapse-free survival (RFS) cohort (n = 15202). Risk group defining cutpoints of the PICOG that maximize the CPE are marked by dashed vertical lines.

Ninety-seven percent of patients had sufficient data to be retrospectively classified according to current COG AALL1731/AALL1732 definitions. Shown in Supplementary Table 9, the resulting classification gives 24.5% SR-Favorable (5-year RFS 96.7%), 20.5% SR-Average (93.3%), 12.5% SR-High (82.7%), 3.0% HR-Favorable (96.3%), 29.6% HR (81.8%), and 1.1% Very HR (VHR; 53.6%). Table 3 compares the classification of patients according to both the PICOG and the COG current clinical standard. As seen in the SR-Fav and VHR rows, the two risk classification strategies generally agree when risk is very high or very low. However, for other current COG risk classifications (SR-Avg, SR-High, HR) that collectively include 63% of patients, there is a broader spectrum of PICOG RG assignment.

Table 3 Sample sizes (%) for subgroups by COG risk and COG Prognostic Index classification in the combined training/testing data.

Table 4 displays 5-year RFS estimates within each of the subgroups discussed above. Within the COG SR-Avg group, PICOG identified a “low risk” subgroup with an outstanding 96.0% RFS estimate, similar to the outcomes for patients traditionally classified as SR-Fav. In the COG HR group, we observed a broad range of RFS estimates, from a group with an RFS of 95.5% to a group with an RFS similar to that expected with VHR (66.0% RFS). Similar trends are seen for DFS and OS, as well as when the results are stratified by testing and training datasets (Supplementary Tables 1019).

Table 4 5-year relapse-free survival probability estimates (SE) for subgroups by COG retrospective and COG Prognostic Index risk classifications in the combined training/testing data.

Discussion

Prognostic models are used in oncology to aid clinical decision making by adjusting treatment intensity to individual patient relapse risk [22]. A prognostic model must satisfy many quality control guidelines to be useful in clinical practice, including appropriate model validation [12, 13, 15]. Ideally, this includes both strong resampling-based internal validation (“training”) and external validation in independent populations (“testing”) [12, 15].

We have developed and rigorously validated a new model to determine a prognostic index (PICOG) using COG B-ALL trials. PICOG is easily calculated on a large scale and can be hosted online on a web-application for use by patients and practitioners, lending itself well to the described clinical applications (see https://natalie-delrocco.shinyapps.io/COG_PI_Calculator/). This work extends that of Enshaei et al., whose prognostic index, the PIUKALL, was prognostic in the COG data and emphasized the strength of D29 MRD, WBC, and favorable and unfavorable cytogenetics as predictors of outcome in pediatric ALL [5].

These independent analyses were both conducted with large, uniformly annotated clinical trial datasets, giving strong evidence of reliable estimation of the effect of these prognostic factors on relapse risk. This work provided an independent external validation of the PIUKALL, and also demonstrated the contributions of Age, CNS status, and D8 MRD in prognostic modeling for relapse risk. Despite correlation with D29 MRD and a modest effect, D8 MRD still contributes independently to the model, likely due to ability to indicate excellent expected outcomes when D8 MRD is negative. We additionally note that model estimation showing similar hazard ratios for patients with CNS2 and CNS3 is not unique to this study, and refer the interested reader to Winick et al. for discussion [23]. However, present interpretations regarding CNS2 vs. CNS3 must be made with caution as the confidence interval associated with the estimated hazard ratio for CNS3 patients (vs. CNS1) is wide given the relatively small number of these patients.

Difference in performance of PIUKALL in COG patient populations may be attributed to several factors including different geographic case-mix [14], different MRD detection methods, and differing definitions of genetic factors [24]. Differences in cytogenetic classification between the COG and UKALL groups include the definition of hyperdiploidy. While the UKALL group defines this favorable cytogenetic subgroup as those with high hyperdiploid (i.e., between 51 and 67 chromosomes), the COG defines this group as those with trisomy of chromosomes 4 and 10. Of note, subsequent UKALL analyses are likely to further refine their definition of this group as an indicator of good risk genetics [24]. Additionally, the use of hypodiploidy as an inclusion criterion for HR cytogenetic classification differs between the COG and UKALL groups. The COG considers all individuals with hypodiploidy (<43 chromosomes) as HR. UKALL considers two subsets of hypodiploidy as HR: “near haploidy” (<30 chromosomes) and “low hypodiploidy” (between 30 and 39 chromosomes). Thus, the difference reduces to the small subset of individuals between 40 and 42 chromosomes. TCF3-HLF positivity also contributes to UKALL’s HR definition. TCF3-HLF is indeed a very high-risk factor but is exceedingly rare and not routinely assessed in genetic testing algorithms.

We show that PICOG can identify heterogeneity in outcome among the categorically defined RGs used in the current COG risk classification, suggesting that using continuous information may enhance traditional RG designation. This refinement of current RGs could offer further options for therapeutic interventions for certain subsets of patients. As outcomes continue to improve, the burden of treatment-related toxicity becomes an increasingly important consideration [25, 26]. For those with outstanding prognosis, a less intense chemotherapy regimen may help prevent the life-long complications of therapy, including cardiac disease, secondary cancers, decreased employment, and infertility [27]. For example, SR-Avg individuals who are PICOG low risk could be considered for treatment de-intensification. In contrast, for those within the COG HR group with predicted outcome similar to the COG VHR group (e.g., PICOG “high-risk”), innovative therapies could be considered to improve RFS.

Ideally, a fully independent external validation of PICOG should be conducted with close attention to validation in minority demographic populations. Though Supplementary Fig. 8 shows good calibration and discrimination within each race/ethnicity subgroup and both sexes, a true external validation in minority populations is optimal for determining predictive performance. Prospective clinical trials could evaluate the PICOG’s efficacy as a clinical decision aid [28].

In addition to assessing the clinical performance of the PICOG, future research could assess further refinement with critical new prognostic factors. Modern clinical prediction models must be prepared to dynamically incorporate new discoveries and updated information [23]. For example, high-throughput sequencing (HTS) for MRD is more sensitive and easily standardized than standard flow cytometry and is a focus of current investigation in childhood ALL [29]. Updated models of ALL will also need to adapt to the growing importance of new genetic markers [30], or to improve the use of information from traditional ones [24]. A model-derived risk score, such as the PICOG, more readily allows the timely incorporation of such new information (such as HTS MRD and novel genetic subtypes) than do traditional risk stratification algorithms. Traditional risk stratification algorithms combining specific categories of many variables to construct RGs require extensive clinical knowledge regarding relationships between a new marker and other risk stratification variables to determine the appropriate algorithmic use for the new marker. Often, when a new prognostic marker is introduced, the first studies show only an association with outcome, with additional clinical knowledge following over the course of time. In contrast, when data becomes available on the new marker, established statistical methods parallel to those described in this paper can be applied to incorporate the new information into the model. Although model updating is nontrivial, the technology is available and could further strengthen the ability of the PICOG to discriminate outcomes in groups of patients previously categorized together, presenting additional future opportunities to ask targeted questions.

This study has several strengths in addition to the size and data consistency of the study cohorts. The availability of D8 MRD, not routinely assessed by other groups, allowed incorporation of early disease response. Extensive prior studies of clinical and genomic variables as outcome predictors enabled this study to have predictor pre-specification instead of model-based selection, enhancing the applicability in external populations. Data-driven selection of PICOG cutpoints to define RGs objectively optimizes outcome-based RG assignment. Several limitations also merit note. Certain patient subgroups (T-ALL, Down syndrome, Ph+) were not used to derive PICOG. The performance of PICOG (or any PI) is unclear in small patient groups with limited data (e.g., Non-Hispanic/Other race or Ph-like). Future studies should assess calibration in such patient subgroups. Finally, because PICOG relies on D29 MRD, only available at end-induction, it cannot be used to modify the first weeks of induction therapy.

In conclusion, contemporary ALL therapy relies on risk stratification but does not use all relevant rich and readily available data. The PICOG showed a wide range of relapse risk within currently used RGs and thus may be useful as a clinical decision aid for future trials. Analogous efforts may have significant clinical value in other cancers.