Introduction

Diffuse large B-cell lymphoma (DLBCL) represents the most common type of non-Hodgkin lymphoma with significant heterogeneity in survival [1]. Numerous clinical prognostic models have been developed to stratify patients according to risk [2]. The International Prognostic Index (IPI) was developed to predict survival in patients with aggressive lymphoma treated with doxorubicin [3]. Although the model was published in 1993, it remains the most widely adopted prognostic tool in newly diagnosed DLBCL and the primary tool for selection and risk stratification in today’s clinical trials [4, 5].

In the original study developing the IPI, patients were stratified into four risk groups based on five variables (age, Ann Arbor stage, Eastern Oncology Cooperative Group performance status [ECOG PS], number of extranodal sites, and lactate dehydrogenase [LDH]). The 5-year overall survival (OS) ranged from 26% to 73% in high- and low-risk groups, respectively [3]. However, due to changes in treatment regimens over time, including the introduction of rituximab, the prognostic value of the IPI has declined [2, 6,7,8]. Moreover, recent treatment improvements, primarily in relapsed/refractory patients with the introduction of chimeric antigen receptor (CAR) T-cell therapy and other therapies under investigation, have increased the need for optimized models to predict poor-risk patients and thus further challenge the usefulness of the IPI [4, 9]. Although many biological factors correlate with disease outcomes, integration into accurate, validated prognostic models is lacking [4]. Establishing a model integrating patient characteristics through an empirical approach to estimate accurate probabilities of an outcome using accessible predictors and with better utilization of data and current devices is needed [10, 11].

Several studies developed IPI-like variants with superior discriminatory ability compared to the IPI in the rituximab era [6,7,8, 12]. The most commonly reported IPI-like models are the Revised IPI (R‐IPI), and the National Comprehensive Cancer Network IPI (NCCN‐IPI), although the attempts to incorporate genetic and molecular variables have shown promising results [1, 2, 13]. However, these models are difficult to reproduce due to the time-consuming data analysis processes and potential costs, which further limit the validation of such a model [14]. Available prognostic models for DLBCL are mainly developed from retrospective studies with a limited number of patients and frequently lack external validation [15]. Most studies aiming to establish DLBCL models focused on OS as the primary survival point and the model’s discrimination ability, which measures the actual probability that a given patient experienced the event (e.g., disease recurrence or death) [15, 16]. Optimal survival end-point in DLBCL patients has been a matter of debate due to improved outcomes and hence the low number of events, and event-free survival (EFS) at 24 months has been proposed as a valid surrogate marker for OS and relative survival [17]. Additionally, all-cause mortality data further challenge current models in DLBCL as these include deaths from all causes, but in older patients, non-lymphoma deaths should be considered [11]. Moreover, calibration, the agreement between observed outcomes and predictions, has gained more attention recently and is recommended when externally validating prediction models [18]. Several validation studies in DLBCL patients have been conducted [2, 8, 11, 19]. However, most of them are either based on a limited number of patients, compared to few prognostic models, lack external validation and calibration analysis, or in the case of large populations, patients were recruited from clinical trials not reflecting the real-world population [2, 8, 11, 19].

This study was conducted to identify the most accurate current prognostic model by validation in a real-world DLBCL population. We report the discrimination ability and calibration of 13 prognostic models and identify the current model that could serve as a reference when developing DLBCL models and potentially be used in patient selection in clinical trials.

Methods

Patients were identified through the nationwide Danish lymphoma registry (LYFO), which contains information on baseline clinicopathologic features, treatment, and outcomes of DLBCL patients. LYFO coverage is 98%, and its database is periodically merged with the national civil registry, which contains the dates of death of all deceased inhabitants, providing accurate survival calculations [20]. The study was approved by RKKP (Regionernes Kliniske Kvalitetsudviklingsprogram) with study number 21/27006. Access to data was available to all authors.

Patients were included in the final analysis if they fulfilled the following inclusion criteria: (1) age ≥18 years; (2) newly diagnosed with DLBCL between January 2000 and June 2021; (3) treated with at least one cycle of R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) or similar; (4) all clinical/laboratory variables retrievable in order to calculate prognostic indices in all models. Patients with primary central nervous system (CNS) lymphoma were excluded, while patients with systemic disease and concomitant CNS involvement were included in the study.

Prognostic models

Clinical prognostic models developed for DLBCL patients were evaluated if the variables for models were available in LYFO. We compared 13 models incorporating at least one clinical (e.g., age, ECOG PS, Ann Arbor stage) and/or laboratory variables (e.g., hemoglobin, LDH, albumin) identified through a previously published systematic review [6]. Those models were considered clinical models. Additionally, as test comparisons, two exclusively laboratory-based models incorporating only variables such as thrombocytes, hemoglobin, and albumin were calculated to compare whether the addition of clinical variables improves the performance of a model. Models including beta-2 microglobulin (β2M) were not included in the current analysis, as this laboratory marker was frequently lacking, and multiple imputations would not reasonably approximate the true distributional relation between unobserved data and available information [21].

Statistical analysis

OS was defined as the time from diagnosis until death from any cause or censoring the last follow-up. Progression-free survival (PFS) was defined as the time from diagnosis to relapse/disease progression or censoring at the last follow-up. OS was calculated using the Kaplan–Meier method, and we used the Log-rank test to compare the difference between risk groups. Cox proportional hazard models obtained hazard ratios with 95% confidence intervals (CI).

As a measure of fit/model quality, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used. Lower AIC or BIC indicates a better fit [22]. The area under the receiver operating characteristic curve (AUC) was used as a standard method to assess the accuracy of the predictive distribution model [23]. According to Uno et al., the concordance index (c-index) was used to measure discrimination, representing a model’s ability to distinguish individuals with and without outcomes of interest [24, 25]. A value of 0.5 indicates no discrimination, while 1 indicates perfect discrimination [24]. Concordance probability estimate (CPE) was used as another discrimination measure, with higher values indicating better discrimination [26]. Calibration represents the agreement between predicted and actual probabilities estimated by a predictive model [16]. The calibration of models in the current study was presented with calibration curves. Models close to a 45-degree line show perfect calibration [16]. Interrater-weighted κ statistics along with 95% CI was used to compare agreement between the IPI, NCCN-IPI, and other four-risk models, and R-IPI and other models with three-risk groups [27].

All p-values were 2-sided, and p < 0.05 was considered statistically significant. Calculations were performed in IBM SPSS statistics (version 28.0.0.0) and R version 3.4.1 using the following packages for survival and performance calculations: CPE, ggplot2, ggsurvfit, dynpred, maxstat, rms, survC1, survival.

Results

Of 8644 patients registered with DLBCL in LYFO in the inclusion period, 6075 patients with DLBCL were treated with rituximab-based therapy and were considered potential candidates for the current study. Data to calculate prognostic indices of interest were available for 5126 patients who fulfilled inclusion criteria and were selected for the final analysis (Fig. 1). Data on missing variables among 6075 potential candidates is provided in Supplementary Fig. 1.

Fig. 1: Consort diagram of the selection process for identifying patients eligible for the current study.
figure 1

DLBCL diffuse large B-cell lymphoma, IPI International Prognostic Index, NCCN-IPI National Comprehensive Cancer Network IPI, LYFO Danish Lymphoma Register, R rituximab, R-CHOP rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone.

Table 1 summarizes baseline patient characteristics. The median age was 68 years (range 18–95), with 71.0% older than 60. There was a slight male predominance (57.3%) and patients with advanced stage III and IV disease (68.1%) (Table 1).

Table 1 Clinical characteristics of patients with diffuse large B-cell lymphoma.

Prognostic models

We identified 13 clinical and two laboratory models from 11 studies [3, 7, 8, 12, 28,29,30,31,32,33,34]. All models except IPI and age-adjusted IPI (aaIPI) were developed for patients treated with rituximab-based regimens. Table 2 summarizes the variables included in each model. Tables 3 and 4 provide calculations and distributions of patients within each risk group and among our cohort according to models with four- (n = 8) and three-risk groups (n = 5) [3, 7, 8, 12, 28,29,30,31,32,33,34]. All studies except two included DLBCL patients from retrospective cohorts [3, 28]. Slight differences in inclusion criteria with variable follow-up under 60 months were reported across the studies. The number of patients from the original populations used to develop current models ranged from 88 to 2031 [3, 28]. The largest study population was used to develop IPI, followed by DLBCL Prognostic Index (DLBCL-PI), and NCCN-IPI with 2031, 1803, and 1650 patients, respectively [3, 8, 12]. Five models were developed in populations larger than 500 patients [3, 8, 12]. All studies proposing models with three risk groups analyzed less than 400 patients (range 88-365) [7, 28, 32,33,34]. The median age in analyzed studies ranged from 57 to 70 years [8, 29, 31, 32]. Three studies reported the median age of the analyzed population over 68 years [29, 31, 32].

Table 2 Clinical and laboratory variables included in each model.
Table 3 Summary of variables, distributions of patients according to four risk group models, and 3/5-year overall survival in original models and models from the current study.
Table 4 Summary of variables, distributions of patients according to three risk group models, and 3/5-year overall survival in original models and models from the current diffuse large B-cell lymphoma cohort.

Variables included in models

The most commonly used variables in analyzed models were the IPI variables, including Ann Arbor stage (9/13), ECOG PS (9/13), LDH (9/13), age (6/13), and extranodal sites (5/13) (Table 2). In contrast to the IPI, three models stratified extranodal involvement by high-risk localizations and not the absolute number of involved sites [8, 30, 31]. Additionally, four laboratory variables (hemoglobin, platelet [PLT] count, and absolute lymphocyte count [ALC]) were used in individual models, while albumin was used in four models. Different cut-offs for age, LDH, hemoglobin, PLT, ALC, and albumin were used across different studies (Table 2).

Variables were commonly dichotomized, while age and LDH were divided into several groups in NCCN-IPI, Modified NCCN-IPI, and Kyoto Prognostic Index (KPI) [8, 30, 31].

Model agreement

All patients were categorized into risk groups according to the prognostic models used in this analysis. Distributions of patients according to risk categories in original models and current study are provided in Tables 3 and 4.

IPI classified 26.2% of patients into low-risk groups, whereas 21.3% were in high-risk groups. aaIPI, R-IPI, and NCCN-IPI classified 19.8, 6.4%, 8.1%, respectively, in the low-risk group and 19.4%, 49.1%, and 13.9% in the high-risk group.

As presented in Suppl. Table 1, when IPI was used as the reference model and compared to other models with four-risk groups, it showed substantial agreement (weighted κ between 0.61-0.80) with aaIPI (weighted κ = 0.76), NCCN-IPI, Modified NCCN-IPI, DLBCL-PI, and age-adjusted DLBCL-PI (aaDLBCL-PI). When NCCN-IPI was used as the reference model, it showed substantial agreement with the IPI and Modified NCCN-IPI. The highest number of differently grouped patients with only fair agreement (weighted κ between 0.21–0.40) was observed between the NCCN-IPI vs. KPI (weighted κ = 0.35) and Modified-3-factor Model (weighted κ = 0.37). When models with three risk categories were compared with the R-IPI as the reference model, they showed poor (weighted κ < 0.00) to slight agreement (weighted κ = 0-0.20) with only ALC/R-IPI showing fair agreement with R-IPI (weighted κ = 0.24) (Suppl. Table 2).

Survival

Overall survival (OS)

The median follow‐up of the study population was 58.2 months, and the maximum follow-up of 244.7 months. There were 2190 deaths (42.7%). The median survival of the whole study population was 135.1 months (95% CI 127.2–143.0).

Univariate analysis of parameters included in each model showed the prognostic significance of all included variables in the evaluated prognostic models (Suppl. Table 3). As several models used laboratory variables (e.g., hemoglobin, PLT, ALC, albumin) with different cut-offs of the same variable included in some models, we compared the laboratory biomarkers’ hazard ratios (HR) at different cut-offs. We then applied the cut-offs producing the highest HR in multivariate analysis. Five IPI/NCCN-IPI variables were further combined with four laboratory variables (hemoglobin<120 g/L, PLT < 100 × 109/L, ALC < 0.84 × 109/L, and albumin<35 g/L) in multivariate analysis. No significant correlations (collinearity) between models included in multivariate analysis were observed. In multivariate analysis with IPI parameters, only the number of extranodal sites was insignificant, while when combining five NCCN-IPI parameters with four laboratory variables, all parameters retained prognostic significance with extranodal sites marginally significant (Suppl. Table 3).

Figure 2 presents Kaplan–Meier curves for all 13 models. Moreover, we calculated 3- and 5-year OS rates for all models, as shown in Tables 3 and 4. Five-year OS estimates in the respective high-risk groups ranged from 31.7%, 33.4%, and 38.5% for Modified NCCN-IPI, NCCN-IPI, and DLBCL-PI, respectively, to 43.6% and 53.9% for IPI and R-IPI. In the respective low-risk groups, 5-year OS was 89.5%, 92.8%, and 97.1% for DLBCL-PI, Modified NCCN-IPI, and NCCN-IPI, while a lower estimate of 85.8% was registered for IPI, but not for R-IPI (97.5%).

Fig. 2: Overall survival of 13 prognostic models in diffuse large B-cell lymphoma patients (Kaplan–Meier curves).
figure 2

The shaded color areas around curves represent confidence intervals.

The median PFS was 129.5 months (95% CI, 120.8–138.2 months), and the maximum PFS was 244.7 months. Suppl. Table 4 provides HRs for PFS for risk groups within each prognostic model. Moreover, measures of model fitness and discrimination are also provided in Suppl. Table 4. Kaplan–Meier and calibration curves were similar to those of OS (data not provided).

Model fit, discrimination, and calibration

The lowest AIC was registered for NCCN-IPI (34002), Modified NCCN-IPI (34039), and DLBCL-PI (34100). The highest AIC was registered in aaIPI (34748), along with laboratory models (Table 5). IPI and R-IPI had AIC values in the middle of the group (34340, 34380). Regarding BIC, similar results were obtained as AIC (Table 5).

Table 5 Summary of hazard ratios, model fit/quality measures, and discrimination measures concerning overall survival.

The highest CPE values were found for NCCN-IPI (0.670), Modified NCCN-IPI (0.664), and DLBCL-PI (0.660). The lowest CPE was registered for the Matsumoto model (0.580), aaIPI (0.585), and two laboratory models (0.577, 0.584) (Table 5).

Models that provided the highest c-index were DLBCL-PI, NCCN-IPI, and Modified NCCN-IPI, with values of 0.700, 0.693, and 0.684, respectively. When these models were compared to NCCN-IPI as the reference model, there was no statistical difference between the NCCN-IPI and DLBCL-PI. However, NCCN-IPI had statistically better discriminative ability than Modified NCCN-IPI. Additionally, NCCN-IPI performed significantly better than IPI and R-IPI, for which the c-indexes were 0.673 and 0.643, respectively. Other models with three risk groups had significantly inferior discriminative ability than NCCN-IPI with a c-index lower than R-IPI (Fig. 3).

Fig. 3
figure 3

Calibration curves of 13 prognostic models in diffuse large B-cell lymphoma patients concerning overall survival.

Additionally, when AUC was calculated, DLBCL-PI, NCCN-IPI, and Modified NCCN-IPI showed the highest values (0.661, 0.657, and 0.651, respectively). In contrast, the lowest AUC values were found in three-risk models and aaIPI.

Calibration curves for 5-year survival are provided in Fig. 3. Models with the highest c-index had calibration curves close to a 45-degree line, indicating good calibration.

Discussion

We have used a population-based lymphoma database to compare and validate 13 prognostic models for DLBCL, including two laboratory-based ones. We confirmed the prognostic value of all included models in the rituximab era but found variable discriminatory power among different models. Although previous studies have shown that the IPI has diminished value in the rituximab era, it is still widely used for risk stratification in general practice and clinical trials. So far, only a few models have been previously validated in different populations (e.g., R-IPI, NCCN-IPI) [2, 15, 35]. Our study shows that although an ideal clinical model is still needed, NCCN-IPI and DLBCL-PI provided better discriminative ability than other analyzed models in particular models with three-risk groups and aaIPI [8, 12].

The predictive capacity of the IPI has been universally accepted due to the use of robust and easily accessible markers. Age was frequently included in analyzed models, as its relevance with different cut-offs has been confirmed in DLBCL in numerous studies [8, 12, 30, 36]. In our analysis, age was incorporated in all three models providing the highest discrimination, with two incorporating fractionalized age groups (NCCN-IPI, Modified NCCN-IPI) [8, 30]. Accordingly, Biccler et al. reported a significant loss in predictive performance when dichotomizing age, although this was not the case with the dichotomization of other IPI variables [11]. Age is probably one of the most influential survival predictors as it captures some patient characteristics not directly included in models. Although in elderly patients, lymphoma is not substantially less responsive to treatment than in younger patients, poor outcome in elderly patients results mainly from their decreased ability to tolerate treatment [37]. Moreover, polypharmacy, changes in drug metabolism with age, comorbidities, and impaired bone marrow function increase the risk of treatment-related toxicity [37]. Therefore, deaths from other causes, specifically in the elderly, represent a significant competing risk [11].

In our study, all five IPI/NCCN-IPI variables were significant in univariate analysis but not multivariate analysis. When IPI/NCCN-IPI variables were combined with the five laboratory variables included in other models, only extranodal sites lost prognostic significance. As the prognostic impact of the extranodal scoring system included in the IPI has been questioned in the rituximab era, numerous studies have proposed that the particular (number of) high-risk sites are more impactful than the number of involved sites according to the IPI [8, 31, 38, 39]. This distinction has been considered in three models we evaluated [8, 30, 31]. Incorporating positron emission tomography measures (i.e., baseline metabolic tumor volume [MTV]) instead of extranodal sites could further improve the value of future models for DLBCL [40,41,42,43,44]. However, standardization across studies evaluating these measures is currently lacking [40,41,42,43].

Several studies have previously demonstrated the prognostic value of albumin [12, 45, 46]. Low albumin levels are associated with poor general health, inflammation, and low body mass index (BMI) and are generally related to poor survival in patients independent of lymphoma [45]. Of the four models incorporating albumin levels, only DLBCL-PI and Modified NCCN-IPI showed superior performance measures than the other models [12, 30]. As DLBCL-PI was developed from a Danish population, overlapping with the current study is likely and could potentially lead to an overestimation of the discriminatory ability of DLBCL-PI [12]. Models incorporating albumin are challenged primarily due to missing values in retrospective cohorts, variable cut-off levels, and correlation with other conditions (e.g., previous/concomitant cancers, renal failure, low BMI, inflammatory diseases).

Laboratory markers, such as blood counts, have been investigated as potential surrogate markers of the host’s adaptive immunity and immune microenvironment [28, 47]. ALC was incorporated in two models, but the discriminative value of these models was below that of the IPI in our analysis [28, 29]. Moreover, pretreatment hemoglobin concentration was previously associated with outcomes in DLBCL patients treated with anthracycline-containing chemotherapy [48]. Although hemoglobin, PLT levels, and ALC showed prognostic significance in multivariate analysis, models incorporating these parameters had poor discrimination and calibration in the current analysis [32,33,34].

When we examined the prognostic model’s ability to discriminate between risk groups, all models were successful but with variable performance. Most models failed to identify populations with very poor outcomes and less than 50% long-term survival. A 5-year survival of 35% or less in high-risk groups was identified by Modified NCCN-IPI (31.7%) and NCCN-IPI (33.4%) [8, 30]. Only NCCN-IPI could identify low-risk patients with excellent prognoses with 5-year OS over 95% (97.1%) [8]. In contrast, IPI, aaIPI, and models with three-risk groups could not identify a 5-year OS of less than 40% in high-risk patients, and they also failed to identify a population with a favorable prognosis in the low-risk group with 5-year OS over 90% [3, 7, 28, 32,33,34]. Only R-IPI could identify low-risk patients with a 5-year OS of 97.5%, comparable to NCCN-IPI [7, 8]. Ruppert et al., using data from 7 clinical trials (2124 patients), reported 5-year OS for IPI, R-IPI, and NCCN-IPI ranging from 54% to 88%, 61% to 93%, and 49% to 92%, in the high and low-risk groups, respectively [2]. The inferior discriminatory ability of NCCN-IPI in the study of Ruppert et al. compared to the current study is likely due to the utilization of younger patients in the former study (median 63 vs. 68 years) [2]. Moreover, the authors included a population from clinical trials, while NCCN-IPI was developed from an unselected population of newly diagnosed DLBCL patients, which was also the case in our study [8].

We found superior discriminatory ability of NCCN-IPI, DLBCL-PI, and Modified NCCN-IPI compared to all other models [8, 12, 30]. Regarding the model’s discriminative ability, only DLBCL-PI showed a c-value comparable, but not superior, to the NCCN-IPI [8, 12]. The lowest discriminatory ability was observed in models with three risk groups without age as a model component, including aaIPI and aaDLBCL-PI. Moreover, when comparing the two laboratory models to the clinical models, including NCCN-IPI (c-index 0.693), they showed significantly inferior discriminatory ability (c-index ranging from 0.606 to 0.612), indicating that the established laboratory variables cannot be used alone to prognosticate DLBCL. Reasons for poor discrimination could be that these models stratified patients into only three risk groups. Moreover, clinical variables seem to add significantly to the performance of each model, probably due to the robustness of clinical variables such as age and ECOG PS. Similarly, the best fit evaluated by AIC and BIC was also registered for three previously mentioned models [8, 12, 30]. Additionally, calibration curves were in accordance with the discriminative power of different models showing poor calibration among three-risk models.

Numerous biomarkers have already been incorporated in some predictive models for DLBCL (MTV, cell of origin, genetic markers) [13, 40, 42]. Moreover, the prognostic potential of whole-genome sequencing, circulating tumor DNA (ctDNA), along with sociodemographic, clinical, laboratory, molecular, and genetic markers, are under investigation to facilitate prognostication and treatment decision-making [1]. The need for a comprehensive prognostic model for DLBCL patients has been recognized, particularly in the evolving treatment possibilities and improvements in diagnostics [1, 4]. Identifying high-risk patients who could benefit from early treatment with CAR-T, bispecifics, and other novel therapies is necessary. Studies aiming to develop more accurate models for outcome predictions should focus on using different model formulations with more predictors (including MTV, ctDNA, genetic markers), different flexible modeling approaches (e.g., spline models, random survival forests), optimal end-point (cause-specific survival, disease-free survival), and timing of model calculation (e.g., at diagnosis, interim evaluation) [11, 17, 49]. However, until new biomarkers are integrated into validated prognostic models, NCCN-IPI and the IPI should be reported for patients treated in clinical trials to allow optimal comparison of outcomes with previous studies [2]. Despite not affecting treatment decisions, NCCN-IPI can provide insight into patients’ long-term survival.

Although this is one of the most extensive validation studies of different clinical prognostic models in DLBCL based on easily obtained clinical features, several limitations should be addressed. The retrospective nature and usage of register-based data come with an inevitable bias. The risk of incorrect disease classification is possible as some cases (e.g., primary mediastinal B-cell lymphoma, primary effusion lymphoma, leg-type DLBCL) could have been registered as DLBCL. As these subtypes are rare compared to DLBCL, potential influence is likely insignificant. Moreover, we did not analyze models incorporating cell of origin, MYC/BCL2/6 re-arrangements, comorbidities, and results of interim response analysis. Models incorporating variables not reported regularly in the database and those with significant missing values were not analyzed. Additionally, overall survival, and death as a primary event, irrespective of cause, was the primary point of this study. However, an issue in survival analysis is competing risks of non-fatal outcomes and mortality due to non-lymphoma causes, which can lead to an overestimation of absolute risk [50]. Nevertheless, the study’s main strengths are the large number of patients and the direct comparison of risk indices calculated on the same individual patient from a real-world dataset.

Conclusion

This large retrospective register-based study analyzing 13 prognostic models for DLBCL showed superior model quality and discriminatory ability of NCCN-IPI and DLBCL-PI compared to other models. Moreover, aaIPI performed poorly and should be cautiously used. Laboratory-based models with currently available markers should be critically analyzed due to the significant additive effects of clinical characteristics on prognosis. Current analysis favors NCCN-IPI, and this model is suggested as the reference model along with the IPI when developing models for DLBCL. Future developments and a better understanding of disease pathology will hopefully allow the use of the extensive data to create models based on prognostically important clinical, molecular, and genetic factors to facilitate treatment decision-making.