Introduction

Progress in breast cancer biology, research, and treatment has made the incorporation of clinical and pathological characteristics into the staging system necessary. The American Joint Committee on Cancer (AJCC) TNM staging classification is globally accepted to describe extent of cancer at diagnosis and has been available since 1959. The AJCC has incorporated biological factors (e.g. Gleason’s score, mitotic index, and grade) into the staging classification of cancers such as prostate, melanoma, and sarcoma1,2,3. In contrast, until recently, the breast cancer TNM classification has had only anatomical-based staging criteria without the addition of biological, prognostic, and predictive factors used to guide treatment in the clinical practice4,5,6,7,8. It is evident that a staging system based solely on anatomy, does not always reflect the variable clinical course and long-term outcomes seen in clinical experience9. Age, tumor grade, hormone receptor (HR) status, and human epidermal growth factor receptor-2 (HER2) status are well-recognized prognostic factors and the latter two also serve as predictors of response to endocrine and anti-HER2 therapy, respectively10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29. Prior work with robust statistical approaches has indicated the need to include such factors in a refined staging system30,31,32,33.

The recently published AJCC 8th edition recognizes new prognostic categories that significantly improve prognostic categorization compared to the anatomic stage groupings alone34. To date, there are no robust breast cancer outcome prediction tools available to practicing clinicians to provide patients with early stage breast cancer survival estimates based on the presenting features of their tumor and other clinical factors. Prior tools have been developed but have used different statistical approaches, i.e. mathematical modeling, Cox proportion hazards regression, or actuarial analysis; lacking in follow-up time; more focused on guiding adjuvant therapy discussions; or do not represent a contemporary patient population35,36,37,38. To accompany the recent update to the AJCC staging system, we sought in this analysis, to develop a validated model and demonstrate that age, tumor grade, and biomarker subtypes, conditional on stage, were important determinants of BCSS and that their incorporation could further refine the survival estimates based upon stage utilizing a Fine–Gray Model. Further, we have developed an online tool to estimate individual prognosis based on this model combining clinical and biologic variables for use in daily clinical practice.

Results

Prediction model on the training data

In the MDACC database, 20,928 patients with stage I–III breast cancer, who received treatment with surgery as the first intervention were identified (see Supplementary Table 1 for the characteristics of the overall cohort). From the overall cohort (n = 20,928), a complete data cohort (N = 14,781) was formed for model comparison purposes, after excluding 29% of the patients (n = 6147) missing one or more of the key variables (age, HR, HER2, grade). All models with stage and additional factors showed higher C-index than the model with stage alone (Table 1). The model with age, grade, combined HR and HER2 status, and stage, which reflects the information conventionally available in current clinical practice, showed the best predictive accuracy.

Table 1 Comparisons of multivariable Fine–Gray models, using complete data cohort (N = 14,781)a.

Table 2 shows the characteristics of the patients included in the multivariate analysis from the training dataset (N = 14,781) and validation dataset (N = 29,727). The median follow-up was 6.6 years (95% CI, 6.5–6.7) (range: 0–24.9 years) and the median age was 55 (range 20–99) years. The 5-year BCSS estimate was 89% for the whole multivariate cohort. Outcomes for the multivariate cohort were as follows: 15% BC-related mortality, 20% mortality from any cause, 10% loco-regional recurrence, and 20% distant recurrence (first loco-regional recurrence and distant recurrence were 8% and 16%, respectively).

Table 2 Patient characteristics for the multivariate Fine–Gray model: complete data cohort in training and validation datasets.

In multivariate analysis, age, stage, biologic subtype, and grade were associated with BCSS (Table 3). There was some indication of violations of the PH assumption, especially for HR/Her2 subtype. However, we did not explore the Fine–Gray model with time-varying covariates because the analysis would be computationally complex and the resulting model could still yield serious biases39,40. Despite the violation of the PH assumption, our estimated hazard ratios are still interpretable as weighted average hazard ratios over the follow-up41. Figure 1 shows examples for the largest stage groupings (I, IIA, IIB) demonstrating that the curves can be further refined with the combination of factors including age group, receptor-subtype, and nuclear grade within each anatomic stage group. In each stage grouping, patients aged 40–69 with a grade 1 HR+/HER2− cancers have the best survival; whereas patients with age < 40, grade 3, TNBC cancers have the worst survival. The impact of age is also clear; younger patients (<40) with a grade 3 TNBC consistently have worse survival compared to an older age group 40–69 keeping other factors constant (TNBC and grade 3).

Table 3 Multivariate Fine–Gray model (Model 7) (N = 14,781)a parameter estimates based on Fine–Gray model.
Fig. 1: BCSS estimation.
figure 1

Estimated BCSS by combining age group, receptor subtype, grade within each stage group (1, 2A, 2B) based on Fine–Gray model 7 using average baseline survival (N = 14,781).

Table 4 shows refined 5-year and 10-year BCSS estimates by age group (<40: A, 40–69: B, and ≥70: C), HR status, HER2 status, and pathologic stage for each grade. For TNBC, Grade 3, age < 40, the 5-year BCSS is noted to decrease by stage: 88% (1); 76% (IIA), 65% (IIB), 60% (IIIAB), and 45% (IIIC). In contrast, for TNBC, grade 3, age 40–69, the 5-year BCSS is as follows: 92% (I), 84% (IIA), 76% (IIB), 72% (IIIAB), and 60% (IIIC). Within the age group < 40 and HR+/HER2− subtype, the 5-year BCSS within each stage group is different for grade 1 vs. grade 3 tumors: 98% vs. 94% (I), 96% vs. 86% (IIA), 93% vs. 80% (IIB), 92% vs. 76% (IIIAB), 88% vs. 66% (IIIC). Similarly, within the age group 40–69 similar contrasts are noted for HR+/HER2− by grade (1 vs. 3): 99% vs. 96% (I), 97% vs. 91% (IIA), 96% vs. 87% (IIB), 95% vs. 84% (IIIAB), and 92% vs. 76% (IIIC). The results of Shoenfeld residuals, nonlinear covariate effects and two-way covariate–covariate interaction yielded no remarkable findings. Table 4 shows the 5-year and 10-year BCSS for each combination of factors (age, anatomic stage, HR, HER2, and tumor grade).

Table 4 Five-year and ten-year BCSS estimates by age, HR status, HER2 status, and stage for each grade based on Fine–Gray model 7 with average baseline survival (n = 14,781).

Model validation

Table 2 shows the patient characteristics for the 29,727 patients in the validation data set. The median follow-up was 2.4 years and the median age was 58 years. The 5-year BCSS estimate and 5-year OS was 95% and 91%, respectively, for the validation cohort. Outcomes were as follows: 3% BC-related mortality, 6% mortality from any cause, 1% locoregional recurrence, and 4% distant recurrence. BCSS at 5 years was 89% in the training set and 95% in the validation set while at 10 years BCSS was 78% in the training set and 89% in the validation set. Figure 2a, b show BCSS predictions in the training and validation sets stratified by risk set (based on the same centiles used for the training set). The c-index from the validation set is 0.82. Taken together, these indicate that the selected model discriminated well when applied to the validation set. However, Fig. 3a and b shows that our model is poorly calibrated when applied to the validation set. The predicted probabilities were much lower than the observed probabilities. Figure 4 shows that after recalibration, the predictions showed good calibration in the validation data.

Fig. 2: BCSS probability by risk group.
figure 2

a Training data and b validation data.

Fig. 3: Calibration plots.
figure 3

a Training data, 5 years, b training data, 10 years, c validation data, 5 years and d validation data, 10 years.

Fig. 4: New calibration plots after recalibration.
figure 4

a Training data, 5 years, b training data, 10 years, c validation data, 5 years, and d validation data, 10 years.

Discussion

In this analysis, we demonstrate statistical significance and improved predictive accuracy of incorporating age, grade, and receptor-based subtypes conditional on anatomic stage to improve prediction of breast cancer-specific survival (BCSS) using a Fine–Gray Model. In doing so, we have shown through a different statistical methodology and with a longer follow-up that more individualized prediction of BCSS is possible by considering clinical and biologic factors in addition to anatomic stage in a dataset with long followup that has been validated in an external dataset with demonstration of good calibration in both datasets. Calibration of the BCSS prediction model is evaluated by visual inspection of the 5-year and 10-year calibration plots, plotting the average predicted BCSS against the observed BCSS for each of 5 risk groups at 5 years and 10 years. The 5-year and 10-year BCSS of patients with particular values for the clinical and biological variables can be found using an online tool at the following link: http://mdanderson.org/BCSS.

We chose the Fine–Gray model to develop a BCSS prediction model over other potential and different statistical methods used in the existing online breast cancer survival prediction tools35,36,37,38. PREDICT, an online breast cancer prognostic and treatment benefit model, used a Cox proportional hazards model, which is frequently used in risk prediction models for breast cancer, to predict BCSS. When an individual in the risk set is exposed to more than one cause of failure (e.g., non-breast cancer death, which cannot be neglected for early stage breast cancers), resulting in competing risks, the Kaplan–Meier method of estimating CIF and the Cox regression model of estimating survival lead to biases42. The Fine–Gray model is a widely used method to build prediction models when competing risks exist (as is the case for early stage breast cancer). The Fine–Gray model allows estimation of the effect of the covariates on the CIF, but it has a limitation in its difficulty in interpreting the model coefficients, as compared to the Cox proportional hazards model42,43,44. That is, the model covariates can be interpreted as having an effect on the CIF, but they do not directly link to an underlying event rate in the real world45,46. This limitation, however, does not pose an issue in our analysis as the CIF is of equal clinical relevance in our study.

The results of our study should be interpreted with several considerations. First, the analysis was performed using data collected at a high volume single cancer center with specific referral and practice patterns leading to a potential selection bias. Our institution tends to treat larger and higher grade tumors with neoadjuvant systemic therapy, and the patient cohort analyzed in this study excluded those patients, possibly biasing the remaining group. One example of our referral bias is that the median age of patients in this database is lower than the national median age. However, since the incorporation of the other biomarkers had a similar effect in all age groups in this analysis, we consider that what we demonstrated in our database is likely generalizable to all age groups. Second, all patients did not have their surgery at our institution so there is the possibility of interobserver variability among pathologists. However, the majority of patients seen at MDACC have their pathology material reviewed by dedicated breast cancer pathologists, so this variability was likely largely reduced. Further, while inter-pathologist variability in determining grade has been amply documented in the literature, grade has always been a strong prognostic marker in all published analyses. Additionally, the majority of patients in the training dataset had only NG available. In a preliminary analysis not shown, concordance was evaluated among the 6150 patients in the training dataset with both NG and HG and 17,536 patients with both NG and HG in the validation dataset with a substantial or moderate agreement noted (Kappa 0.7 (training) and 0.56 (validation), respectively)47. Therefore, a decision was made to use NG interchangeably when HG was unavailable. Third, our results reflect heterogeneity of treatments selected by individual treating physicians, although the overwhelming majority would have received an anthracycline-based regimen, and most also received a taxane. Detailed adjuvant therapy was available in the training dataset but not available in the validation dataset. In the training dataset, we made a decision to exclude HER2+ patients who did not receive adjuvant therapy with trastuzumab-based regimens [HER2+(no T)] patients to provide survival probability estimates that better reflect outcomes of a contemporary population; however it has introduced a bias within this subset with respect to adjuvant treatment as well as time. Adjuvant treatment was not accounted for in the other biomarker subtypes and trastuzumab was not incorporated into standard adjuvant practice for HER2+ BC until 2005. Fourth, the data spanned over a large time period during which significant diagnostic and therapeutic advances have been made, resulting in a temporal cohort effect. Fifth, with respect to violations of proportional hazard, the hazard ratio changes over time making it challenging to represent this situation accurately with a single value (i.e., which is valid when the hazard ratio is constant over time). Thus, the estimate can be viewed as a weighted average of the hazard ratios which are changing over time. Future research will consider more complicated models, using time-varying covariate effects, in pursuit of more accurate prediction. Another consideration is that there is the potential for more complex interaction effects between these additional clinical and biological factors with other variables, such as adjuvant treatment and pathological stage, not accounted for in the analyses presented here. The median follow-up time in the training dataset is much longer than that of the validation dataset (8 years vs. 2.4 years) and this resulted in the need to recalibrate the predictions due to poor calibration of the final model when applied to validation data set. It is unclear why the final model did not fit the validation data well. However, a possible answer is the dissimilarity in the distributions of covariates and risk profiles between training and validation data sets. Finally, all deaths after breast cancer recurrence were considered as breast cancer-related deaths as the detailed death attributions are not available in our database.

Previously published work from MDACC has shown that a staging system that incorporates grade and HR status improves the disease-specific survival estimates as compared to anatomic stage alone30. In two recent presentations from MDACC, a novel risk score that includes grade, and ER, and HER2 status was evaluated in a contemporary patient population who received surgical intervention at MDACC and was then validated in a larger cancer registry confirming the importance of biologic factors in determining prognosis for breast cancer patients48. The MDACC database has also been supplemented with laboratory, patient lifestyle, and quality of life survey data showing further refinement in prognostic ability49. The recently updated AJCC 8th edition has recognized prognostic categories which are largely based on the NCDB analysis which contains over 300,000 women with invasive breast cancer diagnosed in 2010–2011 with a complete set of variables but short median follow-up (37.6 months). The inclusion of grade, and HER2, and HR status using the NCDB model resulted in stage reassignment for 41% of the patients to a stage group that was higher or lower than would otherwise have been assigned using the AJCC 7th edition anatomic stage categories34. Since the analyses confirming that prognosis varied within TNM groupings based on tumor biology and the identification of the new prognostic categories in the updated AJCC 8th edition, there have been several publications and validation papers further recognizing the importance of incorporating biological factors and confirming the effectiveness of the revised prognostic categories50,51,52,53.

While the breast cancer community has known for several decades the prognostic impact of grade, age, HR and HER2, no publication had shown that singly or in combination, these factors affected the outcomes of patients included in specific TNM stages. This analysis has clearly demonstrated the added prognostic value of patient and tumor characteristics when combined with anatomical stage. External validation confirmed discriminative ability of selected model and with recalibration the predictions were well calibrated to the validation data. In summary, we present the first user-friendly clinical tool developed to estimate BCSS based on an extensive analysis using Fine–Gray Model in a robust single institution database and validated in a nationally recognized-external database. The goal of developing this tool is to provide a resource for clinicians to help guide discussions with patients and provide an estimation of prognosis based on clinical and biological factors.

Methods

Training data patient population

A prospectively maintained electronic database of patients with breast cancer treated at The University of Texas MD Anderson Cancer Center (MDACC) was used to identify over 20,000 patients with stage I–III invasive unilateral breast cancer who received surgery as initial treatment from 1990 to 2016. Clinico-pathologic data was collected from the database, including age; stage; grade, estrogen receptor (ER), progesterone receptor (PR), and HER2 status; adjuvant treatment history; and disease status at the time of death. Pathological stage, tumor grade, HR status, and HER2 status were extracted from the surgical pathology report and determined according to AJCC guidelines8,27,54,55,56,57,58,59. For tumor grade, composite histologic grade was used primarily when available and if not available, then nuclear grade was used (Supplementary Text A). Disease status at the time of death and cause of death were ascertained for each patient. Patients were defined to have a breast cancer (BC)-related death if they died following a recurrence of breast cancer irrespective of the time elapsed between recurrence and death and the cause of death. Study data were collected and analyzed with approval from the Institutional Review Board (IRB) at the University of Texas MDACC. A waiver of consent was obtained due to the retrospective nature of the study. The data were transferred for analysis on April 5, 2016. Among the patients who were alive, 66% had a date of last follow-up within 2 years of April 2016. 58% of alive patients had follow-up longer than 5 years and 28% of alive patients had follow-up longer than 10 years.

Validation data patient population

In the NCCN breast cancer outcomes database, a cohort of over 44,000 patients with pathological stage I–III unilateral primary invasive breast cancer who received surgery as an initial treatment from 1997 to 2012 were identified. Patients received all or some of their treatment at one of 16 NCCN participating centers between July 1st 1997 and December 31, 2012 (Supplementary Text B). Patients registered at the University of Texas MDACC (n = 7432) were excluded from the NCCN database to create an external dataset of 37,559 patients (non-MDACC NCCN cohort). Patients without complete data on age, anatomic stage, HR/HER2 status, and grade were excluded resulting in a validation cohort comprised of 29,727 patients who had complete data. All of the clinically relevant variables were obtained directly from the NCCN database. The last follow-up in validation data was 2/15/2013. Among patients who were alive, 79% had a date of last follow-up within 2 years of February 2013, 20% of patients alive had follow-up longer than 5 years, and 4% of patients alive had follow-up longer than 10 years.

Statistical analysis

The primary endpoint was BCSS that was assessed while treating non-BC death as a competing risk. BCSS curves were estimated using the Aalen–Johansen method. Univariate and multivariate Fine–Gray proportional hazards models were fit to assess the statistical significance of the effects of the clinically relevant variables on BCSS. We coded HR/HER2 as a four-level categorical variable in all models. Harrell’s C-index was calculated to evaluate the discrimination capacity of each model56. The kmi package was used to impute censoring times for competing risk data and rcorr.cens in Hmisc package was used to estimate the C-index and its confidence interval. We checked the proportional hazards (PH) assumption by visually inspecting the smoothed, scaled Schoenfeld residuals and hazard ratios by time intervals, assessed nonlinear covariates effects using spline functions and checked for two-way covariate–covariate interactions by introducing product terms in the models. A p-value of <0.05 indicated statistical significance. Using a selected Fine–Gray model to estimate the BCSS probabilities (as 1−cumulative incidence function (CIF)) including patients with complete data, an online tool to estimate individual prognosis was developed. Prognostic index was defined as the weighted sum of the variables in the Fine–Gray model, where the weights were the regression coefficients. Model calibration was evaluated by comparing observed and predicted BCSS probabilities for five risk groups (defined by partitioning the prognostic index based on its 16th, 39th, 62nd, and 84th percentiles).

To assess the performance of our selected model on the validation data, we compute predictions for each patient in the validation set using the model fit to the training data and compare these predictions to the observed validation outcomes. Statistical analyses were performed using SAS 9.4 (SAS Institute Inc, Cary, NC). A more complete description of the statistical methods is given in the supplementary materials (Supplementary Text C).