Incorporation of clinical and biological factors improves prognostication and reflects contemporary clinical practice

We developed prognostic models for breast cancer-specific survival (BCSS) that consider anatomic stage and other important determinants of prognosis and survival in breast cancer, such as age, grade, and receptor-based subtypes with the intention to demonstrate that these factors, conditional on stage, improve prediction of BCSS. A total of 20,928 patients with stage I–III invasive primary breast cancer treated at The University of Texas MD Anderson Cancer Center between 1990 and 2016, who received surgery as an initial treatment were identified to generate prognostic models by Fine-Gray competing risk regression model. Model predictive accuracy was assessed using Harrell’s C-index. The Aalen–Johansen estimator and a selected Fine–Gray model were used to estimate the 5-year and 10-year BCSS probabilities. The performance of the selected model was evaluated by assessing discrimination and prediction calibration in an external validation dataset of 29,727 patients from the National Comprehensive Cancer Network (NCCN). The inclusion of age, grade, and receptor-based subtype in addition to stage significantly improved the model predictive accuracy (C-index: 0.774 (95% CI 0.755–0.794) vs. 0.692 for stage alone, p < 0.0001). Young age (<40), higher grade, and TNBC subtype were significantly associated with worse BCSS. The selected model showed good discriminative ability but poor calibration when applied to the validation data. After recalibration, the predictions showed good calibration in the training and validation data. More refined BCSS prediction is possible through a model that has been externally validated and includes clinical and biological factors.


INTRODUCTION
Progress in breast cancer biology, research, and treatment has made the incorporation of clinical and pathological characteristics into the staging system necessary. The American Joint Committee on Cancer (AJCC) TNM staging classification is globally accepted to describe extent of cancer at diagnosis and has been available since 1959. The AJCC has incorporated biological factors (e.g. Gleason's score, mitotic index, and grade) into the staging classification of cancers such as prostate, melanoma, and sarcoma [1][2][3] . In contrast, until recently, the breast cancer TNM classification has had only anatomical-based staging criteria without the addition of biological, prognostic, and predictive factors used to guide treatment in the clinical practice [4][5][6][7][8] . It is evident that a staging system based solely on anatomy, does not always reflect the variable clinical course and long-term outcomes seen in clinical experience 9 . Age, tumor grade, hormone receptor (HR) status, and human epidermal growth factor receptor-2 (HER2) status are well-recognized prognostic factors and the latter two also serve as predictors of response to endocrine and anti-HER2 therapy, respectively [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29] . Prior work with robust statistical approaches has indicated the need to include such factors in a refined staging system [30][31][32][33] .
The recently published AJCC 8th edition recognizes new prognostic categories that significantly improve prognostic categorization compared to the anatomic stage groupings alone 34 . To date, there are no robust breast cancer outcome prediction tools available to practicing clinicians to provide patients with early stage breast cancer survival estimates based on the presenting features of their tumor and other clinical factors.
Prior tools have been developed but have used different statistical approaches, i.e. mathematical modeling, Cox proportion hazards regression, or actuarial analysis; lacking in follow-up time; more focused on guiding adjuvant therapy discussions; or do not represent a contemporary patient population [35][36][37][38] . To accompany the recent update to the AJCC staging system, we sought in this analysis, to develop a validated model and demonstrate that age, tumor grade, and biomarker subtypes, conditional on stage, were important determinants of BCSS and that their incorporation could further refine the survival estimates based upon stage utilizing a Fine-Gray Model. Further, we have developed an online tool to estimate individual prognosis based on this model combining clinical and biologic variables for use in daily clinical practice.

Prediction model on the training data
In the MDACC database, 20,928 patients with stage I-III breast cancer, who received treatment with surgery as the first intervention were identified (see Supplementary Table 1 for the characteristics of the overall cohort). From the overall cohort (n = 20,928), a complete data cohort (N = 14,781) was formed for model comparison purposes, after excluding 29% of the patients (n = 6147) missing one or more of the key variables (age, HR, HER2, grade). All models with stage and additional factors showed higher C-index than the model with stage alone ( Table 1). The model with age, grade, combined HR and HER2 status, and stage, which reflects the information conventionally available in current clinical practice, showed the best predictive accuracy. Table 2 shows the characteristics of the patients included in the multivariate analysis from the training dataset (N = 14,781) and validation dataset (N = 29,727). The median follow-up was 6.6 years (95% CI, 6.5-6.7) (range: 0-24.9 years) and the median age was 55 (range 20-99) years. The 5-year BCSS estimate was 89% for the whole multivariate cohort. Outcomes for the multivariate cohort were as follows: 15% BC-related mortality, 20% mortality from any cause, 10% loco-regional recurrence, and 20% distant recurrence (first loco-regional recurrence and distant recurrence were 8% and 16%, respectively).
In multivariate analysis, age, stage, biologic subtype, and grade were associated with BCSS (Table 3). There was some indication of violations of the PH assumption, especially for HR/Her2 subtype. However, we did not explore the Fine-Gray model with timevarying covariates because the analysis would be computationally complex and the resulting model could still yield serious biases 39,40 . Despite the violation of the PH assumption, our estimated hazard ratios are still interpretable as weighted average hazard ratios over the follow-up 41 . Figure 1 shows examples for the largest stage groupings (I, IIA, IIB) demonstrating that the curves can be further refined with the combination of factors including age group, receptor-subtype, and nuclear grade within each anatomic stage group. In each stage grouping, patients aged 40-69 with a grade 1 HR+/HER2− cancers have the best survival; whereas patients with age < 40, grade 3, TNBC cancers have the worst survival. The impact of age is also clear; younger patients (<40) with a grade 3 TNBC consistently have worse survival compared to an older age group 40-69 keeping other factors constant (TNBC and grade 3). Table 4 shows refined 5-year and 10-year BCSS estimates by age group (<40: A, 40-69: B, and ≥70: C), HR status, HER2 status, and pathologic stage for each grade. For TNBC, Grade 3, age < 40, the 5-year BCSS is noted to decrease by stage: 88% (1); 76% (IIA), 65% (IIB), 60% (IIIAB), and 45% (IIIC). In contrast, for TNBC, grade 3, age 40-69, the 5-year BCSS is as follows: 92% (I), 84% (IIA), 76% (IIB), 72% (IIIAB), and 60% (IIIC  Harrell's C-Index: The kmi package in R was used to impute censoring times for competing risk data and the rcorr.cens function in the Hmisc package in R was used to estimate the C-index and its confidence interval.    Histologic grade or nuclear grade (if histologic grade is not available); HG (n = 4724); NG (n = 10,057). The results of Shoenfeld residuals, nonlinear covariate effects and two-way covariate-covariate interaction yielded no remarkable findings. Table 4 shows the 5-year and 10-year BCSS for each combination of factors (age, anatomic stage, HR, HER2, and tumor grade). Table 2 shows the patient characteristics for the 29,727 patients in the validation data set. The median follow-up was 2.4 years and the median age was 58 years. The 5-year BCSS estimate and 5-year OS was 95% and 91%, respectively, for the validation cohort. Outcomes were as follows: 3% BC-related mortality, 6% mortality from any cause, 1% locoregional recurrence, and 4% distant recurrence. BCSS at 5 years was 89% in the training set and 95% in the validation set while at 10 years BCSS was 78% in the training set and 89% in the validation set. Figure 2a, b show BCSS predictions in the training and validation sets stratified by risk set (based on the same centiles used for the training set). The c-index from the validation set is 0.82. Taken together, these indicate that the selected model discriminated well when applied to the validation set. However, Fig. 3a and b shows that our model is poorly calibrated when applied to the validation set. The predicted probabilities were much lower than the observed probabilities. Figure 4 shows that after recalibration, the predictions showed good calibration in the validation data.  I  IIA  IIB  IIIAB  IIIC  I  IIA  IIB  IIIAB  IIIC   1

DISCUSSION
In this analysis, we demonstrate statistical significance and improved predictive accuracy of incorporating age, grade, and receptor-based subtypes conditional on anatomic stage to improve prediction of breast cancer-specific survival (BCSS) using a Fine-Gray Model. In doing so, we have shown through a different statistical methodology and with a longer follow-up that more individualized prediction of BCSS is possible by considering clinical and biologic factors in addition to anatomic stage in a dataset with long followup that has been validated in an external dataset with demonstration of good calibration in both datasets. Calibration of the BCSS prediction model is evaluated by visual inspection of the 5-year and 10-year calibration plots, plotting the average predicted BCSS against the observed BCSS for each of 5 risk groups at 5 years and 10 years. The 5-year and 10-year BCSS of patients with particular values for the clinical and biological variables can be found using an online tool at the following link: http://mdanderson.org/BCSS. We chose the Fine-Gray model to develop a BCSS prediction model over other potential and different statistical methods used in the existing online breast cancer survival prediction tools [35][36][37][38] . PREDICT, an online breast cancer prognostic and treatment benefit model, used a Cox proportional hazards model, which is frequently used in risk prediction models for breast cancer, to predict BCSS. When an individual in the risk set is exposed to more than one cause of failure (e.g., non-breast cancer death, which cannot be neglected for early stage breast cancers), resulting in competing risks, the Kaplan-Meier method of estimating CIF and the Cox regression model of estimating survival lead to biases 42 . The Fine-Gray model is a widely used method to build prediction models when competing risks exist (as is the case for early stage breast cancer). The Fine-Gray model allows estimation of the effect of the covariates on the CIF, but it has a limitation in its difficulty in interpreting the model coefficients, as compared to the Cox proportional hazards model [42][43][44] . That is, the model covariates can be interpreted as having an effect on the CIF, but they do not directly link to an underlying event rate in the real world 45,46 . This limitation, however, does not pose an issue in our analysis as the CIF is of equal clinical relevance in our study.
The results of our study should be interpreted with several considerations. First, the analysis was performed using data collected at a high volume single cancer center with specific referral and practice patterns leading to a potential selection bias. Our institution tends to treat larger and higher grade tumors with neoadjuvant systemic therapy, and the patient cohort analyzed in this study excluded those patients, possibly biasing the remaining group. One example of our referral bias is that the median age of patients in this database is lower than the national median age. However, since the incorporation of the other biomarkers had a similar effect in all age groups in this analysis, we consider that what we demonstrated in our database is likely generalizable to all age groups. Second, all patients did not have their surgery at our institution so there is the possibility of interobserver variability among pathologists. However, the majority of patients seen at MDACC have their pathology material reviewed by dedicated breast cancer pathologists, so this variability was likely largely reduced. Further, while inter-pathologist variability in determining grade has been amply documented in the literature, grade has always been a strong prognostic marker in all published analyses. Additionally, the majority of patients in the training dataset had only NG available. In a preliminary analysis not shown,   47 . Therefore, a decision was made to use NG interchangeably when HG was unavailable. Third, our results reflect heterogeneity of treatments selected by individual treating physicians, although the overwhelming majority would have received an anthracycline-based regimen, and most also received a taxane. Detailed adjuvant therapy was available in the training dataset but not available in the validation dataset. In the training dataset, we made a decision to exclude HER2+ patients who did not receive adjuvant therapy with trastuzumab-based regimens [HER2+(no T)] patients to provide survival probability estimates that better reflect outcomes of a contemporary population; however it has introduced a bias within this subset with respect to adjuvant treatment as well as time. Adjuvant treatment was not accounted for in the other biomarker subtypes and trastuzumab was not incorporated into standard adjuvant practice for HER2+ BC until 2005. Fourth, the data spanned over a large time period during which significant diagnostic and therapeutic advances have been made, resulting in a temporal cohort effect. Fifth, with respect to violations of proportional hazard, the hazard ratio changes over time making it challenging to represent this situation accurately with a single value (i.e., which is valid when the hazard ratio is constant over time). Thus, the estimate can be viewed as a weighted average of the hazard ratios which are changing over time. Future research will consider more complicated models, using time-varying covariate effects, in pursuit of more accurate prediction. Another consideration is that there is the potential for more complex interaction effects between these additional clinical and biological factors with other variables, such as adjuvant treatment and pathological stage, not accounted for in the analyses presented here. The median followup time in the training dataset is much longer than that of the validation dataset (8 years vs. 2.4 years) and this resulted in the need to recalibrate the predictions due to poor calibration of the final model when applied to validation data set. It is unclear why the final model did not fit the validation data well. However, a possible answer is the dissimilarity in the distributions of covariates and risk profiles between training and validation data sets. Finally, all deaths after breast cancer recurrence were considered as breast cancer-related deaths as the detailed death attributions are not available in our database. Previously published work from MDACC has shown that a staging system that incorporates grade and HR status improves the disease-specific survival estimates as compared to anatomic stage alone 30 . In two recent presentations from MDACC, a novel risk score that includes grade, and ER, and HER2 status was evaluated in a contemporary patient population who received surgical intervention at MDACC and was then validated in a larger cancer registry confirming the importance of biologic factors in determining prognosis for breast cancer patients 48 . The MDACC database has also been supplemented with laboratory, patient lifestyle, and quality of life survey data showing further refinement in prognostic ability 49 . The recently updated AJCC 8th edition has recognized prognostic categories which are largely based on the NCDB analysis which contains over 300,000 women with invasive breast cancer diagnosed in 2010-2011 with a complete set of variables but short median follow-up (37.6 months). The inclusion of grade, and HER2, and HR status using the NCDB model resulted in stage reassignment for 41% of the patients to a stage group that was higher or lower than would otherwise have been assigned using the AJCC 7th edition anatomic stage categories 34 .
Since the analyses confirming that prognosis varied within TNM groupings based on tumor biology and the identification of the new prognostic categories in the updated AJCC 8th edition, there have been several publications and validation papers further recognizing the importance of incorporating biological factors and confirming the effectiveness of the revised prognostic categories [50][51][52][53] . While the breast cancer community has known for several decades the prognostic impact of grade, age, HR and HER2, no publication had shown that singly or in combination, these factors affected the outcomes of patients included in specific TNM stages. This analysis has clearly demonstrated the added prognostic value of patient and tumor characteristics when combined with anatomical stage. External validation confirmed discriminative ability of selected model and with recalibration the predictions were well calibrated to the validation data. In summary, we present the first user-friendly clinical tool developed to estimate BCSS based on an extensive analysis using Fine-Gray Model in a robust single institution database and validated in a nationally recognized-external database. The goal of developing this tool is to provide a resource for clinicians to help guide discussions with patients and provide an estimation of prognosis based on clinical and biological factors.

METHODS
Training data patient population A prospectively maintained electronic database of patients with breast cancer treated at The University of Texas MD Anderson Cancer Center (MDACC) was used to identify over 20,000 patients with stage I-III invasive unilateral breast cancer who received surgery as initial treatment from 1990 to 2016. Clinico-pathologic data was collected from the database, including age; stage; grade, estrogen receptor (ER), progesterone receptor (PR), and HER2 status; adjuvant treatment history; and disease status at the time of death. Pathological stage, tumor grade, HR status, and HER2 status were extracted from the surgical pathology report and determined according to AJCC guidelines 8,27,[54][55][56][57][58][59] . For tumor grade, composite histologic grade was used primarily when available and if not available, then nuclear grade was used (Supplementary Text A). Disease status at the time of death and cause of death were ascertained for each patient. Patients were defined to have a breast cancer (BC)-related death if they died following a recurrence of breast cancer irrespective of the time elapsed between recurrence and death and the cause of death. Study data were collected and analyzed with approval from the Institutional Review Board (IRB) at the University of Texas MDACC. A waiver of consent was obtained due to the retrospective nature of the study. The data were transferred for analysis on April 5, 2016. Among the patients who were alive, 66% had a date of last follow-up within 2 years of April 2016. 58% of alive patients had follow-up longer than 5 years and 28% of alive patients had follow-up longer than 10 years.

Statistical analysis
The primary endpoint was BCSS that was assessed while treating non-BC death as a competing risk. BCSS curves were estimated using the Aalen-Johansen method. Univariate and multivariate Fine-Gray proportional hazards models were fit to assess the statistical significance of the effects of the clinically relevant variables on BCSS. We coded HR/HER2 as a four-level categorical variable in all models. Harrell's C-index was calculated to evaluate the discrimination capacity of each model 56 . The kmi package was used to impute censoring times for competing risk data and rcorr. cens in Hmisc package was used to estimate the C-index and its confidence interval. We checked the proportional hazards (PH) assumption by visually inspecting the smoothed, scaled Schoenfeld residuals and hazard ratios by time intervals, assessed nonlinear covariates effects using spline functions and checked for two-way covariate-covariate interactions by introducing product terms in the models. A p-value of <0.05 indicated statistical significance. Using a selected Fine-Gray model to estimate the BCSS probabilities (as 1−cumulative incidence function (CIF)) including patients with complete data, an online tool to estimate individual prognosis was developed. Prognostic index was defined as the weighted sum of the variables in the Fine-Gray model, where the weights were the regression coefficients. Model calibration was evaluated by comparing observed and predicted BCSS probabilities for five risk groups (defined by partitioning the prognostic index based on its 16th, 39th, 62nd, and 84th percentiles).
To assess the performance of our selected model on the validation data, we compute predictions for each patient in the validation set using the model fit to the training data and compare these predictions to the observed validation outcomes. Statistical analyses were performed using SAS 9.4 (SAS Institute Inc, Cary, NC). A more complete description of the statistical methods is given in the supplementary materials (Supplementary Text C).

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author, upon reasonable request. Please note that the NCCN data can only be made available with permission from NCCN.