Introduction

The PREDICT breast cancer prognostication and treatment benefit prediction model (v1) was developed in 2010 using data from the UK East Anglia Cancer Registration and Information Centre (ECRIC) for model fitting and data from the West Midlands Cancer Intelligence Unit for model validation1,2,3. The model fitting data set comprised data on 5232 cases diagnosed from 1999 to 2003. PREDICT v1 was implemented as a web-based tool for clinicians in January 2011 (www.breast.predict.nhs.uk), and since then the use of the tool has increased steadily around the world. The model was refitted in 2017 using the original cohort of cases from East Anglia with updated survival time in order to take into account age at diagnosis and to smooth out the hazard ratio functions for tumour size and node status (v2)4. PREDICT has been independently validated in cohorts from Canada5, Malaysia6, the Netherlands7,8,9, and the UK10,11 and has generally been shown to have good discrimination and calibration.

The data on which PREDICT breast v1 and v2 were based were breast cancer cases diagnosed in the Eastern Region of England over 20 years ago. Since then, the prognosis of early breast cancer has improved substantially12 and it is likely that the current model is not well calibrated for contemporary patients13. Moreover, the number of cases with ER-negative disease in the cohort was comparatively small (<1000) and it is possible that the estimates of the prognostic effects of the variables in the ER-negative disease model were sub-optimal. Furthermore, radiotherapy and chemotherapy have been shown to be associated with an increase in mortality from causes other than breast cancer14,15 and this was not taken into account in previous versions of PREDICT Breast.

We have therefore refitted the PREDICT breast model using a national data set of patients diagnosed from 2000 to 2017 with the aim of refining the hazard ratio estimates for the variables in the current model and to estimate the effect of the year of diagnosis on prognosis in order to be able to recalibrate the model for contemporary patients. In addition, we included the beneficial effect of radiotherapy on breast cancer mortality and the harmful effect of both chemotherapy and radiotherapy on other causes of mortality. Model development, validation and reporting were carried out according to the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) criteria16.

Results

Table 1 shows the patient characteristics by the cancer registry. The model fitting was carried out using Eastern Cancer Registry data for 4644 women with an ER-negative tumour and 34,265 women with an ER-positive tumour.

Table 1 Patient characteristics for the Eastern Cancer Registry, the West Midlands Cancer Registry and the other cancer registries (mean (sd), unless stated otherwise).

On fitting the multivariable fractional polynomial model to the ER-positive cases the hazard ratio function for tumour size was found to be 2.39*(size)0.5–0.439*size. Under this function, the hazard ratio would increase to a maximum for a tumour of 7.4 cm and then decrease for larger tumours (Fig. 1 dashed line). It seems unlikely that the true effect size would get smaller with increasing tumour size so we refitted the model using 1–exp(−size/2) so that the hazard ratio increases up to 7.5 cm and then flattens off (Fig. 1 solid line). The breast cancer-specific mortality hazard ratio (HR) functions for age at diagnosis, tumour size and number of positive nodes for the ER-negative and ER-positive cases are shown in Fig. 2 and the associated logarithmic hazard ratios in Table 2.

Fig. 1: Polynomial hazard ratio functions for tumour size in ER-positive disease.
figure 1

Dashed line—best fit from the multivariable fractional polynomial model. Solid line—monotonic function selected for inclusion in the final model.

Fig. 2: Breast cancer-specific mortality hazard ratio functions.
figure 2

a Age, b tumour size, and c the number of positive nodes. ER-negative is indicated by red lines and ER-positive is indicated by blue lines. Note that the hazard ratios for ER-negative and ER-positive disease should not be directly compared as an indicator of prognosis in ER-negative disease compared to ER-positive disease because the risk is a function of both the hazard ratio and the ER-status specific baseline hazards.

Table 2 Fractional polynomial functions and associated logarithmic hazard ratios for age at diagnosis, tumour size, number of positive nodes, tumour grade and mode of detection by oestrogen receptor (ER) status.

The derived polynomial baseline hazard functions for breast cancer-specific mortality in the ER-negative cases, ER-positive cases, and non-breast cancer mortality are given by the following equations:

$$\begin{array}{l}{\rm{ER}}-{\rm{negative}}:{{\rm{baseline}}\; {\rm{hazard}}}=\exp \left(-3.015-0.576\times {\left(\frac{t}{10}\right)}^{-1}\right.\\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\;\;\left.-0.103\times {\left(\frac{t}{10}\right)}^{-1}\times \log \left(\frac{t}{10}\right)\right)\end{array}$$
(1)
$$\begin{array}{l}{\rm{ER}}-{\rm{positive}}:{{\rm{baseline}}}\,{{\rm{hazard}}}=\exp \left(-2.319-3.623\times {\left(\frac{t}{10}\right)}^{-0.5}\right.\\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\left.-0.542\times {\left(\frac{t}{10}\right)}^{-0.5}\times \log \left(\frac{t}{10}\right)\right)\end{array}$$
(2)
$$\begin{array}{l}{\rm{Non}}-{\rm{breast}}\; {\rm{mortality}}:{{\rm{baseline}}}\,{{\rm{hazard}}}={\rm{exp}} \left(-4.846+1.341* \log \left(\frac{t}{10}\right)\right.\\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\left.+0.495* \left(\frac{t}{10}\right)\right)\end{array}$$
(3)

These functions provided a very good fit to the observed baseline hazard (Supplementary Fig. 1).

Model calibration

Table 3 shows the cumulative number of breast cancer deaths predicted at 5, 10, and 15 years by the new version of the model (v3.0) and the current version of the model (v2.2) by cancer registry and ER status. As expected, for breast cancer-specific mortality, v3.0 is well-calibrated in the model development data. It also performs well in the two validation data sets; in all strata of the data, the predicted number of deaths was within 10% of that observed. In contrast, v2.2 consistently over-predicted the number of deaths as might have been expected given the general improvement in prognosis observed since the data on which v2.2 was generated. Prediction of non-breast cancer mortality by v3.0 (Table 4) was also excellent in the model development data, but under-predicted by about 10% in the validation data sets. Again, v2.2 substantially overpredicted other mortality in all the data sets.

Table 3 Cumulative observed versus predicted breast cancer deaths estimated by the updated version of PREDICT Breast (v3.0) and the current version (v2.2) by cancer registry and ER status at up to 5, 10, and 15 years follow up.
Table 4 Cumulative observed versus predicted deaths from other causes estimated by the updated version of PREDICT Breast (v3.0) and the current version (v2.2) by cancer registry at up to 5, 10 and 15 years follow up.

The observed and predicted breast cancer deaths in the West Midlands cancer registry by quintile of predicted risk for the updated version of PREDICT Breast are shown in Fig. 3 which shows that calibration is excellent at all levels of risk.

Fig. 3: Observed and predicted breast cancer deaths at 15 years in West Midlands data set by quintile of predicted risk.
figure 3

a All cases, b ER-negative cases, and c ER-positive cases.

Model discrimination

Model discrimination (area under the receiver operator characteristic curve) was good in all strata of the data (Table 5). In general, the model for ER-positive disease performed better than that for ER-negative disease and the performance of the model in the model development data from the Eastern Cancer Registry was slightly better than the performance in the two validation data sets. PREDICT v3.0 performed consistently slightly better than v2.2.

Table 5 The discrimination for up to 5-year, 10-year and 15-year breast cancer-specific mortality by cancer registry and ER status.

Model reclassification

The Cambridge Breast Unit classifies women with breast cancer into three groups based on the predicted benefit of adjuvant chemotherapy at 10 years as given by the absolute reduction in risk of breast cancer-specific mortality; low-risk women are those with a predicted 10-year benefit of 0–3% who would usually be advised not to have adjuvant chemotherapy and high-risk women are those with a predicted benefit of over 5% who would usually be advised to have adjuvant chemotherapy17. The advice to intermediate-risk women (3–5%) would depend more on other factors including patient preferences. While the benefit of therapy depends on patient age and adjuvant chemotherapy regime it is possible to classify women into similar categories based on the predicted breast cancer mortality at 10 years: low risk being 0–15%, medium risk being 15–20% and high-risk being >20% risk of breast cancer death at 10 years. Based on these risk categories it is possible to evaluate reclassification comparing PREDICT v3.0 with v2.2. Of 32,408 breast cancer cases in the West Midlands data set 4203 (13%) women would be classified in different risk groups by PREDICT v2.2 and v3.0 (Table 6).

Table 6 Re-classification of 32,408 West Midlands Cancer Registry breast cancer cases by PREDICT v3.0 into low-, medium- and high-risk compared to PREDICT v2.2 classification.

Discussion

We have used data from the National Cancer Registration and Analysis Service for England for breast cancer cases diagnosed from 2000 to 2017 to develop and validate a new PREDICT Breast prognostic model (v3.0). We used a similar analytic approach to that used to develop PREDICT Breast v2.0 using multi-variable fractional polynomials within a Cox regression framework to create different models for breast cancer-specific mortality for ER-positive disease and ER-negative disease and non-breast cancer mortality. The major difference between v2.2 and v3.0 is that v3.0 includes a term for year of diagnosis as the data show a clear trend from improved survival rates over time.

It has previously been observed that the log hazard ratio function for age at diagnosis in ER-positive breast cancer is U-shaped with breast cancer in young women and older women being associated with a poorer prognosis. However, a similar relationship in ER-negative disease has not been previously described—age at diagnosis in v2.2 was modelled as a linear term. However, in this much larger data set, we also observed a U-shaped function for age at diagnosis in ER-negative disease. We also observed an unexpected hazard ratio function for tumour size in ER-positive cases with an inverted U-shape. There may be a biological reason for this—it is conceivable that for tumours to become very large in size they would need to be growing for a long time without metastasizing, and so may be inherently less aggressive. However, despite our very large data set, the number of ER-positive cases with tumours above 7.5 cm was only 414 with 80 deaths from breast cancer and the precision of the hazard ratio estimates in larger tumours will be small. We therefore chose to constrain the polynomial function such that the hazard ratio flattened off but did not get smaller with increasing tumour size.

Overall, the model performed well in terms of discrimination and calibration in both model development data and the model validation data. We assumed that all patients receiving chemotherapy received a standard-dose anthracycline-based regime and all patients receiving hormonal therapy received 5-year treatment. However, some patients will have received taxane-based or high-dose anthracycline-based chemotherapy, and some patients will have had 10 years of hormonal therapy. Similarly, we assumed the same benefit for all patients treated with radiotherapy whereas whole breast irradiation alone after breast-conserving surgery is likely to have different effects than post-mastectomy radiotherapy to the chest wall and regional nodes. Furthermore, mortality from causes other than breast cancer was modelled as a function of age and therapy and we assumed similar relative harm for chemotherapy and radiotherapy, although this may vary by the type of chemotherapy and radiotherapy received. Non-breast cancer mortality is also affected by co-morbidities and lifestyle factors such as smoking. Other factors could not be included in the models because information on co-morbidities is not available in the NCRAS data. These misclassifications would not be expected to affect model calibration but would be likely to reduce discrimination.

The improvement in prognosis over time is reflected in the reclassification of breast cancer cases within the three categories of risk used by the Cambridge Breast Unit to guide the use of adjuvant chemotherapy. In the West Midlands data set 10,053 cases would be classified as moderate or high risk by PREDICT Breast v2.2 and would be considered candidates for adjuvant chemotherapy. Of these, 3,821 (38%) would be reclassified as low risk by PREDICT Breast v3.0 and spared the harms of chemotherapy. The reason for the improvement in prognosis over time is not clear—the effect of the year of diagnosis is seen after adjusting for the known major prognostic factors and after adjusting for treatment. General improvements in the organization and standardization of cancer services with better targeting of systemic therapies and improvements in the delivery of radiotherapy are likely to play a role. Some improvement will be due to the increased use of therapies such as trastuzumab and bisphosphonates and improved management of disease relapse with second-line therapies.

Tumour gene expression profile tests (also known as genomic risk scores) are being increasingly used to guide treatment decisions in breast cancer18. The results of genomic risk scores are not available in the cancer registration data set used for these analyses and it was not possible to assess any added value of such scores to PREDICT v3.0. However, it has been shown that genomic risk scores do not significantly improve the discrimination of PREDICT v2.219. Further research to evaluate the performance of genomic risk scores by themselves and in combination with other biomarkers such as KI67 in breast cancer patients shown to be at intermediate risk by PREDICT v3.0 is warranted.

Another limitation of this model is that it does not include either local or distant recurrence as an endpoint, as these data are not available in the NCRAS data set. While mortality as an endpoint is important in decision-making, recurrence may also be an important end point for some patients. In the future, the integration of electronic health records with cancer registration data may enable the accurate encoding of recurrence and the inclusion of other endpoints in the model.

In an era of precision oncology, accurate, well-validated models that predict patient outcomes are invaluable clinical tools. We have derived an improved version of the PREDICT prognostication and treatment benefit model to reduce some of the limitations of the current version. In particular, we have included updated the model to reflect outcomes in contemporary patients and added the benefits of radiotherapy as well as the harms of both chemotherapy and radiotherapy. The new model has been validated in two independent population-based data sets from the United Kingdom and performs well. It will be implemented in the online tool available at www.breast.predict.nhs.uk and will continue to aid clinical decision-making in clinical practice.

Methods

Patient data

The study was approved by the Public Health England Office for Data Release. Public Health England provided anonymized data from the National Cancer Registration and Analysis Service (PHE NCRAS) for all women diagnosed in England with non-metastatic invasive breast cancer from 2000 to 2017 inclusive. Ethical approval by the National Research Ethics Service was not required because all analyses were carried out on an anonymized data set provided by Public Health England. Information obtained from PHE NCRAS included age at diagnosis, year of diagnosis, tumour size, histological grade, tumour stage at diagnosis, number of lymph nodes sampled, number of lymph nodes positive, ER status, HER2 status, mode of detection (clinically detected vs. screen-detected), and whether the patient had undergone chemotherapy, hormone therapy and/or radiotherapy for two time periods, the first being within 6 months following their diagnosis. Data on other biomarkers such as tumour KI67 expression status were not available in the NCRAS data set. Patients younger than 25 or older than 85 at diagnosis, patients with a tumour larger than 20 cm, or with more than 20 positive lymph nodes were excluded from the analysis. Of 372,110 cases, complete data were available for 163,224 (44%). Initial analyses showed that the Eastern Cancer Registry and the West Midlands Cancer Registry had fewer missing data (62% and 71% complete cases) compared to the other registries (35% complete cases), particularly in the years 2000–2009 (Supplementary Table 1). The variable with the most missing data was ER status (42% missing), 31% were missing the number of positive nodes, 16% were missing tumour size, 3% were missing tumour grade and 6% were missing mode of detection. The complete case data set for the Eastern Cancer Registry (n = 35,474; 4644 ER-negative and 30,830 ER-positive ) was used for the development of the new version of PREDICT Breast and the West Midlands Cancer Registry data set (n = 31,801; 4668 ER-negative; 27,133 ER-positive) was used as the primary validation data and the data set for the other cancer registries (n = 95,949; 12,814 ER-negative; 83,135 ER-positive) used as an additional validation data set.

Details of the specific regimen used for radiotherapy, chemotherapy, duration of hormonal therapy, or use of trastuzumab or bisphosphonates were not available. We assumed that all patients who underwent chemotherapy were treated with an anthracycline-based regimen and that all women received hormonal therapy for 5 years. The benefits of radiotherapy were applied to all patients who received including those who had lumpectomy and those who had mastectomy as the primary surgical treatment. Death certificate flagging through the Office for National Statistics provides the registries with notification of deaths. The lag times for these are a few weeks for cancer deaths and 2 months to 1 year for non-cancer deaths. Vital status was ascertained at the end of December 2019, and so all analyses were censored on 31 December 2018 to allow for a delay in reporting vital status. Breast cancer-specific mortality was defined as deaths where breast cancer was listed as the cause of death on parts 1a, 1b or 1c of the death certificate.

Statistical methods

Multivariable Cox proportional hazards models were used to estimate the prognostic effect of each variable. In all models, follow-up time was defined as the time from breast cancer diagnosis to the last follow-up, death or 15 years after diagnosis, whichever came first. The outcome of interest was either breast cancer-specific mortality or mortality from other causes.

Separate models were derived for breast cancer-specific mortality in ER-negative and ER-positive cases. Multiple fractional polynomials were used to model non-linear effects between the continuous risk factors (age at diagnosis, tumour size and number of positive nodes) and breast cancer-specific mortality as adding higher order polynomials to the model will improve the fit to the data in the presence of non-linearity. Sequential backward elimination with a maximum of 4 degrees of freedom for a single continuous predictor was used to estimate the continuous variable transformations. In addition to the variables already present in the current version of PREDICT, the year of breast cancer diagnosis and the effect of radiotherapy were also incorporated into the analyses. Age at diagnosis was transformed to age at diagnosis minus 24 and year of diagnosis was transformed to year minus 2000 in order that the baseline hazard would be more realistic. The baseline hazard is the hazard that corresponds to a hypothetical individual with all variables taking a value of zero. Transforming age at diagnosis and year at diagnosis in this way means that the baseline hazard corresponds to a woman diagnosed at age 24 in the year 2000 rather than a woman diagnosed at age 0 in the year 0. The relative treatment benefits for chemotherapy, hormone therapy, and radiotherapy were constrained to the estimates of benefit randomized controlled trial meta-analyses of the Early Breast Cancer Trialists Collaborative Group (adjuvant hormone therapy log hazard ratio −0.38620, adjuvant chemotherapy log hazard ratio −0.24821, radiotherapy log hazard ratio −0.18022) by adding them as an offset in the analyses. After fitting the Cox proportional hazards models to ER-negative and ER-positive cases, a multiple fractional polynomial model with a Gaussian distribution was fit to the baseline hazards according to the method of Sauberei and colleagues23 to derive a smoothed baseline hazard functions for breast cancer-specific mortality.

A single multivariate Cox regression model for mortality from other causes (non-breast cancer-specific) was built for ER-negative and ER-positive cases combined with year of diagnosis and age at diagnosis modelled using multivariable fractional polynomials. The relative harms of chemotherapy and radiotherapy were constrained to the estimates reported by Kerr and colleagues (adjuvant chemotherapy log hazard ratio 0.183)14 and Taylor and colleagues (radiotherapy log hazard ratio 0.078 per Gray whole-heart dose)15 by adding them as an offset in the analyses. We assumed all patients receiving radiotherapy receive a whole heart dose of 2 Gy, as the radiotherapy dose was not available in our data. The smoothed baseline hazard function for non-breast cancer-specific mortality was also computed using a multivariable fractional polynomial model.

Model validation

The models derived from the Eastern Cancer Registry were used to predict the probabilities of death from breast cancer or death from other causes in the cases in both validation data sets. Because the web version of PREDICT Breast v2.2 allows for missing data on the mode of detection we also included 9848 cases for whom only modes of detection were missing. Model calibration was performed by comparing the observed number of deaths with those predicted by v3.0 and v2.2 up to 5 years, 10 years and 15 years after diagnosis. Calibration plots were used to visualize calibration at different levels of risk. Model discrimination was evaluated by calculating the area under the receiver operator-characteristic curve (AUC) for up to 5-year, 10-year and 15-year breast cancer mortality. The AUC is the probability that the predicted mortality from a randomly selected patient who died will be higher than the predicted mortality from a randomly selected survivor.

The study has been reported in accordance with the TRIPOD guidelines for reporting a multivariable prediction model for individual prognosis16. All analyses were carried out using the mfp24, patchwork25, pROC26, survival27, tableone28 and tidyverse29 packages for the R software30 implemented in R Studio31.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.