The rates and routes of lethal systemic spread in breast cancer are poorly understood owing to a lack of molecularly characterized patient cohorts with long-term, detailed follow-up data. Long-term follow-up is especially important for those with oestrogen-receptor (ER)-positive breast cancers, which can recur up to two decades after initial diagnosis1,2,3,4,5,6. It is therefore essential to identify patients who have a high risk of late relapse7,8,9. Here we present a statistical framework that models distinct disease stages (locoregional recurrence, distant recurrence, breast-cancer-related death and death from other causes) and competing risks of mortality from breast cancer, while yielding individual risk-of-recurrence predictions. We apply this model to 3,240 patients with breast cancer, including 1,980 for whom molecular data are available, and delineate spatiotemporal patterns of relapse across different categories of molecular information (namely immunohistochemical subtypes; PAM50 subtypes, which are based on gene-expression patterns10,11; and integrative or IntClust subtypes, which are based on patterns of genomic copy-number alterations and gene expression12,13). We identify four late-recurring integrative subtypes, comprising about one quarter (26%) of tumours that are both positive for ER and negative for human epidermal growth factor receptor 2, each with characteristic tumour-driving alterations in genomic copy number and a high risk of recurrence (mean 47–62%) up to 20 years after diagnosis. We also define a subgroup of triple-negative breast cancers in which cancer rarely recurs after five years, and a separate subgroup in which patients remain at risk. Use of the integrative subtypes improves the prediction of late, distant relapse beyond what is possible with clinical covariates (nodal status, tumour size, tumour grade and immunohistochemical subtype). These findings highlight opportunities for improved patient stratification and biomarker-driven clinical trials.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All code and scripts are available for academic use at https://github.com/cclab-brca/brcarepred.
The genomic copy number, gene-expression and molecular-subtype information has been described previously12 and is available at the European Genome-Phenome Archive at https://www.ebi.ac.uk/ega/studies/EGAS00000000083. Clinical data are available in Supplementary Tables 5–8. The breast-cancer-recurrence predictor is available as a web application for academic use at https://caldaslab.cruk.cam.ac.uk/brcarepred.
Blows, F. M. et al. Subtyping of breast cancer by immunohistochemistry to investigate a relationship between subtype and short and long term survival: a collaborative analysis of data for 10,159 cases from 12 studies. PLoS Med. 7, e1000279 (2010).
Davies, C. et al. Long-term effects of continuing adjuvant tamoxifen to 10 years versus stopping at 5 years after diagnosis of oestrogen receptor-positive breast cancer: ATLAS, a randomised trial. Lancet 381, 805–816 (2013).
Sestak, I. et al. Factors predicting late recurrence for estrogen receptor-positive breast cancer. J. Natl Cancer Inst. 105, 1504–1511 (2013).
Sgroi, D. C. et al. Prediction of late distant recurrence in patients with oestrogen-receptor-positive breast cancer: a prospective comparison of the breast-cancer index (BCI) assay, 21-gene recurrence score, and IHC4 in the TransATAC study population. Lancet Oncol. 14, 1067–1076 (2013).
Pan, H. et al. 20-year risks of breast-cancer recurrence after stopping endocrine therapy at 5 years. N. Engl. J. Med. 377, 1836–1846 (2017).
Dowsett, M. et al. Integration of clinical variables for the prediction of late distant recurrence in patients with estrogen receptor-positive breast cancer treated with 5 years of endocrine therapy: CTS5. J. Clin. Oncol. 36, 1941–1948 (2018).
Harris, L. N. et al. Use of biomarkers to guide decisions on adjuvant systemic therapy for women with early-stage invasive breast cancer: American Society of Clinical Oncology clinical practice guideline. J. Clin. Oncol. 34, 1134–1150 (2016).
Sledge, G. W. et al. Past, present, and future challenges in breast cancer treatment. J. Clin. Oncol. 32, 1979–1986 (2014).
Richman, J. & Dowsett, M. Beyond 5 years: enduring risk of recurrence in oestrogen receptor-positive breast cancer. Nat. Rev. Clin. Oncol. 1, https://doi.org/10.1038/s41571-018-0145-5 (2018).
Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).
Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1160–1167 (2009).
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Ali, H. R. et al. Genome-driven integrated classification of breast cancer validated in over 7,500 samples. Genome Biol. 15, 431 (2014).
Putter, H., van der Hage, J., de Bock, G. H., Elgalta, R. & van de Velde, C. J. H. Estimation and prediction in a multi-state model for breast cancer. Biom. J. 48, 366–380 (2006).
Fisher, B. et al. Significance of ipsilateral breast tumour recurrence after lumpectomy. Lancet 338, 327–331 (1991).
Insa, A. et al. Prognostic factors predicting survival from first recurrence in patients with metastatic breast cancer: analysis of 439 patients. Breast Cancer Res. Treat. 56, 67–78 (1999).
Putter, H., Fiocco, M. & Geskus, R. B. Tutorial in biostatistics: competing risks and multi-state models. Stat. Med. 26, 2389–2430 (2007).
Wishart, G. C. et al. PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer. Breast Cancer Res. 12, R1 (2010); erratum 12, 401 (2010).
Michaelson, J. S. et al. Improved web-based calculators for predicting breast carcinoma outcomes. Breast Cancer Res. Treat. 128, 827–835 (2011).
Ormandy, C. J., Musgrove, E. A., Hui, R., Daly, R. J. & Sutherland, R. L. Cyclin D1, EMS1 and 11q13 amplification in breast cancer. Breast Cancer Res. Treat. 78, 323–335 (2003).
Sanchez-Garcia, F. et al. Integration of genomic data enables selective discovery of breast cancer drivers. Cell 159, 1461–1475 (2014).
Shrestha, Y. et al. PAK1 is a breast cancer oncogene that coordinately activates MAPK and MET signaling. Oncogene 31, 3397–3408 (2012).
Holland, D. G. et al. ZNF703 is a common luminal B breast cancer oncogene that differentially regulates luminal and basal progenitors in human mammary epithelium. EMBO Mol. Med. 3, 167–180 (2011).
Reis-Filho, J. S. et al. FGFR1 emerges as a potential therapeutic target for lobular breast carcinomas. Clin. Cancer Res. 12, 6652–6662 (2006).
Liu, H. et al. Pharmacologic targeting of S6K1 in PTEN-deficient neoplasia. Cell Reports 18, 2088–2095 (2017).
Delmore, J. E. et al. BET bromodomain inhibition as a therapeutic strategy to target c-Myc. Cell 146, 904–917 (2011).
Pearson, A. et al. High-level clonal FGFR amplification and response to FGFR inhibition in a translational clinical trial. Cancer Discov. 6, 838–851 (2016).
Wapnir, I. L. et al. A randomized clinical trial of adjuvant chemotherapy for radically resected locoregional relapse of breast cancer: IBCSG 27-02, BIG 1-02, and NSABP B-37. Clin. Breast Cancer 8, 287–292 (2008).
Clark, G. M., Sledge, G. W. Jr, Osborne, C. K. & McGuire, W. L. Survival from first recurrence: relative importance of prognostic factors in 1,015 breast cancer patients. J. Clin. Oncol. 5, 55–61 (1987).
Kennecke, H. et al. Metastatic behavior of breast cancer subtypes. J. Clin. Oncol. 28, 3271–3277 (2010).
Fix, E. & Neyman, J. A simple stochastic model of recovery, relapse, death and loss of patients. Hum. Biol. 23, 205–241 (1951).
Broët, P. et al. Analyzing prognostic factors in breast cancer using a multistate model. Breast Cancer Res. Treat. 54, 83–89 (1999).
Meier-Hirmer, C. & Schumacher, M. Multi-state model for studying an intermediate event using time-dependent covariates: application to breast cancer. BMC Med. Res. Methodol. 13, 80 (2013).
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model (Springer, New York, 2000).
de Wreede, L. C., Fiocco, M. & Putter, H. mstate: an R package for the analysis of competing risks and multi-state models. J. Stat. Software 38, 1–30 (2011).
Klein, J. P., Keiding, N. & Copelan, E. A. Plotting summary predictions in multistate survival models: probabilities of relapse and death in remission for bone marrow transplantation patients. Stat. Med. 12, 2315–2332 (1993).
Aalen, O., Borgan, O. & Gjessing, H. Survival and Event History Analysis—A Process Point of View (Springer, New York, 2008).
Fiocco, M., Putter, H. & van Houwelingen, H. C. Reduced-rank proportional hazards regression and simulation-based prediction for multi-state models. Stat. Med. 27, 4340–4358 (2008).
Hothorn, T., Bretz, F. & Westfall, P. Simultaneous inference in general parametric models. Biom. J. 50, 346–363 (2008).
Dunnett, C. W. A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 50, 1096–1121 (1955).
Prentice, R. L., Williams, B. J. & Peterson, A. V. On the regression analysis of multivariate failure time data. Biometrika 68, 373–379 (1981).
Harrell, F. E. J. Regression Modeling Strategies (Springer, 2001).
Li, Y. et al. Amplification of LAPTM4B and YWHAZ contributes to chemotherapy resistance and recurrence of breast cancer. Nat. Med. 16, 214–218 (2010).
Clarke, C. et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis 34, 2300–2308 (2013).
Loi, S. et al. Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 9, 239 (2008).
Nagalla, S. et al. Interactions between immunity, proliferation and molecular subtype in breast cancer prognosis. Genome Biol. 14, R34 (2013).
Schmidt, M. et al. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res. 68, 5405–5413 (2008).
Desmedt, C. et al. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin. Cancer Res. 13, 3207–3214 (2007).
Miller, L. D. et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl Acad. Sci. USA 102, 13550–13555 (2005); correction 102, 17882 (2005).
Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315 (2004).
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).
Gendoo, D. M. A. et al. Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics 32, 1097–1099 (2016).
Schröder, M. S., Culhane, A. C., Quackenbush, J. & Haibe-Kains, B. survcomp: an R/Bioconductor package for performance assessment and comparison of survival models. Bioinformatics 27, 3206–3208 (2011).
R Core Team. R: A Language and Environment for Statistical Computing. http://www.r-project.org/ (2015).
We thank the women who participated in this study and the UK Cancer Registry. O.M.R. was supported by a Cancer Research UK (CRUK) travel grant (SWAH/047) to visit C. Curtis’ laboratory. C.R. is supported by award MTM2015-71217-R. C. Caldas is supported by ECMC, NIHR, the Mark Foundation for Cancer Research and Cancer Research UK Cambridge Centre (C9685/A25177). C. Curtis is supported by the National Institutes of Health through the NIH Director’s Pioneer Award (DP1-CA238296), the American Association for Cancer Research and the Breast Cancer Research Foundation. This study is dedicated to J.M.W. and J.N.W.
Nature thanks Jeff Gerold, Martin A. Nowak, Peter Van Loo and the other anonymous reviewer(s) for their contribution to the peer review of this work.
S.A. is founder and shareholder of Contextual Genomic and a scientific advisor to Sangamo Biosciences and Takeda Pharmaceuticals. C. Caldas is a scientific advisor to AstraZeneca-iMed and has received research funding from AstraZeneca, Servier and Genentech/Roche. C. Curtis is a scientific advisory board member and shareholder of GRAIL and consultant for GRAIL and Genentech. A patent application has been filed on aspects of the described work, entitled ‘Methods of treatment based upon molecular characterization of breast cancer’ (C. Curtis, C. Caldas, J.A.S. and O.M.R.).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Description of the METABRIC discovery cohort, clinical characteristics and flow chart of sample inclusion for analysis. b, Description of the validation cohort, clinical characteristics and flow chart of sample inclusion for analysis. DRFS, distant-relapse-free survival; DSS, disease-specific survival; OS, overall survival; RFS, relapse-free survival. The cohorts are as follows: GSE19615 (DFHCC cohort43), GSE42568 (Dublin cohort44), GSE9195 (Guyt2 cohort45), GSE45255 (IRB/JNR/NUH cohort46), GSE11121 (Maintz cohort47), GSE6532 (TAM cohort45), GSE7390 (Transbig cohort48) and GSE3494 (Upp cohort49). NA, not available.
Extended Data Fig. 2 Effect of censoring nonmalignant deaths on the estimation of disease-specific survival, and prognostic value of clinical covariates at different disease states.
a, Cumulative incidence computed as 1 − Kaplan–Meier (KM) estimator, using only disease-specific death as an end point and censoring other types of death. b, Cumulative incidence computed using a competing-risk model that takes into account different causes of death. The bias of the 1 − Kaplan–Meier estimator is visible. c, Distribution of age at the time of diagnosis for ER-negative and ER-positive patients. The number of patients in each group is indicated in all panels. This analysis was done with the full dataset. Box plots were computed using the median of the observations (centre line). The first and third quartiles are shown as boxes, and the whiskers extend to the ±1.58 interquartile range divided by the square root of the sample size. Outliers are shown as dots. d, log hazard ratios calculated using the multistate model stratified by ER status (n = 3,147) for different covariates, namely grade, lymph-node (LN) status, tumour size (size), time from surgery and time from local relapse (LR). log hazard ratios are shown for different states, including post-surgery (PS; hazard ratio of progressing to relapse or DSD), locoregional recurrence (LR; hazard ratio of progressing to distant relapse or DSD) and distant recurrence (DR; hazard ratio of cancer-specific death). 95% confidence intervals are shown. This analysis was done with the full dataset.
a, Internal validation of the global predictions of the models on all transitions using bootstrap (n = 200). Discriminant measures of predictive ability are shown on the x axis, as described in the Methods section ‘Model validation and calibration’. The y axis shows the optimism, that is, the difference between the training predictive ability and the test predictive ability of the discriminant measures (see Methods). b, Internal calibration of the global predictions of the models on all transitions using bootstrap (n = 200). The distribution of the mean absolute error between observed and predicted is plotted. c, External calibration of DSD risk and nonmalignant death risk using PREDICT 2.1 (n = 1,841). The distribution of the mean absolute error between the predictions of PREDICT and our model based on ER status only is plotted. a–c, Box plots were computed using the median of the observations (centre line). The first and third quartiles are shown as boxes, and the whiskers extend to the ±1.58 interquartile range divided by the square root of the sample size (see Methods). d, Scatter plot of the predictions of DSD risk computed by PREDICT and our model based on the IntClust subtypes only at ten years (n = 1,841; see Methods). The Pearson correlation is shown. e, Concordance index (C-index) of prediction of risk of distant relapse (DRFS), disease-specific death (disease-specific survival, DSS), death (overall survival, OS) and relapse (RFS) in the 178 withheld METABRIC samples and in a metacohort composed of eight published studies among ER+/HER2− patients in the high-risk IntClust subtypes, where results are shown for individual cohorts and the combined metacohort (see Methods and Supplementary Information). Error bars correspond to 95% confidence intervals for the C-index. The number of patients in each group is indicated on the right.
a, Average probability of experiencing a distant relapse (defined as the probability of having a distant relapse at any point followed by any other transition) or cancer-related death for the high-risk ER+ IntClust (IC) subtypes (IC1 n = 134, IC6 n = 81, IC9 n = 134, IC2 n = 69) relative to IC3 (n = 269), the ER+ subgroup with the best prognosis. This analysis was restricted to ER+/HER2− cases, which represent the vast majority for each of these subtypes. Error bars represent 95% confidence intervals around the mean. b, As for a, but showing the average probability of experiencing distant recurrence or cancer-related death after a local recurrence (IC1 n = 21, IC6 n = 10, IC9 n = 21, IC2 n = 13, IC3 n = 30). c, Average probability of recurrence (distant relapse or cancer-specific death) after locoregional relapse for all patients in each of the 11 IntClust subtypes. d, Median time until an additional relapse (distant recurrence or cancer-specific death) after local recurrence for all patients in each of the 11 IntClust subtypes (n = 270). This has been computed using a Kaplan–Meier approach with competing risks of progression and nonmalignant death. Error bars represent 95% confidence intervals around the median time. Asterisks denote situations in which the median time cannot be computed because fewer than 50% of the patients relapsed. This analysis was done with the molecular dataset. e, Average probability of cancer-related death after distant recurrence for all patients by subtype. f, As for d, except that the median time until cancer-specific death after distant recurrence is shown (n = 596). g, Mean probabilities of relapse after surgery and after five and ten disease-free years (see Methods and Supplementary Table 4) for the patients in each of the four IHC subtypes. Error bars represent 95% confidence intervals. The number of patients in each group is indicated. h–k, As for c–f, but for the IHC subtypes (same sample sizes). l, As for g, but for the PAM50 subtypes. The number of patients in each group is indicated. m–p, As for h–k, but for the PAM50 subtypes (with the same sample sizes, except for p where n = 593).
The probabilities of distant relapse or cancer-related death among ER−/HER2− patients who were disease-free at five years after diagnosis reveal marked differences in the risk of relapse for TNBC IntClust subtype IC4ER− versus the IC10 (basal-like enriched) subtype. Here the base clinical model with IHC subtypes is compared with the base clinical model plus IntClust subtype information. Error bars represent 95% confidence intervals. The number of patients in each group is indicated.
Transition probabilities from locoregional recurrence to other states for individual average patients, stratified on the basis of ER, IHC, PAM50 or IntClust subtype. 95% confidence bands were computed using bootstrap. This analysis was done with the full dataset for the comparisons between ER+ and ER−, and the molecular dataset for the remainder.
Extended Data Fig. 7 Associations between probabilities of distant relapse ten years after locoregional relapse with clinico-pathological and molecular features of the primary tumour.
For each patient that had a locoregional recurrence, the ten-year probability of having a distant relapse or cancer-related death is plotted against different variables. A loess fit is overlaid to highlight the relationship between the probability and tumour size or time of relapse. Box plots were computed using the median of the observations (centre line). The first and third quartiles are shown as boxes, and the whiskers extend to the ±1.58 interquartile range divided by the square root of the sample size. Outliers are shown as dots. This analysis was done with the molecular dataset and the model was stratified by IntClust subtype (n = 257).
Transition probabilities from distant relapse to other states for individual average patients stratified on the basis of ER, IHC, PAM50 or IntClust subtype. 95% confidence bands were computed using bootstrap. This analysis was done with the full dataset for the comparisons between ER+ and ER−, and the molecular dataset for the remainder.
a, Times of distant recurrence for ER− and ER+ patients (n = 605). Each dot represents a distant recurrence, coded by colour for different sites. b, Distribution of the number of distant relapses for different subtypes (n = 609), based on ER status (ER+ n = 422, ER− n = 187), IHC ER/HER2 status (ER+/HER2− n = 263, ER−/HER2− n = 82, ER+/HER2+ n = 36, ER−/HER2+ n = 41), PAM50 subtype (normal n = 33, luminal A n = 101, luminal B n = 138, basal n = 79, HER2 n = 69) and IntClust subtype (IC1 n = 40, IC2 n = 25, IC3 n = 32, IC4ER+ n = 46, IC4ER− n = 16, IC5 n = 72, IC6 n = 23, IC7 n = 24, IC8 n = 54, IC9 n = 38, IC10 n = 52). ER status was imputed on the basis of expression in four samples. These analyses were done with the recurrent-events cohort.
a, Left, percentages of patients with metastases at a given site in the IHC subtypes (bar plots, total numbers also indicated). Upright triangles indicate significant positive differences in that group with respect to the overall mean and inverted triangles indicate significant negative differences in that group with respect to the overall mean using simultaneous testing of all sites (see Methods). Location of metastatic sites is not anatomically accurate. Right, cumulative incidence functions (as 1 − Kaplan–Meier estimates) for each site of metastasis in the IHC subtypes. The same patient can have multiple sites of metastasis. b, As for a, but for the PAM50 subtypes. c, As for a, but for the IntClust subtypes. These analyses were done with the recurrent-events cohort. Female silhouettes are from the public-domain human body diagrams at https://commons.wikimedia.org/wiki/Human_body_diagrams.
Summary of clinico-pathological features of the cohort according to ER status (based on the full dataset) and for the IHC, PAM50 and IntClust subtypes (based on the molecular dataset).
Number of transitions between each state in the multistate model according to ER status (based on the full dataset) and for the IHC, PAM50 and IntClust subtypes (based on the molecular dataset).
Proportion of cases classified into each IntClust subtype mapping onto the IHC and PAM50 subtypes within the molecular dataset.
Transition probabilities and standard errors for each of the breast cancer subgroups. a, Predictions for each subgroup were computed taking the average and the standard deviation of the probabilities of all patients in each group. Standard deviations represent variability within each subtype. The probabilities of any transition ending up in a relapse group and all transitions visiting that state of the multistate model are included for patients stratified by ER status (based on the full dataset) and for the IHC, PAM50, and IntClust subtypes (based on the molecular dataset). b, Predictions for an average individual from each subgroup. These probabilities are computed by selecting an average individual and predicting the trajectory between each state of the multistate model in the and corresponding dataset for the distinct subtypes. The probabilities for staying in relapse are omitted for clarity and can be computed as one minus the sum of moving to the rest of the states. Standard errors represent uncertainty in the individual predictions.
Clinical information for the full dataset.
Clinical information for the molecular dataset.
Clinical information for the recurrent-events dataset.
Description of clinical variables provided in Supplementary Tables 5–7 for the full, molecular and recurrent-events datasets.
About this article
Cite this article
Rueda, O.M., Sammut, S., Seoane, J.A. et al. Dynamics of breast-cancer relapse reveal late-recurring ER-positive genomic subgroups. Nature 567, 399–404 (2019). https://doi.org/10.1038/s41586-019-1007-8
Identifying oncogenic drivers associated with increased risk of late distant recurrence in post-menopausal, estrogen receptor-positive, HER2-negative early breast cancer: results from the BIG 1-98 study
Annals of Oncology (2020)
Cancer-associated fibroblast compositions change with breast cancer progression linking the ratio of S100A4+ and PDPN+ CAFs to clinical outcome
Nature Cancer (2020)
Direct comparison shows that mRNA-based diagnostics incorporate information which cannot be learned directly from genomic mutations
BMC Bioinformatics (2020)
Multi-cancer analysis of clonality and the timing of systemic spread in paired primary tumors and metastases
Nature Genetics (2020)
Nedd8-activating enzyme inhibitor MLN4924 (Pevonedistat), inhibits miR-1303 to suppress human breast cancer cell proliferation via targeting p27Kip1
Experimental Cell Research (2020)