Introduction

Breast cancer (BC) is the second most common cancer worldwide and the main cause of cancer deaths in women.1, 2 Advances in adjuvant therapy have led to lower proportion of recurrence or metastasis in BC patients. However, adjuvant therapy can have serious negative side effects, including heart toxicity, infertility, cognitive impairment, and secondary cancers, which may increase the probability of death due to non-cancer causes.3, 4, 5, 6, 7 Indeed, a 5-year follow-up study of BC patients reported a larger number of non-cancer deaths, many attributable to the side effects of adjuvant therapy, compared with those attributed to BC itself.8 Therefore, deaths could be prevented and suffering reduced if we were able to predict, at the time of diagnosis, BC outcomes such as the likelihood of recurrence, the probability of developing distant metastasis, and the expected years of life after the diagnosis of BC.

The prognostic factors commonly used to assess the outcome of the disease and to guide the BC treatment include axillary lymph-node involvement, tumor size, patient age and ethnicity, lymphatic/vascular invasion, histological type and grade of the tumor, estrogen/progesterone, and Her2/neu receptor status.9 More recently, there has been an increased use of omic data to assess BC patients. For instance, Perou et al10 showed that gene expression (GE) data could be used to identify risk groups, which are both confirmatory of immunohistochemistry BC subtypes (eg, luminals) and predictive of prognosis.11 Other omics such as CNVs, methylation, and miRNA have also been considered for the assessment of prognosis.12, 13, 14, 15, 16

While clustering algorithms applied to GE data have succeeded in identifying groups with different prognosis, the proportion of interindividual differences in survival explained by these groups remains limited. A higher predictive power could be achieved using whole-omic profiles (WOPs),14, 16 integrating clinical and omics in a unified risk assessment method.16, 17 The integration of high-dimensional inputs, such as WOPs, presents important statistical and computational challenges. Recent advances in the fields of regularized and Bayesian regressions allow integrating high-dimensional inputs for prediction of disease risk. These methods have been successfully applied for the prediction of complex traits and disease using DNA information, the so-called whole-genome-regression (WGR) approach, in humans,18, 19, 20 plants,21 and animals.22 Methods similar to those used in WGR could be used to integrate GE and other omics for prediction of BC outcomes. For instance, VanRaden23 suggested the use of Omic Kriging, a method equivalent to the so-called genomic BLUP,24 to integrate omic data for prediction of complex traits. Additionally, Vazquez and co-workers25 described a Bayesian framework that allows integrating multiple omics platforms for prediction of cancer outcomes.

Although advances in adjuvant therapy have led to a significant improvement in the survival of BC patients, it is also clear that individuals are quite diverse in their responses to treatment.26, 27, 28, 29 For example, luminal A patients usually exhibit a poor response to chemotherapy (CT), while luminal B patients are considered viable candidates for the use of both anthracyclines and taxanes.30 Similarly, differential responses to treatment have been observed in other types of cancer, such as colon cancer, where the differential risk of recurrence strongly depends on the individual expression profiles and application of either surgery alone or in combination with adjuvant CT.31 From a statistical point of view, these differential responses to treatments can be modeled as interactions between omics, such as whole-genome GE and treatments.

None of the studies that have integrated whole omics for prediction of disease outcomes (eg, Vazquez et al16 and VanRaden23) considered treatments or interactions between omics and treatments. Therefore, in this study, we extended the framework described in Vazquez et al16 to accommodate interactions between omics and treatments, adapting at the same time, this framework for a survival model. We used the resulting Bayesian model and data from Molecular Taxonomy of Breast Cancer Consortium (METABRIC) to integrate clinical covariates (COVs) (including age at the moment of diagnosis, cancer subtype (CS), histological class, Nottingham Prognostic Index (NPI), and treatment), whole-genome GE profiles, CNVs, and interactions between these omics and treatments (CT, hormonal treatment (HT), and radiotherapy (RT)). Using these data, we evaluated the contribution of COVs, omics, and interactions to interindividual differences in years of life after a diagnosis of BC, using both variance components and a measure of prediction accuracy in cross-validation.

Materials and methods

Data

The METABRIC data set12 comprised information from 1977 white Caucasian women who were diagnosed with BC. Survival data consisted of patient state (dead or alive) and time to either death or last follow-up. Feature data consisted of COVs, along with GE and DNA CNV data. METABRIC CNV data (the gene-by-patient matrix log 2 values from Synapse) is a measure of somatic copy number alteration in the tumor. It identifies tumor CNV in reference to normal tissue,12 meaning that any CNV present in the tumor, and also in the normal tissue, was not considered as a tumor CNV. In addition to the original edition criteria, four observations corresponding to genomic data outliers for at least one omic were removed.

Our response variable was the time from diagnoses to death due to cancer. Other non-cancer deaths, as well as loss of follow-up data (cases in which the patients were alive at the last contact time), were treated as censored observations. There were a total 622 deaths due to cancer, 50% occurring at ~17 years. The same analysis was performed for overall survival (Supplementary Material). The total number of deaths was 887 cases, with 50% of survival at ~12 years. For both sets, the 50% survival occurred at ~7 years (Supplementary Figure S5). The average times to censor were 9.2 (4.8 SDs) and 9.5 (4.9) years, for each set, respectively. The resulting Bayesian model and data from METABRIC were used to integrate COVs (including age at the moment of diagnosis (AGE), CS, type of carcinoma (TC), the NPI,32 and treatment), whole-genome GE profiles, copy number variants (CNVs), and interactions between these omics and treatments. The NPI consists of a well-validated prognostic score that takes information from tumor size, grade, and nodal involvement, specifically NPI= (0.417 × size)+(0.76 × lymph-node stage)+(0–82 × tumor grade), typically ranging between 2 and 7;33 the higher the score the shorter the lifespan prognosis for the patient. Histological type was defined as a TC and was subdivided into two levels: one including in situ medullary, invasive medullary, mucinous or tubular ductal tumors; and the other including non-ductal carcinomas, such as lobular, phyllodes, and ‘grab bag’ classified tumors. CS11 was binary coded, either as whether the patient has a triple negative BC (Her2 and ER and PR) or not ((Her2 and ER+ or PR+) and Her2+ subtypes). Treatment included whether or not the patient received CT, RT, and HT.

Normalization, quality control, and summarization for the GE and CNV intensity (at a gene level in the CNV) data are described elsewhere.34 Briefly, DNA genotypes and GE were performed on the Affymetrix SNP 6.0 (Affymetrix Inc., Santa Clara, CA, USA) and Illumina HT 12v.3 (Illumina Inc., San Diego, CA, USA) platforms, respectively. In the case of GE, the data were not summarized because the majority of the probes were designed to interrogate distinct mRNA transcripts. The edited omic data included the log 2 of the intensity of the CNV for 18 538 regions and 49 473 bead-level GE array probes.

Statistical models

We modeled the logarithm of survival time in a Bayesian setting, accounting for COVs, omics main effects, and their interactions. Inference was obtained from the posterior distribution of the unknowns given the data and the hyperparameters. The likelihood and prior distribution assumed to obtain the posterior distribution are described below.

The model for log time to death can be represented as t*i=ηi+ɛi, where t*i is the logarithm of the time to either the last follow-up or the time to death for the ith subject (i=1, …, n), ηi is a linear function of the COVs, omics, and their interactions, and ɛi is the residual error that follows a normal distribution centered on 0, with residual variance σ2ɛ. Alive subjects at the last follow-up time were considered right-censored (ie, observations where the beginning of the treatment was observed but not the occurrence of the event); therefore, the joint conditional distribution of the log-transformed survival time (t*=[t*1, …, t*n ]T) is then given by

where η=[η1, …, ηn ]T and ci indicates whether the data are censored (ci=1, if the subject was alive at the last observation and ci=0, if the subject was dead at the time of last follow-up observation, that is, the data are not censored).

To include prior information about the linear predictor, we assumed multivariate normal distribution for the vector η, of the form N(X β, Σ), where X was the model matrix of the standardized COVs as columns and individuals as rows, β was the vector of COVs effects (assumed to have a flat prior distribution), and Σ was the variance–covariance matrix of the genomic effects, including the main effects of both omics and the interactions between GE and a given treatment. For ease of notation, hereafter we will use only CT to account for the interactions between GE and given treatment: the models including the terms corresponding to HT or RT are equivalents, changing GE|CT by GE|HT or GE|RT. So far, the full model linear predictor can be written as:

where uCNV was the vector of genomic main effects due to the CNV, uGE was the vector of genomic effects due to GE, and uGE|yCT and uGE|nCT were the interaction effects between GE and both levels of CT, either when CT was applied (yCT) or not (nCT). Σ was obtained as described elsewhere,24 and corresponds with a weighted sum of kernels. For instance, the kernels for GE, GE|yCT, and GE|nCT, are respectively, computed as , and , where W is the matrix of standardized GE features, p is the number of features, and D is a diagonal matrix with ones for those patients with CT, and zero otherwise.

To obtain the relative contribution of each omic main effects and GExCT interactions, together with the model's prediction ability, we used the Gibbs sampler implemented in the BGLR.35 Basically, BGLR allows to handle an arbitrary number of linear predictors terms, each one with different distributional assumptions, while also allowing to fit numerical, categorical, or truncated response variables.

A sequence of nested models was adjusted to evaluate the predictive accuracy and to compare the impact of including COVs, omics, and the interaction between omics and covariates. The baseline model (COV) included only COVs and their effects. This model was further extended to include either copy number variation (COV+CNV), GE (COV+GE), or (COV+GE+CNV). In addition, COV+GE was also extended to include the interactions between each treatment and GE, adding the effects of the interaction with the CT (COV+GE+GExCT), RT (COV+GE+GExRT) and hormone therapy (COV+GE+GExHT).

Model prediction accuracy

We evaluated the models in terms of their ability in predicting survival. To do that, we performed a 10-fold cross-validation (CV), repeating 10 times the random assignation of subjects in folds. With the paired information of survival probability and patient’s status at different time points (from 1 to 7 years), we calculated the area under the receiver-operating characteristic curve (AUC) in each CV. In addition, we performed a survival analysis, using the Kaplan–Meier model with the time of follow-up as response and patient status as ‘event’, to compare the models in their ability to discriminate between low- and high-risk groups. These groups were defined based on the average predicted values across CVs: subjects with predicted survival below the first quartile were considered in the high-risk group, whereas those with values over the third quartile were considered in the low-risk group.

Results

The outcome analyzed was survival (years of life) after diagnosis of BC, with individuals who were alive at the last follow-up treated as censored (results for analysis based on years of life for all-cause deaths are given in the Supplementary Data). Years of life after diagnosis of BC was regressed on COVs (including age at the moment of diagnosis, NPI, hormone receptor status, histological type, and treatment) and on WOP (including GE and CNV) and interactions between WOP with treatments (RT, HT and CT all defined as Yes/No). COVs were treated as fixed effects, whereas WOP and the interactions between treatment and GE were treated as random (see Supplementary Table S1 for the fixed-effects estimation for the full model COV+CNV+GE). Models were fitted using the BGLR R-package.35 Further details about the models used are given in the Materials and methods section.

Proportion of variance explained by COVs, omics, and omic-by-treatment interactions

Figure 1 shows the (estimated) proportion of variance explained by inputs for each of the models fitted. The baseline model used only COV and explained ~19% of the interindividual differences in survival time. When CNV data were added to the model, the total proportion of variance explained increased by a small margin (about 6%). However, the addition of GE led to a substantial increase in the proportion of variance explained by the model from 19% (COV model) to 65.3% (COV+GE model). Combining CNV and GE (COV+GE+CNV model) did not lead to a substantial increase in the proportion of variance explained already reached by the model COV+GE. Similarly, adding interactions between GE and either CT or RT did not substantially increase the overall proportion of variance explained by the model relative to COV+GE. However, adding interactions between GE and HT lead to a substantial increase in the total proportion of variance explained by the model. In the model including COV+GE+GExHT, the proportion of variance explained by the interaction term was substantial (25%). The results obtained using survival defined based on all-cause mortality (see Supplementary Figures S1 and S2) were similar to those reported in Figure 1.

Figure 1
figure 1

The proportion of interindividual differences (variance scale) in survival explained by each of the input set considered by the model.

Assessment of prediction accuracy in CV

We evaluated the ability of each model to predict future outcomes using 10 replicates of a 10-fold CV. In each replicate, individuals were randomly assigned to 10-fold CVs. In each replicate, we evaluated the ability of each model to predict survival time (further details are given in the Materials and methods section). Prediction accuracy was measured using the CV AUC36) computed for dummy variables that indicate whether an individual lives longer than x years. Figure 2 shows the average CV AUC for models accounting for covariates, CNV, GE and the interaction between GE and HT. Supplementary Table S2 shows the AUC of models considering GE by CT and GE by HT interactions. Prediction accuracy improved from year 1 after diagnoses to the fourth year and lowered towards the next years in all models. Median survival time occurred at 7.4 years. Our results suggest that reasonably high prediction accuracy (AUC of ~0.8 in a testing subset of the data) can be achieved for prediction of whether a BC patient will live longer than 4 years after diagnosis.

Figure 2
figure 2

Prediction ability by model and time point in terms of AUC across CVs: the lines represent the average AUC across 10 repetitions of 10-fold CVs (the vertical segments represent standard error across CV). The number of dead and alive subjects at any time point is represented by the bars stacked. This figure includes the most relevant models: COV model, COV plus CNV (COV+CNV), COV plus GE (COV+GE), and covariates plus GE and interaction between GE and RT (COV+GE+GExHT).

The model with only COV had AUC values between 0.70 and 0.78, depending on how many years after treatment were being predicted. Combining COV and CNV have gains in CV AUC of the order of two points of AUC relative to the use of COV only for prediction of long-term survival. However, adding CNV to the model did not result in a substantial change in CV AUC for prediction of early mortality (eg, whether a patient lived longer than 1 or 2 years after diagnosis of BC). Combining GE with COV gave substantial gains (≥3.5) in CV AUC relative to the COV model. These gains in CV AUC were observed both for prediction of early, intermediate, and late mortality. The results for overall survival (see Supplementary Figure S3) were similar to those presented in Figure 2, which are based on deaths due to BC.

Using CV predictions of years of life, we classify individuals in high- and low-risk groups (corresponding to the individuals ranking in the lower and higher quartiles for predicted years of life) and subsequently computed (Kaplan and Meier) survival curves for each of these groups. Figure 3 displays these curves for the groups defined based on predicted years of life using COV and COV+GE. Both methods produce highly accurate classifications. For instance, at year 4, >95% of the individuals classified as being in the low-risk groups were still alive; on the other hand, <60% of the individuals classified as being at high risk were alive after 4 years of diagnosis. The model using COV+GE had greater discriminatory power than COV only; indeed; the survival curve for the low (high)-risk groups identified with this model run always above (below) the ones corresponding to the classification based on COV.

Figure 3
figure 3

Average Kaplan–Meier estimates by risk group for COV and COV+GE models across CVs: the curves show the average across CV and separating individuals as high or low risk. COV, model with COVs; COV+GE, model with COVs plus whole-genome GE.

None of the models that included interactions produced a clear improvement in prediction accuracy relative to the model using COV+GE (Supplementary Table S2).

The results shown in Figure 4 are based on the prediction accuracy assessed using all patients. Using the predictions presented in Figures 2 and 3, we evaluated prediction accuracy for groups defined by the treatments received. The figure shows the prediction accuracy obtained by the COV and COV+GE models using sets of patients who did or did not receive hormone therapy or CT (results for RT are shown in Supplementary Figure S4). Again, prediction accuracy is expressed in AUC by thresholds of years of life after a diagnosis of BC for each subgroup of women. This analysis revealed that prediction accuracy increases for patients receiving treatment, and such increment seems not to depend on the specific treatment. Additionally, women receiving CT or HT showed better predictions for longer periods (until the seventh year BC-specific death). Lower predictive accuracy was obtained for overall survival. Nevertheless, the same AUC variation across time was obtained (Supplementary Figure S3).

Figure 4
figure 4

Prediction ability obtained with COV and COV+GE by sets of patients with and without treatment: the treatments are CT and HT. Prediction accuracy for patients who received treatment are in the top panels; the bottom panels correspond with those without treatment. Prediction accuracy was obtained as the average AUC for each treatment. Average AUC is presented for subjects with (upper panels) and without treatment (lower panels). The models compared contained COVs and COVs plus GE (COV+GE).

Discussion

In this study, we first determined the importance of genomic effects of CNVs and GE on survival. Additionally, we determined the prediction achieved when covariates, omics, and interactions between omics and treatments are being accounted. Accordingly, survival models were implemented for both BC-specific and overall deaths. In a primary analysis, we first studied the survival rates for each COV separately, preselecting the covariates associated with survival. Using the significant covariates with the integrative model (ie, COV+CNV+GE), younger (under 50 years) and older (above 70 years) age women have the worst prognosis. In older patients, factors such as chronic diseases and lower applications of CT can be associated with bad prognosis.37 Poor prognosis for younger patients, on the other hand, is attributed to more aggressive tumors.38 Additionally, the NPI showed an inverse relationship with prognosis: the prognosis decrease as the NPI values increase.39, 40

CNVs can modify GE by changing gene dosages or by breaking down regulatory sites.41 In BC, CNV has been reported as affecting genes associated with survival and tumor development (eg, PIK3CA, EGFR, FOXA1, and HER2).42 The addition of CNV to COVs allowed to explain an extra 6.5% more of the survival variance. Although this proportion of variance explained is smaller than that of GE, we note that adding CNV to a model based on COVs increased prediction accuracy. However, these results were moderated as compared with GE. Accordingly, Curtis et al12 also found a less relevant effect on survival of germline copy numbers (the CNV used here) than somatic copy numbers (CNA, the copy number originated by the tumoral process). Our results are also consistent with those from Vazquez et al,42 although they had a considerably smaller sample size and only explored survival at the third year. The inclusion of GE in the covariates model increased, even more, the prediction accuracy and explained even a bigger portion of the survival variance than the model COV+CNV. A possible reason for the moderated variance explained by CNV may be due to the fact that CNV was summarized at the gene level, leaving all non-coding regions not represented in the summary (see Supplementary Materials from Curtis et al12). Eventually, an underestimated effect of CNV could be related to missing CNVs in non-codifying regions distally affecting transcription.43

GE is an important disease risk indicator, which can relegate individuals into cancer subtypes.11, 12, 44, 45, 46, 47 Subgroups can also be derived by combining several platforms to define consensus groups by meta-analysis.48 We confirmed not only the primary role of the GE by explaining cancer subtype but also that GE explains a larger portion of the total survival variability than cancer subtype clusters. Models including GE explained the largest amount of survival variability and increased the prediction accuracy by many AUC units. Interestingly, overall survival showed a lower proportion of overall death survival variance, explained by the model containing both CNV and GE, further suggesting a more relevant role of both GE and CNV in the cancer process and less in unrelated deaths. The overall deaths included cancer-related deaths and other non-cancer-related ones. Although other causes of death are also related to more aggressive cancer (thus patients are exposed to more aggressive treatments), our results indicate that only predicting cancer-related deaths is more accurate (ie, omics are related to cancer deaths and not to overall survival). However, it is likely that deaths not due to BC could be actually induced or related to the cancer treatment. Other studies have indicated a higher rate of cancer unrelated deaths in BC patients than the expected mortality rate.8

BC patients can have a heterogeneous response to a given treatment, due to evolving subclonal architecture49 and stroma microenvironment conditions50 of the tumor. For instance, abnormal vasculature creates poorly oxygenated zones and can limit the supply of nutrients and drugs that affect the success of both radio and CTs.50 This variability was echoed in this work by the variance of BC survival explained by the GE × Treatment interactions (ie, variance magnitude dependent on whether a treatment was given or not), with different magnitudes by treatment perhaps due to how they were administrated in these data set. Most likely, all patients in our data received RT, while application CT was more restricted: almost all ER-negative patients (triple negative, Her2+, and some luminal patients) received CT, while ER positives (most of the luminal patients) did not.12 On the other hand, the administration of HT is provided to patients with positive hormone receptor status, markedly reducing the possibility of observing a sizable GExHT interaction.

The lack of improvement in AUC when interactions are included in the model may reflect poor sensitivity of AUC. To evaluate this, we also assessed prediction accuracy as the correlation between CV predictions of time to death and observed survival among patients with known time to death. This analysis showed a benefit of adding GE (the prediction correlation for COV was 0.22 and increase to 0.31 when GE was added to the model). However, the model including COV+GE+TRTxGE did not yield higher prediction correlation than the one using COV+GE. This was also true when all the interactions were included into the model (Supplementary Table S3).

To get an insight about which genes were contributing the most to survival variance, we also performed an ad hoc analysis using a spike-slab model to declare genes as up- or downregulated (ie, associated with either increasing or decreasing days of life, respectively) (Supplementary Figure S6). We found three genes with a probability of inclusion in the model >0.5: two upregulated on the sample of all patients (FGD3 and DNAJB9), and one downregulated on the subset of patients with hormone therapy (SERPINE3). FGD3 product is involved in signaling pathways regulating apoptosis.51, 52 On the other hand, DNAJB9 belongs to a group of genes related with the GIPC family that has an essential role in carcinogenesis and development.53 Finally, SERPINE3 is a member of the Serpin family, a very diverse group of proteins involved in many different biological processes, such as inflammation, immune function and tumorigenesis.54 Its product belongs to the clade E of human serpins (nexin/plasminogen activator inhibitor 1), although its function is not well understood.54 Additionally, our original method allows us to extract a very interesting biological interpretation of the results: the amount of interindividual differences in survival that can be explained by (1) well-known and widely used COVs (such as the state of the cancer, the cancer subtype, or a clinical treatment), (2) all the gene products (GE) present in the tumors, (3) CNV from the tumor, and (4) any possible interaction between treatments and GE.

This article focuses on the comparison of models based on COVs commonly used in clinical practice with others that incorporate WOPs as well as interactions of omics with treatments. Perou and co-workers55 demonstrated that clusters derived from GE profiles are confirmatory of the BC subtypes. Our COV model incorporates already the BC subtypes and therefore fully incorporates clustering. For these reasons the COV model is a high-quality benchmark for the model comparison. Nevertheless, the statistical learning literature offer a vast array of methods for incorporating high-dimensional inputs, including shrinkage and variable selection methods,56 support vector machines,57 and random forests.58 We considered the use of Bayesian regressions with Gaussian priors, which induce shrinkage of estimates. We also considered Bayesian models that combine variable selection and shrinkage simultaneously and did not find noticeable differences in neither proportion of variance explained nor in prediction accuracy. These results are in agreement with previous studies that have reported limited differences between various types of regularized regressions.59 In Breiman’s words: ‘when it comes to prediction there are usually many equally good models’.58 However, our study is clearly not exhaustive, and Bayesian models are not necessarily granted to be universally superior methods. Further research involving comparison of these approaches with others such as support vector machines or random forest is granted.