Survival analysis in breast cancer using proteomic data from four independent datasets

Breast cancer clinical treatment selection is based on the immunohistochemical determination of four protein biomarkers: ESR1, PGR, HER2, and MKI67. Our aim was to correlate immunohistochemical results to proteome-level technologies in measuring the expression of these markers. We also aimed to integrate available proteome-level breast cancer datasets to identify and validate new prognostic biomarker candidates. We searched studies involving breast cancer patient cohorts with published survival and proteomic information. Immunohistochemistry and proteomic technologies were compared using the Mann–Whitney test. Receiver operating characteristics (ROC) curves were generated to validate discriminative power. Cox regression and Kaplan–Meier survival analysis were calculated to assess prognostic power. False Discovery Rate was computed to correct for multiple hypothesis testing. We established a database integrating protein expression data and survival information from four independent cohorts for 1229 breast cancer patients. In all four studies combined, a total of 7342 unique proteins were identified, and 1417 of these were identified in at least three datasets. ESR1, PGR, and HER2 protein expression levels determined by RPPA or LC–MS/MS methods showed a significant correlation with the levels determined by immunohistochemistry (p < 0.0001). PGR and ESR1 levels showed a moderate correlation (correlation coefficient = 0.17, p = 0.0399). An additional panel of candidate proteins, including apoptosis-related proteins (BCL2,), adhesion markers (CDH1, CLDN3, CLDN7) and basal markers (cytokeratins), were validated as prognostic biomarkers. Finally, we expanded our previously established web tool designed to validate survival-associated biomarkers by including the proteomic datasets analyzed in this study (https://kmplot.com/). In summary, large proteomic studies now provide sufficient data enabling the validation and ranking of potential protein biomarkers.


Material and methods
Construction of the integrated protein database. We searched for publications and datasets containing proteome and survival data for breast cancer patients in PubMed, The Cancer Proteome Atlas (TCPA) 12 and the ProteomeXchange Consortium 21 portals. The search terms "human", "breast", and "cancer" were used to identify eligible datasets. Only studies with available protein expression data generated by either mass spectrometry or RPPA, clinical survival information, and at least 50 cancer patients with at least 30 events (either death or relapse) met our inclusion criteria. Four protein datasets met these conditions 12,[22][23][24] . Due to the use of different platforms and analysis methods, it was not possible to merge the datasets into a single unified dataset. Therefore, each dataset was processed separately. In the analyses, the author-reported normalized expression data were used. Figure 1 summarizes the pipeline of data filtering and Supplemental Table 1 summarizes the methods used in the original studies.
Protein annotation. In each dataset, the protein annotation generated by the authors was the starting point and duplicated and non-annotated proteins were removed. In addition, UniProt IDs were used to identify gene symbols corresponding to the same genes. The final integrated table of all annotated proteins in the database, including the gene symbol, UniProt ID and TCPA antibody list, is provided as Supplemental Table 2.
Validation of proteome-based protein level determination. To  www.nature.com/scientificreports/ proteome-based results to classification (positive/negative) acquired by conventional immunohistochemistry methods. The patient-level data necessary for this analysis was available in multiple data sets for genes with therapeutic importance, including ESR1, PGR, HER2, and MKI67. All validation analyses were performed in each of the four cohorts separately. In the case of MKI67, we also compared the expression between normal and tumor tissue, as this was available in one dataset.
Correlation between protein biomarker candidates and survival. We performed a PubMed search to identify biomarker candidates related to survival using the search terms "breast cancer", "protein", "cohort", "marker", and "survival" published up to 2019. Publications describing cell lines, other tumor types, those not investigating a tumor tissue, and studies with fewer than 100 patients were excluded. After these restrictions, 53 publications remained. In addition, we examined ten additional publications describing breast cancer guidelines. In all 63 publications, a total of 91 proteins were linked to breast cancer outcome, 57 of which were present in our database. The identification of the proteins was based on their Uniprot IDs. The list includes FDA-approved biomarkers, growth factor receptors, immune receptor ligands, basal and adhesion markers (cytokeratins, cadherins, and claudins), stem cell markers, and apoptotic markers (Supplemental Table 3). We analyzed all together 63 protein biomarkers used in breast cancer diagnostics for their prognostic power. The validation of the markers was performed separately in each dataset using overall survival and relapse-free survival time.  www.nature.com/scientificreports/ Statistical analyses. The immunohistochemistry classification was available as positive/negative and we used this classification to divide the samples into two groups. The differential expression between these groups was evaluated using the Mann-Whitney test by comparing the variables in each study separately. In a second analysis, Receiver operating characteristics (ROC) were computed to measure sensitivity and specificity and to validate discriminative power. ROC was also utilized to determine the optimal cutoff values to define cohorts based on the expression of the investigated proteins. Spearman rank correlation coefficients were calculated to assess the correlation of continuous variables. To measure the association between protein expression and survival length, the patients were grouped into high and low expression groups based on the expression of the selected protein. Then, the two groups were compared by Cox proportional hazards regression, and hazard ratios (HRs), 95% confidence intervals (CIs) and log-rank p values were calculated. Finally, for a selected set of markers, Kaplan-Meier plots were generated to display the different survival characteristics of the two cohorts 25 . For cutoff values, each potential threshold was analyzed between the lower and upper quartiles, and the false discovery rate (FDR) was computed to correct for multiple hypothesis testing. The results were accepted as significant when p < 0.05 and FDR < 0.2.
Survival analysis web tool. We previously created an online analysis platform utilizing transcriptomelevel mRNA expression 26 and miRNA expression 27 data together with clinical, follow-up, and pathological data to assess the correlation between gene expression and survival in breast cancer. Here, we have established a new subsystem of this analysis platform. The complete proteomic database is now integrated into this system, and new biomarker candidates, as well as each biomarker assessed here, can be rapidly evaluated using the registration-free analysis site. In the tool, selection of the proteins can be performed using the gene symbol, the UniProt ID or the RPPA antibody name (https:// kmplot. com/ analy sis/).

Results
Integrated breast cancer protein database. Altogether, 140 datasets were identified, of which 30 studies had at least some clinical information for the included patients. We listed all these datasets in Table 1. After exclusion of those without survival data and other ineligible studies, four independent projects remained. These four datasets comprise 1229 specimens and 7342 unique proteins. The entire set of patients included 1064 overall survival (OS) and 998 relapse-free survival (RFS) records. Two datasets had either only overall 24 or relapse-free survival data 23 . Median OS and RFS times varied between 27.6 and 96.5 months and 9.6-85.5 months, respectively. The mean age of the patients was 57.7 ± 13.6 years. In line with previous expectations 28 , estrogen receptorpositive (ESR1 +) patients represented approximately 67% of all samples, and almost half of the patients had nodal involvement (46%). Of note, the Liu 2014 dataset included triple negative breast cancer (TNBC) 22 , lymph node negative and treatment naive patients only. In the other studies, hormone therapy, primarily tamoxifen, was applied (59%). Table 2 contains detailed clinical parameters for each included dataset used, and Fig. 2 shows selected clinical characteristics for these datasets. The dataset generated using RPPA contains most of the patients (n = 873) but least of the proteins (n = 224). The other three datasets have combined > 7000 protein records measured by LC-MS/MS technology. Figure 3A shows the proportions of detected proteins in each dataset combination. Only 39 proteins were measured in all datasets, while 1356 overlapping proteins were evaluated in the three LC-MS/MS studies. A total of 4731 proteins were detected in only one study, and most of them came from the Tang 2018 cohort (n = 4225) 24 . When mapping the measured proteins to cellular locations, the majority of proteins originated from the cytoplasm (36.3%), nucleus (32.2%) and cytosol (27.6%) (Fig. 3B,C). Supplemental Table 2 includes all proteins.
Evaluation of routine diagnostic biomarkers. Immunohistochemistry results were available as positive or negative, and we compared the expression of the selected protein (e.g. ESR1) between the positive and the negative groups. When ESR1, PGR, and HER2 protein expression levels determined by RPPA were compared to IHC-based receptor status, results revealed that protein expression and receptor status were highly significantly correlated with one another (p < 0.0001) (see means-plots and ROC-plots in Fig 24 . When comparing the expression of the proliferation marker MKI67 between the normal and cancer samples, the tumor samples had significantly higher expression (fold change = 2.22, p = 0.0001) (Fig. 5B).
Finally, we also assessed the correlation between ESR1 and the ESR1-regulated gene PGR. In this analysis, we uncovered a moderate correlation between ESR1 and PGR protein expression levels, as determined by LC-MS/ MS (correlation coefficient = 0.17, p = 0.0399, Fig. 5C). Unfortunately, due to the limited availability of simultaneously collected data, it was not possible to analyze all possible clinical scenarios and to model molecular subtype determination based on proteomic datasets.
Proteins with significant prognostic power. We assessed the link between survival and the expression of 63 proteins and their phosphorylated forms to validate their prognostic relevance in breast cancer (Supplemental Table 3). The expression of 33 of 63 proteins had a significant correlation with patient outcome. Twelve proteins associated with OS only, nine proteins associated with RFS only, and twelve proteins (PGR, CDH1, BCL2, NDRG1, CTNNB1, APOD, PARP1, RBM3 and four cytokeratins: KRT18, KRT5, KRT6B, KRT17) were www.nature.com/scientificreports/ prognostic for both RFS and OS. Of these, three proteins (KRT18, APOD and CDH1) and four proteins (PGR, CDH1, CTNNB1, and BCL2) were confirmed to be related to OS and RFS, respectively, in at least two independent datasets. The results of the survival analysis for each of these proteins in terms of OS and RFS are displayed in Table 3A and 3B, respectively. A better overall survival outcome was associated with higher expression of E-cadherin (HR = 0.21, 95%CI = 0.08 − 0.6, p = 0.0013) and the apoptosis regulator protein BCL2 (HR = 0.6, 95%CI = 0.39 − 0.81, p = 0.0017). Higher BCL2 was also strongly related to longer relapse-free survival (HR = 0.4, 95%CI = 0.27 − 0.61, p = 9.5e − 06). While we also validated the prognostic value of the expression level of tyrosine 1248-phosphorylated HER-2 (HER2_pY1248) (HR = 1.63, 95%CI = 1.13 − 2.36, p = 0.0079) using RPPA data, the expression level of nonphosphorylated HER-2 did not have a significant correlation with survival in any of the included datasets.

Discussion
A major advance of proteomic technologies lies in their ability to simultaneously measure multiple biomarkers from a single clinical specimen. Here, we collected four independent breast cancer proteomic cohorts and validated established and new biomarker candidates.
Despite the quantitative and multiplexing limitations of immunohistochemical analysis, in clinical practice, it is still the gold standard. We compared the efficiency of various proteomic techniques to determine routinely measured breast cancer biomarkers, including ESR1, PGR, HER2, and MKI67. In this analysis, both the RPPA and LC-MS/MS method results were highly correlated with IHC results and thus can be utilized to determine receptor status in breast cancer patients. Unfortunately, we did not have all markers for the same patients, and the results achieved for individual genes can only suggest that proteomic technologies will also be capable of performing molecular stratification in the future, enabling the discrimination of breast cancer subtypes.
Estrogen receptor is a pioneer cancer biomarker, and classifying breast tumors based on hormone receptor status has been utilized in routine clinical practice for over four decades 29 . ESR1 positivity and PGR positivity are associated with better survival outcomes than negative ESR1/PGR status. In addition to clinicopathological prognostication, the main medical application of these receptors is selecting patients for endocrine therapy 30 .
MKI67 is a protein not expressed in G0 phase, and thus, it is a perfect marker for determining the proportion of dividing cells 31 . MKI67 expression is correlated with outcome, and high MKI67 expression is associated with poor prognosis, which has been validated in a meta-analysis involving over 64 thousand breast cancer patients 32 . Immunohistochemical staining of MKI67 alone can also pinpoint low-risk breast cancers with the same reliability as genomic markers 33 .
Evaluation of HER2 (ERBB2, neu) status has also been routinely used in breast cancer molecular diagnostics since the end of the 1990s. Analysis of large cohorts of patients found that HER2 overexpression is associated with unfavorable prognosis and poor response to chemotherapy 34 . The clinical introduction of anti-HER2 therapies (i.e., trastuzumab, pertuzumab) in combination with chemotherapy in patients who have HER2-positive cancer results in exceptional survival advantages. As a result, HER2-positive patients have a better outlook than HER2-negative patients 35 . Today, tumors with even 1% positivity are eligible for anti-HER2 therapy 36 .
We assessed the prognostic power of a selected set of proteins, including ESR1, PGR, HER2, cytokeratins, claudins, E-cadherin 39 and EGFR, in the datasets included in the present study. Overall, we uncovered that 33 proteins had a significant correlation with prognosis. In the case of FDA-approved protein biomarkers, the expression of estrogen and progesterone receptors is correlated with favorable relapse-free survival. High expression levels of phosphorylated HER2 protein measured by RPPA were linked with worse overall survival than low expression levels; these findings are in line with the previous study by Hayashi et al. on the same protein 40 .
High expression of the antiapoptotic Bcl-2 and the adhesion marker E-cadherin was related to longer relapse-free survival than low expression in at least two independent datasets. Bcl-2 overexpression was revealed in other cancers and was linked to cancer initiation and progression, and higher expression positively correlated with favorable patient outcomes in hormone receptor-positive breast tumors 41,42 . Loss of E-cadherin expression is frequently represented in invasive lobular breast carcinoma, which is three times more likely to metastasize 43 .  www.nature.com/scientificreports/ Interestingly, some of the genes, including PGR and E-cadherin, display inverse correlations with survival when assessing the link to survival in different patient cohorts. Here, we have to mention some limitations of our analysis that might lie behind these discrepancies. A major constraint is that only 20% of the proteins were determined in at least three platforms. This means that the evaluation of further databases will be needed to perform a comprehensive validation of all potential biomarker candidates. Another shortcoming of the investigated datasets is the rather low proportion of events (in the case of the TCGA dataset) 12 and the short follow-up time (DeMarchi dataset 23 ). A future large-scale proteomic database with long follow-up and uniform protein level determination using a single method could provide more reliable data for a similar analysis.
In summary, we successfully integrated four distinct breast cancer proteomic datasets containing tumor and normal samples. A significant correlation was observed between marker levels detected by proteomic technologies and those detected by immunohistochemistry results. We validated prognostic and predictive breast cancer biomarkers and compared the efficiency of different proteome analysis techniques. The entire database is integrated into our online tool, providing an opportunity to validate our findings and to identify and rank new survival-associated biomarker candidates using multiple independent cohorts of breast cancer. www.nature.com/scientificreports/