Introduction

Large-scale omics data characterizing human tumors can be leveraged to develop a deeper understanding of biological processes and predict clinical outcomes. For instance, one can develop prognostic molecular signatures to stratify patients into risk groups for disease progression or metastasis1,2,3. Multiple studies have demonstrated that molecular characterization of tumors may provide a more accurate and granular picture of a patient’s prognosis than the traditional pathological staging system, thus informing therapeutic and disease surveillance strategies4,5,6.

The Cancer Genome Atlas (TCGA) program has generated molecular profiles for thousands of human tumors spanning over thirty different tissue types7. Detailed genomic analyses using these data have identified novel cancer driver genes and biomarkers of disease8,9,10,11. To complement the genomic, epigenetic and transcript level data of TCGA, a more recent project by Akbani et al. has generated proteomic data from reverse-phase protein arrays (RPPA)12. RPPA is a high-throughput and cost-effective antibody-based method that provides a more direct assessment of cellular activity compared to DNA and RNA sequencing, which generate data that do not always correlate with protein expression13. Protein levels and post-translational modifications, such as phosphorylation and acetylation, are thought to better represent active pathway signaling.

Multiple studies have demonstrated the prognostic value of RPPA data12,14,15,16,17. Some of these studies have used pathway-driven approaches, relying on prior knowledge from the literature to group proteins into biological pathways, to develop prognostic signatures or predictors of treatment response. For instance, in the paper by Akbani et al. that introduced The Cancer Proteome Atlas (TCPA), proteins analyzed by RPPA were assigned to ten cancer-related pathways on the basis of a literature search of review articles on these pathways12. For a given pathway, positive regulatory elements of the pathway were assigned a coefficient of + 1. Correspondingly, the coefficient of negative regulatory elements of the pathway were set to − 1. Effectively, the pathway activity score was defined as the sum of positive regulators minus the sum of negative regulators of the pathway. This approach did yield pathway activity scores with prognostic value in some cancer types12. However, this approach may not be generally applicable as for many cancer types, involved pathways and regulator genes are not well defined18. We therefore hypothesized that a statistical approach specifically geared toward outcome prediction may yield scores with improved prognostic ability.

Using normalized RPPA data for up to 258 total, cleaved, acetylated, or phosphorylated proteins from TCPA19,20, we demonstrate the capability of a statistical approach, LASSO regression21, to derive weighted risk scores that achieve strong prognostic stratification without requiring a priori biological knowledge. Unbiased statistical resampling methods were applied to proteomic data from four TCGA cancer studies to demonstrate that performance of our LASSO-based prognostic scores is equivalent or superior to that of predefined pathway-driven RPPA signatures.

Results

Three-fold cross-validation model assessment

The number of samples in the KIRC dataset was comparable between the version of TCPA that we downloaded for our analysis, and the version used in the original study by Akbani et al.(Table 1)12. We first repeated the Kaplan–Meier analysis of the KIRC dataset with the modifications noted in the methods and illustrated in Fig. 1A: for ten iterations, we split the dataset into three folds and assigned tumors to a training set (2/3) and a testing set (1/3). The training set median and s.d. were used to adjust RPPA values in all 445 tumors. Subsequently, the unweighted RTK signature score was computed for all tumors, and testing set tumors were assigned into high and low risk group based on median RTK score in the training set. The resulting thirty pairs of high risk and low risk Kaplan–Meier curves are displayed in Supplementary Fig. S1. Then, Cox regression weighted RTK pathway scores and LASSO-regression derived protein signature scores were evaluated following the same procedure. The resulting Kaplan–Meier curves are shown in Supplementary Fig. S2 and Supplementary Fig. S3, respectively.

Table 1 Data summary.
Figure 1
figure 1

Schematic representations of the unbiased model evaluation approaches. (A) Ten iterations of threefold cross-validation. (B) Permutation test with 1000 permutations.

Re-assigning weights to the 7 RTK proteins based on Cox regression did not improve model performance compared to the original, unweighted RTK score; however, deriving a new, pathway-independent LASSO-driven score improved the stratification of patients into high and low risk groups. Median difference in overall survival probability at 5 years based on the LASSO-derived risk score was 32.8%, compared to 25.2% when using the 7-protein unweighted RTK score (Fig. 2A). Median hazard ratio (HR) between high and low risk groups across the held-out folds in the CV based on the 7-protein RTK score was 2.4, compared to 3.3 when using the risk score derived by LASSO applied to the training data folds (Fig. 2B). Time-dependent ROC curves for overall survival probability at 5 years for all three prognostic models are shown in Supplementary Fig. S4A–C. Boxplots of risk scores stratified by pathologic stage for all three types of risk scores in KIRC revealed a weak linear trend in association between risk score and stage (Supplementary Fig. S5A–C).

Figure 2
figure 2

Probability density distribution of (A) difference in overall survival probability at 5 years, and (B) hazard ratio for high vs. low risk TCGA-KIRC groups stratified according to the original RTK score, LASSO-modified RTK score, Cox regression-modified RTK score, or pathway independent LASSO-derived signature score.

Permutation test for the evaluation of the cross-validated log-rank statistic

As described in the Methods and schematized in Fig. 1B, for each of the three prognostic models (unweighted RTK pathway score, Cox regression weighted RTK score, and LASSO-derived protein signature score), the statistical significance of the cross-validated log-rank statistic was evaluated with a 1000 permutation test22. Tumor stratification based on the original RTK score or on the pathway-independent LASSO-derived score obtained the best possible permutation test p value after 1000 permutations (i.e. permutation test p = 5e − 04); however, the split between high and low risk groups was more pronounced with the LASSO-derived pathway-independent score (Fig. 3A,C). The Cox regression weighted RTK pathway score method resulted in a somewhat larger, but still statistically significant, permutation test p value = 1.5e − 03 (Fig. 3B).

Figure 3
figure 3

Kaplan–Meier overall survival analysis of KIRC samples stratified according to different signature scores. Kaplan–Meier curves demonstrating the stratification of TCGA-KIRC tumors according to (A) the original 7-protein RTK score, (B) the COX regression weighted 7-protein RTK score, and (C) the pathway-independent LASSO-derived prognostic signature score. The high and low risk group curves are in purple and green, respectively.

Stage-separated and sex-separated Kaplan–Meier curves for the three types of risk scores (original RTK score, Cox-modified RTK score, and LASSO-derived score) in KIRC were also generated (Supplementary Fig. S6). A visual examination reveals that the performance of the risk scores is independent of sex, and even at lower pathologic stages, the LASSO-derived risk score effectively stratified patients into better and worse prognoses (Supplementary Fig. S6C). In contrast, the 7 protein RTK score from Akbani et al. whether in its original form or modified with Cox regression coefficients, performed worse for the stratification of low stage tumors (Supplementary Fig. S6A–B).

The top 20 proteins most frequently selected by the LASSO are listed in Table 2. Multiple proteins have previously been implicated in kidney cancer14,23,24,25,26, and interestingly, 13 of these 20 proteins were not assigned to any of the ten cancer-related pathways in the original paper by Akbani et al.12. The remaining 7 proteins were annotated as belonging to different pathways (TSC_mTOR, Hormone_b, Cell_cycle, Ras_MAPK, and DNA_damage_response), none of which were the RTK pathway (Supplementary Table S1). Furthermore, except for MAPK_pT202_Y204, the expression of the top 20 proteins did not strongly correlate to that of the 7 RTK proteins from the original prognostic signature (Supplementary Table S2). These results provide support for the use of a pathway-independent method to optimize the selection of prognostic protein markers from the RPPA data matrix.

Table 2 Top 20 proteins most frequently selected by the LASSO and pathway assignment from reference12 (Supplementary Table S1).

LASSO-derived RPPA scores have prognostic value in other tumor types represented in TCGA

To assess whether our proposed LASSO-derived approach yields scores with prognostic value in other human tumor datasets, we compared the performance of ten literature-driven pathway scores to that of the purely statistical LASSO-derived protein signature score in 3 additional datasets from TCGA: 353 skin cutaneous melanomas (SKCM), 221 sarcomas (SARC), and 411 ovarian serous cystadenocarcinoma (OVCA). Clinical characteristics of the datasets are detailed in Table 3.

Table 3 Clinical and protein characteristics of the TCGA datasets evaluated in the permutation test.

Representative plots of the cross-validated optimization of the regularization parameter λ on the three datasets and non-zero coefficients assigned by the LASSO are shown in Supplementary Fig. S7. Boxplots of LASSO-derived risk scores stratified by pathologic stage presented in Supplementary Fig. S5D–E demonstrate that in the OVCA and SKCM datasets, there is little to no association between risk score and tumor stage.

The performance of the different scoring methods was evaluated with a 1000 permutation test, as for KIRC. The resulting cross-validated Kaplan–Meier curves for high and low LASSO-derived risk scores for these three datasets demonstrate the statistically significant stratification of the tumors into high and low risk groups (Fig. 4A–C).

Figure 4
figure 4

Kaplan–Meier curves demonstrating the stratification of tumors from TCGA according to the pathway-independent LASSO-derived prognostic signature score for multiple tumor types: (A) skin melanoma, (B) sarcoma, and (C) ovarian carcinoma. Permutation test p values are shown. The high and low risk group curves are in purple and green, respectively. Published pathway-specific unweighted signatures introduced by Akbani et al.12 were also evaluated for comparison (see Table 4).

Stage separated Kaplan–Meier curves were plotted for OVCA and SKCM, and sex separated curves were plotted for SKCM and SARC (Supplementary Fig. S8). The SARC dataset in TCGA did not have any pathologic stage nor tumor grade information and the OVCA dataset only contains female patients. In OVCA, the vast majority of tumors are stage III (78%, see Table 3), hence the visible difference in survival probability between high and low score stage III tumors (Supplementary Fig. S8A). The very low sample size and low number of events in the lower stages (stage I and II tumors together account for ~ 8% of the dataset) make the corresponding Kaplan–Meier curves less compelling. In SKCM, high and low score effectively align with patient survival (Supplementary Fig. S8B). In these datasets as well, performance of risk scores was independent of sex (Supplementary Fig. S8B–C).

Furthermore, permutation test p values for pathway-12 or LASSO-driven protein signature in the three TCGA studies are listed in Table 4. In SKCM and SARC, our LASSO-based approach performed consistently well and yielded smaller p-values than all ten literature-curated unweighted pathway scores (p = 5e − 04). In OVCA, the p-value for the LASSO-derived protein signature score was only matched by that of the Ras-MAPK pathway score (p = 2.5e − 03). For SKCM and SARC, the LASSO-derived signatures mostly contained proteins that did not have a pre-defined pathway assignment in the original study12 (Supplementary Fig. S7A–B). Moreover, for OVCA, the LASSO-derived signature was composed of 13 proteins that did not belong to any of the ten pre-defined pathways from Akbani et al. and nine proteins belonging to eight of the pre-defined pathways (Supplemental Fig. S7C). Taken together, these results suggest that more than one pathway may inform prognosis, thus placing pathway-specific approaches at a disadvantage for prognostic modeling.

Table 4 Permutation test p values for pathway-12 or LASSO-driven protein signature in TCGA studies.

Discussion

Assessing the functional proteome via the analysis of RPPA data may yield important insights into patient prognosis and therapy options. We used two unbiased statistical approaches to compare the performance of our pathway-independent LASSO-derived method to that of a predefined pathway-driven risk score (Fig. 1A,B). We found our LASSO-derived method for the selection of a data-driven prognostic signature to be effective for the stratification of patient samples into high and low survival risk groups (Supplementary Fig. S3 and Supplementary Fig. S4C). Our LASSO-based approach to derive a prognostic signature performed as well or better than a biology-driven prognostic signature for the TCGA kidney clear cell carcinoma dataset according to both unbiased evaluation approaches (Figs. 2A,B, 3A–C, and Supplementary Fig. 4A–C). Our method was successfully applied to three other TCGA cancer studies in which it performed as well or better than predefined pathway-driven RPPA signatures (Fig. 4A–C).

Pathway-based approaches have limitations and are susceptible to biases depending on which molecules are included from a given pathway. They require prior knowledge of pathways and regulators of the cancer type under study. Mubeen et al. justly noted that different pathway databases contain different representations of the same biological pathway37. Correspondingly, they found that the choice of pathway database for statistical enrichment analysis or predictive modeling had a profound impact on results. Another recent study by Chen et al. came to the same conclusion38. Moreover, cancer is an extremely complex disease often involving the concerted dysregulation of multiple pathways39. Therefore, using a single literature-defined pathway for prognostic prediction runs the risk of overlooking informative molecules assigned to a different pathway. Indeed, in the TCGA datasets examined for the present study, the majority of proteins most frequently selected by the LASSO were not assigned to any of the 10 cancer-associated pathways curated by Akbani et al. (Table 2 and Supplementary Fig. S7)12. For KIRC, only 7 out of the top 20 most frequently selected proteins overlapped with one or more of the 10 predefined pathways from Akbani et al. The analysis of SKCM, SARC, and OVCA also revealed that the majority of LASSO-selected predictors were not in the pathways defined by Akbani et al. despite being assigned strong weights by the LASSO, and belong to a wide variety of cancer-associated pathways such as the Hippo pathway (e.g. YAP, TAZ) and inflammatory immune response (e.g. PDL1, NFKBP65_pS536) (Supplementary Fig. S7), consistent with the widespread dysregulation that is typical of cancer40.

In our study, LASSO regression on the KIRC RPPA dataset consistently yielded signatures including proteins which have previously been linked to survival in kidney cancer specimens (Table 2). For instance, AMPK is a sensor of cellular energy and negative regulator of the mTOR signaling pathway 26. Foersch et al. demonstrated the significant association between androgen receptor (AR) and prognosis in patients with renal clear cell carcinoma (RCC)27. Cytoplasmic CAV1 protein expression measured by immunohistochemistry (IHC) was found to correlate with clinical prognosis is RCC28. CDK1 and CDK2 activity was linked to poor prognosis and RCC recurrence29. Bellut et al. showed that c-MYC protein expression had prognostic value in a subtype of RCC30. The phosphorylation of ribosomal protein S6 kinase beta-1 (p70S6K) is a downstream target of mTOR and confirmed prognostic marker in RCC31. SF2, a novel oncoprotein in RCC, was significantly associated with poor survival in a large cohort of patients with RCC35. High SCD1 expression was prognostic of overall survival in patients with RCC34. Nuclear expression of p-STAT3 was significantly associated with RCC subtypes with greater malignant potential36. 4E-BP1, a regulator of mRNA translation initiation, is activated by mTORC1 signaling in response to extracellular stimuli and metabolic stress conditions41. A recent study by Naito et al. revealed an association between 4EBP1 phosphorylation and poor prognosis in a non-metastatic cohort of renal clear cell carcinoma (RCC)23. Correspondingly, Campbell et al. had demonstrated that the combined expression of p4E-BP1 and eIF4E was associated with significantly worse disease-free survival in patients with RCC24. Furthermore, acetyl-CoA carboxylase (ACC1) was also systematically selected by the LASSO (Table 2). A defining feature of KIRC is the presence of lipid and glycogen-rich cytoplasmic deposits 25. Du et al. identified hypoxia-inducible factor (HIF) control of fatty acid metabolism as being essential for KIRC tumorigenesis. ACC1 carries out a major step of fatty acid synthesis for membrane synthesis, production of energy stores and signaling molecules42. Interestingly, the expression of lipogenic enzymes including FASN, ACC1, and ACLY is also downstream of mTORC1 signaling43. Han et al. also reported the prognostic utility of ACC1 protein expression in KIRC, as well as FASN, Cyclin B1 and Rad51, which was also frequently selected by the LASSO in our study (Table 2)14.

The 258 proteins included in the RPPA for TCPA were selected on the basis of their functional role in cancer-related pathways such as proliferation, DNA damage, EMT, and apoptosis12. This focused approach confers an advantage for LASSO feature selection over the use of whole genome RNA-seq datasets which contain tens of thousands of genes, thus making the feature selection process highly susceptible to noise. Kim and Bredel reported similar findings in their 2013 publication44. The authors used gene expression profiles from 300 cancer pathway genes obtained from the Molecular Signature Database (MSigDb) and the Kyoto Encyclopedia of Genes and Genomes dataset (KEGG) as an input for LASSO optimization. They demonstrated that the gene pre-selection increased the average correlation coefficient between observed survival days and relate risks compared to the same analysis conducted on whole genome gene expression profiles44.

The data-driven nature of our LASSO-based approach makes it versatile and particularly well-suited for the discovery of unexplored protein/disease associations that could aid in therapeutic discovery.

Methods

Data acquisition

Level 4, batch-corrected proteomic data generated by reverse phase protein array (RPPA) for up to 258 total, cleaved, acetylated, or phosphorylated proteins across 7694 patient tumors were obtained from The Cancer Proteome Atlas (TCPA) data portal (https://tcpaportal.org/tcpa/) version 4.2 (release date: 07/18/2018)19,20. The tumors included 445 kidney clear cell carcinomas (KIRC), 353 skin cutaneous melanomas (SKCM), 221 sarcomas (SARC), and 411 ovarian serous cystadenocarcinoma (OVCA). Survival data, sex, and pathologic stage information for the patient tumors were downloaded from the Broad Institute’s cBioPortal for Cancer Genomics45,46, and were matched to the proteomic data by specimen ID. Table 1 summarizes the different tissue datasets downloaded from TCPA and compares the number of samples in our study to the number of samples used in the paper by Akbani et al.12.

For cross-validation steps described below, level 4 RPPA values downloaded from TCPA were median-centered and standard deviation (s.d.) normalized across tumors using the median protein expression and s.d. from each training set to yield relative protein expression levels in the testing set as described previously by Akbani et al.12.

Unweighted RTK pathway score

The starting point of our study was a published RPPA-based seven-protein signature of receptor tyrosine kinase (RTK) pathway activity in the form of an unweighted sum of seven protein measurements: EGFR-pY1068, EGFR-pY1173, HER2-pY1248, HER3-pY1289, SHC-pY317, SRC-pY416, and SRC-pY52712. The prognostic value of this signature had been demonstrated by Akbani et al. in a 445-patient renal clear cell carcinoma cohort (TCGA-KIRC) 12. When computing the literature-driven, unweighted pathway score from Akbani et al. the protein weights w were assigned the value of + 1 or − 1. The pre-defined pathway members and weights are listed in Supplemental Table S1.

Weighted RTK pathway score with Cox regression weights

Subsequently, we modified the original RTK score using Cox regression to derive new protein weights w for the seven proteins of the original RTK signature using R package survival (version 3.3-1)47. Cox regression was run on each training set within the cross-validation procedure, as described below, to optimize protein weights w for the seven proteins members of the RTK pathway according to the literature search conducted by Akbani et al.12. Subsequently, the protein signature score for each tumor was computed using the following equation:

$$\text{Protein signature score = }\sum_{i=1}^{n}{w}_{i}{Y}_{i}\text{,}$$
(1)

where n is the number of proteins with measurements, w is the vector of protein weights, and Y is the median-centered, SD-scaled protein expression matrix.

LASSO-derived protein signature score

Finally, we derived a pathway independent protein signature score using LASSO regression with L1-penalty to select an unrestricted number of elements from the 233 proteins with RPPA measurements in this dataset, and optimally combine their RPPA measurements into a weighted risk score for the 445 KIRC tumors. LASSO regression was performed on each training set within the cross-validation procedure, as described below, to determine protein weights w corresponding to the optimal value of the tuning parameter λ using R package glmnet (version 4.1-4)48. Protein signature score was computed for all tumors using Eq. (1) as described above.

Method performance evaluation

Because model building from a large number of candidate variables is prone to overfitting, we utilized two unbiased approaches for evaluation of method performance: (1) ten iterations of threefold cross-validation for unbiased estimation of hazard ratio and difference in 5-year survival (by Kaplan–Meier method) between high and low risk groups defined based on application of a median cut to the risk score; and (2) a permutation test to evaluate the statistical significance of the cross-validated log-rank statistic.

Cross-validation

The prognostic scores developed using the Cox regression and LASSO approaches, and corresponding low and high risk groups defined by median cut, were first evaluated with ten iterations of three-fold cross-validation. R package caret (version 6.0-93) was used to split the dataset into folds for the cross-validation49. In order to test model stability, we used a different random seed for each of the ten iterations. The evaluation approach is illustrated in Fig. 1A. For each of the ten iterations, the dataset of 233 RPPA measurements for 445 KIRC tumors was randomly split into a training set (2/3 of the tumors) and a testing set (remaining 1/3 of the tumors) for three rounds of cross-validation (CV). At each CV round, the pathway score was computed on the training set and applied to all tumors as described above. Then, the median pathway score for the tumors of the training set was used as a stratification cutoff for high and low risk groups in the testing set. We then performed a log-rank test comparing testing set high and low risk groups using R package survival47 and recorded the log-rank test statistic. Hazard ratios and difference in overall survival probabilities at five years between high and low risk groups in the cross-validation testing set by Kaplan–Meier method were also documented. Time-dependent receiver operating characteristic (ROC) analysis was conducted using R package survivalROC (version 1.0.3) which implements the cumulative case/dynamic control ROC50. ROC for overall survival at 5 years (i.e. 60 months) was evaluated because in this dataset, > 70% of events had occurred by that time point.

Assessment of model performance with the permutation test

As schematized in Fig. 1B, the dataset of 233 RPPA measurements for 445 KIRC tumors was randomly split into ten evenly-sized folds using R package caret49. For ten rounds, nine tenths of the data served as the training set, while the remaining tenth was assigned to the testing set. The resulting ten partitions were found to have similar pathologic stage and sex proportions to the complete dataset. For the unweighted RTK signature all seven protein weights were assigned the value of + 1. For the Cox regression weighted RTK signature and the LASSO-derived protein signature score, protein weights w were derived from the training set as described above. Protein signature scores were computed for all 445 tumors using Eq. (1). The median pathway or protein signature score in the training set was used as the threshold to assign the testing set tumors to high and low risk score groups. After the tenth round, with all 445 tumors having been assigned a high or low risk label, we drew the overall cross-validated Kaplan–Meier curves and recorded the log-rank test statistic for the original data. Then, for 1000 permutations, we randomly permuted the correspondence of phenotype (i.e. survival time and status) and protein expression, repeated the tenfold cross-validation, and computed the log-rank statistic. The permutation test p value was computed using the following equation described by Royston and Parmar51:

$$\text{Permutation test }{{p}}\text{ = }\frac{N+0.5}{M+1},$$
(2)

where N is the number of permutations for which log-rank test statistic was greater than or equal to the real dataset log-rank test statistic, M is the number of permutation (i.e. 1000), and 0.5 corresponds to the continuity correction constant. With 1000 permutations, the best possible permutation test p value = 5e − 04.

Application to other TCGA cohorts

To test the broader applicability of our LASSO-based signature development approach, we selected three other TCGA studies—skin cutaneous melanomas (SKCM), sarcomas (SARC), and ovarian serous cystadenocarcinoma (OVCA)—and compared the resulting log-rank statistic for the LASSO-based patient stratification to that based on published unweighted pathway-driven protein signatures12. For each of the three datasets, we computed unweighted pathway scores for the 10 literature-curated pathways listed in Supplementary Table S1 and evaluated the model performances using the permutation test with 1000 permutations as was done for KIRC. LASSO-derived protein signature scores were derived as described for KIRC and were evaluated using the 1000-permutation test.