Risk subtyping and prognostic assessment of prostate cancer based on consensus genes

Prostate cancer (PCa) is the most frequent malignancy in male urogenital system around worldwide. We performed molecular subtyping and prognostic assessment based on consensus genes in patients with PCa. Five cohorts containing 1,046 PCa patients with RNA expression profiles and recorded clinical follow-up information were included. Univariate, multivariate Cox regression analysis and least absolute shrinkage and selection operator (LASSO) Cox regression were used to select prognostic genes and establish the signature. Immunohistochemistry staining, cell proliferation, migration and invasion assays were used to assess the biological functions of key genes. Thirty-nine intersecting consensus prognostic genes from five independent cohorts were identified. Subsequently, an eleven-consensus-gene classifier was established. In addition, multivariate Cox regression analyses showed that the classifier served as an independent indicator of recurrence-free survival in three of the five cohorts. Combined receiver operating characteristic (ROC) analysis achieved synthesized effects by combining the classifier with clinicopathological features in four of five cohorts. SRD5A2 inhibits cell proliferation, while ITGA11 promotes cell migration and invasion, possibly through the PI3K/AKT signaling pathway. To conclude, we established and validated an eleven-consensus-gene classifier, which may add prognostic value to the currently available staging system.

Prognostic assessment of the eleven-consensus-gene classifier in five cohorts. Referring to the subgroups, in the training MSKCC cohort, 70 patients were divided into the high-risk group, and the other 70 patients belonged to the low-risk group (Fig. 2a). Patients in the high-risk subgroup showed an unfavorable prognosis (log-rank P < 0.001, Fig. 2f), with AUC values of 0.908 at 1 year, 0.898 at 3 years, and 0.857 at 5 years (Fig. 2k). The predicting classifier also applied well in four external datasets. The K-M curves showed similar RFS outcomes in the GSE116918 (log-rank, P < 0.001), GSE70768 (log-rank, P = 0.049), GSE70769 (log-rank, P < 0.001), and TCGA-PRAD cohorts (log-rank, P < 0.001) ( Fig. 2b-e, g-j). The classifier showed moderate predictive accuracy in all four validation cohorts (AUC values of 0.936, 0.735, and 0.705 at 1, 3, and 5 years in the GSE116918 cohort; AUC values of 0.816, 0.706, and 0.554 at 1, 3, and 5 years in the GSE70768 cohort; AUC values of 0.858, 0.806, and 0.745 at 1, 3, and 5 years in the GSE70769 cohort; AUC values of 0.717, 0.711, and 0.641 at 1, 3, and 5 years in the TCGA-PRAD cohort, Fig. 2l-o). We also validated the prognostic value of the eleven-consensus-gene classifier in the external GSE46602 cohort, and we revealed that patients with higher-risk score had a poor prognosis (P = 0.033, HR = 3.55, 95% CI = 1.108-11.385), with a prognostic AUC value of 0.760 ( Supplementary Fig. 1).
To further investigate the clinical application value of the eleven-consensus-gene classifier, we performed the K-M analyses in different clinicopathological subgroups. The signature precisely subclassified the high-and low-risk groups of PCa patients into different subgroups with an adequate number of samples but failed in some conditions, potentially attributing to the small sample size ( Supplementary Fig. 2). For example, in the MSKCC cohort, which comprised 140 PCa cases, the results suggested that the classifier significantly discriminated the high-and low-risk subgroups in separate age (≥60 vs. <60 years old), PSA level (>10 vs. ≤10 ng/dl), pathological tumor stage (T3 + T4 vs. T1 + T2), and Gleason score (>7 vs. ≤7) subgroups (all, log-rank P < 0.05). For the GSE70769 cohort, which comprised 85 PCa cases, we also revealed that the classifier significantly discriminated the highand low-risk subgroups of the different PSA (>10 vs. ≤10), tumor stage (T3 + T4 vs. T1 + T2) subgroups (log-rank P < 0.05), Gleason score (>7 vs. ≤7) subgroups, and surgical margin (negative vs. positive) (all, log-rank P < 0.05). Further clinical trial is warranted to verify our findings.
We also compared the prognostic value of eleven-consensusgene classifier, ISUP, EAU risk group by the ROC curve, it is gratifying that the eleven-consensus-gene classifier showed a comparable prognostic value to ISUP, EAU risk group in MSKCC cohort (AUC value, classifier: 0.  Supplementary Fig. 4c), and a preferable prognostic value in GSE116918 cohort (AUC value, classifier: 0.703vs. ISUP: 0.594 vs. EAU: 0.596; Comparison P-value, classifier vs. ISUP: P = 0.046, classifier vs. EAU: P = 0.017, Fig. 3e). Taken together, the elevenconsensus-gene classifier showed a comparable prognostic value with ISUP and EAU risk group.
IHC validation the protein of SRD5A2 and ITGA11. We chose SRD5A2 and ITGA11 for further validation, due to the high weight of these two genes in the risk score formula, 1.20428 for ITGA11 and 0.71137 for SRD5A2. What's more, several studies reported the association between SRD5A2 polymorphism and PCa risk, while rarely study reported the function of ITGA11 in PCa, therefore, we final chose these two genes for experimental validation. To address and confirm the associations of SRD5A2 and ITGA11 protein levels with clinicopathological features, we used IHC assay on a prostate cancer tissue array, which contains tumor tissues from 42 patients. The standard definition of the SRD5A2 protein level is described in the Methods section. Then, we calculated the staining density of each tissue. Tissues with score equal to or higher than 3 were regarded as positive, while those with score less than 3 were negative.
We observed decreased protein expression of SRD5A2 in advanced tumor stages (Gleason ≤ 7 vs. Gleason >7, P = 0.031, Stage I + II vs. Stage III + IV, P = 0.148, Gleason 3 + 4 vs.   Fig. 4a). Moreover, with the help of Fisher's extract test, we investigated the different distributions of SRD5A2 expression (strong positive, weak positive, or negative) in different clinicopathological subgroups. We revealed a negative association of SRD5A2 and tumor progression, reflected by the Gleason score (P = 0.013) and pathological tumor stage (P = 0.047) ( Table 2). Regarding ITGA11, we observed elevated protein expression in the advanced stage compared with the early stage (Gleason ≤ 7 vs. Gleason > 7, P = 0.067, Stage I + II vs. Stage III + IV, P = 0.014, Gleason 3 + 4 vs. Gleason 4 + 3, P = 0.028, Fig. 4b). Fisher's extract test of the categorical variables illustrated similar results. These patients whose Gleason score was higher than 7 or who were in tumor stages III and IV showed strong positive staining of ITGA11 (Gleason score, P = 0.049, pathological tumor stage, P = 0.022, Table 3). All these results indicated that the SRD5A2 protein is negatively associated with the progression of PCa, while the ITGA11 protein is positively associated with the advanced stage.
Knockdown of SRD5A2 and ITGA11 impacts prostate cancer cell behaviors. After knocking down the expression of SRD5A2 (P < 0.05, Supplementary Fig. 5), we found that cell proliferation was significantly increased, as determined by MTT assay and colony formation assays, in C4-2 and PC-3 cells (all P < 0.05, Fig. 5a, b). Since the functional role of SRD5A2 in regulating PCa cell migration and invasion has been investigated by Suruchi Aggarwal et al. 13 , we only focused on the proliferation effects here. In contrast, we found that silencing ITGA11 expression decreased the migration and invasion of C4-2 and PC-3 cells but not cell proliferation (all P < 0.05, Fig. 5c, d). These results confirm the tumor-suppressive role and oncogenic role of SRD5A2 and ITGA11, respectively.
Exploring the underlying mechanisms of how ITGA11 regulates PCa progression. Many studies have already demonstrated the mechanisms of how the selected 11 genes influence the progression of PCa, while few studies have focused on the role of ITGA11. We conducted Pearson correlation analyses to identify the highly coexpressed genes in each cohort and overlapped these genes derived from all five cohorts. Then, KEGG analysis was used to enrich the significant signaling pathways. We found that ITGA11 might be involved in the regulation of calcium signaling, Rap1 signaling, Ras signaling, and PI3K/Akt signaling pathways (Fig. 6a). Moreover, we collected two gene sets that could reflect the activation status of PI3K/AKT signaling and calculated the NES score of each patient by ssGSEA. We observed that the elevated expression of ITGA11 was linked with the increasing NES score of the HALLMARK PI3K/AKT/mTOR signaling gene set (P < 0.05, r = 0.43, Fig. 6b) and the REACTOME PI3K/AKT activation gene set (P < 0.05, r = 0.34, Fig. 6c). In addition, we validated the function of ITGA11 in the activation of PI3K/AKT signaling in vitro. After knocking down the expression of ITGA11, we found that Ser473 phosphorylated-AKT1/2/3 was significantly decreased instead of total AKT1/2/3 ( Fig. 6d,   Comparison between the eleven-consensus-gene classifier and proposed signatures. We calculated the risk score of the current classifier, Zhang et al.'s score, Liu et al.'s score and CCP score in the MSKCC, GSE70768, GSE70769, GSE116918, GSE46602, and TCGA-PRAD cohorts, respectively. It is gratifying that we observed that the eleven-consensus-gene classifier showed better prognostic value than the other three signatures in the MSKCC, GSE70768, GSE70769, and GSE46602 cohorts (Fig. 7a). The eleven-consensus-gene classifier showed comparable prediction efficiency with other signatures in the GSE116918 cohort and TCGA-PRAD cohort (Fig. 7a).

Discussion
The interpatient heterogeneity in PCa is well recognized [14][15][16] . However, the molecular stratification of PCa based on predictive biomarkers to guide treatment selection has not yet been applied in the clinic. In our study, we analyzed five datasets derived from the GEO and TCGA databases to generate an eleven-consensusgene classifier (Fig. 7b). We first employed univariate Cox regression analyses and identified 39 candidate genes that are closely related to the RFS of PCa patients in all five datasets. The RFS predicting classifier was established by the LASSO Cox regression analysis based MSKCC dataset. The classifier showed satisfying molecular subtyping accuracy determined by the logrank, K-M, and ROC analyses in both the training and four external validation cohorts. Furthermore, the multivariate analyses suggested that the classifier served as an independent indicator of RFS in a set of cohorts. Notably, the combined ROC curve, which synthesized the classifier with clinicopathological features, added prognostic value to the currently available staging system. We conducted Pearson correlation analyses to determine the highly coexpressed genes in each cohort and overlapped these genes derived from all five cohorts. Then, we employed KEGG pathway analyses to reveal the underlying mechanisms of these critical candidates influencing tumor progression. For the eleven candidates, Yu et al. 17 reported that CDKN3 downregulated the expression levels of cell-cycle-and DNA-replication-related proteins. It has also been reported that abundant miR-92a-1-5p from PCa exosomes can downregulate COL1A1 and thus promote osteoclast differentiation and inhibit osteoblast genesis 18 . Pan et al. 19 revealed the higher level of DPP4 in malignant prostate tissue than that in benign prostate tissue, its expression correlated with PSA and tumor stage. Kamata et al. 20 reported that zinc finger mutation of PARP7 can result in the loss of PARP7, and further impact the enhancement of AR-dependent transcription of the MYBPC1 gene. Wu et al. 21 found that silencing of NOX4 can contribute to the decreasing of lactate production, glucose uptake, ATP production, and cell proliferation but increasing the apoptosis. In aggressive prostate cancers, the oncoprotein STMN1 is often overexpressed, Chakravarthi et al. 22 reported that CtBP1-regulated miR-34a modulates STMN1 expression and involved in the progression of prostate cancer via the regulation of GDF15. Thus, we predicted the biological function of ITGA11 through the bioinformatic method indicated above, and the results indicated that ITGA11 might be involved in the regulation of calcium signaling, Rap1 signaling, Ras signaling, and PI3K-Akt signaling pathway activity.  pathway 13 . Ntais et al. 23 also reported that the A49T and TA repeat polymorphisms of SRD5A2 can increase the PCa susceptibility to human beings, elevating the important function of SRD5A2 in PCa. We further investigated the functional role of these two candidates in regulating cancer cell fates, as well as the protein expression in clinical samples. Our results confirm the tumor-suppressive role of SRD5A2 and the oncogenic role of ITGA11 in PCa. Next, we validated the pathway enrichment results based on coexpressed genes by western blot assay. Our results suggested that silencing ITGA11 suppresses the activity of AKT signaling, indicating that ITGA11 promotes PCa cell progression potentially through activating AKT signaling. Overall, we successfully established a solid prognosis prediction system. In recent days, several prognostic classifiers have been developed to predict the outcome of PCa patients based on clinical features, gene genetics, or epigenetics. Zhao et al. 24 reported that PD-L2 is a prognostic biomarker for PCa based on patients, and they also reported that the infiltration of T cells and macrophages is increased in the poor outcome group, which is also consistent with our work that M2 macrophages are linked with unfavorable   26 illustrated an African-American specifically automated stromal classifier, which has the potential to substantially improve the accuracy of prognosis and risk stratification. Yang et al. 27 established a 28-hypoxia-related-gene prognostic classifier for localized PCa, which could predict biochemical recurrence and metastasis events. A Gleason score of 4 + 3 is resulted in almost 3-fold metastasis risk at diagnosis compared with a Gleason score of 3 + 4, although the overall incidence is low 28,29 . In our study, we investigated the value of this classifier in distinguishing patients with Gleason score 3 + 4 and 4 + 3 subgroups in the MSKCC, TCGA-PRAD, GSE70768, and GSE70769 datasets. We observed that patients with Gleason score 4 + 3 had a higher-risk score than patients with Gleason score 3 + 4. In addition, the risk score distinguished the 3 + 4/ 4 + 3 subgroups with good accuracy, a result consistent with Fisher's extract test. Another limitation of these studies was that their findings were not validated in two more independent cohorts, and the potential mechanisms of how these markers influence tumor progression were not predicted or investigated. Herein, we established an eleven-consensus-gene classifier and validated its usage in five independent cohorts. We also predicted the potential mechanisms through bioinformatic methods. Thus, our findings are stable and convincing.
The advantages of the current study are summarized and presented as follows. First, we identified 39-consensus prognostic genes from five independent cohorts, and with the help of LASSO Cox regression analysis, we chose the eleven most suitable candidates to establish the RFS prediction classifier. The classifier showed satisfying molecular prognostic subtyping accuracy determined by the log-rank, K-M, and ROC analyses in all five cohorts. Second, we further confirmed that the eleven-consensusgene classifier serves as an independent RFS predictor through multiple platforms and provides a novel method for the prognosis predition. Third, the eleven-consensus-gene classifier is not dependent on the Gleason score as compared with ISUP grading group and EAU risk group, therefore the prediction accuracy will not be impacted by the work experience of pathologist. Forth, we predicted the potential mechanisms of how these critical candidates influence the progression of PCa, which would benefit the development of targeted drugs. However, the lack of survival analysis in our samples and the lack of systematic functional studies to show the function and mechanisms of these consensus genes are the major limitations of the current study.
Our study successfully classified PCa patients with different prognostic outcomes under a consensus ensemble framework using a large clinical cohort of 1046 cases, ending up with an eleven-consensus-gene classifier. The classifier shows comparable prognostic value with ISUP Gleason group and EAU risk group, and also presented a preferable prognostic value after combined with other major clinical features. Comparing with wholetranscriptome profile, target gene profiling panel of small number of genes is wildly applied in the clinical with the high costperformance ratio. Therefore, the eleven-consensus-gene classifier is promising to be applicable in clinical setting to propel the prognosis prediction for PCa patients.

Methods
Data preparation and processing. We searched the Gene Expression Omnibus (GEO) to enroll eligible datasets that met the following criteria: (1) PCa cases with available expression data, and (2) available clinicopathological features, particularly RFS status and time. Then, the gene-expression profiles were generated from four eligible GEO datasets [GSE116918 30 , GSE70769 31 , GSE70768 31 , and GSE21032/ Memorial Sloan Kettering Cancer Center (MSKCC) 32 ], as well as the geneexpression profile from The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD, https://www.cancer.gov/tcga). For gene-expression profile of TCGA-PRAD, the number of fragments per kilobase million (FPKM) was computed and converted into transcripts per kilobase million (TPM) and further log 2 transformed, which showed more similarity to the numbers obtained from microarray analysis and improved comparability between samples, the ensemble IDs were mapped to gene symbols along with the GENCODE 27 file (https:// www.gencodegenes.org/human/release_27.html). For the gene symbol with more than one probe IDs, the mean value was calculated as its expression value. We also removed the potential cross-dataset batch effect via the "sva" package along with the empirical Bayes framework 33 . The matched clinicopathological data were also downloaded along with the expression profiles. Patients who lacked pathological T stage data were excluded. We also identified the ISUP groups and EAU risk groups by Gleason score and PSA value 34,35 .
Univariate Cox regression analysis and consensus-gene selection. We conducted the univariate Cox regression analysis to identify the consensus RFS-related genes from five datasets. Subsequently, we intersected these RFS-related candidates that appeared in all five datasets, with a cutoff P-value of less than 0.05. The following analyses were performed based on these selected genes.
Classifier establishment and validation. According to the results provided by univariate Cox regression analysis, we employed least absolute shrinkage and selection operator (LASSO) Cox regression to select stable prognostic candidates. LASSO is a regression method that uses both regularization and variable selection to elevate the prediction accuracy and interpretability of the results. We calculated The median value of the risk score was set as the cutoff value in each cohort, and patients with risk scores lower than the median value were assigned to the lowrisk subgroup, while others belonged to the high-risk subgroup. The risk score of patients in the other four external validation cohorts, TCGA-PRAD, GSE70768, GSE70769, and GSE116918, was also calculated by this risk formula, and then these patients were dichotomized into two different risk subgroups by the median risk score in each cohort.
Survival and receiver operating characteristic (ROC) analyses. Survival analyses were executed using the "survminer" package (https://github.com/ kassambara/survminer), with BCR as the endpoint. Furthermore, the area under the ROC curve (AUC) was employed to assess the predictive value of the formula.
The comparison between two ROC curves was also conducted by the "pROC" package. Furthermore, subgroup analyses were executed to test the accuracy of the classifier in different clinicopathological subgroups, such as different Gleason score (≤7 vs. >7), pathological tumor stage (T1 + T2 vs. T3 + T4), and age (≤60 vs. >60).
Assay of cell proliferation, migration, and invasion. To evaluate the impact of SRD5A2 and ITGA11 on prostate cancer cells, we employed MTT and colony formation assays to assess the alteration of cell proliferation, while Transwell-based invasion and migration assays were used to evaluate cell migration and invasion. For the MTT assay, 5000 cells were seeded per well of 24-well plates, and the results were collected by adding 50 μL of prepped 5 mg/mL MTT reagent to 450 μL of refreshed medium (with a concentration of 0.5 mg/mL) and incubating at 37°C for 1.5 h. We collected the plates on the 0, 2nd, 4th, and 6th days to block cell viability and stored them at −20°C. On the 6th day, we added DMSO solution to all plates to dissolve the formazan crystals and then read the optical density value at 570 nm to display the cell viability. For colony formation, 800 cells were seeded per well and grown for 12 days. The cells in plates were fixed with 4% paraformaldehyde for 20 min, and 0.05% crystal violet subsequently used to stain these fixed cells for another 20 min.
For the migration assay, Transwell Permeable Supports (Corning Inc., Maine, USA) were used. A total of 1 × 10 5 cells with FBS-free medium were put into the upper chamber of transwell plate, and 500 μL fresh medium with 10% FBS was filled into the lower chamber. The cells that migrated to the bottom of the membranes were fixed with methanol and further stained with 0.01% crystal violet. The steps of the invasion assay were similar to those of the migration assay, which also used permeable supports but with extra Matrigel (Biocoat, Corning, New York, USA) diluted and coated in the upper chambers and incubated for 36 h. The cell numbers were calculated by counting three random fields.
Functional prediction. Increasing evidence indicates that highly coexpressed genes potentially have similar biological functions [40][41][42] , and we identified the coexpressed genes of ITGA11 in the five cohorts (correlation >0.7) by Pearson correlation analysis. After overlapping these coexpressed genes, we performed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis to subclassify their functions based on the "clusterProfiler" R package 43 . In addition, we also used Cytoscape (v3.5.1, San Diego, La Jolla, California, USA) to visualize the functional network. Two external gene sets, HALLMARK PI3K/AKT SIGNALING and REACTOME PI3K/AKT ACTIVATION, were employed to assess the association between ITGA11 expression and the activation of the PI3K signaling pathway. Single-sample gene set enrichment analysis (ssGSEA) 44,45 , implemented in the GSVA R package, was applied to calculate the normalized enrichment score (NES) of the above 2 gene sets. For a gene matrix, the enrichment score (ES) reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes, and NES means the corrects for differences in ES between gene sets due to differences in gene set sizes, NES was calculated with below formula: NES ¼ Actual ES Mean ðESs against all permutations of the datasetÞ