Introduction

An estimation of overall survival with or without treatment at the time of prostate cancer (PCa) diagnosis is of the utmost importance for selecting the most appropriate treatment.1, 2, 3, 4 The currently available clinical prognostic tools demonstrate an accuracy of 70–80% for the prediction of biochemical or PSA recurrence, but these tools are less effective at predicting cancer-specific survival and even less accurate at predicting overall survival.1, 2, 3, 4, 5, 6 It is known that PSA recurrence-free survival cannot be used as a reliable surrogate for PCa-specific or overall survival, as the clinical outcomes of recurrence can be highly variable. This is largely due to the variability in the survival benefit conferred by hormone or castration therapy.2

Currently, a primary treatment decision after the diagnosis is based on an overall evaluation of both tumor and patient risk factors. Tumor factors include serum PSA level, biopsy Gleason score and clinical stage, and patient factors include age, performance status and other diseases, that is, comorbidity. The challenge is to identify the most effective treatment that the patient can tolerate. Even though currently available clinical prognostic tools for the prediction of biochemical recurrence are valuable in clinical practice, a tool for estimating overall survival would improve the treatment decision.

Whole-genome expression analyses of tumor samples may identify new biomarkers that could improve the accuracy of survival prediction. However, most previous studies have identified biomarkers that only predict PSA recurrence-free survival, mainly due to limited clinical follow-up. Only a few recent studies have identified genomic markers associated with lethal forms of PCa.7, 8, 9, 10, 11 This inability to predict overall survival is also due to the fact that the primary focus has been on tumor biological aggressiveness or tumor risk factors. However, it has been shown that about 50% of the patients can die of causes other than PCa. The observation underlines that the patient risk factors have the same importance as the tumor risk factors.

Several studies have demonstrated the use of embryonic stem cell (ESC) gene expression signatures for determining subtype classification and prognosis of various cancers, including PCa, as discussed in the review by Glinsky.8 We have further developed this concept into a hypothesis of ESC gene predictors (ESCGPs) with the following reasoning: (1) Embryonic stem cells are the origin of tissue differentiated cells, tissue stem cells and cancer stem cells. (2) Genes that are important in maintaining ESC status and regulating differentiation are also important in maintaining cancer stem cell status and abnormal differentiation (dedifferentiation). (3) Genes with significant expression variations among different ES cell lines are not important in this respect. (4) Genes that show consistently high or consistently low expression levels across various ES cell lines are equally important in maintaining ESC status. Different expression patterns of these genes determine the development of different normal or cancer tissue. These genes are here named as ESCGPs. (5) These ESCGPs may be expressed not only in cancer stem cells but also cancer cells and their expressions can be measured by microarray, reverse transcription-polymerase chain reaction (RT-PCR) or quantitative PCR (qPCR). (6) Different expression patterns of these ESCGPs measured in the cancer tissues can reflect cancer’s biological aggressiveness, and predict the efficacy of treatment and patient survival.12, 13, 14

To evaluate this hypothesis experimentally, we analyzed fine-needle aspiration (FNA) biopsy samples from 189 PCa patients with nearly complete follow-up data. The cohort is unique in that it contains high-quality fresh–frozen tumor samples and complete survival data. The majority of the patients were not treated radically, such as radical surgery or radiation. The clinical outcomes of these patients may closely resemble the natural course of development and outcome of PCa (i.e. natural history of PCa). We report that an ESCGPs expression signature at diagnosis could indeed estimate overall, PCa-specific and non-PCa-specific survival in this cohort. If our findings can be validated with concurrent cohorts dominated with radical surgery or radiation therapy, the signature may become an important complement in the process of selecting therapeutic modality for each individual patient.

Materials and methods

Written approval from the local ethics committee was obtained for the molecular analysis of biological samples from PCa patients. This study was conducted in a stepwise manner. The procedure for selecting and verifying genes is outlined in Figure 1 and described in detail as follows.

Figure 1
figure 1

Outline of a stepwise gene selection process. (a) Identification of 641 embryonic stem cell gene predictors (ESCGPs) by bioinformatic analysis. Previously published data sets of whole-genome complementary DNA microarrays derived from five human ESC lines and 115 human normal tissues from various organs were retrieved from the Stanford Microarray Database (SMD). After a data-centering process, a sub-data set with expression profile of 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarray (SAM) was performed and a SAM plot was generated. The 328 genes with the highest expression levels and 313 genes with the lowest expression levels were identified, in total 641 ESCGPs. (b) Identification of 258 ESCGPs in prostate cancer (PCa). PCa ESCGPs were identified by matching the list of the 641 ESCGPs and the list of 5513 genes published by Lapointe et al.9 When clustering the 112 PCa tissue samples and comparing the cluster results when using all 5513 genes and when using only the 258 ESCGPs present in the data set, nearly identical results were obtained. Sample labeling: PL, lymph node metastasis; PN, normal prostate tissue; PT, prostate tumor. Three cases (marked green) were placed in different classification positions and two cases (purple) were consistently misclassified. (c) Selection of important candidate ESCGPs for clinical survival correlation. Of 258 PCa ESCGPs, 34 genes were selected by their high-ranking order in the SAM analysis identifying significant genes for the subtype classification or for the discriminating between tumor and normal samples. Of these 34 ESCGPs, 19 were selected based on their markedly different expression patterns and robust performances in RT-PCR reactions (Supplementary Figure S1). The 19 ESCGPs and the 5 reported genes were included in the optimization of the 4-plex qPCR method using RNAs from PCa cell lines. (d) Identification of the ESCGP signature in Subset 1. After the 4-plex qPCR optimization, the method was used to analyze 36 fresh–frozen fine-needle aspiration (FNA) biopsies taken from PCa patients (Subset 1). RNAs could be extracted in 28 biopsies. A series of cluster analyses using different gene combinations revealed that the ESCGP signature VGLL3, IGFBP3 and F3 classified Subset 1 samples into three subtypes with strong survival correlations. The level of gene expression increases from blue to red, whereas the delta Ct value decreases from blue to red. Gray areas represent missing qPCR data.

Identification of candidate ESCGPs

Previously published data sets of whole-genome cDNA microarrays derived from five human ESC lines15 and 115 human normal tissues from various organs16 were retrieved from the SMD (Stanford Microarray Database) (Figure 1a). Initially we combined these two retrieved data sets, the data set of normal tissues was used to normalize the subset of ESC lines by a data-centering process15, 16, afterwards a sub-data set with whole-genome expression profile of the 24 361 genes in the ESC lines was isolated from the combined whole data set. A single-class significance analysis of microarrays (SAM)17 was performed using the subset of the ESC lines only, whereby all genes were ranked according to the consistency (without significant variations) of their expression levels across the ESC lines, assuming that genes with significant expression variation between ESC lines would not be critical for maintaining stem cell-like status. Significant genes were selected at delta 0.23 and q-value 0.05.

Selection of candidate ESCGPs in PCa

An independent data set9 with 112 prostate tissue samples was used to verify the ESCGP findings and to select ESCGPs for PCa (Figure 1b). The list of genes in the published data set was matched to the list of the candidate ESCGPs identified in Step 1. This resulted in a shorter list of genes and the expression data of these genes were used to repeat the cluster analysis. The result was compared with the original cluster.

Refining ESCGPs selection using RT-PCR and multiplex qPCR analyses of three PCa cell lines

A SAM analysis was performed using the list of PCa ESCGPs identified in Step 2 (Figure 1c). The high-ranking ESCGPs and an additional five reported genes were selected for further analysis. A 4-plex qPCR method was optimized for the quantification of these genes by using RNAs from three PCa cell lines (LNCaP, DU145, PC-3). The procedures used for the isolation of total RNA from cell cultures and FNA cytology smears, cDNA synthesis, RT-PCR and multiplex qPCR analysis are described in the Supplementary Information file, Supplementary Tables S1–S3 and Supplementary Figures S1 and S2. For the qPCR analysis, the expression level of each gene in a sample was normalized to that of GAPDH (glyceraldehyde 3-phosphate dehydrogenase) and was presented as the delta Ct value, which is inversely correlated to the gene expression level.18, 19 This delta Ct value was centered by the median value across all samples, and the centered delta Ct value was then used for the statistical and k-nearest neighbor (kNN) analyses.

Establishing of the clinical importance

FNA samples

The FNA samples were collected according to the routine procedure used at the Clinical Pathology/Cytology unit at Karolinska University Hospital in Stockholm, Sweden20 (Figure 1d). Multiple cytology smears were obtained by prostate FNA procedure in each patient at the time of diagnosis. The representative smear was identified by examination of Giemsa-stained slides and was used for the clinical cytological diagnosis. The remaining fresh smears on glass slides that were duplicates of the Giemsa-stained slide used for diagnosis were freshly frozen and kept at –70 °C. We found that at least 80% of all cells collected from most of the PCa FNA samples were cancer cells. Of the 241 FNA samples that were collected from the patients, we obtained good-quality total RNAs from 193 samples; 189 of these samples were from patients with a diagnosis of PCa. The researcher who performed the 4-plex qPCR analyses of the FNA samples was not informed of the relevant clinical data until the complete data set was constructed from both the qPCR results and the clinical data.

Clinical characteristics of the cohort

The 189 PCa patients were diagnosed between 1986 and 2001. During this time period in Sweden, PCa diagnoses were mainly confirmed by prostate FNA cytology rather than by performing a multiple-core biopsy of the prostate.20 Elderly men without lower urinary tract symptoms were seldom tested for their serum PSA levels. The average PSA level at the time of PCa diagnosis was therefore higher during this study than the level currently observed. In this cohort, very few patients had an indolent cancer, over 50% of patients had high-grade, advanced cancer, and hormone therapy was the primary treatment for 77.9% of the patients (Table 1). With regard to comorbidity, 40% of the patients had cardiovascular disease and 9% diabetes. An internship doctor who was not informed of the results of the molecular analyses collected the relevant clinical data under the supervision of an oncologist. Information with regard to the date of diagnosis, the date of death and the cause of death for all patients was first obtained from regional or national registries and was then verified by examining the medical journals. Diagnosis and cause of death were coded according to the International Classification of Diseases (ICD9 and ICD10) recommended by the World Health Organization.21 PCa-specific mortality was assigned to cases where PCa or metastases were the primary or secondary cause of death. Death causes of patients are described in Supplementary Table S8. By 31 December 2008, 22 of the original 189 patients were still alive, 163 were deceased and 4 could not be found in the registries (Table 1).

Table 1 Characteristics of the patients

Statistical analysis

Sample size and design of the subsets The details for these procedures are provided in the Supplementary Information file. The patient cohort was divided into three subsets, according to the diagnoses and the experimental time order (Table 1). For Subset 1, we evaluated the strongest candidate genes from the ESCGP list. For Subset 2, we evaluated the most significant genes from Subset 1 and selected genes (reported genes) from the literature. For Subset 3, we tested the genes that demonstrated significance in Subset 2 and a limited number of genes that showed significance in Subset 1 but were not tested in Subset 2. A summary of the genes tested in the different subsets and complete set is shown in Table 2. Two major factors determined that not all candidate genes could be tested in every sample, that is, insufficient amount of total RNAs isolated from most FNA samples and fixed gene combinations by 4-plex qPCR.

Table 2 Number of patients’ samples for gene expression profiling

Definition of important parameters The details and references for the parameters are provided in the Supplementary Information file.

Survival analysis The univariate and multivariate hazard ratios were calculated according to the Cox proportional hazard model using Stata statistics software (version 10.1; StataCorp LP, College Station, TX, USA). The Kaplan–Meier plots and statistical box plots were made using JMP statistics software (version 8.0.1; SAS Institute, Cary, NC, USA).

Cluster analysis Gene expression data were evaluated using the unsupervised hierarchical clustering method and the gene median-centered delta Ct values, results were visualized using Treeview software (Eisen Lab, University of California at Berkeley, CA, USA).22 Unsupervised hierarchical clustering is based on similarity measures and identifies clusters as groups of patterns.

Parametric model design A first-order polynomial model using the selected genes was designed based on the assumption of the Weibull distribution in Stata statistics software (StataCorp LP). Of the 95 patients for whom there was data with regard to expression pattern of the ESCGP signature, 87 had expression data and clinical information that could be used for Weibull regression survival prediction. Two models were made, one using clinical parameters alone and one combining clinical parameters and ESCGP signature. The clinical parameters included PSA level (>50 vs 50 ng ml−1), clinical disease stage (advanced vs localized), tumor grade (poorly vs well/moderately differentiated) and age at the time of diagnosis. A first-order polynomial model using the selected genes was designed in Stata statistics software (StataCorp LP).

kNN modeling The data set was randomly divided into a training set (70% of the data set, n=139) for model development and a test set (30% of the data set, n=50) for model verification. Four different kNN models estimating overall survival were designed and optimized on the training set data (Table 5). One of the models had only clinical parameters, one had only the ESCGP signature and two had combinations of clinical parameters and ESCGP signature. In all cases, models were applied only to cases without missing data, Euclidian distance measures were used and the average survival time for the three nearest neighbors was calculated as output. The scaling of the parameters of each model was determined through an exhaustive search of all combinations of the scaling factors 1, 3 and 9. A random number generator with a similar distribution as the overall survival in the data set was used as the reference for random guess of overall survival. For all models and the random number generator, the prediction performance of the kNN models was evaluated by comparing the average and variance of the absolute prediction error. kNN is a pattern based classification tool that assigns an unknown case to the same group as the most similar reference cases, meaning that kNN can be capable of classifying data sets where there is no simple univariate relationship between gene expression and patient outcome.

Results

Identification of the ESCGP signature by a stepwise process

Identification of candidate ESCGPs

The SAM analysis of public data identified 328 genes with consistently high levels of expression and 313 genes with consistently low levels of expression in ESCs (Figure 1 and Supplementary Table S1), that is, 641 genes in total (Figure 1a).

Selection of candidate ESCGPs in PCa

The ability of the 641 genes to classify tumor subtype was verified on an independent data set of PCa samples,9 wherein the clustering result was almost identical when comparing the complete original data set of 5513 genes and the 258 PCa-related ESCGPs isolated from the same original data set. Therefore, the 641 genes were defined as ESCGPs (Figure 1b).

Refining ESCGPs selection using RT-PCR and multiplex qPCR analyses of three PCa cell lines

Within the 258 verified PCa ESCGPs, the 34 genes of highest ranking order in SAM analyses for the discrimination between tumor and normal samples and between different tumor subtypes were selected for follow-up analysis. In addition, five reported genes based on previously published studies9, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 were included in the same set, both to serve as a positive reference and to make possible the investigation if these known genes could improve the predictive power of the final gene signature. Of the 34 ESCGPs, 19 had both robust performances in RT-PCR reactions and clearly different expression patterns in PCa cell lines (Figure 1c and Supplementary Figure S1). The 19 ESCGPs and 5 reported genes (Table 2) were included in an optimization of 4-plex qPCR using RNAs from PCa cell lines. One gene (MAF) has two different mRNA transcripts (c-MAF-a and c-MAF-b), and both were included. After the optimization, the 4-plex qPCR method was ready to use for the analysis of FNA samples taken from PCa patients.

Establishing of the clinical importance

Finally, the potential additive or synergistic effects by different combinations of the 10 significant genes identified in the univariate analysis were explored (Table 3). Using the data of 36 patient samples in the pilot subset (Subset 1), a series of cluster analyses were performed using 120 different gene combinations by the selection of k (2k10) different genes each time. Of these 120 different combinations, the ESCGP signature VGLL3, IGFBP3 and F3 was the best for tumor subtype classification in correlation to survival differences (Figure 1d). The risk of false discovery by multiple testing is reduced as survival difference was correlated to the ESCGP signature as observed in cluster analysis, univariate and multivariate analyses also after inclusion of two additional subsets of patients (Subset 2 with 65 patients and Subset 3 with 88 patients, Table 2).

Table 3 Cox proportional hazards analysis of ESCGPs and various clinical parameters (univariate analysis)

The resulting gene expression data for the complete cohort was subjected to analysis with respect to overall and PCa-specific survival. In the univariate analysis, all of the clinical parameters were significantly correlated with both overall and PCa-specific survival (Table 3 and Supplementary Figures S3 and S4). Of the 25 gene expression markers, 10 (F3, WNT5B, VGLL3, CTGF, IGFBP3, c-MAF-a, c-MAF-b, AMACR, MUC1 and EZH2) were significantly correlated with either overall or PCa-specific survival. Two of these markers (F3 and WNT5B) presented a more significant P-value than did PSA when they were used as continuous variables, and this level of significance remained after a stringent Bonferroni correction was performed for the multiple testing of 30 variables (P<β=0.0016667; Table 3 and Supplementary Table S5). A multivariate analysis was performed to evaluate the influence of clinical parameters on the significance of each gene variable. The number of patients included in the multivariate analysis was smaller than that included in the univariate analysis because several parameters had missing data. In summary, four markers, F3, IGFBP3, CTGF and AMACR, showed correlations with both overall and PCa-specific survival, which were independent of any of the clinical parameters evaluated (Supplementary Tables S6 and S7). Three of these genes (F3, IGFBP3, CTGF) are ESCGPs.

Of the 189 patients evaluated, 87 had data available for all clinical parameters (mainly patients in Subsets 1 and 3) and could be classified into subtypes according to the expression signatures of VGLL3, IGFBP3 and F3. The multivariate analysis for overall and PCa-specific survival revealed that the tumor subtype classification defined by the ESCGP signature was the most powerful survival indicator and further independent of age, PSA level, tumor grade and clinical stage (Table 4). The median overall survival time was 3.23 years for patients with the high-risk subtype, 4.00 years for the intermediate-risk subtype and 9.85 years for the low-risk subtype (Figure 2), and these values corresponded to hazard ratios of 5.86 (95% confidence interval (CI): 2.91–11.78, P<0.001) for the high-risk subtype and 3.45 (95% CI: 1.79–6.66, P<0.001) for the intermediate-risk subtype compared with the low-risk subtype (Table 4 and Figure 2). Kaplan–Meier plots further indicated a clear survival difference between the three subtypes classified using the ESCGP signature (Figure 2 and Supplementary Figures S4 and S5). The difference in overall survival was attributed to both PCa-specific and non-PCa-specific survival (Figure 2). Interestingly, the survival difference between the three tumor subtypes was maintained when only patients treated with hormone therapy were analyzed, and these differences were independent of all other clinical parameters (Figure 3 and Supplementary Figure S5). Results from separate analysis of the subgroup of patients with cardiovascular disease were in agreement with the results from the complete cohort.

Table 4 Cox proportional hazards analysis of the ESCGP signature and various clinical parameters (univariate and multivariate analyses)
Figure 2
figure 2

Clear survival difference according to tumor subtypes classification based on the embryonic stem cell gene predictor (ESCGP) signature (VGLL3, IGFBP3 and F3). Data were available for evaluation of the ESCGP signature for 95 of the 189 patients. (a) Fine-needle aspiration (FNA) samples from the 95 patients were used to create three tumor subtypes (group 1, red tree; group 2, yellow tree; group 3, blue tree) according to the ESCGP signature. The expression data was evaluated using the unsupervised hierarchical clustering method and the gene median-centered delta Ct values; the results were visualized using Treeview software. The gene expression level increases from blue to red, whereas the delta Ct value decreases from blue to red. Missing data are represented by the gray color. The clinical parameters of each patient are marked by various squares. Empty squares represent a longer survival period, lower PSA level, localized PCa clinical disease stage and a well or moderately differentiated tumor grade. Squares with various fill colors represent a shorter survival period, higher PSA level, advanced clinical disease stage and poorly differentiated tumor grade. (bd) The overall, PCa-specific and non-PCa-specific survival analyses of the three subtypes were presented by Kaplan–Meier curves. X and Y axis presents actual time as diagnosis and survival rate, respectively. The P-values for differences between each of the three tumor subtypes were calculated using a log-rank test, and the P-values marked with stars represent statistical significance (P-value<0.05). Besides the most significant difference between subtypes 1 and 3 shown in the figure, the other P-values between each two subtypes were P1–2=0.063, P2–3<0.001 (b); P1–2=0.063, P2–3<0.001 (c); P1–2=0.523, P2–3=0.070 (d).

Figure 3
figure 3

Survival difference between the three tumor subtypes classified according to the embryonic stem cell gene predictor (ESCGP) signature in patients primarily treated with castration therapy. Of the 95 patients shown in Figure 2, 65 received castration therapy as their primary treatment. Within this group, clear survival differences could still be observed according to the three tumor subtypes classified based on the ESCGP signature. The overall (upper panel), PCa-specific (middle panel) and non-PCa-specific (lower panel) survival analyses of the three subtypes are shown by the Kaplan–Meier curves. The P-values for differences between each of the three tumor subtypes were calculated using a log-rank test. Besides the most significant difference between subtypes 1 and 3 shown in the figure, the other P-values between each two subtypes were P1–2=0.037*, P2–3=0.001* (a); P1–2=0.009*, P2–3=0.006* (b); P1–2=0.955, P2–3=0.076 (c). The overall survival rates at 5 years of follow-up were 13.6%, 36.0% and 77.8% for groups 1, 2 and 3, respectively.

Survival predictions with the combined use of the ESCGP signature and various clinical parameters

To assess the predictive performance of the selected ESCGP genes, different kNN classification algorithms were developed using the training set to estimate the overall survival.33 When evaluated on the test set (Table 5), the performance of the kNN model using only clinical parameters was similar to the random model, whereas all kNN models including the selected ESCGP genes were significantly (P<0.04) better than the random model. Another illustration of predictive performance was obtained using a parametric model. This model was used to estimate whether the use of tumor subtype classification,34 according to the expression signature of VGLL3, IGFBP3 and F3, could improve the prediction of survival beyond that estimated using the available clinical parameters (Figure 4). Compared with the prediction model that used only the clinical parameters, the addition of the tumor subtype classification improved sensitivity and specificity of the overall survival prediction from 0.775 to 0.800, and from 0.660 to 0.766, respectively (at 5 years; Figure 4). Receiver operating characteristic curves at 5-year survival were estimated to show the sensitivity and the specificity of survival prediction. The area under the receiver operating characteristic curve value was increased from 0.755 to 0.815 in overall survival prediction, from 0.726 to 0.793 in PCa-specific survival prediction and from 0.730 to 0.793 in non-PCa-specific survival prediction, respectively (Figure 4).

Table 5 Analysis of classification error of kNN model performance.
Figure 4
figure 4

Receiver operating characteristic (ROC) curves for 5-year survival prediction. Prediction of survival time was modeled using a parametric model based on the assumption of the Weibull distribution. ROC curves at 5-year survival prediction show the sensitivity and the specificity of survival prediction. Overall (upper panel), PCa-specific (middle panel) and non-PCa-specific survival (bottom panel) predictions at 5 years were determined by the clinical parameters alone (black lines), and by both clinical parameters and the tumor subtypes classified by embryonic stem cell gene predictor (ESCGP) signature (red lines). The area under the curve (AUC) values of overall, PCa-specific and non-PCa-specific survival predictions were all increased by adding ESCGP signature. Positive predictive value (PPV) and negative predictive value (NPV) both increased.

Discussion

This report discusses the ability to estimate the overall and cancer-specific survival using gene expression levels in PCa samples. If such a measure would become available, it would provide an important and orthogonal complement to the currently available data used in the decision process for selecting treatment for individual patients.

Numerous attempts to produce prognostic methods for PCa use surrogate end points like biochemical relapse or even cancer-specific mortality.35 This is probably due to the fact that data sets for surrogate end points are more easily obtained. However, it has been shown that nearly 50% of the patients die of diseases other than PCa.36 The identification of biomarkers that correlate with overall survival of PCa is rare. Our results demonstrate that PCa tumor subtypes classified by the gene expression signature of VGLL3, IGFBP3 and F3 at the time of diagnosis are clearly correlated with overall and cancer-specific survival in the evaluated cohort.

The gene expression signature was independent of age, PSA level, World Health Organization tumor grade and clinical stage. Furthermore, as shown through the kNN model and the parametric prediction model, this signature demonstrated clear prognostic value and a potential to further improve the prognostic accuracy of conventional clinical parameters. Following validation on additional cohorts, this ESCGP signature could be particularly beneficial in the clinical management of early-stage PCa. In such cases, the accuracy of conventional clinical parameters for the prediction of cancer-specific and overall survival are limited by a relatively low PSA level, localized disease stage and insufficient tumor material for Gleason scoring.1, 2, 3, 4, 37 When evaluating such small tumor samples, the ESCGP has a potential to improve the assessment.

Overall survival is the real lifetime determined by the aggressiveness of PCa and patient’s other conditions or comorbidities.2 The ability to estimate overall survival by the ESCGP signature may reflect the biological functions of the three genes. Both F3 and IGFBP3 have been shown associated with metastasis development in prostate and other cancers. They have also been shown important in the development of many non-cancer diseases of the coagulatory, cardiovascular and metabolic systems, diseases that are common causes of death in PCa patients.28, 38 The positive correlation between prolonged survival and increased expression of F3 was unexpected and may suggest that PCa cells with higher levels of F3 are strongly androgen-dependent and sensitive to castration treatment.9, 24, 39, 40 The functions of VGLL3 have yet to be studied. VGLL3 shows clear correlation with the age at diagnosis (Supplementary Tables S6 and S7), which is an important patient risk factor that strongly influences the patient overall survival and treatment decision. We suggest that the expression of VGLL3 may reflect the patient’s biological age that currently can be estimated only by physician’s subjective observation. Therefore, the combination of these three genes could provide a molecular classification sufficient to estimate overall survival.

Several reported gene markers (AMACR, EZH2, c-MAF-a, c-MAF-b and MUC1) selected from previous studies were also validated in our FNA cohort (Supplementary Table S5); however, they were not as strong as the ESCGP signature when estimating overall survival. Owing to the limited RNA quantity present in the FNA samples, the previously reported ‘stemness’ gene signature8 could not be compared with our ESCGP signature, although this comparison would be warranted in future studies.

The present study was driven by the stem cell hypothesis, whereby ESC gene expression signatures are thought to be associated with the prognosis of various cancers. Our results demonstrate that the 258 ESCGPs could classify an independent PCa data set in a nearly identical manner as compared with using the complete 5513 genes identified in a previous study (Figure 1b). Furthermore, PCa tumor subtypes classified by the ESCGP signature of VGLL3, IGFBP3 and F3 at the time of diagnosis clearly correlated with overall and PCa-specific survival. These two results support the stem cell hypothesis.

The stepwise procedure implemented in the current study has both advantages and disadvantages. We find it advantageous to use an initially wide concept and incrementally narrow the scope through use of independent historic data sets and new measurements, as illustrated in Figure 1. The drawback is that Subset 1 of the patient database was part of selecting the ESCGP signature, leaving a smaller set of patient material for validation. In our case, the limited availability of FNA samples prevented us from dividing them into one discovery set and one validation set. On the other hand, the availability of a series of high-quality, fresh–frozen FNA samples with nearly complete survival data made it possible to complete this study. Currently, evaluation of Gleason score using transrectal ultrasound-guided prostate biopsy samples has become the major diagnostic procedure for PCa. A direct comparison and correlation between this signature and Gleason score needs to be established using biopsy samples. All in all, before implementing the ESCGP signature in clinical practice, it has to be validated in an independent data set using sample material readily available in pathology laboratories. Such a validation study has been initiated in our laboratory.

In conclusion, the ESCGP signature is a promising biomarker combination suitable for estimating the survival of PCa patients. After validation in an independent large cohort study, it would provide an important and orthogonal complement to the currently clinical parameters routinely used in the process of treatment decision for individual patients, in particular for early-stage PCas.