Introduction

Prostate cancer is the most common cancer type amongst men with an estimated incidence of 268,490 new cases per year in the United States, with an estimated 34,500 deaths per year1. Molecular profiling of prostate cancer has led to insights into the relationship of genomic alterations and disease initiation, progression, and treatment response. However, no significant differences in disease free survival were found for patients that were stratified according to the 8-group prostate cancer (PCa) taxonomy defined by The Cancer Genome Atlas (TCGA) using single gene molecular alterations2. Additionally, when primary tumors were compared to metastatic tumor samples, few changes in the frequency of these genomic alterations were observed2,3,4.

A reliable molecular biomarker that stratifies aggressive vs. indolent disease is increased frequency of Copy Number Alterations (CNAs)4,5,6,7; however, this finding provides little mechanistic or therapeutically actionable insight. Recent studies have shown that combinations of alterations, namely TP53 & RB18 and CHD1 & MAP3K79, drive aggressive disease, suggesting that molecular subtyping in PCa is complex. Many efforts have been put forward to develop predictive gene expression signatures with the goal of identifying which patients will progress to lethal disease10,11,12,13,14,15,16. Some of these signatures have been clinically successful11,17,18; however, an overwhelming amount of gene expression profiling results lack replicability between studies resulting in inconsistent lists of candidate genes associated with PCa prognosis19. Additional challenges in reproducible PCa research remain. For example, the use of high-dimensional molecular data is dependent on a thorough validation of the statistical models in diverse datasets. Similar concerns apply to molecular subtyping. Many of these challenges can at least partially be addressed by harmonization of the ‘omic’ data preprocessing and annotations, matched with manual curation of the clinicopathologic features and outcomes for easy application of multi-study statistical learning20, and cross-study validation21.

Data wrangling and data harmonization are critical for the consistent, reproducible, and benchmarked analysis of multi-omic cancer datasets. Efforts have been completed for ovarian cancer in the curatedOvarianData R package22, breast cancer in the curatedBreastData R package23, and across cancer types in the curatedTCGAData R package24. These packages have advanced the field in many ways. To this end, the R user community has put great effort into developing R class objects that help end-users to utilize data across different types - such as transcriptomics, copy number alterations, and somatic mutations - and between studies that vary in their specific study characteristics. The MultiAssayExperiment-class25 (MAE) aggregates data of various types utilizing such R classes as matrix, RaggedExperiment, SummarizedExperiment across these data levels. This data class supports linking and simultaneous storage of sample- or patient-level clinical metadata fields that can be easily processed and stored together with their corresponding ‘omics’ data.

In addition to the primary ‘omic’ data types themselves, such as gene expression measurements by RNA sequencing or microarrays, there are now an array of innovative approaches to develop molecular signatures and deconvolution methods to estimate cell types present in bulk tissue. The immunedeconv-package26 has proven to be a popular choice as a wrapper R package providing harmonized access to multiple popular cell type deconvolution methods such as EPIC27, ESTIMATE28, MCP-counter29, quanTIseq30, and xCell31. Estimating the prevalences of different cell types in the tumor specimen has allowed for investigating the relationship between immune cells and other cell frequencies in a tumor sample with clinical outcomes26,27,28,29,30,31,32,33,34.

Given the value to the PCa research field in having a unified resource of molecular features across independent studies, we developed a curated, comprehensive, and harmonized PCa resource that contains multi-omic and clinical data from 19 PCa studies. The ‘omic’ data types were preprocessed and annotated, and clinical variables were mapped to a common data dictionary to ensure consistent annotation of the samples. Furthermore, we precomputed several prostate-specific genomic scores using the uniformly preprocessed and annotated gene expression data sets. Namely, we conveniently provide Decipher35, Oncotype DX36, and Prolaris37 risk scores as well as Androgen Receptor (AR) scores2. These pre-computed variables can be easily included in the downstream analyses as correlative or phenotypic variables. Leveraging the MAE class, we supply the data in the curatedPCaData R package (https://github.com/Syksy/curatedPCaData). The package provides open and accessible data and analysis pipelines with maximum flexibility for data analysts and prostate cancer researchers. We discuss the integrated datasets within the package and insights that have been gained by bringing together >3500 prostate tissue, primary PCa, and metastatic PCa tumor samples.

Results

A summary of the key study characteristics of the 19 datasets contained in the curatedPCaData package are in Table 1. The curatedPCaData package was developed using standardized workflows for raw data processing where available, mapping all clinical information for each dataset to a common data dictionary38, and ensuring gene symbols are consistent and up-to-date using HUGO Gene Nomenclature Committee (HGNC) symbols across all datasets and data types (Figure S1). To harmonize, organize, and manage all datasets and data types, the curatedPCaData package was built using the data structures for multi-omic data integration as implemented in the MultiAssayExperiment R package25.

Table 1 Summary of studies in curatedPCaData and their corresponding MultiAssayExperiment (MAE) object contents when queried from ExperimentHub using the function getPCa.

For reproducibility and to provide users with example code, analyses and results presented in the following sections are made available as vignettes through the curatedPCaData package. Furthermore, the individual data components used to create the MultiAssayExperiment objects are made available via the ExperimentHub package’s storage service following current guidelines for data packages intended for the Bioconductor repository.

Molecular measurements are consistent across independent datasets

There is an expectation that multiple, independent datasets that report molecular features across cancer patient cohorts with similar clinical profiles will reveal similar biological findings. If results are inconsistent between patient cohorts, differences in data processing and annotations, major batch effects or potentially biological effects could be the explanation. To test the consistency of our processed molecular measurements across patient cohorts, we evaluated patterns of transcriptome, copy number alterations, and mutations.

Gene expression, as measured by microarrays or RNA sequencing, is the most common molecular measurement in the curatedPCaData package (Table 1). To evaluate the consistency of expression patterns, we first performed a pairwise correlation analysis of gene expression differences in Gleason grade ≥8 vs. Gleason grade ≤6 tumor samples using the genes that were in common between the datasets (Fig. 1a). Overall, we found that pairwise Pearson correlation between datasets was generally moderate to low and statistically significant. Compared to the TCGA dataset2, the reported correlations were between 0.34 and 0.48 for Taylor et al.4,39, Weiner et al.40,41, Barwick et al.42,43, and IGC44,45. However, not all datasets were as correlated to TCGA. For example, the Friedrich et al.46,47 dataset only showed a correlation of 0.18, which could be attributed to the difference in the underlying platform as gene expression in TCGA was measured by RNA sequencing, and Friedrich et al. was measured using a custom Agilent microarray.

Fig. 1
figure 1

Gene expression patterns across datasets. (a) Pearson correlation between datasets comparing differential expression of Gleason grade ≥8 vs. Gleason grade ≤6 samples for genes common between the datasets. (b) Volcano plots for differential gene expression comparing Gleason grade ≥8 vs. Gleason grade ≤6 samples. The highlighted genes are the top five up- and down-regulated genes identified across the four datasets using Fisher’s method to combine p-values. (c) Spearman’s rank correlations for all genes within the dataset were calculated compared to AR and the ETS transcription factor, ERG. The Spearman correlation was calculated for the correlation patterns between datasets and displayed for AR (left side) and ERG (right side) in both primary and metastatic tumors.

Next, we identified the most commonly up- and down-regulated genes when comparing Gleason grade ≥8 vs. Gleason grade ≤6 tumor samples across multiple datasets (TCGA2, IGC44,45, Taylor et al.4,39, Weiner et al.40,41). We used the moderated t-test calculated through the limma R package to determine log fold changes and p-values for individual datasets. We then integrated the four datasets using Fisher’s method to combine p-values to identify genes that were consistently up- (n = 263) or down- (n = 501) regulated and significant (q-value < 0.01) across these datasets48. Consistent with the biological processes associated with tumor growth and aggressiveness, the up-regulated genes are enriched for cell cycle-related processes, cell division, DNA replication, and DNA repair, while the down-regulated genes are enriched for positive regulation of apoptosis, negative regulation of ERK1 and ERK2 cascade, and cell-matrix adhesion. Using volcano plots for visualization and illustrative purposes, we highlighted the top 5 consistently up- (PRR16, RRM2, COMP, ASPN, PPFIA2) and top 5 consistently down-regulated genes (ANPEP, ACTG2, MYCBPC1, CD38, SLC2A3) (Fig. 1b).

Finally, for gene expression, we evaluated the consistency of correlation patterns in relation to prostate cancer-associated genes. For each dataset, we calculated the Pearson correlation of all genes within the dataset to Androgen Receptor (AR) and the ETS transcription factor, ERG. We then calculated the Spearman correlation of the correlation patterns to AR and ERG across datasets (Fig. 1c). For the majority of datasets measuring gene expression in primary prostate tumors, the correlation patterns for AR across datasets were consistent with some datasets being highly correlated, such as Kim et al.49,50 and Weiner et al.40,41, or Taylor et al.4,39 and Sun et al.51,52. Patterns for ERG expression were moderately to highly correlated, but there were some datasets with inverse correlation, such as Ren et al.53 and Sun et al.51,52, and Ren et al. and Barwick et al.42,43 While datasets with gene expression from metastatic tumors are few, the pattern of correlation between Chandran et al.54,55, Abida et al.56, and Taylor et al.4,39 were lower, likely due to the intrinsic heterogeneity of measuring gene expression from samples in the metastatic setting.

Prostate cancer is known to be heavily driven by copy number alterations, which will impact the molecular measurements of gene expression. For datasets with copy number alteration information, curatedPCaData provides discretized copy number calls according to GISTIC2 (−2 = deep loss, −1 = shallow loss, 0 = diploid, 1 = gain, 2 = amplification)57. We evaluated the overall copy number landscape and found that independent datasets showed highly similar patterns of copy number gain and loss in primary tumors (Taylor et al.4,39, TCGA2, Baca et al.58) (Fig. 2a), with samples from metastatic tumors (Abida et al.56) showing an overall increase in copy number alterations as has been previously reported2,56. We additionally evaluated the frequency of copy number alteration across several genes that have been shown to be associated with prostate cancer (PTEN, TP53, CHD1, MAP3K7, FOXA1, NXK3.1, USP10, SPOP2,4,9,58,59,60,61,62,63,64), along with the TMPRSS2:ERG fusion2,65. For these genes, we found the copy number alteration and mutation patterns to be consistent across datasets (Fig. 2b, note that not all datasets have all genes measured for mutations or copy number). We also tested for patterns of co-occurrence and mutual exclusivity between these genes. While general patterns of co-alteration were consistent between datasets, the statistical significance, as measured in the primary tumor setting (Taylor et al.4,39, TCGA2, Baca et al.58), not surprisingly is highly dependent on the size of the dataset. In the metastatic setting (Abida et al.56), the frequency of alteration is consistently much higher and many genes are statistically significantly co-altered (Fig. 2b).

Fig. 2
figure 2

Copy number and mutational landscapes across datasets. (a) Multiple known prostate cancer-associated genes (MAP3K7, MYC, NKX3-1, PTEN, TP53) displayed consistent copy number loss/deletion or gain/amplification across datasets. (b) Oncoprints (left side) for select prostate cancer-associated genes are displayed across datasets. Mutual exclusivity (right side) was calculated using Fisher’s exact test (*p < 0.05). Note that due to a lack of overlap in ‘omics’, some alteration percentages combining CNA and mutations are under-estimated; for example Taylor et al.4,39 used a targeted sequencing panel, and thus not all genes were measured for somatic mutations.

Overall, these benchmarking analyses show that the molecular features in primary prostate cancer are generally reliably and consistently measured across datasets. Gene expression patterns are correlated across datasets. Copy number results were more robust across datasets, with mutational information limited to a few datasets. The consistent data processing and harmonization of gene names across datasets provide a ready to use resource for meta-analysis.

Derived features add value to published datasets

A value added in the curatedPCaData package, beyond data harmonization, is that features were systematically and consistently derived across datasets. Leveraging gene expression data, we inferred and evaluated estimates of risk (Oncotype DX66, Decipher11, and Prolaris10), AR scores, and microenvironment cell content leveraging the Immunedeconv R package32.

Prognostic risk scores are calculated from a select set of genes; thus, missing genes and assay platform differences can impact the reliability of the computed scores67. To assess the impact of missing genes on risk score calculations, we benchmarked the risk scores included in curatedPCaData (Oncotype DX66, Decipher11, and Prolaris10) by removing different genes for calculating the risk scores, calculating the risk score with simulated missingness, followed by correlating the risk score derived from the incomplete gene set to the risk score calculated from the full gene list. Oncotype DX, a 12-gene signature, performed well overall when genes were missing from the gene list. As an example, with 5 genes missing over 100 random sampling iterations, the average correlation coefficient was 0.891(median = 0.903) compared to the “ground truth” score using all genes (Figure S2a). Prolaris, a 34-gene signature, also proved to be highly robust whereby removing 10 random genes from the Prolaris gene list in the Kunderfranco et al. dataset had an average correlation with the original score of 0.973 (median = 0.974; Figure S2b). Decipher, a 17-gene signature, showed similar results to Oncotype DX where removing 5 genes resulted in an average correlation of 0.921 (median = 0.937; Figure S2c). Lastly, the AR score was calculated by taking the mean across scaled gene expression values and found to be robust to the removal of genes. There are 20 genes that are used to calculate the AR score and we found that by removing 10 at random still provides an average AR score with a correlation of 0.930 (median = 0.935; Figure S2d).

In addition to prognostic risk and AR score calculations, we performed cell type deconvolution, which infers immune cells and other stromal cells from bulk tissue gene expression profiling. For datasets with gene expression, we calculated immune and other cell estimates using EPIC27, ESTIMATE28, MCP-counter29, quanTIseq30, and xCell31 as implemented in the immunedeconv R package32, and CIBERSORTx34. While deconvolution methods vary in the types of cells that they estimate, the overall methodology has been shown to produce robust predictions and comparison between methods have been shown to be mostly consistent and robust, which is covered in depth by Sturm et al.32 and was a major motivation to develop the immunedeconv R package. The following section highlights how the inferred cell content can be used to infer associations with clinical outcomes using curatedPCaData.

Endothelial cell content predicts patient outcomes

Leveraging the results from the immune and cell deconvolution methods from bulk transcriptome data, we evaluated the relationship between inferred cell types, patient outcomes, and disease progression. We found that the estimates of endothelial cell content as estimated by xCell31, MCP Counter29, and EPIC27 were predictive of biochemical recurrence. It was encouraging to also find that the results from the three independent methods were highly correlated (Fig. 3a), which provides support that the signal is reproducible and not an artifact of one deconvolution method. For illustrative purposes, we stratified patients in the TCGA2 and Taylor et al.4,39 cohorts into the top 1/3 and bottom 2/3 by endothelial cell estimates, and estimated HRs using univariate Cox models for each method (EPIC, MCP-counter, and xCell). The univariate Cox models agreed on the Hazard Ratio (HR) estimates and statistical significance across the methods and datasets, with HR estimates ranging between 2.02 to 2.45 in TCGA and 1.96 to 3.54 in Taylor et al. (Fig. 3b). When Gleason grade group (≤6, 7, ≥8) was modeled as a univariate Cox model predictor, its unit increase estimate for HR was of similar effect size as having the top tertile for endothelial cells with 2.15 and 3.52 for TCGA and Taylor et al., respectively. Patient samples with a high endothelial score show significantly shorter times to biochemical relapse (Fig. 3c). Furthermore, we evaluated primary tumor datasets for the association between endothelial cell estimates and Gleason grade. Across the datasets that reported at least 10 patients per Gleason grade group and where we could infer endothelial cell content from gene expression data (TCGA2, Taylor et al.4,39, Friedrich et al.46,47), we consistently found an increased estimated presence of endothelial cells in Gleason grade ≥8 compared to Gleason grade 7 or ≤6 (Fig. 3d).

Fig. 3
figure 3

Estimates of endothelial cell content are associated with clinical outcomes. (a) The endothelial cell scores calculated from gene expression across TCGA2 and Taylor et al.4,39 were highly correlated (Spearman correlation) across the three estimation methods, EPIC27, MCP-counter29, and xCell31. (b) Forest plots for univariate Cox proportional hazard models illustrate that all three methods and Gleason grade were predictive of biochemical recurrence. (c) Endothelial cell score top tertiles, as illustrated using MCP-counter’s estimates, showed a statistically significant stratification for worse outcome in TCGA and Taylor et al. datasets. (d) In addition to being associated with biochemical recurrence, the estimates from MCP-counter are associated with tumor Gleason grade groups.

It has been established that the cellular content of the tumor microenvironment can be predictive of tumor progression and response to treatment, mostly in the context of immune cells33. Similarly, angiogenesis and the vascularization of the tumor microenvironment have been associated with tumor progression and outcomes68,69,70,71, with specific studies linking endothelial cell content to prostate cancer aggressiveness72,73. Our findings are consistent with previous results and demonstrate the strength of leveraging the inferred features across multiple, independent datasets through curatedPCaData.

Discussion

The curatedPCaData R package provides a harmonized and centralized resource for prostate cancer studies with multi-omic and clinical data that can be leveraged easily for cancer research. The cross-study analyses presented herein demonstrate the strength of leveraging multiple studies in PCa; however, it is important to understand and incorporate relative differences between studies, their aims, design, and the underlying composition in such data analysis. For example, Abida et al.56 focused on the progressed metastatic form of the disease and reported a significant number of disease-related deaths suitable for overall survival modeling. On the other hand, Friedrich et al.46,47, Hieronymus et al.6,74, ICGC-CA75, and TCGA2 also reported overall survival, but they present a more indolent form of the disease with a lower count of deaths, making survival modeling more challenging. Furthermore, biochemical recurrence is often used as a surrogate for progression-free survival and is reported in Barwick et al.42,43, Sun et al.51,52, Taylor et al.4,39 and TCGA2; of these four datasets, we focused our Cox models for recurrence on Taylor et al. and TCGA, as Barwick et al. used a very targeted custom DASL gene panel (<1,000 genes) making cell composition estimation unreliable for most methods. Sun et al. only report recurrence as a binary outcome without follow-up times, rendering it unsuitable for Cox proportional hazards models or survival estimation using the Kaplan-Meier method. Despite the differences in reported variables, a considerable amount of clinical information is made available across independent datasets to draw associations with molecular features.

Researchers should also consider the original study aims, as these will be reflected in which metadata fields and ‘omics’ that will be available. For example, Weiner et al.40,41 studied ethnicity-related PCa-trends, thus the patients had accurate demographics-related metadata commonly available, while samples were just described as being primary tumors. In contrast, Wang et al.76,77 studied how sample composition (tumor cells, stroma, atrophic grand, or benign prostate hyperplasia) could be differentiated based on gene expression, thus providing metadata suitable for tumor purity estimation, but provided no clinical end-points or patient characteristics. While we have gone through great effort to minimize technical and reporting variability, some fundamental study characteristics will inevitably not be comparable. Thus, combining studies should be planned with care to avoid introducing confounding effects. To this end, curatedPCaData offers assistance in bringing together studies suitable for efficiently tackling specific prostate cancer related research questions.

Additional consideration should be given to how studies reported the common end-point of Gleason grade. In curatedPCaData, we provided summarized results across studies as Gleason grade groups (≤6, 7, ≥8), though studies might have additional information to report. For example, Weiner et al.40,41 reported an International Society of Urologic Pathologists (ISUP) disease stage ranging from 1–5, for which the suggested mapping to the traditional Gleason grade was done78. Multiple studies reported Gleason as the sum of major + minor Gleason grades or a grade group (≤6, 7, ≥8), thus groupings were offered as an endpoint with an equal level of granularity, while a finer level of detail was offered in alternate clinical metadata columns when available. In ambiguous cases, the primary publications and the supplementary material were mined, along with contacting the primary authors in many cases, in an effort to offer accurate and up-to-date information on both the clinical metadata and the primary data. For this purpose, a great deal of manual labor was required to curate the curatedPCaData datasets. The resulting datasets were thus standardized to be as comparable as possible, while retaining details essential to the studies. To this end, we offer a great variety of R package vignettes alongside curatedPCaData with numerous examples and extra data characteristics, which assist the end-user in planning their analyses.

One benefit of curatedPCaData is that it greatly lowers the barrier for accessing data to rapidly test hypotheses and generate novel hypotheses supported by multiple, independent datasets. The code used to generate the MAE objects is offered within the R package and GitHub repository. The processed MAE objects exported from the package are the main focus of the package; however, from a developer point of view, they also offer natural potential for future extensions such as: a) adding new studies and exporting them as new MAE objects using the pipelines developed in curatedPCaData; b) supplementing the existing MAE slots with newly derived variables or even adding other primary ‘omics’ data; or c) extending the existing clinical metadata fields to include new fields.

Currently, curatedPCaData offers a base R Shiny79 interface to the package as well, with plans to extend the visual browser-based access to the data. While ongoing efforts such as the NCI Genomic Data Commons80, cBioPortal81, or the International Cancer Genome Consortium82 already aim to provide a standardized approach to tackling complex ‘omics’ traits in cancer, curatedPCaData is the first harmonized, multi-study, hands-on data resource intended for analysts with a strong focus on PCa and allowing for maximum flexibility of the analyses, using the R statistical software83. As such, the presented proof-of-concept analyses provide merely a staging platform for more efficient exploration of multi-omics signatures coupled with clinical metadata for the wider research community for prostate cancer.

Methods

Data acquisition

Gene expression, copy number alterations, and mutation data were downloaded from Gene Expression Omnibus (GEO)84 using GEOquery (R package version 2.64.2) and from cBioPortal81 using cBioPortalData (R package version v2.8.2) and cgdsr (R package version v1.3.0) (Figure S1a). In addition to downloading raw data from GEO, GEOquery was used for downloading the latest array-specific annotations and all three R packages were further utilized to download clinical metadata accompanying the raw data. Raw CEL-file files for Affymetrix-arrays were RMA-normalized in oligo (R package version v1.62.1) with functions read.celfiles, rma, getNetAffx, and exprs. Agilent arrays were processed using limma (R package version v3.52.2) with the functions read.maimages, backgroundCorrect, normalizeBetweenArrays, and avereps. For custom arrays such as the DASL array in Barwick et al.42,43, quantile normalization was used together with log-transformation. No additional normalization was done on the gene expression data from cBioPortal, since cBioPortal offers pre-normalized data. For data with raw copy number alteration available, these were processed using rCGH (R package version v1.26.0) with functions readAgilent, adjustSignal, segmentCGH, and EMnormalize. This yielded log-ratios, which were input to GISTIC257 when available. Copy number alteration matrices from cBioPortal with pre-existing GISTIC2 calls were stored with the discretized calls consistently across all the datasets. A summary of the acquired datasets and their sources is presented in Table 1.

The TCGA Prostate Cancer (PRAD) dataset was downloaded from Xena Browser85, due to better data quality and providing tumor samples and normal samples separately, instead of providing relative tumor to normal gene expression found in cBioPortal processed data. We also removed low-quality samples which were excluded from the TCGA publication due to RNA degradation from the gene expression matrix to provide users with the most reliable information. We followed uniform naming conventions for all the metadata fields and leveraged data in the original publications to obtain maximum information in case information wasn’t readily available in these public repositories38.

All layers of data, namely the gene expression, copy number alterations, and mutations, underwent a harmonization process to ensure uniform gene naming conventions. Note that some datasets have matched normal samples to call somatic mutations and some datasets do not have matched normal samples and are thus tumor-only variants. The mutation calling status is noted in the “Mutation_status” field. The latest hg38 gene symbols, aliases, and locations were downloaded using biomaRt (R package version v2.52.0). We then mapped all the gene names to the up-to-date dictionary to ensure consistency in HGNC symbols across all datasets. A liftover from hg19 to hg38 was done as part of the harmonization using the liftOver function from rtracklayer (R package version v1.56.1), for mutations called with an older genome assembly to ensure uniformity.

Clinicopathological features were processed using R scripts customized to each dataset. Features were collected from supplementary annotation files and processed to map features to the data dictionary. The data dictionary ensured common terminology and some additional features, such as Gleason grade group (where not supplied by the primary publication), were inferred using a predefined set of rules. The scripts for each dataset are made available in curatedPCaData.

Derived features

A number of derived features were computed for the final MAE-objects (Figure S1b). Using gene expression data, we calculated cell proportions, genomic risk scores, and AR scores. The immunedeconv32 (R package version v2.1.0) wrapper package was used to estimate cell proportions from EPIC27, ESTIMATE28, MCP-counter29, quanTIseq30, and xCell31. As the implementation of CIBERSORTx34 required external access using the free academic license, it was run with default parameters on their web interface and quantile normalization disabled with the normalized gene expression data as input and LM22 signature matrix used to infer cell types. The output CIBERSORTx matrices were then downloaded and integrated into the MAEs.

Due to the different platforms (sequencing, different brands, and versions of microarrays) used to assess gene expression, not all datasets have the same set of genes. To determine the impact that gene missingness on the precomputed scores would have on those studies without all genes, we benchmarked the Oncotype DX66, Decipher11, and Prolaris10 risk scores and the AR score. This was performed by identifying the study in curatedPCaData that contained the most genes belonging to the scoring method. By using this study, we were able to get as close to what the true score would be. Assessing the impact of missing genes was performed by randomly removing genes to simulate missing between 1 and 10 genes for Prolaris10 risk score (34 genes in the complete signature) and AR score (20 genes), and removing between 1 and 5 for Oncotype DX66 and Decipher11 risk scores (12 and 20 genes, respectively). Since the number of gene combinations that can be made by simulating 10 missing genes for a risk score such as Prolaris10 is large, the combinations were sampled to cut down on vignette and package build time. The number of combinations used for assessing impact of missingness in Decipher11, Oncotype DX66, and AR scores was 100 while Prolaris risk score used 50 combinations.

We implemented the Oncotype DX66, Decipher11, and Prolaris10 risk scores based on the instructions in their original publications supported by the implementation outlined in Creed et al.67 The gene list (n = 12 matching genes) for Oncotype DX matched perfectly with several studies: Abida et al.56, Kim et al.49,50, Ren et al.53, Sun et al.51,52, Taylor et al.4,39, TCGA2, Wallace et al.86,87, and Weiner et al.40,41 We considered TCGA to be the most complete dataset as well as most widely used, thus we used the gene expression from TCGA for testing the variability of the Oncotype DX score due to missing genes (Table 2). The gene list (n = 17 matching genes in TCGA) for Decipher did not have a 1-to-1 match with any study in curatedPCaData, but did have the highest number of matching genes in Ren et al.53 (18 genes were a 1-to-1 match with two genes from Decipher missing) while Abida et al.56, Friedrich et al.46,47, and TCGA2 had slightly fewer number of matching genes (17 genes were a 1-to-1 with 3 genes missing). We used TCGA gene expression for benchmarking inferred risk scores from Decipher. Prolaris required the largest number of genes (n = 34 matching genes) to calculate risk. Kunderfranco et al.88,89 had the highest number of matching genes with 32 1-to-1 matches and only 2 genes missing. The next highest 1-to-1 match was ICGC-CA75 where 29 genes were 1-to-1 matches. Because of the high number of matching genes, we selected Kunderfranco et al. as the benchmarking study for Prolaris (Table 2).

Table 2 The intersection between Prolaris, Oncotype DX, Decipher, and Androgen Receptor (AR) score’ genes and genes that are found in studies within curatedPCaData R Package.

AR-scores were calculated for the 20 genes identified originally in Hieronymus et al.90 and then calculated as the sum of z-scores of AR signaling genes as described by TCGA2. There were 8 studies that matched all 20 genes used to calculate the AR score; we leveraged TCGA gene expression for benchmarking.

Statistical analysis

While the primary focus is on providing readily processed MAE-objects with MultiAssayExperiment (R package version v1.21.6), curatedPCaData delivers several application examples as R vignettes and documentation, with relevant statistical methodology applied therein. Cox proportional hazard models and Kaplan-Meier (KM) curves were fitted with survival (R package version v3.3-1) and plotted using survminer (R package version v0.4.9), and the corresponding p-values were calculated using log-rank tests.

Differential gene expression was calculated as the average log-transformed expression of Gleason grade ≥8 samples minus the average log-transformed expression of Gleason grade ≤6 samples. Statistical significance was determined by comparing the log-transformed gene expression of Gleason grade ≥8 compared to Gleason grade ≤6 samples using the moderated t-test as implemented in limma (R package version v3.52.2). The final p-values were adjusted for multiple testing using Benjamini-Hochberg correction. Pearson correlation was used to compare differential expression in Fig. 1a. The genes reported in Fig. 1b were identified using Fisher’s method to combine p-values for statistical significance. The log fold change was then tested to ensure consistent up- and down-regulation of the associated gene, meaning a gene needed to have logFC >0 or logFC <0 across all four datasets tested. The top up- and down-regulated gene sets were tested for pathway and biological process enrichment using the DAVID web server91. The correlations reported in Fig. 1c were calculated using Spearman’s rank correlation.

Genes were defined to be co-occurring or mutually exclusive based on the odds ratio (OR) which is calculated as: OR = (Both* Neither)/(B Not A * A not B) where A and B stand for alterations in genes A and B respectively. We define any alteration in copy number or mutations that are not silent as an alteration. The significance of mutual exclusivity/co-occurrence was computed using the Fisher’s Exact Test and the Benjamini-Hochberg correction was applied to determine the adjusted p-values. Mutual exclusivity plots for different data sets shown in Fig. 2b (right side) provide information on whether or not a set of important genes in PCa are significantly altered together.

Statistical modeling used to identify interesting derived features predictive of biochemical recurrence were based on 10-fold cross-validation (CV) of Cox models regularized using LASSO from glmnet (R package version v4.1-4)92. There were three methods that calculated endothelial cell abundance scores (EPIC27, MCP-counter29, and xCell31). Among these methods, endothelial cell abundance scores were predictive in at least one of these datasets, when predictive features were chosen according to the optimal regularization coefficient λ in the CV-curve.

Spearman’s rank correlation was used to assess the non-linear association between endothelial cell scores in Fig. 3a. Cox proportional hazards models were fit as univariate models with biochemical recurrence as an endpoint, by introducing one of the endothelial scores at a time to a separate model compared with using Gleason score sum as a univariate predictor; these were then plotted together as a forest plot in Fig. 3b.