Comprehensive genomic characterization of prostate cancer has identified recurrent alterations in genes involved in androgen signaling, DNA repair, and PI3K signaling, among others. However, larger and uniform genomic analysis may identify additional recurrently mutated genes at lower frequencies. Here we aggregate and uniformly analyze exome sequencing data from 1,013 prostate cancers. We identify and validate a new class of E26 transformation-specific (ETS)-fusion-negative tumors defined by mutations in epigenetic regulators, as well as alterations in pathways not previously implicated in prostate cancer, such as the spliceosome pathway. We find that the incidence of significantly mutated genes (SMGs) follows a long-tail distribution, with many genes mutated in less than 3% of cases. We identify a total of 97 SMGs, including 70 not previously implicated in prostate cancer, such as the ubiquitin ligase CUL3 and the transcription factor SPEN. Finally, comparing primary and metastatic prostate cancer identifies a set of genomic markers that may inform risk stratification.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank the patients for participating in this study. We also thank the Broad Cancer Genome Analysis and Data Sciences groups for analysis methodology and computational support. This work was supported by the SU2C-PCF Prostate Cancer International Dream Team, Prostate Cancer Foundation Young Investigator Awards (B.S.T., C.P., N.S., and E.M.V.A.), Prostate Cancer Foundation–V Foundation Challenge Award (E.M.V.A., P.S.N., and J.S.d.B.), NIH K08CA188615 (E.M.V.A.), NCI P50-CA097186 and NCI P50-CA92629 SPOREs in Prostate Cancer, the Marie-Josée and Henry R. Kravis Center for Molecular Oncology, a National Cancer Institute Cancer Center Core Grant (P30-CA008748), and the Robertson Foundation (B.S.T. and N.S.).
Integrated supplementary information
Metastatic tumors have increased mutational and copy number burden as compared to primary tumors, adjusted for differences in purity and coverage. In primary tumors, increased age at diagnosis and higher Gleason score are associated with higher mutational and copy number burden, adjusting for purity and coverage. Reported P values for each predictor (metastatic versus primary disease; age; Gleason score) are from the multivariate regression adjusting for purity and coverage. The center values represent the median value of each group and the error bars below or above the median line define the first and third quartile respectively.
Supplementary Figure 3 Schematic workflow of the combination of statistical, quality and biological filters used to identify SMGs.
Flowchart describing how we applied a combination of statistical, quality, and biological filters to the results to identify SMGs. Approach to defining significantly mutated genes using both statistical and biological filters.
(a) Allele frequencies of significantly mutated genes. Boxplot showing the distribution of allele frequencies across SMGs sorted by decreasing median allele frequency. (b) mRNA expression of the SMGs across the TCGA cohort. mRNA expression of the SMGs across the PCF-SU2C cohort. The red center line in the boxplots indicates the median value.
Samples (rows) are ordered according to six copy number clusters derived from hierarchical clustering on arm-level deletions (2q, 5q, 6q, 8p, 13q, and 16q), with SPOP and CUL3 mutations indicated in green. Chromosomes are shown from left to right, samples from top to bottom. Regions of loss are indicated by shades of blue, and gains are indicated by shades of red.
(a) CUL3 mutations observed in our cohort. (b) Mutations observed in PIK3R2. Hotspot mutations observed in PIK3R2 are paralogous to the oncogenic D560 mutation in PIK3R1. (c) Mutation distribution of CDK12 variants, showing that the majority of mutations are truncating and that missense variants cluster in the kinase domain.
Supplementary Figure 7 The overlap of the three bait sets used in this analysis (Agilent, Ilumina, NimbleSeq).
(a) Overlap of all bases. (b) Overlap of coding regions.
(a) Whole exome sequencing (top) vs Affymetrix SNP6 data (bottom). 303 tumors are shown for data set, sorted in identical order. Regions of gain are shown in shades of red, losses as shades of blue. The center summary plots show the fraction of samples with a log2(cn/2) value >0.1 (red) or <-0.1 (blue) at a given position. (b) Scatter plot to compare the segment means of matched segments >200KB from the SNP6 and the ReCapSeg data, resulting in a pearson correlation of 0.92 (C.I. 0.916-0.918). The graph shows the correlation between the inferred copy number of 29,290 microarray segments with size > 200 KB with the corresponding inferred copy number from the WES. The segments with size > 200KB represent 99.94% of the covered genome from the SNP array.
Supplementary Figures 1–8, Supplementary Tables 5 and 7, and Supplementary Note
Source of cohorts used for this analysis
Complete list of all somatic mutations in this cohort
Mutational significance analysis
Genes in cancer pathways
Intersected BED file
Unfiltered MAF file
Matrix of copy number calls