Prostate cancer is a common heterogeneous disease that is responsible annually for more than 1,400,000 new diagnoses and 375,000 male-associated deaths worldwide1. Characterized by a highly variable natural history and diverse clinical behaviours4, it is not surprising that genome profiling has revealed extensive intra- and intertumour heterogeneity and complexity5,6. The identification of oncogenic subtypes7 and actionable drug targets8 are moving prostate cancer management a step closer to the promise of precision medicine7,9,10,11,12. Although high-income European ancestral countries are well along the road to incorporating cancer genomics in all aspects of cancer care13, the rest of the world lags behind, with a notable absence in sub-Saharan Africa14. Prostate cancer is no different, with a single large-scale study out of China11; in 2018, we provided a snapshot for sub-Saharan Africa, reporting an elevated mutational density in a mere six cases15. With mortality rates of greater than double compared with high-income countries and quadrupled for greater Asia, in sub-Saharan Africa, prostate cancer is the top-ranked male-associated cancer both by diagnosis and deaths, including southern Africa with age-standardized rates of 65.9 and 22 per 100,000, respectively1. Through the Southern African Prostate Cancer Study (SAPCS), we report a 2.1-fold increase in aggressive disease (grades 4–5) and 4.8-fold increase in prostate-specific antigen levels at diagnosis compared with African Americans16.

Here we describe, to our knowledge, the largest cancer and prostate cancer genomics data for sub-Saharan Africa, including 123 South African men. Controlling for study artefacts, an additional 53 Australian and 7 Brazilian individuals were passed simultaneously through the same high-depth whole-genome sequencing (WGS), mutation-calling and analytical framework. Focusing on treatment-naive cases (100% South Africans, 98% Australians and two confirmed Brazilians) and aggressive tumours (grades 4–5 for 72.2% South Africans, 86.8% Australians and 85.7% Brazilians; Extended Data Fig. 1a) at biopsy (100% South Africans) or surgery (100% Australians, 62.5% Brazilians) and patient-matched blood achieving coverages of 88.69 ± 14.78 and 44.34 ± 8.11, respectively (median ± s.d.; Supplementary Table 1), we uniformly generated, called and assessed about 2 million somatic variants. Through ancestral classification (genetic ancestry over self-identified ethnicity), we show a greater number of acquired genetic alterations within African individuals while identifying both globally relevant and African-specific genomic subtypes. Combining our somatic variant dataset with that published for ethnically defined European7,8,17,18 and Chinese11 prostate cancer genomes, we reveal a prostate cancer taxonomy with different clinical outcomes. The inclusion of 2,658 cancer genomes from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG)13 expanded our global mutational subtyping between cancer types. Using known clock-like mutational processes in each subtype, we inferred mutation timing of oncogenic drivers in broad periods of tumour evolution and calculated the mutation rates for each subtype that had a distinctive tumour evolution pattern. Combined, these analyses enable us to demonstrate how global inclusion in cancer genomics can unravel unseen heterogeneity in prostate cancer in terms of its genomic and clinical behaviours.

Genetic ancestry

Genetic ancestries were estimated for the 183 patient donors using a joint dataset in a unified analysis aggregated from a collection of geographically matched African (n = 64) and European (n = 4) deep-coverage published and unpublished reference genomes19. Ancestries were assigned using 7,472,833 markers as African (n = 113, all South Africans), with greater than 98% contribution; European (n = 61; 53 Australians, 5 South Africans and 3 Brazilians), allowing for up to 10% Asian contribution (with a single outlier of 26%); and African–European admixed (n = 9; 5 South Africans and 4 Brazilians), with as little as 4% African or European contribution (Extended Data Fig. 1b).

Total somatic mutations

In 183 prostate tumours, we identified 1,067,885 single-nucleotide variants (SNVs), 11,259 dinucleotides, 307,263 small insertions and deletions (indels, <50 bp), 419,920 copy-number alterations (CNAs) and 22,919 structural variants (SVs), with each mutational type elevated in tumours from African individuals (Fig. 1a). A median of 37.54% ± 5.51 of SNVs were C-to-T mutations, and the transition and transversion ratio was 1.282 cohort-wise. Tumours from African individuals had a higher rate of small mutations (SNVs and indels), with a median of 1.197 mutations per Mb (range 0.031–170.445) compared with those of Europeans (1.061 mutations per Mb; P = 0.013, two-sample t-test; exclusion of hypermutated tumours at >30 mutations per Mb, P = 0.028). The percentage of genome alteration (PGA) was similarly greater in Africans (7.26% versus 2.82%, P = 0.021). Correlation tests of ancestry and total somatic mutations also supported the findings (false-discovery rate (FDR) = 0.009 and FDR = 0.032 for SNVs and PGA, respectively; Extended Data Fig. 1d). The top six highest estimates of SV breakpoints per sample were observed among African patients (928–2,284 breakpoints). No overall differences between the ancestries were observed for chromothripsis (range, 52–55%) and chromoplexy (range, 33–38%), whereas tumours from African individuals demonstrated a trend towards a higher number of interchromosomal chromoplexic chains (1–6 versus 1–2). Moreover, the magnitude of all types of mutations was strongly correlated with one another (Fig. 1b). Thus, the more mutations a prostate tumour has of any given type, the more mutations it is likely to have of all types.

Fig. 1: Mutational density in prostate tumours of individuals with different ancestries.
figure 1

a, The distribution of somatic aberrations (event number or number of base pairs) for 7 mutational types across 183 tumour–blood WGS pairs representing n = 61 European, n = 113 African and n = 9 admixed individuals. The box plots show the median (centre line), the 25th and 75th percentiles (box limits), and ±1.5× the interquartile range (whiskers). b, The different types of mutational burden observed in this cohort. The samples were percentile-ranked and then ordered on the basis of the sum of percentiles across the mutational types observed in each ancestral group (left). Right, Spearman correlation is shown between mutational types, with the dot size representing the magnitude of correlation and the background colour giving the statistical significance of FDR values.

Candidate oncogenic drivers

Prostate cancer is known to have a long tail of oncogenic drivers18 across the spectrum of different mutational types8 (Extended Data Fig. 2). Protein-coding mutations, including those that are probably and possibly damaging, were significantly greater in each African individual (PolyPhen-2, 14 versus 11 mutations in a European individual; P = 0.022, two-sample t-test; exclusion of hypermutated tumours, P = 0.039). We identified 482 coding and 167 non-coding drivers defined by the PCAWG consortium20 (Extended Data Fig. 3a). A median of two (first quartile to third quartile, 2–4) coding drivers was observed in this study (Supplementary Table 2), with one (0–2) appearing to be specific to prostate cancer7,8,17,18. The coding driver genes significantly mutated among 183 patients were FOXA1, PTEN, SPOP and TP53 (10–25 patients, FDR = 1.34 × 10−21–9.44 × 10−5), whereas non-coding driver elements included the FOXA1 3′ UTR, SNORD3B-2 small RNA and a regulatory micro RNA promoter at chromosome 22: 38381983 (FDR = 9.12 × 10−13, FDR = 6.16 × 10−9 and FDR = 0.070, respectively). Recurrent CNAs of all the patients included 137 gains and 129 losses (GISTIC2, FDR < 0.10; Supplementary Table 3) with some spanning driver genes (Extended Data Fig. 3b), such as DNAH2 (FDR = 2.18 × 10−7), FAM66C (FDR = 1.30 × 10−9), FOXP1 (FDR = 0.005), FXR2 (FDR = 2.18 × 10−7), PTEN (FDR = 9.61 × 10−13), SHBG (FDR = 2.18 × 10−7) and TP53 (FDR = 2.18 × 10−7).

Moreover, a fraction of somatic SVs (2 breakpoints each; 1,328 breakpoints in total) overlapped with 156 driver genes reported as altered by significantly recurrent breakpoints in the PCAWG study20, while, using a generalized linear model with adjustable background covariates, we identified an additional 100 genes to be significantly affected by SV breakpoints (FDR = 1.3 × 10−43–0.097; Extended Data Fig. 3c and Supplementary Table 4). For more than 20% of tumours, SV breakpoints coexisted with other mutational types within DNAH2, ERG, FAM66C, FXR2, PTEN, SHBG and TP53. Using optical genome mapping—an alternative non-sequencing method to examine for chromosomal abnormalities21—we validated recurrent breakpoints in HLA regions (DQA1 and DQB1 genes), identifying translocations between the 3 Mb HLA complex at chromosome 6 and its corresponding HLA alternative contigs (Extended Data Fig. 3d).

Differences in oncogenic driver alterations between ancestries were observed (Fig. 2a,b). Specifically, tumours from African individuals were more likely to have CNAs and mutations in SETBP1 (frequency = 0.33, odds ratio (OR) = 0.357, P = 0.012), DDX11L1 (frequency = 0.48, OR = 0.24, P = 0.0001), STK19 (frequency = 0.25, OR = 0.215, P = 0.004) and NCOA2 (frequency = 0.51, OR = 0.172, P = 3.14 × 10−6), along with SVs in PCAT1 (frequency = 0.13, OR = 0.11, P = 0.012). By contrast, SVs for TMPRSS2 (frequency = 0.38, OR = 3.639, P = 0.0006) and ERG (frequency = 0.34, OR = 3.159, P = 0.003) were more notable among Europeans. Although several DNA-damage repair genes and other genes previously associated with African ancestry were not significantly altered between Africans and Europeans in this study, 10 were solely altered in Africans with most in the coding sequence (frequency = 0.009–0.035). All of these data support the inclusion of a larger number of under-represented populations in clinical enrolment for the benefit of precision oncology studies22.

Fig. 2: Taxonomy and differences in driver mutations in prostate cancer by ancestry.
figure 2

a, The selected 35 driver genes classified as (1) the most altered in this study (>10 patients), irrespective of ancestry (green); (2) DNA-damage repair (DDR) genes that are known to be associated with African ancestry (orange); (3) other ancestry-associated genes studied in prostate cancer (assoc., purple). The OR, 95% confidence interval and two-sided P value (<0.05) were calculated using Fisher exact tests for count data and including 10 African-specific (OR = 0) and 3 European-specific (OR = infinity) genes. Significance was observed for TMPRSS2 (P = 0.0006), ERG (P = 0.003), SETBP1 (P = 0.012), DDX11L1 (P = 0.0001), STK19 (P = 0.004), NCOA2 (P = 3.14 × 10−6), PCAT1 (P = 0.012), PAPSS2 (P = 0.042) and MTCH2 (P = 0.014). b, The mutational frequency of the altered driver genes between Africans and Europeans by mutational type (CDS, non-coding, SV and CNA). c, An integrative clustering analysis reveals four distinct molecular subtypes of prostate cancer. The molecular subtypes are illustrated by small somatic mutations (coding regions and non-coding elements), somatic CNAs and somatic SVs. The proportion and association between the iCluster membership and patient ancestry are illustrated in  d. Additional unsupervised consensus clustering on each data type was performed and mostly recapitulated the subtypes by integrative analysis. d, Total somatic mutations across four molecular subtypes in this study. The dashed lines indicate the median values of mutational densities across the four subtypes. For each subtype, patients are ordered on the basis of their ancestry.

Integrative clustering analysis

Molecular subtyping of tumours is a standard approach in cancer genomics to stratify patients into different degrees of somatic alterations in a homogeneous population, with an implication for clinical use9,10,11. Identifying five out of the seven TCGA oncogenic driver-defined subtypes in our study7, European patients were 25% more likely than African patients to be classified (Supplementary Table 5 and Extended Data Fig. 4a–d). Whereas TMPRSS2-ERG fusions (predominantly 3 Mb deletions) were significantly elevated in our tumours from European individuals compared with from African individuals (37.7% versus 13.3%; OR = 3.919, P = 0.0004), albeit not significantly, African patients were 1.3-fold more likely to present with SPOP-coding mutations (MATH and BTB domains).

For further molecular classification, we performed iCluster analysis on all mutational types (small mutations, CNAs and SVs) identifying four subtypes—A to D (Fig. 2c,d and Supplementary Table 6). We found that subtype A is mutationally quiet (1.01 mutations per Mb, 0.50 breakpoints per 10 Mb, 2% PGA); by contrast, subtype D showed the greatest mutational density (1.91 mutations per Mb, 1.08 breakpoints per 10 Mb, 31% PGA) with a mixture of CN gains and losses, whereas subtypes B and C were marked by substantial CN gains or losses, respectively (Fig. 3a). The quiet subtype seems to be common in prostate cancer studies7,9,23, while the number of pan-cancer consensus drivers20 increased from subtype A (median, 2 drivers) to B (median, 3 drivers), C (median, 3 drivers) and D (median, 4 drivers).

Fig. 3: Significance of somatic aberrations across four diverse subtypes.
figure 3

a, Analysis of the long tail of driver genes using different combinations of mutational types (CDS, coding driver data; NC, non-coding driver data; SV, significantly recurrent breakpoint data; and CN, gene-level CN data), resulting in the identification of 124 preferentially mutated genes among the subtypes. Ordered by mutational frequency, 100 (80.6%) have been reported as significantly recurrent mutations/SV breakpoints in the PCAWG Consortium20, and 24 (19.4%) are significantly mutated in this study (marked by asterisks). Using iClusterplus, unsupervised hierarchical clustering of all mutational types identified four prostate cancer subtypes (A–D; Fig. 2c), presented for 183 patients (rows) and 124 mutated genes (columns), with each subgroup ordered by ancestry. Ancestrally diverse subtypes A and C are mutationally quiet and are marked by CN loss, respectively. African-specific/predominant subtypes B and D are marked by CN gains and are mutationally noisy, respectively. Three genes on chromosome X, KDM6A, ATRX and ZMYM3, are considered to be significant due to the abundance of homozygous (homo.) loss present in subtype C. Chr., chromosome; hemi., hemizygous; ISUP, International Society of Urologic Pathologists; NA, not applicable. b, Kaplan–Meier plot of biochemical relapse (BCR)-free survival proportion of European patients for subtype A (n = 161) versus C (n = 19). c, Kaplan–Meier plot of the cancer survival probability of European patients for subtype A (n = 82) versus C (n = 17). For b and c, the probability estimates, 95% confidence intervals and two-sided P values (log-rank test) are indicated.

Using all of the mutational types in the analysis, 124 genes were significantly mutated across the four subtypes (FDR = 3.742 × 10−13–0.067; Fig. 3a), occurring in 31 to 183 patients (frequency, 0.17–1). Among them, 100 genes were reported as oncogenic drivers in the PCAWG20, and FOXA1 and SPOP genes acting as the TCGA subtypes were also replicated in this analysis, while the 24 new mutated genes among the subtypes were predominantly affected by SV breakpoints and CNAs. The median number of mutated genes ranged from 28 (range 3–105) for subtype A to 82, 98 and 93 for subtypes B, C and D, respectively (42–109, 72–112, 49–107). Although different mutational types tended to co-occur within genes and/or patients (Supplementary Table 7), small mutations (coding and non-coding) were noticeably observed in the quiet subtype A, supporting acquisition early in tumorigenesis24. Our preferentially mutated genes within tumour subtypes resemble the long tail of prostate cancer drivers18, with some highly impacting many tumours, but most only affecting a few tumours.

The 124 preferentially mutated genes within our tumour subtypes corresponded to 8 TCGA/ICGC cancer pathways (Supplementary Information and Extended Data Fig. 5). Whereas six showed slightly elevated mutational frequencies in tumours from African individuals, genes affecting epigenetic mechanisms were significantly biased towards European individuals (OR = 5.586, P = 2.9 × 10−7; Extended Data Fig. 6b). Pathway enrichment analysis supported five functional networks of the cancer pathways, with two of them involved in signal transduction and DNA checkpoint processes that five out of the eight pathways interacted with (Extended Data Fig. 6a and Supplementary Table 8).

Global molecular subtypes

By combining molecular profiling and patient demographics, genetic ancestry and geography, we identified a new prostate cancer taxonomy that we define as GMS (Fig. 2d). Whereas all European patients from Australia (n = 53) and Brazil (n = 3) were limited to GMS-A and GMS-C, tumours from African individuals were dispersed across all four subtypes. We found that GMS-B and GMS-D predominate in African individuals, with GMS-B including a single patient of admixed ancestry (92% African) and GMS-D including a single admixed (63% African) and a single European ancestral patient. The latter individual was one of only five Europeans in our study who was born and raised in Africa. Compared with the other patients of European ancestry, this patient showed the highest mutational density across all types. Alternative consensus clustering of individual mutational types mostly recapitulated the subtypes by integrative analysis (Supplementary Table 6). By further including Chinese Asian high-risk prostate cancer data11 (n = 93; Extended Data Fig. 7a), we found that GMS-A is ancestrally and geographically universal, whereas GMS-D remained African specific, with a new African–Asian GMS-E emerging. GMS-B remained African specific and GMS-C remained European–African specific. Although all of the patients were treatment naive at the time of sampling, our European cohort was recruited with extensive follow-up data (median ± s.d., 122.5 ± 44.4 months). Interestingly, biochemical relapse (Fig. 3b) and death-free survival probability (Fig. 3c) explains better clinical outcomes for patients presenting with the universal over the European-African GMS (GMS-A versus GMS-C, log-rank test, P = 0.008 and P = 0.041, respectively).

Our GMS taxonomy could leverage pan-cancer studies in the following ways. First, a sampling strategy of patients from the PCAWG project was rather homogeneous in each cancer, therefore inhibiting the discovery of globally restricted subtypes3,13 (Extended Data Fig. 7b). Second, genetic ancestral25 and geographical data of patients should be included in molecular profiling of cancers. Finally, the inclusion of ethnic disparity in cancer studies would need to properly address genetic admixture in a sampling cohort, with a too low ancestral cut-off appearing to create highly admixed, but similar, ancestry among individuals, therefore discouraging ethnically diverse samples.

New and known mutational signatures

Approximating the contribution of mutational signatures to individual cancer genomes facilitates the association of the signatures with exogenous or endogenous mutagen exposures that contribute to the development of human cancer3. Here we generated a list of CN and SV signatures and their contributions to prostate cancer using non-negative matrix factorization26 (Extended Data Fig. 8a,b). Combined with a known catalogue of small mutational signatures, including single-base substitutions (SBSs), doublet base substitutions (DBSs) and indels (IDs), we observed not only a substantial variation in the number of mutational features but also over-representation in tumours from African individuals (Extended Data Fig. 8c). Overall, 96 SBS, 78 DBS and 83 ID features examined had significantly higher totals in African individuals (SBSs, 3,399 versus 2,840 in Europeans, P = 0.014; DBSs, 42 versus 32, P = 0.006; IDs, 374 versus 360, P = 0.016, two-sample t-tests). We generated six de novo signatures for each small signature type (median cosine similarity, 0.986, 0.856 and 0.976, respectively), corresponding to 12, 7 and 8 global signatures, respectively (median cosine similarity, 0.966, 0.850 and 0.946, respectively; Extended Data Fig. 9), with 26 likely to be of biological origin (SBS47, possible sequencing artefacts). DBSs accounted for about 1% of the prevalence of SBSs. The CN features were also greater in Africans (CN, 3,971 versus 2,721, P = 1.92 × 10−8; SV, 94 versus 88, P = 0.100). The SV features defined in a recent pan-cancer study26 were each mutually exclusive and included simple SVs (split according to size, replication timing and occurrence at fragile sites), templated insertions (split by size), local n-jumps and local–distant clusters. The factorization of a sample-by-mutation spectrum matrix identified six CN signatures (CN1–6) and eight SV signatures (SV1–8), as well as their contributions to each tumour.

We found that the full spectrum of mutational signatures (SBSs, DBSs, IDs, CNs and SVs) supports our newly described GMS. Enrichment records of the top signatures in each tumour were significantly associated type by type with the taxonomic subtypes, except for DBSs (P = 5.1 × 10−7–0.017, one-way analysis of variance (ANOVA) or Fisher exact test; Extended Data Fig. 8d). Regardless of the signature type, 13 out of 40 mutational signatures showed either inverse or proportionate correlations with our GMS (FDR = 4.97 × 10−13–0.095, Spearman correlation; Fig. 4a). Duplication signatures, including CN1 (tandem duplication), CN4 (whole-genome duplication), SV2 (insertion) and SV5 (large duplication), were biased to the most mutationally noisy subtype (Extended Data Fig. 8a, b), with CN4 and SV5 frequent in Africans (correlation coefficient = −0.24, FDR = 0.005–0.006). Figure 4b shows that the duplication signatures have at least a 1.5× greater proportion of genomic aberrations in GMS-B, GMS-C and GMS-D compared with the universal GMS-A. Furthermore, the African-specific subtype GMS-B consisted of several CN4 and SV5 genomic aberrations composed predominantly of CN amplification (>5 copies and mainly >100 kb in length) and tandem duplication (<5 Mb in size occurred during early to late timing of DNA replication), respectively. Moreover, the mutational density of 30 out of 32 genes that are highly mutated in our GMS and reported in prostate cancer was significantly correlated with different somatic signatures, with most observed in CN2, CN6 and SV6 signatures that were mainly caused by deleted genomes (FDR = 1.61 × 10−7–0.082).

Fig. 4: Estimates of genomic aberrations contributed by each mutational signature.
figure 4

a, Correlation plots of total mutational signatures along with clinical and genomic characteristics. The size of each dot represents the FDR values of Spearman correlation P values (two-sided) using Benjamini–Hochberg correction. The colours of each dot represent the correlation coefficient. GMS subtypes are assigned as 1–4 for subtypes A–D, respectively; African, admixed and European are recorded as 1–3, respectively. The correlation of 32 recurrent genes in prostate cancer is shown on the x axis. Many small- or large-sized mutational signatures agree with the GMS. HR, homologous recombination; PSA, prostate-specific antigen. b, Sankey diagram depicting a proportion of duplication signatures observed across cancer subtypes. Duplication features, including amplification (Amp), translocation (trans) plus, local n-jump, templated insertion (ins), amplification loss of heterozygosity (LOH), gain, tandem duplication and gain LOH (Extended Data Fig. 8a,b) are summed per subtype and equally weighted to 20. Links connecting between nodes (GMS, signatures and features) have widths proportional to the total number of CN or SV features across all patients within each GMS subtype to which they belong. Note that we believe that GMS-B is the identity of the African-specific genomic subtype.

Evolution of GMS

Timeline estimates of individual somatic events reflect evolutionary periods that differ from one patient to another; for example, a cluster of identical alterations derived from clones in one patient presented as subclonal events in another patient (Extended Data Fig. 10a,b). However, they provide in part the order of driver mutations and CNAs present in each sample24. The reconstruction of aggregating single-sample ordering of all drivers and CNAs reveals different evolutionary patterns that are unique to each GMS subtype (Fig. 5a,b and Extended Data Fig. 10c). We drew approximate cancer timelines for each GMS subtype portraying the ordering of driver genes, recurrent CNAs and signature activities chronologically interleaved with whole-genome duplication and the emergence of the most recent common ancestor leading up to diagnosis. Basically, significantly co-occurring interactions of the drivers and CNAs are shown (OR = 2.6–97.8, P = 2.04 × 10−30–0.01), supporting their clonal and subclonal ordering states within the reconstructed timelines. SBS and indel signatures that are abundant in each GMS subtype display changes in their mutational spectrum between the clonal and subclonal state, suggesting a difference in mutation rates. The plot of clock-like CpG-to-TpG mutations and patient-age adjustment shows a median mutation rate of as low as 0.968 per year for the universal GMS, but a highest rate of 1.315 per year observed in the African-individual-specific GMS-D. GMS-B and GMS-C have rates of 1.144 and 1.092 per year, respectively. Assessing the relative timing of somatic driver events, TP53 mutations and accompanying 17p loss are of particular interest, occurring early in GMS-C progression and at a later stage in GMS-A. League model relative timing of driver events (Supplementary Information) is consistent with a fraction of probability distribution of the TP53 alterations at the early stage, but most are at an intermediate state of evolution (Extended Data Fig. 10d). This basic knowledge of in vivo tumour development suggests that some tumours could have a shorter latency period before reaching their malignant potential, so known genomic heterogeneity of their primary clones is paramount to pave the way for early detection.

Fig. 5: Evolutionary history of globally mutated subtypes.
figure 5

a, The cancer timeline of the universal subtype (A) begins from the fertilized egg to the age of the patients in a cohort. b, The cancer timeline of GMS-C. Estimates for major events, such as whole-genome duplication (WGD) and the emergence of the most-recent common ancestor (MRCA) are used to define the early, variable, late and subclonal stages of tumour evolution approximately in chronological time. When the early and late clonal stages are uncertain, the variable stage is assigned. Driver genes and CNAs are shown in each stage if present in previous studies8,20 and defined by the MutationTime.R program. Mutational signatures (Sigs) that, on average, change over the course of tumour evolution, or are substantially active, are shown as described in the Supplementary Information. The dagger symbols denote alterations that are found to have different timing. Significant pairwise interaction events between the mutations and CNAs were computed to support cancer timelines. The OR and two-sided P value were calculated using Fisher exact tests. Co-occurrence or mutually exclusive event is considered when OR > 2 or OR < 0.5, respectively. The interaction significance between pairs in GMS-A and GMS-C has P values ranging from 2.04 × 10−30 to 0.047 and from 1.64 × 10−27 to 0.045, respectively. Median mutation rates of CpG-to-TpG burden per Gb are calculated using the age-adjusted branch length of cancer clones and maximally branching subclones. The mutation rate plots in a and b show the median ± 2 s.e. of fitted data as dashed lines and error bands, respectively. c, Schematic of a world map with the distribution of GMS-A–D among ancestrally/globally diverse populations. The gene–environment interaction of GMS is shown on the right. The contingency table of the number of patients with different ancestries (germline variants) stratified by subtypes and associated with certain geography or environmental exposure (two-sided P = 0.0005, Fisher exact test with 2,000 bootstraps).


Our study represents one of the largest whole-genome prostate cancer genome resources for sub-Saharan Africa (a summary is provided in Supplementary Table 12). Acknowledging the lack of information on clinical staging for the South African patients (recruited at diagnosis), we describe a prostate cancer molecular taxonomy, identifying ancestrally distinctive GMS. Compared to previous taxonomy using significantly mutated genes in prostate cancer7,18, we found that GMS subtypes compliment known subtypes such as SPOP and FOXA1 mutations, in contrast to under-represented subtypes in this study, including gene fusions (Extended Data Fig. 4a). We also found that GMS subtypes correlate with mutational signatures reported in the known catalogue of somatic mutations in cancer, in which each tumour is represented by different degrees of exogenous and endogenous mutagen exposures3. Our study used the analysis of evolution across 38 cancer types by the PCAWG consortium24, recognizing that each GMS subtype represents a unique evolutionary history with drivers and mutational signatures varied between cancer stages and linking somatic evolution to a patient’s demographics. Thus, some represent rare or geographically restricted signatures that have not been observed in pan-cancer studies3,13.

We considered two extreme cases, universal GMS-A versus African-specific GMS-B and GMS-D, that would have been influenced by two different mutational processes for conceptual simplicity (Fig. 5c). One factor is predisposing genetics27,28,29 contributing to endogenous mutational processes, especially those with significant germline–somatic interactions, such as the TMPRSS2-ERG fusion that is less frequently observed in men of African and Asian ancestry11,30, germline BRCA2 mutations and the somatic SPOP driver co-occurred with their respective counterparts31,32. Another factor is modifiable environmental attributes that are specific to certain circumstances or geographical regions that, to date, have not been observed in prostate cancer. They act as mutagenic forces leading to the positive selection of point mutations throughout life in healthy tissues33,34 and cancers35, forming fluid boundaries between normal ageing and cancer tissues. According to Ottman36, the above-mentioned model of gene–environment interaction is observed when there is a different effect of a genotype on disease in individuals with different environmental exposures or, alternatively, a different effect of an environmental exposure on disease in individuals with different genotypes. Other GMS subtypes would be a combination of the two processes, warranting a need for larger populations capturing ancestral versus ethnic and geographical diversity. As such, the study directly accounts for the large spatiogenomic heterogeneity of prostate cancer and its associated evolutionary history in understanding the disease aetiology.

Our study suggests that larger genomic datasets of geo-ethnically diverse and ancestrally defined populations in a unified analysis will continue to identify rare and geographically restricted subtypes in prostate cancer and potentially other cancers. We demonstrate that ancestral and geographical attributes of patients could facilitate those studies on cancer population genomics, an alternative to cancer personalized genomics, for a better scientific understanding of nature versus nurture.


Patient cohorts and WGS

Our study included 183 treatment-naive patients with prostate cancer who were recruited under informed consent and appropriate ethics approval (Supplementary Information 2) from Australia (n = 53), Brazil (n = 7) and South Africa (n = 123). While matched for pathological grading, as previously reported, prostate-specific antigen levels are notably elevated within our African patients16 and we cannot exclude on the basis of potential metastasis (as data on metastases in this cohort are unavailable). DNA extracted from fresh tissue and matched blood underwent 2 × 150 bp sequencing on the Illumina NovaSeq instrument (Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research).

WGS processing and variant calling

Each lane of raw sequencing reads was aligned against human reference hg38 + alternative contigs using bwa (v.0.7.15)37. Lane-level BAM files from the same library were merged, and duplicate reads were marked. The Genome Analysis Toolkit (GATK, v. was used for base quality recalibration38. Contaminated and duplicate samples (n = 8) were removed. We implemented three main pipelines for the discovery of germline and somatic variants, with the latter including small (SNV and indel) to large genomic variation (CNAs and SVs). The complete pipelines and tools used are available from the Sydney Informatics Hub (SIH), Core Research Facilities, University of Sydney (see the ‘Code availability’ section). Scalable bioinformatic workflows are described in Supplementary Information 4.

Genetic ancestry was estimated using fastSTRUCTURE (v.1.0)39, Bayesian inference for the best approximation of marginal likelihood of a very large variant dataset. Reference panels for African and European ancestry compared in this study were retrieved from previous whole-genome databases19.

Analysis of chromothripsis and chromoplexy

Clustered genomic rearrangements of prostate tumours were identified using ShatterSeek (v.0.4)40 and ChainFinder (v.1.0.1)41. Our somatic SV and somatic CNA call sets were prepared and co-analysed using custom scripts (see the ‘Code availability’ section; Supplementary Information 6).

Analysis of mutational recurrence

We used three approaches to detect recurrently mutated genes or regions based on three mutational types, including small mutations, SVs and CNAs (Supplementary Information 7). In brief, small mutations were tested within a given genomic element as being significantly more mutated than the adjacent background sequences. The genomic elements retrieved from syn5259886, the PCAWG Consortium20, were a group of coding sequences and ten groups of non-coding regions. SV breakpoints were tested in a given gene for their statistical enrichment using gamma–Poisson regression and corrected by genomic covariates12. Focal and arm-level recurrent CNAs were examined using GISTIC (v.2.0.23)42. Known driver mutations in coding and non-coding regions published in PCAWG20,43,44 were also recorded in our 183 tumours, and those specific to prostate cancer genes were also included7,8,12,17,18.

Integrative analysis of prostate cancer subtypes

Integrative clustering of three genomic data types for 183 patients was performed using iClusterplus11,45 in R, with the following inputs: (1) driver genes and elements; (2) somatic CN segments; and (3) significantly recurrent SV breakpoints. We ran iClusterPlus.tune with clusters ranging from 1 to 9. We also performed unsupervised consensus clustering on each of the three data types individually. Association analysis of genomic alteration with different iCluster subtypes was performed in detail (Supplementary Information 8). Differences in driver mutations, recurrent breakpoints and somatic CNAs across different iCluster subtypes were reported.

Comparison of iCluster with Asian and pan-cancer data

To compare molecular subtypes between extant human populations, the Chinese Prostate Cancer Genome and Epigenome Atlas (CPGEA, PRJCA001124)11 was merged and processed with our integrative clustering analysis across the three data types described above, with some modifications. Moreover, we leveraged the PCAWG consortium data13 to define molecular subtypes across different ethnic groups in other cancer types using published data of somatic mutations, SV and GISTIC results by gene. Four cancer types consisting of breast, liver, ovarian and pancreatic cancers were considered due to existing primary ancestries of African, Asian and European with at least 70% contribution. Full details are provided in Supplementary Information 8.4.

PCAWG13 participants with prostate cancer were retrieved to compare with Australian data with clinical follow-up. Only those of European ancestry greater than 90% (n = 139) were analysed for the three genomic data types of iCluster subtyping, as well as individual consensus clustering. Clustering results identical to the larger cohort size mentioned above were chosen for association analyses. Differences in the biochemical relapse and lethal prostate cancer of the participants across the subtypes were assessed using the Kaplan–Meier plot followed by a log-rank test for significance.

Analysis of mutational signatures

Mutational signatures (SBSs, DBSs and indels), as defined by the PCAWG Mutational Signatures Working Group3, were fit to individual tumours with observed signature activities using SigProfiler46. Non-negative matrix factorization was implemented to detect de novo and global signature profiles among 183 patients and their contributions. New mutational genome rearrangement signatures (CN and SV) were also performed using non-negative matrix factorization, with 45 CN and 44 SV features examined across 183 tumours. We followed the PCAWG working classification and annotation scheme for genomic rearrangement26. Two SV callers were used to obtain exact breakpoint coordinates. Replication timing scores influencing on SV detection were set at >75, 20–75 and <20 for early, mid, and late timing, respectively47. Full details of analysis steps, parameters and relevant statistical tests are provided in Supplementary Information 9.

Reconstruction of cancer timelines

Timing of CN gains and driver mutations (SNVs and indels) into four epochs of cancer evolution (early clonal, unspecified clonal, late clonal and subclonal) was conducted using MutationTimeR24. CN gains including 2 + 0, 2 + 1 and 2 + 2 (1 + 1 for a diploid genome) were considered for a clearer boundary between epochs instead of solely information of variant allele frequency. Confidence intervals (tlo – tup) for timing estimates were calculated with 200 bootstraps. Mutation rates for each subtype were calculated according to ref. 24 such that CpG-to-TpG mutations were counted for the analysis because they were attributed to spontaneous deamination of 5-methyl-cytosine to thymine at CpG dinucleotides, therefore acting as a molecular clock.

League model relative ordering was performed to aggregate across all study samples to calculate the overall ranking of driver mutations and recurrent CNAs. The information for the ranking was derived from the timing of each driver mutation and that of clonal and subclonal CN segments, as described above. A full description is provided in Supplementary Information 10.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.