African-specific molecular taxonomy of prostate cancer

Prostate cancer is characterized by considerable geo-ethnic disparity. African ancestry is a significant risk factor, with mortality rates across sub-Saharan Africa of 2.7-fold higher than global averages1. The contributing genetic and non-genetic factors, and associated mutational processes, are unknown2,3. Here, through whole-genome sequencing of treatment-naive prostate cancer samples from 183 ancestrally (African versus European) and globally distinct patients, we generate a large cancer genomics resource for sub-Saharan Africa, identifying around 2 million somatic variants. Significant African-ancestry-specific findings include an elevated tumour mutational burden, increased percentage of genome alteration, a greater number of predicted damaging mutations and a higher total of mutational signatures, and the driver genes NCOA2, STK19, DDX11L1, PCAT1 and SETBP1. Examining all somatic mutational types, we describe a molecular taxonomy for prostate cancer differentiated by ancestry and defined as global mutational subtypes (GMS). By further including Chinese Asian data, we confirm that GMS-B (copy-number gain) and GMS-D (mutationally noisy) are specific to African populations, GMS-A (mutationally quiet) is universal (all ethnicities) and the African–European-restricted subtype GMS-C (copy-number losses) predicts poor clinical outcomes. In addition to the clinical benefit of including individuals of African ancestry, our GMS subtypes reveal different evolutionary trajectories and mutational processes suggesting that both common genetic and environmental factors contribute to the disparity between ethnicities. Analogous to gene–environment interaction—defined here as a different effect of an environmental surrounding in people with different ancestries or vice versa—we anticipate that GMS subtypes act as a proxy for intrinsic and extrinsic mutational processes in cancers, promoting global inclusion in landmark studies.

Nature | Vol 609 | 15 September 2022 | 553 per 100,000, respectively 1 . Through the Southern African Prostate Cancer Study (SAPCS), we report a 2.1-fold increase in aggressive disease (grades 4-5) and 4.8-fold increase in prostate-specific antigen levels at diagnosis compared with African Americans 16 .
Here we describe, to our knowledge, the largest cancer and prostate cancer genomics data for sub-Saharan Africa, including 123 South African men. Controlling for study artefacts, an additional 53 Australian and 7 Brazilian individuals were passed simultaneously through the same high-depth whole-genome sequencing (WGS), mutation-calling and analytical framework. Focusing on treatment-naive cases (100% South Africans, 98% Australians and two confirmed Brazilians) and aggressive tumours (grades 4-5 for 72.2% South Africans, 86.8% Australians and 85.7% Brazilians; Extended Data Fig. 1a) at biopsy (100% South Africans) or surgery (100% Australians, 62.5% Brazilians) and patient-matched blood achieving coverages of 88.69 ± 14.78 and 44.34 ± 8.11, respectively (median ± s.d.; Supplementary Table 1), we uniformly generated, called and assessed about 2 million somatic variants. Through ancestral classification (genetic ancestry over self-identified ethnicity), we show a greater number of acquired genetic alterations within African individuals while identifying both globally relevant and African-specific genomic subtypes. Combining our somatic variant dataset with that published for ethnically defined European 7,8,17,18 and Chinese 11 prostate cancer genomes, we reveal a prostate cancer taxonomy with different clinical outcomes. The inclusion of 2,658 cancer genomes from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) 13 expanded our global mutational subtyping between cancer types. Using known clock-like mutational processes in each subtype, we inferred mutation timing of oncogenic drivers in broad periods of tumour evolution and calculated the mutation rates for each subtype that had a distinctive tumour evolution pattern. Combined, these analyses enable us to demonstrate how global inclusion in cancer genomics can unravel unseen heterogeneity in prostate cancer in terms of its genomic and clinical behaviours.

Genetic ancestry
Genetic ancestries were estimated for the 183 patient donors using a joint dataset in a unified analysis aggregated from a collection of geographically matched African (n = 64) and European (n = 4) deep-coverage published and unpublished reference genomes 19 . Ancestries were assigned using 7,472,833 markers as African (n = 113, all South Africans), with greater than 98% contribution; European (n = 61; 53 Australians, 5 South Africans and 3 Brazilians), allowing for up to 10% Asian contribution (with a single outlier of 26%); and African-European admixed (n = 9; 5 South Africans and 4 Brazilians), with as little as 4% African or European contribution (Extended Data Fig. 1b).

Total somatic mutations
In 183 prostate tumours, we identified 1,067,885 single-nucleotide variants (SNVs), 11,259 dinucleotides, 307,263 small insertions and deletions (indels, <50 bp), 419,920 copy-number alterations (CNAs) and 22,919 structural variants (SVs), with each mutational type elevated in tumours from African individuals (Fig. 1a). A median of 37.54% ± 5.51 of SNVs were C-to-T mutations, and the transition and transversion ratio was 1.282 cohort-wise. Tumours from African individuals had The percentage of genome alteration (PGA) was similarly greater in Africans (7.26% versus 2.82%, P = 0.021). Correlation tests of ancestry and total somatic mutations also supported the findings (false-discovery rate (FDR) = 0.009 and FDR = 0.032 for SNVs and PGA, respectively; Extended Data Fig. 1d). The top six highest estimates of SV breakpoints per sample were observed among African patients (928-2,284 breakpoints). No overall differences between the ancestries were observed for chromothripsis (range, 52-55%) and chromoplexy (range, 33-38%), whereas tumours from African individuals demonstrated a trend towards a higher number of interchromosomal chromoplexic chains (1-6 versus 1-2). Moreover, the magnitude of all types of mutations was strongly correlated with one another (Fig. 1b). Thus, the more mutations a prostate tumour has of any given type, the more mutations it is likely to have of all types.

Candidate oncogenic drivers
Prostate cancer is known to have a long tail of oncogenic drivers 18 across the spectrum of different mutational types 8 (Extended Data Fig. 2). Protein-coding mutations, including those that are probably and possibly damaging, were significantly greater in each African individual (PolyPhen-2, 14 versus 11 mutations in a European individual; P = 0.022, two-sample t-test; exclusion of hypermutated tumours, P = 0.039). We identified 482 coding and 167 non-coding drivers defined by the PCAWG consortium 20 (Extended Data Fig. 3a). A median of two (first quartile to third quartile, 2-4) coding drivers was observed in this study (Supplementary Table 2), with one (0-2) appearing to be specific to prostate cancer 7,8,17,18 . The coding driver genes significantly mutated among 183 patients were FOXA1, PTEN, SPOP and TP53 (10-25 patients, FDR = 1.34 × 10 −21 -9.44 × 10 −5 ), whereas non-coding driver elements included the FOXA1 3′ UTR, SNORD3B-2 small RNA and a regulatory micro RNA promoter at chromosome 22: 38381983 (FDR = 9.12 × 10 −13 , FDR = 6.16 × 10 −9 and FDR = 0.070, respectively). Recurrent CNAs of all the patients included 137 gains and 129 losses (GISTIC2, FDR < 0.10; Supplementary Table 3) with some spanning driver genes (Extended Data Moreover, a fraction of somatic SVs (2 breakpoints each; 1,328 breakpoints in total) overlapped with 156 driver genes reported as altered by significantly recurrent breakpoints in the PCAWG study 20 , while, using a generalized linear model with adjustable background covariates, we identified an additional 100 genes to be significantly affected by SV breakpoints (FDR = 1.3 × 10 −43 -0.097; Extended Data  Table 4). For more than 20% of tumours, SV breakpoints coexisted with other mutational types within DNAH2, ERG, FAM66C, FXR2, PTEN, SHBG and TP53. Using optical genome mappingan alternative non-sequencing method to examine for chromosomal abnormalities 21 -we validated recurrent breakpoints in HLA regions (DQA1 and DQB1 genes), identifying translocations between the 3 Mb HLA complex at chromosome 6 and its corresponding HLA alternative contigs (Extended Data Fig. 3d).

Integrative clustering analysis
Molecular subtyping of tumours is a standard approach in cancer genomics to stratify patients into different degrees of somatic alterations in a homogeneous population, with an implication for clinical  TMPRSS2  ERG  RB1  PTEN  FAT1  THADA  LSAMP  FOXA1  TP53  KMT2C  SPOP  MSH2  BRAF  MYC  ZMYM3  SETBP1  ZFHX3  CDK12  DDX11L1  STK19  NCOA2  PCAT1  MSH6  PAPSS2  MTCH2  ATR  BRCA1  POLE  TP53BP1  ATM  BRCA2  ERCC5 ERF PIK3CA APC  (3) other ancestry-associated genes studied in prostate cancer (assoc., purple). The OR, 95% confidence interval and two-sided P value (<0.05) were calculated using Fisher exact tests for count data and including 10 African-specific (OR = 0) and 3 European-specific (OR = infinity) genes. Significance was observed for TMPRSS2 (P = 0.0006), ERG (P = 0.003), SETBP1 (P = 0.012), DDX11L1 (P = 0.0001), STK19 (P = 0.004), NCOA2 (P = 3.14 × 10 −6 ), PCAT1 (P = 0.012), PAPSS2 (P = 0.042) and MTCH2 (P = 0.014). b, The mutational frequency of the altered driver genes between Africans and Europeans by mutational type (CDS, non-coding, SV and CNA). c, An integrative clustering analysis reveals four distinct molecular subtypes of prostate cancer. The molecular subtypes are illustrated by small somatic mutations (coding regions and non-coding elements), somatic CNAs and somatic SVs. The proportion and association between the iCluster membership and patient ancestry are illustrated in d. Additional unsupervised consensus clustering on each data type was performed and mostly recapitulated the subtypes by integrative analysis. d, Total somatic mutations across four molecular subtypes in this study. The dashed lines indicate the median values of mutational densities across the four subtypes. For each subtype, patients are ordered on the basis of their ancestry. use [9][10][11] . Identifying five out of the seven TCGA oncogenic driver-defined subtypes in our study 7 , European patients were 25% more likely than African patients to be classified (Supplementary Table 5  For further molecular classification, we performed iCluster analysis on all mutational types (small mutations, CNAs and SVs) identifying four subtypes-A to D (Fig. 2c,d and Supplementary Table 6). We found that subtype A is mutationally quiet (1.01 mutations per Mb, 0.50 breakpoints per 10 Mb, 2% PGA); by contrast, subtype D showed the greatest mutational density (1.91 mutations per Mb, 1.08 breakpoints per 10 Mb, 31% PGA) with a mixture of CN gains and losses, whereas subtypes B and C were marked by substantial CN gains or losses, respectively (Fig. 3a). The quiet subtype seems to be common in prostate cancer studies 7,9,23 , while the number of pan-cancer consensus drivers 20 increased from subtype A (median, 2 drivers) to B (median, 3 drivers), C (median, 3 drivers) and D (median, 4 drivers).

Article
The 124 preferentially mutated genes within our tumour subtypes corresponded to 8 TCGA/ICGC cancer pathways (Supplementary Information and Extended Data Fig. 5). Whereas six showed slightly elevated mutational frequencies in tumours from African individuals, genes affecting epigenetic mechanisms were significantly biased towards European individuals (OR = 5.586, P = 2.9 × 10 −7 ; Extended Data Fig. 6b). Pathway enrichment analysis supported five functional networks of the cancer pathways, with two of them involved in signal transduction and DNA checkpoint processes that five out of the eight pathways interacted with (Extended Data Fig. 6a and Supplementary Table 8).

Global molecular subtypes
By combining molecular profiling and patient demographics, genetic ancestry and geography, we identified a new prostate cancer taxonomy that we define as GMS (Fig. 2d). Whereas all European patients from Australia (n = 53) and Brazil (n = 3) were limited to GMS-A and GMS-C, tumours from African individuals were dispersed across all four subtypes. We found that GMS-B and GMS-D predominate in African individuals, with GMS-B including a single patient of admixed ancestry (92% African) and GMS-D including a single admixed (63% African) and a single European ancestral patient. The latter individual was one of only five Europeans in our study who was born and raised in Africa. Compared with the other patients of European ancestry, this patient showed the highest mutational density across all types. Alternative consensus clustering of individual mutational types mostly recapitulated the subtypes by integrative analysis (Supplementary Table 6). By further including Chinese Asian high-risk prostate cancer data 11 (n = 93; Extended Data Fig. 7a), we found that GMS-A is ancestrally and geographically universal, whereas GMS-D remained African specific, with a new African-Asian GMS-E emerging. GMS-B remained African specific and GMS-C remained European-African specific. Although all of the patients were treatment naive at the time of sampling, our European cohort was recruited with extensive follow-up a b    APC  ATM  ATRX  BRAF  CDK12  CDKN1B  CHD3  CHD4  CUL2  ERF  FIP1L1  FOXA1  FOXP1  IL6ST  KDM6A  KMT2C  KRAS  LRP1B  MGA  NDST4  NOTCH2  PDE4D  PIK3R1  PTEN  CNOT1  RNF43  ROBO2  SCAI  SPOP  TP53  ZMYM3 Co-occurrence The dagger symbols denote alterations that are found to have different timing. Significant pairwise interaction events between the mutations and CNAs were computed to support cancer timelines. The OR and two-sided P value were calculated using Fisher exact tests. Co-occurrence or mutually exclusive event is considered when OR > 2 or OR < 0.5, respectively. The interaction significance between pairs in GMS-A and GMS-C has P values ranging from 2.04 × 10 −30 to 0.047 and from 1.64 × 10 −27 to 0.045, respectively. Median mutation rates of CpG-to-TpG burden per Gb are calculated using the age-adjusted branch length of cancer clones and maximally branching subclones. The mutation rate plots in a and b show the median ± 2 s.e. of fitted data as dashed lines and error bands, respectively. c, Schematic of a world map with the distribution of GMS-A-D among ancestrally/globally diverse populations. The gene-environment interaction of GMS is shown on the right. The contingency table of the number of patients with different ancestries (germline variants) stratified by subtypes and associated with certain geography or environmental exposure (two-sided P = 0.0005, Fisher exact test with 2,000 bootstraps).
Our GMS taxonomy could leverage pan-cancer studies in the following ways. First, a sampling strategy of patients from the PCAWG project was rather homogeneous in each cancer, therefore inhibiting the discovery of globally restricted subtypes 3,13 (Extended Data Fig. 7b). Second, genetic ancestral 25 and geographical data of patients should be included in molecular profiling of cancers. Finally, the inclusion of ethnic disparity in cancer studies would need to properly address genetic admixture in a sampling cohort, with a too low ancestral cut-off appearing to create highly admixed, but similar, ancestry among individuals, therefore discouraging ethnically diverse samples.

New and known mutational signatures
Approximating the contribution of mutational signatures to individual cancer genomes facilitates the association of the signatures with exogenous or endogenous mutagen exposures that contribute to the development of human cancer 3 . Here we generated a list of CN and SV signatures and their contributions to prostate cancer using non-negative matrix factorization 26 (Extended Data Fig. 8a,b). Combined with a known catalogue of small mutational signatures, including single-base substitutions (SBSs), doublet base substitutions (DBSs) and indels (IDs), we observed not only a substantial variation in the number of mutational features but also over-representation in tumours from African individuals (Extended Data Fig. 8c) 26 were each mutually exclusive and included simple SVs (split according to size, replication timing and occurrence at fragile sites), templated insertions (split by size), local n-jumps and local-distant clusters. The factorization of a sample-by-mutation spectrum matrix identified six CN signatures (CN1-6) and eight SV signatures (SV1-8), as well as their contributions to each tumour.
We found that the full spectrum of mutational signatures (SBSs, DBSs, IDs, CNs and SVs) supports our newly described GMS. Enrichment records of the top signatures in each tumour were significantly associated type by type with the taxonomic subtypes, except for DBSs (P = 5.1 × 10 −7 -0.017, one-way analysis of variance (ANOVA) or Fisher exact test; Extended Data Fig. 8d). Regardless of the signature type, 13 out of 40 mutational signatures showed either inverse or proportionate correlations with our GMS (FDR = 4.97 × 10 −13 -0.095, Spearman correlation; Fig. 4a). Duplication signatures, including CN1 (tandem duplication), CN4 (whole-genome duplication), SV2 (insertion) and SV5 (large duplication), were biased to the most mutationally noisy subtype (Extended Data Fig. 8a, b), with CN4 and SV5 frequent in Africans (correlation coefficient = −0.24, FDR = 0.005-0.006). Figure 4b shows that the duplication signatures have at least a 1.5× greater proportion of genomic aberrations in GMS-B, GMS-C and GMS-D compared with the universal GMS-A. Furthermore, the African-specific subtype GMS-B consisted of several CN4 and SV5 genomic aberrations composed predominantly of CN amplification (>5 copies and mainly >100 kb in length) and tandem duplication (<5 Mb in size occurred during early to late timing of DNA replication), respectively. Moreover, the mutational density of 30 out of 32 genes that are highly mutated in our GMS and reported in prostate cancer was significantly correlated with different somatic signatures, with most observed in CN2, CN6 and SV6 signatures that were mainly caused by deleted genomes (FDR = 1.61 × 10 −7 -0.082).

Evolution of GMS
Timeline estimates of individual somatic events reflect evolutionary periods that differ from one patient to another; for example, a cluster of identical alterations derived from clones in one patient presented as subclonal events in another patient (Extended Data Fig. 10a,b). However, they provide in part the order of driver mutations and CNAs present in each sample 24 . The reconstruction of aggregating single-sample ordering of all drivers and CNAs reveals different evolutionary patterns that are unique to each GMS subtype (Fig. 5a,b and Extended Data Fig. 10c). We drew approximate cancer timelines for each GMS subtype portraying the ordering of driver genes, recurrent CNAs and signature activities chronologically interleaved with whole-genome duplication and the emergence of the most recent common ancestor leading up to diagnosis. Basically, significantly co-occurring interactions of the drivers and CNAs are shown (OR = 2.6-97.8, P = 2.04 × 10 −30 -0.01), supporting their clonal and subclonal ordering states within the reconstructed timelines. SBS and indel signatures that are abundant in each GMS subtype display changes in their mutational spectrum between the clonal and subclonal state, suggesting a difference in mutation rates. The plot of clock-like CpG-to-TpG mutations and patient-age adjustment shows a median mutation rate of as low as 0.968 per year for the universal GMS, but a highest rate of 1.315 per year observed in the African-individual-specific GMS-D. GMS-B and GMS-C have rates of 1.144 and 1.092 per year, respectively. Assessing the relative timing of somatic driver events, TP53 mutations and accompanying 17p loss are of particular interest, occurring early in GMS-C progression and at a later stage in GMS-A. League model relative timing of driver events (Supplementary Information) is consistent with a fraction of probability distribution of the TP53 alterations at the early stage, but most are at an intermediate state of evolution (Extended Data Fig. 10d). This basic knowledge of in vivo tumour development suggests that some tumours could have a shorter latency period before reaching their malignant potential, so known genomic heterogeneity of their primary clones is paramount to pave the way for early detection.

Discussion
Our study represents one of the largest whole-genome prostate cancer genome resources for sub-Saharan Africa (a summary is provided in Supplementary Table 12). Acknowledging the lack of information on clinical staging for the South African patients (recruited at diagnosis), we describe a prostate cancer molecular taxonomy, identifying ancestrally distinctive GMS. Compared to previous taxonomy using significantly mutated genes in prostate cancer 7,18 , we found that GMS subtypes compliment known subtypes such as SPOP and FOXA1 mutations, in contrast to under-represented subtypes in this study, including gene fusions (Extended Data Fig. 4a). We also found that GMS subtypes correlate with mutational signatures reported in the known catalogue of somatic mutations in cancer, in which each tumour is represented by different degrees of exogenous and endogenous mutagen exposures 3 . Our study used the analysis of evolution across 38 cancer types by the PCAWG consortium 24 , recognizing that each GMS subtype represents a unique evolutionary history with drivers and mutational signatures varied between cancer stages and linking somatic evolution to a patient's demographics. Thus, some represent rare or geographically restricted signatures that have not been observed in pan-cancer studies 3,13 .
We considered two extreme cases, universal GMS-A versus Africanspecific GMS-B and GMS-D, that would have been influenced by two different mutational processes for conceptual simplicity (Fig. 5c). One factor is predisposing genetics [27][28][29] contributing to endogenous mutational processes, especially those with significant germline-somatic interactions, such as the TMPRSS2-ERG fusion that is less frequently observed in men of African and Asian ancestry 11,30 , germline BRCA2 mutations and the somatic SPOP driver co-occurred with their respective counterparts 31,32 . Another factor is modifiable environmental attributes that are specific to certain circumstances or geographical regions that, to date, have not been observed in prostate cancer. They act as mutagenic forces leading to the positive selection of point mutations throughout life in healthy tissues 33,34 and cancers 35 , forming fluid boundaries between normal ageing and cancer tissues. According to Ottman 36 , the above-mentioned model of gene-environment interaction is observed when there is a different effect of a genotype on disease in individuals with different environmental exposures or, alternatively, a different effect of an environmental exposure on disease in individuals with different genotypes. Other GMS subtypes would be a combination of the two processes, warranting a need for larger populations capturing ancestral versus ethnic and geographical diversity. As such, the study directly accounts for the large spatiogenomic heterogeneity of prostate cancer and its associated evolutionary history in understanding the disease aetiology.
Our study suggests that larger genomic datasets of geo-ethnically diverse and ancestrally defined populations in a unified analysis will continue to identify rare and geographically restricted subtypes in prostate cancer and potentially other cancers. We demonstrate that ancestral and geographical attributes of patients could facilitate those studies on cancer population genomics, an alternative to cancer personalized genomics, for a better scientific understanding of nature versus nurture.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-022-05154-6.

Patient cohorts and WGS
Our study included 183 treatment-naive patients with prostate cancer who were recruited under informed consent and appropriate ethics approval ( Supplementary Information 2) from Australia (n = 53), Brazil (n = 7) and South Africa (n = 123). While matched for pathological grading, as previously reported, prostate-specific antigen levels are notably elevated within our African patients 16 and we cannot exclude on the basis of potential metastasis (as data on metastases in this cohort are unavailable). DNA extracted from fresh tissue and matched blood underwent 2 × 150 bp sequencing on the Illumina NovaSeq instrument (Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research).

WGS processing and variant calling
Each lane of raw sequencing reads was aligned against human reference hg38 + alternative contigs using bwa (v.0.7.15) 37 . Lane-level BAM files from the same library were merged, and duplicate reads were marked. The Genome Analysis Toolkit (GATK, v.4.1.2.0) was used for base quality recalibration 38 . Contaminated and duplicate samples (n = 8) were removed. We implemented three main pipelines for the discovery of germline and somatic variants, with the latter including small (SNV and indel) to large genomic variation (CNAs and SVs). The complete pipelines and tools used are available from the Sydney Informatics Hub (SIH), Core Research Facilities, University of Sydney (see the 'Code availability' section). Scalable bioinformatic workflows are described in Supplementary Information 4. Genetic ancestry was estimated using fastSTRUCTURE (v.1.0) 39 , Bayesian inference for the best approximation of marginal likelihood of a very large variant dataset. Reference panels for African and European ancestry compared in this study were retrieved from previous whole-genome databases 19 .

Analysis of mutational recurrence
We used three approaches to detect recurrently mutated genes or regions based on three mutational types, including small mutations, SVs and CNAs ( Supplementary Information 7). In brief, small mutations were tested within a given genomic element as being significantly more mutated than the adjacent background sequences. The genomic elements retrieved from syn5259886, the PCAWG Consortium 20 , were a group of coding sequences and ten groups of non-coding regions. SV breakpoints were tested in a given gene for their statistical enrichment using gamma-Poisson regression and corrected by genomic covariates 12 . Focal and arm-level recurrent CNAs were examined using GISTIC (v.2.0.23) 42 . Known driver mutations in coding and non-coding regions published in PCAWG 20,43,44 were also recorded in our 183 tumours, and those specific to prostate cancer genes were also included 7,8,12,17,18 .

Integrative analysis of prostate cancer subtypes
Integrative clustering of three genomic data types for 183 patients was performed using iClusterplus 11,45 in R, with the following inputs: (1) driver genes and elements; (2) somatic CN segments; and (3) significantly recurrent SV breakpoints. We ran iClusterPlus.tune with clusters ranging from 1 to 9. We also performed unsupervised consensus clustering on each of the three data types individually. Association analysis of genomic alteration with different iCluster subtypes was performed in detail ( Supplementary Information 8). Differences in driver mutations, recurrent breakpoints and somatic CNAs across different iCluster subtypes were reported.

Comparison of iCluster with Asian and pan-cancer data
To compare molecular subtypes between extant human populations, the Chinese Prostate Cancer Genome and Epigenome Atlas (CPGEA, PRJCA001124) 11 was merged and processed with our integrative clustering analysis across the three data types described above, with some modifications. Moreover, we leveraged the PCAWG consortium data 13 to define molecular subtypes across different ethnic groups in other cancer types using published data of somatic mutations, SV and GISTIC results by gene. Four cancer types consisting of breast, liver, ovarian and pancreatic cancers were considered due to existing primary ancestries of African, Asian and European with at least 70% contribution. Full details are provided in Supplementary  Information 8.4. PCAWG 13 participants with prostate cancer were retrieved to compare with Australian data with clinical follow-up. Only those of European ancestry greater than 90% (n = 139) were analysed for the three genomic data types of iCluster subtyping, as well as individual consensus clustering. Clustering results identical to the larger cohort size mentioned above were chosen for association analyses. Differences in the biochemical relapse and lethal prostate cancer of the participants across the subtypes were assessed using the Kaplan-Meier plot followed by a log-rank test for significance.

Analysis of mutational signatures
Mutational signatures (SBSs, DBSs and indels), as defined by the PCAWG Mutational Signatures Working Group 3 , were fit to individual tumours with observed signature activities using SigProfiler 46 . Non-negative matrix factorization was implemented to detect de novo and global signature profiles among 183 patients and their contributions. New mutational genome rearrangement signatures (CN and SV) were also performed using non-negative matrix factorization, with 45 CN and 44 SV features examined across 183 tumours. We followed the PCAWG working classification and annotation scheme for genomic rearrangement 26 . Two SV callers were used to obtain exact breakpoint coordinates. Replication timing scores influencing on SV detection were set at >75, 20-75 and <20 for early, mid, and late timing, respectively 47 . Full details of analysis steps, parameters and relevant statistical tests are provided in Supplementary Information 9.

Reconstruction of cancer timelines
Timing of CN gains and driver mutations (SNVs and indels) into four epochs of cancer evolution (early clonal, unspecified clonal, late clonal and subclonal) was conducted using MutationTimeR 24 . CN gains including 2 + 0, 2 + 1 and 2 + 2 (1 + 1 for a diploid genome) were considered for a clearer boundary between epochs instead of solely information of variant allele frequency. Confidence intervals (t lo - t up ) for timing estimates were calculated with 200 bootstraps. Mutation rates for each subtype were calculated according to ref. 24 such that CpG-to-TpG mutations were counted for the analysis because they were attributed to spontaneous deamination of 5-methyl-cytosine to thymine at CpG dinucleotides, therefore acting as a molecular clock.
League model relative ordering was performed to aggregate across all study samples to calculate the overall ranking of driver mutations and recurrent CNAs. The information for the ranking was derived from the timing of each driver mutation and that of clonal and subclonal CN segments, as described above. A full description is provided in Supplementary Information 10.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
DNA-sequencing data have been deposited at the European Genome-Phenome Archive (EGA) under overarching accession EGAS00001006425 and including the Southern African Prostate Cancer Study (SAPCS) Dataset (EGAD00001009067 and Garvan/St Vincent's Prostate Cancer Database EGAD00001009066). Academic researchers meeting the data-access policy criteria may apply for data access through the respective data access committees. CPGEA data are available through http://www.cpgea.com. PCAWG data are available at ICGC Data Portal (https://dcc.icgc.org/releases/PCAWG).

Code availability
The core computational pipelines used in this study for read alignment, quality control and variant calling are available at GitHub (https:// github.com/Sydney-Informatics-Hub/Bioinformatics). Analysis code for chromothripsis and chromoplexy is also available at GitHub (https:// github.com/tgong1/Code_HRPCa). European, blue. d, Stacked barplots of multiple signature exposures for each mutational type enriched per patient and ranked by ancestral group. In many cases, certain mutational signatures occur more frequent in a tumour than others. The top enrichment of small-to large-size mutational signatures mentioned is shown for each patient in Supplementary Table 9 (see  Enrichment). Copy number and structural variation signatures (CN1-6 and SV1-8, respectively) are the first identified in this study for prostate cancer, and their top enrichment of signature mixture/exposure per patient appears to be significantly associated with our GMS (one-way ANOVA or Fisher's exact test, two-sided P-values = 5.1e-07-0.017), considering either de novo or global mutational signatures discovered in the Catalogue of Somatic Mutations in Cancer (COSMIC). This supports a role of GMS in explaining intrinsic and extrinsic mutational processes in cancer.  Table 9).

Article
Extended Data Fig. 10 | See next page for caption.

Article
Extended Data Fig. 10 | Stages of prostate tumour development. a, Clonal architecture and its frequency in prostate cancer between Africans and Europeans. Tumours are divided into three groups: monoclonal, linear and branching polyclonal. The number of small somatic mutations (SSM) and CNAs as percentage of genome alteration (PGA) is provided as median and range in bracket. Cancer cell fraction (CCF) in each clone and/or subclone is shown in a circular node. Tumours that show characteristics consistent with being polytumours or with multiple independent primary tumors are excluded to remain conservative. b, Unbiased hierarchical clustering of CNAs between clonal (trunk) and subclonal (branch) mutations. Trunk mutations encompass those that occur between the root node (normal) and its only child node, while all others are classified to have occurred in branch. Red indicates gain; blue indicates loss; and rows indicate patients. Unidentified regions in trunk and branch are assumed to have neutral copy number. ConsensusClusterPlus showed seven CNA clusters among our patients to be optimal. The figure shows that a trunk alteration from one patient is mutationally similar to a branch alteration from another, rather than to other trunk ones from different patients in a cohort. c, Cancer timelines of GMS-B and -D identified in this study.
Detailed explanation is provided in Fig. 5

Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend,

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

IZJ Life sciences D Behavioural & social sciences D Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. "'\

Sample size
Sample sizes were determined in order to obtain nearly 200 tumour-blood pairs, based on biospecimen availability, with a focus on samples of underrepresented populations. All primary tumours and matched-blood tissue from 190 specimens were used to generate sequencing data in this study. We considered this sample size would be sufficient because our significant comparison of tumour genome profiling between Europeans (n=9) and Africans (n=6) has been previously published in a peer-reviewed journal. For comparisons, 93 CPGEA donors were included due to high-risk prostate cancer with most treatment-naive and 628 PCAWG donors were chosen based on different primary ancestries. Additional 256 prostate cancer patients from PCAWG were compared with most treatment-naive.
Data exclusions After quality assurance, data from 8 tumour-blood pairs were excluded as unusable. Reasons for data exclusion included evidence of cross Replication "'\

Randomization Blinding
Reporting for specific materials, systems and methods

.)
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.  Table 1. Demographically, the cohort included 53 Australians, 7 Brazilians and 123 Africans, with ages ranging from 45-99 years old (median 65.5 yo). Having performed genetic ancestry analysis, the cohort consisted of 113 Africans, 61 Europeans and 9 Admixed mostly between African and European. Preoperative PSA levels ranged from 3.5 to 4,84 ng/ml (median 22.9 ng/ml). ISUP Grade Groups were distributed as follows: 0-2: 29 (16.6%); 3: 11 (6.3%); 4: 52 (29.7%); and 5: 83 (47.4%). All patients are male.

Materials & experimental systems Methods
After obtained the consent of patients, 183 patients from Australia (n=53), Brazil (n=7) and South Africa (n=123) and presenting mostly with clinicopathologically confirmed prostate cancer had their tumour and blood samples collected. All except one Australian patient (PID 15178) treated with one-month-long Ozurdex therapy were treatment na'i' ve at time of sampling. Three patients were unconfirmed for the cancer and confirmed for benign prostate hyperplasia (BPH). All men from the Southern African Prostate Cancer Study (SAPCS) were recruited at the time of diagnosis, and therefore tumour tissue was derived from biopsy core, while age and PSA levels were recorded at the time of diagnosis. Australian and Brazilian subjects were recruited at the time of radical prostatectomy. Additional selection criteria included: availability of fresh-frozen tissue and matched blood, self-reported ethnicity and country of origin, as well as availability of clinical and pathological data. contamination and duplication. Hypermutated tumours (30 mutations/Mb) were removed in mutational recurrence analysis of small mutations and cancer evolution analysis , following ActiveDriverWGS and PhyloWGS software user manuals.
The accuracy of SV breakpoint inference was assessed by applying two different algorithms and selecting only calls detected by both. Integrative clustering analysis was re-assessed using independent clustering of each dataset, with subsequent results mostly recapitulating the subtypes found by the integrative analysis. N/A -This exploratory study of genome profiling tumours in underrepresented populations did not contain a randomisation step due to biospecimen scarcity. N/A -This exploratory study within underrepresented populations did not contain a blinded data collection due to the focus on those populations in this study. Sequencing and early steps in data analysis were partially blinded using a pool of all samples of different ancestries collected.
Results in this study would not represent all underrepresented populations in Africa and South America as only South Africa and Brazil were studied.