The molecular landscape of Asian breast cancers reveals clinically relevant population-specific differences

Molecular profiling of breast cancer has enabled the development of more robust molecular prognostic signatures and therapeutic options for breast cancer patients. However, non-Caucasian populations remain understudied. Here, we present the mutational, transcriptional, and copy number profiles of 560 Malaysian breast tumours and a comparative analysis of breast cancers arising in Asian and Caucasian women. Compared to breast tumours in Caucasian women, we show an increased prevalence of HER2-enriched molecular subtypes and higher prevalence of TP53 somatic mutations in ER+ Asian breast tumours. We also observe elevated immune scores in Asian breast tumours, suggesting potential clinical response to immune checkpoint inhibitors. Whilst HER2-subtype and enriched immune score are associated with improved survival, presence of TP53 somatic mutations is associated with poorer survival in ER+ tumours. Taken together, these population differences unveil opportunities to improve the understanding of this disease and lay the foundation for precision medicine in different populations.

B reast cancer is a heterogenous disease, where histopathological and clinico-demographic features guide treatment choices-for example, use of endocrine treatment in oestrogen receptor-positive (ER+) tumours, use of HER2-targeted treatment in Her2-positive tumours and use of chemotherapy in women with poor prognostic features, such as high grade. The development of molecular classifiers based on gene expression 1,2 has led to a better understanding of the molecular subtypes of breast cancer, and the development of molecular-based prognostic tools such as OncoTypeDx has led to improved clinical decision-making-for example, in determining the benefit of chemotherapy in node-negative, ER+ disease 3 . Recognising that breast cancer is a copy number driven disease 4 , we have improved the stratification of breast cancers by integrating copy number alterations (CNAs) and gene expression in the classification of 2000 primary tumours from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) 5 . The subtypes identified, designated as Integrative Clusters (IntClust), have very distinct molecular features, drivers and clinical courses 6,7 .
The molecular characterisation of breast tumours has also enabled the development of new therapies, and the selection of patients for the appropriate therapies. Gene expression 8,9 , targeted sequencing 7 , whole-exome sequencing (WES) [10][11][12] and whole-genome sequencing (WGS) have revealed the genomics drivers of breast cancer, uncovering hitherto unrecognised therapeutic possibilities. Furthermore, mutational analysis has identified tumour mutational burden 13 and homologous recombination (HR) deficiency 14 as potential biomarkers for response to immunotherapy and poly (ADP-ribose) polymerase (PARP) inhibitors 15,16 , respectively.
Notably, the extent to which findings from these studies, conducted predominantly in women of European descent, can be applied to other populations where there are differences in genetic, lifestyle and environmental factors remains largely understudied. Triple-negative breast cancer is more common in African-American women and a recent genomic analysis of 194 tumours show an increased HR deficiency signature, pervasive TP53 mutations and greater structural variations, indicating a more-aggressive biology 17 . Breast cancer in Asians tend to occur at a younger age, with a higher proportion of pre-menopausal, oestrogen receptor (ER) negative and human epidermal growth factor receptor 2 (HER2) receptor-positive disease (reviewed in Yap et al. 18 ). A recent genomic analysis of 187 early-onset Asian breast cancers show a higher prevalence of TP53 mutations and enrichment in immune signatures 19 . In addition, analysis of 465 triple-negative breast cancers in Chinese women demonstrated broad similarities in tumours of the same subtype arising in Asian and Caucasian women, although Chinese patients had a significantly higher proportion of the luminal androgen receptor (LAR) subtype of TNBCs relative to Caucasian or African-American patients 9 .
Given the potential impact of tumour subtypes, candidate drivers, mutational signatures and immune profiles on treatment options for breast cancer patients, and the hitherto lack of detailed information in Asian breast cancer patients, we have performed WES, shallow WGS (sWGS) and RNA-sequencing (RNA-seq) of 560 breast tumours from a cohort of Asian patients in Malaysia and report the impact of population differences on the molecular profiles of breast cancer. Relative to Caucasian women, we found a higher prevalence of HER2-enriched molecular subtypes, as well as a higher prevalence of TP53 somatic mutations in ER+ Asian breast tumours. Importantly, we also found that Asian breast tumours have elevated immune scores. HER2 subtype and elevated immune scores were found to be associated with improved survival, whereas presence of TP53 somatic mutations was associated with poorer survival in ER+ tumours. Taken together, our findings set the stage for improving precision medicine efforts for breast cancer in the Asian populations.

Results
Study population and clinicopathological characteristics. Primary tumour tissue and blood samples (including three matched bilateral and one matched primary-recurrence samples) were obtained from 560 female patients with breast cancer treated at the Subang Jaya Medical Centre (SJMC), Malaysia between 2012 and 2017 who were recruited to the Malaysian Breast Cancer (MyBrCa) study 20 . The patients were recruited sequentially as seen in the clinic from a year to year period. After excluding samples that failed quality checks, data were generated for RNAseq (n = 527), WES (n = 546) and sWGS (n = 533). Compared with patients in The Cancer Genome Atlas (TCGA) 10 , our patients were younger, presented at later stages and had higher proportions of ductal carcinoma and Her2 positivity (28.9% MyBrCa versus 19.3% TCGA when NAs are excluded; Supplementary Table 1). The MyBrCa tumour cohort includes Malaysians from a number of different ethnicities, primarily Chinese (89%), Malay (4.6%) and Indian (3.4%) (Supplementary Table 1), and is as a whole genetically similar to other East/Southeast Asian populations according to genotyping analysis 21 . A principal component analysis of our RNA-seq data did not reveal any significant differences across different ethnicities ( Supplementary  Fig.1).
Given that the median age of diagnosis of breast cancer was 49 years in Asians and 61 years in Caucasians, we examined the distribution of subtypes in women above and below the age of 50. In women below 50 years of age (n = 206), there were no statistically significant differences between Asians and Caucasians in IntClust subtypes (Fig. 1b), but there was an increased prevalence in the PAM50 Luminal B subtype ( Supplementary  Fig. 2b). In women above 50 years of age (n = 321; Fig. 1c, Supplementary Fig. 2c), we see the significant differences noted above in the whole population analysis, i.e., there was a markedly higher prevalence of Her2-positive disease: (a) IntClust 4-(11.8% in MyBrCa, 10% in other Asians, 4.1% in Caucasians), (b) IntClust 5 (14.6% in MyBrCa, 15% in other Asians, 7.4% in Caucasians) and (c) PAM50 Her2-enriched subtype (27.4% in MyBrCa, 30% in other Asians, 10.6% in Caucasians), suggesting that it is in this age group where population-specific differences are mainly observed.
In addition, we also conducted a naive combined cluster analysis using gene expression data from the MyBrCa and TCGA Caucasian cohorts in order to determine whether there were any molecular subtypes that may be unique to Asians or Caucasians. However, unsupervised k-means consensus clustering did not reveal any exclusive clusters, but did support a higher prevalence of Her2-enriched subtypes and a lower prevalence of luminal subtypes in Asians ( Supplementary Fig. 3).
Finally, we classified the TNBC samples of the MyBrCa cohort into LAR, mesenchymal-like (MES), immunomodulatory (IM) and basal-like immune-suppresed (BLIS) TNBC subtypes using hierarchical clustering with the 80-gene signature from Burstein et al. 22 (Supplementary Fig. 4a). We found that the prevalence of each subtype was comparable with previous studies, with a relatively high prevalence of the LAR subtype (35%) that is consistent with reports from other Asian cohorts 9 (Supplementary Fig. 4b).
We performed driver gene analyses 24 ( Supplementary Fig. 11, Supplementary Tables 2 and 3) and identified well-established breast cancer oncogenes and tumour suppressors. For example, regardless of ER status or ethnicity, PIK3CA and AKT1 have high oncogene (ONC) but low tumour suppressor gene (TSG) scores, consistent with their established roles as known breast cancer oncogenes, while known tumour suppressors TP53, RB1 and PTEN have elevated TSG scores. We also observed high TSG scores for MAP3K1 and high ONC for CDKN1B in ER+ as previously noted 7 . We also observed preferentially higher SF3B1 mutations in ER+ samples, with the recurrent K700E mutation enriched in Caucasians ER+ tumours 25 (Caucasian ER+ ONC = 58.6 vs Asian ER+ ONC = 12.5). We additionally examined pairwise association of somatic mutations ( Supplementary Fig. 12) in Asian or Caucasian breast tumours. We observed mutual exclusivity between TP53 and each of CDH1 and GATA3, between PIK3CA and each of CDH1 and MAP3K1, and between GATA3 and CBFB as previously noted 7 in Asians and Caucasians tumours ( Supplementary Fig. 12). Finally, an analysis of tumours with germline pathogenic variants in six high to moderate breast cancer susceptibility genes did not reveal any significant differences between our cohort and TCGA 26 , except for a marginally significant higher prevalence in germline PALB2 mutations (Supplementary Table 4).
Analysis of CNA using whole-genome sequencing at 0.1× coverage revealed frequently altered regions in chromosomes 1, 8, 16 and 17 ( Supplementary Fig. 13), similar to what has been found in previous large genomic studies (TCGA, METABRIC). High-level amplifications were observed in known oncogenes such as ERBB2 and MYC, while deletions were observed in loci encompassing TP53 and MAP2K4. We observed IntClust-specific CNAs ( Fig. 2b and Supplementary Fig. 14): for example, amplification of ERBB2 at 17q12 in IntClust5; amplification of 17q23 encompassing RPS6KB1, PTRH2, APPBP2 and PPM1D in IntClust1, amplification of two distinct amplicons around 11q13 (CCND1 and EMSY, RSF1, and INTS4) in IntClust2, and amplification of ZNF703 at 8p12 in IntClust6.
We also used copy number data to estimate the fraction of cancer cells that harbour mutations for nine common driver genes, and found that most genes had median mutational cancer cell fractions (CCF) of 1, suggesting that mutations in these genes are clonal and tend to occur early during tumour development ( Supplementary Fig. 15).
Together, WES and sWGS analyses suggest that the subtypespecific somatic SNVs, indels, CNA and CCF patterns in our cohort are similar to that previously described in other Asian or European populations, with the notable exception that TP53 mutations are more common in ER+ Asian breast cancers.
We found only minor differences in the mutational signatures between Asian and Caucasian breast tumours (Fig. 2d, Supplementary Fig. 16). Whilst we observed a higher percentage of Korean samples (Korean and WGS Asian) with the HR deficiency Signature 3 as previously reported 19 , this was not observed in other Asian datasets (MyBrCa and TCGA Asian; Fig. 2d). Significantly, we found a higher percentage of Asian breast tumours with Signature 13, consistent with the higher prevalence of the APOBEC3B germline deletion polymorphism among the population (p = 0.0486). No differences for other mutational signatures were observed (Fig. 2d).
We extended the analysis to molecular subtypes and found no significant difference in the distribution of mutational signatures based on ethnicity. IntClusts 3 and 8 samples primarily have high Signature 1, and moderate number of samples with Signatures 2, 13 and 3, and consistent with this, ER+ luminal A tumours by PAM50 classification show an elevated percentage of samples with Signature 1 as well ( Supplementary Fig. 17, 18). IntClust 5 samples have high Signature 2 and 13, and consistent with this, the Her2-enriched tumours by PAM50 classification show an elevated percentage of samples with Signatures 2 and 13, suggesting increased APOBEC enzymatic activity 29 within the Her2-enriched cohort. By contrast, IntClust 10 samples have a higher percentage of samples with the HR deficiency Signature 3, and consistent with this, the basal-like samples by PAM50 classification have a high percentage of samples with Signature 3. Together, these results suggest that there is no significant difference in the carcinogenic processes driving the development of each subtype of breast tumours in both Asian and Caucasian women.
Asian breast cancers exhibit enriched immune scores. Given that population differences in the genetic, environmental and lifestyle exposures that drive carcinogenesis could shape the tumour microenvironment and lead to different outcomes, we performed pathway gene set enrichment analyses (GSEA) to determine pathways that are differentially activated between our cohort and TCGA. Interestingly, we found that 5 out of the top 15 most differentially expressed pathways were related to the immune system (Supplementary Table 5). We determined the immune scores for MyBrCa, Korean, METABRIC and TCGA Caucasian cohorts according to five different scoring methods: ESTIMATE 30 , gene set variation analysis (GSVA) using gene sets for immune cells 31 , GSVA using the expanded interferon-gamma gene set 32 , the IMPRES method 33 and CD8+ T cells scores from CIBERSORT 34 . Remarkably, all five methods showed significantly elevated immune scores in both Asian cohorts relative to the Caucasian cohorts (Fig. 3a). Furthermore, a multiple regression analysis of IMPRES scores for the MyBrCa and TCGA cohorts demonstrated that the difference in IMPRES scores between the two cohorts remained significant even after controlling for age, stage, histological subtype, menopausal status, tumour content, and HR/HER2 positivity (Supplementary Table 6).
As immune profiles are subtype-dependent, we compared IMPRES scores between MyBrCa and the TCGA Caucasian cohort across breast cancer subtypes. We observed the highest immune enrichment was in IntClust 10, which comprises mainly of TNBCs, in concordance with published studies 35,36 . More interestingly, we found that the IMPRES scores were significantly higher for MyBrCa cohorts compared to TCGA Caucasian samples independent of subtype classification (Fig. 3b). Similar patterns were found using the other immune scores (Supplementary Fig. 19) and classification methods ( Supplementary  Fig. 20), suggesting a systemic enrichment of immune scores in Asian breast cancers that occurs across the majority of molecular subtypes.
To identify immune cell types that drive the enrichment of immune scores in Asians, we used CIBERSORT on RNA-seq gene expression profiles to quantify the relative abundance of 22 different immune cell types in the tumour immune microenvironment. We found CD8+ T cells and macrophages are enriched in Asian breast cancers (Fig. 3a, Supplementary Fig. 21), and this was consistent with TIMER and GSVA with Bindea gene sets (Supplementary Figs. 22,23), suggesting an overall enrichment of cytotoxic NK and T cells in Asian tumours.
In a subset of 124 patients our cohort, we compared each patient's IMPRES score with anti-CD8 staining of tumourinfiltrating lymphocytes (TILs) in archival formalin-fixed paraffinembedded (FFPE) blocks and found a high correlation (r s = 0.45; Fig. 3c). Similarly, IMPRES scores were highly correlated with CD3 IHC scores (r s = 0.52; Supplementary Fig. 24), and ESTIMATE immune scores were highly correlated with CD3 and CD8 IHC scores ( Supplementary Fig. 24). Together, these data suggest that differences in immune scores derived from RNA-seq data are at least in part due to differences in the TILs within each tumour.
Factors contributing to enriched immune scores. To explore the factors contributing to enriched immune scores in Asians, we first explored whether the generation of neoantigens through different mutational processes is associated with immune scores. Indeed, we found that Signatures 2 and 13 (APOBEC) and Signature 3 (HR deficiency) are positively correlated with immune scores, whereas Signature 1 (aging) is negatively correlated (Fig. 4a, b). We also asked if the immune signatures were associated with the underlying tumour heterogeneity of each sample. To quantify tumour heterogeneity, we used PyClone 37 to estimate the number of subclonal clusters in each sample. A comparison of ESTIMATE immune scores with the number of PyClone clusters revealed a strong negative association (r s = −0.69, p = 0.0008; Fig. 4c), suggesting that tumours in the MyBrCa cohort with high immune scores have correspondingly low tumour heterogeneity. Furthermore, we also found that the MyBrCa samples have significantly lower tumour heterogeneity relative to TCGA samples across subtypes (Supplementary Fig. 25). These data are consistent with a model in which the stronger immune response in Asian breast cancer patients leads to higher selective pressure on tumour cells and, ultimately, more homogenous tumours in those patients 38,39 .
Next, in order to identify clinical or demographic factors that could contribute to the elevated immune scores observed in Asian breast cancer samples, we conducted linear regression analysis of immune scores against all available clinical and demographic data, including age, tumour size, tumour stage, tumour grade, and tumour content. These analyses showed that elevated IMPRES scores were associated with molecular subtype and age, but not with other clinical or demographic factors (Supplementary Table 7). Following up on this result, multiple linear regression analysis of immune scores against all available molecular data, adjusting for age and molecular subtype, showed that mutational signature 3, tumour heterogeneity (PyClone clusters) and predicted neoantigen binding to be independently associated with IMPRES scores (Supplementary Table 8), although the overall predictive ability of the model was weak (adjusted R 2 = 0.13).
Finally, to determine the biological pathways that are associated with higher immune scores, we conducted GSEA comparing tumours in the top quartile of IMPRES scores to tumours in the bottom quartile. We found that the systemic lupus erythematosus (SLE) pathway was the most enriched pathway for tumours in the top IMPRES score quartile (Supplementary Table 9). In addition, we also found enrichment in the cGAS-STING cytosolic DNA-sensing pathway, and decreased expression of the TGF-β signalling pathway in the top quartile, which is concordant with previous studies on the tumour immune microenvironment (Supplementary Table 9) [40][41][42] . We confirmed these results by comparing IMPRES scores for our cohort to GSVA scores using the KEGG gene sets, and found a strong Clinical impact of genomic profiles of Asian breast cancer. To determine the potential clinical impact of the differences in molecular profiles of breast cancers in Asian women, we determined the impact of the observations on overall survival. As expected, we found that ER− patients had poorer survival compared with ER+ patients, patients with IntClust 10 had poorer survival relative to other IntClusts, and patients with basal tumours by PAM50 had poorer survival relative to other molecular subtypes (Fig. 5a). In ER+ patients alone, TP53 mutations were associated with poorer survival (Fig. 5b). Patients with high IMPRES scores had significantly better survival in both unadjusted (Fig. 5b) and adjusted (Supplementary Fig. 27) models, and the effect was stronger in ER+ patients (Fig. 5b). Interestingly, ER + patients with both a low IMPRES score and TP53 mutation appear to have markedly poorer survival than other patients (Fig. 5b), although the sample size was small (n = 8). Overall, these data suggest that the differences in molecular profiles of Asian breast cancer patients could have important clinical implications, particularly in patients with ER+ tumours.

Discussion
To our knowledge, this is the largest and most extensive molecular profiling study on breast tumours arising in Asian women. Our study of Malaysian breast tumours complements previous genomics studies of Asian breast tumours from China and South Korea 9,19,23 , and more importantly, enables more-comprehensive comparisons of Asian breast tumours with breast tumours arising in women of European descent 5,10,23 . Comparisons conducted in this study revealed a higher prevalence of the Her2 molecular subtypes, higher prevalence of TP53 mutations in ER+ disease and higher overall immune signatures in Asian breast cancer, all of which could impact clinical outcomes. Despite the lower mean age of breast cancer in our cohort compared with TCGA, and the association of younger age with ER− breast cancer, we did not find an increased proportion of triple-negative breast cancer. In contrast to other Asian studies, which selected for individuals with young onset breast cancer 19 , we also did not find a large difference in the frequency of germline carriers for BRCA1, BRCA2, PALB2, ATM or CHEK2 relative to TCGA. The prevalence of germline BRCA mutations that we report is slightly lower than that reported by studies that have looked at germline BRCA mutations in the Asian population [43][44][45] , which is likely owing to a higher mean age of diagnosis and our cohort being unselected for age and family history of breast cancer.
The higher prevalence of the Her2 subtype in our cohort, the South Korean and the TCGA Asian cohorts is consistent with immunohistochemistry-based epidemiological studies among Asians and Asian Americans [46][47][48][49] . It is becoming increasingly apparent that the lifestyle and genetic risk factors may have overlapping but sometimes distinct effects in different subtypes of breast cancer. In fact, a large study of 28,095 breast cancer patients has suggested that parity may be associated with earlyonset Her2-positive breast cancers 50 . Incidentally, the MyBrCa cohort 20 has a higher median parity relative to the Caucasian cohort sequenced by Nik Zainal et al. 23 . Further work on the basis for population differences in risk is warranted to understand implications for prevention.
The higher prevalence of TP53 mutations in ER+ breast cancer in our cohort, the South Korean and the TCGA Asian cohorts is consistent with previous findings 19 . However, unlike liver cancer where TP53 mutations may be linked to population-specific risk factors 51 , we found no difference in the type or location of mutations nor in mutational signatures in Asian and Caucasian breast tumours, suggesting that the source of mutations may be similar in both populations. Regardless of the cause of the mutations, TP53 mutations have been suggested to be prognostic, but with opposite effects in different subtypes of breast cancer 52,53 , and in ER+ /HER2− disease, TP53 mutations appear to be associated with higher risk to recurrence and poorer prognosis [54][55][56] . Notably, expression of TP53 mutations may have different effects in different populations 57 , highlighting the need for further clinical studies in diverse populations.
Finally, we observed an elevated immune score, driven by higher cytotoxic TILs, in both Malaysian and Korean breast tumours relative to the Caucasian tumours. In our cohort, elevated immune score is also associated with better overall survival. Our results are consistent with recent microarray-based analyses in Asian breast cancer patients 58 . Intriguingly, ethnic-specific differences in the tumour microenvironment, such as hypoxia have been reported and appeared to be associated with higher TP53 and lower PIK3CA mutations in Asians 59 . We found that the elevated immune scores are associated with the SLE, cGAS-STING cytosolic DNA-sensing and TGF-β signalling pathways, HR deficiency, neoantigen prediction and tumour heterogeneity. However, these associations only account for a small proportion of the population variance in immune scores, highlighting the need for further studies in this area.
In summary, this study may have important implications on our understanding of breast cancers arising in different populations. It has been reported that Asian-American breast cancer patients have better survival than other racial/ethnic groups 60,61 , and the molecular basis of breast cancers in Asian women reported in this study highlights some of the possible explanations for these population-specific differences in outcome.

Methods
Biospecimen collection and pathological assessment. In all, 661 tissue samples were collected from breast cancer patients recruited as part of the MyBrCa study at the Subang Jaya Medical Centre, Subang Jaya, Malaysia. The patients were recruited sequentially as seen in the clinic from a year to year period. The project was reviewed and approved by the Independent Ethics Committee, Ramsay Sime Darby Health Care (reference no: 201109.4 and 201208.1), and written informed consent was given by each individual patient. Matching blood samples were obtained prior to surgery, whereas tumour samples were sectioned from the primary tumour during surgery and were immediately frozen and stored in liquid nitrogen.
Patients were excluded from this study for the following criteria: no corresponding tumour samples (n = 5), no corresponding germline samples (n = 5) and those who withdrew consent (n = 12). Tumour samples were further excluded after clinicopathological review if they were found to be from rare histological subtypes and other breast diseases (n = 5). Tumour samples were then sectioned for DNA and RNA extraction, and the top and bottom sections were stained with haematoxylin and eosin and reviewed for tumour content. Tumour samples with an average tumour content of <30% (n = 50) and those with insufficient DNA (n = 8) were excluded from the study.
A small minority of patients that were included in this study (n = 26) received neoadjuvant chemotherapy prior to tissue collection. Tissue samples also included four cases of bilateral breast cancer and 1 case of recurrence. RNA-sequencing alignment and quality assessment. RNA-seq reads were mapped to the hs37d5 human genome and the ENSEMBL GRCh37 release 87 human transcriptome using the STAR aligner (v. 2.5.3a) 62 . Samples with <15 million mapped fragments were excluded from further RNA-seq analyses. Genelevel aligned fragment counts were generated using featureCounts (v. 1.5.3), while gene-level expression in transcripts-per-million (TPM) was calculated using RSEM (v1.2.31) 63 . Genes with an average count of less than 10 or an average TPM score of < 0.1 were filtered out and excluded from further analyses. Variant calling from RNA-seq data for sample ID reconfirmation and downstream analyses was also conducted using the GATK Best Practices workflow for RNA-seq-in brief, STARmapped reads were processed to mark duplicates with Picard, followed by exon splitting & trimming and base recalibration using GATK, and lastly variants were called using HaplotypeCaller. Sample identities were verified as described below in the WES section. Molecular subtyping. Gene-level count matrices for the SD and TCGA cohort were transformed into log 2 counts-per-million (logCPM) using the voom function from the limma (v. 3.34.9) R package. The transformed matrices were then was subtyped according to PAM50 and SCMgene designations using the Genefu package in R (v. 2.14.0). In addition, each matrix was quantile-normalised and subtyped according to integrative clusters using the iC10 R package (v. 1.5). For the METABRIC cohort, the normalised microarray expression matrix from the discovery set was used for Genefu and ic10 subtyping without modification. For IntClust 4, we designated each sample as being ER+ or ER− by fitting a twocomponent mixture model to the distribution of ESR1 expression in TPM using an expectation-maximisation algorithm as implemented in the mixtools (v. 1.1) package in R. Unsupervised cluster analysis. K-means consensus clustering was conducted using gene-median centred TPM gene expression scores for the MyBrCa and TCGA Caucasian cohorts, using the ConsensusClusterPlus (v.1.46) package in R.
Classification of TNBC subtypes. TPM gene expression scores for MyBrCa TNBC samples were first normalised using gene-median centreing. Then, the 80 genes from the TNBC classifier in Burnstein et al. 22 was used to conduct hierarchical clustering, as implemented in the heatmap.2 function in R using default options. Each cluster was assigned to the four subtypes according to pathway expression: high expression of hormone-related pathways for the LAR subtype, high expression of immune pathways and developmental growth genes as the IM, low expression of immune pathways but high expression of developmental growth genes as the BLIS subtype, and finally the remaining cluster with moderate expression of epithelial-mesenchymal transition-related pathways as the MES subtype.
Profiling the tumour immune microenvironment. Overall immune cell infiltration in the bulk tumour samples was assessed from RNA-seq TPM gene expression scores using ESTIMATE (v. 1.0.13) 30 , as well as with GSVA (v. 1.26) using the combined immune cell gene sets from Bindea et al. 31 . For each sample, we also scored the immune features predictive of checkpoint inhibitor immunotherapy using IMPRES scores 33 (only 14 out of 15 of the predictive features were available in our datasets) as well as with GSVA using the Expanded IFN-gamma gene set from Ayers et al. 32 . The relative abundance of specific immune cell populations was estimated from RNA-seq TPM scores with the CIBERSORT 34 and TIMER 64 web tools, as well as through GSVA with individual immune gene sets from Bindea et al. 31 . All box plots shown are constructed in the same way-the boxes indicate 25th percentile, median, and 75th percentile, whereas whiskers show the maximum and minimum values within 1.5 times the inter-quartile range from the edge of the box.

Regression analyses of clinical and molecular features.
To assess the associations between various clinical and molecular features, we performed multiple regression analyses using the base stats package in R (v. 3.5.1). For continuous variables such as IMPRES score, regular linear regression was applied using the "lm" function to estimate regression coefficients and statistical significance.
Shallow WGS alignment and CNA assessment. The sequenced reads were mapped to the hg19 reference genome using bwa-mem, sorted using samtools and dedupped using picard (http://broadinstitute.github.io/picard). Mapped reads were analysed using QDNAseq 65 to obtain 100 kb segmented copy number profiles using standard protocol and default parameters. Copy number aberrations were called using CGHcall (v 2.40) as implemented in the QDNASeq R package, which calls each segment as normal, copy number gain, copy number loss, amplification or deletion using a mixture model. ENSEMBL hg19 genes with HUGO names were mapped to the segmented copy number calls by their start positions to determine the copy number status for each gene in each sample.
WES analysis alignment and quality assessment. The sequenced reads were trimmed for Illumina adapters using trim_galore! (https://github.com/ FelixKrueger/TrimGalore) and mapped to the human genome b37 plus decoy genome using bwa-mem version 0.7.12. The mapped reads were merged and sorted using samtools, and de-duplicated using picard (http://broadinstitute.github.io/ picard). Realignment and base quality score recalibration were performed using GATK3 66 . Realignment around indels was done concurrently on the tumour and normal pairs. For bilateral cases, the two tumours and normal were realigned simultaneously. Sample identities were verified by determining the concordance of the genotypes using GATK3 HaplotypeCaller 66 . In brief, the haplotypes at 2000-5000 dbSNP positions were called for all samples (RNA-seq and WES experiments) in an all-versus-all manner. True match will typically have upwards of 95% concordance between samples from the sample individuals. Samples with ambiguous identities were excluded from further analyses.
Short-nucleotide variants calling. We first constructed panel-of-normals by artefact detection mode of GATK3 Mutect2 66 , retaining only sites that are present in two or more normal samples. In all steps, we only concentrated on the Nextera Rapid Capture regions with additional flanking 100 bp. For SNVs, we used positions that are called by Mutect2, and applied following filters: minimum 10 reads in tumour and 5 reads in normal samples, OxoG metric <0.8, variant allele frequency (VAF) 0.075 or more, p value for Fisher's exact test on the strandedness of the reads 0.05 or more, and S AF > 0.75. For positions that are present in five samples or more, we removed two positions that were not in COSMIC and in single tandem repeats. We also removed variants that have VAF at least 0.01 in gnomAD, and considered only variants that are supported by at least four alternate reads, with at least two reads per strand. For indels, we also required the positions to be called by Strelka2 68 . We manually salvaged an indel in PALB2 gene in sample SD0028 which we have identified using PCR. We annotated the variants using Oncotator version 1.9.9.0. We plotted mutation positions as lollipop plots using MutationMapper 67 .
Cancer cell fractions. CCFs were calculated for nine common breast cancer driver genes using VAFs, copy number and tumour purity estimates obtained from WES, following the methodology described in Pereira et al. 7 . Copy number and tumour purity were generated for this analysis using the ASCAT R package (v. 2.5.2) on allele counts generated from WES bam files using alleleCounter (v. 4.0.1).
Driver genes analysis. We considered samples from WGS and WES experiments (MyBrCa, TCGA Asian, Korean, WGS Asian, TCGA Caucasian, WGS Caucasian) for driver gene analyses 24 . We considered genes that are mutated in at least 1% of samples in Asian ER−, Asian ER+, Caucasian ER+ or Caucasian ER−. We calculated ONC score as the percentage of missense mutation counts at the most recurrent amino-acid positions over all mutations, and considered only genes with at least five recurrent mutations. We calculated TS (score) as the percentage of inactivating mutation counts over all mutations, and considered only genes with at least five inactivating mutations. We selected genes with either ONC or TSG scores of >20, and queried the Integrative Onco Genomics database (www.intogen.org) for known cancer driver genes ( Supplementary Tables 2 and 3). We sorted 40 genes according to the enrichment in Asian ER+ over Caucasian ER+ and plotted the prevalence and the ONC/TSG scores ( Supplementary Fig. 11).
Association patterns of driver genes. We considered the top 15 mutated genes from driver genes analysis. We calculated odds ratio and determined the likelihood of co-occurrence or mutual exclusivity of missense or inactivating mutations using Fisher's exact test. We displayed associations with p value of <0.05 after Bonferroni correction and plotted the log odds ratio.
Mutational signatures. We used deconstructSigs 69  Germline mutation analysis. Carriers of deleterious germline mutations in BRCA1, BRCA2, TP53, PALB2, ATM and CHEK2 were identified from targeted sequencing of the MyBrCa cohort (BRIDGES study, in review), which were confirmed with Sanger sequencing. Comparable data from TCGA was taken from Huang et al. 26 .
Correlation between immune scores and mutational signatures. We calculated Spearman's correlation coefficient between the mutational signatures and immune scores and plotted the correlations as heatmaps.
Differential expression and pathway analyses. Gene-level count matrices were normalised using the "Trimmed Mean of M-values" method implemented in the edgeR (v. 3.20.9) R package. The count matrices were then transformed into logCPM using the voom function from the limma package in R. Differentially expressed genes were determined by empirical Bayes moderation of the standard errors towards a common value from a linear model fit of the transformed count matrices as implemented in the limma package, with the threshold for differential expression set as false discovery rate (FDR) < 0.001 and absolute log fold-change >0.2. For pathway analysis, DE genes were further filtered for FDR < 1e-50, and ranked according to absolute log fold-change. The top 1000 ranked genes were queried in Reactome (www.reactome.org) to determine the top enriched pathways. Further pathway analyses were also conducted using GSEA (v. 3.0) on quantilenormalised gene-level count data. GSEA was run for 1000 permutations with the v. 6.2 Hallmark and KEGG gene sets from MSigDB.
Validation of immune scores with immunohistochemistry. FFPE blocks for 207 patients with sequencing data were sectioned and stained for anti-CD3 (clone 2GV6, predilute; Ventana Medical Systems), anti-CD4 (clone SP35, predilute; Ventana Medical Systems), anti-CD8 (clone SP57, predilute; Ventana Medical Systems) and anti-PD-1 (clone SP263, predilute; Ventana Medical Systems) using an automated immunostainer (Ventana BenchMark ULTRA; Ventana Medical Systems, Tucson, AZ) without any dilution. Stained slides were digitised using an Aperio AT2 whole slide scanner. CD3, CD4 and CD8 staining was quantified using the Aperio Positive Pixel digital pathology tool and PDL-1 expression was determined using the Combined Positive Score system. The data were further verified by a pathologist (PR).
Quantification of tumour heterogeneity. Tumour heterogeneity was determined using PyClone (v 0.13.1) 37 with default options to estimate the number of subclonal clusters within each tumour sample. Allele counts used for the PyClone input were extracted from the GATK output MAF files, whereas copy number input data were generated by ASCAT (v. 2.5.2) from WES allele counts generated by alleleCounter (v 4.0.1). Tumour heterogeneity was also separately quantified for each sample in the MyBrCa and TCGA cohorts using the MATH method described in Pereira et al. 7 from MAF files from WES.
Neoantigen analysis. Sample HLAs were determined using Polysolver from tumour and normal DNA WES data. Only HLA alleles that were concordant in tumour and normal WES data were considered. Somatic mutations were annotated using VEP. All possible neoantigen peptides (9-to 11-mers) encompassing the nonsynonymous mutations were predicted using a combination of NetMHCpan and NetMHC on the pVAC-seq platform 70 . Only neoantigens with predicted binding of less than 500 nM were considered.
Survival analysis. Overall survival data were obtained for each patient by querying their names and identity card numbers against the Malaysian National Registry records of deaths. Patients that did not return any matches against the database were assumed to still be alive, and vice versa. Length of survival was defined as the period of time from the date when patients were recruited into the study until the date of death as recorded by the Malaysian National Registry for patients who have passed away, or until the date when the Malaysian National Registry was last queried for patients assumed to still be alive. For all survival analyses in this study, only patients with at least two years of survival data were included (n = 367). Unadjusted Kaplan-Meier analyses and log-rank tests were conducted using the survival package in R (v. 2.44) and plotted using the "ggsurvplot" function from the survminer R package (v. 0.4.4). Cox proportional hazard models were built using the "coxph" function from the survival package and plotted using the "ggforest" function from survminer.
Box and whiskers plots. All box and whiskers plots in the main and supplemental figures are constructed with boxes indicating 25th percentile, median and 75th percentile, and whiskers showing the maximum and minimum values within 1.5 times the inter-quartile range from the edge of the box, with outliers not shown.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Sequencing data for this study (WES, RNA-seq and sWGS bam files) are available on the European Genome-phenome Archive under the study accession number EGAS00001004518. Access to controlled patient data will require the approval of the MyBrCa Tumour Genomics Data Access Committee upon request to Soo-Hwang Teo at genetics@cancerresearch.my. Publicly available data sets that were included in this study are accessible as follows: TCGA 10 and METABRIC 5 data are available via the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) and cBioportal (https://www. cbioportal.org/) 67 , whereas data from Nik-Zainal et al. 23 are available from ftp://ftp. sanger.ac.uk/pub/cancer/Nik-ZainalEtAl-560BreastGenomes and data from Zhengyan Kan et al. 19 are available from https://www.nature.com/articles/s41467-018-04129-4. Other public databases that were accessed include Reactome (www.reactome.org), the Integrative Onco Genomics database (www.intogen.org) and gnomAD (gnomad. broadinstitute.org). The remaining data are available within the Article, Supplementary Information or from the authors upon request.

Code availability
Code that was used to create the main figures in this study is available at https://github. com/panjw0/MyBrCa_Genomics. Received: 8 April 2020; Accepted: 17 November 2020;