Characterization of Nigerian breast cancer reveals prevalent homologous recombination deficiency and aggressive molecular features

Racial/ethnic disparities in breast cancer mortality continue to widen but genomic studies rarely interrogate breast cancer in diverse populations. Through genome, exome, and RNA sequencing, we examined the molecular features of breast cancers using 194 patients from Nigeria and 1037 patients from The Cancer Genome Atlas (TCGA). Relative to Black and White cohorts in TCGA, Nigerian HR + /HER2 − tumors are characterized by increased homologous recombination deficiency signature, pervasive TP53 mutations, and greater structural variation—indicating aggressive biology. GATA3 mutations are also more frequent in Nigerians regardless of subtype. Higher proportions of APOBEC-mediated substitutions strongly associate with PIK3CA and CDH1 mutations, which are underrepresented in Nigerians and Blacks. PLK2, KDM6A, and B2M are also identified as previously unreported significantly mutated genes in breast cancer. This dataset provides novel insights into potential molecular mechanisms underlying outcome disparities and lay a foundation for deployment of precision therapeutics in underserved populations.

B reast cancer is a heterogeneous disease comprising distinct subtypes. Both global burden and severity of the disease vary widely across populations, with women of African ancestry being diagnosed at a younger age, having more clinically aggressive disease and advanced stage at diagnosis, as well as having higher mortality rates than age-matched women of European or Asian ancestry [1][2][3][4] . Molecular and genetic characteristics strongly influence breast cancer prognosis and treatment, with HER2 amplification (human epidermal growth factor receptor 2 [ERBB2]) and hormone receptor (HR; estrogen receptor [ER] and progesterone receptor [PR]) expression being the best examples.
Recent large sequencing studies, for instance the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), have refined our knowledge of the genomic landscape and pathogenesis of breast cancer, have provided insight into tumor evolution and mechanisms of drug resistance, and have laid a pathway to deployment of precision therapeutics [5][6][7][8][9][10][11][12][13][14][15] . Moreover, these large public datasets have also enhanced our understanding on the divergent mutation accretion processes; most notably in breast cancer, studies have shown high APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like)-related mutagenesis, especially in HER2 + tumors 16 , whereas BRCA1/2 mutations are strongly associated with signatures depicting DNA repair deficiency 17 .
The cases used to elucidate the genetic basis of breast cancer have been overwhelmingly from women of European ancestry, which reiterates the need for data from underrepresented ethnicities [18][19][20] . Moreover, paucity of data from African countries potentially widens the knowledge gap that contributes to global disparities in breast cancer outcomes. To get a comprehensive understanding of the genetic architecture of breast cancer in West Africans, the founder population of a large proportion of Black women in the United States, we conducted whole-genome sequencing (WGS), whole-exome sequencing (WES), and transcriptome sequencing (RNA sequencing (RNA-seq)) on 194 tumors from Nigerian patients and performed a comparative analysis with Black women of African ancestry and White women of European ancestry from the United States in TCGA. In comparison with the TCGA cohorts, we observe that HR + /HER2 − Nigerians are enriched for molecular characteristics associated with aggressive biology. To the best of our knowledge, combined with African American patients in TCGA, this is the largest breast cancer genomics study on tumors from women of African ancestry to date.

Results
Study populations. The Nigerian cohort comprised 194 breast cancer patients: 40 with WGS data, 129 with WES data, and 103 with RNA-seq data ( Supplementary Fig. 1). Of the 1097 TCGA breast cancer patients with either WES (n = 1035) or WGS (n = 84), 1030 were assigned without ambiguity to 3 ancestral race groups (Black, ≥ 50% African; White, ≥ 90% European; Asian, ≥ 90% Asian ancestry) and the other 67 had mixed racial background (Supplementary Data 1a-e). DNA sequencing data from all samples was uniformly processed using the SwiftSeq workflow (manuscript in preparation). Patient clinical and pathologic characteristics are shown in Supplementary Tables 1-5. Nigerians were much younger and had more advanced stage at diagnosis than patients in the TCGA cohort, reflecting population structure and lack of screening in the country.
Across all 1164 individuals-both TCGA and Nigerians-with WES data, we identified 25 genes that were significantly mutated above background (MutSigCV, Q < 0.05). Three of these genes (PLK2, KDM6A, and B2M; Supplementary Methods; Supplementary Fig. 2) had little or no previous evidence of harboring mutations that drive breast carcinogenesis. A fourth gene, GPS2, was also identified by Bailey et al. 22 while this manuscript was under review. Notably, mutations in PLK2 (Fisher's exact, P = 0.05) and KDM6A (P = 0.06) were enriched within HER2 + patients. Combined with previously reported significantly mutated genes in breast cancer 13,23 , this resulted in 44 driver genes. These genes, along with those recurrently affected by copy number changes 6 (Supplementary Table 6), were used for genecentric comparisons by race/ethnicity. Consistent with the aggressive subtype composition in Nigerians, we found an enrichment of TP53 alterations (62 vs. 46 and 29%; Fisher's exact, Benjamini-Hochberg [BH] P < 0.0001) as well as a lower prevalence of PIK3CA mutations (17 vs. 20 and 36%; BH P < 0.0001) (Fig. 1c). Combined BRCA1 germline and somatic variants were also enriched in the Nigerian cohort (11.6 vs. 7.0 and 4.0%; BH P = 0.03). CDH1 mutation was rare in Nigerians (0.8 vs. 6.4 and 16.2%; BH P < 0.0001), whereas GATA3 alterations were more common in this population (17.1 vs. 10.0% and 9.5%; BH P = 0.24).
When comparing recurrently gained or lost regions as identified by GISTIC2 ( Supplementary Fig. 3; Supplementary Methods), we found that all high confidence peaks identified in the Nigerian cohort had corresponding peaks within 10 Mb in the combined TCGA cohort. In line with immunohistochemistry (IHC) and PAM50, the ERBB2 locus (17q12) was enriched in Nigerians (amplified in 24 vs. 12 and 10%; BH P = 0.002), as was its wide neighboring peak at 17q23.1 (TBX2 locus, BH P = 0.1) (Fig. 1d).
Within IHC subtypes, significantly mutated genes and copy number peaks generally displayed similar proportions across ethnicities (Fig. 1e, f), suggesting that most mutation frequency differences reflects subtype differences across ethnicities. Within the HR + /HER2 − subtype, however, there were more TP53 and GATA3 mutations, and fewer PIK3CA and CDH1 mutations in Nigerians, compared with TCGA Blacks and Whites (all P < 0.05). These results are not strongly influenced by age (Supplementary Methods) and suggest that HR + /HER2 − breast cancers in Nigerian women have genomic lesions consistent with more aggressive disease.
Mutation signatures across subtypes and driver mutations. We next extracted breast cancer mutational signatures in the 122 WGS and 500 WES samples from Nigerian and TCGA cohorts harboring 100 or more mutations (Supplementary Methods). Of the nine independently identified signatures, signatures A (APOBEC C > T), B (APOBEC C > G), C (Aging), H (Signature 8), and I (homologous recombination deficiency [HRD]) closely matched to previously identified breast cancer signatures (Supplementary Figs. 4 and 5A). Given that these five signatures had high correlation between exomes and genomes ( Supplementary  Fig. 5b), we examined those in subsequent analyses. Combined, they explain the vast majority of mutations regardless of race/ ethnicity (Fig. 2a) or subtype (Fig. 2b).
In the HR + /HER2 − subtype, the APOBEC C > T signature displayed differences by race/ethnicity with Nigerian and Black cohorts having lower APOBEC C > T contributions compared with Whites (MWU, P < 0.05). In the HR −/HER2 − subtype, Nigerians had increased APOBEC C > G signature relative to the Black and White cohorts (P < 0.05) ( Supplementary Fig. 7a, b). Strikingly, HR +/HER2 − Nigerian tumors had higher HRD signature contributions compared with both Black (P = 1.8 × 10 −4 ) and White (P = 1.6 × 10 −4 ) cohorts (Fig. 4b). This finding was confirmed using data from WGS ( Supplementary Fig. 8c). Structural variants (SVs) are more prevalent in tumor types with HRD defects such as ovarian and basal-like breast cancers 13,26 . In this same set of genomes, Nigerians had more SVs than both Black (MWU, P = 0.03) and White cohorts (P = 2.8 × 10 −4 ). Similar with the HRD signature, SVs counts in HR +/HER2 − Nigerians (~551 SVs per genome) were reminiscent of HR −/HER2 − (~626 SVs per genome) (Fig. 4c). Differences between Nigerians and Whites in HRD signature and SVs (both P < 2.0 × 10 −3 ) extended to HER2 + cases as well (Fig. 4b, c). Taken together, multiple lines of evidence suggest that HR +/HER2 − Nigerians have increased HRD and genomic complexity compared with the Black and White cohorts. Furthermore, genome data suggests a potentially more granular stratification by African ancestry.
We postulated that increased HRD in HR +/HER2 − Nigerians may be explained by increased prevalence of TP53 mutations as well as fewer PIK3CA and CDH1 mutations-although not necessarily causatively. Using multivariate modeling (Supplementary Methods), we investigated the effect of race/ethnicity on HRD adjusting for age and missense burden, as well as mutation status in TP53, BRCA1/2, PIK3CA, and CDH1. Although many of these factors have significant, independent effects, they cannot entirely account for the racial/ethnic HRD disparities seen across HR +/HER2 − tumors.
HRD-APOBEC signature balance. Several threads of evidence suggest a possible interplay between the HRD and APOBEC signature contributions, particularly in HR +/HER2 − breast cancers: (1) we identified racial/ethnic differences in mutation prevalence for TP53, CDH1, and PIK3CA; (2) we found associations between these mutations and mutation signatures (Fig. 3a); and (3) consistent with differential mutation status, HRD activity was increased in Nigerians, whereas APOBEC C > T displayed reduced activity in Nigerians and Blacks compared with Whites ( Supplementary Fig. 7a). Furthermore, within this subtype, HRD had a notable negative correlation with both the APOBEC C > T (ρ = − 0.56, permutation test P < 0.0001; Supplementary Methods) and APOBEC C > G (ρ = − 0.30, P < 0.0001) signatures.
The signature patterns for the Neither group most closely resembled those of TP53/BRCA1/BRCA2 (Fig. 5a-c), suggesting that there may be other mechanisms, such as inactivation of other homologous recombination genes 27 or BRCA1/2 methylation 28 , which promote increased HRD activity. When looking at the proportion of these mutational groups across HR +/HER2 − samples (including those without signature estimates), the groups with the highest HRD and lowest APOBEC-TP53/BRCA1/ BRCA2 and Neither-encompassed 70.3% Nigerians and 66.3% Blacks but only 47.7% of Whites (χ 2 -test, P = 1.2 × 10 −3 ) (Fig. 5d). This suggests that individuals with African ancestry are more likely to fall within mutational groups associated with increased HRD and lower APOBEC contributions. Consistent with this assertion, the HR +/HER2 − Black cohort had greater copy number segmentation (MWU, P = 0.022), more structural variation (Dunn's test, P = 0.028), and increased HRD in WGS (Dunn's test, P = 0.015) compared with Whites ( Fig. 4b; Supplementary Fig. 8c). Throughout African ancestry tumors, prevalent aggressive and limited favorable molecular features could in part explain known racial/ethnic mortality disparities within the HR +/HER2 − subtype 29 . This has significant clinical implications, because HRD tumors are more likely to be sensitive to platinum-based chemotherapy, PARP (poly (ADP-ribose) polymerase) inhibition, and immunotherapy 28 .
Infiltrating immune cell inference by RNA signatures. Given the high HRD signature activity and the fact that DNA repair gene alterations have been linked to checkpoint inhibitor efficacy, we next investigated gene expression signatures related to immune cell infiltration, or immune signatures, with RNA-seq ( Fig. 6a and Fig. 3 Associations between genome-wide oncogenic features and the mutation status of common driver genes. Dot plot depicting the relationships between mutation status in TP53, PIK3CA, CDH1, and GATA3, and mutation signatures (APOBEC C > T, APOBEC C > G, aging, HRD, and signature 8), missense mutation burden, and copy number (CN) segments a across all IHC subtypes (n = 500) and b within HR +/HER2 − (n = 222). Only TCGA data, including samples lacking mutation signature estimates, was used for CN associations (all subtype n = 1,023; HR +/HER2 − n = 635). No samples were excluded based on race/ethnicity. Comparisons between mutation status and genomic features were performed with Mann-Whitney U and P-values were corrected for multiple testing (Benjamini-Hochberg method). Circle size is proportional to the magnitude of the − log10 BH P-value (i.e., lower BH P-values have larger circles). If mutation status associated with a significant increase or decrease of a genomic feature, the corresponding circle is colored red or blue, respectively. Non-significant (NS) comparisons are colored black IFN, and Proliferation-displayed statistically significant differences across PAM50 subtypes (analysis of variance, all P < 0.0001; Supplementary Methods). Racial differences adjusted for PAM50 subtype, however, were modest ( Fig. 6b and Supplementary Fig. 1,0). The Cytotoxic cell signature (P = 0.004) was lower in Nigerians in all subtypes but Basal, whereas the Fibroblast signature (P = 0.01) was consistently highest in Nigerians. Type I IFN signature scores (P = 0.01) were enriched in Luminal subtypes for both Nigerians and Blacks, which potentially indicates that tumors from these racial groups would respond better to immunotherapy 30 . Lastly, macrophage infiltration in Nigerians was highest in the Basal subtype, similar to what has been reported in other studies, including one in a small subset of Nigerian patients 31,32 .
We next tested these immune signatures for association with potential predictors of response to immunotherapy. We considered the combined APOBEC C > T and C > G, and the HRD mutation signatures as the two independent mutational processes generating putative neoantigens, as well as mutation burden and chromosomal instability (CIN) [33][34][35] . APOBEC mutation signature contribution was positively correlated with mutation burden (ρ = 0.35, Spearman's rank correlation, BH P < 0.0001). Consistent with recent reports, we found APOBEC contribution being further associated with increased T-cell infiltration (ρ = 0.25, BH P < 0.0001) and CIN being positively correlated with mutation burden (ρ = 0.28, BH P < 0.0001), while negatively correlated with T-cell infiltration (ρ = − 0.08, BH P < 0.01) 34,35 . The same trends were observed in the Nigerian and TCGA cohorts separately with similar effect sizes (Fig. 6c, d), although, in the former, most were not significant after multiple testing correction potentially due to the smaller sample size (Fig. 6a).

Discussion
To date, this study is the largest genomic analysis of breast cancer among women of African ancestry. Aggressive molecular subtypes were found to be more prevalent in Nigerian patients, which has been consistently documented in breast tumors across West Africa 2 . The extent to which this disparity represents disparate biology, environmental influences, or a combination thereof remains unknown. Recently, ER expression was demonstrated to be a heritable trait in breast cancer 36 , suggesting that genetically influenced basal expression levels may contribute to subtype differentiation. Given that genetic background associates with phenotypes relevant to breast cancer, it is reasonable to postulate that patterns of somatic mutations may differ across genetically distinct populations. Here we have shown that regardless of subtype, aggressive molecular features are prevalent in breast tumors from Nigerian women.
Including Nigerian samples along with TCGA allowed us to identify PLK2, KDM6A, and B2M as novel significantly mutated genes in breast cancer, with the former two enriched in the HER2 + subtype. PLK2 is a cell cycle regulator and presumed tumor suppressor, whereas KDM6A is a chromatin modifier frequently mutated in other cancer types (e.g., pancreatic, esophageal, and bladder) [37][38][39][40] . B2M inactivation was recently reported to be a  41 . Further studies to characterize the role for these genes in HER2 + tumors specifically and breast cancer in general are warranted. The mutational landscape and signature patterns differed across racial/ethnic populations. In particular, the relatively younger Nigerian patients had more TP53 and GATA3 mutations than Blacks in TCGA, whereas both African ancestry groups had higher prevalence of these mutations than Whites. The frequencies of prognostically favorable PIK3CA and CDH1 mutations were lower in women of African ancestry than in Whites, which may reflect differences in breast cancer risk factors across populations. Even when restricting to ER +/HER2 − breast cancer, tumors from Nigerian women were characterized by canonically aggressive molecular features, such as higher contributions from the HRD mutational signature, TP53 mutations, and increased structural variation. Along with more pervasive HR negativity and HER2 positivity, the aggressive features of HR + tumors provide biological insight to why breast cancers in the unscreened and relatively younger female populations of West Africa are often fatal 42 . This study lays the foundation for a more concerted effort to reduce global disparities in cancer outcomes by first closing the knowledge gaps. Given the genomic landscape, Nigerian women would benefit from increased access to genomically tailored clinical trials and more effective treatments such as HER2-targeted therapy and PARP inhibition for HER2 + and HRD-deficient tumors, respectively 28 .
There are certain limitations to this study including the relatively small sample size of Nigerian tumors and the fact that both TCGA and this study used convenient samples ascertained in Hospitals and may not reflect population rates. Nonetheless, this study underscores the need to include diverse populations when identifying and pursuing novel therapeutic targets 18 . It is possible that genetic and environmental factors not only drive subtype differentiation but also dictate evolutionary dynamics of a tumor. This latter assertion could help explain the observed mutational differences between racial/ethnic groups, a pattern which has also been noted comparing Black and Whites with colorectal cancer in the United States 43 . Similarly, strong associations between driver mutations and mutation signature contributions (e.g., PIK3CA and APOBEC signatures) pose a causality dilemma suited for further biological and epidemiological investigations. Overall, our results justify the need for future studies integrating germline and somatic genetics, as well as environmental factors, in order to better understand the root causes of disparities in breast cancer outcomes and develop more effective interventions to achieve health equity.

Methods
Biospecimen collection and pathological assessment. This study was embedded within the Nigerian Breast Cancer Study (NBCS) and approved by the Institutional Review Board of all participating institutions. Patient ascertainment and details of the study have been previously published 2,44,45 . In collaboration with Novartis, NBCS was extended to Lagos State University Teaching Hospital (LASUTH). A grand total of 493 subjects were recruited from University College Hospital, Ibadan (UCH; n = 284) and LASUTH (n = 209) between February 2013 and September 2015. Each patient gave written informed consent before participation in the study. Six biopsy cores and peripheral blood were collected from each patient. Two biopsy cores were used for routine formalin fixation for clinical diagnosis and the remaining four cores were preserved in PAXgene Tissue containers (Qiagen, CA) for subsequent genomic material extraction. In addition, 27 mastectomy tissues were preserved in RNAlater. Complete pathology assessment was done central by study pathologists. Tumor burden was assessed based on cellularity, histology type, and morphological quality of tissue using TCGA best practices, and only tissues containing 60% or more tumor cellularity were used for WGS. For WES, tissues containing 30% or more tumor cellularity were used. IHC on ER, PR, and HER2 were performed centrally in Nigeria and further reviewed in the United States. Cases with discordant results were again reviewed and resolved by the study pathologists. IHC scoring variables for Allred scoring algorithm were captured according to the 2013 ASCO/CAP standard reporting guidelines. Briefly, for ER and PR testing, immunoreactive tumor cells < 1% was recorded as negative and those with ≥ 1% were reported positive. All the positive ER and PR cases were graded in percentages stained cells and further scored in line with the Allred scoring system. Percentage of tumor staining for HER2 test were also reported along with a score of 0 and 1 + as negative, 2 + as equivocal, and 3 + as positive case. In addition, genomic copy number calls of HER2 and chromosome 17 ploidy were used as alternative to HER2 fluorescent in situ hybridization test. Overall, IHC calls were corroborated ESR1, PGR, and ERBB2 expression using RNA-seq ( Supplementary Fig. 1,1a-c).
Sample selection and genomic material extraction. Breast tumors were selected for sequencing following the TCGA guidelines 6 . Tumor samples containing > 60% tumor cellularity were selected for DNA extraction using PAXgene Tissue DNA kit (Qiagen). Gentra Puregene Blood Kit (Qiagen) was used to extract genomic DNA from blood. Extracted DNA were quality controlled for its purity, quantity, and integrity. Identity of the extracted DNA were tested using AmpFlSTR Identifiler PCR Amplification Kit (Thermo Fisher Scientific). Samples that match > 80% of the short tandem repeat profiles between tumor and germline DNA were considered authentic. RNA was extracted from PAXgene fixed tissues using the PAXgene Tissue RNA kit (Qiagen). RNA integrity (RIN) was determined for all samples by the RIN score given by the TapeStation (Agilent) read out. RNA samples that had RIN scores of 4 and above were included in downstream sequencing analysis.
Next-generation sequencing data generation. WES and RNA-seq were carried out at the Novartis Next Generation Diagnostics facility.  49 and those that met the required criteria ("COSMIC_n_o-verlapping_mutation > 1" AND "1000gp3_AF ≤ 0.005" AND "ExAC_AF ≤ 0.005") were considered likely to be somatic and were retained. This panel of normal process was also repeated for genomes (normal sample n = 124  65 was used to estimate somatic mutational signatures. The ability to reliably call mutation signatures depends on sufficient numbers of mutations. To this point, we used all high-quality exome SNVs, regardless of whether they are coding or non-coding. Any sample containing at least 100 SNVs was included for downstream assessment. In addition, in order to stimulate more accurate signature estimates, 122 WGS tumor-normal pairs were also included in addition to 500 WES pairs (Supplementary Data 1c). To account for variable mutation counts across samples, we used SomaticSignatures to normalize the mutation matrix before performing non-negative matrix factorization. We elected to estimate 9 signatures ( Supplementary Fig. 4), as that was (1) consistent with the number of signatures identified previously in breast cancer ([http:// cancer.sanger.ac.uk/cosmic/signatures]) and (2) as 9 signatures explained~99% of variance when using 122 genomes alone. Using matrix algebra on the resulting exposure and mutation matrices, we calculated the relative contribution of the nine signatures on each sample. Contributions represent the proportion of mutations assigned to given mutation signature within each tumor (Supplementary Methods). Exomes were used for all mutation signature analyses unless explicitly stated.
RNA-seq analysis and immune signatures.  Fig. 1,2). To characterize the immune and stromal microenvironment of these tumors, we assessed the expression of several pre-specified sets of immune and stromal cell gene expression markers (Supplementary Table 7). Gene signature scores were calculated using the GSVA R/ Bioconductor package ([https://www.bioconductor.org/packages/GSVA/]) 70 .
Statistical methods. All statistical calculations were completed in in R. Names of the performed tests are provided in the text and all P-values are two-sided. Nonparametric tests were used when the underlying data types often lacked normality (e.g., mutation signature contributions). All boxplots throughout the manuscript are Tukey's style.