Introduction

As we enter an era of personalized medicine in oncology, large-scale studies have been instrumental in deciphering the pathogenesis and evolution of tumors. Public data repositories such as The Cancer Genome Atlas (TCGA) have enabled researchers to define the genomic landscape of different types of cancers, including breast cancer. The public availability of large-scale datasets has led to a surge in candidate drug targets and novel prognostic and/or predictive gene signatures. However, it is important to note that the majority of patients in public datasets are of European ancestry (EA), and, hence, the knowledge gained from such studies might not be applicable to patients of a different ancestry1. Given the global disparities in clinical behavior of breast cancer, it has become imperative to investigate ancestry-associated differences in tumor biology.

Breast cancer in women of African ancestry (AA) presents at a younger age, and is associated with more advanced disease and higher mortality rates as compared to breast cancer in age-matched patients of EA or Asian ancestry (AsA)2,3,4,5,6,7,8,9,10. Several reports have demonstrated an increased frequency of the more aggressive triple-negative breast cancer (TNBC) subtype and of the PAM50-molecular basal subtype in AA women7,8,9,10,11,12,13,14,15,16. Moreover, African-American women with early-stage TNBCs have been shown to exhibit a lower pathological complete response to neoadjuvant chemotherapy17. Interestingly, this discrepancy in clinical outcome remains after correcting for socioeconomic factors, suggesting the presence of molecular differences by ancestry18,19. The African-American breast cancer epidemiology and risk consortium identified few rare germline single-nucleotide polymorphisms (SNPs) that are associated with an increased risk of hormone receptor-negative breast cancer and/or TNBC in African-American women20,21. Analysis of genotypic traits revealed that most somatic mutations and copy number variations are subtype-specific rather than ancestrally determined22,23. Very few mutations showed dissimilar frequencies across African, African-American, or European-American patient subgroups when considering a specific breast cancer subtype. Likewise, numerous differentially expressed genes have been identified between breast tumors of patients of AA and EA24,25,26,27,28; however, there is little to no evidence linking these findings to differences in breast cancer survival or subtype-specific survival in relation to ancestry. Therefore, differential expression of genes involved in biological processes such as differentiation, cell cycle, DNA repair, invasion, metastasis, and angiogenesis could be related to the higher proportion of triple-negative breast tumors in the African-American population. To address this, several studies investigated molecular differences within TNBC tumors of African-American and European-American patients. TNBC tumors of African-American women were shown to display enrichment of gene sets related to a high proliferative rate, high genomic grade index, BRCA1 deficiency, increased activation of insulin-like growth factor 1 receptor, and increased angiogenesis, closely resembling the basal like-1 TNBC subtype gene signature as described by Lehmann et al.23,28,29,30,31,32,33. In addition, it has been suggested that an abundance of cancer stem cells might, in part, contribute to the worse survival of African-American women with TNBC tumors34,35,36,37,38.

Given the importance of immune cell infiltration in determining the prognosis and treatment response of breast cancer, and, especially in TNBC, it is important to investigate whether differences in antitumor immunity may contribute to the divergent clinical behavior of breast cancer across populations39,40,41,42. To date, this phenotypic aspect of breast cancer is largely unexplored in the context of ancestry. Interestingly, systemic levels of pro-inflammatory cytokines such as interferon-γ and interleukin-6 have been found to be elevated in both healthy African-American women and those affected with breast cancer as compared to European-American women, suggesting ancestry-inferred differences in the immune response that might affect antitumor immunity and ultimately breast cancer clinical outcome43,44. In contrast, only subtle differences in immune gene signatures related to immune cell infiltration were found in TNBC tumors of women of AA22,45.

In this study, we applied a unique approach to explore ancestry-associated heterogeneity of breast cancer outcomes. First, we used improved and curated survival information from the TCGA Pan-Cancer clinical data resource (TCGA-CDR)46. Second, we applied SNP-based inference of ancestry47,48 to improve ancestry assignment, enabling us to include a substantial number of additional patients from the TCGA dataset in our analysis, thereby increasing the power of our study. Third, we performed a comprehensive transcriptomic analysis of both immunological and cancer cell-intrinsic parameters within breast cancer subtypes as defined by a novel PAM50 classification. This refined classifier utilizes a combination of Topological Data Analysis (TDA) signatures of normal mammary cell types (basal epithelial cells, luminal epithelial cells, myoepithelial cells, and Her2-related expression) to subgroup breast tumors into seven distinct molecular subtypes with prognostic value49. Using this combined novel approach, we interrogated the TCGA breast cancer dataset, comprising of patients of AA (n = 184), EA (n = 811), and AsA (n = 56), and a local Arab/Asian breast cancer dataset from Qatar (n = 24) for ancestry-specific molecular differences in breast cancer.

Results

Ancestry of patient populations

To date, studies investigating molecular differences between ancestries have been solely based on self-identified ancestry. In our study, we applied a novel approach combining self-reported ancestry and SNP-based inference of ancestry47,48. Ancestries were assigned using principal component (PC) analysis of SNP array genotyping calls following the method as described by Carrot-Zhang et al.48 (Supplementary Fig. 1). As such, we included 1051 patients from the TCGA breast cancer dataset in our analysis, of which 811 EA, 184 AA, and 56 AsA patients (Table 1). Ancestry of patients in the local Retrospective Arab cohort from Qatar (RA-QA) was solely based on self-reported ancestry, subgrouping 16 patients as Arab ancestry (ArA), five as AsA, two as EA, and one as Persian (Table 2).

Table 1 Cohort demographics of the TCGA breast cancer cohort.
Table 2 Cohort demographics of RA-QA breast cancer cohort.

Distribution of molecular breast cancer subtypes

Numerous studies have demonstrated a higher prevalence of TNBC and of tumors of the molecular basal subtype among AA women and have linked the increased frequency of these aggressive breast tumors to ancestry-associated disparity in breast cancer clinical outcome. Using our novel combined approach, we interrogated the TCGA and RA-QA datasets to subgroup patients according to TDA-defined molecular subtype and ancestry49. Heatmaps of TCGA and RA-QA samples based on TDA gene signatures (basal, myo1, myo2, luminal, and Her2) show a clear segregation of samples in seven molecular subtypes, each defined by a unique combination of expression of five distinct gene groups, demonstrating the accuracy and robustness of the novel classifier (Fig. 1a). As can be seen in the circos plots in Fig. 1b, and in accordance with the METABRIC analysis by Mathews et al.49, we found that luminal A tumors are mainly reclassified into Lum and MyoLum subgroups, while luminal B tumors are mainly subgrouped into LumBasal and Lum tumors. In addition, tumors of the normal-like PAM50 subtype are mainly reclassified into the Myo classes. Her2-enriched tumors are predominantly subdivided into BasalHer2, BasalLumHer2, and LumBasal tumors. Further, the vast majority of basal tumors are reclassified as BasalMyo (88%). Figure 1c clearly demonstrates differences in molecular subtype frequency across ancestries, with a strong enrichment in AA patients of BasalMyo (38.0 vs 16.5% in EA, χ2 = 41.3, p = 1.30E − 10) and a reduced proportion of MyoLumA (2.7 vs 11% in EA, χ2 = 11.7, p = 0.0006) and Lum (17 vs 29% in EA, χ2 = 10.9, p = 0.001) tumors, and in AsA patients an enrichment of BasalHer2 tumors (21.7 vs 6.4% in EA, χ2 = 19.0, p = 1.33E − 05). While several studies reported an increase in basal tumors with worse outcome in AA patients7,9,11,12,29,50, we were able to fine-tune this observation to a strong increase of BasalMyo tumors, accounting for the majority of basal tumors. Furthermore, we observed an increase in the proportion of BasalMyo tumors in ArA patients (25.0 vs 16.5% in EA, χ2 = 1.0E − 4, n.s.), although this did not reach statistical significance as a likely result of the small cohort size.

Fig. 1: Distribution of breast cancer molecular subtypes defined by topological data analysis (TDA) signatures across ancestries.
figure 1

a Heatmap of expression of PAM50 genes organized by TDA signature classes in TCGA breast cancer and RA-QA cohort. Samples are annotated by TDA signature class (upper annotation bar) and classical PAM50 intrinsic molecular subtype (lower annotation bar). The combination patterns of upregulated expression of five distinct gene groups defining each TDA class are summarized in a table on the right (Summary TDA). b Reclassification of breast cancer samples from classical PAM50 intrinsic molecular subtypes (upper part of circos) to TDA signature classes (lower part of circos) in TCGA and RA-QA breast cancer cohorts. c Stacked bar chart of distribution of TDA classes by ancestry. d Kaplan–Meier plots showing overall survival (upper panels) and disease-specific survival (lower panels) by ancestry. Difference between the survival of patients with European and African ancestry is shown for the complete TCGA breast cancer cohort (left), patients with TNBC according to hormone receptor status (middle left), patients with PAM50-defined basal breast cancer (middle right), and patients with tumors classified as BasalMyo by TDA classification (right). Censor points are indicated by vertical lines.

Next, we explored ancestry-related differences in clinical outcome using curated survival data from the TCGA-CDR46. The clinical outcome of breast cancer patients, irrespective of molecular subtype, was not different between EA and AA patients (Fig. 1d). Among all seven TDA subtypes, BasalMyo tumors were the only tumors that were associated with significantly different 10-year overall survival (OS, p = 0.020) and disease-specific survival (DSS, p = 0.033) rates for AA vs EA patients (Fig. 1d and Supplementary Fig. 2). The 5-year OS rates for BasalMyo tumors were 85.5% for EA and 70.1% for AA patients (p = 0.07), and the 5-year DSS rates were 90.1% for EA and 73.6% for AA patients (p = 0.05). Interestingly, compared to TNBC and basal tumors, we observed a larger disparity in 10-year OS (hazard ratio (HR) = 2.39, p = 0.020) and 10-year DSS (HR = 2.57, p = 0.033) by ancestry in BasalMyo tumors (Fig. 1d). To exclude that this survival difference results from a higher frequency of more advanced stage BasalMyo tumors in AA patients, we compared the AJCC pathological stage between EA and AA patients and found no significant difference in stage distribution by ancestry (χ2 = 2.83, p = 0.092) (Supplementary Fig. 3). In addition, we performed survival analysis stratified by early (stages I and II) and advanced (stages III and IV) stage and found rather large HRs, although not significant, indicating worse OS of AA patients within strata (Supplementary Fig. 3). Adjustment for tumor stage and/or age in multivariate analysis showed similar results with AA being associated with worse survival (Supplementary Fig. 3), albeit with borderline significance, implying that additional factors beyond pathological stage contribute to the divergent clinical outcome of AA patients with BasalMyo tumors compared to EA patients.

Ancestry-associated differences in immunological parameters

In an effort to elucidate potential ancestry-inferred differences in tumor biology, we compared the immune microenvironment of tumors from patients with different ancestry. More specifically, we assessed tumor immune disposition using the prognostic Immunologic Constant of Rejection (ICR) immune gene signature51,52 and deconvoluted immune cell abundance using leukocyte subgroup enrichment scores (LES)53. The ICR 20-gene signature consists of genes encoding CXCR3/CCR5 chemokine ligands (CXCL9, CXCL10, and CCL5), genes encoding molecules involved in T-helper type 1 (Th1) signaling (IFNG, TXB21, CD8B, CD8A, IL12B, STAT1, and IRF1), and effector immune functions (GNLY, PRF1, GZMA, GZMB, and GZMH), as well as counter-regulatory molecules (IDO1, PDCD1/PD-1, CD274/PD-L1, CTLA4, and FOXP3). Using the ICR gene signature, we previously classified breast cancer samples into four classes with the highest activation of the antitumor immune response in the ICR4 class51. In a follow-up study of >8000 nonmetastatic breast cancer cases, we demonstrated that the ICR signature was the strongest independent prognostic predictor for metastatic relapse, in particular for patients with Her2+-enriched and triple-negative breast tumors54. Since we did not consider ancestry in our previous findings, the present study aimed to investigate whether the prognostic value of ICR holds true across ancestries or whether there could be immune-related dysregulations that, in part, explain the disparity in the clinical outcome of AA breast cancer patients. First, we used the ESTIMATEscore, ImmuneScore, and StromalScore to compare tumor cellularity, proportion of the stromal component, and level of infiltration of immune cells of all TDA subtypes in EA vs AA patients55. We did not observe significant differences within subtypes by ancestry, indicating that any potential changes in immune-related gene expression in AA vs EA patients are not caused by differences in stromal and immune cell composition (Supplementary Fig. 4).

The ICR gene signature clearly clusters breast tumors of the TCGA dataset into three immune phenotypes with varying degrees of immune activation (ICR low, ICR medium, and ICR high), while tumors of the RA-QA cohort were subdivided into two immune phenotypes (ICR low and ICR high) (Fig. 2a). In accordance with our previous work, tumors with an ICR Low immune phenotype were associated with a worse survival in EA patients (p = 0.028) (Fig. 2b). Likewise, we observed a large, although not significant, difference in survival between ICR low and ICR high patients within the AA and ArA groups. In line with these findings, the prognostic value of gene signatures that reflect the abundance of individual immune cell populations was overall similar across ancestries with leukocyte subpopulations classically associated with better prognosis such as CD8+ T cells and cytotoxic cells having the same trends in EA and AA patients (Supplementary Fig. 5). Next, we investigated whether the immune disposition, inferred from the ICR enrichment score, varies within TDA subtypes by ancestry (Fig. 2c). Comparison of the continuous ICR enrichment score demonstrated modest variation between TDA subtypes with overall higher scores in non-luminal tumors (BasalHer2 and BasalMyo), which was not affected by ancestry. For instance, no significant difference in ICR enrichment score was found in BasalMyo tumors by ancestry, suggesting a similar overall immune disposition across ancestries. In accordance, we did not find any significant differences in the expression of individual ICR genes based on ancestry (data not shown). Further analysis of BasalMyo tumors, however, revealed differences within ICR clusters whereby ICR low and ICR medium patients were grouped into one subgroup due to the limited sample size of each cluster within BasalMyo tumors. Although BasalMyo tumors of AA patients were overall associated with worse OS, this was more pronounced in ICR medium + low tumors (10-year OS, p = 0.03; 5-year OS, p = 0.07) (Fig. 2d). In multivariate analysis, AA remained significantly associated with worse survival when adjusted for tumor stage, and reached borderline significance when adjusted for tumor stage and age (Supplementary Fig. 3).

Fig. 2: Tumor immune phenotypes and clinical outcome by ancestry.
figure 2

a Heatmap of ICR gene expression in TCGA and RA-QA breast cancer cohorts. Classification of samples by ICR consensus clustering segregates TCGA samples in ICR low, ICR medium, and ICR high groups. Samples of RA-QA cohort were classified as ICR low or ICR high. b Kaplan–Meier plots showing overall survival across ICR groups in breast cancer TCGA patients of EA (left), TCGA patients of AA (middle), and RA-QA patients of ArA (right). c ICR enrichment scores across ancestries within TDA signature classes. Box plots indicate medians and interquartile range, and whiskers represent 10th and 90th percentile. All data points are plotted individually. d Overall survival of EA and AA patients in TCGA BasalMyo samples classified as ICR medium + low (left), and ICR high (right). Censor points are indicated by vertical lines.

This finding raised the question whether the worse outcome of AA patients with BasalMyo tumors is linked to molecular differences in ICR medium + low tumors also known as cold tumors. For this purpose, we determined the LES of 24 distinct immune cell types (Fig. 3a). Focused analysis of BasalMyo cold (ICR medium + low) tumors revealed a significant decrease in T-regulatory cell (Tregs) and Th2 enrichment scores (p = 0.036; p = 3.36E − 4, respectively), and a small increase in B cell enrichment score (p = 0.039) in AA vs EA patients, whereas dendritic cell (DC) enrichment scores were reduced in ICR hot (ICR high) tumors (p = 0.009).

Fig. 3: Enrichment of immune cell subpopulations in AA and EA patients with BasalMyo breast tumors.
figure 3

a. Enrichment scores of signatures reflecting the abundance of dendritic cells (DCs), T-regulatory cells (Tregs), T-helper 2 (Th2), and B cells in BasalMyo tumor samples of EA and AA patients. Box plots are facetted by ICR groups, ICR high (upper panels), ICR medium + low (middle panels), and across all samples (lower panels). Box plots indicate medians and interquartile range, and whiskers represent 10th and 90th percentile. All data points are plotted individually. T test (two-sided): *p < 0.05, **p < 0.01, ***p < 0.001, and n.s. not significant. Adjusted p value (FDR) by Benjamini and Hochberg method. b Kaplan–Meier plots of overall survival in EA and AA patients with BasalMyo breast cancer dichotomized by enrichment scores of TReg (left panels) and Th2 cell signatures (right panels). Cutoff for dichotomization in “High” and “Low” categories is based on optimal enrichment cutoff determined by XGBoost model used for survival analysis. Censor points are indicated by vertical lines.

In order to identify which LES may harbor prognostic value, we focused on BasalMyo tumors irrespective of ICR class due to sample size limitations and adopted a machine-learning strategy, which has empirically been shown to work efficiently on small size datasets56,57,58, despite a slight tendency for overfitting (EA, n = 134; AA, n = 70). First, we performed a sensitivity model analysis that enabled us to identify the XGboost models that have an optimal set of hyper-parameters (Harrell’s C index EA = 0. 58, AA = 0.63) with relatively small variance (data not shown). Next, we used XGBoost modeling for nonlinear multivariate Cox regression survival analysis followed by the SHapley Additive exPlanation (SHAP) method for the AA and EA subgroups separately (Supplementary Fig. 6). This approach provided information on which features or gene signatures are the most important and their range of effects over the dataset, including the breadth (SHAP value) and the direction of the effect (positive or negative). Both the Treg and Th2 signature were classified as features with more importance for predicting outcome in AA patients as compared to outcome in EA patients, with reduced enrichment scores being associated with increased risk of death. In accordance, we found that AA, but not EA, patients could be stratified into different risk groups based on the expression of the Treg and Th2 cell signatures with borderline statistically different clinical outcomes (Fig. 3b). More specifically, stratification by Treg LES subgrouped AA patients with BasalMyo tumors in a low-risk group with higher expression and 5-year OS rate of 77%, and a high-risk group with low expression and 5-year OS rate of 59% (10-year HR = 2.99, 95% confidence interval (CI) = 1.02–8.77). Th2 LES-based stratification grouped AA patients with BasalMyo tumors into a low-risk/high expression group with 5-year survival rate of 84% and a high-risk/low expression group with 5-year survival rate of 55% (10-year HR = 3.13, 95% CI = 0.98–10.00). No differences in survival were noted for DC and B cell LES (data not shown), which supports their lower rank of importance in the SHAP plot of AA patients (Supplementary Fig. 6).

Ancestry-associated differences in cancer cell-intrinsic features

Next, we investigated whether specific cancer cell-intrinsic features might contribute to the worse survival of AA patients with BasalMyo tumors. First, we examined potential changes in common cancer-associated genomic aberrations, including mutational load, neoantigen load, and tumor aneuploidy. Remarkably, non-silent mutation rate was significantly lower in AA patients compared to EA (p = 0.025), while the number of predicted single-nucleotide variant neoantigens was similar between both patient populations (Supplementary Fig. 7). Therefore, we speculated that AA BasalMyo tumors undergo less immunoediting and immune-mediated elimination of neoantigens compared to EA BasalMyo tumors. To address this hypothesis, we used an “immunoediting score,” defined as the observed ratio (number of point mutations predicted to generate neo-epitopes divided by the total count of non-silent point mutations) compared to the expected ratio (expected numbers based on silent mutation rate)59. Indeed, the ratio of the observed/expected neoantigens was increased in AA patients (p = 0.033), suggesting reduced immunoediting in AA samples (Supplementary Fig. 7). However, we did not observe any survival difference between tumors with a high observed/expected neoantigen ratio compared to tumors with a low ratio (HR = 1.1, 95% CI = 0.43–2.79, p = 0.842), suggesting that this tumor attribute does not explain the observed survival differences between AA and EA BasalMyo tumors. Similarly, while we observed a significantly increased tumor aneuploidy score in samples of AA patients (p = 0.008, Supplementary Fig. 7), this tumor characteristic was not associated with a difference in survival (HR = 0.691, 95% CI = 0.32–1.48, p = 0.34).

To further explore tumor intrinsic features that could contribute to the divergent survival outcomes, we explored the differential enrichment of 54 cancer-associated pathways (Fig. 4a). A total of 16 pathways were found to be differentially enriched between BasalMyo tumors of AA vs EA patients. Of note, only 2 out of 16 pathways, DNA repair and oxidative phosphorylation, were associated with an increased enrichment in AA patients. A number of enriched pathways were identified multiple times as they were included in more than one database, including estrogen response and estrogen-dependent breast cancer signaling, ErbB signaling and ErbB2/ErbB3 signaling, PI3K-Akt mTOR signaling and PI3K-AKT signaling or mTOR signaling, and ERK MAPK signaling, ultraviolet B (UVB)-induced MAPK signaling, and MAPK up genes. Furthermore, the pathways defined as angiogenesis, AMPK signaling, EGF signaling, and PTEN signaling were significantly less enriched in BasalMyo tumors of AA vs EA patients. Using the same approach that we applied to explore the prognostic value of immune gene signatures, we used XGBoost modeling and the SHAP method to identify which cancer-associated pathways are the most powerful indicators of poor survival in AA vs EA patients with BasalMyo tumors (Fig. 4b, c). Based on the summary SHAP plots, we observed that among the top 10 pathways affecting survival in EA patients, the majority displayed an inverse correlation of enrichment with survival, including barrier genes, reactive oxygen species pathway, EGF signaling, hedgehog signaling, UVC-induced MAPK signaling, AMPK signaling, estrogen-dependent breast cancer signaling, and UV response up genes (Fig. 4b). In contrast, increased enrichment of DNA repair and VEGF signaling pathways were associated with better survival in EA patients. In AA patients, the majority of the top 10 pathways determining survival exhibited better survival with increased enrichment including PI3K-Akt mTOR signaling, proliferation, G2M checkpoint, PI3K-AKT signaling, AMPK signaling, ERK5 signaling, and ErbB signaling (Fig. 4b). On the other hand, we found that pathway enrichment for telomere extension by telomerase, barrier genes, and UV response down corresponded to worse survival.

Fig. 4: Differentially enriched oncogenic pathways with prognostic connotation in EA and AA patients with BasalMyo breast tumors.
figure 4

a Enrichment scores of signatures of tumor-associated pathways that are differentially regulated between EA and AA patients with BasalMyo tumors. Box plots indicate medians and interquartile range, and whiskers represent 10th and 90th percentile. All data points are plotted individually. T test (two-sided): *p < 0.05, **p < 0.01, ***p < 0.001, and ****p < 0.0001. Adjusted p value (FDR) by Benjamini and Hochberg method. b SHAP plots of tumor-associated pathways that are associated with overall survival in EA (left) and AA (right) patients with BasalMyo breast tumors. Pathways are ranked by p value to reflect the importance of each feature in the survival model. Each dot represents a single sample and is colored by relative enrichment score. Corresponding impact on model output (SHAP value) ranges from −1 (indicating the absence of an event) to +1 (indicating the occurrence of an event, in this case, death). c Intersection of differentially enriched tumor-associated pathways with ten most important pathways in AA and EA patients with BasalMyo breast tumors. AMPK signaling is differentially regulated in AA vs EA and is of importance in survival models of both AA and EA patients. d Kaplan–Meier curves visualizing the prognostic value of AMPK signaling in EA (upper) and AA (lower) BasalMyo patients. Dichotomization of samples by AMPK signaling is based on optimal enrichment score cutoff as determined by XGBoost model. Censor points are indicated by vertical lines.

In analogy with our analysis of the prognostic value of enriched immune gene signatures, we performed a combined analysis of differentially enriched pathways and the top ten pathways with importance for the prediction of survival (Fig. 4c). Using this approach, we identified three differentially enriched pathways with prognostic value in EA patients with higher enrichment of EGF signaling (p = 0.02, optimal enrichment cutoff = 0.334) and estrogen-dependent breast cancer signaling (p = 0.076, optimal enrichment cutoff = 0.268) being associated with worse prognosis, while a better survival was observed for enrichment of DNA repair (p = 0.03, optimal enrichment cutoff = 0.304). Focusing on AA patients, we found three differentially enriched pathways with prognostic connotation whereby enrichment of PI3K-Akt-mTOR signaling (p = 9.00E − 04, optimal enrichment cutoff = 0.307), PI3K-Akt signaling (p = 0.006, optimal enrichment cutoff = 0.328), and ErbB signaling (p = 0.053, optimal enrichment cutoff = 0.232) was associated with better outcome (Fig. 4b, c). Interestingly, we found AMPK signaling to be the sole pathway to be differentially enriched between BasalMyo tumors of AA and EA patients with prognostic value in patients of both ancestries. Further analyses revealed an inverse correlation of AMPK enrichment with OS in AA vs EA patients. While in EA patients, pathway enrichment was associated with worse survival, it bestowed a survival advantage for AA patients (Fig. 4d). The 5-year OS rate of EA patients with BasalMyo tumors enriched for AMPK signaling was reduced by 12% from 91 to 79% (10-year HR = 0.343, 95% CI = 0.11–1.10), while the opposite was observed in AA patients where the 5-year OS rate was increased by 21% from 57 to 78% (10-year HR = 3.598, 95% CI = 1.18–10.94).

Molecular alterations in Arab breast cancer patients

Given the similarity in TDA subtype distribution of ArA and AA patients (Fig. 1c), we investigated whether the increased frequency of BasalMyo tumors in ArA patients was associated with differential enrichment of LES and cancer-associated pathways. Specifically, we focused our analyses on Treg, Th2, and AMPK signaling signatures that showed differentially enrichment with prognostic value in AA patients. Due to limited cohort size, we assessed enrichment patterns in all Arab patients without subgrouping by TDA subtype. Compared to AsA patients, ArA patients showed a trend towards lower enrichment scores of the Treg and AMPK signature (Fig. 5a). In order to compare patterns of enrichment between ancestries of both cohorts, we performed a similar analysis across TCGA ancestries (EA, AA, and AsA) without TDA subgrouping (Fig. 5b). Out of the three signatures, only the differential enrichment of AMPK signaling holds true when comparing the overall AA vs EA patient population. Since BasalMyo tumors constitute a large proportion of breast tumors in the AA patients (38%) and are associated with a strong reduction in AMPK signaling (p = 1.78E − 04), we cautiously speculate that the overall reduced enrichment of AMPK signaling in AA patients might be related to our findings in BasalMyo tumors. Similarly, it could be plausible that our findings in Arab patients might be related to differential enrichment signatures in BasalMyo tumors, supporting the need for larger Arab patient cohorts to enable statistically powered subanalysis of TDA subgroups.

Fig. 5: Enrichment of selected immune cell subpopulations and oncogenic pathways in Arab breast cancer patients.
figure 5

Enrichment scores for signatures for T-regulatory cells (Tregs, left), T-helper 2 cells (Th2, middle), and AMPK signaling (right) in panel (a). RA-QA cohort comparing ArA to AsA breast cancer patients, independent of molecular subtype. b TCGA breast cancer cohort comparing EA, AA and AsA breast cancer patients, independent of intrinsic molecular subtype. Box plots indicate medians and interquartile range, and whiskers represent 10th and 90th percentile. All data points are plotted individually. T test (two-sided): *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, and ns, not significant.

Discussion

An increasing effort is expended to decipher the molecular differences that are associated with global disparities in breast cancer outcomes. Several studies have investigated the presentation of breast tumors in patients of African origin in comparison to women of European origin. A consensus across studies is that women of AA display a higher prevalence of the unfavorable TNBC subtype and of the molecular PAM50-defined basal subtype7,8,9,10,11,12,13,14,15,60. We interrogated the TCGA breast cancer cohort using curated survival data, improved ancestry assignment, and a refined classifier that reclassifies breast tumors into seven subgroups using the PAM50 signature in combination with TDA. Comparison of the classical PAM50 and the TDA classifier revealed that the large majority of basal tumors belong to the BasalMyo TDA subgroup, and that the reported enrichment of basal tumors in patients of AA is largely dominated by the BasalMyo subtype. Moreover, we were able to demonstrate that BasalMyo tumors are the only TDA subgroup that is associated with an ancestry-associated disparity in clinical outcome, underlining the clinical relevance of BasalMyo tumors in African patients.

In order to elucidate the underlying biological processes contributing to the worse survival of AA patients with BasalMyo tumors as compared to EA patients, we assessed transcriptomic differences in immunological parameters and cancer cell-intrinsic features. To date, only a few population-based studies have considered ancestry-related changes in the immune response of breast cancer patients22,43,44,45. Overall, very few immunological differences in tumor tissues have been reported between patients of AA and EA22,45. Pitt et al.22 reported subtle differences in tumor immune signatures when adjusting for PAM50-defined subtype. They found an enrichment of the type I IFN signature in luminal A and luminal B tumors of patients of AA, including African-American and Nigerian women, as compared to patients of EA. A study by O’Meara et al.45 reported no significant differences in the expression of 14 immune metagenes in TNBC tumors of AA and EA patients, whereas the proportion of resting CD4+ memory cells, as determined by CIBERSORT, was significantly higher in TNBC tumors of EA patients. Based on the notion that the CIBERSORT algorithm determines the relative abundance of immune cell subpopulations within a tumor rather than between tumors, we did not include CIBERSORT in our analyses. We explored ancestry-related differences in immune disposition using the ICR classifier of tumor immune phenotypes and LES. As such, we found that the prognostic value of the ICR immune gene signature holds true across ancestries and that the lower enrichment of Tregs and Th2 immune cells in patients of AA negatively correlated with outcome. Although this seems a counterintuitive finding, it is important to note that the presence of immunosuppressive cells could be a result of prior immune activation. In line with this, we previously found that FoxP3 expression heavily correlates with T cell infiltration as a counter-regulatory signal and hence is an important marker of the ICR signature52. In addition, a number of studies have reported that increased expression of immunosuppressive gene signatures supports chemotherapy sensitivity and hence better clinical outcome in (triple-negative) breast cancer61,62,63,64.

Subsequently, we explored whether we could identify ancestry-specific enriched oncogenic pathways with prognostic relevance in BasalMyo tumors. In support of this concept, a recent transcriptome-wide association study of the Caroline Breast Cancer Study transcriptomic dataset, comprising of self-identified African-American and European-American women, demonstrated that ancestry-stratified predictive risk models did not perform across ancestries and/or subtype65. Through integrative analysis of differential enrichment and prognostic connotation, we identified seven differentially enriched signaling pathways with prognostic connotation in patients of EA and/or AA. Enrichment of EGF and estrogen-dependent signaling was associated with worse clinical outcomes in patients of EA, while enrichment of DNA repair genes correlated with a better outcome. Conversely, enrichment of PI3K-Akt/PI3K-AKT-mTOR and ErbB signaling was associated with better prognosis in patients of AA. Although this survival-favorable correlation appears contradictory in relation to mTOR and ErbB-mediated oncogenic signaling, recent studies have demonstrated enrichment of PI3K-AKT signaling in immunogenic TNBC tumors, suggesting that hyperactivation of this signaling pathway might promote immunogenic activity and result in better prognosis61,66,67. This raises the question whether BasalMyo tumors enriched in PI3K and ErbB signaling could similarly infer an immune favorable tumor phenotype in a subset of AA patients. Furthermore, analysis of the individual molecules constituting the ErbB signaling pathway revealed a reduced enrichment of ErbB2, ErbB3, and ErbB4 and downstream signaling, irrespective of ancestry, in hormone receptor-negative tumors and in particular BasalMyo tumors compared to hormone receptor-positive tumors (data not shown). On the other hand, hormone receptor-negative tumors and BasalMyo tumors feature a higher enrichment of ErbB1/EGFR and its downstream molecules, which may be driving the overall increased enrichment of ErbB signaling in those tumors (data not shown). These findings highlight the importance of obtaining a more granular view of the changes in the ErbB pathway in BasalMyo tumors such as the relative effect of individual EGFR ligands on ErbB signaling enrichment. Notably, AMPK signaling was associated with opposing prognostic significance in EA and AA patients, with a positive connotation in the latter group. AMP-activated protein kinase or AMPK is a key regulator of cancer metabolism and oncogenic signaling, is frequently upregulated in TNBC vs non-TNBC tumors, and is generally associated with poor clinicopathological factors and shorter survival68,69. Several lines of evidence, however, point towards a more complex role for AMPK in cancer whereby AMPK activation has been associated with both pro-tumorigenic and anti-tumorigenic effects depending on specific metabolic cues70. For example, activation of AMPK signaling has been shown to inhibit the PI3K-AKT-mTOR pathway, the expression of EGFR and cyclins, and the phosphorylation of Src, STAT3, and MAPK, culminating in reduced tumorigenic potential and better clinical outcome71,72,73. It remains to be determined if metabolic-mediated dysregulation of AMPK signaling could be regulated by ancestry-specific traits. Indeed, few studies have reported ancestral disparity in cancer metabolomics74,75,76. Our finding illustrates that metabolic pathways might be governed by different regulators depending on ancestry, and hence reiterates the need to account for ancestry in biomarker and cancer target research.

To conclude, the rapidly evolving technological landscape and refinement of cancer treatment towards precision cancer medicine has led to the recognition that breast cancer is not a single disease, but should be studied and clinically managed as multiple distinct disease entities. It is now well appreciated that the complexity and heterogeneity of breast cancer arise from differences in cancer cell-intrinsic mechanisms as well as from dysregulation of the interplay with the stromal and immune microenvironment. Our findings support the notion of an additional level of complexity introduced by ancestry-associated traits and urge for more studies on underrepresented populations such as patients of ArA. Therefore, we advocate accounting for ancestry-specific molecular features in breast cancer research and in clinical decision making in order to guide precision cancer medicine.

Methods

Patient cohorts

Two different breast cancer cohorts were included in this study: the publicly available TCGA breast cancer dataset and a local cohort from Qatar.

RNA-sequencing data from the TCGA breast cancer cohort (n = 1082 patients) was downloaded using R (v3.5.1) and TCGA Assembler (v2.0.3, ref. 77). Sample data were extracted ensuring a single primary tumor sample per patient using the TCGA Assembler “ExtractTissueSpecificSamples” function. Clinical data for all patients were obtained from the TCGA-CDR46. Patient ancestry was obtained using SNP-based inferred ancestry data, focusing on the European, Asian, and African clusters47,48. To visualize major ancestry clusters within the TCGA-BRCA cohort, PC analysis results of Carrot-Zhang et al.48 were used to plot PC1 vs. PC2 using ggplot. Using these data, we were able to include 108 patients who previously had no reported ancestry. As SNP-based ancestry had a very high concordance with reported ancestry (99.1%), we decided to also include 63 patients for whom only self-reported ancestry was available. We excluded 31 patients from our ancestry-based analyses. First, 16 patients with American inferred ancestry as the number of samples in this cluster is limited as well as one patient who self-identified as not Hispanic or Latino. Second, six patients without self-reported or inferred ancestry and third exceptional cases of discordance between self-reported and SNP-based ancestry (n = 8; 0.9%) were excluded. The final TCGA breast cancer cohort used for analysis comprises 1051 patients (811 of EA, 184 of AA, and 56 of AsA). The tumor non-silent mutation rate, predicted neoantigen load, and aneuploidy score were obtained from Thorsson et al.78, and predicted vs expected neoantigen values were extracted from Rooney et al.59.

The RA-QA patient cohort constitutes a breast cancer cohort from Qatar (n = 24 of which 16 of ArA) with patients who were newly diagnosed with breast cancer between 2004 and 2010 at the National Centre for Cancer Care and Research (NCCCR) in Doha. Clinical information and self-reported ancestry were extracted from the medical records. The study was approved by the local ethical committees of the Hamad Medical Corporation (study approval number #14027/14), the Qatar Biomedical Research Institute (study approval number #2016-002), and Sidra Medicine (study approval number #1711015664), and was performed in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

The study protocol was granted a waiver of informed consent under the condition of anonymization and no additional intervention for the participants.

Total RNA-sequencing

RNA was isolated from four 20 μm sections of formalin-fixed paraffin-embedded (FFPE) tumor samples of the RA-QA cohort using the AllPrep DNA/RNA FFPE kit (Qiagen, Germany), followed by a quality control for purity and integrity by the Agilent Bioanalyzer system. Total RNA was depleted from ribosomal RNA and random primed for complementary DNA synthesis using the TruSeq-stranded total RNA kit (Illumina, USA). RNA-sequencing was performed on the Illumina HiSeq2500 platform (Illumina) with paired-end 25× coverage (PE100–125). The FASTQ files were trimmed to remove adaptor sequences using flexbar (v3.0.3, ref. 79) and aligned to GRCh37/hg19 reference genome using hisat2 (v2.0.5, ref. 80), resulting in an average 10–15 M aligned reads. Reads were counted to genomic features using subreads (v1.5.5, ref. 81). For both the TCGA and RA-QA cohort, RNA-seq data were corrected for GC content and normalized within and between lanes using the R package EDASeq (v2.12.0, ref. 82), and quantile normalized using the preprocessCore (v1.36.0, ref. 83).

Intrinsic molecular subtype classification

The intrinsic molecular subtype of each tumor sample was defined by the differential expression of a set of 50 genes (PAM50) using two distinct algorithms. First, the R package bioclassifier_R was used to predict sample subtype according to the Parker et al.84 subtype predictor. Second, a more recent classification model was applied using a robust classifier that integrates the PAM50 gene signature with Topological Data Analysis, resulting in seven subgroups with well-defined gene expression patterns49. The TDA classifier is based on the observed expression of five gene groups, basal (a), myo1 (b), myo2 (c), luminal (d), and Her2 (e) (Fig. 1a). The nomenclature of the identified TDA classes directly reflects the observed gene groups, for example, BasalHer2 samples are characterized by increased expression of the basal (a) and the Her2 (e) gene groups, and LumBasal samples by basal (a) and luminal (d) gene expression, and so on. An explanatory summary of the characteristics of the different TDA classes is included in Fig. 1a. Sample clustering according to both classification methods was visualized in a PAM50-based heatmap using the R package ComplexHeatmap (v1.20.0, ref. 85). Circos plots using the R package circlize (v0.4.6, ref. 86) depicted TDA reclassification of samples in comparison to PAM50 subtyping. The distribution of TDA subtypes within ancestries was assessed using stacked bar plots and χ2 tests.

ICR consensus clustering

Consensus clustering of samples according to the expression values of 20 ICR genes was performed using the ConsensusClusterPlus (v1.42.0, ref. 87) R package with the following parameters: 5000 repeats, and agglomerative hierarchical clustering with ward criterion (Ward.D2) inner and complete outer linkage as previously described51,88. The optimal number of clusters for best segregation of samples was determined using the Calinski-Harabasz criterion with samples in intermediate clusters defined as “ICR Medium.” Samples of the TCGA dataset were clustered into three groups: ICR low (cluster 1), ICR medium (clusters 2 and 3), and ICR high (cluster 4). Due to the small number of samples, the RA-QA cohort was divided into 2 groups: ICR low (clusters 1, 2, and 3) and ICR high (cluster 4).

Single-sample gene set enrichment analysis

Enrichment of specific gene sets, reflecting either abundance of immune cell populations or expression of tumor-related pathways, was defined by single-sample gene set enrichment analysis using R package GSVA (v.1.30.0, ref. 89)90. Gene set signatures of 24 distinct immune cell types or LES were used to deconvolute immune cell abundance53. Gene sets comprising numerous tumor-related pathways were obtained from multiple sources, including the Molecular Signatures Hallmark91 and Ingenuity Pathway Analysis (IPA) gene set collections and several signatures that have been associated with tumor immune escape92,93,94,95. Gene signature enrichment scores were compared based on ancestry using the two-tailed unpaired t test.

XGBoost model

We utilized an optimized version of the white-box, nonlinear, ensemble gradient boosting machine called XGBoost to build our Cox regression model for survival analysis96,97. Gradient Boosting is a machine-learning technique based on a constructive strategy by which the learning procedure will additively fit new models, typically decision trees98 and repetitively leverage the patterns in residuals to provide a more accurate estimate of the response variable or time to event, that is, death in case of survival analysis. The patients who are alive are considered as right-censored, and since the XGBoost model takes only one label for the response variable as input, the censored survival information is converted to negative labels while performing the Cox proportional hazards modeling99. XGBoost is a scalable machine-learning technique for tree boosting, a learning technique to improve the regression performance of weak regressors by repeatedly adding new decision trees to the ensembles, which enhances performance in comparison to other boosting algorithms96. The main components of XGBoost algorithm are the objective function and its iterative solution. The objective function is initialized to describe the model’s performance. Given the training dataset, \(D = \{ x^i,y^i\} _{i = 1}^N\), where xiRd, d = 54, yiR, N denotes the total number of training samples, R depicts the set of real numbers, and D represents the training set. The predicted output \({{\hat y}^l}\)obtained from the ensemble model can be represented as: \(\hat y^l = \mathop {\sum}\nolimits_{t = 1}^T {H_t} \left( {x^i} \right)\), where Ht(xi) represents the prediction score of the tth decision tree for the ith patient in the training dataset. If the decision trees are allowed to grow unregulated, then the resulting model is bound to overfit96. Hence, the following objective has to be minimized:

$$J\left( H \right) = \mathop {\sum}\limits_{i = 1}^N {L\left( {y^i,\hat y^l} \right)} + \mathop {\sum}\limits_{t = 1}^T {\Omega \left( {H_t} \right)}$$
(1)

where L is the loss function and Ω() is the penalty that is used to prevent overfitting and is defined as \(\Omega \left( {H_t} \right) = \gamma A + \frac{1}{2}{\rlap{\kern.10em \hbox{$\bar{\,}$}}\raise-.32em\hbox{${{\lambda}}$}} \mathop {\sum}\nolimits_{j = 1}^A {w_j^2}\), where γ and ƛ are the parameters that control the penalty for number of leaf nodes (A) and leaf weights (w), respectively, in the decision tree Ht.

The objective function can be rewritten as \(J\left( H \right) = {\mathop {\sum}\nolimits_{i = 1}^N} {L\left( {y^i, {{\hat y^l}}_{t - 1} + H_t\left( {x^i} \right)} \right)} + \mathop {\sum}\nolimits_{t = 1}^T {\Omega \left( {H_t} \right)}\). After applying a Taylor expansion100 and expanding Ω(Ht), we obtain:

$${\it{J}}\left( {{\it{H}}_{\it{t}}} \right) = \mathop {\sum}\limits_{{\it{i}} = 1}^{\it{N}} {\left[ {{\it{g}}_{\it{i}}{\it{H}}_{\it{t}}\left( {{\it{x}}^{\it{i}}} \right) + \frac{1}{2}{\it{h}}_{\it{i}}^2{\it{H}}_{\it{t}}\left( {{\it{x}}^{\it{i}}} \right)} \right]} + \gamma {\it{A}} + \frac{1}{2}{\rlap{\kern.10em \hbox{$\bar{\,}$}}\raise-.32em\hbox{${{\lambda}}$}} \mathop {\sum}\limits_{j = 1}^{\it{A}} {{\it{w}}_j^2}$$
(2)

where \(g_i = \partial _{y_{t - 1}}\left( {L\left( {y^i,\hat y_{t - 1}^l} \right)} \right)\) and \(h_i = \partial _{y_{t - 1}}^2\left( {L\left( {y^i, {{\hat y^l}}_{t - 1} } \right)} \right)\) are the first- and second-order gradient statistics for the loss function L. For a fixed tree structure H(x), where Ij = {i},H(xi) = j is an instance of leaf node j, the optimal weight \(w_j^o\) for leaf node j is given by:

$${\it{w}}_{\it{j}}^{\it{o}} = \frac{{ - \mathop {\sum}\nolimits_{{\it{i}} \in {\it{I}}_{\it{j}}} {{\it{g}}_{\it{i}}} }}{{\mathop {\sum}\nolimits_{{\it{i}} \in {\it{I}}_{\it{j}}} {{\it{h}}_{\it{i}}} + {\rlap{\kern.10em \hbox{$\bar{\,}$}}\raise-.32em\hbox{${{\lambda}}$}} }}$$

The corresponding optimal objective function becomes:

$${\it{J}}\left( {{\it{H}}_{\it{t}}} \right) = \frac{{ - 1}}{2}\mathop {\sum}\limits_{{\it{j}} = 1}^{\it{A}} {\frac{{\left( {\mathop {\sum}\nolimits_{{\it{i}} \in {\it{I}}_{\it{j}}} {{\it{g}}_{\it{i}}} } \right)^2}}{{\left( {\mathop {\sum}\nolimits_{{\it{i}} \in {\it{I}}_{\it{j}}} {{\it{h}}_{\it{i}}} + {\rlap{\kern.10em \hbox{$\bar{\,}$}}\raise-.32em\hbox{${{\lambda}}$}} } \right)}}} + \gamma {\it{A}}$$
(3)

Equation 3 can be used as a scoring function to measure the quality of a tree structure Ht during iteration t. This score is equivalent to the impurity score used for evaluating decision trees in random forests101. We build our XGBoost model using the fast, greedy, and iterative algorithm by Chen et al.96 to identify the optimal tree structures.

SHAP model

One of the disadvantages of the feature importance scores obtained from the XGBoost model is that the directionality is not apparent. For instance, when a particular pathway attains a high enrichment score, it is not clear whether this corresponds to a higher or lower risk of death. Moreover, at the test phase, it is a challenge for traditional white-box, tree-based, machine-learning techniques to provide information about the top five features driving the prediction to better or poorer survival prognosis. Recently, several techniques have been proposed to overcome aforementioned limitations, including LIME (Local Interpretable Model-agnostic Explanations)102 and SHAP103. These methods have the ability to interpret feature importance scores from complex training models and provide interpretable predictions for a test sample based on the top k features for that particular test instance. In our work, we used the SHAP method as it has been shown to outperform the LIME method and to be better aligned with human intuition103. The SHAP method is an additive feature attribution method where a test instance prediction is defined as a linear function of features that satisfies three critical properties: local accuracy, missingness, and consistency.

The explicit SHAP regression values are derived from a game-theory framework104,105 and can be computed as:

$$\Phi _i = \mathop {\sum}\limits_{S \subseteq Q - \{ i\} } {\frac{{\left| S \right|!\left( {\left| Q \right| - \left| S \right| - 1} \right)!}}{{\left| Q \right|!}}} \left[ {H_{S \cup \{ i\} }\left( {x_{S \cup \{ i\} }} \right) - H_S\left( {x_S} \right)} \right]$$

where Q represents the set of all d features, S represents the subsets obtained from Q after removing the ith feature, and ɸi is an estimate of the importance of feature i in the model. In order to refrain from undergoing 2|Q| differences to estimate ɸi, the SHAP method approximates the Shapley value by either performing Shapley sampling106 or Quantitative Input Influence107. A detailed description of model interpretation using the SHAP method has been outlined by Samek et al.103. In our work, SHAP values associated with a particular pathway in the XGBoost model provide information on the change in log (risk of death) for each feature of the Cox proportional hazards model.

Survival analysis

Kaplan–Meier curves were generated using the ggsurvplot function from R package “survminer” (v0.4.8) to compare OS and DSS between ancestries, ICR clusters, and AMPK subgroups. Univariate Cox proportional hazards regression analysis was performed with the R package “survival.” AJCC pathologic tumor stage as described in the TCGA-CDR was used for stratified analysis within the BasalMyo class. Forest plots were generated using the R package forestplot (v1.7.2).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.