Kataegis in clinical and molecular subgroups of primary breast cancer

Kataegis is a hypermutation phenomenon characterized by localized clusters of single base pair substitution (SBS) reported in multiple cancer types. Despite a high frequency in breast cancer, large-scale analyses of kataegis patterns and associations with clinicopathological and molecular variables in established breast cancer subgroups are lacking. Therefore, WGS profiled primary breast cancers (n = 791) with associated clinical and molecular data layers, like RNA-sequencing data, were analyzed for kataegis frequency, recurrence, and associations with genomic contexts and functional elements, transcriptional patterns, driver alterations, homologous recombination deficiency (HRD), and prognosis in tumor subgroups defined by ER, PR, and HER2/ERBB2 status. Kataegis frequency was highest in the HER2-positive(p) subgroups, including both ER-negative(n)/positive(p) tumors (ERnHER2p/ERpHER2p). In TNBC, kataegis was neither associated with PAM50 nor TNBC mRNA subtypes nor with distant relapse in chemotherapy-treated patients. In ERpHER2n tumors, kataegis was associated with aggressive characteristics, including PR-negativity, molecular Luminal B subtype, higher mutational burden, higher grade, and expression of proliferation-associated genes. Recurrent kataegis loci frequently targeted regions commonly amplified in ER-positive tumors, while few recurrent loci were observed in TNBC. SBSs in kataegis loci appeared enriched in regions of open chromatin. Kataegis status was not associated with HRD in any subgroup or with distinct transcriptional patterns in unsupervised or supervised analysis. In summary, kataegis is a common hypermutation phenomenon in established breast cancer subgroups, particularly in HER2p subgroups, coinciding with an aggressive tumor phenotype in ERpHER2n disease. In TNBC, the molecular implications and associations of kataegis are less clear, including its prognostic value.


INTRODUCTION
Breast cancer genomes are shaped by somatic changes including epigenetic and DNA alterations like single base pair substitutions (SBSs), indels, structural rearrangements, and copy number alterations (CNAs), which together infer high molecular heterogeneity even across patients with similar clinical features.The activity of several mutational processes has been demonstrated in breast cancer, including endogenous processes like DNA repair deficiency and APOBEC mutagenesis [1].The genetic readout of many mutational processes can be approximated through the concept of mutational signatures, as illustrated by Alexandrov et al. in 2013 for mutational SBS signatures [2].Currently, a variety of different SBS signatures, indel signatures, structural rearrangement signatures, and CNA signatures have been reported [1][2][3][4][5].The most studied type of mutational signatures are the SBS signatures which are based on the trinucleotide context of SBSs, i.e., the triplet of bases comprising the single base alteration and adjacent bases immediately 5' and 3' [3].Currently, 49 different SBS signatures have been reported based on pan-cancer analysis [5].
Two of the most distinct SBS signatures with respect to their trinucleotide contexts are signatures SBS2 and SBS13.These signatures were identified already in the breast cancer study by Nik-Zainal et al. in 2012 [3] and have been associated with the activity of APOBEC cytidine deaminases (APOBEC mutagenesis) [2].APOBEC mutagenesis has in turn been linked to a mutational phenomenon referred to as kataegis, which is characterized by clusters of SBS hypermutation biased towards a single DNA strand shown to co-localize with specific rearrangement signatures at the vicinity of structural rearrangements [2][3][4]6].Kataegis hypermutation typically comprises C>N mutations in a TpC context [1], although a T>N conversion in a TpT or CpT process attributed to error-prone polymerases has also been reported [7].Kataegis is proposed to be due to the dominant acting apolipoprotein B editing catalytic subunit 3b (APOBEC3B) enzyme that deaminates genomic DNA cytosines and promotes mutation rates higher than normal [8].Kataegis is typically defined by an intermutation distance between adjacent SBSs, e.g., as six or more consecutive mutations with an average intermutation distance of less than or equal to 1,000 bp [2].In a recent pan cancer whole genome sequence (WGS) based study kataegis events were found in 60.5% of all cancers, with particularly high abundance in lung squamous cell carcinoma, bladder cancer, acral melanoma, and sarcomas [6].The APOBEC signature accounted for 81.7% of these kataegis events.
In breast cancer, kataegis has been reported in up to 50% of tumors [3,4,9].Despite this high frequency there is a shortage of studies that have analyzed the frequency and associations of kataegis with clinicopathological and molecular variables in large patient cohorts stratified by established clinical and molecular subgroups.In 2016, Antonio et al. [9] reported the analysis of a small cohort of 97 breast cancers using WGS.This study reported that kataegis was associated with a distinct transcriptional signature, late onset disease, better patient prognosis, and higher HER2 levels.Additionally, it revealed an enrichment of kataegis events on chromosomes 8, 17, and 22, and depletion of these events on chromosomes 2, 9, and 16.However, the majority of associations reported in that study were based on the usage of a transcriptional classifier of kataegis in a gene expression cohort.Thus, kataegis, as defined by WGS (the gold standard), still appears remarkedly understudied in established clinical and molecular subgroups of breast cancer.
To address the limited understanding of kataegis in early-stage primary breast cancer, our study aimed to comprehensively describe, characterize, and analyse the association of kataegis with clinicopathological and molecular factors, transcriptional patterns, and patient outcome with a focus on established clinical and molecular subgroups defined by ER, PR, HER2, PAM50, and homologous recombination deficiency (HRD) status.Our analysis is conducted on a set of 791 primary breast cancer profiled by WGS and coupled with additional clinical and molecular data.This undertaking represents a scale of studying kataegis not previously reported.

SCAN-B unselected population-based TNBC cohort
Based on the Sweden Cancerome Analysis Network -Breast (SCAN-B) study [10,11], 235 TNBC patients diagnosed with primary invasive breast tumors and enrolled between 2010 to 2015 were included, originally reported in [12].Specific patient inclusion and exclusion criteria for the SCAN-B cohort are reported in the original publication [12], and patients in this cohort have previously been shown to be representative of the underlying breast cancer population of the healthcare region in which they were enrolled [12].All tumors had curated WGS and RNA sequencing data (FPKM) available, as well as complete clinicopathological data, PAM50 subtypes, and TNBCtype [13] subtypes (BL1, BL2, M, LAR) [12,14].Clinicopathological and molecular characteristics for the 235 SCAN-B patients' tumors are summarized in Table 1.
Based on FPKM data, gene expression-based rank scores for eight biological metagenes in breast cancer originally defined by Fredlund et al. [15] were calculated as described by Nacer et al. [16].Pathology estimated proportions of tumor infiltrating lymphocytes (TILs) on whole slide H&E sections were obtained from [17].Proportions (exposure on tumor level) of SBS signatures (COSMIC v2) were taken as SigFit computed values from the study by Aine et al. [17].

BASIS selected breast cancer cohort
The BASIS cohort comprises 560 patients of all clinical subtypes of breast cancer with curated WGS data reported by Nik-Zainal et al. [4].BASIS is a selected cohort of breast cancers based on tissue samples from several European institutions collected over a large time span.
Clinicopathological and molecular characteristics of BASIS patients' tumors are summarized in Table 1.The BASIS cohort lacks complete treatment and patient follow-up data, limiting meaningful survival analyses.Of the 560 cases, 265 had available RNA-sequencing data (log2 transformed FPKM) and PAM50 subtypes as outlined in the original publication (using the AIMS PAM50 algorithm [18]).Of the 265 cases, 183 were ER-positive (ERp) (ERpHER2p or ERpHER2n).Based on FPKM data, gene expression-based rank scores for eight biological metagenes in breast cancer originally defined by Fredlund et al. [15] were calculated as described by Nacer et al. [16].Rank scores were computed individually for each tumor without any normalization or data centering (i.e., they represent single sample scores).BASIS TNBC cases with FPKM data were classified into the TNBCtype subtypes (BL1, BL2, M, LAR) using the online classification tool with default parameters [13].

Kataegis analysis
SBSs were mapped to the hg19 genome build in the original studies.Analysis of kataegis was performed using the R KataegisPortal package (v1.0.3) [19] with default settings, including a requirement of at least six consecutive SBSs with a maximum intermutation distance of 1000bp.To map detected kataegis events to genes and functional elements in KataegisPortal the suggested packages from the vignette were used, including BSgenome (v1.66.[21].

Statistical methods
All p-values reported are two-sided and were compared to a level of significance of 0.05 unless otherwise specified.Boxplot elements correspond to: (i) center line = median, (ii) box limits = upper and lower quartiles, (iii) whiskers = 1.5x interquartile range.Differential gene expression analysis was performed using the Significance Analysis of Microarray (SAM) method [22].In the SCAN-B cohort, tumors with FPKM=0 for a gene had their log2 FPKM value set to 0. In BASIS subgroups, only genes without any missing log2 FPKM data were used.Functional annotation enrichment analysis was performed using the clusterProfiler (v4.8.3) R package [23].Input values were t-test p-values and log2 fold change values of all genes (processed as for the SAM analysis).A multiple testing adjusted p-value<0.05using Benjamini-Hochberg (BH) correction was considered statistically significant.A gene list of 628 genes reported to be differentially expressed between breast cancers with and without kataegis [9], were analyzed in subgroups using Student's t-test on log2 transformed FPKM values similar to the SAM analysis.

Survival analysis
Survival analyses were performed in R (v4.2.2) using the survival (v3.4.0) and survminer (v0.4.9) packages.Survival analyses were performed only in the SCAN-B TNBC cohort, due to incomplete outcome data in BASIS.Distant recurrence-free interval (DRFI) defined according to the STEEP criteria [24] was used as the primary endpoint for TNBC patients treated with standard of care adjuvant chemotherapy (FEC-based [combination of 5 fluorouracil, epirubicin, and cyclophosphamide] ± a taxane in 96% of cases) according to national guidelines.Details on patient inclusion and exclusion criteria, treatment details, endpoint definition, and CONSORT diagram relevant for the survival analysis of the SCAN-B TNBC cohort are available in [12].Survival curves were estimated using the Kaplan-Meier method and compared using the log-rank test.

Ethical approval
This study is based on open access data.All SCAN-B enrolled patients provided written informed consent prior to study inclusion as described in [12].

Clinical and molecular characteristics of SCAN-B and BASIS cohorts.
Characteristics of the investigated SCAN-B and BASIS cohorts are outlined in Table 1.The SCAN-B and BASIS TNBC cohorts were analyzed separately due to a significant difference in binary kataegis frequency, but also to provide validation of results in independent cohorts.
It should be acknowledged that the BASIS cohort is not a population-representative cohort and is underpowered with respect to the analyzed number of HER2-positive tumors.

Frequency of kataegis in clinical and molecular subgroups of breast cancer
Number of kataegis events per tumor was investigated in four clinical subgroups of primary breast cancer: TNBC, ERnHER2p, ERpHER2p, and ERpHER2n, demonstrating a markedly higher number of kataegis loci in HER2p groups irrespective of ER-status (Figure 1A).
Consistently, in a binary context the frequency of kataegis (³1 event on chromosomes 1-  1B-F).These analyses demonstrated that kataegis events were more frequent in ERpHER2n PAM50 Luminal B tumors compared to Luminal A tumors, more frequent in PR-negative versus PR-positive tumors, and more frequent in high-grade tumors compared to lower grades, while no differences were observed for lymph node status or HRDetect status.
While lower sample numbers precluded subgroup analyses within the ERnHER2p and ERpHER2p groups, we could analyze kataegis in both SCAN-B (n=235) and BASIS TNBC (n=163) tumors similar to ERpHER2n tumors.These analyses were performed separately for each TNBC cohort as we did observe a significant difference in binary kataegis frequency between the two cohorts (Chi-square test p=2e-6).In neither SCAN-B nor BASIS TNBC tumors were the number of kataegis events associated with PAM50 subtypes, TNBCtype subtypes, HRDetect status, or lymph node status (Figure 1G-L and Supplementary Figure S1).
Grade 3 SCAN-B TNBC tumors showed more kataegis events compared to lower stages (Kruskal-Wallis p=0.01), but this observation was not supported in BASIS TNBC tumors (p=0.24)(Supplementary Figure S1).

Association of kataegis with patient age and molecular variables in breast cancer subgroups
Kataegis has been proposed to be associated with a higher age at diagnosis of breast cancer [9].
In SCAN-B TNBC, BASIS TNBC, ERpHER2p, and ERpHER2n tumors binary kataegis was not significantly associated with patient age (Wilcoxon's test p>0.05)(Figure 2A).Only in the smallest clinical group, ERnHER2p (n=27 tumors), was a notably higher age at diagnosis in kataegis positive (³1 kataegis event) patients observed, albeit not statistically significant (Wilcoxon's test p=0.07) and based on small patient numbers (Figure 2A).
To explore associations of kataegis with genomic variables we analyzed if binary kataegis status was associated with tumor ploidy, mutational burden, the fraction of the genome altered by copy number alterations, and the fraction of the genome affected by LOH in TNBC, ERnHER2p, ERpHER2p, and ERpHER2n tumors.A significant association or trend of higher tumor mutational burden (TMB) with positive kataegis status was observed across all subgroups (Figure 2B).For tumor ploidy, higher ploidy was observed in kataegis positive BASIS TNBC (Wilcoxon's test p=0.04),but not in SCAN-B TNBC or any other tested subgroup (Supplementary Figure S1).For both the fraction of the genome affected by copy number alterations or by LOH, significantly higher fractions (i.e., a more altered tumor genome) were only observed in kataegis positive ERpHER2n tumors (Figure 2C and Supplementary Figure S1).To investigate if kataegis was associated with broader transcriptional programs we analyzed expression rank scores of eight biological metagenes (from [15]) in the clinical subgroups stratified by binary kataegis status.Due to a lack of matched gene expression data for BASIS HER2-positive tumors this analysis was restricted to the TNBC and ERpHER2n groups.In both TNBC cohorts, this analysis identified only a weaker increase in metagene rank scores for the mitotic progression (proliferation associated) metagene in kataegis positive tumors (Wilcoxon's test p<0.05 in both SCAN-B and BASIS) (Figure 2D).In BASIS TNBCs, kataegis positive tumors showed lower rank scores of the immune response metagene (Wilcoxon's test p<0.05), with a similar, but non-significant, trend of lower immune response rank scores observed also in kataegis positive SCAN-B tumors (Wilcoxon's test p=0.15).
Moreover, a non-significant trend of lower TIL scores was also observed for kataegis positive TNBC tumors in the SCAN-B cohort (Wilcoxon's test p=0.16).In ERpHER2n tumors, several metagenes showed significant rank score differences between kataegis positive and negative cases, including higher rank scores of the mitotic progression and mitotic checkpoint metagenes (both associated with expression of proliferation related genes), as well as higher immune response scores in the kataegis positive group (Figure 2E).In contrast, rank scores for the steroid response metagene were lower in kataegis positive ERpHER2n tumors (Figure 2E).
To investigate if specific tumor driver alterations were associated with binary kataegis status in the clinical subgroups we analyzed reported driver gene alterations (SBS mutations, indels, structural rearrangements, and copy number alterations) from the original studies [4,12].In ERnHER2p tumors no significant associations were identified, potentially because of the small group size (Chi-square test p>0.05).In the small ERpHER2p subgroup a difference in CDH1 alterations was found (Chi-square test p=0.02) with a higher frequency in kataegis negative tumors.PIK3CA and PIK3R1 alterations were more frequent in kataegis negative BASIS TNBC tumors (p=0.01 and p=0.02, respectively), but these findings were not significant in the SCAN-B cohort.In contrast, in the SCAN-B TNBC cohort only alterations involving CCND1, KRAS, ARID1B, and TP53 were found to be more frequent in kataegis positive tumors (p<0.05).In ERpHER2n tumors alterations involving CCND1 (11q13.3), ZNF703/FGFR1 at 8p12, MDM2 (12q15), MYC (8q24.21),TP53, and ZNF217 (20q13.2) were all more frequent in kataegis positive tumors, while PIK3CA and MAP3K1 alterations were more frequent in kataegis negative tumors (Chi-square test p<0.05).

Association of kataegis with patient outcome after adjuvant chemotherapy in TNBC
Association of binary kataegis status with distant metastasis free interval after adjuvant chemotherapy (mainly FEC-based therapy, see [12]) was investigated in SCAN-B TNBC patients, however, without finding any significant association (Figure 2F).To assure that the binary stratification based on detecting at least one event did not bias the survival analysis, we also conducted survival analysis using two and three events as cutoffs.No significant association was found in these analyses (log-rank p>0.05).Due to incomplete survival and treatment data similar survival analyses could not be performed in the TNBC, ERpHER2n, ERnHER2p, or ERpHER2p subgroups within the BASIS cohort.

Genomic contexts of kataegis loci in clinical breast cancer subgroups
To investigate the contexts of kataegis loci, we aggregated the KataegisPortal annotations for each kataegis loci for all kataegis events in a clinical subgroup and compared proportions between subgroups.The subgroup comparison was first restricted to the BASIS cohort and showed similar distributions of kataegis loci (Figure 3A), with a majority of loci annotated to be in distal intergenic (39-52%) or intronic regions (31-36%), while approximately 7-10% of events were in promoter regions.In comparison, in the SCAN-B TNBC cohort 48% of kataegis events were in distal intergenic regions, 40% in intronic regions, while 7% in promoter regions based on KataegisPortal annotations.
We further examined the distribution of SBSs associated with kataegis events across the genome with respect to different genomic contexts and functional elements for each clinical subgroup (excluding HER2p groups due to small sample sizes).To this end, each SBS involved in a kataegis event was annotated by mapping to regions of open chromatin (based on ATA Cseq data), DHS, intronic, exonic, intergenic, and different repeat regions, as well as functional elements like CTCF binding regions and different chromatin states.For each clinical subgroup, we computed the sum of such SBSs annotated to a specific genomic context, normalized the sum by the total number and compared the proportions across subgroups.Accordingly, in this comparison only kataegis positive tumors were included.Notably, while proportions differed between investigated contexts, each such pattern was relatively consistent across subgroups (Figure 3B showing the open chromatin comparison and Supplementary Figure S2 for all comparisons).
To investigate the enrichment for kataegis involvement among SBSs in various genomic contexts we extended our analysis to include both kataegis positive and kataegis negative tumors.We computed the proportions of SBSs using an approach analogue to that illustrated in Figure 3B, now applying it to all SBSs.This analysis was restricted to TNBC and ERpHER2n due to sample sizes.Comparisons across subgroups identified higher proportions of kataegis associated SBSs mapped to open chromatin regions versus the full SBS context in both kataegis positive and negative tumors (Figure 3C and Supplementary Figure S3 for all comparisons).Similar, but considerably weaker positive trends were also observed for DHS regions and intronic regions.Weak opposite trends (i.e., lower proportions for kataegis associated SBSs versus an all SBS context) were observed for SINE and low complexity repeat elements.Using this analysis approach, we found no evidence of enrichment or depletion of kataegis associated SBSs for chromatin states related to transcription start site (Tss) elements, different enhancer elements, or quiescent chromatin regions (loci not transcribed) compared to the full SBS context in both kataegis positive and negative tumors (Supplementary Figure S3).The displayed ratio corresponds to the kataegis SBS bar divided by the "all SBS" bar for kataegis positive tumors.

Patterns of recurrent kataegis in clinical breast cancer subgroups
To coarsely illustrate prevalence of kataegis loci across the genome we compared the number of events detected per chromosome and clinical tumor subgroup (Figure 4A-E).This analysis revealed subgroup differences regarding chromosomes frequently harboring kataegis events.
For instance, chromosome 17 and 1 frequently harbored kataegis events in all subgroups except TNBC.Events appeared more often on chromosome 8 in ER-positive subgroups compared to ER-negative tumors, whereas chromosome 6, 11, and 12 harbored events more often in ERpHER2n tumors compared to the other subgroups.It should be noted that sample numbers are small for the HER2p groups so there is uncertainty in conclusions for these subgroups.
Moreover, the chromosome analysis also demonstrated a difference between the two TNBC cohorts, supporting that they should be examined separately.
To expand our analysis of recurrence of kataegis events across the genome from Figure 4A-E we next increased the analysis resolution beyond whole chromosomes by counting events for each tumor mapped to consecutive 2 MBp sized bins throughout the genome (i.e., bins can include multiple closely spaced kataegis events for a single sample).Using these counts we then generated frequency plots for respective tumor subgroup (Figure 4F and Supplementary

Transcriptional patterns associated with kataegis in breast cancer subgroups
To investigate if binary kataegis status could explain general transcriptional variation in breast cancer we first performed an unsupervised UMAP clustering of FPKM data in the SCAN-B TNBC, BASIS TNBC, and ERpHER2n subgroups based on available matched expression data (ERnHER2p and ERpHER2p excluded due to lack of matched RNA-sequencing data in BASIS).As shown in Figure 5A, UMAP analysis provided no apparent evidence of association with global gene expression patterns.
Next, we performed supervised SAM analysis within the clinical subgroups to identify differentially expressed genes (DEGs) between kataegis positive and negative tumors.This analysis identified 2148 DEGs in SCAN-B TNBC, 882 in BASIS TNBC, and 1976 in ERpHER2n tumors (unadjusted SAM p<0.05).After multiple correction adjustment no genes remained significant in the TNBC cohorts, and only three genes (SIAE, LIMCH1, and NTN4) remained significant in ERpHER2n tumors (adjusted q<0.05).The near absence of significant DEGs after multiple testing correction contrasts with the previously reported number of 628 genes identified as differentially expressed [9].This discrepancy prompted us to investigate the expression patterns of these genes in the TNBC cohorts and ERpHER2n tumors analyzed in our study.In the SCAN-B TNBC cohort, none of the 556 matched genes had a significant FDRadjusted p-value based on an unpaired t-test analysis (p<0.05).For BASIS TNBC the corresponding numbers were zero significant genes out of 183 genes with complete log2 transformed FPKM data for all tumors (i.e., no missing values).For the ERpHER2n group, 17 out of 202 matched genes were significant (FDR-adjusted p<0.05).These 17 genes were FGF10, QSOX1, PLEK2, RGS5, AURKAPS1, GLRB, AIM1, WHSC1L1, CYP4Z1, PRR11, ENTPD5, TUBA3D, BRF2, CX3CR1, RAB3D, TMEM132A, and EAF2.
Given the predominantly negative results in our supervised analysis of DEGs, we proceeded with a GSEA analysis to explore gene sets distinguishing between kataegis positive and negative tumors.This analysis identified both activated and suppressed pathways, much in agreement with the biological metagene expression analyses previously shown in Figure 2.
Specifically, in both SCAN-B and BASIS TNBC cases suppressed GSEA pathways in kataegis positive tumors included different immune pathways, while for ERpHER2n tumors cell cycle / proliferation associated pathways were activated and pathways associated with extracellular matrix terms were found to be repressed (Figure 5B-D).

DISCUSSION
In this study, we aimed to investigate the genome-wide patterns, frequency, and associations of kataegis with clinicopathological characteristics, molecular variables, and transcriptional patterns across established clinical and molecular breast cancer subgroups.Although high frequency of kataegis in breast cancer have been observed repeatedly for over a decade, our study represents a large-scale analysis of the phenomenon that has not been reported to date.
Under a binary categorization, we observed that kataegis frequency was markedly higher in the HER2p subgroups, even though these subgroups had relatively small patient numbers.
This observation is consistent with the report by Antonio et al. [9], as well as a reported higher APOBEC exposure (associated with kataegis) in HER2p breast cancer groups [25,26].For the HER2p groups, the high kataegis frequency appears to be largely driven by a notably high kataegis frequency on specific chromosomes such as 1, 8, 17 (often including the ERBB2 locus), and 20 (Figure 4).A notable observation is the difference (for unknown reasons) in kataegis frequency between the SCAN-B and BASIS TNBC cohorts.This discrepancy underscores the importance of basing conclusions on data from several independent cohorts.
In this context, the SCAN-B cohort has been demonstrated to be representative of the underlying TNBC patient population in the catchment region for years of inclusion [12].This contrasts with the BASIS cohort, which is multi-national and multi-institutional.In ERpHER2n tumors, our combined results show a strong associated between kataegis and a typically aggressive ERpHER2n phenotype, including then Luminal B PAM50 subtype, higher tumor grade, PR negativity, higher tumor mutational burden, generally more copy number alterations and LOH, consistent with [9].
To explore associations of kataegis in clinical breast cancer subgroups we used available clinical and molecular data from both WGS and RNA-sequencing.In contrast to Antonio et al.
[9], we did not find support for kataegis being associated with a higher age at diagnosis in any of the tested clinical subgroups.However, two consistent molecular findings across all subgroups emerged: a significant pattern or trend indicating higher tumor mutational burden in kataegis positive tumors, and a non-significant association of kataegis with HRD status (which is also associated with higher tumor mutational burden).Although higher exposure to APOBEC mutagenesis (associated with kataegis) have been previously reported in HER2p breast cancer groups based on mutational signature analysis [25,26], many HRD positive tumors, e.g.among TNBCs, also show exposure to APOBEC related mutational signatures like SBS2 and SBS13 (see Supplementary Figure S4).Given this context, the non-overlapping agreement between two common genetic phenotypes (HRD and kataegis), both of which are driven by endogenous mutational processes, deserves further investigation.
In TNBC, where our insights are more validated based on the usage of two independent cohorts, we found that kataegis was not associated with PAM50 subtypes or proposed TNBC specific subtypes.Moreover, for adjuvant chemotherapy treated SCAN-B TNBC patients, kataegis showed no prognostic association with DRFI.This finding aligns logically with the observed insignificant association with HRD status, which has been shown to be prognostic in the SCAN-B cohort [12].Also, there was no significant association between kataegis and an immune response mRNA metagene or pathology estimated TILs scores, both recognized as prognostic variables in TNBC (see for example [27]) in the SCAN-B cohort.
The driver gene association analysis clearly demonstrated the impact of different cohorts on the results.In ERpHER2n tumors, driver gene alterations associated with a positive kataegis status predominantly included TP53 and genes in regions commonly amplified in breast cancer.
In contrast, kataegis negative tumors showed more enriched for PIK3CA and MAP3K1 alterations.While PIK3CA alterations were also more frequent in kataegis negative BASIS TNBC tumors this trend was not observed in SCAN-B kataegis negative TNBC tumors.The SCAN-B cohort supported higher frequency of TP53 alterations in kataegis positive tumors, a finding not mirrored in the BASIS TNBC cohort.Taken together, the driver gene analysis was most consistent for ERpHER2n tumors, indicating that observed kataegis is associated with recurrent amplifications of key oncogenes in breast cancer, often found in PAM50 Luminal B and HER2p tumors [28,29].
Kataegis loci have been proposed to stabilize the expression of neighboring genes, and kataegis as a binary event has been reported to be associated with a specific 628 gene mRNA signature based on analysis of 97 tumors of mixed clinical subgroups [9].In our study we explored transcriptional patterns associated with kataegis within each clinical subgroup, employing both unsupervised and supervised approaches.Notably, our unsupervised global UMAP analysis demonstrated that binary kataegis status does not appear to significantly explain transcriptional variation in neither TNBC nor ERpHER2n tumors, suggesting the absence of a distinct transcriptional phenotype attributed to kataegis in these subgroups.Our unsupervised biological metagene analyses indicated that kataegis positive tumors generally tended to exhibit higher expression of genes associated with proliferation.In ERpHER2n tumors there was also a trend towards lower expression of steroid response associated genes.This latter finding is in agreement with a generally lower ER signaling, PR negativity, and a Luminal B subtype, as opposed to the typical patterns in Luminal A tumors (see e.g., data from [15,30]).A notable distinction between TNBC and ERpHER2n tumors was an inversed pattern of the Fredlund et al. [15] immune response metagene.In kataegis positive ERpHER2n tumors, there were higher rank scores, whereas in TNBC tumors, there was a trend of lower scores (as well as lower TIL estimates), though this trend was not significant in either BASIS or SCAN-B.Supervised differential gene expression failed to identify significant numbers of DEGs after multiple testing correction in both TNBC and ERpHER2n tumors.Similarly, the previously reported list of 628 kataegis associated DEGs in breast cancer [9] validated poorly in both TNBC cohorts as well as in the ERpHER2n subgroup.Together, these findings support our UMAP results, suggesting that kataegis does not constitute a distinct transcriptional entity.
In contrast with the SAM analyses, GSEA analysis identified enriched pathways consistent with results from the biological metagene analysis.Specifically, for TNBC, we observed suppressed activity of immune response pathways in both the BASIS and SCAN-B cohorts, while results for activated pathways appeared more mixed.This variability underscores the critical importance of careful curation and control of study cohort composition to draw accurate biological conclusions.For the ERpHER2n subgroup, GSEA analysis confirmed the activation of cell cycle related pathways in kataegis positive tumors and reinforced the association of kataegis with an aggressive, typically poor patient outcome phenotype by demonstrating suppression of different ECM related pathways typically associated with higher metastatic potential.Due to lack of matched tissue for in situ analyses in our study, it was not possible to address the observed trends regarding the interplay between immune response and kataegis status, a topic that warrants further investigation.Specifically, it remains unclear whether kataegis hypermutation events induce immunogenicity in some molecular background (like ERpHER2n), but not in others (e.g., TNBC), or whether observed associations are more related to other correlative characteristics like a high tumor mutational burden and frequent structural rearrangements.
Our genome wide analysis of recurrent kataegis demonstrated that recurrent alterations are often in close proximity to recurrently amplified regions and established oncogenes in breast cancer.As such, this finding is consistent with the reported location of kataegis events in the vicinity of structural rearrangements [2][3][4]6], although only 2% of rearrangements in the original BASIS study were reported to be associated with a kataegis loci [4].Our observed localization of kataegis loci close to amplifications were particularly clear for the ER-positive tumor groups, but also for ERnHER2p tumors with respect to the 17q12 locus (including ERBB2), although it should be noted that our studied ERnHER2p group is very small.In comparison, recurrent kataegis events were less frequent in TNBC and targeted different genomic loci (Figure 4F).Based on the used kataegis analysis tool, kataegis loci were most often annotated as distal intergenic or intronic, with only a small portion of loci annotated in proximity of, or in, promoter regions (Figure 3A).We found no support for kataegis loci targeting different genomic contexts or functional elements across clinical breast cancer subgroups.In agreement with Antonio et al. [9], we observed an enrichment of kataegis associated SBSs (the SBS comprising the kataegis event) mapping to open chromatin regions, and to some extent DHS regions, when compared to a complete SBS background in kataegis positive and negative TNBC and ERpHER2n tumors.However, in contrast to the report by Antonio et al. [9] we did not observe any consistent enrichment of kataegis associated SBSs in transcription start sites or depletion in quiescent chromatin regions.As such, the actual functional impact of kataegis SBSs on topographical genome features, or mRNA expression remains unclear.
Several limitations should be considered regarding the results presented in this study.The sample size for some clinical groups, particularly for the ERnHER2p and ERpHER2p groups, are small, which may impact findings.Moreover, as the BASIS cohort is not population representative, we acknowledge that frequency patterns may change if a similar analysis is performed in a representative ER and HER2 cohort.For analyses involving topographical features like chromatin states or regions of open chromatin, our study, along with other reported pan-cancer studies, is limited in that these features were mapped in samples unrelated to the studied cancers [31].Another limitation is the lack of both complete transcriptional data and patient treatment and outcome data in the BASIS cohort, as well as an independent ER-positive validation cohort.These shortages preclude conclusions about the prognostic relevance of kataegis in ER-positive treatment groups, despite an apparent association with an aggressive tumor phenotype, stressing the need for further prognostically oriented investigations.
In this study we have focused on delineating kataegis in established breast cancer subgroups to provide a more nuanced understanding of its frequency and clinicopathological associations in primary breast cancer.Our findings show that in ERpHER2n disease, kataegis positive tumors are associated with more aggressive disease characteristics, while in TNBC the molecular implications and associations of kataegis, including its prognostic significance, remain less clear.

DECLARATIONS ETHICS APPROVAL AND CONSENT TO PARTICIPATE
This study is based on open access data.All SCAN-B enrolled patients provided written informed consent prior to study inclusion as described in [12]

Figure 1 . 5 B
Figure 1.Number of kataegis events in breast cancer subgroups.(A) Clinical subgroups defined by ER, PR, and HER2-status.SCAN-B and BASIS TNBCs are separately displayed.(B) PAM50 Luminal A and Luminal B in ERpHER2n tumors.(C) Progesterone receptor (PR) status in ERpHER2n tumors.(D) Tumor grade in ERpHER2n tumors.(E) Lymph node status in ERpHER2n tumors.(F) HRD status in ERpHER2n tumors.(G) PAM50 subtypes in SCAN-

Figure 3 .
Figure 3. Genomic contexts of kataegis in clinical breast cancer subgroups.(A) Proportion of all kataegis events (loci) in a clinical subgroup in the BASIS cohort annotated to different genomic contexts by the KataegisPortal tool.Detailed contexts from the tool have been aggregated to the specified larger annotations for each loci.(B) Proportion of kataegis associated SBSs (i.e., SBSs in a kataegis loci) mapped to regions of open chromatin (ATAC) per clinical subgroup.For each subgroup, all SBSs associated with kataegis that were mapped or not mapped to ATAC regions were summarized across all tumors, i.e., proportions represent the proportion of all kataegis SBS mapped or not in kataegis affected tumors only.(C) Proportions of ATAC mapped kataegis SBS from panel B compared to similarly computed proportion based on all SBSs in kataegis positive and negative tumors for the SCAN-B TNBC, BASIS TNBC, and ERpHER2n subgroups.Top axis for each bar reports the number of SBSs mapped to ATAC across all tumors and the total number of SBS across all tumors in the group.

Figure 4 .
Figure 4. Genome patterns of recurrent kataegis in clinical breast cancer subgroups.Panels A-E show the number of kataegis events per chromosome (chromosome 1-22) across all tumors in a subgroup.(A) SCAN-B TNBC, (B) BASIS TNBC, (C) ERnHER2p (D) ERpHER2p, and (E) ERpHER2n.(F) Genome-wide frequency plots of the number of mapped kataegis events per 2MBp bin (y-axis) per tumor subgroup, ordered according to genomic position (x-axis).Technically, multiple kataegis loci could be mapped to the same bin if occurring close to each other.Chromosome boundaries are indicated by vertical red lines.Bins with recurrent kataegis events are labelled by cytoband.

3 externalFigure 5 .
Figure 5. Unsupervised and supervised gene expression patterns associated with kataegis in clinical breast cancer subgroups.(A) UMAP analysis of FPKM data from SCAN-B TNBC (left: 235 tumors, 19675 genes), BASIS TNBC (center: 73 tumors, 16129 genes), and ERpHER2n (right: 186 tumors, 17632 genes).Tumors are colored by binary kataegis event status (positive: ³1 event, negative: 0 events).The first two UMAP components are shown.(B) GSEA analysis for significantly activated and suppressed pathways in SCAN-B TNBC tumors stratified into kataegis positive or negative as in A. (C) GSEA analysis for significantly activated and suppressed pathways in BASIS TNBC tumors stratified into kataegis positive or negative as in A. (D) GSEA analysis for significantly activated and suppressed pathways inERpHER2p tumors stratified into kataegis positive or negative as in A. In B-D, the Gene ratio axis represents the number of genes from the query gene list that overlap with the gene set of a specific pathway or Gene Ontology (GO) term.This ratio is calculated by dividing the number of overlapping genes by the total number of genes in the set.The Count circle size represents the number of query gene list that overlap with the gene set and the adjusted p-value represents the statistical significance of the enrichment adjusted for multiple testing using Benjamini-Hochberg (BH) correction.

Associations of kataegis with patient age, molecular variables and patient outcome in breast cancer subgroups. (A)
status in BASIS TNBC tumors.P-values were calculated using Wilcoxons' test (2-group) or Kruskal-Wallis test (>2 groups).Distribution of patient age versus binary kataegis classification in breast cancer subgroups defined by ER, PR, and HER2.(B) Distribution of tumor mutational burden (sum of SBS and indels per million base pair sequence) versus binary [15]us.(E) Rank scores for four biological metagenes from Fredlund et al.[15]showing a difference in scores for binary kataegis status in ERpHER2n tumors.(F) Kaplan-Meier plot of the association of binary kataegis status with distant relapse-free interval (DRFI) as clinical endpoint in SCAN-B TNBC patients treated with adjuvant chemotherapy.The P-value was calculated using the log-rank test.All BASIS tumors do not have matched RNA-sequencing data.P-values were calculated using Wilcoxons' test (2-group) or Kruskal-Wallis test (>2 groups).
[4]thical approval was given for the SCAN-B study (approval numbers 2009/658, 2010/383, 2012/58, 2013/459, 2015/277) by the Regional Ethical Review Board in Lund, Sweden, governed by the Swedish Ethical Review Authority, Box 2110, 750 02 Uppsala, Sweden.Patients in the BASIS cohort provided consentto research as outlined in the original publication[4].All analyses were performed following local and international regulations for research ethics in human subject research.