Chromatin marks govern mutation landscape of cancer at early stage of progression

Accumulation of somatic mutations over time leads to tissue abnormalities, such as cancer. Somatic mutation rates vary across the genome in a cell-type specific manner, depending on the types of mutation processes1–7. Although recent studies have identified several determinants relevant to the establishment of the cancer mutation landscape8–13, these studies have yet to propose the major time point at which these factors come into play during cancer progression. Here, we analyzed whole genome sequencing data from two different types of precancerous tissues, monoclonal B-cell lymphocytosis and Barrett’s esophagus, and their matching cancer types along with 423 epigenetic features from normal tissues to determine the critical time point when chromatin features contribute to the formation of the somatic mutation landscape. Our analyses revealed that a subset of cell-of-origin associated chromatin features can explain more than 80% of the regional mutation variance for both types of precancerous tissues, comparable to the variance explained level for the genomes of matching cancer types. In particular, major significant chromatin features explaining the mutation landscape of Barrett’s esophagus and esophageal adenocarcinoma were derived from stomach tissues, indicating that mutation landscape establishment occurs mostly after environment-mediated epigenetic changes during gastric metaplasia. Analyses of the genome of esophageal squamous cell carcinoma tissues demonstrated that the proposed time point for mutation landscape establishment of Barrett’s esophagus and esophageal adenocarcinoma were specific to the occurrence of cell-type shift. Thus, our data suggest that the major time point for the mutation landscape establishment dictated by chromatin features is early in the process of cancer progression, and epigenetic changes due to environmental conditions at early stages can dramatically impact the somatic mutation landscape of cancer.

Somatic mutation rates vary across the genome in a cell-type specific manner, depending on 12 the types of mutation processes [1][2][3][4][5][6][7] . Although recent studies have identified several 13 determinants relevant to the establishment of the cancer mutation landscape 8-13 , these studies 14 have yet to propose the major time point at which these factors come into play during cancer 15 progression. Here, we analyzed whole genome sequencing data from two different types of 16 precancerous tissues, monoclonal B-cell lymphocytosis and Barrett's esophagus, and their 17 matching cancer types along with 423 epigenetic features from normal tissues to determine 18 the critical time point when chromatin features contribute to the formation of the somatic 19 mutation landscape. Our analyses revealed that a subset of cell-of-origin associated 20 chromatin features can explain more than 80% of the regional mutation variance for both 21 types of precancerous tissues, comparable to the variance explained level for the genomes of 22 matching cancer types. In particular, major significant chromatin features explaining the 23 mutation landscape of Barrett's esophagus and esophageal adenocarcinoma were derived 24 from stomach tissues, indicating that mutation landscape establishment occurs mostly after 25 environment-mediated epigenetic changes during gastric metaplasia. Analyses of the genome 26 of esophageal squamous cell carcinoma tissues demonstrated that the proposed time point for 27 mutation landscape establishment of Barrett's esophagus and esophageal adenocarcinoma 28 were specific to the occurrence of cell-type shift. Thus, our data suggest that the major time 29 point for the mutation landscape establishment dictated by chromatin features is early in the 30 process of cancer progression, and epigenetic changes due to environmental conditions at 31 early stages can dramatically impact the somatic mutation landscape of cancer. 32 Recent advances in cancer genomics have so far revealed numerous somatic mutation 33 landscapes for various cancer types, leading to a number of key findings. Identification of 34 new driver gene mutations, deciphering clonal evolution structure and profiling tumor 35 heterogeneity within and among different patients through examination of mutations, mainly 36 at the gene level [1][2][3][4][5][6][7] , have successfully addressed the genes contributing to cancer progression 37 and identified novel therapeutic targets. Beyond these gene-focused approaches, systematic 38 analyses of mechanisms that could explain genomic regional variations in mutation rates 39 across various cancer types could significantly extend our understanding about common 40 contributors to the establishment of mutation landscapes before and during cancer 41 progression. To this end, a number of studies have examined relationships between regional 42 mutation frequencies across the genome and several types of features, including gene 43 expression level, DNA sequence context, mutation profiles of nucleotide excision and 44 mismatch repair genes, histone post-translational modifications, and open chromatin marks 45 such as DNase-seq profiles [8][9][10][11][12][13][14][15] . Although these factors display high correlation with regional 46 mutation rates, somatic mutation profiles used for the studies were limited to fully progressed 47 tumors. Thus, it remains unknown whether the correlations between regional mutation 48 frequencies and cell-of-origin chromatin marks are established either gradually during cancer 49 progression or during a specific critical time period, either pre-or post-malignancy. Analyzing 50 the mutation landscapes of precancerous, non-neoplastic tissues alongside matching cancer 51 tissues could help to determine the major time points where chromatin marks shape the 52 mutation landscape. 53 Here, we analyzed a total of 38 precancerous lesions including monoclonal B cell 54 lymphocytosis (MBL) and Barrett's esophagus (BE) (methods). Representative matching 55 cancer types were also analyzed, corresponding to a total of 144 tumor samples from chronic 56 lymphocytic leukemia (CLL) and esophageal adenocarcinoma (EAC). In addition, a total of 57 14 esophageal squamous cell carcinoma (ESCC) samples were analyzed to represent cancer 58 without any defined precancerous stages during progression with a matching cell-of-origin. 59 We performed principal coordinate analysis (PCOA) to test whether the average mutation 60 rate differences reported previously 16,17 were reflected in the level of 1 megabase window 61 regional mutation frequencies. Consistent with the differences in average mutation frequency, 62 both MBL samples and CLL samples were indistinguishably located and formed separate 63 clusters based on immunoglobulin heavy chain variable region (IGHV) mutation status, a key 64 marker for distinguishing either naive-B cells or memory B cell origin for both MBL and 65 CLL 16,18 . These results indicate that cell-of-origin differences might dictate the differences in 66 regional mutation frequencies, rather than cancer progression-based status alone (Fig. 1a). In 67 contrast, individual BE tissues formed clusters with the EAC tissues separate from the ESCC 68 tissues, suggesting that the matching of cancer progression history might serve as a stronger 69 factor than the cell-of-origin context itself (Fig. 1b). Collectively, these results show 70 similarity in regional variation in mutation frequencies of precancerous tissues and matching 71 cancer types, indicating that the effect of cell-of-origin context might be cancer-type 72 dependent. 73 Whole-genome analyses of distinct cancer types depict cell-of-origin chromatin marks as the 74 strongest feature explaining the cancer mutation landscape, with a number of proposed 75 mechanisms 10 . Based on the IGHV mutation status-based clustering of MBL and CLL tissues 76 in PCOA, we hypothesized that differential IGHV mutation status would correlate with 77 distinct chromatin features explaining the regional mutation variation, and similar chromatin 78 features would come up as significant when comparing IGHV mutation type-matching MBL 79 and CLL genomes. To confirm the former part of the hypothesis, we first employed a random 80 forest regression-based chromatin feature selection algorithm to identify significant 81 chromatin features explaining the variance in regional mutation rates for different sample 82 groups. Indeed, significant chromatin features explaining regional mutation variations were 83 different between IGHV unmutant and groups (Extended Data Fig. 1a). Top-ranked 84 chromatin features for both groups were derived from CD19-positive cells, which is expected 85 since the CD19 marker cannot distinguish between naive and memory B cells. To further 86 examine whether the differences in chromatin features were cell-type dependent, we 87 performed chromatin feature selection after removing the 1Mbp regions containing IGHV 88 mutation status-associated differential DNA methylation SNPs, which also highly overlaps 89 with differential DNA methylation SNPs between naive and memory B cells 19-21 . This 90 approach resulted in 3 out of 4 top significant chromatin features between the IGHV-mutant 91 and unmutant groups (Extended Data Fig. 1b), implying that the differential chromatin 92 features explaining mutation frequency landscapes of distinct IGHV mutation status might 93 actually correlate with differences in cell-of-origin context. Next, we compared chromatin 94 features that might explain regional mutation variations across the genomes of IGHV-95 mutation-status-matched MBL and CLL tissues. Due to the limits of sample size and average 96 mutation rate of the samples, only IGHV-mutant MBL and CLL genomes were subjected to 97 further analyses. Notably, the top two ranked chromatin features were identical between 98 IGHV-mutant MBL and CLL samples (Fig. 2a), implicating that the subset of chromatin 99 marks might commonly dictate the formation of regional mutation landscape for both pre-100 cancerous tissues and matching cancer type. Additional examination of simple correlation 101 between regional mutation frequency and histone modification levels derived from CD19-102 positive cells at the 1 megabase-level revealed marginal differences between MBL and CLL 103 tissues ( Fig. 2b and Extended Data Fig. 2a). The correlation between the CD19 DNase1-seq 104 profile and regional mutation frequency was higher for CLL than MBL for chromosome 2 105 (Fig. 2c) and other chromosomes (Extended Data Fig. 3a), but this finding might be due to 106 the different number of samples between MBL and CLL, as the correlation score for MBL at 107 chromosome 2 was highly similar to the correlation scores for CLL (0.76 vs. 0.75) after 108 sample-number matching. These results demonstrate that the cell-of-origin chromatin context, 109 defined by the IGHV mutation status, serves a major role in shaping the mutation landscape 110 of both MBL and CLL tissues, suggesting that the cell-of-origin chromatin landscape could 111 govern the establishment of the somatic mutation landscape for CLL early in cancer 112 progression, even before the precancerous cell type, MBL, is apparent. 113 Cell type shift, represented as gastric metaplasia, is one of the main hallmarks in the 114 development of BE 22 . Thus, one could assume that the critical time point for the 115 establishment of the mutation landscape for BE could be either before or during the course of 116 cell type shift, or after its completion. Chromatin feature selection analysis of the mutation 117 landscape of BE and EAC tissues confirmed that high-ranked chromatin features were 118 derived from the stomach tissue type (Extended Data Fig. 4) for both tissues, without any 119 significant esophageal chromatin features. Simple correlation between regional mutation 120 frequency and histone modification marks from stomach and esophagus tissues revealed 121 marginal differences between BE and EAC tissues (Extended Data Fig. 2b, c), and this 122 pattern was also consistent with the correlation to stomach tissue DNase I hypersensitivity 123 profile (Extended Data Fig. 3b). Moreover, six features covering all stomach chromatin 124 features subjected to the feature selection analysis solely explained over 80% of the regional 125 mutation variance for both BE and EAC tissues, which is unlikely to be random (p value < 126 2.2e-16) (Extended Data Fig. 5). These results imply that the major time point of mutation 127 landscape establishment for BE is most likely to be after the cell type shift into stomach 128 mucosa-like cells. Chromatin feature selections on subgroups of somatic mutations for BE 129 and EAC based on overlap and uniqueness of the mutations shared common top-ranked 130 stomach chromatin features (Fig. 3a). In addition, chromatin feature selection on sample 131 subgroups with respect to dysplasia grade revealed that the top features all originated from 132 stomach tissue (Extended Data Fig. 6) and the variance explained level for all of the 133 dysplasia-based subgroups using six stomach tissue chromatin features were similar to the 134 variance explained level using all 423 chromatin features (Fig. 3a). This finding was 135 consistent with the high correlation to stomach tissue DNase I hypersensitivity profile 136 (Extended Data Fig. 3c). From all of these results, we infer an early time point for 137 establishment of the mutation landscape for EAC, even prior to the occurrence of dysplasia 138 for BE, but most likely after epigenetic changes due to gastric metaplasia. 139 To ensure that the chromatin features shaping the mutation landscape of BE and EAC were 140 not common to any esophageal cancer type, we analyzed the genome of ESCC, another 141 cancer type derived from the esophageal squamous epithelium without any precancerous 142 stages with cell type shift. Although the regional mutation frequency of ESCC correlated 143 with histone modification marks from stomach and esophagus tissues in a similar manner 144 (Extended Data Fig. 2d), chromatin feature selection revealed a subset of squamous cell type 145 and esophagus chromatin features that were significant and distinct from BE and EAC 146 (Extended Data Fig. 7). Moreover, measuring the level of variance explained values per 147 tissue or cell type categories showed stomach chromatin features to be the strongest ones for 148 BE and EAC, reaching higher than 90% of the variance level explained by the 423 total 149 chromatin features, whereas esophageal chromatin features were dominant for ESCC (Fig. 4). 150 Notably, the variance explained values for each category displayed non-significant 151 relationship with simple correlations between the chromatin marks from different tissue or 152 cell types (BE r s = 0.36, EAC r s = 0.36, ESCC r s = 0.18). These results imply a distinct 153 process of mutation landscape establishment for these cancer types that varies depending on 154 the presence of precancerous tissues with cell-type shifts. 155 In conclusion, our data suggest that the major time point for the establishment of the mutation 156 landscape governed by chromatin marks could be early, even prior to the phenotypic 157 emergence of precancerous tissues. Results from BE and EAC also raise the possibility that 158 epigenetic changes due to environmental insults, represented as a cell type shift, could serve 159 as a primary role by affecting the course of the establishment of the mutation landscape 160 (Extended Data Fig. 8). Further comprehensive studies to decipher the mutation landscape of 161 other precancerous tissues with metaplasia and discover the exact mechanisms controlling the 162 timing of mutation landscape establishment would lead to a better understanding of the effect 163 of epigenetic marks on shaping the precancerous tissues and matching cancer genome and 164 help identify possible biomarkers for early-stage detection of cancer. 165

167
Data 168 For the purposes of our project, we used somatic mutation data from CLL, MBL, BE, EAC, 169 and ESCC tissues. In the case of CLL and MBL genome data, total mutations were acquired 170 from Supplementary To calculate the regional mutation density and mean signal of chromatin features, all 184 autosomes were split in 1-Mbp regions followed by filtering out regions containing 185 centromeres, telomeres and low quality unique mappable base pairs. To determine regional 186 mutation density and histone modification profiles, we counted the total number of somatic Feature selection based on random forest regression.

200
A random forest regression-based feature selection algorithm was performed as described 10 201 with modifications. Briefly, the training set for each tree was constructed, followed by using 202 out-of-bag data to estimate the mean squared error. Thus, there was no need to perform 203 additional tests for error evaluation. Out-of-bag data were also used to estimate the 204 importance of each variable. In each out-of-bag case, the values corresponding to each 205 variable were randomly permuted, then tested to each tree. Subtracting the score of the mean 206 squared error between the untouched out-of-bag data cases and the variable-m-permuted 207 cases, the raw importance score of variable m was measured. By calculating the average 208 score of variable m in the entire tree, the rank of importance for each variable was determined.

209
A total of 1,000 random forest trees were employed to predict mutation density using a total 210 of 423 chromatin features. Every random forest model was repeated 1,000 times.

211
After the random forest algorithm step, greedy backward elimination was performed to select 212 the top 20 chromatin variables. Subsequent removal of the lowest rank variable was done to 213 calculate the variance explained value measurements for each variable. To conduct feature 214 selection on all of the samples corresponding to the particular pre-cancerous tissues or cancer 215 types, mutation density was calculated by adding samples in each case. However, a number 216 of particular analyses employed the subgrouping of samples. In the case of chromatin feature 217 selection assessing the effects of differential DNA methylation between IGHV-mutants and 218 unmutants (Extended Data Fig. 1b), a total of 935 regions containing differentially 219 methylated CpGs 20 were removed prior to the analysis. To perform feature selection 220 classified by differential dysplasia states (Extended Data Fig. 6), samples were divided into 3 221 groups: 17 samples of no dysplasia, 3 samples of low-grade dysplasia and 2 samples of high-222 grade dysplasia. In the case of feature selection after subgrouping for distinct and common 223 mutations (Fig. 3a), all mutations in paired-samples of BE and EAC were divided into 3 224 different groups: Barrett's only, EAC only, and common mutations.

226
Analysis of mutation frequency variance explained by chromatin features. 227 To examine the effect of a particular cell-type specific chromatin context on explaining 228 regional variability of mutation density across the genome, chromatin features were 229 subgrouped based on the feature selection algorithm. To study the differences in variance 230 explained values among distinct cell types, 8 groups were categorized (Fig. 4). Each group 231 included 5 chromatin markers common among the 8 cell-type based groups: H3K27me3, 232 H3K36me3, H3K4me1, H3K4me3 and H3K9me3.                           (a) Random forest regression-based chromatin feature selection in relations to the regional mutation frequency of IGHV-mutant MBL and CLL samples. Each chromatin feature is ranked by importance value, and variance explained scores are represented by bar length. Error bars demonstrate minimum and maximum values derived from 1,000 repeated simulations. Red lines display variance explained scores determined by 423 features -1 SEM, and CD19 chromatin features are green-colored. (b) Univariate correlation between CD19 chromatin features that displayed significance in the feature selection models and the regional mutation density of IGHV-mutant MBL or CLL. Spearman's rank correlations (r) are shown on each plot. (c) The density plot for regional mutation density of IGHV-mutant MBL or CLL and CD19 DNase1 accessibility index (reverse scale) across the full chromosome 2. Spearman's rank correlations (r) are shown on each plot.  Average variance explained scores for pre-cancerous or matching cancer genomes were separately calculated using the tissue or cell type-based subgroup-classified chromatin features. The pink panel represents subgroups with the highest variance explained score for each cell type. The red line indicates the variance explained score when using all 423 epigenomic features. Dots represent the Spearman's rank correlations (r) of chromatin features between the highest variance explained-scored subgroup and the remaining subgroups. Error bars demonstrate minimum and maximum values derived from 1,000 repeated simulations.