Connecting high-resolution 3D chromatin organization with epigenomics

The resolution of chromatin conformation capture technologies keeps increasing, and the recent nucleosome resolution chromatin contact maps allow us to explore how fine-scale 3D chromatin organization is related to epigenomic states in human cells. Using publicly available Micro-C datasets, we develop a deep learning model, CAESAR, to learn a mapping function from epigenomic features to 3D chromatin organization. The model accurately predicts fine-scale structures, such as short-range chromatin loops and stripes, that Hi-C fails to detect. With existing epigenomic datasets from ENCODE and Roadmap Epigenomics Project, we successfully impute high-resolution 3D chromatin contact maps for 91 human tissues and cell lines. In the imputed high-resolution contact maps, we identify the spatial interactions between genes and their experimentally validated regulatory elements, demonstrating CAESAR’s potential in coupling transcriptional regulation with 3D chromatin organization at high resolution.


Supplementary
: Model structure details. The model includes two parts -one for predicting chromatin loops and the other for predicting the contact profile, and each part includes input layers, convolutional layers, and output layers. At last, the outputs of the two parts are summed up to generate the final output.
Supplementary Figure 2: The illustration of the loop caller (a-f) and stripe caller(g-i) in our study. a, b, c, Three neighboring regions are used to calculate the expectation of a center pixel. d, The comparison of HICCUPS and our loop caller's results in two example regions. e, The Venn diagram compares HICCUPS and our loop caller's results in the HFF Micro-C contact map. f, The overlap ratio between reported loops on two replicates of HFF Micro-C is similar between HICCUPS and our loop caller. g, Step 1: A narrow and long sliding window moves along the diagonal to identify candidate vertical stripes. h, Step 2: For each pixel on the candidate stripe, five windows are selected and a "stripe score" is calculated for evaluating whether it is on a stripe. i, An example of original contacts versus stripe scores illustrates that positive scores indicate potential stripes. a, The distribution of eQTL-TSS distances in the 12 human tissues and cell lines demonstrates that about 50% of eQTL-TSS pairs are less than 100 kb apart, which are hard to identify on low-resolution Hi-C contact maps. b, The eQTL-TSS pile-up results for different donors from the heart left ventricle and sigmoid colon are consistent. c, The example region from the imputed contact maps of two heart left ventricle donors illustrates that, besides eQTL pile-up results, the imputed tissue contact maps are mostly consistent between individuals. d, A counter-example -a loop is observed on the imputed contact map of sigmoid colon from donor GTEX-1JKYN but not donor GTEX-1K2DA, which is related to a more clear H3K27me3 peak in donor GTEX-1JKYN's epigenomic features. e, The loops between gene NIFK's TSS and its three eQTLs specific in heart left ventricle (HLV), which cannot be observed on the low-resolution Hi-C contact map, appear on the CAESARimputed contact map of HLV. Although all three eQTLs are HLV-specific, only the loop between NIFK TSS and eQTL i is HLV-exclusive; while the other two loops can also be observed on the CAESAR-imputed contact map of lung and the Micro-C contact map of HFF, respectively. f, A series of gene TTC7A's eQTLs are shared by stomach and pancreas, and both loops and stripes are observed on the CAESAR-imputed contact maps of the two tissues. As a reference, the contacts are not observed on the low-resolution Hi-C contact map of pancreas and less enriched on the CAESAR-imputed contact maps of lung.
Supplementary Figure 8: Attributing CAESAR's outputs towards input epigenomic features. a, By attributing the entire contact map to the 7 epigenomic features, we obtained the overall attribution for each epigenomic features. Since H3K4me2 is less commonly profiled and also contributes less, we can leave it out from the 6-epi model. b, The attribution is calculated at a stripe region. In the genome-wide attribution analysis of stripes, we collected attribution from the 40.2 kb region centered at the anchor of each stripe. c, The attribution is calculated at a loop region. In the genomewide attribution analysis of loops, we collected attribution from the 12.2 kb regions centered at the anchors of each loop. d, The average attribution of stripes spanning to downstream and upstream directions repsectively. e, The clustering and embedding of all stripes' attribution illustrate that there are two clusters of stripes, which means the model has learned two major patterns indicating stripes. The average epigenomic features and attribution for each cluster are visualized. f, The clustering and embedding of all loops' attribution illustrate that there are two clusters of loop, which means the model has learned two major patterns indicating stripes. The average epigenomic features and attribution for each cluster are visualized.
Supplementary Figure 9: The histograms of H3K4me3 and H3K27ac signal distribution in the genome v.s at stripe anchors. It is observed that most stripe anchors are highly enriched for H3K4me3. For the 1,000 loci with the highest H3K4me3 signal, 374 of them are stripe anchors; for the 1,000 loci with the highest H3K27ac signal, only 50 of them are stripe anchors. Therefore, although H3K4me3 and H3K27ac are both enriched in active regions, H3K4me3 shows a much higher enrichment at stripe anchors, and therefore CAESAR connects stripes to positive H3K4me3 attribution. Instead, CAESAR is likely to regard H3K27ac as a feature related to "active regions but not stripes" and gives negative attribution.  12 * * For tissues or cell lines without available ATAC-seq data, we collected DNase-seq instead.
14 Supplementary is not a cancer cell line.  we will only impute one contact map for the specific human tissue (referred to as "mixed-donor tissue"). In the end, we The model includes two major parts -one for predicting chromatin loops, and the other for predicting contact profile. Each part includes consecutive input layers, convolutional layers, and output layers (Supplementary Figure 1). CAESAR captures the interpolated Hi-C contact map as a graph G with nodes representing genomic regions of 200 bp long, and weighted edges representing chromatin contacts. A is the adjacency matrix of G. For both parts, the inputs include the graph adjacency matrix A and the epigenomic features X. As one 250 kb region is fed into the model each time, the dimension of the input adjacency matrix is 1250×1250. In a 6-epigenomic model, the size of the epigenomic feature matrix is 6×1250. In addition, eight positional encoding dimensions are concatenated to the epigenomic features. The positional encoding is calculated with the following method, in which pos is from 0 to 1249 and i is from 0 to 7 [8]. In deep learning models, convolutional kernels are small filters sliding through the input to extract certain patterns.

55
When the filter is applied to an input element, it calculates the weighted sum of the element with its local neighbors. In a 56 convolutional layer, multiple kernels work in parallel to learn different sets of weights and extract different patterns. There are 57 two types of convolutional layers, 1-D convolutional (Conv1D) and graph convolutional (GC) layers in CAESAR. Conv1D 58 layers operate along the genome fiber, aggregating the epigenomic features from nearby bins. GC layers extract spatial 59 epigenomic patterns over the spatial neighborhood specified by G. Here, we use the GC layer in which X and Y are the input and output,Ã is the normalized graph adjacency matrix, W is the trainable parameters, and    In existing literature, there are two major categories of machine learning approaches for imputing Hi-C contact maps.

93
The first category takes low-resolution contact maps as input and treats Hi-C contact maps as 2-D images, exemplified by 94 HiCPlus [10]. The second category predicts the contacts between every two bins with genomic or epigenomic features from 95 the two bins, exemplified by HiC-Reg [11]. Therefore, we use HiCPlus and HiC-Reg as two baselines in our experiments.

96
HiCPlus is a deep-learning model with three sequential layers, in which the first and third layers are Conv2D layers, and 97 the second layer is a fully-connected layer. Since the matrices in our study are much bigger, we accordingly increased both 98 the number and the size of Conv2D kernels. We set the number of Conv2D kernels to be 96, and the size of Conv2D kernels 99 to be 15×15. The model was re-trained with hESC train set, in which the inputs were the Hi-C contact maps and the targets All predictive models generate false positives. Since CAESAR predicts more loops and stripes than Micro-C contact 159 maps should be, we carefully investigated the false positives produced by CAESAR. We observed that CAESAR's false-  Figures 6b and 6d). This demonstrates that some fine-scale structures predicted by 197 CAESAR are cell type-specific. The integrated gradient can be applied to arbitrary regions of the imputed contact map. Here we show, by calculating the 201 attribution of all stripe regions, we can identify sub-types of stripes. 202 We selected the stripes which were called on both Micro-C and CAESAR-imputed contact maps. Since the stripes were 203 called at 1 kb resolution, we identified the accurate stripe anchors at 200 bp resolution by selecting the row/column with the 204 largest summation on the Micro-C contact map. The stripe regions were defined as 11×500 long rectangles starting from 205 the stripe anchor on the diagonal and stretching in the same direction as the stripes. We calculated the attribution of all 206 stripe regions with integrated gradient. Only the attribution near the anchors (i.e., 100 bins both upstream and downstream) 207 were preserved (Supplementary Figure 8b), resulting in 6×201 attribution matrices. We observed a significant difference of attributions between stripes spanning to upstream and downstream directions (Supplementary Figure 8d). Therefore, we 209 flipped the attribution results for all stripes which span to the upstream direction. Afterwards, we used PCA to reduce the 210 dimension from 1,206 (i.e., 6×201) to 50, and then k-means to cluster the 50-dim vectors. The 50-dim vectors are further 211 transformed into 2-dim with tSNE for visualization.

212
The attribution near all stripe anchors can be clustered into two groups, in which each group has its characteristic patterns 213 in average attribution and epigenomic features. Cluster 1 stripes have higher CTCF attribution, while cluster 2 stripes have 214 higher H3K4me1 attribution. We inferred that cluster 1 stripes are related to chromatin structure maintenance and cluster 1 215 stripes are related to regulatory activities, which are referred to as "structural stripes" and "regulatory stripes" respectively 216 (Supplementary Figure 8e).

217
Similar analysis was performed on loops, in which the attribution on loop anchors was clustered (Supplementary Figure   218 8c). Two groups of loops are observed. Cluster 1 loops have higher CTCF attribution, while cluster 2 loops have higher 219 H3K4me1/me3 attribution. The two clusters are also referred to as "structural loops" and "regulatory loops" (Supplementary 220 Figure 8f).

221
Although the sub-types still need to be further explored and experimentally validated, this approach provides inter-222 pretable insights into our "black box". CAESAR does not attribute stripes positively to H3K27ac. Since CAESAR is data-driven, the supporting evidence can be 227 found from the original epigenomic data. We observed that, among the 1,000 loci with the highest H3K4me3 signal on test 228 set chromosomes, 374 are stripe anchors. By contrast, among the 1,000 loci with the highest H3K27ac signal on the same 229 chromosomes, only 50 are stripe anchors (Supplementary Figure 9). Therefore, CAESAR correlates stripes with H3K4me3 230 but regards H3K27ac as a feature of "active regions but not stripes" and attributes it negatively. The imputed high-resolution contact maps are shared on a web server (https://nucleome.dcmb.med.umich.edu/), which 234 allows users to easily navigate these fine-scale chromatin structures, and the corresponding explanatory epigenomic fea-235 tures. The back-end of the server uses python Flask with sqlite. The front-end of the server uses bootstrap framework.

236
The web server utilizes multi-threading to allow multiple users to access it at the same time. Our web server processes 237 host data at multiple ports at localhost. We use Nginx to perform the reverse proxy that passes internet requests to them.