Stratification of TAD boundaries identified in reproducible Hi-C contact matrices reveals preferential insulation of super-enhancers by strong boundaries

The metazoan genome is compartmentalized in megabase-scale areas of highly interacting chromatin known as topologically associating domains (TADs), typically identified by computational analyses of Hi-C sequencing data. TADs are demarcated by boundaries that are largely conserved across cell types and even across species, although, increasing evidence suggests that the seemingly invariant TAD boundaries may exhibit plasticity and their insulating strength can vary. However, a genome-wide characterization of TAD boundary strength in mammals is still lacking. A systematic classification and characterization of TAD boundaries may generate new insights into their function. In this study, we first use fused two-dimensional lasso as a machine learning method to improve Hi-C contact matrix reproducibility, and, subsequently, we categorize TAD boundaries based on their insulation score. We demonstrate that higher TAD boundary insulation scores are associated with elevated CTCF levels and that they may differ across cell types. Intriguingly, we observe that super-enhancer elements are preferentially insulated by strong boundaries, i.e. boundaries of higher insulation score. Furthermore, we perform a pan-cancer analysis to demonstrate that strong TAD boundaries and super-enhancer elements are frequently co-duplicated in cancer patients. Taken together, our findings suggest that super-enhancers insulated by strong TAD boundaries may be exploited, as a functional unit, by cancer cells to promote oncogenesis.


INTRODUCTION 42
The advent of proximity-based ligation assays has allowed scientists to probe the three-43 dimensional chromatin organization at an unprecedented resolution [1,2]. Hi-C, a high-

intramax = max(mean(L), mean(R)) and inter = mean(X) 184
For more details, see [18]. 185 186 TAD calling using the "ratio" insulation score 187 For TAD calling, we first calculated the "ratio" insulation score for each bin at 40kb resolution. 188 Then, TAD boundaries (of size equal to the bin size, i.e. 40kb) were identified as local maxima 189 of the insulation scores across each chromosome. Only insulation scores above a certain 190 cutoff were considered as potential TAD boundaries. The cutoff was determined such that the 191 false discovery rate (FDR) of the identified local maxima was not greater than 10%. The FDR 192 was estimated by applying the same procedure (calculate "ratio" insulation scores and seeking 193 local maxima) on randomized Hi-C matrices. The randomized Hi-C matrices were generated 194 by permuting the original matrix values separately for each "diagonal" of the matrix (i.e.  interaction values at a given distance between interacting loci), so that the distribution of Hi-C 196 signal as a function of distance between interacting loci was preserved in the randomized 197 matrix. The code is publicly available as part of the HiC-bench distribution. 198 199

Categorization of TAD boundaries based on fused two-dimensional lasso 200
We applied two-dimensional fused lasso to categorize TAD boundaries based on their strength. 201 The rationale behind this categorization is that topological domains separated by more 202 "permissive" (i.e. weaker) boundaries [32] will tend to fuse into larger domains when lasso is 203 applied, compared to TADs separated by well-defined, stronger boundaries. We indeed 204 applied this strategy and categorized boundaries into multiple groups ranging from the most 205 permissive to the strongest boundaries. The boundaries that were lost when λ value was 206 increased from 0 to 0.25, fall in the first category (λ=0), the ones lost when λ was increased to 207 0.5, in the second (λ=0.25) etc. 208 9 209

Categorization of TAD boundaries based on insulation scores 210
We stratified TAD boundaries into five categories (I through V) of equal size according to their 211 insulation score, independently in each Hi-C dataset used in this study. Category I contained 212 TAD boundaries with the lowest insulation scores and category V contained those with the 213 highest. Before calculating insulation scores, we first processed the Hi-C matrices using ICE, 214 calCB and scaling and then applied fused 2D lasso (with optimal λ). Then, TAD calling and 215 TAD boundary insulation score calculations were performed using our "ratio" or the "crane" 216 method and the boundaries were classified into five equal-size categories, as described above. 217 218

Analysis of CTCF and H3K27ac ChIP-seq data 219
All ChIP-seq data was uniformly processed using the HiC-bench platform [18]. Raw 220 sequencing files were aligned using bowtie2 version 2.3.1 with standard parameters. Only 221 uniquely mapped reads were selected for downstream analysis. PCR duplicates were 222 removed using Picad-tools version 1.88. Macs2 version 2.0.10.20131216 were used to call 223 narrow peaks for CTCF and broad peaks for H3K27ac with default parameters. 224 225

Association of CTCF levels with boundary strength categories 226
We obtained CTCF ChIP-sequencing data for the cell lines utilized in this study (with the 227 exception of KBM7 for which no publicly available dataset was available, see Supplementary 228 Table 1 for details). Total CTCF levels (i.e. aggregated peak intensities from potentially 229 multiple CTCF peaks) at each TAD boundary were calculated and their normalized 230 distributions for each boundary category (weak to strong) were plotted in boxplots in order to 231 demonstrate the association of increased boundary strength with increased levels of CTCF 232 binding. We performed this analysis separately for TSS-only and non-TSS CTCF binding sites. 233 The rationale behind these separate analyses was based on the observation that several TAD 234

Analysis workflow 271
The overall workflow, including our benchmark strategy and downstream analysis, is 272 summarized in Figure 1. Initial alignment and filtering of the collected Hi-C sequencing 273 datasets was performed with HiC-bench [18] (see Methods for details). Quality assessment 274 analysis revealed that the samples varied considerably in terms of total numbers of reads, 275 ranging from ~150 million reads to more than 1.3 billion (Supplementary Figure 1a). 276 Mappable reads were over 96% in all samples. The percentages of total accepted reads 277 corresponding to cis (ds-accepted-intra, dark green) and trans (ds-accepted-inter, light green) 278 Figure 1b) also varied widely, ranging from ~17% to ~56%. The 279 characteristic drop of average Hi-C signal as a function of distance between interacting loci 280 was observed (Supplementary Figure 1c). The main part of analysis starts with unprocessed 281

(Supplementary
Hi-C contact matrices ("filtered" matrices). We then generate processed Hi-C matrices using 282 ICE "correction" [31], our "scaling" approach (see Methods) and calCB [26]. Finally, fused two-283 dimensional lasso is applied on the processed Hi-C matrices. Matrix reproducibility between 284 biological replicates is assessed across samples for a variety of parameters, for example, 285 resolution, distance between interacting loci, sequencing depth, etc, using stratum-adjusted 286 correlation coefficients [30]. Finally, downstream analysis, involves the characterization of 287 TAD boundaries based on their insulating strength, the enrichment in CTCF binding, proximity 288 to repeat elements and super-enhancers, and, finally, their genetic alterations in cancer.

Assessment of same-enzyme and cross-enzyme reproducibility of Hi-C contact 291 matrices 292
Hi-C is prone to biases and multiple algorithms have been developed for Hi-C bias correction, 293 Specifically, we focused on multiple factors that may play an important role on reproducibility: 301 first, we separately considered biological replicates of Hi-C libraries generated with the same 302 or different restriction enzymes; second, we studied the impact of Hi-C matrix resolution (i.e. 303 bin size); third, we assessed reproducibility as a function of the distance of interacting loci 304 pairs; fourth, we studied the impact of sequencing depth. Stratum-adjusted correlation 305 coefficients (SCC) were calculated for each pair of replicates (same-or cross-enzyme) on Hi-306 C contact matrices estimated by four methods: (i) naïve filtering (i.e. matrix generation by 307 simply using double-sided accepted intra-chromosomal read pairs from Supplementary 308 summarizes the correlations between replicates generated by the same restriction enzyme, 314 whereas the right panel the correlations between replicates generated by a different restriction 315 enzyme. In both scenarios, as expected, reproducibility drops quickly as finer resolutions (from 316 100kb to 20kb) are considered. The same conclusion applies for increasing distance (from 317 2.5Mb to 10Mb) between interacting loci, demonstrating that long-range interactions require 318 ultra-deep sequencing (beyond what is currently available in most of the datasets in this study) 319 in order to be detected reliably. To elaborate on this point, we repeated the analysis after 320 resampling at higher sequencing depth (Supplementary Figure 4). Both conclusions hold 321 true with the new sequencing depth and are independent of the Hi-C contact matrix estimation 322 method. From this benchmarking study, we conclude that reproducibility of Hi-C contact 323 matrices across biological replicates is not ideal and that there is a need for computational 324 methods to improve it. In the next sections, we focus on improving the reproducibility of the available Hi-C datasets lack biological replicates of ultra-deep sequenced samples, we 361 evaluated our method by testing whether it could recover the 5kb loops identified in [20] in a 362 single biological sample of GM12878, the most deeply sequenced sample in this study (~3 363 billion read pairs of which ~900 million intra-chromosomal read pairs passed our filtering 364 criteria). As a recovery metric, we used the fraction of the reported loops within the top 365 interactions as ranked by our fused lasso approach. We observed that by tuning the λ 366 parameter we improved this metric by an 8% relative improvement (Supplementary Figure  367   6a). For the optimal λ, our method ranked most of the known loops (~90%) in the top 10% of 368 measured interactions (~79% in the top 5% of all measured interactions). We also evaluated 369 the sensitivity of our approach to subsampling. In particular, we re-computed the interaction 15 matrices using 200, 400, 600, 800 million intra-chromosomal read pairs, rerun the analysis 371 and obtained relative improvements of 9%, 14%, 22% and 32% respectively (Supplementary 372 Although fused 2D lasso improves reproducibility of Hi-C matrices between biological 378 replicates, there is a possibility that this is achieved at the expense of losing cell-type 379 specificity. To test this, we compared the effect of λ on the reproducibility between biological 380 replicates (intra-cell-type) to its effect on the stratum-adjusted correlation coefficients between 381 unrelated samples in our Hi-C dataset collection (inter-cell-type). For this test, we chose to 382 focus on the collection of H1 stem cell Hi-C replicates and their derivatives generated by the 383 Ren lab [19], so that we could assess the effect of smoothing on subtle cell-type specific 384 differences in experiments performed in a single lab. Hi-C matrices were distance-normalized 385 (similar to [41], see Methods for details) to account for the dependence of the Hi-C signal on 386 the distance between interacting loci. The results of this analysis are presented in Figure 3: 387 although both intra-cell-type and inter-cell-type stratum-adjusted correlation coefficients 388 increase by λ (Figure 3a), the difference between intra-cell-type and inter-cell-type correlation 389 coefficients also increases (Figure 3b), suggesting that fused 2D lasso actually preserves cell 390 type specificity of Hi-C contact matrices, a behavior that is consistent independent of the matrix 391 "correction" method. Nevertheless, some "correction" methods appear to work better than 392 others in combination with lasso. In addition, we also evaluated an alternative "smoothing" 393 method, two-dimensional mean filter smoothing, recently made available as part of the 394 HiCRep package [30]. In Figure 3c, we show the results of the comparison of the three 395 correction methods in combination with the smoothing techniques using two metrics: 396 preservation of cell type specificity (x axis) and intra-cell-type reproducibility (y axis). The main 397 conclusions from this comparison are: (a) smoothing (lasso or mean filter) improves both 398 metrics independent of the correction method, and (b) fused lasso performs slightly better than 399 mean filter smoothing in preserving cell type specificity, while it behaves slightly worse in 400 improving intra-cell-type specificity. In Figure 3d, we further demonstrate the trade-off 401 between intra-cell-type and inter-cell-type metrics when using 2D lasso or 2D mean filtering. 402 403

Fused lasso reveals a nested TAD hierarchy linked to TAD boundary insulation scores 404
After demonstrating that parameter λ improves reproducibility of Hi-C contact matrices 405 independent of the bias-correction method, we hypothesized that increased values of λ may 406 also define distinct classes of TADs with different properties. For this reason, we now allowed 407 λ to range from 0 to 5. We then identified TADs at multiple λ values using HiC-bench, and we 408 observed that the number of TADs is monotonically decreasing with the value of λ 409 (Supplementary Figure 7a), suggesting that by increasing λ, we are effectively identifying 410 larger TADs encompassing smaller TADs detected at lower λ values. Indeed, when comparing 411 TAD boundaries detected at successive λ values, we found that higher λ values produced 412 TAD boundaries that are almost a strict subset of TAD boundaries produced at lower λ values 413 (~94% overlap when considering only the exact bin as a true common TAD boundary, and 414 ~98% when TAD boundaries are allowed to differ by at most one bin between TADs generated 415 for successive λ values). Equivalently, certain TAD boundaries "disappear" as λ is increased. 416 Therefore, we hypothesized that TAD boundaries that disappear at lower values of λ are 417 weaker (i.e. lower insulation score), whereas boundaries that disappear at higher values of λ 418 are stronger (i.e. higher insulation score). To test this hypothesis, we identified the TAD 419 boundaries that are "lost" at each value of λ, and generated the distributions of the insulation 420 scores for each λ across samples. As insulation score, we used the Hi-C "ratio" score (see 421 Methods), which was shown to outperform other TAD calling methods [18]. Indeed, as 422 hypothesized, TAD boundaries lost at higher values of parameter λ are associated with higher 423 TAD insulation scores (Supplementary Figure 7b). 424

Stratification of TAD boundaries by insulating score reveals an association with 426 enriched CTCF levels 427
Motivated by the observation that with increasing λ, weaker TAD boundaries are not detected, 428 we decided to explore in depth the properties of TAD boundaries with respect to their insulation 429 score. To this end, we stratified TAD boundaries into five categories (I through V) of equal size 430 according to their insulation score, independently in each Hi-C dataset used in this study. As 431 shown in Figure 4a, we first processed the Hi-C matrices using ICE, calCB and scaling and 432 applied fused 2D lasso with "optimal" λ, defined as the λ value beyond which no statistically 433 significant improvement on the reproducibility is observed. The statistical significance was 434 assessed using a Wilcoxon test between the distributions of stratum-adjusted correlation 435 coefficients across chromosomes in given sample for successive λ values. The procedure is 436 demonstrated using an IMR90 replicate as an example (Supplementary Figure 7c). Then, 437 TAD calling and TAD boundary insulation score calculations were performed using our "ratio" strength was found to be positively associated with CTCF levels, suggesting that stronger 449 CTCF binding confers stronger insulation. Since we noticed that several TAD boundaries 450 contain TSSs, this analysis was done separately for TSS-only CTCF peaks (Figure 4c) and 451 for all CTCF peaks (see below). Both approaches revealed the same trend, with the exception 452 of the class of strongest boundaries (category V), where CTCF levels in TSS regions were 453 significantly higher compared to non-TSS regions, suggesting that the strongest boundaries 454 are formed by CTCF-mediated loops at gene promoters. Alternatively to our "ratio" insulation 455 score, we repeated our analysis using the insulation score generated by the "crane" TAD 456 calling algorithm [12]. A comparative analysis with between "ratio" and "crane" is shown in 457 Figure 4d, where it appears that ratio-generated insulation scores better associate with CTCF 458 levels. In the interest of robustness, we performed the same analyses for all preprocessing 459 methods, at both low and high sequencing depth, for both "ratio" and "crane" insulation scores 460 analysis determined that they are significantly more frequently localized within TADs insulated 476 by at least one strong TAD boundary (Figure 4e). Further analysis revealed that, super-477 enhancers are 2.94 times more likely to be insulated by strong boundaries (categories IV or 478 V) in both the upstream and downstream directions, compared to being insulated by weak 479 boundaries (categories I or II) in both directions. A comparison with TAD boundary 480 classification using "crane" insulation scores demonstrated that "ratio" insulation scores are 481 more significantly associated with proximity to super-enhancers (Figure 4f). A similar 482 robustness analysis as the one presented above for CTCF was also performed for super-483 enhancers (Supplementary Figure 8c and Supplementary Figure 9c for "ratio" and "crane" 484 insulation scores respectively). Taken together, our findings suggest that, because of their 485 significance in gene regulation, super-enhancers should only target genes confined in the 486 "correct" TAD or neighborhood, while remaining strongly insulated from genes in adjacent 487 TADs. This is conceivably achieved by the strong TAD boundaries we have identified in this 488 study. 489 490

Strong TAD boundaries are co-duplicated with super-enhancers in cancer patients 491
To further investigate the importance of variable boundary strength, we asked whether TAD increasing boundary strength (Figure 5a). This suggests that strong TAD boundaries are less 499 frequently lost in cancer, as they may "safeguard" functional elements that are necessary for 500 proliferation. By contrast, the frequency of tandem duplications (up to 1Mb) increased with 501 increasing boundary strength (Figure 5b). Both results were robust to various cutoffs on the 502 sizes of the structural variants, within the usual range of TAD sizes (from 250kb to 2.5Mb). 503 Then, to further clarify the connection between super-enhancers, strong TAD boundaries and 504 cancer, we studied tandem duplication events where super-enhancers (obtained from a 505 20 publicly available collection of super-enhancers [36]) are co-duplicated with adjacent strong 506 boundaries. As demonstrated in Figure 5c, super-enhancers are indeed co-duplicated with 507 strong TAD boundaries. Co-duplication of strong boundaries and super-enhancers was 508 statistically significantly more frequent than that of strong boundaries and regular enhancers. 509 This suggests that, in cancer, not only are strong boundaries protected from deletions, but 510 they are also co-duplicated with super-enhancer elements. A robustness analysis similar to 511 the one performed for CTCF and super-enhancers is presented in Supplementary Figure 12  512 demonstrating that our findings are consistent for low and high sequencing depth. Finally, we 513 present an example of a co-duplication of a super-enhancer with a strong boundary in Figure  514 5d: MYC, a well-known oncogene that is typically overexpressed in cancer, is localized next 515 to a strong TAD boundary and is co-duplicated with the boundary as well as with several 516 proximal super-enhancers. 517

DISCUSSION 519
Multiple recent studies have revealed that the metazoan genome is compartmentalized in 520 boundary-demarcated functional units known as topologically associating domains (TADs). 521 TADs are highly conserved across species and cell types. A few studies, however, provide 522 compelling evidence that specific TADs, despite the fact that they are largely invariant, exhibit 523 some plasticity. Given that TAD boundary disruption has been recently linked to aberrant gene 524 activation and multiple disorders including developmental defects and cancer, categorization 525 of boundaries based on their strength and identification of their unique features becomes of 526 particular importance. In this study, we first developed a method based on fused two-527 dimensional lasso in order to improve Hi-C contact matrix reproducibility between biological 528 replicates. Then, we categorized TAD boundaries based on their insulating score. Our analysis 529 demonstrated that: (a) using fused 2D lasso, we can improve the reproducibility of Hi-C contact 530 matrices irrespective of the Hi-C bias correction method used, and (b) using our "ratio" 531 insulation score, we can successfully identify boundaries of variable strength and that strong 532 boundaries exhibit certain expected features, such as elevated CTCF levels. By performing 533 an integrative analysis of boundary strength with super-enhancers in matched samples, we 534 observed that super-enhancers are preferentially insulated by strong boundaries, suggesting 535 that super-enhancers and strong boundaries may represent a biologically relevant entity. 536 Motivated by this observation, we examined the frequency of structural alterations involving 537 strong boundaries and super-enhancers. We found that not only strong boundaries are 538 "protected" from deletions, but, more importantly, they are co-duplicated together with super- unprocessed Hi-C contact matrices. We then generate processed Hi-C matrices using ICE 574 "correction", our "scaling" approach and calCB. Fused two-dimensional lasso is applied on the 575 processed Hi-C matrices. Matrix reproducibility between biological replicates is assessed 576 across samples for a variety of parameters using stratum-adjusted correlation coefficients [30].

depth). Comparison of Hi-C contact matrices between biological replicates generated from 643
Hi-C library using the same or different restriction enzymes; Hi-C matrices were estimated 644 using four methods (naïve filtering, iterative correction, calCB and simple scaling); assessment 645 was performed using stratum-adjusted correlation coefficient at resolutions ranging from 646 100kb to 20kb and maximum distances of 2.5Mb, 5Mb, 7.5Mb and 10Mb between interacting Hi-C library using the same or different restriction enzymes; Hi-C matrices were estimated 652 using four methods (naïve filtering, iterative correction, calCB and simple scaling); assessment 653 was performed using stratum-adjusted correlation coefficient at resolutions ranging from 654 100kb to 20kb and maximum distances of 2.5Mb, 5Mb, 7.5Mb and 10Mb between interacting