Abstract
Here we present HiCDC, a principled method to estimate the statistical significance (P values) of chromatin interactions from HiC experiments. HiCDC uses hurdle negative binomial regression account for systematic sources of variation in HiC read counts—for example, distancedependent random polymer ligation and GC content and mappability bias—and model zero inflation and overdispersion. Applied to highresolution HiC data in a lymphoblastoid cell line, HiCDC detects significant interactions at the subtopologically associating domain level, identifying potential structural and regulatory interactions supported by CTCF binding sites, DNase accessibility, and/or active histone marks. CTCFassociated interactions are most strongly enriched in the middle genomic distance range (∼700 kb–1.5 Mb), while interactions involving actively marked DNase accessible elements are enriched both at short (<500 kb) and longer (>1.5 Mb) genomic distances. There is a striking enrichment of longerrange interactions connecting replicationdependent histone genes on chromosome 6, potentially representing the chromatin architecture at the histone locus body.
Introduction
HiC is a genomewide chromosome conformation capture (3C) technology that uses restriction enzyme digestion of DNA followed by proximity ligation and pairedend sequencing^{1}. A large number of paired end reads connecting two genomic regions is interpreted as evidence of an interaction over the population of cells. HiC data is typically summarized as a contact matrix relative to a fixed partition of the genome into intervals—at the finest resolution defined by the restriction fragments themselves but more often as longer genomic regions. Each entry or ‘interaction bin’ in the contact matrix contains the count of unique paired end reads mapping to the corresponding pair of genomic intervals. Not all nonzero counts are significant, and the challenge is to identify interactions supported by more reads than expected by chance. Many factors confound the statistical analysis of HiC data. First, random polymer ligation between restriction fragments, which decays as a function of linear genomic distance, produces a background distribution of paired end read counts that likely accounts for a large fraction of HiC reads^{2,3,4}. Other systematic sources of bias include GC content and mappability of short reads^{5,6}. If uniform genomic bins are used to define the contact map, the number of restriction enzyme sites within each bin is another source of bias.
The original study introducing HiC generated a coarse resolution contact matrix (1 Mb bins) to characterize biophysical rules of polymer folding^{1}, while later studies with improved resolution (10–50 kb bins) reported features of chromatin organization such as topologically associating domains (TADs)—regions that favour internal (withinTAD) contacts over external contacts^{3,7}. However, the statistical significance of individual interactions, either within TADs or between more distal loci, was not addressed in these studies. Several normalization schemes have been proposed to correct for GC content and other sources of HiC read count bias, including a nonparametric probabilistic approach due to Yaffe and Tanay^{6} and iterative correction and eigenvalue decomposition (ICE), which approximates this method^{5}. More recently, HiCNorm^{8} was introduced to learn these biases statistically with Poisson regression, using GC content and other features as covariates in a generalized linear model (GLM) for interaction bin counts. These normalization approaches are typically used to correct the contact matrix—for example, by rescaling the observed counts in ICE or by replacing the count with the residual from the HiCNorm regression model—to increase reproducibility between experiments for downstream analyses.
Recently several groups have proposed statistical models to assess the significance of interactions—that is, to assign P values to the observed counts in individual interaction bins—for HiC and other 3Cbased technologies^{9,10}. FitHiC^{9} uses a binomial null model Bin(P(d)), where P(d) is the probability that a randomly chosen paired end read occurs between a given locus pair at distance d. The probability P(d) is estimated from the data using a spline fitting process. To account for other sources of bias, FitHiC uses ICE to adjust the contact probabilities P(d). Other recent approaches use more elaborate strategies to call ‘peaks’ in HiC data. HiCCUPS, developed to detect subTAD chromatin interactions in a recent highresolution HiC study by Rao et al.^{11}, compares each ICEnormalized interaction bin count in the contact matrix to the normalized counts of local neighborhoods; a peak is called only if the bin count is significant relative to all local comparisons and satisfies further filtering criteria. In another approach, Xu et al.^{12} used a hidden Markov random field to explicitly model the spatial dependence of bin counts in the contact matrix, where the binlevel statistical model is a mixture of negative binomial distributions representing ‘peak’ and background states, and the expected counts for the background state are estimated by ICE or FitHiC.
Here we present an integrated model for detecting significant HiC interactions that systematically accounts for the dependence of interaction bin read counts on sources of bias like GC content and mappability, as in HiCNorm, as well as the dependence of random polymer ligation on genomic distance, as in FitHiC. Additionally, we explicitly model the zeroinflation and overdispersion of counts in the contact matrix by using a GLM approach based on hurdle regression^{13}. By learning a null model that incorporates all these statistical properties of HiC contact matrix counts, our estimates of significance (P values) reduce inflation in order to better identify direct interactions between regulatory or structural elements rather than nearby noninteracting loci. Our model can be estimated from a sampling of the data rather than working with the entire contact matrix. We focus our analysis on the Rao et al.^{11} highresolution in situ HiC data set in the GM12878 lymphoblastoid cell line and show that our method can identify significant interactions at the subTAD level, including DNA loops associated with CTCF and/or cohesin binding sites and enhancerpromoter interactions, as well as longer range (1.5–2 Mb) promoter–promoter loops. In particular, we identify a network of longerrange gene–gene interactions connecting the histone genes on human chromosome 6, potentially representing a specialized chromatin architecture at histone locus bodies. An implementation of our method, called HiCDC (for ‘HiC direct caller’), is available as an open source R package at https://bitbucket.org/leslielab/hicdc (see also Supplementary Software 1).
Results
Hurdle negative binomial regression models HiC read biases
We estimate a null or background model for HiC contact matrix counts using zerotruncated negative binomial (ZTNB) regression, also called hurdle regression^{13}. By correctly modelling the zero inflation and overdispersion in the background model, we avoid inflating the significance of interaction bin read counts and reduce false positives. We assume that many of the contact matrix counts are well explained by the null model, which we use to estimate significance (P values) of interaction bins with unusually high counts. For each interaction bin(i,j), we take y_{ij} to be the count of unique paired end reads joining intervals I_{i} and I_{j} and mapping within 500 bp of a restriction enzyme site on each side. We define a set of covariates X_{ij} of the interaction bin, including the genomic distance d_{ij} between intervals I_{i} and I_{j}, and bias features like GC content and mappability of the effective sequence space for the pair of intervals^{8}, that is, the region within 500 bp of a restriction enzyme site (see Methods section).
The background model generates interaction bin read counts according to a twostep generative process: a Bernouilli distribution governs whether the count will be 0 (with probability ) or positive; if positive, then the read counts follow a negative binomial distribution with mean μ_{ij} and dispersion α, where log μ_{ij} is fit as a linear combination of Bspline functions (for the dependence on genomic distance) and biasrelated covariates X_{ij} (see Methods section). One rationale for the twostep generative process is that HiC libraries may not be complex enough to truly sample all random ligation events (and real interactions) that occur in the population of cells; rather, with some probability that depends on the covariates, ligated restriction fragments representing an interaction bin are not captured in the library, giving a zero count, similar to the ‘dropout’ of lower expressed genes in singlecell RNAseq^{14}.
For most analyses, we partitioned the genome into restriction fragments, concatenated 10 adjacent fragments, and used these genomic intervals to produce the HiC contact matrix. For highresolution data from Rao et al.^{11}, this produced bins with median length ∼4 kb. For the medium resolution HiC data set for the human fibroblast cell line IMR90 from Dixon et al.^{7}, using a different restriction enzyme, the same procedure produced bins of median length ∼32 kb. An alternative procedure is to use uniform genomic intervals for the contact matrix and include the effective sequence space for pairs of intervals as an additional covariate^{8}. We ran all models on interactions up to a genomic distance of 2 Mb to focus on subTAD structure.
We first analysed the highresolution Rao GM12878 data set. Following FitHiC, we used two iterations of training to fit our null model. First, for each chromosome, we trained a ZTNB model using hurdle regression on a sample of 1% of all interactions bins I_{i} x I_{j} and confirmed that the dependence of the expected bin counts on d_{ij} as estimated by the hurdle model indeed fit the empirical relationship (Fig. 1a for Chr 1; see Supplementary Fig. 1 for other chromosomes). We then identified interaction bins with empirical counts y_{ij} in the P<0.025 tail according to the model, removed these bins from the training set as possible true interactions, and retrained the hurdle regression to obtain final parameters for the null model (Fig. 1a). We then used a quantile–quantile (Q–Q) plot to compare the P values produced by the ZTNB model to those drawn from the uniform distribution, which would be expected if all interactions came from the null model. Similarly, we plotted P values produced by applying (i) negative binomial regression, (ii) Poisson regression and (iii) Poisson hurdle regression to the same training data, again using two iterations of training. The Q–Q plots suggest that all alternative models inflated the significance of estimated P values (Fig. 1b for Chr 1; see Supplementary Fig. 2 for other chromosomes) and hence that modelling both the zero inflation and overdispersion is important for reducing false positive interactions.
We next ran both our model and the FitHiC algorithm on the same GM12878 data, using ICE normalization to adjust FitHiC P values, as previously described; for both methods, we used uniform 5 kb bins and did not filter mappable reads for sequencing quality in order to get deeper coverage (corresponding to the ‘MAPQG0’ contact matrix from Rao et al.^{11}). We removed diagonal (d=0) interaction bins from HiCDC since FitHiC filters out these bins. Comparison by Q–Q plots suggests that FitHiC produces P values with inflated significance compared to HiCDC (Fig. 1c, Chr 1; see Supplementary Fig. 3 for other chromosomes). This apparent P value inflation of FitHiC relative to HiCDC was more pronounced for the medium resolution IMR90 Dixon et al.^{7} data set (Supplementary Fig. 4). Genomewide on the Rao^{11} GM12878 data set, FitHiC used with ICE reported ∼793 K interactions, while our approach reported ∼321 K interactions, at a 1% false discovery rate (FDR) based on the BenjaminiHochberg procedure. While there was some correlation between –log_{10} P values generated by the two methods within the same distance ranges, 74% of the interactions reported as significant based on the FitHiC binomial model would not be rejected by our null model (Fig. 1d, Supplementary Fig. 5, Supplementary Fig. 6). Using FitHiC without ICE normalization leads to a dramatic increase in the reported number of significant interactions, while filtering reads for quality (using the ‘MAPQGE30’ contact map) somewhat reduces FitHiC’s interaction calls pre or postICE; however, all versions of FitHiC call more nondiagonal interactions than HiCDC at the same significance level (Supplementary Data 1, Supplementary Fig. 7).
We caution that there is no gold standard of true interactions against which false positive rates can be estimated, and Q–Q plot analysis presumes that most entries in the contact matrix are generated by the null distribution. However, since the binomial distribution used by FitHiC does not model overdispersion in read count data, similar to the Poisson regression variant of HiCDC (Fig. 1b, Supplementary Fig. 2), we hypothesize that this feature of FitHiC may lead to P value inflation.
To ask whether fine resolution interactions would be detectable at lower sequencing depth, we performed a downsampling analysis using 75, 50 and 25% of the reads (see Methods section). Using P values based on 100% of the data (∼129 million pairedend reads) to define true interactions (FDR<5%) and noninteractions (FDR>10%), we asked if P values estimated from downsampled reads could distinguish interactions from noninteractions. Precisionrecall curves for this task (Fig. 1e) suggest that true interaction detection strongly degrades at the 25% sampling rate; at least 50% of the reads were required for reasonable (auPR of 84%) recovery of interactions at ∼4 kb resolution.
We also confirmed the stability of the model relative to different 1% samples of training data. We trained HiCDC on a fixed 1% sample of interaction bins (using uniform 5 kb bins) and defined the significant interactions at 1% FDR as our ‘true interactions’. Then we retrained on 100 different 1% samples and computed, for each ‘true interaction’, the fraction of the models that detect the interaction (Supplementary Fig. 8). We found that 95% of ‘true interactions’ were detected by at least 90% of the models, and 90% of the interactions were detected by 100% of the models, showing stability of the training procedure.
HiCDC event calls are reproducible across experiments
We also ran HiCDC on additional data sets to examine the reproducibility of its event calls across biological replicates and different versions of the HiCDC protocol, for example, the use of different restriction enzymes, as well as different versions of the model, for example, uniform bins versus nonuniform bins. First, we considered the primary replicate and the largest secondary replicate HiC experiment in GM12878 from Rao et al.^{11}. Since the secondary replicate has only slightly less coverage (∼1.3 × 10^{9} M aligned and filtered reads versus ∼1.5 × 10^{9} M for the primary replicate), we used 5 kb uniform bins for each replicate and found good genomewide concordance between P values (Supplementary Fig. 9, Ρ=0.52). We also ran HiCDC on two lower resolution HiC data sets^{7} generated with different restriction enzymes, HindIII and NcoI, in mouse ES cells, using fixed bins at 50 kb resolution, and again found fair correlation between P values (Supplementary Fig. 10, Ρ=0.66).
Although HiCDC was not designed primarily as a HiC normalization procedure like the Yaffe and Tanay’s method^{6} or ICE^{5}, it can be used to normalize the contact matrix by dividing the observed bin count by the expected bin count estimated from the model (O_{ij}/E_{ij}). Following Yaffe and Tanay, we computed a heatmap of normalized counts (O_{ij}/E_{ij}) as a function of GC content in bin i and bin j for mouse ES HiC data generated with the HindIII and NcoI enzymes. Similar to previous findings^{6}, we observed preferential HiC contact patterns associated with low and high regional GC content, but these patterns were largely consistent between the two restriction enzymes (Supplementary Fig. 11).
Yaffe and Tanay also found a restriction fragment length bias in early HiC data sets^{6}, presumably due to differences in ligation efficiency. Analysing data from chromosome 1 of the Rao et al.^{11} GM12878 data set with nonuniform binning, we found that observed counts for the 10% shortest bins are marginally higher than those of the 10% longest bins as a function of genomic distance, but this variation is small compared to the difference between the top and bottom 10% percentiles as estimated by the model (Supplementary Fig. 12). Therefore, the variation due to bin size is small compared to that of the modelled covariates. We further confirmed good concordance between P values for the uniform (5 kb) and nonuniform (10RE) bin models (Supplementary Fig. 13, Ρ=0.71).
Finally, to provide a measure of experimental validation, we compared HiCDC predictions (5 kb fixed bin model) for the Rao et al.^{11} GM12878 data set against previously performed 3DFISH validation experiments that confirmed four interactions between 30 kb intervals (L1, L2) along with negative controls (L2, L3). We took all contacts predicted by HiCDC with endpoints overlapping (L1, L2) and (L2, L3) and examined both the maximal adjusted P value and top 10 P values among the contacts anchored in (L1, L2) and (L2, L3). HiCDC correctly identified all of the significant events and did not assign significance to any of the contacts overlapping the control interaction bin (L2, L3) (Supplementary Table 1).
HiCDC enables detection of subTAD interactions
To determine whether HiCDC could reveal interactions involving individual regulatory or structural elements, we analysed the significant chromatin interactions called on Rao et al.^{11} HiC data together with other epigenomic data sets for the B lymphoblastoid cell line GM12878 and chronic myelogenous leukemia cell line K562.
We first visualized the reported subTAD structure and our HiCDC interactions at important B cell genes, together with DNasesequencing (DNaseseq), RNAsequencing (RNAseq) and H3K27ac chromatin immunoprecipitationsequencing (ChIPseq) data, along with ChIPseq data for CTCF and cohesin subunits SMC3 and Rad21, all generated by ENCODE in GM12878. For example, Fig. 2a shows the raw HiC interaction count matrix for a ∼700 kb region encompassing the BCL2 locus, with previously reported subTADs^{11} drawn as blue boxes; Fig. 2b shows the interaction matrix with –log_{10} P values from HiCDC in place of the raw counts. Consistent with the ‘corner detection’ strategy used by HiCCUPS^{11}, we see that three nested subTADs (∼60.675 Mb–61 Mb) are indeed supported by significant ‘corner’ interactions—that is, interactions linking subTAD boundaries—while noncorner bins inside the subTAD have low significance. However, while a faint corner is detectable for smaller rightmost subTAD (∼61 Mb–61.130 Mb), it is not supported by a significant HiCDC interaction. Genomewide, ‘corner’ interactions of subTADs reported by HiCCUPS were strongly enriched for significant interactions (P<10^{−16}, KS test of –log_{10} P distribution of corners of true vs. randomized subTADs; Supplementary Fig. 14), and HiCDC P values accurately discriminated HiCCUPS interactions from a genomicdistance matched negative set of random interactions (Supplementary Fig. 15). By contrast, FitHiC tended to assign significant interactions to many loci within the subTAD, suggesting that subTAD ‘corner’ interactions—for example, interactions of pairs CTCF sites—are not well distinguished from nearby pairs of loci that may not represent direct interactions (Supplementary Fig. 16). HiCCUPS itself reports only a small number of interactions genomewide, concentrated at genomic distances under 500 kb (∼8K), so that while the majority of HiCCUPS predictions overlap with significant HiCDC interactions computed at the same resolution (10 kb), HiCDC also successfully operates at a longer genomic distance range complementary to HiCCUPS (Supplementary Data 1, Supplementary Fig. 17, Supplementary Fig. 18).
To mechanistically interpret significant HiCDC events, we examined other epigenomic signal tracks and the RefSeq genome annotation track alongside a 1D ‘hotspot’ summarization of HiCDC analysis (Fig. 2c). The hotspot track shows, for each genomic interval, the maximum –log_{10} P value over interactions involving the interval; we restrict to interactions within the submatrix in Fig. 2b at FDR < 1% and display only interactions with distance >50 kb for improved clarity. The arc plot representation in Fig. 2d shows these significant interactions as arcs joining genomic loci, where arc height represents statistical significance. The parallel epigenetic signal tracks show at least four distinct HiC hotspot regions: (i) a promoter proximal region that encompasses an intronic enhancer in the BCL2 locus, the gene’s DNaseaccessible promoter that coincides with two CTCF peaks, and another CTCF peak upstream of the promoter (∼61 Mb); (ii) a broad 3′end hotspot region that includes an H3K27acmarked intronic enhancer and multiple CTCF peaks (∼60.8 Mb); (iii) a downstream hotspot that is cooccupied by CTCF and cohesin subunits SMC3 and Rad21 (∼60.7 Mb); and (iv) an upstream hotspot with a strong CTCF peak as well as signal for SMC3 and Rad21. The arc plot (Fig. 2d) shows that the BCL2 promoter hotspot has numerous significant contacts with both the 3′end hotspot and the downstream hotspot, while the upstream hotspot also has a few significant contacts with the 3′end hotspot and downstream hotspot. We do not see strong contacts between the 3′end and downstream hotspots, even though both interact with the promoter. Consistent with previous observations^{11}, the interacting hotspots contained CTCF motifs with divergent orientation, while the noninteracting hotspots had motifs with the same orientation. We also examined the same locus in K562, using the Rao et al.^{11} data set for this cell line (Supplementary Fig. 19). As expected, the B cell gene BCL2 is not highly expressed in K562, and HiC interactions between the previously described hotspots are assigned low significance by our model. In particular, the interaction between the intronic enhancer and promoter of BCL2 is lost in K562, and the enhancer there displays minimal DNase accessibility and H2K27ac.
We observed similar CTCF and cohesinmediated DNA looping insulating other important B cell genes. For example, the IKZF1 locus is flanked by several upstream hotspots and several 3′ end/downstream hotspots, all cooccupied by CTCF and cohesin to various degrees (Supplementary Fig. 20). Upstream and 3′end/downstream elements interact with each other in pairwise fashion, insulating IKZF1 from nearby genes. We also looked at the human βglobin locus to see if we observed interactions between the classic locus control region (LCR) and βglobin gene cluster^{15} in the erythroleukemic cell line K562 (Supplementary Fig. 21). We indeed observed significant shortrange interactions between the DNase accessible enhancers that make up the LCR, as well as shortrange interactions involving the HBB and HBD loci. Furthermore, there is an interaction between the two CTCF sites flanking the region encompassing both the LCR and the gene cluster, albeit one of modest significance. Notably, these significant interactions are not observed at the locus in GM12878 (Supplementary Fig. 22), where the βglobin locus genes are not expressed.
Distinct interaction types occur at different length scales
HiCDC analysis of the GM12878 data set provided sufficient resolution to annotate significant interactions by the epigenetic and genomic features of the interacting loci. We first asked whether significant interactions with specific epigenetic signals in GM12878—CTCF binding, DNase accessibility, and active histone mark H3K27ac—were enriched at different length scales. We considered significant interactions (FDR < 1%) and determined, for each 10 kb distance band, enrichment of interactions with a specific annotation relative to background prevalence of the annotation (Methods section). We then compared the enrichment P values for this annotation over 100 kb distance subranges to all other annotations (Methods section; see Supplementary Fig. 23 for absolute enrichment P values). Relative to other annotations, CTCFmediated interactions without other epigenetic signals (that is, CTCFCTCF and CTCFunannotated interactions) were most strongly enriched in the middle distance range of 700 kb–1.5 Mb (Fig. 3a), suggesting a preferred distance range for CTCFassociated structural loops (see also Supplementary Fig. 23). This length scale is consistent with the reported length distribution of TADs (median length 880 kb), whose boundaries are enriched for CTCF binding sites. By contrast, interactions involving active regulatory elements, defined as DNase accessible and H3K27ac marked loci, were most enriched both at shorter distance ranges of <500 kb and, more surprisingly, at longer ranges of >1.5 Mb.
We then repeated the enrichment analysis by considering the genomic annotation—promoter, gene body, or distal intergenic—this time considering 50 kb distance subranges (Methods section; see Supplementary Fig. 24 for absolute enrichment P values). Here we found that, relative to other annotations, significant promotergene body interactions were most strongly enriched at the shorter distance range of <500 kb, consistent with intronic regulatory enhancers interacting with promoters (Fig. 3b). This analysis also revealed a strong enrichment of significant promoter–promoter interactions at longer distances of 1.5–2 Mb (Fig. 3b).
Genome annotation tools such as Segway^{16} and ChromHMM^{17} provide an automated way of assigning chromatin states to genomic intervals based on epigenomic data. We downloaded the sevenstate combined Segway+ChromHMM segmentation for GM12878 from the UCSC Genome Browser and performed a similar enrichment analysis relative to these states, reporting absolute enrichments as a function of genomic distance rather than relative enrichments (Supplementary Figs 25 and 26). This analysis recapitulated some of our major findings, such as the enrichment of distal promoter–promoter interactions. We note that only 4% of genomic intervals previously annotated as CTCF binding sites based on ChIPseq data were assigned a ‘CTCF’ state in this analysis—possibly because the CTCF states correspond to short segments and states with longer segments (for example, ‘Repressed’) dominate the annotations. Enrichments for CTCFassociated interactions were therefore less prominent in this analysis.
Previous megabase resolution HiC analyses described two chromatin compartments, called A/B compartments, defined by examining the sign of the first principle component of the contact matrix (or the correlation matrix generated from rows of the contact matrix). The A compartment is associated with open chromatin—that is, enriched for DNase hypersensitive sites and active histone marks—and the B compartment with closed chromatin. We binned the GM12878 HiC data at 100 kb resolution, determined A/B compartments as previously described (Methods section), and split significant interactions into three groups: within the A compartment, within the B compartment, and between A and B compartments. These groups displayed different patterns of enrichments for epigenomic signals and genomic annotations (Supplementary Fig. 27). Significant interactions within compartment A displayed greater enrichment for DNaseDNase and H3K27acH3K27ac marks at longer range genomic distances (>1.5 Mb) compared to those within compartment B (Supplementary Fig. 28). Longerrange promoter–promoter interactions were also uniquely enriched within compartment A. Significant interactions between compartment A and compartment B displayed little enrichment for epigenomic signals or genome annotations at any distance range (Supplementary Fig. 27).
Similarly, we segregated HiCDC predicted interactions into interTAD and intraTAD contacts based on previous TAD annotations^{7} and found different patterns of enrichments for epigenetic signals and genomic annotations. For example, significant interTAD interactions showed greater enrichment for DNaseDNase and K27acK27ac marks at longer range genomic distances (>1.5 Mb) compared to significant intraTAD interactions (Supplementary Fig. 29). Longerrange promoter–promoter interactions were uniquely enriched within interTAD regions (Supplementary Fig. 30).
HiCDC identifies longrange histone gene interactions
To investigate the longrange promoter–promoter interactions from our enrichment analysis, we examined the distribution of these interactions by chromosome. Strikingly, we found that significant 1.5–2 Mb promoter–promoter interactions were highly concentrated on chromosome 6, both in absolute number (Supplementary Fig. 31) and when normalized by chromosome length (Fig. 4a). Of the 90 genes on chromosome 6 involved in these interactions, over 40% (37/90) of them were replicationdependent histone genes, all but one of them joined by a dense connected interaction network; other genes participated in small clusters of interactions of 8 or fewer genes (Fig. 4b, Supplementary Fig. 32). Interactions between histone promoters were also recently reported in an analysis of capture HiC data^{18}. A visualization of the significant promoter–promoter interactions in the 26–28 Mb region of chromosome 6 shows three subclusters of histone genes that appear to interact with each other in longrange chromosomal looping (Fig. 4c). Histone genes are short (median length ∼410 bp), and multiple histone genes can fall into a single genomic interval in our analysis. Globally, longrange ‘promoterpromoter’ interactions tend to arise from genedense regions and involve genes with short length (Supplementary Fig. 33) and therefore are more accurately described as ‘genegene’ interactions.
Expression of replicationdependent histone genes is restricted to S phase, when massive production of histone proteins is required to package newly replicated DNA^{15}. Histone genes also require specialized mRNA processing, since they are not polyadenylated but rather contain a 3′ end hairpin structure. Factors required for hairpin recognition and 3′ cleavage of histone genes, including hairpin binding protein (HBP, also called SLBP) and the U7snRNP, are found in high concentrations in nuclear bodies called histone locus bodies^{19}, which have been most extensively studied in Drosophila. It is possible that the network of histone interactions we observe is the result of chromosome looping in S phase to bring all histone genes on chromosome 6 into close proximity to facilitate mRNA processing at the histone locus body. A recent single cell study in GM12878 in fact estimated that 40% of cells are in S phase in cell culture conditions^{20}.
Discussion
We have presented a principled statistical approach for detecting significant chromatin interactions from HiC read count data. We account for systematic sources of bias, including distancedependent random polymer ligation as well as GC content and mappability, in a single model rather than using an ad hoc normalization scheme prior to analysis. HiCDC allows us to detect significant subTAD interactions, interpret multiple interacting HiC hotspots at developmentally important genes, and identify regions with specialized chromatin organization, such as the network of longrange interactions between histone genes on chromosome 6.
Our model assumes that most interaction bin counts are well explained by the null distribution. Q–Q plot analysis suggests that methods that fail to model overdispersion of read count data—including Poisson regression variants of HiCDC as well as FitHiC—may be prone to inflation of P values. However, in the absence of either a ‘gold standard’ of true interactions or an experimental procedure to empirically produce a HiC null distribution, we cannot make definitive claims about accuracy. There are several directions for improvement of the HiCDC null distribution, which currently uses a fixed dispersion estimate per chromosome and does not account for hierarchical structure in the data (nonindependence between bin counts). Training HiCDC on 25 kb slices of the genomic distance distribution suggests that the dispersion parameter does vary with distance (Supplementary Figs 34 and 35) and is close to 0 for interactions bins of >1 Mb. A future extension could develop an appropriate parametric form for α(d) within the GLM framework. Moreover, two loci separated by genomic distance d that lie inside a TAD likely are closer in 3D and generate higher read counts than loci at distance d in adjacent TADs; however, all interaction bins at distance d are treated equivalently by the model. While this assumption promotes detection of the most significant looping interactions that support the subTAD structure, rather than other pairs of loci within the loops that are brought closer by these interactions, HiCDC sensitivity might be improved by modelling the hierarchical structure among interaction bins. A further extension would be a simplified version of HiCDC for interchromosomal interactions: for each pair of chromosomes, the matrix of bin counts between pairs of intervals would be output values in the regression model, with the bias covariates as before but no genomic distance dependence.
HiCDC incorporates covariates such as GC content and mappability directly in the regression model rather than trying to first rebalance the count matrix using an approach like ICE. Our method is both more direct and more scalable: when applied to highresolution HiC experiments with large interaction matrices, matrix rebalancing algorithms can become numerically unstable and lead to unpredictable results. Indeed, to apply ICE to the large interaction matrices in their study, Rao et al.^{11} masked a large number of interaction bins (more than 20% on chromosomes 9, 13, 14, 15, 21, 22)—including centromeric, telomeric, and other lowcount regions—with ‘NaNs’ in order to ensure ICE convergence (Supplementary Fig. 36). Our approach requires no such masking while appropriately dealing with zero counts and low mappability bins.
Analysis of HiC data may also be confounded by phases of the cell cycle. We observed a dense network of longrange (1.5–2 Mb) histone gene interactions on chromosome 6, consistent with their colocalization in the histone locus body to enable efficient mRNA processing in S phase. However, it is also possible that this chromatin organization is maintained outside of S phase to poise the histone genes for rapid mRNA expression once the appropriate transcription and RNA processing factors are available. Future HiC studies in synchronized cells will resolve these questions.
Methods
Data preprocessing
HiC paired end (PE) reads were mapped to the hg19 reference genome using the BurrowsWheeler Aligner (BWA). The paired ends were mapped independently and matched later in the analysis, with sequence alignments performed in parallel on a cluster. We used the BWA single end aligner bwasw but found that BWA bwa mem worked equally well. The short read version of aligner, bwaaln, did not perform well for aligning longer length paired end reads for HiC data.
As a filtering step, we only used PE reads where each end uniquely mapped to the reference genome, and clonal reads from PCR amplification were collapsed. In addition, PE reads that did not map within 500 bp from a restriction enzyme (RE) site were eliminated.
Calculating local genomic features
We divided the genome into consecutive disjoint intervals, with RE sites as breakpoints, and merged every ten contiguous RE fragments to produce larger nonoverlapping intervals. We used the GM12878 in situ HiC dataset by from Rao et al.^{11}. In this HiC experiment, a 4 base cutter (Mbol) was used for chromatin fragmentation and yielded intervals (after merging every 10 RE fragments) with a median width of ∼4 kb. Each pair of intervals defines an interaction bin for counting PE reads, and we developed a model for assessing the significance of these counts based on local genomic features and the linear distance between the interval midpoints.
For each interval, we computed the average GC content and mappability over regions within 500 bp of a RE site. For each interaction bin, we standardized the log transformed GC content and mappability features to obtain GC content and mappability covariates for the GLM described below.
The HiCDC model can also be used with a uniform binning of the genome. In this case, an additional covariate called the effective sequence space is included in the model^{8}. For each interaction bin, we compute the fraction of each of the corresponding genomic intervals that is within 500 bp of a RE within the interval; the effective sequence space is the product of these fractions.
Modelling interaction bin count data
The read counts for each pair of intervals (‘interaction bin’) from highthroughput conformation capture are overdispersed relative to the Poisson distribution (Fig. 1b). The negative binomial (NB) distribution is typically used for modelling overdispersed Poisson read count data. However, the NB distribution cannot handle an inflation of zero observations (zero count bins), which is typical in HiC data, perhaps due to a ‘dropout’ phenomenon where less frequent interactions are never captured in the sequencing library, as in singlecell RNAseq^{14}. One option is to use a zerotruncated negative binomial (ZTNB) regression model. Under this distribution, a Bernoulli model governs the binary outcome of whether a count variable has a zero or positive value. If the outcome is positive, the ‘hurdle’ is crossed, and the conditional distribution of the nonzero counts is governed by a negative binomial distribution. This model is sometimes called a hurdle negative binomial model. We used a GLM based on zerotruncated negative binomial regression to model the HiC read counts, and we used the fitted model to estimate the statistical significance (P value) to the HiC interaction bin counts.
Let Y={y_{ij}} represent the HiC contact map of intrachromosomal interactions, where i and j are a pair of genomic intervals (as described above) and the tuple (i,j) defines an interaction bin. Each bin has an associated vector of covariates, which we denote as . y_{ij} is a random variable that follows a zerotruncated negative binomial distribution. The regression model is defined as
where the distribution f (y_{ij}; μ_{ij}, α) is a negative binomial distribution with dispersion parameter α. The negative binomial mean parameter μ_{ij} is described with a loglinear model
where β_{o} is the intercept term, and β_{gc} and β_{map} are coefficients for GC content and mappability features, respectively. We modelled the relationship between genomic distance and contact significance (Supplementary Fig. 1) with a third order Bspline (l=3), which takes the distance covariate as input, and the β_{k} correspond to the coefficients of Bspline basis functions:
where
and t_{i} are ordered knots. The number of inner knots equals the degrees of freedom minus the order of the Bspline. We use six of degrees of freedom for fitting the spline and hence three inner knots. The inner knots are 25, 50 and 75% quantiles of the linear genomic distance, and the boundary knots are 0 and 2 Mb.
To increase the statistical power of the model, we removed bin counts that exceed the 97.5% percentile of null distribution that we deem as positive outliers, corresponding to potentially nonrandom contacts, and then refit the model to the remainder of the data. We use the function hurdle in the R library pscl to fit the zerotruncated negative binomial regression model.
We assume that the majority of interaction bins are generated by the null distribution, that is, do not represent significant interactions. HiCDC models the null read count distribution as described above, and it assesses the significance of unexpectedly large interaction bin counts based on the corresponding estimated ZTNB distributions. To do this, after model fitting, each bin within a suitable distance range is assigned a P value by subtracting from 1 the sum of probabilities for all values less than the observed read count for the bin. Significant bins can then be selected with adjusted P values controlling for FDR based on the BenjaminiHochberg procedure.
Annotating bins with genomic and epigenomic labels
To add epigenomic annotations to our bins, we first performed IDR on two replicates of DNaseI ChIPseq peaks (ENCFF001WFR and ENCFF001WFS by ENCODE), as well as two replicates for CTCF ChIPseq peaks (ENCFF001XPJ and ENCFF001XPK by ENCODE) to obtain a set of reproducible peaks for each data set. Only one replicate for H3K27ac was available (ENCFF001SUG), which precluded IDR and so we took those peaks as they were. We merged all three types of peaks together into one catalogue of peaks by successively merging the GRanges objects. Where peaks of different types overlapped, we concatenated the peak labels.
We then annotated each of the genomic bins using the catalogue of peaks, by transforming our bins into a GRanges object, and assigning peak annotations to bins where they shared overlap as determined by the findOverlaps function in GRanges. Each contact inherited the annotations of both its genomic bins.
To assign genomic annotations to the bins, we downloaded an annotation of all known genes from RefSeq (built against GRCh38), and split these into known genes and known transcripts. We used the transcripts along with the following rules to annotate each bin. If the bin was within 2kb of a transcription start site, it was labelled as containing a promoter. If the bin overlapped with a known exon, it was labelled as exonic. If the bin overlapped with an intron, it was labelled intronic. Finally, if none of the previous were satisfied, it was labelled as distal intergenic. We later merged intronic and exonic labels into genebody (or genic), since exonic bins were so sparsely distributed.
Calculating annotation enrichment among significant contacts
To ask whether significant interactions with specific epigenetic signals in GM12878 were enriched over different distances, we partitioned all contacts by distance (excluding selfligating contacts) in increments of 10 kb, covering distances from up to 2 Mb. For each 10 kb sized element of the distance partition, that we termed a distance band, and each epigenomic label, we collected all contacts that fell within this distance band into significant (FDR<1%) contacts bearing the label, significant contacts not bearing the label, notsignificant contacts sharing the given label, and notsignificant contacts not sharing the label. We used a Fisher’s exact test on the contingency table of those four groups to determine the enrichment of each label among significant contacts in that distance band (results depicted in Supplementary Fig. 23).
To examine the enrichment signal across longer distances, we pooled the results from the Fisher’s exact test described above into contiguous regions of 100 kb. This gave us 10 observed levels of significance for each annotation over each 100 kb region. We applied a Wilcoxon ranksum test on these P values to get a relative enrichment value for each annotation over the longer intervals, which we display in Fig. 3a.
For the genomic annotations, we repeated the same methodology of testing for enrichment in 10 kb distance bands for bins annotated as either a promoter, gene body, or distal intergenic. We pooled the results over 50 kb contiguous regions, and applied a Wilcoxon ranksum test on the P values to get a relative enrichment value, which we display in Fig. 3b.
Genomic compartment analysis
To identify A and B compartments in the Rao et al.^{11} GM12878 data set, we merged adjacent interaction bins, combining their counts to form metabins with 100 kb resolution. We computed a correlation matrix from this coarsegrained contact matrix, and computed its first principal component. Following the procedure in [1], the sign of the first principal component for each metabin was used to define the compartment label (A or B). We found that the A compartment metabins were enriched for DNase hypersensitive sites, and so labelled the A compartment metabins as ‘open’, then the B compartment metabins ‘closed’.
For the compartment enrichment analysis, we applied the label of each metabin to its constituent bins. We then partitioned all contacts into three groups: contacts with both endpoints in open regions, contacts with both endpoints in closed regions, and contacts with one open and one closed endpoint. For each of the three sets, we repeated the enrichment analysis as in Supplementary Fig. 14.
Downsampling analysis
For the downsampling analysis, we performed training on chromosome 1 in Rao et al.^{11} GM12878. We downsampled the number of reads from the HiC interaction matrix by taking the counts for each element (that is, each pair of genomic loci) and transforming them into a list of paired end reads of size equal to the counts. From this list of all reads, we subsampled randomly without replacement and reassigned each of the sampled reads to their corresponding elements in the new interaction matrix. We trained HiCDC on samples in proportions of 75, 50 and 25% from the original contact matrix. We used HiCDC to calculate FDRadjusted –log_{10} P values for each interaction in each of the three subsampled contact matrices. These P values were used to predict the significant interactions from the full interaction matrix. The FDRadjusted P values for the full HiC contact were used to define our ground truth labels. Interactions with adjusted P values <0.05 were labelled as positive, and interactions with adjusted P values >0.1 were labelled as negative. We excluded all bins with adjusted P values between 0.05 and 0.1 from the analysis. For each of the downsampled contact maps, we plotted precisionrecall curves as shown in Fig. 1e.
Data availability
HiC data sets analysed in this study were obtained through Gene Expression Omnibus accession codes GSE63525 (GSM1551550—GSM1551578, GSM1551618—GSM1551623) and GSE35156 (GSM862720—GSM862723, GSM892307). Source code and documentation for HiCDC is available as an R package through a git repository located at https://bitbucket.org/leslielab/hic.dc.
Additional information
How to cite this article: Carty, M. et al. An integrated model for detecting significant chromatin interactions from highresolution HiC data. Nat. Commun. 8, 15454 doi: 10.1038/ncomms15454 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1
LiebermanAiden, E. et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
 2
Tanizawa, H. et al. Mapping of longrange associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regulation. Nucleic Acids Res. 38, 8164–8177 (2010).
 3
Sexton, T. et al. Threedimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458–472 (2012).
 4
Duan, Z. et al. A threedimensional model of the yeast genome. Nature 465, 363–367 (2010).
 5
Imakaev, M. et al. Iterative correction of HiC data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003 (2012).
 6
Yaffe, E. & Tanay, A. Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065 (2011).
 7
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
 8
Hu, M. et al. HiCNorm: removing biases in HiC data via Poisson regression. Bioinformatics 28, 3131–3133 (2012).
 9
Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for HiC data reveals regulatory chromatin contacts. Genome Res. 24, 999–1011 (2014).
 10
Klein, F. A. et al. FourCSeq: analysis of 4C sequencing data. Bioinformatics 31, 3085–3091 (2015).
 11
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
 12
Xu, Z. et al. A hidden Markov random fieldbased Bayesian method for the detection of longrange chromosomal interactions in HiC data. Bioinformatics 32, 650–656 (2016).
 13
Zeileis, A., Kleiber, C. & Jackman, S. Regression models for count data in R. J. Stat. Softw. 27, 1–25 (2008).
 14
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
 15
Rattray, A. M. & Muller, B. The control of histone gene expression. Biochem. Soc. Trans. 40, 880–885 (2012).
 16
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).
 17
Ernst, J. & Kellis, M. ChromHMM: automating chromatinstate discovery and characterization. Nat. Methods 9, 215–216 (2012).
 18
Cairns, J. et al. CHiCAGO: robust detection of DNA looping interactions in capture HiC data. BioRxiv 17, 127 (2015).
 19
Nizami, Z., Deryusheva, S. & Gall, J. G. The Cajal body and histone locus body. Cold Spring Harb. Perspect. Biol. 2, a000653 (2010).
 20
Marinov, G. K. et al. From singlecell to cellpool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 24, 496–510 (2014).
Acknowledgements
This work was supported in part by NIH/NHGRI grants HG006798 and HG007893 to C.S.L. We would like to thank Iestyn Whitehouse for helpful discussions and Ferhat Ay for assistance with the FitHiC software.
Author information
Affiliations
Contributions
M.C. developed and implemented the statistical model, performed statistical analyses including method comparisons and detailed views of DNA looping at specific loci, and contributed to writing the manuscript. L.Z. performed statistical analyses including genomewide enrichment analyses and mapping of histone locus interactions and contributed to writing the manuscript. M.S. contributed to optimizations of the HiCDC code and statistical analyses. A.G. processed the HiC data to produce count matrices of interaction bins, parsed epigenomic datasets to annotate the bins, and advised on the software implementation. R.P. advised on algorithm development. O.E. helped to supervise the research. C.S.L. helped to develop the statistical model, supervised the research, and wrote the manuscript.
Corresponding author
Correspondence to Christina S. Leslie.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figures, Supplementary Table, and Supplementary References (PDF 8882 kb)
Supplementary Data 1
Counts of all significant contacts for HiCDC, FitHiC and HiCCUPS across all chromosomes (XLSX 22 kb)
Supplementary Software 1
HiCDC R package (ZIP 4898 kb)
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Received
Accepted
Published
DOI
Further reading

Identification of significant chromatin contacts from HiChIP data by FitHiChIP
Nature Communications (2019)

Measuring the reproducibility and quality of HiC data
Genome Biology (2019)

Comparison of computational methods for 3D genome analysis at singlecell HiC level
Methods (2019)

EZH2 oncogenic mutations drive epigenetic, transcriptional, and structural changes within chromatin domains
Nature Genetics (2019)

Computational methods for analyzing and modeling genome structure and organization
Wiley Interdisciplinary Reviews: Systems Biology and Medicine (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.