Abstract
Proximityligation methods such as HiC allow us to map physical DNA–DNA interactions along the genome, and reveal its organization into topologically associating domains (TADs). As the HiC data accumulate, computational methods were developed for identifying domain borders in multiple cell types and organisms. Here, we present PSYCHIC, a computational approach for analyzing HiC data and identifying promoter–enhancer interactions. We use a unified probabilistic model to segment the genome into domains, which we then merge hierarchically and fit using a local background model, allowing us to identify overrepresented DNA–DNA interactions across the genome. By analyzing the published HiC data sets in human and mouse, we identify hundreds of thousands of putative enhancers and their target genes, and compile an extensive genomewide catalog of gene regulation in human and mouse. As we show, our predictions are highly enriched for ChIPseq and DNA accessibility data, evolutionary conservation, eQTLs and other DNA–DNA interaction data.
Introduction
One of the key mechanisms of gene regulation in eukaryotes involves promoter–enhancer interactions, where distal regulatory regions along the DNA (enhancers) come in close physical proximity to their target promoters to further activate transcription. The human genome is estimated to contain hundreds of thousands of enhancers, often with multiple enhancers regulating a single gene. These act in a tissuespecific manner and could be found up to 1 Mb away from their target genes^{1,2,3,4,5,6}. The importance of enhancers for gene regulation is further emphasized by a growing body of works that link genetic variation in enhancer sequences to human diseases^{7,8,9,10,11}. Nonetheless, we still lack a deep understanding of the following: (a) how enhancers work molecularly, (b) how their tissue specificity is encoded in their sequence, and above all, (c) how they recognize and physically interact with their target genes.
In recent years, highthroughput molecular methods have been developed to study the threedimensional organization of the genome, and its relation to various functions. For example, proximityligation methods such as 4C, ChIAPET and HiC quantify the frequency of DNA–DNA interactions in living cells and map the 3D organization of the genome in high resolution^{12,13,14,15,16,17,18,19,20,21,22,23}. To date, HiC experiments were performed in a variety of organisms and cellular conditions, including many cell types and tissues.
While the genomic resolution of these data is often low, varying from few Kbs to 40Kb blocks, they were mainly used to identify and delineate topologically associating domains (TADs). These are continuous regions (hundreds of Kbs to few Mbs) that were shown to be folded upon themselves into local compartments and facilitate high number of DNA–DNA interactions^{19,24,25,26}.
In recent years, topological domains were studied extensively, and were shown to be (a) related to replication domains^{27,28}, (b) largely conserved across evolution, and (c) play a crucial role in chromosome function^{25,29,30,31,32,33}.
TADs also play a key role in gene regulation, as they define the regulatory scope of enhancers. The domains' boundaries were shown to act as regulatory “insulators” that prevent targeting genes outside of the enhancer domain^{34,35}. Disruptions of the chromosomal structure, either in human genetic disorders or by artificially deleting boundary elements (e.g., using CRISPRCas9), were shown to be associated with enhancer misregulation and aberrant gene expression^{9,10,11,36,37,38}. While we still lack a deep understanding of the exact mechanisms by which topological domains are defined and maintained, TAD borders were shown to be enriched for highly transcribed genes^{25}, as well as CTCF and cohesin binding sites^{22,31,39,40,41,42,43,44,45}.
As more and more 3D data accumulate, in a multitude of tissues and cellular conditions, algorithms were developed to analyze HiC data and partition the genome into a set of topological domains^{17,20,25,46,47,48,49,50}. Most notable are the Directionality Index method^{25} that scans the genome by analyzing the set of DNA–DNA interactions for every locus, and identifies transitions from loci with mostly backward interactions to adjacent loci with mostly forward interactions; and the Insulation Square method^{23} that identifies TAD boundaries as genomic loci with very few overhead interactions. Additional methods aim to construct a more hierarchical structure of topological domains, a visible feature of HiC maps, either by merging crossconnected subdomains into larger domains^{20} or by iteratively altering the algorithm parameters to obtain an ensemble of multiple chromosomal segmentations that could be interpreted as hierarchical domains^{50}. While these methods are generally fast and robust, they are inherently biased towards shortrange interactions that form the vast majority of DNA–DNA interactions, thus shading the less abundant longrange interactions (250 Kb and above), that are more informative for calling hierarchical TADs.
Here, we present PSYCHIC (Fig. 1)—a threestep modular algorithm to identify promoter–enhancer interactions. Briefly, we use a unified probabilistic model and a Dynamic Programming algorithm to find an optimal segmentation of each chromosome into topological domains; we next iteratively merge neighboring domains into hierarchical structures; and finally we fit each domain using a local background model. This allows us to identify overrepresented DNA–DNA pairs, including enhancers and their target genes. We have analyzed the HiC data from 15 conditions and cell types in mouse and human^{19,20,25}, and identified hundreds of thousands of overrepresented interactions. This comprehensive genomewide tissuespecific database of putative interactions between enhancers and their target genes would be of great interest to the scientific community.
Results
A unified probabilistic mixture model for HiC data
HiC interaction maps often show a clear distinction between two different patterns—Rectangular regions along the diagonal of the HiC map that correspond to topological domains, and present high intensity of (intradomain) DNA–DNA interactions. These are often surrounded by regions with fewer (interdomain) DNA–DNA interactions. Due to symmetry, HiC maps are often rotated in 45 degrees, with topological domains shown as isosceles right triangles along the (now horizontal) diagonal of the HiC map (Fig. 1a).
We begin by developing a simple twocomponent probabilistic model, corresponding to the probability of intra and interTAD interactions. In brief, our algorithm analyzes the HiC interaction matrix and infers for every cell (DNA–DNA pair) the logprobability ratio (LPR) of these loci occurring within the same topological domain or not. In the following stages, we will combine these ratios into a unified score, and use Dynamic Programming to optimally segment each chromosome into domains.
Formally, let P _{ d }(N) denote the probability of observing N HiC interactions between two DNA loci d bases apart. This equals to the weighted sum of the intradomain and interdomain submodels:
where P _{ d }(N  intra) and P _{ d }(N  inter) correspond to the likelihood of observing N interactions d bp apart in the intraTAD and interTAD submodels, respectively. P _{ d }(intra) and P _{ d }(inter) correspond to the a priori probability of observing two loci d bp apart to be within or outside of the same TAD. For robustness, we model N using a logNormal distribution (Supplementary Fig. 1a, b; Methods section). Additional probabilistic families (logPoisson and Negative Binomial) were considered and found to be less accurate (Supplementary Fig. 1c, d). This parameterization greatly reduces the number of free parameters, resulting in a compact model θ _{ d } with only six parameters for every distance d, including μ _{ d } ^{intra}, σ _{ d } ^{intra}, μ _{ d } ^{inter}, and σ _{ d } ^{inter} (mean and standard deviation parameters for intraTAD and interTAD models); and two prior parameters P _{ d }(intra) and P _{ d } (inter), while offering an accurate approximation of the HiC data (Supplementary Fig. 1a, b). For every distance d, we directly estimate the model parameters from annotated HiC data: To estimate θ _{ d }, we rely on an initial (possible noisy) segmentation of the HiC map into domains. These could be obtained using various methods, including the directionality index (DI) HMMbased method of Dixon et al^{25}, Insulation Square^{23}, or approximated iteratively using the ExpectationMaximization (EM) algorithm^{51}. Given such annotations, we consider all intra and interTAD pairs and use a maximum likelihood estimation of the mean and the standard deviation parameters. As shown by comparing different chromosomes of mouse ES cells, these estimations are very robust (Supplementary Fig. 1e). The same approach is used to estimate the prior probabilities, namely which percent of the DNA–DNA interactions of distance d occur within, or across, topological domains.
Identification of TAD boundaries using logposterior ratios
Using the above probabilistic model, we now wish to resegment the genome into domains. For this, we propose a score that will integrate information from various distances of DNA–DNA interactions across the entire HiC matrix, without being skewed by the significantly higher number of interactions among nearby DNA–DNA pairs.
For this, we define a local score that calculates for every cell in the HiC matrix the logposterior ratio (LPR) of the intra and interTAD submodels. Assuming N interactions for two DNA loci d bases apart, we could use Bayes’ law to derive the posterior probability of being within P _{ d }(intra  N) or between TADs P _{ d }(inter  N) (Methods section). This allows us to compute the logposterior ratio of the two submodels:
We are now ready to score a segmentation of the genome into domains.
First, let us define the probabilistic score for a single topological domain t, starting at position s and ending at position e. For this, we sum the logposterior ratios for all intraTAD cells (pairs < i,j > such that s ≤ i ≤ j ≤ e), and subtract the logposterior ratios for all interTAD cells outside of TAD t. These are defined by the remaining (non intraTAD) pairs <k,l> whose centers lie within the TAD t, such that s ≤ (k + l)/2 ≤ e.
These are shown as blue (intra) and yellow (interTAD) regions in Fig. 1c. For efficiency reasons, we only consider intraTAD pairs (<i,j>) or interTAD (<k,l>) up to a maximal distance h of 5 Mb. Probabilistically speaking, we allow every HiC cell to independently compare its likelihood given each of the two submodels. We then define a global score for a segmentation C of the genome into a set of TADs, by summing over their respective scores:
As shown in Fig. 1c, the score of each TAD t is based on pairs within t (blue) or directly above t (yellow), such that all HiC cells are counted exactly once. Moreover, since the score is strictly additive, breaking a single TAD into two TADs requires to only change the sign of LPR scores for cells between those TADs (Fig. 1c, striped region), as they are shifted from being considered intraTAD (thus positive, lefthand side of Eq. 3) to interTAD (negative, righthand side of Eq. 3).
Finally, we use a Dynamic Programming algorithm to find the optimal segmentation of each chromosome into topological domains, with respect to our twocomponent model. For this, we use a Dynamic Programming algorithm that computes the optimal score of each genomic interval C _{i,j} by comparing its score as a single TAD from position i to position j, S(t _{ i,j } ) as in Eq. 3, or by recursively breaking it at each possible position k, into two distinct regions, one ranging from position i to k, and another region from position k + 1 to position j:
Our algorithm then extends the computed range <i,j> until the entire chromosome is covered. This allows us to efficiently enumerate over all possible configurations {C} for each chromosome and identity the optimal segmentation C, with respect to the above probabilistic score.
Hierarchical model of topological domains
So far, we developed a probabilistic framework for modeling the HiC data within and across topological domains, and presented an efficient algorithm for identifying the optimal segmentation. For this, our model assumed that all intraTAD DNA–DNA pairs, located d bases apart, distribute according to one set of logNormal parameters, and all interTAD pairs use another set.
We now wish to alleviate this assumption, and allow each TAD to fit a unique set of parameters fitting its intraTAD HiC interaction counts. In addition, we wish to fit additional sets of parameters to selected interTAD regions (shown as tilted rectangles in the HiC map, Fig. 1c).
Specifically, we wish to iteratively agglomerate neighboring TADs into hierarchical structures of topological domains, where each TAD or merged regions is assumed to have a different tendency for HiC interactions (Fig. 1d). For this, we developed a “merge score” that allows us to examine adjacent domains. A naive scoring system for neighboring TADs would simply quantify their connectivity, by directly counting the number of interTAD interactions^{20}. This score, however, might be biased by the size of the two domains, as well as the overall interaction intensity in each of the two domains.
Instead, our “merge score” preferentially chooses neighboring TADs whose interTAD region is more similar to each of the intraTAD regions than to the overall interTAD HiC count distributions. Specifically, we calculate for each domain the average number of DNA–DNA interactions at any distance d (Supplementary Fig. 1f), and compare these plots to the region between the two TADs, and to the remaining interTAD regions (“Sky” in Supplementary Fig. 1f). We then linearly regress these plots, and find the optimal α satisfying:
where I _{ Merge }, I _{ TADs }, and I _{ Sky } denote the average intensities for each d at the interTAD (“Merge”) area, the intraTAD interactions within the two TADs, and the interTAD background model (“Sky”). We do so iteratively, greedily merging TAD pairs with the highest α value. Specifically, we merge two adjacent TADs whose interTAD region is the most similar (in terms of HiC interactions) and most dissimilar to interTAD regions. As before, this is done iteratively up to a maximal merge size of 5 Mb, to create a set (forest) of treelike TAD merges, visually corresponding to triangles (TADs) and rectangles (interTAD merge regions). Supplementary Fig. 1g compares the number of TADs and hierarchical merges (1st, 2nd order, etc) for various HiC data sets.
TADspecific background model using Bilinear powerlaw fit
Once we segmented the HiC map into topological domains and TAD merges, we wish to specifically model the intensity of HiC data in each region, thus fitting the HiC data with a series of local background models. This will allow us to estimate the expected number of interactions in each HiC cell, thus identifying overrepresented HiC cells enriched compared to their specific TAD environment. Previous works used a powerlaw scaling model^{15,52,53} to regress the expected number of DNA–DNA interactions as a function of their distance d :
This is often plotted in log–log scale, where the (log) number of interactions scales linearly with the (log) distance:
with a being the powerlaw coefficient (slope of log–log plot) and b is the intersection parameter.
Nonetheless, while we found the powerlaw model to be generally accurate, it is clear that some domains show more HiC interactions than others (Fig. 1a), suggesting they would be best described by different powerlaw parameters (Supplementary Fig 1f, e.g. TADs A vs. B). We therefore wish to fit a different background model for each TAD and each merged region (Fig. 1d). This allows us to estimate the expected number of interactions at any distance within every topological domain/merge and quantify the statistical significance of overrepresented interactions.
Next, we quantified the goodnessoffit of each model to HiC data (Supplementary Fig. 2). First, we tested the overall fit with a single model for each chromosome, yielding an average RMSE of 1.45. We then tested the original segmentation of the genome into domains, using the Directionality Index method by Dixon et al^{25} in mouse cortex HiC data (mean RMSE of 1.27). For each TAD, we estimated the optimal powerlaw parameters a _{ i } and intersect b _{ i } resulting with RMSE score of 1.20, an improvement of 7% compared to a random segmentation of the genome (using TAD shuffling, RMSE = 1.29). The hierarchical agglomeration of neighboring domains did not further improve the fit noticeably (RMSE = 1.19).
Finally, we considered a more sophisticated parametric family for modeling HiC interaction data in each TAD or merge area. As we noticed, many TADs do not follow a powerlaw distribution (straight line in log–log plots), but instead show a “broken” behavior, which could reflect one powerlaw fit for the closer distances, and another at more distant ones (Supplementary Fig. 3). For this, we developed a piecewise powerlaw regression model for modeling the average number of interactions (in log scale) for any distance (in log scale) (Methods section). This richer model offers a much more accurate fit of the HiC data (RMSE = 1.06), a 12% reduction in fit error compared to the original powerlaw fit.
For comparison, RMSE for simulated data sampled (using Poisson distribution with matching “read depth”) from the background model itself, was only 3% lower at RMSE = 1.03. Put together, hierarchical TAD models with bilinear powerlaws allow us to model HiC interaction data with high accuracy, thus forming a detailed background model against which we can compare the data and identify overrepresented DNA–DNA interactions.
Identification of enriched interactions in the mouse cortex
We now wish to use the hierarchical TADspecific bilinear model as background model for HiC, and identify overrepresented DNA–DNA interactions that could correspond to promoter–enhancer and other functional interactions in vivo.
For this, we aim to compute the “virtual 4C” plot for each promoter, and compare it to the expected number of interactions according to the background model. We consider a large genomic region surrounding each promoter (±1 Mb) and search for regions showing enriched HiC interactions with the promoter. By subtracting the background model from the HiC data, we obtain the “residual” overrepresentation map. Statistical significance score (p values) are assigned using a logNormal distribution fitted to the residuals in a 2 Mb window surrounding each promoter, then corrected for multiple hypotheses (FDR)^{54} (Methods section).
We begin by focusing the Foxg1 locus (chr12, 50.3–51.2 Mb) using HiC data from mouse cortex^{25}. Figure 2a shows the “residual” map for this locus. Prominent overrepresented cells match two Foxg1 enhancers (hs566 and hs1539) located 550 Kb and 750 Kb downstream of the gene, with FDR values of 7e12 and 1e20, respectively. These two enhancers were discovered in human by us and others, using ChIPseq and conservation data^{55,56,57}. Comparison to published ChIPseq data of H3K27ac, CTCF, PolII, and DNaseI hypersensitivity data from the mouse ENCODE project^{58}, and evolutionary conservation data^{59} further identifies the exact location of these Foxg1 enhancers (Fig. 2b).
Genomewide validation of putative enhancers
To further test our results on a genomewide scale, we systematically characterized the chromatin landscape surrounding all predicted enhancers in mouse cortex^{25}. For this, we aligned a 4 Mb region around each of the 17,788 putative enhancer regions (in HiC bin resolution) using an FDR threshold of 1e2, and tested various enhancerrelated chromatin marks. These include active enhancer and promoter marks (H3K27ac, H3K4me1, PolII), CTCF, evolutionary conservation, DNA accessibility, and chromHMM predictions^{58,59,60,61} (Fig. 3, blue lines and heatmaps). For control, we also computed the average signal at a random set of genomic regions up to 1 Mb away from promoters (Fig. 3, dotted black lines). For all data types, the predicted enhancers were significantly enriched compared to their surrounding flanking regions (See Supplementary Fig. 4 for heatmaps of control regions).
Similar analysis for predicted boundaries identifies enrichment for CTCF and high DNA accessibility, as well as enrichment for promoterlike marks of PolII and H3K27ac, without H3K4me1 enrichment (Supplementary Fig. 5).
Next, we wished to study the effect different initialization methods have on the predicted promoter–enhancer interactions. For this, we initialized twocomponent intra /interTAD model using three methods, including the Directionality Index^{25}, the Insulation Square method^{23} as well as a random initialization of TADs. These changes had a limited effect on the predicted enhancer HiC bins (Supplementary Fig. 6).
We then turned to analyze the statistics of the predicted promoter–enhancer interactions. Overall, 49% of the predicted enhancers are located within 120 Kb of their target promoters, with only about 15% regulating the nearest gene (56% regulate one of the 5 nearest genes). About 87% of the predicted interactions fall within a topological domain (compared to 60% at random), and 92% comply are contained within the first hierarchical merge of TADs. Similar statistics were obtained to additional HiC data sets analyzed (see below) in human and mouse—overall, 88% of predicted enhancers are within the same TAD, compared to 45% in random shuffles (Fig. 4).
Next, we calculated the distribution over the number of putative enhancers regulating each gene, and compared it to the distribution of randomly selected regions (equivalent to a “random set” of near promoter loci). As shown in Supplementary Fig. 7, we observed a much greater number of genes predicted to be regulated by multiple enhancer regions, compared to the random set. Our results show some genes to be regulated by ten or more enhancers. For example, 443 genes are predicted to have five brain enhancer regions (FDR < 1e2), compared to only two in the randomized set, or three expected according to a binomial distribution.
A comprehensive catalog of human and mouse enhancers
To obtain a comprehensive list of putative enhancer regions, we gathered HiC data in 15 conditions and cell types in human and mouse, including mouse cortex and embryonic stem cells^{25}, mouse embryonic stem cells, neural progenitor cells (NPC), and neurons^{20}, and mouse Blymphoblast (CH12LX) cells^{19}, as well as human embryonic stem cells and lung fibroblast IMR90 cells^{25}, GM12878 Blymphoblastoid cells, and HMEC, HUVEC, IMR90, K562, KBM7, and NHEK cells lines^{19}. We then used PSYCHIC (with hierarchical TAD merging and bilinear powerlaw fit) to identify overrepresented interactions (up to 1 Mb) from promoter regions.
Globally, using an FDR threshold of 0.01, we predicted 267,938 putative enhancers (88,193 in mouse and 179,745 in human) that regulate a total of 25,783 genes (20,471 in mouse and 20,264 in human). A more stringent FDR threshold of 1e4, yields 136,448 putative enhancer regions (38,405 and 98,043) regulating 21,435 genes (14,698 and 17,298 for mouse and human, respectively). These are summarized in Supplementary Table 1 (full lists in Supplementary Data 1, 2) or in our supplementary webpage www.cs.huji.ac.il/~tommy/PSYCHIC.
Comparison to other algorithms for enriched interactions
To test these predictions, we collected external ChIPseq data in matching conditions, using which we can compare our predictions with their surrounding loci. In addition, we used previous sets of predicted DNA–DNA interactions for the same HiC data, by FitHiC^{62}—that uses a chromosomewide statistical model (with no TAD resolution) to identify enriched HiC cells—and HiCCUPS^{19}—where the enrichment of each HiC cell is computed based on its neighboring cells. For an unbiased and systematic comparison, we identified all DNA–DNA interactions that involve promoter loci, predicted by HiCCUPS (for human IMR90, GM12878, K562, HMEC, HUVEC and NHEK cell lines)^{19}, or FitHiC (human IMR90 cells, and mouse cortex and ES cells)^{62} and compared their ChIPseq signal.
As shown in Fig. 5 and Supplementary Fig. 8, the predictions by PSYCHIC are generally more enriched (both in terms of absolute signal strength, and its genomic localization, or “sharpness”) for H3K27ac, DNaseI, and chromHMM’s “Strong Enhancer” class in matching cell types. We do observe, however, stronger enrichments for HiCCUPS’ and FitHiC’s predictions for both CTCF and chromHMM’s “Insulator” loci, suggesting that these methods, that are not TADspecific are possibly skewed by boundary elements, leading to overestimation of nearboundary interactions (Supplementary Fig. 8).
Enrichment of eQTLs and nuclei cryosectioning
To further test the quality of our predicted promoter–enhancer interactions, we computed their agreement with additional data sets. First, we analyzed the data from the GenotypeTissue Expression (GTEx) Project (https://gtexportal.org), in which expression quantitative trait loci (eQTLs) were collected in multiple different human tissues by comparing the genotypes and expression level profiles in hundreds of donors^{63}. As we show in Fig. 6a, the majority of our promoter–enhancer predictions are supported by GTEx eQTL data. These include, for example, 55% of our GM12878 predictions (at FDR < 1e2) compared to only 20% of the random interactions, or 29–35% of HiCCUPS promoter–enhancer interactions. More stringent PSYCHIC thresholds further improve this data set agreement: 58% of 1e4 predictions, or 63% of the predictions at FDR < 1e10. Similar numbers are obtained for all other human data set analyzed. These numbers also outperform FitHiC predictions—for example, GTEx data support 25% of the human ESC FitHiC promoter–enhancer predictions (at q value < 1e10) compared with 46% for our 2075 predictions (at FDR < 1e2), or 29% for their 866 (at q < 1e20) predictions compared with 48% for our 833 predicted interactions (at FDR < 1e4).
In addition, we compared our prediction with DNA–DNA interactions in mouse ESC, predicted using ultrathin cryosectioning slices through a single nucleus, followed by sequencing^{64}. Here, we compared the average number of slices in which both the promoter and its predicted enhancer region are captured in the same slice. As shown in Fig. 6b, the 9771 promoter–enhancer interactions predicted by PSYCHIC for mouse ESC data (at FDR < 1e2) are cosequenced in an average of 41 slices (p < 5e92 using random shuffles), or 42 slices on average for the 3908 predictions at a threshold of 1e4, compared to an average of 30 slices for random interactions, or 35 slices on average among the 7164 promoter–enhancer predictions of FitHiC (at a threshold of 1e10). These results further support our methodology and the biological significance of our predicted enhancer regions and their associated target genes.
Validation by capture HiC and ChIAPET data
Finally, we compared our promoter–enhancer interactions with other proximityligation data sets, including Capture HiC (CHiC) data from mouse ES cells^{21} and ChIAPET data from GM12878 cells^{22}. The Capture HiC interactions show high support for the predicted interactions by PSYCHIC, with coverage ranging from 69% of PSYCHIC predicted interactions (in mESC, called using an FDR threshold of 1e2) to 74% (threshold of 1e4), compared to 52–66% of FitHiC predictions for mESC HiC data (Supplementary Fig. 9a). Next, we compared our predictions to ChIAPET data in GM12878 cells^{22}. ChIAPET interactions obtained using PolII antibodies showed high support for our promoter–enhancer predictions, covering 37% (PSYCHIC GM12878 predictions with threshold of 1e2) to 55% (threshold of 1e10); compared to 33–36% for HiCCUPS GM12878 calls (Supplementary Fig. 9b). Intriguingly, a higher portion of HiCCUPS calls (73%) was supported by the ChIAPET data using CTCF antibodies, compared to ~34% for PSYCHIC. This is in line with the relative enrichment of CTCF ChIPseq signal among HiCCUPS predictions (Fig. 5).
Interaction with inactive enhancers
Notably, most—but not all—putative enhancer regions show strong enrichment for active chromatin marks. For example, ~70% of the enhancers predicted with FDR < 1e2 show increased accessibility compared to their flanking DNA regions (Fig. 3, “DNaseI”). Almost half (46%) of predicted enhancer regions show enrichment that is greater than one standard deviation compared to their flanking regions (32% > 2 SD). For comparison, only 43% of the randomly selected regions show increased accessibility, with only 24% exceeding one standard deviation (15% > 2 SD). Similar numbers are obtained for H3K27ac or CTCF.
This suggests that overrepresented DNA–DNA interactions (in HiC) are not limited to active and accessible regions, and raises the hypothesis that a nontrivial fraction of putative enhancer regions are “silent” and inaccessible. A closer examination identified several known enhancers even within those. For example, PSYCHIC identified the ZRS locus as interacting with the Shh gene, even in adult mouse cortex (Fig. 7). In the mouse, early developmental Shh expression is essential for autopod formation, regulated in developing limbs by the distal ZRS enhancer, located ~1 Mb away^{8,65}. Our results suggest that ZRS is in close physical proximity to Shh even in adult brain. Analysis of HiC data in mouse and human identifies similar interactions between Shh and ZRS in most mouse conditions (Supplementary Fig. 10a). This was recently validated by DNA FISH showing ZRS in the proximity of Shh throughout a variety of tissues and developmental stages, while not being in active transcription^{66}. Similarly, a crosscondition analysis of the promoter–enhancer interactions (predicted using PSYCHIC, GM12878, with a stringent threshold of FDR < 1e10) shows that >25% of these putative interactions are predicted (by PSYCHIC) in at least three additional human HiC data sets (compared to only 3% in random; Supplementary Fig. 10b).
Discussion
In this work we presented PSYCHIC, a computational model for analyzing the HiC data to identify enriched DNA–DNA interactions. Using a probabilistic model and efficient algorithms, PSYCHIC identifies the optimal segmentation of chromosomes into topological domains, assembles them into hierarchical structures, and fits a TADspecific background model for the HiC data. By considering a “virtual 4C” plot for every gene, and using the background model for statistical assessments, our algorithm identified 267,938 significant overrepresented enhancer–promoter interactions in 15 HiC experiments in human and mouse.
To segment the genome into TADs, our algorithm uses a probabilistic twocomponent model that independently computes for every cell in the HiC matrix, the likelihood ratio between intraTAD and interTAD models. This score assigns similar importance to near and far DNA–DNA interactions, and is less affected by shortrange interactions that dominate HiC data, but are mostly invariant of topological domains. This additive score is easily computed from nested TADs, allowing for fast and scalable Dynamic Programming algorithm.
Our algorithm then computes for each TAD the average number of contacts at any distance. This spectrum was previously modeled using powerlaws, which we replaced by twosegment models, greatly improving the model accuracy. These results suggest a transition between two packaging mechanisms, typically at 100–300 Kb.
Currently, most HiC data are of 10–40 Kb resolution, hindering our ability to pinpoint promoter–enhancer interactions. Various methods (e.g., ChIPseq, accessibility, evolutionary conservation) could be applied to further identify enhancers in higher resolution. As more detailed HiC data are accumulated, PSYCHIC will offer more accurate predictions. While the running time of PSYCHIC is quadratic, it is scalable. Various heuristic assumptions (e.g., maximal size for subTADs) will dramatically speed it up, allowing for higher resolution analysis using future HiC data sets.
Groundtruth data for promoter–enhancer interactions are still limited, and we have taken multiple approaches to establish our predictions. We showed that the predicted enhancer regions are enriched for active marks (H3K27ac, H3K4me1, PolII), DNA accessibility, or CTCF. This was shown initially for a single locus (Foxg1) in the mouse cortex, and later supported in a genomewide manner over multiple tissues. Comparison to previous methods, including HiCCUPS and FitHiC, generally showed stronger and sharper enrichment for PSYCHIC, as well as a general bias of other algorithms to nearboundary interactions. Secondly, we used highthroughput eQTL data, linking genotypes and gene expression profiles in hundreds of donors, and intersected them with our predictions. As we show, about half of PSYCHIC’s predictions are supported, in a variety of cell types. Finally, we used recently published cryosections of nuclei, showing that predicted promoter–enhancer pairs are cosliced more often then expected.
Intriguingly, a closer examination reveals that ~1/3 of predicted regions are inaccessible and bear no active chromatin marks. These include the ZRS locus that acts as a limbspecific distal enhancer for Shh, located nearly ~1 Mb away. While the ZRS locus shows no accessibility or ChIP peaks in the mouse cortex, therefore predicted to be inactive, it presents a significant number of interactions with Shh. Indeed, Williamson et al.^{66} recently used FISH and 5C to show that ZRS and Shh are located in spatial proximity regardless of their activity.
These results suggest that the 3D structure of the genome may be organized to support regulatory DNA–DNA interactions, rather than merely reflect the set of accessible or active regions in the genome. As more HiC data are collected and analyzed, we hope to shed light on the causality of gene regulation and genome packaging, as well as the plasticity of genome packaging in general.
Put together, we demonstrated how HiC data—typically used to identify TAD boundaries—can be used to identify enriched DNA–DNA interactions, including thousands of putative enhancer regions and associate them to their target genes.
Methods
Modeling HiC data
IntraTAD HiC data are represented using logNormal distribution with two parameters (mean and standard deviation) for each distance d
where the logNormal distribution with mean μ and standard deviation σ can be written as:
InterTAD HiC data are represented similarly:
Bayes’ law could be used to derive the posterior probabilities of the intraTAD:
and interTAD models:
given the number of interactions N at a given distance d, and the prior probabilities P _{ d } (intra) and P _{ d } (intra).
Bilinear regression of logintensity and logdistance
We model the HiC interaction intensity between two loci as a segmented powerlaw function of their distance. In log–log scale this is modeled by a twopiece segmented linear regression model. For this, we developed a computational algorithm (implemented in MATLAB) to iterate over the optimal breaking point and estimates the two parameters (intercept and slope) for each segment, while minimizing the squared deviation of the data (in log–log scale). Similarly, a piecewise linear model was learned for the remaining interTAD regions (“Sky”).
TAD merges
Neighboring TADs are merged into a hierarchical structure, according to a “merge score” that compares the mean HiC intensity per distance within the two underlying TADs, their interTAD area, and the null interTAD model (represented by α in Eq. 10). We then iteratively merge two neighboring TADs whose merge area is the most similar, up to a maximal domain size of 5 Mb.
Random set of enhancers
A random set of genomic loci along the genome, while maintaining a similar distribution around gene promoters, we considered for each gene all genomic loci up to 1 Mb away (on either direction), and selected each with a probability of 1e2.
Statistical significance of ChIPseq for putative enhancers
To estimate the statistical significance for the average ChIPseq signal (or others) at putative enhancer regions (Fig. 3), we fitted a Normal distribution to the average ChIPseq signals at distances >500 Kb from the predicted enhancers, then approximated the p value as the cumulative distribution function (CDF) given by the Normal distribution at the average ChIPseq signal for predicted enhancer regions.
Simulated HiC data
HiC matrices were simulated by sampling considering the hierarchical TADspecific fit model (from PSYCHIC), then resampling each HiC cell from a Poisson distributions with a parameter λ matching the expected mean number of DNA–DNA interactions.
Statistical enrichment score
To assign a statistical significance score (p value) for each putative enhancer (namely, an overrepresented interaction between a promoter region and some other locus), we assumed a Normal distribution of the local residual map (i.e. HiC minus PSYCHIC background mode) at a 2 Mb surrounding the promoter of each gene. We then fitted maximum likelihood estimator for the mean value μ _{ i }, and its standard deviation σ _{ i }, and used these statistics to translate the deviation of each HiC cell from its background model, into zscores. Finally, we assigned a p value for each zscore using a standard Normal cumulative distribution function, and applied an FDR correction for multiple hypotheses^{54}.
HiC data sources and preprocessing
Normalized HiC maps were analyzed. For Dixon et al^{25}, normalized HiC data at 40 Kb resolution were obtained from the Ren lab website (http://chromosome.sdsc.edu/mouse/hic). For Rao et al^{19}, processed data (intrachromosomal, MAPQGE30, KR normalized) were downloaded from GEO (GSE63525), and downsampled from 5 Kb to 25 Kb resolution for higher coverage and more robust analysis. For Fraser et al^{20}, processed and normalized HiC data were downloaded from GEO (GSE59027) in 50 Kb or 100Kb resolution.
Statistical significance of SLICE data
To quantify the statistical significance of the average number of promoter–enhancer cooccurrence in the cryosectioning slices, we randomized our predictions 1000 times by shuffling the gene names (stratified by chromosomes). We then computed the average slice cooccurrence in each shuffle. PSYCHIC predictions outperformed all 1000 shuffles, and obtained a Normal distribution p value of 5e92.
Code availability
PSYCHIC is publicly available via GitHub (https://github.com/dhkron/PSYCHIC).
Data availability
A full list of putative enhancer regions, as well as the genes they regulate is available in Supplementary Table 1 and Supplementary Data 1, 2, and in our supplemental website at www.cs.huji.ac.il/~tommy/PSYCHIC. Also available in our website are saved UCSC Genome Browser sessions for mouse (mm9) and human (hg19).
References
 1.
Visel, A., Rubin, E. M. & Pennacchio, L. A. Genomic views of distantacting enhancers. Nature 461, 199–205 (2009).
 2.
Bickmore, W. A. & van Steensel, B. Genome architecture: domain organization of interphase chromosomes. Cell 152, 1270–1284 (2013).
 3.
Rowley, M. J. & Corces, V. G. The threedimensional genome: principles and roles of longdistance interactions. Curr. Opin. Cell. Biol. 40, 8–14 (2016).
 4.
Van Steensel, B. & Dekker, J. Genomics tools for unraveling chromosome architecture. Nat. Biotechnol. 28, 1089–1095 (2010).
 5.
Dekker, J. & Mirny, L. The 3D genome as moderator of chromosomal communication. Cell 164, 1110–1121 (2016).
 6.
Fraser, P. & Bickmore, W. Nuclear organization of the genome and the potential for gene regulation. Nature 447, 413–417 (2007).
 7.
Claussnitzer, M. et al. FTO obesity variant circuitry and adipocyte browning in humans. N. Engl. J. Med. 373, 895–907 (2015).
 8.
Lettice, L. A. et al. A longrange Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 1725–1735 (2003).
 9.
Lupiáñez, D. G. et al. Disruptions of topological chromatin domains cause pathogenic rewiring of geneenhancer interactions. Cell 161, 1012–1025 (2015).
 10.
Franke, M. et al. Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 538, 265–269 (2016).
 11.
AchingerKawecka, J. & Clark, S. J. Disruption of the 3D cancer genome blueprint. Epigenomics 9, 47–55 (2016).
 12.
KiefferKwon, K.R. et al. Interactome maps of mouse gene regulatory domains reveal basic principles of transcriptional regulation. Cell 155, 1507–1520 (2013).
 13.
Handoko, L. et al. CTCFmediated functional chromatin interactome in pluripotent cells. Nat. Genet. 43, 630–638 (2011).
 14.
Simonis, M. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation captureonchip (4C). Nat. Genet. 38, 1348–1354 (2006).
 15.
LiebermanAiden, E. et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
 16.
Jin, F. et al. A highresolution map of the threedimensional chromatin interactome in human cells. Nature 503, 290–294 (2013).
 17.
Lajoie, B. R., Dekker, J. & Kaplan, N. The Hitchhiker’s guide to HiC analysis: Practical guidelines. Methods 72, 65–75 (2015).
 18.
Mifsud, B. et al. Mapping longrange promoter contacts in human cells with highresolution capture HiC. Nat. Genet. 47, 598–606 (2015).
 19.
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
 20.
Fraser, J. et al. Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation. Mol. Syst. Biol. 11, 852 (2015).
 21.
Schoenfelder, S. et al. The pluripotent regulatory circuitry connecting promoters to their longrange interacting elements. Genome Res. 25, 582–597 (2015).
 22.
Tang, Z. et al. CTCFmediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).
 23.
Crane, E. et al. Condensindriven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015).
 24.
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the Xinactivation centre. Nature 485, 381–385 (2012).
 25.
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
 26.
de Laat, W. & Duboule, D. Topology of mammalian developmental enhancers and their regulatory landscapes. Nature 502, 499–506 (2013).
 27.
Pope, B. D. et al. Topologically associating domains are stable units of replicationtiming regulation. Nature 515, 402–405 (2014).
 28.
Dileep, V. et al. Topologically associating domains and their longrange contacts are established during early G1 coincident with the establishment of the replicationtiming program. Genome Res. 25, 1104–1113 (2015).
 29.
Taberlay, P. C. et al. Threedimensional disorganization of the cancer genome occurs coincident with longrange genetic and epigenetic alterations. Genome Res. 26, 719–731 (2016).
 30.
Jager, R. et al. Capture HiC identifies the chromatin interactome of colorectal cancer risk loci. Nat. Commun. 6, 6178 (2015).
 31.
Vietri Rudan, M. et al. Comparative HiC reveals that CTCF underlies evolution of chromosomal domainarchitecture. Cell Rep. 10, 1297–1309 (2015).
 32.
GómezMarín, C. et al. Evolutionary comparison reveals that diverging CTCF sites are signatures of ancestral topological associating domains borders. Proc. Natl Acad. Sci. 112, 7542–7547 (2015).
 33.
Ryba, T. et al. Evolutionarily conserved replication timing profiles predict longrange chromatin interactions and distinguish closely related cell types. Genome Res. 20, 761–770 (2010).
 34.
Symmons, O. et al. Functional and topological characteristics of mammalian regulatory domains. Genome Res. 24, 390–400 (2014).
 35.
Doyle, B., Fudenberg, G., Imakaev, M. & Mirny, L. A. Chromatin loops as allosteric modulators of enhancerpromoter interactions. PLoS Comput. Biol. 10, e1003867 (2014).
 36.
Zhang, Y. et al. Chromatin connectivity maps reveal dynamic promoterenhancer longrange associations. Nature 504, 306–310 (2013).
 37.
Blinka, S., Reimer, M. H., Pulakanti, K. & Rao, S. SuperEnhancers at the nanog locus differentially regulate neighboring PluripotencyAssociated genes. Cell Rep. 17, 19–28 (2016).
 38.
Fulco, C. P. et al. Systematic mapping of functional enhancerpromoter connections with CRISPR interference. Science 354, 769–773 (2016).
 39.
IngSimmons, E. et al Spatial enhancer clustering and regulation of enhancerproximal genes by cohesin. Genome Res. 25, 504–513 (2015).
 40.
Zuin, J. et al. Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells. Proc. Natl. Acad. Sci. 111, 996–1001 (2014).
 41.
Demare, L. E. et al. The genomic landscape of cohesinassociated chromatin interactions. Genome Res. 23, 1224–1234 (2013).
 42.
Nichols, M. H. & Corces, V. G. A CTCF code for 3D genome architecture. Cell 162, 703–705 (2015).
 43.
Ong, C.T. & Corces, V. G. CTCF: an architectural protein bridging genome topology and function. Nat. Rev. Genet. 15, 234–246 (2014).
 44.
Seitan, V. C. et al. Cohesinbased chromatin interactions enable regulated gene expression within preexisting architectural compartments. Genome Res. 23, 2066–2077 (2013).
 45.
Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).
 46.
LévyLeduc, C., Delattre, M., MaryHuard, T. & Robin, S. Twodimensional segmentation for analyzing HiC data. Bioinformatics 30, i386–i392 (2014).
 47.
Xu, Z., Zhang, G., Wu, C., Li, Y. & Hu, M. FastHiC: a fast and accurate algorithm to detect longrange chromosomal interactions from HiC data. Bioinformatics 32, 2692–2695 (2016).
 48.
Adhikari, B., Trieu, T. & Cheng, J. Chromosome3D: reconstructing threedimensional chromosomal structures from HiC interaction frequency data using distance geometry simulated annealing. BMC Genomics 17, 886 (2016).
 49.
Chen, J., Hero, A. O. 3rd & Rajapakse, I. Spectral identification of topological domains. Bioinformatics 32, 2151–2158 (2016).
 50.
Filippova, D., Patro, R., Duggal, G. & Kingsford, C. Identification of alternative topological domains in chromatin. Algorithms Mol. Biol. 9, 14 (2014).
 51.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B Stat. Methodol. 39, 1–38 (1977).
 52.
Naumova, N. et al. Organization of the mitotic chromosome. Science 342, 948–953 (2013).
 53.
Mirny, L. A. The fractal globule as a model of chromatin architecture in the cell. Chromosome Res. 19, 37–51 (2011).
 54.
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Methodol. 57, 289300 (1995).
 55.
Visel, A., Minovitsky, S., Dubchak, I. & LA, P. VISTA Enhancer Browser—a database of tissuespecific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
 56.
Visel, A. et al. A highresolution enhancer atlas of the developing telencephalon. Cell 152, 895–908 (2013).
 57.
Visel, A. et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 40, 158–160 (2008).
 58.
Mouse ENCODE Consortium. et al. An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol. 13, 418 (2012).
 59.
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
 60.
Ernst, J. & Kellis, M. ChromHMM: automating chromatinstate discovery and characterization. Nat. Methods 9, 215–216 (2012).
 61.
Shen, Y. et al. A map of the cisregulatory sequences in the mouse genome. Nature 488, 116–120 (2012).
 62.
Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for HiC data reveals regulatory chromatin contacts. Genome Res. 24, 999–1011 (2014).
 63.
GTEx Consortium. The genotypetissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
 64.
Beagrie, R. A. et al. Complex multienhancer contacts captured by genome architecture mapping. Nature 543, 519–524 (2017).
 65.
Sagai, T., M, H., Y, M., M, T. & T, S. Elimination of a longrange cisregulatory module causes complete loss of limbspecific Shh expression and truncation of the mouse limb. Development 132, 797–803 (2005).
 66.
Williamson, I., Lettice, L. A., Hill, R. E. & Bickmore, W. A. Shh and ZRS enhancer colocalisation is specific to the zone of polarizing activity. Development 143, 2994–3001 (2016).
 67.
Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deepsequencing data. Nucleic Acids Res. 42, W187–W191 (2014).
 68.
ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA elements) project. Science 306, 636–640 (2004).
 69.
Bernstein, B. E. et al. The NIH Roadmap epigenomics mapping consortium. Nat. Biotechnol. 28, 1045–1048 (2010).
 70.
Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
 71.
Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).
Acknowledgements
We would like to thank Nir Friedman, Eran Rosenthal, Shira Strauss, and members of the Kaplan lab for helpful discussions and comments. T.K. is a member of the Israeli Center of Excellence (ICORE) for Gene Regulation in Complex Human Disease (no. 41/11) and the Israeli Center of Excellence (ICORE) for Chromatin and RNA in Gene Regulation (no. 1796/12). This research was also supported by a Marie Curie Career Integration Grant (PCIG13GA2013618327), and an Israel Science Foundation grant (no. 913/15) to T.K. Y.G. is supported by a Leibniz Fellowship.
Author information
Affiliations
Contributions
Conceived and designed the method: G.R. and T.K. Implementation: G.R. Analyzed the data: G.R., Y.G., D.M., and T.K. Wrote the paper: G.R. and T.K.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ron, G., Globerson, Y., Moran, D. et al. Promoterenhancer interactions identified from HiC data using probabilistic models and hierarchical topological domains. Nat Commun 8, 2237 (2017). https://doi.org/10.1038/s41467017023863
Received:
Accepted:
Published:
Further reading

The epigenetic basis of cellular heterogeneity
Nature Reviews Genetics (2021)

HiCEnterprise: identifying long range chromosomal contacts in HiC data
PeerJ (2021)

Reorganization of chromatin architecture during prenatal development of porcine skeletal muscle
DNA Research (2021)

miRNAindependent function of long noncoding primiRNA loci
Proceedings of the National Academy of Sciences (2021)

Application of HiC and other omics data analysis in human cancer and cell differentiation research
Computational and Structural Biotechnology Journal (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.