Abstract
Although HiC technology is one of the most popular tools for studying 3D genome organization, due to sequencing cost, the resolution of most HiC datasets are coarse and cannot be used to link distal regulatory elements to their target genes. Here we develop HiCPlus, a computational approach based on deep convolutional neural network, to infer highresolution HiC interaction matrices from lowresolution HiC data. We demonstrate that HiCPlus can impute interaction matrices highly similar to the original ones, while only using 1/16 of the original sequencing reads. We show that the models learned from one cell type can be applied to make predictions in other cell or tissue types. Our work not only provides a computational framework to enhance HiC data resolution but also reveals features underlying the formation of 3D chromatin interactions.
Introduction
The highthroughput chromosome conformation capture (HiC) technique^{1} has emerged as a powerful tool for studying the spatial organization of chromosomes, as it measures all pairwise interaction frequencies across the entire genome. In the past several years, HiC technique has facilitated several exciting discoveries, such as A/B compartment^{1}, topological associating domains (TADs)^{2,3}, chromatin loops^{4}, and frequently interacting regions (FIREs)^{5}, and therefore significantly expanded our understanding of threedimensional (3D) genome organization^{1,2,4} and gene regulation machinery^{6}. HiC data are usually presented as an n × n contact matrix, where the genome is divided into n equally sized bins and the value within each cell of the matrix indicates the number of pairended reads spanning between a pair of bins. Depending on sequencing depths, the commonly used sizes of these bins can range from 1 kb to 1 Mb. The bin size of HiC interaction matrix is also referred to as 'resolution', which is one of the most important parameters for HiC data analysis, as it directly affects the results of downstream analysis, such as predicting enhancer–promoter interactions or identifying TAD boundaries. Sequencing depth is the most crucial factor in determining the resolution of HiC data—the higher the depth, the higher the resolution (smaller bin size).
Owing to high sequencing cost, most available HiC datasets have relatively low resolution such as 25 or 40 kb, as the linear increase of resolution requires a quadratic increase in the total number of sequencing reads^{6}. These lowresolution HiC datasets can be used to define largescale genomic patterns such as A/B compartment or TADs but cannot be used to identify more refined structures such as subdomains or enhancer–promoter interactions. Therefore, it is urgent to develop a computational approach to take full advantage of these currently available HiC datasets to generate higherresolution HiC interaction matrix.
Recently, deep learning has achieved great success in several disciplines^{7,8,9}, including computational epigenomics^{10,11,12,13}. In particular, Deep Convolutional Neural Network (ConvNet)^{7,14}, which is inspired by the organization of the animal visual cortex^{14,15,16}, has made major advancement in computer vision and natural language processing^{7}. In the fields of computational biology and genomics, ConvNet has been successfully implemented to predict the potential functional of DNA sequence^{17,18,19,20,21,22}, DNA methylation or gene expression patterns^{23,24,25,26}.
In this work, we propose HiCPlus, which is the first approach to infer highresolution HiC interaction matrices from lowresolution or insufficiently sequenced HiC samples. Our approach is inspired by the most recent advancements^{27,28,29,30} in the single image superresolution and can generate the HiC interaction matrices with the similar quality as the original ones, while using as few as 1/16 of sequencing reads. We observe that HiC matrices are composed by a series of lowlevel local patterns, which are shared across all cell types. We systematically applied HiCPlus to generate highresolution matrices for 20 tissue/cell lines (Supplementary Table 1) where only lowresolution HiC datasets are available, covering a large variety of human tissues. In summary, this work provides a great resource for the study of chromatin interactions, establishes a framework to predict highresolution HiC matrix with a fraction of sequencing cost, and identifies potential features underlying the formation of 3D chromatin interactions.
Results
Overview of HiCPlus framework
Figure 1 illustrates the overall framework of HiCPlus. To train the ConvNet model, we first generate a highresolution matrix (10 kb) with deeply sequenced HiC data, such as those from GM12878 or IMR90 cells. Next, we downsample the sequencing reads to 1/16 and construct another interaction matrix at the same resolution, which consequently contains more noises and more blurred patterns. We then fit the ConvNet model using values at each position in the highresolution matrix as the response variable and using its neighbouring points from the downsampled matrix as the predictors (Fig. 1a). Our goal is to investigate whether the ConvNet framework can accurately predict values in the highresolution matrix using values from the lowresolution matrix. Noticeably, although technically both matrices are at the same resolution, we consider the downsampled interaction matrix 'low resolution', as in practice, it is usually processed at lower resolution due to the shallower sequencing depths. In this paper, we use 'lowresolution' and 'insufficiently sequenced' interchangeably.
We describe the conceptual view of the ConvNet in Fig. 1b, which learns the mapping relationship between highresolution HiC matrix and lowresolution HiC matrix at feature levels. Once the model is trained, we can apply it to enhance any HiC interaction matrix with lowsequencing depth. HiCPlus divides the entire HiC matrix into small square samples and enhance them separately. After each block of interactions are predicted, those blocks are merged into chromosomewide interaction matrix Fig. 1c. The detailed structure of ConvNet is described in Supplementary Fig. 1 and more detailed description of the algorithm is described in the Methods section.
Chromatin interactions are predictable from neighbouring regions
Our hypothesis is that the HiC matrix contains repeating local patterns, and the interaction intensity of each point is not independent to its local neighbouring regions. Therefore, we should be able to predict the interaction frequency of any cell in the HiC matrix with the interaction frequencies from its neighbouring regions. To test this hypothesis, we trained a ConvNet model on chromosomes 1–17 and systematically predicted interaction matrices in chromosomes 18–22, using the 10 kb resolution HiC data in GM12878 cells^{4}. To evaluate the performance of our ConvNet model, we computed both the Pearson and Spearman correlation coefficients between the predicted values and the real values at each genomic distance.
An important parameter in our model is the size of neighbouring regions: intuitively, to predict the value of one point, using a larger surrounding matrix will yield better results. Therefore, we tested a range of neighbouring matrix sizes, from 3 × 3 to 15 × 15. Indeed, we observed that using a larger neighbouring matrix generally increases the prediction accuracy. When using a 13 × 13 surrounding matrix, the Pearson correlations between the predicted and real interaction frequencies are consistently higher than the predictions using smaller surrounding matrices, at each genomic distance. For example, the Pearson correlation at 40 kb genomic distance for 13 × 13, 7 × 7 and 3 × 3 matrices are 0.93, 0.92, and 0.89, respectively (Fig. 2). However, we found that the prediction accuracy reached a plateau when we used the 13 × 13 matrix prediction model, and further increasing the size of surrounding matrix shows little if any improvement of the prediction accuracy (Supplementary Fig. 2).
For comparisons, we first tried a simple approach, by predicting each interaction frequency using the average values from its surrounding matrix. After testing a series of block sizes, we found that 3 × 3 gives the best performance for averagingbased method (Supplementary Fig. 3). Next, we compared the ConvNet with a few other commonly used methods, such as twodimensional (2D) Gaussian smoothing and Random Forest. We observed that ConvNet performs the best among all these methods (Fig. 2).
Finally, we compared the performance of training one model for the whole matrix with training a model for each genomic distance. It is known that there is distance decay in the HiC interaction matrix, which means that the further away a bin is from the diagonal, the smaller value it tends to be. Therefore, we trained a set of models and each model is used for prediciting chromatin interactions at one given distance (e.g. 10 kb, 20 kb, …). However, this approach did not improve the prediction accuracy (Supplementary Fig. 2), indicating that our current model has incorporated the distance effect and it is not necessary to train different models at different genomic distances.
Enhancing chromatin interaction matrix with lowsequence depth
Having established that values in HiC matrix can be predicted using their surrounding regions, we then investigated whether we can predict these values with insufficiently sequenced samples. We first trained and tested our HiCPlus model in the same cell type, using the highresolution HiC data in GM12878 cell (access code GSE63525)^{4}. We constructed the 10 kb resolution matrix using all the reads (Fig. 3a, right panel). Then we downsampled the reads to 1/16 of the original sequencing depth and constructed the interaction matrix at the same resolution (Fig. 3a, left panel). The newly generated matrix contains lots of noise and TAD structures are less clear. Next, we fit a ConvNet model using values at each bin on the highquality matrix as the response variable and using its neighbouring 13 × 13 points in the downsampled matrix as predictors. We used chromosomes 1–7 as the training datasets and chromosome 13 as the validation set to obtain the optimal hyperparameters. Then we applied it to enhance the downsampled interaction matrix in chromosome 18. An example of a ConvNetenhanced matrix is shown in Fig. 3a (middle panel). We observed that the HiCPlusenhanced matrix is highly similar with the real highresolution HiC matrix. Compared with the matrix generated from downsampled reads, it contains much less noise and both the individual chromatin loops and the TAD structures are more visible.
To quantitatively evaluate the performance of HiCPlus, we computed the Pearson correlation and Spearman ranking correlation between the experimental highresolution matrix, downsampled matrix, 2D Gaussian Smoothingenhanced and HiCPlusenhanced matrix at each genomic distance. As shown in Fig. 3b and Supplementary Fig. 4, the HiCPlusenhanced matrix obtained much higher correlation with the real highresolution HiC matrix than the downsampled matrix at all genomic distances. Surprisingly, the correlations between the HiCPlusenhanced matrix and the real highresolution HiC matrix are nearly as high as those between two real highresolution matrices from two biological replicates in the same cell type (Fig. 3b), suggesting that ConvNet framework can reconstruct a highresolution interaction matrix using only a fraction of the total sequencing reads. We thoroughly tested the model and found that its performance of the model is consistent across different chromosomes (Supplementary Fig. 5).
To compare deep convolutional neural network with other approaches, we first implemented several image denoising methods, including 2D Gaussian smoothing, 2D average smoothing and anisotropic diffusion (Supplementary Fig. 6). We tested and selected the best parameters for 2D Gaussian smoothing (Supplementary Fig. 7) and used suggested parameters for 2D Average smoothing from previous work^{31}. To compare with nondeep learning frameworks, we also implemented a Random Forest Regressor, using the default parameters from Sklearn^{32}. Among all the methods, we observe that HiCPlus has the best performance, followed by Gaussian smoothing and Radom Forest (Fig. 3, Supplementary Fig. 6).
It has been shown that there are systematic biases in HiC data^{33,34}, such as GC contents, number of cutter sizes and mappability. Applying systematic normalization can remove these biases and generate more accurate contact maps. Therefore, we also investigated whether HiCPlus can be used to enhance the normalized HiC matrix. As shown in Supplementary Fig. 8, HiCPlus can be also applied to increase the resolution of normalized HiC matrix.
Enhancing HiC interaction matrices across different cell types
A key application for HiCPlus is to enhance the resolution of existing lowresolution HiC matrices from the previous studies^{2,35,36,37,38,39,40,41,42,43,44} with the model trained on the cell types where highresolution HiC data are available^{4,38}. The results can also be used to address whether the lowlevel local patterns on HiC matrix are shared across different cell types as well. First, we trained the ConvNet model in three different cell types (GM12878, K562, IMR90)^{4} with similar sequencing depths and tested their prediction performances in K562 cells. Similar to the procedure showed in the previous section, we downsampled HiC reads in K562 to 1/16 and then applied ConvNet to enhance its interaction matrix. As shown in Fig. 4a, the enhanced HiC matrices using three different training datasets are highly similar to each other. More importantly, all of them are also similar to the original highresolution interaction matrix (Figs. 4a, c), suggesting that the local patterns/features captured by ConvNet framework from different HiC matrices are highly similar and can be used to enhance HiC matrix in other cell types.
To further validate this observation, we trained the ConvNet model in GM12878 cells and applied it to enhance HiC matrices in three different cell types (GM12878, K562, IMR90). Again, we found that the ConvNetenhanced HiC matrices are highly similar to the real highresolution HiC matrices. An example is shown in Fig. 4b, where the chromatin interaction patterns in this region are highly dynamic across different cell types. However, the ConvNetenhanced matrices accurately depict these differences and help demonstrating these celltypespecific TADs and chromatin interactions. Finally, we applied HiCPlus and systematically enhanced the lowresolution HiC interaction matrices previously generated in 20 tissues/cell types^{2,35,36,37,38,39,40,41,42,43,44}.
To predict such datasets, in the first step, we trained models for different sequencing depth from (×4 to ×16). Then we generate the 10 kb HiC interaction matrix from the BAM file in HiC library. In order to determine the enhancement scale, we calculate the ratio of the effective sequencing depth between the candidate HiC matrix and HiC training matrix between genomic distances of 25,000 to 1,000,000 base pairs. If the sequencing depth of candidates’ HiC matrix is <1/16 of training HiC matrix, we use the ×16 model.
Identifying chromatin interactions in HiCPlusenhanced matrices
It has been shown that strong chromatin interactions (loops) are enriched for important regulatory elements, such as enhancers and promoters^{4}. After demonstrating that HiCPlus can transform lowresolution HiC data to highresolution HiC interaction matrix, we investigated whether these enhanced highresolution matrices can facilitate the identification of meaningful chromatin interactions. For this purpose, we used the FitHiC^{45} software, which can adjust random polymer looping effect and estimate statistical confidence of intrachromosomal interactions. We applied FitHiC to the real highresolution, 1/16 downsampled and HiCPlusenhanced interaction matrices at 10 kb resolution in K562 cell line, respectively. We kept the predicted significant interactions (qvalue < 1e06) in genomic distance from 30 to 500 Kb for further comparative analysis. We first observed that 72.37% (6340/8760) of the peaks identified in the HiCPlusenhanced matrix were also identified in the true highresolution matrix (Supplementary Fig. 9, Supplementary Table 2). Then we investigated whether the predicted chromatin interactions from three matrices are enriched for potential functional elements annotated by ChromHMM^{46}. As shown in Fig. 5a, significant interactions from the real highresolution HiC matrix and HiCPlusenhanced matrix show similar patterns: enriched for active states, such as enhancerassociated states (‘Weak Enhancer’, ‘Active Enhancer 1&2’, ‘Bivalent Enhancer’ and ‘Genic enhancer1&2’) and promoterassociated states (‘Flanking TSS Upstream’, ‘Flanking TSS Downstream’ and ‘Active TSS’), while depleted of inactive states, such as quiescent and heterochromatinassociated states (‘Quiescent/Low’ and ‘Heterochromatin’). On the contrary, the interactions identified in the downsampled HiC matrix show discrepant patterns with those identified in real highresolution HiC matrix. For example, they are enriched for heterochromatin and minimal if any enrichment of active transcription start site (TSS), suggesting that interactions identified from the downsampled matrix are of poor quality and might give false information if analysed at this resolution^{47}.
Next, we compared the predicted chromatin interactions from the real highresolution HiC, downsampled HiC and HiCPlusenhanced matrices with the identified chromatin loops by CTCF chromatin interaction analysis by pairedend tagging sequencing (ChIAPET) in the same cell type. We used the identified CTCFmediated chromatin loops from ChIAPET as true positives. As for negatives, we randomly selected the same number of pairs of CTCFbinding sites that are not predicted as interacting pairs by ChIAPET (Methods). Then we plotted the AUC (area under the curve) to evaluate the performance of our model. As shown in Fig. 5b, CTCF interacting pairs and noninteracting pairs are separated in the predicted results from HiCPlusenhanced matrix (average AUC = 0.85). We also observed that the AUC score for the HiCPlusenhanced matrix is significantly higher than the AUC from the downsampled matrix (pvalue < 0.05). Finally, we compared the overlap between significant interactions identified in three interaction matrices with the ChIAPET identified interactions (Fig. 5c). Seventynine and 67% of ChIAPETpredicted loops can be recovered by interactions identified in the real and HiCPlusenhanced matrices, respectively, while only 9% of the ChIAPET interactions can be recovered by the downsampled HiC matrix, demonstrating again that by HiCPlusenhanced matrix the analysis of the downsampled matrix at this resolution is not reliable. To further show the power of HiCPlus framework, we applied it to enhance the HiC dataset from aorta tissue where only lowresolution (40 kb) matrices are available (Fig. 5d). By comparing chromatin interactions from Capture HiC, we observe that HiCPlusenhanced matrix captures significant interactions between MYC promoter and cisregulatory elements that are missed or unresolved by lowresolution HiC matrix. For example, multiple Capture HiC interactions are mapped to the same 40 kb bin and thus unresolvable by the lowresolution HiC matrix (yellow dots on the second 4C track). However, these interactions are captured by the enhanced matrix, suggesting that HiCPlus can improve the resolution of HiC interaction matrix and reveal meaningful interactions that are missed by original lowresolution HiC data.
In summary, the ConvNet framework can significantly improve the quality of interaction matrix for insufficiently sequenced HiC samples and further facilitate identifying biologically meaningful interactions that are enriched for potential functional elements and validated by other techniques.
Discussion
Here we present HiCPlus, the first computational approach to infer highresolution HiC interaction matrices from lowresolution HiC data. Our framework can construct the interaction matrix with similar quality using only 1/16 or even fewer sequencing reads. We systematically applied HiCPlus to generate highresolution matrices for 20 tissue/cell types where only lowresolution HiC data are available, covering a large variety of human tissues.
We observe that HiC interaction matrices are composed of a series of lowlevel repeating local patterns, which are shared across all cell types and tissues. These features can be effectively captured by our ConvNet framework and used to enhance HiC matrix in different cell types. However, most of these local patterns are still represented as black boxes in the intermediate convolutional layers and therefore are not human interpretable. We hypothesize that these features are related to important functions in 3D genome organization, such as chromatin loops and TADs. More work on visualizing and interpreting these features are imperative and will be of great values to deepen our understanding of the highorder genome organization and gene regulation.
Another caveat is the ground truth used for training and evaluating in the ConvNet framework. Throughout the analyses in this work, we used the real highresolution HiC matrix as the ground truth/gold standard. However, there are natural variations even between highresolution interaction matrices from different biological replicates in the same cell type. In the functional enrichment analysis (Fig. 5a), the significant interactions in the ConvNetenhanced matrix are more enriched for some epigenetic markers than those from the real highresolution HiC matrix. In addition, previous work from other disciplines^{8,47,48,49} have reported that introducing noises in the training process can increase the prediction accuracy of the deep learning model. It is possible that the deep ConvNet model can distinguish noises from real signals in the HiC matrices, which contributes to the interaction matrix enhancement. Indeed, we observe that HiCPlus performs better in identifying the significant chromatin interactions than other methods (Supplementary Fig. 10). Further investigations are needed to validate and interpret these interesting observations and the results might shed light on how to improve the computational model and deepen our understanding of chromatin interactions.
It is interesting that the performance of Gaussian kernel and deep learning have comparable results, especially when evaluated by genomewide correlation. In this project, our primary goal is to enhance lowresolution HiC data matrix to higherresolution HiC data, which is essentially an imageenhancing problem. Therefore, it is not surprising that Gaussian kernel and diffusionbased methods have solid performance, as they have been widely used in solving such problems. On the other hand, as shown in Supplementary Fig. 10, we also observe that deeplearning framework performs better at significant chromatin interaction regions and that HiCPlus predicted values are closer to the values in experimental highresolution HiC matrix. Finally, although Gaussian smoothing can provide solid performance for enhancing HiC matrix, it will not provide any additional biologically meaningful information. HiCPlus, on the other hand, is trained by learning certain patterns and information from the training datasets, and these patterns are used for enhancing HiC matrix in the prediction process. In future studies, we will further study these patterns and hopefully we can recover more biologically meaningful interpretation of the results.
In summary, HiCPlus presents the first deep learning framework for enhancing the resolution of HiC interaction matrices. By leveraging interaction frequencies from neighbouring regions and learning regional patterns from available highresolution HiC data, HiCPlus can generate highresolution HiC interaction matrices at a fraction of the original sequencing reads. With the fast accumulation of HiC data in different cell lines and tissue types, we provide a rich resource and a powerful tool for the study of 3D genome organization and gene regulation.
Methods
HiCPlus workflow
Step 1, Preprocessing HiC matrix: Many of the current available HiC data, especially in human tissues^{2,5,35}, are only available at 40 kb resolution matrices. For these data sets, we start from the BAM file and generate the 10 Kb resolution interaction matrices. Consequently, we observe an increased noisetosignal ratio comparing with deeply sequenced HiC library. In the training stage, we start from highresolution HiC data (such as GM12878 from GSE63525) and use a random downsampling method to simulate the lowresolution HiC matrix. After this step, all input matrices are at 10 Kb resolution. As previously mentioned, we consider the matrices generated from downsampled sequencing reads as low resolution since they would have been processed at a lower resolution at that sequencing depths in practice.
Step 2: Divide a HiC matrix into multiple squarelike subregions with fixed size, and each subregion is treated as one sample. Unless otherwise noticed, each subregion is 0.4 × 0.4 Mb^{2}, which contains 40 × 40 = 1600 pixels at 10 Kb resolution. We only investigate and predict chromatin interactions where the genomic distance between two loci is <2 Mb, as the average size of TADs is <1 Mb and there are few meaningful interactions outside TADs.
Step 3: The deep ConvNet is trained to learn the relationship between the lowresolution samples (a.k.a., same size but insufficient sequenced samples) and highresolution samples in the training stage and predicts the highresolution samples from lowresolution samples in the production stage.
Step 4: The predicted highresolution submatrices are merged into chromosome size HiC interaction matrix. As the samples have a surrounding padding region that is removed during the prediction by ConvNet, the proper overlap is necessary when dividing the HiC interaction matrix to the samples in the Step 1.
ConvNet structure
For the ConvNet, the input is a list of lowresolution samples with N × N size for each sample. To avoid the border effect, similar with Dong’s work^{30}, we did not add white padding to any convolutional layer so the output of each sample has the smaller size. Therefore, the output is a list of predicted highresolution samples with (N−padding) × (N−padding) size, where N = 40 and padding = 12 for the typical setting in this discussion, and both input 40 × 40 matrix and output 28 × 28 matrix are registered in the same central location. The shrunk size can be offset by the overlapping during the dividing process, X_{ i }.
We denote the ConvNet model as F, the lowresolution input as X, the predicted highresolution output as Y and the real highresolution HiC as Y (Y is also regarded as ground truth in this section). Mean square error (MSE) is used as loss function in the training process. Therefore, the goal of the training process is to generate F that minimizes the MSE.
$$\mathrm{argmin}\frac{1}{\it m}\mathop {\sum}\nolimits_{i = 1}^{\it m} {\left\ {{\it F}\left( {{\it X}{\it i}} \right)  {\it Y}} \right\^2}$$As shown in Fig. 1b, the ConvNet in HiCPlus has three layers, serving for extracting and representing patterns on the lowresolution matrix, nonlinearly mapping the patterns on the lowresolution matrix to highresolution matrix and combining the highresolution patterns to generate the predicted matrix, respectively. Below we describe each layer in detail.
Pattern extraction and representation
In this stage, input is the lowresolution f_{1} × f_{1} matrix, and output is generated by the following formula
$$F1\left( X \right) = {\mathrm{max}}\left( {0,\,w_1 \ast X + b_1} \right)$$where * denotes the convolutional operation, X is the input matrix, b_{1} is the bias and w_{1} is an n_{1} × f_{1} × f_{1} matrix. Here n_{1} and f_{1} are the filter numbers and filter size, respectively. Both n_{1} and f_{1} are hyperparameters in the ConvNet, and we set n_{1} to 16 and f_{1} to 5. As shown in (Supplementary Fig. 1c), HiCPlus is not sensitive to these hyperparameters. The Rectified Linear Unit (ReLU)^{50} is utilized as the nonlinear activation function.
Nonlinear mapping between the patterns on highand lowresolution maps
This stage is shown as the middle part of the Fig. 1b, where the patterns on the lowresolution matrix are mapped nonlinearly with the patterns on highresolution matrix using the formula:
$$F2(X) = {\mathrm{max}}\left( {0,\,w_2 \ast F1(X) + b_2} \right)$$where F1(X) is the output from the previous layer, b_{2} is the bias and w_{2} are n_{2} matrices, each has the size of f_{2} × f_{2}. In this layer, we set n_{2} to 16 and f_{2} to 1 as it is a process of nonlinear mapping.
Combining patterns to predict highresolution maps
We employ the following formula to generate the predicted highresolution HiC matrix from the results of the second layer
$$F3(X) = w_3 \ast F2(X) + b_3$$where F2(X) is the output from the previous layer, b_{3} is the bias and w_{3} are n_{3} matrices of size f_{3} × f_{3}. In this step, the nonlinear activation function is not required, and the filter number n_{3} is set to 1 to generate the predicted results.
Overall, function F has parameters Θ = {w_{1}, w_{2}, w_{3}, b_{1}, b_{2}, b_{3}}. The goal of the training process is to obtain the optimal Θ to minimize MSE on the samples in the training set. We employ the standard backpropagation^{14} with gradient descent to train the network and use Stochastic Gradient Descent^{51} as the update strategy. The initial parameters are drawn from the uniform distribution with Glorot’s strategy^{52} unless otherwise noted. The training is converged and no overfitting is observed (Supplementary Fig. 11).
We noted that sequencing depth has great impact on the performance of HiCPlus. In this work, to make enhanced matrices for the 20 human tissue/cell types, we trained three models in IMR90 cell lines, depending on the sequencing depth of the sequenced HiC data in different tissue/cell type: >80 million, 50–80 million, <50 millions (more detailed breakdown in Supplementary Table 1). All three models were trained on chromosomes 1–8 and tested in chromosome 18. To achieve the best result, an individual user is recommended to retrain the model according to the sequencing depth. The user can simply downsample the HiC reads in GM12878 or IMR90 to match their read numbers and run our pipeline to train their model.
Testing hyperparameter settings for HiCPlus
To find the optimal setting for HiCPlus, we have implemented the convolutional neural network with two layers and another model with three layers without ReLU activation. We observed that the performances of these three models (two layers vs. three layers without ReLU vs. three layers with ReLU) are almost indistinguishable, when evaluated by correlations at each distance between enhanced and original matrices (Supplementary Fig. 12, red, blue and yellow dashed line). However, when we further zoomed in and inspected individual chromatin loops, we noticed that, compared to the network with the threelayer and ReLU activation, the other two models show reduced chromatin intensities at chromatin loop regions (white circles in Supplementary Fig. 13a and 13b). For example, the colours of the highresolution and HiCPlus matrices are both red (high intensity), while the colours in the other two models are 'blue' (low intensity). We show another similar example in Supplementary Fig. 13b.
We also evaluated whether pooling layers could help improve the performance of HiCPlus. We tried both maximum pooling and mean pooling and showed the results in Supplementary Fig. 14. We did not notice any difference in performance: the correlation from the models with pool layers is no better than the model without pooling. Therefore, there is no obvious advantage using a pooling layer.
In the current implementation of the software, we did not use the interpolation step and directly processed the lowresolution HiC data and generated the interaction matrix, using the same number of bins as the highresolution HiC. We also tried to use the lowresolution interpolated matrix but observed that the interpolation did not perform as well as directly using the lowresolution matrix. As shown below (Supplementary Fig. 15,16), simple bicubic interpolation (yellow line) and the bicubic interpolation followed by ConvNet (black line) both have good performance but not as good as HiCPlus.
Data availability
Source code is publicly available at available in the GitHub repository (https://github.com/zhangyan32/HiCPlus). Enhanced HiC datasets can be downloaded from http://promoter.bx.psu.edu/public/HiCPlus/matrix/ and can be visualized in the 3D Genome Browser (http://3dgenome.org).
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
LiebermanAiden, E. et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
 2.
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
 3.
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the Xinactivation centre. Nature 485, 381–385 (2012).
 4.
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
 5.
Schmitt, A. D. et al. A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep. 17, 2042–2059 (2016).
 6.
Schmitt, A. D., Hu, M. & Ren, B. Genomewide mapping and analysis of chromosome architecture. Nat. Rev. Mol. Cell Biol. 17, 743–755 (2016).
 7.
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
 8.
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning. (MIT Press, 2016).
 9.
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
 10.
Angermueller, C., Pärnamaa, T., Parts, L. & Oliver, S. Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016).
 11.
Koh, P. W., Pierson, E. & Kundaje, A. Denoising genomewide histone ChIPseq with convolutional neural networks. Bioinformatics 33, i225–i233 (2017).
 12.
Schreiber, J., Libbrecht, M., Bilmes, J. & Noble, W. Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. Preprint at https://doi.org/10.1101/103614 (2017).
 13.
Liu, F., Li, H., Ren, C., Bo, X. & Shu, W. PEDLA: predicting enhancers with a deep learningbased algorithmic framework. Sci. Rep. 6, 28517 (2016).
 14.
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
 15.
Fukushima, K. Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980).
 16.
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M. & Poggio, T. Robust object recognition with cortexlike mechanisms. IEEE Trans. Pattern Anal. Mach. Intell. 29, 411–426 (2007).
 17.
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learningbased sequence model. Nat. Methods 12, 931–934 (2015).
 18.
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
 19.
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
 20.
Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNAprotein binding. Bioinformatics 32, i121–i127 (2016).
 21.
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
 22.
Zhou, J., Lu, Q., Xu, R., Gui, L. & Wang, H. CNNsite: prediction of DNAbinding residues in proteins using Convolutional Neural Network with sequence features. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)78–85 (IEEE Computer Society, 2016).
 23.
Singh, R., Lanchantin, J., Robins, G. & Qi, Y. DeepChrome: deeplearning for predicting gene expression from histone modifications. Bioinformatics 32, i639–i648 (2016).
 24.
Angermueller, C., Lee, H., Reik, W. & Stegle, O. DeepCpG: accurate prediction of singlecell DNA methylation states using deep learning. Genome Biol. 18, 67 (2016).
 25.
Min, X., Chen, N., Chen, T. & Jiang, R. DeepEnhancer: predicting enhancers by convolutional neural networks. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 637–644 (IEEE, 2016).
 26.
Zhang, Y., Yamaguchi, R., Imoto, S. & Miyano, S. Sequencespecific bias correction for RNAseq data using recurrent neural networks. BMC Genomics 18, 1044 (2017).
 27.
Glasner, D., Bagon, S. & Irani, M. Superresolution from a single image. In Proceedings of the IEEE International Conference on Computer Vision 349–356 (IEEE, 2009).
 28.
Yang, J., Wright, J., Huang, T. & Ma, Y. Image superresolution as sparse representation of raw image patches. In 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 1–8 (IEEE, 2008).
 29.
Dong, C., Loy, C. C., He, K. & Tang, X. Image superresolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 295–307 (2016).
 30.
Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image superresolution. In Computer Vision – ECCV 2014. ECCV 2014 (eds Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars T.) 184–199 (Springer, Cham, 2014).
 31.
Yang, T. et al. HiCRep: assessing the reproducibility of HiC data using a stratumadjusted correlation coefficient. bioRxiv 101386 (2017).
 32.
Pedregosa, F. et al. Scikitlearn: machine learning in {P}ython. J. Mach. Learn. Res. 12, 2825–2830 (2011).
 33.
Yaffe, E. & Tanay, A. Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065 (2011).
 34.
Hu, M. et al. HiCNorm: removing biases in HiC data via Poisson regression. Bioinformatics 28, 3131–3133 (2012).
 35.
Dixon, J. R. et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331–336 (2015).
 36.
Fraser, J. et al. Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation. Mol. Syst. Biol. 11, 1–14 (2015).
 37.
Nagano, T. et al. Comparison of HiC results using insolution versus innucleus ligation. Genome Biol. 16, 175 (2015).
 38.
Jin, F. et al. A highresolution map of the threedimensional chromatin interactome in human cells. Nature 503, 290–294 (2013).
 39.
Selvaraj, S., Dixon, J. R., Bansal, V. & Ren, B. Wholegenome haplotype reconstruction using proximityligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).
 40.
Leung, D. et al. Integrative analysis of haplotyperesolved epigenomes across human tissues. Nature 518, 350–354 (2015).
 41.
Seitan, V. C. et al. Cohesinbased chromatin interactions enable regulated gene expression within preexisting architectural compartments. Genome Res. 23, 2066–2077 (2013).
 42.
Shen, Y. et al. A map of the cisregulatory sequences in the mouse genome. Nature 488, 116–120 (2012).
 43.
Tang, Z. et al. CTCFmediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).
 44.
Sofueva, S. et al. Cohesinmediated interactions organize chromosomal domain architecture. EMBO J. 32, 3119–3129 (2013).
 45.
Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for HiC data reveals regulatory chromatin contacts. Genome Res. 24, 999–1011 (2014).
 46.
Ernst, J. & Kellis, M. ChromHMM: automating chromatinstate discovery and characterization. Nat. Methods 9, 215–216 (2012).
 47.
Schram, R. D., Barkema, G. T. & Schiessel, H. On the stability of fractal globules. J. Chem. Phys. 138, 224901 (2013).
 48.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
 49.
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L. & Fergus, R. Learning from noisy labels with deep neural networks. Preprint at https://arxiv.org/abs/1406.2080 (2015).
 50.
Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proc. of the 27th International Conference on Machine Learning (Eds Fürnkranz, J. & Joachims, T.) 807–814 (Omnipress, 2010).
 51.
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proc. Twentyfirst Int. Conf. Mach. Learn. 6, 116 (2004).
 52.
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics (Eds Teh, Y. W. & and Titterington, M.) 249–256 (PMLR, 2010).
Acknowledgements
This work was supported by the National Science Foundation [1161586 to Y.Z. and J.T., 1339470 to J.Z.]; National Institutes of Health (NIH) [1R35GM124820, R01HG009906, U01CA200060 and R24DK106766 to F.Y.; U54DK107977 to M.H.] and National Key R&D Program of China [2017YFC0908400 to J.T.]. J.T. is also supported by the National Science Foundation of China (NSFC61772362). We are grateful to the NVIDIA Corporation for donation of a TITAN X GPU card through a NVIDIA Hardware Grant.
Author information
Author notes
Yan Zhang and Lin An contributed equally to this work.
Affiliations
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29208, USA
 Yan Zhang
 & Jijun Tang
Bioinformatics and Genomics Program, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
 Lin An
 , Bo Zhang
 & Feng Yue
Department of Biochemistry and Molecular Biology, College of Medicine, The Pennsylvania State University, Hershey, PA, 17033, USA
 Jie Xu
 & Feng Yue
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
 W. Jim Zheng
Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
 Ming Hu
School of Computer Science and Technology, Tianjin University, 300072, Tianjin, China
 Jijun Tang
Tianjin University Institute of Computational Biology, Tianjin University, 300072, Tianjin, China
 Jijun Tang
Authors
Search for Yan Zhang in:
Search for Lin An in:
Search for Jie Xu in:
Search for Bo Zhang in:
Search for W. Jim Zheng in:
Search for Ming Hu in:
Search for Jijun Tang in:
Search for Feng Yue in:
Contributions
Y.Z. and L.A. implemented the HiCPlus software. Y.Z., J.T. and F.Y. designed the study. J.T. and F.Y. supervised the project. Y.Z., L.A, J.T. and F.Y wrote the manuscript with input from all the authors.
Competing interests
The authors declare no competing financial interests.
Corresponding authors
Correspondence to Jijun Tang or Feng Yue.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Further reading

Dense neural networks for predicting chromatin conformation
BMC Bioinformatics (2018)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.