Introduction

Hi-C data, an extension of chromosome conformation capture assay (3C) is a biological assay which can be used to inspect the three-dimensional (3D) architecture of a genome1. Hi-C data can be used for downstream analysis of structural features of chromosomes such as AB compartment, Topological Associated Domains (TADs), loops, and 3D chromosome and genome models. Changes in chromosomal conformation have been empirically demonstrated to impact a variety of genomic processes including gene methylation and gene expression2.

When analyzing Hi-C data, reads are usually converted into contact matrices, where each cell entry corresponds to the quantity of contacts between the two regions indexed by row and column3,4. The size of an individual region in this contact matrix is referred to as the resolution or bin size4. The smaller the bin size, the higher the resolution. The resolution of a contact matrix is usually selected based on the quantity of read pairs in an individual Hi-C experiment, with a higher quantity of read pairs permitting a higher resolution. Certain genomic features, such as TADs, can only be meaningfully identified using high resolution contact matrices, however if a matrix resolution is selected with insufficient read coverage the matrices can be overly sparse. One method to address this issue is to run additional Hi-C experiments, however because of experimental costs this is not always a feasible solution.

To solve this problem previous groups have utilized methods from the field of Image super-resolution to improve Hi-C contact matrix resolution. The first of these networks was HiCPlus5, a simple neural network optimized using mean squared error (mse). HiCPlus was then improved upon by HiCNN6 by adjusting network architecture. Next hicGAN7 was proposed, introducing the use of Generative Adversarial Networks (GAN), which generated high resolution contact maps conditioned on low resolution input. The network DeepHiC8 maintained the GAN loss function while extending it to also include a perceptual loss function derived from VGG-16 trained on image data. The model HiCSR9 continued the advancement by introducing the use of a deep autoencoder as a feature extraction mechanism.

Our network, the Variationally Encoded Hi-C Loss Enhancer (VEHiCLE), extends the approach of conditional generative adversarial networks by using an integrated training approach inspired by literature in the domains of deep learning and genomics. First, VEHiCLE incorporates a variational autoencoder which extracts biologically meaningful features from Hi-C data. Second, VEHiCLE’s decoder network is engineered to provide an easy to use generative model for Hi-C data generation which smoothly maps user tunable, low dimensional vectors to Hi-C contact maps independent of any low sampled input. Third, VEHiCLE incorporates a biologically explicit loss function based on Topologically Associated Domain identification to ensure accurate downstream genomic analysis.

VEHiCLE obtains state of the art results in the task of Hi-C super-resolution across a variety of metrics pulled from the domains of Image analysis and Hi-C quality/reproducibility. VEHiCLE enhanced data show successful retrieval of important downstream structures such as TAD identification and 3DModel generation while also providing novel human interpretability of its enhancement process.

Approach

Description of VEHiCLE network training

Vehicle is trained as an adversarial network conditioned on low resolution input. The network is trained using a composite loss function made up of 4 sub loss functions: Adversarial loss, Variational loss, mean square error (MSE) loss, and Insulation loss. An overview of the training mechanism is displayed in Fig. 1a. The intellectual motivation for each of these loss functions is outlined below.

Figure 1
figure 1

VEHiCLE architecture: (a) overview of training strategy, (b) generator architecture, (c) discriminator architecture.

Adversarial loss function

Generative adversarial networks (GANs) are a popular deep learning based framework for generative modeling which has gained traction in a wide variety of tasks including image superresolution. GANs were first introduced to the field of Hi-C super resolution through hicGAN, and later improved upon in DeepHiC and HiCSR. A GAN uses two key networks: a generator G (Fig. 1b) and a discriminator D (Fig. 1c). The generator takes samples from an input distribution and generates enhanced matrices. The Discriminator is trained on a collection of inputs including real high resolution Hi-C samples as well as enhanced resolution Hi-C samples and attempts to determine whether individual samples are real or enhanced. The two networks are trained in a game where the generator is rewarded for successfully tricking the discriminator and the discriminator tries to minimize classification mistakes.

$${\min}_{G}{\max}_{D}V(D,G) = {\mathbb{E}}_{x\sim {p}_{data}(x)}[log(D(x))] +{\mathbb{E}}_{z\sim {p}_{z}(z)}[log(1-D(D(z))))]$$
(1)

The generator loss function is defined as:

$${L}_{adb}=\sum_{n=1}^{N}-log(D(G({X}_{low}))$$
(2)

Variational loss

Autoencoders are deep learning systems which map inputs from sample space to a condensed latent space via an encoder and then reconstruct images in sample space from the latent space using a decoder. The use of autoencoders for the task of Hi-C data super resolution was originally proposed in our preprint10 for the task of denoising Hi-C data. They were then suggested by Dimmick et al.9 as tools for training super resolution networks by using the features extracted by passing Hi-C data through a trained autoencoder as a loss function. In this manuscript we expand upon this strategy, but replace their network with a different flavor of network called the variational autoencoder11.

Similar to vanilla autoencoders, variational autoencoders (VAE) aim to condense data into lower dimensional space, however they have the advantage of providing smooth feature representation which can permit the construction of powerful generative models. To obtain these advantages VAE rely upon a statistical method called variational inference11. This method frames the tasks of encoding and decoding as an ancestral sampling problem with two steps: First, a latent variable, z, is sampled from a prior distribution \({P}_{\theta }(z)\). Second, the observed variables, x, is drawn from a likelihood distribution \({P}_{\theta }(x|z)\).

To encode the observed variable, x, requires the computation of the posterior distribution \({P}_{\theta }(z|x)\). However because this is computationally intractable, instead one approximates the posterior by choosing a parametric family of recognition models \({q}_{\phi }(z|x)\) and selects parameters that minimize the divergence between the recognition model and the true underlying distribution via a probabilistic dissimilarity metric called KL-divergence,

$${D}_{kl}({q}_{\theta }(z|x)||{p}_{\phi }(z))= \sum_{\mathbb{Z}}q(z)log \left (\frac{q(z)}{p(z)}\right)$$
(3)

By performing some algebra outlined in Kingma and welling11 variational autoencoders are trained using the following loss function

$$L(\theta ,\phi ,x) = -{D}_{kl}({q}_{\phi }(z|x)||{p}_{\theta }(z))+{\int }_{z}{q}_{\phi }(z|x)log({p}_{\theta }(x|z))$$
(4)

The integral term on the far right of the loss function ensures that the reconstruction outputs of our networks are highly similar to their original inputs, while the KL divergence term causes the latent space distribution of values to closely resemble a vector of gaussian random variables. This imposition of gaussian similarity on the latent space results in advantages in quality of extracted features and the procurement of a generative model.

To create the variational loss function, we first train our variational autoencoder using high resolution contact matrices as both inputs and labels. In each experiment our VAE network is trained using the same chromosomes as the overall VEHiCLE network. The variational autoencoder maps vectors from data space into condensed latent space, which we interpret as a lower dimensional feature vector (Fig. 2a,b). Because the variational autoencoder training strategy imposes a Gaussian distribution of the latent space variables and because our decoder maps latent vectors back into data space in a relatively smooth manner we expect highly similar Hi-C contact matrices to contain similar latent space profiles.

Figure 2
figure 2

Variational autoencoder. (a) Overview of variational autoencoder approach. (b) VEHiCLE architecture. Tad loss evaluated using a feedforward implementation of Insulation loss computing, (c) insulation vector, (d) delta vector and (e) identification of TAD Boundaries.

We compute variational loss by passing both the enhanced Hi-C contact matrix and target high resolution Hi-C contact matrices through the backpropagatable encoder component of our variational autoencoder network, extracting latent dimensional representations. We then compute the mean differences between their latent feature vectors,

$${L}_{vae} = \frac{1}{n}\sum_{i=1}^{n} \left |{f}_{encode}({X}_{enhanced}) - {f}_{encode}({X}_{target}) \right |$$
(5)

where \({f}_{encode}\) is the encoding function, defined by our trained encoder network.

Insulation score loss

Most of the previously proposed loss functions for developing Hi-C enhancement networks draw upon loss functions prolific in the fields of computer vision5,6,7,8,9. While there are certainly advantages to these strategies, they derive from assumed similarities between the tasks of image super-resolution and Hi-C super-resolution. However, the tasks are not synonymous. Hi-C contact matrices contain important information used for downstream feature analysis such as loop calling, TAD identification and 3D model construction. Consequently, images which are highly visually similar which are blurry, shift positions of structural features, or contain noise might result in significant differences in downstream analysis. With this fact in consideration, we used domain knowledge of computational genomics to devise an insulation loss function, which directly trains networks to correctly identify downstream features, specifically TAD placement.

One well-established strategy for the identification of TADs is the use of insulation scores12. Insulation scores of a matrix are calculated by sliding a 20 bin (200 kb × 200 kb) window down the diagonal of a matrix and summing the signal across each bin, resulting in an insulation vector (Fig. 2c). This insulation vector is normalized by taking the log2 ratio of each bin’s insulation score and the mean of all insulation scores on the chromosome. From the insulation vector a delta vector is computed by observing the change in signal strength 100 kb downstream and upstream of each bin on the insulation vector (Fig. 2d). This delta vector is treated as a pseudo-derivative, and identifies insulation valleys in the regions where the delta vector crosses the x-axis from negative values to positive values, indicating a relative minimum in insulation. TAD Boundaries are assigned to each insulation valley whose difference in strength between the nearest left local max and right local min was > 0.1 (Fig. 2e).

The insulation TAD calling procedure can be encoded into a single, back propagatable network up until extraction of the delta vector (Fig. 2c,d). We define insulation loss,

$${L}_{ins}=\frac{1}{n}\sum_{i=1}^{n} \left |{D}_{vec}({X}_{enhanced}) -{D}_{vec}({X}_{target}) \right |$$
(6)

where \({D}_{vec}\) is a backpropagatable network which maps a contact matrix to a delta insulation vector.

Bin-wise mean squared error loss

Bin-wise mean square error loss is a thoroughly tested loss function used in previous Hi-C enhancement literature6,8,9. It contributes to maintaining visual similarity between enhanced and target Hi-C contact matrices.

$${L}_{mse}=\frac{1}{{n}^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n} \left |{X}_{enhanced} - {X}_{target} \right |$$
(7)

Composite training function

To capitalize on the advantages of all four loss functions we incorporate them into our comprehensive training process. First the variational network is trained on the train and validation datasets. Then the trained encoder is used for \({L}_{vae}\) along with the three other training losses to train the generator network, yielding our overall loss function

$${L}_{tot} = {\lambda }_{adv}{L}_{adv}+{\lambda }_{mse}{L}_{mse}+{\lambda }_{vae}{L}_{vae}+{\lambda }_{ins}{L}_{ins}$$
(8)

where \({\lambda }_{x}\) are hyperparameters used to determine loss contribution. We use \({\lambda _{adv}} = 0.0025,\;\;{\lambda _{mse}} = 1,\;\;{\lambda _{vae}} = 0.01,\;\;{\lambda _{ins}} = 1.\)

Results

Latent space representations permit generation of synthetic Hi-C data

The KL divergence term in the loss function of our variational autoencoder imposes constraints on the latent dimension, pushing our estimate for the prior \(q(z|x)\) towards a vector distribution of Gaussian random variables. Because all latent vector variables fall within Gaussians centered around 0, most vectors near the center of these Gaussians can be successfully decoded into Hi-C space, resulting in a generative model for Hi-C data. We first perform principal component analysis (PCA) on our training set’s learned, latent dimensional features. We then create a function mapping PCA values to the latent dimensional space. We then use our trained decoder network to transform the values in latent dimensional space into Hi-C space (Fig. 3a). The result is a function mapping a profile of PCA values to a 2.57 Mb × 2.57 Mb block of Hi-C data. We hook this function into an interactive matplotlib widget, permitting manual visualization of changes to generated Hi-C data as input variables are adjusted. In our widget we set a NUM_SLIDERS = 15 parameter to permit the manual tuning of PCA vector components. The widget passes a vector to our mapping function with user selected values in all manually adjusted components and dataset averages for all PCA’s that are not manually selected or are above the NUM_SLIDERS component index threshold. The selection of 15 is arbitrary and can be manually increased by users interested in viewing the impact of adjusting higher PC values on the generated Matrix structure.

Figure 3
figure 3

(a) Diagram of synthetic Hi-C generation tool, a user tunable zero-centered feature vector is transformed via PCA reverse transform to latent space and then passed through our tuned decoder network. (b) The 0 vector corresponds to a purely linear contact map. (c) Increasing value of PC5 results in generation of TADs. (d) Adjusting value of PC2 shifts position of TADs. (e) Adjusting PC11 creates stripes within TADS. (f) Adjusting PC14 develops loops within TADs.

The zero vector results in a vanilla Hi-C map with interaction frequency between two regions following the inverse of genomic distance (Fig. 3b). The biological interpretation of some adjustable features remains elusive, with changes to vector component values resulting merely in changes of diagonal signal strength or sporadic repositioning of contact regions. However, we observe that many of the tunable feature vector components correspond directly with biologically meaningful features in Hi-C space such as: formation of TADs, increasing TAD size (Fig. 3c), increasing TAD frequency, shifting TAD position (Fig. 3c,d), formation of genomic stripes (Fig. 3e)13 and formation of chromatin loops14 (Fig. 3f).

Low resolution Hi-C contact matrices enhanced by VEHiCLE appear visually competitive with other enhancement algorithms

We generate visual heatmaps of Hi-C contact maps of the GM12878 dataset using VEHiCLE as well as three other previously developed algorithms: HiCSR, DeepHiC and HiCPlus. We observe high visual similarity between reconstructions by VEHiCLE and other enhancement algorithms (Fig. 4a). We also subtracted high resolution contact maps from reconstructions by each tool to observe a visual difference matrix (Fig. 4b). Visually VEHiCLE appears competitive with existing algorithms.

Figure 4
figure 4

Visual Comparison of deep learning based methods for contact matrix enhancement. (a) Visual comparison of enhancement matrices. (b) Absolute difference matrices between target high resolution data and enhancement. All displayed matrices are derived from the GM12878 cell line. Architectures of previous models utilize original window size.

Notes on evaluation metrics

One of the major differences between the VEHiCLE algorithm and previous Hi-C enhancement tools is that our architecture is trained to enhance 2.69 Mb × 2.69 Mb regions along diagonals of contact maps rather than splitting contact maps into 0.4 Mb × 0.4 Mb pieces, enhancing in a piecemeal fashion, and then reassembling (see  “Methods”). This contribution permits the inclusion of more comprehensive information like TAD structure into training samples. However, it is possible to expand older architectures to full 2.69 Mb × 2.69 Mb sizes rather than the condensed 0.4 Mb × 0.4 Mb window that appears in previous papers. In some cases this expansion of window size degrades older architecture performance, while in others it leads to enhancement. Thus, when comparing VEHiCLE to previous tools we include both original architectures without adjusting window size as well as alternative architectures trained using expanded window sizes.

Low resolution Hi-C contact matrices enhanced by VEHiCLE achieve strong similarity to high resolution contact matrices using multiple metrics

Using models trained and tested on the GM12878 cell line dataset We evaluated the effectiveness of VEHiCLE in predicting high resolution contacts using 5 common metrics: Pearson Correlation Coefficient (PCC), Spearman Correlation Coefficient( SPC), Mean Squared Error (MSE), Signal-to-noise ratio (SNR) and Structure Similarity Index (SSI) (see “Methods”). We compared VEHiCLE reconstructions to the lower resolution data as well as other super resolution methods (HiCPlus, DeepHic and HiCSR.) VEHiCLE enhanced contact matrices consistently showed improvement relative to low resolution data along all 5 metrics (Table 1). VEHiCLE frequently out-performed other Hi-C super resolution methods beating all older models with 0.4 Mb window size along every test chromosome in every vision metric (Table 1). VEHiCLE out performs both the original and expanded window HiCPlus model in every vision metric across every chromosome (Table 1). VEHiCLE remained competitive with 2.69 Mb window sized DeepHiC and HiCSR models scoring highest in PCC in 3 of the 4 test chromosomes and scoring in the top 2 for 80% of the metric-chromosome combinations, a higher consistency of top-2 performance than any of the previous models (Table 1).

Table 1 Comparison of multiple standard vision metrics across different super-resolution algorithms.

Downsampled Hi-C contact matrices enhanced by VEHiCLE display significant improvement using Hi-C specific metrics

Using models trained on the GM12878 cell line dataset we next evaluated VEHiCLE reconstructions using 3 Hi-C specific metrics: GenomeDISCO, HiCRep and QuASAR-Rep (see “Methods”). VEHiCLE enhanced metrics remain competitive with other methods (Table 2). Furthermore, even in instances where VEHiCLE is outperformed by another algorithm, we consistently observe increased performance relative to original low resolution matrices. These results indicate biological consistency with VEHiCLE enhanced matrices.

Table 2 Comparison of Hi-C superresolution algorithms using Hi-C reproducibility metrics.

VEHiCLE enhanced contact matrices effectively retrieve downstream features such as TADS

We identified TADs using the prolific insulation score method12. This method assigns an insulation score vector by sliding a window across the diagonal of the contact matrix, constructing an insulation difference vector, and using the zeros of the insulation difference vector to discover TAD boundaries. We used models trained on the GM12878 cell line and evaluated insulation on test chromosomes for the HMEC, K562 and IMR90 cell lines as well as the GM12878 cell line. We expand the test set to evaluate the effectiveness of our network at predicting downstream biological features like TADs when the model is trained on different cell lines which may have different TAD profiles.

We compare the insulation difference vector of each matrix-enhancement algorithm to the insulation difference vector of our high resolution contact matrix using the L2 norm dissimilarity metric. In many cases VEHiCLE enhanced insulation difference vectors have higher similarity to target matrices relative to other matrix enhancing algorithms (Table 3). Furthermore, even in instances where VEHiCLE is outperformed by another algorithm we consistently observe higher similarity between the target high resolution matrices and VEHiCLE enhanced matrices relative to low resolution matrices (Table 3).

Table 3 L2 norm of TAD Insulation difference vectors against target insulation vectors.

3D chromatin model construction

We tested the effectiveness of reconstructed data in building 3D structure models using the structural modeling tool 3DMax. We extracted constraints from the low resolution, high resolution and VEHiCLE-enhanced 2.57 Mb × 2.57 Mb regions of our test dataset chromosomes of the GM12878 dataset. From each constraint grouping we generated 3 models. We observed significantly higher visual similarity between VEHiCLE-enhanced and high-resolution matrices relative to low-resolution matrices (Fig. 5a). We then used the TM-score metric to quantify structural similarity of models15. We observed higher TM-scores between high resolution and VEHiCLE-enhanced matrices than between high resolution and low resolution models (Fig. 5b). We also observed higher TM-score similarities between models generated by the same VEHiCLE-Enhanced matrices relative to models generated by the same low resolution matrices, indicating VEHiCLE enhanced models are more consistent (Fig. 5c).

Figure 5
figure 5

(a) 3D reconstruction of Chro 20 0.6–3.1 MB. (b) TM -score comparison of High Resolution structures to (red) Low resolution structures and (green) VEHiCLE enhanced structures. VEHiCLE enhanced scores are significantly higher (Wilcoxon rank sum p value < 1e−20). (c) Average TM-Score comparison of ingroup structures generated by same contact matrix (red) low res, (yellow) high res, (green) VEHiCLE enhanced. VEHiCLE enhanced scores are significantly better than low-resolution scores (wilcoxon rank sum p value < 1e−20) Structures are all generated from GM12878 cell line using the test chromosome set: 4, 14,16,20.

Discussion

One of the most common challenges in Deep Learning projects is the opaque nature of a neural network’s inner functioning. Consequently, our ability to extract latent features and map them to biologically relevant structures provides a significant advance in increasing interpretability of Hi-C matrices. Our GUI tool can be used to generate Hi-C data through user tunable parameters with biologically relevant downstream structures such as TAD strength, TAD positioning, stripes and loops. Further inspection of these features has potential to enhance analysis of key characteristics of chromatin organization.

Our introduction of the Insulation loss sets a new precedent of utilizing biological knowledge in the training of Hi-C networks. This may open the door for future improvement of Hi-C data enhancement by utilizing other forms of domain knowledge to increase usability of deep learning enhanced matrices. Future loss functions could incorporate algorithms for identification of other important downstream features such as loops or stripes.

In addition to the increased interpretability and inclusion of domain knowledge, VEHiCLE obtains resolution enhancement results competitive with the state-of-the art, often beating top algorithms on a variety of metrics, all while preserving the ability to convey meaningful structures such as TAD’s and 3D structure in downstream analysis.

VEHiCLE’s capacity to increase accuracy of insulation scores shows promise of utility for experimental biologists interested in chromosome architecture at specific genomic locations. By enhancing experimentally obtained Hi-C data a biologist could observe the frequency with which a list of genes or cis regulatory elements are found near TAD boundaries. Such analysis could provide further insight into the role of structural organization in a genomic process. Additionally, VEHiCLE enhanced matrices could be used to generate more accurate 3D models when building visualizations of genomic structure. These visualizations may provide insight into the underlying machinery of a genomic process of interest.

Methods

Dataset assembly

Like many of the previous Hi-C super resolution networks we train VEHiCLE on high and low resolution Hi-C data for the GM12878 cell line16. While previous work often split chromosomes into training, validation and testing sets in a sequential manner8,9 we were concerned that differences in the 3D conformation of large vs small chromosomes17 may contain implicit bias in contact map features that could confound training. Consequently we assembled training, validation and test sets in a non-sequential manner using chromosomes 1, 3, 5, 6, 7, 9, 11, 12, 13, 15, 17, 18, 19, 21 as our training set, chromosome 2, 8, 10, 22 as our validation set and chromosomes 4,14,16,20 as our test set.

Previous work on Hi-C super resolution consistently used network input window sizes of 0.4 Mb × 0.4 Mb at 10 kb resolution, requiring networks to split chromosome contact maps into 40 × 40  bin matrices5,6,7,8,9. While this strategy has seen relative success, a major disadvantage is that certain important features of Hi-C such as TADs can span ranges larger than 0.4 Mb, meaning that it is impossible for previous networks to explicitly encode important information about TAD organization. Furthermore, this informational bottleneck of constraining window sizes to 40 × 40 bins is not incumbent upon the employed super-resolution networks as work in the field of computer vision has demonstrated the effectiveness of GAN and VAE networks on significantly larger images. With these considerations in mind we instead built our network to accept 2.69 Mb × 2.69 Mb images, a range which is large enough to fully encompass the average TAD of length 1MB18. Observing 2.69 Mb × 2.69 Mb regions of Hi-C contact maps at range 10 kb results in submatrix images of 269 × 269 bin size. Because of the expanded window size we trained our network exclusively on diagonally centered submatrices, split by sliding a 269 × 269 window down the diagonal of each chromosome's Hi-C contact map. We move the window with a stride of 50 bins at a time, ensuring sufficient overlap between samples for our dataset to include all contacts between regions within 2 Mb of each other. This results in a total of 3309 training, 1051 validation, and 798 testing matrices.

Because the convolutional arithmetic of our GAN architecture results in a decrease in output matrices by 12 bins, our output matrices are of dimension 257 × 257. Our variational loss is based on reconstruction of matrices output by our GAN, thus when training our variational autoencoder we use the inner 257 × 257 bins of each 269 × 269 sample in our dataset.

All Models were trained using the GM12878 cell line. When evaluating vision metrics, Hi-C qc metrics and 3D model comparison we use the test chromosomes from the GM12878 cell line. For our insulation score analysis we extend our test set to include the K562, IMR90 and HMEC cell lines so as to verify the effectiveness of our network at retrieving information when trained on a different cell line. Both low resolution and high-resolution contact maps are normalized using the Knight–Ruiz algorithm, a standard normalization method in the Hi-C literature [19].

Variational autoencoder architecture

The VAE component of VEHiCLE utilizes two neural networks for the encoding and decoding components, where the encoder is trained for the parameters of q0 and the decoder is trained to optimize the parameters of p0. The VEHiCLE encoder network contains 7 convolutional layers with kernel counts: 32, 64, 128, 256, 256, 512, 512. Each convolutional layer is separated by leaky ReLU and batch normalization. The decoder network has 7 layers of convolution transpose with the kernel counts 512, 512, 256, 256, 128, 64, 32, also separated by leaky ReLU and batch norm functions. The decoder network is appended by a Sigmoid activation function placing outputs in the range of [0,1].

Generative adversarial network architecture

We use the discriminator and generator architecture defined in HiCSR, with the exception of our generator’s output function, which is changed from tanh, to a sigmoid so that outputs are mapped to [0,1]. The generator architecture contains 15 residual blocks separated by skip connections, each containing 64 convolutional filters. The fully convolutional discriminator is a fully convolutional network with ReLU activation. Both the generator and discriminator are trained with batch normalization.

Other networks

We used the pytorch versions of HiCPlus, DeepHiC and HiCSR provided at https://github.com/wangjuan001/hicplus, https://github.com/omegahh/DeepHiC and https://github.com/PSI-Lab/HiCSR. We first tested networks using their literature provided weights, however we obtained very poor performance because these networks were trained on alternative training sets with key characteristic differences from ours. First, their training sets had bin value ranges of [− 1,1], however our training datas range was [0,1] because negative values confound the probabilistically motivated VAE component. Second the input size of contact maps for previous networks was 40 × 40, while our network aims to incorporate surrounding genomic information and utilizes a larger window input size of 269 × 269. To provide more accurate comparison we trained networks on our own GM12878 Dataset. Because our networks accept a large scale input matrix 269 × 269, but other networks were built to accept 40 × 40 pieces, we trained other networks by splitting each 269 × 269 into 36 non-overlapping pieces. Evaluation of Hi-C metrics was performed by feeding split pieces through networks as necessary, then reassembling pieces and comparing full chromosome contact maps.

Standard evaluation metrics

We utilize 5 reproducibility metrics pulled from image-super resolution literature: Pearson Correlation Coefficient (PCC), Spearman Correlation Coefficient (SPC), Mean Squared Error (MSE), Signal-to-noise ratio (SNR) and Structure Similarity Index (SSI).

Mean squared error

$${L}_{mse}(x,y) = \sum_{i=1}{({x}_{i}-{y}_{i})}^{2}$$

Pearson correlation coefficient

$${L}_{pcc}(x,y) = \frac{\sum_{i=1}({x}_{i}-x)({y}_{i}-y)}{\sqrt{\sum_{i=1}{({x}_{i}-x)}^{2}}\sqrt{\sum_{i=1}{({y}_{i}-y)}^{2}}}$$

Spearman correlation coefficient

Spearman Correlation is similar to Pearson correlation differing in that it utilizes rank variables so as to evaluate monotonic relationship between the matrices without imposing a linearity condition that may not exist in nature.

$${L}_{spc}(x,y) = \frac{\sum_{i=1}(r{x}_{i}-rx)(r{y}_{i}-ry)}{\sqrt{\sum_{i=1}{(r{x}_{i}-rx)}^{2}}\sqrt{\sum_{i=1}{(r{y}_{i}-ry)}^{2}}}$$

Signal-to-noise ratio

Signal-To-Noise Ratio uses a ratio of the clean signal to the difference between clean and noisy signals to represent how much signal is actually getting through. The higher the value of SNR the better quality the data.

$${L}_{snr}(x,y) = \frac{\sum_{i,j}{y}_{i,j}}{\sqrt{\sum_{i,j}{({x}_{i,j}-{y}_{i,j})}^{2}}}$$

Structural similarity index

SSI is calculated by sliding windows between images and averaging values. The constants C1 and C2 are used to stabilize the metric while the means, variances and covariances are computed via a Gaussian filter. We use the implementation of SSI developed by Hong et al.8 (DeepHiC) keeping their default values for the size of sub-windows and variance value of gaussian filter at 11 and 3 respectively.

$${L}_{ssi}(x,y) = \frac{(2{\mu }_{x}{\mu }_{y}+{C}_{1})(2{\sigma }_{xy}+{C}_{2})}{({\mu }_{x}^{2}+{\mu }_{y}^{2}+{C}_{1})({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{2})}$$

Hi-C reproducibility metrics

We consider 3 Hi-C specific reproducibility metrics: GenomeDISCO20, HiCRep21, and QuASAR-Rep22. We use the 3DChromatin_ReplicateQC23 implementations of the metrics. This 3DChromatin_ReplicateQC repository also included metrics for the tool HiC-Spector22, however we consistently obtained faulty values, even when using the repositories sample data and so we excluded HiC-Spector results from this analysis. When expanding previous models to a 269 × 269 window size the HiCSR model repeatedly failed to converge using these metrics, thus we only include the original 40 × 40 window version of HiCSR in our evaluation of Hi-C Reproducibility metrics. GenomeDISCO utilizes a random walk on a graph generated by contact maps to obtain a concordance score23. HiCRep develops a stratum adjusted correlation coefficient for matrix comparison by measuring weighted similarity of contacts in identified stratum23. QuASAR-Rep calculates a correlation matrix of interaction using weights based on enrichment22.

Topologically associated domain identification

Topologically associated domains were identified using Insulation score as identified in Crane et al. We mimicked their procedure entirely with the exception that our initial insulation score window size was condensed to 20 bins instead of 50 because this demonstrated greater visual accuracy in TAD positioning12.

Three-dimensional model reconstruction

To generate models, we utilize 3DMax24 with out-of-the-box parameters of 0.6 conversion factor, 1 learning rate, and 10,000 max iteration. We create 3 models per input contact matrices. We generate models for every 5th 269 Mb × 269 Mb input matrix from our training dataset, because this skipping distance ensures coverage of each chromosome while minimizing model generation time. Similarity between structures was measured using TM-score15.

Motivation for 269 × 269 window size

The decision to expand our window size to 2.69 × 2.69 Mb is multifaceted. Philosophically the decision to expand beyond the previous standard of 0.4 Mb × 0.4 Mb was to permit the inclusion of a wider range of genomic information in our deep learning methods.

From a technical standpoint our insulation score loss is based on the previously defined method for insulation determination12. Because calculation of insulation necessitates incorporation of boundary bins, the length of an insulation vector is always smaller than the dimension of a Hi-C contact maps axis with formula:

$$\left( {\text{Len of insulation vector}} \right) = \left( {{\text{Len of Hi}}\text{-}{\text{C Axis}}} \right){-}\left( {\text{insulation window}} \right) - ({2}*{\text{delta window}} - {1}).$$

Thus, using a 20 kb insulation window and 10 kb Delta window with the previously applied 40 × 40 window would result in an insulation vector of length (40-20-19) = 1, which is only a scalar and would contain insufficient information for meaningful feature extraction.

The decision to use 269 as opposed to a different, large number is due to our variational autoencoder. While passing through the variational autoencoder the dimension of an input matrix is compressed with each incremental layer. It was essential that at each step the output dimension remained a whole number and that when the latent representation is decoded back into contact matrix space the reconstructed matrix be of the same dimension as its input. 257 was the smallest number which both spanned 2 Mb (a range that would encompasses nearly all TADs) and resulted in the same dimensional input and output at each layer of our variational autoencoder. We account for the 12 bin decrease in size that occurs by passing through our GAN, resulting in a 269 × 269  matrix.