Abstract
Chromatin conformation plays an important role in a variety of genomic processes. HiC is one of the most popular assays for inspecting chromatin conformation. However, the utility of HiC contact maps is bottlenecked by resolution. Here we present VEHiCLE, a deep learning algorithm for resolution enhancement of HiC contact data. VEHiCLE utilises a variational autoencoder and adversarial training strategy equipped with four loss functions (adversarial loss, variational loss, chromosome topologyinspired insulation loss, and mean square error loss) to enhance contact maps, making them more viable for downstream analysis. VEHiCLE expands previous efforts at HiC super resolution by providing novel insight into the biologically meaningful and human interpretable feature extraction. Using a deep variational autoencoder, VEHiCLE provides a user tunable, full generative model for generating synthetic HiC data while also providing stateoftheart results in enhancement of HiC data across multiple metrics.
Similar content being viewed by others
Introduction
HiC data, an extension of chromosome conformation capture assay (3C) is a biological assay which can be used to inspect the threedimensional (3D) architecture of a genome^{1}. HiC data can be used for downstream analysis of structural features of chromosomes such as AB compartment, Topological Associated Domains (TADs), loops, and 3D chromosome and genome models. Changes in chromosomal conformation have been empirically demonstrated to impact a variety of genomic processes including gene methylation and gene expression^{2}.
When analyzing HiC data, reads are usually converted into contact matrices, where each cell entry corresponds to the quantity of contacts between the two regions indexed by row and column^{3,4}. The size of an individual region in this contact matrix is referred to as the resolution or bin size^{4}. The smaller the bin size, the higher the resolution. The resolution of a contact matrix is usually selected based on the quantity of read pairs in an individual HiC experiment, with a higher quantity of read pairs permitting a higher resolution. Certain genomic features, such as TADs, can only be meaningfully identified using high resolution contact matrices, however if a matrix resolution is selected with insufficient read coverage the matrices can be overly sparse. One method to address this issue is to run additional HiC experiments, however because of experimental costs this is not always a feasible solution.
To solve this problem previous groups have utilized methods from the field of Image superresolution to improve HiC contact matrix resolution. The first of these networks was HiCPlus^{5}, a simple neural network optimized using mean squared error (mse). HiCPlus was then improved upon by HiCNN^{6} by adjusting network architecture. Next hicGAN^{7} was proposed, introducing the use of Generative Adversarial Networks (GAN), which generated high resolution contact maps conditioned on low resolution input. The network DeepHiC^{8} maintained the GAN loss function while extending it to also include a perceptual loss function derived from VGG16 trained on image data. The model HiCSR^{9} continued the advancement by introducing the use of a deep autoencoder as a feature extraction mechanism.
Our network, the Variationally Encoded HiC Loss Enhancer (VEHiCLE), extends the approach of conditional generative adversarial networks by using an integrated training approach inspired by literature in the domains of deep learning and genomics. First, VEHiCLE incorporates a variational autoencoder which extracts biologically meaningful features from HiC data. Second, VEHiCLE’s decoder network is engineered to provide an easy to use generative model for HiC data generation which smoothly maps user tunable, low dimensional vectors to HiC contact maps independent of any low sampled input. Third, VEHiCLE incorporates a biologically explicit loss function based on Topologically Associated Domain identification to ensure accurate downstream genomic analysis.
VEHiCLE obtains state of the art results in the task of HiC superresolution across a variety of metrics pulled from the domains of Image analysis and HiC quality/reproducibility. VEHiCLE enhanced data show successful retrieval of important downstream structures such as TAD identification and 3DModel generation while also providing novel human interpretability of its enhancement process.
Approach
Description of VEHiCLE network training
Vehicle is trained as an adversarial network conditioned on low resolution input. The network is trained using a composite loss function made up of 4 sub loss functions: Adversarial loss, Variational loss, mean square error (MSE) loss, and Insulation loss. An overview of the training mechanism is displayed in Fig. 1a. The intellectual motivation for each of these loss functions is outlined below.
Adversarial loss function
Generative adversarial networks (GANs) are a popular deep learning based framework for generative modeling which has gained traction in a wide variety of tasks including image superresolution. GANs were first introduced to the field of HiC super resolution through hicGAN, and later improved upon in DeepHiC and HiCSR. A GAN uses two key networks: a generator G (Fig. 1b) and a discriminator D (Fig. 1c). The generator takes samples from an input distribution and generates enhanced matrices. The Discriminator is trained on a collection of inputs including real high resolution HiC samples as well as enhanced resolution HiC samples and attempts to determine whether individual samples are real or enhanced. The two networks are trained in a game where the generator is rewarded for successfully tricking the discriminator and the discriminator tries to minimize classification mistakes.
The generator loss function is defined as:
Variational loss
Autoencoders are deep learning systems which map inputs from sample space to a condensed latent space via an encoder and then reconstruct images in sample space from the latent space using a decoder. The use of autoencoders for the task of HiC data super resolution was originally proposed in our preprint^{10} for the task of denoising HiC data. They were then suggested by Dimmick et al.^{9} as tools for training super resolution networks by using the features extracted by passing HiC data through a trained autoencoder as a loss function. In this manuscript we expand upon this strategy, but replace their network with a different flavor of network called the variational autoencoder^{11}.
Similar to vanilla autoencoders, variational autoencoders (VAE) aim to condense data into lower dimensional space, however they have the advantage of providing smooth feature representation which can permit the construction of powerful generative models. To obtain these advantages VAE rely upon a statistical method called variational inference^{11}. This method frames the tasks of encoding and decoding as an ancestral sampling problem with two steps: First, a latent variable, z, is sampled from a prior distribution \({P}_{\theta }(z)\). Second, the observed variables, x, is drawn from a likelihood distribution \({P}_{\theta }(xz)\).
To encode the observed variable, x, requires the computation of the posterior distribution \({P}_{\theta }(zx)\). However because this is computationally intractable, instead one approximates the posterior by choosing a parametric family of recognition models \({q}_{\phi }(zx)\) and selects parameters that minimize the divergence between the recognition model and the true underlying distribution via a probabilistic dissimilarity metric called KLdivergence,
By performing some algebra outlined in Kingma and welling^{11} variational autoencoders are trained using the following loss function
The integral term on the far right of the loss function ensures that the reconstruction outputs of our networks are highly similar to their original inputs, while the KL divergence term causes the latent space distribution of values to closely resemble a vector of gaussian random variables. This imposition of gaussian similarity on the latent space results in advantages in quality of extracted features and the procurement of a generative model.
To create the variational loss function, we first train our variational autoencoder using high resolution contact matrices as both inputs and labels. In each experiment our VAE network is trained using the same chromosomes as the overall VEHiCLE network. The variational autoencoder maps vectors from data space into condensed latent space, which we interpret as a lower dimensional feature vector (Fig. 2a,b). Because the variational autoencoder training strategy imposes a Gaussian distribution of the latent space variables and because our decoder maps latent vectors back into data space in a relatively smooth manner we expect highly similar HiC contact matrices to contain similar latent space profiles.
We compute variational loss by passing both the enhanced HiC contact matrix and target high resolution HiC contact matrices through the backpropagatable encoder component of our variational autoencoder network, extracting latent dimensional representations. We then compute the mean differences between their latent feature vectors,
where \({f}_{encode}\) is the encoding function, defined by our trained encoder network.
Insulation score loss
Most of the previously proposed loss functions for developing HiC enhancement networks draw upon loss functions prolific in the fields of computer vision^{5,6,7,8,9}. While there are certainly advantages to these strategies, they derive from assumed similarities between the tasks of image superresolution and HiC superresolution. However, the tasks are not synonymous. HiC contact matrices contain important information used for downstream feature analysis such as loop calling, TAD identification and 3D model construction. Consequently, images which are highly visually similar which are blurry, shift positions of structural features, or contain noise might result in significant differences in downstream analysis. With this fact in consideration, we used domain knowledge of computational genomics to devise an insulation loss function, which directly trains networks to correctly identify downstream features, specifically TAD placement.
One wellestablished strategy for the identification of TADs is the use of insulation scores^{12}. Insulation scores of a matrix are calculated by sliding a 20 bin (200 kb × 200 kb) window down the diagonal of a matrix and summing the signal across each bin, resulting in an insulation vector (Fig. 2c). This insulation vector is normalized by taking the log2 ratio of each bin’s insulation score and the mean of all insulation scores on the chromosome. From the insulation vector a delta vector is computed by observing the change in signal strength 100 kb downstream and upstream of each bin on the insulation vector (Fig. 2d). This delta vector is treated as a pseudoderivative, and identifies insulation valleys in the regions where the delta vector crosses the xaxis from negative values to positive values, indicating a relative minimum in insulation. TAD Boundaries are assigned to each insulation valley whose difference in strength between the nearest left local max and right local min was > 0.1 (Fig. 2e).
The insulation TAD calling procedure can be encoded into a single, back propagatable network up until extraction of the delta vector (Fig. 2c,d). We define insulation loss,
where \({D}_{vec}\) is a backpropagatable network which maps a contact matrix to a delta insulation vector.
Binwise mean squared error loss
Binwise mean square error loss is a thoroughly tested loss function used in previous HiC enhancement literature^{6,8,9}. It contributes to maintaining visual similarity between enhanced and target HiC contact matrices.
Composite training function
To capitalize on the advantages of all four loss functions we incorporate them into our comprehensive training process. First the variational network is trained on the train and validation datasets. Then the trained encoder is used for \({L}_{vae}\) along with the three other training losses to train the generator network, yielding our overall loss function
where \({\lambda }_{x}\) are hyperparameters used to determine loss contribution. We use \({\lambda _{adv}} = 0.0025,\;\;{\lambda _{mse}} = 1,\;\;{\lambda _{vae}} = 0.01,\;\;{\lambda _{ins}} = 1.\)
Results
Latent space representations permit generation of synthetic HiC data
The KL divergence term in the loss function of our variational autoencoder imposes constraints on the latent dimension, pushing our estimate for the prior \(q(zx)\) towards a vector distribution of Gaussian random variables. Because all latent vector variables fall within Gaussians centered around 0, most vectors near the center of these Gaussians can be successfully decoded into HiC space, resulting in a generative model for HiC data. We first perform principal component analysis (PCA) on our training set’s learned, latent dimensional features. We then create a function mapping PCA values to the latent dimensional space. We then use our trained decoder network to transform the values in latent dimensional space into HiC space (Fig. 3a). The result is a function mapping a profile of PCA values to a 2.57 Mb × 2.57 Mb block of HiC data. We hook this function into an interactive matplotlib widget, permitting manual visualization of changes to generated HiC data as input variables are adjusted. In our widget we set a NUM_SLIDERS = 15 parameter to permit the manual tuning of PCA vector components. The widget passes a vector to our mapping function with user selected values in all manually adjusted components and dataset averages for all PCA’s that are not manually selected or are above the NUM_SLIDERS component index threshold. The selection of 15 is arbitrary and can be manually increased by users interested in viewing the impact of adjusting higher PC values on the generated Matrix structure.
The zero vector results in a vanilla HiC map with interaction frequency between two regions following the inverse of genomic distance (Fig. 3b). The biological interpretation of some adjustable features remains elusive, with changes to vector component values resulting merely in changes of diagonal signal strength or sporadic repositioning of contact regions. However, we observe that many of the tunable feature vector components correspond directly with biologically meaningful features in HiC space such as: formation of TADs, increasing TAD size (Fig. 3c), increasing TAD frequency, shifting TAD position (Fig. 3c,d), formation of genomic stripes (Fig. 3e)^{13} and formation of chromatin loops^{14} (Fig. 3f).
Low resolution HiC contact matrices enhanced by VEHiCLE appear visually competitive with other enhancement algorithms
We generate visual heatmaps of HiC contact maps of the GM12878 dataset using VEHiCLE as well as three other previously developed algorithms: HiCSR, DeepHiC and HiCPlus. We observe high visual similarity between reconstructions by VEHiCLE and other enhancement algorithms (Fig. 4a). We also subtracted high resolution contact maps from reconstructions by each tool to observe a visual difference matrix (Fig. 4b). Visually VEHiCLE appears competitive with existing algorithms.
Notes on evaluation metrics
One of the major differences between the VEHiCLE algorithm and previous HiC enhancement tools is that our architecture is trained to enhance 2.69 Mb × 2.69 Mb regions along diagonals of contact maps rather than splitting contact maps into 0.4 Mb × 0.4 Mb pieces, enhancing in a piecemeal fashion, and then reassembling (see “Methods”). This contribution permits the inclusion of more comprehensive information like TAD structure into training samples. However, it is possible to expand older architectures to full 2.69 Mb × 2.69 Mb sizes rather than the condensed 0.4 Mb × 0.4 Mb window that appears in previous papers. In some cases this expansion of window size degrades older architecture performance, while in others it leads to enhancement. Thus, when comparing VEHiCLE to previous tools we include both original architectures without adjusting window size as well as alternative architectures trained using expanded window sizes.
Low resolution HiC contact matrices enhanced by VEHiCLE achieve strong similarity to high resolution contact matrices using multiple metrics
Using models trained and tested on the GM12878 cell line dataset We evaluated the effectiveness of VEHiCLE in predicting high resolution contacts using 5 common metrics: Pearson Correlation Coefficient (PCC), Spearman Correlation Coefficient( SPC), Mean Squared Error (MSE), Signaltonoise ratio (SNR) and Structure Similarity Index (SSI) (see “Methods”). We compared VEHiCLE reconstructions to the lower resolution data as well as other super resolution methods (HiCPlus, DeepHic and HiCSR.) VEHiCLE enhanced contact matrices consistently showed improvement relative to low resolution data along all 5 metrics (Table 1). VEHiCLE frequently outperformed other HiC super resolution methods beating all older models with 0.4 Mb window size along every test chromosome in every vision metric (Table 1). VEHiCLE out performs both the original and expanded window HiCPlus model in every vision metric across every chromosome (Table 1). VEHiCLE remained competitive with 2.69 Mb window sized DeepHiC and HiCSR models scoring highest in PCC in 3 of the 4 test chromosomes and scoring in the top 2 for 80% of the metricchromosome combinations, a higher consistency of top2 performance than any of the previous models (Table 1).
Downsampled HiC contact matrices enhanced by VEHiCLE display significant improvement using HiC specific metrics
Using models trained on the GM12878 cell line dataset we next evaluated VEHiCLE reconstructions using 3 HiC specific metrics: GenomeDISCO, HiCRep and QuASARRep (see “Methods”). VEHiCLE enhanced metrics remain competitive with other methods (Table 2). Furthermore, even in instances where VEHiCLE is outperformed by another algorithm, we consistently observe increased performance relative to original low resolution matrices. These results indicate biological consistency with VEHiCLE enhanced matrices.
VEHiCLE enhanced contact matrices effectively retrieve downstream features such as TADS
We identified TADs using the prolific insulation score method^{12}. This method assigns an insulation score vector by sliding a window across the diagonal of the contact matrix, constructing an insulation difference vector, and using the zeros of the insulation difference vector to discover TAD boundaries. We used models trained on the GM12878 cell line and evaluated insulation on test chromosomes for the HMEC, K562 and IMR90 cell lines as well as the GM12878 cell line. We expand the test set to evaluate the effectiveness of our network at predicting downstream biological features like TADs when the model is trained on different cell lines which may have different TAD profiles.
We compare the insulation difference vector of each matrixenhancement algorithm to the insulation difference vector of our high resolution contact matrix using the L2 norm dissimilarity metric. In many cases VEHiCLE enhanced insulation difference vectors have higher similarity to target matrices relative to other matrix enhancing algorithms (Table 3). Furthermore, even in instances where VEHiCLE is outperformed by another algorithm we consistently observe higher similarity between the target high resolution matrices and VEHiCLE enhanced matrices relative to low resolution matrices (Table 3).
3D chromatin model construction
We tested the effectiveness of reconstructed data in building 3D structure models using the structural modeling tool 3DMax. We extracted constraints from the low resolution, high resolution and VEHiCLEenhanced 2.57 Mb × 2.57 Mb regions of our test dataset chromosomes of the GM12878 dataset. From each constraint grouping we generated 3 models. We observed significantly higher visual similarity between VEHiCLEenhanced and highresolution matrices relative to lowresolution matrices (Fig. 5a). We then used the TMscore metric to quantify structural similarity of models^{15}. We observed higher TMscores between high resolution and VEHiCLEenhanced matrices than between high resolution and low resolution models (Fig. 5b). We also observed higher TMscore similarities between models generated by the same VEHiCLEEnhanced matrices relative to models generated by the same low resolution matrices, indicating VEHiCLE enhanced models are more consistent (Fig. 5c).
Discussion
One of the most common challenges in Deep Learning projects is the opaque nature of a neural network’s inner functioning. Consequently, our ability to extract latent features and map them to biologically relevant structures provides a significant advance in increasing interpretability of HiC matrices. Our GUI tool can be used to generate HiC data through user tunable parameters with biologically relevant downstream structures such as TAD strength, TAD positioning, stripes and loops. Further inspection of these features has potential to enhance analysis of key characteristics of chromatin organization.
Our introduction of the Insulation loss sets a new precedent of utilizing biological knowledge in the training of HiC networks. This may open the door for future improvement of HiC data enhancement by utilizing other forms of domain knowledge to increase usability of deep learning enhanced matrices. Future loss functions could incorporate algorithms for identification of other important downstream features such as loops or stripes.
In addition to the increased interpretability and inclusion of domain knowledge, VEHiCLE obtains resolution enhancement results competitive with the stateofthe art, often beating top algorithms on a variety of metrics, all while preserving the ability to convey meaningful structures such as TAD’s and 3D structure in downstream analysis.
VEHiCLE’s capacity to increase accuracy of insulation scores shows promise of utility for experimental biologists interested in chromosome architecture at specific genomic locations. By enhancing experimentally obtained HiC data a biologist could observe the frequency with which a list of genes or cis regulatory elements are found near TAD boundaries. Such analysis could provide further insight into the role of structural organization in a genomic process. Additionally, VEHiCLE enhanced matrices could be used to generate more accurate 3D models when building visualizations of genomic structure. These visualizations may provide insight into the underlying machinery of a genomic process of interest.
Methods
Dataset assembly
Like many of the previous HiC super resolution networks we train VEHiCLE on high and low resolution HiC data for the GM12878 cell line^{16}. While previous work often split chromosomes into training, validation and testing sets in a sequential manner^{8,9} we were concerned that differences in the 3D conformation of large vs small chromosomes^{17} may contain implicit bias in contact map features that could confound training. Consequently we assembled training, validation and test sets in a nonsequential manner using chromosomes 1, 3, 5, 6, 7, 9, 11, 12, 13, 15, 17, 18, 19, 21 as our training set, chromosome 2, 8, 10, 22 as our validation set and chromosomes 4,14,16,20 as our test set.
Previous work on HiC super resolution consistently used network input window sizes of 0.4 Mb × 0.4 Mb at 10 kb resolution, requiring networks to split chromosome contact maps into 40 × 40 bin matrices^{5,6,7,8,9}. While this strategy has seen relative success, a major disadvantage is that certain important features of HiC such as TADs can span ranges larger than 0.4 Mb, meaning that it is impossible for previous networks to explicitly encode important information about TAD organization. Furthermore, this informational bottleneck of constraining window sizes to 40 × 40 bins is not incumbent upon the employed superresolution networks as work in the field of computer vision has demonstrated the effectiveness of GAN and VAE networks on significantly larger images. With these considerations in mind we instead built our network to accept 2.69 Mb × 2.69 Mb images, a range which is large enough to fully encompass the average TAD of length 1MB^{18}. Observing 2.69 Mb × 2.69 Mb regions of HiC contact maps at range 10 kb results in submatrix images of 269 × 269 bin size. Because of the expanded window size we trained our network exclusively on diagonally centered submatrices, split by sliding a 269 × 269 window down the diagonal of each chromosome's HiC contact map. We move the window with a stride of 50 bins at a time, ensuring sufficient overlap between samples for our dataset to include all contacts between regions within 2 Mb of each other. This results in a total of 3309 training, 1051 validation, and 798 testing matrices.
Because the convolutional arithmetic of our GAN architecture results in a decrease in output matrices by 12 bins, our output matrices are of dimension 257 × 257. Our variational loss is based on reconstruction of matrices output by our GAN, thus when training our variational autoencoder we use the inner 257 × 257 bins of each 269 × 269 sample in our dataset.
All Models were trained using the GM12878 cell line. When evaluating vision metrics, HiC qc metrics and 3D model comparison we use the test chromosomes from the GM12878 cell line. For our insulation score analysis we extend our test set to include the K562, IMR90 and HMEC cell lines so as to verify the effectiveness of our network at retrieving information when trained on a different cell line. Both low resolution and highresolution contact maps are normalized using the Knight–Ruiz algorithm, a standard normalization method in the HiC literature [19].
Variational autoencoder architecture
The VAE component of VEHiCLE utilizes two neural networks for the encoding and decoding components, where the encoder is trained for the parameters of q_{0} and the decoder is trained to optimize the parameters of p_{0}. The VEHiCLE encoder network contains 7 convolutional layers with kernel counts: 32, 64, 128, 256, 256, 512, 512. Each convolutional layer is separated by leaky ReLU and batch normalization. The decoder network has 7 layers of convolution transpose with the kernel counts 512, 512, 256, 256, 128, 64, 32, also separated by leaky ReLU and batch norm functions. The decoder network is appended by a Sigmoid activation function placing outputs in the range of [0,1].
Generative adversarial network architecture
We use the discriminator and generator architecture defined in HiCSR, with the exception of our generator’s output function, which is changed from tanh, to a sigmoid so that outputs are mapped to [0,1]. The generator architecture contains 15 residual blocks separated by skip connections, each containing 64 convolutional filters. The fully convolutional discriminator is a fully convolutional network with ReLU activation. Both the generator and discriminator are trained with batch normalization.
Other networks
We used the pytorch versions of HiCPlus, DeepHiC and HiCSR provided at https://github.com/wangjuan001/hicplus, https://github.com/omegahh/DeepHiC and https://github.com/PSILab/HiCSR. We first tested networks using their literature provided weights, however we obtained very poor performance because these networks were trained on alternative training sets with key characteristic differences from ours. First, their training sets had bin value ranges of [− 1,1], however our training datas range was [0,1] because negative values confound the probabilistically motivated VAE component. Second the input size of contact maps for previous networks was 40 × 40, while our network aims to incorporate surrounding genomic information and utilizes a larger window input size of 269 × 269. To provide more accurate comparison we trained networks on our own GM12878 Dataset. Because our networks accept a large scale input matrix 269 × 269, but other networks were built to accept 40 × 40 pieces, we trained other networks by splitting each 269 × 269 into 36 nonoverlapping pieces. Evaluation of HiC metrics was performed by feeding split pieces through networks as necessary, then reassembling pieces and comparing full chromosome contact maps.
Standard evaluation metrics
We utilize 5 reproducibility metrics pulled from imagesuper resolution literature: Pearson Correlation Coefficient (PCC), Spearman Correlation Coefficient (SPC), Mean Squared Error (MSE), Signaltonoise ratio (SNR) and Structure Similarity Index (SSI).
Mean squared error
Pearson correlation coefficient
Spearman correlation coefficient
Spearman Correlation is similar to Pearson correlation differing in that it utilizes rank variables so as to evaluate monotonic relationship between the matrices without imposing a linearity condition that may not exist in nature.
Signaltonoise ratio
SignalToNoise Ratio uses a ratio of the clean signal to the difference between clean and noisy signals to represent how much signal is actually getting through. The higher the value of SNR the better quality the data.
Structural similarity index
SSI is calculated by sliding windows between images and averaging values. The constants C_{1} and C_{2} are used to stabilize the metric while the means, variances and covariances are computed via a Gaussian filter. We use the implementation of SSI developed by Hong et al.^{8} (DeepHiC) keeping their default values for the size of subwindows and variance value of gaussian filter at 11 and 3 respectively.
HiC reproducibility metrics
We consider 3 HiC specific reproducibility metrics: GenomeDISCO^{20}, HiCRep^{21}, and QuASARRep^{22}. We use the 3DChromatin_ReplicateQC^{23} implementations of the metrics. This 3DChromatin_ReplicateQC repository also included metrics for the tool HiCSpector^{22}, however we consistently obtained faulty values, even when using the repositories sample data and so we excluded HiCSpector results from this analysis. When expanding previous models to a 269 × 269 window size the HiCSR model repeatedly failed to converge using these metrics, thus we only include the original 40 × 40 window version of HiCSR in our evaluation of HiC Reproducibility metrics. GenomeDISCO utilizes a random walk on a graph generated by contact maps to obtain a concordance score^{23}. HiCRep develops a stratum adjusted correlation coefficient for matrix comparison by measuring weighted similarity of contacts in identified stratum^{23}. QuASARRep calculates a correlation matrix of interaction using weights based on enrichment^{22}.
Topologically associated domain identification
Topologically associated domains were identified using Insulation score as identified in Crane et al. We mimicked their procedure entirely with the exception that our initial insulation score window size was condensed to 20 bins instead of 50 because this demonstrated greater visual accuracy in TAD positioning^{12}.
Threedimensional model reconstruction
To generate models, we utilize 3DMax^{24} with outofthebox parameters of 0.6 conversion factor, 1 learning rate, and 10,000 max iteration. We create 3 models per input contact matrices. We generate models for every 5th 269 Mb × 269 Mb input matrix from our training dataset, because this skipping distance ensures coverage of each chromosome while minimizing model generation time. Similarity between structures was measured using TMscore^{15}.
Motivation for 269 × 269 window size
The decision to expand our window size to 2.69 × 2.69 Mb is multifaceted. Philosophically the decision to expand beyond the previous standard of 0.4 Mb × 0.4 Mb was to permit the inclusion of a wider range of genomic information in our deep learning methods.
From a technical standpoint our insulation score loss is based on the previously defined method for insulation determination^{12}. Because calculation of insulation necessitates incorporation of boundary bins, the length of an insulation vector is always smaller than the dimension of a HiC contact maps axis with formula:
Thus, using a 20 kb insulation window and 10 kb Delta window with the previously applied 40 × 40 window would result in an insulation vector of length (402019) = 1, which is only a scalar and would contain insufficient information for meaningful feature extraction.
The decision to use 269 as opposed to a different, large number is due to our variational autoencoder. While passing through the variational autoencoder the dimension of an input matrix is compressed with each incremental layer. It was essential that at each step the output dimension remained a whole number and that when the latent representation is decoded back into contact matrix space the reconstructed matrix be of the same dimension as its input. 257 was the smallest number which both spanned 2 Mb (a range that would encompasses nearly all TADs) and resulted in the same dimensional input and output at each layer of our variational autoencoder. We account for the 12 bin decrease in size that occurs by passing through our GAN, resulting in a 269 × 269 matrix.
Data availability
All HiC data were downloaded from the Gene Expression Omnibus (GEO) GSE63525. For the Hi Resolution Matrices of GM12878, IMR90, K562 and HMEC we used GSE63525_GM12878_insitu_primary+replicate_combined_30.hic, GSE63525_IMR90_combined_30.hic, GSE63525_K562_combined_30.hic and GSE63525_HMEC_combined_30.hic respectively. For low resolution matrices we used GSM1551550_HIC001_30.hic, GSM1551602_HIC053_30.hic, GSE63525_K562_combined_30.hic, and GSM1551610_HIC061_30.hic respectively.
Code availability
VEHiCLE was built using python. All experimental code as well as the VEHiCLE enhancement tool and Contact Matrix generating GUI are available at https://github.com/MaxHighsmith/VEHiCLE with zenodo https://zenodo.org/badge/latestdoi/339535370.
References
LiebermanAiden, E. et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Miele, A. & Dekker, J. Longrange chromosomal interactions and gene regulation. Mol. BioSyst. 4, 1046 (2008).
Oluwadare, O., Highsmith, M. & Cheng, J. an overview of methods for reconstructing 3D chromosome and genome structures from HiC data. Biol. Proced. Online 21, 7 (2019).
Lajoie, B. R., Dekker, J. & Kaplan, N. The Hitchhiker’s guide to HiC analysis: Practical guidelines. Methods 72, 65–75 (2015).
Zhang, Y. et al. Enhancing HiC data resolution with deep convolutional neural network HiCPlus. Nat. Commun. 9, 750 (2018).
Liu, T. & Wang, Z. HiCNN: A very deep convolutional neural network to better enhance the resolution of HiC data. Bioinformatics 35, 4222–4228 (2019).
Liu, Q., Lv, H. & Jiang, R. hicGAN infers super resolution HiC data with generative adversarial networks. Bioinformatics 35, i99–i107 (2019).
Hong, H. et al. DeepHiC: A generative adversarial network for enhancing HiC data resolution. PLoS Comput. Biol. 16, e1007287 (2020).
Dimmick, M. C., Lee, L. J. & Frey, B. J. HiCSR: A HiC SuperResolution Framework for Producing Highly Realistic Contact Maps. https://doi.org/10.1101/2020.02.24.961714.
Highsmith, M., Oluwadare, O. & Cheng, J. Deep Learning For Denoising HiC Chromosomal Contact Data. https://doi.org/10.1101/692558.
Kingma, D. P. & Welling, M. An Introduction to Variational Autoencoders. https://doi.org/10.1561/9781680836233 (2019).
Crane, E. et al. Condensindriven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015).
Kraft, K. et al. Serial genomic inversions induce tissuespecific architectural stripes, gene misexpression and congenital malformations. Nat. Cell Biol. 21, 305–310 (2019).
Zhao, L. et al. Chromatin loops associated with active genes and heterochromatin shape rice genome architecture for transcriptional regulation. Nat. Commun. 10, 3640 (2019).
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinform. 68, 1020–1020 (2007).
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Trieu, T. & Cheng, J. 3D genome structure modeling by Lorentzian objective function. Nucleic Acids Res. 45, 1049–1058 (2017).
Dali, R. & Blanchette, M. A critical assessment of topologically associating domain prediction tools. Nucleic Acids Res. 45, 2994–3005 (2017).
Knight, P. A. & Ruiz, D. A fast algorithm for matrix balancing. IMA J. Numer. Anal. 33, 1029–1047 (2013).
Ursu, O. et al. GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 34, 2701–2707 (2018).
Yang, T. et al. HiCRep: Assessing the reproducibility of HiC data using a stratumadjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017).
Sauria, M. E. G. & Taylor, J. QuASAR: Quality Assessment of Spatial Arrangement Reproducibility in HiC Data. https://doi.org/10.1101/204438.
Yardımcı, G. G. et al. Measuring the reproducibility and quality of HiC data. Genome Biol. 20, 57 (2019).
Oluwadare, O., Zhang, Y. & Cheng, J. A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC Genomics 19, 161 (2018).
Acknowledgements
This project is partially supported by two NSF grants (no. IOS1545780 and no. DBI1149224). It is also supported by the NIH through the T32 training grant GM008396.
Author information
Authors and Affiliations
Contributions
M.H. and J.C. conceived the project. M.H. performed all experiments and drafted the manuscript. J.C. revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Highsmith, M., Cheng, J. VEHiCLE: a Variationally Encoded HiC Loss Enhancement algorithm for improving and generating HiC data. Sci Rep 11, 8880 (2021). https://doi.org/10.1038/s41598021881159
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021881159
This article is cited by

Diffusion models in bioinformatics and computational biology
Nature Reviews Bioengineering (2023)

Normalization and denoising of singlecell HiC data with BandNorm and scVI3D
Genome Biology (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.