Abstract
Chromatin interaction studies can reveal how the genome is organized into spatially confined subcompartments in the nucleus. However, accurately identifying subcompartments from chromatin interaction data remains a challenge in computational biology. Here, we present SubCompartment Identifier (SCI), an algorithm that uses graph embedding followed by unsupervised learning to predict subcompartments using HiC chromatin interaction data. We find that the network topological centrality and clustering performance of SCI subcompartment predictions are superior to those of hidden Markov model (HMM) subcompartment predictions. Moreover, using orthogonal Chromatin Interaction Analysis by insitu PairedEnd Tag Sequencing (ChIAPET) data, we confirmed that SCI subcompartment prediction outperforms HMM. We show that SCIpredicted subcompartments have distinct epigenetic marks, transcriptional activities, and transcription factor enrichment. Moreover, we present a deep neural network to predict subcompartments using epigenome, replication timing, and sequence data. Our neural network predicts more accurate subcompartment predictions when SCIdetermined subcompartments are used as labels for training.
Introduction
The genome is hierarchically organized in threedimensional (3D) space^{1}, and chromosomal sequences can be mapped to one of two major compartments: compartment A is associated with open chromatin, and compartment B is associated with closed chromatin. The A/B compartments are predicted from chromatin interaction data, in particular, data generated by the genomewide HiC chromatin interaction profiling technique, which combines proximitybased ligation with massively parallel sequencing^{2}. Recent studies of additional data types, including methylation, DNase I hypersensitive sites, and singlecell ATACseq, have corroborated the existence of A/B compartments as functional units of genome organization^{3}. The genome can be further subdivided into subcompartments, where genomic regions within a given subcompartment are more likely to interact with each other than with regions in different subcompartments. Subsequent to the A/B compartment model, a threesubcompartment model was introduced based on higherresolution HiC interchromosomal interaction data^{4}: one subcompartment is a component of the open chromatin compartment; and the other two subcompartments are components of closed chromatin compartment, one near centromeres and the other far from centromeres. The latest model proposes five subcompartments based on the use of hidden Markov modeling (HMM) to cluster interchromosomal interactions^{5}. This method identifies two active subcompartments (A1 and A2) and three inactive subcompartments (B1, B2, and B3) using deeply sequenced GM12878 cell line data. The model also shows that the five subcompartments exhibit distinct epigenomic signatures. Since subcompartments are essential for understanding higherorder 3D genome organization, and an increasing volume of genomewide chromatin interaction data will soon be available through the ENCODE and 4D Nucleome consortia, there is a practical need for an easytouse, automated, and accurate subcompartment predictor.
In addition to HiC, other types of genomewide sequencing data have been used to predict the higherorder genome organization. For example, chromatin immunoprecipitation sequencing (ChIPseq) data of 11 histone modifications and 73 transcription factors (TFs) have been used to predict subcompartments with 63% test accuracy using a deep learning multiclass classifier^{6}. Recently, TSASeq, a new genomewide mapping method that estimates mean distances of chromosomal loci from nuclear structures, was used to predict several Mbp chromosome trajectories between nuclear structures^{7}. These findings use subcompartment assignments either for predictive model output labels or for evaluation of the distances between nuclear speckles. These studies require accurate subcompartment determination. Also, a DNA methylation correlation matrix based on multiple replicates corroborates the A/B compartment model^{3}. However, the potential of DNA methylation levels to predict nuclear subcompartmentalization is still not clear.
Here, we introduce SCI, an algorithmic framework based on graph embedding^{8} and kmeans clustering that predicts genomic subcompartments from HiC data. We show that subcompartment prediction by SCI is more accurate than other unsupervised algorithms. Furthermore, we use orthogonal data for external validation, including ChIAPET chromatin interaction, transcriptome, and epigenome data. Lastly, using SCI output, we developed an epigenomebased (DNA methylation and histone modification) deep neural network model for subcompartment classification.
Results
SCI overview
To predict subcompartments, SCI starts with a normalized, genomewide HiC interchromosomal contact matrix and constructs a HiC interaction graph, where each node on the graph represents a genomic bin. If two bins are interacting, an edge connecting them is added to the graph, and the weight of the edge corresponds to the normalized HiC sequencing read count between the two bins. Then, SCI uses graph embedding^{8} to project the interaction graph into a lowerdimensional vector space for kmeans clustering to predict subcompartments (Fig. 1a).
Specifically, SCI adapts a graphembedding method termed LINE^{8} to project an otherwise highdimensional graph structure into a lowerdimensional space that describes the chromatin interactions. LINE utilizes firstorder and secondorder proximities of graph vertices to reduce dimensionality and promote efficient clustering.
The number of subcompartments in the nucleus is jointly determined by (a) the spatial location of chromatin and (b) chromatin interactions. Importantly, local epigenetic status can impact chromatin interactions^{9}. Using gap statistics to determine the optimal number of subcompartments, we obtained nine as the optimal number of subcompartments (named C1–C9). However, some of these subcompartments are enriched for similar epigenetic modifications (C3 and C4; C5 and C6), suggesting that subcompartments can be spatially separated and functionally comparable—they still show comparable levels of transcriptional potential (Supplementary Fig. 1). To allow for direct comparison of SCI with other computational methods for predicting subcompartments, we chose to use 5 clusters in subsequent SCI performance evaluations. Importantly, SCI offers users the flexibility to select the number of subcompartments based on either their preference or the optimal number calculated by gap statistics.
We then assessed the performance of SCI based on structural and functional evaluations (Fig. 1b). We also developed a deep neural network classifier for predicting subcompartments using DNA sequencing data and epigenomic profiles (Fig. 1b).
SCI outperforms alternative algorithms
We compared subcompartments predicted by SCI to those predicted by HMM (reported in Rao et al. 2014) and by kmeans clustering applied directly to an interchromosomal interaction matrix (Kmeans_Yaffe, reported in Yaffe, E. and Tanay, A., 2011). We tested each algorithm using data from Rao et al., who generated 4.9 billion HiC reads from GM12878, a human lymphoblastoid ENCODE cell line.
When given the same parameter (k = 5), SCI predicted comparable number of subcompartments as that predicted by HMM (C1C5; Fig. 1c, Supplementary Fig. 3); however, Kmeans_Yaffe failed to predict five subcompartments, as it placed genomic bins in four subcompartments instead of 5 (Supplementary Fig. 2). Following the convention of Rao et al., we reordered subcompartments predicted by SCI such that subcompartments C1 and C2 correspond to open chromatin subcompartments reported by Rao et al. (A1 and A2), while subcompartments C3, C4, and C5 correspond to closed chromatin subcompartments reported by Rao et al. (B1, B2, and B3). Subcompartments C1 and C2 have higher gene density compared to the other three subcompartments (Fig. 1d), which is consistent with the previous reports^{10}.
Our UMAP visualization of SCI embedding (Fig. 1e) shows: (i) a clear distinction between subcompartments C1 (in red) and C2 (in yellow), which was indirectly reported in the literature by measuring the distance from laminaassociated domains (LADs)^{11}; and (ii), clear separation of the inactive subcompartment C3 (in gray) from the inactive subcompartments C4 (in purple) and C5 (in blue). We further showed that the open chromatinassociated subcompartments (C1 and C2) have higher enrichment of ATACseq compared to the inactive compartments (Fig. 1f, openchromatin subcompartment C1 compared to closedchromatin subcompartment C3, and openchromatin subcompartment C2 compared to closedchromatin subcompartment C5). Moreover, we show that C4 is more highly enriched for nucleolarassociating chromosomal domains (NADs)^{12} compared to C5 (Fig. 1g), indicating that C4 is located in the nucleolus while C5 is likely not. Thus, SCIpredicted subcompartments exhibit distinct chromatin characteristics.
Genomic regions that map to the same subcompartment based on the graph embedding of SCI are expected to be in proximity to each other in 3D space. To test this, we examined intrasubcompartment, interchromosomal chromatin interaction network topology features, including network centrality (closeness and betweenness) and the clustering coefficient. Compared to the HMM approach, SCI showed significantly higher closeness centrality (pvalues < 2.2e−16, Mann–Whitney test) and betweenness centrality, and a higher graph clustering coefficient within subcompartments (Fig. 1h–j), with improvements between 8–10%. In addition, we found that open subcompartments (C1 and C2) show higher centrality values for chromatin interactions compared to the closed subcompartments (C3, C4, C5) (Supplementary Fig. 3), indicating that heterochromatin is more spatially constrained than euchromatin. These topological network features indicate higher chromatin interactions between genomic regions within the same SCIpredicted subcompartments. To further evaluate the clustering performance of SCI and HMM, we calculated their Silhouette indices, which measure the consistency within clusters of data^{13}. The Silhouette index ranges from −1 to 1, where a higher value indicates better clustering performance. SCI improved the Silhouette index over the previous HMM subcompartment annotation by 8% (Fig. 1k). Moreover, we calculated the DaviesBouldin index, where lower values are better, and SCI improved the DaviesBouldin index by 7% (Supplementary Fig. 4). We also compared LINE to the stateoftheart graph embedding algorithms HOPE^{14} and DeepWalk^{15}. LINEbased cluster results had a better Silhouette index and centrality measures compared to the alternative graph embedding methods (Supplementary Fig. 5).
SCI performance validated with ChIAPET data
To further validate that genomic regions from any given HiCbased SCIpredicted subcompartment have higher chromatin interaction frequencies than those from different subcompartments, we applied SCI to datasets from an orthogonal chromatin interaction assay, Chromatin Interaction Analysis by PairedEnd Tag Sequencing (ChIAPET). ChIAPET measures chromatin interactions associated with specific protein factors^{16}. We used two different ChIAPET datasets: the first captures interactions associated with CCCTCbinding factor (CTCF), a chromatinbinding protein associated with more than 70% of the total chromatin interactions^{17}; and the second captures interactions associated with RNA polymerase II (RNAPII), which coordinates communication between promoters and their distal regulatory elements. We then compared the chromatin interactions in GM12878 cells measured by CTCF and RNAPII ChIAPET to quantify two types of chromatin interactions: (i) intraloops, which are defined as ChIAPET chromatin loops that connect two genomic bins that have the same subcompartment predictions; and (ii) interloops, which are defined as ChIAPET chromatin loops that connect two genomic bins that have different subcompartment predictions (Fig. 2). We showed that there are more CTCF ChIAPET intraloops (the diagonal of the heatmap in Fig. 2a and the top panel of Fig. 2b and Supplementary Fig. 6) than CTCF ChIAPET interloops. SCIdetected subcompartments achieved a higher intraloop vs. interloop ratio of normalized CTCF ChIAPET loop counts compared to HMMdetected subcompartments (Fig. 2c, pvalue < 2.2 × 10^{−16}, Fisher’s exact test). RNAPII ChIAPET data corroborated a higher ratio of intraloops vs. interloops using SCI compared to HMM (Fig. 2d–f). This equated to a 60% and 74% more robust performance for SCI compared to HMM for CTCF and RNAPII loop ratios, respectively. We then visualized the ChIAPET loops in an example region with more intraloops than interloops (Fig. 2g). Together, these data demonstrate that, compared to HMMprediction of subcompartments, SCIprediction of subcompartments achieves better clustering performance and tighter network topology structure.
Functional assessment of subcompartments
To compare the functional properties of subcompartments predicted by SCI with those of subcompartments predicted by other methods, we assessed enrichment for epigenomic marks and replication timing within subcompartments. Data comprised DNase hypersensitive sites (DHS) (open chromatin), ten histone modifications, enhancer annotations, superenhancer annotations, DNA methylation, and six replication timing datasets (Fig. 3a, Supplementary Fig. 7). In general, transcriptionally active subcompartments are expected to have higher enrichment for DHS and the histone modifications (H3K27ac and H3K36me), compared to transcriptionally inactive subcompartments. While inactive subcompartments are expected to have higher enrichment for H3K27me3 and DNA methylation. We found that SCIpredicted subcompartments satisfied these expectations (Fig. 3a). In contrast, HMMpredicted subcompartments exhibited higher enrichment for DHS in C3 (inactive) than in C2 (active) (Supplementary Fig. 7). We observed higher enrichment for enhancers and superenhancers in the C1 and C2 subcompartments predicted by both SCI and HMM (Fig. 3a, Supplementary Fig. 7), consistent with transcriptionally active genomic regions.
Genes in close 3D spatial proximity to one another are more likely to have synchronized transcriptional control compared to those that are in different subcompartments^{18}. Therefore, we examined transcriptional activity in SCIpredicted subcompartments. We assessed geneexpression levels using ENCODE RNAseq data of GM12878 cells. Genes within C1 and C2 had significantly higher transcript per million (TPM) values (pvalue < 0.01 based on ANOVA) compared to genes within C35, consistent with their epigenetic regulatory status (Fig. 3b). Furthermore, we evaluated the generation of nascent RNA using Global runon sequencing (GROseq)^{19}. The GROseq data confirmed that transcriptional activity in C1 and C2 is significantly higher (fold change > 3, pvalue < 0.01 based on ANOVA) than that in subcompartments C35 (Fig. 3c).
We hypothesize that the expression of genes within the same subcompartment is more highly coordinated than that of genes across different subcompartments. Therefore, we calculated pairwise Spearman’s correlation of gene expression for gene pairs within the same subcompartment and across subcompartments. We observed higher pairwise correlation values of gene expression for genes falling into bins within the same subcompartment (intracompartment correlation) compared to genes in different subcompartments (intercompartment correlation) (Fig. 3d). We also showed that SCI achieved a significantly higher (Pvalue < 2.2e−16, paired Wilcoxon test) intracompartment correlation to the intercompartment correlation ratio compared to HMM. As shown in Fig. 3e, SCI performs 40% better than HMM with respect to pairwise geneexpression correlation for functional evaluation of subcompartments.
Folding the genome into subcompartments may optimize the efficient usage of transcriptional regulatory elements. If true, we would expect to see subcompartmentspecific TF enrichment. To test this, we assessed TF enrichment using elastic net regression, which infers the regulatory elements that control transcription for each subcompartment^{20}. Based on feature dependency analysis^{20} (see Methods section for details), we identified subcompartmentspecific TFs that can be used to predict geneexpression levels in each subcompartment (Supplementary Table 1). We also identified subcompartmentspecific TFs (red text in Fig. 4a, b). To further validate C1 and C2specific TFs, we evaluated ENCODE ChIPseq data that was available for MYC, EP300, and BRCA1 TFs. We found that ChIPseqbased TF enrichment patterns agree with the predicted subcompartmentspecific TFs, where MYC and BRCA1 are enriched in C1, and BATF and EP300 are enriched in C2 (Fig. 4c). In addition, we observed that YY1, a TF common to both C1 and C2, is not enriched in either subcompartment. These results indicate that SCI can identify chromatin hierarchical subcompartments that parse into transcriptional regulatory units.
Deep neural network for predicting subcompartments
Recently, models have been proposed for predicting subcompartmentlevel genome structure from the DNA sequence, ChIPseq data^{6}, and other 3D genome assays^{7}. One approach uses a 450 Kbased DNA methylation correlation matrix from multiple replicates (n ≥ 62) to identify genomewide A/B compartments. However, there is no available model for subcompartment structure prediction using a single DNA methylation library.
Using SCI predictions, the five subcompartments showed distinct epigenome signatures (Fig. 3a), DNA methylation percentage distributions, and percentage of disordered reads, the latter of which is a measurement of epigenome stochasticity^{21} (Fig. 5a). Building on these distinctions, we developed an epigenomebased classifier to predict subcompartments based on ChIPseq epigenomic marks and wholegenome bisulfite sequencing data, using a deep neural network (DNN) (Fig. 5b). We combined dynamic epigenomic features and static DNA sequence features to design a deep neural network (DNN). We sliced the data set into a training set (80%) and test set (20%). The model accuracy reached higher test accuracy (0.73) using SCIdetected class labels than using HMMdetected class labels (0.68) or using randomized class labels (0.23) with the same class sizes (Fig. 5c). In addition, we developed a more compact model using only DNA methylation and DNA based features, which was able to predict genomic subcompartments with 0.66 of accuracy. Because the model interpretability of DNN is limited, we used a gradientboosted trees model (XGBoost) classifier and random forest machine learning methods to derive a feature importance score for our features. The accuracy of the XGBoost classifier is 0.69, and the accuracy of the random forest method is 0.67, each of which is slightly lower than the accuracy of our DNN model. The XGBoost and random forest methods agreed that histone modification features followed by DNA methylation distribution are the most important feature in the dataset (Supplementary Fig. 8).
Our subcompartment DNN predictors outperformed recently developed model^{6} using 84 ChIPseq libraries for histone modifications and TFs using the same training/test data split (training data is from oddnumbered chromosomes and test data is from evennumbered chromosomes) (Supplementary Table 2). Moreover, our reduced model using DNA methylation and DNA features outperformed their reduced model using 11 histone marks.
Compartment prediction using HiC or ChIAPET data
We also used SCI to predict A/B compartments from HiC and ChIAPET data on two cell lines: GM128787 (human Blymphocyte) and K562 (human lymphoblast). We predicted A/B compartments using a 100kb resolution. We observed high agreement for A/B compartment prediction from both technologies (Supplementary Fig. 9) (agreement for GM12878 is 91% and agreement for K562 is 78%). We observed a high correlation (0.88 for GM12878 and 0.76 for K562) for the first eigenvector values between HiC and ChIAPET. Moreover, we noted that enrichment for several open chromatin marks, histone marks, and replication timing data was similar using compartments called by HiC and ChIAPET for both cell lines (Supplementary Fig. 10). We implemented the compartment predictor using static features (DNA sequence information) and dynamic features (DNA methylation information) as input for the deep neural network model and achieved 87% test accuracy (Supplementary Fig. 11).
Discussion
HiC enables the elucidation of higherorder chromatin topology that suggests structural regulation of transcriptional control. Software exists for data preprocessing, chromatin loop calling, and topologically associating domain (TAD) predictions. However, there is no available software to examine compartment and subcompartment structures. Furthermore, it has been shown that the TAD structure is rather stable among different cell types, while the structure of subcompartments is highly variable among different cell types; this validates the need to study the cell typespecific functions of subcompartments^{22}. SCI identifies complex nuclear subcompartments in a fully datadriven, unsupervised fashion.
In contrast to previous methods that infer subcompartments using HMM, SCI more comprehensively utilizes network properties and thus may preserve the global structure of the chromatin interaction network. SCI formulates subcompartment prediction as a graphbased problem and enables efficient utilization of HiC information via graph embedding. SCI considers both the firstorder and secondorder proximities, which complement the sparsity of firstorder proximity in HiC data. We demonstrated that SCI outperforms other algorithms, such as HMM and Kmeans_Yaffe. We believe that the superior performance of SCI compared to other methods is due to its use of all the interchromosomal interactions, and efficient dimensionality reduction via graph embedding. Specifically, Rao_HMM uses only a subset of interchromosome interactions, and Kmeans_Yaffe does not perform a dimensionality reduction on the data. Importantly, we studied the functional relevance of subcompartments as they relate to cellular processes. Moreover, we show TF enrichment in SCIpredicted subcompartments.
Recently, various computational models have been proposed to predict genome organization from sequence and epigenomic data^{6} and other 3D genome assays^{7}. We believe that SCI will further propel the development of such models by providing robust, accurate labels for genomic subcompartments. As an example, we developed compartment and subcompartment DNN predictors using static features from the DNA sequence of the reference genome and dynamic features from histone modifications and DNA methylation data and outperformed the published model^{6} using 84 ENCODE ChIPseq libraries for histone modifications and TFs from the same cell line.
Importantly, the ENCODE and 4D Nucleome consortia are generating robust HiC datasets, and we anticipate that SCI will contribute to more efficient and accurate subcompartment determination, and will improve our understanding of the complex interactions between DNA methylation, gene expression, and chromatin organization. Furthermore, tens of thousands of DNA methylation assays have been generated in largescale studies such as The Cancer Genome Atlas, the Epigenome Roadmap, and Cancer Cell Line Encyclopedia. SCI will provide methods to maximize the value of these datasets and thereby enrich our understanding of the layers of geneexpression coordination that are important for development and disease states.
Methods
Background
All current subcompartment prediction methods utilize an interchromosomal interaction matrix to predict subcompartments. Two methods for prediction of genomic subcompartments have been published. First model is the Yaffee and Tanay subcompartment model (Yaffee_Kmeans). This model applies Kmeans clustering directly on a HiC interaction matrix. It modifies the distance calculation method such that it ignores all distance computation that involves intrachromosomes entries. This model identifies three subcompartments: one active and two inactive subcompartments. The second model is Rao subcompartments (HMM). The Rao subcompartment prediction method focuses on one subset of interchromosomal interactions: the interactions between oddnumbered and evennumbered chromosomes. This method uses HMM to cluster the interactions between oddnumbered and evennumbered chromosomes in different runs. After performing cluster annotation for oddnumbered and evennumbered chromosomes, the Rao method combines those annotations into a single final cluster annotation based on enriched HiC interactions between the clusters of oddnumbered and evennumbered chromosomes.
Graphembedding methods
Given an undirected graph G = [V, E], where V represents genomic bins, and E represents HiC interactions, each edge (e) between v_{i} and v_{j} has weight w_{ij}, which represents the normalized HiC read count. Graphembedding approaches project graph G into a low dimension space Rd by calculating embedding matrix U (please refer to the following reviews for more detailed information about graph embedding^{23,24,25}). We compared SCI graph embedding method to DeepWalk and HOPE graph embedding methods.
The Deepwalk method relies on performing random walks across the graph that have a specific length. These walks resemble node sequences in the graph, and this sequence is fed to the Word2Vec approach to derive the embedding of the graph vertices based on their context. The original DeepWalk implementation handles only unweighted graphs. We used a modified version of the algorithm to handle weighted graphs from (https://github.com/dongguosheng/deepwalk).
The HOPE method derives the graph embedding matrix U, which minimizes the following objective function:
where S is a similarity matrix, and U is an embedding matrix where U = [Us, Ut]. The similarity matrix can be defined using different similarity measures, including the Katz index and rooted page rank. For HOPE, we used the implementation of these algorithms from the GEM graph embedding package (https://github.com/palash1992/GEM). We used Katz index similarity measures in the original publication^{14}.
LINE and joint optimization
In order to project graph G into lower dimension R^{d}, LINE defines two properties to preserve in the graph: firstorder proximity and secondorder proximity.
Firstorder proximity is defined as the pairwise proximity through edges between the vertices in graph G. LINE models firstorder proximity between two vertices in graph v_{i} and v_{j}:
where u_{i} and u_{j} are vertices embedding into lowdimensional space R^{d}. We define empirical edge distribution over the space V × V, \(\hat p_1 = \frac{{w_{ij}}}{W}\) where \(W = \mathop {\sum }\nolimits_{(i,j) \in E} w_{ij}\). To obtain the firstorder embedding, LINE optimizes the KL divergence (omitting constants) between p_{1} and \(\hat p_1\) as
For secondorder proximity, the main assumption is that vertices connected to other vertices are similar. To calculate embedding based on the secondorder proximity, LINE introduces the context concept. Each vertex in the graph is considered as a simple vertex or as a context to other vertices. LINE introduces two vectors u_{i} and \({\mathrm{u}}_{\mathrm{i}}^\prime\), where u_{i} is the representation for v_{i} as a vertex while \({\mathrm{u}}_{\mathrm{i}}^\prime\) is the representation for v_{i} as a context.
For each directed edge in the graph (i, j) (undirected edges are treated as two directed edges with opposite directions), LINE defines the probability of context v_{j} generated by vertex v_{i} as:
Empirical distribution \(\hat p_2\) is defined as \(\hat p_2 = \frac{{w_{ij}}}{{d_i}}\), where di is the outdegree of node i.
LINE optimizes scaled (scaled by node degree) versions of the KL divergence between p_{2} and \(\hat p_2\). It defines
SCI implements two approaches to combine defined objective functions O_{1} and O_{2}. In the first approach, SCI optimizes O_{1} and O_{2} independently and then combines representation from both orders into a final representation. In the second approach (joint optimization), SCI combines firstorder and secondorder proximity objective functions into one function to define O_{3} as:
where α is a mixing parameter to determine the weight of firstorder and secondorder optimization functions.
It then optimizes both orders at the same time. An asynchronous gradient descent algorithm coupled with negative sampling is used to optimize all objective functions. The joint optimization showed a 30% increase in time efficiency (from 26 min to 18 min using GM12878 HiC data) and a slightly lower clustering performance of the joint optimization compared to separate optimization (measured by Silhouette index, DaviesBouldin index, and centrality measures, Supplementary Fig. 12). Both separate and joint optimization are available as options in the SCI code.
Evaluation for clustering methods
For each genomic bin we define Silhouette index (s_{i}) as:
where b_{i} is the lowest average distance for genomic bin i for all other clusters, and a_{i} is the average distance from genomic bin i from all points in the same cluster.
We used the silhouette_score function from scikitlearn package version 0.19.2. We used Euclidean distance as our distance measure.
The Davies–Bouldin index (DBI) is defined as:
where
Following the definition in ref. ^{26}, we define d_{i} and d_{j} as the average distance between all pairs in clusters i and j. d_{ij} is the distance between all pairs between clusters i and j.
The clustering coefficient is defined as the ratio of the number of closed triplets (triangles) the node has to the number of all triplets that the node has. To adjust to the weighted graph, the weighted average for the closed triples is calculated^{27}.
Closeness network centrality of a node is defined as the average length of the shortest path between the node and all other nodes in the graph. Closeness centrality for a node u in a graph with n nodes can be calculated as:
where d(v, u) is the shortest path between nodes u and v.
Betweenness network centrality is defined as the number of times a node acts as a bridge along the shortest path between two other nodes. Betweenness centrality for a node u in a graph with n nodes can be calculated as:
where V is the total number of nodes in the graph. \(\sigma \left( {s,t} \right)\) is the number of shortest paths between nodes s and t. \(\sigma (s,tu)\) is the number of shortest paths between s and t passing through u.
Closeness and betweenness centralities were calculated using NetworkX version 2.1.
GM12878 CTCF and RNAPII ChIAPET intrachromosome loops supported by at least five pairedend reads were used for loop ratio calculation. A loop anchor is associated with a specific subcompartment if it has an overlapping ratio greater than 50%. Next, we create subcompartment loop interaction matrix M (m_{ij}), where m_{ij} represents the number interaction between subcompartments i and j detected using ChIAPET data, such that the ChIAPET loop left anchor falls in subcompartment i and the right anchor falls in subcompartment j. As there is no directionality difference between the left and right anchors, we create a symmetric compartment matrix M_{sym} as the sum of M and M^{T}. Finally, a normalized matrix (M_{norm}) is obtained by dividing M_{sym} by the sum of its all elements.
An intrasubcompartment loop was defined as a ChIAPET loop with both anchors falling in genomic bins belonging to the same subcompartment. Intersubcompartment loops were defined if two anchors of the loops fall in genomic bins from different subcompartments. The ratio of the number of intrasubcompartment and the number of intersubcompartment were then used to assess the connectivity of the predicted subcompartments.
Subcompartment and compartment prediction using epigenome data
We developed machine learning and deep learning models for subcompartment and compartment predictions using DNA sequence static features and epigenomic features. We utilized the XGBoost^{28} and random forest^{29} machine learning approaches to assess feature importance.
To predict subcompartments from methylation and sequencebased data, we defined two types of features: static sequencebased features, and epigenomebased features. The static features include counting DNAbased 2mers and 3mers in each subcompartment bin. The DNA methylation features include DNA methylation percentage distribution in each genomic bin, and epigenome instability or disordered DNA methylation patterns measured by the percentage of disordered reads (PDR) and epipolymorphism. Also, we include the median signal for histone modification and replication timing data per genomic bin as features. Histone modifications data include H3K27ac, H3K27me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1, H3K36me3, and H2az marks. Replication timing data include: RepG1, RepG2, RepS1, RepS2, RepS3, and RepS4^{30}. The model was constructed using the Keras framework (v2.2.4) and TensorFlow backend (v1.11.0).
LINE hyperparameter for subcompartment identification
We experimented with several hyperparameter settings for LINE. We performed hyperparameter selection over embedding size (size), number of negative samples (negative), and number of edges to be sampled to construct embedding (samples). We used the grid search to obtain the best parameters; below are the values used for each hyperparameter, with the selected values indicated in bold:
 i.
Size: 100, 128, 200, 256, and 512
 ii.
Negative: 1, 2, 3, 4, 5, 6, and 7
 iii.
Samples: 15, 20, 25, 30, 40, 50
Neural network hyperparameter selection for subcompartment and compartment prediction
We finetuned network hyperparameters using a random search. Below is the list of hyperparameters with their experimental values, with the selected values indicated in bold:
 i.
First layer number of nodes: [128, 256, 512, 1024]
 ii.
Second layer number of nodes [64, 265, 512]
 iii.
Learning rate [0.01, 0.02, 0.03, 0.04, 0.05]
 iv.
Second hidden layer dropout rate [0.1, 0.2, 0.3, 0.4, 0.5, 0.60, 0.7, 0.8, 0.9]
 v.
Optimizer: [SGD, RMSProp, Adam]
 vi.
Epochs: [100, 200, 300]
 vii.
Batch size: [32, 64, 128, 265]
Random forest and XGBoost hyperparameter selection
We used the XGBoost Python library to build our XGBoost classifier. We used the grid search to finetune the number of boosting trees, and we experimented with the number of trees in the range of [20, 200] with a step of 10. We selected 100 trees based on crossvalidation results using training data.
We used scikitlearn (scikitlearn.org) implementation for the random forest. We used the grid search to finetune the number of forest trees, and we experimented with the number of trees in the range of [20, 200] with a step of 10. We selected 130 trees based on crossvalidation results using training data.
Training and testing data design
Genomic bins (the data points) were split: 80% for training and 20% for testing. We applied random forest, XGBoost, and DNN models for subcompartment and compartment classifications. We report accuracy as the performance for those models. For comparison with^{6} in Supplementary Table 2, we trained on oddnumbered chromosomes and tested on evennumbered chromosomes.
Subcompartment regulator inference
We obtained the DNA binding motifs of transcription factors (TFs) from the CISBP database^{31}. We mapped TFs to the hg19 sequence using FIMO^{32}, and we retained TF hits with FDR < 0.1. Then, we associated TF binding sites (TFBS) with genes if the predicted TFBS were within 2 kb of a gene TSS. We further filtered TFBS by selecting only those in accessible chromatin regions, as determined by ATACseq data. Finally, we constructed a features matrix where the genes represent samples, and all mapped TFs represent features, and where the count of TF hits associated with a given gene represents the feature value. We modified the RegulatorInference tool (https://bitbucket.org/leslielab/regulatorinference) to implement a samplebysample elastic net regression model to predict gene expression for each replicate using regulatory elements in gene promoters. We optimized the elastic net mixing parameter (alpha) using ten folds crossvalidation. Based on the minimum squared error (mse), we picked an alpha value of 0.3 (Supplementary Fig. 13).
As a result of elastic net regression, we obtained a coefficient vector for TFs that represents the importance of the corresponding TF for the prediction of gene expression, while the sign of the coefficient can be interpreted as the predicted direction of regulation. We performed elastic net regression tasks separately for each subcompartment.
Finally, we performed feature dependency analysis across samples to determine regulators that significantly account for compartmentspecific gene expression in the regression models. In this process, the aggregate error change of one TF is calculated^{20}. In summary, the aggregate error corresponds to the difference between the total regression error and the error contributed by a specific TF across the six different RNAseq datasets used. To calculate the aggregate error for each elastic net model, initially, the entire error is measured using all learned TF coefficients. Then the TF coefficients are set to zero one at a time, then the error contributed by a specific TF is calculated. Lastly, the difference between both errors is calculated for each model and aggregated over the different models to define the final aggregate error.
The higher the aggregate error value, the most important the TF. If the aggregate error change of a TF is higher than the cutoff, it is considered an important TF. We used the default RegulatorInference tool cutoff, which corresponds to the mean of aggregate error changes across all TFs plus 1.5 times the standard deviation of aggregate error changes across all TFs.
Epigenomic signal enrichment
The enrichment of the epigenome signal was calculated based on the description of (Rao et al., 2014). Briefly, to calculate enrichment for epigenomic data, we used bwtool^{33} to calculate the mean of the normalized signal for each histone mark in each bin downloaded from the ENCODE consortium data portal (see data availability). Then we computed enrichment as the ratio of the median of ChIPseq signal for all of the bins in each subcompartment divided by median of ChIPseq signal of all of the bins genomewide.
Transcription factors enrichment
We obtained all ENCODE uniformly processed peaks from (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/). We assigned peaks to different subcompartments using bedtools^{34}. We calculated peak enrichment in specific subcompartments as the average peak abundance in a specific subcompartment divided by genomewide peak abundance. We evaluated the significance of the presence of a specific TF in a subcompartment using the MannWhitney test compared to the genomewide background. We considered only those TFs with corrected pvalue <0.05 in at least one compartment. To evaluate the significance between C1 and C2 subcompartments, we used a twotailed Mann–Whitney test.
Pairwise correlation among genomic bins using RNAseq
Correlation of geneexpression levels within the same or different compartments was computed only for those genes expressed in at least one of the studied replicates. Each gene was represented as one TSS, corresponding with the 5’ end of the “gene” feature from Gencode V19 reference annotation^{35}. All genes within a given maximum distance (e.g., 100 kb were joined into pairs. Spearman’s correlation coefficient was computed for each pair. Next, gene pairs were categorized as either intracompartment, if both genes were located in the same type of compartment (e.g., C1), or as intercompartment, if representative gene coordinates of a pair were located in different compartments. Finally, the average value of Spearman’s correlation coefficients was computed for all intracompartment and intercompartment pairs, considering only those for which results were statistically significant (i.e., pvalue < 0.05). All of the abovedescribed steps were conducted for the maximum range between pairs up to 10 Mb with a 100 Kb step.
Geneexpression levels comparison using RNAseq
We used transcriptpermillion RNAseq transcript quantification at the gene level in the box of Fig. 3b. We assigned gene TSSs to different subcompartments, and then we plotted a boxplot using seaborn package version 0.9.0.
Transcription efficiency measured by GROseq
We used bwtools to extract an aggregation plot^{36} for GROseq data around each TSS for each subcompartment separately.
DNA methylation data preprocessing
Briefly, whole genome bisulfite sequencing (WGBS) data were first trimmed of adapters using Trimgalore (version 0.4.0) [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/]. The base quality of the trimmed reads was checked with FastQC (version 0.11.5) [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/]. The preprocessed reads were then aligned to human genome reference hg19 using Bismark (version 0.16.1)^{37}. The aligned reads were then deduplicated, and the methylation call at each CpG site was determined by running the appropriate Bismark scripts. DNA methylation patterns were calculated using methclone^{38}; epipolymorphism^{39} and PDR^{21} were calculated using epihet (https://www.bioconductor.org/packages/devel/bioc/html/epihet.html).
ATACseq data processing
We used HMCan^{40} to obtain signal tracks from the alignment data.
Enhancer and superenhancers enrichment
We calculated enhancer and superenhancer enrichment as the ratio between the expected value of observing a superenhancer (or enhancer) per compartment and the expected value of observing a super enhancer (or enhancer) genomewide.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
We obtained processed (.hic) formatted HiC data for both GM12878 and K562 celllines from^{5} with GEO entry GSE63525. GM12878 ChIAPET data accession code of 4D Nucleome consortium (https://commonfund.nih.gov/4dnucleome) is 4DNES7IB5LY9 (CTCF) and 4DNESZ25MOZV (RNAPII). We obtained raw ChIAPET for CTCF and RNAPII factor for K562 cellline from ENCODE. We processed raw files to produce loops follow the Links: https://www.encodeproject.org/experiments/ENCSR000BZY/ and https://www.encodeproject.org/experiments/ENCSR000CAC/. We used ATACseq data generated by^{41} GEO entry GSE47753. We downloaded signal bigwig tracks for DHS and histone modifications from the ENCODE consortium using the following link: http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/signal/jan2011/bigwig/. We downloaded superenhancer annotations for GM12878 dbSuper database^{42} (https://asntech.org/dbsuper/). We obtained enhancer data for GM12878 from DENdb enhancers database^{43} (https://www.cbrc.kaust.edu.sa/dendb/), and used only high quality enhancers with a minimum score of 3 as assessed by DHS signal. We downloaded RNAseq quantified gene expression files from ENCODE with the following GEO accessions: GSE88583, GSE88627, GSM958730, GSE90222, GSE78553, GSE78555. We downloaded GROseq signal tracks described in ref. ^{19} from GEO accession number GSE60456. For replication time data, we downloaded signal tracks from the ENCODE consortium using the following link: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/. We obtained NAD enriched regions published in^{12} for hg18 reference. We used liftOver tool to map regions to hg19 reference. We only considered mapped regions with 0.95 confidence.
Code availability
SCI code is available on https://github.com/TheJacksonLaboratory/sci and https://github.com/TheJacksonLaboratory/sciDNN.
References
Schmitt, A. D., Hu, M. & Ren, B. Genomewide mapping and analysis of chromosome architecture. Nat. Rev. Mol. Cell Biol. 17, 743–755 (2016).
LiebermanAiden, E. et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Fortin, J. P. & Hansen, K. D. Reconstructing A/B compartments as revealed by HiC using longrange correlations in epigenetic data. Genome Biol. 16, 180 (2015).
Yaffe, E. & Tanay, A. Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065 (2011).
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Di Pierro, M., Cheng, R. R., Lieberman Aiden, E., Wolynes, P. G. & Onuchic, J. N. De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. Proc. Natl Acad. Sci. USA 114, 12126–12131 (2017).
Chen, Y. et al. Mapping 3D genome organization relative to nuclear compartments using TSASeq as a cytological ruler. J. Cell Biol. 217, 4025–4048 (2018).
Tang, J. et al. LINE: Largescale Information Network Embedding. In: Proceedings of the 24th International Conference on World Wide Web. (International World Wide Web Conferences Steering Committee, 2015).
Flavahan, W. A. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529, 110–114 (2016).
Ke, Y. et al. 3D chromatin structures of mature gametes and structural reprogramming during mammalian embryogenesis. Cell 170, 367–381 e320 (2017).
Robson, M. I. et al. Constrained release of laminaassociated enhancers and genes from the nuclear envelope during Tcell activation facilitates their association in chromosome compartments. Genome Res. 27, 1126–1138 (2017).
Nemeth, A. et al. Initial genomics of the human nucleolus. PLoS Genet. 6, e1000889 (2010).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Ou, M., Cui, P., Pei, J., Zhang, Z. & Zhu, W. Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM, 2016).
Perozzi, B., AlRfou, R. & Skiena, S. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. (ACM, 2014).
Fullwood, M. J. et al. An oestrogenreceptoralphabound human chromatin interactome. Nature 462, 58–64 (2009).
Tang, Z. et al. CTCFmediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627 (2015).
Hnisz, D., Shrinivas, K., Young, R. A., Chakraborty, A. K. & Sharp, P. A. A phase separation model for transcriptional control. Cell 169, 13–23 (2017).
Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Setty, M. et al. Inferring transcriptional and microRNAmediated regulatory programs in glioblastoma. Mol. Syst. Biol. 8, 605 (2012).
Landau, D. A. et al. Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell 26, 813–825 (2014).
Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Gurukar, S. et al. Network Representation Learning: Consolidation and Renewed Bearing. Preprint at arXiv:190500987 (2019).
Zhang, D., Yin, J., Zhu, X., Zhang, C. Network representation learning: a survey. IEEE Transactions on Big Data (2018).
Goyal, P. & Ferrara, E. Graph embedding techniques, applications, and performance: a survey. Knowl.Based Syst. 151, 78–94 (2018).
Fotuhi Siahpirani, A., Ay, F. & Roy, S. A multitask graphclustering approach for chromosome conformation capture data sets identifies conserved modules of chromosomal interactions. Genome Biol. 17, 114 (2016).
Saramäki, J., Kivelä, M., Onnela, J.P., Kaski, K. & Kertesz, J. Generalizations of the clustering coefficient to weighted complex networks. Phys. Rev. E 75, 027105 (2007).
Chen, T., Guestrin, C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2016).
Ho T. K. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. (IEEE, 1995).
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Pohl, A. & Beato, M. bwtool: a tool for bigWig files. Bioinformatics 30, 1618–1619 (2014).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res. 22, 1735–1747 (2012).
Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for BisulfiteSeq applications. Bioinformatics 27, 1571–1572 (2011).
Li, S. et al. Dynamic evolution of clonal epialleles revealed by methclone. Genome Biol. 15, 472 (2014).
Landan, G. et al. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat. Genet. 44, 1207–1214 (2012).
Ashoor, H. et al. HMCan: a method for detecting chromatin modifications in cancer samples using ChIPseq data. Bioinformatics 29, 2979–2986 (2013).
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNAbinding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Khan, A. & Zhang, X. dbSUPER: a database of superenhancers in mouse and human genome. Nucleic Acids Res. 44, D164–D171 (2016).
Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database 2015, bav085 (2015).
Acknowledgements
We thank Drs. Sara Cassidy, Carmen Robinett, and Stephen Sampson from The Jackson Laboratory Research Program Development for editing this paper. We thank The Jackson Laboratory Computational Sciences and Research IT team for technical support and discussion. S.L. was supported by the Leukemia Research Foundation New Investigator Grant, The Jackson Laboratory Director’s Innovation Fund (JAXDIF 190001713), The Jackson Laboratory Cancer Center New Investigator Award, and the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM133562. Research reported in this publication was partially supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA034196. Y.R. was supported by NIH ENCODE (UM1 HG009409), 4D Nucleome (U54 DK107967), and JAX Director’s Innovation Fund (JAXDIF 190001802).
Author information
Authors and Affiliations
Contributions
S.L. conceived the project, H.A. and S.L. designed the algorithm and deep neural network architecture, and H.A. implemented the algorithms, developed software, and performed computational analysis. X.C., W.R., J.W., P.W., Y.R., and S.L. contributed to analysis of SCI evaluation. H.A., A.C., and S.L. contribute to the revision. H.A. and S.L. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ashoor, H., Chen, X., Rosikiewicz, W. et al. Graph embedding and unsupervised learning predict genomic subcompartments from HiC chromatin interaction data. Nat Commun 11, 1173 (2020). https://doi.org/10.1038/s4146702014974x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702014974x
This article is cited by

ProbC: joint modeling of epigenome and transcriptome effects in 3D genome
BMC Genomics (2022)

Polygenic risk modeling of tumor stage and survival in bladder cancer
BioData Mining (2022)

Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation
Nature Communications (2022)

Regulation associated modules reflect 3D genome modularity associated with chromatin activity
Nature Communications (2022)

Systematic inference and comparison of multiscale chromatin subcompartments connects spatial organization to cell phenotypes
Nature Communications (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.