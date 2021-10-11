scHi-C data and other genomic data processing

In this work, we used several publicly available single-cell Hi-C datasets. We refer to them as Ramani et al.14 (Gene Expression Omnibus (GEO): GSE84920), Nagano et al.15 (GEO: GSE94489) and 4DN sci-Hi-C20 (4DN Data Portal: 4DNES4D5MWEZ, 4DNESUE2NSGS, 4DNESIKGI39T, 4DNES1BK1RMQ and 4DNESTVIP977). We also used a new scHi-C dataset generated from the WTC-11 iPSC line (4DN Data Portal: 4DNESF829JOW and 4DNESJQ4RXY5).

For all scHi-C datasets, we kept only the cells with more than 2,000 read pairs that have genomic span greater than 500 Kb. At a given resolution, we define the number of contacts per cell as the number of interaction pairs (read count) assigned to the non-diagonal entries of the intra-chromosomal contact maps. The Ramani et al. dataset and the 4DN sci-Hi-C dataset used single-cell combinatorial indexed Hi-C (sci-Hi-C).

After filtering, the Ramani et al. dataset contains 620 cells of four human cell types (GM12878, HAP1, HeLa and K562) with 7,800 median contacts per cell, whereas the 4DN sci-Hi-C dataset contains 6,388 cells of five human cell types (GM12878, H1ESC, HAP1, HFFc6 and IMR90) with 3,800 median contacts per cell. The Nagano et al. dataset used a different protocol with 1,171 cells and 56,800 median contacts per cell. The WTC-11 scHi-C dataset (188 cells in total) was generated using single-nucleus Hi-C with 144,800 median contacts per cell. The interaction pairs from the Nagano et al. and Ramani et al. datasets were downloaded from the corresponding GEO repository. The interaction pairs for WTC-11 were obtained through personal communication with Bing Ren. For 4DN sci-Hi-C, we downloaded the FASTQ files and processed them with the recommended processing pipeline (https://github.com/VRam142/combinatorialHiC). The interaction pairs can be directly used as input for Higashi.

The co-assayed single-cell methylation and Hi-C dataset (sn-m3C-seq) was from ref. 17. We followed the same processing pipeline as sn-m3C-seq for processing the methylation signals. We obtained the 10-kb processed contact maps from ref. 17 and used them as input for Higashi. The corresponding cell type information was obtained from ref. 17 as well. The refined cell type information for the sn-m3c-seq dataset was from ref. 28, where the methylation profiles of the sn-m3c-seq dataset are jointly embedded with single-cell methylation profiles from snmC-seq, snmCT-seq and snmC2T-seq on human prefrontal cortex to annotate cell types. We then merged the small clusters with fewer than 30 cells in the sn-m3c-seq dataset for better visualization and more robust analysis. For all datasets, only intra-chromosomal contacts were used to make fair comparisons with other methods. In principle, Higashi can include inter-chromosomal interactions as well by adding the corresponding hyperedges to the model. However, the amount of inter-chromosomal contacts in scHi-C data is generally not sufficient for reliable imputation and analysis.

We also used other publicly accessible genomic datasets in this work. The bulk Hi-C of WTC-11 was obtained from the 4DN Data Portal (4DNESPDEZNWX and 4DNESJ7S5NDJ; two clones were merged before calculating bulk compartment scores). The scRNA-seq of WTC-11 was from ref. 26. The details on calculating transcriptional variability based on scRNA-seq can be found in Supplementary Note A.6. We also analyzed the CTCF binding near the identified single-cell TAD-like domain boundaries in WTC-11 cells. We used the WTC-11 CTCF ChIA-PET data (4DN Data Portal: 4DNES8MZ76GP) and called peaks based on the singleton reads from the dataset following the ENCODE ChIP-seq peak calling pipeline36. Specifically, peaks were generated for individual replicates and merged by keeping only the reproducible peaks. The scRNA-seq of multiple cortical areas of the human brain was obtained from the Allen Brain map33,37.

Hypergraph NN architecture in Higashi

A hypergraph G is a generalization of a graph and can be formally defined as a pair of sets G = (V, E), where V = {v i } represents the set of nodes in the graph, and \(E=\{{e}_{i}=({v}_{1}^{(i)},...,{v}_{k}^{(i)})\}\) represents the set of hyperedges. For any hyperedge e ∈ E, it connects two or more nodes (∣e∣≥2). Both nodes or hyperedges can have attributes reflecting the associated properties, such as node type or the strength of a hyperedge. The hyperedge prediction problem aims to learn a function f that can predict the probability of a group of nodes (v 1 , v 2 , . . . , v k ) forming a hyperedge or the attributes associated with the hyperedge. For simplicity, we refer to both cases as predicting the probabilities of forming a hyperedge.

The core part of Higashi is a hypergraph representation learning framework, extending our recently developed Hyper-SAGNN22 that models higher-order interaction patterns from the hypergraph constructed from the scHi-C data. The model aims to predict the value of an entry (that is, contact frequency) in an scHi-C contact map using the rest of the contact map as input. The model also has the option to use the contact maps from cells that share similar 3D genome structures (that is, close to each other in the embedding space) as auxiliary information for the prediction as well. This setting shares similarity with the self-supervised learning on graphs38 where a proportion of the graph is masked randomly, and the NN is trained to recover the masked part with the rest of the graphs. The overall structure of the hypergraph NN is illustrated in Supplementary Fig. 1. We use x i to represent the attributes of node v i . The input to the model is a triplet—that is, (x 1 , x 2 , x 3 )—consisting of attributes of one cell node and two genomic bin nodes. For simplicity, we do not differentiate between these two types of nodes in this section. Each node within a triplet passes through an NN, respectively, to produce (s 1 , s 2 , s 3 ), where s i = NN 1 (x i ). The structure of NN 1 used in this work is a position-wise feed-forward NN with one fully connected layer. By definition, each s i remains the same for node v i independent to the given triplet and is, thus, called the ‘static embedding’, reflecting the general topological properties of a node in the given hypergraph. In addition, the triplet as a whole also passes through another transformation, leading to a new set of vectors (d 1 , d 2 , d 3 ), where d i = NN 2 (x i ∣(x 1 , x 2 , x 3 )). The structure of NN 2 will be discussed later. The definition of d i depends on all the node features within this triplet that reflect the specific properties of a node v i in a particular hyperedge and is, thus, called the ‘dynamic embedding’.

Next, the model uses the difference between the static and dynamic embeddings to produce \({\hat{y}}_{i}\) by passing the Hadamard power of d i − s i to a fully connected layer. Additional features, including the genomic distance between the bin pair, one hot encoded chromosome ID, batch ID when applicable and also the total read number per cell, are concatenated and sent to a multi-layer perceptron with output \({\hat{y}}_{{{\mathrm{ext}}}}\). All the output \({\hat{y}}_{i}\) and \({\hat{y}}_{{{\mathrm{ext}}}}\) are further aggregated to produce the final result \(\hat{y}\)—that is, the predicted probability for this triplet to be a hyperedge:

$$\hat{y}={\hat{y}}_{{{\mathrm{ext}}}}+\mathop{\sum }\limits_{i=1}^{3}{\hat{y}}_{i}={\hat{y}}_{{{\mathrm{ext}}}}+\mathop{\sum }\limits_{i=1}^{3}{{\mbox{FC}}}\,\left[{({d}_{i}-{s}_{i})}^{\circ 2}\right]$$ (1)

where FC is the fully connected layer.

In the following sections, we describe how the node attributes are generated, the structure of NN 2 , the model training and how we incorporate co-assayed signals into Higashi.

Node attribute generation in Higashi

As mentioned, the input to the hypergraph NN model is a triplet consisting of attributes of one cell node and two genomic bin nodes. For the bin nodes, we use the corresponding rows of the merged scHi-C contact maps as the attributes. For the cell nodes, we calculate a feature vector based on its scHi-C contact maps as its attributes. This process is as follows:

1. Each contact map is normalized based on the total read count. 2. Contact maps are flattened into one-dimensional vectors and concatenated across the cell population. 3. (optional) Singular value decomposition is used to reduce dimensions for computational efficiency. 4. The corresponding row in the feature matrix is used as the attributes for the corresponding cell.

For computational efficiency, we calculate the feature vectors for cell nodes in low-resolution scHi-C contact maps (such as 1 Mb or 500 Kb) when training Higashi for high-resolution imputation.

Cell-dependent graph NN for dynamic embeddings

Here, we introduce NN 2 (mentioned above) that transforms the attributes of a node given a node triplet to the corresponding dynamic embeddings. In the original Hyper-SAGNN, this was accomplished by a modified multi-head self-attention layer39. This self-attention layer functions as follows. Given a group of nodes (x 1 , x 2 , x 3 ) and weight matrices W Q , W K , W V to be trained, the model first computes the attention coefficients that reflect the pairwise importance of nodes:

$${e}_{ij}={\left({W}_{Q}^{T}{x}_{i}\right)}^{T}\left({W}_{K}^{T}{x}_{j}\right),\forall 1\le i,j\le 3,i

e j$$ (2)

These coefficients then normalize e ij by all possible j within the tuple through the softmax function. Finally, a weighted sum of the transformed features with an activation function is calculated:

$${\alpha }_{ij}=\frac{\exp ({e}_{ij})}{{\sum }_{1\le l\le k,l

e i}\exp ({e}_{il})}$$ (3)

$${d}_{i}=\tanh \left(\mathop{\sum}\limits_{1\le j\le k,j

e i}{\alpha }_{ij}{W}_{V}^{T}{x}_{j}\right)$$ (4)

However, the representation capacity of using self-attention layers to calculate dynamic embeddings is constrained by the embedding dimensions and the depth of self-attention layers, which would lead to high computational cost and increased training difficulty.

To increase the expressiveness of this NN for generating dynamic embeddings while maintaining small embedding dimensions and fewer layers, we developed a cell-dependent graph neural network (GNN)40 that transforms the attributes of bin nodes before passing to the self-attention layer. For a node triplet (c i , b j , b k ), where c i corresponds to a cell node and b j , b k are bin nodes, a graph G(c i ) (where both b j , b k are nodes in it) is constructed by taking c i as input. Details on the construction of G(c i ), which is shared for all triplets that contain the cell node c i , is discussed in the next section. For each layer in the GNN, to generate the output vector for bin node b j , the information of its neighbors in the graph \({{{{\mathcal{N}}}}}_{G({c}_{i})}({b}_{j})\) is aggregated:

$${H}_{{{{{\mathcal{N}}}}}_{G({c}_{i})}({b}_{j})}^{(n)}=\,{{\mbox{Average}}}\,\left(\{{H}_{u}^{(n-1)}e(u,{b}_{j}| {c}_{i}),u \sim {{{{\mathcal{N}}}}}_{G({c}_{i})}({b}_{j}),u

e {b}_{k}\}\right)$$ (5)

$${H}_{{b}_{j}}^{(n)}=\sigma \left\{{W}_{\,{{\mbox{GNN}}}}^{(n)}\cdot {{\mbox{Concat}}}\,\left[{H}_{{b}_{j}}^{(n-1)},{H}_{{{{{\mathcal{N}}}}}_{G({c}_{i})}({b}_{j})}^{(n)}\right]\right\}$$ (6)

where \({H}_{{b}_{j}}^{(n)}\) is the output vector of the node b j at the nth layer of the GNN, and \({H}_{{b}_{j}}^{(0)}\) represents the attributes of the node b j before passing to the GNN. e(u, b j ∣c i ) is the edge weight between node u and b j in G(c i ). \({W}_{\,{{\mathrm{GNN}}}\,}^{(n)}\) represents the weight matrix to be optimized at the nth layer, and σ is the non-linear activation function. Optionally, to take the similarity of adjacent bins in the genome into account, b j can also aggregate the information from the neighbors of its adjacent bins b j ± 1. We call this GNN cell-dependent because the structure of the graph depends on the cell, although the weight matrix \({W}_{\,{{\mathrm{GNN}}}\,}^{(n)}\) is shared across all cells. This cell-dependent GNN can improve the expressiveness of the NN by incorporating a large amount of single-cell information (contact maps) into the structure of the model instead of entirely relying on the embeddings of the cell nodes. The GNN is trained to reconstruct the interaction between a pair of bin nodes by using only information of themselves and their neighborhood (but not including each other). The attributes of both b j and b k are transformed by this cell-dependent GNN into \({\hat{b}}_{j}\) and \({\hat{b}}_{k}\), respectively, and the triplet of \(({c}_{i},{\hat{b}}_{j},{\hat{b}}_{k})\) passes through the aforementioned self-attention layer to generate the final dynamic embeddings.

Information-sharing among cells

Higashi has a unique capability for cells to share information with each other in the embedding space to enhance imputation by taking advantage of the latent correlations among cells. Specifically, we first train Higashi until convergence without the cell-dependent GNN to allow the self-attention layer to capture cell-specific information and reflect in the embeddings through back-propagation. We then calculate the pairwise distances of cell embeddings that indicate the similarities among cells. Given a hyperparameter k, we construct a graph G(c i ) based on the contact maps of c i and its k-nearest neighbors in the embedding space. It is crucial to clarify that, when we mention the neighbor of a cell \({{{\mathcal{N}}}}({c}_{i})\), we are referring to other cells that have small pairwise distances of embedding vectors instead of other nodes that have connections to the cell in the hypergraph. We name the contact maps of c i as M(c i ). The new G(c i ) is constructed as the weighted sum of \(M(u),u\in \{{c}_{i}\}\cup {{{\mathcal{N}}}}({c}_{i})\), where the weight is calculated based on the pairwise distance d(u, c i ) in the embedding space—that is,

$$G({c}_{i}) \sim \mathop{\sum}\limits_{u}w(u,{c}_{i})M(u),\,\,u\in \{{c}_{i}\}\cup {{{\mathcal{N}}}}({c}_{i})$$ (7)

$$w(u,{c}_{i})\propto \,{{\mbox{exp}}}\,\left[-d(u,{c}_{i})\right]$$ (8)

Each embedding is normalized by the maximum ℓ2 norm. Note that, although contact maps of different cells are mixed in this step, we do not mix the prediction results from different cells or directly use the mixed contact maps as output. This differentiates our method from the k-NN-based smoothing methods fundamentally. The Higashi model is trained with only the observed interactions in each single cell, together with the interactions in cells that share overall similar structures serving as auxiliary information for synergistic prediction in a cell population.

Loss function and training details of Higashi

The hypergraph NN in Higashi produces a score \(\hat{y}\) for any triplet (c i , b j , b k ). The NN is trained to minimize the difference between the predicted score \(\hat{y}\) and the target score y (that is, the observations in the dataset), indicating the probability of the pairwise interaction between bin nodes b j and b k in cell c i . In Higashi, we offer several choices of loss function for scHi-C datasets with different coverage. For scHi-C datasets with relatively low sequencing depths, or the analysis resolution is high (hence, fewer reads in each genomic bin), the model is trained with a binary classification loss (cross-entropy) where the triplets corresponding to all non-zero entries in the single-cell contact maps are treated as positive samples, and the rest are considered as the negative samples (that is, y(c i , b j , b k ) ∈ {0, 1}). The classification loss is:

$$\begin{array}{l}{{{\mbox{Loss}}}}_{{{\mathrm{class}}}}=-\mathop{\sum}\limits_{i,j,k}y({c}_{i},{b}_{j},{b}_{k}){{\mathrm{log}}}\,\hat{y}({c}_{i},{b}_{j},{b}_{k})\\\qquad\quad\qquad+\left[1-y({c}_{i},{b}_{j},{b}_{k})\right]{{\mathrm{log}}}\,\left[1-\hat{y}({c}_{i},{b}_{j},{b}_{k})\right]\end{array}$$ (9)

For datasets with relatively high sequencing depths or when the analysis resolution is low (hence, more reads in each genomic bin), we further differentiate among the non-zero values by training the model with a ranking loss, which maintains consistent ranking of predicted scores versus the continuous target scores (that is, \(y({c}_{i},{b}_{j},{b}_{k})\in {\mathbb{R}}\)). The ranking loss can be described as a binary classification problem aiming to identify the triplet with the larger target score in a pair of selected triplets. For simplicity, we denote two triplets as t i , t j and the corresponding target scores as y(t i ), y(t j ). The ranking loss is:

$${l}_{ij}={\mathbb{I}}\left[y({t}_{i}) > y({t}_{j})\right]$$ (10)

$${p}_{ij}=\,{{\mbox{Sigmoid}}}\,\left[\hat{y}({t}_{i})-\hat{y}({t}_{j})\right]$$ (11)

$${{{\mbox{Loss}}}}_{{{\mathrm{rank}}}}=-\mathop{\sum}\limits_{| y({t}_{i})-y({t}_{j})| \ge \alpha }{l}_{ij}{{\mathrm{log}}}\,{p}_{ij}+(1-{l}_{ij}){{\mathrm{log}}}\,\left(1-{p}_{ij}\right)$$ (12)

where α defines whether the order of y(t i ), y(t j ) can be reliably called and is set to 2 in this work. Note that l ij , p ij are intermediate variables used only in this definition.

Moreover, the structure of Higashi can be easily adapted to estimate a distribution for y(t i ). Zero-inflated negative binomial (ZINB) distribution and its variants have been widely used in the modeling of single-cell sequencing datasets41. Specifically, the distribution of the read count for an entry in an scHi-C contact map can be characterized by three parameters: the mean parameter μ(t i ), the dispersion parameter θ(t i ) and the dropout rate π(t i ). To incorporate this loss function into the Higashi framework, we change the output size of the last layer of the NN from 1 to 2. We also constrain that the dropout rate π(t i ) is approximated by batch effects, total read counts in a cell and genomic distance, which are the additional features a(t i ) in Higashi. The loss function for the ZINB regression can, thus, be described as:

$$\hat{y}({t}_{i})={[\mu ({t}_{i}),\theta ({t}_{i}),]}^{T}$$ (13)

$$\pi ({t}_{i})=\,{{\mbox{FC}}}\,\left[a({t}_{i})\right]$$ (14)

$${{{\mbox{Loss}}}}_{{{\mathrm{ZINB}}}}=-\mathop{\sum}\limits_{{t}_{i}}{{\mathrm{log}}}\,{P}_{{{\mathrm{ZINB}}}}\left[y({t}_{i})| \mu ({t}_{i}),\theta ({t}_{i}),\pi ({t}_{i})\right]$$ (15)

If the model is trained with the ZINB loss, μ(t i ) is used as the imputed read count for the specific entry in the contact map. In this work, the Higashi model for sn-m3c-seq data is trained with the ZINB loss, whereas the Higashi models for the other datasets are trained with the ranking loss.

Using any of the above loss functions requires negative samples (samples with zero read count in the original datasets) in the training data. We designed an effective negative sampling approach. Specifically, at each epoch, we randomly sample a batch of triplets and make sure that these triplets do not overlap with the positive samples. To reflect the similarity of 3D genome structures of flanking genomic bins, we also exclude triplet (c i , b j , b k ) from the negative samples if triplets such as (c i , b j + 1, b k ) belong to the positive samples. The number of negative samples generated for each batch is guided by the sparsity of the input data. When studying an scHi-C dataset where s% of the contact map entries are zeros, for a batch of n positive triplets, \(\min \left[s/(100-s),5\right]n\) negative samples will be generated. For computational efficiency, the number of negative samples is no more than five times the number of positive samples. The model is optimized by the Adam algorithm42 with the learning rate of 1 × 10−3. The batch size is set as 192. For a dataset with multiple chromosomes, only one Higashi model is trained for all chromosomes. For different resolutions on the same dataset, separate Higashi models are trained.

Incorporating co-assayed signals in Higashi

The unique design of Higashi allows joint modeling of co-assayed scHi-C and the corresponding one-dimensional signals (for example, from sn-m3C-seq17). We add an auxiliary task for Higashi by using the learned embeddings for cell nodes c i to accurately reconstruct the co-assayed signals m i through a multi-layer perceptron. The auxiliary loss term is added to the main loss function and optimized jointly. The model, thus, builds an integrated connection between chromatin conformation and the co-assayed signals, guiding the embedding of the scHi-C data—that is,

$${{{\mbox{Loss}}}}_{{{\mbox{aux}}}}={{\mbox{MSE}}}\,\left[{m}_{i},\,{{\mbox{MLP}}}\,({c}_{i})\right]$$ (16)

where MSE refers to the mean squared error between the co-assayed signals and the estimate.

Batch effects removal during imputation

The core structure of Higashi can already implicitly remove batch effects to a certain extent during imputation. As described in Eq. (1), the final predicted probability of a triplet includes the values \({\hat{y}}_{{{\mbox{ext}}}}\) produced by feeding extra features that include features related to batch effects, such as the batch ID and the total read counts per cell. During imputation, these factors are set as constant for all cells in order to remove batch effects. The motivation for this design is to use the batch ID and total read counts to regress out the batch effects.

However, one problem that might arise is the use of contact maps with potential batch effects to construct the cell-dependent graph G(c i ). This is because, when imputing cell c i , the k-nearest neighboring cells in the embedding space that contribute to its imputation are more likely from the same batch of c i . As a result, the batch effects in the constructed cell-dependent graph G(c i ) are expected to lead to unreliable batch-correlated imputation results. To address this, we developed the following framework to explicitly remove batch effects during imputation. As described in the above section, the k-nearest neighboring cells in the embedding space could contribute to the imputation by using the weighted average of the corresponding contact maps to construct the cell-dependent graph G(c i ). Motivated by the mutual nearest neighbor method that is widely adopted in scRNA-seq analysis for batch effect removal43, we add constraints for the selection of neighboring cells that will involve in the imputation. When imputing a cell i from an scHi-C dataset with N batches, we require that the k-nearest neighbors contributing to the imputation process must be evenly distributed across N batches. In cases where there is no exact division ⌈k/N⌉ cells will be sampled from each batch based on their distance to cell i in the embedding space. Next, k cells will be randomly selected and serve as the final set of neighboring cells to contribute to imputation. Note that this new neighborhood construction mechanism will be carried out dynamically after every epoch of the training process of Higashi to improve the robustness of the imputation and the random sampling process. By incorporating this mechanism into Higashi, G(c i ) will have similar distribution across different batches. The Higashi model is now able to regress out the batch effects with the batch ID and read count information. During imputation, the batch-effects-related features will be set as constant from the input to recover batch-effect-corrected contact maps.

Variability of compartmentalization and TAD-like boundaries

In Higashi, we developed strategies for reliable analysis of 3D genome features in different scales across the cell population. We developed a method to calculate continuous compartment scores for the imputed single-cell contact maps such that these scores are directly comparable across different cells in the population to assess variability (Supplementary Note A.5). For single-cell TAD-like domain boundary analysis, we developed a calibration method using an optimization scheme based on insulation scores to achieve comparative analysis of domain boundary variability from single cells (Supplementary Notes A.7 and A.8). These algorithms greatly enhance the analysis of variable multiscale 3D genome structures at single-cell resolution.

Visualization tool for integrative scHi-C analysis

In Higashi, we developed a visualization tool that allows interactive navigation of the scHi-C analysis results. Our tool enables the navigation of the embedding vectors and the imputed contact maps from Higashi in a user-friendly interface. Users can select individual cells or a group of cells of interest in the embedding space and explore the corresponding single-cell or pooled contact maps. Supplementary Fig. 28 shows a screenshot of the visualization tool. See the GitHub repository of Higashi for detailed documentation of this visualization tool: https://github.com/ma-compbio/Higashi.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.