DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data

Zeng, Wanwen; Chen, Xi; Duren, Zhana; Wang, Yong; Jiang, Rui; Wong, Wing Hung

doi:10.1038/s41467-019-12547-1

Download PDF

Article
Open access
Published: 10 October 2019

DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data

Wanwen Zeng^1,2^na1,
Xi Chen¹^na1,
Zhana Duren¹^na1,
Yong Wang^3,4,
Rui Jiang² &
…
Wing Hung Wong¹

Nature Communications volume 10, Article number: 4613 (2019) Cite this article

10k Accesses
44 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements. Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. The subpopulation-specific profiles of gene expression, chromatin accessibility and enhancer-promoter contact obtained by DC3 provide a comprehensive characterization of the gene regulatory system in each subpopulation.

Jointly defining cell types from multiple single-cell datasets using LIGER

Article 12 October 2020

Benchmarking of cell type deconvolution pipelines for transcriptomics data

Article Open access 06 November 2020

ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis

Article Open access 25 February 2021

Introduction

With the rapid development of single-cell (sc) genomics technology, researchers are now able to study heterogeneous mixtures of cell populations at the single cell level. Each type of sc-genomics experiments offers one particular aspect to delineate the heterogeneity; for example, scRNA-seq¹ enables single cell gene expression profiling, scATAC-seq² identifies accessible chromatin regions in single cells and scHi-C³ measures chromatin contacts in the single-cell level. In many situations, a first step in the analysis of single-cell data is clustering, that is, to classify cells into the constituent subpopulations. While clustering methods for scRNA-seq or scATAC-seq alone have been widely studied^{4, 5}, when different types of sc-genomics experiments are performed on different samples from the same heterogeneous cell population, then all samples are informative on the underlying subpopulations, and analysis of one sample should be informed by the analysis on another sample. Recently, Duren et al.⁶ proposed a coupled NMF (coupled non-negative matrix factorization) method to cluster cells in scRNA-seq and scATAC-seq samples and to infer both the expression profile and accessibility profile of each subpopulation. These two profiles reveal a great deal about the subpopulation of cells: the accessible regions identify the active regulatory elements (RE) while the expression profiles identify actively transcribed genes⁷. However, even with these two profiles, our understanding of the subpopulation-specific regulatory networks remains incomplete if we cannot link the active REs to their target genes. In principle, such linkages can be obtained by measuring 3D contacts between REs and gene promoters. In bulk sample, it is easy to measure 3D contacts between active enhancers and gene promoters by H3K27ac HiChIP experiments⁸. On the other hand, combinatorial indexing can be used for 3D contact measurement in single cells³.

In order to take these 3D contacts into account in the study of subpopulation-specific regulatory networks, here we introduce DC3 as a method for the joint analysis of bulk and single cell data under various settings of input data combinations, including: (1) scRNA-seq, scATAC-seq and scHi-C; (2) scRNA-seq, scATAC-seq and bulk HiChIP; (3) scRNA-seq, bulk ATAC-seq, bulk HiChIP; (4) bulk RNA-seq, scATAC-seq, bulk HiChIP. Based on comprehensive simulation experiments, we show that this method can deconvolve bulk profiles into subpopulation-specific profiles. At the same time, the subpopulation-specific profiles in turn leads to improved coupled clustering results of single-cell data. To assess its performance in a heterogeneous cell population in vivo, we apply DC3 on a population obtained after four days of retinoic-acid (RA) induced differentiation of mouse embryonic stem cells. We validated the HiChIP profile for one of the inferred subpopulations by showing its consistency to HiChIP data on cells obtained by fluorescence-activated cell sorting (FACS) based markers specific to that subpopulation. Finally, we illustrate the value of results from DC3 by using them to derive the core regulatory network and their downstream effectors in each of the subpopulation in the induced differentiated mouse embryonic stem cells.

Results

The DC3 algorithm

We formulate the joint analysis of bulk and single cell RNA-seq, ATAC-seq and Hi-C data as an optimization problem (Methods). For each type of single cell data, the cost function contains a NMF term that drives clustering of the single cells through non-negative matrix factorization (NMF). For each bulk data type, the cost function contains a coupling term that couples the three data types within each subpopulation by enforcing certain relationship among them. For example, suppose we have an input data setting with scRNA-seq, scATAC-seq and bulk HiChIP (Fig. 1a), then the cost function is given in Fig. 1b where the first term gives the coupling and the other two are NMF terms. As previously described⁶, each NMF term drives the decomposition of a single-cell data matrix into two factors W and H, with columns of W representing cluster-specific profiles, and each column of H giving the relative weights (for cluster-assignment) of a particular single cell. To derive the coupling, we examined data from various cell lines and found that HiChIP loop counts are generally positively correlated with both gene expression values from RNA-seq (Supplementary Fig. 1) and enhancer openness from ATAC-seq (Supplementary Fig. 2). This observation motivated us to use a linear relation between the loop count and the product of gene expression and enhancer openness to couple the three data types, which gives rise to first term of the cost function. This approach can be extended to handle any combination of bulk and single cell data, as long as at least one of the data types contains single cell data. The general cost function and further discussions are given in the Methods section. Note that, instead of using a pre-defining enhancer set, DC3 defines the candidate enhancers directly based on ATAC-seq and HiChIP data.

The main purpose of the coupling term is to improve clustering of single cells by exploiting the statistical correlation between different data types within each subpopulation. Although the optimization can provide estimates of subpopulation-specific profiles (subpopulation profiles) using the W matrix in the NMF term, in simulation experiments we observed that when single cell data is available, we can better estimate a subpopulation profile by averaging the data from the single cells assigned to that subpopulation (Methods and Supplementary Table 1). For a data type with only bulk data, we can obtain its subpopulation profiles by a simple method using an expression based on the already estimated profiles of the single cell data types (Methods). Alternatively, we may infer its subpopulation profiles based on a Poisson model with the profiles of the single cell data types treated as known (Methods). Compared to the simple method, the Poisson model-based method has a better interpretation and also performs slightly better in simulations (Supplementary Fig. 3 and Supplementary Table 2), but at the cost of a much higher time complexity (24 h vs 3 min). For computational efficiency, DC3 uses the simple method as default.

Evaluation of deconvolution on in silico mixture of cells

We used deconvolution to denote the task of estimating subpopulation profiles regardless of whether it was based on single cell or bulk data. We constructed an in silico mixture of deep single cell data (UMI~I million per cell) from two cell lines, GM12878 and K562 (Methods) and used it to evaluate the performance of our method under four settings of input data combinations: (1) scRNA-seq, scATAC-seq and scHi-C; (2) scRNA-seq, scATAC-seq and bulk HiChIP; (3) scRNA-seq, bulk ATAC-seq, bulk HiChIP; (4) bulk RNA-seq, scATAC-seq, bulk HiChIP. Deconvolution performance was assessed by the mean Pearson correlation coefficient (mean PCC scores in 50 runs of DC3) between the observed versus predicted subpopulation-specific profiles in the two cell lines (Methods). DC3 was seen to perform well in HiChIP, RNA-seq and ATAC-seq deconvolution, with mean PCC score of 0.78– 0.95 in HiChIP deconvolution, 0.85–0.99 in RNA-seq deconvolution, 0.88–0.99 in ATAC-seq deconvolution (Table 1). As a comparison, we generated a null distribution by randomly assigning the reads to two artificial cell lines and repeated the whole computation (Table 1 and Supplementary Figs. 4–9). The deconvolution accuracy of DC3 was seen to be significantly higher than random deconvolution.

Table 1 The deconvolution performance of DC3 with different input combinations

Full size table

To assess the impact of sequencing depths, we further conducted a series of dropout experiments, where site-level dropout in both scRNA-seq, scATAC-seq, and scHi-C data were simulated using different dropout rates (Methods). The results were presented in Supplementary Tables 3–5. As expected, for all four input settings, deconvolution accuracy deteriorated with increasing dropout rate. When only one data type is available in single cells (input settings 3 and 4) and when the dropout rate is high, we cannot obtain significantly better performance over random deconvolution. On the other hand, when both RNA-seq and ATAC-seq were available in single cells (input settings 1 and 2), DC3 deconvolution performance was still acceptable (PCC 0.82–0.93) at 80% dropout and remained significantly better than random deconvolution even at 90% dropout. It is noteworthy that at all dropout levels (Fig. 2a), deconvolution accuracy in setting 2 (scRNA, scATAC, bulk HiChIP) was comparable to that in setting 1 (scRNA, scATAC, scHiChIP). In the remainder of this paper, we will focus on the further evaluation and application of DC3 under setting 2.

Evaluation of clustering on in silico mixture of cells

In this and subsequent sections, we assume that the input data setting is scRNA-seq, scATAC-seq and bulk HiChIP. First, we investigated whether DC3 would lead to improved clustering performance as compared to clustering without the incorporation of loop data from HiChIP. We compared DC3 to coupled NMF (which performed coupled clustering without using the loop data) and NMF (which performed clustering separately for each single-cell data type) in in silico mixture of GM12878 and K562 cells. Fifty independent runs were performed for each method. At each dropout rate, we compared clustering results based on the average (over the scRNA-seq and scATAC-seq samples) error rate in cluster assignment. Figure 2b shows the results for DC3 and coupled NMF (details including NMF results in Supplementary Table 6). Since the two cell types were rather distinct, with the initial deeply sequenced mixtures (similar in depth as data from Fluidigm), both methods performed well with no cells misclassified in any run. As the dropout rate increased, the two types of cells became less distinct (Supplementary Figs. 10–13) and the incorporation of loop data became more important. In particular, the incorporation of loop data reduced the classification error by more than four folds when the dropout rate is at 80% or higher, which corresponds to a sequencing depth typical of data from droplet-based system such as 10× (median UMI < 10,000). As a comparison, we performed DC3 in three negative control experiments, by keeping two of the three data sets the same and randomly permutating the third data set (Supplementary Tables 7–9). The performance of DC3 dropped in these negative controls, indicating that each data type was important to clustering results. These results demonstrated the potential of DC3 to improve clustering of single cells.

Evaluation of deconvolution on experimental mixtures

When mouse embryonic stem cells (mESC) are induced to differentiate, several different lineages of cells may emerge, resulting in a real experimental mixture suitable for analysis with our approach. Specifically, embryoid bodies (EBs) are obtained from mESCs using the hanging drop method and then differentiated further under retinoic acid (RA) treatment (Methods). We performed scRNA-seq, scATAC-seq and bulk HiChIP on the mixture after 4 days of RA treatment (RA-day 4). The scRNA-seq and scATAC-seq samples have already been analyzed in our previous study⁶. We wanted to assess the performance of DC3 in the joint analysis of the three data types together. DC3 identified 3 subpopulations (Supplementary Figs. 14, 15) together with their subpopulation-specific loop profiles. Previous study has shown that subpopulation 1 and subpopulation 3 were two related subpopulations⁶. To isolate pure cell population, we focused on the more distinct subpopulation (subpopulation 2) and performed further experiments to validate its inferred loop profile. To this end, we searched for subpopulation 2-specific surface markers and identified EpCAM and CD38 as being highly expressed in cluster 2 but not in subpopulations 1 and 3 (Methods, Supplementary Figs. 16, 17). We performed FACS experiments by using these two markers to isolate subpopulation 2 cells. Figure 2c shows that we successfully isolated 15.7 ± 3.2% EpCAM/CD38 double positive cells at RA-day 4 (Supplementary Fig. 18).

Next, we performed a HiChIP experiment to obtain the loop profile for these double positive cells. We noted that 5 independent FACS runs were necessary in order to collect enough cells for HiChIP. From the PCA plot of the loop profiles (Fig. 2d), we can see that the double positive sample was indeed far closer to subpopulation 2 than subpopulation 1, 3, or the bulk sample. The loop profile of double positive cells had a PCC of 0.7633 with that of subpopulation 2 cells (Supplementary Fig. 19), which was far higher than its PCCs with the profiles of the other two subpopulations (0.34 and 0.45; Supplementary Figs. 20, 21). Together, these results validated the performance of DC3 in the deconvolution of loop data in a real biological mixture.

DC3 improves interpretation of subpopulations

To assess whether adding loop data will help to interpret the top genes in each subpopulation, we carry out GO terms enrichment analysis. First, we combined the scRNA-seq data and the loop profile for each subpopulation to select the top 1000 subpopulation-specific genes, and performed gene ontology (GO) terms enrichment analysis (Methods). We compared the enrichment results to those obtained when the selection of subpopulation-specific genes was based on scRNA-seq data alone. Table 2 (under original) gives the most enriched GO terms and p-values in each subpopulation. Although both subpopulations 1 and 3 are strongly enriched in nervous system-associated terms, subpopulation 1 is specific to neuron development and contains terms like axonogenesis and neuron projection guidance. Meanwhile, subpopulation 3 is enriched in terms concerning general brain and central nervous system development. Finally, subpopulation 2 has weaker but still highly significant enrichments in mesodermal development terms such as muscle structure development and cardiovascular development.

Table 2 Subpopulation-specific GO terms enrichment results

Full size table

We further investigate how the enrichment results depend on sequencing depths by a down-sampling experiment (Methods), and the results are shown in Table 2 (under Down-sampling). Consistent with previous benchmark studies⁹, deep scRNA-seq allows much better characterization (in the sense of high enrichment scores) of the subpopulations than the low-depth scRNA-seq. On the other hand, as shown by the −log₁₀(p-value), adding loop profile information offers an improved characterization of the subpopulations over scRNA-seq alone at any levels of sequencing depth. The improvement is especially large when the sequencing depth is low. In conclusion, DC3-inferred loop information can be used to improve the interpretation of subpopulations of cells.

Subpopulation accessibility, expression, and loop profiles

The chromatin accessibility, gene expression and 3D contact profiles provided by DC3 for each subpopulation can be used to construct subpopulation-specific gene regulatory networks. However, typically these networks are large and complex, which makes it difficult to discern the key regulatory relationships. Therefore, we have developed a method to extract and visualize important subnetworks. Below, we demonstrated this method in the RA-day 4 example.

(Step 1) Identification of key regulators for each subpopulation: We merged the scRNA-seq reads from cells in the same subpopulation and calculated subpopulation-specific gene expression (in FPKM). Similarly, we merged the scATAC-seq cells in each subpopulation to call open peaks and computed motif enrichment scores (Methods). We defined the key regulators of a subpopulation as those TFs with high expression level (FPKM > 10), high motif enrichment score (>2), and differential expression compared to at least one of the other subpopulations (t-test, adjusted p-value <0.01). There are 58, 42, and 71 key regulators for subpopulations 1, 2, and 3 respectively (Fig. 3a–c, Supplementary Table 10–12). We ranked the key TFs by its importance score, defined as the product of its expression (i.e., log2 of FPKM), expression specificity (i.e., maximum expression fold change compared to the other two subpopulations) and motif enrichment score. The Top 30 key regulators are shown in Fig. 3d. For example, {Lhx1, Neurod1}, {Gata4, Sox17, Foxa2} and {Rfx4, Sox3} are high ranking specifically in subpopulations 1, 2 and 3 respectively, while Pou3f2 and several Hox genes are high ranking in both subpopulations 1 and 3.

(Step 2) Construction of gene regulatory networks: On each subpopulation, we identified enhancer-target gene pairs with loop counts greater than or equal to 2. Given an enhancer-target gene pair, we connect it to key TFs which have both significant motif match on the enhancer region and significant correlation with target gene in the single cell gene expression data. This gives 14,979, 4,909 and 15,459 TF-Enhancer-Gene triplets in subpopulations 1, 2, and 3 respectively. Finally, for any pair of TF and target gene, say T_i and G_j, we compute a TF-Target score W_ij as the sum, over TF-RE-Gene triples with TF = T_i and Gene = G_j, of the product of the motif score of T_i on the RE and the loop count between RE and G_j. In this way, we obtained a regulatory networks for each subpopulation, defined as the directed graph with key TFs are nodes and TF-Target scores¹⁰ as edge weights. The networks for the three subpopulations contain (58 nodes, 1043 edges), (42 nodes, 685 edges) and (71 nodes, 1037 edges) respectively.

(Step 3) Analysis of dense subnetwork: For each subpopulation-specific network, we extracted its dense subnetwork by quadratic programming (Methods). The extracted subnetwork is seen to be significantly denser than those obtained from random networks with same in-degree and out-degree for each node as our network (p-value equals 0.0230, 0.0180 and 0.0320 in subpopulations 1, 2, and 3, see Methods). The dense subnetwork was further partitioned into (i) the core subnetwork consisting of TFs that densely cross-regulate each other to achieve robust maintenance of the cellular state, (ii) the upstream subnetwork consisting of TFs that may regulate the core, and (iii) the downstream subnetwork consisting of key TFs regulated by the core. Different downstream TFs may be involved in different pathways or functions characteristic of the cells in the subpopulation. Figure 3e–g present the dense subnetworks of the three subpopulations. Downstream TFs in subpopulation 1 included Ascl1, Neurog1, Lhx3, Onecut2 and Bhlhe23. The BHLH transcription factor Ascl1 is one of the most important factors in neural commitment and differentiation¹¹, and it is also necessary for reprograming from fibroblasts to functional neurons¹². Lhx3 in known to contributes to the specification of motor neuron¹³. In subpopulation 2, Foxa2, Gata4, and Gata6 are in the core subnetwork. Foxa2 is a pioneer factor important in mesendoderm development and is known to regulate Gata4¹⁴, Gata4 and Gata6 are master TFs important to heart and gut formation. Our analysis suggests that these core TFs, together with their downstream effectors such as Sox17, may drive differentiation towards mesodermal and endoderm lineages. In subpopulation 3, Rfx4 and Pou3f2 are in the core subnetwork. A novel splice variant of Rfx4 is reported to be crucial for normal brain development¹⁵ and Pou3f2 is involved in cognitive function as well as adult hippocampal neurogenesis¹⁶. Downstream TFs in subpopulation 3 included Pax6. Pax6 is important for the maintenance of brain integrity¹⁷. We note that many Hox genes are found in the core subnetworks of subpopulations 1 and 3, suggesting that they are important in the maintenance of these neural related populations. On the other hand, Lhx1¹⁸ and Neurod1¹⁹ are specific to subpopulation 1 while Rfx4²⁰ and Pax6²¹ are specific to subpopulation 3. These regulators may play a role in defining the differences of these two related but distinct subpopulations.

Discussion

In summary, we developed DC3 for simultaneous deconvolution and coupled clustering based on the joint analysis of different combinations of bulk and single-cell level RNA-seq, ATAC-seq, and HiChIP data. We showed that DC3 can decompose bulk profiles into subpopulation-specific profiles and at the same time enhance clustering performance of the single-cell data. The subpopulation-specific HiChIP interactions are seen to lead to improved interpretation of the subpopulations. Furthermore, we showed that the accessibility, expression and loop profiles inferred by DC3 can serve as a foundation for further analyses of the regulatory systems, such as the extraction of core subnetworks, in a population-specific manner.

Since DC3 is an unsupervised method and the hyper-parameters can be tuned automatically, it can be applied to many different scenarios. For example, existing single cell atlases^22,23,24,25 usually adopt barcode-based Drop-seq experiments (median UMI ~5000). If scRNA-seq and scATAC-seq with low sequencing depths have already been performed in the same cell population, then with additional simple HiChIP experiments, DC3 can greatly improve the characterization of the different subpopulations and their regulatory networks. As another example, if scRNA-seq, scATAC-seq and bulk HiChIP are performed in heterogeneous tumor cell population, DC3 can help to distinguish the subpopulations in the mixture and identify the TFs, enhancers and genes that are important in the subpopulations.

Finally, our optimization-framework is flexible and can be extended easily. For example, recently Cao et al. proposed a combinatorial indexing–based assay sci-CAR that jointly profiles chromatin accessibility and mRNA in each of thousands of single cells²⁶; Lin et al.²⁷ proposed a model-based method to infer the subpopulations. In the future we will modify the cost function of DC3 to incorporate data types from such emerging single cell experiments, and incorporate the model-based method into the inference.

Methods

Data preprocessing

We aligned scATAC-seq reads to reference genome mm9 and removed duplicates. MACS2²⁸ was employed to call peaks by merging reads from all single cells and removed peaks present less than 10 cells. The final read counts for each peak on each cell were calculated by bedtools²⁹ intersect command.

We mapped scRNA-seq reads to mm9 by STAR³⁰ following ENCODE³¹ pipeline and calculated Transcripts Per Million (TPM) by RSEM³² using GENCODE³³ vM16 annotation.

We used HiC-pro³⁴ to process HiChIP data from raw fastq files to normalized contact maps using reference genome mm9. Then hichipper³⁵ was employed to perform bias-corrected peak calling, library quality control and DNA loop calling. We filter out the replicates that have less than 500 strong loops (greater than 5 reads). We further utilized ATAC-seq peaks to annotate loops and to select candidate enhancer-promoter interactions.

Simulation data construction

To simulate single-cell level RNA-seq/ATAC-seq/HiChIP data from a mixed population with two different cell types, we downloaded the public scRNA-seq/scATAC-seq data from GM12878 and K562 and mixed them together as a single scRNA-seq/scATAC-seq dataset; we downloaded public bulk HiChIP data from GM12878 and K562 and down-sampled them as scHi-C data. In detail, we downloaded scRNA-seq and scATAC-seq, and simulated scHi-C for GM12878 and K562 separately. Then for scRNA-seq data, we computed the data matrix $E_{m \times n_1}$ where E_gh denotes the expression level of the g-th gene in the h-th cell and n₁ = 146 is the total number of cells from GM12878 (73) and K562 (73). For scATAC-seq data, we computed a data matrix $O_{n \times n_2}$, where O_ij denotes the degree of openness (i.e., read count) of the i-th peak in the j-th cell and n₂ = 746 is the total number of cells in GM12878 (373) and K562 (373). For scHi-C data, we computed a data matrix $C_{s \times n_3}^s$, where $C_{ij}^s$ denotes the loop counts of the i-th interaction in the j-th cell and n₃ = 200 is the total number of cells in GM12878 (100) and K562 (100). The numbers of cell in simulation data are shown in Table 3. We note that the scATAC-seq, the scRNA-seq data and the scHi-C data are not measured in the same cell in our setting.

Table 3 The numbers of cells in simulation data

Full size table

DC3 algorithm

We first introduce some notations for our data matrices (Fig. 1b): (1) scRNA-seq matrix: E_m×n1, where E_gh denotes the expression level of the g-th gene in the h-th cell; (2) scATAC-seq matrix: O_n×n2, where O_ij denotes the degree of openness (i.e., read count) of the i-th enhancer in the j-th cell; (3) HiChIP matrix: C_m×n, where C_pq denotes the enhancer–promoter interactions strength (i.e., loop read counts) for the p-th gene’s promoter and the q-th enhancer at the bulk level. To infer the pattern of gene expression, chromatin accessibility, and chromatin contact in each subpopulation, we formulate the following optimization problem:

$$\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {} \\ {\min } \\ {W_1,H_1,W_2,H_2,{\mathrm{\Lambda }} \ge 0} \end{array}\frac{1}{2}\left\| {C - {\mathrm{\alpha }}D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \right\|_F^2 + \frac{{\mu _1}}{2}\left\| {E - W_1H_1} \right\|_F^2 + \frac{{\mu _2}}{2}\left\| {O - W_2H_2} \right\|_F^2} \\ {s.t.\quad \mathop {\sum}\limits_{k = 1}^K {h_{kj}^1} = 1;j = 1,2, \ldots ,n_1} \\ {\mathop {\sum}\limits_{k = 1}^K {h_{kj}^2} = 1;j = 1,2, \ldots ,n_2} \\ {\mathop {\sum}\limits_{k = 1}^K {\lambda _k} = 1;} \end{array}$$

(1)

To explain this formulation, we briefly discuss each term in the objective function. (1) $\left\| {E - W_1H_1} \right\|_F^2$: A soft clustering of the scRNA-seq cells can be obtained from a nonnegative matrix factorization E = W₁H₁ as follows: W₁ has K columns and H₁ has K rows. The i-th column of W₁ gives the mean gene expression for the i-th cluster of cells, while the j-th column of H₁ gives the assignment weights of the j-th cell to the different clusters. (2) $\left\| {O - W_2H_2} \right\|_F^2$: Similarly, clustering of cells in scATAC-seq data can be obtained from the factorization O = W₂H₂. W₂ also has K columns and H₂ has K rows. The i-th column of W₂ gives the mean chromatin accessibility for the i-th cluster of cells, while the j-th column of H₂ gives the assignment weights of the j-th cell to different clusters. Note that, the k-th column of W₂ corresponds to the k-th column of W₁, indicating they are from the same cluster, namely the k-th cluster. (3) $\left\| {C - {\mathrm{\alpha }}D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \right\|_F^2$: We decompose each enhancer-promoter loop strength c_ij in the bulk sample into subpopulation-specific loop strengths, i.e., $c_{ij} = \mathop {\sum}\nolimits_k {\lambda _k} c_{ijk}$, where c_ijk is the loop strength in the k-th subpopulation; λ_k is proportional to the size of the subpopulation; Λ is a K by K diagonal matrix [λ₁, λ₂, …, λ_K]. Furthermore, based on the expectation that an enhancer-promoter loop strength is positively correlated with both the accessibilities of the enhancer and the expression values of the gene, we model c_ij as

$$\begin{array}{*{20}{c}} {c_{ij} = \alpha d_{ij}\mathop {\sum }\limits_k \lambda _kw_{ik}^1w_{jk}^2} \end{array}$$

(2)

Here α is a scaling factor; the elements (d_ij) of the matrix D are indicators selecting the enhancer-promoter pair to be modeled. Only enhancer-promoter pairs with loop count larger than or equal to 1 are included into the optimization:

$$\begin{array}{*{20}{c}} {d_{ij} = \left\{ {\begin{array}{*{20}{c}} {1,} & {c_{ij} \ge 1} \\ {0,} & {c_{ij} < 1} \end{array}} \right.} \end{array}$$

(3)

This leads directly to the first term in the objective function.

Finally, the objective function can be extended to handle any combination of single cell and bulk data. The general cost function is as follows:

$$\begin{array}{*{20}{c}} {{\mathrm{min}}\frac{1}{2}\alpha _1\left\| {C_s - W_3H_3} \right\|_F^2 + \frac{1}{2}\alpha _2\mathop {\sum }\limits_i \mathop {\sum }\limits_{j \in RE_i} \left( {C_b^{(i,j)} - \alpha _1D_b^{(i,j)}\mathop {\sum }\limits_{k = 1}^K \lambda _kW_1^{(i,k)}W_2^{(j,k)}} \right)^2} \\ {\frac{{\mu _1}}{2}\beta _1\left\| {E_s - W_1H_1} \right\|_F^2 + \frac{{\mu _1}}{2}\beta _2\mathop {\sum }\limits_i \left( {E_b^{(i)} - \alpha _2\mathop {\sum }\limits_{k = 1}^K \lambda _k\mathop {\sum }\limits_{j \in RE_i} W_3^{(i,j,k)}W_2^{(j,k)}} \right)^2} \\ {\frac{{\mu _2}}{2}\gamma _1\left\| {O_s - W_2H_2} \right\|_F^2 + \frac{{\mu _2}}{2}\gamma _2\mathop {\sum }\limits_j \left( {O_b^{(j)} - \alpha _3\mathop {\sum }\limits_{k = 1}^K \lambda _k\mathop {\sum }\limits_{i \in TG_i} W_3^{(i,j,k)}W_1^{(i,k)}} \right)^2} \\ { + \frac{{\mu _3}}{2}\alpha _1\beta _1\gamma _1\mathop {\sum }\limits_{k = 1}^K \lambda _k\mathop {\sum }\limits_i \mathop {\sum }\limits_{j \in RE_i} W_3^{\left( {i,j,k} \right)}W_1^{\left( {i,k} \right)}W_2^{\left( {j,k} \right)}} \end{array}$$

(4)

$$S = \{ (i,j){\mathrm{|}}C_s^{\left( {i,j,:} \right)} \, > \, 0\,{\mathrm{or}}\,C_b^{\left( {i,j} \right)} \, > \, 0\}$$

$$C_b:\left| S \right| \times 1{\mathrm{matrix}}$$

$$C_s:\left| S \right| \times n_3{\mathrm{matrix}}$$

$$W_3:\left| S \right| \times K{\mathrm{matrix}}$$

$$W_3^{(i,j,k)}:\left( {i,j} \right){\mathrm{represent}}\,{\mathrm{one}}\,{\mathrm{row}}\,{\mathrm{in}}\,W_3$$

$$\alpha _1 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{HiChIP}}\,{\mathrm{bulk}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{HiChIP}}\,{\mathrm{bulk}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$\alpha _2 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{HiChIP}}\,{\mathrm{single - cell}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{HiChIP}}\,{\mathrm{single - cell}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$\beta _1 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{RNA - seq}}\,{\mathrm{bulk}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{RNA - seq}}\,{\mathrm{bulk}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$\beta _2 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{RNA - seq}}\,{\mathrm{single - cell}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{RNA - seq}}\,{\mathrm{single - cell}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$\gamma _1 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{ATAC - seq}}\,{\mathrm{bulk}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{ATAC - seq}}\,{\mathrm{bulk}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$\gamma _2 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{indicate}}\,{\mathrm{including}}\,{\mathrm{ATAC - seq}}\,{\mathrm{single - cell}}\,{\mathrm{input}}} \hfill \\ 1 \hfill & {{\mathrm{indicate}}\,{\mathrm{not}}\,{\mathrm{including}}\,{\mathrm{ATAC - seq}}\,{\mathrm{single - cell}}\,{\mathrm{intput}}} \hfill \end{array}} \right.$$

$$s.t.\mathop {\sum }\limits_{k = 1}^K h_1^{(k,j)} = 1;j = 1,2, \ldots ,n_1$$

$$\mathop {\sum }\limits_{k = 1}^K h_2^{(k,j)} = 1;j = 1,2, \ldots ,n_2$$

$$\mathop {\sum }\limits_{k = 1}^K h_3^{(k,j)} = 1;j = 1,2, \ldots ,n_3$$

$$\mathop {\sum }\limits_{k = 1}^K \lambda _k = 1;$$

We used different α, β, γ to deal with different input combinations: (1) α₁ = 1, α₂ = 0, β₁ = 1, β₂ = 0, γ₁ = 1, γ₂ = 0, indicates scRNA-seq, scATAC-seq and scHi-C input; (2) α₁ = 0, α₂ = 1, β₁ = 1, β₂ = 0, γ₁ = 1, γ₂ = 0, indicates scRNA-seq, scATAC-seq and bulk HiChIP input; (3)α₁ = 0, α₂ = 1, β₁ = 0, β₂ = 1, γ₁ = 1, γ₂ = 0, indicates bulk RNA-seq, scATAC-seq and bulk HiChIP input; (4)α₁ = 0, α₂ = 1, β₁ = 1, β₂ = 0, γ₁ = 0, γ₂ = 1, indicates scRNA-seq, bulk ATAC-seq and bulk HiChIP input.

Optimization algorithm

We proposed a multiplicative update algorithm to solve the following non-convex optimization problem. Taking DC3 with scRNA-seq, scATAC-seq and bulk HiChIP input as example:

$$\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {} \\ {\min } \\ {W_1,H_1,W_2,H_2,{\mathrm{\Lambda }} \ge 0} \end{array}\frac{1}{2}\left\| {C - {\mathrm{\alpha }}D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \right\|_F^2 + \frac{{\mu _1}}{2}\left\| {E - W_1H_1} \right\|_F^2 + \frac{{\mu _2}}{2}\left\| {O - W_2H_2} \right\|_F^2} \\ {s.t.\mathop {\sum }\limits_{k = 1}^K h_{kj}^1 = 1;j = 1,2, \ldots ,n_1} \\ {\mathop {\sum }\limits_{k = 1}^K h_{kj}^2 = 1;j = 1,2, \ldots ,n_2} \\ {\mathop {\sum }\limits_{k = 1}^K \lambda _k = 1;} \end{array}$$

(5)

Let $w_{ij}^1$ represent the element of the i-th row and the j-th column in matrix W₁ and $w_{ij}^2$, $h_{ij}^1$, $h_{ij}^2$ be the corresponding elements in W₂, H₁ and H₂. We adopted the following update scheme and stopped the iteration when the relative error was less than 10⁻⁴.

$$\begin{array}{*{20}{c}} {w_{ij}^1 \leftarrow w_{ij}^1\frac{{\left( {\mu _1EH_1^T + \alpha CW_2{\mathrm{\Lambda }}^T} \right)_{ij}}}{{\left( {\mu _1W_1H_1H_1^T + \alpha ^2D \odot (W_1{\mathrm{\Lambda }}W_2^T)W_2{\mathrm{\Lambda }}^T} \right)_{ij}}}} \end{array}$$

(6)

$$\begin{array}{*{20}{c}} {h_{ij}^1 \leftarrow h_{ij}^1\frac{{\left( {W_1^TE} \right)_{ij}}}{{\left( {W_1^TW_1H_1} \right)_{ij}}}} \end{array}$$

(7)

$$\begin{array}{*{20}{c}} {w_{ij}^2 \leftarrow w_{ij}^2\frac{{\left( {\mu _2OH_2^T + C^TW_1{\mathrm{\Lambda }}} \right)_{ij}}}{{\left( {\mu _2W_2H_2H_2^T + D^T \odot (W_2{\mathrm{\Lambda }}^TW_1^T)W_1{\mathrm{\Lambda }}} \right)_{ij}}}} \end{array}$$

(8)

$$\begin{array}{*{20}{c}} {h_{ij}^2 \leftarrow h_{ij}^2\frac{{\left( {W_2^TO} \right)_{ij}}}{{\left( {W_2^TW_2H_2} \right)_{ij}}}} \end{array}$$

(9)

$$\begin{array}{*{20}{c}} {h_{ij}^1 \leftarrow \frac{{h_{ij}^1}}{{\mathop {\sum }\nolimits_k h_{kj}^1}}} \end{array}$$

(10)

$$\begin{array}{*{20}{c}} {h_{ij}^2 \leftarrow \frac{{h_{ij}^2}}{{\mathop {\sum }\nolimits_k h_{kj}^2}}} \end{array}$$

(11)

$$\begin{array}{*{20}{c}} {\alpha \leftarrow \frac{{tr\left( {\hat C^TC} \right)}}{{tr\left( {\hat C^T\hat C} \right)}}\hat C = D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \end{array}$$

(12)

$$\begin{array}{*{20}{c}} {\lambda _k \leftarrow \frac{{\mathop {\sum }\nolimits_i^{n_1} h_{ik}^1 + \mathop {\sum }\nolimits_i^{n_2} h_{ik}^2}}{{n_1 + n_2}}} \end{array}$$

(13)

Subpopulation-specific subnetwork connectivity (for hyper-parameter selection, see below)

We first applied t-test to select top 5% subpopulation-specific genes and subpopulation-specific enhancers. Then we regarded these subpopulation-specific genes and enhancers in each subpopulation as nodes and formed K subpopulation-specific subnetworks. Here we defined connectivity as the edges that fall within the given subpopulation-specific subnetwork. Suppose the subnetwork contains n nodes and the strength of edge between node i and j is A_ij. Then the subpopulation-specific connectivity is given by the sum of A_ij over all pairs of nodes i,j. Subpopulation-specific subnetwork connectivity measures the specificity of each subpopulation-specific subnetwork, including the specificity of genes, enhancers, and enhancer-gene interactions.

Initialization and hyper-parameters selection

We selected hyper-parameters μ₁ and μ₂ according to the connectivity of the subpopulation-specific subnetworks (Supplementary Table 13). We first solved the optimization problem $\min _{W_1,H_1 \ge 0}\left\| {E - W_1H_1} \right\|_F^2$ and $\min _{W_2,H_2 \ge 0}\left\| {O - W_2H_2} \right\|_F^2$ by the alternating least-squares (ALS) algorithm with 50 different initializations using a Monte Carlo-type approach and got the solutions for W₁₀, H₁₀, W₂₀, H₂₀, which would be used as initializations in our optimization problem. Then we calculated

$$\begin{array}{*{20}{c}} {\mu _{10} = \left\| {C - {\mathrm{\alpha }}D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \right\|_F^2/\left\| {E - W_1H_1} \right\|_F^2} \end{array}$$

(14)

$$\begin{array}{*{20}{c}} {\mu _{20} = \left\| {C - {\mathrm{\alpha }}D \odot \left( {W_1{\mathrm{\Lambda }}W_2^T} \right)} \right\|_F^2/\left\| {O - W_2H_2} \right\|_F^2} \end{array}$$

(15)

The hyper-parameter μ₁ was chosen from μ₁₀ × [10⁰, 10¹, 10², 10³, 10⁴], and μ₂ was chosen from μ₂₀ × [10⁰, 10¹, 10², 10³, 10⁴]. We used the sum the connectivity of K subpopulation-specific subnetworks to select the best hyper-parameters and chose the ones which had the highest connectivity. The number of clusters K can be determined by a method similar to that in Brunet et al.³⁶ (Supplementary Fig. 22).

Subpopulation profiles

For single-cell input, we calculated the mean profiles for those cells with the same cluster labels. For example, the subpopulation gene expression profiles are given by the columns of the matrix P₁, The k-th column of P₁ is computed by averaging the single cell expression profile of the cells in cluster K. For example, $P_1^{( \cdot k)} = \mathop {\sum }\nolimits_{j \in S_k} E_{ \cdot j}/\left| {S_k} \right|$, S_k:{j|j − th cell belongs to cluster K}. The cluster mean based subpopulation profile P₁ is similar to the subpopulation profile W₁ from matrix factorization. However, the results on simulation data show that the cluster mean based subpopulation profile have better performance in HiChIP deconvolution than the matrix factorization-based subpopulation profile. The computation of subpopulation profile P₂ from scATAC-seq and subpopulation profile P₃ from scHi-C are similar.

For bulk data type, we can use a simple plug-in expression to obtain its subpopulation profiles from the subpopulation profiles already obtained from single-cell data. For example, the subpopulation HiChIP profiles P₃ can be obtained from single-cell-averaged- profiles P₁ and P₂by using the expression $p_{ijk}^3 = {\mathrm{\alpha }}d_{ij}\lambda _kp_{ik}^1p_{jk}^2$.

Poisson model-based estimate for subpopulation HiChIP profiles

We also developed a statistical model for the deconvolution of bulk HiChIP profile into subpopulation HiChIP profiles.

In this model the observed loop count C_ij, is a sum of latent loop counts, i.e.,

$$C_{ij} = \mathop {\sum}\limits_k {C_{ijk}}$$

(16)

where $C_{ijk}$ indicates the loop counts between the i-th promoter and the j-th enhancer in the k-th cluster.

We assume that $C_{ijk}$ is generated from a Poisson model

$$\begin{array}{*{20}{c}} {C_{ijk}\sim Poisson\left( {\lambda \sqrt {O_{ik}O_{jk}} } \right)} \end{array}$$

(17)

O_ik indicates the openness of the i-th promoter in the k-th cluster, O_jk indicates the openness of the j-th enhancer in the k-th cluster, and these openness values are assumed to be known constants. Let

$$\begin{array}{*{20}{c}} {{\boldsymbol{M}}_{{\boldsymbol{ij}}} = \left( {C_{ij1}, \ldots ,C_{ijK}} \right)} \end{array}$$

(18)

$$\begin{array}{*{20}{c}} {{\boldsymbol{\rho }}_{{\boldsymbol{ij}}} = \left( {\rho _{ij1}, \ldots ,\rho _{ijK}} \right)} \end{array}$$

(19)

$$\begin{array}{*{20}{c}} \hskip -21pt {\rho _{ijk} = \frac{{\sqrt {O_{ik}O_{jk}} }}{{\mathop {\sum }\nolimits_k \sqrt {O_{ik}O_{jk}} }}} \end{array}$$

(20)

Then

$$\begin{array}{*{20}{c}} {{\boldsymbol{M}}_{{\boldsymbol{ij}}}|C_{ij}\sim Multinomial\left( {C_{ij},{\boldsymbol{\rho }}_{{\boldsymbol{ij}}}} \right)} \end{array}$$

(21)

Let E_ik be the gene expression of the i-th gene in the k-th cluster. We assume this gene expression follows a Poisson distribution with a rate proportional to the sum of cluster-specific HiChIP interactions involving the i-th promoter.

$$\begin{array}{*{20}{c}} {E_{ik}\sim Poisson\left( {\beta \mathop {\sum }\limits_q C_{iqk}} \right)} \end{array}$$

(22)

Given the two sets of observations {C_ij} and {E_jk}, our task is to infer the set of latent variables {C_ijk}. We do this by iteratively computing the MAP (maximum a posteriori) estimate of the latent multinomial variable M_ij conditional on the set of all other latent variables {M_rs:(r, s) ≠ (i,j)}. Specifically, given the current values of {C_rsk:(r, s) ≠ (i, j)} we compute the value of {C_ij1, …, C_ijK} that maximizes the following conditional posterior probability:

$$\begin{array}{*{20}{c}} {\left( {\begin{array}{*{20}{c}} {C_{ij}} \\ {{\boldsymbol{M}}_{{\boldsymbol{ij}}}} \end{array}} \right)\mathop {\prod }\limits_{k = 1}^K \rho _{ijk}^{C_{ijk}}e^{ - \beta \left( {\mathop {\sum }\limits_{q \ne j} C_{iqk} + C_{ijk}} \right)}\left[ {\beta \left( {\mathop {\sum }\limits_{q \ne j} C_{iqk} + C_{ijk}} \right)} \right]^{E_{ik}}} \end{array}$$

(23)

Dropout simulation

Dropout usually refers to the phenomenon that an expressed RA molecule might not be captured in a single cell. To test whether our algorithm could still function well in the presence of dropout events, we used a Bernoulli distribution to decide which “sites” (genes or enhancers) should be dropped in scRNA-seq and scATAC-seq data. Zero values were introduced into the simulated data for each gene/enhancer based on a Bernoulli distribution defined by the dropout rate. In our experiments, we chose the dropout rate from [0, 0.5, 0.8, 0.9].

Down sampling scRNA-seq and scATAC-seq in RA day 4

To simulate dropout based on scRNA-seq dataset from Drop-seq platform, we first down sampled each gene’s read count or each enhancer’s read count P_ij as $\hat P_{ij} = P_{ij}/100$, where $\hat P_{ij}\sim Poi(P_{ij}/100)$, and the dropout effect was modeled as $D_i\sim Ber\left( {\frac{1}{{1 + P_{ij}/100^{ - 0.1}}}} \right)$ (P indicates the openness matrix O or expression matrix E).

Surface markers selection

To sort the subpopulation 2 cells, we selected subpopulation 2 specific surface markers from gene expression data. We required that the selected surface markers satisfy the following conditions: (i) Differentially expressed between subpopulation 2 and the other subpopulations. (ii) Highly expressed in subpopulation 2, and (iii) expression level in subpopulations 1 and 3 are less than 2. In practice, we compared the distribution of surface markers’ expression in subpopulation 2 versus the other subpopulations by t-test. We selected the top 20 markers and further require that TPM expression in subpopulation 2 be greater than 10 and higher than in subpopulations 1 and 3.

Performance evaluation

There are two tasks for our algorithm: (1) deconvolution of subpopulation-specific HiChIP; (2) coupled clustering of scRNA-seq and scATAC-seq. For deconvolution, we ran our algorithm 50 times and evaluated the results in terms of mean Pearson Correlation Coefficient (PCC) of true subpopulation-specific HiChIP values and the predicted values. In detail, if there are n interactions in the bulk HiChIP data, both the true subpopulation-specific HiChIP and the predicted subpopulation-specific HiChIP are represented using n-dimensional vectors. Then the deconvolution performance is evaluated by calculating the PCC score between the true vector and predicted vector. For coupled clustering, we evaluated the performance in terms of error rate of true subpopulation labels and the predicted cluster assignments. We ran our algorithm and NMF 50 times from different initial values to calculate the mean error rate and compared our algorithm with NMF.

Subpopulation-specific genes

We defined subpopulation-specific genes according to its p-values from scRNA-seq and HiChIP data. For scRNA-seq, we applied a one-tailed t-test to define the subpopulation-specific genes and obtained the scRNA-seq p-values; For HiChIP, we first applied a one-tailed binomial test to define subpopulation-specific interactions. To eliminate the bias of various loop counts of interactions, we normalized the total loop counts to N (e.g., N = 10) for each interaction and get the modified loop counts in the k-th subpopulation as n_k. The expected proportion of the interactions in each subpopulation p_k is regarded as 1/K based on the assumption that each interaction is uniformly distributed in each subpopulation. Then, we calculated p-value for the interaction in the k-th subpopulation using the binomial test in R binom.test (n_k, N, p_k, alternative = “greater”). For genes with more than one interaction, we chose the most significant p-value as the HiChIP p-value of the gene. We further combined these two p-values for each gene using Fisher’s methods and select the top 1000 subpopulation-specific genes with the smallest combined p-values in each subpopulation.

Motif enrichment scores

We merged the scATAC-seq cells in each subpopulation to get a merged sample. On this sample, we used MACS2 to call the open peaks. We performed motif enrichment analysis on those open peaks by Homer. The motif enrichment score was defined by geometric mean of –log10(p-value) and fold change.

$$\begin{array}{*{20}{c}} {{\mathrm{Motif}}\,{\mathrm{enrichment}}\,{\mathrm{score}} = \sqrt { - \log _{10}\left( {{\mathrm{pvalue}}} \right) \times {\mathrm{FoldChange}}} } \end{array}$$

(24)

GO terms selection

For each subpopulation, we ranked GO terms using motif enrichment scores and kept the significant GO terms with scores larger than 2. Then we removed the GO terms which were significant in all three subpopulations. For example, GO terms cell projection morphogenesis, regulation of cellular component movement, regulation of localization, regulation of biological quality and etc. were significant in all three subpopulations, and we removed these GO terms for subsequent subpopulation-specific analysis.

Dense subnetwork detection

Given a directed weighted graph (G, W), we obtained the dense subnetwork by solving the following optimization problem:

$$\begin{array}{*{20}{c}} {\max \mathop {\sum }\limits_i \mathop {\sum }\limits_j W_{ij}x_iy_j} \\ {{\mathrm{s}}.{\mathrm{t}}.x_1^\beta + x_2^\beta + \ldots + x_n^\beta = 1} \\ {y_1^\beta + y_2^\beta + \ldots + y_n^\beta = 1} \\ {x_i \ge 0;i = 1,2, \ldots n} \\ {y_i \ge 0;i = 1,2, \ldots n} \end{array}$$

(25)

where nonzero value of x_i indicates that the i-th node in graph G is included in the subnetwork as a regulator, nonzero value of y_i indicates that the node is included in the module as a target. We chose β = 1, which leads to L1-type constraint that promotes sparse solutions. Then, the dense subnetwork was given by the set of nodes with nonzero x_i or y_i. We can further partition the dense subnetwork into the core node C, the upstream node U and the downstream node D, defined respectively as

$$\begin{array}{*{20}{c}} {U = \left\{ {i|x_i \, > \, 0,y_i = 0} \right\}} \end{array}$$

(26)

$$\begin{array}{*{20}{c}} {C = \left\{ {i|x_i \, > \, 0,y_i \, > \, 0} \right\}} \end{array}$$

(27)

$$\begin{array}{*{20}{c}} {D = \left\{ {i|x_i = 0,y_i \, > \, 0} \right\}} \end{array}$$

(28)

To test whether the extracted dense subnetwork is statistically significant or not, we generated a null distribution by permuting the network. In the permutation, to maintain the same in-degree and out-degree of each nodes of the network, we used the switching permutation operation (selected two edge every time and switch source and target), and switched 1000 times to generate the random network. We generated 1000 random networks and extracted the dense subnetwork on random networks. We calculated a p-value by comparing the optimal value from our network with that from random networks.

Cell culture

Mouse ES cell lines R1 were obtained from American Type Culture Collection (ATCC, Cat. no. SCRC-1036). The mESCs were first expanded on an MEF feeder layer previously irradiated. Then, subculturing was carried out on 0.1% bovine gelatin-coated tissue culture plates. Cells were propagated in mESC medium consisting of Knockout DMEM supplemented with 15% knockout serum replacement, 100 μM nonessential amino acids, 0.5 mM beta-mercaptoethanol, 2 mM GlutaMax, and 100 U/mL Penicillin-Streptomycin with the addition of 1000 U/mL of LIF (ESGRO, Millipore).

Cell differentiation

mESCs were differentiated using the hanging drop method³⁷. Trypsinized cells were suspended in differentiation medium (mESC medium without LIF) to a concentration of 50,000 cells/ml. 20 μl drops (~1000 cells) were then placed on the lid of a bacterial plate and the lid was upside down. After 48 h incubation, Embryoid bodies (EBs) formed at the bottom of the drops were collected and placed in the well of a 6-well ultra-low attachment plate with fresh differentiation medium containing 0.5 μM retinoic acid (RA) for up to 4 days, with the medium being changed daily.

HiChIP

We followed the HiChIP protocol published by Mumbach et al.³⁷, using antibody to H3K27ac (Abcam, ab4729) with the following modifications. The EBs were first treated with StemPro Accutase Cell Dissociation Reagent (Thermo Fisher) at 37 °C for 10–15 min with pipetting. Approximately one million cells were crosslinked with freshly prepared 1% formaldehyde. The pellet was then resuspended in 500 μl of ice-cold Hi-C Lysis buffer. After digestion with 25 U (5 μl of 5U/μl) MboI restriction enzyme and ligation, the nuclear pellet was brought up to 880 μl of Nuclear Lysis Buffer. Samples were sheared using a Covaris E220 using the following parameters: fill level = 10, duty cycle = 5, PIP = 140, cycles/burst = 200, time = 2 min and then clarified by centrifugation for 15 min at 16,100 × g at 4 °C. The samples were precleared with 6 μl Dynabeads Protein A (Thermo Fisher) at 4 °C for 1 h. We then added 2.5 μg of antibody to H3K27ac, and captured the chromatin-antibody complex with 6 μl of Dynabeads Protein A. Approximately 2–4 ng of ChIP DNA was obtained following Qubit quantification. The amount of Tn5 used and number of PCR cycles performed were based on the post-ChIP Qubit amounts, as described in the HiChIP protocol³⁷. The library was sequenced on Illumina NextSeq 500 with 75 bp paired-end reads. Total 13 million cells were used in HiChIP experiment.

Fluorescence-activated cell sorting (FACS)

The EBs treated with RA for 4 days were trypsinized with 1 ml StemPro Accutase Cell Dissociation Reagent (Thermo Fisher) at 37 °C for 10–15 min with pipetting. Once EBs got dissociated, 4 ml of Flow Cytometry Staining Buffer (Invitrogen, Cat. no. 00-4222) was add to the cell sample. The single cells were obtained by filtering twice with 40 μm cell strainer. After centrifuge at 500 × g for 4 min, the supernatant was removed and cells were resuspended in 500–700 μl of Flow Cytometry Staining Buffer to obtain the final concentration of 4 × 10⁷ cells/ml. 100 μl cells were used as unstained negative control cells for FACS analysis. The remaining cells were distributed at 100 μl per tube (~4 × 10⁶ cells) into Falcon® 12 × 75 mm round-bottom polystyrene test tube (Thermo Scientific), 100 μl per tube. To block non-specific Fc-mediated interactions, all tubes were first pre-incubated with 0.5 μg of Anti-Mouse CD16/32 antibody (1:40 dilution, Invitrogen, Cat. no. 14-0161) for 15 min at 4 °C. Then 0.125 μg PE-Cy7-labeled EpCAM (1:160 dilution, Invitrogen, Cat. no. 25-5791-80) and 0.1 μg PE-labeled CD38 (1:200 dilution, Invitrogen, Cat. no. 12-0381-82) were added to the tubes. After incubation for at least 30 min on ice, the cells were washed with 2 ml Flow Cytometry Staining Buffer + 1 mM EDTA to prevent cell adhesion. The cells were spin down at 500 × g for 5 min at room temperature and the wash step was repeated twice. The final cells were resuspended in 200 μl of Flow Cytometry Staining Buffer + 1 mM EDTA for FACS analysis. As compensation controls, 1 drop of UltraComp eBeads (Invitrogen, Cat. no. 01-2222) was added to three empty 12 × 75 mm round bottom test tubes, followed by adding 0.125 μg PE-Cy7-labeled EpCAM (labeled as PE-Cy7 only compensation beads), 0.1 μg PE-labeled CD38 (labeled as PE only compensation beads), or no antibody (labeled as no stain compensation beads). After mixing well by flicking, the tubes were incubated on ice for 20 min, followed by washing with 2 ml of Flow Cytometry Staining Buffer twice. After removing supernatant, 200 μl from each tube was used as compensation controls for FACS analysis. Five independent experiments were performed for FACS analysis, each time we obtained approximately 700,000 EpCAM and CD38 double positive cells. Those cells were collected in 15 ml conical tube, then the cells were spin down and crosslinked with freshly prepared 1% formaldehyde based on the HiChIP protocol³⁷. After crosslinking, the cells were ready for the following HiChIP analysis.

Software availability

DC3 is implemented in Python 2.7 and freely available at https://github.com/SUwonglab/DC3.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The HiChIP data that support the findings of this study have been deposited in Gene Expression Omnibus (GEO) with the accession code GSE127807. The single-cell data that support the finding of this study are available in GEO with the accession code GSE115968 and GSE107651. All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding authors upon reasonable request. The source data underlying Fig. 2a, b and 3 are provided as a Source Data file. A reporting summary for this Article is available as a Supplementary Information file.

References

Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Article CAS Google Scholar
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Article CAS ADS Google Scholar
Ramani, V. et al. Massively multiplex single-cell Hi-C. Nat. Methods 14, 263–266 (2017).
Article CAS Google Scholar
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS Google Scholar
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).
Article ADS Google Scholar
Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl Acad. Sci. USA 115, 7723–7728 (2018).
Article CAS Google Scholar
Duren, Z., Chen, X., Jiang, R., Wang, Y. & Wong, W. H. Modeling gene regulation from paired expression and chromatin accessibility data. Proc. Natl Acad. Sci. USA 114, E4914–E4923 (2017).
Article CAS Google Scholar
Mumbach, M. R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat. Methods 13, 919–922 (2016).
Article CAS Google Scholar
Rizzetto, S. et al. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells. Sci. Rep. 7, 12781 (2017).
Article ADS Google Scholar
Kulkens, T., Vanheerikhuizen, H., Klootwijk, J., Oliemans, J. & Planta, R. J. A yeast ribosomal DNA-binding protein that binds to the rdna enhancer and also close to the site of Pol-I transcription initiation is not important for enhancer functioning. Curr. Genet. 16, 351–359 (1989).
Article CAS Google Scholar
Pattyn, A. et al. Ascl1/Mash1 is required for the development of central serotonergic neurons. Nat. Neurosci. 7, 589 (2004).
Article CAS Google Scholar
Vierbuchen, T. et al. Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035 (2010).
Article CAS ADS Google Scholar
Thaler, J. P., Lee, S.-K., Jurata, L. W., Gill, G. N. & Pfaff, S. L. LIM factor Lhx3 contributes to the specification of motor neuron and interneuron identity through cell-type-specific protein-protein interactions. Cell 110, 237–249 (2002).
Article CAS Google Scholar
He, A. et al. Dynamic GATA4 enhancers shape the chromatin landscape central to heart development and disease. Nat. Commun. 5, 4907 (2014).
Article CAS ADS Google Scholar
Blackshear, P. J. et al. Graded phenotypic response to partial and complete deficiency of a brain-specific transcript variant of the winged helix transcription factor RFX4. Development 130, 4539–4552 (2003).
Article CAS Google Scholar
Hashizume, K., Yamanaka, M. & Ueda, S. POU3F2 participates in cognitive function and adult hippocampal neurogenesis via mammalian-characteristic amino acid repeats. Genes. Brain. Behav. 17, 118–125 (2018).
Article CAS Google Scholar
Yogarajah, M. et al. PAX6, brain structure and function in human adults: advanced MRI in aniridia. Ann. Clin. Transl. Neurol. 3, 314–330 (2016).
Article CAS Google Scholar
Avraham, O. et al. Transcriptional control of axonal guidance and sorting in dorsal interneurons by the Lim-HD proteins Lhx9 and Lhx1. Neural Dev. 4, 21 (2009).
Article Google Scholar
Cho, J.-H. & Tsai, M.-J. The role of BETA2/NeuroD1 in the development of the nervous system. Mol. Neurobiol. 30, 35–47 (2004).
Article CAS Google Scholar
Ashique, A. M. et al. The Rfx4 transcription factor modulates Shh signaling by regional control of ciliogenesis. Sci. Signal. 2, ra70–ra70 (2009).
Article Google Scholar
Kikkawa, T. et al. The role of Pax6 in brain development and its impact on pathogenesis of autism spectrum disorder. Brain Res. 1705, 95–103 (2018).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 e18 (2018).
Article CAS Google Scholar
Carter, R. A. et al. A single-cell transcriptional atlas of the developing murine cerebellum. Curr. Biol. 28, 2910–2920 e2 (2018).
Article CAS Google Scholar
Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).
Article CAS ADS Google Scholar
Cusanovich, D. A. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018).
Article CAS ADS Google Scholar
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Article CAS ADS Google Scholar
Lin, Z. et al. Model-based approach to the joint analysis of single-cell data on chromatin accessibility and gene expression. Stat. Sci. 2019 (Accepted).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome. Biol. 9, R137 (2008).
Article Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article ADS Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Article CAS Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS Google Scholar
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome. Biol. 16, 259 (2015).
Article Google Scholar
Lareau, C. A. & Aryee, M. J. hichipper: a preprocessing pipeline for calling DNA loops from HiChIP data. Nat. Methods 15, 155–156 (2018).
Article CAS Google Scholar
Brunet, J.-P., Tamayo, P., Golub, T.R. & Mesirov, J.P. Metagenes and molecular pattern discovery using matrix factorization. 101, 4164–4169 (2004).
Mumbach, M. R. et al. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nat. Genet. 49, 1602–1612 (2017).
Article CAS Google Scholar

Download references

Acknowledgements

W.H.W. and Z.D. were supported by NIH grants P50HG007735, R01HG010359, and R01GM109836. X.C. was partially supported by VA Palo Alto Health Care System. Y.W. was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13000000). W.W.Z. and R.J. were supported by National Key Research and Development Program of China No. 2018YFC0910404, the National Natural Science Foundation of China Nos. 61873141, 61721003, 61573207, and the Tsinghua-Fuzhou Institute for Data Technology. R.J. is also supported by a RONG professorship at the Institute for Data Science of Tsinghua University. Cell sorting/flow cytometry analysis for this project was done on instruments in the Stanford Shared FACS Facility.

Author information

These authors contributed equally: Wanwen Zeng, Xi Chen, Zhana Duren

Authors and Affiliations

Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, CA, 94305, USA
Wanwen Zeng, Xi Chen, Zhana Duren & Wing Hung Wong
MOE Key Laboratory of Bioinformatics, Bioinformatics Division, Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, 100084, China
Wanwen Zeng & Rui Jiang
CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100080, China
Yong Wang
Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China
Yong Wang

Authors

Wanwen Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhana Duren
View author publications
You can also search for this author in PubMed Google Scholar
Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Wing Hung Wong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.H.W. and R.J. conceived the project. W.W.Z. and Z.D. designed the analytical approach and performed numerical experiments and data analysis. X.C. performed all biological experiments. W.W.Z. wrote the software. Y.W., R.J., and W.H.W. supervised the research. All authors wrote, revised, and contributed to the final paper.

Corresponding authors

Correspondence to Rui Jiang or Wing Hung Wong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zeng, W., Chen, X., Duren, Z. et al. DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data. Nat Commun 10, 4613 (2019). https://doi.org/10.1038/s41467-019-12547-1

Download citation

Received: 28 February 2019
Accepted: 10 September 2019
Published: 10 October 2019
DOI: https://doi.org/10.1038/s41467-019-12547-1

This article is cited by

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data
- Qiuyue Yuan
- Zhana Duren
Nature Biotechnology (2024)
Challenges and perspectives in computational deconvolution of genomics data
- Lana X. Garmire
- Yijun Li
- Andrew E. Teschendorff
Nature Methods (2024)
Current and future perspectives of single-cell multi-omics technologies in cardiovascular research
- Wilson Lek Wen Tan
- Wei Qiang Seow
- Joseph C. Wu
Nature Cardiovascular Research (2023)
Gene regulatory network inference in the era of single-cell multi-omics
- Pau Badia-i-Mompel
- Lorna Wessels
- Julio Saez-Rodriguez
Nature Reviews Genetics (2023)
Gene regulatory network reconstruction: harnessing the power of single-cell multi-omic data
- Daniel Kim
- Andy Tran
- Pengyi Yang
npj Systems Biology and Applications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

The DC3 algorithm

Evaluation of deconvolution on in silico mixture of cells

Evaluation of clustering on in silico mixture of cells

Evaluation of deconvolution on experimental mixtures

DC3 improves interpretation of subpopulations

Subpopulation accessibility, expression, and loop profiles

Discussion

Methods

Data preprocessing

Simulation data construction

DC3 algorithm

Optimization algorithm

Subpopulation-specific subnetwork connectivity (for hyper-parameter selection, see below)

Initialization and hyper-parameters selection

Subpopulation profiles

Poisson model-based estimate for subpopulation HiChIP profiles

Dropout simulation

Down sampling scRNA-seq and scATAC-seq in RA day 4

Surface markers selection

Performance evaluation

Subpopulation-specific genes

Motif enrichment scores

GO terms selection

Dense subnetwork detection

Cell culture

Cell differentiation

HiChIP

Fluorescence-activated cell sorting (FACS)

Software availability

Reporting summary

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links