Abstract
Singlecell multimodal sequencing technologies are developed to simultaneously profile different modalities of data in the same cell. It provides a unique opportunity to jointly analyze multimodal data at the singlecell level for the identification of distinct cell types. A correct clustering result is essential for the downstream complex biological functional studies. However, combining different data sources for clustering analysis of singlecell multimodal data remains a statistical and computational challenge. Here, we develop a novel multimodal deep learning method, scMDC, for singlecell multiomics data clustering analysis. scMDC is an endtoend deep model that explicitly characterizes different data sources and jointly learns latent features of deep embedding for clustering analysis. Extensive simulation and realdata experiments reveal that scMDC outperforms existing singlecell singlemodal and multimodal clustering methods on different singlecell multimodal datasets. The linear scalability of running time makes scMDC a promising method for analyzing large multimodal datasets.
Similar content being viewed by others
Introduction
Singlecell RNA sequence (scRNAseq) profiles a highresolution picture inside an individual cell. Based on scRNAseq technology, recently, many multimodal sequencing technologies have been developed to jointly profile multiple modalities of data in a single cell. For example, cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITEseq) and RNA expression and protein sequencing assay (REAPseq) have been developed to profile mRNA expression and quantify surface protein simultaneously at the cellular level^{1,2}. Specifically, CITESeq employs existing singlecell sequencing technologies, such as the 10X Genomics Chromium platform^{3}, and allows the counting of AntibodyDerived Tags (ADT) to quantify the cellsurface protein abundance. Each cell with ADT labels and DNAbarcoded microbeads will be encapsulated in a droplet for singlecell sequencing^{4}. REAPseq also combines DNAbarcoded antibodies with existing scRNAseq approaches to measure the expression levels of genes and cellsurface proteins^{2}. In addition to studying singlecell transcriptomes and surface proteins, recently, the development of singlecell approaches for the assay of the transposase accessible chromatin sequencing (scATACseq) provides us a chance to measure chromatin accessibility in a single cell^{5}. Specifically, these technologies are designed to identify open chromatin regions in the genome by using the hyperactive Tn5 transposase, which simultaneously tags and fragments DNA sequences in open chromatin regions^{6}. The scATACseq enables us to explore cell typespecific biological activities by investigating the chromatinaccessibility signatures, such as the transcription factors that control the gene expression of cells. More recently, some multiomics singlecell technologies have been developed to jointly profile chromatin accessibility and gene expression within a single cell^{7}, such as SNAREseq and 10X SingleCell Multiome ATAC + Gene Expression (we denote it as SMAGEseq)^{8,9}. Overall, these multimodal sequencing technologies provide us with a more comprehensive and complicated profile of a single cell. Therefore, the computational tools for jointly integrating different data views for downstream analyses, such as clustering analysis, are desired for these new powerful experimental technologies.
It is noted that in the multimodal data, the biological information provided by different modalities is complementary^{2,4}, and each modality generally has its own strengths and weaknesses. Using CITEseq as an example, its ADT modal focuses on surface proteins. ADT data have demonstrated a low dropout rate^{4} and thus can reliably quantify cell activities. For the five CITEseq datasets analyzed in this study, we observed dropout rates of up to 12% in ADT data. In contrast, there were more than 80% or even 90% zero entries in its corresponding mRNA data. For most genes, protein is the final product to fulfill their functions and messenger RNA is an immediate product. Thus, ADT data seems ideal for characterizing cell functions and types. However, due to current technique limits, ADT can profile only up to a couple of hundreds of proteins. Because of this limit, investigators generally include wellknown cell type markers in ADT modal first. Therefore, ADT data is good at identifying common cell types^{4,10}, such as CD4+ and CD8+ T cells, when their marker genes are profiled. However, because of its limited dimensions, ADT data may not detect rare or minor cell types well. In contrast, the full transcriptome of mRNA data can capture comprehensive cell types. Nevertheless, clustering cells based on scRNAseq may be challenged by its large dropout rate and sparse signal with high dimensionality. Furthermore, the quantity of ADT and mRNA sources produced by the same gene may not be the same when considering the posttranscriptional and posttranslational regulations^{4,11}. In this case, ADT and mRNA data provide complementary information in cell type identification^{10}. For SNAREseq and SMAGEseq, scATACseq data provides chromatin accessibility information which is also complementary to mRNA data^{8}. Thus, by integrating the information from multimodalities, we should be able to arrive at a higher resolution of cell typing.
Clustering analysis is an essential step in most singlecell studies and has been studied extensively. Based on the clustering results, researchers can explore the biological activities in cell type or subtype level, which could not be reached by studying bulk data^{12,13,14}. Numerous clustering methods have been designed for the analysis of scRNAseq data. For example, Tscan applies principal component analysis (PCA) on the scRNAseq data and then performs the Gaussian mixture model (GMM) clustering on the lowdimensional representation^{15}. Seurat constructs a knearest neighbors (KNN) graph based on the Euclidean distance in PCA space. With the graph, it then employs the Louvain^{16}/Leiden algorithm to iteratively group cells together by optimizing modularity^{17}. The Louvain/Leiden algorithm has already become one of the most popular methods for scRNAseq clustering. SC3 employs spectral clustering to obtain individual clustering results based on the distance matrices derived from the Euclidean, Pearson and Spearman metrics, respectively. It then computes a consensus matrix by summarizing the three individual clustering results. Finally, the consensus matrix is clustered using hierarchical clustering to produce final clustering results^{18}. However, these traditional singlecell clustering methods are not ready to take the advantage of multiomics data to improve clustering performance and are thus not applicable to multimodal data.
A couple of methods have emerged for the clustering analysis of CITEseq data in the past years. Recently, we proposed a single cell deep constrained clustering framework – scDCC that can integrate ADT information into the clustering analysis of scRNAseq data by manually defined constraints^{19}. BREMSC^{10}, a hierarchical Bayesian mixture model, applies two multinomial models to jointly characterize scRNAseq and ADT data. It assumes that the proportions (relative expression levels of genes or proteins) in the multinomial models follow Dirichlet distributions, and cellspecific random effects are introduced to model the correlation between the two data sources. Although BREMSC is one of the first proposed models for clustering analysis of CITEseq data, it has several limitations. Firstly, it assumes that the data follow a certain specific distribution. Such parametric assumptions may not hold in all real applications. Secondly, BREMSC does not characterize the dropout events, which is the major problem in the clustering of scRNAseq data. Finally, BREMSC has a scalability issue. The running time of BREMSC becomes costly and slow when analyzing thousands of cells.
Meanwhile, CiteFuse, Seurat V4, and Specter can cluster CITEseq data by using distancebased graphs. CiteFuse^{20} calculates the celltocell similarity matrices of ADT and mRNA separately and then merges them by a similarity network fusion algorithm^{21}. Clustering is performed on the merged similarity matrix by using graphbased clustering algorithms such as spectral^{22} and Louvain algorithm^{16}. However, similarity matrixbased clustering cannot explicitly consider the dropout events in scRNAseq data. Hao et al. developed a weighted nearestneighbor (WNN) procedure in Seurat V4 for multiomics data clustering^{23}. Briefly, the WNN procedure learns the weights of multimodal data and generates a similarity graph of cells by a weighted combination of mRNA and protein views. Van et al.^{24} proposed a landmarkbased spectral clustering (LSC) method, Spector, for clustering singlecell data with lineartime scalability. LSC picks a small set of cells as the landmarks and calculates a Gaussian kernelbased similarity matrix between the rest of the cells and the landmarks, then the whole Laplacian matrix is built. Different omics require a different choice of the number of landmarks and the kernel bandwidth, and consensus clustering is used for ensembles across modalities. Compared to BREMSC and CiteFuse, the WNN algorithm and Specter run much faster and require less memory. However, these two methods fail to take into consideration the dropout events in the count data too.
Another line of research, which is relevant, focuses on learning a joint embedding of different modalities. Such joint embedding is expected to improve various downstream analyses, including clustering. TotalVI is a deep variational autoencoder that can capture the same latent space of different data types^{25}. With this design, TotalVI can learn a joint probabilistic representation of the paired ADT and mRNA measurements from CITEseq data that accounts for the distinct information of each modality. Similarly, for SNAREseq or SMAGEseq data, Cobolt^{26} and scMM^{27} employ a Multimodal Variational Autoencoder to jointly model the multiple modalities and learn a joint embedding of the singlecell mRNAseq and ATACseq data. However, these methods focusing on joint embedding are not designed and optimized for clustering, although we can, as a naïve solution, learn joint embeddings first, which is then followed by simple clustering using, for example, kmeans. Such a divided strategy is suboptimal for clustering, as shown in our experiments later.
As we mentioned above, many existing methods fail to consider the dropout events in the singlecell data during the learning of embedding and/or clustering. However, the pervasive dropout events make singlecell count data to be zeroinflated and overdispersed. To better characterize singlecell mRNA count data, a zeroinflated negative binomial (ZINB) model has been widely used to account for the large dispersion and the dropout events^{28,29}. Many ZINB modelbased methods, including deep learning approaches, have been developed to analyze scRNAseq count data, including ZINBWaVE^{29}, DCA^{30}, scVI^{31}, and scDeepCluster^{28}, to name a few. These studies show that the ZINB model can effectively characterize scRNAseq data and improve the representation learning and clustering results.
In this article, we propose a multimodal deep learning model, Single Cell Multimodal Deep Clustering (scMDC), for the clustering analysis of multimodal singlecell data. The network architecture of scMDC is shown in Fig. 1. scMDC employs a multimodal autoencoder^{32}, which applies one encoder for the concatenated data from different modalities and two decoders to separately decode the data from each modal. Following scDeepCluster^{28}, we apply ZINB loss as the reconstruction loss. The bottleneck layer is used for a deep Kmeans clustering^{33}. To further improve latent feature learning, we introduce a KullbackLeibler divergencebased loss (KL loss), which attracts similar cells and separates dissimilar cells^{34}. The whole model, including the autoencoder, the KLloss, and the deep Kmeans clustering, are optimized simultaneously. scMDC is an endtoend multimodal deep learning clustering method for modeling different multiomics data. Taking the advantage of graphics processing units (GPU), scMDC is very efficient in the analyses of large datasets. In addition, by employing a conditional autoencoder framework, scMDC can correct batch effects when analyzing multibatch data. To our knowledge, scMDC is the first endtoend deep clustering method that can both integrate multimodal data and remove the batch effect for different types of multimodal data. The superior performance of scMDC is observed from the extensive experiments on both CITEseq and SMAGEseq data. After clustering, for a given cluster, we also detect the markers (genes or proteins) by transplanting an ACE model^{35} to scMDC and conduct a gene set enrichment analysis based on the gene ranks learned from ACE. The meaningful results of these downstream analyses further support the superior clustering performance of scMDC. We conclude that scMDC is a promising tool for clustering multimodal singlecell data.
Results
Real CITEseq data evaluation
We first evaluate the clustering performance of scMDC on CITEseq datasets in comparison with ten competing methods. The competing methods include the models designed for multimodal data clustering (BREMSC, CiteFuse, Specter, and SeuratV4), the models developed for learning an embedding for single or multimodal data (SCVIS and TotalVI), two clustering tools for singlecell data (SC3 and Tscan), and two general clustering methods (IDEC and Kmeans). We test these tools on seven singlebatch CITEseq datasets and two multibatch CITEseq datasets. Of these ten methods under comparison, only scMDC, Seurat, and TotalVI can correct batch effects before clustering. We hypothesize that scMDC can boost the clustering performance in all the CITEseq real datasets. Figure 2 shows the performance (AMI, NMI, and ARI) of all the methods for different datasets. Overall, the multimodal methods have shown clear advantages over the singlemodal methods. As shown in Fig. 2a, scMDC has demonstrated superior performance over competing methods across two metrics for most singlebatch datasets except the BMNC dataset, in which Seurat has comparable performance. For the two multibatch datasets, scMDC outperforms all the competing methods (Fig. 2b); TotalVI and Seurat are inferior to scMDC but outperform the other competing methods, thanks to their capability of correcting batch effects. The differences between the performance of scMDC and the competing methods are summarized in Fig. 2c. A positive difference means higher performance in scMDC than the competing methods. We find that scMDC has a steady advantage over all the competing methods in multiple datasets. We then rank all competing methods for each dataset based on their performance metrics. Figure 2d shows the averaged rank of each method for the nine datasets. We can see that scMDC constantly ranks number 1 in all datasets for all three metrics. In contrast, the secondbest methods, Seurat for AMI and NMI and Specter for ARI, have an averaged rank of 3. Using onesided paired ttests on the clustering metrics (AMI, NMI, and ARI), we confirm that the improvements of scMDC over competing methods are all significant (Supplementary Table 1). In summary, our results on multiple real datasets reveal that scMDC has stable and robust clustering performance on the CITEseq datasets.
Real SMAGEseq data evaluation
We then test the clustering performance of scMDC on the SMAGEseq data. Here we compare scMDC with four competing methods: Cobolt, scMM, SeuratV4, and Kmeans + PCA. Cobolt and scMM are designed for multiomics data embedding learning. SeuratV4 is developed for CITEseq data but here we apply the WNN algorithm to the SMAGEseq data. We test these methods on three real SMAGEseq datasets from 10X genomics, including two PBMC datasets and one embryonic mouse brain dataset. We also conduct a multibatch experiment by combining two PMBC datasets (denoted as PBMC13K). For scATACseq data, we use a celltogene matrix as input for scMDC, scMM, Seurat, and Kmeans. This matrix is built by mapping ATAC reads onto the gene regions (See method for details). Cobolt uses the peak count matrix as the input. Figure 3 shows the clustering performance of scMDC and the competing methods in singlebatch datasets (a) and multibatch datasets (b). We find that scMDC has superior performance in both single and multibatch datasets from all the metrics (NMI and ARI). Cobolt is the secondbest method in the tests and has a comparable performance with scMDC on the E18 dataset in NMI, but its performance is inferior to that of scMDC in other datasets. Figure 3c summarizes the differences in clustering performance between scMDC and the competing methods. We find that the median differences are around 0.1 in AMI and NMI, and around 0.3 in ARI for all the competing methods, which illustrates the superiority of scMDC. We then rank all competing methods for each dataset based on their performance metrics. Figure 3d shows the averaged rank of each method for the four datasets. We can see that scMDC ranks best in all three metrics, while Cobolt is the secondbest for AMI and ARI, and Seurat is the secondbest for ARI. Using onesided paired ttests done on the raw performance metrics, we confirm that the improvements of scMDC over competing methods are all significant (Supplementary Table 2).
Taking the results from CITEseq and SMAGEseq experiments together, we conclude that scMDC is a general and promising clustering model for various singlecell multimodal data.
Simulation experiments
To test the robustness of scMDC under different scenarios, we conduct two simulation experiments with various clustering signals and dropout rates. We generate all the simulation datasets using the SymSim package (v0.0.0.9) in R. Figure 4a–c show the performance of scMDC and the competing methods on the simulated CITEseq data with low, medium, and high clustering signals, respectively. scMDC has demonstrated superior performance across all levels of clustering signals, especially in terms of AMI and NMI. TotalVI has comparable performance with scMDC in ARI, but it is outperformed by scMDC in other metrics. Besides, when the clustering signal is low, scMDC shows a greater advantage over other methods, revealing its capability to handle datasets with low signaltonoise ratios. Figure 4d–f show the clustering results of all the methods with low, medium, and high dropout rates, respectively. We can see that scMDC yields the optimal performance under various dropout rates, followed by TotalVI. We also observe that, the higher the dropout rate, the larger the improvement scMDC brings, in comparison with its competing methods. Such a result is compelling because most real singlecell datasets exhibit high dropout rates. The robust performance under high dropout events makes scMDC to be a superior clustering method. This result also consolidates our statement that scMDC is a better tool to cluster the datasets with low signaltonoise ratios than the competing methods. For multibatch data, we compare scMDC with TotalVI and Seurat, the only two competing methods that can correct batch effects. Medium dropout rate and clustering signal are used for simulating the multibatch dataset. scMDC outperforms the two competing methods in all three metrics (Fig. 4g). The differences between the performance of scMDC and each competing method are summarized in Fig. 4h. Although the distribution of differences varies across different methods, all the medians of differences are greater than zero indicating a consistent superiority of scMDC over all the competing methods. Similarly, we rank all methods in the analyses of these simulated datasets. scMDC and TotalVI constantly rank No. 1 and No. 2, respectively (Fig. 4i). Like the results in the real datasets, multiomics methods have better overall performance than singlesource methods. Using onesided paired ttests done on the three raw performance metrics, we confirm that the improvements of scMDC over competing methods are all significant (Supplementary Table 3). These simulation results demonstrate that scMDC has robust clustering performance under various scenarios.
Latent representations of real data
Figure 5 shows the tSNE plots of the embedding of scMDC (a) and four competing methods, IDEC (b), SCVIS (c), TotalVI (d), and Seurat (e), on the BMNC dataset. We also show the expression pattern of three marker genes in the tSNE plots. They are LYZ (the first column) for CD14 monocyte cells, CD8A (the second column) for CD8 cells, and NKG7 (the third column) for NK cells. True labels (cell types) are shown in the fourth column. We find that scMDC can divide most cell types in the latent space. In contrast, SCVIS, TotalVI, and Seurat fail to separate many cell types, including some large cell types, such as CD14 monocyte and CD4 memory cells, which are connected or mixed with other cell types in the latent spaces. IDEC divides large cell types into many small clusters. Many of them are mixed with other cell types. It is noted that scMDC fails to divide some subcell types, such as CD8 effect 1, CD8 effect 2, CD8 memory 1, and CD8 memory 2, on the latent space. This problem is also observed on the tSNE plots of other methods. In the latent space of scMDC, the marker genes are only expressed in some isolated clusters. However, in the latent space of other methods, the marker genes are either expressed in multiple clusters or in a part of a large cluster. These are all unsatisfactory expression patterns. Similar results are observed in the expression pattern of ADT markers (Supplementary Fig. 1). We then build tSNE plots of the embeddings of a multibatch dataset SLN111 with two batches of data (Fig. 6). This dataset contains 28 cell types including some large ones (>1000 cells, such as CD4 and CD8 T cells) and tiny ones (<100 cells, such as erythrocytes and plasmacytoid dendritic cells). An ideal model should be capable of 1) dividing different cell types on the latent space, and 2) removing the batch effect and mixing the cells from different batches on the latent space. In other words, biological variations should be captured while technical variations are omitted during the embedding learning. Figure 6 shows the latent representations of scMDC (a) and four competing methods including IDEC (b), SCVIS (c), TotalVI (d), and Seurat (e). We find that scMDC can separate most cell types in the latent space. In addition, it mixes the cells from two batches in most clusters. IDEC can separate the large cell types but fails to divide many small cell types. SCVIS, TotalVI, and Seurat show inferior performance in dividing different cell types in the latent space. Like scMDC, TotalVI and Seurat also have satisfactory performance on batch effect correction. SCVIS and IDEC cannot address the batch effects, so the cells from the two batches are separated on the latent space. In summary, scMDC is the only method that has superior performance on both cell type partition and batch effect removal. Similar results can be found on the tSNE plots of a multibatch SMAGEseq dataset (PBMC13K, Supplementary Fig. 2).
The advantages of using multimodal data
As described in the introduction, different omics of data provide different and complementary information for cell clustering and cell typing. Therefore, using multiomics data in clustering should be able to achieve better performance than using singlesource data. In this experiment, we conduct two tests. In the first test, we compare the performance of scMDC with three variant models: a submodel of scMDC with only mRNA input and reconstruction loss (named scMDCRNA), a submodel of scMDC with only ADT/ATAC input and reconstruction loss (named scMDCADT/scMDCATAC), and a variant model with concatenated mRNA and ADT data as input but with only one reconstruction loss (named as scMDCConcat). Figure 7a, b shows the performance of scMDC and three variant models in CITEseq and SMAGEseq data, respectively. We find that scMDC outperforms the variant models in all the datasets. For CITEseq data, scMDCADT has the secondbest performance in all datasets. This is consistent with our expectation because most ADTs are strong markers for identifying some cell types. On the other hand, scMDCATAC has inferior performance in two SMAGEseq datasets. The differences between the performance of scMDC and each variant model are summarized in Fig. 7c. We find a stable advantage of scMDC over all the variant models. Using a onesided paired ttest, we find that scMDC significantly outperforms most variant models for both CITEseq and SMAGEseq data (Supplementary Table 4). The only exception is the scMDCATAC model (Pvalue = 0.07), because of the low sample size of SMAGEseq data (n = 4). Considering that the submodels of scMDC are not optimized for clustering scRNAseq data, we then compare scMDC with scDeepCluster, a stateofart tool for clustering scRNAseq data. It is noted that scMDC uses multiomics data as input (either mRNA + ADT or mRNA + ATAC), while scDeepCluster only uses mRNAseq data as input. We find that scMDC outperforms scDeepCluster in all datasets (Fig. 7d, e), indicating that scMDC can integrate the information from multimodal data to boost clustering performance. We also build the tSNE plots of the embeddings from scMDC and three variant models (Supplementary Fig. 3). Consolidating our expectations in the introduction, scMDCRNA correctly separates some tiny cell types but falsely combines some large cell types. In constrast, scMDCADT separates most large cell types but fails to detect some small cell types. scMDCConcat exhibits similar performance as scMDCRNA, which suggests a predominant role of mRNA data in the concatenated input. The tSNE plots of SMAGEseq data (PBMC13K) from scMDC and three variant models are shown in Supplementary Fig. 4. scMDC also outperforms the variant models in cell type partition on the latent space. In addition, we compare the singlemodal scMDC (scMDCRNA and scMDCADT/scMDCATAC) to other singlemodal methods (Supplementary Figs. 5–12). We find that in most datasets, the singlemodal scMDC models also have the best or closetobest performance. Based on these singlemodal methods, the multimodal scMDC further boosts the clustering performance by integrating the information from two omics of data.
Downstream analysis
Based on the clustering results, we perform two popular downstream analyses, differential expression (DE) analysis and gene set enrichment analysis (GSEA). We employ the algorithm from ACE^{36}, which ranks genes based on the confidence of them to be assigned to a cluster. The DE analysis can be performed between two clusters or between one cluster and the rest of the clusters. Then, we calculate the logfold change of each gene to get the directions of differential expression (namely upregulation or downregulation) based on the normalized mRNA counts. With gene ranks and directions, we perform GSEA to find the enriched pathways in a target cluster. Here, we show the results of the BMNC dataset (Fig. 8). We conduct DE and GSEA for the four largest clusters in the BMNC data. All comparisons are performed between the target cluster and the rest of the clusters. Figure 8a shows the DE genes for CD14 monocyte, CD4 memory T cells, CD4 naive T cells, and CD8 naive T cells. We find many proven marker genes for each cell type. For example, LYZ, CST3, HLADRA, CD74, and CD14 have been shown to be highly expressed in the monocyte cells^{37}. CD27 and CCR7 are the marker genes for naive cells^{38}. They are in the top ranks in both CD4 naive and CD8 naive clusters. IL7R and S100A4 have been demonstrated to be highly expressed in memory T cells^{39}. Figure 8b shows the GSEA results of the Hallmark pathways based on the DE analysis. Hierarchical clustering is performed on both pathways and cell clusters. We find that two naive cell types are clustered together and have many common enriched pathways. The MYC targets are enriched in CD4 naive, CD4 memory, and CD8 naive clusters. Their important functions in CD4 and CD8 T cells have been demonstrated by Marchingo et al.^{40}. The complement system has the highest enrichment score in CD14 monocytes. It is an essential pathway for the phagocytosis of mesenchymal stromal cells by monocytes^{41}. The hypoxia pathway is enriched in CD4 memory T cells. It has been widely shown that hypoxia has a significant influence on the metabolism and differentiation of memory CD4 T cells^{42,43,44}. IL2 signaling is also enriched in CD4 memory T cells. Its dynamic roles in CD4 T cells have been demonstrated in many previous studies^{45,46}. The enrichment plots of the significant Hallmark pathways are shown in Supplementary Figs. 13–16. These downstream analyses further consolidate the correctness of the clustering results of scMDC.
Hyperparameter tuning and time complexity
scMDC has two key hyperparameters φ(Phi) and γ(Gamma) that control the KL loss and clustering loss, respectively. Figure 9a, b shows the clustering performance of scMDC on both CITEseq and SMAGEseq datasets with various φ and γ, respectively. We find that when φ is lower than 0.01 and γ is lower than 10, scMDC is insensitive to these parameters. When φ goes beyond 0.01 and γ goes beyond 10, scMDC’s performance drops dramatically. It is noted that the clustering loss has a clear contribution to the performance of most datasets (P < 0.05 from a onesided paired ttest between γ = 0.1 and γ = 0.001). On the other hand, the KL loss contributes slightly to the performance of some CITEseq datasets but boosts the performance of SMAGEseq datasets, especially in ARI. The statistical tests of the hyperparameter tuning results are listed in Supplementary Table 5.
To test the running time of scMDC, we simulate datasets with cell numbers ranging from 1000 to 100,000. Figure 9c shows the running time of scMDC with ascending cell numbers. We find a linear relationship between the cell numbers and the running time of scMDC. When the cell number is ten thousand, scMDC only needs about 7 min to finish the clustering analysis. Even when the cell number is as large as a hundred thousand, scMDC just takes about 1 h to finish the clustering analysis. All results are obtained on the Nvidia Tesla P100 with 16 Gb memory.
Discussion
We have introduced scMDC  a multimodal deep learning method for clustering analysis of different singlecell multiomics data. scMDC jointly models both mRNA and ADT/ATAC data by employing a multimodal autoencoder. Deep Kmeans clustering is conducted on the bottleneck of the autoencoder, and a KLloss is employed to facilitate separating distinct cell groups. scMDC is an endtoend deep model, and all components are optimized simultaneously. Current existing clustering methods for CITEseq data either apply a shallow Bayes model, such as BREMSC, or combine two distancebased graphs of mRNA and ADT, such as CiteFuse and Seurat, to leverage information from different data sources. These methods do not explicitly model dropout events and overdispersions in mRNA and/or ADT count data. Our realdata results demonstrate that the multimodalbased deep learning approach can characterize different sources of count data of CITEseq and SMAGEseq more effectively and efficiently.
The clustering results are essential for the downstream analyses, such as differential expression and gene set enrichment analysis. We employ a deep learningbased differential expression algorithm^{36} to rank genes in a target cluster based on their confidence of being assigned to that cluster. Given the ranked list of genes, GSEA can be performed to profile cell types at a functional level. The advantages of this deep differential expression method over the traditional methods, such as Wilcoxon test and DEseq2^{47}, have been demonstrated by Lu et al.^{36}. With the acceleration of GPU, scMDC is very efficient for analyzing large multiomics datasets. Taking all results together, we conclude that scMDC is a promising method for the analysis of singlecell multiomics data.
Method
Count data preprocessing
The raw CITEseq data is preprocessed and normalized by the Python package SCANPY^{48}. mRNA and ADT data are normalized separately but using the same method. Specifically, the genes and ADTs with no count are filtered out. The counts of a cell are normalized by a size factor s_{i} (specifically, \({s}_{i}^{p}\) for ADT data and \({s}_{i}^{r}\)for mRNA data), which is calculated as dividing the library size of that cell by the median of the library size of all cells. In this way, all cells will have the same library size and become comparable. Finally, the counts are transformed into logarithms and scaled to have unit variance and zero mean. The treated count data of mRNA and ADT are used in our denoising multimodal autoencoder model. We use the raw count matrix to calculate the ZINB loss^{30,31}. Before processing the Singlecell Multiome ATAC Gene Expression (SMAGEseq) data, we map all the reads from scATACseq to the gene regions (see details below). Then we use the same methods to preprocess and normalize SMAGEseq data as for CITEseq data. The size factor \({s}_{i}^{a}\) for ATAC data is also calculated.
Denoising hierarchical multimodal autoencoder
The autoencoder is a neural network that is able to learn nonlinear representations efficiently^{49}. There are various types of autoencoder models. The denoising autoencoder receives corrupted data with artificial noises and reconstructs the original data^{50}. It is widely used for noisy datasets to learn a robust latent representation. We use the denoising autoencoder for the mRNA, ADT, and ATAC data since they are very noisy. Let us denote the preprocessed counts of mRNA, ADT, and ATAC as X^{r}, X^{p}, and X^{a} and the corrupted mRNA, ADT and ATAC data as \({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{r}}}}}}}\), \({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{p}}}}}}}\), and \({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{a}}}}}}}\), formally:
here n_{r}, n_{p}, and n_{a} are the artificial gaussian noise (with mean = 0 and variance = 1) for mRNA, ADT and ATAC data, respectively, and σ_{r}, σ_{p}, and σ_{a} controls the weights of n_{r}, n_{p} and n_{a}. We set σ_{r} and σ_{a} as 2.5 and σ_{p} as 1.5.
Next, ADT/ATAC and mRNA data are reduced to latent spaces by an autoencoder model. Our autoencoder model contains one encoder (E) for the concatenated data and two decoders (D) for different omics of data. Both the encoder and decoders are multilayered fully connected neural networks. We denote encoder \({{{{{\bf{Z}}}}}}={E}_{{{{{{\bf{w}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{r}}}}}}}\odot {{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{p}}}}}}})\) for the concatenated mRNA and ADT data, encoder \({{{{{\bf{Z}}}}}}={E}_{{{{{{\bf{w}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{r}}}}}}}\odot {{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{a}}}}}}})\) for the concatenated mRNA and ATAC data, and decoder \({{{{{{\bf{X}}}}}}}^{{{{{{\bf{{a}}}}}}}^{\prime}}={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\bf{a}}}}}}}^{{\prime} }}^{a}({{{{{{\bf{Z}}}}}}}_{{{{{{\bf{a}}}}}}})\) for ATAC data, decoder \({{{{{{\bf{X}}}}}}}^{{{{{{{\bf{p}}}}}}}^{{\prime} }}={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\bf{p}}}}}}}^{{\prime} }}^{p}({{{{{{\bf{Z}}}}}}}_{{{{{{\bf{p}}}}}}})\) for ADT data, and decoder \({{{{{{\bf{X}}}}}}}^{{{{{{\bf{r}}}}}}^{\prime} }={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\bf{r}}}}}}}^{{\prime} }}^{r}({{{{{{\bf{Z}}}}}}}_{{{{{{\bf{r}}}}}}})\) for mRNA data. X^{r′}, X^{p′}, and X^{a′} stand for the reconstructed data of mRNA, ADT, and ATAC. w and \({{{{{{\bf{w}}}}}}}^{{\prime} }\) stand for the learnable weights of the encoder and the decoders, respectively. ʘ indicates the concatenation of two matrices. The ELU activation function^{51} is used for all the hidden layers in the encoder and the decoders. Batch normalization is performed on the output of all the hidden layers. The reconstruction loss functions of our autoencoder model are:
\({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{con}}}}}}}\) stands for the concatenated data from either mRNA + ADT or mRNA + ATAC. For all the omics of data, we employ the zeroinflated negative binomial (ZINB) models as the reconstruction loss function^{28}. It is noted that the raw count data is used in the ZINB models^{28,30,31}. Let \({X}_{{ij}}^{p}\) be the count for cell i and protein j in the raw count matrix of ADT, \({X}_{{ij}}^{a}\) be the count for cell i and gene j in the raw count matrix of ATAC, and \({X}_{{ij}}^{r}\) be the count for cell i and gene j in the raw count matrix of mRNA. The NB distributions are parameterized by mean values \({\mu }_{{ij}}^{p}\), \({\mu }_{{ij}}^{a}\) and \({\mu }_{{ij}}^{r}\), and dispersions \({\theta }_{{ij}}^{p}\), \({\theta }_{{ij}}^{a}\) and \({\theta }_{{ij}}^{r}\), for ADT, ATAC and mRNA respectively. Formally:
ZINB distribution is parameterized by the negative binomial of count data and an additional coefficient (\({\pi }_{{ij}}^{p}\), \({\pi }_{{ij}}^{a}\) and \({\pi }_{{ij}}^{r}\)) for the probabilities of dropout events:
To estimate these parameters in the ZINB loss functions, we add three independent fully connected layers M, θ, and Π to the last hidden layer of each decoder. The layers are defined as
here M_{ADT}, θ_{ADT} and Π_{ADT} are the matrices of estimated mean, dispersion, and dropout probability for the ZINB loss of ADT data, M_{ATAC}, θ_{ATAC} and Π_{ATAC} are the matrices of estimated mean, dispersion, and dropout probability for the ZINB loss of ATAC data, and M_{RNA}, θ_{RNA} and Π_{RNA} are the matrices of estimated mean, dispersion, and dropout probability for the ZINB loss of mRNA data. \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{p}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\mu }}}}}}{{{{{\boldsymbol{)}}}}}}}\), \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{p}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\theta }}}}}}{{{{{\boldsymbol{)}}}}}}}\), \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{p}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\pi }}}}}}{{{{{\boldsymbol{)}}}}}}}\), \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{a}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\mu }}}}}}{{{{{\boldsymbol{)}}}}}}}\), \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{a}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\theta }}}}}}{{{{{\boldsymbol{)}}}}}}},\;{{{{{{{\bf{w}}}}}}}_{{{{{{\bf{a}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\pi }}}}}}{{{{{\boldsymbol{)}}}}}}},\;{{{{{\bf{w}}}}}}}_{{{{{{\bf{r}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\mu }}}}}}{{{{{\boldsymbol{)}}}}}}}\), \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{r}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\theta }}}}}}{{{{{\boldsymbol{)}}}}}}}\) and \({{{{{{\bf{w}}}}}}}_{{{{{{\bf{r}}}}}}{{{{{\boldsymbol{(}}}}}}{{{{{\boldsymbol{\pi }}}}}}{{{{{\boldsymbol{)}}}}}}}\) are the learnable weights. The size factor \({s}_{i}^{p}\), \({s}_{i}^{a}\) and \({s}_{i}^{r}\) for ADT, ATAC and mRNA are calculated in the preprocessing step. The loss function of the ZINBbased autoencoder is defined as
for ADT, ATAC and mRNA data, respectively.
Conditional autoencoder
Conditional autoencoder (CAE) has been designed to integrate the data from different batches^{25}. Based on the traditional autoencoder model, we add a matrix B on the input of the encoder and decoders. B is the onehot coding from a batch vector b of cells. If there are M batches in b, the dimension of B would be N × M. So, the encoder becomes \({{{{{\bf{Z}}}}}}={E}_{{{{{{\bf{w}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\bf{c}}}}}}}^{{{{{{\bf{con}}}}}}}\odot {{{{{\bf{B}}}}}})\) and the decoders become \({{{{{{\bf{X}}}}}}}^{{{{{{\bf{p}}}}}}{\prime} }={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{p}}}}}}}^{{\prime} }}^{p}({{{{{\bf{Z}}}}}}\odot {{{{{\bf{B}}}}}})\) for ADT, \({{{{{{\bf{X}}}}}}}^{{{{{{\bf{a}}}}}}{\prime} }={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{a}}}}}}}^{{\prime} }}^{a}({{{{{\bf{Z}}}}}}\odot {{{{{\bf{B}}}}}})\) for ATAC, and \({{{{{{\bf{X}}}}}}}^{{{{{{\bf{r}}}}}}{\prime} }={D}_{{{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{r}}}}}}}^{{\prime} }}^{r}({{{{{\bf{Z}}}}}}\odot {{{{{\bf{B}}}}}})\) for mRNA data.
Model Architecture
Our model can be used for clustering CITEseq data and SMAGEseq data. For CITEseq data, the encoder is set as {256, 64, 32, 16}, the decoder for mRNA is set as {16, 64, 256} and the decoder for ADT is set as {16 20}. For SMAGEseq data, the encoder is set as {256, 128, 64} and the decoders for both mRNA and ATAC data are set as {64, 128, 256}. So, the latent space of CITEseq and SMAGEseq data has 16 and 64 dimensions respectively. The overall architecture of the scMDC model is shown in Fig. 1.
KL divergence on the latent layer
In the clustering analysis, similar points should be grouped into the same cluster. According to the method described by Chen et al.^{34}, we employ a KL divergence loss function to enhance the association between similar cells and prevent squeezing the centroids of clusters in the latent space. Following tSNE^{52}, the tdistribution kernel function is used to describe the pairwise similarity among two cells i and i’ in the latent space of our autoencoder:
here \({q}_{{ii}}=0\). The P is the target distribution in training, which strengthens and weakens the affinities between the cells with high and low similarities, respectively. P is defined as the square of Q then normalized:
With the two similarity distributions, we construct the KL loss function by the KullbackLeibler (KL) divergence between Q and the derived target distribution P:
which measure the probabilitydistance between the two distributions. During the training process, P and Q are calculated per batch.
Deep Kmeans clustering
We perform unsupervised clustering on the latent space of the autoencoder^{34}. Our multimodal autoencoder learns a nonlinear mapping for each cell i, which transfers two input matrices to a lowdimensional space Z. The clustering loss function is defined as
here V stands for the K clustering centroids and f calculates the Euclidean distance between a cell (in latent space) and a centroid. τ is a hyperparameter. We set τ as 1 for CITEseq data and 0.1 for SMAGEseq data. The Gaussian kernel function is applied in weight measuring to smooth the gradient descent optimization process:
Then, to speed up the convergence, an inflation operation is applied on the weights:
here the hyperparameter α is set to 2.
The total loss of scMDC is defined as
For CITEseq data, and
For SMAGEseq data. w is the weight matrix of the encoder. \({{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{a}}}}}}}^{{\prime} },\;{{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{p}}}}}}}^{{\prime} },\) and \({{{{{{\bf{w}}}}}}}_{{{{{{\boldsymbol{r}}}}}}}^{{\prime} }\) are the weights of mRNA decoder, ADT decoder and ATAC decoder, respectively. U is the set of centroids initialized by Kmeans. Here, γ and φ are the hyperparameters that control the weights for the clustering loss and the KL loss, respectively. The value of γ is set as 0.1 for all experiments. φ is set to 0.001 for CITEseq data and 0.005 for SMAGEseq data.
Marker gene detection
We employ an approach proposed by Lu et al.^{36} to find marker genes in each cluster against another cluster or the rest of the clusters. Briefly, for each gene, this algorithm will find the minimal perturbation that alters the group assignment from a source group (s) to the target group(s) (t). The objective function for onetoone comparison is:
here the tradeoff coefficient \(\lambda\) and the margin \(\alpha\) are set to 100 and 1, respectively. \({{{{{\bf{x}}}}}}\in {{{{{\bf{X}}}}}}\) is the normalized data of a cell. \(\delta \in {{\mathbb{R}}}^{P}\) is the perturbation for altering the cluster assignment of cells. L1 norm of \(\delta\) is used to encourage sparsity and nonredundancy. The objective function for onetorest comparison is:
It is equal to comparing a source cluster to a target cluster for which cell x has the highest confidence. The confidence from a cell x to a cluster c is defined as
here μ_{c} is the centroid of cluster c and β is set to 1. Besides the mRNA matrix, this algorithm can also be applied to ADT and ATAC matrix.
The gene rank learned from ACE is then multiply by a direction vector of genes to get the directed gene rank. The direction vector of genes is calculated based on the log fold change between clusters by changing positive values to 1 and negative values to −1. Based on the directed gene rank, gene set enrichment analysis (GSEA) is performed by the package fgsea (v1.19.4) and msigdbr (v7.4.1) in R.
Model implementation
The model is implemented in Python3 using PyTorch^{53}. Adam with AMSGrad variant^{54,55} with an initial learning rate = 0.001 is used for the pretraining stage. The Adadelta optimizer^{56} with a learning rate = 1 and rho = 0.95 is used in the clustering stage. The batch size is set as 256. We pretrain the autoencoders for 400 epochs before entering the clustering stage. In the pretraining stage, we optimize the reconstruction losses in the first 200 epochs. The KL loss (L_{kl}) on the bottleneck layer is then added to the training in the remaining 200 epochs. After pretraining, the users need to specify the number of clusters (K). At the beginning of the clustering stage, we initialize K centroids by implementing Kmeans algorithm on the pretrained latent space. During the clustering stage, all loss functions including clustering loss (L_{c}) are optimized simultaneously, and the centroids are also continuously updated by the learning process. The convergence threshold for the clustering stage is that the predicted labels are changed less than 0.1% per epoch. All experiments of scMDC in this study are conducted on Nvidia Tesla P100 (16 G) GPU.
Competing methods
BREMSC (v0.2.0, https://github.com/tarot0410/BREMSC)^{10}, CiteFuse (v1.0.0, https://github.com/SydneyBioX/CiteFuse)^{20}, Seurat (v4.0.4, https://github.com/satijalab/seurat)^{17}, IDEC (https://github.com/XifengGuo/IDEC)^{33}, Kmeans (sklearn v0.22.2, https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html), SC3 (v1.21.0, https://github.com/hemberglab/SC3)^{18}, SCVIS (v0.1.0, https://github.com/shahcompbio/scvis)^{57}, Tscan (v1.31.0, https://github.com/zji90/TSCAN)^{15}, TotalVI (scvitools v0.15.0, https://scvitools.org/), Cobolt (v1.0.0, https://github.com/epurdom/cobolt)^{26}, scMM (v1.0.0, https://github.com/kodaim1115/scMM)^{27} and Specter (https://github.com/canzarlab/Specter)^{24} are used as competing methods. For the multimodal methods, ADT/ATAC and mRNA data are used as input, and standard normalization is applied if the authors described it. For single data source methods, ADT/ATAC and mRNA matrices are preprocessed and normalized separately and then concatenated as a single input. To keep consistency, all the methods use the same highly variable genes in RNA and ATAC data and use full ADTs in the CITEseq data. If the methods require normalized data as inputs without defining a specific way of normalization, we apply the same normalization method as that for scMDC (described above). Before doing Kmeans clustering, PCA is performed on the normalized mRNA data and the top 20 PCs are used for clustering. BREMSC uses the raw count matrix as input directly. The data normalization for Citefuse follows the vignette (https://sydneybiox.github.io/CiteFuse/articles/CiteFuse.html). Specifically, mRNA counts are normalized by the function “logNormCounts” in the Scater package^{58} with default settings. ADT counts are normalized and logtransformed by the function “normaliseExprs” from the CiteFuse package. Seurat uses the raw count matrices as input. Following the CITEseq tutorial of Seurat, we use “LogNormalize” for mRNA and “centered logratio transformation” for ADT data normalization. Then the function “ElbowPlot” is used to find the best PCs (principal components) for clustering. The resolution in “FindClusters” function of Seurat is adjusted for different datasets in order to estimate a satisfactory number of clusters that are close to the real K. For the singleomics and multiomics clustering, the function ‘FindNeighbors’ and ‘FindMultiModalNeighbors’^{23} are used to find the neighbors of cells by the SNN (shared nearestneighbor) and WNN (weighted nearestneighbor) algorithms, respectively. For IDEC and Tscan, normalized data are provided as the inputs. SC3 needs both the raw data and the normalized data as input. When the cell number is higher than 5000, SC3 runs a SVM to estimate the cell types of the extra cells in a supervised manner. SCVIS is a variational autoencoderbased model aimed to reduce the dimension of scRNAseq data. According to the author’s protocol^{57}, the count data are firstly processed as log_{2}(CPM/10 + 1), where ‘CPM’ means the ‘counts per million’. Next, we concatenate CPMs of mRNA and ADT. Then the 100 PCs are extracted from the CPM matrix by PCA and used as the input for SCVIS analysis. Kmeans clustering is performed on the latent output of SCVIS. For TotalVI, we keep the default setting for all the datasets according to the official pipeline (https://docs.scvitools.org/en/stable/tutorials/notebooks/totalVI.html). We then perform Kmeans clustering on the latent space from TotalVI since the number of clusters is supposed to be known. Specter^{24} uses the normalized RNA and ADT expression data as the input. We used the default setting for Specter’s multimodal analysis. For SMAGEseq datasets, we compare our model to four competing methods: Kmeans + PCA, Seurat, scMM, and Cobolt. All the methods use the top 2000 highly variable mRNA and ATAC data from the SMAGEseq data. If the methods need normalized data as input, we apply the same normalization method for it as that for scMDC. Before doing Kmeans, PCA is performed on both mRNA and ATAC data and the top 20 PCs of each are used for clustering. For Seurat, the ATAC data, which is mapped to the gene regions, is processed in the same way as for the mRNA data. Then the WNN algorithm is used for integrating multimodal data as described before. For Cobolt, we follow the official pipeline (https://github.com/epurdom/cobolt/blob/master/docs/tutorial.ipynb) to produce the data embeddings. We then perform Kmeans clustering on the latent space of datasets since the number of clusters is supposed to be known. We followed the tutorial provided by scMM (v1.0.0)^{27} and used the default parameters. The embeddings of scMM are obtained and used for the Kmeans clustering.
Evaluation metrics
Adjust Rand Index (ARI)^{59}, Adjusted Mutual Information (AMI)^{60}, and Normalized Mutual Information (NMI)^{61} are used as metrics to evaluate the clustering performance.
Adjust Rand Index measures the agreements between two sets C and G. Assuming a is the number of pairs of two objects in the same group in both C and G; b is the number of pairs of two objects in different groups in both C and G; c is the number of pairs of two objects in the same group in C but in different groups in G; and d is the number of pairs of two objects in different groups in C, but in the same group in G. The ARI is defined as
Let C = {C1, C2, …, C_{tc}}and G = {G1, G2, …, G_{tg}} be the predicted and ground truth labels on a dataset with n cells. NMI is defined as
here I(C,G) represents the mutual information between C and G and is defined as
And H(C) and H(G) are the entropies:
Similarly, AMI is defined as
The extra component \(E\{I\left({{{{{\bf{C}}}}}},{{{{{\bf{G}}}}}}\right)\}\) is the expected mutual information between two random clusters^{60}.
To illustrate the superiority of scMDC over the competing methods in multiple datasets, we rank the methods based on their clustering performance (AMI, NMI, and ARI) on each dataset. The lower the rank, the better the performance. Besides, a onesided paired ttest is conducted to test if the clustering metrics (NMI, AMI, and ARI) of scMDC are significantly higher than that of the competing methods, which is implemented by the “t.test()” function in R. Nominal pvalue <0.05 is considered to indicate a significant difference.
Public real datasets
The real CITEseq datasets used in this study are summarized in Supplementary Table 6. The GSE100866 dataset is downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100866). The cells in this dataset are cord blood mononuclear (CBMN) cells and annotated by Wang et al. from marker genes and ADTs^{10}. Cells with ‘Unknown’ cell types were filtered out. The bone marrow mononuclear cells (BMNC, GSE128639) and the cell type labels are downloaded from the “bmcite” dataset in “SeuratData” package (v0.2.1). The mouse spleen lymph node datasets (SLN208 and SLN111, GSE150599) and the cell type labels are provided by TotalVI^{25} on GitHub (https://github.com/YosefLab/totalVI_reproducibility). Cells are also filtered by the author. PBMC dataset is available on the 10X website (https://support.10xgenomics.com/singlecellgeneexpression/datasets). We downloaded the preprocessed data and the cell type labels from the GitHub of Specter (https://github.com/canzarlab/Specter)^{24}.
The real Singlecell Multiome ATAC Gene Expression (SMAGEseq) datasets used in this study are summarized in Supplementary Table 7. All the SMAGEseq datasets are downloaded from the 10X Genomics website (https://www.10xgenomics.com/resources/datasets). The first and second datasets are from human peripheral blood mononuclear cells (PBMCs) with about 3k and 10k cells. We denote them as PBMC3K and PBMC10K respectively. The third dataset is from the E18 mouse brain. We denote it as E18. For each dataset, mRNA counts are downloaded directly while the ATAC gene counts are generated by us. Specifically, after filtering the reads by ATAC peak region fragments, nucleosome signal, and TSS enrichment, we mapped each read to a gene region by the function ‘GeneActivity’ in Signac (v1.4.0)^{62}. All the steps are referred to the official pipeline from Satija lab. Then, the PBMC cells are annotated by the label transferring method in Seurat V3^{62} with the reference datasets “pbmc_10k_v3.rds” (https://www.dropbox.com/s/zn6khirjafoyyxl/pbmc_10k_v3.rds?dl=0) provided by Satija lab. For the E18 dataset, we transfer the labels from another mouse brain dataset (GSE126074 P0 mouse brain cortex) and the cell type labels are provided by the author of the SNAREseq paper^{8}.
Simulation
The simulated data are generated by the R package SymSim (0.0.0.9000)^{63}. The overall setting for simulation is from the Online vignettes of SymSim (https://github.com/YosefLab/SymSim/blob/master/vignettes/SymSimTutorial.Rmd). This setting was estimated from the Zeisel 2015 dataset^{64}. We lower the parameter “n_de_evf” to 5 to keep about 50% differential expressed genes/ADTs in the dataset. We perform three experiments to test the clustering performance of scMDC and generate 10 datasets in each experiment. In the first experiment, we adjust the parameter “Sigma” in the function SimulateTrueCounts() to 0.6, 0.7, and 0.8 in mRNA and 0.3, 0.4, and 0.5 in ADT to simulate the high, medium, and low clustering signal among clusters (cell types). We give lower sigma values (higher signal) to ADT data than mRNA data since it has higher signaltonoise ratios in the real datasets^{10}. In the second experiment, we adjust the parameter “alpha_mean” in function True2ObservedCounts() to 0.001, 0.00075, 0.0005 in mRNA and 0.05, 0.045, 0.04 in ADT data to simulate low, medium, and high dropout rates. These settings are also consistent with the observations in the real datasets since mRNA data has higher dropout rates than ADT data as we described in the introduction. In the third experiment, we add a batch effect in the data to test the model’s performance in batch effect correction. Medium signal and dropout rate are used for this data and the parameter “batch_effect_size” in function DivideBatches() is set to 1. All the simulated datasets have 8 groups, 1000 cells, 2000 genes, and 30 ADTs.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The GSE100866 data used in this study are available in the GEO database under accession code GSE100866. Cell type labels are downloaded from the GitHub of BREMSC (https://github.com/tarot0410/BREMSC). The BMNC dataset and the cell type labels are downloaded from the “bmcite” dataset in “SeuratData” package (https://github.com/satijalab/seuratdata). The mouse spleen lymph node datasets (SLN208 and SLN111) and the cell type labels are provided by TotalVI^{25} on GitHub (https://github.com/YosefLab/totalVI_reproducibility). These datasets are sequenced in two batches. PBMC dataset is available on 10x Genomics website (https://support.10xgenomics.com/singlecellgeneexpression/datasets) and the cell type labels are downloaded from the GitHub of Specter (https://github.com/canzarlab/Specter). All SMAGEseq datasets (PBMC3K, PBMC10K, and mouse brain E18) are downloaded from the 10X Genomics website (https://www.10xgenomics.com/resources/datasets). Labels are transferred by Signac (v1.4.0) from the annotated datasets. Source data are provided with this paper.
Code availability
Codes supporting this study are available on GitHub: https://github.com/xianglin226/scMDC/releases/tag/v1.0.0.
References
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Buenrostro, J. D. et al. Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex singlecell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Ma, A., McDermaid, A., Xu, J., Chang, Y. & Ma, Q. Integrative methods and practical challenges for singlecell multiomics. Trends Biotechnol. 38, 1007–1022 (2020).
Chen, S., Lake, B. B. & Zhang, K. Highthroughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Ma, S. et al. Chromatin potential identified by shared singlecell profiling of RNA and chromatin. Cell 183, 1103–1116. e1120 (2020).
Wang, X. et al. BREMSC: a bayesian random effects mixture model for joint clustering single cell multiomics data. Nucleic Acids Res. 48, 5814–5824 (2020).
Haider, S. & Pal, R. Integrated analysis of transcriptomic and proteomic data. Curr. Genomics 14, 91–110 (2013).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of singlecell RNAseq data. Nat. Rev. Genet. 20, 273–282 (2019).
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. & Teichmann, S. A. The technology and biology of singlecell RNA sequencing. Mol. Cell 58, 610–620 (2015).
Shapiro, E., Biezuner, T. & Linnarsson, S. Singlecell sequencingbased technologies will revolutionize wholeorganism science. Nat. Rev. Genet. 14, 618–630 (2013).
Ji, Z. & Ji, H. TSCAN: Pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 44, e117–e117 (2016).
Blondel, V. D., Guillaume, J.L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Kiselev, V. Y. et al. SC3: consensus clustering of singlecell RNAseq data. Nat. Methods 14, 483–486 (2017).
Tian, T., Zhang, J., Lin, X., Wei, Z. & Hakonarson, H. Modelbased deep embedding for constrained clustering analysis of single cell RNAseq data. Nat. Commun. 12, https://doi.org/10.1038/s41467021220083 (2021).
Kim, H. J., Lin, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. CiteFuse enables multimodal analysis of CITEseq data. Bioinformatics 36, 4137–4143 (2020).
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333 (2014).
Ng, A., Jordan, M. & Weiss, Y. On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 14, 849–856 (2001).
Hao, Y. et al. Integrated analysis of multimodal singlecell data. bioRxiv https://doi.org/10.1101/2020.10.12.335331 (2020).
Ringeling, F. R. & Canzar, S. Lineartime cluster ensembles of largescale singlecell RNAseq and multimodal data. Genome Res. 31, 677–688 (2021).
Gayoso, A. et al. Joint probabilistic modeling of singlecell multiomic data with totalVI. Nat. Methods 18, 272–282 (2021).
Gong, B., Zhou, Y. & Purdom, E. Cobolt: Joint analysis of multimodal singlecell sequencing data. bioRxiv https://doi.org/10.1101/2021.04.03.438329 (2021).
Minoura, K., Abe, K., Nam, H., Nishikawa, H. & Shimamura, T. A mixtureofexperts deep generative model for integrated analysis of singlecell multiomics data. Cell Rep. Methods 1, 100071 (2021).
Tian, T., Wan, J., Song, Q. & Wei, Z. Clustering singlecell RNAseq data with a modelbased deep learning approach. Nat. Mach. Intell. 1, 191–198 (2019).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 1–17 (2018).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Singlecell RNAseq denoising using a deep count autoencoder. Nat. Commun. 10, 1–14 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Simidjievski, N. et al. Variational autoencoders for cancer data integration: design principles and computational practice. Front. Genet. 10, 1205 (2019).
Xie, J., Girshick, R. & Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In: (eds Balcan, M. F. & Weinberger, K. Q.) Proceedings of Machine Learning Research Vol. 48, 478–487 (PMLR, 2016).
Chen, L., Wang, W., Zhai, Y. & Deng, M. Deep soft Kmeans clustering with selftraining for singlecell RNA sequence data. NAR Genomics Bioinform. 2, lqaa039 (2020).
Lu, Y. Y., Timothy, C. Y., Bonora, G. & Noble, W. S. ACE: Explaining cluster from an adversarial perspective. In: (eds Meila, M. & Zhang, T.) International Conference on Machine Learning. 7156–7167 (PMLR).
Lu, Y. Y., Yu, T., Bonora, G. & Noble, W. S. ACE: explaining cluster from an adversarial perspective. bioRxiv https://doi.org/10.1101/2021.02.08.428881 (2021).
Schlachetzki, J. et al. A monocyte gene expression signature in the early clinical course of Parkinson’s disease. Sci. Rep. 8, 1–13 (2018).
Caccamo, N., Joosten, S. A., Ottenhoff, T. H. & Dieli, F. Atypical human effector/memory CD4+ T cells with a naivelike phenotype. Front. Immunol. 9, 2832 (2018).
Harding, S. D. et al. The IUPHAR/BPS Guide to PHARMACOLOGY in 2018: updates and expansion to encompass the new guide to IMMUNOPHARMACOLOGY. Nucleic Acids Res. 46, D1091–D1106 (2018).
Marchingo, J. M., Sinclair, L. V., Howden, A. J. & Cantrell, D. A. Quantitative analysis of how Myc controls T cell proteomes and metabolic pathways during T cell activation. Elife 9, e53725 (2020).
Gavin, C. et al. The complement system is essential for the phagocytosis of mesenchymal stromal cells by monocytes. Front. Immunol. 10, 2249 (2019).
Cho, S. H. et al. Hypoxiainducible factors in CD4+ T cells promote metabolism, switch cytokine secretion, and T cell help in humoral immunity. Proc. Natl Acad. Sci. 116, 8975–8984 (2019).
Dimeloe, S. et al. The immunemetabolic basis of effector memory CD4+ T cell function under hypoxic conditions. J. Immunol. 196, 106–114 (2016).
Hasan, F., Chiu, Y., Shaw, R. M., Wang, J. & Yee, C. Hypoxia acts as an environmental cue for the human tissueresident memory T cell differentiation program. JCI insight 6, e138970 (2021).
Jones, D. M., Read, K. A. & Oestreich, K. J. Dynamic roles for IL2STAT5 signaling in effector and regulatory CD4+ T cell populations. J. Immunol. 205, 1721–1730 (2020).
Ross, S. H. & Cantrell, D. A. Signaling and function of interleukin2 in T lymphocytes. Annu. Rev. Immunol. 36, 411 (2018).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 1–21 (2014).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 19, 15 (2018).
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.A. Extracting and Composing Robust Features with Denoising Autoencoders. In: Proc. 25th International Conference on Machine Learning 1096–1103 (Association for Computing Machinery, 2008).
Clevert, D.A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). Preprint at https://arxiv.org/abs/1511.07289 (2015).
Maaten, L. V. D. & Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Paszke, A. et al. Automatic differentiation in pytorch. In: NIPS 2017 Workshop on Autodiff. https://openreview.net/forum?id=BJJsrmfCZ (2017).
Reddi, S. J., Kale, S. & Kumar, S. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Zeiler, M. D. ADADELTA: an adaptive learning rate method. Preprint at https://arxiv.org/abs/1212.5701 (2012).
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 1–13 (2018).
McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: preprocessing, quality control, normalization and visualization of singlecell RNAseq data in R. Bioinformatics 33, 1179–1186 (2017).
Hubert, L. & Arabie, P. Comparing partitions. J. Classification 2, 193–218 (1985).
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Alexander, S. & Joydeep, G. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn Res 3, 583–617 (2003).
Stuart, T., Srivastava, A., Lareau, C. & Satija, R. Multimodal singlecell chromatin analysis with Signac. BioRxiv https://doi.org/10.1101/2020.11.09.373613 (2020).
Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 1–16 (2019).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347, 1138–1142 (2015).
Acknowledgements
This work was supported by grant R15HG012087 (Z.W.) from the National Institutes of Health (NIH), and partially supported by the National Center for Advancing Translational Sciences (NCATS), a component of NIH under award number UL1TR003017 (Z.W.). The computing resource was partially provided by Extreme Science and Engineering Discovery Environment (XSEDE) through allocation CIE160021 and CIE170034 (supported by National Science Foundation Grant No. ACI1548562).
Author information
Authors and Affiliations
Contributions
Z.W. and H.H. conceived and supervised the project. X.L. and T.T. designed the method and conducted the experiments. X.L. and T.T. wrote the manuscript. Z.W. and H.H revised the manuscript. All authors contributed to and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lin, X., Tian, T., Wei, Z. et al. Clustering of singlecell multiomics data with a multimodal deep learning method. Nat Commun 13, 7705 (2022). https://doi.org/10.1038/s41467022350319
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467022350319
This article is cited by

MOCAT: multiomics integration with auxiliary classifiers enhanced autoencoder
BioData Mining (2024)

Mosaic integration and knowledge transfer of singlecell multimodal data with MIDAS
Nature Biotechnology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.