Abstract
Singlecell ATACseq (scATACseq) profiles the chromatin accessibility landscape at single cell level, thus revealing celltocell variability in gene regulation. However, the high dimensionality and sparsity of scATACseq data often complicate the analysis. Here, we introduce a method for analyzing scATACseq data, called SingleCell ATACseq analysis via Latent feature Extraction (SCALE). SCALE combines a deep generative framework and a probabilistic Gaussian Mixture Model to learn latent features that accurately characterize scATACseq data. We validate SCALE on datasets generated on different platforms with different protocols, and having different overall data qualities. SCALE substantially outperforms the other tools in all aspects of scATACseq data analysis, including visualization, clustering, and denoising and imputation. Importantly, SCALE also generates interpretable features that directly link to cell populations, and can potentially reveal batch effects in scATACseq experiments.
Introduction
Accessible regions within chromatin often contain important genomic elements for transcription factor binding and gene regulation^{1}. Assay for TransposaseAccessible Chromatin using sequencing (ATACseq) is an efficient method to probe genomewide open chromatin sites, using the Tn5 transposase to tag them with sequencing adapters^{2}. In particular, singlecell ATACseq (scATACseq) reveals chromatinaccessibility variations at the singlecell level, and can be used to uncover the mechanisms regulating celltocell heterogeneity^{3,4}. However, in an scATACseq experiment, each open chromatin site of a diploidgenome single cell only has one or two opportunities to be captured. Normally, only a few thousand distinct reads (versus many thousands of possible open positions) are obtained per cell, thus resulting in many bona fide open chromatin sites of the cell that lack sequencing data signals (i.e., peaks). The analysis of scATACseq data hence suffers from the curse of “missingness” in addition to high dimensionality^{3}.
Many computational approaches have been designed to tackle highdimensional and sparse genomic sequencing data, especially singlecell RNAseq (scRNASeq) data. Dimensionality reduction techniques such as principal component analysis (PCA)^{5} and tdistributed stochastic neighbor embedding (tSNE)^{6} are frequently employed to map raw data into a lower dimensional space, which is particularly useful for visual inspecting the distribution of input data. Clustering based on the full expression spectrum or extracted features can be performed to identify specific cell types and states, as well as gene sets that share common biological functions^{7,8,9,10}. The imputation of missing expression values is also often carried out to mitigate the loss of information caused by dropouts in scRNAseq data^{11,12}.
Direct applications of these scRNAseq analysis methods to scATACseq data, however, may not be suitable due to the closetobinary nature and increased sparsity of the data (Supplementary Fig. 1). A recent method specifically developed for scATACseq data analysis, chromVAR^{13}, evaluates groups of peaks that share the same motifs or functional annotations together. Another method, scABC, weighs cells by sequencing depth and applies weighted Kmedoid clustering to reduce the impact of missing values^{14}. To refine the clustering, it then calculates a landmark for each cluster and assigns cells to the closest landmarks based on the Spearman correlation. However, each method suffers notable caveats: chromVAR only analyzes peaks in groups and lacks the resolution of individual peaks, whereas scABC heavily depends on landmark samples with high sequencing depths, and the Spearman rank can be illdefined for data with many missing values (in particular for scATACseq data). Recently a newly developed method called cisTopic applied latent Dirichlet allocation to model on scATACseq data to identify cisregulatory topics and simultaneously cluster cells and accessible regions based on the celltopic and regiontopic distributions^{15}.
Expressive deep generative models have emerged as a powerful framework to model the distribution of highdimensional data. One of the most popular of such methods, the variational autoencoder (VAE), estimates the data distribution and learns a latent distribution from the observed data through a recognition model (encoder) and a generative model (decoder)^{16}. It does this by maximizing the similarity of the calibrated data (output by the decoder) with the input data and minimizing the KullbackLeibler divergence of the approximate from the true posteriors^{16}. VAE could be applied to data embedding and clustering based on the lowdimensional latent representation of the input highdimensional data^{17}. Recently, a method called scVI adapted VAE for scRNAseq data analysis^{18}. However, the standard VAE employed by scVI uses a single isotropic multivariable Gaussian distribution over the latent variables and often underfits sparse data^{19}. A tighter estimation of the posterior distribution could greatly improve the power of VAE in fitting and analyzing sparse data^{19}. Applying Gaussian Mixture Model (GMM) as the prior over the latent variables has been used in unsupervised clustering and to generate highly realistic samples by learning more disentangled and interpretable latent representations^{20,21,22}.
Here, we introduce SCALE (SingleCell ATACseq analysis via Latent feature Extraction), a method that combines the VAE framework with the Gaussian Mixture Model (GMM, a probabilistic model to estimate observed data with a mixture of Gaussian distributions). We validated the effectiveness of SCALE in extracting latent features that characterize the distributions of input scATACseq data on multiple different datasets generated on different platforms with different protocols, and of different overall data qualities. We then used the latent features to cluster cell mixtures into subpopulations, and to denoise and impute missing values in scATACseq data. We compared the performance of SCALE with other widelyused dimensionality reduction techniques, as well as with the stateofart scRNAseq and scATACseq analysis tools. We found that SCALE substantially outperforms the other tools in all aspects of scATACseq data analysis. It is even comparable to sophisticated experimental technologies with additional steps (e.g., PiATAC^{23}, which uses protein labeling as an aid in defining cell identifies) in correctly revealing cell types and their specific regulatory motifs in a tumor sample.
Results
The SCALE model and validation datasets
SCALE combines the variational autoencoder (VAE) and the Gaussian Mixture Model (GMM) to model the distribution of highdimensional sparse scATACseq data (Fig. 1). SCALE models the input scATACseq data x as a joint distribution \(p\left( {{\mathbf{x}},{\mathbf{z}},c} \right)\) where c is one of predefined K clusters corresponding to a component of GMM, z is the latent variable obtained by \({\mathbf{z}} = {\boldsymbol{\mu }}_z + {\boldsymbol{\sigma }}_z \ast {\boldsymbol{\epsilon}}\), where μ_{z} and σ_{z} are learned by the encoder network from x, and ε is sampled from \({\Bbb N}(0,{\mathbf{I}})\)^{16}. Since z is conditioned on c, \(p\left( {{\mathbf{x}},{\mathbf{z}},c} \right)\) can be written as \(p\left( {{\mathbf{x}}{\mathrm{}}{\mathbf{z}}} \right)p\left( {{\mathbf{z}}{\mathrm{}}c} \right)p(c)\) where p(c) is a discrete distribution of K predefined clusters, p(zc) follows a mixture of Gaussians distribution with a mean μ_{c} and a variance σ_{c} for each component corresponding to a cluster c, and p(xz) is a multivariable Bernoulli distribution modeled by the decoder network (Fig. 1). In the SCALE framework, each cell x^{i} is first transformed into a ddimensional vector of latent features z^{i} on the GMM manifold by an encoder network and then reconstructed back through a decoder network with the original dimensionality to represent the chromatin openness at each peak in each cell. The latent features that capture the characteristics of scATACseq data are then visualized in the lowdimensional space with tSNE, and used to cluster single cells with various clustering methods, e.g., Kmeans.
We constrained the hyperparameters of SCALE on the Leukemia scATACseq dataset and found SCALE is insensitive to the encoder structure and the dimension of latent features (Supplementary Table 1). The SCALE model with default parameters can be accessed in the Online Method. We then tested the model by using the GM12878/HEK293T, the GM12878/HL60, and the InSilico datasets^{3,4,24}, and two other recently published Splenocyte and Forebrain datasets^{25,26}. The Leukemia dataset is derived from a mixture of monocytes (Mono) and lymphoidprimed multipotent progenitors (LMPP) isolated from a healthy human donor, and leukemia stem cells (SU070_LSC, SU353_LSC) and blast cells (SU070_Leuk, SU353_Blast) isolated from two patients with acute myeloid leukemia^{24}. The GM12878/HEK293T dataset and the GM12878/HL60 dataset are respective mixtures of two commonlyused cell lines^{22}. The InSilico dataset is an in silico mixture constructed by computationally combining six individual scATACseq experiments that were separately performed on a different cell line^{3,11}. Note that these four datasets were the same ones used to validate scABC^{14}. The more recent Splenocyte dataset^{25} is derived from a mixture of mouse splenocytes (after red blood cell removal) and the Forebrain dataset^{26} is derived from P56 mouse forebrain cells. The six datasets cover scATACseq data generated from both microfluidicsbased and cellular indexing platforms, and the distributions of the number of peaks in each single cell vary substantially in different datasets (Supplementary Fig. 1). However, they always have a high level of data sparsity compared to the aggregation of peaks from all single cells in each dataset (Supplementary Table. 2).
SCALE identifies cell types by clustering on latent features
We examined SCALE’s ability to uncover features that characterize scATACseq data distributions. By default, SCALE extracts 10 features from the input data. For comparison, we also applied PCA, scVI and cisTopic to reduce the input data to 10 dimensions. In the comparison, we also included Cicero^{27}, a scATACseq data analysis tool for predicting cisregulatory interactions and building singlecell trajectories from scATACseq data, and TFIDF a transformation for performing dimension reduction and clustering^{28}. We then visualized the extracted features from these tools as well as the raw data with tSNE. In general, the feature embeddings of SCALE and cisTopic were better separated between cell types, whereas the embeddings of PCA, scVI, Cicero, TFIDF and the raw data overlapped between some cell types (Fig. 2a, Supplementary Fig. 2).
SCALE can also reveal the distance between different cell subpopulations and sometimes suggested their developmental trajectory in UMAP visualization^{29} (Supplementary Fig. 3). For example, in the Forebrain dataset the three clusters of excitatory neuron cells (EX1, EX2, and EX3) are close to each other in the latent space. For the Splenocyte dataset, the different Tcell subpopulations are in the neighborhood, the B cells form another bigger group, and the two types of natural killer cells also cluster closely. For the Leukemia dataset, Mono and LMPP cells are the most dissimilar in leukemia evolution and they were indeed the farthest separated. LSCs exhibit strong similarity to LMPPs^{30}, consistent with that the LSC cells (LSC_SU070, LSC_SU353) were close to the LMPP cells in the embedding. Finally, the blast cells (Blast_SU070 and Blast_SU353) showed a bimodal distribution, with some more differentiated blasts closer to monocytes^{31,32}.
We then applied Kmeans clustering on the SCALEextracted latent features and assessed the clustering accuracy by comparing the results with scABC, scVI, cisTopic, and SC3^{33}, another widelyused clustering method for scRNAseq data. SCALE displayed the overall best performance on all five real experimental mixture datasets, and was nearly as accurate as scABC and cisTopic on the InSilico dataset (Fig. 2b, Supplementary Fig. 4). The newly developed cisTopic generally performed pretty well on all datasets, with the overall clustering performance only slightly lower than SCALE, but it misclassified a few clusters on the Splenocyte dataset. We also compared with TFIDF and Cicero on clustering. TFIDF performed well on most datasets (although not as good as SCALE) except on the Forebrain dataset. However, Cicero did not perform well on most datasets; indeed, data visualization and clustering are not major goals of Cicero. On the Forebrain dataset, cluster assignments of SCALE were the closest to the reference cell types. Due to the sparsity of data, the Pearson and Spearman correlations were both illdefined (Supplementary Fig. 5a), which directly led to poor clustering for SC3 where most cells collapsed into one group. Although the VAEbased method scVI did not suffer from the problem of illdefined cell distance, it misclassified three subgroups of cells (s1, s2, s3 labeled on the confusion matrix. Supplementary Fig. 5b).
To identify the cause of the misclustering by scVI, we searched for the most similar cell types for the three subgroups (s1, s2, s3). We aggregated the peak profiles of each cell type or subgroup to form a representing metacell and calculated the similarities among the metacells. As expected, s1 is the most similar to EX2, s2 the most similar to EX3, and s3 to AC (astrocyte) in the original data (Supplementary Fig. 5c). Both scVI and SCALE model the distribution of scATAC peak profiles to remove noise and to impute missing values (discussed in detail in the next section). We found that, consistent with the clustering results, this data calibration by scVI actually made s1, s2, and s3 cells less similar to the original cell types of EX2, EX3, and AC, respectively. On the contrary, SCALE retained the similarities of the three subgroups to their original cell types. Strikingly, when removing the GMM restriction from the overall framework but keeping the other part of the network the same, the degenerated SCALE yield performance was similar to that of a regular VAE, like scVI (Supplementary Fig 5d). Thus, introducing GMM as the prior to restrict the data structure provides SCALE with greater power for fitting sparse data than regular VAE using single Gaussian as the prior.
Finally, we tested whether SCALE is robust with respect to data sparsity by randomly dropping scATACseq values in the raw datasets down to zero. We compared the clustering accuracy of SCALE and other tools at different dropping rates (10–90%), measured by the adjusted Rand Index (ARI), Normalized Mutual Information (NMI) and micro F1 score (Methods). We found that SCALE displayed only a moderate decrease in clustering accuracy with increased data corruption until at about the corruption level of 0.6, and was robust for all datasets (Supplementary Fig. 6). In general, scABC, SC3, and scVI also showed robustness to data corruption; however, the overall clustering accuracies were much lower on some datasets (e.g., SC3 failed on the Forebrain dataset and scVI failed on the GM12878/HEK293T and the GM12878/HL60 datasets). On the Forebrain dataset, the ARI scores of SCALE dropped from 0.668 using the raw data to 0.448 on using the data with 30% corruption, and scABC and scVI dropped from 0.315 to 0.222 and from 0.448 to 0.388, respectively.
Finally we also provide a method to help users choose the optimal number of clusters based on the TracyWidom distribution^{34} (Methods), which could often produce an estimate of the number of clusters close to that of the references (Supplementary Fig. 7) and generate clustering results similar to the reference sets (Supplementary Fig. 7).
SCALE reduces noise and recovers missing peaks
An important feature of SCALE is the ability to accurately estimate the real distribution of scATACseq data, which usually contains both noise and a large number of missing values. The estimate could be used to remove noise and restore missing data (Fig. 1). We evaluated the calibration efficiency of SCALE on both real and simulated datasets. Since no such tool is currently available for scATACseq data, we compared SCALE with scImpute, SAVER, MAGIC, and scVI, four stateoftheart scRNAseq imputation methods (Fig. 3a).
We first evaluated the ability of SCALE to remove noise and to recover missing values on real scATACseq datasets. A challenge of analyzing real data is that the ground truth data without any corruption is unknown. However, if we average all single cells of the same biological cell type, the resulted metacell will be a good approximate to those single cells. SCALE performed better than all scRNAseq imputation methods in all scATACseq datasets, in that it achieved the highest correlation of the single cells with the corresponding metacell for each cell type (Fig. 3a, Supplementary Fig. 8), indicating that it obtained a better estimate of the real scATACseq data distribution. For most cases, scImpute was very stable and among the best comparing with other scRNAseq imputation methods, and SAVER performed well on denser datasets (InSilico, Splenocyte) but deteriorated on sparser datasets. MAGIC and scVI might have underfit the sparse input data and the imputed data substantially deviated from it (Supplementary Fig. 9), which may reflect that the two powerful tools that are optimized to scRNAseq data imputation may not fit for scATACseq data analysis.
It is important to note that the data calibration of SCALE was obtained at the same time of data modeling and clustering, i.e., without knowing the original type of each cell. So it could not simply average all single cells of the same cell type to reconstruct the peak so that they resemble the reference metacell. Also importantly, SCALE achieved a high correlation with the metacells while maintaining a similar level of variation within each cell population (see the variation of correlation coefficients in Fig. 3a and Supplementary Fig. 8). Indeed, SCALE retained the original data structure (intracorrelation within the imputed data) and recovered the original peak profiles (intercorrelation with the raw data) in the process of data regularization by GMM (Supplementary Fig. 9).
The imputation of SCALE could strengthen the distinct patterns of clusterspecific peaks by filling missing values and removing potential noise (Supplementary Fig. 10), which improves downstream analysis, for example the identification of celltypespecific motifs and transcription factors by chromVAR. We demonstrated this feature with the Forebrain dataset. We first followed the method used by Cusanovich et. al. to identify differentially accessible sites with the “binomialff” test of Monocle 2 package^{28}. At 1% FDR threshold, we identified 4100 differential accessible sites across the eight reference clusters of the Forebrain dataset. We then used chromVAR to search for motifs enriched in the differential sites in the raw and the imputed data, respectively. Overall, the patterns of different cell types are more distinct for these differentially accessible sites in the imputed than in the raw data (Supplementary Fig. 11a). And embedding on the imputed data shows betterdefined clusters (each well corresponds to a subtype with biological definition) than on the raw data (Fig. 3b, Supplementary Fig. 11b).
We found that the imputed data can greatly improve the results of chromVAR analysis by identifying more motifs (increased from 52 motifs to 105). For example, chromVAR analysis on the imputed data, but not on the raw data, identified the motifs Mafb and Hoxd9 enriched in the MG (macroglia) cluster (Supplementary Fig. 11c–d). It was recently reported that Mafb contributes to the activation of microglia^{35}. It also identified Hoxd9 enriched in IN (inhibitory neuron) from the imputed but not the raw data. Similarly, we found that Dlx2, Lhx8, Arx, and Neurog1 are much more significantly enriched in the, respectively, clusters in the imputed data (Supplementary Fig. 11cd). Dlx2, Lhx8, and Arx are important components in the MGE (medial ganglionic eminence) pathway of forebrain development^{36}, and Neurog1 is required for excitatory neurons in the cerebral cortex^{37}.
We then introduced further corruption to the real data by randomly dropping out peaks at different rates (Methods). At all corruption rates, SCALE performed the best, in that the calibrated data most closely correlated with the original metacells (Supplementary Fig. 12). We observed similar trends for the other scRNAseq imputation tools as above, confirming the effectiveness of SCALE in enhancing scATACseq data. We further tested the impact of missingness on generative model of imputation by calculating the confusion score (Methods) to evaluate the ability to preserve the original structure (inter and intracorrelation of metacells) (Supplementary Fig. 13). We found that the effect was minimal when the corruption level was lower than about 0.5, and after that threshold, the generative model was less capable of preserving the original structure (Supplementary Fig. 13b).
We subsequently tested the calibration accuracy on a simulated dataset. We constructed the dataset by first generating reference scATACseq data consisting of three clusters, each containing 100 peaks with no missing values, then randomly dropping out peaks and introducing noise (Methods, Supplementary Fig. 14a). As we knew the ground truth data of each single cell, we could quantify the efficiency of all tools by calculating peakwise and cellwise correlations of each calibrated single cell with its original ground truth. At all corruption rates, SCALE recovered the original data most accurately (Supplementary Fig. 14b–c). On the other hand, although scImpute could also recover the missing values in most cases, it messed up two clusters at the 0.2 corruption rate and was unable to remove the noise. SAVER and scVI smoothed both the signal and noise simultaneously and only recovered missing values to some degree. MAGIC performed very well at low corruption rates, but apparently oversmoothed the data and removed true signals along with noise at high levels of data corruption.
SCALE reveals cell types and their specific motifs
Next, we used SCALE to analyze a dataset generated by a recently developed technology, proteinindexed singlecell assay of transposaseaccessible chromatinseq (PiATAC), which uses protein labeling to help define cell identities^{23}. Dissecting complex cell mixtures of in vivo biological samples may be challenging. By simultaneously characterizing protein markers and epigenetic landscapes in the same individual cells, PiATAC provides an effective approach to tackle the problem. The Breast Tumor dataset is derived from a mouse breast tumor sample, including two plates of tumor cells (Epcam+) and another two plates of tumorinfiltrating immune cells (CD45+), isolated by protein labeling and FACS sorting. In the original study, a set of motifs was used to project the Epcam+ and CD45+ specific chromatin features with tSNE, and it was difficult to separate these two cell types computationally (Supplementary Fig. 15a). However, we found that SCALE was able to separate the two cell types well, better than PCA and scVI in latent embedding (Fig. 4a). On clustering, SCALE also yielded results the closest to the proteinindex labels, better than scVI and scABC, whereas SC3 poorly distinguished the two cell types (Fig. 4b). Although cisTopic grouped the cells well in the embedding, it misclassified parts of CD45+ cells into Epcam+ cells. SCALE thus can reveal cell types within complex tissues based only on scATACseq data, with performance comparable to sophisticated experimental technologies like PiATAC.
We validated the biological significance of the cell clusters based on PiATAC peaks. For each cluster, we calculated the top 1000 peaks with the highest specificity score as typespecific peaks (Methods, Supplementary Fig. 15b). We then used Homer^{38} to identify transcription factor binding motifs that were enriched in the typespecific peaks. We removed the common motifs enriched in both CD45+ cells and Epcam+ cells, and kept those that were enriched in only one cell type. We found that CD45+ cells were enriched for immunespecific motifs Maz, Pu.1Irf, Irf8, Runx1, Elk4, Nfy, Elf3, and SpiB binding motifs. These findings are consistent with the role of Runx1 in maintenance of haematopoietic stem cells (HSC) and that knockout of Runx1 results in defective T and Blymphocyte development^{39}. Nfy promotes the expression of the crucial immune responsive gene Major Histocompatibility Complex (MHC)^{40}. Epcam+ cells were enriched for tumorrelated motifs Klf14, Mitf, Ets1, Nrf2, and Nrf1 binding motifs. Ets1 is frequently overexpressed in breast cancer and associated with invasiveness^{41}, whereas Nrf2 is a key signature for breast cancer cell proliferation and metastasis^{42} (Fig. 4c). Thus, SCALE analysis of the Breast Tumor data revealed biologically relevant ciselements for gene regulation.
SCALE disentangles biological cell types and batch effects
In addition to tighter estimates of the multimodal input data, by pushing each dimension to learn a separate Gaussian distribution, GMM has another advantage in that it leads to latent representations that are more structured and disentangled, and thus more interpretable^{21}. In SCALE, as each feature is directly connected with output peaks, it can be assessed by the most weighted connections (Methods, Supplementary Fig. 16a). For example, in the Leukemia dataset, dimensions 9 of the extracted features captured peaks specific to the Mono cell type and enriched regulatory elements related to immunerelated “biological process” (BP, Methods) (Supplementary Fig. 16b). In the Forebrain dataset, feature 3 characterized the AC (astrocyte) and the OC (oligodendrocyte) cell types, enriched elements related to “glial cell differentiation” (Supplementary Fig. 16c). In the Splenocyte dataset, features 4 and 7 portrayed two complementary sets of cell types (Supplementary Fig. 16d), with feature 4 enriched with B cellrelated processes like “regulation of cell morphogenesis” and “myeloid leukocyte activation and differentiation”, and feature 7 enriched with T cellrelated processes such as “immune response” and “regulation of cell killing” (Supplementary Fig. 16d). These data suggest that the features learned by the model of SCALE are disentangled and can shed light on the biological significance.
Most interestingly, we found that SCALE could possibly reveal features corresponding to potential batch effects in the input data. For example, the Breast Tumor dataset is derived from experiments performed separately on two plates of Epcam+ tumor cell samples and two plates of CD45+ tumorinfiltrating immune cells. Although SCALE successfully clustered the two cell types, the data structure in the lowdimensional space also revealed bias towards different plates (Fig. 5a). We carefully analyzed the SCALEextracted features (Fig. 5b) and noticed that while some, e.g., features 1 and 6, were wellcorrelated to biological cell types, the others, e.g., features 2, 4, 8, and 10, more or less corresponded to independent plates, or, e.g., features 3 and 5, displayed biased distribution not related to cell types. Using the platerelated features (i.e., features 2, 3, 4, 5, 8, and 10) for data embedding, we found that the cells were separated by plates, but not by types. On the other hand, if we used the other plateindependent features (i.e., features 1, 6, 7, and 9), we found that indeed the cells of different plates of the same types more evenly distributed in the cluster (Fig. 5a). We further checked the represented peaks of these features and its biological significance (Supplementary Fig. 16e). Most of platerelated features have no biological relevance, except for peaks of feature 8, which appeared in one plate of CD45+ cells and are enriched with biological processes such as “response to cytokine stimulus”. This finding, however, suggests another possibility in interpreting the “plate bias” as a real biological difference in the two separate plates of CD45+ cells that might arise from sorting and cell culture.
We noticed that GM12878 cells in the InSilico dataset contain four replicates with many peak values much greater than 2. PCA analysis showed that replicates 1 and 3 were separated in the lowdimensional space (Supplementary Fig. 17a), suggesting a possible batch effect. However, the differences in the two replicates disappeared after we binarized the data, by masking values greater than 1 to 1 (Supplementary Fig. 17b). On the other hand, we observed no particular features corresponding to any batch among the SCALEextracted features (Supplementary Fig. 17c). Consequently, in the embedding and clustering results based on the SCALEextracted features, the cells of each replicate were distributed evenly in the lowdimensional space (Supplementary Fig. 17c). We confirmed this result by checking the top 200 specific peaks for each replicate based on raw data and found no significantly different pattern across replicates (Supplementary Fig. 17d). The distinction may reflect the different characters of the two approaches: while PCA is a linear method and sensitive to quantitative variations, SCALE is nonlinear and more stable. Lastly, we repeated the analysis on the Splenocyte and the Forebrain datasets—the other two datasets that contain different experimental batches, and found no batchrelated features, and the cells of different batches were distributed indistinguishably in the lowdimensional space (Supplementary Figs. 18, 19).
SCALE is scalable to large datasets
We further examined a mouse singlecell atlas of profiled chromatin accessibility in ~80,000 single cells from 13 adult mouse tissues by sciATACseq^{28} to investigate whether SCALE works for large datasets. The atlas study used a computational pipeline to infer 30 cell types from the dataset by graphic clustering, which were used as “reference” cell types when benchmarking SCALE. SCALE worked well on this big dataset and showed a good agreement with the reference: the overall F1 score was 0.419, and most of the major reference clusters have a corresponding one identified by SCALE. Nevertheless, some large reference clusters were split into two or three small groups (Supplementary Fig. 20).
Finally, we benchmarked the running time and memory usage of SCALE on different scales of datasets by downsampling a subset of cells and peaks from the mouse atlas datasets (10,000 peaks and different cell number). We found that SCALE required a little over 1.5 h and 2 GB of memory to run a dataset, and importantly the used computational resource only increased slightly with the size of datasets (Supplementary Fig. 21).
Discussion
Our work shows that SCALE accurately characterizes the distribution of highdimensional and sparse scATACseq data by using a deep generative framework to extract latent features. SCALE is thus a powerful tool for scATACseq data analysis, including data visualization, clustering, and denoising and imputation. In all comparisons, SCALE performs much more favorably than scABC and scRNAseq tools. Based on the better clustering assignments and imputation data, we can improve the discovery of clusterspecific peaks, and regulatory motifs as well, when combined with tools like Homer^{38} or chromVAR^{13}.
The success of SCALE can be attributed to the powerful deep generative framework and the GMM to accurately model the highdimensional, sparse, multimodal scATACseq data. Similar to SCALE, a recent scRNAseq analysis tool scVI also learns latent representation of scRNAseq data by aggregating information across similar cells using a hierarchical Bayesian model^{18}. However, SCALE also applies a GMM to overcome the increased sparsity of scATACseq data and more tightly estimate data distribution, thus achieving higher accuracy than scVI on scATACseq data analysis. It highlights the necessity and advantage to develop new methods that are optimized for scATACseq data, but not to use scRNAseq data analysis tools.
An attractive additional observation about SCALE is the interpretability of the GMM model. We showed that SCALE could possibly capture biological celltyperelated and potential batcheffectrelated latent features in the lowdimensional space. By excluding batchrelated features in embedding and clustering, we are able to reduce batch effects. Nevertheless, SCALE is not specifically designed to identify and remove these artifacts from the input data. In the future, we could improve the model to explicitly incorporate variables that are designated for the discovery and removal of batch effects and other possible technical variations.
Methods
Data and preprocessing
Data: The Leukemia dataset is derived from a mixture of monocytes (Mono) and lymphoidprimed multipotent progenitors (LMPP) isolated from a healthy human donor, and leukemia stem cells (SU070_LSC, SU353_LSC) and blast cells (SU070_Leuk, SU353_Blast) isolated from two patients with acute myeloid leukemia^{24}. The GM12878/HEK293T dataset and the GM12878/HL60 dataset are respective mixtures of two commonlyused cell lines^{3}. The InSilico dataset is an in silico mixture constructed by computationally putting together six individual scATACseq experiments separately performed on a different cell line^{3,11}. The Splenocyte dataset^{25} is derived from a mixture of mouse splenocytes (after red blood cells removal) and the Forebrain dataset^{26} is derived from P56 mouse forebrain cells. The Breast Tumor dataset^{23} is obtained from a mouse breast tumor sample, including two plates of tumor cells (Epcam+) and another two plates of tumorinfiltrating immune cells (CD45+) from protein labeling and FACS sorting.
Preprocessing: Similar to scABC^{14}, we filtered the scATACseq count matrix to only keep peaks in10 cells with ≥2 reads for the InSilico dataset, the GM12878/HEK293T dataset, and the GM12878/HL60 dataset, ≥5 cells with ≥2 reads for the Leukemia dataset, ≥50 cells with ≥2 reads for the Forebrain dataset, and ≥5 cells with ≥1 reads for the Breast Tumor dataset. We kept all the peaks for the Splenocyte dataset. We also only kept cells with read counts ≥(number of filtered peaks/50). For the InSilico dataset, there were still almost 90,000 peaks after filtering. For the efficiency of the SCALE model, similar to SC3^{33}, we further removed rare peaks (reads >2 in less than X% of cells) and ubiquitous peaks (reads ≥1 in at least (100–X)% of cells).
The probabilistic model of SCALE
SCALE combines a variational autoencoder (VAE) and the Gaussian Mixture Model (GMM) to model the input scATACseq data x through a generative process. Given K clusters, corresponding latent variable z can be obtained through the encoder via the reparameterization then to generate sample x through the decoder. It can be modeled with a joint distribution \(p\left( {{\mathbf{x}},{\mathbf{z}},c} \right)\), where z is the latent variable and c is a categorical variable whose probability is Discrete (cπ) where \({\mathrm{P}}\left( {{\mathrm{C}} = {\mathrm{j}}} \right) = \pi _j,\pi \in {\Bbb R}^K\). p(zc) is mixture of Gaussians distribution parameterized by μ_{c} and σ_{c} conditioned on c. Given that x and c are independently conditioned on z, then joint probability p(x, z, c) can be factorized as:
We define each factorized probability as:
The training SCALE is to maximize the loglikelihood of the observed scATACseq data:
which can be transformed to maximize the evidence lower bound (ELBO). The ELBO can be written with a reconstruction term and a regularization term:
The reconstruction term encourages the imputed data to be similar to the input data. The regularization term is a KullbackLeibeler divergence, which regularizes the latent variable z to a GMM manifold. And q(z, cx) and p(xz) are an encoder and a decoder, respectively, which can be modeled by two neural networks.
The overall network architecture of SCALE
SCALE consists of an encoder and a decoder. The encoder is a fourlayer neural network (3200–1600–800–400) with the ReLU activation function. The decoder has no hidden layer but directly connects the ten latent variables (features) to the output layer (peaks) with the Sigmoid activation function. A GMM model is used to initialize the parameters μ_{c} and σ_{c}. The Adam optimizer^{43} with a 5e4 weight decay is used to maximize the ELBO. Minibatch size is 32. SCALE also provides a quick mode for large datasets with the encoder structure of two layers (1024–128), and model training with maximum iterations of 30,000 and early stopping when no improvements in 10 epochs. The GMM models are constructed with the Python “scikitlearn” package, and the neural network is implemented with the “pytorch” package.
Visualization
We used tSNE from the Python “scikitlearn” package to project the raw data or latent features to 2dimension with random state as 124. We used Python package “umap” to visualize the trajectory cell relationships.
Clustering
We used the Kmeans clustering method from the Python “scikitlearn” package to cluster the input single cells based on the extracted features. To repeat the result, we set the random seed to 18.
Evaluation of clustering results
Adjusted Rand Index: The Rand Index (RI) computes similarity score between two clustering assignments by considering matched and unmatched assignments pairs independently of the number of clusters. The Adjusted Rand Index (ARI) score is calculated by “adjust for chance” with RI by:
If given the contingency table, the ARI can also be represented by:
The ARI score is 0 for random labeling and 1 for perfectly matching.
Normalized mutual information:
where P, T are empirical categorical distributions for the predicted and real clustering, I is the mutual entropy, and H is the Shannon entropy.
F1 score:
Generation and corruption of the simulation dataset
A simulation dataset consisting of 300 cells and 1000 peaks was generated. The peaks formed three clusters, with each cluster containing 100 specific peaks. These specific peaks had a value of 1 or 2 (ratio 1:4) in the cells of the corresponding clusters, and 0 in other cells. Corrupted datasets were generated by randomly dropping out values at different rates from 0.1 to 0.8, followed by introducing random noise by setting values as 1 or 2 (ratio 1:4) with the probability of 0.1.
Identifying differentially accessible sites
We followed Cusanovich et al. ^{28} and used “binomiallf” test implemented in Monocle 2 package^{44} to identify differentially accessible peaks. We set a 1% FDR threshold (BenjaminiHochberg method) to decide the peaks were significant for each cluster.
Calculation of the cluster specificity score of a peak
We applied an entropybased measure to calculate a cluster specificity score for the association of each peak with each cluster. In detail, it is defined by comparing the distribution of the peak pattern with the predefined ideal clusterspecific pattern in which a peak only appears in one cluster:
while p is the distribution of observed peaks overall samples, and q is the distribution of predefined pattern for the cluster c,
where \({\mathrm{Div}}_{{\mathrm{jensen}}}(p,q)\) is the Jensen divergence distance:
where \(H\left( p \right)\) is the entropy of peak’s distribution:
This provides the peakcluster matrix, and the final cluster specificity score is the maximal score overall clusters. By default, we defined the top 200 peaks as the clusterspecific peaks, which were used in the downstream analysis.
Binarization
We transformed the float imputed values to binary ones as below:
where imputed is the imputation matrix, raw is the raw data matrix, i means the ith peak, j means the jth cell.
Confusion score
We first calculated the inter/intracorrelation matrix, then transformed the diagonal values of the correlation matrix to:
Then calculated the mean of the upper triangle of the correlation matrix as the confusion matrix:
A confusion score of “0” means a perfect preservation of the original population.
Features associated peaks
In SCALE, as each feature is directly connected with output peaks, the featurepeak association can be assessed by the weights of links. We approximate the distribution of the weights as a Gaussian distribution, and defined those peaks with weights most deviated from the mean as featureassociated peaks. By default, we set 2.5 standard deviations from the mean as the cutoff.
Discovery of enriched TFs
We applied findMotifsGenomes.pl from the software Homer with default parameters on the top 1000 specific peaks of the CD45+ and the Epcam+ corresponding singlecell clusters, respectively, to search for transcription factor binding motifs. We only considered the motif occurrences with binomial test Pvalue ≤ 0.001.
Annotation of genomic elements
We used the GREAT^{45} algorithm (version 3.0.0) to perform the gene enrichment analysis by including genomic regions of a basal plus an extension (1 kb upstream and 0.1 kb downstream with up to 500kb max extension) in the search for elements enriched with the GO ‘biological process’ terms.
Prediction of a suitable number of cluster k
We used the number of the eigenvalues of X^{T}X that are significantly different as the predicted k, where X is the count matrix. We followed SC3 and calculated the mean and the s.d. of the TracyWidom distribution to determine the threshold:
Where n is the number of peaks and p is the number of cells.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The scATACseq in silico mixture data are available in Gene Expression Omnibus (GEO) under accession number GSE65360. Singlecell data for leukemia mixture is available at GSE74310, GM12878/HEK293T and GM12878/HL60 mixtures can be found at GSE68103, PiATAC Breast Tumor data can be obtained at GSE112091. Splenocyte mixture can be accessed at ArrayExpress with accession number EMTAB6714 and Forebrain mixture can be accessed at GSE100033. The mouse atlas dataset is available at http://atlas.gs.washington.edu/mouseatac. All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request. A reporting summary for this Article is available as a Supplementary Information file.
Code availability
The SCALE software including documents and tutorial is available on Github (https://github.com/jsxlei/SCALE).
References
Tsompana, M. & Buck, M. J. Chromatin accessibility: a window into the genome. Epigenetics Chromatin 7, 33 (2014).
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNAbinding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Buenrostro, J. D. et al. Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Abdi, H. & Williams, L. J. Principal component analysis. WIREs Comput. Stat. 2, 433–459 (2010).
van der Maaten, L. & Hinton, G. Visualizing Data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Menon, V. Clustering single cells: a review of approaches on highand lowdepth singlecell RNAseq data. Brief Funct. Genomics, https://doi.org/10.1093/bfgp/elx044 (2017).
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by singlecell transcriptomics. Cell 166, 1308–1323 e1330 (2016).
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for singlecell RNAseq data. Nat. Commun. 9, 997 (2018).
Huang, M. et al. SAVER: gene expression recovery for singlecell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcriptionfactorassociated accessibility from singlecell epigenomic data. Nat. Methods 14, 975–978 (2017).
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).
Bravo GonzalezBlas, C. et al. cisTopic: cisregulatory topic modeling on singlecell ATACseq data. Nat. Methods 16, 397–400 (2019).
Kingma, D. P. & Welling, M. Autoencoding variational bayes. arXiv:1312.6114 (2013).
Xie, J., Girshick, R. & Farhadi, A. Unsupervised deep embedding for clustering analysis. arXiv:1511.06335 (2015).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Krishnan, R. G., Liang, D. & Hoffman, M. On the challenges of learning with inference networks on sparse, highdimensional data. arXiv:1710.06085 (2017).
Jiang, Z., Zheng, Y., Tan, H., Tang, B. & Zhou, H. Variational deep embedding: an unsupervised and generative approach to clustering. arXiv:1611.05148 (2016).
Dilokthanakul, N. et al. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv:1611.02648v2 (2016).
Grønbech, C. H. et al. scVAE: Variational autoencoders for singlecell gene expression data. bioRxiv (2018).
Chen, X. et al. Joint singlecell DNA accessibility and protein epitope profiling reveals environmental regulation of epigenomic heterogeneity. Nat. Commun. 9, 4590 (2018).
Corces, M. R. et al. Lineagespecific and singlecell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).
Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and robust method for single cell chromatin accessibility profiling. Nat. Commun. 9, 5345 (2018).
Preissl, S. et al. Singlenucleus analysis of accessible chromatin in developing mouse forebrain reveals celltypespecific transcriptional regulation. Nat. Neurosci. 21, 432–439 (2018).
Pliner, H. A. et al. Cicero predicts cisregulatory DNA interactions from singlecell chromatin accessibility data. Mol. Cell, https://doi.org/10.1016/j.molcel.2018.06.044 (2018).
Cusanovich, D. A. et al. A singlecell atlas of in vivo mammalian chromatin accessibility. Cell, https://doi.org/10.1016/j.cell.2018.06.052 (2018).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2018).
Goardon, N. et al. Coexistence of LMPPlike and GMPlike leukemia stem cells in acute myeloid leukemia. Cancer Cell. 19, 138–152 (2011).
Bennett, J. M. et al. Proposals for the classification of the acute leukaemias. FrenchAmericanBritish (FAB) cooperative group. Br. J. Haematol. 33, 451–458 (1976).
van’t Veer, M. B. The diagnosis of acute leukemia with undifferentiated or minimally differentiated blasts. Ann. Hematol. 64, 161–165 (1992).
Kiselev, V. Y. et al. SC3: consensus clustering of singlecell RNAseq data. Nat. Methods 14, 483–486 (2017).
Patterson, N., Price, A. L. & Reich, D. Population Structure and Eigenanalysis. PLoS. Genet. 2, e190 (2006).
TozakiSaitoh, H. et al. Transcription factor MafB contributes to the activation of spinal microglia underlying neuropathic pain development. Glia 67, 729–740 (2019).
Nord, A. S., Pattabiraman, K., Visel, A. & Rubenstein, J. L. R. Genomic perspectives of transcriptional regulation in forebrain development. Neuron 85, 27–47 (2015).
Kim, E. J. et al. Spatiotemporal fate map of neurogenin1 (Neurog1) lineages in the mouse central nervous system. J. Comp. Neurol. 519, 1355–1370 (2011).
Heinz, S. et al. Simple combinations of lineagedetermining transcription factors prime cisregulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Voon, D. C., Hor, Y. T. & Ito, Y. The RUNX complex: reaching beyond haematopoiesis into immunity. Immunology 146, 523–536 (2015).
Sachini, N. & Papamatheakis, J. NFY and the immune response: dissecting the complex regulation of MHC genes. Biochim. Biophys. Acta Gene Regul. Mech. 1860, 537–542 (2017).
Furlan, A. et al. Ets1 controls breast cancer cell balance between invasion and growth. Int. J. Cancer 135, 2317–2328 (2014).
Zhang, C. et al. NRF2 promotes breast cancer cell proliferation and metastasis by increasing RhoA/ROCK pathway signal transduction. Oncotarget 7, 73593–73606 (2016).
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014).
Qiu, X. et al. Reversed graph embedding resolves complex singlecell trajectories. Nat. Methods 14, 979–982 (2017).
McLean, C. Y. et al. GREAT improves functional interpretation of cisregulatory regions. Nat. Biotechnol. 28, 495–501 (2010).
Acknowledgements
We thank Xinqi Chen for insightful comments on the manuscript and the help with the investigation of the Breast Tumor dataset. We thank Jianbin Wang for helpful suggestions. We thank Mahdi Zamanighomi and Timothy Daley for kindly providing the InSilico and the Leukemia datasets used in the scABC paper. We thank Rongxin Fang for the celltype labels of the Forebrain dataset in their original paper. We thank Life Science Editors for editing assistance. This project is supported by the Chinese Ministry of Science and Technology (Grant No. 2018YFA0107603 to Q.C.Z.) and the National Natural Science Foundation of China (Grants No. 91740204, 31761163007, and 31621063 to Q.C.Z.), the Beijing Advanced Innovation Center for Structural Biology, the TsinghuaPeking Joint Center for Life Sciences and the National Thousand Young Talents Program of China to Q.C.Z.
Author information
Authors and Affiliations
Contributions
Q.C.Z. conceived and supervised the project. L.X. designed, implemented, and validated SCALE with the help from K.X., K.T. and Y.S., L.T., G.G., M.Z. and T.J. helped analyzing the data, L.X. and Q.C.Z. wrote the manuscript with inputs from all the authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Syed Murtuza Baker, Ole Winther and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xiong, L., Xu, K., Tian, K. et al. SCALE method for singlecell ATACseq analysis via latent feature extraction. Nat Commun 10, 4576 (2019). https://doi.org/10.1038/s41467019126307
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467019126307
This article is cited by

A benchmark of computational pipelines for singlecell histone modification data
Genome Biology (2023)

Cellspecific clockcontrolled gene expression program regulates rhythmic fiber cell growth in cotton
Genome Biology (2023)

siVAE: interpretable deep generative models for singlecell transcriptomes
Genome Biology (2023)

Characterizing cisregulatory elements using singlecell epigenomics
Nature Reviews Genetics (2023)

Modeling fragment counts improves singlecell ATACseq analysis
Nature Methods (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.