Abstract
Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models—especially variational autoencoders—have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE’s capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The splenocyte dataset, with peaks in the GRCm38/mm10 genome, was downloaded from ArrayExpress via accession no. E-MTAB-6714 (ref. 66). The InSilico dataset, with peaks in the GRCh37/hg19 genome, was constructed by computationally putting together six scCAS experiments that were performed on different cell lines individually and was downloaded in Gene Expression Omnibus (GEO) under accession no. GSE65360 (ref. 1). The droplet dataset, with peaks in the GRCh37/hg19 genome, measures chromatin accessibility across 136,463 resting and stimulated human bone marrow-derived cells and was downloaded in GEO under accession no. GSE123580 (ref. 25). The mouse chromatin accessibility atlas datasets with peaks in the genome of GRCm37/mm9 were downloaded from http://atlas.gs.washington.edu/mouse-atac (ref. 37). The human cell atlas of fetal chromatin accessibility, with peaks in the GRCh37/hg19 genome, was downloaded under accession no. GSE149683 (ref. 26). The brain dataset, with peaks in the GRCm38/mm10 genome, consists of two batches assayed by single nucleus assay for transposase-accessible chromatin using sequencing, snATAC and 10X (refs. 12,32) and was downloaded in GEO under accession no. GSE126724 and https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k. The immune dataset with peaks in the genome of GRCh37/hg19, derived from human peripheral blood and bone marrow dataset, was downloaded in GEO under accession no. GSE129785 (ref. 56). Source Data are provided with this paper.
Code availability
The CASTLE software, including detailed documents and tutorial, is freely available on GitHub (https://github.com/cuixj19/CASTLE). The source code is also available via Zenodo at https://doi.org/10.5281/zenodo.10906304 (ref. 67).
References
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Klemm, S. L., Shipony, Z. & Greenleaf, W. J. Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet. 20, 207–220 (2019).
Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 1–25 (2019).
Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).
Ding, J. & Regev, A. Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces. Nat. Commun. 12, 2554 (2021).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Gao, Z. et al. scEpiTools: a database to comprehensively interrogate analytic tools for single-cell epigenomic data. J. Genet. Genom. 51, 462–465 (2024).
Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. In Proc. 2nd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) http://arxiv.org/abs/1312.6114 (ICLR, 2014).
Xiong, L. et al. Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space. Nat. Commun. 13, 6118 (2022).
Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).
Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: a deep generative model for single-cell chromatin accessibility analysis. Cell Rep. Methods 2, 100182 (2022).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).
Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).
van den Oord, A., Vinyals, O. & Kavukcuoglu, K. Neural discrete representation learning. In Proc. 31st Conference on Neural Information Processing Systems 6309–6318 (Curran Associates Inc., 2017).
Razavi, A., van den Oord, A. & Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In Proc. 33rd Conference on Neural Information Processing Systems 1331 (Curran Associates Inc., 2019).
Kobayashi, H., Cheveralls, K. C., Leonetti, M. D. & Royer, L. A. Self-supervised deep learning encodes high-resolution features of protein subcellular localization. Nat. Methods 19, 995–1003 (2022).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Romano, S., Vinh, N. X., Bailey, J. & Verspoor, K. Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 17, 1–32 (2016).
Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002).
Fowlkes, E. B. & Mallows, C. L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Lareau, C. A. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).
Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 (2020).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Argelaguet, R., Cuomo, A. S., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).
Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).
Kopp, W., Akalin, A. & Ohler, U. Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning. Nat. Mach. Intell. 4, 162–168 (2022).
Chen, S., Wang, R., Long, W. & Jiang, R. ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data. Bioinformatics 39, btac842 (2023).
Dudoit, S. & Fridlyand, J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3, 1–21 (2002).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 310 (2019).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).
Chen, S. et al. RA3 is a reference-guided approach for epigenetic characterization of single cells. Nat. Commun. 12, 2177 (2021).
Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol. 40, 703–710 (2022).
Li, Z., Chen, X., Zhang, X., Jiang, R. & Chen, S. Latent feature extraction with a prior-based self-attention framework for spatial transcriptomics. Genome Res. 33, 1757–1773 (2023).
Slowikowski, K., Hu, X. & Raychaudhuri, S. SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci. Bioinformatics 30, 2496–2497 (2014).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Danese, A. et al. EpiScanpy: integrated single-cell epigenomic analysis. Nat. Commun. 12, 5228 (2021).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 1–16 (2018).
Chen, X. et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat. Mach. Intell. 4, 116–126 (2022).
Liu, Q., Chen, S., Jiang, R. & Wong, W. H. Simultaneous deep generative modelling and clustering of single-cell genomic data. Nat. Mach. Intell. 3, 536–544 (2021).
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Kaiser, L. et al. Fast decoding in sequence models using discrete latent variables. In Proc. 35th International Conference on Machine Learning 2390–2399 (PMLR, 2018).
Peng, J., Liu, D., Xu, S. & Li, H. Generating diverse structure for image inpainting with hierarchical VQ-VAE. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10775–10784 (IEEE, 2021).
Williams, W. et al. Hierarchical quantized autoencoders. In Proc. 34th International Conference on Neural Information Processing Systems 4524–4535 (ACM, 2020).
Takida, Y. et al. SQ-VAE: variational bayes on discrete representation with self-annealed stochastic quantization. In Proc. 39th International Conference on Machine Learning 20987–21012 (PMLR, 2022).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems 721 (Curran Associates Inc., 2019).
Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (Bengio, Y. & LeCun, Y.) http://arxiv.org/abs/1412.6980 (ICLR, 2015).
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Tang, S. et al. scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data. Nat. Commun. 15, 1629 (2024).
Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. USA 101, 6062–6067 (2004).
Gazal, S. S-LDSC reference files. Zenodo https://doi.org/10.5281/zenodo.7768714 (2017).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and robust method for single cell chromatin accessibility profiling. Nat. Commun. 9, 5345 (2018).
Cui, X. et al. Discrete latent embedding of single-cell chromatin accessibility sequencing data for cell heterogeneity uncovering. Zenodo https://zenodo.org/doi/10.5281/zenodo.10906304 (2024).
Acknowledgements
This work was supported by the National Key Research and Development Program of China (grant nos. 2021YFF1200902 and 2023YFF1204802 to R.J.), the National Natural Science Foundation of China (grant nos. 62273194 to R.J. and 62203236 to S.C.), the Fundamental Research Funds for the Central Universities (grant no. Nankai University 63231137 to S.C.), and the Young Elite Scientists Sponsorship Program by CAST (grant no. 2023QNRC001 to S.C.).
Author information
Authors and Affiliations
Contributions
R.J. and S.C. conceived the study and supervised the project. X.C. and S.C. designed, implemented and validated CASTLE. Z.L. and Z.G. helped analyze the results. X.C., S.C. and X.C. wrote the paper, with input from all of the authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Jianrong Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Editor recognition statement: Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–12, Figs. 1–65 and Table 1.
Source data
Source Data Fig. 2
Original data for this figure.
Source Data Fig. 3
Original data for this figure.
Source Data Fig. 4
Original data for this figure.
Source Data Fig. 5
Original data for this figure.
Source Data Fig. 6
Original data for this figure.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cui, X., Chen, X., Li, Z. et al. Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity. Nat Comput Sci 4, 346–359 (2024). https://doi.org/10.1038/s43588-024-00625-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-024-00625-4