Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deciphering cell types by integrating scATAC-seq data with genome sequences

Abstract

The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic overview of SANGO framework for annotating cells within scATAC-seq data by integrating the genome sequence.
Fig. 2: Performance of cell-type annotation for intra-datasets.
Fig. 3: Performance across platform or tissue datasets.
Fig. 4: Performance utilizing multisource data or atlas data as the reference.
Fig. 5: Revealing biological implications for normal tissues.
Fig. 6: Identifying multilevel cell types in basal cell carcinoma data.

Similar content being viewed by others

Data availability

We downloaded the raw scATAC matrix data directly from the website and followed previous works5,47,54 to binarize the matrix. (1) The datasets BoneMarrowA, BoneMarrowB, LungA, LungB, Kidney, Liver, Heart, LargeIntestineA, LargeIntestineB, SmallIntestine, WholeBrainA, WholeBrainB, Cerebellum and PreFrontalCortex are derived from the adult mouse atlas data55, downloading from either GEO accession number GSE111586 or the website http://atlas.gs.washington.edu/mouse-atac/data/. These datasets are sequenced using the sciATAC-seq technology56 and annotated through the mm9 reference genome. (2) The anterior datasets (MosA1, MosA2), middle datasets (MosM1, MosM2) and posterior datasets (MosP1, MosP2) are from the different sections of the secondary motor cortex in mouse brain57, which can be accessed through GEO accession number GSE126724. These datasets are sequenced using snATAC-seq technology58 and annotated through the GRCm38 reference genome. (3) The Mouse Brain (10x) dataset and the normal cortex dataset are sequenced using the 10x sequencing technology and annotated using the mm10 reference genome. These two datasets can be downloaded from https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k and https://www.10xgenomics.com/resources/datasets/fresh-cortex-from-adult-mouse-brain-p-50-1-standard-1-2-0, respectively. (4) The forebrain dataset can be downloaded through GEO accession number GSE100033, which is sequenced by the snATAC and annotated using the mm9 reference genome. (5) The PBMC atlas data, BCC–TIL and the basal cell carcinoma sample data are obtained from refs. 3,15. These datasets are annotated using the ENCODE hg19 reference genome and can be accessed through GEO accession number GSE129785 or the download website https://www.synapse.org/#!Synapse:syn52559388/files/. (6) The PBMC (10x) data are obtained from the official 10x website: https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k, which is annotated using the GRCh38 reference genome. (7) The raw HHLA data can be obtained from GEO accession number GSE184462 and the processed data can be downloaded from the website https://www.synapse.org/#!Synapse:syn52559388/files/. All of these datasets were preprocessed as described in Dataset preprocessing. Source data are provided with this paper.

Code availability

All source codes used in our experiments have been deposited at https://github.com/cquzys/SANGO. A Zenodo version is also available at https://doi.org/10.5281/zenodo.10826453 (ref. 59).

References

  1. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).

    Article  Google Scholar 

  2. Chen, H. et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun. 10, 1903 (2019).

    Article  Google Scholar 

  3. Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).

    Article  Google Scholar 

  4. Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).

    Article  Google Scholar 

  5. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).

    Article  Google Scholar 

  6. Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 241 (2019).

    Article  Google Scholar 

  7. Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).

    Article  Google Scholar 

  8. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).

    Article  Google Scholar 

  9. Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).

  10. Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).

    Article  Google Scholar 

  11. Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 9, 207–213 (2019).

    Article  Google Scholar 

  12. Kimmel, J. C. & Kelley, D. R. Semisupervised adversarial neural networks for single-cell classification. Genome Res. 31, 1781–1793 (2021).

  13. Ma, W., Lu, J. & Wu, H. Cellcano: supervised cell type identification for single cell ATAC-seq data. Nat. Commun. 14, 1864 (2023).

  14. Chen, X. et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat. Mach. Intell. 4, 116–126 (2022).

    Article  Google Scholar 

  15. Jiang, Y. et al. scATAnno: automated cell type annotation for single-cell ATAC sequencing data. Preprint at bioRxiv https://doi.org/10.1101/2023.06.01.543296 (2024).

  16. Srivastava, D. & Mahony, S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochim. Biophys. Acta 1863, 194443 (2020).

  17. Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).

  18. Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).

    Article  Google Scholar 

  19. Tayyebi, Z., Pine, A. R. & Leslie, C. S. Scalable sequence-informed embedding of single-cell ATAC-seq data with CellSpace. Preprint at bioRxiv https://doi.org/10.1101/2022.05.02.490310 (2023).

  20. Chen, K., Zhao, H. & Yang, Y. Capturing large genomic contexts for accurately predicting enhancer–promoter interactions. Brief. Bioinform. 23, bbab577 (2022).

  21. O’Shea, K. & Nash, R. An introduction to convolutional neural networks. Preprint at arXiv https://doi.org/10.48550/arXiv.1511.08458 (2015).

  22. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    Article  Google Scholar 

  23. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    Article  Google Scholar 

  24. Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).

    Article  Google Scholar 

  25. Mackay, M. et al. Selective dysregulation of the FcγIIB receptor on memory B cells in SLE. J. Exp. Med. 203, 2157–2164 (2006).

    Article  Google Scholar 

  26. Sundell, T. et al. Single-cell RNA sequencing analyses: interference by the genes that encode the B-cell and T-cell receptors. Brief. Funct. Genom. 22, 263–273 (2023).

    Article  Google Scholar 

  27. Loo, L. et al. Single-cell transcriptomic analysis of mouse neocortical development. Nat. Commun. 10, 134 (2019).

    Article  Google Scholar 

  28. Ruan, C. & Elyaman, W. A new understanding of TMEM119 as a marker of microglia. Front. Cell. Neurosci. 16, 902372 (2022).

    Article  Google Scholar 

  29. Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).

    Article  Google Scholar 

  30. Hu, J. et al. SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18, 1342–1351 (2021).

    Article  Google Scholar 

  31. Xu, C. et al. Automatic cell type harmonization and integration across Human Cell Atlas datasets. Cell 186, 5876–5891 (2023).

  32. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

  33. Hao, Z.-Z. et al. Single-cell transcriptomics of adult macaque hippocampus reveals neural precursor cell populations. Nat. Neurosci. 25, 805–817 (2022).

    Article  Google Scholar 

  34. Zappia, L. & Theis, F. J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 22, 301 (2021).

    Article  Google Scholar 

  35. Chen, S., Zhang, B., Chen, X., Zhang, X. & Jiang, R. stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics 37, i299–i307 (2021).

  36. Song, Q., Suand, J. & Zhang, W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat. Commun. 12, 3826 (2021).

    Article  Google Scholar 

  37. Wang, Q. et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11534–11542 (IEEE, 2020).

  38. Wu, Q., Zhao, W., Li, Z., Wipf, D. P. & Yan, J. Nodeformer: a scalable graph structure learning transformer for node classification. Adv. Neural Inf. Process. Syst. 35, 27387–27401 (2022).

  39. Rahimi, A. & Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 20, 1177–1184 (2007).

    Google Scholar 

  40. Jang, E., Gu, S. & Poole, B. Categorical reparameterization with Gumbel-Softmax. Preprint at arXiv https://doi.org/10.48550/arXiv.1611.01144 (2016).

  41. Kingma, D. P., Salimans, T. & Welling, M. Variational dropout and the local reparameterization trick. Adv. Neural Inf. Process. Syst. 28, 2575–2583 (2015).

    Google Scholar 

  42. Maddison, C. J., Mnih, A. & Teh, Y. W. The concrete distribution: a continuous relaxation of discrete random variables. Preprint at arXiv https://doi.org/10.48550/arXiv.1611.00712 (2016).

  43. Zeng, Y., Zhou, X., Rao, J., Lu, Y. & Yang, Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In Proc. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 519–522 (IEEE, 2020).

  44. Zeng, Y., Wei, Z., Pan, Z., Lu, Y. & Yang, Y. A robust and scalable graph neural network for accurate single-cell classification. Brief. Bioinform. 23, bbab570 (2022).

  45. Slowikowski, K., Hu, X. & Raychaudhuri, S. SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci. Bioinformatics 30, 2496–2497 (2014).

  46. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. USA 101, 6062–6067 (2004).

    Article  Google Scholar 

  47. Ma, A. et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat. Commun. 14, 964 (2023).

    Article  Google Scholar 

  48. Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).

    Article  Google Scholar 

  49. Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001 (2021).

    Article  Google Scholar 

  50. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    Article  Google Scholar 

  51. Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    Article  Google Scholar 

  52. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  Google Scholar 

  53. Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).

    Article  Google Scholar 

  54. Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).

  55. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 (2018).

    Article  Google Scholar 

  56. Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).

    Article  Google Scholar 

  57. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).

    Article  Google Scholar 

  58. Preissl, S. et al. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432–439 (2018).

    Article  Google Scholar 

  59. Zeng,Y. et al. Deciphering cell types by integrating scATAC-seq data with genome sequences. Zenodo https://doi.org/10.5281/zenodo.10826453 (2024).

Download references

Acknowledgements

This study has been supported by the National Key R&D Program of China (2022YFF1203100), the National Natural Science Foundation of China (T2394502), the Research and Development Project of Pazhou Lab (Huangpu) (2023K0606) and the Postdoctoral Fellowship Program of CPSF (GZC20233321).

Author information

Authors and Affiliations

Authors

Contributions

Y.Y. conceived and supervised the project. Y.Z., M.L. and N.S developed and implemented the SANGO algorithm. Y.Y., W.Y. and Y.Z. validated the methods and wrote the paper. P.S., J.F. and J.X. conducted the biological analysis. Y.L. and K.C. discussed and performed the rebuttal experiments. All authors read and approved the final paper.

Corresponding author

Correspondence to Yuedong Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Andrea Tangherloni, Guangyu Wang and Hao Wu for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–25, Note 1 and References.

Reporting Summary

Source data

Source Data Fig. 2

A single file containing all source data for Fig. 2.

Source Data Fig. 3

A single file containing all source data for Fig. 3.

Source Data Fig. 4

A single file containing all source data for Fig. 4.

Source Data Fig. 5

A single file containing all source data for Fig. 5.

Source Data Fig. 6

A single file containing all source data for Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, Y., Luo, M., Shangguan, N. et al. Deciphering cell types by integrating scATAC-seq data with genome sequences. Nat Comput Sci 4, 285–298 (2024). https://doi.org/10.1038/s43588-024-00622-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-024-00622-7

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing