Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

Abstract

Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference.

This is a preview of subscription content, access via your institution

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: scBasset architecture.
Fig. 2: scBasset cell representation performance.
Fig. 3: scBasset batch correction.
Fig. 4: scBasset denoising performance evaluation.
Fig. 5: scBasset infers single cell TF activity.

Data availability

We used only public datasets in this study. We downloaded the processed peak set for Buenrostro2018 generated by Chen et al. at https://github.com/pinellolab/scATAC-benchmarking/blob/master/Real_Data/Buenrostro_2018/input/combined.sorted.merged.bed. We downloaded the aligned bam files from https://github.com/pinellolab/scATAC-benchmarking/tree/master/Real_Data/Buenrostro_2018/input/sc-bams_nodup. The original datasets are from the Gene Expression Omnibus (GEO) under accession code GSE96769. We downloaded the 10x multiome datasets from 10x Genomics at https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/2.0.0/pbmc_granulocyte_sorted_3k for the PBMC dataset and https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/2.0.0/e18_mouse_brain_fresh_5k for the mouse brain dataset. We downloaded the processed peak-by-cell matrix from sci-ATAC human atlas (GEO accession code GSE184461) stored at http://renlab.sdsc.edu/kai/Key_Processed_Data/Cell_by_cCRE/. spear-ATAC preprocessed count matrix ‘K562-Pilot-scATAC-Peak-Matrix-SE.rds’ was downloaded from GEO (accession code GSE168851). Source data are provided with this paper.

Code availability

Code for training and using the scBasset model can be found at https://github.com/calico/scBasset. Instructions and tutorials are provided at the GitHub repository for how to train scBasset models from anndata and to compute cell embeddings, denoise accessibility profiles, perform TF activity inference and ISM from a trained scBasset model. A trained scBasset model for the Buenrostro2018 dataset is available in the kipoi model zoo (https://github.com/kipoi/models/tree/master/scbasset).

References

  1. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).

    CAS  Article  Google Scholar 

  2. Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).

    CAS  Article  Google Scholar 

  3. Miao, Z. et al. Single cell regulatory landscape of the mouse kidney highlights cellular differentiation programs and renal disease targets. Nat. Commun. 12, 2277 (2021).

    CAS  Article  Google Scholar 

  4. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 (2018).

    CAS  Article  Google Scholar 

  5. Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).

    Article  Google Scholar 

  6. Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 241 (2019).

    Article  Google Scholar 

  7. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).

    CAS  Article  Google Scholar 

  8. Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).

    Article  Google Scholar 

  9. Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).

    CAS  Article  Google Scholar 

  10. de Boer, C. G. & Regev, A. BROCKMAN: deciphering variance in epigenomic regulators by k-mer factorization. BMC Bioinf. 19, 253 (2018).

    Article  Google Scholar 

  11. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    CAS  Article  Google Scholar 

  12. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    CAS  Article  Google Scholar 

  13. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    CAS  Article  Google Scholar 

  14. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    CAS  Article  Google Scholar 

  15. Buenrostro, J. D. et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548.e16 (2018).

    CAS  Article  Google Scholar 

  16. Qin, Q. et al. Lisa: inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data. Genome Biol. 21, 32 (2020).

    Article  Google Scholar 

  17. Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).

    CAS  Article  Google Scholar 

  18. Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).

    CAS  Article  Google Scholar 

  19. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    CAS  Article  Google Scholar 

  20. Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  Google Scholar 

  21. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).

    CAS  Article  Google Scholar 

  22. Li, Z. et al. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. Nat. Commun. 12, 6386 (2021).

    CAS  Article  Google Scholar 

  23. Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).

    CAS  Article  Google Scholar 

  24. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

    CAS  Article  Google Scholar 

  26. Kiss, M. et al. Retinoid X receptor suppresses a metastasis-promoting transcriptional program in myeloid cells via a ligand-insensitive mechanism. Proc. Natl Acad. Sci. USA 114, 10725–10730 (2017).

    CAS  Article  Google Scholar 

  27. Pierce, S. E., Granja, J. M. & Greenleaf, W. J. High-throughput single-cell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer. Nat. Commun. 12, 2969 (2021).

    CAS  Article  Google Scholar 

  28. Tuan, D., Solomon, W., Li, Q. & London, I. M. The ‘β-like-globin’ gene domain in human erythroid cells. Proc. Natl Acad. Sci. USA 82, 6384–6388 (1985).

    CAS  Article  Google Scholar 

  29. Li, Q., Peterson, K. R., Fang, X. & Stamatoyannopoulos, G. Locus control regions. Blood 100, 3077–3086 (2002).

    CAS  Article  Google Scholar 

  30. Tallack, M. R. et al. A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. Genome Res. 20, 1052–1063 (2010).

    CAS  Article  Google Scholar 

  31. Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001 (2021).

    CAS  Article  Google Scholar 

  32. Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).

    CAS  Article  Google Scholar 

  33. Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).

    CAS  Article  Google Scholar 

  34. Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. arXiv. https://arxiv.org/abs/1811.00416 (2018).

  35. Lal, A. et al. Deep learning-based enhancement of epigenomics data with AtacWorks. Nat. Commun. 12, 1507 (2021).

    CAS  Article  Google Scholar 

  36. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    CAS  Article  Google Scholar 

  37. Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).

    CAS  Article  Google Scholar 

  38. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    CAS  Article  Google Scholar 

  39. Jiang, M., Anderson, J., Gillespie, J. & Mayne, M. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinf. 9, 192 (2008).

    Article  Google Scholar 

  40. Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).

    CAS  Article  Google Scholar 

  41. Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank V. Agarwal, J. Kimmel and M. Mohamed for feedback on the manuscript. We thank S. Spock for feedback on the code. We also thank N. Bernstein and A. Odak for helpful discussions.

Author information

Authors and Affiliations

Authors

Contributions

D.R.K. conceived the project. H.Y. and D.R.K. developed the model. H.Y. performed the analysis. H.Y. and D.R.K prepared the manuscript.

Corresponding authors

Correspondence to Han Yuan or David R. Kelley.

Ethics declarations

Competing interests

H.Y. and D.R.K. are paid employees of Calico Life Sciences.

Peer review

Peer review information

Nature Methods thanks Luca Pinello, Qiangfeng Cliff Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Buenrostro2018 cell embeddings.

t-SNE visualization of different cell embedding methods on Buenrostro2018, including: chromVAR motif, chromVAR kmer (k = 6), PCA, cicero (LSI), ArchR, snapATAC, cisTopic, scDEC, SCALE, peakVI and scBasset.

Extended Data Fig. 2 10x multiome PBMC cell embeddings.

UMAP visualization of different cell embedding methods on the 10x multiome PBMC dataset, including: chromVAR_motif, chromVAR_kmer (k = 6), PCA, cicero (LSI), ArchR, snapATAC, cisTopic, scDEC, SCALE, peakVI and scBasset.

Extended Data Fig. 3 scBasset batch correction on chemistry-mixed PBMC.

a) Model architecture of scBasset-BC. b) Performance of scBasset batch correction models trained on the chemistry-mixed PBMC data. We trained scBasset-BC models with increasing L2 penalty (from 0 to 1e-2). Batch-mixing is measured with kBET and iLISI, and conservation of biological variation is measured with label score. c) Performance comparison of different batch correction methods on the chemistry-mixed PBMC data. Harmony is applied on either PCA, Harmony(PCA), or scBasset embeddings, Harmony(scBasset). Performance is evaluated by kBET, iLISI and label score. d) UMAPs of scBasset batch correction with different L2 penalties on the chemistry-mixed PBMC data. We selected L2 = 1e-6 as the fin42al scBasset-BC model. e) UMAPs of different batch correction methods on the chemistry-mixed PBMC data.

Source data

Extended Data Fig. 4 scBasset batch correction on Buenrostro2018.

a) scBasset batch correction performance as a function of L2 penalty on Buenrostro2018 dataset. Performance is evaluated by kBET, iLISI and label score. b) Performance comparison of different batch correction methods on Buenrostro2018 data. Harmony is applied on either PCA, Harmony(PCA), or scBasset embeddings, Harmony(scBasset). Performance is evaluated by kBET, iLISI and label score. c) UMAPs of scBasset batch correction with different L2 penalties on Buenrostro2018 data. We selected L2 = 1e-8 as the final scBasset-BC model. d) UMAPs of different batch correction methods on Buenrostro2018 data.

Source data

Extended Data Fig. 5 Additional scBasset denoising and integration results.

a) Comparison of different denoising methods in Buenrostro2018 dataset as evaluated by label score and cell type average Silhouette width (cell type ASW). b) Correlation between gene accessibility score and gene expression across genes for each cell before (x-axis) and after scBasset denoising (y-axis) for the multiome mouse brain dataset. Cells are colored by sequencing depth. c) Left, Comparison of different denoising methods in multiome mouse brain dataset as evaluated by label score and cell type ASW. Right, comparison of different denoising methods in multiome mouse brain dataset as evaluated by correlation between scVI-denoised RNA and denoised ATAC profiles across genes per cell (correlation per cell (RNA,ATAC)), and correlation between scVI-denoised RNA and denoised ATAC profiles across cells per gene (correlation per gene (RNA,ATAC)). d) Integration performance comparison in multiome 10x PBMC dataset when (i) both RNA and ATAC profiles are raw; (ii) only RNA profile is denoised with scVI; (iii) only ATAC profile is denoised with scBasset; and (iv) both RNA and ATAC profiles are denoised. n = 2714 cells for each boxplot. The boxplot shows min and max as whiskers (excluding outliers), 1st and 3rd quartiles as boxes and median in the center. Outliers (> 1.5x interquartile range away from the box) are not shown. e) UMAPs of RNA and ATAC co-embedding after integration for 10x multiome mouse brain dataset. Left, integration performed on RNA (blue) and raw ATAC (red) profile embeddings. Right, integration performed on RNA (blue) and scBasset-denoised ATAC (red).

Source data

Extended Data Fig. 6 Motif activity inference using scBasset and chromVAR on Buenrostro2018 dataset.

a) UMAPs showing scBasset-predicted TF activity. b) Boxplots showing scBasset-predicted TF activity by cell type. c) UMAPs showing chromVAR-predicted TF activity. d) Boxplots showing chromVAR-predicted TF activity per cell type. For boxplots in B) and D), n = 502, 402, 344, 160, 142, 141, 138, 78, 64, 60 cells for each of CMP, GMP, HSC, LMPP, MPP, pDC, MEP, CLP, mono, UNK cell types. The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range (IQR). Outliers are not shown.

Source data

Extended Data Fig. 7 TF expression and activity correlation for the 10x multiome datasets.

Scatterplots of correlations between chromVAR-inferred activity and expression (x-axis) versus correlations of scBasset-inferred TF activity and expression (y-axis) for activating TFs (left) and repressive TFs (right) in the 10x multiome PBMC (top) and 10x multiome mouse brain (bottom). Activating TFs are TFs for which both scBasset and chromVAR agree on a positive correlation between TF expression and activity. Repressive TFs are TFs for which both scBasset and chromVAR agree on a negative correlation between TF expression and activity. A one-sided Wilcoxon signed rank test was performed.

Source data

Extended Data Fig. 8 Motif activity inference using scBasset and chromVAR on 10x multiome PBMC dataset.

UMAP visualization of TF expression (left), scBasset TF activity (middle), and chromVAR TF activity (right) for additional known PBMC regulators. Pearson correlation between inferred TF activity and expression are shown in the titles.

Extended Data Fig. 9 Comparison of scBasset and chromVAR in TF perturbation experiments.

a) Embeddings of K562 cells transfected by a pool of 9 CRISPRi sgRNAs targeting GATA1 (sgGATA1) and GATA2 (sgGATA2) and 9 inert sgRNA controls (sgNT). b) Performance comparison of scBasset and chromVAR in distinguishing sgGATA1 cells from sgNT cells in ROC curves. c) Performance comparison of scBasset and chromVAR in distinguishing sgGATA2 cells from sgNT cells in ROC curves.

Extended Data Fig. 10 scBasset result on sci-ATAC human cell atlas.

a) UMAP of sci-ATAC human cell atlas. Cells colored by tissue of origin. b-d) Runtime, peak CPU memory and GPU memory usage of scBasset as a function of the number of cells in the dataset.

Source data

Supplementary information

Supplementary Information

Supplementary Methods, Supplementary Notes and Supplementary Figs. 1–9

Reporting Summary

Source data

Source Data Fig. 1

Statistical Source Data.

Source Data Fig. 2

Statistical Source Data.

Source Data Fig. 3

Statistical Source Data.

Source Data Fig. 4

Statistical Source Data.

Source Data Fig. 5

Statistical Source Data.

Source Data Extended Data Fig. 3

Statistical Source Data.

Source Data Extended Data Fig. 4

Statistical Source Data.

Source Data Extended Data Fig. 5

Statistical Source Data.

Source Data Extended Data Fig. 6

Statistical Source Data.

Source Data Extended Data Fig. 7

Statistical Source Data.

Source Data Extended Data Fig. 10

Statistical Source Data.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yuan, H., Kelley, D.R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat Methods 19, 1088–1096 (2022). https://doi.org/10.1038/s41592-022-01562-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01562-8

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing