scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

Yuan, Han; Kelley, David R.

doi:10.1038/s41592-022-01562-8

Article
Published: 08 August 2022

scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

Nature Methods volume 19, pages 1088–1096 (2022)Cite this article

19k Accesses
28 Citations
54 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 23 December 2022

This article has been updated

Abstract

Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: scBasset cell representation performance.**

**Fig. 4: scBasset denoising performance evaluation.**

**Fig. 5: scBasset infers single cell TF activity.**

Deep learning-based enhancement of epigenomics data with AtacWorks

Article Open access 08 March 2021

Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning

Article Open access 23 February 2022

Inferring transcription factor regulatory networks from single-cell ATAC-seq data based on graph neural networks

Article 11 April 2022

Data availability

We used only public datasets in this study. We downloaded the processed peak set for Buenrostro2018 generated by Chen et al. at https://github.com/pinellolab/scATAC-benchmarking/blob/master/Real_Data/Buenrostro_2018/input/combined.sorted.merged.bed. We downloaded the aligned bam files from https://github.com/pinellolab/scATAC-benchmarking/tree/master/Real_Data/Buenrostro_2018/input/sc-bams_nodup. The original datasets are from the Gene Expression Omnibus (GEO) under accession code GSE96769. We downloaded the 10x multiome datasets from 10x Genomics at https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/2.0.0/pbmc_granulocyte_sorted_3k for the PBMC dataset and https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/2.0.0/e18_mouse_brain_fresh_5k for the mouse brain dataset. We downloaded the processed peak-by-cell matrix from sci-ATAC human atlas (GEO accession code GSE184461) stored at http://renlab.sdsc.edu/kai/Key_Processed_Data/Cell_by_cCRE/. spear-ATAC preprocessed count matrix ‘K562-Pilot-scATAC-Peak-Matrix-SE.rds’ was downloaded from GEO (accession code GSE168851). Source data are provided with this paper.

Code availability

Code for training and using the scBasset model can be found at https://github.com/calico/scBasset. Instructions and tutorials are provided at the GitHub repository for how to train scBasset models from anndata and to compute cell embeddings, denoise accessibility profiles, perform TF activity inference and ISM from a trained scBasset model. A trained scBasset model for the Buenrostro2018 dataset is available in the kipoi model zoo (https://github.com/kipoi/models/tree/master/scbasset).

Change history

23 December 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41592-022-01754-2

References

Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Article CAS Google Scholar
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Article CAS Google Scholar
Miao, Z. et al. Single cell regulatory landscape of the mouse kidney highlights cellular differentiation programs and renal disease targets. Nat. Commun. 12, 2277 (2021).
Article CAS Google Scholar
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 (2018).
Article CAS Google Scholar
Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).
Article Google Scholar
Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 241 (2019).
Article Google Scholar
Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).
Article CAS Google Scholar
Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).
Article Google Scholar
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).
Article CAS Google Scholar
de Boer, C. G. & Regev, A. BROCKMAN: deciphering variance in epigenomic regulators by k-mer factorization. BMC Bioinf. 19, 253 (2018).
Article Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article CAS Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS Google Scholar
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Article CAS Google Scholar
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS Google Scholar
Buenrostro, J. D. et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548.e16 (2018).
Article CAS Google Scholar
Qin, Q. et al. Lisa: inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data. Genome Biol. 21, 32 (2020).
Article Google Scholar
Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).
Article CAS Google Scholar
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Article CAS Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS Google Scholar
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Article Google Scholar
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Article CAS Google Scholar
Li, Z. et al. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. Nat. Commun. 12, 6386 (2021).
Article CAS Google Scholar
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Article CAS Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article CAS Google Scholar
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
Article CAS Google Scholar
Kiss, M. et al. Retinoid X receptor suppresses a metastasis-promoting transcriptional program in myeloid cells via a ligand-insensitive mechanism. Proc. Natl Acad. Sci. USA 114, 10725–10730 (2017).
Article CAS Google Scholar
Pierce, S. E., Granja, J. M. & Greenleaf, W. J. High-throughput single-cell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer. Nat. Commun. 12, 2969 (2021).
Article CAS Google Scholar
Tuan, D., Solomon, W., Li, Q. & London, I. M. The ‘β-like-globin’ gene domain in human erythroid cells. Proc. Natl Acad. Sci. USA 82, 6384–6388 (1985).
Article CAS Google Scholar
Li, Q., Peterson, K. R., Fang, X. & Stamatoyannopoulos, G. Locus control regions. Blood 100, 3077–3086 (2002).
Article CAS Google Scholar
Tallack, M. R. et al. A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. Genome Res. 20, 1052–1063 (2010).
Article CAS Google Scholar
Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001 (2021).
Article CAS Google Scholar
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Article CAS Google Scholar
Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).
Article CAS Google Scholar
Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. arXiv. https://arxiv.org/abs/1811.00416 (2018).
Lal, A. et al. Deep learning-based enhancement of epigenomics data with AtacWorks. Nat. Commun. 12, 1507 (2021).
Article CAS Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS Google Scholar
Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).
Article CAS Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. https://doi.org/10.1038/s41467-021-21583-9 (2021).
Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: A deep generative model for single-cell chromatin accessibility analysis. Cell Rep. Methods 2, 100182 (2022).
Article CAS Google Scholar
Liu, Q., Chen, S., Jiang, R. & Wong, W. H. Simultaneous deep generative modelling and clustering of single-cell genomic data. Nat. Mach. Intell. https://doi.org/10.1038/s42256-021-00333-y (2021).
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell https://doi.org/10.1016/j.cell.2018.05.061 (2018).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Article CAS Google Scholar
Jiang, M., Anderson, J., Gillespie, J. & Mayne, M. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinf. 9, 192 (2008).
Article Google Scholar
Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).
Article CAS Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We thank V. Agarwal, J. Kimmel and M. Mohamed for feedback on the manuscript. We thank S. Spock for feedback on the code. We also thank N. Bernstein and A. Odak for helpful discussions.

Author information

Authors and Affiliations

Calico Life Sciences, South San Francisco, CA, USA
Han Yuan & David R. Kelley

Authors

Han Yuan
View author publications
You can also search for this author in PubMed Google Scholar
David R. Kelley
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.R.K. conceived the project. H.Y. and D.R.K. developed the model. H.Y. performed the analysis. H.Y. and D.R.K prepared the manuscript.

Corresponding authors

Correspondence to Han Yuan or David R. Kelley.

Ethics declarations

Competing interests

H.Y. and D.R.K. are paid employees of Calico Life Sciences.

Peer review

Peer review information

Nature Methods thanks Luca Pinello, Qiangfeng Cliff Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Buenrostro2018 cell embeddings.

t-SNE visualization of different cell embedding methods on Buenrostro2018, including: chromVAR motif, chromVAR kmer (k = 6), PCA, cicero (LSI), ArchR, snapATAC, cisTopic, scDEC, SCALE, peakVI and scBasset.

Extended Data Fig. 2 10x multiome PBMC cell embeddings.

UMAP visualization of different cell embedding methods on the 10x multiome PBMC dataset, including: chromVAR_motif, chromVAR_kmer (k = 6), PCA, cicero (LSI), ArchR, snapATAC, cisTopic, scDEC, SCALE, peakVI and scBasset.

Extended Data Fig. 3 scBasset batch correction on chemistry-mixed PBMC.

a) Model architecture of scBasset-BC. b) Performance of scBasset batch correction models trained on the chemistry-mixed PBMC data. We trained scBasset-BC models with increasing L2 penalty (from 0 to 1e-2). Batch-mixing is measured with kBET and iLISI, and conservation of biological variation is measured with label score. c) Performance comparison of different batch correction methods on the chemistry-mixed PBMC data. Harmony is applied on either PCA, Harmony(PCA), or scBasset embeddings, Harmony(scBasset). Performance is evaluated by kBET, iLISI and label score. d) UMAPs of scBasset batch correction with different L2 penalties on the chemistry-mixed PBMC data. We selected L2 = 1e-6 as the fin42al scBasset-BC model. e) UMAPs of different batch correction methods on the chemistry-mixed PBMC data.

Source data

Extended Data Fig. 4 scBasset batch correction on Buenrostro2018.

a) scBasset batch correction performance as a function of L2 penalty on Buenrostro2018 dataset. Performance is evaluated by kBET, iLISI and label score. b) Performance comparison of different batch correction methods on Buenrostro2018 data. Harmony is applied on either PCA, Harmony(PCA), or scBasset embeddings, Harmony(scBasset). Performance is evaluated by kBET, iLISI and label score. c) UMAPs of scBasset batch correction with different L2 penalties on Buenrostro2018 data. We selected L2 = 1e-8 as the final scBasset-BC model. d) UMAPs of different batch correction methods on Buenrostro2018 data.

Source data

Extended Data Fig. 5 Additional scBasset denoising and integration results.

a) Comparison of different denoising methods in Buenrostro2018 dataset as evaluated by label score and cell type average Silhouette width (cell type ASW). b) Correlation between gene accessibility score and gene expression across genes for each cell before (x-axis) and after scBasset denoising (y-axis) for the multiome mouse brain dataset. Cells are colored by sequencing depth. c) Left, Comparison of different denoising methods in multiome mouse brain dataset as evaluated by label score and cell type ASW. Right, comparison of different denoising methods in multiome mouse brain dataset as evaluated by correlation between scVI-denoised RNA and denoised ATAC profiles across genes per cell (correlation per cell (RNA,ATAC)), and correlation between scVI-denoised RNA and denoised ATAC profiles across cells per gene (correlation per gene (RNA,ATAC)). d) Integration performance comparison in multiome 10x PBMC dataset when (i) both RNA and ATAC profiles are raw; (ii) only RNA profile is denoised with scVI; (iii) only ATAC profile is denoised with scBasset; and (iv) both RNA and ATAC profiles are denoised. n = 2714 cells for each boxplot. The boxplot shows min and max as whiskers (excluding outliers), 1st and 3rd quartiles as boxes and median in the center. Outliers (> 1.5x interquartile range away from the box) are not shown. e) UMAPs of RNA and ATAC co-embedding after integration for 10x multiome mouse brain dataset. Left, integration performed on RNA (blue) and raw ATAC (red) profile embeddings. Right, integration performed on RNA (blue) and scBasset-denoised ATAC (red).

Source data

Extended Data Fig. 6 Motif activity inference using scBasset and chromVAR on Buenrostro2018 dataset.

a) UMAPs showing scBasset-predicted TF activity. b) Boxplots showing scBasset-predicted TF activity by cell type. c) UMAPs showing chromVAR-predicted TF activity. d) Boxplots showing chromVAR-predicted TF activity per cell type. For boxplots in B) and D), n = 502, 402, 344, 160, 142, 141, 138, 78, 64, 60 cells for each of CMP, GMP, HSC, LMPP, MPP, pDC, MEP, CLP, mono, UNK cell types. The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range (IQR). Outliers are not shown.

Source data

Extended Data Fig. 7 TF expression and activity correlation for the 10x multiome datasets.

Scatterplots of correlations between chromVAR-inferred activity and expression (x-axis) versus correlations of scBasset-inferred TF activity and expression (y-axis) for activating TFs (left) and repressive TFs (right) in the 10x multiome PBMC (top) and 10x multiome mouse brain (bottom). Activating TFs are TFs for which both scBasset and chromVAR agree on a positive correlation between TF expression and activity. Repressive TFs are TFs for which both scBasset and chromVAR agree on a negative correlation between TF expression and activity. A one-sided Wilcoxon signed rank test was performed.

Source data

Extended Data Fig. 8 Motif activity inference using scBasset and chromVAR on 10x multiome PBMC dataset.

UMAP visualization of TF expression (left), scBasset TF activity (middle), and chromVAR TF activity (right) for additional known PBMC regulators. Pearson correlation between inferred TF activity and expression are shown in the titles.

Extended Data Fig. 9 Comparison of scBasset and chromVAR in TF perturbation experiments.

a) Embeddings of K562 cells transfected by a pool of 9 CRISPRi sgRNAs targeting GATA1 (sgGATA1) and GATA2 (sgGATA2) and 9 inert sgRNA controls (sgNT). b) Performance comparison of scBasset and chromVAR in distinguishing sgGATA1 cells from sgNT cells in ROC curves. c) Performance comparison of scBasset and chromVAR in distinguishing sgGATA2 cells from sgNT cells in ROC curves.

Extended Data Fig. 10 scBasset result on sci-ATAC human cell atlas.

a) UMAP of sci-ATAC human cell atlas. Cells colored by tissue of origin. b-d) Runtime, peak CPU memory and GPU memory usage of scBasset as a function of the number of cells in the dataset.

Source data

Supplementary information

Supplementary Information

Supplementary Methods, Supplementary Notes and Supplementary Figs. 1–9

Reporting Summary

Source data

Source Data Fig. 1

Statistical Source Data.

Source Data Fig. 2

Statistical Source Data.

Source Data Fig. 3

Statistical Source Data.

Source Data Fig. 4

Statistical Source Data.

Source Data Fig. 5

Statistical Source Data.

Source Data Extended Data Fig. 3

Statistical Source Data.

Source Data Extended Data Fig. 4

Statistical Source Data.

Source Data Extended Data Fig. 5

Statistical Source Data.

Source Data Extended Data Fig. 6

Statistical Source Data.

Source Data Extended Data Fig. 7

Statistical Source Data.

Source Data Extended Data Fig. 10

Statistical Source Data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yuan, H., Kelley, D.R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat Methods 19, 1088–1096 (2022). https://doi.org/10.1038/s41592-022-01562-8

Download citation

Received: 08 September 2021
Accepted: 27 June 2022
Published: 08 August 2022
Issue Date: September 2022
DOI: https://doi.org/10.1038/s41592-022-01562-8

This article is cited by

Cell-type-directed design of synthetic enhancers
- Ibrahim I. Taskiran
- Katina I. Spanier
- Stein Aerts
Nature (2024)
Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis
- Sneha Mitra
- Rohan Malik
- Christina S. Leslie
Nature Genetics (2024)
Genotype imputation methods for whole and complex genomic regions utilizing deep learning technology
- Tatsuhiko Naito
- Yukinori Okada
Journal of Human Genetics (2024)
A fast, scalable and versatile tool for analysis of single-cell omics data
- Kai Zhang
- Nathan R. Zemke
- Bing Ren
Nature Methods (2024)
Challenges and best practices in omics benchmarking
- Thomas G. Brooks
- Nicholas F. Lahens
- Gregory R. Grant
Nature Reviews Genetics (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

Change history

23 December 2022

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links