Abstract
Droplet-based single-cell assays, including single-cell RNA sequencing (scRNA-seq), single-nucleus RNA sequencing (snRNA-seq) and cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq), generate considerable background noise counts, the hallmark of which is nonzero counts in cell-free droplets and off-target gene expression in unexpected cell types. Such systematic background noise can lead to batch effects and spurious differential gene expression results. Here we develop a deep generative model based on the phenomenology of noise generation in droplet-based assays. The proposed model accurately distinguishes cell-containing droplets from cell-free droplets, learns the background noise profile and provides noise-free quantification in an end-to-end fashion. We implement this approach in the scalable and robust open-source software package CellBender. Analysis of simulated data demonstrates that CellBender operates near the theoretically optimal denoising limit. Extensive evaluations using real datasets and experimental benchmarks highlight enhanced concordance between droplet-based single-cell data and established gene expression patterns, while the learned background noise profile provides evidence of degraded or uncaptured cell types.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The datasets used in this study are the following: pbmc8k (the publicly available pbmc8k dataset from 10x Genomics called ‘8k PBMCs from a healthy donor’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/8-k-pbm-cs-from-a-healthy-donor-2-standard-2-1-0); heart600k (the published dataset from the Broad–Bayer PCL called ‘Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy’ (ref. 23), run with 10x Genomics 3′ capture version 3 chemistry and analyzed with CellRanger version 4.0.0, available at https://singlecell.broadinstitute.org/single_cell/study/SCP1303); hgmm12k (the publicly available hgmm12k dataset from 10x Genomics called ‘12k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/12-k-1-1-mixture-of-fresh-frozen-human-hek-293-t-and-mouse-nih-3-t-3-cells-2-standard-2-1-0); pbmc5k (the publicly available pbmc5k dataset with antibodies from 10x Genomics called ‘5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor with a Panel of TotalSeq™-B Antibodies (Next GEM)’, run with version 3 Next GEM chemistry and analyzed with CellRanger version 3.1.0, available at https://www.10xgenomics.com/resources/datasets/5-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-with-cell-surface-proteins-next-gem-3-1-standard-3-1-0); and rat6k (an snRNA-seq dataset from a healthy Wistar rat left atrium, comprising approximately 6,000 nuclei, processed on the 10x Genomics platform using version 2 chemistry and analyzed with CellRanger version 3.1.0. The dataset was provided by P.T.E.’s group at the Broad Institute as part of the Broad–Bayer PCL. The experiment was performed by A.A. and A.-D.A. The dataset is publicly available on Broad’s Single Cell Portal at https://singlecell.broadinstitute.org/single_cell/study/SCP2148). Datasets analyzed only in the Supplementary Information are as follows: smartseq3xpress_pbmc (a Smart-seq3xpress (well-based) scRNA-seq dataset from healthy human PBMCs called ‘Scalable full-transcript coverage single-cell RNA sequencing of PBMCs using Smart-seq3xpress’ and published by Hagemann-Jensen et al.59. This dataset was kindly provided to the authors in count matrix format by C. Ziegenhain, an author of the referenced paper. We subsetted the data to the 16 384-well plates that came from ‘donor8’ and fluentbio_pbmc (the publicly available scRNA-seq dataset of healthy human PBMCs from Fluent BioSciences called ‘Profiling 20k Immune Cells in Healthy PBMCs from a Single T20 Reaction’, generated with T20 PIPseq and analyzed with PIPseeker version 1.1.3 by Fluent Biosciences60, available at https://fbs-public.s3.us-east-2.amazonaws.com/public-datasets/pbmc/raw_matrix.tar.gz).
Code availability
CellBender can be obtained from https://github.com/broadinstitute/CellBender. Additional documentation is available at https://cellbender.readthedocs.io. CellBender modules are also available as workflows on Terra (https://app.terra.bio), a secure open platform for collaborative omic analysis, and can be run on the cloud with zero set-up. We have implemented the model and the inference method using Pyro probabilistic programming language16 and PyTorch61 and presented it as a user-friendly, production-grade and stand-alone command-line tool. We refer to the background noise-removal algorithm implemented in CellBender as remove-background.
References
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Liu, L. et al. Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity. Nat. Commun. 10, 470 (2019).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. GigaScience 9, giaa151 (2020).
Haas, B. J. et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 21, 494–504 (2011).
Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).
Thompson, J. R., Marcelino, L. A. & Polz, M. F. Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by ‘reconditioning PCR’. Nucleic Acids Res. 30, 2083–2088 (2002).
Perkel, J. M. et al. Single-cell analysis enters the multiomics age. Nature 595, 614–616 (2021).
Bingham, E. et al. Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 20, 1–6 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Dani, N. et al. A cellular and spatial map of the choroid plexus across brain ventricles and ages. Cell 184, 3056–3074 (2021).
Popova, G. Human microglia states are conserved across experimental models and regulate neural stem cell responses in chimeric organoids. Cell Stem Cell 28, 2153-2166 (2021).
Holloway, E. M. et al. Mapping development of the human intestinal niche at single-cell resolution. Cell Stem Cell 28, 568–580 (2021).
Tucker, N. R. et al. Transcriptional and cellular diversity of the human heart. Circulation 142, 466–482 (2020).
Tucker, N. R. et al. Myocyte specific upregulation of ACE2 in cardiovascular disease: implications for SARS-CoV-2 mediated myocarditis. Circulation 142, 708–710 (2020).
Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).
Sun, W. et al. snRNA-seq reveals a subpopulation of adipocytes that regulates thermogenesis. Nature 587, 98–102 (2020).
Dong, H. et al. Identification of a regulatory pathway inhibiting adipogenesis via RSPO2. Nat. Metab. 4, 90–105 (2022).
Delorey, T. M. et al. COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature 595, 107–113 (2021).
Xu, G. et al. The differential immune responses to COVID-19 in peripheral and lung revealed by single-cell RNA sequencing. Cell Discov. 6, 73 (2020).
Ziegler, C. G. K. et al. Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19. Cell 184, 4713–4733 (2021).
Melms, J. C. et al. A molecular single-cell lung atlas of lethal COVID-19. Nature 595, 114–119 (2021).
Wang, S. et al. A single-cell transcriptomic landscape of the lungs of patients with COVID-19. Nat. Cell Biol. 23, 1314–1328 (2021).
Zazhytska, M. et al. Non-cell-autonomous disruption of nuclear architecture as a potential cause of COVID-19-induced anosmia. Cell 185, 1052–1064 (2022).
Eraslan, G. et al. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Science 376, eabl4290 (2022).
Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://doi.org/10.48550/arXiv.1312.6114 (2014).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640 (2019).
Uhlen, M. et al. A genome-wide transcriptomic analysis of protein-coding genes in human blood cells. Science 366, eaax9198 (2019).
Neutrophil Analysis in 10x Genomics Single Cell Gene Expression Assays Report No. CG000444 (10x Genomics, 2021).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).
Lun, A. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Heiser, C. N., Wang, V. M., Chen, B., Hughey, J. J. & Lau, K. S. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Res. 31, 1742–1752 (2021).
Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).
Oberdoerffer, S. et al. Regulation of CD45 alternative splicing by heterogeneous ribonucleoprotein, hnRNPLL. Science 321, 686–691 (2008).
Luecken, M. D. & Theis, F. J. Current best practices in single cell RNA seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Clarke, Z. A. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat. Protoc. 16, 2749–2764 (2021).
Caglayan, E., Liu, Y. & Konopka, G. Neuronal ambient RNA contamination causes misinterpreted and masked cell types in brain single-nuclei datasets. Neuron 110, 4043–4056 (2022).
Di Bella, D. J. et al. Molecular logic of cellular diversification in the mouse cerebral cortex. Nature 595, 554–559 (2021).
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716-729 (2018).
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Jiang, R., Sun, T., Song, D. & Li, J. J. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 23, 31 (2022).
Hoffman, M., Blei, D. M., Wang, C. & Paisley, J. Stochastic variational inference. Preprint at https://doi.org/10.48550/arXiv.1206.7051 (2012).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Ganchev, K., Graça, J., Gillenwater, J. & Taskar, B. Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010).
Hagemann-Jensen, M., Ziegenhain, C. & Sandberg, R. Scalable single-cell RNA sequencing from full transcripts with Smart-seq3xpress. Nat. Biotechnol. 40, 1452–1457 (2022).
Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01685-z (2023).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems 12 (NeurIPS, 2019).
Acknowledgements
We thank L. D’Alessio, C. Roselli, C. Porter, E. Bingham, F. Obermeyer, J. Nemesh, B. Wang, B. Babadi, V. Popic, A. Wysoker, A. Subramanian, N. Tucker, Y. Farjoun, T. Tickle and A. Carr for insightful discussions at various stages of this project. S.J.F., M.D.C. and M.B. acknowledge financial support from the Broad–Bayer PCL. M.B. acknowledges additional support from the SPARC grant ‘Development of Production-Grade Computational Methods for Single-Cell Genomics’ from the Broad Institute. The publicly available rat6k snRNA-seq dataset was generated by the PCL, and the experiment was performed by A.A. and A.-D.A. We additionally thank C. Ziegenhain for providing a count matrix for the published Smart-seq3xpress PBMC dataset analyzed in Supplementary Section 2.4.
Author information
Authors and Affiliations
Contributions
S.J.F. and M.B. jointly developed the probabilistic model, software and study design and jointly wrote the paper. S.J.F. additionally performed statistical analyses on real and simulated data. A.A. and A.-D.A. collected the rat6k dataset under the supervision of P.T.E. M.D.C. provided critical feedback at various stages of the project and analyzed the heart600k dataset. M.B. and P.T.E. jointly supervised the project, with additional input from A.A.P., J.C.M. and E.B.
Corresponding authors
Ethics declarations
Competing interests
A.-D.A. is an employee of Bayer US LLC (a subsidiary of Bayer AG) and may own stock in Bayer AG. A.A.P. is employed as a venture partner at Google Ventures, and he is also supported by a grant from Bayer AG to the Broad Institute focused on machine learning for clinical trial design. P.T.E. is supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of cardiovascular diseases. P.T.E. has also served on advisory boards or consulted for Bayer AG, Quest Diagnostics, MyoKardia and Novartis. The other authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Eran Mukamel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 The CellBender model.
(a) The CellBender generative model for noisy single-cell count data. (b) The variational posterior used by CellBender. The neural network NNenc takes the observed data as input and yields the parameters of various variational distributions assumed for the local latent variables. The global latent variables are treated in the usual mean-field approximation.
Extended Data Fig. 2 Violin plots showing the count distributions of lysozyme, LYZ, per cluster before and after CellBender denoising.
(nFPR was 0.01.) The off-target counts are effectively removed, with counts remaining in clusters 0 (CD14+ monocytes C), 10 (FCGR3A+ monocytes NC), and 12 (plasmacytoid dendritic cells).
Extended Data Fig. 3 UMAPs created from the CellBender-analyzed pbmc8k data, showing increased expression specificity of marker genes for different cell types after CellBender denoising as compared to the raw data.
a–d, UMAP plots of the expression of NKG7, CST3, AIF1 and LST1 in each cell before and after CellBender.
Extended Data Fig. 4 UMI curves from the raw data together with various CellBender outputs for the pbmc8k and rat6k datasets.
(a-d) pbmc8k, and (e-h) rat6k. (a,e) The raw UMI curves, annotated with areas of cells and empty droplets. Notably, the distinction is much more difficult in (e), the nuclei dataset extracted from heart tissue. (b,f) Cells probabilities inferred by CellBender on same UMI curves from (a,e) respectively. The region of transition from “surely-cell” to “surely-empty” is much broader in the snRNA-seq dataset. (c,g) First two principal components of the latent gene expression embedding inferred by CellBender, colored by Leiden clustering from a separate scanpy analysis. The structure very closely reflects the labels attributed by that separate analysis. (d,h) Scatter plots showing removal of each gene by CellBender (each dot is a gene, MALAT1 is off-scale). Several top denoised genes are indicated.
Extended Data Fig. 5 Presence of doublets does not impact the denoising performance of CellBender.
(a,c,e) Simulated dataset without doublets. (b,d,f) Simulated dataset where 20% of the cell-containing droplets are doublets. (a) UMAP of the gene expression profile of the three simulated cell types. (b) Same as (a), but including doublets, which are highlighted in bold color. Doublets with cells of two different types form their own clusters in UMAP space, due to their unique transcriptional profile. (c) The learned CellBender prior on gene expression, visualized via PCA, shows three clusters for the three cell types. (d) With doublets present, the prior on gene expression now additionally contains clusters for each type of doublet. From the standpoint of CellBender, a doublet is like a unique cell type. (e,f) Denoising performance has been quantified using a ROC curve, and shows that denoising metrics are nearly identical (TPR 0.750, FPR 0.041) whether doublets are present or not. The error bars shown in panels e-f correspond to the interquartile range of TPR (vertical) and FPR (horizontal) over N=2400 simulated cells.
Extended Data Fig. 6 Published human scRNA-seq PBMC dataset from the well-based Smart-seq3xpress protocol59.
This dataset is extremely clean to begin with. The UMAP shows the expected cell types, nicely clustered. The two dotplots show expression of immune cell marker genes before and after CellBender. Some genes show improvement, but many look quite similar, as expected for a clean dataset. UMAP plots on the right show cleanup of a few genes after CellBender.
Extended Data Fig. 7 Publicly available human scRNA-seq PBMC dataset from the Fluent Biosciences PIPseq platform60.
Droplets are generated by vigorous vortexing, and thus we expect more ambient RNA than a microfluidics experiment. The UMAP shows the expected cell types, in addition to some probable doublets. The two dotplots show expression of immune cell marker genes before and after CellBender. Many genes show significant cleanup. UMAP plots on the right show rather marked cleanup of a few genes after CellBender.
Extended Data Fig. 8 Systematic background noise as a source of batch variation and spurious differential expression across batches.
(a) Setup of the cohort of simulated datasets, where there are two cell types whose expression profiles are taken from real data (rat6k) for cardiomyocytes and fibroblasts. The only difference between simulations from batch A and batch B is the number of cardiomyocytes. Noise ends up being different in the two batches due to these cell number differences. The “truth” in this simulated cohort is that there are no differences between a cell type’s expression profile between batches. (b-d) Raw data. (e-g) CellBender denoised data. (b) Dotplot showing top cardiomyocyte and fibroblast marker genes. Background noise causes marker genes to show up in the off-target cell type at a low level. (e) Marked cleanup of the dataset at an aggregate level. (c,f) The cardiomyocytes show no differentially expressed genes between batch A and B, before or after CellBender. (d) In the raw data, many genes show up as being significantly differentially-expressed due to background noise. (g) After CellBender, these spurious results have disappeared (a few of which are labeled). Benjamini-Hochberg-corrected FDR value for significance (red dotted line) is 0.01 in all volcano plots.
Extended Data Fig. 9 Comparison of output summarization methods for constructing an integer count matrix.
Methods are discussed in Supplementary Sections 5.5 (legend label MCKP), 5.6 (legend label Posteior CDF), and 5.7 (legend labels PR-μ and PR-q). The four panels show four different ways to compute TPR and FPR to display a ROC curve. “Macro-averaged per cell” computes TPR as (∑gTPng)/(∑gTPng + FNng), while “micro-averaged per cell" computes TPR as ∑g[TPng/(TPng + FNng)]. For the “per gene” cases, the sum over genes is replaced by a sum over cells. We exclude genes whose raw data counts are less than 10 summed over all cells. The dots shown represent the mean over all cells or genes as appropriate.
Extended Data Fig. 10 Comparison of per-gene performance of different noise estimation methods.
Methods are discussed in Supplementary Sections 4.5 (MCKP), 4.6 (Posterior CDF), and 4.7 (PR-μ and PR-q). Each plot shows the over-removal of each gene (fraction removed - fraction that should have been removed according to truth) for the given method with the hyperparameter setting specified in the title. Each dot is a gene. Positive values indicate that too many counts of the gene were removed at the level of the entire experiment. Row 1 column 1 shows the posterior mode, row 2 column 1 shows the posterior mean, and row 3 column 1 shows a single sample from the unregularized posterior (α = 0).
Supplementary information
Supplementary Information
Supplementary Figs. 1–10, Discussion and Tables 1–6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fleming, S.J., Chaffin, M.D., Arduini, A. et al. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat Methods 20, 1323–1335 (2023). https://doi.org/10.1038/s41592-023-01943-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01943-7