Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender

Fleming, Stephen J.; Chaffin, Mark D.; Arduini, Alessandro; Akkad, Amer-Denis; Banks, Eric; Marioni, John C.; Philippakis, Anthony A.; Ellinor, Patrick T.; Babadi, Mehrtash

doi:10.1038/s41592-023-01943-7

Article
Published: 07 August 2023

Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender

Nature Methods volume 20, pages 1323–1335 (2023)Cite this article

14k Accesses
37 Citations
60 Altmetric
Metrics details

Subjects

Abstract

Droplet-based single-cell assays, including single-cell RNA sequencing (scRNA-seq), single-nucleus RNA sequencing (snRNA-seq) and cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq), generate considerable background noise counts, the hallmark of which is nonzero counts in cell-free droplets and off-target gene expression in unexpected cell types. Such systematic background noise can lead to batch effects and spurious differential gene expression results. Here we develop a deep generative model based on the phenomenology of noise generation in droplet-based assays. The proposed model accurately distinguishes cell-containing droplets from cell-free droplets, learns the background noise profile and provides noise-free quantification in an end-to-end fashion. We implement this approach in the scalable and robust open-source software package CellBender. Analysis of simulated data demonstrates that CellBender operates near the theoretically optimal denoising limit. Extensive evaluations using real datasets and experimental benchmarks highlight enhanced concordance between droplet-based single-cell data and established gene expression patterns, while the learned background noise profile provides evidence of degraded or uncaptured cell types.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The phenomenology of ambient RNA and its deep generative modeling using CellBender remove-background.**

**Fig. 2: Evaluation of CellBender on a PBMC dataset, showing a standard SCANPY analysis of the publicly available 10x Genomics dataset pbmc8k with and without CellBender.**

**Fig. 3: Removal of background RNA from a published human heart snRNA-seq atlas, heart600k, using CellBender.**

**Fig. 4: Comparing four cell-calling algorithms (CellRanger version 3, dropkick, EmptyDrops and CellBender) on the rat6k snRNA-seq dataset.**

**Fig. 5: Benchmarking CellBender on denoising the hgmm12k human–mouse mixture dataset and a simulated dataset with differently sized cells.**

**Fig. 6: Performance of CellBender on denoising a CITE-seq PBMC dataset from 10x Genomics (pbmc5k).**

Simultaneous single-cell three-dimensional genome and gene expression profiling uncovers dynamic enhancer connectivity underlying olfactory receptor choice

Article Open access 15 April 2024

Pooled multicolour tagging for visualizing subcellular protein dynamics

Article Open access 19 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Data availability

The datasets used in this study are the following: pbmc8k (the publicly available pbmc8k dataset from 10x Genomics called ‘8k PBMCs from a healthy donor’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/8-k-pbm-cs-from-a-healthy-donor-2-standard-2-1-0); heart600k (the published dataset from the Broad–Bayer PCL called ‘Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy’ (ref. ²³), run with 10x Genomics 3′ capture version 3 chemistry and analyzed with CellRanger version 4.0.0, available at https://singlecell.broadinstitute.org/single_cell/study/SCP1303); hgmm12k (the publicly available hgmm12k dataset from 10x Genomics called ‘12k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/12-k-1-1-mixture-of-fresh-frozen-human-hek-293-t-and-mouse-nih-3-t-3-cells-2-standard-2-1-0); pbmc5k (the publicly available pbmc5k dataset with antibodies from 10x Genomics called ‘5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor with a Panel of TotalSeq™-B Antibodies (Next GEM)’, run with version 3 Next GEM chemistry and analyzed with CellRanger version 3.1.0, available at https://www.10xgenomics.com/resources/datasets/5-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-with-cell-surface-proteins-next-gem-3-1-standard-3-1-0); and rat6k (an snRNA-seq dataset from a healthy Wistar rat left atrium, comprising approximately 6,000 nuclei, processed on the 10x Genomics platform using version 2 chemistry and analyzed with CellRanger version 3.1.0. The dataset was provided by P.T.E.’s group at the Broad Institute as part of the Broad–Bayer PCL. The experiment was performed by A.A. and A.-D.A. The dataset is publicly available on Broad’s Single Cell Portal at https://singlecell.broadinstitute.org/single_cell/study/SCP2148). Datasets analyzed only in the Supplementary Information are as follows: smartseq3xpress_pbmc (a Smart-seq3xpress (well-based) scRNA-seq dataset from healthy human PBMCs called ‘Scalable full-transcript coverage single-cell RNA sequencing of PBMCs using Smart-seq3xpress’ and published by Hagemann-Jensen et al.⁵⁹. This dataset was kindly provided to the authors in count matrix format by C. Ziegenhain, an author of the referenced paper. We subsetted the data to the 16 384-well plates that came from ‘donor8’ and fluentbio_pbmc (the publicly available scRNA-seq dataset of healthy human PBMCs from Fluent BioSciences called ‘Profiling 20k Immune Cells in Healthy PBMCs from a Single T20 Reaction’, generated with T20 PIPseq and analyzed with PIPseeker version 1.1.3 by Fluent Biosciences⁶⁰, available at https://fbs-public.s3.us-east-2.amazonaws.com/public-datasets/pbmc/raw_matrix.tar.gz).

Code availability

CellBender can be obtained from https://github.com/broadinstitute/CellBender. Additional documentation is available at https://cellbender.readthedocs.io. CellBender modules are also available as workflows on Terra (https://app.terra.bio), a secure open platform for collaborative omic analysis, and can be run on the cloud with zero set-up. We have implemented the model and the inference method using Pyro probabilistic programming language¹⁶ and PyTorch⁶¹ and presented it as a user-friendly, production-grade and stand-alone command-line tool. We refer to the background noise-removal algorithm implemented in CellBender as remove-background.

References

Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS PubMed PubMed Central Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).
Article CAS PubMed Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article CAS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article Google Scholar
Liu, L. et al. Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity. Nat. Commun. 10, 470 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Article CAS PubMed PubMed Central Google Scholar
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. GigaScience 9, giaa151 (2020).
Article PubMed PubMed Central Google Scholar
Haas, B. J. et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 21, 494–504 (2011).
Article CAS PubMed PubMed Central Google Scholar
Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).
Thompson, J. R., Marcelino, L. A. & Polz, M. F. Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by ‘reconditioning PCR’. Nucleic Acids Res. 30, 2083–2088 (2002).
Article CAS PubMed PubMed Central Google Scholar
Perkel, J. M. et al. Single-cell analysis enters the multiomics age. Nature 595, 614–616 (2021).
Article CAS Google Scholar
Bingham, E. et al. Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 20, 1–6 (2019).
Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Dani, N. et al. A cellular and spatial map of the choroid plexus across brain ventricles and ages. Cell 184, 3056–3074 (2021).
Article Google Scholar
Popova, G. Human microglia states are conserved across experimental models and regulate neural stem cell responses in chimeric organoids. Cell Stem Cell 28, 2153-2166 (2021).
Article Google Scholar
Holloway, E. M. et al. Mapping development of the human intestinal niche at single-cell resolution. Cell Stem Cell 28, 568–580 (2021).
Article Google Scholar
Tucker, N. R. et al. Transcriptional and cellular diversity of the human heart. Circulation 142, 466–482 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tucker, N. R. et al. Myocyte specific upregulation of ACE2 in cardiovascular disease: implications for SARS-CoV-2 mediated myocarditis. Circulation 142, 708–710 (2020).
CAS PubMed PubMed Central Google Scholar
Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).
Article CAS PubMed Google Scholar
Sun, W. et al. snRNA-seq reveals a subpopulation of adipocytes that regulates thermogenesis. Nature 587, 98–102 (2020).
Article CAS PubMed Google Scholar
Dong, H. et al. Identification of a regulatory pathway inhibiting adipogenesis via RSPO2. Nat. Metab. 4, 90–105 (2022).
Article CAS PubMed PubMed Central Google Scholar
Delorey, T. M. et al. COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature 595, 107–113 (2021).
Article CAS PubMed PubMed Central Google Scholar
Xu, G. et al. The differential immune responses to COVID-19 in peripheral and lung revealed by single-cell RNA sequencing. Cell Discov. 6, 73 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ziegler, C. G. K. et al. Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19. Cell 184, 4713–4733 (2021).
Article Google Scholar
Melms, J. C. et al. A molecular single-cell lung atlas of lethal COVID-19. Nature 595, 114–119 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. et al. A single-cell transcriptomic landscape of the lungs of patients with COVID-19. Nat. Cell Biol. 23, 1314–1328 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zazhytska, M. et al. Non-cell-autonomous disruption of nuclear architecture as a potential cause of COVID-19-induced anosmia. Cell 185, 1052–1064 (2022).
Article Google Scholar
Eraslan, G. et al. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Science 376, eabl4290 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).
Article PubMed PubMed Central Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://doi.org/10.48550/arXiv.1312.6114 (2014).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020).
Article PubMed Google Scholar
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Article CAS PubMed PubMed Central Google Scholar
Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640 (2019).
Article CAS PubMed PubMed Central Google Scholar
Uhlen, M. et al. A genome-wide transcriptomic analysis of protein-coding genes in human blood cells. Science 366, eaax9198 (2019).
Article CAS PubMed Google Scholar
Neutrophil Analysis in 10x Genomics Single Cell Gene Expression Assays Report No. CG000444 (10x Genomics, 2021).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).
Article PubMed PubMed Central Google Scholar
Lun, A. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Heiser, C. N., Wang, V. M., Chen, B., Hughey, J. J. & Lau, K. S. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Res. 31, 1742–1752 (2021).
Article PubMed PubMed Central Google Scholar
Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).
Article PubMed PubMed Central Google Scholar
Oberdoerffer, S. et al. Regulation of CD45 alternative splicing by heterogeneous ribonucleoprotein, hnRNPLL. Science 321, 686–691 (2008).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. & Theis, F. J. Current best practices in single cell RNA seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Article PubMed PubMed Central Google Scholar
Clarke, Z. A. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat. Protoc. 16, 2749–2764 (2021).
Article CAS PubMed Google Scholar
Caglayan, E., Liu, Y. & Konopka, G. Neuronal ambient RNA contamination causes misinterpreted and masked cell types in brain single-nuclei datasets. Neuron 110, 4043–4056 (2022).
Article Google Scholar
Di Bella, D. J. et al. Molecular logic of cellular diversification in the mouse cerebral cortex. Nature 595, 554–559 (2021).
Article PubMed PubMed Central Google Scholar
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716-729 (2018).
Google Scholar
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Article CAS PubMed Google Scholar
Jiang, R., Sun, T., Song, D. & Li, J. J. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 23, 31 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hoffman, M., Blei, D. M., Wang, C. & Paisley, J. Stochastic variational inference. Preprint at https://doi.org/10.48550/arXiv.1206.7051 (2012).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Article CAS Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Article PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article Google Scholar
Ganchev, K., Graça, J., Gillenwater, J. & Taskar, B. Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010).
Google Scholar
Hagemann-Jensen, M., Ziegenhain, C. & Sandberg, R. Scalable single-cell RNA sequencing from full transcripts with Smart-seq3xpress. Nat. Biotechnol. 40, 1452–1457 (2022).
Article CAS PubMed PubMed Central Google Scholar
Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01685-z (2023).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems 12 (NeurIPS, 2019).

Download references

Acknowledgements

We thank L. D’Alessio, C. Roselli, C. Porter, E. Bingham, F. Obermeyer, J. Nemesh, B. Wang, B. Babadi, V. Popic, A. Wysoker, A. Subramanian, N. Tucker, Y. Farjoun, T. Tickle and A. Carr for insightful discussions at various stages of this project. S.J.F., M.D.C. and M.B. acknowledge financial support from the Broad–Bayer PCL. M.B. acknowledges additional support from the SPARC grant ‘Development of Production-Grade Computational Methods for Single-Cell Genomics’ from the Broad Institute. The publicly available rat6k snRNA-seq dataset was generated by the PCL, and the experiment was performed by A.A. and A.-D.A. We additionally thank C. Ziegenhain for providing a count matrix for the published Smart-seq3xpress PBMC dataset analyzed in Supplementary Section 2.4.

Author information

Alessandro Arduini
Present address: Bayer US, LLC, Cambridge, MA, USA

Authors and Affiliations

Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Stephen J. Fleming, Eric Banks, Anthony A. Philippakis & Mehrtash Babadi
Precision Cardiology Laboratory (PCL), Broad Institute of MIT and Harvard, Cambridge, MA, USA
Stephen J. Fleming, Mark D. Chaffin, Alessandro Arduini, Patrick T. Ellinor & Mehrtash Babadi
Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Mark D. Chaffin & Patrick T. Ellinor
Precision Cardiology Laboratory (PCL), Bayer US, LLC, Cambridge, MA, USA
Amer-Denis Akkad
Wellcome Sanger Institute, Hinxton, Cambridge, UK
John C. Marioni
European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
John C. Marioni
Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
Patrick T. Ellinor

Authors

Stephen J. Fleming
View author publications
You can also search for this author in PubMed Google Scholar
Mark D. Chaffin
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Arduini
View author publications
You can also search for this author in PubMed Google Scholar
Amer-Denis Akkad
View author publications
You can also search for this author in PubMed Google Scholar
Eric Banks
View author publications
You can also search for this author in PubMed Google Scholar
John C. Marioni
View author publications
You can also search for this author in PubMed Google Scholar
Anthony A. Philippakis
View author publications
You can also search for this author in PubMed Google Scholar
Patrick T. Ellinor
View author publications
You can also search for this author in PubMed Google Scholar
Mehrtash Babadi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.J.F. and M.B. jointly developed the probabilistic model, software and study design and jointly wrote the paper. S.J.F. additionally performed statistical analyses on real and simulated data. A.A. and A.-D.A. collected the rat6k dataset under the supervision of P.T.E. M.D.C. provided critical feedback at various stages of the project and analyzed the heart600k dataset. M.B. and P.T.E. jointly supervised the project, with additional input from A.A.P., J.C.M. and E.B.

Corresponding authors

Correspondence to Stephen J. Fleming or Mehrtash Babadi.

Ethics declarations

Competing interests

A.-D.A. is an employee of Bayer US LLC (a subsidiary of Bayer AG) and may own stock in Bayer AG. A.A.P. is employed as a venture partner at Google Ventures, and he is also supported by a grant from Bayer AG to the Broad Institute focused on machine learning for clinical trial design. P.T.E. is supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of cardiovascular diseases. P.T.E. has also served on advisory boards or consulted for Bayer AG, Quest Diagnostics, MyoKardia and Novartis. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Eran Mukamel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The CellBender model.

(a) The CellBender generative model for noisy single-cell count data. (b) The variational posterior used by CellBender. The neural network NN_enc takes the observed data as input and yields the parameters of various variational distributions assumed for the local latent variables. The global latent variables are treated in the usual mean-field approximation.

Extended Data Fig. 2 Violin plots showing the count distributions of lysozyme, LYZ, per cluster before and after CellBender denoising.

(nFPR was 0.01.) The off-target counts are effectively removed, with counts remaining in clusters 0 (CD14⁺ monocytes C), 10 (FCGR3A⁺ monocytes NC), and 12 (plasmacytoid dendritic cells).

Extended Data Fig. 3 UMAPs created from the CellBender-analyzed pbmc8k data, showing increased expression specificity of marker genes for different cell types after CellBender denoising as compared to the raw data.

a–d, UMAP plots of the expression of NKG7, CST3, AIF1 and LST1 in each cell before and after CellBender.

Extended Data Fig. 4 UMI curves from the raw data together with various CellBender outputs for the pbmc8k and rat6k datasets.

(a-d) pbmc8k, and (e-h) rat6k. (a,e) The raw UMI curves, annotated with areas of cells and empty droplets. Notably, the distinction is much more difficult in (e), the nuclei dataset extracted from heart tissue. (b,f) Cells probabilities inferred by CellBender on same UMI curves from (a,e) respectively. The region of transition from “surely-cell” to “surely-empty” is much broader in the snRNA-seq dataset. (c,g) First two principal components of the latent gene expression embedding inferred by CellBender, colored by Leiden clustering from a separate scanpy analysis. The structure very closely reflects the labels attributed by that separate analysis. (d,h) Scatter plots showing removal of each gene by CellBender (each dot is a gene, MALAT1 is off-scale). Several top denoised genes are indicated.

Extended Data Fig. 5 Presence of doublets does not impact the denoising performance of CellBender.

(a,c,e) Simulated dataset without doublets. (b,d,f) Simulated dataset where 20% of the cell-containing droplets are doublets. (a) UMAP of the gene expression profile of the three simulated cell types. (b) Same as (a), but including doublets, which are highlighted in bold color. Doublets with cells of two different types form their own clusters in UMAP space, due to their unique transcriptional profile. (c) The learned CellBender prior on gene expression, visualized via PCA, shows three clusters for the three cell types. (d) With doublets present, the prior on gene expression now additionally contains clusters for each type of doublet. From the standpoint of CellBender, a doublet is like a unique cell type. (e,f) Denoising performance has been quantified using a ROC curve, and shows that denoising metrics are nearly identical (TPR 0.750, FPR 0.041) whether doublets are present or not. The error bars shown in panels e-f correspond to the interquartile range of TPR (vertical) and FPR (horizontal) over N=2400 simulated cells.

Extended Data Fig. 6 Published human scRNA-seq PBMC dataset from the well-based Smart-seq3xpress protocol 59.

This dataset is extremely clean to begin with. The UMAP shows the expected cell types, nicely clustered. The two dotplots show expression of immune cell marker genes before and after CellBender. Some genes show improvement, but many look quite similar, as expected for a clean dataset. UMAP plots on the right show cleanup of a few genes after CellBender.

Extended Data Fig. 7 Publicly available human scRNA-seq PBMC dataset from the Fluent Biosciences PIPseq platform 60.

Droplets are generated by vigorous vortexing, and thus we expect more ambient RNA than a microfluidics experiment. The UMAP shows the expected cell types, in addition to some probable doublets. The two dotplots show expression of immune cell marker genes before and after CellBender. Many genes show significant cleanup. UMAP plots on the right show rather marked cleanup of a few genes after CellBender.

Extended Data Fig. 8 Systematic background noise as a source of batch variation and spurious differential expression across batches.

(a) Setup of the cohort of simulated datasets, where there are two cell types whose expression profiles are taken from real data (rat6k) for cardiomyocytes and fibroblasts. The only difference between simulations from batch A and batch B is the number of cardiomyocytes. Noise ends up being different in the two batches due to these cell number differences. The “truth” in this simulated cohort is that there are no differences between a cell type’s expression profile between batches. (b-d) Raw data. (e-g) CellBender denoised data. (b) Dotplot showing top cardiomyocyte and fibroblast marker genes. Background noise causes marker genes to show up in the off-target cell type at a low level. (e) Marked cleanup of the dataset at an aggregate level. (c,f) The cardiomyocytes show no differentially expressed genes between batch A and B, before or after CellBender. (d) In the raw data, many genes show up as being significantly differentially-expressed due to background noise. (g) After CellBender, these spurious results have disappeared (a few of which are labeled). Benjamini-Hochberg-corrected FDR value for significance (red dotted line) is 0.01 in all volcano plots.

Extended Data Fig. 9 Comparison of output summarization methods for constructing an integer count matrix.

Methods are discussed in Supplementary Sections 5.5 (legend label MCKP), 5.6 (legend label Posteior CDF), and 5.7 (legend labels PR-μ and PR-q). The four panels show four different ways to compute TPR and FPR to display a ROC curve. “Macro-averaged per cell” computes TPR as (∑_gTP_ng)/(∑_gTP_ng + FN_ng), while “micro-averaged per cell" computes TPR as ∑_g[TP_ng/(TP_ng + FN_ng)]. For the “per gene” cases, the sum over genes is replaced by a sum over cells. We exclude genes whose raw data counts are less than 10 summed over all cells. The dots shown represent the mean over all cells or genes as appropriate.

Extended Data Fig. 10 Comparison of per-gene performance of different noise estimation methods.

Methods are discussed in Supplementary Sections 4.5 (MCKP), 4.6 (Posterior CDF), and 4.7 (PR-μ and PR-q). Each plot shows the over-removal of each gene (fraction removed - fraction that should have been removed according to truth) for the given method with the hyperparameter setting specified in the title. Each dot is a gene. Positive values indicate that too many counts of the gene were removed at the level of the entire experiment. Row 1 column 1 shows the posterior mode, row 2 column 1 shows the posterior mean, and row 3 column 1 shows a single sample from the unregularized posterior (α = 0).

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Discussion and Tables 1–6.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fleming, S.J., Chaffin, M.D., Arduini, A. et al. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat Methods 20, 1323–1335 (2023). https://doi.org/10.1038/s41592-023-01943-7

Download citation

Received: 10 July 2022
Accepted: 13 June 2023
Published: 07 August 2023
Issue Date: September 2023
DOI: https://doi.org/10.1038/s41592-023-01943-7

This article is cited by

Pro-inflammatory feedback loops define immune responses to pathogenic Lentivirus infection
- Aaron J. Wilk
- Joshua O. Marceau
- Catherine A. Blish
Genome Medicine (2024)
PancrESS – a meta-analysis resource for understanding cell-type specific expression in the human pancreas
- David Sturgill
- Li Wang
- H. Efsun Arda
BMC Genomics (2024)
Empagliflozin and liraglutide ameliorate HFpEF in mice via augmenting the Erbb4 signaling pathway
- Xia-yun Ni
- Xiao-jun Feng
- Jian-ping Weng
Acta Pharmacologica Sinica (2024)
Transient expression of the neuropeptide galanin modulates peripheral‑to‑central connectivity in the somatosensory thalamus during whisker development in mice
- Zsofia Hevesi
- Joanne Bakker
- Tibor Harkany
Nature Communications (2024)
Slide-tags enables single-nucleus barcoding for multimodal spatial genomics
- Andrew J. C. Russell
- Jackson A. Weir
- Fei Chen
Nature (2024)