Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender

Abstract

Droplet-based single-cell assays, including single-cell RNA sequencing (scRNA-seq), single-nucleus RNA sequencing (snRNA-seq) and cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq), generate considerable background noise counts, the hallmark of which is nonzero counts in cell-free droplets and off-target gene expression in unexpected cell types. Such systematic background noise can lead to batch effects and spurious differential gene expression results. Here we develop a deep generative model based on the phenomenology of noise generation in droplet-based assays. The proposed model accurately distinguishes cell-containing droplets from cell-free droplets, learns the background noise profile and provides noise-free quantification in an end-to-end fashion. We implement this approach in the scalable and robust open-source software package CellBender. Analysis of simulated data demonstrates that CellBender operates near the theoretically optimal denoising limit. Extensive evaluations using real datasets and experimental benchmarks highlight enhanced concordance between droplet-based single-cell data and established gene expression patterns, while the learned background noise profile provides evidence of degraded or uncaptured cell types.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The phenomenology of ambient RNA and its deep generative modeling using CellBender remove-background.
Fig. 2: Evaluation of CellBender on a PBMC dataset, showing a standard SCANPY analysis of the publicly available 10x Genomics dataset pbmc8k with and without CellBender.
Fig. 3: Removal of background RNA from a published human heart snRNA-seq atlas, heart600k, using CellBender.
Fig. 4: Comparing four cell-calling algorithms (CellRanger version 3, dropkick, EmptyDrops and CellBender) on the rat6k snRNA-seq dataset.
Fig. 5: Benchmarking CellBender on denoising the hgmm12k human–mouse mixture dataset and a simulated dataset with differently sized cells.
Fig. 6: Performance of CellBender on denoising a CITE-seq PBMC dataset from 10x Genomics (pbmc5k).

Similar content being viewed by others

Data availability

The datasets used in this study are the following: pbmc8k (the publicly available pbmc8k dataset from 10x Genomics called ‘8k PBMCs from a healthy donor’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/8-k-pbm-cs-from-a-healthy-donor-2-standard-2-1-0); heart600k (the published dataset from the Broad–Bayer PCL called ‘Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy’ (ref. 23), run with 10x Genomics 3′ capture version 3 chemistry and analyzed with CellRanger version 4.0.0, available at https://singlecell.broadinstitute.org/single_cell/study/SCP1303); hgmm12k (the publicly available hgmm12k dataset from 10x Genomics called ‘12k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells’, run with version 2 chemistry and analyzed with CellRanger version 2.1.0, available at https://www.10xgenomics.com/resources/datasets/12-k-1-1-mixture-of-fresh-frozen-human-hek-293-t-and-mouse-nih-3-t-3-cells-2-standard-2-1-0); pbmc5k (the publicly available pbmc5k dataset with antibodies from 10x Genomics called ‘5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor with a Panel of TotalSeq™-B Antibodies (Next GEM)’, run with version 3 Next GEM chemistry and analyzed with CellRanger version 3.1.0, available at https://www.10xgenomics.com/resources/datasets/5-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-with-cell-surface-proteins-next-gem-3-1-standard-3-1-0); and rat6k (an snRNA-seq dataset from a healthy Wistar rat left atrium, comprising approximately 6,000 nuclei, processed on the 10x Genomics platform using version 2 chemistry and analyzed with CellRanger version 3.1.0. The dataset was provided by P.T.E.’s group at the Broad Institute as part of the Broad–Bayer PCL. The experiment was performed by A.A. and A.-D.A. The dataset is publicly available on Broad’s Single Cell Portal at https://singlecell.broadinstitute.org/single_cell/study/SCP2148). Datasets analyzed only in the Supplementary Information are as follows: smartseq3xpress_pbmc (a Smart-seq3xpress (well-based) scRNA-seq dataset from healthy human PBMCs called ‘Scalable full-transcript coverage single-cell RNA sequencing of PBMCs using Smart-seq3xpress’ and published by Hagemann-Jensen et al.59. This dataset was kindly provided to the authors in count matrix format by C. Ziegenhain, an author of the referenced paper. We subsetted the data to the 16 384-well plates that came from ‘donor8’ and fluentbio_pbmc (the publicly available scRNA-seq dataset of healthy human PBMCs from Fluent BioSciences called ‘Profiling 20k Immune Cells in Healthy PBMCs from a Single T20 Reaction’, generated with T20 PIPseq and analyzed with PIPseeker version 1.1.3 by Fluent Biosciences60, available at https://fbs-public.s3.us-east-2.amazonaws.com/public-datasets/pbmc/raw_matrix.tar.gz).

Code availability

CellBender can be obtained from https://github.com/broadinstitute/CellBender. Additional documentation is available at https://cellbender.readthedocs.io. CellBender modules are also available as workflows on Terra (https://app.terra.bio), a secure open platform for collaborative omic analysis, and can be run on the cloud with zero set-up. We have implemented the model and the inference method using Pyro probabilistic programming language16 and PyTorch61 and presented it as a user-friendly, production-grade and stand-alone command-line tool. We refer to the background noise-removal algorithm implemented in CellBender as remove-background.

References

  1. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).

    Article  Google Scholar 

  7. Liu, L. et al. Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity. Nat. Commun. 10, 470 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. GigaScience 9, giaa151 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Haas, B. J. et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 21, 494–504 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).

  14. Thompson, J. R., Marcelino, L. A. & Polz, M. F. Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by ‘reconditioning PCR’. Nucleic Acids Res. 30, 2083–2088 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Perkel, J. M. et al. Single-cell analysis enters the multiomics age. Nature 595, 614–616 (2021).

    Article  CAS  Google Scholar 

  16. Bingham, E. et al. Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 20, 1–6 (2019).

    Google Scholar 

  17. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Dani, N. et al. A cellular and spatial map of the choroid plexus across brain ventricles and ages. Cell 184, 3056–3074 (2021).

    Article  Google Scholar 

  19. Popova, G. Human microglia states are conserved across experimental models and regulate neural stem cell responses in chimeric organoids. Cell Stem Cell 28, 2153-2166 (2021).

    Article  Google Scholar 

  20. Holloway, E. M. et al. Mapping development of the human intestinal niche at single-cell resolution. Cell Stem Cell 28, 568–580 (2021).

    Article  Google Scholar 

  21. Tucker, N. R. et al. Transcriptional and cellular diversity of the human heart. Circulation 142, 466–482 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tucker, N. R. et al. Myocyte specific upregulation of ACE2 in cardiovascular disease: implications for SARS-CoV-2 mediated myocarditis. Circulation 142, 708–710 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).

    Article  CAS  PubMed  Google Scholar 

  24. Sun, W. et al. snRNA-seq reveals a subpopulation of adipocytes that regulates thermogenesis. Nature 587, 98–102 (2020).

    Article  CAS  PubMed  Google Scholar 

  25. Dong, H. et al. Identification of a regulatory pathway inhibiting adipogenesis via RSPO2. Nat. Metab. 4, 90–105 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Delorey, T. M. et al. COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature 595, 107–113 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Xu, G. et al. The differential immune responses to COVID-19 in peripheral and lung revealed by single-cell RNA sequencing. Cell Discov. 6, 73 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Ziegler, C. G. K. et al. Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19. Cell 184, 4713–4733 (2021).

    Article  Google Scholar 

  29. Melms, J. C. et al. A molecular single-cell lung atlas of lethal COVID-19. Nature 595, 114–119 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Wang, S. et al. A single-cell transcriptomic landscape of the lungs of patients with COVID-19. Nat. Cell Biol. 23, 1314–1328 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zazhytska, M. et al. Non-cell-autonomous disruption of nuclear architecture as a potential cause of COVID-19-induced anosmia. Cell 185, 1052–1064 (2022).

    Article  Google Scholar 

  32. Eraslan, G. et al. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Science 376, eabl4290 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://doi.org/10.48550/arXiv.1312.6114 (2014).

  35. Lopez, R., Regier, J., Cole, M. B., Jordan, M. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020).

    Article  PubMed  Google Scholar 

  37. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Uhlen, M. et al. A genome-wide transcriptomic analysis of protein-coding genes in human blood cells. Science 366, eaax9198 (2019).

    Article  CAS  PubMed  Google Scholar 

  40. Neutrophil Analysis in 10x Genomics Single Cell Gene Expression Assays Report No. CG000444 (10x Genomics, 2021).

  41. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Lun, A. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).

  44. Heiser, C. N., Wang, V. M., Chen, B., Hughey, J. J. & Lau, K. S. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Res. 31, 1742–1752 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Oberdoerffer, S. et al. Regulation of CD45 alternative splicing by heterogeneous ribonucleoprotein, hnRNPLL. Science 321, 686–691 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Luecken, M. D. & Theis, F. J. Current best practices in single cell RNA seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Clarke, Z. A. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat. Protoc. 16, 2749–2764 (2021).

    Article  CAS  PubMed  Google Scholar 

  49. Caglayan, E., Liu, Y. & Konopka, G. Neuronal ambient RNA contamination causes misinterpreted and masked cell types in brain single-nuclei datasets. Neuron 110, 4043–4056 (2022).

    Article  Google Scholar 

  50. Di Bella, D. J. et al. Molecular logic of cellular diversification in the mouse cerebral cortex. Nature 595, 554–559 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  51. van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716-729 (2018).

    Google Scholar 

  52. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).

    Article  CAS  PubMed  Google Scholar 

  53. Jiang, R., Sun, T., Song, D. & Li, J. J. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol. 23, 31 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Hoffman, M., Blei, D. M., Wang, C. & Paisley, J. Stochastic variational inference. Preprint at https://doi.org/10.48550/arXiv.1206.7051 (2012).

  55. Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).

    Article  CAS  Google Scholar 

  56. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  Google Scholar 

  58. Ganchev, K., Graça, J., Gillenwater, J. & Taskar, B. Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010).

    Google Scholar 

  59. Hagemann-Jensen, M., Ziegenhain, C. & Sandberg, R. Scalable single-cell RNA sequencing from full transcripts with Smart-seq3xpress. Nat. Biotechnol. 40, 1452–1457 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01685-z (2023).

  61. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems 12 (NeurIPS, 2019).

Download references

Acknowledgements

We thank L. D’Alessio, C. Roselli, C. Porter, E. Bingham, F. Obermeyer, J. Nemesh, B. Wang, B. Babadi, V. Popic, A. Wysoker, A. Subramanian, N. Tucker, Y. Farjoun, T. Tickle and A. Carr for insightful discussions at various stages of this project. S.J.F., M.D.C. and M.B. acknowledge financial support from the Broad–Bayer PCL. M.B. acknowledges additional support from the SPARC grant ‘Development of Production-Grade Computational Methods for Single-Cell Genomics’ from the Broad Institute. The publicly available rat6k snRNA-seq dataset was generated by the PCL, and the experiment was performed by A.A. and A.-D.A. We additionally thank C. Ziegenhain for providing a count matrix for the published Smart-seq3xpress PBMC dataset analyzed in Supplementary Section 2.4.

Author information

Authors and Affiliations

Authors

Contributions

S.J.F. and M.B. jointly developed the probabilistic model, software and study design and jointly wrote the paper. S.J.F. additionally performed statistical analyses on real and simulated data. A.A. and A.-D.A. collected the rat6k dataset under the supervision of P.T.E. M.D.C. provided critical feedback at various stages of the project and analyzed the heart600k dataset. M.B. and P.T.E. jointly supervised the project, with additional input from A.A.P., J.C.M. and E.B.

Corresponding authors

Correspondence to Stephen J. Fleming or Mehrtash Babadi.

Ethics declarations

Competing interests

A.-D.A. is an employee of Bayer US LLC (a subsidiary of Bayer AG) and may own stock in Bayer AG. A.A.P. is employed as a venture partner at Google Ventures, and he is also supported by a grant from Bayer AG to the Broad Institute focused on machine learning for clinical trial design. P.T.E. is supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of cardiovascular diseases. P.T.E. has also served on advisory boards or consulted for Bayer AG, Quest Diagnostics, MyoKardia and Novartis. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Eran Mukamel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The CellBender model.

(a) The CellBender generative model for noisy single-cell count data. (b) The variational posterior used by CellBender. The neural network NNenc takes the observed data as input and yields the parameters of various variational distributions assumed for the local latent variables. The global latent variables are treated in the usual mean-field approximation.

Extended Data Fig. 2 Violin plots showing the count distributions of lysozyme, LYZ, per cluster before and after CellBender denoising.

(nFPR was 0.01.) The off-target counts are effectively removed, with counts remaining in clusters 0 (CD14+ monocytes C), 10 (FCGR3A+ monocytes NC), and 12 (plasmacytoid dendritic cells).

Extended Data Fig. 3 UMAPs created from the CellBender-analyzed pbmc8k data, showing increased expression specificity of marker genes for different cell types after CellBender denoising as compared to the raw data.

ad, UMAP plots of the expression of NKG7, CST3, AIF1 and LST1 in each cell before and after CellBender.

Extended Data Fig. 4 UMI curves from the raw data together with various CellBender outputs for the pbmc8k and rat6k datasets.

(a-d) pbmc8k, and (e-h) rat6k. (a,e) The raw UMI curves, annotated with areas of cells and empty droplets. Notably, the distinction is much more difficult in (e), the nuclei dataset extracted from heart tissue. (b,f) Cells probabilities inferred by CellBender on same UMI curves from (a,e) respectively. The region of transition from “surely-cell” to “surely-empty” is much broader in the snRNA-seq dataset. (c,g) First two principal components of the latent gene expression embedding inferred by CellBender, colored by Leiden clustering from a separate scanpy analysis. The structure very closely reflects the labels attributed by that separate analysis. (d,h) Scatter plots showing removal of each gene by CellBender (each dot is a gene, MALAT1 is off-scale). Several top denoised genes are indicated.

Extended Data Fig. 5 Presence of doublets does not impact the denoising performance of CellBender.

(a,c,e) Simulated dataset without doublets. (b,d,f) Simulated dataset where 20% of the cell-containing droplets are doublets. (a) UMAP of the gene expression profile of the three simulated cell types. (b) Same as (a), but including doublets, which are highlighted in bold color. Doublets with cells of two different types form their own clusters in UMAP space, due to their unique transcriptional profile. (c) The learned CellBender prior on gene expression, visualized via PCA, shows three clusters for the three cell types. (d) With doublets present, the prior on gene expression now additionally contains clusters for each type of doublet. From the standpoint of CellBender, a doublet is like a unique cell type. (e,f) Denoising performance has been quantified using a ROC curve, and shows that denoising metrics are nearly identical (TPR 0.750, FPR 0.041) whether doublets are present or not. The error bars shown in panels e-f correspond to the interquartile range of TPR (vertical) and FPR (horizontal) over N=2400 simulated cells.

Extended Data Fig. 6 Published human scRNA-seq PBMC dataset from the well-based Smart-seq3xpress protocol59.

This dataset is extremely clean to begin with. The UMAP shows the expected cell types, nicely clustered. The two dotplots show expression of immune cell marker genes before and after CellBender. Some genes show improvement, but many look quite similar, as expected for a clean dataset. UMAP plots on the right show cleanup of a few genes after CellBender.

Extended Data Fig. 7 Publicly available human scRNA-seq PBMC dataset from the Fluent Biosciences PIPseq platform60.

Droplets are generated by vigorous vortexing, and thus we expect more ambient RNA than a microfluidics experiment. The UMAP shows the expected cell types, in addition to some probable doublets. The two dotplots show expression of immune cell marker genes before and after CellBender. Many genes show significant cleanup. UMAP plots on the right show rather marked cleanup of a few genes after CellBender.

Extended Data Fig. 8 Systematic background noise as a source of batch variation and spurious differential expression across batches.

(a) Setup of the cohort of simulated datasets, where there are two cell types whose expression profiles are taken from real data (rat6k) for cardiomyocytes and fibroblasts. The only difference between simulations from batch A and batch B is the number of cardiomyocytes. Noise ends up being different in the two batches due to these cell number differences. The “truth” in this simulated cohort is that there are no differences between a cell type’s expression profile between batches. (b-d) Raw data. (e-g) CellBender denoised data. (b) Dotplot showing top cardiomyocyte and fibroblast marker genes. Background noise causes marker genes to show up in the off-target cell type at a low level. (e) Marked cleanup of the dataset at an aggregate level. (c,f) The cardiomyocytes show no differentially expressed genes between batch A and B, before or after CellBender. (d) In the raw data, many genes show up as being significantly differentially-expressed due to background noise. (g) After CellBender, these spurious results have disappeared (a few of which are labeled). Benjamini-Hochberg-corrected FDR value for significance (red dotted line) is 0.01 in all volcano plots.

Extended Data Fig. 9 Comparison of output summarization methods for constructing an integer count matrix.

Methods are discussed in Supplementary Sections 5.5 (legend label MCKP), 5.6 (legend label Posteior CDF), and 5.7 (legend labels PR-μ and PR-q). The four panels show four different ways to compute TPR and FPR to display a ROC curve. “Macro-averaged per cell” computes TPR as (∑gTPng)/(∑gTPng + FNng), while “micro-averaged per cell" computes TPR as ∑g[TPng/(TPng + FNng)]. For the “per gene” cases, the sum over genes is replaced by a sum over cells. We exclude genes whose raw data counts are less than 10 summed over all cells. The dots shown represent the mean over all cells or genes as appropriate.

Extended Data Fig. 10 Comparison of per-gene performance of different noise estimation methods.

Methods are discussed in Supplementary Sections 4.5 (MCKP), 4.6 (Posterior CDF), and 4.7 (PR-μ and PR-q). Each plot shows the over-removal of each gene (fraction removed - fraction that should have been removed according to truth) for the given method with the hyperparameter setting specified in the title. Each dot is a gene. Positive values indicate that too many counts of the gene were removed at the level of the entire experiment. Row 1 column 1 shows the posterior mode, row 2 column 1 shows the posterior mean, and row 3 column 1 shows a single sample from the unregularized posterior (α = 0).

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Discussion and Tables 1–6.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fleming, S.J., Chaffin, M.D., Arduini, A. et al. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat Methods 20, 1323–1335 (2023). https://doi.org/10.1038/s41592-023-01943-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01943-7

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics