Abstract
The high proportion of zeros in typical single-cell RNA sequencing datasets has led to widespread but inconsistent use of terminology such as dropout and missing data. Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help to reduce confusion. These include: (1) observed single-cell RNA sequencing counts reflect both true gene expression levels and measurement error, and carefully distinguishing between these contributions helps to clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps to address questions of biological interest, such as whether messenger RNA expression levels are multimodal among cells.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership
Genome Biology Open Access 19 October 2023
-
Cell-type-specific co-expression inference from single cell RNA-sequencing data
Nature Communications Open Access 10 August 2023
-
Liver in infections: a single-cell and spatial transcriptomics perspective
Journal of Biomedical Science Open Access 10 July 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout

Data availability
Sorted immune cell and PBMC data were downloaded from https://10xgenomics.com/data. iPSC data were downloaded from the Gene Expression Omnibus (accession number GSE118723). Brain data were downloaded from the Genotype-Tissue Expression portal (https://www.gtexportal.org/home/datasets). Kidney and retina data were downloaded from the Human Cell Atlas Data Portal (https://data.humancellatlas.org/). Control data were downloaded from https://figshare.com/projects/Zero_inflation_in_negative_control_data/61292. All of the results generated in this study are available at https://zenodo.org/record/4543923 and all analysis notebooks have been published at https://aksarkar.github.io/singlecell-modes/.
Code availability
All of the code used to perform the analysis is available at https://zenodo.org/record/4543921 and https://zenodo.org/record/4543923.
References
Fuller, W. A. Measurement Error Models (John Wiley & Sons, 1986).
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Pachter, L. Models for transcript quantification from RNA-seq. Preprint at https://arxiv.org/abs/1104.3889 (2011).
Wang, J. et al. Gene expression distribution deconvolution in single-cell RNA sequencing. Proc. Natl Acad. Sci. USA 115, E6437–E6446 (2018).
Zhang, M. J., Ntranos, V. & Tse, D. Determining sequencing depth in a single-cell RNA-seq experiment. Nat. Commun. 11, 774 (2020).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).
Zhu, L., Lei, J., Devlin, B. & Roeder, K. A unified statistical framework for single cell and bulk RNA sequencing data. Ann. Appl. Stat. 12, 609–632 (2018).
Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 11, 1169 (2020).
Fujimura, F. K., Northrup, H., Beaudet, A. L. & O’Brien, W. E. Genotyping errors with the polymerase chain reaction. N. Engl. J. Med. 322, 61 (1990).
Whale, A. S., Cowen, S., Foy, C. A. & Huggett, J. F. Methods for applying accurate digital PCR analysis on low copy DNA samples. PLoS ONE 8, e58177 (2013).
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Chen, M. & Zhou, X. VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol. 19, 196 (2018).
Talwar, D., Mongia, A., Sengupta, D. & Majumdar, A. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci. Rep. 8, 16329 (2018).
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2013).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 295 (2019).
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Tang, W. et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36, 1174–1181 (2019).
Hilbe, J. M. Modeling Count Data (Cambridge Univ. Press, 2014).
Lu, M. Generalized Adaptive Shrinkage Methods and Applications in Genomics Studies. PhD thesis, Univ. Chicago (2018).
Raj, A. & van Oudenaarden, A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135, 216–226 (2008).
Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236–240 (2013).
Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014).
Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 17, 63 (2016).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Hu, Q. & Greene, C. S. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pac. Symp. Biocomput. 24, 362–373 (2019).
Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20, 269 (2019).
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).
Tipping, M. E. & Bishop, C. M. Probabilistic principal component analysis. J. R. Stat. Soc. B Stat. Methodol. 61, 611–622 (1999).
Wang, W. & Stephens, M. Empirical Bayes matrix factorization. J. Mach. Learn. Res. (in the press).
Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241 (2015).
Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160 (2015).
Verma, A. & Engelhardt, B. E. A robust nonlinear low-dimensional manifold for single cell RNA-seq data. BMC Bioinformatics 21, 324 (2020).
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Lun, A. Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. Preprint at bioRxiv https://doi.org/10.1101/404962 (2018).
Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S 4th edn (Springer, 2002).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comp. Biol. 11, e1004333 (2015).
Zeileis, A., Kleiber, C. & Jackman, S. Regression models for count data in R. J. Stat. Softw. 27, 1–25 (2008).
Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294 (2017).
Kiefer, J. & Wolfowitz, J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Stat. 27, 887–906 (1956).
Lee, D. D. & Seung, H. S. in Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference (eds Leen, T. K. et al.) 556–562 (MIT Press, 2000).
Levitin, H. M. et al. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol. Syst. Biol. 15, e8557 (2019).
Gouvert, O., Oberlin, T. & Févotte, C. Negative binomial matrix factorization for recommender systems. IEEE Signal Process. Lett. 27, 815–819 (2020).
Sun, S., Chen, Y., Liu, Y. & Shang, X. A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNA-seq data. BMC Syst. Biol. 13, 28 (2019).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Sarkar, A. K. et al. Discovery and characterization of variance QTLs in human induced pluripotent stem cells. PLoS Genet. 15, e1008045 (2019).
Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods 14, 955–958 (2017).
Stewart, B. J. et al. Spatiotemporal immune zonation of the human kidney. Science 365, 1461–1466 (2019).
Lukowski, S. W. et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 38, e100811 (2019).
Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. Nat. Methods 14, 381–387 (2017).
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).
Acknowledgements
We thank members of the M.S. and Y. Gilad laboratories for helpful comments. This work was supported by NIH grant HG002585 and a Gut Cell Atlas grant from The Leona M. and Harry B. Helmsley Charitable Trust (both to M.S.).
Author information
Authors and Affiliations
Contributions
A.S. and M.S. developed the theory. A.S. performed the analysis. A.S. and M.S. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–5, Figs. 1–4 and Methods
Rights and permissions
About this article
Cite this article
Sarkar, A., Stephens, M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 53, 770–777 (2021). https://doi.org/10.1038/s41588-021-00873-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-021-00873-4
This article is cited by
-
Consequences and opportunities arising due to sparser single-cell RNA-seq datasets
Genome Biology (2023)
-
GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership
Genome Biology (2023)
-
Pitfalls and opportunities for applying latent variables in single-cell eQTL analyses
Genome Biology (2023)
-
Liver in infections: a single-cell and spatial transcriptomics perspective
Journal of Biomedical Science (2023)
-
Cell-type-specific co-expression inference from single cell RNA-sequencing data
Nature Communications (2023)