Abstract
The rapid progress of protocols for sequencing single-cell transcriptomes over the past decade has been accompanied by equally impressive advances in the computational methods for analysis of such data. As capacity and accuracy of the experimental techniques grew, the emerging algorithm developments revealed increasingly complex facets of the underlying biology, from cell type composition to gene regulation to developmental dynamics. At the same time, rapid growth has forced continuous reevaluation of the underlying statistical models, experimental aims, and sheer volumes of data processing that are handled by these computational tools. Here, I review key computational steps of single-cell RNA sequencing (scRNA-seq) analysis, examine assumptions made by different approaches, and highlight successes, remaining ambiguities, and limitations that are important to keep in mind as scRNA-seq becomes a mainstream technique for studying biology.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The following scRNA-seq datasets were used in creating example figures:
• 10x Genomics PBMC 10k (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3).
• 10x Genomics PBMC 66k (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.1.0/5k_pbmc_NGSC3_aggr).
• Fetal pancreas: E12.5 timepoint data from Byrnes et al.84 were downloaded from GEO (GSM3140915).
• Mouse developing retina: 10x Chromium replicate from Lo Giudice et al.85 was downloaded from GEO (GSM3466902).
• Cell lines: Benchmarking data measuring different cell lines on different platforms, taken from Tian et al.89, were downloaded from GEO (GSE118767).
• Metadata on the single-cell RNA-seq experiments were taken from http://www.nxn.se/single-cell-studies/.
Code availability
The notebooks and scripts for the figures presented in the paper can be found on the author’s website: http://pklab.med.harvard.edu/peterk/review2020/.
Change history
30 June 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41592-021-01223-2
References
Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).
Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Vu, T. N. et al. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 32, 2128–2135 (2016).
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Wang, T., Li, B., Nelson, C. E. & Nabavi, S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinf. 20, 40 (2019).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
Vallejos, C. A., Richardson, S. & Marioni, J. C. Beyond comparisons of means: understanding changes in gene expression at the single-cell level. Genome Biol 17, 70 (2016).
Nabavi, S., Schmolze, D., Maitituoheti, M., Malladi, S. & Beck, A. H. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 32, 533–541 (2016).
Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016).
Martinez-Jimenez, C. P. et al. Aging increases cell-to-cell transcriptional variability upon immune stimulation. Science 355, 1433–1436 (2017).
Crowell, H. L. et al. muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).
Ntranos, V., Kamath, G. M., Zhang, J. M., Pachter, L. & Tse, D. N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).
Aggarwal, C. C., Hinneburg, A. & Keim, D. A. in Database Theory — ICDT 2001. (eds Van den Bussche, J. & Vianu, V.) 420–434 (Springer Berlin Heidelberg, 2001).
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).
Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).
Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294 (2018).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 e1821 (2019).
Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).
Shao, C. & Hofer, T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 33, 235–242 (2017).
Zhu, X., Ching, T., Pan, X., Weissman, S. M. & Garmire, L. Detecting heterogeneity in single-cell RNA-seq data by non-negative matrix factorization. PeerJ 5, e2888 (2017).
Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16, 241 (2015).
Zhou, M. Nonparametric Bayesian negative binomial factor analysis. Bayesian Analysis 13, 1065–1093 (2018).
Zhang, L. & Mallick, B. K. Inferring gene networks from discrete expression data. Biostatistics 14, 708–722 (2013).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. Publisher Correction: A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 10, 646 (2019).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Aggarwal, C. C. Neural Networks and Deep Learning: A Textbook. (Springer International Publishing, 2018).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).
Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Haghverdi, L., Buettner, F. & Theis, F. J. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31, 2989–2998 (2015).
Jarvis, R. A. & Patrick, E. A. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. C-22, 1025–1034 (1973).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.05.22.111161 (2020).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Van Mieghem, P. Graph Spectra for Complex Networks. (Cambridge University Press, 2010).
Haghverdi, L., Buttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Maaten, L. V. D. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Amir el, A. D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).
Pons, P. & Latapy, M. Computing communities in large networks using random walks. J. Graph Algorithms Appl. 10, 191–218 (2006).
Gorban, A. N. & Zinovyev, A. Y. in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 28–59 (IGI Global, 2010).
Hastie, T. & Stuetzle, W. Principal curves. J. Am. Stat. Assoc. 84, 502–516 (1989).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Soldatov, R. et al. Spatiotemporal structure of cell fate decisions in murine neural crest. Science 364, eaas9536 (2019).
Ji, Z. & Ji, H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e11 (2016).
Shin, J. et al. Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell 17, 360–372 (2015).
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 1–9 (2019).
Hrvatin, S. et al. Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex. Nat. Neurosci. 21, 120–129 (2018).
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943(2019).
Tran, T. N. & Bader, G. D. Tempora: cell trajectory inference using time-series single-cell RNA sequencing data. PLoS Comput. Biol. 16, e1008205 (2020).
Weinreb, C., Wolock, S., Tusi, B. K., Socolovsky, M. & Klein, A. M. Fundamental limits on dynamic inference from single-cell snapshots. Proc. Natl Acad. Sci. USA 115, E2467(2018).
Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).
Cao, J., Zhou, W., Steemers, F., Trapnell, C. & Shendure, J. Sci-fate characterizes the dynamics of gene expression in single cells. Nat. Biotechnol. 38, 980–988 (2020).
Erhard, F. et al. scSLAM-seq reveals core features of transcription dynamics in single cells. Nature 571, 419–423 (2019).
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887(2019).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324(2018).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Lein, E., Borm, L. E. & Linnarsson, S. The promise of spatial transcriptomics for neuroscience in the era of molecular cell typing. Science 358, 64–69 (2017).
Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).
Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature 568, 235–239 (2019).
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Lake, B. B. et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018).
Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database (Oxford) 2020, baaa073 (2020).
Byrnes, L. E. et al. Lineage dynamics of murine pancreatic development at single-cell resolution. Nat. Commun. 9, 3922 (2018).
Lo Giudice, Q., Leleu, M., La Manno, G. & Fabre, P. J. Single-cell transcriptional logic of cell-fate specification and axon guidance in early-born retinal neurons. Development 146, dev178103 (2019).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Mao, Q., Yang, L., Wang, L., Goodison, S. & Sun, Y. in Proceedings of the 2015 SIAM International Conference on Data Mining 792–800 (SIAM, 2015).
Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
Acknowledgements
P.V.K was supported by the NHLBI R01HL131768 award from NIH and CAREER (NSF-14-532) award from NSF.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
P.V.K. serves on the scientific advisory boards of Celsius Therapeutics and Biomage Inc.
Additional information
Peer review information Nature Methods thanks Martin Hemberg, Michael Morgan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Properties of scRNA-seq measurements.
a. Dependency between cost per cell (x axis) and the expected depth (UMIs per cell, y axis) is shown for a number of popular methods, largely based on the assessment by Ding. et al.1. b–d. Systematic transcript-specific bias of different scRNA-seq protocols. b. The scatter plot shows average log10(CPM+1) values for different genes (each dot represents a gene), as assessed using 10x Chromium (x axis) or dropseq (y axis) platforms. Genes showing higher (red) or lower (green) expression (above 10-fold threshold) are highlighted. c, d. Similar scatter plots shown for other two cell lines: H2228 (b) and HCC827 (c) cells. The set of differential genes determined from analysis of the H1975 cell line (a) is shown. Most of the genes that showed large discrepancy in the detection rate in H1975 results also show same discrepancy in the other two cell lines, illustrating stable detection bias between the two platforms. e. The ability to distinguish nearest neighbors decreases as the dimensionality of the space increases. The difference between closest (mind) and furthest (maxd) points from the origin, normalized by mind (y axis) is shown for different distance measures as a function of increasing number of dimensions (x axis). For each dimensionality n, a set of 100 random points are drawn from the n-dimensional uniform distribution, and a median of 1000 draws is shown. The distinction between closest and furthest points approaches 0 at high dimensions. In other words, relative to the origin, in high-dimensional space the points appear to be distributed on the surface of a high-dimensional sphere. f. Principal tree fit to the PBMC10k dataset. The tree shows computationally optimal spanning of the PBMC populations, yet the interpreting it as a dynamic process is incorrect.
Extended Data Fig. 2 Dimensionality reduction and neural networks.
a. A t-SNE embedding of the PBMC10k dataset (left); projection of cells onto the first two principal components (middle); projection of cells onto first two basis of the non-negative matrix factorization (right); b. Projection of cells onto the first two principal components, based on re-analysis of a subset of the PBMC10k dataset that contains only T lymphocytes. Given this restricted cellular context, the first two components are much better at capturing separation between different subsets of T cells, compared to the PCA on the full dataset shown in the previous panel. c. Visualization of the PBMC10k dataset in the 2D latent space determined by an autoencoder structure shown in (d). d. The architecture of an autoencoder used to reduce dimensions of the PBMC10k dataset in the previous panel. The autoencoder starts with a vector of top 3000 most variable genes, and then for each cell transforms this expression profile through a series of non-linear transformations, first into increasingly narrow dimensions, culminating in a two-dimensional middle layer, and then back into a full 3000-dimensional vector. The values of the two-dimensional middle layer are shown in (d). The parameters of the transformations connecting each layer are optimized so that they minimize the discrepancy between the original expression vector (leftmost layer) and the reconstructed vector (rightmost layer). e, f. Using neural networks to learn non-linear mapping from high-dimensional expression state to the coordinates of a t-SNE embedding. As t-SNE embeddings are based on empirical optimization of the relative positions of neighboring cells, there is no obvious analytical function connecting the expression state with the resulting t-SNE coordinates. Neural networks, however, can be used to approximate highly nonlinear and noisy functions. Here, a neural network with an architecture shown in (f) was used to approximate such a function. The parameters of the transformations connecting the layers were optimized based on a training set of 3000 cells, and then an additional set of 3000 test cells was used to illustrate the resulting fit. The left panel in (e) shows the actual positions of the 3000 test cells, and the right plot shows the positions predicted by the trained network.
Rights and permissions
About this article
Cite this article
Kharchenko, P.V. The triumphs and limitations of computational methods for scRNA-seq. Nat Methods 18, 723–732 (2021). https://doi.org/10.1038/s41592-021-01171-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-021-01171-x
This article is cited by
-
Feature selection followed by a novel residuals-based normalization that includes variance stabilization simplifies and improves single-cell gene expression analysis
BMC Bioinformatics (2024)
-
Suppression of the JAK/STAT pathway inhibits neuroinflammation in the line 61-PFF mouse model of Parkinson’s disease
Journal of Neuroinflammation (2024)
-
MOI is a comprehensive database collecting processed multi-omics data associated with viral infection
Scientific Reports (2024)
-
A relay velocity model infers cell-dependent RNA velocity
Nature Biotechnology (2024)
-
scLENS: data-driven signal detection for unbiased scRNA-seq data analysis
Nature Communications (2024)