Abstract
Singlecell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce singlecell variational inference (scVI), a readytouse scalable framework for the probabilistic representation and analysis of gene expression in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.
Access options
Subscribe to Journal
Get full journal access for 1 year
$227.00
only $18.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
from$8.99
All prices are NET prices.
Data availability
All of the datasets analyzed in this paper are public and can be referenced at https://github.com/romainlopez/scVIreproducibility.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
Semrau, S. et al. Dynamics of lineage commitment revealed by singlecell transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8, 1096 (2017).
 2.
Gaublomme, J. T. et al. Singlecell genomics unveils critical regulators of Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).
 3.
Patel, A. P. et al. Singlecell RNAseq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).
 4.
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740–742 (2014).
 5.
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing singlecell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).
 6.
Shaham, U. et al. Removal of batch effects using distributionmatching residual networks. Bioinformatics 33, 2539–2546 (2017).
 7.
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with singlecell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
 8.
Prabhakaran, S., Azizi, E., Carr, A. & Pe’er, D. Dirichlet process mixture model for correcting technical variation in singlecell gene expression data. PMLR 48, 1070–1079 (2016).
 9.
Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome. Biol. 16, 241 (2015).
 10.
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 284 (2018).
 11.
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of singlecell RNAseq data by kernelbased similarity learning. Nat. Methods 14, 414–416 (2017).
 12.
van Dijk, D. et al. MAGIC: a diffusionbased imputation method reveals genegene interactions in singlecell RNAsequencing data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).
 13.
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome. Biol. 16, 278 (2015).
 14.
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
 15.
Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, New York, 2007).
 16.
Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for singlecell transcriptomics. Nat. Methods 11, 637–640 (2014).
 17.
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome. Biol. 15, 550 (2014).
 18.
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
 19.
Wang, D. & Gu, J. VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/10/06/199315 (2017).
 20.
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNAseq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).
 21.
Grønbech, C. H. et al. scVAE: variational autoencoders for singlecell gene expression data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/16/318295 (2018).
 22.
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of singlecell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
 23.
Cole, M. B. et al. Performance assessment and selection of normalization procedures for singlecell RNAseq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/18/235382 (2017).
 24.
Louizos, C., Swersky, K., Li, Y., Welling, M. & Zemel, R. The variational fair autoencoder. Oral presentation at the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016.
 25.
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. Oral presentation at the International Conference on Learning Representations, Banff, Alberta, Canada, 14–16 April 2014.
 26.
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
 27.
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems (eds Lee, D. D. et al.) 3738–3746 (NIPS Foundation, La Jolla, CA, 2016).
 28.
10x Genomics. Support: single cell gene expression datasets. 10x Genomics https://support.10xgenomics.com/singlecellgeneexpression/datasets (2017).
 29.
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347, 1138–1142 (2015).
 30.
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
 31.
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by singlecell transcriptomics. Cell 166, 1308–1323 (2016).
 32.
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
 33.
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
 34.
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
 35.
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
 36.
Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
 37.
Held, L. & Ott, M. On pvalues and Bayes factors. Annu. Rev. Stat. Appl. 5, 393–419 (2018).
 38.
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
 39.
Nakaya, H. I. et al. Systems biology of vaccination for seasonal influenza in humans. Nat. Immunol. 12, 786–795 (2011).
 40.
Görgün, G., Holderried, T. A. W., Zahrieh, D., Neuberg, D. & Gribben, J. G. Chronic lymphocytic leukemia cells induce changes in gene expression of CD4 and CD8 T cells. J. Clin. Invest. 115, 1797–1805 (2005).
 41.
Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of highthroughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).
 42.
Zoph, B. & Le, Q. Neural architecture search with reinforcement learning. Oral presentation at the International Conference on Learning Representations, Toulon, France, 24–26 April 2017.
 43.
Bergstra, J. S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems 24 (eds ShaweTaylor, J. et al.) 2546–2554 (NIPS Foundation, La Jolla, CA, 2011).
 44.
Tanay, A. & Regev, A. Scaling singlecell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017).
 45.
DeTomaso, D. & Yosef, N. FastProject: a tool for lowdimensional analysis of singlecell RNASeq data. BMC Bioinformatics 17, 315 (2016).
 46.
Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).
 47.
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: largescale singlecell gene expression data analysis. Genome. Biol. 19, 15 (2018).
Acknowledgements
N.Y. and R.L. were supported by NIH–NIAID (grant U19 AI090023). We thank A. Klein, S. Dudoit, and J. Listgarten for helpful discussions.
Author information
Affiliations
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
 Romain Lopez
 , Jeffrey Regier
 , Michael I. Jordan
 & Nir Yosef
Department of Physics, University of California, Berkeley, Berkeley, CA, USA
 Michael B. Cole
Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
 Michael I. Jordan
Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA
 Nir Yosef
Chan Zuckerberg BioHub, San Francisco, CA, USA
 Nir Yosef
Authors
Search for Romain Lopez in:
Search for Jeffrey Regier in:
Search for Michael B. Cole in:
Search for Michael I. Jordan in:
Search for Nir Yosef in:
Contributions
R.L., J.R., and N.Y. conceived the statistical model. R.L. developed the software. R.L. and M.B.C. applied the software to real data analysis. R.L., J.R., N.Y., and M.I.J. wrote the manuscript. N.Y. and M.I.J. supervised the work.
Competing interests
The authors declare no competing interests.
Corresponding author
Correspondence to Nir Yosef.
Integrated supplementary information
Supplementary Figure 1 Robustness analysis for scVI.
(a) Imputation score on the BRAINLARGE dataset across multiple random initializations (n = 5) for different dimensions of the latent space (xaxis). The center of the error bars denotes the median, and the extrema denote the s.d. (b) Visualization of scVI numerical objective function during training on the BRAINLARGE dataset. This shows that our model does not overfit and has a stable training procedure. (c) Imputation score as a function of the number of epochs on the BRAINLARGE dataset. This figure also shows stability across posterior sampling, as there is not much change in the parameters between two subsequent epochs. (d) Clustering metrics on the CORTEX dataset across multiple initializations (n = 5) and dimensions for the latent space. The center of the error bars denotes the median, and the extrema denote the s.d.
Supplementary Figure 2 Stability of the earlystopping criterion on the 1millioncell sample of the BRAINLARGE dataset.
(a) Evolution of the loss function (yaxis) value on a validation set (n = 10,000 cells) with the number of epochs (xaxis). (b) Contrast of the expected frequency ρ values between the model trained with early stopping (xaxis) and the model trained without early stopping (yaxis) on a random subset of n = 100 cells and all 720 genes. We also report the Pearson correlation, r.
Supplementary Figure 3 Imputation results for scVI.
(a–c) We investigate how scVI latent space can be used to impute the data (with the uniform perturbation scheme) and report benchmarking across datasets for stateoftheart methods.
Supplementary Figure 4 Imputation of scVI on the CORTEX dataset.
(a–d) The heat maps represent density plots of imputed values (by scVI, ZIFA, ZINBWaVE, and MAGIC respectively) on a downsampled version versus the original (nonzero) values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.
Supplementary Figure 5 Imputation of scVI on the CORTEX dataset.
(a–d) Models were trained on a binomially corrupted dataset (see Methods). The heat maps denote density plots of imputed values scVI, ZIFA, ZINBWaVE, and MAGIC on a downsampled version versus original values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.
Supplementary Figure 6 Imputation results for scVI.
(a–c) We investigate how scVI latent space can be used to impute the data (with the binomial perturbation scheme) and report benchmarking across datasets for stateoftheart methods.
Supplementary Figure 7 Preliminary results for scVI.
Loglikelihood results on a heldout subset of the CORTEX dataset (n = 751 cells).
Supplementary Figure 8 Posterior analysis of generative models on the CORTEX dataset.
(ac) The observed counts of n = 55,932 randomly selected entries of the data matrix (xaxis) and their posterior uncertainty (yaxis) obtained by sampling n = 500 times from the variational posterior (scVI) or the exact posterior (FA, ZIFA). The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles. (df) The observed counts of a representative gene, Thy1, in the CORTEX dataset. Data are presented across all cells (n = 3,005) (xaxis) against the (n = 500) independent posterior expected counts for each cell produced by scVI, ZIFA and FA respectively (yaxis). The values in each axis have been divided into k = 20 bins and the color scale reflects the proportion of cells in each pair of bins. By definition, the uncertainty of FA is independent of the input value and tight around the observed count. ZIFA can generate zero and puts realistically more weight in this area. scVI’s posterior is more complex, able to generate zero for low unique molecular identifier (UMI) values but also able to generate high UMI values when the original count observed was only of intermediate intensity.
Supplementary Figure 9 How scVI latent space can be used to cluster the data, and benchmarking across datasets for stateoftheart methods.
(a–d) The results for the (a) PBMC, (b) CORTEX, (c) CBMC, and (d) RETINA datasets. ASW: average silhouette width of pre annotated subpopulations (higher is better), ARI: adjusted rand index (higher is better), NMI: normalized mutual information (higher is better), BE: batch mixing entropy (higher is better), BASW: average silhouette width of batches (lower is better). (eg) The performance of clustering metrics for different depths of the hierarchical clustering in the CORTEX data, computed in the original publication [29]. The numbers in the legend indicate the number of clusters at the given depth.
Supplementary Figure 10 Complement to the analysis on the HEMATO dataset.
(n = 4,016 cells) (a–c) All scatter plots illustrate the embedding of a 5nearestneighbors graph of a latent space. Cell positions are computed using a forcedirected layout. (a) A reduction to 60 pcs as in the original paper is shown. (b) The output of an scVI in dimension 60. As the dimension is sensibly different from other experiments, the warmup schedule (which governs how the prior on z is enforced) was adjusted. (c) The figure from the main paper. To recover all the differentiation paths, the authors performed several operations on the knearestneighbors graph that we did not reproduce in this analysis. We instead visualize the graph before the smoothing procedure. (d) SIMLR tSNE on the HEMATO dataset. We prefer to visualize the SIMLR embedding on a kNNs graph, as even tSNE would lose the continuum structure of the dataset.
Supplementary Figure 11 Analysis on the ZINB random dataset.
(n = 2,000 cells). For three algorithms (PCA, SIMLR, and scVI), we compute a latent space (we let SIMLR choose k; here SIMLR found 11 clusters). Then, we compute a cell–cell similarity matrix where cells are ordered by the SIMLR clusters (first column). We either apply tSNE to the latent space (scVI, PCA) or use the twodimensional embedding from SIMLR and color by number of UMIs (second column), or by SIMLR cluster labels (third column).
Supplementary Figure 12 Batch effect removal on the RETINA dataset.
n = 27,499 cells for scVI with no batch, DCA, PCA, SIMLR, and ComBat. Embedding plots for all methods but SIMLR were generated by application of tSNE on the respective latent space. For SIMLR, we used the tSNE coordinates provided by the program and the number of clusters was set to the number of preannotated subpopulations (n = 15).
Supplementary Figure 13 Capturing technical variability with scVI.
(a,b) Data are based on the CD14^{+} cell subpopulation of the PBMC dataset. (a) Scatter plot for each cell (n = 2,237) of inferred scaling factor using scVI against library size. (b) The frequency of observed zero values versus the expected expression level, as produced by scVI. Each of the n = 3,346 points represents a gene g, where the xaxis is ρ^{g}, the average expected frequency per cell (for gene g, average over ρ^{g} for all cells c in the subpopulation), and the yaxis is the observed percentage of cells that detects the fee (UMI > 0). The green curve depicts the probability for selecting zero transcripts from every gene as a function of its frequency, assuming a simple model of sampling U molecules from a cell with N molecules at random without replacement. U = 1,398 is the average number of UMIs in the subpopulation and N is the average number of transcripts per cell (for this curve N = 10,000). Notably, the curve converges for values larger than 20,000 to the red curve, a binomial selection procedure (conform to the probabilistic limit of the sampling process when N goes to infinity). (c,d) Signed log P values for a Pearson correlation test between the zero probabilities from the two distributions (negative binomial, Bernoulli) and quality control metrics across n = 5 random initializations of scVI and all subpopulations of the PBMC (n = 12,039 cells) and BRAINSMALL (n = 9,128 cells) datasets. The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles.
Supplementary Figure 14 The generative distributions of scVI.
This study focuses on a particular subpopulation of the BRAINSMALL dataset (n = 2,592). (a) To assess whether most of the zeros in the data come from the negative binomial, for each entry of the count matrix (percentage in yaxis), we plot the probability that a given zero comes from the NB conditioned on having a zero (xaxis). (b) Number of genes detected versus negative binomial zero probability averaged across all genes. (c) Genome_not_gene versus Bernoulli zero probability averaged across all genes. (d) Mapped_reads versus Bernoulli zero probability averaged across all genes.
Supplementary Figure 15 Differential expression with scVI on the PBMC dataset.
(a–e) BHcorrected P values from bulk analysis against BHcorrected P values or Bayes factor for CD4^{+}/CD8^{+} comparison (microarrays, n = 12 in each group, scRNAseq n = 100 cells). Each point is a gene (g = 3,346). In the order indicated, scVI, edgeR, MAST, DESeq2, DESeq2 on DCA imputed counts. (f) Histogram for the Bayes factor of scVI when applying differential expression to two sets of n = 100 random cells from the same cluster.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15, Supplementary Tables 1–5 and Supplementary Notes 1–6
Reporting Summary
Rights and permissions
To obtain permission to reuse content from this article visit RightsLink.
About this article
Further reading

Bayesian deep learning for singlecell analysis
Nature Methods (2018)