Deep generative modeling for single-cell transcriptomics

Abstract

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of scVI.
Fig. 2: Biological signal is retained in the scVI latent space.
Fig. 3: Benchmarking of differential expression analysis.

Data availability

All of the datasets analyzed in this paper are public and can be referenced at https://github.com/romain-lopez/scVI-reproducibility.

References

  1. 1.

    Semrau, S. et al. Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8, 1096 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Gaublomme, J. T. et al. Single-cell genomics unveils critical regulators of Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Prabhakaran, S., Azizi, E., Carr, A. & Pe’er, D. Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. PMLR 48, 1070–1079 (2016).

    Google Scholar 

  9. 9.

    Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome. Biol. 16, 241 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    van Dijk, D. et al. MAGIC: a diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).

  13. 13.

    Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome. Biol. 16, 278 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, New York, 2007).

  16. 16.

    Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).

    Article  PubMed  Google Scholar 

  17. 17.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome. Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Wang, D. & Gu, J. VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/10/06/199315 (2017).

  20. 20.

    Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).

  21. 21.

    Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/16/318295 (2018).

  22. 22.

    Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Cole, M. B. et al. Performance assessment and selection of normalization procedures for single-cell RNA-seq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/18/235382 (2017).

  24. 24.

    Louizos, C., Swersky, K., Li, Y., Welling, M. & Zemel, R. The variational fair autoencoder. Oral presentation at the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016.

  25. 25.

    Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Oral presentation at the International Conference on Learning Representations, Banff, Alberta, Canada, 14–16 April 2014.

  26. 26.

    Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).

    CAS  Article  Google Scholar 

  27. 27.

    Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems (eds Lee, D. D. et al.) 3738–3746 (NIPS Foundation, La Jolla, CA, 2016).

  28. 28.

    10x Genomics. Support: single cell gene expression datasets. 10x Genomics https://support.10xgenomics.com/single-cell-gene-expression/datasets (2017).

  29. 29.

    Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

    CAS  Article  PubMed  Google Scholar 

  30. 30.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  PubMed  Google Scholar 

  35. 35.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).

    Article  Google Scholar 

  37. 37.

    Held, L. & Ott, M. On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5, 393–419 (2018).

    Article  Google Scholar 

  38. 38.

    Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  Article  Google Scholar 

  39. 39.

    Nakaya, H. I. et al. Systems biology of vaccination for seasonal influenza in humans. Nat. Immunol. 12, 786–795 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Görgün, G., Holderried, T. A. W., Zahrieh, D., Neuberg, D. & Gribben, J. G. Chronic lymphocytic leukemia cells induce changes in gene expression of CD4 and CD8 T cells. J. Clin. Invest. 115, 1797–1805 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).

    Article  Google Scholar 

  42. 42.

    Zoph, B. & Le, Q. Neural architecture search with reinforcement learning. Oral presentation at the International Conference on Learning Representations, Toulon, France, 24–26 April 2017.

  43. 43.

    Bergstra, J. S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24 (eds Shawe-Taylor, J. et al.) 2546–2554 (NIPS Foundation, La Jolla, CA, 2011).

  44. 44.

    Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  45. 45.

    DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of single-cell RNA-Seq data. BMC Bioinformatics 17, 315 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  47. 47.

    Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome. Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

N.Y. and R.L. were supported by NIH–NIAID (grant U19 AI090023). We thank A. Klein, S. Dudoit, and J. Listgarten for helpful discussions.

Author information

Affiliations

Authors

Contributions

R.L., J.R., and N.Y. conceived the statistical model. R.L. developed the software. R.L. and M.B.C. applied the software to real data analysis. R.L., J.R., N.Y., and M.I.J. wrote the manuscript. N.Y. and M.I.J. supervised the work.

Corresponding author

Correspondence to Nir Yosef.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Robustness analysis for scVI.

(a) Imputation score on the BRAIN-LARGE dataset across multiple random initializations (n = 5) for different dimensions of the latent space (x-axis). The center of the error bars denotes the median, and the extrema denote the s.d. (b) Visualization of scVI numerical objective function during training on the BRAIN-LARGE dataset. This shows that our model does not overfit and has a stable training procedure. (c) Imputation score as a function of the number of epochs on the BRAIN-LARGE dataset. This figure also shows stability across posterior sampling, as there is not much change in the parameters between two subsequent epochs. (d) Clustering metrics on the CORTEX dataset across multiple initializations (n = 5) and dimensions for the latent space. The center of the error bars denotes the median, and the extrema denote the s.d.

Supplementary Figure 2 Stability of the early-stopping criterion on the 1-million-cell sample of the BRAIN-LARGE dataset.

(a) Evolution of the loss function (y-axis) value on a validation set (n = 10,000 cells) with the number of epochs (x-axis). (b) Contrast of the expected frequency ρ values between the model trained with early stopping (x-axis) and the model trained without early stopping (y-axis) on a random subset of n = 100 cells and all 720 genes. We also report the Pearson correlation, r.

Supplementary Figure 3 Imputation results for scVI.

(ac) We investigate how scVI latent space can be used to impute the data (with the uniform perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

Supplementary Figure 4 Imputation of scVI on the CORTEX dataset.

(ad) The heat maps represent density plots of imputed values (by scVI, ZIFA, ZINB-WaVE, and MAGIC respectively) on a downsampled version versus the original (nonzero) values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

Supplementary Figure 5 Imputation of scVI on the CORTEX dataset.

(ad) Models were trained on a binomially corrupted dataset (see Methods). The heat maps denote density plots of imputed values scVI, ZIFA, ZINB-WaVE, and MAGIC on a downsampled version versus original values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

Supplementary Figure 6 Imputation results for scVI.

(ac) We investigate how scVI latent space can be used to impute the data (with the binomial perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

Supplementary Figure 7 Preliminary results for scVI.

Log-likelihood results on a held-out subset of the CORTEX dataset (n = 751 cells).

Supplementary Figure 8 Posterior analysis of generative models on the CORTEX dataset.

(a-c) The observed counts of n = 55,932 randomly selected entries of the data matrix (x-axis) and their posterior uncertainty (y-axis) obtained by sampling n = 500 times from the variational posterior (scVI) or the exact posterior (FA, ZIFA). The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles. (d-f) The observed counts of a representative gene, Thy1, in the CORTEX dataset. Data are presented across all cells (n = 3,005) (x-axis) against the (n = 500) independent posterior expected counts for each cell produced by scVI, ZIFA and FA respectively (y-axis). The values in each axis have been divided into k = 20 bins and the color scale reflects the proportion of cells in each pair of bins. By definition, the uncertainty of FA is independent of the input value and tight around the observed count. ZIFA can generate zero and puts realistically more weight in this area. scVI’s posterior is more complex, able to generate zero for low unique molecular identifier (UMI) values but also able to generate high UMI values when the original count observed was only of intermediate intensity.

Supplementary Figure 9 How scVI latent space can be used to cluster the data, and benchmarking across datasets for state-of-the-art methods.

(ad) The results for the (a) PBMC, (b) CORTEX, (c) CBMC, and (d) RETINA datasets. ASW: average silhouette width of pre- annotated subpopulations (higher is better), ARI: adjusted rand index (higher is better), NMI: normalized mutual information (higher is better), BE: batch mixing entropy (higher is better), BASW: average silhouette width of batches (lower is better). (e-g) The performance of clustering metrics for different depths of the hierarchical clustering in the CORTEX data, computed in the original publication [29]. The numbers in the legend indicate the number of clusters at the given depth.

Supplementary Figure 10 Complement to the analysis on the HEMATO dataset.

(n = 4,016 cells) (ac) All scatter plots illustrate the embedding of a 5-nearest-neighbors graph of a latent space. Cell positions are computed using a force-directed layout. (a) A reduction to 60 pcs as in the original paper is shown. (b) The output of an scVI in dimension 60. As the dimension is sensibly different from other experiments, the warm-up schedule (which governs how the prior on z is enforced) was adjusted. (c) The figure from the main paper. To recover all the differentiation paths, the authors performed several operations on the k-nearest-neighbors graph that we did not reproduce in this analysis. We instead visualize the graph before the smoothing procedure. (d) SIMLR t-SNE on the HEMATO dataset. We prefer to visualize the SIMLR embedding on a k-NNs graph, as even t-SNE would lose the continuum structure of the dataset.

Supplementary Figure 11 Analysis on the ZINB random dataset.

(n = 2,000 cells). For three algorithms (PCA, SIMLR, and scVI), we compute a latent space (we let SIMLR choose k; here SIMLR found 11 clusters). Then, we compute a cell–cell similarity matrix where cells are ordered by the SIMLR clusters (first column). We either apply t-SNE to the latent space (scVI, PCA) or use the two-dimensional embedding from SIMLR and color by number of UMIs (second column), or by SIMLR cluster labels (third column).

Supplementary Figure 12 Batch effect removal on the RETINA dataset.

n = 27,499 cells for scVI with no batch, DCA, PCA, SIMLR, and ComBat. Embedding plots for all methods but SIMLR were generated by application of t-SNE on the respective latent space. For SIMLR, we used the t-SNE coordinates provided by the program and the number of clusters was set to the number of pre-annotated subpopulations (n = 15).

Supplementary Figure 13 Capturing technical variability with scVI.

(a,b) Data are based on the CD14+ cell subpopulation of the PBMC dataset. (a) Scatter plot for each cell (n = 2,237) of inferred scaling factor using scVI against library size. (b) The frequency of observed zero values versus the expected expression level, as produced by scVI. Each of the n = 3,346 points represents a gene g, where the x-axis is ρg, the average expected frequency per cell (for gene g, average over ρg for all cells c in the subpopulation), and the y-axis is the observed percentage of cells that detects the fee (UMI > 0). The green curve depicts the probability for selecting zero transcripts from every gene as a function of its frequency, assuming a simple model of sampling U molecules from a cell with N molecules at random without replacement. U = 1,398 is the average number of UMIs in the subpopulation and N is the average number of transcripts per cell (for this curve N = 10,000). Notably, the curve converges for values larger than 20,000 to the red curve, a binomial selection procedure (conform to the probabilistic limit of the sampling process when N goes to infinity). (c,d) Signed log P values for a Pearson correlation test between the zero probabilities from the two distributions (negative binomial, Bernoulli) and quality control metrics across n = 5 random initializations of scVI and all subpopulations of the PBMC (n = 12,039 cells) and BRAIN-SMALL (n = 9,128 cells) datasets. The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles.

Supplementary Figure 14 The generative distributions of scVI.

This study focuses on a particular subpopulation of the BRAIN-SMALL dataset (n = 2,592). (a) To assess whether most of the zeros in the data come from the negative binomial, for each entry of the count matrix (percentage in y-axis), we plot the probability that a given zero comes from the NB conditioned on having a zero (x-axis). (b) Number of genes detected versus negative binomial zero probability averaged across all genes. (c) Genome_not_gene versus Bernoulli zero probability averaged across all genes. (d) Mapped_reads versus Bernoulli zero probability averaged across all genes.

Supplementary Figure 15 Differential expression with scVI on the PBMC dataset.

(ae) BH-corrected P values from bulk analysis against BH-corrected P values or Bayes factor for CD4+/CD8+ comparison (microarrays, n = 12 in each group, scRNA-seq n = 100 cells). Each point is a gene (g = 3,346). In the order indicated, scVI, edgeR, MAST, DESeq2, DESeq2 on DCA imputed counts. (f) Histogram for the Bayes factor of scVI when applying differential expression to two sets of n = 100 random cells from the same cluster.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–5 and Supplementary Notes 1–6

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lopez, R., Regier, J., Cole, M.B. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018). https://doi.org/10.1038/s41592-018-0229-2

Download citation

Further reading