Article | Published:

Deep generative modeling for single-cell transcriptomics

Nature Methodsvolume 15pages10531058 (2018) | Download Citation

Abstract

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

All of the datasets analyzed in this paper are public and can be referenced at https://github.com/romain-lopez/scVI-reproducibility.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Semrau, S. et al. Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8, 1096 (2017).

  2. 2.

    Gaublomme, J. T. et al. Single-cell genomics unveils critical regulators of Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).

  3. 3.

    Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).

  4. 4.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).

  5. 5.

    Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).

  6. 6.

    Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).

  7. 7.

    Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).

  8. 8.

    Prabhakaran, S., Azizi, E., Carr, A. & Pe’er, D. Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. PMLR 48, 1070–1079 (2016).

  9. 9.

    Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome. Biol. 16, 241 (2015).

  10. 10.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

  11. 11.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).

  12. 12.

    van Dijk, D. et al. MAGIC: a diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).

  13. 13.

    Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome. Biol. 16, 278 (2015).

  14. 14.

    Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

  15. 15.

    Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, New York, 2007).

  16. 16.

    Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).

  17. 17.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome. Biol. 15, 550 (2014).

  18. 18.

    Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

  19. 19.

    Wang, D. & Gu, J. VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/10/06/199315 (2017).

  20. 20.

    Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).

  21. 21.

    Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/16/318295 (2018).

  22. 22.

    Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).

  23. 23.

    Cole, M. B. et al. Performance assessment and selection of normalization procedures for single-cell RNA-seq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/18/235382 (2017).

  24. 24.

    Louizos, C., Swersky, K., Li, Y., Welling, M. & Zemel, R. The variational fair autoencoder. Oral presentation at the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016.

  25. 25.

    Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Oral presentation at the International Conference on Learning Representations, Banff, Alberta, Canada, 14–16 April 2014.

  26. 26.

    Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).

  27. 27.

    Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems (eds Lee, D. D. et al.) 3738–3746 (NIPS Foundation, La Jolla, CA, 2016).

  28. 28.

    10x Genomics. Support: single cell gene expression datasets. 10x Genomics https://support.10xgenomics.com/single-cell-gene-expression/datasets (2017).

  29. 29.

    Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

  30. 30.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  31. 31.

    Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).

  32. 32.

    Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

  33. 33.

    Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

  34. 34.

    Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  35. 35.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

  36. 36.

    Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).

  37. 37.

    Held, L. & Ott, M. On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5, 393–419 (2018).

  38. 38.

    Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

  39. 39.

    Nakaya, H. I. et al. Systems biology of vaccination for seasonal influenza in humans. Nat. Immunol. 12, 786–795 (2011).

  40. 40.

    Görgün, G., Holderried, T. A. W., Zahrieh, D., Neuberg, D. & Gribben, J. G. Chronic lymphocytic leukemia cells induce changes in gene expression of CD4 and CD8 T cells. J. Clin. Invest. 115, 1797–1805 (2005).

  41. 41.

    Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).

  42. 42.

    Zoph, B. & Le, Q. Neural architecture search with reinforcement learning. Oral presentation at the International Conference on Learning Representations, Toulon, France, 24–26 April 2017.

  43. 43.

    Bergstra, J. S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24 (eds Shawe-Taylor, J. et al.) 2546–2554 (NIPS Foundation, La Jolla, CA, 2011).

  44. 44.

    Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017).

  45. 45.

    DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of single-cell RNA-Seq data. BMC Bioinformatics 17, 315 (2016).

  46. 46.

    Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).

  47. 47.

    Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome. Biol. 19, 15 (2018).

Download references

Acknowledgements

N.Y. and R.L. were supported by NIH–NIAID (grant U19 AI090023). We thank A. Klein, S. Dudoit, and J. Listgarten for helpful discussions.

Author information

Affiliations

  1. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA

    • Romain Lopez
    • , Jeffrey Regier
    • , Michael I. Jordan
    •  & Nir Yosef
  2. Department of Physics, University of California, Berkeley, Berkeley, CA, USA

    • Michael B. Cole
  3. Department of Statistics, University of California, Berkeley, Berkeley, CA, USA

    • Michael I. Jordan
  4. Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA

    • Nir Yosef
  5. Chan Zuckerberg BioHub, San Francisco, CA, USA

    • Nir Yosef

Authors

  1. Search for Romain Lopez in:

  2. Search for Jeffrey Regier in:

  3. Search for Michael B. Cole in:

  4. Search for Michael I. Jordan in:

  5. Search for Nir Yosef in:

Contributions

R.L., J.R., and N.Y. conceived the statistical model. R.L. developed the software. R.L. and M.B.C. applied the software to real data analysis. R.L., J.R., N.Y., and M.I.J. wrote the manuscript. N.Y. and M.I.J. supervised the work.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Nir Yosef.

Integrated supplementary information

  1. Supplementary Figure 1 Robustness analysis for scVI.

    (a) Imputation score on the BRAIN-LARGE dataset across multiple random initializations (n = 5) for different dimensions of the latent space (x-axis). The center of the error bars denotes the median, and the extrema denote the s.d. (b) Visualization of scVI numerical objective function during training on the BRAIN-LARGE dataset. This shows that our model does not overfit and has a stable training procedure. (c) Imputation score as a function of the number of epochs on the BRAIN-LARGE dataset. This figure also shows stability across posterior sampling, as there is not much change in the parameters between two subsequent epochs. (d) Clustering metrics on the CORTEX dataset across multiple initializations (n = 5) and dimensions for the latent space. The center of the error bars denotes the median, and the extrema denote the s.d.

  2. Supplementary Figure 2 Stability of the early-stopping criterion on the 1-million-cell sample of the BRAIN-LARGE dataset.

    (a) Evolution of the loss function (y-axis) value on a validation set (n = 10,000 cells) with the number of epochs (x-axis). (b) Contrast of the expected frequency ρ values between the model trained with early stopping (x-axis) and the model trained without early stopping (y-axis) on a random subset of n = 100 cells and all 720 genes. We also report the Pearson correlation, r.

  3. Supplementary Figure 3 Imputation results for scVI.

    (ac) We investigate how scVI latent space can be used to impute the data (with the uniform perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

  4. Supplementary Figure 4 Imputation of scVI on the CORTEX dataset.

    (ad) The heat maps represent density plots of imputed values (by scVI, ZIFA, ZINB-WaVE, and MAGIC respectively) on a downsampled version versus the original (nonzero) values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

  5. Supplementary Figure 5 Imputation of scVI on the CORTEX dataset.

    (ad) Models were trained on a binomially corrupted dataset (see Methods). The heat maps denote density plots of imputed values scVI, ZIFA, ZINB-WaVE, and MAGIC on a downsampled version versus original values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

  6. Supplementary Figure 6 Imputation results for scVI.

    (ac) We investigate how scVI latent space can be used to impute the data (with the binomial perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

  7. Supplementary Figure 7 Preliminary results for scVI.

    Log-likelihood results on a held-out subset of the CORTEX dataset (n = 751 cells).

  8. Supplementary Figure 8 Posterior analysis of generative models on the CORTEX dataset.

    (a-c) The observed counts of n = 55,932 randomly selected entries of the data matrix (x-axis) and their posterior uncertainty (y-axis) obtained by sampling n = 500 times from the variational posterior (scVI) or the exact posterior (FA, ZIFA). The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles. (d-f) The observed counts of a representative gene, Thy1, in the CORTEX dataset. Data are presented across all cells (n = 3,005) (x-axis) against the (n = 500) independent posterior expected counts for each cell produced by scVI, ZIFA and FA respectively (y-axis). The values in each axis have been divided into k = 20 bins and the color scale reflects the proportion of cells in each pair of bins. By definition, the uncertainty of FA is independent of the input value and tight around the observed count. ZIFA can generate zero and puts realistically more weight in this area. scVI’s posterior is more complex, able to generate zero for low unique molecular identifier (UMI) values but also able to generate high UMI values when the original count observed was only of intermediate intensity.

  9. Supplementary Figure 9 How scVI latent space can be used to cluster the data, and benchmarking across datasets for state-of-the-art methods.

    (ad) The results for the (a) PBMC, (b) CORTEX, (c) CBMC, and (d) RETINA datasets. ASW: average silhouette width of pre- annotated subpopulations (higher is better), ARI: adjusted rand index (higher is better), NMI: normalized mutual information (higher is better), BE: batch mixing entropy (higher is better), BASW: average silhouette width of batches (lower is better). (e-g) The performance of clustering metrics for different depths of the hierarchical clustering in the CORTEX data, computed in the original publication [29]. The numbers in the legend indicate the number of clusters at the given depth.

  10. Supplementary Figure 10 Complement to the analysis on the HEMATO dataset.

    (n = 4,016 cells) (ac) All scatter plots illustrate the embedding of a 5-nearest-neighbors graph of a latent space. Cell positions are computed using a force-directed layout. (a) A reduction to 60 pcs as in the original paper is shown. (b) The output of an scVI in dimension 60. As the dimension is sensibly different from other experiments, the warm-up schedule (which governs how the prior on z is enforced) was adjusted. (c) The figure from the main paper. To recover all the differentiation paths, the authors performed several operations on the k-nearest-neighbors graph that we did not reproduce in this analysis. We instead visualize the graph before the smoothing procedure. (d) SIMLR t-SNE on the HEMATO dataset. We prefer to visualize the SIMLR embedding on a k-NNs graph, as even t-SNE would lose the continuum structure of the dataset.

  11. Supplementary Figure 11 Analysis on the ZINB random dataset.

    (n = 2,000 cells). For three algorithms (PCA, SIMLR, and scVI), we compute a latent space (we let SIMLR choose k; here SIMLR found 11 clusters). Then, we compute a cell–cell similarity matrix where cells are ordered by the SIMLR clusters (first column). We either apply t-SNE to the latent space (scVI, PCA) or use the two-dimensional embedding from SIMLR and color by number of UMIs (second column), or by SIMLR cluster labels (third column).

  12. Supplementary Figure 12 Batch effect removal on the RETINA dataset.

    n = 27,499 cells for scVI with no batch, DCA, PCA, SIMLR, and ComBat. Embedding plots for all methods but SIMLR were generated by application of t-SNE on the respective latent space. For SIMLR, we used the t-SNE coordinates provided by the program and the number of clusters was set to the number of pre-annotated subpopulations (n = 15).

  13. Supplementary Figure 13 Capturing technical variability with scVI.

    (a,b) Data are based on the CD14+ cell subpopulation of the PBMC dataset. (a) Scatter plot for each cell (n = 2,237) of inferred scaling factor using scVI against library size. (b) The frequency of observed zero values versus the expected expression level, as produced by scVI. Each of the n = 3,346 points represents a gene g, where the x-axis is ρg, the average expected frequency per cell (for gene g, average over ρg for all cells c in the subpopulation), and the y-axis is the observed percentage of cells that detects the fee (UMI > 0). The green curve depicts the probability for selecting zero transcripts from every gene as a function of its frequency, assuming a simple model of sampling U molecules from a cell with N molecules at random without replacement. U = 1,398 is the average number of UMIs in the subpopulation and N is the average number of transcripts per cell (for this curve N = 10,000). Notably, the curve converges for values larger than 20,000 to the red curve, a binomial selection procedure (conform to the probabilistic limit of the sampling process when N goes to infinity). (c,d) Signed log P values for a Pearson correlation test between the zero probabilities from the two distributions (negative binomial, Bernoulli) and quality control metrics across n = 5 random initializations of scVI and all subpopulations of the PBMC (n = 12,039 cells) and BRAIN-SMALL (n = 9,128 cells) datasets. The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles.

  14. Supplementary Figure 14 The generative distributions of scVI.

    This study focuses on a particular subpopulation of the BRAIN-SMALL dataset (n = 2,592). (a) To assess whether most of the zeros in the data come from the negative binomial, for each entry of the count matrix (percentage in y-axis), we plot the probability that a given zero comes from the NB conditioned on having a zero (x-axis). (b) Number of genes detected versus negative binomial zero probability averaged across all genes. (c) Genome_not_gene versus Bernoulli zero probability averaged across all genes. (d) Mapped_reads versus Bernoulli zero probability averaged across all genes.

  15. Supplementary Figure 15 Differential expression with scVI on the PBMC dataset.

    (ae) BH-corrected P values from bulk analysis against BH-corrected P values or Bayes factor for CD4+/CD8+ comparison (microarrays, n = 12 in each group, scRNA-seq n = 100 cells). Each point is a gene (g = 3,346). In the order indicated, scVI, edgeR, MAST, DESeq2, DESeq2 on DCA imputed counts. (f) Histogram for the Bayes factor of scVI when applying differential expression to two sets of n = 100 random cells from the same cluster.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–15, Supplementary Tables 1–5 and Supplementary Notes 1–6

  2. Reporting Summary

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0229-2

Further reading