Deep generative modeling for single-cell transcriptomics

Lopez, Romain; Regier, Jeffrey; Cole, Michael B.; Jordan, Michael I.; Yosef, Nir

doi:10.1038/s41592-018-0229-2

Article
Published: 30 November 2018

Deep generative modeling for single-cell transcriptomics

Nature Methods volume 15, pages 1053–1058 (2018)Cite this article

83k Accesses
700 Citations
182 Altmetric
Metrics details

Subjects

Abstract

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Biological signal is retained in the scVI latent space.**

**Fig. 3: Benchmarking of differential expression analysis.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Simultaneous single-cell three-dimensional genome and gene expression profiling uncovers dynamic enhancer connectivity underlying olfactory receptor choice

Article Open access 15 April 2024

Data availability

All of the datasets analyzed in this paper are public and can be referenced at https://github.com/romain-lopez/scVI-reproducibility.

References

Semrau, S. et al. Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8, 1096 (2017).
Article PubMed PubMed Central Google Scholar
Gaublomme, J. T. et al. Single-cell genomics unveils critical regulators of Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).
Article CAS PubMed PubMed Central Google Scholar
Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Article CAS PubMed PubMed Central Google Scholar
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).
Article CAS PubMed PubMed Central Google Scholar
Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
Article CAS PubMed PubMed Central Google Scholar
Prabhakaran, S., Azizi, E., Carr, A. & Pe’er, D. Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. PMLR 48, 1070–1079 (2016).
Google Scholar
Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome. Biol. 16, 241 (2015).
Article PubMed PubMed Central Google Scholar
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Article PubMed PubMed Central Google Scholar
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).
Article CAS PubMed Google Scholar
van Dijk, D. et al. MAGIC: a diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome. Biol. 16, 278 (2015).
Article PubMed PubMed Central Google Scholar
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Article PubMed PubMed Central Google Scholar
Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, New York, 2007).
Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).
Article PubMed Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome. Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
Article PubMed PubMed Central Google Scholar
Wang, D. & Gu, J. VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/10/06/199315 (2017).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).
Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/16/318295 (2018).
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
Article PubMed PubMed Central Google Scholar
Cole, M. B. et al. Performance assessment and selection of normalization procedures for single-cell RNA-seq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/18/235382 (2017).
Louizos, C., Swersky, K., Li, Y., Welling, M. & Zemel, R. The variational fair autoencoder. Oral presentation at the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016.
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Oral presentation at the International Conference on Learning Representations, Banff, Alberta, Canada, 14–16 April 2014.
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Article CAS Google Scholar
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems (eds Lee, D. D. et al.) 3738–3746 (NIPS Foundation, La Jolla, CA, 2016).
10x Genomics. Support: single cell gene expression datasets. 10x Genomics https://support.10xgenomics.com/single-cell-gene-expression/datasets (2017).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Article CAS PubMed Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS PubMed PubMed Central Google Scholar
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article CAS PubMed PubMed Central Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article PubMed Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
Article Google Scholar
Held, L. & Ott, M. On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5, 393–419 (2018).
Article Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Nakaya, H. I. et al. Systems biology of vaccination for seasonal influenza in humans. Nat. Immunol. 12, 786–795 (2011).
Article CAS PubMed PubMed Central Google Scholar
Görgün, G., Holderried, T. A. W., Zahrieh, D., Neuberg, D. & Gribben, J. G. Chronic lymphocytic leukemia cells induce changes in gene expression of CD4 and CD8 T cells. J. Clin. Invest. 115, 1797–1805 (2005).
Article PubMed PubMed Central Google Scholar
Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).
Article Google Scholar
Zoph, B. & Le, Q. Neural architecture search with reinforcement learning. Oral presentation at the International Conference on Learning Representations, Toulon, France, 24–26 April 2017.
Bergstra, J. S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24 (eds Shawe-Taylor, J. et al.) 2546–2554 (NIPS Foundation, La Jolla, CA, 2011).
Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017).
Article CAS PubMed PubMed Central Google Scholar
DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of single-cell RNA-Seq data. BMC Bioinformatics 17, 315 (2016).
Article PubMed PubMed Central Google Scholar
Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome. Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

N.Y. and R.L. were supported by NIH–NIAID (grant U19 AI090023). We thank A. Klein, S. Dudoit, and J. Listgarten for helpful discussions.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
Romain Lopez, Jeffrey Regier, Michael I. Jordan & Nir Yosef
Department of Physics, University of California, Berkeley, Berkeley, CA, USA
Michael B. Cole
Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
Michael I. Jordan
Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA
Nir Yosef
Chan Zuckerberg BioHub, San Francisco, CA, USA
Nir Yosef

Authors

Romain Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Regier
View author publications
You can also search for this author in PubMed Google Scholar
Michael B. Cole
View author publications
You can also search for this author in PubMed Google Scholar
Michael I. Jordan
View author publications
You can also search for this author in PubMed Google Scholar
Nir Yosef
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.L., J.R., and N.Y. conceived the statistical model. R.L. developed the software. R.L. and M.B.C. applied the software to real data analysis. R.L., J.R., N.Y., and M.I.J. wrote the manuscript. N.Y. and M.I.J. supervised the work.

Corresponding author

Correspondence to Nir Yosef.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Robustness analysis for scVI.

(a) Imputation score on the BRAIN-LARGE dataset across multiple random initializations (n = 5) for different dimensions of the latent space (x-axis). The center of the error bars denotes the median, and the extrema denote the s.d. (b) Visualization of scVI numerical objective function during training on the BRAIN-LARGE dataset. This shows that our model does not overfit and has a stable training procedure. (c) Imputation score as a function of the number of epochs on the BRAIN-LARGE dataset. This figure also shows stability across posterior sampling, as there is not much change in the parameters between two subsequent epochs. (d) Clustering metrics on the CORTEX dataset across multiple initializations (n = 5) and dimensions for the latent space. The center of the error bars denotes the median, and the extrema denote the s.d.

Supplementary Figure 2 Stability of the early-stopping criterion on the 1-million-cell sample of the BRAIN-LARGE dataset.

(a) Evolution of the loss function (y-axis) value on a validation set (n = 10,000 cells) with the number of epochs (x-axis). (b) Contrast of the expected frequency ρ values between the model trained with early stopping (x-axis) and the model trained without early stopping (y-axis) on a random subset of n = 100 cells and all 720 genes. We also report the Pearson correlation, r.

Supplementary Figure 3 Imputation results for scVI.

(a–c) We investigate how scVI latent space can be used to impute the data (with the uniform perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

Supplementary Figure 4 Imputation of scVI on the CORTEX dataset.

(a–d) The heat maps represent density plots of imputed values (by scVI, ZIFA, ZINB-WaVE, and MAGIC respectively) on a downsampled version versus the original (nonzero) values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

Supplementary Figure 5 Imputation of scVI on the CORTEX dataset.

(a–d) Models were trained on a binomially corrupted dataset (see Methods). The heat maps denote density plots of imputed values scVI, ZIFA, ZINB-WaVE, and MAGIC on a downsampled version versus original values prior to downsampling. The reported score d is the median imputation error across all the hidden entries (lower is better; see Methods). Each density plot was computed using n = 55,932 independently perturbed entries from the original matrix.

Supplementary Figure 6 Imputation results for scVI.

(a–c) We investigate how scVI latent space can be used to impute the data (with the binomial perturbation scheme) and report benchmarking across datasets for state-of-the-art methods.

Supplementary Figure 7 Preliminary results for scVI.

Log-likelihood results on a held-out subset of the CORTEX dataset (n = 751 cells).

Supplementary Figure 8 Posterior analysis of generative models on the CORTEX dataset.

(a-c) The observed counts of n = 55,932 randomly selected entries of the data matrix (x-axis) and their posterior uncertainty (y-axis) obtained by sampling n = 500 times from the variational posterior (scVI) or the exact posterior (FA, ZIFA). The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles. (d-f) The observed counts of a representative gene, Thy1, in the CORTEX dataset. Data are presented across all cells (n = 3,005) (x-axis) against the (n = 500) independent posterior expected counts for each cell produced by scVI, ZIFA and FA respectively (y-axis). The values in each axis have been divided into k = 20 bins and the color scale reflects the proportion of cells in each pair of bins. By definition, the uncertainty of FA is independent of the input value and tight around the observed count. ZIFA can generate zero and puts realistically more weight in this area. scVI’s posterior is more complex, able to generate zero for low unique molecular identifier (UMI) values but also able to generate high UMI values when the original count observed was only of intermediate intensity.

Supplementary Figure 9 How scVI latent space can be used to cluster the data, and benchmarking across datasets for state-of-the-art methods.

(a–d) The results for the (a) PBMC, (b) CORTEX, (c) CBMC, and (d) RETINA datasets. ASW: average silhouette width of pre- annotated subpopulations (higher is better), ARI: adjusted rand index (higher is better), NMI: normalized mutual information (higher is better), BE: batch mixing entropy (higher is better), BASW: average silhouette width of batches (lower is better). (e-g) The performance of clustering metrics for different depths of the hierarchical clustering in the CORTEX data, computed in the original publication [29]. The numbers in the legend indicate the number of clusters at the given depth.

Supplementary Figure 10 Complement to the analysis on the HEMATO dataset.

(n = 4,016 cells) (a–c) All scatter plots illustrate the embedding of a 5-nearest-neighbors graph of a latent space. Cell positions are computed using a force-directed layout. (a) A reduction to 60 pcs as in the original paper is shown. (b) The output of an scVI in dimension 60. As the dimension is sensibly different from other experiments, the warm-up schedule (which governs how the prior on z is enforced) was adjusted. (c) The figure from the main paper. To recover all the differentiation paths, the authors performed several operations on the k-nearest-neighbors graph that we did not reproduce in this analysis. We instead visualize the graph before the smoothing procedure. (d) SIMLR t-SNE on the HEMATO dataset. We prefer to visualize the SIMLR embedding on a k-NNs graph, as even t-SNE would lose the continuum structure of the dataset.

Supplementary Figure 11 Analysis on the ZINB random dataset.

(n = 2,000 cells). For three algorithms (PCA, SIMLR, and scVI), we compute a latent space (we let SIMLR choose k; here SIMLR found 11 clusters). Then, we compute a cell–cell similarity matrix where cells are ordered by the SIMLR clusters (first column). We either apply t-SNE to the latent space (scVI, PCA) or use the two-dimensional embedding from SIMLR and color by number of UMIs (second column), or by SIMLR cluster labels (third column).

Supplementary Figure 12 Batch effect removal on the RETINA dataset.

n = 27,499 cells for scVI with no batch, DCA, PCA, SIMLR, and ComBat. Embedding plots for all methods but SIMLR were generated by application of t-SNE on the respective latent space. For SIMLR, we used the t-SNE coordinates provided by the program and the number of clusters was set to the number of pre-annotated subpopulations (n = 15).

Supplementary Figure 13 Capturing technical variability with scVI.

(a,b) Data are based on the CD14⁺ cell subpopulation of the PBMC dataset. (a) Scatter plot for each cell (n = 2,237) of inferred scaling factor using scVI against library size. (b) The frequency of observed zero values versus the expected expression level, as produced by scVI. Each of the n = 3,346 points represents a gene g, where the x-axis is ρ^g, the average expected frequency per cell (for gene g, average over ρ^g for all cells c in the subpopulation), and the y-axis is the observed percentage of cells that detects the fee (UMI > 0). The green curve depicts the probability for selecting zero transcripts from every gene as a function of its frequency, assuming a simple model of sampling U molecules from a cell with N molecules at random without replacement. U = 1,398 is the average number of UMIs in the subpopulation and N is the average number of transcripts per cell (for this curve N = 10,000). Notably, the curve converges for values larger than 20,000 to the red curve, a binomial selection procedure (conform to the probabilistic limit of the sampling process when N goes to infinity). (c,d) Signed log P values for a Pearson correlation test between the zero probabilities from the two distributions (negative binomial, Bernoulli) and quality control metrics across n = 5 random initializations of scVI and all subpopulations of the PBMC (n = 12,039 cells) and BRAIN-SMALL (n = 9,128 cells) datasets. The center of the box plot is the median and the hinges correspond to the interquartile range, the distance between the first and third quartiles; the whiskers represent the 5th to 95th percentiles.

Supplementary Figure 14 The generative distributions of scVI.

This study focuses on a particular subpopulation of the BRAIN-SMALL dataset (n = 2,592). (a) To assess whether most of the zeros in the data come from the negative binomial, for each entry of the count matrix (percentage in y-axis), we plot the probability that a given zero comes from the NB conditioned on having a zero (x-axis). (b) Number of genes detected versus negative binomial zero probability averaged across all genes. (c) Genome_not_gene versus Bernoulli zero probability averaged across all genes. (d) Mapped_reads versus Bernoulli zero probability averaged across all genes.

Supplementary Figure 15 Differential expression with scVI on the PBMC dataset.

(a–e) BH-corrected P values from bulk analysis against BH-corrected P values or Bayes factor for CD4⁺/CD8⁺ comparison (microarrays, n = 12 in each group, scRNA-seq n = 100 cells). Each point is a gene (g = 3,346). In the order indicated, scVI, edgeR, MAST, DESeq2, DESeq2 on DCA imputed counts. (f) Histogram for the Bayes factor of scVI when applying differential expression to two sets of n = 100 random cells from the same cluster.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–5 and Supplementary Notes 1–6

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopez, R., Regier, J., Cole, M.B. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018). https://doi.org/10.1038/s41592-018-0229-2

Download citation

Received: 30 March 2018
Accepted: 26 October 2018
Published: 30 November 2018
Issue Date: December 2018
DOI: https://doi.org/10.1038/s41592-018-0229-2

This article is cited by

Integration of scRNA-seq data by disentangled representation learning with condition domain adaptation
- Renjing Liu
- Kun Qian
- Hongwei Li
BMC Bioinformatics (2024)
InClust+: the deep generative framework with mask modules for multimodal data integration, imputation, and cross-modal generation
- Lifei Wang
- Rui Nie
- Jun Cai
BMC Bioinformatics (2024)
Single-cell characterisation of tissue homing CD4 + and CD8 + T cell clones in immune-mediated refractory arthritis
- Dipabarna Bhattacharya
- Jason Theodoropoulos
- Tapio Lönnberg
Molecular Medicine (2024)
An atlas of cell-type-specific interactome networks across 44 human tumor types
- Zekun Li
- Gerui Liu
- Yang Yang
Genome Medicine (2024)
Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks
- Jamshaid A. Shahir
- Natalie Stanley
- Jeremy E. Purvis
BMC Bioinformatics (2024)