Joint probabilistic modeling of single-cell multi-omic data with totalVI

Gayoso, Adam; Steier, Zoë; Lopez, Romain; Regier, Jeffrey; Nazor, Kristopher L.; Streets, Aaron; Yosef, Nir

doi:10.1038/s41592-020-01050-x

Article
Published: 15 February 2021

Joint probabilistic modeling of single-cell multi-omic data with totalVI

Nature Methods volume 18, pages 272–282 (2021)Cite this article

24k Accesses
147 Citations
49 Altmetric
Metrics details

Subjects

Abstract

The paired measurement of RNA and surface proteins in single cells with cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a promising approach to connect transcriptional variation with cell phenotypes and functions. However, combining these paired views into a unified representation of cell state is made challenging by the unique technical characteristics of each measurement. Here we present Total Variational Inference (totalVI; https://scvi-tools.org), a framework for end-to-end joint analysis of CITE-seq data that probabilistically represents the data as a composite of biological and technical factors, including protein background and batch effects. To evaluate totalVI’s performance, we profiled immune cells from murine spleen and lymph nodes with CITE-seq, measuring over 100 surface proteins. We demonstrate that totalVI provides a cohesive solution for common analysis tasks such as dimensionality reduction, the integration of datasets with different measured proteins, estimation of correlations between molecules and differential expression testing.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic of a CITE-seq data analysis pipeline with totalVI.**

**Fig. 2: totalVI identifies and corrects for protein background.**

**Fig. 3: Benchmarking of integration methods for CITE-seq data.**

**Fig. 4: totalVI identifies differentially expressed genes and proteins.**

**Fig. 5: Characterization of B-cell heterogeneity in the spleen and lymph nodes with RNA and proteins.**

Isolating salient variations of interest in single-cell data with contrastiveVI

Article 07 August 2023

Ethan Weinberger, Chris Lin & Su-In Lee

Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system

Article 27 February 2024

Philipp Sven Lars Schäfer, Daniel Dimitrov, … Julio Saez-Rodriguez

Population-level integration of single-cell datasets enables multi-scale analysis across samples

Article Open access 09 October 2023

Carlo De Donno, Soroor Hediyeh-Zadeh, … Fabian J. Theis

Data availability

The data discussed in this manuscript (SLN-all) have been deposited in the National Center for Biotechnology Information’s Gene Expression Omnibus⁹³ and are accessible through accession number GSE150599. Processed data are also available in the reproducibility GitHub repository (https://github.com/YosefLab/totalVI_reproducibility). The SLN-all dataset processed with totalVI can be explored interactively with Vision at http://s133.cs.berkeley.edu:9000/Results.html. Public datasets were downloaded from 10X Genomics (PBMC5k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3; PBMC10k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3; MALT: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3). Mouse mm10 reference was downloaded from 10X Genomics.

Code availability

The code to reproduce the results in this manuscript is available at https://github.com/YosefLab/totalVI_reproducibility and has been deposited at https://doi.org/10.5281/zenodo.4330368 (ref. ⁹⁴). The reference implementation of totalVI is available via the scvi-tools package at https://github.com/YosefLab/scvi-tools.

References

Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63 (2017).
Article CAS PubMed PubMed Central Google Scholar
Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. https://doi.org/10.1038/nri.2017.76 (2017).
Labib, M. & Kelley, S. O. Single-cell analysis targeting the proteome. Nat. Rev. Chem. 4, 143–158 (2020).
Article Google Scholar
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. https://doi.org/10.1038/nbt.3711 (2016).
Efremova, M. & Tiechmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods https://doi.org/10.1038/nmeth.4380 (2017).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. https://doi.org/10.1038/nbt.3973 (2017).
Regev, A. et al. The Human Cell Atlas. eLife https://doi.org/10.7554/eLife.27041 (2017).
Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature https://doi.org/10.1038/nature21350 (2017).
Todorovic, V. Single-cell RNA-seq—now with protein. Nat. Methods 14, 1028–1029 (2017).
Article CAS Google Scholar
Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 1–12 (2017).
Article Google Scholar
Granja, J. M. et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat. Biotechnol. 37, 1458–1465 (2019).
Article CAS PubMed PubMed Central Google Scholar
Praktiknjo, S. D. et al. Tracing tumorigenesis in a solid tumor model at single-cell resolution. Nat. Commun. 11, 991 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kotliarov, Y. et al. Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus. Nat. Med. 26, 618–629 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Levitin, H. M. et al. De novo gene signature identification from single‐cell RNA ‐seq with hierarchical Poisson factorization. Mol. Sys. Biol. 15, e8557 (2019).
Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genomics Comput. Biol. https://doi.org/10.18547/gcb.2017.vol3.iss1.e46 (2017).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. https://doi.org/10.1038/s41467-017-02554-5 (2018).
Blei, D. M. Build, compute, critique, repeat: Data analysis with latent variable models. Annu. Rev. Stat. Appl. https://doi.org/10.1146/annurev-statistics-022513-115657 (2014).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. in 2nd International Conference on Learning Representations https://arxiv.org/abs/1312.6114v10 (2014).
Cutler, A. & Breiman, L. Archetypal analysis. Technometrics https://doi.org/10.1080/00401706.1994.10485840 (1994).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. https://doi.org/10.1186/s13059-018-1603-1 (2018).
10X Genomics. 10k PBMCs from a healthy donor—gene expression and cell surface protein (2018).
10X Genomics. 10k Cells from a MALT tumor—gene expression and cell surface protein (2018).
Gelman, A., Meng, X. L. & Stern, H. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–760 (1996).
Kuleshov, V., Fenner, N. & Ermon, S. Accurate uncertainties for deep learning using calibrated regression. in 35th International Conference on Machine Learning 80, 2796–2804 (2018).
Hulspas, R., O’Gorman, M. R. G., Wood, B. L., Gratama, J. W. & Sutherland, D. R. Considerations for the control of background fluorescence in clinical flow cytometry. Cytometry B Clin. Cytom. https://doi.org/10.1002/cyto.b.20485 (2009).
Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).
Article PubMed PubMed Central Google Scholar
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience https://doi.org/10.1093/gigascience/giaa151 (2020).
Fleming, S. J., Marioni, J. C. & Babadi, M. CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets. Preprint at bioRxiv https://doi.org/10.1101/791699 (2019).
Ngo Trong, T. et al. Semisupervised generative autoencoder for single-cell data. J. Comput. Biol. https://doi.org/10.1089/cmb.2019.0337 (2019).
Li, B. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat. Methods 17, 793–798 (2020).
Article CAS PubMed PubMed Central Google Scholar
Andrews, T. S. & Hemberg, M. False signals induced by single-cell imputation. F1000Research https://doi.org/10.12688/f1000research.16613.2 (2019).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. https://doi.org/10.1038/s41587-019-0113-3 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
10X Genomics. 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor with cell surface proteins (v3 chemistry). (2019).
Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 1–10 (2020).
Google Scholar
Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
Article Google Scholar
Boyeau, P. et al. Deep generative models for detecting differential expression in single cells. Preprint at bioRxiv https://doi.org/10.1101/794289 (2019).
Bezman, N. A. et al. Molecular definition of the identity and activation of natural killer cells. Nat. Immunol. 13, 1000–1008 (2012).
Article CAS PubMed PubMed Central Google Scholar
Walzer, T. et al. Identification, activation, and selective in vivo ablation of mouse NK cells via NKp46. PNAS 104, 3384–3389 (2007).
Article CAS PubMed PubMed Central Google Scholar
Gordon, S. M. et al. The transcription factors T-bet and Eomes control key checkpoints of natural killer cell maturation. Immunity 36, 55–67 (2012).
Article CAS PubMed PubMed Central Google Scholar
Korem, Y. et al. Geometry of the gene expression space of individual cells. PLoS Comput. Biol. 11, 1–27 (2015).
Article Google Scholar
Dijk, D. van et al. Finding archetypal spaces for data using neural networks. Preprint at arXiv https://arxiv.org/abs/1901.09078 (2019).
Thomas, M. D., Srivastava, B. & Allman, D. Regulation of peripheral B cell maturation. Cell. Immunol. 239, 92–102 (2006).
Article CAS PubMed Google Scholar
Loder, F. et al. B cell development in the spleen takes place in discrete steps and is determined by the quality of B cell receptor-derived signals. J. Exp. Med. 190, 75–89 (1999).
Article CAS PubMed PubMed Central Google Scholar
Kreslavsky, T. et al. Essential role for the transcription factor Bhlhe41 in regulating the development, self-renewal and BCR repertoire of B-1a cells. Nat. Immunol. 18, 442–455 (2017).
Article CAS PubMed PubMed Central Google Scholar
DeTomaso, D. et al. Functional interpretation of single cell similarity maps. Nat. Commun. 10, 4376 (2019).
Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523–542 (2013).
Article PubMed PubMed Central Google Scholar
Argelaguet, R. et al. Multi‐omics factor analysis—a framework for unsupervised integration of multi‐omics data sets. Mol. Sys. Biol. 14, 1–13 (2018).
Google Scholar
Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).
Article CAS PubMed Google Scholar
Gorin, G., Svensson, V. & Pachter, L. Protein velocity and acceleration from single-cell multiomics experiments. Genome Biol. 21, 1–6 (2020).
Article Google Scholar
Svensson, V., Beltrame, E. da V. & Pachter, L. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/762773 (2019).
Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Sys. 2, 239–250 (2016).
Article CAS Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. https://doi.org/10.1186/s13059-017-1382-0 (2018).
Clark, S. J. et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 1–9 (2018).
Article Google Scholar
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Article CAS PubMed PubMed Central Google Scholar
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics https://doi.org/10.1101/737601 (2020).
Wang, C. & Blei, D. M. A general method for robust Bayesian modeling. Bayesian Anal. https://doi.org/10.1214/17-BA1090 (2018).
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Article CAS Google Scholar
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Advances in neural information processing systems. in Neural Information Processing Systems 29, 3738–3746 (2016).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. in 3rd International Conference on Learning Representations http://arxiv.org/abs/1412.6980 (2014).
Lopez, R. et al. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. in ICML Workshop in Computational Biology (2019).
Mattei, P. A. & Freiisen, J. {MIWAE}: Deep generative modelling and imputation of incomplete data sets. in 36th International Conference on Machine Learning 97, 4413–4423 (2019).
Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. & Wortman, J. Learning bounds for domain adaptation. in Advances in Neural Information Processing Systems 20, 129–136 (2008).
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).
Google Scholar
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36 (Suppl. 2), i610–i617 (2020).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. https://doi.org/10.1038/ncomms14049 (2017).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics https://doi.org/10.1093/bioinformatics/bts635 (2013).
Gayoso, A. et al. DoubletDetection (v.2.5.2). Zenodo. https://doi.org/10.5281/zenodo.2678041 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Kucukelbir, A., Wang, Y. & Blei, D. M. Evaluating Bayesian models with posterior dispersion indices. Proc. 34th Intl. Conf. Machine Learning 70, 1925–1934 (2017).
Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Article PubMed PubMed Central Google Scholar
Lai, L., Alaverdi, N., Maltais, L. & Morse, H. C. Immunophenotyping mouse cell surface antigens: nomenclature and immunophenotyping. J. Immunol. 160, 3861–3868 (1998).
Watts, C. Capture and processing of exogenous antigens for presentation on MHC molecules. Ann. Rev. Immunol. 15, 821–850 (1997).
Article CAS Google Scholar
Uchida, J. et al. Mouse CD20 expression and function. Int. Immunol. https://doi.org/10.1093/intimm/dxh009 (2004).
Hünig, T., Beyersdorf, N. & Kerkau, T. CD28 co-stimulation in T-cell homeostasis: a recent perspective. ImmunoTargets Ther. 4, 111 (2015).
Article PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426 (2018).
Filion, L. G., Izaguirre, C. A., Garber, G. E., Huebsh, L. & Aye, M. T. Detection of surface and cytoplasmic CD4 on blood monocytes from normal and HIV-1 infected individuals. J. Immunol. Methods 135, 59–69 (1990).
Article CAS PubMed Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
DeTomaso, D. & Yosef, N. Identifying informative gene modules across modalities of single cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.02.06.937805 (2020).
Traag, V., Waltman, L. & Eck, N. J. van. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Google Scholar
Zhao, H., Liao, X. & Kang, Y. Tregs: where we are and what comes next? Front. Immunol. https://doi.org/10.3389/fimmu.2017.01578 (2017).
Roncarolo, M.-G. & Gregori, S. Is FOXP3 a bona fide marker for human regulatory T cells? Eur. J. Immunol. 38, 925–927 (2008).
Article CAS PubMed Google Scholar
Fontenot, J. D., Rasmussen, J. P., Gavin, M. A. & Rudensky, A. Y. A function for interleukin 2 in Foxp3-expressing regulatory T cells. Nat. Immunol. 6, 1142–1151 (2005).
Article CAS PubMed Google Scholar
Sprouse, M. L. et al. High self-reactivity drives T-bet and potentiates Treg function in tissue-specific autoimmunity. JCI Insight 3, 1–14 (2018).
Article Google Scholar
Burda, Y., Grosse, R. & Salakhutdinov, R. Importance Weighted Autoencoders. in International Conference on Learning Representations http://arxiv.org/abs/1509.00519 (2016).
Liberzon, A. et al. Databases and ontologies Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Article CAS PubMed PubMed Central Google Scholar
Gayoso, A. and Steier, Z. YosefLab/totalVI_reproducibility: totalVI reproducibility (v.0.3). Zenodo. https://doi.org/10.5281/zenodo.4330368 (2020).

Download references

Acknowledgements

We thank E. Robey, L. Lutes and D. Bangs for help designing experiments. We thank BioLegend and their proteogenomics team, especially B. Yeung, A. Fernandes, Q. Gao, H. Zhang and T. S. Huang for providing reagents and expertise and for help with sample preparation, library generation and sequencing of CITE-seq libraries. We thank D. DeTomaso for general data analysis advice and P. Boyeau, A. Nazaret and G. Xing for help with integrating totalVI in the scvi-tools package. We thank members of the Streets and Yosef laboratories for helpful feedback. Research reported in this manuscript was supported by the NIGMS of the National Institutes of Health under award number R35GM124916 (A.S), the Chan Zuckerberg Foundation Network under grant number 2019-02452 (N.Y.) and the National Institutes of Mental Health under grant number U19MH114821 (N.Y.). A.G. was supported by National Institutes of Health Training Grant 5T32HG000047-19. Z.S. was supported by the National Science Foundation Graduate Research Fellowship. N.Y. was supported by the Koret-Berkeley-Tel Aviv Initiative in Computational Biology. A.S. and N.Y. are Chan Zuckerberg Biohub investigators.

Author information

These authors contributed equally: Adam Gayoso, Zoë Steier.

Authors and Affiliations

Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
Adam Gayoso, Aaron Streets & Nir Yosef
Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
Zoë Steier & Aaron Streets
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
Romain Lopez & Nir Yosef
Department of Statistics, University of Michigan, Ann Arbor, Ann Arbor, MI, USA
Jeffrey Regier
BioLegend, Inc., San Diego, CA, USA
Kristopher L. Nazor
Chan Zuckerberg Biohub, San Francisco, CA, USA
Aaron Streets & Nir Yosef
Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA
Nir Yosef

Authors

Adam Gayoso
View author publications
You can also search for this author in PubMed Google Scholar
Zoë Steier
View author publications
You can also search for this author in PubMed Google Scholar
Romain Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Regier
View author publications
You can also search for this author in PubMed Google Scholar
Kristopher L. Nazor
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Streets
View author publications
You can also search for this author in PubMed Google Scholar
Nir Yosef
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.G. and Z.S. contributed equally. A.G., Z.S., A.S. and N.Y. designed the study. A.G., Z.S, R.L., J.R. and N.Y. conceived the statistical model. A.G. implemented the totalVI software with input from R.L. K.L.N. designed and produced antibody panels and provided input on the study. Z.S. designed and led experiments with input from A.S. and N.Y. A.G. and Z.S. designed and implemented analysis methods and applied the software to analyze the data with input from A.S. and N.Y. A.S. and N.Y. supervised the work. A.G., Z.S., R.L., J.R., A.S. and N.Y. participated in writing the manuscript.

Corresponding authors

Correspondence to Aaron Streets or Nir Yosef.

Ethics declarations

Competing interests

K.L.N. is an employee of BioLegend Inc. The other authors declare no competing interests.

Additional information

Peer review information Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluation of totalVI model.

a, Posterior predictive check of coefficient of variation (CV) of genes and proteins. For each of the PBMC10k, MALT, and SLN111-D1 datasets and for each model (totalVI, scVI, factor analysis with normalized input, scHPF) the average coefficient of variation from posterior predictive samples was computed for each feature. Violin plots summarize the distribution of CVs for genes and proteins. Mean absolute error (MAE) between raw data CVs and average posterior predictive CV are reported. b, For each gene and protein, the Mann-Whitney U statistic between posterior predictive samples and observed data averaged over samples. Shown are boxplots of this statistic for each set of features (genes and proteins), model, and dataset (n=4000 genes across datasets and n=14 proteins for PBMC10k and MALT, n=110 proteins for SLN111-D1). Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. Higher is better.

Extended Data Fig. 2 Evaluation of totalVI model (continued).

a, Mean absolute error (MAE) between held out data and posterior predictive mean separated by genes and proteins for each model and dataset. b, Calibration error of held-out data reported separately for genes and proteins. c, Held-out reconstruction loss of RNA for scVI and totalVI. d, e, Stability of held-out results (n=5 initializations) for totalVI on SLN111-D1. Metrics displayed are the (d) Held out MAE, and (e) held out calibration error. Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. f, Inference time for totalVI and scVI across cells randomly subsampled to different levels from SLN-all. scVI was run with only genes. totalVI was applied with 20 latent dimensions and 100 latent dimensions.

Extended Data Fig. 3 Protein background in cells and empty droplets.

a-c, Histogram of log(protein counts + 1) in the SLN111-D1 dataset for B cells, T cells, and empty droplets (Methods) for CD19 (a), CD20 (b), and CD28 (c). d-f, Fraction of empty droplets, B cells, or T cells with > 0 UMIs detected for a given RNA (left, hatched) or protein (right, solid). RNA/proteins displayed are Cd19/CD19 (d), Ms4a1/CD20 (e), and Cd28/CD28 (f). g, Barcode rank plot for all barcodes detected in the SLN111-D1 dataset. Red lines at 20 and 100 RNA UMI counts indicate the lower and upper bounds, respectively, used to define empty droplets in (a-f). h, Performance of totalVI and a Gaussian mixture model (GMM) fit on all cells for each protein of the SLN111-D1 dataset to classify cell types by marker proteins (Methods). Receiver operating characteristic (ROC) curves shown for CD19 (B cells), CD20 (B cells), or CD28 (T cells). Area under the receiver operating characteristic curve (ROC AUC score) was calculated using as input either the totalVI foreground probability or GMM foreground probability where the indicated cell type was the positive population out of all B and T cells.

Extended Data Fig. 4 totalVI decouples foreground and background for trimodal protein distributions and denoises protein data.

a, b, CD4 protein expression in the PBMC10k dataset. (a) Trimodal distribution of log(protein counts + 1). (b) UMAP plot of the totalVI latent space colored by totalVI foreground probability. c-e, UMAP plots of the totalVI latent space for the SLN111-D1 dataset. Plots are colored by log(protein counts+1) (top) and log(totalVI denoised protein+1) (bottom) for CD19 (c), CD20 (d), and CD28 (e). f, g, Distributions of log(protein counts + 1) (f) and log(totalVI denoised protein + 1) (g) for CD19 protein in B and T cells. y-axis is truncated at 3.

Extended Data Fig. 5 RNA-protein correlations.

a, b, 2d density plots of Pearson correlations between all RNA and protein features in the SLN111-D1 dataset as well as 100 additional genes whose expression was randomly permuted. Correlations between all proteins and the randomly permuted genes are colored in red. Raw correlations were calculated between log library-size normalized RNA and log(protein counts + 1). (a), Naive totalVI correlations were calculated between totalVI denoised RNA and totalVI denoised proteins. (b), totalVI correlations were calculated between denoised RNA and proteins sampled from the posterior (Methods). c, Pearson correlations between each protein and its encoding RNA for all proteins with a unique encoding RNA, colored by the mean probability foreground of the protein across all cells. totalVI correlations were calculated as in (b) and raw correlation were calculated as in (a, b). d-f, Same as (a-c), but for Spearman correlations.

Extended Data Fig. 6 Integration of SLN-all with totalVI-intersect.

a, b, UMAP plot of SLN-all colored by (a) dataset, and (b) tissue. c, Heatmap of proteins used for annotation. Proteins (columns) are log(protein counts + 1) scaled by column for visualization. d, Dotplot of RNA markers used for annotation. RNA is log library size normalized.

Extended Data Fig. 7 Differential expression analysis.

a, 2d density plot of totalVI and scVI log Bayes factors for genes. Bayes factors were computed for each gene in one-vs-all tests on the SLN111-D1 dataset. b, Number of isotype controls called differentially expressed in one-vs-all tests (n=27) for totalVI, totalVI-wBG (totalVI test without background removal), Wilcoxon rank-sum, and t-test. Tests were applied to SLN208-D1, for which isotype controls were retained. Box plots indicate the median (center lines), interquartile range (hinges), whiskers at 1.5x interquartile range. Red dashed line indicates the maximum number of isotype controls. c-e, Significance level (Bayes factors for totalVI, adjusted p-value for frequentist tests) for proteins in one-vs-all tests computed on SLN111-D1 and SLN111-D2 for each of (c) totalVI, (d) t-test, (e) Wilcoxon. f, Bayes factors for proteins in one-vs-all tests computed on the SLN111 datasets integrated with and without the SLN111-D2 proteins held-out. Differential expression tests for both model fits were conditioned on SLN111-D1. Bayes factors are colored by the average protein expression from SLN111-D1.

Extended Data Fig. 8 Interpreting totalVI latent dimensions with archetypal analysis.

a, b, Heatmap of top (a) gene and (b) protein features for each archetype. The archetype score corresponds to the standard scaled archetypal expression profiles (Methods). Heatmaps are individually column normalized for visualization. c, Fraction of proteins in top archetypal features for each archetype. Features in each archetype were selected in the “top” if they had an archetype score of greater than 2. For these features, we performed a one-sided hypergeometric test to determine if proteins were over-represented in this feature set relative to the global distribution of feature types. Archetypes with over-representation of proteins (one-sided hypergeometric test, BH-adjusted P<0.05) are denoted.

Extended Data Fig. 9 Visualization of archetypes in totalVI-intersect model of SLN-all.

a, UMAP plots of SLN-all cells colored by latent dimension value. b, totalVI protein expression for CD24 and CD93 proteins as a function of distance to archetype 16. c, totalVI denoised expression for Isg20 and Ifit3 genes as a function of distance to archetype 7. Archetype is colored in blue, all other cells in grey.

Extended Data Fig. 10 totalVI identifies correlated modules of RNA and proteins that are associated with the maturation of transitional B cells.

a, UMAP of the totalVI latent space colored by totalVI RNA expression of Rag1. b, totalVI RNA expression of Rag1 as a function of 1 - Z₁₆ (the totalVI latent dimension associated with transitional B cells). c, totalVI Spearman correlations in mature B cells between the same RNA and proteins as in Fig. 5h. Features were hierarchically clustered within mature B cells. d, Histogram of Spearman correlations between each feature in (a) and 1 - Z₁₆ (n = 2,735 cells).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gayoso, A., Steier, Z., Lopez, R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 18, 272–282 (2021). https://doi.org/10.1038/s41592-020-01050-x

Download citation

Received: 18 May 2020
Revised: 07 December 2020
Accepted: 18 December 2020
Published: 15 February 2021
Issue Date: March 2021
DOI: https://doi.org/10.1038/s41592-020-01050-x

This article is cited by

Spatial multi-omics: novel tools to study the complexity of cardiovascular diseases
- Paul Kiessling
- Christoph Kuppe
Genome Medicine (2024)
PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies
- Xinzhi Yao
- Sizhuo Ouyang
- Jingbo Xia
Genome Medicine (2024)
eSVD-DE: cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings
- Kevin Z. Lin
- Yixuan Qiu
- Kathryn Roeder
BMC Bioinformatics (2024)
Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells
- Adam Gayoso
- Philipp Weiler
- Nir Yosef
Nature Methods (2024)
scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders
- Yichuan Cao
- Xiamiao Zhao
- Shengquan Chen
Nature Communications (2024)