Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Joint probabilistic modeling of single-cell multi-omic data with totalVI

Abstract

The paired measurement of RNA and surface proteins in single cells with cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a promising approach to connect transcriptional variation with cell phenotypes and functions. However, combining these paired views into a unified representation of cell state is made challenging by the unique technical characteristics of each measurement. Here we present Total Variational Inference (totalVI; https://scvi-tools.org), a framework for end-to-end joint analysis of CITE-seq data that probabilistically represents the data as a composite of biological and technical factors, including protein background and batch effects. To evaluate totalVI’s performance, we profiled immune cells from murine spleen and lymph nodes with CITE-seq, measuring over 100 surface proteins. We demonstrate that totalVI provides a cohesive solution for common analysis tasks such as dimensionality reduction, the integration of datasets with different measured proteins, estimation of correlations between molecules and differential expression testing.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic of a CITE-seq data analysis pipeline with totalVI.
Fig. 2: totalVI identifies and corrects for protein background.
Fig. 3: Benchmarking of integration methods for CITE-seq data.
Fig. 4: totalVI identifies differentially expressed genes and proteins.
Fig. 5: Characterization of B-cell heterogeneity in the spleen and lymph nodes with RNA and proteins.

Similar content being viewed by others

Data availability

The data discussed in this manuscript (SLN-all) have been deposited in the National Center for Biotechnology Information’s Gene Expression Omnibus93 and are accessible through accession number GSE150599. Processed data are also available in the reproducibility GitHub repository (https://github.com/YosefLab/totalVI_reproducibility). The SLN-all dataset processed with totalVI can be explored interactively with Vision at http://s133.cs.berkeley.edu:9000/Results.html. Public datasets were downloaded from 10X Genomics (PBMC5k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3; PBMC10k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3; MALT: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3). Mouse mm10 reference was downloaded from 10X Genomics.

Code availability

The code to reproduce the results in this manuscript is available at https://github.com/YosefLab/totalVI_reproducibility and has been deposited at https://doi.org/10.5281/zenodo.4330368 (ref. 94). The reference implementation of totalVI is available via the scvi-tools package at https://github.com/YosefLab/scvi-tools.

References

  1. Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. https://doi.org/10.1038/nri.2017.76 (2017).

  3. Labib, M. & Kelley, S. O. Single-cell analysis targeting the proteome. Nat. Rev. Chem. 4, 143–158 (2020).

    Article  Google Scholar 

  4. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. https://doi.org/10.1038/nbt.3711 (2016).

  5. Efremova, M. & Tiechmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).

  6. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods https://doi.org/10.1038/nmeth.4380 (2017).

  7. Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. https://doi.org/10.1038/nbt.3973 (2017).

  8. Regev, A. et al. The Human Cell Atlas. eLife https://doi.org/10.7554/eLife.27041 (2017).

  9. Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature https://doi.org/10.1038/nature21350 (2017).

  10. Todorovic, V. Single-cell RNA-seq—now with protein. Nat. Methods 14, 1028–1029 (2017).

    Article  CAS  Google Scholar 

  11. Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 1–12 (2017).

    Article  Google Scholar 

  12. Granja, J. M. et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat. Biotechnol. 37, 1458–1465 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Praktiknjo, S. D. et al. Tracing tumorigenesis in a solid tumor model at single-cell resolution. Nat. Commun. 11, 991 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kotliarov, Y. et al. Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus. Nat. Med. 26, 618–629 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Levitin, H. M. et al. De novo gene signature identification from single‐cell RNA ‐seq with hierarchical Poisson factorization. Mol. Sys. Biol. 15, e8557 (2019).

  17. Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genomics Comput. Biol. https://doi.org/10.18547/gcb.2017.vol3.iss1.e46 (2017).

  18. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. https://doi.org/10.1038/s41467-017-02554-5 (2018).

  19. Blei, D. M. Build, compute, critique, repeat: Data analysis with latent variable models. Annu. Rev. Stat. Appl. https://doi.org/10.1146/annurev-statistics-022513-115657 (2014).

  20. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. in 2nd International Conference on Learning Representations https://arxiv.org/abs/1312.6114v10 (2014).

  21. Cutler, A. & Breiman, L. Archetypal analysis. Technometrics https://doi.org/10.1080/00401706.1994.10485840 (1994).

  22. Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. https://doi.org/10.1186/s13059-018-1603-1 (2018).

  23. 10X Genomics. 10k PBMCs from a healthy donor—gene expression and cell surface protein (2018).

  24. 10X Genomics. 10k Cells from a MALT tumor—gene expression and cell surface protein (2018).

  25. Gelman, A., Meng, X. L. & Stern, H. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–760 (1996).

  26. Kuleshov, V., Fenner, N. & Ermon, S. Accurate uncertainties for deep learning using calibrated regression. in 35th International Conference on Machine Learning 80, 2796–2804 (2018).

  27. Hulspas, R., O’Gorman, M. R. G., Wood, B. L., Gratama, J. W. & Sutherland, D. R. Considerations for the control of background fluorescence in clinical flow cytometry. Cytometry B Clin. Cytom. https://doi.org/10.1002/cyto.b.20485 (2009).

  28. Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience https://doi.org/10.1093/gigascience/giaa151 (2020).

  30. Fleming, S. J., Marioni, J. C. & Babadi, M. CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets. Preprint at bioRxiv https://doi.org/10.1101/791699 (2019).

  31. Ngo Trong, T. et al. Semisupervised generative autoencoder for single-cell data. J. Comput. Biol. https://doi.org/10.1089/cmb.2019.0337 (2019).

  32. Li, B. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat. Methods 17, 793–798 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Andrews, T. S. & Hemberg, M. False signals induced by single-cell imputation. F1000Research https://doi.org/10.12688/f1000research.16613.2 (2019).

  34. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. https://doi.org/10.1038/s41587-019-0113-3 (2019).

  36. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

  37. 10X Genomics. 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor with cell surface proteins (v3 chemistry). (2019).

  38. Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 1–10 (2020).

    Google Scholar 

  39. Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).

    Article  Google Scholar 

  40. Boyeau, P. et al. Deep generative models for detecting differential expression in single cells. Preprint at bioRxiv https://doi.org/10.1101/794289 (2019).

  41. Bezman, N. A. et al. Molecular definition of the identity and activation of natural killer cells. Nat. Immunol. 13, 1000–1008 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Walzer, T. et al. Identification, activation, and selective in vivo ablation of mouse NK cells via NKp46. PNAS 104, 3384–3389 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Gordon, S. M. et al. The transcription factors T-bet and Eomes control key checkpoints of natural killer cell maturation. Immunity 36, 55–67 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Korem, Y. et al. Geometry of the gene expression space of individual cells. PLoS Comput. Biol. 11, 1–27 (2015).

    Article  Google Scholar 

  45. Dijk, D. van et al. Finding archetypal spaces for data using neural networks. Preprint at arXiv https://arxiv.org/abs/1901.09078 (2019).

  46. Thomas, M. D., Srivastava, B. & Allman, D. Regulation of peripheral B cell maturation. Cell. Immunol. 239, 92–102 (2006).

    Article  CAS  PubMed  Google Scholar 

  47. Loder, F. et al. B cell development in the spleen takes place in discrete steps and is determined by the quality of B cell receptor-derived signals. J. Exp. Med. 190, 75–89 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Kreslavsky, T. et al. Essential role for the transcription factor Bhlhe41 in regulating the development, self-renewal and BCR repertoire of B-1a cells. Nat. Immunol. 18, 442–455 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. DeTomaso, D. et al. Functional interpretation of single cell similarity maps. Nat. Commun. 10, 4376 (2019).

  50. Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523–542 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Argelaguet, R. et al. Multi‐omics factor analysis—a framework for unsupervised integration of multi‐omics data sets. Mol. Sys. Biol. 14, 1–13 (2018).

    Google Scholar 

  52. Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).

    Article  CAS  PubMed  Google Scholar 

  53. Gorin, G., Svensson, V. & Pachter, L. Protein velocity and acceleration from single-cell multiomics experiments. Genome Biol. 21, 1–6 (2020).

    Article  Google Scholar 

  54. Svensson, V., Beltrame, E. da V. & Pachter, L. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/762773 (2019).

  55. Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Sys. 2, 239–250 (2016).

    Article  CAS  Google Scholar 

  56. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. https://doi.org/10.1186/s13059-017-1382-0 (2018).

  57. Clark, S. J. et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 1–9 (2018).

    Article  Google Scholar 

  58. Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics https://doi.org/10.1101/737601 (2020).

  60. Wang, C. & Blei, D. M. A general method for robust Bayesian modeling. Bayesian Anal. https://doi.org/10.1214/17-BA1090 (2018).

  61. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).

  62. Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).

    Article  CAS  Google Scholar 

  63. Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Advances in neural information processing systems. in Neural Information Processing Systems 29, 3738–3746 (2016).

  64. Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. in 3rd International Conference on Learning Representations http://arxiv.org/abs/1412.6980 (2014).

  65. Lopez, R. et al. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. in ICML Workshop in Computational Biology (2019).

  66. Mattei, P. A. & Freiisen, J. {MIWAE}: Deep generative modelling and imputation of incomplete data sets. in 36th International Conference on Machine Learning 97, 4413–4423 (2019).

  67. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. & Wortman, J. Learning bounds for domain adaptation. in Advances in Neural Information Processing Systems 20, 129–136 (2008).

  68. Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).

    Google Scholar 

  69. Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36 (Suppl. 2), i610–i617 (2020).

  70. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. https://doi.org/10.1038/ncomms14049 (2017).

  71. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics https://doi.org/10.1093/bioinformatics/bts635 (2013).

  72. Gayoso, A. et al. DoubletDetection (v.2.5.2). Zenodo. https://doi.org/10.5281/zenodo.2678041 (2019).

  73. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  74. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

  75. Kucukelbir, A., Wang, Y. & Blei, D. M. Evaluating Bayesian models with posterior dispersion indices. Proc. 34th Intl. Conf. Machine Learning 70, 1925–1934 (2017).

  76. Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  77. Lai, L., Alaverdi, N., Maltais, L. & Morse, H. C. Immunophenotyping mouse cell surface antigens: nomenclature and immunophenotyping. J. Immunol. 160, 3861–3868 (1998).

  78. Watts, C. Capture and processing of exogenous antigens for presentation on MHC molecules. Ann. Rev. Immunol. 15, 821–850 (1997).

    Article  CAS  Google Scholar 

  79. Uchida, J. et al. Mouse CD20 expression and function. Int. Immunol. https://doi.org/10.1093/intimm/dxh009 (2004).

  80. Hünig, T., Beyersdorf, N. & Kerkau, T. CD28 co-stimulation in T-cell homeostasis: a recent perspective. ImmunoTargets Ther. 4, 111 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  81. McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426 (2018).

  82. Filion, L. G., Izaguirre, C. A., Garber, G. E., Huebsh, L. & Aye, M. T. Detection of surface and cytoplasmic CD4 on blood monocytes from normal and HIV-1 infected individuals. J. Immunol. Methods 135, 59–69 (1990).

    Article  CAS  PubMed  Google Scholar 

  83. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. DeTomaso, D. & Yosef, N. Identifying informative gene modules across modalities of single cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.02.06.937805 (2020).

  85. Traag, V., Waltman, L. & Eck, N. J. van. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

  86. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  87. Zhao, H., Liao, X. & Kang, Y. Tregs: where we are and what comes next? Front. Immunol. https://doi.org/10.3389/fimmu.2017.01578 (2017).

  88. Roncarolo, M.-G. & Gregori, S. Is FOXP3 a bona fide marker for human regulatory T cells? Eur. J. Immunol. 38, 925–927 (2008).

    Article  CAS  PubMed  Google Scholar 

  89. Fontenot, J. D., Rasmussen, J. P., Gavin, M. A. & Rudensky, A. Y. A function for interleukin 2 in Foxp3-expressing regulatory T cells. Nat. Immunol. 6, 1142–1151 (2005).

    Article  CAS  PubMed  Google Scholar 

  90. Sprouse, M. L. et al. High self-reactivity drives T-bet and potentiates Treg function in tissue-specific autoimmunity. JCI Insight 3, 1–14 (2018).

    Article  Google Scholar 

  91. Burda, Y., Grosse, R. & Salakhutdinov, R. Importance Weighted Autoencoders. in International Conference on Learning Representations http://arxiv.org/abs/1509.00519 (2016).

  92. Liberzon, A. et al. Databases and ontologies Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Gayoso, A. and Steier, Z. YosefLab/totalVI_reproducibility: totalVI reproducibility (v.0.3). Zenodo. https://doi.org/10.5281/zenodo.4330368 (2020).

Download references

Acknowledgements

We thank E. Robey, L. Lutes and D. Bangs for help designing experiments. We thank BioLegend and their proteogenomics team, especially B. Yeung, A. Fernandes, Q. Gao, H. Zhang and T. S. Huang for providing reagents and expertise and for help with sample preparation, library generation and sequencing of CITE-seq libraries. We thank D. DeTomaso for general data analysis advice and P. Boyeau, A. Nazaret and G. Xing for help with integrating totalVI in the scvi-tools package. We thank members of the Streets and Yosef laboratories for helpful feedback. Research reported in this manuscript was supported by the NIGMS of the National Institutes of Health under award number R35GM124916 (A.S), the Chan Zuckerberg Foundation Network under grant number 2019-02452 (N.Y.) and the National Institutes of Mental Health under grant number U19MH114821 (N.Y.). A.G. was supported by National Institutes of Health Training Grant 5T32HG000047-19. Z.S. was supported by the National Science Foundation Graduate Research Fellowship. N.Y. was supported by the Koret-Berkeley-Tel Aviv Initiative in Computational Biology. A.S. and N.Y. are Chan Zuckerberg Biohub investigators.

Author information

Authors and Affiliations

Authors

Contributions

A.G. and Z.S. contributed equally. A.G., Z.S., A.S. and N.Y. designed the study. A.G., Z.S, R.L., J.R. and N.Y. conceived the statistical model. A.G. implemented the totalVI software with input from R.L. K.L.N. designed and produced antibody panels and provided input on the study. Z.S. designed and led experiments with input from A.S. and N.Y. A.G. and Z.S. designed and implemented analysis methods and applied the software to analyze the data with input from A.S. and N.Y. A.S. and N.Y. supervised the work. A.G., Z.S., R.L., J.R., A.S. and N.Y. participated in writing the manuscript.

Corresponding authors

Correspondence to Aaron Streets or Nir Yosef.

Ethics declarations

Competing interests

K.L.N. is an employee of BioLegend Inc. The other authors declare no competing interests.

Additional information

Peer review information Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluation of totalVI model.

a, Posterior predictive check of coefficient of variation (CV) of genes and proteins. For each of the PBMC10k, MALT, and SLN111-D1 datasets and for each model (totalVI, scVI, factor analysis with normalized input, scHPF) the average coefficient of variation from posterior predictive samples was computed for each feature. Violin plots summarize the distribution of CVs for genes and proteins. Mean absolute error (MAE) between raw data CVs and average posterior predictive CV are reported. b, For each gene and protein, the Mann-Whitney U statistic between posterior predictive samples and observed data averaged over samples. Shown are boxplots of this statistic for each set of features (genes and proteins), model, and dataset (n=4000 genes across datasets and n=14 proteins for PBMC10k and MALT, n=110 proteins for SLN111-D1). Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. Higher is better.

Extended Data Fig. 2 Evaluation of totalVI model (continued).

a, Mean absolute error (MAE) between held out data and posterior predictive mean separated by genes and proteins for each model and dataset. b, Calibration error of held-out data reported separately for genes and proteins. c, Held-out reconstruction loss of RNA for scVI and totalVI. d, e, Stability of held-out results (n=5 initializations) for totalVI on SLN111-D1. Metrics displayed are the (d) Held out MAE, and (e) held out calibration error. Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. f, Inference time for totalVI and scVI across cells randomly subsampled to different levels from SLN-all. scVI was run with only genes. totalVI was applied with 20 latent dimensions and 100 latent dimensions.

Extended Data Fig. 3 Protein background in cells and empty droplets.

a-c, Histogram of log(protein counts + 1) in the SLN111-D1 dataset for B cells, T cells, and empty droplets (Methods) for CD19 (a), CD20 (b), and CD28 (c). d-f, Fraction of empty droplets, B cells, or T cells with > 0 UMIs detected for a given RNA (left, hatched) or protein (right, solid). RNA/proteins displayed are Cd19/CD19 (d), Ms4a1/CD20 (e), and Cd28/CD28 (f). g, Barcode rank plot for all barcodes detected in the SLN111-D1 dataset. Red lines at 20 and 100 RNA UMI counts indicate the lower and upper bounds, respectively, used to define empty droplets in (a-f). h, Performance of totalVI and a Gaussian mixture model (GMM) fit on all cells for each protein of the SLN111-D1 dataset to classify cell types by marker proteins (Methods). Receiver operating characteristic (ROC) curves shown for CD19 (B cells), CD20 (B cells), or CD28 (T cells). Area under the receiver operating characteristic curve (ROC AUC score) was calculated using as input either the totalVI foreground probability or GMM foreground probability where the indicated cell type was the positive population out of all B and T cells.

Extended Data Fig. 4 totalVI decouples foreground and background for trimodal protein distributions and denoises protein data.

a, b, CD4 protein expression in the PBMC10k dataset. (a) Trimodal distribution of log(protein counts + 1). (b) UMAP plot of the totalVI latent space colored by totalVI foreground probability. c-e, UMAP plots of the totalVI latent space for the SLN111-D1 dataset. Plots are colored by log(protein counts+1) (top) and log(totalVI denoised protein+1) (bottom) for CD19 (c), CD20 (d), and CD28 (e). f, g, Distributions of log(protein counts + 1) (f) and log(totalVI denoised protein + 1) (g) for CD19 protein in B and T cells. y-axis is truncated at 3.

Extended Data Fig. 5 RNA-protein correlations.

a, b, 2d density plots of Pearson correlations between all RNA and protein features in the SLN111-D1 dataset as well as 100 additional genes whose expression was randomly permuted. Correlations between all proteins and the randomly permuted genes are colored in red. Raw correlations were calculated between log library-size normalized RNA and log(protein counts + 1). (a), Naive totalVI correlations were calculated between totalVI denoised RNA and totalVI denoised proteins. (b), totalVI correlations were calculated between denoised RNA and proteins sampled from the posterior (Methods). c, Pearson correlations between each protein and its encoding RNA for all proteins with a unique encoding RNA, colored by the mean probability foreground of the protein across all cells. totalVI correlations were calculated as in (b) and raw correlation were calculated as in (a, b). d-f, Same as (a-c), but for Spearman correlations.

Extended Data Fig. 6 Integration of SLN-all with totalVI-intersect.

a, b, UMAP plot of SLN-all colored by (a) dataset, and (b) tissue. c, Heatmap of proteins used for annotation. Proteins (columns) are log(protein counts + 1) scaled by column for visualization. d, Dotplot of RNA markers used for annotation. RNA is log library size normalized.

Extended Data Fig. 7 Differential expression analysis.

a, 2d density plot of totalVI and scVI log Bayes factors for genes. Bayes factors were computed for each gene in one-vs-all tests on the SLN111-D1 dataset. b, Number of isotype controls called differentially expressed in one-vs-all tests (n=27) for totalVI, totalVI-wBG (totalVI test without background removal), Wilcoxon rank-sum, and t-test. Tests were applied to SLN208-D1, for which isotype controls were retained. Box plots indicate the median (center lines), interquartile range (hinges), whiskers at 1.5x interquartile range. Red dashed line indicates the maximum number of isotype controls. c-e, Significance level (Bayes factors for totalVI, adjusted p-value for frequentist tests) for proteins in one-vs-all tests computed on SLN111-D1 and SLN111-D2 for each of (c) totalVI, (d) t-test, (e) Wilcoxon. f, Bayes factors for proteins in one-vs-all tests computed on the SLN111 datasets integrated with and without the SLN111-D2 proteins held-out. Differential expression tests for both model fits were conditioned on SLN111-D1. Bayes factors are colored by the average protein expression from SLN111-D1.

Extended Data Fig. 8 Interpreting totalVI latent dimensions with archetypal analysis.

a, b, Heatmap of top (a) gene and (b) protein features for each archetype. The archetype score corresponds to the standard scaled archetypal expression profiles (Methods). Heatmaps are individually column normalized for visualization. c, Fraction of proteins in top archetypal features for each archetype. Features in each archetype were selected in the “top” if they had an archetype score of greater than 2. For these features, we performed a one-sided hypergeometric test to determine if proteins were over-represented in this feature set relative to the global distribution of feature types. Archetypes with over-representation of proteins (one-sided hypergeometric test, BH-adjusted P<0.05) are denoted.

Extended Data Fig. 9 Visualization of archetypes in totalVI-intersect model of SLN-all.

a, UMAP plots of SLN-all cells colored by latent dimension value. b, totalVI protein expression for CD24 and CD93 proteins as a function of distance to archetype 16. c, totalVI denoised expression for Isg20 and Ifit3 genes as a function of distance to archetype 7. Archetype is colored in blue, all other cells in grey.

Extended Data Fig. 10 totalVI identifies correlated modules of RNA and proteins that are associated with the maturation of transitional B cells.

a, UMAP of the totalVI latent space colored by totalVI RNA expression of Rag1. b, totalVI RNA expression of Rag1 as a function of 1 - Z16 (the totalVI latent dimension associated with transitional B cells). c, totalVI Spearman correlations in mature B cells between the same RNA and proteins as in Fig. 5h. Features were hierarchically clustered within mature B cells. d, Histogram of Spearman correlations between each feature in (a) and 1 - Z16 (n = 2,735 cells).

Supplementary information

Supplementary Information

Supplementary Tables 1–6, Supplementary Figs. 1–14 and Supplementary Notes 1–7.

Reporting Summary

Supplementary Data 1

Antibodies used in the murine spleen and lymph node CITE-seq experiments.

Supplementary Data 2

totalVI one-versus-all DE test results for the SLN-all dataset.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gayoso, A., Steier, Z., Lopez, R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 18, 272–282 (2021). https://doi.org/10.1038/s41592-020-01050-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-020-01050-x

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing