Abstract
The paired measurement of RNA and surface proteins in single cells with cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a promising approach to connect transcriptional variation with cell phenotypes and functions. However, combining these paired views into a unified representation of cell state is made challenging by the unique technical characteristics of each measurement. Here we present Total Variational Inference (totalVI; https://scvi-tools.org), a framework for end-to-end joint analysis of CITE-seq data that probabilistically represents the data as a composite of biological and technical factors, including protein background and batch effects. To evaluate totalVI’s performance, we profiled immune cells from murine spleen and lymph nodes with CITE-seq, measuring over 100 surface proteins. We demonstrate that totalVI provides a cohesive solution for common analysis tasks such as dimensionality reduction, the integration of datasets with different measured proteins, estimation of correlations between molecules and differential expression testing.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data discussed in this manuscript (SLN-all) have been deposited in the National Center for Biotechnology Information’s Gene Expression Omnibus93 and are accessible through accession number GSE150599. Processed data are also available in the reproducibility GitHub repository (https://github.com/YosefLab/totalVI_reproducibility). The SLN-all dataset processed with totalVI can be explored interactively with Vision at http://s133.cs.berkeley.edu:9000/Results.html. Public datasets were downloaded from 10X Genomics (PBMC5k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3; PBMC10k: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3; MALT: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3). Mouse mm10 reference was downloaded from 10X Genomics.
Code availability
The code to reproduce the results in this manuscript is available at https://github.com/YosefLab/totalVI_reproducibility and has been deposited at https://doi.org/10.5281/zenodo.4330368 (ref. 94). The reference implementation of totalVI is available via the scvi-tools package at https://github.com/YosefLab/scvi-tools.
References
Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63 (2017).
Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. https://doi.org/10.1038/nri.2017.76 (2017).
Labib, M. & Kelley, S. O. Single-cell analysis targeting the proteome. Nat. Rev. Chem. 4, 143–158 (2020).
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. https://doi.org/10.1038/nbt.3711 (2016).
Efremova, M. & Tiechmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods https://doi.org/10.1038/nmeth.4380 (2017).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. https://doi.org/10.1038/nbt.3973 (2017).
Regev, A. et al. The Human Cell Atlas. eLife https://doi.org/10.7554/eLife.27041 (2017).
Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. Nature https://doi.org/10.1038/nature21350 (2017).
Todorovic, V. Single-cell RNA-seq—now with protein. Nat. Methods 14, 1028–1029 (2017).
Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 1–12 (2017).
Granja, J. M. et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat. Biotechnol. 37, 1458–1465 (2019).
Praktiknjo, S. D. et al. Tracing tumorigenesis in a solid tumor model at single-cell resolution. Nat. Commun. 11, 991 (2020).
Kotliarov, Y. et al. Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus. Nat. Med. 26, 618–629 (2020).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Levitin, H. M. et al. De novo gene signature identification from single‐cell RNA ‐seq with hierarchical Poisson factorization. Mol. Sys. Biol. 15, e8557 (2019).
Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genomics Comput. Biol. https://doi.org/10.18547/gcb.2017.vol3.iss1.e46 (2017).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. https://doi.org/10.1038/s41467-017-02554-5 (2018).
Blei, D. M. Build, compute, critique, repeat: Data analysis with latent variable models. Annu. Rev. Stat. Appl. https://doi.org/10.1146/annurev-statistics-022513-115657 (2014).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. in 2nd International Conference on Learning Representations https://arxiv.org/abs/1312.6114v10 (2014).
Cutler, A. & Breiman, L. Archetypal analysis. Technometrics https://doi.org/10.1080/00401706.1994.10485840 (1994).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. https://doi.org/10.1186/s13059-018-1603-1 (2018).
10X Genomics. 10k PBMCs from a healthy donor—gene expression and cell surface protein (2018).
10X Genomics. 10k Cells from a MALT tumor—gene expression and cell surface protein (2018).
Gelman, A., Meng, X. L. & Stern, H. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–760 (1996).
Kuleshov, V., Fenner, N. & Ermon, S. Accurate uncertainties for deep learning using calibrated regression. in 35th International Conference on Machine Learning 80, 2796–2804 (2018).
Hulspas, R., O’Gorman, M. R. G., Wood, B. L., Gratama, J. W. & Sutherland, D. R. Considerations for the control of background fluorescence in clinical flow cytometry. Cytometry B Clin. Cytom. https://doi.org/10.1002/cyto.b.20485 (2009).
Yang, S. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 21, 57 (2020).
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience https://doi.org/10.1093/gigascience/giaa151 (2020).
Fleming, S. J., Marioni, J. C. & Babadi, M. CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets. Preprint at bioRxiv https://doi.org/10.1101/791699 (2019).
Ngo Trong, T. et al. Semisupervised generative autoencoder for single-cell data. J. Comput. Biol. https://doi.org/10.1089/cmb.2019.0337 (2019).
Li, B. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat. Methods 17, 793–798 (2020).
Andrews, T. S. & Hemberg, M. False signals induced by single-cell imputation. F1000Research https://doi.org/10.12688/f1000research.16613.2 (2019).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. https://doi.org/10.1038/s41587-019-0113-3 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
10X Genomics. 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor with cell surface proteins (v3 chemistry). (2019).
Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 1–10 (2020).
Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
Boyeau, P. et al. Deep generative models for detecting differential expression in single cells. Preprint at bioRxiv https://doi.org/10.1101/794289 (2019).
Bezman, N. A. et al. Molecular definition of the identity and activation of natural killer cells. Nat. Immunol. 13, 1000–1008 (2012).
Walzer, T. et al. Identification, activation, and selective in vivo ablation of mouse NK cells via NKp46. PNAS 104, 3384–3389 (2007).
Gordon, S. M. et al. The transcription factors T-bet and Eomes control key checkpoints of natural killer cell maturation. Immunity 36, 55–67 (2012).
Korem, Y. et al. Geometry of the gene expression space of individual cells. PLoS Comput. Biol. 11, 1–27 (2015).
Dijk, D. van et al. Finding archetypal spaces for data using neural networks. Preprint at arXiv https://arxiv.org/abs/1901.09078 (2019).
Thomas, M. D., Srivastava, B. & Allman, D. Regulation of peripheral B cell maturation. Cell. Immunol. 239, 92–102 (2006).
Loder, F. et al. B cell development in the spleen takes place in discrete steps and is determined by the quality of B cell receptor-derived signals. J. Exp. Med. 190, 75–89 (1999).
Kreslavsky, T. et al. Essential role for the transcription factor Bhlhe41 in regulating the development, self-renewal and BCR repertoire of B-1a cells. Nat. Immunol. 18, 442–455 (2017).
DeTomaso, D. et al. Functional interpretation of single cell similarity maps. Nat. Commun. 10, 4376 (2019).
Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523–542 (2013).
Argelaguet, R. et al. Multi‐omics factor analysis—a framework for unsupervised integration of multi‐omics data sets. Mol. Sys. Biol. 14, 1–13 (2018).
Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).
Gorin, G., Svensson, V. & Pachter, L. Protein velocity and acceleration from single-cell multiomics experiments. Genome Biol. 21, 1–6 (2020).
Svensson, V., Beltrame, E. da V. & Pachter, L. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/762773 (2019).
Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Sys. 2, 239–250 (2016).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. https://doi.org/10.1186/s13059-017-1382-0 (2018).
Clark, S. J. et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 1–9 (2018).
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics https://doi.org/10.1101/737601 (2020).
Wang, C. & Blei, D. M. A general method for robust Bayesian modeling. Bayesian Anal. https://doi.org/10.1214/17-BA1090 (2018).
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Advances in neural information processing systems. in Neural Information Processing Systems 29, 3738–3746 (2016).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. in 3rd International Conference on Learning Representations http://arxiv.org/abs/1412.6980 (2014).
Lopez, R. et al. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. in ICML Workshop in Computational Biology (2019).
Mattei, P. A. & Freiisen, J. {MIWAE}: Deep generative modelling and imputation of incomplete data sets. in 36th International Conference on Machine Learning 97, 4413–4423 (2019).
Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. & Wortman, J. Learning bounds for domain adaptation. in Advances in Neural Information Processing Systems 20, 129–136 (2008).
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36 (Suppl. 2), i610–i617 (2020).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. https://doi.org/10.1038/ncomms14049 (2017).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics https://doi.org/10.1093/bioinformatics/bts635 (2013).
Gayoso, A. et al. DoubletDetection (v.2.5.2). Zenodo. https://doi.org/10.5281/zenodo.2678041 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Kucukelbir, A., Wang, Y. & Blei, D. M. Evaluating Bayesian models with posterior dispersion indices. Proc. 34th Intl. Conf. Machine Learning 70, 1925–1934 (2017).
Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Lai, L., Alaverdi, N., Maltais, L. & Morse, H. C. Immunophenotyping mouse cell surface antigens: nomenclature and immunophenotyping. J. Immunol. 160, 3861–3868 (1998).
Watts, C. Capture and processing of exogenous antigens for presentation on MHC molecules. Ann. Rev. Immunol. 15, 821–850 (1997).
Uchida, J. et al. Mouse CD20 expression and function. Int. Immunol. https://doi.org/10.1093/intimm/dxh009 (2004).
Hünig, T., Beyersdorf, N. & Kerkau, T. CD28 co-stimulation in T-cell homeostasis: a recent perspective. ImmunoTargets Ther. 4, 111 (2015).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426 (2018).
Filion, L. G., Izaguirre, C. A., Garber, G. E., Huebsh, L. & Aye, M. T. Detection of surface and cytoplasmic CD4 on blood monocytes from normal and HIV-1 infected individuals. J. Immunol. Methods 135, 59–69 (1990).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
DeTomaso, D. & Yosef, N. Identifying informative gene modules across modalities of single cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.02.06.937805 (2020).
Traag, V., Waltman, L. & Eck, N. J. van. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Zhao, H., Liao, X. & Kang, Y. Tregs: where we are and what comes next? Front. Immunol. https://doi.org/10.3389/fimmu.2017.01578 (2017).
Roncarolo, M.-G. & Gregori, S. Is FOXP3 a bona fide marker for human regulatory T cells? Eur. J. Immunol. 38, 925–927 (2008).
Fontenot, J. D., Rasmussen, J. P., Gavin, M. A. & Rudensky, A. Y. A function for interleukin 2 in Foxp3-expressing regulatory T cells. Nat. Immunol. 6, 1142–1151 (2005).
Sprouse, M. L. et al. High self-reactivity drives T-bet and potentiates Treg function in tissue-specific autoimmunity. JCI Insight 3, 1–14 (2018).
Burda, Y., Grosse, R. & Salakhutdinov, R. Importance Weighted Autoencoders. in International Conference on Learning Representations http://arxiv.org/abs/1509.00519 (2016).
Liberzon, A. et al. Databases and ontologies Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Gayoso, A. and Steier, Z. YosefLab/totalVI_reproducibility: totalVI reproducibility (v.0.3). Zenodo. https://doi.org/10.5281/zenodo.4330368 (2020).
Acknowledgements
We thank E. Robey, L. Lutes and D. Bangs for help designing experiments. We thank BioLegend and their proteogenomics team, especially B. Yeung, A. Fernandes, Q. Gao, H. Zhang and T. S. Huang for providing reagents and expertise and for help with sample preparation, library generation and sequencing of CITE-seq libraries. We thank D. DeTomaso for general data analysis advice and P. Boyeau, A. Nazaret and G. Xing for help with integrating totalVI in the scvi-tools package. We thank members of the Streets and Yosef laboratories for helpful feedback. Research reported in this manuscript was supported by the NIGMS of the National Institutes of Health under award number R35GM124916 (A.S), the Chan Zuckerberg Foundation Network under grant number 2019-02452 (N.Y.) and the National Institutes of Mental Health under grant number U19MH114821 (N.Y.). A.G. was supported by National Institutes of Health Training Grant 5T32HG000047-19. Z.S. was supported by the National Science Foundation Graduate Research Fellowship. N.Y. was supported by the Koret-Berkeley-Tel Aviv Initiative in Computational Biology. A.S. and N.Y. are Chan Zuckerberg Biohub investigators.
Author information
Authors and Affiliations
Contributions
A.G. and Z.S. contributed equally. A.G., Z.S., A.S. and N.Y. designed the study. A.G., Z.S, R.L., J.R. and N.Y. conceived the statistical model. A.G. implemented the totalVI software with input from R.L. K.L.N. designed and produced antibody panels and provided input on the study. Z.S. designed and led experiments with input from A.S. and N.Y. A.G. and Z.S. designed and implemented analysis methods and applied the software to analyze the data with input from A.S. and N.Y. A.S. and N.Y. supervised the work. A.G., Z.S., R.L., J.R., A.S. and N.Y. participated in writing the manuscript.
Corresponding authors
Ethics declarations
Competing interests
K.L.N. is an employee of BioLegend Inc. The other authors declare no competing interests.
Additional information
Peer review information Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Evaluation of totalVI model.
a, Posterior predictive check of coefficient of variation (CV) of genes and proteins. For each of the PBMC10k, MALT, and SLN111-D1 datasets and for each model (totalVI, scVI, factor analysis with normalized input, scHPF) the average coefficient of variation from posterior predictive samples was computed for each feature. Violin plots summarize the distribution of CVs for genes and proteins. Mean absolute error (MAE) between raw data CVs and average posterior predictive CV are reported. b, For each gene and protein, the Mann-Whitney U statistic between posterior predictive samples and observed data averaged over samples. Shown are boxplots of this statistic for each set of features (genes and proteins), model, and dataset (n=4000 genes across datasets and n=14 proteins for PBMC10k and MALT, n=110 proteins for SLN111-D1). Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. Higher is better.
Extended Data Fig. 2 Evaluation of totalVI model (continued).
a, Mean absolute error (MAE) between held out data and posterior predictive mean separated by genes and proteins for each model and dataset. b, Calibration error of held-out data reported separately for genes and proteins. c, Held-out reconstruction loss of RNA for scVI and totalVI. d, e, Stability of held-out results (n=5 initializations) for totalVI on SLN111-D1. Metrics displayed are the (d) Held out MAE, and (e) held out calibration error. Box plots indicate the median (center line), interquartile range (hinges), and whiskers at 1.5x interquartile range. f, Inference time for totalVI and scVI across cells randomly subsampled to different levels from SLN-all. scVI was run with only genes. totalVI was applied with 20 latent dimensions and 100 latent dimensions.
Extended Data Fig. 3 Protein background in cells and empty droplets.
a-c, Histogram of log(protein counts + 1) in the SLN111-D1 dataset for B cells, T cells, and empty droplets (Methods) for CD19 (a), CD20 (b), and CD28 (c). d-f, Fraction of empty droplets, B cells, or T cells with > 0 UMIs detected for a given RNA (left, hatched) or protein (right, solid). RNA/proteins displayed are Cd19/CD19 (d), Ms4a1/CD20 (e), and Cd28/CD28 (f). g, Barcode rank plot for all barcodes detected in the SLN111-D1 dataset. Red lines at 20 and 100 RNA UMI counts indicate the lower and upper bounds, respectively, used to define empty droplets in (a-f). h, Performance of totalVI and a Gaussian mixture model (GMM) fit on all cells for each protein of the SLN111-D1 dataset to classify cell types by marker proteins (Methods). Receiver operating characteristic (ROC) curves shown for CD19 (B cells), CD20 (B cells), or CD28 (T cells). Area under the receiver operating characteristic curve (ROC AUC score) was calculated using as input either the totalVI foreground probability or GMM foreground probability where the indicated cell type was the positive population out of all B and T cells.
Extended Data Fig. 4 totalVI decouples foreground and background for trimodal protein distributions and denoises protein data.
a, b, CD4 protein expression in the PBMC10k dataset. (a) Trimodal distribution of log(protein counts + 1). (b) UMAP plot of the totalVI latent space colored by totalVI foreground probability. c-e, UMAP plots of the totalVI latent space for the SLN111-D1 dataset. Plots are colored by log(protein counts+1) (top) and log(totalVI denoised protein+1) (bottom) for CD19 (c), CD20 (d), and CD28 (e). f, g, Distributions of log(protein counts + 1) (f) and log(totalVI denoised protein + 1) (g) for CD19 protein in B and T cells. y-axis is truncated at 3.
Extended Data Fig. 5 RNA-protein correlations.
a, b, 2d density plots of Pearson correlations between all RNA and protein features in the SLN111-D1 dataset as well as 100 additional genes whose expression was randomly permuted. Correlations between all proteins and the randomly permuted genes are colored in red. Raw correlations were calculated between log library-size normalized RNA and log(protein counts + 1). (a), Naive totalVI correlations were calculated between totalVI denoised RNA and totalVI denoised proteins. (b), totalVI correlations were calculated between denoised RNA and proteins sampled from the posterior (Methods). c, Pearson correlations between each protein and its encoding RNA for all proteins with a unique encoding RNA, colored by the mean probability foreground of the protein across all cells. totalVI correlations were calculated as in (b) and raw correlation were calculated as in (a, b). d-f, Same as (a-c), but for Spearman correlations.
Extended Data Fig. 6 Integration of SLN-all with totalVI-intersect.
a, b, UMAP plot of SLN-all colored by (a) dataset, and (b) tissue. c, Heatmap of proteins used for annotation. Proteins (columns) are log(protein counts + 1) scaled by column for visualization. d, Dotplot of RNA markers used for annotation. RNA is log library size normalized.
Extended Data Fig. 7 Differential expression analysis.
a, 2d density plot of totalVI and scVI log Bayes factors for genes. Bayes factors were computed for each gene in one-vs-all tests on the SLN111-D1 dataset. b, Number of isotype controls called differentially expressed in one-vs-all tests (n=27) for totalVI, totalVI-wBG (totalVI test without background removal), Wilcoxon rank-sum, and t-test. Tests were applied to SLN208-D1, for which isotype controls were retained. Box plots indicate the median (center lines), interquartile range (hinges), whiskers at 1.5x interquartile range. Red dashed line indicates the maximum number of isotype controls. c-e, Significance level (Bayes factors for totalVI, adjusted p-value for frequentist tests) for proteins in one-vs-all tests computed on SLN111-D1 and SLN111-D2 for each of (c) totalVI, (d) t-test, (e) Wilcoxon. f, Bayes factors for proteins in one-vs-all tests computed on the SLN111 datasets integrated with and without the SLN111-D2 proteins held-out. Differential expression tests for both model fits were conditioned on SLN111-D1. Bayes factors are colored by the average protein expression from SLN111-D1.
Extended Data Fig. 8 Interpreting totalVI latent dimensions with archetypal analysis.
a, b, Heatmap of top (a) gene and (b) protein features for each archetype. The archetype score corresponds to the standard scaled archetypal expression profiles (Methods). Heatmaps are individually column normalized for visualization. c, Fraction of proteins in top archetypal features for each archetype. Features in each archetype were selected in the “top” if they had an archetype score of greater than 2. For these features, we performed a one-sided hypergeometric test to determine if proteins were over-represented in this feature set relative to the global distribution of feature types. Archetypes with over-representation of proteins (one-sided hypergeometric test, BH-adjusted P<0.05) are denoted.
Extended Data Fig. 9 Visualization of archetypes in totalVI-intersect model of SLN-all.
a, UMAP plots of SLN-all cells colored by latent dimension value. b, totalVI protein expression for CD24 and CD93 proteins as a function of distance to archetype 16. c, totalVI denoised expression for Isg20 and Ifit3 genes as a function of distance to archetype 7. Archetype is colored in blue, all other cells in grey.
Extended Data Fig. 10 totalVI identifies correlated modules of RNA and proteins that are associated with the maturation of transitional B cells.
a, UMAP of the totalVI latent space colored by totalVI RNA expression of Rag1. b, totalVI RNA expression of Rag1 as a function of 1 - Z16 (the totalVI latent dimension associated with transitional B cells). c, totalVI Spearman correlations in mature B cells between the same RNA and proteins as in Fig. 5h. Features were hierarchically clustered within mature B cells. d, Histogram of Spearman correlations between each feature in (a) and 1 - Z16 (n = 2,735 cells).
Supplementary information
Supplementary Information
Supplementary Tables 1–6, Supplementary Figs. 1–14 and Supplementary Notes 1–7.
Supplementary Data 1
Antibodies used in the murine spleen and lymph node CITE-seq experiments.
Supplementary Data 2
totalVI one-versus-all DE test results for the SLN-all dataset.
Rights and permissions
About this article
Cite this article
Gayoso, A., Steier, Z., Lopez, R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 18, 272–282 (2021). https://doi.org/10.1038/s41592-020-01050-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-020-01050-x
This article is cited by
-
Orthogonal multimodality integration and clustering in single-cell data
BMC Bioinformatics (2024)
-
Spatial multi-omics: novel tools to study the complexity of cardiovascular diseases
Genome Medicine (2024)
-
PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies
Genome Medicine (2024)
-
eSVD-DE: cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings
BMC Bioinformatics (2024)
-
scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders
Nature Communications (2024)