Abstract
Recent advances in single-cell RNA sequencing have driven the simultaneous measurement of the expression of thousands of genes in thousands of single cells. These growing datasets allow us to model gene sets in biological networks at an unprecedented level of detail, in spite of heterogeneous cell populations. Here, we propose a deep neural network model that is a hybrid of matrix factorization and variational autoencoders, which we call restricted latent variational autoencoder (resVAE). The model uses weights as factorized matrices to obtain gene sets, while class-specific inputs to the latent variable space facilitate a plausible identification of cell types. This artificial neural network model seamlessly integrates functional gene set inference, experimental covariate effect isolation, and static gene identification, which we conceptually demonstrate here for four single-cell RNA sequencing datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All example datasets were used in previously published studies. The C. elegans dataset was downloaded according to the instructions provided at http://atlas.gs.washington.edu/worm-rna/docs/#use-case-1-expression-pattern-of-a-gene-of-interest23. The testis single-cell RNA sequencing data with the accession number GSE104556 was downloaded from GEO29. The set of pancreas single-cell RNA sequencing datasets including annotations was downloaded according to the instructions on https://satijalab.org/seurat/v3.0/integration.html. The peripheral blood mononuclear cell (PBMC) dataset was obtained as outlined on https://satijalab.org/seurat/v3.1/immune_alignment.html. The individual datasets can be accessed on GEO (GSE81076: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81076, GSE85241: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241, GSE86469: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86469, GSE96583: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583) and SRA (E-MTAB-5061: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/)33,34,35,36,39.
Code availability
The code for our implementation of the resVAE algorithm is available on GitHub (https://github.com/lab-conrad/resVAE) and Zenodo (https://doi.org/10.5281/zenodo.4088371)44. A CodeOcean capsule together with a minimal example dataset is available at https://codeocean.com/capsule/2076269/tree/v145.
References
Barolo, S. & Posakony, J. W. Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. Genes Dev 16, 1167–1181 (2002).
Jambusaria, A. et al. A computational approach to identify cellular heterogeneity and tissue-specific gene regulatory networks. BMC Bioinf. 19, 217 (2018).
Bleazard, T., Lamb, J. A. & Griffiths-Jones, S. Bias in microRNA functional enrichment analysis. Bioinformatics 31, 1592–1598 (2015).
Chen, X., Wang, L., Smith, J. D. & Zhang, B. Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics 24, 2474–2481 (2008).
Tomfohr, J., Lu, J. & Kepler, T. B. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinf. 6, 225 (2005).
Frost, H. R., Li, Z. & Moore, J. H. Principal component gene set enrichment (PCGSE). BioData Min. 8, 25 (2015).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Hore, V. et al. Tensor decomposition for multiple-tissue gene expression experiments. Nat. Genet. 48, 1094–1100 (2016).
Jung, M. et al. Unified single-cell analysis of testis gene regulation and pathology in five mouse strains. eLlife 8, e43966 (2019).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Wu, Y., Tamayo, P. & Zhang, K. Visualizing and interpreting single-cell gene expression datasets with similarity weighted nonnegative embedding. Cell Syst. 7, 656–666 (2018).
Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl Acad. Sci. 115, 7723–7728 (2018).
Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803 (2019).
Yu, J., Zhou, G., Cichocki, A. & Xie, S. Learning the hierarchical parts of objects by deep non-smooth nonnegative matrix factorization. IEEE Access 6, 58096–58105 (2018).
Ye, F., Chen, C. & Zheng, Z. Deep autoencoder-like nonnegative matrix factorization for community detection. In Proc. 27th ACM Int. Conf. on Information and Knowledge Management (CIKM ’18) 1393–1402 (ACM Press, 2018); https://doi.org/10.1145/3269206.3271697.
Squires, S., Bennett, A. P. & Niranjan, M. A variational autoencoder for probabilistic non-negative matrix factorisation. Preprint at https://arxiv.org/abs/1906.05912 (2019).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations In The Microstructure Of Cognition Vol. 1, 318–362 (MIT Press, 1986).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR, 2014).
Kingma, D. P. & Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn. 12, 307–392 (2019).
Wang, D. & Gu, J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genom. Proteom. Bioinform. 16, 320–331 (2018).
Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics https://doi.org/10.1093/bioinformatics/btz095 (2019).
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).
Calinski, T. & Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Theor. Meth. 3, 1–27 (1974).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Carbon, S. et al. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).
Zhou, Y. et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 10, 1523 (2019).
Yu, H., Luscombe, N. M., Qian, J. & Gerstein, M. Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19, 422–427 (2003).
Lukassen, S., Bosch, E., Ekici, A. B. & Winterpacht, A. Characterization of germ cell differentiation in the male mouse through single-cell RNA sequencing. Sci. Rep. 8, 6521 (2018).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Bolcun-Filas, E. et al. A-MYB (MYBL1) transcription factor is a master regulator of male meiosis. Development 138, 3319–3330 (2011).
Daems, C., Martin, L. J., Brousseau, C. & Tremblay, J. J. MEF2 is restricted to the male gonad and regulates expression of the orphan nuclear receptor NR4A1. Mol. Endocrinol. 28, 886–898 (2014).
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Danielsson, A. et al. The human pancreas proteome defined by transcriptomics and antibody-based profiling. PLoS One 9, e115421 (2014).
Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa169 (2020).
Wu, Y., Burda, Y., Salakhutdinov, R. & Grosse, R. On the quantitative analysis of decoder-based generative models. In International Conference on Learning Representations (ICLR, 2017).
Grosse, R. B., Ghahramani, Z. & Adams, R. P. Sandwiching the marginal likelihood using bidirectional Monte Carlo. Preprint at https://arxiv.org/abs/1511.02543 (2015).
Lukassen, S., Ten, F. W., Adam, L., Eils, R. & Conrad, C. Initial release of resVAE v1.0. resVAE: Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. zenodo https://doi.org/10.5281/zenodo.4088371 (2020).
Lukassen, S., Ten, F. W., Adam, L., Eils, R. & Conrad, C. resVAE: code for Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. CodeOcean https://doi.org/10.24433/CO.5190570.v1 (2020).
Law, C. W. et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000 Res. 5, 1408 (2018).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).
Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).
Acknowledgements
This publication is part of the Human Cell Atlas at http://www.humancellatlas.org/publications.
Author information
Authors and Affiliations
Contributions
S.L. and C.C. conceived this study. S.L. implemented the algorithm. S.L. and F.W.T. analysed the data. L.A. assisted with the implementation and data analysis. C.C. and R.E. supervised the study.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 VRC dependence on gene set size.
VRC values for gene sets comprised of randomly sampled genes, tested on the C. elegans dataset. The number of sampled genes is displayed on the x axis.
Extended Data Fig. 2 Additional performance metrics.
Performance metrics for the mouse testis (a), human pancreas (b) and PBMC (c) datasets. The VRC, mean squared reconstruction error, ARI of cluster label inference, and Spearman correlation coefficient between genes in the input and in the weight mappings are shown. Calculations were performed as in Fig. 2.
Extended Data Fig. 3 Hyperparameter tuning.
VRC values for different hyperparameters in latent and decoder neuron layers, tested on the C. elegans dataset. Abbreviations: dec_reg: decoder activity regularizer; lat_scale: Latent scale; dec_bias: decoder bias. Boxes indicate the inner-quartile range (IQR), whiskers extend to 1.5 times the IQR. All outliers shown. Horizontal lines indicate the median.
Extended Data Fig. 4 Additional gene set enrichments for C. elegans cell types I.
Pathway and gene set enrichment P values for cell type-specific gene sets identified from the cell type to gene mapping obtained by resVAE for the C. elegans dataset.
Extended Data Fig. 5 Additional gene set enrichments for C. elegans cell types II.
Pathway and gene set enrichment P values for cell type-specific gene sets identified from the cell type to gene mapping obtained by resVAE for the C. elegans dataset.
Extended Data Fig. 6 Comparison of GO term enrichments across cell types.
Pathway and GO term fold enrichment for the top three enriched pathways per cell type (rows) across all cell types (columns) in the C. elegans dataset.
Extended Data Fig. 7 Comparison of motif enrichment across cell types and gene sets.
Motif enrichment P values for all motifs identified as the top enriched in any gene set/neuron (rows) across all gene sets/neurons (columns) in the mouse testis dataset.
Extended Data Fig. 8 GO term enrichment comparison after batch effect isolation.
Heatmap of Spearman’s rank correlation coefficient of the similarity of GO term enrichments in the cell-type-specific gene sets for the human pancreas dataset.
Extended Data Fig. 9 Gene set enrichment and covariate isolation benchmarking on PBMCs.
a–f, Metascape enrichment results for interferon stimulated and unstimulated PBMCs. a–d, Cell-type-specific gene set enrichment. e,f, Treatment-specific gene set enrichment. Asterisk indicates the term name has been shortened for display purposes. g, ARI values for benchmarking of resVAE and Seurat batch effect isolation. ARIs for cell type labels and treatment after batch correction using Seurat (blue colored bars), ARIs for cell type and treatment using resVAE covariate isolation on an uncorrected top 2000 gene matrix (purple colored bars), ARI for Seurat-corrected matrix without providing treatment labels (gray coloured bar), ARI for cell type using resVAE on the uncorrected matrix without treatment labels (black coloured bar). h, resVAE clearly showed Erythrocytes-related term for the Erythrocytes cluster, whereas Seurat did not yield obvious terms while still being cluttered by Stimulated-related terms. Meanwhile, resVAE also performed competitively against Seurat in the Stimulated gene set enrichment terms, showing comparable results. Gene set enrichments for other cell types that did not show obvious or significant differences between resVAE and Seurat were omitted here as they do not add additional information.
Supplementary information
Supplementary Information
Supplementary Table 1
Rights and permissions
About this article
Cite this article
Lukassen, S., Ten, F.W., Adam, L. et al. Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. Nat Mach Intell 2, 800–809 (2020). https://doi.org/10.1038/s42256-020-00269-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-00269-9