Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders

A preprint version of the article is available at bioRxiv.

Abstract

Recent advances in single-cell RNA sequencing have driven the simultaneous measurement of the expression of thousands of genes in thousands of single cells. These growing datasets allow us to model gene sets in biological networks at an unprecedented level of detail, in spite of heterogeneous cell populations. Here, we propose a deep neural network model that is a hybrid of matrix factorization and variational autoencoders, which we call restricted latent variational autoencoder (resVAE). The model uses weights as factorized matrices to obtain gene sets, while class-specific inputs to the latent variable space facilitate a plausible identification of cell types. This artificial neural network model seamlessly integrates functional gene set inference, experimental covariate effect isolation, and static gene identification, which we conceptually demonstrate here for four single-cell RNA sequencing datasets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: resVAE network architecture.
Fig. 2: Performance metrics.
Fig. 3: Gene set and housekeeping gene inference from C. elegans single-cell RNA sequencing.
Fig. 4: Upstream regulator identification in mouse testis data.
Fig. 5: Batch effect isolation in human pancreatic islet data from different sources.

Similar content being viewed by others

Data availability

All example datasets were used in previously published studies. The C. elegans dataset was downloaded according to the instructions provided at http://atlas.gs.washington.edu/worm-rna/docs/#use-case-1-expression-pattern-of-a-gene-of-interest23. The testis single-cell RNA sequencing data with the accession number GSE104556 was downloaded from GEO29. The set of pancreas single-cell RNA sequencing datasets including annotations was downloaded according to the instructions on https://satijalab.org/seurat/v3.0/integration.html. The peripheral blood mononuclear cell (PBMC) dataset was obtained as outlined on https://satijalab.org/seurat/v3.1/immune_alignment.html. The individual datasets can be accessed on GEO (GSE81076: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81076, GSE85241: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241, GSE86469: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86469, GSE96583: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583) and SRA (E-MTAB-5061: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/)33,34,35,36,39.

Code availability

The code for our implementation of the resVAE algorithm is available on GitHub (https://github.com/lab-conrad/resVAE) and Zenodo (https://doi.org/10.5281/zenodo.4088371)44. A CodeOcean capsule together with a minimal example dataset is available at https://codeocean.com/capsule/2076269/tree/v145.

References

  1. Barolo, S. & Posakony, J. W. Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. Genes Dev 16, 1167–1181 (2002).

    Article  Google Scholar 

  2. Jambusaria, A. et al. A computational approach to identify cellular heterogeneity and tissue-specific gene regulatory networks. BMC Bioinf. 19, 217 (2018).

    Article  Google Scholar 

  3. Bleazard, T., Lamb, J. A. & Griffiths-Jones, S. Bias in microRNA functional enrichment analysis. Bioinformatics 31, 1592–1598 (2015).

    Article  Google Scholar 

  4. Chen, X., Wang, L., Smith, J. D. & Zhang, B. Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics 24, 2474–2481 (2008).

    Article  Google Scholar 

  5. Tomfohr, J., Lu, J. & Kepler, T. B. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinf. 6, 225 (2005).

    Article  Google Scholar 

  6. Frost, H. R., Li, Z. & Moore, J. H. Principal component gene set enrichment (PCGSE). BioData Min. 8, 25 (2015).

    Article  Google Scholar 

  7. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  Google Scholar 

  8. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  Google Scholar 

  9. Hore, V. et al. Tensor decomposition for multiple-tissue gene expression experiments. Nat. Genet. 48, 1094–1100 (2016).

    Article  Google Scholar 

  10. Jung, M. et al. Unified single-cell analysis of testis gene regulation and pathology in five mouse strains. eLlife 8, e43966 (2019).

  11. Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).

    Article  Google Scholar 

  12. Wu, Y., Tamayo, P. & Zhang, K. Visualizing and interpreting single-cell gene expression datasets with similarity weighted nonnegative embedding. Cell Syst. 7, 656–666 (2018).

    Article  Google Scholar 

  13. Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl Acad. Sci. 115, 7723–7728 (2018).

    Article  Google Scholar 

  14. Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803 (2019).

  15. Yu, J., Zhou, G., Cichocki, A. & Xie, S. Learning the hierarchical parts of objects by deep non-smooth nonnegative matrix factorization. IEEE Access 6, 58096–58105 (2018).

    Article  Google Scholar 

  16. Ye, F., Chen, C. & Zheng, Z. Deep autoencoder-like nonnegative matrix factorization for community detection. In Proc. 27th ACM Int. Conf. on Information and Knowledge Management (CIKM ’18) 1393–1402 (ACM Press, 2018); https://doi.org/10.1145/3269206.3271697.

  17. Squires, S., Bennett, A. P. & Niranjan, M. A variational autoencoder for probabilistic non-negative matrix factorisation. Preprint at https://arxiv.org/abs/1906.05912 (2019).

  18. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations In The Microstructure Of Cognition Vol. 1, 318–362 (MIT Press, 1986).

  19. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR, 2014).

  20. Kingma, D. P. & Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn. 12, 307–392 (2019).

    Article  Google Scholar 

  21. Wang, D. & Gu, J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genom. Proteom. Bioinform. 16, 320–331 (2018).

    Article  Google Scholar 

  22. Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics https://doi.org/10.1093/bioinformatics/btz095 (2019).

  23. Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).

    Article  Google Scholar 

  24. Calinski, T. & Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Theor. Meth. 3, 1–27 (1974).

    Article  MathSciNet  Google Scholar 

  25. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  Google Scholar 

  26. Carbon, S. et al. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).

    Article  Google Scholar 

  27. Zhou, Y. et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 10, 1523 (2019).

    Article  Google Scholar 

  28. Yu, H., Luscombe, N. M., Qian, J. & Gerstein, M. Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19, 422–427 (2003).

    Article  Google Scholar 

  29. Lukassen, S., Bosch, E., Ekici, A. B. & Winterpacht, A. Characterization of germ cell differentiation in the male mouse through single-cell RNA sequencing. Sci. Rep. 8, 6521 (2018).

    Article  Google Scholar 

  30. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    Article  Google Scholar 

  31. Bolcun-Filas, E. et al. A-MYB (MYBL1) transcription factor is a master regulator of male meiosis. Development 138, 3319–3330 (2011).

    Article  Google Scholar 

  32. Daems, C., Martin, L. J., Brousseau, C. & Tremblay, J. J. MEF2 is restricted to the male gonad and regulates expression of the orphan nuclear receptor NR4A1. Mol. Endocrinol. 28, 886–898 (2014).

    Article  Google Scholar 

  33. Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article  Google Scholar 

  34. Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    Article  Google Scholar 

  35. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

    Article  Google Scholar 

  36. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    Article  Google Scholar 

  37. Danielsson, A. et al. The human pancreas proteome defined by transcriptomics and antibody-based profiling. PLoS One 9, e115421 (2014).

    Article  Google Scholar 

  38. Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).

    Article  Google Scholar 

  39. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    Article  Google Scholar 

  40. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    Article  Google Scholar 

  41. Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa169 (2020).

  42. Wu, Y., Burda, Y., Salakhutdinov, R. & Grosse, R. On the quantitative analysis of decoder-based generative models. In International Conference on Learning Representations (ICLR, 2017).

  43. Grosse, R. B., Ghahramani, Z. & Adams, R. P. Sandwiching the marginal likelihood using bidirectional Monte Carlo. Preprint at https://arxiv.org/abs/1511.02543 (2015).

  44. Lukassen, S., Ten, F. W., Adam, L., Eils, R. & Conrad, C. Initial release of resVAE v1.0. resVAE: Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. zenodo https://doi.org/10.5281/zenodo.4088371 (2020).

  45. Lukassen, S., Ten, F. W., Adam, L., Eils, R. & Conrad, C. resVAE: code for Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. CodeOcean https://doi.org/10.24433/CO.5190570.v1 (2020).

  46. Law, C. W. et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000 Res. 5, 1408 (2018).

    Article  Google Scholar 

  47. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).

    Google Scholar 

  48. Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).

    Article  Google Scholar 

  49. Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

    Google Scholar 

Download references

Acknowledgements

This publication is part of the Human Cell Atlas at http://www.humancellatlas.org/publications.

Author information

Authors and Affiliations

Authors

Contributions

S.L. and C.C. conceived this study. S.L. implemented the algorithm. S.L. and F.W.T. analysed the data. L.A. assisted with the implementation and data analysis. C.C. and R.E. supervised the study.

Corresponding authors

Correspondence to Roland Eils or Christian Conrad.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 VRC dependence on gene set size.

VRC values for gene sets comprised of randomly sampled genes, tested on the C. elegans dataset. The number of sampled genes is displayed on the x axis.

Extended Data Fig. 2 Additional performance metrics.

Performance metrics for the mouse testis (a), human pancreas (b) and PBMC (c) datasets. The VRC, mean squared reconstruction error, ARI of cluster label inference, and Spearman correlation coefficient between genes in the input and in the weight mappings are shown. Calculations were performed as in Fig. 2.

Extended Data Fig. 3 Hyperparameter tuning.

VRC values for different hyperparameters in latent and decoder neuron layers, tested on the C. elegans dataset. Abbreviations: dec_reg: decoder activity regularizer; lat_scale: Latent scale; dec_bias: decoder bias. Boxes indicate the inner-quartile range (IQR), whiskers extend to 1.5 times the IQR. All outliers shown. Horizontal lines indicate the median.

Extended Data Fig. 4 Additional gene set enrichments for C. elegans cell types I.

Pathway and gene set enrichment P values for cell type-specific gene sets identified from the cell type to gene mapping obtained by resVAE for the C. elegans dataset.

Extended Data Fig. 5 Additional gene set enrichments for C. elegans cell types II.

Pathway and gene set enrichment P values for cell type-specific gene sets identified from the cell type to gene mapping obtained by resVAE for the C. elegans dataset.

Extended Data Fig. 6 Comparison of GO term enrichments across cell types.

Pathway and GO term fold enrichment for the top three enriched pathways per cell type (rows) across all cell types (columns) in the C. elegans dataset.

Extended Data Fig. 7 Comparison of motif enrichment across cell types and gene sets.

Motif enrichment P values for all motifs identified as the top enriched in any gene set/neuron (rows) across all gene sets/neurons (columns) in the mouse testis dataset.

Extended Data Fig. 8 GO term enrichment comparison after batch effect isolation.

Heatmap of Spearman’s rank correlation coefficient of the similarity of GO term enrichments in the cell-type-specific gene sets for the human pancreas dataset.

Extended Data Fig. 9 Gene set enrichment and covariate isolation benchmarking on PBMCs.

af, Metascape enrichment results for interferon stimulated and unstimulated PBMCs. ad, Cell-type-specific gene set enrichment. e,f, Treatment-specific gene set enrichment. Asterisk indicates the term name has been shortened for display purposes. g, ARI values for benchmarking of resVAE and Seurat batch effect isolation. ARIs for cell type labels and treatment after batch correction using Seurat (blue colored bars), ARIs for cell type and treatment using resVAE covariate isolation on an uncorrected top 2000 gene matrix (purple colored bars), ARI for Seurat-corrected matrix without providing treatment labels (gray coloured bar), ARI for cell type using resVAE on the uncorrected matrix without treatment labels (black coloured bar). h, resVAE clearly showed Erythrocytes-related term for the Erythrocytes cluster, whereas Seurat did not yield obvious terms while still being cluttered by Stimulated-related terms. Meanwhile, resVAE also performed competitively against Seurat in the Stimulated gene set enrichment terms, showing comparable results. Gene set enrichments for other cell types that did not show obvious or significant differences between resVAE and Seurat were omitted here as they do not add additional information.

Supplementary information

Supplementary Information

Supplementary Table 1

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lukassen, S., Ten, F.W., Adam, L. et al. Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders. Nat Mach Intell 2, 800–809 (2020). https://doi.org/10.1038/s42256-020-00269-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-020-00269-9

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing