Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Protocol
  • Published:

Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

Abstract

Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user’s desired dataset can vary from hours to days depending on factors such as dataset size or input parameters.

Key points

  • This protocol describes procedures for learning cellular and molecular processes from single-cell RNA-sequencing data using the non-negative matrix factorization algorithm Coordinated Gene Activity across Pattern Subsets. This is implemented and demonstrated in Python and R, with additional vignettes covering how to run Coordinated Gene Activity across Pattern Subsets via Docker deployment and GenePattern Notebook.

  • This protocol presents an end-to-end, optimized workflow that is usable, flexible, totally optimized for contemporary single-cell data formats, accessible and intuitive for computational biologists.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: NMF learns signal in input data.
Fig. 2: A generalized workflow for performing NMF on single-cell data.
Fig. 3: Distributed CoGAPS finds robust patterns across randomized gene or sample subsets.
Fig. 4: Decision tree for selecting the most appropriate PyCoGAPS or CoGAPS procedure to follow.
Fig. 5: Graphical comparison of the procedures.
Fig. 6: Comparison of runtimes of R CoGAPS versus PyCoGAPS.
Fig. 7: UMAP of patterns learned by PyCoGAPS.
Fig. 8: Comparing statistical overlap between a biologist’s single-cell annotations and the learned CoGAPS to associate patterns with biological processes.
Fig. 9: Python hallmark GSEA.
Fig. 10: UMAP of patterns learned by CoGAPS.
Fig. 11: Pattern amplitude by cell group.
Fig. 12: R CoGAPS hallmark GSEA.

Similar content being viewed by others

Data availability

The data analyzed in these examples is freely available under accession code GSA: CRA001160, and from the Genome Sequence Archive, where it has the ID: PRJCA001063.

Code availability

All code and example data objects are accessible via our lab’s GitHub repositories, and/or available for download from Zenodo50. The CoGAPS core library and R interface are available at https://github.com/FertigLab/CoGAPS/ and the PyCoGAPS (Python interface) can be obtained from https://github.com/FertigLab/pycogaps.

References

  1. Brunet, J.-P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Stein-O’Brien, G. L. et al. Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species. Cell Syst. 8, 395–411.e8 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Cleary, B., Cong, L., Cheung, A., Lander, E. S. & Regev, A. Efficient generation of transcriptomic profiles by random composite measurements. Cell 171, 1424–1436.e18 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinform. 11, 367 (2010).

    Article  Google Scholar 

  5. Ochs, M. F. & Fertig, E. J. Matrix factorization for transcriptional regulatory network inference. IEEE Symp. Comput. Intell. Bioinform. Comput. Biol. Proc. 2012, 387–396 (2012).

    Google Scholar 

  6. Stein-O’Brien, G. L. et al. Enter the matrix: factorization uncovers knowledge from omics. Trends Genet. 34, 790–805 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Fertig, E. J., Ding, J., Favorov, A. V., Parmigiani, G. & Ochs, M. F. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26, 2792–2793 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Clark, B. S. et al. Single-cell RNA-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification. Neuron 102, 1111–1126.e5 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Sherman, T. D., Gao, T. & Fertig, E. J. CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures. BMC Bioinform. 21, 453 (2020).

    Article  Google Scholar 

  10. Peng, J. et al. Author correction: single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 777 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Kinny-Köster, B. et al. Inflammatory signaling in pancreatic cancer transfers between a single-cell RNA sequencing atlas and co-culture. Preprint at bioRxiv https://doi.org/10.1101/2022.07.14.500096 (2022).

  12. Reich, M. et al. The genepattern notebook environment. Cell Syst. 5, 149–151.e1 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).

    Article  CAS  PubMed  Google Scholar 

  14. Ochs, M. F., Stoyanova, R. S., Arias-Mendoza, F. & Brown, T. R. A new method for spectral decomposition using a bilinear Bayesian approach. J. Magn. Reson. 137, 161–176 (1999).

    Article  CAS  PubMed  Google Scholar 

  15. Wang, G., Kossenkov, A. V. & Ochs, M. F. LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinform 7, 175 (2006).

    Article  Google Scholar 

  16. Sibisi, S. & Skilling, J. Prior distributions on measure space. J. R. Stat. Soc. B 59, 217–235 (1997).

    Article  Google Scholar 

  17. Woo, J., Aliferis, C. & Wang, J. ccfindR: single-cell RNA-seq analysis using Bayesian non-negative matrix factorization. https://www.bioconductor.org/packages/devel/bioc/vignettes/ccfindR/inst/doc/ccfindR.html (2022).

  18. Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Cemgil, A. T. Bayesian inference for nonnegative matrix factorisation models. Comput. Intell. Neurosci. 2009, 785152 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Palla, G. & Ferrero, E. Latent factor modeling of scRNA-seq data uncovers dysregulated pathways in autoimmune disease patients. iScience 23, 101451 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Shao, C. & Höfer, T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 33, 235–242 (2017).

    Article  CAS  PubMed  Google Scholar 

  22. Xie, F., Zhou, M. & Xu, Y. BayCount: a Bayesian decomposition method for inferring tumor heterogeneity using RNA-seq counts. Preprint at bioRxiv https://doi.org/10.1101/218511

  23. Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Elyanow, R., Dumitrascu, B., Engelhardt, B. E. & Raphael, B. J. netNMF-sc: leveraginggene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 30, 195–204 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).

    Article  PubMed  Google Scholar 

  26. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Wu, Y., Tamayo, P. & Zhang, K. Visualizing and interpreting single-cell gene expression datasets with similarity weighted nonnegative embedding. Cell Syst. 7, 656–666.e4 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Stein-O’Brien, G. L. et al. PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF. Bioinformatics 33, 1892–1894 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Taylor-weiner, A. et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 20, 228 (2019).

  32. Stein-O’Brien, G. L. et al. Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species. Cell Syst. 8, 395–411 (2019).

  33. Fertig, E. J. et al. Preferential activation of the hedgehog pathway by epigenetic modulations in HPV negative HNSCC identified with meta-pathway analysis. PLoS ONE 8, e78127 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Way, G. P., Zietz, M., Rubinetti, V., Himmelstein, D. S. & Greene, C. S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol. 21, 109 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).

    PubMed  PubMed Central  Google Scholar 

  36. Bidaut, G. & Ochs, M. F. ClutrFree: cluster tree visualization and interpretation. Bioinformatics 20, 2869–2871 (2004).

    Article  CAS  PubMed  Google Scholar 

  37. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Davis-Marcisak, E. F. et al. From bench to bedside: single-cell analysis for cancer immunotherapy. Cancer Cell 39, 1062–1080 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Gojo, J. et al. Single-Cell RNA-seq reveals cellular hierarchies and impaired developmental trajectories in pediatric ependymoma. Cancer Cell 38, 44–59.e9 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Moloshok, T. D. et al. Application of Bayesian decomposition for analysing microarray data. Bioinformatics 18, 566–575 (2002).

    Article  CAS  PubMed  Google Scholar 

  42. Zhu, X., Ching, T., Pan, X., Weissman, S. M. & Garmire, L. Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization. PeerJ 5, e2888 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Stein-O’Brien, G. et al. Integrated time course omics analysis distinguishes immediate therapeutic response from acquired resistance. Genome Med. 10, 37 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Liu, J. et al. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat. Protoc. 15, 3632–3662 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Lê Cao, K.-A. et al. Community-wide hackathons to identify central themes in single-cell multi-omics. Genome Biol. 22, 220 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Sharma, G., Colantuoni, C., Goff, L. A., Fertig, E. J. & Stein-O’Brien, G. projectR: an R/Bioconductor package for transfer learning via PCA, NMF, correlation and clustering. Bioinformatics 36, 3592–3593 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Davis-Marcisak, E. F. et al. Transfer learning between preclinical models and human tumors identifies a conserved NK cell activation signature in anti-CTLA-4 responsive tumors. Genome Med. 13, 129 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Deshpande, A. et al. Uncovering the spatial landscape of molecular interactions within the tumor microenvironment through latent spaces. Cell Syst. 4, 285–301 (2022).

  50. zenodo: Research. Shared. (CERN and GitHub, 2023).

  51. Anaconda v22.9.0 (Anaconda Software Distribution, 2021).

  52. Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: Annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).

  53. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Seabold, S. & Perktold, J. Statsmodels: Econometric and Statistical Modeling with Python. In Proc. 9th Python in Science Conference (SciPy) https://doi.org/10.25080/majora-92bf1922-011 (2010).

  55. Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757 (2023).

    Article  CAS  PubMed  Google Scholar 

  56. Korotkevich, G. et al. Fast gene set enrichment analysis. Preprint at bioRxiv. https://doi.org/10.1101/060012 (2016).

  57. Liberzon, A. et al. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by U24CA248457/US Department of Health & Human Services, National Institutes of Health and National Institutes of Health U24 CA220341 (J.P.M.), the Chan-Zuckerberg Initiative DAF (2018-183445 to L.A.G. and 2018-183444 to E.J.F.); the Johns Hopkins University Catalyst (E.J.F. and L.A.G.); an Allegheny Health Network grant (to E.J.F.), U01CA212007 (to E.J.F.), U01CA253403 (to E.J.F.), P01CA247886 (to E.J.F. and E.M.J.); a Pilot Award from P50CA062924 (to E.J.F.) from the National Cancer Institute; the JHU School of Medicine Synergy Award (to E.J.F. and L.A.G.); 640183 from the Emerson Collective (to E.J.F. and E.M.J.); a Kavli Neurodiscovery Institute Distinguished Postdoctoral fellowship (G.L.S.-O.): a Johns Hopkins Provost Award (G.L.S.-O.); and K99NS122085 from the BRAIN Initiative in partnership with the National Institute of Neurological Disorders (G.L.S.-O.)

Author information

Authors and Affiliations

Authors

Contributions

E.J.F., G.L.S.-O. and T.S. originally conceived of the project. E.D.-M. and M.L. prepared a preliminary draft of the manuscript. A.P.T. and J.A.I.J. wrote PyCoGAPS with guidance from G.L.S.-O. A.P.T. implemented the PyCoGAPS GenePattern Notebook and introduced Docker support. M.R. and J.T.M. provided critical GenePattern Notebook support and collaboration. J.A.I.J. and A.P.T. wrote user guides, and J.T.M. performed the PDAC Atlas single-cell analysis included in them. J.B. created the CoGAPS website. All authors read, edited and approved the final manuscript.

Corresponding authors

Correspondence to Elana J. Fertig or Genevieve L. Stein-O’Brien.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Protocols thanks Martin Hemberg, Qing Nie and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Stein-O’Brien, G. L. et al. Cell Syst. 8, 395–411.e8 (2019): https://doi.org/10.1016/j.cels.2019.04.004

Clark, B. S. et al. Neuron 102, 1111–1126.e5 (2019): https://doi.org/10.1016/j.neuron.2019.04.010

Supplementary information

Supplementary Information

Supplementary notes 1–7, materials 1–8

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnson, J.A.I., Tsang, A.P., Mitchell, J.T. et al. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat Protoc 18, 3690–3731 (2023). https://doi.org/10.1038/s41596-023-00892-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-023-00892-x

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics