Abstract
Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user’s desired dataset can vary from hours to days depending on factors such as dataset size or input parameters.
Key points
-
This protocol describes procedures for learning cellular and molecular processes from single-cell RNA-sequencing data using the non-negative matrix factorization algorithm Coordinated Gene Activity across Pattern Subsets. This is implemented and demonstrated in Python and R, with additional vignettes covering how to run Coordinated Gene Activity across Pattern Subsets via Docker deployment and GenePattern Notebook.
-
This protocol presents an end-to-end, optimized workflow that is usable, flexible, totally optimized for contemporary single-cell data formats, accessible and intuitive for computational biologists.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data analyzed in these examples is freely available under accession code GSA: CRA001160, and from the Genome Sequence Archive, where it has the ID: PRJCA001063.
Code availability
All code and example data objects are accessible via our lab’s GitHub repositories, and/or available for download from Zenodo50. The CoGAPS core library and R interface are available at https://github.com/FertigLab/CoGAPS/ and the PyCoGAPS (Python interface) can be obtained from https://github.com/FertigLab/pycogaps.
References
Brunet, J.-P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).
Stein-O’Brien, G. L. et al. Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species. Cell Syst. 8, 395–411.e8 (2019).
Cleary, B., Cong, L., Cheung, A., Lander, E. S. & Regev, A. Efficient generation of transcriptomic profiles by random composite measurements. Cell 171, 1424–1436.e18 (2017).
Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinform. 11, 367 (2010).
Ochs, M. F. & Fertig, E. J. Matrix factorization for transcriptional regulatory network inference. IEEE Symp. Comput. Intell. Bioinform. Comput. Biol. Proc. 2012, 387–396 (2012).
Stein-O’Brien, G. L. et al. Enter the matrix: factorization uncovers knowledge from omics. Trends Genet. 34, 790–805 (2018).
Fertig, E. J., Ding, J., Favorov, A. V., Parmigiani, G. & Ochs, M. F. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26, 2792–2793 (2010).
Clark, B. S. et al. Single-cell RNA-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification. Neuron 102, 1111–1126.e5 (2019).
Sherman, T. D., Gao, T. & Fertig, E. J. CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures. BMC Bioinform. 21, 453 (2020).
Peng, J. et al. Author correction: single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 777 (2019).
Kinny-Köster, B. et al. Inflammatory signaling in pancreatic cancer transfers between a single-cell RNA sequencing atlas and co-culture. Preprint at bioRxiv https://doi.org/10.1101/2022.07.14.500096 (2022).
Reich, M. et al. The genepattern notebook environment. Cell Syst. 5, 149–151.e1 (2017).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Ochs, M. F., Stoyanova, R. S., Arias-Mendoza, F. & Brown, T. R. A new method for spectral decomposition using a bilinear Bayesian approach. J. Magn. Reson. 137, 161–176 (1999).
Wang, G., Kossenkov, A. V. & Ochs, M. F. LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinform 7, 175 (2006).
Sibisi, S. & Skilling, J. Prior distributions on measure space. J. R. Stat. Soc. B 59, 217–235 (1997).
Woo, J., Aliferis, C. & Wang, J. ccfindR: single-cell RNA-seq analysis using Bayesian non-negative matrix factorization. https://www.bioconductor.org/packages/devel/bioc/vignettes/ccfindR/inst/doc/ccfindR.html (2022).
Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803 (2019).
Cemgil, A. T. Bayesian inference for nonnegative matrix factorisation models. Comput. Intell. Neurosci. 2009, 785152 (2009).
Palla, G. & Ferrero, E. Latent factor modeling of scRNA-seq data uncovers dysregulated pathways in autoimmune disease patients. iScience 23, 101451 (2020).
Shao, C. & Höfer, T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 33, 235–242 (2017).
Xie, F., Zhou, M. & Xu, Y. BayCount: a Bayesian decomposition method for inferring tumor heterogeneity using RNA-seq counts. Preprint at bioRxiv https://doi.org/10.1101/218511
Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).
Elyanow, R., Dumitrascu, B., Engelhardt, B. E. & Raphael, B. J. netNMF-sc: leveraginggene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 30, 195–204 (2020).
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Wu, Y., Tamayo, P. & Zhang, K. Visualizing and interpreting single-cell gene expression datasets with similarity weighted nonnegative embedding. Cell Syst. 7, 656–666.e4 (2018).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Stein-O’Brien, G. L. et al. PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF. Bioinformatics 33, 1892–1894 (2017).
Taylor-weiner, A. et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 20, 228 (2019).
Stein-O’Brien, G. L. et al. Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species. Cell Syst. 8, 395–411 (2019).
Fertig, E. J. et al. Preferential activation of the hedgehog pathway by epigenetic modulations in HPV negative HNSCC identified with meta-pathway analysis. PLoS ONE 8, e78127 (2013).
Way, G. P., Zietz, M., Rubinetti, V., Himmelstein, D. S. & Greene, C. S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol. 21, 109 (2020).
Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).
Bidaut, G. & Ochs, M. F. ClutrFree: cluster tree visualization and interpretation. Bioinformatics 20, 2869–2871 (2004).
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
Davis-Marcisak, E. F. et al. From bench to bedside: single-cell analysis for cancer immunotherapy. Cancer Cell 39, 1062–1080 (2021).
Gojo, J. et al. Single-Cell RNA-seq reveals cellular hierarchies and impaired developmental trajectories in pediatric ependymoma. Cancer Cell 38, 44–59.e9 (2020).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Moloshok, T. D. et al. Application of Bayesian decomposition for analysing microarray data. Bioinformatics 18, 566–575 (2002).
Zhu, X., Ching, T., Pan, X., Weissman, S. M. & Garmire, L. Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization. PeerJ 5, e2888 (2017).
Stein-O’Brien, G. et al. Integrated time course omics analysis distinguishes immediate therapeutic response from acquired resistance. Genome Med. 10, 37 (2018).
Liu, J. et al. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat. Protoc. 15, 3632–3662 (2020).
Lê Cao, K.-A. et al. Community-wide hackathons to identify central themes in single-cell multi-omics. Genome Biol. 22, 220 (2021).
Sharma, G., Colantuoni, C., Goff, L. A., Fertig, E. J. & Stein-O’Brien, G. projectR: an R/Bioconductor package for transfer learning via PCA, NMF, correlation and clustering. Bioinformatics 36, 3592–3593 (2020).
Davis-Marcisak, E. F. et al. Transfer learning between preclinical models and human tumors identifies a conserved NK cell activation signature in anti-CTLA-4 responsive tumors. Genome Med. 13, 129 (2021).
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Deshpande, A. et al. Uncovering the spatial landscape of molecular interactions within the tumor microenvironment through latent spaces. Cell Syst. 4, 285–301 (2022).
zenodo: Research. Shared. (CERN and GitHub, 2023).
Anaconda v22.9.0 (Anaconda Software Distribution, 2021).
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: Annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Seabold, S. & Perktold, J. Statsmodels: Econometric and Statistical Modeling with Python. In Proc. 9th Python in Science Conference (SciPy) https://doi.org/10.25080/majora-92bf1922-011 (2010).
Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757 (2023).
Korotkevich, G. et al. Fast gene set enrichment analysis. Preprint at bioRxiv. https://doi.org/10.1101/060012 (2016).
Liberzon, A. et al. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Acknowledgements
This work was supported by U24CA248457/US Department of Health & Human Services, National Institutes of Health and National Institutes of Health U24 CA220341 (J.P.M.), the Chan-Zuckerberg Initiative DAF (2018-183445 to L.A.G. and 2018-183444 to E.J.F.); the Johns Hopkins University Catalyst (E.J.F. and L.A.G.); an Allegheny Health Network grant (to E.J.F.), U01CA212007 (to E.J.F.), U01CA253403 (to E.J.F.), P01CA247886 (to E.J.F. and E.M.J.); a Pilot Award from P50CA062924 (to E.J.F.) from the National Cancer Institute; the JHU School of Medicine Synergy Award (to E.J.F. and L.A.G.); 640183 from the Emerson Collective (to E.J.F. and E.M.J.); a Kavli Neurodiscovery Institute Distinguished Postdoctoral fellowship (G.L.S.-O.): a Johns Hopkins Provost Award (G.L.S.-O.); and K99NS122085 from the BRAIN Initiative in partnership with the National Institute of Neurological Disorders (G.L.S.-O.)
Author information
Authors and Affiliations
Contributions
E.J.F., G.L.S.-O. and T.S. originally conceived of the project. E.D.-M. and M.L. prepared a preliminary draft of the manuscript. A.P.T. and J.A.I.J. wrote PyCoGAPS with guidance from G.L.S.-O. A.P.T. implemented the PyCoGAPS GenePattern Notebook and introduced Docker support. M.R. and J.T.M. provided critical GenePattern Notebook support and collaboration. J.A.I.J. and A.P.T. wrote user guides, and J.T.M. performed the PDAC Atlas single-cell analysis included in them. J.B. created the CoGAPS website. All authors read, edited and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Protocols thanks Martin Hemberg, Qing Nie and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Key references using this protocol
Stein-O’Brien, G. L. et al. Cell Syst. 8, 395–411.e8 (2019): https://doi.org/10.1016/j.cels.2019.04.004
Clark, B. S. et al. Neuron 102, 1111–1126.e5 (2019): https://doi.org/10.1016/j.neuron.2019.04.010
Supplementary information
Supplementary Information
Supplementary notes 1–7, materials 1–8
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Johnson, J.A.I., Tsang, A.P., Mitchell, J.T. et al. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat Protoc 18, 3690–3731 (2023). https://doi.org/10.1038/s41596-023-00892-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41596-023-00892-x
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.