Abstract
Gene sets, including protein complexes and signalling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein–protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumours, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a clinical prognostic and predictive subnetwork around neurofilament medium in sarcoma, which we validate in independent cohorts.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
We provide pretrained gene set representations of all gene sets in NCI, Reactome and MSigDB at https://doi.org/10.6084/m9.figshare.11341181.v1. All results in this paper are based on these representations.
Code availability
A software implementation of Set2Gaussian is is available at https://doi.org/10.5281/zenodo.3827929.
References
Schaefer, C. F. et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 37, D674–D679 (2009).
Hewett, M. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 30, 163–165 (2002).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2014).
Holden, M., Deng, S., Wojnowski, L. & Kulle, B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784–2785 (2008).
Wang, S. et al. Deep functional synthesis: a machine learning approach to gene functional enrichment. Preprint at https://doi.org/10.1101/824086 (2019).
Wang, S. et al. Identification of pathways associated with chemosensitivity through network embedding. PLoS Comput. Biol. 15, e1006864 (2019).
Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018).
Bateman, A. R., El-Hachem, N., Beck, A. H., Aerts, H. J. W. L. & Haibe-Kains, B. Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci. Rep. 4, 4092 (2014).
Menche, J. et al. Disease networks. Uncovering disease–disease relationships through the incomplete interactome. Science 347, 1257601 (2015).
Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).
Cao, M. et al. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence. Bioinformatics 30, i219–i227 (2014).
Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).
Cao, M. et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8, e76339 (2013).
Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).
Patkar, S., Magen, A., Sharan, R. & Hannenhalli, S. A network diffusion approach to inferring sample-specific function reveals functional changes associated with breast cancer. PLoS Comput. Biol. 13, e1005793 (2017).
Leiserson, M. D. M. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Kim, Y.-A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).
Liu, Y., Gu, Q., Hou, J. P., Han, J. & Ma, J. A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 15, 37 (2014).
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).
Cho, H., Berger, B. & Peng, J. Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548.e5 (2016).
Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).
Wang, S., Cho, H., Zhai, C., Berger, B. & Peng, J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31, i357–i364 (2015).
Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, i457–i466 (2018).
Wieting, J., Bansal, M., Gimpel, K. & Livescu, K. Towards universal paraphrastic sentence embeddings. Preprint at https://arxiv.org/pdf/1511.08198.pdf (2015).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arxiv.org/pdf/1810.00826.pdf (2018).
Cavallari, S., Zheng, V. W., Cai, H., Chang, K. C.-C. & Cambria, E. Learning community embedding with community detection and node embedding on graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management—CIKM ’17 377–386 (2017).
Zhang, J., Kwong, S., Liu, G., Lin, Q. & WongK.-C. PathEmb: random walk based document embedding for global pathway similarity search. IEEE J. Biomed. Health Inform 23, 1329–1335 (2018).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Bojchevski, A. & GĂĽnnemann, S. Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking. Preprint at https://arxiv.org/pdf/1707.03815.pdf (2017).
He, S., Liu, K., Ji, G. & Zhao, J. Learning to represent knowledge graphs with Gaussian embedding. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management—CIKM ’15 623–632 (2015).
Dos Santos, L., Piwowarski, B. & Gallinari, P. Multilabel classification on heterogeneous graphs with Gaussian embeddings. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, 2016).
Fröhlich, H., Fellmann, M., Sültmann, H., Poustka, A. & Beissbarth, T. Predicting pathway membership via domain signatures. Bioinformatics 24, 2137–2142 (2008).
Kim, K., Jiang, K., Teng, S. L., Feldman, L. J. & Huang, H. Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28, 815–822 (2012).
GarcĂa-JimĂ©nez, B., Pons, T., Sanchis, A. & Valencia, A. Predicting protein relationships to human pathways through a relational learning approach based on simple sequence features. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 753–765 (2014).
ChavarrĂa-Smith, J. & Vance, R. E. The NLRP1 inflammasomes. Immunol. Rev. 265, 22–34 (2015).
Faustin, B. et al. Mechanism of Bcl-2 and Bcl-X(L) inhibition of NLRP1 inflammasome: loop domain-dependent suppression of ATP binding and oligomerization. Proc. Natl Acad. Sci. USA 106, 3935–3940 (2009).
Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).
Saville, M. W. et al. Treatment of HIV-associated Kaposi’s sarcoma with paclitaxel. Lancet 346, 26–28 (1995).
Millecamps, S. & Julien, J.-P. Axonal transport deficits and neurodegenerative diseases. Nat. Rev. Neurosci. 14, 161–176 (2013).
Yadav, P. et al. Neurofilament depletion improves microtubule dynamics via modulation of Stat3/stathmin signaling. Acta Neuropathol. 132, 93–110 (2016).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. 8, 483–493.e7 (2019).
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).
Poon, H., Quirk, C., DeZiel, C. & Heckerman, D. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics 30, 2840–2842 (2014).
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).
Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452.e17 (2017).
Arora, S., Liang, Y. & Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations (ICLR, 2016).
Davis, J. & Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning 233–240 (ACM, 2006).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Cho, A. et al. MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biol. 17, 129 (2016).
Kim, S., Sael, L. & Yu, H. A mutation profile for top-k patient search exploiting gene-ontology and orthogonal non-negative matrix factorization. Bioinformatics 32, 2081 (2016).
Samstein, R. M. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet. 51, 202–206 (2019).
Arthur, D. & Vassilvitskii, S. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035 (Society for Industrial and Applied Mathematics, 2007).
Stoney, R. A., Schwartz, J.-M., Robertson, D. L. & Nenadic, G. Using set theory to reduce redundancy in pathway sets. BMC Bioinformatics 19, 386 (2018).
Simillion, C., Liechti, R., Lischer, H. E. L., Ioannidis, V. & Bruggmann, R. Avoiding the pitfalls of gene set enrichment analysis with SetRank. BMC Bioinformatics 18, 151 (2017).
Lu, Y., Rosenfeld, R., Simon, I., Nau, G. J. & Bar-Joseph, Z. A probabilistic generative model for GO enrichment analysis. Nucleic Acids Res. 36, e109 (2008).
Acknowledgements
This work is supported by NIH TR002515, GM102365, LM005652 and the Chan-Zuckerberg Biohub.
Author information
Authors and Affiliations
Contributions
All authors conceived the problem. S.W. conceived the algorithm and performed the computational experiments. R.B.A. led the research. All authors wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
R.B.A. declares the following competing interests: stock or other ownership (Personalis, 23andme, Youscript) and consulting or advisory roles (United Health, Second Genome, Karius, UK Biobank, Swiss Personalized Health Network).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Wang, S., Flynn, E.R. & Altman, R.B. Gaussian embedding for large-scale gene set analysis. Nat Mach Intell 2, 387–395 (2020). https://doi.org/10.1038/s42256-020-0193-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-0193-2
This article is cited by
-
Embedding gene sets in low-dimensional space
Nature Machine Intelligence (2020)