Context-specific interaction networks from vector representation of words

Manica, Matteo; Mathis, Roland; Cadow, Joris; Rodríguez Martínez, María

doi:10.1038/s42256-019-0036-1

Article
Published: 09 April 2019

Context-specific interaction networks from vector representation of words

Nature Machine Intelligence volume 1, pages 181–190 (2019)Cite this article

1536 Accesses
8 Citations
31 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

The number of biomedical publications has grown steadily in recent years. However, most biomedical facts are not readily available, but buried in the form of unstructured text. Here we present INtERAcT, an unsupervised method to extract interactions from a corpus of biomedical articles. INtERAcT exploits a vector representation of words, computed on a corpus of domain-specific knowledge, and implements a new metric that estimates an interaction score between two molecules in the space where the corresponding words are embedded. We use INtERAcT to reconstruct the molecular pathways of 10 different cancer types using corpora of disease-specific articles, considering the STRING database as a benchmark. Our metric outperforms currently adopted approaches and it is highly robust to parameter choices, leading to the identification of known molecular interactions in all studied cancer types. Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and can therefore be efficiently applied to different scientific domains.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Description of the skip-gram model.**

**Fig. 2: Schematic representation of INtERAcT.**

**Fig. 3: Performance of INtERAcT on prostate cancer embedding.**

**Fig. 4: INtERAcT performance compared to other distance measures using STRING as a ground truth.**

**Fig. 5: Exploration of the influence of word embedding parameters on AUC for different methods and ground truths.**

**Fig. 6: Parametric dependency of INtERAcT using STRING combined score as a ground truth.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

Data availability

The article abstracts and full texts used to generate INtERAcT PPI scores can be freely downloaded from PMC. For this project, STRING interaction scores were downloaded on 19 October 2017. STRING historical data can be downloaded from https://string-db.org/cgi/access.pl?footer_active_subpage=archive. The article collection as well historical STRING data are also available from the corresponding author upon request. The networks generated for KEGG cancer pathways and the corresponding word vectors and entities lists can be downloaded from https://ibm.biz/interact-data.

Code availability

The INtERAcT implementation used to produce the results in this Article can be accessed via https://doi.org/10.5281/zenodo.2576762 (ref. ⁴⁴) and the open source code is available on GitHub (https://github.com/drugilsberg/interact). The INtERAcT python package can be installed via pip and provides a set of utilities to build interaction networks from word vectors using INtERAcT as well as using other metrics.

INtERAcT is also available as a service hosted on IBM Cloud at https://ibm.biz/interact-aas. The web service builds a molecular interaction network given word vectors in Word2Vec binary format and a list of molecular entities (example data are made available through a download link in the app).

References

Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).
Article Google Scholar
Zlotta, A. R. et al. Prevalence of prostate cancer on autopsy: cross-sectional study on unscreened caucasian and asian men. J. Natl Cancer Inst. 105, 1050–1058 (2013).
Article Google Scholar
Cooperberg, M. R., Broering, J. M. & Carroll, P. R. Risk assessment for prostate cancer metastasis and mortality at the time of diagnosis. J. Natl Cancer Inst. 101, 878–887 (2009).
Article Google Scholar
Chou, R. et al. Screening for prostate cancer: a review of the evidence for the U.S. Preventive Services Task Force. Ann. Intern. Med. 155, 762–771 (2011).
Article Google Scholar
Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. & Leser, U. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. 6, e1000837 (2010).
Article MathSciNet Google Scholar
Tjioe, E., Berry, M. W. & Homayouni, R. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization). BMC Bioinformatics 11, S14 (2010).
Article Google Scholar
Mandloi, S. & Chakrabarti, S. PALM-IST: pathway assembly from literature mining—an information search tool. Sci. Rep. 5, 10021 (2015).
Article Google Scholar
Barbosa-Silva, A. et al. PESCADOR, a web-based tool to assist text mining of biointeractions extracted from PubMed queries. BMC Bioinformatics 12, 435 (2011).
Article Google Scholar
Fleuren, W. W. et al. Identification of new biomarker candidates for glucocorticoid induced insulin resistance using literature mining. BioData Min. 6, 2 (2013).
Article Google Scholar
Raja, K., Subramani, S. & Natarajan, J. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature. Database (Oxford) 2013, bas052 (2013).
Article Google Scholar
Usie, A., Karathia, H., Teixidó, I., Alves, R. & Solsona, F. Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents. PeerJ 2, e276 (2014).
Article Google Scholar
Torii, M. et al. RLIMS-P 2.0: a generalizable rule-based information extraction system for literature mining of protein phosphorylation information. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 17–29 (2015).
Article Google Scholar
Collobert, R. & Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. 25th International Conference on Machine Learning 160–167 (ACM, 2008).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems, NIPS’13 3111–3119 (Curran Associates, 2013).
Nie, Y., Rong, W., Zhang, Y., Ouyang, Y. & Xiong, Z. Embedding assisted prediction architecture for event trigger identification. J. Bioinform. Comput. Biol. 13, 1541001 (2015).
Article Google Scholar
Zhou, D., Zhong, D. & He, Y. Event trigger identification for biomedical events extraction using domain knowledge. Bioinformatics 30, 1587–1594 (2014).
Article Google Scholar
Li, C. et al. Using word embedding for bio-event extraction. In Proc. 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015) 121–126 (Association for Computational Linguistics, 2015).
Wang, Y., Liu, Z. & Sun, M. Incorporating linguistic knowledge for learning distributed word representations. PLoS ONE 10, e0118437 (2015).
Article Google Scholar
Jiang, Z., Li, L. & Huang, D. An unsupervised graph based continuous word representation method for biomedical text mining. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 634–642 (2016).
Article Google Scholar
Zhao, Z., Yang, Z., Lin, H., Wang, J. & Gao, S. A protein–protein interaction extraction approach based on deep neural network. Int. J. Data Min. Bioinform. 15, 145–164 (2016).
Article Google Scholar
Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).
Article Google Scholar
Šarić, J., Jensen, L. J., Ouzounova, R., Rojas, I. & Bork, P. Extraction of regulatory gene/protein networks from medline. Bioinformatics 22, 645–650 (2006).
Article Google Scholar
Aggarwal, C. C., Hinneburg, A. & Keim, D. A. On the surprising behavior of distance metrics in high dimensional spaces. In Proc. 8th International Conference on Database Theory, ICDT ‘01 420–434 (Springer, 2001).
Fisher, R. A. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10, 507–521 (1915).
Google Scholar
Ramos, Y. F. M. et al. Genome-wide assessment of differential roles for p300 and CBP in transcription regulation. Nucleic Acids Res. 38, 5396–5408 (2010).
Article Google Scholar
Bird, S., Klein, E. & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (O’Reilly Media, 2009).
Baroni, M., Dinu, G. & Kruszewski, G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. 52nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 238–247 (Association for Computational Linguistics, 2014).
Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975).
Article Google Scholar
Lin, J. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991).
Article MathSciNet Google Scholar
Endres, D. M. & Schindelin, J. E. A new metric for probability distributions. IEEE Trans. Inf. Theory 49, 1858–1860 (2003).
Article MathSciNet Google Scholar
Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 32, D449–D451 (2004).
Article Google Scholar
Chatr-aryamontri, A. et al. The biogrid interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).
Article Google Scholar
Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res 32, D452–D455 (2004).
Article Google Scholar
Licata, L. et al. Mint, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2011).
Article Google Scholar
Caspi, R. et al. The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic Acids Res. 36, D623–D631 (2007).
Article Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25 (2000).
Article Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27–30 (2000).
Article Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Article Google Scholar
Croft, D. et al. The reactome pathway knowledgebase. Nucleic Acids Res 42, D472–D477 (2014).
Article Google Scholar
Franceschini, A. et al. STRING v9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).
Article Google Scholar
Florkowski, C. M. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin. Biochem. Rev. 29, S83–S87 (2008).
Google Scholar
Von Mering, C. et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437 (2005).
Article Google Scholar
Manica, M. drugilsberg/interact: First release of interact. Zenodo https://doi.org/10.5281/zenodo.2576762 (2019).

Download references

Acknowledgements

The authors thank C. Bekas and Y. Ineichen for useful discussions. The project leading to this application received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 668858.

Author information

Roland Mathis
Present address: Telepathy Labs, Zurich, Switzerland
These authors contributed equally: Matteo Manica, Roland Mathis, Joris Cadow.

Authors and Affiliations

IBM Research, Zürich, Switzerland
Matteo Manica, Roland Mathis, Joris Cadow & María Rodríguez Martínez
ETH, Zürich, Switzerland
Matteo Manica

Authors

Matteo Manica
View author publications
You can also search for this author in PubMed Google Scholar
Roland Mathis
View author publications
You can also search for this author in PubMed Google Scholar
Joris Cadow
View author publications
You can also search for this author in PubMed Google Scholar
María Rodríguez Martínez
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.M., R.M., J.C. and M.R.M. conceived the study and analyses. M.M., R.M. and J.C. implemented INtERAcT and performed data analysis. M.R.M. provided biological interpretation. M.M., R.M., J.C. and M.R.M. wrote the manuscript, with input from all authors.

Corresponding author

Correspondence to María Rodríguez Martínez.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary notes and figures

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manica, M., Mathis, R., Cadow, J. et al. Context-specific interaction networks from vector representation of words. Nat Mach Intell 1, 181–190 (2019). https://doi.org/10.1038/s42256-019-0036-1

Download citation

Received: 19 October 2018
Accepted: 07 March 2019
Published: 09 April 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s42256-019-0036-1

This article is cited by

Fast searches of large collections of single-cell data using scfind
- Jimmy Tsz Hang Lee
- Nikolaos Patikas
- Martin Hemberg
Nature Methods (2021)