Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Context-specific interaction networks from vector representation of words

A preprint version of the article is available at arXiv.

Abstract

The number of biomedical publications has grown steadily in recent years. However, most biomedical facts are not readily available, but buried in the form of unstructured text. Here we present INtERAcT, an unsupervised method to extract interactions from a corpus of biomedical articles. INtERAcT exploits a vector representation of words, computed on a corpus of domain-specific knowledge, and implements a new metric that estimates an interaction score between two molecules in the space where the corresponding words are embedded. We use INtERAcT to reconstruct the molecular pathways of 10 different cancer types using corpora of disease-specific articles, considering the STRING database as a benchmark. Our metric outperforms currently adopted approaches and it is highly robust to parameter choices, leading to the identification of known molecular interactions in all studied cancer types. Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and can therefore be efficiently applied to different scientific domains.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Description of the skip-gram model.
Fig. 2: Schematic representation of INtERAcT.
Fig. 3: Performance of INtERAcT on prostate cancer embedding.
Fig. 4: INtERAcT performance compared to other distance measures using STRING as a ground truth.
Fig. 5: Exploration of the influence of word embedding parameters on AUC for different methods and ground truths.
Fig. 6: Parametric dependency of INtERAcT using STRING combined score as a ground truth.

Data availability

The article abstracts and full texts used to generate INtERAcT PPI scores can be freely downloaded from PMC. For this project, STRING interaction scores were downloaded on 19 October 2017. STRING historical data can be downloaded from https://string-db.org/cgi/access.pl?footer_active_subpage=archive. The article collection as well historical STRING data are also available from the corresponding author upon request. The networks generated for KEGG cancer pathways and the corresponding word vectors and entities lists can be downloaded from https://ibm.biz/interact-data.

Code availability

The INtERAcT implementation used to produce the results in this Article can be accessed via https://doi.org/10.5281/zenodo.2576762 (ref. 44) and the open source code is available on GitHub (https://github.com/drugilsberg/interact). The INtERAcT python package can be installed via pip and provides a set of utilities to build interaction networks from word vectors using INtERAcT as well as using other metrics.

INtERAcT is also available as a service hosted on IBM Cloud at https://ibm.biz/interact-aas. The web service builds a molecular interaction network given word vectors in Word2Vec binary format and a list of molecular entities (example data are made available through a download link in the app).

References

  1. 1.

    Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).

    Article  Google Scholar 

  2. 2.

    Zlotta, A. R. et al. Prevalence of prostate cancer on autopsy: cross-sectional study on unscreened caucasian and asian men. J. Natl Cancer Inst. 105, 1050–1058 (2013).

    Article  Google Scholar 

  3. 3.

    Cooperberg, M. R., Broering, J. M. & Carroll, P. R. Risk assessment for prostate cancer metastasis and mortality at the time of diagnosis. J. Natl Cancer Inst. 101, 878–887 (2009).

    Article  Google Scholar 

  4. 4.

    Chou, R. et al. Screening for prostate cancer: a review of the evidence for the U.S. Preventive Services Task Force. Ann. Intern. Med. 155, 762–771 (2011).

    Article  Google Scholar 

  5. 5.

    Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. & Leser, U. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. 6, e1000837 (2010).

    MathSciNet  Article  Google Scholar 

  6. 6.

    Tjioe, E., Berry, M. W. & Homayouni, R. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization). BMC Bioinformatics 11, S14 (2010).

    Article  Google Scholar 

  7. 7.

    Mandloi, S. & Chakrabarti, S. PALM-IST: pathway assembly from literature mining—an information search tool. Sci. Rep. 5, 10021 (2015).

    Article  Google Scholar 

  8. 8.

    Barbosa-Silva, A. et al. PESCADOR, a web-based tool to assist text mining of biointeractions extracted from PubMed queries. BMC Bioinformatics 12, 435 (2011).

    Article  Google Scholar 

  9. 9.

    Fleuren, W. W. et al. Identification of new biomarker candidates for glucocorticoid induced insulin resistance using literature mining. BioData Min. 6, 2 (2013).

    Article  Google Scholar 

  10. 10.

    Raja, K., Subramani, S. & Natarajan, J. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature. Database (Oxford) 2013, bas052 (2013).

    Article  Google Scholar 

  11. 11.

    Usie, A., Karathia, H., Teixidó, I., Alves, R. & Solsona, F. Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents. PeerJ 2, e276 (2014).

    Article  Google Scholar 

  12. 12.

    Torii, M. et al. RLIMS-P 2.0: a generalizable rule-based information extraction system for literature mining of protein phosphorylation information. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 17–29 (2015).

    Article  Google Scholar 

  13. 13.

    Collobert, R. & Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. 25th International Conference on Machine Learning 160–167 (ACM, 2008).

  14. 14.

    Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).

  15. 15.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems, NIPS’13 3111–3119 (Curran Associates, 2013).

  16. 16.

    Nie, Y., Rong, W., Zhang, Y., Ouyang, Y. & Xiong, Z. Embedding assisted prediction architecture for event trigger identification. J. Bioinform. Comput. Biol. 13, 1541001 (2015).

    Article  Google Scholar 

  17. 17.

    Zhou, D., Zhong, D. & He, Y. Event trigger identification for biomedical events extraction using domain knowledge. Bioinformatics 30, 1587–1594 (2014).

    Article  Google Scholar 

  18. 18.

    Li, C. et al. Using word embedding for bio-event extraction. In Proc. 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015) 121–126 (Association for Computational Linguistics, 2015).

  19. 19.

    Wang, Y., Liu, Z. & Sun, M. Incorporating linguistic knowledge for learning distributed word representations. PLoS ONE 10, e0118437 (2015).

    Article  Google Scholar 

  20. 20.

    Jiang, Z., Li, L. & Huang, D. An unsupervised graph based continuous word representation method for biomedical text mining. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 634–642 (2016).

    Article  Google Scholar 

  21. 21.

    Zhao, Z., Yang, Z., Lin, H., Wang, J. & Gao, S. A protein–protein interaction extraction approach based on deep neural network. Int. J. Data Min. Bioinform. 15, 145–164 (2016).

    Article  Google Scholar 

  22. 22.

    Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).

    Article  Google Scholar 

  23. 23.

    Šarić, J., Jensen, L. J., Ouzounova, R., Rojas, I. & Bork, P. Extraction of regulatory gene/protein networks from medline. Bioinformatics 22, 645–650 (2006).

    Article  Google Scholar 

  24. 24.

    Aggarwal, C. C., Hinneburg, A. & Keim, D. A. On the surprising behavior of distance metrics in high dimensional spaces. In Proc. 8th International Conference on Database Theory, ICDT ‘01 420–434 (Springer, 2001).

  25. 25.

    Fisher, R. A. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10, 507–521 (1915).

    Google Scholar 

  26. 26.

    Ramos, Y. F. M. et al. Genome-wide assessment of differential roles for p300 and CBP in transcription regulation. Nucleic Acids Res. 38, 5396–5408 (2010).

    Article  Google Scholar 

  27. 27.

    Bird, S., Klein, E. & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (O’Reilly Media, 2009).

  28. 28.

    Baroni, M., Dinu, G. & Kruszewski, G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. 52nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 238–247 (Association for Computational Linguistics, 2014).

  29. 29.

    Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975).

    Article  Google Scholar 

  30. 30.

    Lin, J. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991).

    MathSciNet  Article  Google Scholar 

  31. 31.

    Endres, D. M. & Schindelin, J. E. A new metric for probability distributions. IEEE Trans. Inf. Theory 49, 1858–1860 (2003).

    MathSciNet  Article  Google Scholar 

  32. 32.

    Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 32, D449–D451 (2004).

    Article  Google Scholar 

  33. 33.

    Chatr-aryamontri, A. et al. The biogrid interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).

    Article  Google Scholar 

  34. 34.

    Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res 32, D452–D455 (2004).

    Article  Google Scholar 

  35. 35.

    Licata, L. et al. Mint, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2011).

    Article  Google Scholar 

  36. 36.

    Caspi, R. et al. The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic Acids Res. 36, D623–D631 (2007).

    Article  Google Scholar 

  37. 37.

    Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25 (2000).

    Article  Google Scholar 

  38. 38.

    Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27–30 (2000).

    Article  Google Scholar 

  39. 39.

    Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

    Article  Google Scholar 

  40. 40.

    Croft, D. et al. The reactome pathway knowledgebase. Nucleic Acids Res 42, D472–D477 (2014).

    Article  Google Scholar 

  41. 41.

    Franceschini, A. et al. STRING v9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).

    Article  Google Scholar 

  42. 42.

    Florkowski, C. M. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin. Biochem. Rev. 29, S83–S87 (2008).

    Google Scholar 

  43. 43.

    Von Mering, C. et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437 (2005).

    Article  Google Scholar 

  44. 44.

    Manica, M. drugilsberg/interact: First release of interact. Zenodo https://doi.org/10.5281/zenodo.2576762 (2019).

Download references

Acknowledgements

The authors thank C. Bekas and Y. Ineichen for useful discussions. The project leading to this application received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 668858.

Author information

Affiliations

Authors

Contributions

M.M., R.M., J.C. and M.R.M. conceived the study and analyses. M.M., R.M. and J.C. implemented INtERAcT and performed data analysis. M.R.M. provided biological interpretation. M.M., R.M., J.C. and M.R.M. wrote the manuscript, with input from all authors.

Corresponding author

Correspondence to María Rodríguez Martínez.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary notes and figures

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Manica, M., Mathis, R., Cadow, J. et al. Context-specific interaction networks from vector representation of words. Nat Mach Intell 1, 181–190 (2019). https://doi.org/10.1038/s42256-019-0036-1

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing