Article | Published:

Context-specific interaction networks from vector representation of words

Abstract

The number of biomedical publications has grown steadily in recent years. However, most biomedical facts are not readily available, but buried in the form of unstructured text. Here we present INtERAcT, an unsupervised method to extract interactions from a corpus of biomedical articles. INtERAcT exploits a vector representation of words, computed on a corpus of domain-specific knowledge, and implements a new metric that estimates an interaction score between two molecules in the space where the corresponding words are embedded. We use INtERAcT to reconstruct the molecular pathways of 10 different cancer types using corpora of disease-specific articles, considering the STRING database as a benchmark. Our metric outperforms currently adopted approaches and it is highly robust to parameter choices, leading to the identification of known molecular interactions in all studied cancer types. Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and can therefore be efficiently applied to different scientific domains.

A preprint version of the article is available at ArXiv.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The article abstracts and full texts used to generate INtERAcT PPI scores can be freely downloaded from PMC. For this project, STRING interaction scores were downloaded on 19 October 2017. STRING historical data can be downloaded from https://string-db.org/cgi/access.pl?footer_active_subpage=archive. The article collection as well historical STRING data are also available from the corresponding author upon request. The networks generated for KEGG cancer pathways and the corresponding word vectors and entities lists can be downloaded from https://ibm.biz/interact-data.

Code availability

The INtERAcT implementation used to produce the results in this Article can be accessed via https://doi.org/10.5281/zenodo.2576762 (ref. 44) and the open source code is available on GitHub (https://github.com/drugilsberg/interact). The INtERAcT python package can be installed via pip and provides a set of utilities to build interaction networks from word vectors using INtERAcT as well as using other metrics.

INtERAcT is also available as a service hosted on IBM Cloud at https://ibm.biz/interact-aas. The web service builds a molecular interaction network given word vectors in Word2Vec binary format and a list of molecular entities (example data are made available through a download link in the app).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).

  2. 2.

    Zlotta, A. R. et al. Prevalence of prostate cancer on autopsy: cross-sectional study on unscreened caucasian and asian men. J. Natl Cancer Inst. 105, 1050–1058 (2013).

  3. 3.

    Cooperberg, M. R., Broering, J. M. & Carroll, P. R. Risk assessment for prostate cancer metastasis and mortality at the time of diagnosis. J. Natl Cancer Inst. 101, 878–887 (2009).

  4. 4.

    Chou, R. et al. Screening for prostate cancer: a review of the evidence for the U.S. Preventive Services Task Force. Ann. Intern. Med. 155, 762–771 (2011).

  5. 5.

    Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. & Leser, U. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. 6, e1000837 (2010).

  6. 6.

    Tjioe, E., Berry, M. W. & Homayouni, R. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization). BMC Bioinformatics 11, S14 (2010).

  7. 7.

    Mandloi, S. & Chakrabarti, S. PALM-IST: pathway assembly from literature mining—an information search tool. Sci. Rep. 5, 10021 (2015).

  8. 8.

    Barbosa-Silva, A. et al. PESCADOR, a web-based tool to assist text mining of biointeractions extracted from PubMed queries. BMC Bioinformatics 12, 435 (2011).

  9. 9.

    Fleuren, W. W. et al. Identification of new biomarker candidates for glucocorticoid induced insulin resistance using literature mining. BioData Min. 6, 2 (2013).

  10. 10.

    Raja, K., Subramani, S. & Natarajan, J. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature. Database (Oxford) 2013, bas052 (2013).

  11. 11.

    Usie, A., Karathia, H., Teixidó, I., Alves, R. & Solsona, F. Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents. PeerJ 2, e276 (2014).

  12. 12.

    Torii, M. et al. RLIMS-P 2.0: a generalizable rule-based information extraction system for literature mining of protein phosphorylation information. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 17–29 (2015).

  13. 13.

    Collobert, R. & Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. 25th International Conference on Machine Learning 160–167 (ACM, 2008).

  14. 14.

    Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).

  15. 15.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems, NIPS’13 3111–3119 (Curran Associates, 2013).

  16. 16.

    Nie, Y., Rong, W., Zhang, Y., Ouyang, Y. & Xiong, Z. Embedding assisted prediction architecture for event trigger identification. J. Bioinform. Comput. Biol. 13, 1541001 (2015).

  17. 17.

    Zhou, D., Zhong, D. & He, Y. Event trigger identification for biomedical events extraction using domain knowledge. Bioinformatics 30, 1587–1594 (2014).

  18. 18.

    Li, C. et al. Using word embedding for bio-event extraction. In Proc. 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015) 121–126 (Association for Computational Linguistics, 2015).

  19. 19.

    Wang, Y., Liu, Z. & Sun, M. Incorporating linguistic knowledge for learning distributed word representations. PLoS ONE 10, e0118437 (2015).

  20. 20.

    Jiang, Z., Li, L. & Huang, D. An unsupervised graph based continuous word representation method for biomedical text mining. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 634–642 (2016).

  21. 21.

    Zhao, Z., Yang, Z., Lin, H., Wang, J. & Gao, S. A protein–protein interaction extraction approach based on deep neural network. Int. J. Data Min. Bioinform. 15, 145–164 (2016).

  22. 22.

    Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).

  23. 23.

    Šarić, J., Jensen, L. J., Ouzounova, R., Rojas, I. & Bork, P. Extraction of regulatory gene/protein networks from medline. Bioinformatics 22, 645–650 (2006).

  24. 24.

    Aggarwal, C. C., Hinneburg, A. & Keim, D. A. On the surprising behavior of distance metrics in high dimensional spaces. In Proc. 8th International Conference on Database Theory, ICDT ‘01 420–434 (Springer, 2001).

  25. 25.

    Fisher, R. A. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10, 507–521 (1915).

  26. 26.

    Ramos, Y. F. M. et al. Genome-wide assessment of differential roles for p300 and CBP in transcription regulation. Nucleic Acids Res. 38, 5396–5408 (2010).

  27. 27.

    Bird, S., Klein, E. & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (O’Reilly Media, 2009).

  28. 28.

    Baroni, M., Dinu, G. & Kruszewski, G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. 52nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 238–247 (Association for Computational Linguistics, 2014).

  29. 29.

    Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975).

  30. 30.

    Lin, J. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991).

  31. 31.

    Endres, D. M. & Schindelin, J. E. A new metric for probability distributions. IEEE Trans. Inf. Theory 49, 1858–1860 (2003).

  32. 32.

    Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 32, D449–D451 (2004).

  33. 33.

    Chatr-aryamontri, A. et al. The biogrid interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).

  34. 34.

    Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res 32, D452–D455 (2004).

  35. 35.

    Licata, L. et al. Mint, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2011).

  36. 36.

    Caspi, R. et al. The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic Acids Res. 36, D623–D631 (2007).

  37. 37.

    Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25 (2000).

  38. 38.

    Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27–30 (2000).

  39. 39.

    Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

  40. 40.

    Croft, D. et al. The reactome pathway knowledgebase. Nucleic Acids Res 42, D472–D477 (2014).

  41. 41.

    Franceschini, A. et al. STRING v9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).

  42. 42.

    Florkowski, C. M. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin. Biochem. Rev. 29, S83–S87 (2008).

  43. 43.

    Von Mering, C. et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–D437 (2005).

  44. 44.

    Manica, M. drugilsberg/interact: First release of interact. Zenodo https://doi.org/10.5281/zenodo.2576762 (2019).

Download references

Acknowledgements

The authors thank C. Bekas and Y. Ineichen for useful discussions. The project leading to this application received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 668858.

Author information

M.M., R.M., J.C. and M.R.M. conceived the study and analyses. M.M., R.M. and J.C. implemented INtERAcT and performed data analysis. M.R.M. provided biological interpretation. M.M., R.M., J.C. and M.R.M. wrote the manuscript, with input from all authors.

Competing interests

The authors declare no competing interests.

Correspondence to María Rodríguez Martínez.

Supplementary information

  1. Supplementary Information

    Supplementary notes and figures

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: Description of the skip-gram model.
Fig. 2: Schematic representation of INtERAcT.
Fig. 3: Performance of INtERAcT on prostate cancer embedding.
Fig. 4: INtERAcT performance compared to other distance measures using STRING as a ground truth.
Fig. 5: Exploration of the influence of word embedding parameters on AUC for different methods and ground truths.
Fig. 6: Parametric dependency of INtERAcT using STRING combined score as a ground truth.