A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories

Hong, Lixiang; Lin, Jinjian; Li, Shuya; Wan, Fangping; Yang, Hui; Jiang, Tao; Zhao, Dan; Zeng, Jianyang

doi:10.1038/s42256-020-0189-y

Article
Published: 08 June 2020

A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories

Nature Machine Intelligence volume 2, pages 347–355 (2020)Cite this article

2977 Accesses
33 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Knowledge about the relations between biomedical entities (such as drugs and targets) is widely distributed in more than 30 million research articles and consistently plays an important role in the development of biomedical science. In this work, we propose a novel machine learning framework, named BERE, for automatically extracting biomedical relations from large-scale literature repositories. BERE uses a hybrid encoding network to better represent each sentence from both semantic and syntactic aspects, and employs a feature aggregation network to make predictions after considering all relevant statements. More importantly, BERE can also be trained without any human annotation via a distant supervision technique. Through extensive tests, BERE has demonstrated promising performance in extracting biomedical relations, and can also find meaningful relations that were not reported in existing databases, thus providing useful hints to guide wet-lab experiments and advance the biological knowledge discovery process.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Test results on the distantly supervised DTI dataset.**

**Fig. 3: The in vitro inhibitory activity of nintedanib against JAK2 and EGFR.**

Structured information extraction from scientific text with large language models

Article Open access 15 February 2024

From language models to large-scale food and biomedical knowledge graphs

Article Open access 15 May 2023

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms

Article Open access 19 October 2023

Data availability

The DDI and DTI datasets used in this work can be found at https://github.com/haiya1994/BERE. The full dataset for discovering potential DTIs is available from the corresponding authors upon request.

Code availability

The source code of BERE can be downloaded from the GitHub repository at https://github.com/haiya1994/BERE or the Zenodo repository at https://doi.org/10.5281/zenodo.3757058. All other code may be obtained from the corresponding authors upon request.

References

Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Article Google Scholar
Mattingly, C. J., Colby, G. T., Forrest, J. N. & Boyer, J. L. The Comparative Toxicogenomics Database (CTD). Environ. Health Perspect. 111, 793–795 (2003).
Article Google Scholar
Kuhn, M., Letunic, I., Jensen, L. J. & Bork, P. The SIDER database of drugs and side effects. Nucleic Acids Res. 44, D1075–D1079 (2015).
Article Google Scholar
Oughtred, R. et al. BioGRID: a resource for studying biological interactions in yeast. Cold Spring Harbor Protoc. 2016, pdb.top080754 (2016).
Wang, S. et al. Annotating gene sets by mining large literature collections with protein networks. In Proceedings of the Pacific Symposium on Biocomputing 601–613 (World Scientific, 2018).
Wang, S. et al. Deep functional synthesis: a machine learning approach to gene functional enrichment. Preprint at https://doi.org/10.1101/824086 (2019).
Magro, L., Moretti, U. & Leone, R. Epidemiology and characteristics of adverse drug reactions caused by drug–drug interactions. Expert Opin. Drug Saf. 11, 83–94 (2012).
Article Google Scholar
Yang, F., Xu, J. & Zeng, J. Drug–target interaction prediction by integrating chemical, genomic, functional and pharmacological data. In Proceedings of the Pacific Symposium on Biocomputing 2014 148–159 (World Scientific, 2014).
Luo, Y. et al. A network integration approach for drug–target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8, 573 (2017).
Article Google Scholar
Wan, F., Hong, L., Xiao, A., Jiang, T. & Zeng, J. NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions. Bioinformatics 35, 104–111 (2018).
Article Google Scholar
Percha, B. & Altman, R. B. A global network of biomedical relationships derived from text. Bioinformatics 34, 2614–2624 (2018).
Article Google Scholar
Verga, P., Strubell E. & McCallum, A. Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 872–884 (ACL, 2018).
Zhang, Y. et al. A hybrid model based on neural networks for biomedical relation extraction. J. Biomed. Inform. 81, 83–92 (2018).
Article Google Scholar
Yu, K. et al. Automatic extraction of protein–protein interactions using grammatical relationship graph. BMC Med. Inform. Decis. Mak. 18, 42 (2018).
Article Google Scholar
Lim, S., Lee, K. & Kang, J. Drug drug interaction extraction from the literature using a recursive neural network. PLoS ONE 13, e0190926 (2018).
Article Google Scholar
Mintz, M., Bills, S., Snow, R. & Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP Vol. 2, 1003–1011 (ACL, 2009).
Riedel, S., Yao, L. & McCallum, A. Modeling relations and their mentions without labeled text. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases 148–163 (Springer, 2010).
Dietterich, T. G., Lathrop, R. H. & Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 31–71 (1997).
Article Google Scholar
Jat, S., Khandelwal, S. & Talukdar, P. Improving distantly supervised relation extraction using word and entity based attention. In Proceedings of the 6th Workshop on Automated Knowledge Base Construction (2017).
Vashishth, S., Joshi, R., Prayaga, S. S., Bhattacharyya, C. & Talukdar, P. RESIDE: improving distantly-supervised neural relation extraction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 1257–1266 (ACL, 2018).
Zeng, D., Liu, K., Chen, Y. & Zhao, J. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1753–1762 (ACL, 2015).
Quirk, C. & Poon, H. Distant supervision for relation extraction beyond the sentence boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics Vol. 1, 1171–1182 (ACL, 2017).
Lin, Y., Shen, S., Liu, Z., Luan, H. & Sun, M. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Vol. 1, 2124–2133 (ACL, 2016).
Zhou, P. et al. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Vol. 2, 207–212 (ACL, 2016).
Sun, X. et al. Drug–drug interaction extraction via recurrent hybrid convolutional neural networks with an improved focal loss. Entropy 21, 37 (2019).
Article Google Scholar
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 1631–1642 (ACL, 2013).
Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R. & DauméIII, H. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 633–644 (ACL, 2014).
Hashimoto, K., Miwa, M., Tsuruoka, Y. & Chikayama, T. Simple customization of recursive neural networks for semantic relation classification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 1372–1376 (ACL, 2013).
Li, J., Luong, M. T., Jurafsky, D. & Hovy, E. When are tree structures necessary for deep learning of representations? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2304–2314 (ACL, 2015).
Bowman, S. R. et al. A fast unified model for parsing and sentence understanding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Vol. 1, 1466–1477 (ACL, 2016).
Yogatama, D., Blunsom, P., Dyer, C., Grefenstette, E. & Ling, W. Learning to compose words into sentences with reinforcement learning. In Proceedings of the 5th Interational Conference on Learning Representations (2017).
Maillard, J., Clark, S. & Yogatama, D. Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs. Nat. Lang. Eng. 25, 433–449 (2019).
Article Google Scholar
Choi, J., Yoo, K. M. & Lee, S.-g. Learning to compose task-specific tree structures. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence 5094–5101 (AAAI, 2018).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7794–7803 (IEEE, 2018).
Vaswani, A. et al. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems 5998–6008 (NIPS, 2017).
Zhao, Z., Yang, Z., Luo, L., Lin, H. & Wang, J. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics 32, 3444–3453 (2016).
Google Scholar
Liu, S., Tang, B., Chen, Q. & Wang, X. Drug-drug interaction extraction via convolutional neural networks. Comput. Math. Methods Med. 2016, 6918381 (2016).
MATH Google Scholar
Quan, C., Hua, L., Sun, X. & Bai, W. Multichannel convolutional neural network for biological relation extraction. Biomed Res. Int. 2016, 1850404 (2016).
Google Scholar
Sahu, S. K. & Anand, A. Drug–drug interaction extraction from biomedical texts using long short-term memory network. J. Biomed. Inform. 86, 15–24 (2018).
Article Google Scholar
Zhou, D., Miao, L. & He, Y. Position-aware deep multi-task learning for drug–drug interaction extraction. Artif. Intell. Med. 87, 1–8 (2018).
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
MathSciNet MATH Google Scholar
Tolias, G., Sicre, R. & Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. In Proceedings of the 4th International Conference on Learning Representations (2016).
Liu, C. Y. et al. The tyrosine kinase inhibitor nintedanib activates SHP-1 and induces apoptosis in triple-negative breast cancer cells. Exp. Mol. Med. 49, e366 (2017).
Article Google Scholar
Kato, M. et al. Gastrointestinal adverse effects of nintedanib and the associated risk factors in patients with idiopathic pulmonary fibrosis. Sci. Rep. 9, 12062 (2019).
Article Google Scholar
XLFit 5.4.0.8 (IDBS, 2014); https://www.idbs.com/excelcurvefitting/xlfit-product/
Herrero-Zazo, M., Segura-Bedmar, I., Martínez, P. & Declerck, T. The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions. J. Biomed. Inform. 46, 914–920 (2013).
Article Google Scholar
Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016, baw068 (2016).
Krallinger, M. et al. Overview of the BioCreative VI chemical-protein interaction track. In Proceedings of the Sixth BioCreative Challenge Evaluation Workshop Vol. 1, 141–146 (2017).
Honnibal, M. & Montani, I. spaCy 2.0.18 (2018); https://spacy.io/
Pyysalo, S., Ginter, F., Moen, H., Salakoski, T. & Ananiadou, S. Word vectors (NLPLab, 2013); http://bio.nlplab.org/
Pyysalo, S., Ginter, F., Moen, H., Salakoski, T. & Ananiadou, S. Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine 39–44 (2013).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (2013).
Tan, Z., Wang, M., Xie, J., Chen, Y. & Shi, X. Deep semantic role labeling with self-attention. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence 16725 (AAAI, 2018).
He, K., Zhang, X., Ren, S. & Sun, J. J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Cho, K., Van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 103–111 (ACL, 2014).
Socher, R., Lin, C. C., Manning, C. & Ng, A. Y. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) 129–136 (ACM, 2011).
Tai, K. S., Socher, R. & Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Vol. 1, 1556–1566 (ACL, 2015).
Kokkinos, F. & Potamianos, A. Structural attention neural networks for improved sentiment analysis. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics Vol. 2, 586–591 (ACL, 2017).
Jang, E., Gu, S. & Poole, B. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations (2017).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) 807–814 (ACM, 2010).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (2015).

Download references

Acknowledgements

We thank Z. Liu, T. Yang and H. Hu for their helpful discussions about this work. This work was supported in part by the National Natural Science Foundation of China (grants 61872216, 81630103 and 31900862), the Turing AI Institute of Nanjing and the Zhongguancun Haihua Institute for Frontier Information Technology.

Author information

Authors and Affiliations

Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
Lixiang Hong, Jinjian Lin, Shuya Li, Fangping Wan, Dan Zhao & Jianyang Zeng
Silexon Co. Ltd, Nanjing, China
Hui Yang
Bioinformatics Division, TNLIST, MOE Key Laboratory of Bioinformatics and Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China
Tao Jiang
MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, China
Tao Jiang & Jianyang Zeng
Department of Computer Science and Engineering, University of California, Riverside, CA, USA
Tao Jiang

Authors

Lixiang Hong
View author publications
You can also search for this author in PubMed Google Scholar
Jinjian Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shuya Li
View author publications
You can also search for this author in PubMed Google Scholar
Fangping Wan
View author publications
You can also search for this author in PubMed Google Scholar
Hui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jianyang Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.H., D.Z. and J.Z. conceived the concept. L.H. designed the methodology and performed experiments. L.H., J.L., S.L., T.J. and D.Z. analysed the results. H.Y. contributed to wet-lab experiments. L.H. and J.Z. wrote the paper. S.L., F.W., T.J. and D.Z. contributed to revision of the manuscript.

Corresponding authors

Correspondence to Dan Zhao or Jianyang Zeng.

Ethics declarations

Competing interests

J.Z. is founder and CTO of Silexon AI Technology Co. Ltd and has an equity interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of the precision-recall curve between BERE and its alternatives with other sentence aggregation strategies on the distantly supervised DTI dataset.

BERE+POOL and BERE+AVE adopt a max-pooling strategy and an average strategy to aggregate sentence representations, respectively. The legend on the top right contains area under precision-recall curve (AUPRC) and F₁-score for each method.

Extended Data Fig. 2 The hyperparameter settings of BERE on different test datasets.

The learning rates were determined using a grid search among {0.0001, 0.0002, …, 0.001}. Other hyper-parameters were set empirically.

Extended Data Fig. 3 The basic statistics of the datasets used in our tests.

(a) The numbers of sentences annotated with five different types of DDI relations in the DDI’13 dataset. NA means no interaction. ADVICE means the recommended concomitant medication usage. EFFECT means that there exists a certain pharmacodynamic effect between two drugs. MECHANISM means that there exists a certain pharmacokinetic mechanism between two drugs. INT means that a DDI occurs without any additional information. (b) The numbers of bags of sentences annotated with six different types of DTI relations in the distantly supervised DTI dataset. NA means no interaction.Substrate means that the drug is what the target (that is, enzyme) acts upon. Inhibitor means that the drug binds to the target (that is, enzyme) and impede with the functioning of the target. Agonist/Antagonist means that the drug binds to the target (that is, receptor) and activates/blocks it to produce a biological response. Unknown means that there exists a certain relation between a drug–target pair, but the action mechanism is unknown in DrugBank. Other is a unified name of all the other types of interactions with fewer occurrences. The unlabelled set, which was mainly used for prediction, was collected from the PMC articles after excluding abstracts.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hong, L., Lin, J., Li, S. et al. A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories. Nat Mach Intell 2, 347–355 (2020). https://doi.org/10.1038/s42256-020-0189-y

Download citation

Received: 15 December 2019
Accepted: 07 May 2020
Published: 08 June 2020
Issue Date: June 2020
DOI: https://doi.org/10.1038/s42256-020-0189-y

This article is cited by

MetaTron: advancing biomedical annotation empowering relation annotation and collaboration
- Ornella Irrera
- Stefano Marchesin
- Gianmaria Silvello
BMC Bioinformatics (2024)
Biomedical term extraction using fuzzy association
- Bidyut Das
- Mukta Majumder
- Arif Ahmed Sekh
Soft Computing (2024)
Leveraging pre-trained language models for mining microbiome-disease relationships
- Nikitha Karkera
- Sathwik Acharya
- Sucheendra K. Palaniappan
BMC Bioinformatics (2023)
2SCE-4SL: a 2-stage causality extraction framework for scientific literature
- Yujie Zhang
- Rujiang Bai
- Xiaoyue Wang
Scientometrics (2023)
Multi-task learning for few-shot biomedical relation extraction
- Vincenzo Moscato
- Giuseppe Napolano
- Giancarlo Sperlì
Artificial Intelligence Review (2023)