Constructing a disease database and using natural language processing to capture and standardize free text clinical information

Raza, Shaina; Schwartz, Brian

doi:10.1038/s41598-023-35482-0

Download PDF

Article
Open access
Published: 26 May 2023

Constructing a disease database and using natural language processing to capture and standardize free text clinical information

Shaina Raza^1,2 &
Brian Schwartz^1,2

Scientific Reports volume 13, Article number: 8591 (2023) Cite this article

4700 Accesses
4 Citations
3 Altmetric
Metrics details

Subjects

Abstract

The ability to extract critical information about an infectious disease in a timely manner is critical for population health research. The lack of procedures for mining large amounts of health data is a major impediment. The goal of this research is to use natural language processing (NLP) to extract key information (clinical factors, social determinants of health) from free text. The proposed framework describes database construction, NLP modules for locating clinical and non-clinical (social determinants) information, and a detailed evaluation protocol for evaluating results and demonstrating the effectiveness of the proposed framework. The use of COVID-19 case reports is demonstrated for data construction and pandemic surveillance. The proposed approach outperforms benchmark methods in F1-score by about 1–3%. A thorough examination reveals the disease’s presence as well as the frequency of symptoms in patients. The findings suggest that prior knowledge gained through transfer learning can be useful when researching infectious diseases with similar presentations in order to accurately predict patient outcomes.

Generative models improve fairness of medical classifiers under distribution shifts

Article Open access 10 April 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

An overview of clinical decision support systems: benefits, risks, and strategies for success

Article Open access 06 February 2020

Introduction

As of November 25, 2022, COVID-19 has infected more than 640 million people, with over 6.63 million deaths¹. There are serious concerns about the impact of infectious disease on society, global health, and economy^2,3,4. It is necessary to develop an efficient surveillance system that can automatically track the spread of infectious diseases by collecting, analyzing, and reporting data to those responsible for disease prevention and control.

Natural language processing (NLP) has the potential to significantly improve public health by aiding in the analysis of vast amounts of textual data from various sources⁵, including social media, electronic health records (EHRs) and published literature. By using NLP techniques, it is possible to extract valuable insights and patterns that can aid in the early detection and monitoring of infectious diseases⁶. However, challenges still exist in applying NLP to public health data, including data quality and accuracy, and variability in language and terminology used in health-related texts⁷.

To address the challenges in using free texts from EHRs and clinical notes for epidemiological and research purposes, we propose an effective NLP framework. This framework is based on deep neural network models that extract key information (entities) from the texts to study clinical and non-clinical factors associated with infectious diseases, including COVID-19. The objective of our study is to bridge the gap between NLP methods and their applications in public health to assist policymakers in decision-making and accelerate research. Our main research question (RQ) is: How can free text be transformed into a readable format to create a disease database, and query it for the factors associated with an infectious disease?

Contributions Our proposed framework consists of a comprehensive pipeline that includes the creation of a high-quality database from published case reports, the design and implementation of NLP models to detect and examine clinical and non-clinical concepts in the data, and a thorough evaluation process. A named entity recognition (NER) algorithm⁸ is included in the NLP models, and it is capable of accurately identifying essential clinical concepts such as diseases, conditions, symptoms, and drugs, as well as non-clinical concepts such as social determinants of health (SDOH)⁹. Furthermore, we developed a relation extraction (RE) model to identify relationships between these concepts, including disease-complication, treatment-improvement, and drug-adverse-effect associations. A two-phase evaluation approach is proposed, in which the proposed methodology is first compared to existing benchmarks, and the second phase includes a detailed analysis and human evaluation to demonstrate the framework’s usefulness for pandemic surveillance.

Novelty of the study The proposed NLP framework contributes to the public health domain, by introducing a data construction module, NLP modules based on Transformer¹⁰ architecture and a detailed evaluation phase. Through the use of few-shot learning^11,12 techniques, our framework significantly reduces the need for manual annotations and enables more efficient and accurate identification and analysis of clinical concepts within the data. One of the major contributions of our framework is its ability to extract both the SDOHs and the clinical factors that makes it different compared to the previous works^{13,14,15,16,17,18,19} that have primarily focused on clinical factors. By enabling the identification of important patterns in disease diagnosis, our methodology facilitates more informed decision-making.

Materials and methods

Data collection

We constructed a comprehensive COVID-19 patient database using electronic case reports sourced from published literature. Specifically, we curated the case reports using a search query (Supplementary Table S1) through the National Library of Medicine (NLM)²⁰ API. This study is aimed at collecting high-quality and relevant data by applying specific criteria to ensure the quality of the data collected. The participants in the study were not human subjects, but rather clinical case reports related to COVID-19 that were obtained from published literature.

The study systematically categorized case reports to analyze a diverse range of clinical experiences and interventions for COVID-19 across different demographics. The case reports were classified into five age groups: Child (6–12 years), Adolescent (13–18 years), Adult (19–44 years), Middle Aged (45–64 years), and Aged (65 + years). The collected data encompassed various approaches related to clinical classification and interventions, including screening, diagnosis, treatments, and therapies for COVID-19. The data collection period spanned from 1st March to 30th June 2022.

To ensure consistency and accuracy in the analysis, only studies published in English were included. Exclusion criteria were also applied, including excluding non-English publications, grey literature, preprints, and clinical trial registers. After applying these filtration criteria, we obtained about 5000 case reports. Each case report generally corresponds to one patient report²¹, although there may be exceptions.

Proposed framework

The proposed NLP framework is shown in Fig. 1 and explained next.

Database preparation

This study utilizes a comprehensive methodology that involved the collection of case reports from NLM sources in PDF format. These PDFs were processed using Spark OCR²² and transformed into a data frame format, which was then indexed with Elasticsearch²³ to create a COVID-19 disease database.

A gold-standard dataset was created by randomly selecting 150 case reports and having four biomedical domain experts annotate them with clinical and non-clinical named entities. Approximately 550 sentences and 3,000 gold labels were produced as a result of this annotation process. A few-shot learning¹¹ technique was used in conjunction with the BERT model to refine the dataset further and train deep neural network-based models. Few-shot learning refers to a machine learning approach that aims to enable models to learn from a limited amount of labeled data¹¹.

The initial gold-standard dataset was used for training BERT²⁴ for token classification during the few-shot learning process. Predictions were then generated for a subset of unlabeled data. New predictions were selectively sampled, a human verification is performed and then added to the existing training set. After that, the classifier was retrained on the new dataset. This iterative process was repeated until convergence was achieved. This learning loop began with 1,100 sentences from the gold-standard dataset and continued until approximately 5,000 samples were collected. This procedure yielded a maximum accuracy of around 93.5%. The final dataset included 40,000 sentences and 320,000 gold labels (named entities). Supplementary Table S2 contains key data statistics.

NLP models

The NLP models developed in this work are: (1) a fine-tuned Transformer model; (2) a NER module to produce named entities; (3) a RE module to define relationships between the named entities.

Fine-tuned Transformer model We fine-tuned the Bidirectional Encoder Representations from Transformers (BERT) for Biomedical Text Mining (BioBERT)¹³, and use our annotated dataset to prepare a fine-tuned Transformer model. Fine-tuning is a light-weight method to use the weights of an existing big language model²⁵, so we prefer it over pre-training for this work.

Named entity recognition model The proposed NER model, shown in Supplementary Fig. S1, is an advanced adaptation of the bi-directional long short-term memory (BiLSTM)²⁶ model with a conditional random field (CRF)²⁷ layer added. We used a Transformer layer as the first layer to improve the model performance. This layer combines attention matrices to obtain contextualized information, which is then used to generate a word vector with varying semantics depending on the context. In this case, we make use of our task-specific Transformer model.

The BiLSTM layer comes after the Transformer layer, which takes the Transformer output vector as input and incorporates contextual features to derive comprehensive semantic information from the text. The output of the BiLSTM layer is the predicted label for each word in the sequence. The final layer, the CRF layer, takes the BiLSTM sequence as input and determines the dependencies between named tags. The CRF layer constrains the ultimate predicted labels using the Inside-Outside-Beginning (IOB)²⁸ format, a tagging schema designed for NER chunking tasks.

The model then converts the IOB representation into a user-friendly format by associating chunks with their labels and removing NER chunks with no associated entities. The named entities are given in Supplementary Table S3, and visual representation of the named entities on piece of text is shown in Supplementary Fig. S2.

Relation extraction model The RE task can identify a specific relation between two co-occurring entities²⁹, such as symptom-disease, disease-disease, drug-effects associations. Inspired by recent advancements in NLP related to RE^11,30,31,32, we again utilize few-shot learning¹¹ as a means of inferring unobserved relationships within the text. In this context, the few-shot learning enables the model to generalize and recognize novel relationships by leveraging a limited quantity of training instances from previously unseen classes³³.

The underlying mechanism of our proposed RE model is depicted in Fig. 2. We incorporate our fine-tuned model weights for the Transformer layer during the fine-tuning process of RE. This few-shot learning strategy embeds sentences and relationship descriptors within a unified embedding space, minimizing distances between them iteratively. As a result, the model effectively classifies unobserved relationships by leveraging the limited labelled data³⁴.

Evaluation

Our study employs a dual evaluation strategy: Phase 1—quantitative assessment and Phase 2—qualitative assessment. We compare the accuracy of our proposed tasks with baseline approaches across benchmark datasets and demonstrate the efficiency of the proposed method for pandemic surveillance using unlabeled data. Datasets are randomly allocated into 70% training, 15% validation, and 15% testing. For our own test set, we reserve 30% of the annotated data for the evaluation purpose.

The experimental configuration utilizes an Intel(R) Core(TM) i7-8565U CPU, Google Colab Pro with cloud-based GPUs, and Google Drive for storage. Following the tradition in related works¹³, we evaluate NER and RE tasks using precision, recall, and F1-measure, reporting top results for each optimized method. BERT encoder layers are implemented using PyTorch BERT from Huggingface³⁵. Human evaluation is also performed to validate the efficacy of the 2-phase evaluation strategies. Supplementary Table S4 contains benchmark dataset and baseline approach details. General hyperparameters are listed in Supplementary Table S5.

Results

Phase 1: quantitative assessment

In the phase 1 evaluation, the NER and RE task is compared for the performance based on F1-scores and the results are given in Table 1

Table 1 Results of the evaluation of the named entity recognition (NER) and relation extraction (RE) tasks using k-fold (k = 5) cross-validation on various datasets and baselines. The evaluation for NER is done on both our test set and benchmark test sets, whereas for the RE task, we do not have a labeled test set, so only benchmark test sets are used. Arrow (↑) indicates a statistically significant improvement of our proposed approach compared to other models, with a p-value < 0.05 based on a two-sample t-test.

Full size table

Analysis for the NER Task: The performance of various models for the NER task on a variety of benchmark datasets, including the test set, was assessed. As shown in Table 1, the proposed approach achieved the highest F1-scores across all datasets and significantly outperformed the baseline methods. For disease entities, a higher F1-score of 91.73% was achieved by our model on the NCBI-disease dataset. Our method, along with Bert-based methods and Att-BiLSTM-CRF, obtained F1-scores above 90% on the BC4CHEMD dataset with chemical entities. The proposed approach, BERT-based methods, and BioGPT also performed well on the named entities of proteins and genes in the BC2GM dataset. A good performance gain by our model, BERT-based and BioGPT was observed on the clinical entities provided by the i2b2 datasets. The performance gain of our approach can be attributed to the clinical embeddings provided by BioBERT that significantly improved the performance on clinical and disease entities.

The proposed NER approach achieved the highest median F1-score compared to other models in fivefold cross-validation on our test set (Supplementary Fig. S3). The F1-scores for other models ranged from 87.2 to 92.8, while our approach achieved a significantly higher median F1-score. A two-sample t-test revealed that our approach significantly outperformed most of the other baseline models for NER. Although BioBERT had higher F1-scores on some datasets, our approach still showed significant differences (p < 0.001) on our test set.

Analysis for RE Task As shown in Table 1, our RE method outperformed all competing methods on all benchmark datasets, demonstrating the effectiveness of the transfer learning mechanism through Transformer model. A two-sample t-test revealed that our proposed approach had a significantly higher mean F1-score of 90% when compared to all other methods tested, including BioBERT and BioGPT. To verify the statistical significance of the performance of our proposed approach on the ADE dataset, we conducted a two-sample t-test. The results showed that our proposed approach achieved a significantly higher mean F1-score of 91.73% compared to all other methods tested, including BioBERT and BioGPT (p-value < 0.05), providing additional evidence for the efficacy of our approach for the RE task. Overall, these findings indicate that our approach has real-world application potential through NLP tasks.

Phase 2: qualitative assessment

Effectiveness of named entity recognition approach on clinical entities We begin by showing the percentage distribution of COVID-19 symptoms among hospitalized patients in Fig. 3a and find that fever, cough, and shortness of breath are most frequent. We also show the percentage distribution of most frequent medical complications in Fig. 3b and found pneumonia, acute respiratory distress syndrome (ARDS), thrombosis, myocardial and kidney injury are among most common medical complications in COVID-19 hospitalized patients.

We further categorized symptoms by disease syndrome and present their prevalence in COVID-19 patients in Table 2. The results in Table 2 show that patients with pulmonary disease are more likely to experience cough, fever, and shortness of breath, while patients with psychological conditions are more likely to experience anxiety and depression.

Table 2 Prevalence of symptoms categorized according to major disease syndromes in COVID-19 hospitalized patients.

Full size table

Effectiveness of named entity recognition approach on social determinants of health (SDOH) The NLP framework applied to COVID-19 data also yielded SDOH-related findings depicted in Fig. 4.

Race and ethnicity were found to be significant factors associated with COVID-19 cases and deaths, with Black and indigenous communities being disproportionately affected, as shown in Fig. 4a. Socioeconomic status, health literacy, and access to healthcare were also associated with disease syndromes in COVID-19 patients, shown in Fig. 4b. Older age groups had a higher risk of hospitalization, ICU admission, and mortality, as shown in Fig. 4c, emphasizing the need for targeted interventions. The recovery rates for COVID-19 cases are shown in Fig. 4d. The recovery rate is a measure of how successful the treatment and care provided in each department have been in helping patients recover from COVID-19. These findings in Fig. 4, overall, underscore the importance of considering SDOH factors in public health surveillance and intervention efforts through NLP.

Effectiveness of relation extraction approach We demonstrate the effectiveness of using RE approach by specifying relationships on the run. Table 3 displays the relations of disease disorder and condition/symptom (appears afterwards). We observe in Table 3 that fever and cough, are among the most common symptoms followed by COVID-19. We also observe shortness of breath, heart failure and so, are common symptoms following hypertension.

Table 3 ‘Symptoms followed by disease’. Disease disorders are chosen based on the frequency of prevalence (occurring > 70%).

Full size table

Next, we demonstrate the relationship “DRUG causes [EFFECT]”) in Table 4. The results presented in Table 4 provide insights into the adverse effects of commonly used drugs among COVID-19 patients. For instance, persistent fever was found to be a side effect of oral amoxicillin, while trilineage hematopoiesis was associated with pirfenidone and acute headache fever was a common side effect of BNT162B2 vaccine.

Table 4 Relation: adverse drug events associated with common COVID-19 medications.

Full size table

We also specify the relation between disease syndrome and psychological condition in Fig. 5 and find that depression and anxiety are the conditions in mental disorders.

Overall, these results show that our proposed NLP framework also has the potential for RE, which can aid in identifying and tracking the spread of infectious diseases and their associated risk factors.

Human evaluation

To further assess the performance of our proposed approach for the NER and RE tasks, we conducted a human evaluation. We chose 100 documents at random from the NCBI-Disease dataset and 50 documents from our test set for the NER task. Three domain experts annotated the documents, and the inter-annotator agreement⁴⁶ was calculated using Fleiss’ kappa⁴⁷, which revealed significant agreement (kappa score of 0.75). Our proposed method outperformed all other baseline methods, with an average precision of 89%, recall of 91%, and F1-score of 90%.

Next, we chose 50 documents at random from the ADE dataset and 50 documents from the BioInfer dataset for the RE task. Three domain experts annotated the documents, and the inter-annotator agreement was calculated using Fleiss’ kappa, which revealed significant agreement (kappa score of 0.73). Our proposed method outperformed all other baseline methods, with an average precision of 86%, recall of 88%, and F1-score of 87%.

These findings indicate that our proposed method is highly effective for both NER and RE tasks, as evidenced by quantitative and qualitative evaluations.

Discussion

Principal findings In this study, we successfully constructed a dataset and inferred valuable information to address our research question. Our approach enables the creation of a dataset from unstructured text, preparing it to study infectious diseases such as COVID-19. Although we focus on COVID-19 data, the methodology can be applied to various diseases. The disease database we developed serves as a critical resource for pandemic surveillance, with common COVID-19 symptoms such as pneumonia, respiratory infections, ARDS. Furthermore, we identified relationships between drugs and diseases. This framework benefits clinicians, medical professionals, nurses, epidemiologists, and researchers by streamlining data acquisition and decision-making.

Our experiments highlighted the impact of transfer learning in detecting COVID-19-related entities and relations. Both our NER and RE methods with pre-trained embeddings from Transformer architectures showed improvements over baseline methods. Additionally, few-shot learning proved useful in reducing annotation costs for building models, though further exploration of techniques for large-scale re-annotation is recommended. We also attempted to predict unseen relationships in texts using NLP. However, this approach differs from extracting causal relationships typically used in epidemiological studies. We suggest incorporating the Bradford Hill⁴⁸ criteria and aligning public health initiatives with RE tasks. We also suggest using BioGPT⁴⁰ or GPT-2⁴⁹ for RE as well as NER experiments to see if it makes an improvement.

Error analysis Our error analysis revealed that our model struggled to recognise certain abbreviations in the NER task. In particular, in the BC2GM dataset, our model had low recall for abbreviations such as "RNA" and "DNA." This is most likely because these abbreviations have multiple meanings and can be used in a variety of contexts. Furthermore, we discovered that our model struggled to distinguish between similar entities in the NCBI-Disease dataset. For example, the model frequently mixed up the terms "glioma" and "lymphoma," which both refer to cancer types. This implies that our model could benefit from more training data that highlights the subtle differences between these types of entities.

Limitations Limitations of our study include the reliance on published case reports, which may result in a biased sample towards sicker, hospitalized patients with Long-COVID, and those seen by academic physicians. This excludes milder cases and patients who may be underserved or live in remote areas. Furthermore, our NLP approach to extracting relationships among entities may identify coincidental associations rather than causal links. Further research on causality criteria in public health is necessary. Despite the limitations of the study, the paper provides useful insights for clinicians, medical professionals, nurses, epidemiologists, and researchers, while further research on causality criteria in public health is necessary.

Conclusions

This study demonstrates that NLP-based methods can be used to identify the presence of disease, symptoms, and risk characteristics from the free-text data. Transfer learning is promising for developing predictive disease models with limited data. The proposed methodology provides a robust way to infer named entities and relations in the texts. Over the state-of-the-art methods, the proposed methods achieve better performance on F1-score for tasks. The current study also shows the effectiveness of the proposed approach for pandemic surveillance. Further studies are needed to validate the effectiveness of our approach in different clinical contexts and with larger and more diverse datasets. In addition, it would be interesting to explore the potential of our method in other applications, such as in real-time monitoring of disease outbreaks or tracking the progression of pandemics across different geographic locations.

Data availability

The data underlying this article will be shared on reasonable request to the corresponding author.

References

Ourworldindata.org. COVID-19 Data Explorer. Our world in data at https://ourworldindata.org/explorers/coronavirus-data-explorer (2022).
Flor, L. S. et al. Quantifying the effects of the COVID-19 pandemic on gender equality on health, social, and economic indicators: a comprehensive review of data from March, 2020, to September, 2021. Lancet (2022).
Baena-Diéz, J. M., Barroso, M., Cordeiro-Coelho, S. I., Diáz, J. L. & Grau, M. Impact of COVID-19 outbreak by income: Hitting hardest the most deprived. J. Public Heal. (UK) 42, 698–703 (2020).
Article Google Scholar
Kaye, A. D. et al. Economic impact of COVID-19 pandemic on healthcare facilities and systems: International perspectives. Best Pract. Res. Clin. Anaesthesiol. 35, 293–306 (2021).
Article PubMed Google Scholar
Raza, S. & Schwartz, B. Detecting Biomedical Named Entities in COVID-19 Texts. in Workshop on Healthcare AI and COVID-19, ICML 2022 (2022).
Raza, S., Schwartz, B. & Rosella, L. C. CoQUAD: a COVID-19 question answering dataset system, facilitating research, benchmarking, and practice. BMC Bioinf. 23, 210 (2022).
Article CAS Google Scholar
Williamson, E. J. et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature 584, 430–436 (2020).
Article CAS PubMed PubMed Central Google Scholar
Raza, S., Reji, D. J., Shajan, F. & Bashir, S. R. Large-scale application of named entity recognition to biomedicine and epidemiology. PLOS Digit. Heal. 1, e0000152 (2022).
Article Google Scholar
Oldroyd, J. Social determinants of health. Public Health: Local and Global Perspectives: 2nd edn 105–123. https://doi.org/10.4159/9780674989207-006 (2019).
Pearce, K., Zhan, T., Komanduri, A. & Zhan, J. A Comparative study of transformer-based language models on extractive question answering (2021).
Sun, Q., Liu, Y., Chua, T. S. & Schiele, B. Meta-transfer learning for few-shot learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition vols 2019-June https://github.com/y2l/meta-transfer-learning-tensorflow (2019).
Wang, Y., Yao, Q., Kwok, J. T. & Ni, L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 53, 1 (2020).
Google Scholar
Lee, J. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Article CAS PubMed Google Scholar
Luo, L. et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 1381–1388 (2018).
Article CAS PubMed Google Scholar
Campillos-Llanos, L., Valverde-Mateos, A., Capllonch-Carrión, A. & Moreno-Sandoval, A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med. Inform. Decis. Mak. 21, 1–19 (2021).
Google Scholar
Luo, X., Gandhi, P., Storey, S. & Huang, K. A deep language model for symptom extraction from clinical text and its application to extract covid-19 symptoms from social media. IEEE J. Biomed. Heal. Informatics 26, 1737–1748 (2021).
Article Google Scholar
Harnoune, A. et al. BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Comput. Methods Programs Biomed. Updat. 1, 100042 (2021).
Article Google Scholar
Perera, N., Dehmer, M. & Emmert-Streib, F. Named entity recognition and relation detection for biomedical information extraction. Front. Cell Dev. Biol. 8, 673 (2020).
Article PubMed PubMed Central Google Scholar
Mahendran, D., Ranjan, S., Tang, J., Nguyen, M. H. & Mcinnes, B. T. BioCreative VII-Track 1 : A BERT-based System for Relation Extraction in Biomedical Text.
National Center for Biotechnology Information. Definitions https://www.ncbi.nlm.nih.gov (2020). https://doi.org/10.32388/uq8dyz.
Norikawa, N. et al. Pemphigoid nodularis induced by long-term use of dipeptidyl peptidase-4 inhibitors. Hear. Views 18(3), 104–105. https://doi.org/10.4103/ijd.ijd_632_22 (2017).
Article Google Scholar
Spark OCR- John Snow Labs. https://nlp.johnsnowlabs.com/docs/en/ocr (2022).
Elasticsearch. https://www.elastic.co (2014).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv Prepr. arXiv1810.04805 (2018).
Chaybouti, S., Saghe, A. & Shabou, A. EfficientQA : A RoBERTa based phrase-indexed question-answering system. 1–9 (2021).
Chiu, J. P. C. & Nichols, E. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016).
Article Google Scholar
Lafferty, J., Mccallum, A. & Pereira, F. Conditional Random Fields : Probabilistic Models for Segmenting and Labeling Sequence Data Abstract. 2001, 282–289 (1999).
Sexton, T. IOB Format Intro - Nestor. https://pages.nist.gov/nestor/examples/named-entities/01-BIO-format (2022).
Zhou, D., Zhong, D. & He, Y. Biomedical relation extraction: from binary to complex. Comput. Math. Methods Med. 2014, 1 (2014).
ADS MATH Google Scholar
Levy, O., Seo, M., Choi, E. & Zettlemoyer, L. Zero-shot relation extraction via reading comprehension. arXiv Prepr. arXiv1706.04115 (2017).
Tang, R. et al. Rapidly Bootstrapping a Question Answering Dataset for COVID-19. (2020).
Chen, C.-Y. & Li, C.-T. ZS-BERT: Towards Zero-Shot Relation Extraction with Attribute Representation Learning. in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2021, Online, June 6–11, 2021 (eds. Toutanova, K. et al.) 3470–3479 (Association for Computational Linguistics, 2021). doi:https://doi.org/10.18653/v1/2021.naacl-main.272.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K. & Wierstra, D. Matching networks for one shot learning. Advances in Neural Information Processing Systems (2016).
Pushp, P. K. & Srivastava, M. M. Train once, test anywhere: Zero-shot learning for text classification. arXiv Prepr. arXiv1712.05972 (2017).
huggingface. transformers. GitHub. https://github.com/huggingface/transformers (2022).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. & Dyer, C. Neural architectures for named entity recognition. arXiv Prepr. arXiv1603.01360 (2016).
Zhao, Z. et al. Disease named entity recognition from biomedical literature using a novel convolutional neural network. BMC Med. Genom. 10, 75–83 (2017).
Article Google Scholar
Yoon, W., So, C. H., Lee, J. & Kang, J. Collabonet: Collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinf. 20, 55–65 (2019).
Article Google Scholar
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv Prepr. arXiv1906.05474 (2019).
Luo, R. et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinf. 23, 1 (2022).
Article Google Scholar
Girju, R. Automatic detection of causal relations for Question Answering. 76–83 (2003). https://doi.org/10.3115/1119312.1119322.
Hsieh, Y.-L., Chang, Y.-C., Chang, N.-W. & Hsu, W.-L. Identifying protein-protein interactions in biomedical literature using recurrent neural networks with long short-term memory. in Proceedings of the eighth international joint conference on natural language processing (volume 2: short papers) 240–245 (2017).
Quan, C., Luo, Z. & Wang, S. A hybrid deep learning model for protein–protein interactions extraction from biomedical literature. Appl. Sci. 10, 2690 (2020).
Article CAS Google Scholar
Zhao, S., Hu, M., Cai, Z. & Liu, F. Modeling dense cross-modal interactions for joint entity-relation extraction. in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence 4032–4038 (2021).
Bekoulis, G., Deleu, J., Demeester, T. & Develder, C. Adversarial training for multi-context joint entity and relation extraction. arXiv Prepr. arXiv1808.06876 (2018).
Artstein, R. Inter-annotator agreement. in Handbook of linguistic annotation 297–313 (Springer, 2017).
Statistics, L. Fleiss’ kappa in SPSS Statistics | Laerd Statistics. https://statistics.laerd.com/spss-tutorials/fleiss-kappa-in-spss-statistics.php (2019).
Rothman, K. J. & Greenland, S. Hill’s criteria for causality. Encycl. Biostat. https://doi.org/10.1002/0470011815.b2a03072 (2005).
Article Google Scholar
Papanikolaou, Y. & Pierleoni, A. DARE: Data Augmented Relation Extraction with GPT-2. (2020).

Download references

Acknowledgements

This research was co-funded by the Canadian Institutes of Health Research’s Institute of Health Services and Policy Research (CIHR-IHSPR) as part of the Equitable AI and Public Health cohort, and Public Health Ontario.

Author information

Authors and Affiliations

Public Health Ontario (PHO), Toronto, ON, Canada
Shaina Raza & Brian Schwartz
Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
Shaina Raza & Brian Schwartz

Authors

Shaina Raza
View author publications
You can also search for this author in PubMed Google Scholar
Brian Schwartz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R. and B.S. conceived the study design. S.R. and B.S. participated in the literature search. B.S. prepared the search query for the data collection. S.R. performed the data curation, and preparation. SR built the framework and the models, and B.S. validated the framework. SR created the tables, plotted the graphics, interpreted the study findings, and drafted the initial manuscript. B.S. validated the results and evaluated the findings and revised the draft. All authors critically reviewed and substantively revised the manuscript. All authors have approved the final version of the manuscript for publication.

Corresponding author

Correspondence to Shaina Raza.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Raza, S., Schwartz, B. Constructing a disease database and using natural language processing to capture and standardize free text clinical information. Sci Rep 13, 8591 (2023). https://doi.org/10.1038/s41598-023-35482-0

Download citation

Received: 27 November 2022
Accepted: 18 May 2023
Published: 26 May 2023
DOI: https://doi.org/10.1038/s41598-023-35482-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.