Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Houssein, Essam H.; Mohamed, Rehab E.; Ali, Abdelmgeid A.

doi:10.1038/s41598-023-34294-6

Download PDF

Article
Open access
Published: 03 May 2023

Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Essam H. Houssein¹,
Rehab E. Mohamed¹ &
Abdelmgeid A. Ali¹

Scientific Reports volume 13, Article number: 7173 (2023) Cite this article

3448 Accesses
4 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Heart disease remains the major cause of death, despite recent improvements in prediction and prevention. Risk factor identification is the main step in diagnosing and preventing heart disease. Automatically detecting risk factors for heart disease in clinical notes can help with disease progression modeling and clinical decision-making. Many studies have attempted to detect risk factors for heart disease, but none have identified all risk factors. These studies have proposed hybrid systems that combine knowledge-driven and data-driven techniques, based on dictionaries, rules, and machine learning methods that require significant human effort. The National Center for Informatics for Integrating Biology and Beyond (i2b2) proposed a clinical natural language processing (NLP) challenge in 2014, with a track (track2) focused on detecting risk factors for heart disease risk factors in clinical notes over time. Clinical narratives provide a wealth of information that can be extracted using NLP and Deep Learning techniques. The objective of this paper is to improve on previous work in this area as part of the 2014 i2b2 challenge by identifying tags and attributes relevant to disease diagnosis, risk factors, and medications by providing advanced techniques of using stacked word embeddings. The i2b2 heart disease risk factors challenge dataset has shown significant improvement by using the approach of stacking embeddings, which combines various embeddings. Our model achieved an F1 score of 93.66% by using BERT and character embeddings (CHARACTER-BERT Embedding) stacking. The proposed model has significant results compared to all other models and systems that we developed for the 2014 i2b2 challenge.

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

Introduction

Heart disease is the leading cause of death in the United States, the UK, and worldwide. It causes more than 73,000 and 600,000 deaths per year in the UK and the US, respectively^1,2. Heart disease caused the death of about 1 in 6 men and 1 in 10 women. Heart disease has a number of common forms such as Coronary Artery Disease (CAD). According to the World Health Organization, risk factors of a specific disease are any attributes that raise the probability that a person may get that disease³. There are several risk factors for CAD and heart disease such as Diabetes, CAD, Hyperlipidemia, Hypertension, Smoking, Family history of CAD, Obesity, and Medications associated with the mentioned chronic diseases^4,5,6. Each heart risk factor should be specified with indicator and time attributes except for a family history of CAD and smoking status. Each indicator attribute reflects the implications of the risk factor in the clinical text. It is essential to detect risk factors mentioned in narrative clinical notes for heart disease prediction and prevention which is considered an important challenge.

Manually detecting heart disease risk factors from several forms of clinical notes is excessively expensive, time-consuming, and error-prone. Therefore, for efficient identification of heart disease risk factors, it is required to apply a model that is fine-tuned to the text structure, the clinical note contents, and the project requirements^{7, 8}.

Electronic health records (EHRs) have been proved to be a promising path for advancing clinical research in recent years^9,10,11. Although EHRs hold structured data such as diagnosis codes, prescriptions, and laboratory test results, a large portion of clinical notes are still in narrative text format, primarily in clinical notes from primary care patients. The narrative form of clinical notes is considered a major challenge facing clinical research applications¹².

NLP techniques have been applied to convert narrative clinical notes into a structured format that will be effectively used in clinical research^13,14,15. Furthermore, several studies have demonstrated the significant impact of NLP, machine learning, and deep learning techniques for disease identification using clinical notes, which are discussed as related works in this paper. Thus, our goal is to develop a model that can detect and predict the progression of heart disease and CAD from clinical notes. The prediction of heart disease risk factor using clinical and statistical approaches has attracted a lot of attention over the past ten years^{16,17,18,19,20} because this process is very complex. Several techniques have been applied to clinical concept extraction such as simple pattern matching, statistical systems, and machine learning. Although these techniques have achieved better results, it is difficult to apply such statistical models to analyze the EHR data due to the time-consuming process of processing large amounts of data, their usage of several statistical and structural assumptions, and custom features/markers^{21, 22}.

Deep learning, a branch of machine learning that has made significant development recently, is used to create significantly improved NLP models²³. DL approaches have lately made substantial progress in a variety of domains through the effective collection of long-range data relationships and the deep hierarchical creation of feature sets²⁴. Due to the growing development of DL methods and the growing number of patient records that provide improved results and require less time-consuming preprocessing and feature extraction compared to conventional methods, there is an increase in research studies that apply DL techniques to EHR data for Clinical tasks^{25, 26}.

Clinical text datasets with annotations are rare and small in size. This made it difficult to apply modern supervised DL techniques. To overcome this issue, clinical information extraction techniques based on transfer learning using pre-trained language models have recently become increasingly popular^{27,28,29,30,31,32,33}.

Several studies have pre-trained these models on English biomedical and clinical notes^{28, 29, 34, 35} and fine-tuned them on several clinical downstream tasks^{27, 30}. These models have widely applied the architecture of bidirectional encoder representations from transformers (BERTs).

This motivated the significance of the evaluation of pretraining and fine-tuning BERT on The i2b2 heart disease risk factors challenge dataset from the heart disease domain to highlight the efficiency of deep-learning-based NLP techniques for clinical information extraction tasks.

This paper proposed an advanced technique of using stacked embeddings to improve the previous research on the i2b2 2014 challenge. The i2b2 heart disease risk factors challenge dataset has shown significant improvement for stacking embeddings, which is conceptually a means to integrate several embeddings. We have achieved an F1-score of 93.66% on the test set by stacking BERT and character embeddings (CHARACTER-BERT Embedding). The main objective is to identify the risk factor indicators included in each document, as well as the temporal features related to the document creation time (DCT) using the data set from the i2b2/UTHealth shared task¹⁰.

Among all the models we have created as a part of this proposed model, this has demonstrated the best results. This is a promising result for our model’s potential to advance research beyond the current benchmark for DL models developed for this shared task⁷, which reported an F1 score of 90.81% using BLSTM and the most successful system³⁶ of the i2b2/UTHealth 2014 challenge, which reported an F1 score of 92.76%. Additionally, our method focuses on how contextual embeddings help to further improve the effectiveness of NLP and DL. This research is a step toward a system that can outperform human annotators and surpass the current state-of-the-art results with minimal feature engineering.

In summary, the main objectives of this study are as follows:

Developing a model that detects heart disease risk factors using stacked embedding algorithms by stacking BERT and CHARACTER-BERT Embedding. Furthermore, the utilization of DL approach (RNN) to extract risk factor indicators from the shared task dataset.
Improve on work that has already been done in this space as part of the i2b2 2014 challenge.
The proposed model achieved superior results compared to state-of-the-art models from the 2014 i2b2/UTHealth shared task.
Various metrics are provided to assess the performance of the proposed model.

The remainder of the paper is organized as follows, “Related works” section, provides a detailed overview of the related work, highlighting several recent related works. The basic description of the dataset, the task, and clinical word embeddings are introduced in “Material and methods” section. “The proposed heart disease risk factors detection model” section, presents the proposed model steps by explaining preprocessing steps, describing the pre-trained word embeddings, and stacked word embeddings. “Discussion” section, shows the evaluation and the results of the proposed model. Finally, “Conclusion and future work” section, discusses the conclusion and future works.

Related work

Clinical information extraction using deep learning

Medical research highly depends on text-based patient medical records. Recent studies have concentrated on applying DL to extract relevant clinical information from EHRs. One of the most significant NLP task is the extraction of clinical information from unstructured clinical records to support decision-making or provide structured representation of clinical notes. The goal of this concept extraction challenge can be described as a sequence labeling problem, to assign a clinically relevant tag to each word in an EHR³⁷. Different deep learning architectures based on recurrent networks, such as GRUs, LSTMs, and BLSTMs, were examined by^{37, 38}. All the RNN versions outperformed the conditional random field (CRF) baselines, which were previously thought to be the most advanced technique for information extraction in general. Clinical event sequencing can be used to analyze disease progress and predict oncoming disease states as patient EHRs change over time³⁹. Because of its temporality, it is necessary to give each extracted medical concept a sense of time⁴⁰ proposed a solution for much more complex issues by using A typical RNN initialized with word2vec⁴¹ vectors and DeepDive⁴² for developing associations and predictions. While⁴³ and⁴⁴ also used word embedding vectors, they extracted the temporal attributes using CNNs. While these methods are not modern, they generated the best results in extracting temporal event. Additionally, each subtask requires a different model and some manual engineering, such as when extracting concepts and temporal attributes^45,46,47. There is an important issue that none of the current systems have ever attempted to use a single, universe model that automatically identifies the temporal attributes of those factors based on their contexts and combines them into the feature learning process, which can be used to extract both medical factors and temporal attributes simultaneously.

The i2b2/UTHealth shared task

The i2b2 has released several NLP shared challenging tasks that focused on identifying risk factors for heart disease in clinical notes as listed in Table 1. For example, the 2009 i2b2 shared task focused on detecting all medications mentioned in a dataset of 251 clinical notes and all relevant information such as reasons, frequencies, dosages, durations, modes, and whether the information was written in a narrative note or not⁴⁸. The 2006 i2b2 shared task focused on classifying the smoking status of the patient into five classes: Past Smoker, Current Smoker, Smoker, Non-Smoker, and Unknown⁴⁹. Similarly, the 2008 i2b2 shared task focused on classifying obesity and comorbidities status of the patient into four categories⁵⁰.

There are three tracks participated in the 2010 i2b2/VA shared task⁵¹:

1.
Clinical Concept extraction task, in which systems needed to extract clinical diseases, medications, and lab tests;
2.
Assertion classification task, in which the previous track’s identified concepts are classified as being diagnosis or condition being present, absent, or possible, etc.;
3.
The concept relation classification task is the classification of relationships between concepts into types. For example, clinical diseases may refer to tests in different ways such as “test reveals clinical condition”, “test performed to explore clinical condition”, or “even if it’s in the same sentence, the relationship is other/unknown”. For the 2010 shared task, 871 medical records were annotated.

The 2012 temporal relations shared task⁵² focused on temporal relationships in clinical notes. Two tracks participated in this shared task: 1) identification of clinical events and their occurrence times, and 2) identification of time and the temporal order of events. For the 2012 shared task, 310 clinical records were annotated. There are three shared tasks for the 2013 ShARe/CLEF eHealth Evaluation Lab⁵³ which were information retrieval for medical queries, identification and normalization of diseases, and identification and normalization of abbreviations. The ShARe corpus of clinical records were used for the first two tasks, and more clinical data was augmented with those data for the third task.

Table 1 Some of the previous i2b2 challenge tasks involving identifying risk factors for heart disease in clinical notes.

Subjects

Abstract

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Highly accurate protein structure prediction with AlphaFold

Transparent medical image AI via an image–text foundation model grounded in medical literature

Introduction

Related work

Clinical information extraction using deep learning

The i2b2/UTHealth shared task

Material and methods

Dataset description

Task description

Clinical word embeddings

General contextual embeddings

Contextual clinical embeddings

Ethical approval

The proposed heart disease risk factors detection model

Motivations

The proposed models

Preprocessing

Pre-trained language models

BERT model

CharacterBERT

Flair

Recurrent neural network (RNN)

Stacked word embeddings

Experimental results and simulations

Evaluation metrics

Discussion

Error analysis

Conclusion and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links