Myocardial scar and left ventricular ejection fraction classification for electrocardiography image using multi-task deep learning

Boribalburephan, Atirut; Treewaree, Sukrit; Tantisiriwat, Noppawat; Yindeengam, Ahthit; Achakulvisut, Titipat; Krittayaphong, Rungroj

doi:10.1038/s41598-024-58131-6

Download PDF

Article
Open access
Published: 29 March 2024

Myocardial scar and left ventricular ejection fraction classification for electrocardiography image using multi-task deep learning

Atirut Boribalburephan^1,2,
Sukrit Treewaree³,
Noppawat Tantisiriwat³,
Ahthit Yindeengam⁴,
Titipat Achakulvisut¹ &
…
Rungroj Krittayaphong³

Scientific Reports volume 14, Article number: 7523 (2024) Cite this article

407 Accesses
Metrics details

Subjects

Abstract

Myocardial scar (MS) and left ventricular ejection fraction (LVEF) are vital cardiovascular parameters, conventionally determined using cardiac magnetic resonance (CMR). However, given the high cost and limited availability of CMR in resource-constrained settings, electrocardiograms (ECGs) are a cost-effective alternative. We developed computer vision-based multi-task deep learning models to analyze 12-lead ECG 2D images, predicting MS and LVEF < 50%. Our dataset comprises 14,052 ECGs with clinical features, utilizing ground truth labels from CMR. Our top-performing model achieved AUC values of 0.838 (95% CI 0.812–0.862) for MS and 0.939 (95% CI 0.921–0.954) for LVEF < 50% classification, outperforming cardiologists. Moreover, MS predictions in a prevalence-specific test dataset recorded an AUC of 0.812 (95% CI 0.810–0.814). Extracted 1D signals from ECG images yielded inferior performance, compared to the 2D approach. In conclusion, our results demonstrate the potential of computer-based MS and LVEF < 50% classification from ECG scan images in clinical screening offering a cost-effective alternative to CMR.

Performance of a convolutional neural network derived from an ECG database in recognizing myocardial infarction

Article Open access 21 May 2020

Deep learning augmented ECG analysis to identify biomarker-defined myocardial injury

Article Open access 27 February 2023

Deep learning interpretation of echocardiograms

Article Open access 24 January 2020

Introduction

Coronary artery disease (CAD) is the leading cause of death and disability worldwide, with decreasing incidence in developed countries but increasing in developing countries¹. CAD remains asymptomatic until the coronary stenosis becomes moderate or severe, resulting in the chest pain, dyspnea, and syncope². However, approximately 30% of myocardial infarctions (MI) do not manifest with clear symptoms³. A missed MI diagnosis can lead to serious complications, such as left ventricular systolic dysfunction, heart failure (HF), and death.

HF was reported to affect 1–2% of the population^4,5, and that rate is expected to increase over the next decade, posing a significant healthcare burden⁶. Left ventricular ejection fraction (LVEF) is used to classify HF patients for appropriate treatment^7,8. Therefore, detection of MI and LVEF at an earlier stage can help clinicians make better decisions for further investigation and proposing an appropriate treatment⁹.

Cardiac magnetic resonance (CMR) imaging has the ability to accurately and non-invasively assess the heart's functional and anatomical abnormalities, including characterizing myocardial scars (MS), which is a common sequelae of MI^9,10. Importantly, MS was reported to be one of the critical determinants, predicting the future development of heart failure¹¹. However, CMR is expensive and requires highly trained personnel to perform imaging and interpretation, limiting its availability in remote or underdeveloped settings.

Electrocardiogram (ECG) is frequently employed as an initial investigation for diagnosing cardiovascular diseases due to its accessibility, whereas CMR is typically reserved for more in-depth investigations¹². ECG scans contain patterns that can suggest cardiovascular diseases like CAD, MS, and left ventricular systolic dysfunction^13,14. Trained cardiologists can identify these patterns, and diagnose the corresponding diseases. However, the availability of well-trained cardiologists is limited in remote or developing areas. Moreover, the interpretation of ECG scans is susceptible to both human error and interrater variability. Alternatively, computer-based ECG interpretation could help mitigate these limitations. Previous studies reported that machine learning systems can detect cardiovascular diseases, such as MI, arrhythmias, and left ventricular systolic dysfunction, using a 12-lead ECG^14,15,16,17.

However, obtaining ECG data in machine-readable format (i.e., ECG tracings) can be challenging in resource-limited settings, including Thailand, since most ECG records are stored in paper or scanned format. Hence, computer image-based ECG classification for CAD detection may be advantageous in these circumstances.

In this paper, rather than using a separate classification model for each task, we introduce multi-task convolutional neural network (CNN) models that read 12-lead ECG scan images to identify both CAD scars and abnormal LVEF of less than 50% for clinical screening purposes. We provide our models as open-source software to improve CAD screening in resource-limited settings.

Results

Study population

A total of 13,707 patients and 14,826 ECGs were retrospectively enrolled in this study. To prevent cross-dataset contamination, 774 ECGs were excluded, resulting in a total of 14,052. The population comprises two ECG formats, specifically the non-grid (old) format and the grid (new) format ECGs, collected using different machines. The following baseline characteristics represented the total number of ECGs. The average age of all ECGs was 72.29 ± 13.81 years and 50.53% were acquired from male patients. The prevalence of MS and LVEF < 50% in the total ECGs was 27.11% and 18.72%, respectively, while 12.35% belonged to the LVEF < 40% group. A total of 10.04% of all ECGs had both subendocardial and transmural scarring. ECGs with only subendocardial scarring or only transmural scarring accounted for 9.45% and 7.61% of the population, respectively. Table 1 shows the baseline data of the overall ECG population and each dataset used in the study. Out of the total ECGs, 5,407 (38.48%) had no clinical feature data except for age and sex, while all other ECGs had complete clinical data. The missing data were imputed as described in the methods section. Summary workflow of MS and LVEF classification system is shown in Fig. 1a.

Table 1 Baseline data of the overall study population, and of those included in each dataset.

Full size table

Model performance

We trained eight deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats), (6) Multi-task with clinical features, (7) Single-task (MS) with clinical features, and (8) Single-task (LVEF) with clinical features.

Overall, our multi-task both formats model outperformed the transferred multi-task model and the multi-task old-format only model in both MS classification and LVEF classification, except for MS classification on new-format test dataset. The AUCs for the multi-task both formats model for MS classification were 0.838 (95% CI 0.812–0.862) and 0.811 (95% CI 0.788–0.832) for the old-format and new-format test datasets, respectively. For detecting LVEF < 50%, the AUCs of the multi-task both formats model were 0.939 (95% CI 0.921–0.954) and 0.931 (95% CI 0.915–0.944) for the old-format and new-format test datasets, respectively (Fig. 2). Model performance results compared to baseline prediction are shown in the Supplementary Data 1.

Regarding the ECG interpretation performance of cardiologists on new-format test data, the AUCs for MS classification by an experienced cardiologist and an in-training cardiologist were 0.683 (95% CI 0.659–0.707) and 0.657 (95% CI 0.632–0.681), respectively. Both cardiologists had similar sensitivity (44.1% vs. 44.50%, respectively); however, the experienced cardiologist had a higher specificity (92.60%) than the in-training cardiologist (86.40%) (Supplementary Fig. 1).

All of our developed models that were designed to evaluate ECG images as input features outperformed the XGBoost model, which was designed to evaluate only standard clinical features, by up to 50.40% (Fig. 2). The multi-task with clinical features model was able to classify MS in the old-format test dataset with specificity of 66.92%, 44.61%, and 30.50% at 80.00%, 90.00%, and 95.00% sensitivity, respectively. For LVEF < 50% classification when using the old-format test dataset, the model achieved the specificity of 90.03%, 84.76%, and 66.90% at 80.00%, 90.00%, and 95.00% sensitivity, respectively (Fig. 3).

Incorporating clinical features

Incorporating clinical features into our multi-task models resulted in improved model performance in some cases (Fig. 2). For the single-task model, adding clinical features provided a performance boost only in MS classification in old-format test datasets. For the multi-task model, the performance boost was observed only in MS classification in new-format test datasets. The multi-task with clinical features model greatly outperformed the XGBoost model (which used only clinical features) with an AUC of 0.841 (95% CI 0.819–0.860) compared to 0.682 (95% CI 0.655–0.707) in MS classification using the new-format test dataset. (Fig. 2).

Prevalence-specific analysis

When using the prevalence-specific test dataset, our multi-task both formats model achieved an AUC for MS classification of 0.812 (95% CI 0.810–0.814) and an F1-score of 0.931 (95% CI 0.931–0.932).

Performance in detecting LVEF < 40%

Our sensitivity analysis showed similar performance among models when comparing LVEF < 50% detection and LVEF < 40% detection. The multi-task model with clinical features achieved the highest AUC of 0.942 (95% CI 0.925–0.956) when using the old-format test dataset, while the multi-task model achieved the highest AUC of 0.939 (95% CI 0.924–0.951) when using the new-format test dataset (Supplementary Fig. 2).

Localization of models’ decision

We applied Grad-CAM++ to visualize the areas of ECG images that influenced the model decision. Figure 4 shows examples of heatmaps generated on top of the ECGs for multi-task and multi-task with clinical model. The heatmap highlighted the area with ECG tracings with greater emphasis on the area associated with models’ decisions. In cases with MS and LVEF < 50%, we observed that the model focused on abnormal Q waves, QRS complexes, and T wave inversions. For cases with no MS and LVEF ≥ 50%, the multi-task model generally highlighted QRS complexes in lead I, II, V2, and V6, while the multi-task with clinical model focused on lead I. Interestingly, the multi-task with clinical model appears to focus more on fewer leads compared to the multi-task model.

1D signal extraction and 1D model performance

We extracted 1-Dimension (1D) signals from the ECG images and trained five 1D deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats).

The results consistently demonstrate the superior performance of our 2D CNN models over their 1D counterparts in both MS and LVEF range classification tasks (Figs. 2, 5, and Supplementary data 2). Among the 1D models, our multi-task model stands out, maintaining the highest overall performance for both tasks.

Discussion

In this study, we demonstrated that a multi-task deep learning model could detect MS and classify the LVEF range using a 12-lead ECG scan image. Our top-performing models, the multi-task both formats model and multi-task model with clinical features, demonstrated high performance in detecting MS and LVEF < 50% in both old and new ECG formats. They consistently exhibited comparable or superior performance when compared to their single-task counterparts in the majority of scenarios. Additionally, our model also achieved a high AUC and F1-score in MS prediction from the prevalence-matched population. The F1 score achieved in this prevalence-matched population was notably higher than that in our test sets, where the prevalence exceeded 26%. This finding might suggest that the model performs better in populations with lower prevalence, which more closely resemble real-world populations.

Given the variation in ECG format among different machines, it becomes crucial for the ECG image classification model to acquire visual features that can be universally applied across these formats. Our findings reveal that amalgamating ECG scans from various formats into a unified training dataset resulted in improved performance compared to segregating datasets based on ECG format. Additionally, the multi-task model designed for the old ECG formats also exhibited commendable performance when tested with a new-format dataset, underscoring its efficacy in predicting ECGs with diverse formats. However, a preprocessing protocol is needed to automatically prepare the ECG image prior to interpretation by the model.

The multi-task model also has a computational advantage since it shares the same backbone for predicting MS and LVEF. Several studies demonstrated performance gains in using a multi-task model in medical image analysis¹⁸. In addition to the computational edge, the multi-task model may also have a clinical advantage. When determining whether a patient has MS, the ability to simultaneously predict LVEF range might provide the impact of the scar on cardiac function. Likewise, when predicting LVEF < 50%, the model could suggest the etiology of impaired cardiac pumping by detecting the ischemic scar. Taken together, the advantages of our developed multi-task deep learning model suggest the strong potential of its use as a screening tool for MS and LVEF < 50% from ECG scan images in limited-resource settings.

Previous studies have demonstrated the potential of deep learning for MS detection, mainly focusing on raw ECG traces^19,20. A similar study achieved a comparable AUC to our model, with CMR as the gold standard²⁰. Another study utilizing data from CMR-confirmed ECGs and a publicly available ECG dataset without CMR measurements, reported a model with superior performance to our model. However, their model combines vectorcardiography with ECG for prediction¹⁹. We hypothesize that vectorcardiography may provide more detailed heart conduction information than ECG alone.

The PhysioNet in Cardiology Challenge 2020 presented a large multi-institutional database of ECG signals with remarkable results, utilizing deep learning and CNN²¹. As we have no access to the raw ECG traces, we extracted the signals from the ECG images and trained 1D CNN models. The results show that the models trained on ECG images perform better than the ones trained on the extracted signals; these results align with previous research study²².

In addition to performance differences, an intriguing distinction emerged between the 1D and 2D multi-task models during our transfer learning experiment. Notably, when assessing the 1D multi-task (old-format only) model on the new-format test set, we observed more pronounced performance degradation compared to its 2D counterpart. Moreover, the 1D multi-task (transferred) model exhibited poorer performance on the old-format test set. Overall, the findings suggest that 2D CNN models may exhibit better generalization to unseen formats (i.e., trained on old-format data and tested on new-format data) and retain previous knowledge more effectively (i.e., avoiding forgetting old-format data).

Nevertheless, the 1D model remains viable due to the potential to utilize a pretrained model from extensive 1D ECG databases in a range of cardiovascular diseases, improving generalizability in real-world applications²³. However, the precision mismatch between raw traces in public 1D databases and image-extracted signals presents a potential hurdle for knowledge transfer to specific use cases. This aspect requires further investigation in future studies.

Numerous methodologies have been developed for the automated identification of MI from ECGs, ranging from simple thresholding techniques to sophisticated deep learning models^24,25. Another study used ECG images annotated by cardiologists as input for MI detection using deep learning²⁶. In the present study, we utilized DE-CMR as the true label, resulting in higher sensitivity for detecting MS cases, capturing more subtle signs and less severe MS cases. Notably, the reported model performance may vary among studies due to differences in the gold standard used and the prevalence of MS in the test dataset.

Observational studies showed that characteristics of ECGs are associated with LVEF^27,28. Many recent studies also applied deep learning models to LVEF range classification using 2D ECG images^17,29,30,31. Two of those previous studies assembled a larger ECG image dataset (compared to the size of our training dataset) for model training, and they measured LVEF using transthoracic echocardiography^17,30. Despite using less training data, we still achieved comparable performance of LVEF prediction, which may suggest the effectiveness of our multi-task backbone for evaluating ECG images.

We found that adding clinical features as an input to our multi-task model could enhance its performance in MS classification in some test datasets. However, adding clinical features slightly reduced the LVEF < 50% classification performance. Additional study is needed to better understand the relationship between clinical features and LVEF. Interestingly, our multi-task both formats model outperformed cardiologist performance of ECG interpretation without the clinical features. This suggests that our model may have learned subtle patterns from ECGs that are unnoticeable by cardiologists.

We introduce our model as open-source software to provide a tool that could improve CAD screening in resource-limited settings. As our models are image classification models, our models could be deployed as a web application that allows taking photos of ECG by a smartphone, which does not require the investment in raw-ECG tracings data acquisition and storage and can easily integrate into hospital workflow.

Strengths and limitations

To our knowledge, this is the first study to demonstrate the potential of multi-task deep learning models for classifying ECG images with corresponding same-date CMR as the gold standard for identifying the MS and the LVEF range. Same-date CMR provides a concurrent and precise comparison between ECG image and CMR assessment of MS and LVEF. Moreover, evaluation of the old-format model on new-format data gives us some positive indications of its performance on the unseen ECG format. However, further studies are needed to assess the performance of our multi-task model when using other ECG formats. Our study also has some mentionable limitations. Our prediction model requires semi-manual preprocessing of the ECG images before being fed into the model, which are ECG lead position identification in the image, ECG lead arrangement, and adjusting the number of QRS pulses. Once the ECG lead position and arrangement data are entered, unseen ECG formats can be easily integrated into the pipeline. Furthermore, our study only enrolled cases with corresponding same-date CMR imaging to ensure that MS cases were correctly classified; therefore, the number of enrolled cases is limited.