Deep learning of ECG waveforms for diagnosis of heart failure with a reduced left ventricular ejection fraction

The performance and clinical implications of the deep learning aided algorithm using electrocardiogram of heart failure (HF) with reduced ejection fraction (DeepECG-HFrEF) were evaluated in patients with acute HF. The DeepECG-HFrEF algorithm was trained to identify left ventricular systolic dysfunction (LVSD), defined by an ejection fraction (EF) < 40%. Symptomatic HF patients admitted at Seoul National University Hospital between 2011 and 2014 were included. The performance of DeepECG-HFrEF was determined using the area under the receiver operating characteristic curve (AUC) values. The 5-year mortality according to DeepECG-HFrEF results was analyzed using the Kaplan–Meier method. A total of 690 patients contributing 18,449 ECGs were included with final 1291 ECGs eligible for the study (mean age 67.8 ± 14.4 years; men, 56%). HFrEF (+) identified an EF < 40% and HFrEF (−) identified EF ≥ 40%. The AUC value was 0.844 for identifying HFrEF among patients with acute symptomatic HF. Those classified as HFrEF (+) showed lower survival rates than HFrEF (−) (log-rank p < 0.001). The DeepECG-HFrEF algorithm can discriminate HFrEF in a real-world HF cohort with acceptable performance. HFrEF (+) was associated with higher mortality rates. The DeepECG-HFrEF algorithm may help in identification of LVSD and of patients at risk of worse survival in resource-limited settings.

. Study flow chart-Among the patients hospitalized with acute heart failure, subjects with no matching echocardiographic results within 1 months and electrocardiograms other than closest match to the echocardiographic results were excluded. ECG electrocardiogram. www.nature.com/scientificreports/ For an EF < 40% cut-off, the sensitivity was 0.779, with specificity of 0.763, positive predictive value (PPV) of 0.708, negative predictive value (NPV) of 0.824, and accuracy of 0.770. The AUC, sensitivity, PPV, and accuracy increased, while NPV decreased with an increase in EF.
Performance of the DeepECG-HFrEF algorithm according to actual EF. The proportion of patients diagnosed with DeepECG-HFrEF (+) increased when the actual EF was lower ( Fig. 2A). The DeepECG-HFrEF algorithm was more likely to yield false-positive and false-negative results when the actual EF was near 40% (Fig. 2B). The scatter plot also shows a higher proportion of correct classifications (true-positives) when the actual EF was lower (Fig. 3). Figure 4 is a forest plot of the AUC and associated 95% confidence interval (CI) for the DeepECG-HFrEF algorithm according to various clinical patient parameters. The performance of the DeepECG-HFrEF algorithm was slightly better in the subgroups of patients: age ≤ 70 years, without hypertension, non-ischemic HF, sinus rhythm, PR interval ≤ 200 ms, QRS duration ≤ 140 ms, corrected QT interval of ≤ 450 ms for men and ≤ 470 ms for women, and normal axis or LAD.

Performance of DeepECG-HFrEF algorithm in different subpopulations.
The 5-year all-cause mortality. Overall, the 5-year survival was worse in the DeepECG-HFrEF (+) than (−) group (p < 0.001; Fig. 5A). The Kaplan-Meier curve also showed a lower survival rate among patients with an actual EF< 40% (Fig. 5B). The crude and adjusted hazard ratios (HRs) for 5-year all-cause mortality for the three different models are reported in Table 3 All components of model 1 showed significantly increased crude HR and multivariable-adjusted HR. In model 2, echocardiographic EF < 40% added to model 1, DeepECG-HFrEF (+) remained as significantly higher HR even after multivariable-adjustment. In model 3, which included a B-type natriuretic peptide (BNP) > 500 pg/mL added to model 1, DeepECG-HFrEF (+) was offset by BNP.

Discussion
In this study, we validated the DeepECG-HFrEF to identify LVSD in patients with symptomatic HF regardless of EF and evaluated the predictive power of the algorithm for the 5-year all-cause mortality. The DeepECG-HFrEF algorithm showed outstanding performance in discriminating LVSD among patients with HF. DeepECG-HFrEF (+) was associated with a worse 5-year survival, even when compared to using the actual EF value. To our knowledge, this is the first study to validate the performance of a deep learning-based AI algorithm for LVSD detection and to show risk predictability in symptomatic patients with HF. LVSD is identified in 40-50% of patients with HF 16 . Although survival rates of patients with HF have recently improved in developed countries, patients with HF still show an eight-fold higher mortality than an age-matched population 17,18 . Not only does HF increase the risk of mortality, but the associated economic burden cannot be overlooked. The economic burden of HF was estimated to be $108 billion per annum globally in 2012, with 60% direct costs to the healthcare system and 40% indirect costs to society through morbidity and others 19 . Such burden is even higher in Asian countries compared to the United States, with a large proportion of the  www.nature.com/scientificreports/ HF-related healthcare costs directly associated to hospitalization 20 . The impact of this burden is accentuated among elderly patients, with almost three-quarters of the total resources assigned to HF being solely devoted to the older population 21 . The increase in the proportion of elderly individuals in the general population, social ageing phenomenon, is consistent throughout the world, with the elderly population projected to double to almost 1.6 billion globally, from 2025 to 2050 22 . Considering the economic burden of HF in the elderly population, there is a need to improve early diagnosis and treatment of LVSD to slow or even prevent its progression to HF. A summary of currently developed AI algorithms for the detection of LVSD and the validation of these algorithms is provided in Supplementary Table S5. The definition of LVSD and the primary endpoint differed among studies, with an EF cut-off of 35% to 40% having been used. The study population used for validation also differed between the studies, from using patients at a community general hospital to patients in cardiac intensive care unit and patients with COVID-19 9,12,13 . As a result of these differences in the clinical population used, the proportion of patients within the validation population varied between 2 and 20% 7,11 . Our study is the first to validate the algorithm to detect LVSD solely using patients with HF. Our results showed the strength of the DeepECG-HFrEF algorithm to discriminate LVSD even when the prevalence of HF is high.
Despite recent advances in HF pharmacotherapy, the mortality and rehospitalization rates of patients with HF are still high. Therefore, the identification of high-risk patients who would benefit the most from comprehensive HF treatment is urgently required 23 . A few studies suggested the promising role of AI support for the early diagnosis of low EF 15 . Regarding AI for the detection of LVSD, only one study, by Attia et al., reported on the power of an AI algorithm to predict future LVSD development 7 . Our study is the first to show an association between long-term survival and LVSD of patients with HF based on an AI algorithm. Our results show that the AI algorithm can identify abnormalities in ECG before overt LVSD is observed on echocardiography.
The AI algorithms are known for being a "black box" with exact mechanism unexplainable. However, there are some ECG characteristics in the DeepECG-HFrEF (+) group which might have contributed to the prognostic performance of the algorithm. The DeepECG-HFrEF (+) group had significantly increased corrected QT intervals and increased proportions of LBBB and IVCD. A study by Lee et al. showed that LBBB and IVCD were associated with an increased risk of all-cause mortality and rehospitalization due to HF aggravation 24 . Regarding the QTc interval, a study by Park et al. showed a J-curve association between the corrected QT interval and mortality among patients with acute HF, with a nadir of 440-450 ms in men and 470-480 ms in women 25 . Thus, such an association might be one of the factors used by the DeepECG-HFrEF algorithm to differentiate between the two  There is no clear explanation for the increased false-positive and false-negative rates among patients with an EF near 40%. One plausible explanation might be that the clustering near an EF of 40% may be a heterogeneous group. A previous study by Rastogi et al. showed heterogeneity in the underlying demographics of HFmrEF to be associated with changes in EF over time 26 . Among the HFmrEF groups, improvement in EF tends to be associated with coronary artery disease, while a worsening of EF is more likely to coexist with hypertension and diastolic dysfunction 26 . Patients with acute coronary syndrome are more likely to have dynamic changes in their ECGs and EF over a short period of time 27,28 . As ischemia was the leading cause of acute HF among patients in the KorAHF Registry, such dynamic changes might have contributed to heterogeneity, resulting in a discrepancy between actual EF and DeepECG-HFrEF algorithm results 29 . Limitations. The limitations of our study need to be acknowledged in the interpretation of results. First, owing to the retrospective design used, causation between identified factors of LVSD among patients with HF could not be inferred. Further validation of the algorithm using a prospective study design is needed. Second, generalization of our results is limited, and should be cautiously interpreted, as the study population was drawn 2), respectively, which was not statistically significant (p = 0.192). Also, the performance of the algorithm although the 30-day maximum has generally been accepted in previous studies 10,12 . It is important to note that the ECG matched to echocardiography within 24 h comprised 82.1% of the data used in this study. Fourth, HF medication compliance was not considered. As angiotensin-converting enzyme inhibitors and beta-blockers are known to have a favorable prognosis for the treatment of LVSD, data on such medication adherence would have affected survival. Fifth, our study focused on the association between ECG and echocardiography and included multiple ECG and echocardiographic data

Conclusions
The DeepECG-HFrEF algorithm showed acceptable performance in distinguishing HFrEF in a real-world HF cohort. Patients with a DeepECG-HFrEF (+) classification had a significantly worse 5-year survival. Application of the DeepECG-HFrEF algorithm may be of specific benefit in resource-limited clinical settings where echocardiography is not readily eligible to identify high-risk patients who may benefit from active therapeutic intervention.  30 .

Methods
AI Algorithm. The original convolutional neural network (CNN)-based algorithm was previously described, developed, and externally validated 8 . The DeepECG-HFrEF algorithm to detect a LVEF < 40% was validated to detect an EF < 40% from 12-lead 10 s ECGs data of HF patients. The algorithm was implemented on the Ten-sorFlow (Google, Mountain View, CA) framework and written in Python (version 3.6; Python Software Foundation, Beaverton, OR). For this study, the algorithm was newly implemented on PyTorch (Facebook, Menlo Park, CA), with no additional training or optimization of the original algorithm. The output for the algorithm is a continuous value between 0 and 1, representing a confidence score for an EF < 40%. Using a certain cut-off value, all tests either had a positive (+) or negative (−) result, and none of the tests were considered intermediate.

Statistical analysis.
A comprehensive panel of diagnostic performance metrics was summarized to evaluate the performance of the DeepECG-HFrEF algorithm. In particular, the sensitivity, specificity, PPV, NPV, accuracy, and accuracy of the validation study were determined using the original algorithm positive (+) of greater than or equal to the cut-off of 0.370, indicating that the input ECG had a confidence score of 0.370 to detect a LVEF < 40% 8 . The AUC with confidence interval was evaluated via a 2000-sample bootstrapping method. We examined the optimal threshold, which is defined as the threshold that maximizes the sum of sensitivity and specificity (i.e., Youden's index). Continuous variables are presented as the mean ± standard deviation and compared using the unpaired Student's t-test. Categorical variables were expressed as frequencies or percentages and were compared using the chi-squared test. For the secondary objective of exploring the long-term prognostic impact of DeepECG-HFrEF (+), the Kaplan-Meier method was used with between-group differences assessed using the log-rank test. The Cox proportional-hazards regression model was used to identify the predictors of 5-year all-cause mortality. The performance of three models was evaluated: DeepECG-HFrEF (+) model 1 (age > 70 years, diabetes, ischemic heart disease, and chronic kidney disease (CKD) stage 4-5); DeepECG-HFrEF (+) model 2 (echocardiographic results of EF < 40%, age > 70 years, diabetes, ischemic heart disease, and CKD stage 4-5); and DeepECG-HFrEF (+) model 3 (BNP > 500 pg/mL, age > 70 years, diabetes, ischemic heart disease, and CKD stage 4-5). All reported p-values were two-sided, with a p-value < 0.05 considered significant. Statistical analyses were performed using IBM SPSS Statistics version 23 (IBM Co., Armonk, NY, USA).