Introduction

Congenital heart disease (CHD) is one of the most common type of birth defects and a major cause of children’s morbidity and mortality1,2. Early and accurate identification of affected pediatric patients is crucial for timely intervention and effective surgical outcomes3,4,5,6. However, commonly used examination methods, such as transthoracic echocardiography (TTE), X-ray, cardiac magnetic resonance imaging (MRI), and dual-source CT examinations, are operationally complex, time-consuming and costly, and heavily dependent on the evaluation of experienced cardiologists7. Unfortunately, the delayed diagnosis is prevalent8 (even for the critical cases9), which results in sub-optimal clinical intervention10,11,12,13,14, especially in low- and middle-income regions15,16,17,18,19,20. A study in a low-income country has demonstrated that the delay diagnosis rate can be up to 85.1%21.

In general, CHDs are caused by structural abnormalities such as holes and leaky valves, which change the electrocardiovectors and can present abnormal manifestations in electrocardiogram (ECG) signals theoretically22,23,24. In this context, the surface ECG can offer insights into cardioelectric activity that could be helpful for the detection of CHD patients due to its affordable price and high effectiveness. It has been partly observed that CHD is associated with some particular manifestations on adult ECG signals25,26,27,28,29,30,31,32, suggesting that evaluating the abnormal ECG waveforms could provide clues for the detection of underlying heart defects. However, previous research mostly focused on the correlations between CHD and adult ECG, which did not offer an immediate benefit for pediatric CHD interventions33. So far, there have been only a few studies on CHD diagnosis using pediatric ECG signals, which brings challenges as well as opportunities for innovative discovery in the field of ECG analysis.

Recent advances in deep learning (DL) have demonstrated cardiologist-level and reliable performances on ECG analysis34,35,36, including identifying features not typically recognized by human experts. Earlier investigations have demonstrated that DL models can derive benefits not only from automatically extracted features of ECG waveform data35,36,37 but also from conceptual features utilized by human experts38,39,40,41,42 and features obtained through wavelet transformation43,44. However, few approaches integrated all these different feature types in an end-to-end DL architecture to conduct automatic, efficient, and optimal fusion. Moreover, scant attention has been given to the development of DL models for the detection of CHD, particularly on a large-scale pediatric ECG dataset45,46,47,48,49.

In this work, we present an end-to-end deep neural network-based approach for pediatric ECG cases, called Congenital Heart Disease diagnosis via Electrocardiogram (CHDdECG). CHDdECG optimally integrates multiple input feature types including raw ECG-waveform data, human-concept features, and wavelet features, in order to make direct probabilistic predictions for CHD. The workflow of our study is illustrated in Fig. 1. First, potential pediatric patients underwent several examinations, primarily consisting of transthoracic echocardiography and electrocardiogram, in accordance with the European Society of Cardiology Guidelines for CHD50. In certain cases, additional tests may have been used at the discretion of the attending doctors. The doctors carefully analyzed all the examination results and subsequently determined the final diagnostic outcomes. Next, our CHDdECG used only pediatric ECG data to identify CHD cases, by integrating features automatically extracted from ECG-waveform data and wavelet features with human concept features. It was developed using 65,869 pediatric ECG cases of young children in the age of 2.12 ± 1.50 (year), and evaluated on an internal test set of 12,000 cases and two external test sets of 7137 and 8121 cases, respectively. We found the ECG-based CHD identification by CHDdECG was promising, and more accurate than ECG cardiologists. Finally, we analyzed the prediction mechanisms of the trained CHDdECG and evaluated its robustness and reliability. Through exploring the potentials of pediatric ECG for CHD diagnosis, our CHDdECG made advances in three folds: (1) predicting structural heart defects of pediatric patients using ECG data; (2) drawing potential knowledge from pediatric ECG data beyond the current knowledge of experts through a deep learning approach; and (3) providing clues for further studies of pediatric ECG and CHDs.

Fig. 1: The workflow of AI-enabled CHD detection with ECG data.
figure 1

Hand-crafted human-concept features (a) were computed with some rules (corresponding formulas were in the Supplementary Materials) on pediatric ECG-waveform data (b), while the wavelet coefficient energy characteristics (as wavelet features (c)) obtained by performing wavelet transformation on the pediatric ECG-waveform data. Features of these three types were fed into the proposed AI model (d) for automatic fusion and CHD detection. e and f Illustrated the receiver operator characteristic curves and precision-recall curves of the AI model’s CHD detection performances on a test set and two external test sets. g Illustrated the CHD detection effects (by the net reclassification index (NRI)) of the AI model and cardiologists assisted by the AI model, compared with cardiologists without any assistance as a baseline, across 10 randomly sampled test data groups (from the Center-A test set). The analysis of NRI(+) and NRI(−) was included in the Supplementary Materials. Source data are provided as a Source Data file.

Results

The overall CHD detection effects of CHDdECG

We trained the CHDdECG model to process pediatric ECG data for predicting the presence or absence of CHDs, without distinguishing between CHD subtypes. After training CHDdECG from scratch in a supervised learning manner, we evaluated the performances of the trained model on the internal test set and two external test sets. See the CHD prediction performances (sub-types are not distinguished) in Table 1, both the specificity and sensitivity of CHD detection exceeded 0.8 on the internal test set. On the two external test sets, characterized by different subtype proportion distributions, the specificity values were 0.937 and 0.907, respectively, and the sensitivity scores approached 0.8. Additionally, the median values of the probabilistic predictions (refer to the last column of Table 1) exhibited proximity to 1.0 on the internal test set and ~0.8 on both external test sets. This highlights the high confidence and robust generalizability of CHDdECG’s predictions (see Fig. 1e), the areas under the AUC curves (ROC-AUC), a comprehensive metric, were quite high, at 0.915 on the internal test set and at 0.917 and 0.907 on the two external test sets. The Brier scores were close to 0.0 on all the test sets. Figure 1f shows the precision-recall (PR) curves. Since they were influenced by class imbalance, the Center-C external test set showed a better PR-AUC score compared to the internal test set. All these comprehensive metrics signified that CHDdECG has achieved good performances and robust generalizability in CHD detection.

Table 1 Model performances on the internal test set from Center-A, an external test set from Center-B, and another external test set from Center-C

Pediatric ECG-based CHD prediction outcomes compared to ECG cardiologists

We compared the CHD diagnosis outcomes of the trained CHDdECG model to those of 10 senior ECG cardiologists, divided into 10 groups denoted as G1–G10 in Fig. 1g. In each group, we randomly selected 200 ECG data with CHD and 100 non-CHD ECG data from the internal test set, with non-overlapping data between groups. For each group, the CHDdECG model made probabilistic predictions of CHD and required the ECG cardiologists to identify the CHD cases. We compared the performance of each method by computing the net reclassification index (NRI). As shown by the light blue bars in Fig. 1g, the diagnosis outcomes of cardiologists are regarded as the baseline, and the NRI scores for CHDdECG were much >0, indicating its superiority in pediatric ECG-based CHD detection compared to cardiologists. Furthermore, we specifically picked out those cases that are misidentified by cardiologists but are correctly identified by CHDdECG, prompting a reevaluation by the cardiologists. To aid their reevaluation, we also included the corresponding highlighted key ECG segments, as demonstrated in Fig. 2. Based on the prompt from CHDdECG, cardiologists identified some wavelets associated with CHD and changed parts of their original diagnosis. The results are illustrated as the dark blue bars in Fig. 1g, which indicate that the reevaluation outcomes are consistently better than the initial diagnosis results. However, the NRI on cardiologists’ reevaluation remained inferior to CHDdECG, suggesting that some cases are still indistinguishable for cardiologists.

Fig. 2: Visualization of CHDdECG-activated segments for CHD subtypes using the Grad-GAM approach.
figure 2

ad Illustrated the classical ECG manifestations of subtype ASD, VSD, d-TGA, and TOF, respectively. e illustrated the ECG manifestations of subtype VSD but misidentified as non-CHD by CHDdECG. f illustrated a segment of non-CHD ECG manifestations but misidentified to be CHD by CHDdECG. The leads of ECG views are marked on the top right. The salient ECG segments were marked by blue (the darker blue indicated the more important segments). We further circled the typical manifestations in orange that were considered to be associated with subtypes of adult ECG by the cardiologists following previous research results25. The horizontal axis represents time (4 × 10−2 s), and the y-axis shows the amplitude of the electrocardiogram (μV). Source data are provided as a Source Data file.

CHD-related manifestation detection performances for major subtypes

We tested whether CHDdECG could effectively detect abnormal manifestations of major CHD subtypes. The definitions of subtypes follow the 2020 ESC guideline50. We fine-tuned the trained CHDdECG model for specific CHD subtypes. The ROC-AUC (area under the receiver operator characteristic curve) scores obtained on these test cases are reported in Table 1, spanning a range of 0.835 to 0.992 on the internal test set, as well as 0.889–0.926 and 0.859–0.939 on the two external test sets, respectively. Among the three most common subtypes (i.e., the ventricular septal defect, atrial septal defect, and patent ductus arteriosus), the ROC-AUC scores were 0.920, 0.835, and 0.856 on the internal test set, while achieved 0.918, 0.926, and 0.889 on the external test set from Center-B, and 0.913, 0.916, and 0.904 on the external test set from Center-C. On the internal test set, performances on 9 of 12 subtypes achieved high ROC-AUC scores over 0.9; on the external test set from Center-C, ROC-AUC scores on 7 of 9 subtypes are over 0.9. It is obvious that CHDdECG performs effectively on most subtypes, except for the relatively lower sensitivity scores on some subtypes (e.g., atrial septal defect (ASD) and patent ductus arteriosus (PDA) on the internal test set). The sensitivity on PDA is also relatively lower on the two external test sets. Notably, all the Brier scores are close to 0.0 on any test cohorts.

Feature importance observation

To evaluate the prediction mechanisms of CHDdECG, we assessed the interpretability of the trained models (for CHDs and their various subtypes). We computed the importance scores of the feature types used for CHD detection, including raw ECG-waveform data, wavelet features, and hand-crafted human-concept features. The feature importance scores of 300 randomly selected test CHD cases (from the internal test set) were illustrated in a heat map in Fig. 3b, and feature importance scores were represented in the instance-specific view in Fig. 3a and feature-wise view in Fig. 3c (see Fig. 3a), features automatically extracted from ECG-waveform data supplied more information for predicting CHD statuses in most cases. The global feature-wise importance scores also affirmed this, while the features representing some human concepts and wavelet features were much less important, yet still beneficial. Here we gave a concrete case in Fig. 3d for ease of understanding. In this case, the global importance score of clinical features was 0.102, and that of wavelet features was 0.014, while the automatically extracted features from waveform data attained a score of 0.884. In Fig. 3e over subtypes, it can also be seen that the automatically extracted features yielded higher importance scores than the other feature types. The comparative experiments on the impact of different feature types on performance were provided in the Supplementary Materials.

Fig. 3: An illustration of feature importance of automatically extracted features from raw ECG signals (called sig), clinically useful human concept features (called clin), and wavelet features (called wave).
figure 3

We randomly selected 300 CHD cases from the test set and illustrated the feature importance scores with a heatmap (b) on these cases. The instance-wise feature importance scores (a) and the global importance scores (c) of features were also computed. d The feature importance scores of one case were especially shown. e An illustration of the feature importance of each feature type for various CHD subtypes (after the model fine-tuning). Note that the features with an importance score of zero were not included. Source data are provided as a Source Data file.

Key pediatric ECG segments of particular interest

Since we have observed that CHDdECG automatically extracted some critical features, we kept on exploring what ECG manifestations were adopted by CHDdECG using the Grad-CAM approach51, and visualized the salient segments contributing to the CHD predictions in Fig. 2. We obtained the salient segments with Guided-Backpropagation52 following the procedure of Grad-CAM algorithm51, on the Temporal Attention layer’s output features (refer to Fig. 5). Interestingly, we found that CHDdECG-activated segments were partially consistent with the previous observations on adult ECG data. In Fig. 2, we marked CHDdECG-activated segments with blue, and the darker blue indicated the more important segments. Figure 2a represented a notch on the R wave of signals of lead II, which was a typical abnormal manifestation with the atrial septal defect27; Fig. 2b illustrates the Katz–Wachtel phenomenon representing diphasic RS complexes on signals of lead V3, which was found to be related to cases with the ventricular septal defect25; Fig. 2c illustrated a QRS complex with a small R wave and a deep S wave on signals of lead II, which was a typical manifestation in the left precordial leads of the cases with the dextro-transposition of the great arteries53; Fig. 2d showed an ST-segment elevation on the lead II ECG signals, which often occurred in the cases with the tetralogy of fallot53. We circled the representative malformations with regard to CHDs in Fig. 2 (in orange color) following previous observations on adult ECG25,27,53. It is obvious that the blue portions are highly overlapped with the orange circles, implying that the CHD predictions of CHDdECG were made based on relevant ECG segments. Besides, we also presented heatmaps for two CHDdECG-misidentified cases in Fig. 2e and f. Figure 2e displayed a segment of the ECG signal that CHDdECG failed to correctly identify as the waveform of a ventricular septal defect. Although it exhibits similarities to the Katz–Wachtel phenomenon shown in Fig. 2b, we observed that its amplitude is significantly smaller (maximum values around 1000 μV), signifying an atypical form25 not prioritized by the model (indicated by the lighter blue color compared to the waveform in Fig. 2b). Regarding Fig. 2f, CHDdECG might misidentify the double-humped waveform as a notch, however, it’s ground truth diagnostic label is non-CHD. It is evident from Fig. 2e and f that the waveforms associated with CHD (or a specific subtype) are diverse and therefore challenging to detect. In summary, CHDdECG demonstrates the ability to identify certain congenital heart disease-related waveforms, and the visualization results are partially interpretable to humans, despite occasional misjudgments of confusing waveforms (which can also serve as a learning opportunity for humans).

Discussion

Previous research indicated that the delayed detection of congenital heart disease (CHD) is a widespread issue across areas of varying income levels8,9,21,54, which leads to missed opportunities for timely interventions. Besides, it is also recognized that the distribution of CHD subtypes can vary by location and over time54. In this study, we developed a pediatric-ECG-based differential CHD diagnosis approach, CHDdECG. CHDdECG was trained on large-scale real-world pediatric ECG data, and its effectiveness was validated on internal and external test sets, as presented in Table 1. The performance across comprehensive metrics, including specificity, sensitivity, ROC-AUC, and Brier scores, demonstrated the model’s ability to accurately distinguish CHD-related ECG manifestations. Furthermore, the effectiveness of CHDdECG presented on two external test sets, characterized by different CHD subtype proportions and variations in ECG recording devices, suggested that the practical impact of subtype proportions and device differences is limited in the application of CHDdECG (see Table 1 and Fig. 2), additional results indicated that CHDdECG also performed well in detecting specific manifestations of most CHD subtypes, suggesting CHDdECG’s good generalization across varying subtypes and can be reliably used in practice.

Though the performances were generally good for most of the major CHD subtypes (see Table 1; especially for the tetralogy of fallot, atrioventricular septal defect, and double-outlet right ventricle), we also noticed that the detection sensitivity scores for some CHD subtypes (e.g., ASD and PDA on the internal test sets) were comparatively lower. Especially, the sensitivity scores for PDA were lower than other subtypes across all three test sets. We thought that these inferior sensitivity scores might be attributed to the inconspicuous CHD-related manifestations since it had been observed on adult ECG data25 that some CHD cases were clinically silent. To further confirm this, we checked the sensitivity scores of cardiologists’ analyses for ASD and PDA, which were only 0.306 and 0.434 respectively (with an overall sensitivity of around 0.6). It indicated that senior ECG cardiologists could not find abnormal manifestations in most of those cases from pediatric ECG as well. A positive aspect was that CHDdECG still outperformed cardiologists on ASD and PDA detection, and the relatively lower sensitivity for ASD and PDA did not hurt the benefits of CHDdECG in overall CHD detection. On the external test set from Center-C, we noticed the performances on the anomalous origin of a coronary artery (AOCA) and coarctation of the aorta (COA) were relatively lower. We further checked the sensitivity of cardiologists’ analyses for these two subtypes, which were at 0.400 and 0.292, respectively, and were lower than those of CHDdECG. While CHDdECG’s performances on Center-B appear notable in terms of ROC-AUC, specificity, and sensitivity, Fig. 1f uncovers a relatively low PR-AUC value of around 0.5. Conversely, the PR-AUC on Center-C showcases robust performance, surpassing 0.8. This phenomenon can be attributed to the significant label imbalance present in Center-B test set. In a nutshell, the CHD detection performances of CHDdECG implied the feasibility of using 9-lead pediatric ECG data to obtain differential diagnosis of CHD; but, the detection performances on some subtypes were sub-optimal due to the limited information in ECG signals.

Based on the robust performances of CHDdECG, we sought to shed light on the prediction mechanisms of CHDdECG. Adopting a deep learning approach to detect structural heart defects is theoretically based on an assumption that structural heart abnormalities can change the electrocardiovectors and thus lead to abnormal manifestations in ECG signals. However, some congenital cardiac malformations are subtle and do not show observable changes in the morphology of ECG signals. Hence, we have to examine whether CHDdECG’s predictions were made based on reasonable features. See NRI compared to senior ECG cardiologists in Fig. 1g, we obtained three-fold findings: (1) CHDdECG is more effective in CHD detection than ECG cardiologists; (2) ECG cardiologists can achieve better CHD detection performances with the prompt of CHDdECG, which implied the predictions made by CHDdECG were reasonable and could be highly acceptable to experts; (3) ECG cardiologists cannot achieve CHDdECG-level performances even with the prompt of CHDdECG, suggesting that CHDdECG could extract some information out of human cognition. These results encouraged using the CHDdECG model for automatic CHD diagnosis since the prediction results were shown to be superior and highly trusty. It also encourages further studies to identify more hard-to-observe knowledge guided by CHDdECG.

Our further explorations attempted to enhance the clinical acceptance of the CHDdECG approach and facilitate interactions between cardiologists and CHDdECG by comparing the feature importance among the three feature types used by CHDdECG (i.e., automatically extracted ECG features, wavelet features, and concept features used by human experts). Analyses conducted from various perspectives (comparing the overall dataset level, subtype level, and instance level; see Fig. 3), all suggested that the automatically extracted features from ECG signals were the most important feature type and contributed more than the other two types, no matter for detecting the presence of CHD or for any specific subtypes. Furthermore, we observed that the key segments detected by CHDdECG (with Grad-CAM) presented many CHD-related malformations that could be observed in both pediatric ECG and adult ECG, despite that pediatric ECG is more complicated55,56. These findings implied that (1) the performance gains of CHDdECG (compared to ECG cardiologists; see Fig. 1g) might mainly come from the automatically extracted features, which represent some hard-to-observe information beyond the current human knowledge; (2) combining CHDdECG and a visualization approach (e.g., GradCAM) made such hard-to-observe knowledge much more accessible. The detailed analyses of pediatric ECG waveforms encouraged further investigations of the association between pediatric ECG and CHDs from theoretical and clinical perspectives.

One key strength of our study is that CHDdECG was devised to identify young CHD patients by using only routinely acquired pediatric ECG, thus enabling efficient CHD detection and timely interventions. In this context, it is more clinically meaningful than the previous studies on adult CHD cases. Note that we did not intend to replace the standard CHD diagnosis guideline for pediatric ECG. However, since there are economically underprivileged populations that have much less access to modern technologies and suffer from delayed interventions, we argue that, in these situations, it is highly desired to detect CHD in young children using our CHDdECG with pediatric ECG data, because it is reliable, low-cost, highly efficient, and has been verified on large-scale real-world datasets. Another key strength of our study is that the superior performances (e.g., outperforming ECG cardiologists) of CHDdECG can provide some potential knowledge on pediatric ECG beyond the current human knowledge. Thus, CHDdECG can offer clues for further exploring the potential of pediatric ECG data, which is generally beneficial.

There are still a few limitations in this study. First, although the test data we used aimed to follow real-world scenarios and we also collected external test data to examine the generalization capability of CHDdECG, the data distribution of CHDs can vary in different areas and times54, which may cause somewhat different effects of CHDdECG, especially in situations when the subtype proportions are different from our test set and external test sets. The geographic specificity and fixed period of training and validation limit the assessment of generalizability. Second, 9-lead ECG data provide less information than the standard 12-lead ECG. Since putting more leads on the chests of young children was generally quite intractable, we made a trade-off decision to train CHDdECG on 9-lead pediatric ECG for wider application scenarios. Nevertheless, CHDdECG allows the processing of ECG data of any lead count, and we believe that the performances of CHDdECG will be better if it is trained and evaluated on standard 12-lead ECG data and CHDdECG can serve adults and elder children well. Third, our CHDdECG architecture is of high compatibility which allows the automated processing and fusion of multiple feature types. However, we used only three feature types for CHD detection, and some other feature types (e.g., signal features extracted by Bayesian approaches) might be further beneficial if they are included in consideration. Fourth, while our retrospective study has demonstrated the efficiency of CHDdECG on a real-world clinical dataset, the performance of CHDdECG for CHD screening in the general population remains uncertain, as it is challenging to prospectively obtain ECG data from children who do not necessarily require such examinations. Fifth, although the CHD labels were acquired following standardized diagnostic guidelines, we cannot rule out the possibility of label misclassification as a limitation, particularly when CHD cases present abnormalities below the level of human detection. This limitation also highlights the need for prospective protocol research dictating a comprehensive and standard diagnostic workup for all individuals.

Methods

Data access and ethical statement

This study was approved by the Medical Ethics Committee of Guangdong Provincial People’s Hospital (KY-Q-2022-144-01). In accordance with ethical guidelines, this study secured a waiver for informed consent based on its retrospective analysis of anonymized data, ensuring privacy and security without explicit consent from subjects.

Data sources

In this study, three distinct datasets were collected and used. The first dataset, utilized comprehensively for model training, validation, and internal testing, originates from the ECG Division in the Cardiovascular Outpatient Department at Guangdong Provincial People’s Hospital (referred to as Center-A). The second dataset consists of an external test set sourced from the ECG Division in the Cardiovascular Inpatient Department at the same hospital (referred to as Center-B). Another external test set was obtained from the ECG Division at Shengjing Hospital of China Medical University (referred to as Center-C). These datasets collectively facilitated the development and comprehensive evaluation of the model. The ECG data in Center-A and Center-B were collected using identical ECG devices (GE MAC800) from August 2014 to October 2020, while the data in Center-C were collected utilizing a distinct brand of ECG device (NIHON KOHDEN ECG-2550) from January 2020 to June 2023. The data selection and train-validation-test split are illustrated in Fig. 4. For the sake of reaching reliable conclusions, some cases were omitted due to: (i) diagnosis results (labels) were missing; (ii) ECG signals were not correctly recorded, with excessive noises (over 20% signal amplitude values exceeding 5 mV), or with corrupted signals (20% recorded values are 0’s); (iii) ECG cases were obtained after the intervention; (iv) ECG cases were obtained from the individuals whose other ECG-waveform data have been included in this study. The demographic and clinical characteristics of cohorts were reported in the Supplementary Materials. To ensure that patients from the test data sets were not included in the training data set, we excluded patients from the Center-B external test set if they were already present in the Center-A training set by using the enterprise master patient index (EMPI) which includes factors such as age, date of birth, sex, and name. For the Center-C external test set, a distance of almost 2300 km combined with comparison rules based on patients’ name, sex, age, and CHD sub-type was used to ensure the absence of patient overlap with other centers. After that, 77,869 pediatric ECG cases from Center-A, 7137 cases from Center-B, and 8121 cases from Center-C finally remained for our study. Specifically, 65,869 cases (stratified around 85% of the ECG cases with various CHD subtypes or non-CHD ones; comprising 23,873 females and 41,996 males) in Center-A were randomly selected for model training, and the rest 12,000 cases (consisting of 4242 females and 7758 males) for model test. Notably, in this work, the sex of each participant was determined based on their biological sex, as recorded on their Chinese identity card. In addition, the 7137 cases (consisting of 3458 females and 3679 males) in Center-B and the 8121 cases (3723 females and 4398 males) in Center-C comprised two independent external test sets, for evaluating the generalization of our CHDdECG. The CHDdECG model was trained to conduct CHD detection as a classification task in a supervised manner, and the classification labels used in the training phase indicating the CHD subtypes or the non-CHD status were real-world diagnostic results following standard diagnostic guidelines (e.g., using echocardiography) organized according to the International Statistical Classification of Diseases 10 codes (ICD-10).

Fig. 4: An overview of the case selection procedures for Center-A, Center-B, and Center-C.
figure 4

Center-A: the ECG Division in the Cardiovascular Outpatient Department at Guangdong Provincial People’s Hospital; Center-B: the ECG Division in the Cardiovascular Inpatient Department at Guangdong Provincial People’s Hospital; Center-C: the ECG Division at Shengjing Hospital of China Medical University. The descriptions within the blue boxes provided the reasons for omitting some cases (with ECG data and diagnostic results).

All of the ECG cases for model training that we used were collected from individuals at the age of 2.12 ± 1.50 (year), among which the cases with CHDs were at the age of 1.58 ± 1.28 (year). This age distribution of our datasets satisfied the need to explore CHD detection methods for early intervention. Notably, over 90% of the cases had 9-lead ECG data with three missing chest leads, V2, V4, and V6, because it was usually intractable to put all 6 chest leads on such a young child’s chest. Thus, we built our framework based on 9-lead pediatric ECG data consisting of I, II, III, aVR, aVL, aVF, V1, V3, and V5, and this setting would be easier to generalize in the young population. All pediatric ECG data were acquired at a frequency of 500 Hz over 10 s, and 5000 values on sampling points were obtained. As shown in Fig. 4, the CHD cases made up approximately 16.6% of Center-A dataset (8741 of 52,695 training cases, 2186 of 13,174 validation cases, and 2038 of 12,000 test cases), and approximately 4.2% and 26.12% of the Center-B and Center-C external test sets. The majority of the CHD cases belonged to the CHD subtypes of the ventricular septal defect (VSD), atrial septal defect (ASD), patent ductus arteriosus (PDA), and tetralogy of fallot (TOF), which is aligned with the real-world scenarios. The quantitative proportions of the CHD subtypes are shown in the second column (Prop (%)) in Table 1. Given that the pediatric ECG data were sourced from diverse departments and hospitals without excessive selection, our collected datasets closely mirrored real-world medical scenarios and ensured our study was credible.

Data pre-processing

Previous research suggested that some proper pre-processing on ECG-waveform data could lead to considerable performance gains57. Inspired by the successes of multi-modal data fusion approaches, we developed a CHDdECG model with three input branches, which took three types of features as input, including the ECG-waveform data Xe, the hand-crafted human-concept features Xc, and the features Xw obtained by wavelet transformation. The last two types of features, Xw and Xc, were organized in tabular data format. The inputs of the three branches were individually prepared as follows.

  • First of all, we eliminated the noisy myoelectric signals (typically at 30–300 Hz) from the raw ECG-waveform data using the low-pass Butterworth filters58,59. Then, the interference of the electric power facilities (typically at 50 Hz) was eliminated by a finite impulse response notch filter59 with the Kaiser window function60. Finally, the baseline wandering elimination was performed using the infinite impulse response zero-phase shift digital filter59. After these de-noising procedures, the key information of the ECG-waveform data was well preserved and the noise was partially eliminated. Then, we organized the ECG-waveform data into the format as \({X}_{{\rm {e}}}\in {{\mathbb{R}}}^{9\times 5000}\) (with 9 leads and 5000 sampling points in each lead).

  • The wavelet features organized in a tabular data format, \({X}_{{\rm {w}}}\in {{\mathbb{R}}}^{54}\), were obtained by performing the wavelet decomposition on the de-noised ECG signal Xe. We performed 9 levels of the wavelet decomposition with the db5 wavelet function, and the resulting coefficient energy characteristics of the 4th–8th levels were selected and concatenated into a feature vector (i.e., Xw). Note that in Xw, the elements were considered independent scalar features.

  • The input human-concept features were also organized in a feature vector (i.e., \({X}_{{\rm {c}}}\in {{\mathbb{R}}}^{114}\)), whose elements were independent scalar features obtained from Xe. The scalar features in Xc represent human concepts widely used in clinical ECG analysis. To imitate the clinical procedure to analyze ECG data, we first detected five keypoints (the P, Q, R, S, and T waves) on the axis using the findpeaks method of the Matlab Software. Specifically, to detect the inverted P and T waves, we took the absolute values of the sampling points on ECGs before using the findpeaks method. Then, the onset and end points of a peak (e.g., the R wave) were obtained by computing the slopes following the approach as in the literature61. After obtaining the keypoints on the axis, 114 tabular features were computed following the method62 to provide clinically useful concepts, including the heartbeat rate, mean duration of QRS/P/PR segments, the mean amplitudes of P waves, et al. All of the formulas for computing the 114 scalar features were provided in the Supplementary Materials.

Data normalization

After the pre-processing, Xe, Xw, and Xc were respectively normalized with z-score (as in Eq. (1)) before being fed separately to the three input branches of the CHDdECG model (see Fig. 5), by

$${X}_{i}^{{\prime} }=\frac{{X}_{i}-{\mu }_{i}}{{\sigma }_{i}},$$
(1)

where Xi {Xe, Xc, Xw}, \({X}_{i}^{{\prime} }\) is the normalized outcome with the identical feature size, and μi and σi are the mean and standard deviation of the ith component computed over the training set. For \({X}_{{\rm {e}}}\in {{\mathbb{R}}}^{9\times 5000}\), the normalization is performed along the lead dimension (i.e., the first dimension).

Fig. 5: An illustration of our proposed deep learning-based model, CHDdECG.
figure 5

The left part showcases the overall architecture of the CHDdECG model, characterized by a fusion procedure involving multiple feature types. The right part presents the module details within CHDdECG. Please refer to the original TabNet paper66 for the structure of the Attentive Transformer module and Feature Transformer module.

CHDdECG architecture and data processing procedure

We proposed a deep learning-based model, CHDdECG, to use 9-lead pediatric ECG data for CHD detection. The model was implemented using the Keras framework63 with Tensorflow 2.0 as the backend. CHDdECG mainly consisted of three input branches for three feature types and one output branch to make the probabilistic presence prediction for CHD. The input Xe was sequentially processed by 1D convolution blocks, a three-path module consisting of 1D residual blocks64 with various kernel sizes, a Transformer Encoder65, and a temporal attention layer, to extract features in the local and global scopes. The features presented in tabular formats denoted as Xw and Xc, were processed individually by TabBlocks66. We considered all the extracted features as independent scalar features and used one TabBlock66 to select and fuse these features. Finally, the fused features were leveraged to predict the presence probability of CHD. The overall architecture of CHDdECG is shown in Fig. 5, and the detailed operations are depicted as follows.

  1. 1.

    For the ECG-waveform data Xe, we employed one-dimensional (1D) convolutional layers as the adaptive signal filters to extract the local signal features, regarding the 1D ECG signal as a special case of a 2D image. We first utilized two 1D convolutional layers (followed by batch normalization and a ReLU activation) to extract features along the temporal dimension. Then, three model paths were used, each of which contained three 1D residual blocks whose filter kernel sizes were, respectively, 3, 5, and 7. In this design, each 1D residual block down-sampled the features by 4 times along the temporal dimension. Notably, we used average pooling in the shortcut path of the residual blocks following ResNet-D67. The output features of the three model paths were concatenated along the channel dimension (since the features were organized into an identical size) and then fed to the Transformer Encoder module.

  2. 2.

    In addition to the convolutions used to extract local features, we also employed a Transformer Encoder block65 to extract global features throughout the duration of ECG recording. The key component of the Transformer Encoder was a multi-head self-attention operation, which was defined by

    $${h}_{i}=\,{{\mbox{softmax}}}\,\left(\frac{({W}_{{\rm {Q}},i}x){({W}_{{\rm {K}},i}x)}^{{\rm {T}}}}{\sqrt{{d}_{{\rm {h}}}}}\right)({W}_{{\rm {V}},i}x),\quad \quad {x}_{{\rm {o}}}=[{h}_{1},{h}_{2},\ldots,{h}_{i},\ldots,{h}_{n}]{W}_{{\rm {o}}},$$
    (2)

    where \({W}_{{\rm {Q}},i},{W}_{{\rm {K}},i},{W}_{{\rm {V}},i}\in {{\mathbb{R}}}^{{d}_{{\rm {h}}}\times {d}_{x}}\) and \({W}_{{\rm {o}}}\in {{\mathbb{R}}}^{n{d}_{{\rm {h}}}\times {d}_{{\rm {o}}}}\) are learnable parametric matrices, [  ,  ] denotes the concatenation operation, dx is the length of the input feature vector, dh is the hidden state dimension of the Transformer Encoder, i is an index of the attention heads, n is the number of the attention heads (in this study, we set n = 8), x denotes the input feature, and xo denotes the output feature. After being processed by the self-attention module, the features were further processed by a feed-forward module composed of two linear layers with a ReLU activation in between (see Fig. 5).

  3. 3.

    Applying the convolutional operations and Transformer Encoder, the local features and global features were hierarchically extracted from the raw ECG-waveform data. After that, we employed a temporal attention layer to highlight the key segments, using a 1D convolution layer (with a batch normalization and a sigmoid activation) to compute spatial attention (see the right part of Fig. 5), as:

    $${z}_{{\rm {e}}}=\,{{\mbox{Sigmoid}}}({{\mbox{BatchNorm}}}({{\mbox{Conv}}}\,({x}_{{\rm {o}}})))\odot {x}_{{\rm {o}}},$$
    (3)

    where  denotes point-wise multiplication, and ze is the output features of the temporal attention module.

  4. 4.

    Finally, we treated the elements of the features ze, Xw, and Xc as independent scalar tabular features, and employed several TabBlocks to process them. For ze, we flattened it into a feature vector before using the first TabBlock. A TabBlock contained an Attentive Transformer module for feature selection and a Feature Transformer for feature processing. Please refer to the original TabNet paper66 for the detailed structure of the Attentive Transformer module and Feature Transformer module. The Attentive Transformer module computes a mask m for feature selection, which filters out parts of input features by a point-wise multiplication. The selected scalar features were then processed by the Feature Transformer module within the TabBlock.

  5. 5.

    After the top-most TabBlock processing, we hypothesized that the higher-level semantic features from Xe, Xw, or Xc associated with CHDs were effectively extracted and fused. These features were processed by two full connection layers with a BatchNorm layer and a ReLU activation in between, which were used to predict the presence and absence probabilities of CHD.

CHDdECG was trained in an end-to-end manner to jointly process the three types of input Xe, Xw, and Xc. Since there were relatively fewer cases with CHDs (compared to the non-CHD cases), we employed the label smoothing approach for the target y in the training phase to avoid over-fitting, which is defined by:

$$\tilde{y}=\alpha y+(1-\alpha )(1-y),$$
(4)

where α [0, 0.5) is a hyperparameter coefficient, and α = 0.15 was used in our study. In Eq. (4), the raw label y (y {0, 1}) was obtained following the CHD diagnostic result (y = 1 if and only if the case was with CHD), and \(\tilde{y}\) is the smoothed label used as the training target (obtained by Eq. (4)). The CHDdECG model was trained under the specification of the weighted cross-entropy loss function \({{{{{{{\mathcal{L}}}}}}}}\), defined by

$${{{{{{{\mathscr{L}}}}}}}}(p,q)=-\tilde{y}\log (p)-w\cdot (1-\tilde{y})\log (q),$$
(5)

where p denotes the predicted probability of a case with CHDs, q denotes the probability of a case without CHDs, and p and q are convexly combined with the sum equal to 1 due to the final softmax layer. To deal with the class imbalance issue, the class weight parameter w in Eq. (5) was set to 0.2, to make the model pay more attention to the CHD cases. In the training phase, CHDdECG for CHD detection was first initialized by He’s parameter initialization68 and was trained by 20 epochs from scratch using the Adam optimizer69 with the default parameters. During training, the size of the mini-batches was 256. The learning rate was initialized to 1.0 × 10−2 and was decayed by 10× every 8 epochs. In the validation and testing phases, CHDdECG inferred the CHD probabilities for the input ECG cases, using the parameters obtained in the training phase.

Model fine-tuning for CHD subtype detection

To evaluate the capability of CHDdECG to detect the major CHD subtypes (each with a proportion over 0.5%), we fine-tuned the trained CHDdECG model to predict whether a case has characteristics of some CHD subtypes. Before the fine-tuning phase, we initialized the CHDdECG model with the parameters trained for overall CHD detection. During fine-tuning, we froze the parameters of the 1D ConvBlocks and the first 1D ResBlock in each sequential path and trained the other parameters for the target subtypes further with two epochs. In these fine-turning phases, we only used the target subtype cases and non-CHD cases. CHDdECG was fine-tuned under the guidance of Eq. (5) (with w = 1). Different from adopting the class-weighting strategy in training CHDdECG for CHD detection, we only employed the oversampling strategy to balance the probabilities of the usage of target subtypes and the non-CHD cases, since the sample amounts varied in different CHD subtypes.

Importance score computing

Using TabBlocks also facilitated the computation of feature importance scores, following TabNet66. The Attentive Transformer module in the top-most TabBlock generates a data-specific sparse attention mask m, whose elements were in [0, 1), as so to find useful features and to exclude useless features. The elements of m could be interpreted as the importance of features. We denote mn,i,j as the importance score of the jth value in the heatmap for the ith feature type obtained using the nth ECG data. For better viewing, we computed the average importance scores of scalar features (\({\bar{m}}_{i,j}\)), the overall importance score of the ith feature type, ηi, i {w, c, e} (shown in Fig. 3e), and the feature type importance scores on each individual case, by

$${\bar{m}}_{i,j}=\frac{\mathop{\sum }\nolimits_{n=1}^{N}{m}_{n,i,j}}{N},\,{\eta }_{i}=\mathop{\sum }\limits_{j=1}^{{n}_{i}}{\bar{m}}_{i,j},\,{\bar{m}}_{n,i}=\mathop{\sum }\limits_{j=1}^{{n}_{i}}{m}_{n,i,j},$$
(6)

where N is the amount of ECG data, ni denotes the count of scalar features belonging to the ith feature type.

Evaluation metrics

We comprehensively evaluated the prediction performance of CHD detection by employing several evaluation metrics. We employed the specificity, sensitivity, area under the receiver operating characteristic curve (ROC-AUC), Brier score, which were optimistic for imbalanced classification tasks. We also reported the probabilistic predictions by box plot. The definitions of these metrics were specified as follows:

  • The sensitivity is a measure to evaluate how the model can predict the true positive cases, which is defined as

    $$\,{{\mbox{sensitivity}}}\,=\frac{{T}_{p}}{{T}_{p}+{F}_{n}},$$
    (7)

    where Tp and Fn denote the case amounts of true positives and false negatives, respectively.

  • The specificity is a measure to evaluate how the model can predict the true negative cases, which is defined as

    $$\,{{\mbox{specificity}}}\,=\frac{{T}_{n}}{{T}_{n}+{F}_{p}},$$
    (8)

    where Tn and Fp denote the case amounts of true negatives and false positives, respectively.

  • The Brier score is a strict measure to evaluate how good the probabilistic predictions are, which is defined by

    $$\,{{\mbox{Brier score}}}\,=\frac{1}{N}\mathop{\sum }\limits_{t=1}^{N}{({f}_{t}-{o}_{t})}^{2},$$
    (9)

    where T is the size of test set, ft is a probabilistic prediction and ot is the corresponding ground truth label.

  • Since a higher sensitivity typically was with a lower specificity and vice versa, we also evaluated the performances of the ROC-AUC metric in Table 1. ROC-AUC is a graphical representation of the trade-off between a true positive rate and a false positive rate at various thresholds. It provides a comprehensive evaluation of the model performances.

  • The probabilistic predictions were a statistic of the outcome probabilities yielded by the CHDdECG models for all the cases belonging to the target classes (CHD or some CHD subtypes). We displayed the probabilistic prediction outcomes by box plots in Table 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.