Introduction

Deep learning has been widely used in recent medical research, such as automatic diagnosis from clinical images1,2,3,4, recognition of human genes5, and cognitive neuroscience6,7. This technique also aids in electrocardiogram (ECG) pattern recognition, such as predicting demographic features8 and automatically identifying cardiovascular comorbidity9,10,11. Notably, several studies have employed deep learning to estimate obesity, defined as body mass index (BMI) > 25 kg/m2, from 12-lead ECGs, achieving moderate predictive accuracy12,13.

However, models predicting extremely low body weight using a cohort of low-weight individuals, such as patients with anorexia nervosa (AN), remain unexplored. Patients with AN frequently show ECG abnormalities such as QT prolongation and bradycardia14,15, for which guidelines and reviews recommend ECG monitoring16,17,18. These suggest the plausibility of inversely predicting extremely low body weight using ECG. Such a prediction could be useful for patients with AN who deny their disease status19 and falsify their weight, e.g., by hiding heavy objects in their clothes20, and for those in intensive care21, for whom standard weight measurements may be difficult to perform. Furthermore, if such a prediction is feasible, examining the rationale of the prediction could potentially lead to a deeper physiological understanding of ECG changes in patients with AN.

Thus, the present study had two objectives. The first was to assess the feasibility of using deep learning to predict extremely low body weight from 12-lead ECGs, which may be useful in specific cases, such as patients with AN attempting to falsify their weight. The second was to highlight the parts of ECGs associated with extremely low body weight by investigating the rationale of the model prediction.

Methods

Ethics approval

The present study was approved by the Institutional Review Board of the University of Tokyo (approval number: 3375-(10)). Due to the retrospective nature of the study using anonymized data, informed consent was obtained using an opt-out approach. This study was conducted in accordance with the ethical standards of the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Study participants and measurement

The present study included patients admitted to the Department of Psychosomatic Medicine at the University of Tokyo Hospital between November 2006 and June 2023. Most patients were underweight, primarily due to AN, and were hospitalized for nutritional rehabilitation. This cohort also included patients with other types of eating disorders, such as bulimia nervosa, and a small number of patients with other psychiatric and psychosomatic disorders.

During hospitalization, patients’ weights were measured weekly. Psychosomatic physicians measured the weight of patients with lightweight clothing after body checks. Patient height was recorded upon admission. In addition, 12-lead ECGs were typically recorded upon admission or during outpatient visits. The present study used ECGs measured on the nearest day within a month (31 days) of each weight measurement.

The records were randomly split into training (80%) and test (20%) datasets. To avoid data leakage during training, data splitting was performed on an individual basis; when multiple pairs of weight and ECG belonged to a single patient, resulting from repeated admissions, all pairs were allocated to the same set.

This dataset is not publicly accessible, and no relevant studies have used the same dataset.

Preprocessing of 12-lead electrocardiogram

The model input was a 12-lead ECG waveform formatted as a one-dimensional (1D) array of 12 channels. At the University of Tokyo Hospital, the waveform was recorded at 500 Hz for 10 s, resulting in an array of 5000 values (Fukuda Denshi, Tokyo). We explored the following preprocessing methods: (1) normalizing each waveform to have a mean voltage of 0.0 mV and a variance of 1.0 mV2; (2) reducing ECGs from a 10-s format to shorter segments, with randomly chosen starting points, following studies using durations shorter than 10 s9,22,23; (3) downsampling ECGs from 500 Hz to a lower frequency (e.g., to 100 Hz by selecting one out of every five data points), in line with relevant studies using frequencies lower than 500 Hz9,22,23; (4) denoising using discrete wavelet transform (DWT), which involved decomposition into eight sub-bands using Daubechies-4 wavelet, setting high-frequency noise (level 1 detail coefficient) and baseline wander (level 8 approximation coefficient) to zero, and applying inverse DWT to reconstruct ECGs (details described in previous studies22,23); and (5) setting any one of the 12 leads to zero. We evaluated the impact of each method and the degree of downsampling on prediction accuracy in cross-validation to determine their inclusion in the final model, as described in the following sections.

CNN model structure

A previous study revealed that the optimal BMI cutoff point for predicting the onset of refeeding syndrome, a severe complication of AN, was 12.6 kg/m224. Thus, we developed a convolutional neural network (CNN) that inputs a 12-lead ECG waveform to output a binary prediction of whether BMI is ≤ 12.6 kg/m2.

In the present study, the CNN model was designed for processing 1D data, similar to previous research using CNNs for processing ECG data8,9,10,11,12. To the best of our knowledge, no widely used models have been designed for 1D ECG datasets, unlike models such as ResNet, which are designed for processing 2D images. Therefore, referencing a model that successfully predicts obesity from 12-lead ECGs in a previous study12, we prepared a network with the same structure, consisting of three blocks of 1D convolutional layers and three blocks of fully connected layers (Fig. 1).

Figure 1
figure 1

Structure of deep learning model. The network consists of three blocks of one-dimensional convolutional layers and three blocks of fully connected layers. Kernel size, pooling size, number of channels in the convolution layers, and number of nodes in fully connected layers are indicated. ReLU rectified linear unit.

Each convolution layer had a kernel size of 5, followed by batch normalization, rectified linear unit, and max pooling with a size of 2. The numbers of output channels in the convolution layers were 48, 96, and 192 for each block. The features extracted from the convolution blocks underwent a global average pooling layer and were flattened into an array of 192 values and then processed using fully connected blocks containing 192, 64, and 32 nodes. Finally, a softmax layer outputs the binary probability from an array of 32 values. To prevent overfitting, we added dropout layers between the fully connected blocks, as our dataset was smaller than that of the referenced study12.

Model development process

AdamW with cross-entropy loss was used as the optimizer. In addition to determining whether to perform the aforementioned preprocessing methods, the batch size, number of epochs, learning rate, and weight decay were fine-tuned. These preprocessing methods and hyperparameters were selected to maximize the area under the curve (AUC) of the receiver operating characteristic (ROC) curve in fivefold cross-validation within the training dataset. Thereafter, the prediction performance was evaluated using the test dataset.

Deep learning model development was implemented using Python 3.9.18 and PyTorch 2.1.0. Statistical analyses were performed using R version 4.3.1.

Prediction rationale interpretation

The following analyses were performed using the test dataset to evaluate the ECG features associated with model predictions. First, the final (i.e., third) 1D convolutional layer output was visualized using the gradient-weighted class activation map (Grad-CAM). This 1D heatmap was resized through interpolation and overlaid on a two-dimensional plot of the 12-lead ECGs12. Additionally, the correlation coefficient between the prediction score (probability of BMI ≤ 12.6 kg/m2) and ECG features, such as heart rate, was quantified. These ECG features were calculated from the ECG waveform independently from the model development and were not directly used as the model input.

Results

Dataset characteristics

A total of 888 pairs of ECG and BMI data from 391 inpatients were available. The median age (range) was 23 (12–84) years, with 857 (96.5%) of the cases being female. The median BMI (range) was 13.6 (9.2–45.0), and 316 cases (35.6%) had a BMI below the cutoff (≤ 12.6 kg/m2). Figure 2 shows typical examples of ECGs for an individual with a BMI ≤ 12.6 kg/m2 and another with a BMI > 12.6 kg/m2. Of these, 701 sets were allocated to the training dataset and 187 to the test dataset.

Figure 2
figure 2

Typical examples of 12-lead ECG. Lead II wave of an individual with (a) BMI ≤ 12.6 kg/m2 and (b) BMI > 12.6 kg/m2.

Prediction accuracy

Among the preprocessing methods examined, normalization, shortening the duration from 10 to 3 s, and downsampling the frequency from 500 to 250 Hz contributed to improvements in AUC during fivefold cross-validation and thus were incorporated into the final model. Conversely, models using DWT and setting several leads to zero exhibited lower AUC than those without this preprocessing; therefore, these methods were not employed. After fine-tuning the hyperparameters, the average AUC in the fivefold cross-validation reached 0.787 using a batch size of 256, 25 epochs, a learning rate of 0.005, and a weight decay of 0.005. These parameters were employed for test prediction.

The test AUC of the final CNN model was 0.807 (95% confidence interval [CI], 0.745–0.869), achieving a sensitivity of 0.702 and a specificity of 0.796 for the Youden index maximum cutoff (see Fig. 3 for the ROC curve). Table 1 shows the confusion matrix at this optimal cutoff point. The overall accuracy was 0.749; the positive predictive value was 0.776; and the negative predictive value was 0.725.

Figure 3
figure 3

Receiver operating characteristic curve of the prediction for the test dataset. The dot indicates the cutoff point that maximizes the Youden index.

Table 1 Confusion matrix of the model.

Visualization of heatmaps from Grad-CAM

Figure 4 shows examples of heatmaps generated by Grad-CAM overlaid on the ECG plots. Areas in red-to-yellow hues greatly influence the prediction, whereas areas closer to blue have minimal impact. In both true positive (high prediction score and BMI ≤ 12.6 kg/m2) and true negative (low prediction score and BMI > 12.6 kg/m2) examples, the model appeared to primarily focus on the QRS waves. By contrast, in false positive (high prediction score but BMI > 12.6 kg/m2) and false negative (low prediction score but BMI ≤ 12.6 kg/m2) examples, no specific areas of focus were observed.

Figure 4
figure 4

Examples of the heatmap generated by Grad-CAM. (a) True positive (high prediction score and BMI ≤ 12.6 kg/m2), (b) true negative (low prediction score and BMI > 12.6 kg/m2), (c) false positive (high prediction score but BMI > 12.6 kg/m2), and (d) false negative (low prediction score but BMI ≤ 12.6 kg/m2).

Correlation between ECG features and model’s prediction score from waveform

Table 2 lists the descriptive statistics of the ECG features and the correlation coefficients with the prediction scores of the test dataset. The prediction scores range from 0.0 to 1.0 and approximate the likelihood of BMI ≤ 12.6 kg/m2. A relatively strong negative correlation was observed between the prediction scores and QRS voltage.

Table 2 ECG features and their correlation coefficients with prediction scores.

Discussion

In the present study, we demonstrated that a deep learning model, trained on a dataset consisting of low-weight individuals, achieved moderate accuracy (AUC, 0.807; 95% CI, 0.745–0.869) in predicting extremely low body weight, defined as BMI ≤ 12.6 kg/m2, using 12-lead ECG as input. Additional analyses highlighted the specific ECG features associated with the prediction.

Weight measurement in patients with AN is crucial owing to the risk of severe complications such as refeeding syndrome24, severe liver dysfunction25, and thrombocytopenia25. However, these patients tend to deny their disease status19 and may even falsify their weight by concealing heavy objects in their clothes during weigh-ins20. Furthermore, these patients may require intensive care21, where standard height and weight measurements may be difficult. Detecting extremely low body weight using a 12-lead ECG may be helpful in these cases. Such tools may become even more important given the reported increase in the prevalence of eating disorders since the onset of the COVID-19 pandemic26. However, deep learning models for ECGs are highly susceptible to even slight perturbations27, and defenses against such noise in deep learning have been widely investigated28. Future studies are required to assess the vulnerabilities of the models developed in the present study.

The predictive accuracy of the CNN model in the present study exceeded that of previous research aimed at predicting obesity (BMI > 25 kg/m2)12,13. These studies used datasets comprising > 10,000 individuals, possibly leading to considerable heterogeneity with wide ranges of comorbidities and ages. This heterogeneity may have complicated the prediction task. In contrast, our study used data predominantly from relatively young patients with AN showing monotonous physical comorbidity at a single institution, which may simplify the prediction task. To further evaluate the predictive ability of the deep learning model, validation using external datasets is required.

The Grad-CAM heatmap focused on the QRS waves, consistent with a strong correlation observed between a reduced QRS voltage and an elevated probability of extremely low body weight. These findings could be related to the frequent occurrence of pericardial effusions in patients with AN29, which can cause a low QRS voltage30. Although previous literature has indicated that QRS voltages are typically reduced in patients with AN15, the results of the present study suggest a quantitative relationship between greater weight loss and lower voltage, which constitutes a new finding.

Larger QT dispersion was positively, albeit insignificantly, associated with a higher likelihood of extremely low body weight. This might be related to previous findings that patients with AN exhibit greater QT dispersion than healthy controls31. In addition, although QTc prolongation is common in patients with AN14,15, its relationship with the CNN model prediction scores was insignificant. QTc prolongation is reportedly attributed to electrolyte disturbance rather than low body weight32,33, which may partially explain the lack of a significant relation.

The present study has several limitations. First, the sample size for deep learning development was relatively small. Enlarging the sample size or pre-training the model using an external dataset is desirable in future research. This limitation regarding the small sample size may have affected the model development process, such as the preprocessing methods. Second, the present study relied on data from usual clinical practice, in which ECG and weight measurements were not necessarily performed on the same day. This time lag may have influenced the results. Thus, future studies in which ECG and weight are measured simultaneously are warranted. Third, the limited availability of electronic medical records precluded us from obtaining detailed diagnostic categories such as restricting or binge-purging types. Fourth, in the present study, because the model inputs the waveform itself, available feature selection methods were limited. Future studies that can use more diverse feature selection methods are desirable. Finally, the present study used data collected at a single institution, potentially introducing some bias, such as data containing mostly patients with extremely severe AN under similar treatment. Further verification is required to determine whether the model developed in the present study can be applied to external data collected from multiple facilities.

In conclusion, using deep learning is feasible for predicting extremely low body weight based on 12-lead ECGs, which may be useful for specific cases, such as patients with AN attempting to falsify their weight. In addition, the model showed that several ECG features, such as lower QRS voltage, may be associated with extremely low body weight.