Introduction

Left ventricular systolic dysfunction (LVSD) is associated with more than 8-fold increase in heart failure risk and to a nearly 2-fold increase in the risk of dying prematurely1. Early diagnosis can effectively mitigate this risk2,3,4, but LVSD is frequently diagnosed only after patients develop symptomatic disease due to the lack of effective screening strategies5,6,7. Artificial intelligence (AI) can detect left ventricular systolic dysfunction (LVSD) from electrocardiograms (ECGs), a diagnosis that has traditionally relied on comprehensive echocardiography or other cardiac imaging, which is resource-intensive and challenging to use for generalized screening strategies8,9. Even though AI-ECG is a promising screening tool for detecting LVSD, the algorithms have been designed in clinically obtained 12-lead ECGs. Advances in wearable and handheld technologies now enable the point-of-care acquisition of single-lead ECG signals, paving the path for more efficient and scalable screening tools with these AI-ECG technologies10,11. This improved accessibility could enable broader AI-based screening for LVSD, but the reliability of such tools is limited by the presence of noise in data collected from wearable and handheld devices12,13. Consequently, the performance of models for detecting LVSD from portable device ECGs may degrade in the real-world setting, with lower performance than observed in the original single-lead derivatives of the clinical development studies14,15.

In the absence of large, labelled datasets of wearable ECGs, the development of algorithms that can detect underlying structural heart disease on wearable devices relies on single-lead information specifically adapted from 12-lead ECGs extracted from clinical ECG libraries. However, this process does not specifically account for the unique data acquisition challenges encountered with wearable ECG, possibly contributing to their inconsistent diagnostic performance. Indeed, several sources of noise exist in wearable data, arising from factors such as poor electrode contact with the skin, movement and muscle contraction during the ECG, and external electrical interference16,17,18,19. This noise has practical implications, as models demonstrate poorer performance when tested on all available wearable ECG data as opposed to selected high-quality subsets15. This marked difference in performance based on noise has limited wearable device-based screening programs, with a wearable device-based atrial fibrillation screening study disqualifying 22% of patients due to insufficient signal quality12. Accounting for this noise is a prerequisite to develop broadly accessible models that will form the basis of effective screening programs for LVSD in the community.

In the present study, we hypothesize that a novel, noise-enhanced training approach can boost the performance of wearable-adapted, single-lead ECG models for accurate and noise-agnostic identification of LVSD. Our method, which relies on training on single-lead ECG data derived from clinical ECGs and augmented with custom noise patterns developed in key frequency ranges relevant for specific ECG noise signatures, explicitly accounts for, and generalizes to multiple real-world noise patterns, including noise isolated from a portable ECG device.

Results

Study population

There were 2,135,846 consecutive 12-lead ECGs performed at the Yale-New Haven Hospital (YNHH) between 2015 and 2021, 440,072 of which had accompanying transthoracic echocardiograms (TTEs) acquired within 15 days of the ECG. We developed the models on 385,601 of the ECG-TTE pairs, representing 116,210 unique patients, who had a complete 12-lead ECG recording (Fig. S1). The signal from Lead I, the standard lead obtained from wearable devices11, was then isolated from each 12-lead ECG. All selected single-lead ECG recordings contained 10 s of Lead I signal at 500 Hz. The single-lead ECGs were then split at the patient level into training, validation, and test datasets (85%-5%-10%).

Of these ECGs, 56,894 (14.8%) were from patients with LVSD, defined as having a paired TTE recording of LV ejection fraction (LVEF) below 40%. Additionally, 40,240 (10.4%) had an LVEF between 40% and 50%, and the remaining 288,467 (74.8%) had an LVEF of 50% or higher. Patients had a median age of 68 years (IQR 56, 78) at the time of ECG recording and 50,776 (43.7%) of the patients were women. A total of 75,928 (65.3%) patients were non-Hispanic white individuals, 14,000 (12.0%) were non-Hispanic Black, 9,349 (8.0%) were Hispanic, and 16,843 (14.5%) were from other racial backgrounds (Table S1).

Detection of LV systolic dysfunction

The noise-adapted model was trained on a noise-augmented development set. High and low pass filtering was used to isolate five-minute samples of random gaussian noise within four different frequency ranges encompassing real-world ECG noises, including 3–12 Hz, 12–50 Hz, 50–100 Hz, and 100–150 Hz. The first of these four ranges, 3–12 Hz, was selected to emulate motion artifact noises due to tremors20,21, while the 12–50 Hz and 100–150 Hz frequency ranges encompass more frequently occurring lower- and higher-frequency muscle activation artifacts, respectively17,21. The 50–100 Hz domain was selected to represent electrode motion noises21. Both this domain and the 100–150 Hz frequency range, which contain multiples of 50 and 60 Hz, the two mains frequencies used in ECG acquisition17, also serve to emulate powerline interference noise17,21. These noise samples were then used to generate the noise-augmented development set, in which ECGs were selectively noised with random, 10-second sequences of one of the four frequency-banded gaussian noises at one of four signal-to-noise ratios (SNRs) each time an ECG was loaded. The standard model was trained on the original training set (described in Methods, Isolation of Frequency-Banded Gaussian Noise and Methods, Noise Augmentation).

Both models were trained to detect LVEF below 40%, a threshold present in most heart failure diagnosis guidelines4, and consistent with prior work8,22. Using DeLong’s test to compare area under the receiver operating characteristic curves (AUROCs) of the noise-adapted and standard models for detection of LVEF < 40%23,24, the noise-adapted and standard models, with an AUROC of 0.90 (95% CI 0.89–0.91) and 0.90 (95% CI 0.88–0.91), respectively, performed similarly on a held-out test set without added noise (p-value = 0.60). Area under the precision-recall curve (AUPRC) on this held-out test set was 0.46 and 0.48, respectively. With separate threshold values selected that achieved sensitivity of 0.90 in the validation set without added noise for each model, the noise-adapted model had specificity and sensitivity of 0.68 and 0.92, respectively, and a PPV and NPV of 0.20 and 0.99, respectively, in the held-out test set. The standard model had sensitivity and specificity of 0.69 and 0.91, respectively, and a PPV and NPV of 0.21 and 0.99, respectively. The noise-adapted model’s performance was comparable to the standard model across subgroups of age, sex, and race (Table 1).

Table 1 Performance of noise-adapted model on test ECGs without noise across demographic subgroups.

Standard and noise-adapted model performance on noised ECGs

The performance of each model was also tested on four distinct real-world noise recordings, including three half-hour recordings containing electrode motion, muscle artifact, and baseline wander noise sourced from the publicly available MIT-BIH noise stress test database17. Both models were tested on separate versions of the held-out test set augmented with each of these 3 real-world noises at seven different signal-to-noise ratios (described in Methods, Acquisition of Real-World, Public ECG Noise Recordings, and Methods, Noise Augmentation).

For the noise-adapted model, model performance was comparable across all SNRs for each MIT-BIH noise with AUROC between 0.86–0.89, 0.87–0.89, and 0.88–0.89 for electrode motion, muscle artifact, and baseline wander noise, respectively, for SNRs from 0.5 to 2.0. The standard model had lower performance across all SNRs for every noise, with AUROC between 0.79–0.86, 0.81–0.86, and 0.80–0.86 for electrode motion, muscle artifact, and baseline wander noise, respectively (Table 2 and Fig. 1).

Table 2 Performance of noise-adapted and standard model on noise-augmented test set ECGs across different types of noise.
Fig. 1: Standard and noise-adapted model performance at increasing levels of noise.
figure 1

AUROC area under receiver operating characteristic curve, ECG electrocardiogram.

Standard and noise-adapted performance with portable ECG device noise

Real-world portable ECG device noise, isolated using a Fast Fourier Transform-based technique on a 30-second, single-lead ECG recording from a KardiaMobile 6 L portable ECG device, was also used to evaluate both models (Methods, Noise Extraction from a Portable Device ECG). Each model was used to generate predictions for noise-augmented ECGs across all SNRs, with the noise-adapted model’s AUROC ranging from 0.87 to 0.89. The standard model’s performance was significantly lower at each SNR, ranging from 0.72 to 0.83. This difference was most pronounced at an SNR of 0.5, at which the noise-adapted model retained an AUROC of 0.87 (95% CI 0.86–0.88) and the standard model had an AUROC of 0.72 (95% CI 0.71–0.74, p-value < 0.001) (Table 2 and Fig. 1).

Standard and noise-adapted performance with multiple simultaneous noise types

Three unique, multiple-noise recordings were generated by adding the MIT-BIH electrode motion, baseline wander, and muscle artifact noise recordings to the portable ECG device noise at an SNR of 1.0. These generated noise recordings were then used to evaluate each model’s performance on ECGs augmented with multiple-noise signatures simultaneously. For the noise-adapted model, model performance was comparable across all SNRs for the combinations of portable ECG device noise and electrode motion, muscle artifact, and baseline wander noise with AUROC between 0.86–0.89, 0.87–0.89, and 0.88–0.89, respectively, for SNRs from 0.5 to 2.0. The standard model had lower performance across all SNRs for every noise combination, with AUROC between 0.77–0.85, 0.77–0.85, and 0.78–0.86 for portable ECG device noise combined with electrode motion, muscle artifact, and baseline wander noise, respectively (Table S3).

Explaining performance differences in standard and noise-adapted models

To gain mechanistic insights into the variation in performance for standard and noise-augmented data by different models, we visually and quantitatively assessed differences in the embedding outputs of both noised and standard ECGs for both standard and noise-adapted models. For this, we focused on the output of the 320-dimensional last fully connected layer of each model before generating final predictions. These predictions were collected for five different versions of the same 1000 ECGs—once with the original ECG, and once for each of the four real-world noises. The variation in these predictions due to the addition of noise was visualized by using uniform manifold approximation and projection (UMAP)25, which constructs a two-dimensional representation of the 320-dimension prediction vectors for each noise and model combination. The magnitude of the shift in the UMAP projection due to noise augmentation, which approximates the shift in the underlying prediction vectors, corresponds to each models’ resilience to a shift in predicted probabilities due to noise and artifacts in the signal. This shift was also quantitatively assessed using scaled Euclidean distances between prediction vectors for the same ECG with and without each type of noise for both models.

For the standard model, the predictions for each of the noised ECGs were visually distinct from those of the standard ECGs, despite being for the same set of 1000 ECGs, and differing only on the added noise. However, for the noise-adapted model, there was no visual separation in the model’s predictions between standard and noised ECGs, indicating that the predictions of the noise-adapted model are more resilient than those of the standard model (Fig. 2).

Fig. 2: UMAP projections of last-layer predictions of standard and noise-adapted model for standard and noise-augmented ECGs.
figure 2

UMAP uniform manifold approximation and projection, ECG electrocardiogram.

Quantitatively, the scaled Euclidean distances between predictions for noised and standard versions of the ECGs were lower for the noise-adapted model across all four noises. In the ECG with noise derived from portable ECGs, for instance, the average scaled Euclidean distance for the standard model was 0.50 (95% CI 0.49–0.51) and for the noise-adapted model was 0.41 (95% CI 0.40–0.42). Similarly for the baseline wander it was 0.52 (0.51–0.53) and 0.36 (0.36–0.37), respectively (Table 3).

Table 3 Scaled Euclidean distance between last-layer predictions of noise-adapted and standard models for noise-augmented and standard ECGs across different noise types at an SNR of 0.5.

Discussion

We developed a novel strategy that automates the detection of hidden signatures of cardiovascular diseases, such as LVSD, adapted for noisy single-lead ECGs obtained on wearable and portable devices. Using this approach, we developed a noise-adapted deep-learning algorithm that accurately identifies LV systolic dysfunction from single-lead ECG data and was resilient to significant noisy artifacts, despite not encountering the specific noises in the model development process. Specifically, the algorithm demonstrates excellent discriminatory performance even on ECGs containing twice as much noise as signal, features that make it ideal for wearable device-based screening strategies. Notably, the algorithm was developed and validated in a diverse population and demonstrates consistent performance across age, sex, and race subgroups. The noise-adapted approach defines a novel paradigm on how to build robust, wearable-ready, single-lead ECG cardiovascular screening models from clinical ECG repositories, with significant potential to expand the screening of LV structural cardiac disorders to low-resource settings with limited access to hospital-grade equipment.

Noise-adapted training of deep-learning algorithms represents a relatively novel field of AI research, focused on expanding the use of AI tools to everyday life by accounting for noise and artifacts that may preclude their reliable use in this setting. Models trained on clinical ECGs have been applied to wearable device ECGs, but have traditionally shown significantly lower performance on wearable device ECGs than on held-out clinical ECG test sets14,15. Additionally, models which have successfully generalized to wearable device ECGs have relied on automated algorithms to select high-quality subsets of available ECG data, with reduced performance when evaluated on all collected ECGs15,26. While such algorithms can improve performance, this approach has limitations. Studies have shown high rates of poor signal quality among data recorded by wearable devices, and significant variation in quality across different manufacturers13. Excluding large portions of collected data may limit the scope of community-wide screening programs, which may use multiple different devices to collect ECGs and may not be able to collect multiple ECGs per person until a certain quality threshold has been met. A model which generalizes to ECGs with varying types and levels of noise may be necessary for implementation in such settings.

Due to the lack of publicly available wearable device ECG datasets, training models using wearable ECG data directly is challenging. The current 12-lead ECG-based models are limited to investments by health systems to incorporate tools into digital ECG repositories, and thereby limited to individuals who already seek care in those systems. In addition to the clinically indicated ECGs limiting the scope of screening, even this technology may not be available or cost-effective for smaller hospitals and clinics with limited access to digital ECGs.

Wearable devices allow obtaining ECGs that are more accessible and allow for community-wide screening, an important next step in the early detection of common and rare cardiomyopathies. On this note, our approach represents a major advancement from a methodological and clinical standpoint. First, it augments clinical ECG datasets in such way that it enables reliable modelling of noisy, wearable-derived, single-lead ECG signals. Second, it demonstrates that through noise-augmentation, single-lead ECG models can retain the prognostic performance of 12-lead ECG models, as shown here for the task of predicting LV systolic dysfunction. Additionally, this approach avoids the unnecessary exclusion of collected data, which increases its generalizability across different device platforms and is particularly important for community-wide screening programs, which may not be able to collect multiple ECGs per person and consistently meet high signal quality thresholds for each participant.

Our approach also offers a strategy to examine the process by which the models achieve better performance. Compared with the current approach of developing models, our noise-adapted approach resulted in selective removal of noise from signal, even for noises the model had not encountered before, while preserving the model’s robustness in discerning complex hidden labels. This strategy is particularly important for a model intended for use on wearable devices, which capture ECGs in unpredictable settings with varying types and magnitudes of noise.

This study has several limitations. First, this model was developed using ECGs from patients who had both an ECG and a clinically indicated echocardiogram. Though this population differs from the intended broader real-world use of this algorithm as a screening method for LV systolic dysfunction among individuals with no clinical disease, the consistent performance across demographic subgroups suggests robustness and generalizability of the model’s performance. Nevertheless, prospective assessments in the intended screening setting are warranted. Second, the model performance may vary by the severity of LV systolic dysfunction. Though the LVEF threshold of 40% was selected due to its therapeutic implications4, it is possible that the model performance among patients with an LVEF near to this cutoff differed compared with those individuals with LVEF significantly higher or lower than 40%. This might also be attributable to a lack of precision in LVEF measurement by echocardiography, which has shown to be less precise relative to other approaches, such as magnetic resonance imaging27,28. Finally, four distinct types of randomly generated noise were used during training and randomly selected sequences of four real-world noises were used at multiple signal-to-noise ratios in the evaluation of performance on the test set. Though this suggests that the model performance generalizes well to unseen noise, we cannot ascertain whether it maintains performance on every type and magnitude of noise possible on wearable devices, including any device-specific noise signatures.

We developed a novel strategy that automates the detection of hidden signatures of cardiovascular diseases, such as LVSD, adapted for noisy single-lead ECGs obtained on wearable and portable devices. Using this approach, we developed a single-lead ECG algorithm that accurately identifies LV systolic dysfunction despite significant noisy artifacts, suggesting a novel approach for the development of wearable-adapted tools from clinical ECG repositories.

Methods

Study design

The study was reviewed by the Yale Institutional Review Board, which approved the study protocol and waived the need for informed consent as the study represents secondary analysis of existing data. The data cannot be shared publicly.

The study was designed as a retrospective analysis of a cohort of 116,210 patients, 18 years of age or older, who underwent clinically indicated ECG with paired echocardiograms within 15 days of the index ECG at the Yale-New Haven Health hospital. To ensure the generalizability of our models, we applied no exclusion criteria, including patients of all sexes, races, and ethnicities (Table S1).

Data source and population

Raw voltage data for lead I was isolated from 12-lead ECGs collected at the Yale-New Haven Hospital (YNHH) between 2015 and 2021. Lead I was chosen as it represents the standard lead obtained from wearable devices11. Each clinical ECG was recorded as a standard 10-second, 12-lead recording with a sampling frequency of 500 Hz. These ECGs were predominantly recorded using Philips PageWriter and GE MAC machines. Patient identifiers were used to link ECGs with an accompanying transthoracic echocardiogram within 15 days of the ECG. These echocardiograms had been evaluated by expert cardiologists, and the LVEF defined in their interpretation was identified. If multiple echocardiograms were performed within the 15-day window, the one nearest to each ECG was used to define the LVEF for the model development and evaluation.

Data preprocessing

A standard preprocessing strategy was used to isolate signal from lead I of 12-lead ECGs, that included median pass filtering and scaling to millivolts. The Lead I signal was then isolated from each ECG, and a one second median filter was calculated for and subtracted from each single-lead ECG to remove baseline drift. The amplitudes of each sample in each recording were then divided by a factor of 1000 to scale the voltage recordings to millivolts.

Isolation of frequency-banded Gaussian noise

Random gaussian noise within four distinct frequency ranges was isolated to train the noise-adapted model. High pass and low pass filters were applied to five minutes of random gaussian noise to isolate noise within each of the frequency ranges, which included 3–12 Hz, 12–50 Hz, 50–100 Hz, and 100–150 Hz. Each frequency range was specifically selected to model an element of real-world ECG noises. 3–12 Hz models the motion artifact noises attributable to tremors, which occur within this frequency range20,21. The 50–100 Hz domain reflects consistent electrode contact noise21, while the 12–50 Hz and 100–150 Hz ranges contain the lower and higher-frequency muscle noises, respectively17,21. Additionally, the 50–100 and 100–150 Hz ranges each contain multiples of 50 and 60 Hz, the two mains frequencies used in ECG acquisition17, which make up powerline interference noise17,21 (Fig. 3).

Fig. 3: Noise extraction and ECG augmentation for noise-adapted training.
figure 3

ECG electrocardiogram, Hz hertz, s seconds.

Acquisition of real-world, public ECG noise recordings

Four real-world noise records, not included during training of either model, were used to test the models. These included three half-hour noise recordings from the MIT-BIH Noise Stress Test Database. The MIT-BIH dataset noises, each obtained at a sampling frequency of 360 Hz, represent three types of noises frequently encountered in ECGs: baseline wander noise, a low-frequency noise produced by lead or subject movement17, muscle artifact, caused by muscle contractions21, and electrode motion artifact, which is caused by irregular movement of the electrodes during ECG recordings17,21. Each of these recordings were obtained using a standard 12-lead ECG recorder by positioning the electrodes on patient limbs such that the patients’ ECG signals were not visible in the recordings (Fig. 4)17.

Fig. 4: Noise augmentation of test set with MIT-BIH real-world noises.
figure 4

ECG electrocardiogram, s seconds.

Noise extraction from a portable device ECG

Real-world noise was also isolated from a 30-s, 300 Hz, single-lead ECG recording obtained using a KardiaMobile 6 L portable ECG device. The noise was extracted from the recording by applying a modified version of the Fourier transform-based approach previously used to denoise ECGs29. First, a fast Fourier transform (FFT) was applied to a 30-second ECG recording and the result was plotted in the frequency domain. Then, a threshold was manually selected to separate the high- and low-amplitude frequencies, which contained signal and noise, respectively. Finally, instead of computing the inverse FFT on the frequencies with amplitudes greater than this threshold, an inverse FFT was applied to all frequencies with amplitudes below the selected threshold, yielding the noise (Fig. 5).

Fig. 5: Portable ECG device noise isolation and ECG noise augmentation.
figure 5

ECG electrocardiogram, s seconds, FFT Fast Fourier Transform, Hz hertz.

Noise augmentation

While training the noise-adapted model, each ECG was included in the training dataset twice, with one of the four frequency ranges for the generated noise randomly selected each time an ECG was loaded. A 10-s sequence was then randomly selected from the full five-minute length of the selected noise recording. This noise sequence was then added to the standard ECG at a signal-to-noise ratio (SNR) randomly selected from a set of SNRs, including 0.50, 0.75, 1.00, and 1.2530.

During model evaluation, the noised versions of the test set were generated by following the same procedure as with the noised training set with several key modifications. First, the noise added to ECGs in the test set was sourced from either the baseline wander, electrode motion, or muscle artifacts noise recordings from the MIT-BIH dataset or from the 30-s noise sample from the KardiaMobile 6 L portable ECG device. Second, the MIT-BIH noises and the portable ECG noise were all up-sampled from their original sampling frequencies, 360 Hz and 300 Hz, respectively, to a 500 Hz sampling frequency to match that of the clinical device ECGs. Third, the specific 10-second sequence of noise added to each ECG in the test set was randomly selected once and defined for each ECG, ensuring that every time any model was tested at any SNR for any specific noise, each individual ECG was always loaded with the same randomly selected sequence of noise. Model performance was evaluated separately for ECGs augmented with each noise type on a larger set of SNRs, including all the SNRs used in training and SNRs of 1.50, 1.75, and 2.00. Three additional multiple-noise recordings were generated by separately combining a randomly selected, 10-second segment of portable ECG noise with randomly selected, 10-second segments of each of the three MIT-BIH noise recordings at a 1:1 ratio. These multiple-noise recordings were then used to evaluate both models’ performance on ECGs augmented with multiple simultaneous noise signatures with the same larger set of SNRs.

Outcome label

Each ECG included in the dataset had a corresponding LVEF value from a paired echocardiogram within 15 days of the ECG. The cutoff for low LVEF was set as LVEF < 40%, a threshold present in most heart failure diagnosis guidelines4, and consistent with prior work in this space8,22.

Model training

All unique patients represented in the set of ECGs were then randomly subset on the patient level into training, validation, and held-out test sets (85%, 5%, 10%). We built and tested multiple convolutional neural network (CNN) models with varying numbers and sizes of convolutional layers and total model parameters. We selected the architecture yielding the highest area under the receiver operator characteristic (AUROC) curve on the validation set with the fewest number of parameters for the standard model. This architecture consisted of a (5000, 1, 1) input layer, corresponding to a 10-s, 500 Hz, Lead I ECG, followed by seven two-dimensional convolutional layers, each of which were followed by a batch normalization layer31, ReLU activation layer, and a two-dimensional max-pooling layer. The output of the seventh convolutional layer was then taken as input into a fully connected network consisting of two dense layers, each of which were followed by a batch normalization layer, ReLU activation layer, and a dropout layer with a dropout rate of 0.532. The output layer was a dense layer with one class and a sigmoid activation function. Model weights were calculated for the loss function such that learning was not affected by the lower frequency of LVEF < 40% compared to the incidence of LVEF ≥ 40% using the effective number of samples class re-weighting scheme33 (Fig. S2).

Both models were trained on the Keras framework in TensorFlow 2.9.1 and Python 3.9 using the Adam optimizer. First, the models were trained at a learning rate of 0.001 for one epoch. The learning rate was then lowered to 0.0001 and training was continued until performance on the validation set did not improve for three consecutive epochs. The epoch with the highest performance on the validation set was selected for each model.

Learning representation assessments for noise-adapted model

To visualize the variation between predictions for standard and noise-augmented data for each model, we first modified both the standard and noise-adapted models by removing their final output layers so both models instead produced the 320-dimensional vector output of the model’s final fully connected layer. We then randomly selected a 1000 ECG subset from the held-out test set. For each of the two models, we generated the 320-dimension prediction vectors five times for each of the 1000 ECGs—once without noise, and once augmented at an SNR of 0.5 for each of the four noises used for testing. We then visualized the variation in the predictions separately for the standard and noise-adapted models using uniform manifold approximation and projection (UMAP), which constructs a two-dimensional representation of the 320-dimension prediction vectors25. The variation in predictions was numerically assessed by the pair-wise calculation of the Euclidean distances between the 320-dimensional prediction vectors for the standard and noised data for each of the four noises. These Euclidean distances were then scaled on a per-model and per-noise basis by dividing by the total range of pair-wise Euclidean distances for each model and noise combination. The average scaled Euclidean distance and a 95% confidence interval was then calculated for each model and noise combination.

Statistical analysis

Summary statistics are presented as counts (percentages) and median (interquartile range, IQR), for categorical and continuous variables, respectively. Model performance was evaluated in the held-out test set both with and without added real-world noises. We used area under receiving operation characteristics (AUROC) to measure model discrimination. We also assessed area under precision-recall curve (AUPRC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and diagnostic odds ratio, and chose threshold values based on cutoffs that achieved sensitivity of 0.90 in validation data. We used a DeLong’s test to compare AUROCs of the noise-adapted and standard models for each noise at each signal-to-noise ratio23. 95% confidence intervals for AUROC were calculated using DeLong’s algorithm23,24. A paired t-test was used to compute the probability of overlap for the scaled Euclidean distances between last-layer projections of the noise-adapted and standard models. All analyses were performed using Python 3.9 and level of significance was set at an alpha of 0.05.