Background & Summary

The 12 lead ECG is a standard non-invasive clinical tool for the diagnosis and long-term monitoring of cardiovascular disease. To support cardiac disease classification and interpretation of 12 lead ECGs in clinical practice, algorithms based on machine learning are increasingly utilized. Training of these algorithms requires large databases of 12 lead ECGs that have been labeled according to desired disease classifications with high accuracy and represent the target population. The most extensive publicly available database for such purpose to date is PTB-XL1.

Clinical 12 lead ECG databases like PTB-XL, however, have several limitations reducing efficacy of machine learning algorithms2. As the databases are typically attained from multiple medical centers, different filtering levels may be applied to reduce noise. Labeling uncertainties may arise due to differences in expertise or judgment between clinicians. Patient enrollment can also lead to both gender bias3 and uneven representation of certain cardiac diseases4. Furthermore, such databases provide limited insight into the underlying mechanisms of cardiovascular disease. Databases of synthetic ECGs have the potential to either complement and enrich5,6, or in the long run to even replace7, clinical datasets to overcome such limitations. Currently, no sizeable and open synthetic ECG databases are available due to the high computational cost and limitations in modeling complete four-chamber cardiac electrophysiology in silico at scale.

We thus aimed to assemble the first public database of labeled synthetic 12 lead ECGs by joining two independent multi-scale models of atrial and ventricular electrophysiology used to compute P waves and QRS complexes, respectively. This approach provides a complete chain of traceability from the anatomical and electrophysiological input parameters of the model to the final 12 lead ECGs. Common diseases were modeled mechanistically in addition to normal healthy control within the synthetic database. Within the ventricular-torso model, the pathologies of myocardial infarction (MI) and complete bundle branch block of both the left ventricle (LBBB) and the right ventricle (RBBB) were modeled. The MI class comprised 6 sub-classes pertaining to the three predominant arteries of right-anterior descending (RAD), left anterior descending (LAD), and left circumflex (LCX)8 each with two different transmural extent. The diseases fibrotic atrial cardiomyopathy (FAM), complete interatrial conduction block (IAB) and left atrial enlargement (LAE) were modeled within the atria. Also, 1st degree AV block (AVB) was modeled as an atrio-ventricular (AV) conduction-based disease. In this way, the chosen pathologies cover a wide range of both atrial and ventricular diseases representing conduction disturbances as well as structural remodeling for which established modeling approaches published in previous work could be resorted to. A total of 16,900 synthetic ECGs equally distributed into the 8 groups (healthy control and 7 cardiac pathologies) were made publicly available in the MedalCare-XL database. This MedalCare-XL dataset is publicly available under the Creative Commons Attribution 4.0 International license9. Thus, we provide a large and balanced ECG dataset with precisely known ground truth labels of the underlying pathology as derived from the mechanistic multi-scale simulations.

Validation of the synthetic ECG database was performed using two approaches to analyze to what extent the synthetic ECG database could represent clinical ECG databases. First, we tested the MedalCare-XL data set of simulated ECGs by comparing the statistical distribution of crucial ECG features extracted from MedalCare-XL with the same features taken from the clinical PTB-XL1 data base for normal healthy ECGs and for different pathology classes. The comparison showed excellent qualitative agreement, while still exhibiting quantitative differences that provide a starting point for future improvement of the underlying models as well as of the quality of future simulation data bases. Second, two clinical Turing tests were also conducted to evaluate the ability of the generated synthetic ECG signals to represent clinical signals undergoing ECG diagnostics by cardiologists. The first test required trained cardiologists to determine the origin of both measured and simulated 12 lead ECGs under normal healthy control. The second test additionally involved pathology classification. Both tests were performed on a subset of 50 synthetics ECG signals extracted from the database and mixed with 50 clinical signals taken from PTB-XL1. Altogether, the MedalCare-XL data base provides the first example for a large-scale data set of physiologically-realistic simulated ECGs.

Methods

We separate the genesis of the 12 lead ECG into P waves and the QRST complex, modeled by two separate atrial and ventricle-torso models. Generation of the anatomical model cohorts and the simulation of electrophysiology to mimic a large patient population is described for both the atrial and ventricular models. Having run single beat simulations for P waves and QRST complexes separately in the two independent models, both signal parts had to be merged in a post-processing step to obtain an ECG of a full heart cycle comprising one P wave, one QRS complex and one T wave. Subsequently, the single heartbeat was repeated with varying RR intervals to account for heart rate variability (HRV) to obtain a time series signal of 10 s length. A visual overview of the pipeline for generating the synthetic 12 lead ECG database is visualized in Fig. 1. The entire ECG dataset described in the manuscript is available online under the Creative Commons Licence CC-BY 4.09. The anatomical model cohort of the atria is publicly available under the Creative Commons Licence CC-BY 4.010. Subject data acquired at the Medical University of Graz which were used to construct the cohort of ventricular-torso models can only be shared with additional IRB approval and subject consent. Requests should be directed to the IRB of the Medical University of Graz with reference to their vote EKNr 24–126 ex 11/12. The data utilized from the participants were used to generate this work but are not part of the published data set.

Fig. 1
figure 1

Pipeline for the generation and validation of the synthetic 12 lead ECG database using individual multi-scale models of the atria and the ventricles.

Anatomical model populations

Ventricles

A cohort of anatomically-specific ventricular-torso models was generated for 13 healthy subjects (8 M, 5 F) ranging from 30 to 65 years of age. All subjects were part of a clinical study approved by ethical review board at the Medical University of Graz (EKNr: 24–126 ex 11/12). Written and informed consent for each subject was attained at the time of the study. Two separate MRI scans of the full torso and whole heart were sequentially acquired using standardized protocols at 3 T (Magnetom Skyra, Siemens Healthcare, Erlangen, Germany). The torso MRI (1.3 × 1.3 × 3.0 mm3) was acquired in four overlapping stacks using a non-ECG gated 3D T1-weighted gradient-echo sequence. The whole heart MRI (0.7 × 0.7 × 0.7 mm3) was acquired using an ECG-gated, fat-saturated, T2-prepared, isotropic 3D gradient-echo sequence. Respiratory navigators were employed to gate the MR-acquisition under free-breathing to end-expiration. MRI-compatible electrodes for recording the 12 lead ECG of each subject were left intact during image acquisition. Intensity thresholding techniques implemented in Seg3D11 were used to segment each torso MRI into heart, lungs, and general torso tissue. Segmentation of the cardiac MRI was automatically performed using a two-kernel convolutional neural network. The network was tailored for MRIs from the original network implemented for computed tomography images12. Segmented structures included blood pools, ventricles, and general atrial tissue. To automatically register the four-chamber heart segmentation into the torso, an iterative closest point algorithm was utilized in Seg3D11,13. Anatomical meshes were generated automatically from the joint segmentations using the Tarantula software meshing package14. Target resolutions within the cardiac and torso surfaces of 1.2 and 4.0 were prescribed, respectively. All models within the cohort were equipped with universal ventricular coordinates (UVCs) to allow for automated manipulation of all geometric-based entities15,16. The entire framework for the generation of the ventricular-torso model cohort is described in detail in Gillette et al.15. The ventricular-torso model cohort comprising geometries \({\Gamma }_{V,i},i\in [1,13]\) is visualized in Fig. 2.

Fig. 2
figure 2

Cohort of ventricular-torso models derived from clinical MRIs. Tissues include lungs, blood pools, atrial tissue, ventricles, and general torso. Parameters dictating ventricular electrophysiologyfor normal healthy control were varied through physiological ranges. Disease conditions of BBB and MI were then modeled by making adaptions to the model.

Atria

An overview of the anatomical model cohort generated for the atrial simulations is shown in Fig. 3. A total of 125 anatomical models \({\Gamma }_{A,h,i},i\in [1,80]\) and \({\Gamma }_{A,LAE,i},i\in [1,45]\) of the atrial endocardium were derived from a bi-atrial statistical shape model10,17. The endocardial surfaces were augmented with a homogeneous wall thickness of 3, rule-based myocardial fiber orientation, tags for anatomical structures and interatrial connections as described by Azzolin et al.18,19. Out of these 125 geometries, 80 models exhibited left and right atrial volumes in physiological ranges reported for healthy subjects20. In these geometries, 10 different fractions from 0 to 45% of the atrial myocardial tissue volume were additionally replaced by fibrotic patches as described previously21 to model atrial cardiomyopathy. The remaining 45 anatomical models were generated by constraining the coefficients of the statistical shape model such that left atrial volumes were increased to value ranges typically observed in left atrial enlargement patients20. Additionally, 25 torso geometries \({\Gamma }_{T,i},i\in [1,25]\) were obtained by modifying the coefficients of the two leading eigenmodes in the human body statistical shape model constructed by Pishchulin et al.22. In this way, height, weight and gender differences were represented in the anatomical torso model cohort. By applying random rotation angles \({\alpha }_{x},{\alpha }_{y},{\alpha }_{z}\) and translation parameters \({t}_{x},{t}_{y},{t}_{z}\) in ranges summarized in Table 4 to the atrial geometry, heart location and orientation variability were additionally accounted for in the virtual population.

Fig. 3
figure 3

Anatomical model cohort for atrial simulations. 80 atrial geometries with physiological left and right atrial volumes were derived from a bi-atrial statistical shape model17 and served as a basis for normal healthy control simulations. 9 different volume fractions of these models were additionally replaced by fibrosis for simulations of fibrotic atrial cardiomyopathy. Interatrial conduction block signals were generated by blocking conduction in Bachmann’s Bundle in the same 80 geometries. Furthermore, 45 geometries with enlarged left atrial volumes were generated. As for the torso anatomy, 25 geometries were derived from a human body statistical shape model to account for height, weight and gender differences in the virtual patient cohort. Moreover, the rotation angle as well as the spatial position of the atria inside the torso were varied in physiological ranges.

Simulation protocol and parameters

Ventricles

Under normal healthy control, activation of the ventricles was assumed to be Durrer-based23, where the His-Purkinje System was modeled assuming 5 fascicular sites of earliest breakthrough on a fast-conducting endocardium. Three fascicular sites were placed in the left ventricle (LV) on the anterior endocardium \({\overrightarrow{x}}_{lv,ant}\), posterior endocardium \({\overrightarrow{x}}_{lv,post}\), and the septum \({\overrightarrow{x}}_{lv,sept}\). Activation of the right ventricle (RV) was controlled using a site corresponding to the moderator band \({\overrightarrow{x}}_{rv,mod}\). An additional site \({\overrightarrow{x}}_{rv,sept}\) was also placed on the right-ventricular septum. All fascicular sites were defined in UVCs. The RV moderator band was placed in the middle of the RV free wall. The transmural depth of the remaining fascicular sites was assumed to be constant at 20% of the ventricular free wall. The fascicles were assumed to be of disc-like shape with a transmural thickness of 0.5% of the ventricular wall, and a radius controlled through additional parameter \(\overrightarrow{r}\) that related to endocardial extent. Activation was assumed to be simultaneous, apart from a prescribed delay \({\overrightarrow{t}}_{mod}\) in the activation of the RV moderator band site.

To modulate the fast spread of conduction on the endocardial surface of the ventricles modulated by the His-Purkinje System, a fast-conducting endocardium was also included that spanned from the middle 10% to 90% of the ventricular mesh along the apico-basal direction. Details of the His-Purkinje representation are available in Gillette et al.15. An isotropic conduction velocity of 2.0 was prescribed within the fast-conducting endocardium24.

Myocardial fiber directions were applied using a rule-based method25 that assumed principal fiber directions rotate radially from 60.0° on the endocardium to the epicardium −60.0°26. Corresponding sheet fiber directions of −65.0° and 25.0° were applied, respectively26. Conduction velocity along the principal direction of myocardial fibers of 0.6 was applied with an off-axis conduction velocity ratio of 4:2:127. Conductivity within the myocardium was set according to Roberts et al.28. All remaining conductivities within the volume conductor containing lungs, blood pools, atria, and general torso tissue were set according to Keller et al.29.

Ventricular myocyte electrophysiology was modeled using the Mitchell-Schaeffer ionic model \({\overrightarrow{i}}_{sinus}\)30. A resting membrane voltage of −86.2 and a peak action potential voltage of 40 was assumed. Gradients in action potential duration (APD) within the myocardium, needed to establish physiological T waves, were generated by utilizing a known relationship between the \({\tau }_{close}\) parameter and APDs. A linear combination of the UVCs weighted with given weights \({\overrightarrow{q}}_{w}\) was first computed at each node of the mesh. The weighted UVC gradients were mapped into a range between \(AP{D}_{min}\) and \(AP{D}_{max}\) to generate an APD map within the entirety of the ventricles. Values for the gradients and the APD are derived from the literature31,32,33. In total, variation in electrophysiology during normal healthy control was controlled through 20 variable parameters summarized in the parameter vector \({\overrightarrow{\omega }}_{qrs}\) for the QRS complex:

$${\overrightarrow{\omega }}_{qrs}=\{{\overrightarrow{x}}_{lv,ant},{\overrightarrow{x}}_{lv,post},{\overrightarrow{x}}_{lv,sept},{\overrightarrow{x}}_{rv,mod},{\overrightarrow{x}}_{rv,sept},{\overrightarrow{t}}_{mod}\}$$
(1)

and \({\overrightarrow{\omega }}_{t}\) for the T wave:

$${\overrightarrow{\omega }}_{t}=\{{\overrightarrow{i}}_{sinus},AP{D}_{min},AP{D}_{max},{\overrightarrow{q}}_{w}\}.$$
(2)

All geometric-based parameters could be mapped into the mesh using kD-trees implemented in meshtool34. Parameters relating to both the QRS complex and T wave under normal healthy control were varied in physiological ranges to generate variation in the QRST complex as reported in Tables 1, 2, respectively. Sampling through the ranges for each of the parameters was done using Latin Hyper Cubes.

Table 1 Model parameters for the electrophysiology within the ventricular simulations generating QRS simulations.
Table 2 Model parameters for the electrophysiology within the ventricular simulations generating T waves simulations.

The two pathologies of BBB and MI were then modeled in the ventricles alongside normal healthy control. Pathologies of LBBB and RBBB were included in the ventricular-torso model. To cause a complete branch block, all fascicular root sites within either the LV or the RV were neglected to inhibit activation. All other relevant electrophysiology parameters were allowed to vary in the same ranges as reported for normal healthy control above.

A MI stemming from occlusion of one of the three primary arteries of RAD, LAD, and LCX was inserted into the ventricles. For each of the arteries \(\nu \in \{RAD,LAD,LCX\}\), a core center \({\overrightarrow{x}}_{\nu ,mi}\) was defined using the apico-basal and rotational UVC coordinate values that were bounded according to recommendations of affected regions on the clinical 17-segment model determined by the American Heart Association (AHA)8. Namely, the LAD was restricted to the anterior-anteroseptal region spanning the entire apico-basal extent. Both the RAD and LCX extended less apically, and were confined to the lateral wall and the inferior-inferioseptal regions, respectively. For each artery, the infarct was either assumed to span the entirety of the ventricular wall or transmural extent of 30% from the endocardium, giving rise to a transmural extent value \({\rho }_{n,mi}\) such that \(n\in \{0.3,1.0\}\). The outer 5% of the infarct area was allocated to be border zone (BZ), and the remaining area was defined as the infarct core. All scars were assumed to be left-sided, thus presenting only in LV.

From each infarct center, an Eikonal activation map was computed within the ventricular geometry assuming the same conduction velocity and off-axis ratios as assigned in the general myocardium during normal healthy control. An infarct geometry was taken by thresholding the activation map according to the computed time that generated a radius of distance dco. The infarct core was assumed to be electrically inert, while the conduction velocity in the BZ was set to 0.15 with an off-axis ratio of 1.035. The conductivity within the BZ was set to the same values reported for the healthy myocardium. Parameters for the Mitchell-Schaeffer ionic model within the BZ \({\overrightarrow{i}}_{BZ}\) were manually adjusted using bench leading to characteristic action potential changes during MI36. In total, the MI class comprised 6 sub-classes. The parameters varied to induce various degrees and positions of MI \({\overrightarrow{\omega }}_{\nu ,mi}\) included:

$${\overrightarrow{\omega }}_{mi}=\{{\overrightarrow{x}}_{\nu ,mi},{\rho }_{n,mi}{d}_{co}\}\,:\nu \in \{RAD,LAD,LCX\},n\in \{0.3,1.0\}$$
(3)

Parameters were similarly varied using Latin Hyper Cubes through ranges based on clinical observation for characteristic occlusion sites and action potential changes (Table 3).

Table 3 Additional parameters were included to define infarct zones within the ventricular-torso model.

Transmembrane voltages were simulated using the efficient reaction-Eikonal method in the monodomain formulation without diffusion37. Electrical potentials of each electrode on the torso surface were recovered from transmembrane voltages using lead fields precomputed once for every model38. A ventricular 12 lead ECG (QRST complex) was generated by simulating a ventricular beat for 450. All simulations were run using the CARPentry cardiac solver39 and the openCARP simulation framework40,41 on a desktop machine with 24 cores, parallelized into 3 threads.

Atria

Local activation times in the atria were obtained by solving the Eikonal equation with the Fast Iterative Method42 and the Fast Marching Method43. Excitation was initiated at the sinoatrial node with an exit site located at the junction of crista terminalis and the superior vena cava. Locally heterogeneous conduction velocity \({{\rm{CV}}}_{{\rm{[Region]}}}\) and anisotropy ratios \({{\rm{AR}}}_{{\rm{[Region]}}}\) for [Region] {bulk tissue, interatrial connections, crista terminalis, pectinate muscles, inferior isthmus} were modeled as summarized in Table 4. The spatio-temporal distributions of transmembrane voltages \({\rm{TMV}}(t,x)\) were subsequently derived from the computed activation times by shifting pre-computed Courtemanche et al. action potential templates \({\rm{TMV}}(t)\) in time. Remodeling of cellular electrophysiology was applied in fibrotic regions as described below. For all simulations except for those of fibrotic atrial cardiomyopathy, the baseline parameters of the Courtemanche et al. model remained unchanged in all atrial regions. The atria were placed inside a torso geometry and were rotated (\({\alpha }_{x},{\alpha }_{y},{\alpha }_{z}\)) and translated (\({t}_{x},{t}_{y},{t}_{z}\)) around and along all three coordinate axes to account for additional anatomical variability in the cohort. The forward problem of electrocardiography was solved with the infinite volume conductor method (for the normal healthy control cases and fibrotic atrial cardiomyopathy) or the boundary element method (for interatrial conduction block and left atrial enlargement). Single beat 12 lead ECGs of the P wave lasting 150–200 were subsequently extracted at standard electrode positions. In total, variation during healthy sinus rhythm simulations was controlled through the parameters summarized in the following vector

$${\omega }_{P}=\{{\overline{{\rm{CV}}}}_{[Region]},{\alpha }_{x},{\alpha }_{y},{\alpha }_{z},{t}_{x},{t}_{y},{t}_{z},{\overrightarrow{\lambda }}_{T,i},{\overrightarrow{\lambda }}_{A,i},\}.$$
(4)

For simulations of fibrotic atrial cardiomyopathy, nine different fractions from 5% to 45% of the healthy atrial myocardial volume were replaced by fibrotic tissue as described in detail by Nagel et al.21 in the same 80 atrial anatomical models that were employed for the healthy control simulations. In fibrotic patches, 50% of the cells were modeled as passive conduction barriers by removing the affected elements from the volumetric meshes. In the remaining 50% of the fibrotic cells, conduction velocity was reduced by a factor of 0.2 and 0.5 compared to the healthy baseline values in Table 4 in transversal and longitudinal fiber direction, respectively. In this way, anisotropy ratios were increased by a factor of 2.5, which typically facilitates functional reentry in patients with atrial fibrillation. To account for paracrine cytokine remodeling effects in fibrotic regions, maximum ionic conductances of the Courtemanche et al. cell model were rescaled (0.6×gNa, 0.5×gK1, 0.5×gCaL).

For left atrial enlargement simulations, 45 additional atrial geometries were derived from the bi-atrial statistical shape model. Constraints were applied to the coefficients of the leading eigenmodes to generate anatomical atrial models with systematically increasing left atrial volumes6. Different rotation angle combinations and conduction velocity variations were applied for the simulations as reported in Table 4.

Complete interatrial conduction block was modeled by inhibiting conduction propagation through the elements in Bachmann’s bundle at the junction between the left and the right atrium in the same 80 bi-atrial geometries that were used for the control simulations. Different combinations of rotation angles and spatial translations of the atria within the torso were applied for the ECG calculations.

Synthesization of complete ECGs

Signal components were synthesized to a full ECG using a heart rate variability (HRV) model to obtain 10 s recordings in accordance with the standard clinical 12 lead ECG. As atrial and ventricular ECGs were carried out using different forward calculation methods, the amplitudes of QRST complexes were scaled according to the P waves prior to concatenation to ensure that signal amplitudes of single waveforms were consistent within one heartbeat. Thus, maximum P wave and R peak amplitudes were extracted in lead II of all clinical recordings from healthy subjects in PTB-XL1 using ECGdeli44. Based on these values, a multi-variate normal distribution was set up representing the relation between P wave and R peak amplitudes in clinical ECGs. In this way, the simulated QRST complex could be scaled with a factor sampled from this multi-variate probability distribution to match the corresponding amplitude of the simulated P wave. A PQ interval complying with the simulated P wave duration was selected like-wise by drawing from a multi-variate normal distribution generated from clinical P wave duration and PQ interval values. Finally, the P waves and the scaled QRST complexes were concatenated using a sigmoid shaped segment of a length determined by the difference of PQ interval and P wave duration. When synthesizing ECG segments for the 1st degree AV block class, the PQ interval was sampled from the range > 200 ms.

To account for heart rate variability in the simulated 10 s ECGs, we refrained from simply repeating the concatenated single heart beat multiple times. Instead, the heart rate variability model developed by Kantelhardt et al.45 was used to generate a series of RR intervals for an average heart rate within physiological ranges (50–90 bpm) determined from the QT interval of the respective simulation run using the multi-variate normal distribution. For each heart beat holding a different RR interval, the signal was shrunk or stretched in the [QRSoff, Toff] interval, again by sampling values from a multi-variate normal distribution derived from clinical QRS duration, QT- and RR interval values. After adding a sigmoidal shaped TP segment to connect subsequent heart beats in the defined RR interval, we obtained the final 10 s 12 lead ECG. The raw ECG signal was superimposed with realistic ECG noise that mimics the effects of electrode movement, baseline wander, and motion artefacts, as reported by Petranas et al.46. The amplitudes of the noise vectors were scaled based on a chosen signal to noise ratio between 15 and 20 dB.

Data Records

The MedalCare-XL dataset is publicly available on Zenodo9 under the Creative Commons Attribution 4.0 International license. Approximately 1,300 ECGs of 10 s length for each disease class are stored in csv format. Rows 1–12 contain the 12 leads of each ECG following the order I, II, III, aVR, aVL, aVF, V1-V6. All signals are sampled at 500 Hz, amplitudes are in mV. Each signal is available in three different versions: ‘*_raw.csv’ contains the noise-free synthesized ECG, ‘*_noise.csv’ contains the synthesized ECG with superimposed realistic ECG noise46, ‘*_filtered.csv’ contains the bandpass filtered version (Butterworth filters of order 3, cut off frequencies of 0.5 Hz (highpass) and 150 Hz (lowpass)) of the synthesized ECGs with superimposed noise. For meaningful machine learning approaches, the signals are split in suggested subsets for training, validation and testing depending on the atrial and ventricular anatomical models the single simulation runs were based on to make sure each anatomical model is only contained in one of the subsets. A detailed description of the structure of the MedalCare XL dataset is shown in Table 5. Example ECGs of lead II for each disease are shown in Fig. 4(A). In Fig. 4(B), exemplary ECGs for each MI pathology class are shown corresponding to different occlusion sites and degrees of transmurality.

Table 4 Model parameters for atrial simulations.
Fig. 4
figure 4

(A) Exemplary 10 s ECGs (lead II) of each pathology class and a normal healthy control in the virtual cohort. (B) Exemplary 10 s ECGs (lead II) of each MI pathology class for different occlusion sites and degrees of transmurality.

Technical Validation

We have employed two different approaches for the technical validation of the MedalCare-XL dataset of simulated, synthetic 12 lead ECGs as described in the following. For a validation of the complete dataset, the statistical distribution of ECG features extracted separately for each class (healthy control and specific pathologies) from the records in the MedalCare-XL database9 were compared to the distributions of the corresponding features extracted from the clinical PTB-XL that were recently summarized in the PTB-XL + dataset47. In addition, we performed several so-called clinical Turing tests, where the ability of expert cardiologists to distinguish the simulated ECGs from clinical ECGs was evaluated again with representative samples from the MedalCare-XL and PTB-XL databases as described in detail below.

Feature distribution

To validate the simulated data against the statistical properties of clinically recorded ECGs, interval and amplitude features were extracted from the synthetic dataset and from PTB-XL using ECGdeli44 and compared to one another. Figure 5 shows the probability density functions for 6 timing and 5 amplitude features extracted from lead II of all ECGs in the healthy clinical and virtual cohort. Except for the T wave amplitudes, the feature values for the synthetic signals lie within the clinical and physiological ranges. However, the feature distributions from the healthy and the virtual data do only coincide for the QRS duration. All other simulated timing and amplitude features only cover a subset of the clinically observed ranges. In Figs. 5, 6, a comparison of feature distributions for healthy and pathological ECGs in the virtual cohort (top panel) and the clinical cohort (bottom panel) is visualized for timing or amplitude features that are clinically considered for a diagnosis of the respective disease.

Fig. 5
figure 5

Comparison of features in the healthy clinical and virtual cohort. Probability density functions are shown for timing features (left column, from top to bottom: P wave duration, QRS duration, T wave duration, PQ interval, QTinterval, RR interval) and amplitude features (right column, from top to bottom: P wave amplitude, Q/R/S peak amplitude, T wave amplitude). Blue and red curves represent the distributions calculated based on the clinical and the simulated data, respectively. The centered vertical lines highlight the mean value μ and the filled areas indicate the interval [μ − σ, μ + σ] with standard deviation σ.

Fig. 6
figure 6

Comparison of features extracted from healthy (solid lines) and pathological (dotted line) ECGs in the clinical (blue curves, bottom panel) and virtual (red curve, top panel) cohorts. Probability density functions are shown for selected timing or amplitude features that are clinically evaluated for a diagnosis of the displayed disease (from left to right: RBBB, LBBB, MI, 1AVB, LAO, IAB and FAM).

Clinical turing tests

We aimed to ensure that the synthetic ECG signals correspond to the clinically measured signals with respect to ECG features which are characteristic for healthy cases. If cardiologists are not able to distinguish between measured and simulated ECG signals, this will increase confidence in the in-silico model as a surrogate for real clinical data. Therefore such a test can be considered as a clinical Turing test. For this, cardiologists were asked to perform an online Turing test to evaluate and to provide feedback on both healthy and pathological ECGs. A first clinical Turing test was conducted to determine the ability of the synthetic 12 lead ECGs within the database to pass as real clinical signals. In a second test, cardiologists were asked to determine the pathology of the signals as conducted routinely in ECG diagnostics. Under all clinical Turing tests, the PTB-XL1 database served as the basis for the measured signals and the simulated database described above was used for the synthetically generated signals.

Development of online platform for clinical turing test

In order to conduct clinical Turing tests, an online solution provided by the Know-Center (https://www.know-center.at), a research center for data science and artificial intelligence located in Graz, was used. The Know-Center extended its TimeFuse (https://ecgviewer.timefuse.io/public/login/turing) online signal data platform to include a survey feature and a plotter to visualize 12 lead ECG signals. The ECG plotter was designed specifically to present 12 lead ECGs in a typical visualization as seen by cardiologists in the clinic on chart paper. Namely, horizontal lines on the pink background correspond to 0.4 and vertical lines correspond to 0.1. The platform was also designed for hosting of multiple clinical Turing tests. Clinical Turing tests of either healthy signals or pathological signals could then be organized and conducted separately.

Conducting tests

In a first iteration, Turing tests were performed with normal healthy control ECGs to better understand the ability of signals to pass as clinical signals under normal healthy. For this purpose, five groups with 20 signals each were created, resulting in a total of 100 signals. For the measured ECGs, 50 signals were randomly selected from a subset of the PTB-XL database, which contained only signals annotated as 100% healthy. For the generated ECGs, 50 signals under healthy sinus rhythm were randomly taken from the synthetic database described above. After pre-processing and filtering the 100 signals, the five groups were uploaded to the online platform and assigned to the survey participants. Within the test, expert cardiologists were required to evaluate whether each ECG test case from the total 100 was measured or generated. Clinicians were also allowed to refrain from answering, but a lack of a statement was taken as a false classification. All clinicians were also asked to provide reasoning behind the classification. A total of 6 clinicians performed the test.

A similar test was also performed with pathological conditions to demonstrate that the synthetic ECGs of the various modeled pathological cases would be classified by expert clinicans at the same accuracy as real clinical signals and could not be distinguished from clinically measured ECGs taken from the PTB-XL database. The cases included myocardial infraction (MI), left bundle branch block (LBBB), right bundle branch block (RBBB), first degree AV block (1AVB), and left atrial overload/enlargement (LAO/LAE). Conditions of fibrotic atrial cardiomyopathy (FAM) and complete interatrial conduction block (IAB) were neglected as such diseases were not present within PTB-XL. Examples of the disease are provided in Fig. 4(A,B).

Similar to the healthy Turing test, 50 generated ECG signals were taken from the synthetic database such that each of the five pathological classes is represented by 10 ECGs. The 50 measured ECGs were randomly selected from five subsets of the PTB-XL database, 10 cases per subset, where each subset only contained signals labeled as 100% pathological according to the 5 classes. Clinicians could choose from a list of 11 labels. Clinicians were asked to make at least one annotation for each of the 100 pathological 12 lead ECG signals from a list of 11 pathologies as listed below:

  • 1AVB

  • atrial fibrillation (AFIB)

  • FAM

  • IAB

  • LAO

  • LBBB

  • MI

  • normal healthy control (NORM)

  • right atrial overload/enlargement (RAO/RAE)

  • RBBB

  • Wolf-Parkinson-White syndrome (WPW)

A total of two cardiologists responded.

Within the normal healthy control clinical Turing Test, the six clinicians correctly classified 464 of the 600 cases, which corresponds to an accuracy of 77.33%. On the other side, 136 signals (22.67%) could not be correctly classified, including 62 (10.34%) synthetic and 74 (12.33%) measured ECGs, see Fig. 7(B). A detailed summary is given in Fig. 7(A,C). Primary ECG features leading to classification as simulated included fractionation or improper R wave propagation in the QRS complex, a spiking or biphasic T wave, and a lack of physiological noise in the signals.

Fig. 7
figure 7

(Type classification) Healthy cases: (A) Classification results for each of the six expert clinicians for the five Turing tests and percentage of correct assessments. In summary, 62 of 300 assessments of the synthetic ECGs and 74 of 300 assessments of the measured ECGs could not be correctly classified by the experts. (B) Type classification matrix across all 600 assessments. (C) Results of the clinical Turing tests performed by 6 clinicians. Each row corresponds to a clinical Turing test and each square belongs to one of the 20 ECGs per test. Shown is the relative number of clinicians who correctly classified the corresponding signal. Pathological cases: (D) Type classification results for each of the two expert clinicians for the five Turing tests and percentage of correct assessments. In summary, 10 of 100 assessments of the synthetic ECGs and 24 of 100 assessments of the measured ECGs could not be correctly classified by the experts. (E) Type classification matrix across all 100 assessments. (F) Results of the clinical Turing tests performed by 2 clinicians. Each row corresponds to a clinical Turing test and each square belongs to one of the 20 ECGs per test. Shown is the relative number of clinicians who correctly classified the type of the corresponding signal.

Within the clinical Turing test on pathological ECGs, the two clinicians correctly classified the signals as either measured or clinical in 166 of the 200 cases, which corresponds to an overall accuracy of 83%. On the other side, the type of 34 signals (17%) could not be correctly classified, including 10 (5%) synthetic and 24 (12%) measured ECGs, see Fig. 7(E). A detailed summary is given in Fig. 7(D,F). Regarding the correct classification of pathological cases, only 101 of the 200 (50.5%) overall cases including both simulated and clinical signals were classified correctly by both clinicians. Namely, 38 measured ECGs were classified as the wrong pathology by experts resulting in an accuracy of 62%. Inversely, simulated pathologies were correctly classified at only 39%, with 61 signals being classified incorrectly. A detailed summary is given in Fig. 8(A,B). The actual pathology and the diagnoses given by each clinician within the pathological clinical Turing test is provided in Fig. 8(C).

Fig. 8
figure 8

(Pathology classification) (A) Pathology classification results for each of the two expert clinicians for the five Turing tests and percentage of correct assessments. In summary, 61 of 100 assessments of the synthetic ECGs and 38 of 100 assessments of the measured ECGs could not be correctly classified by the experts. (B) Pathology classification matrix across all 100 assessments. (C) (Clinician-based). Shown are the classifications for both clinicians of all ECG Signals. For each ECG signal designated by a s quare, the top entries are the correct pathology and the bottom entries are the pathology actually selected by the user. Each row corresponds to a clinical Turing test and each square belongs to one of the 20 ECGs per test. (D) Confusion Matrices.

Usage Notes

Separate models of atrial and ventricular electrophysiology that are individually more detailed and steerable were joined together to capture the P wave and the QRST complex within the 12 lead ECG, respectively. Cohorts of four chamber models of cardiac electrophysiology48 could also be used for such a purpose and offer distinct advantages for modeling certain pathologies with atrio-ventricular dependencies. Such four-chamber cohorts, however, are not yet well suited for the generation of large ECG databases due limited anatomical variation. While statistical shape models of the four chamber heart have been generated to encode such anatomical variation, these models still lack controllable electrophysiology needed to generate realistic signals. For example, repolarization in the ventricles has volumetric gradients (both transmural and apico-basal) that are needed for realistic T waves. Using a cohort of volumetric ventricular model for the QRST complex in combination with a statistical shape model in the atria overcomes such limitations. Furthermore, using both atrial and ventricular shape models does not necessarily mean that the two systems may be linked, which may lead to unphysiological configurations.

The feature analysis showed that the synthetic signals exhibit interval and amplitude features that are mostly in line with feature ranges reported in PTB-XL for the healthy and the pathological cohorts. From Fig. 6, it is apparent that the change in feature values extracted from healthy and diseased ECGs is consistent between the simulated and the clinical data even though absolute feature ranges sometimes deviate. However, they neither cover the full range of feature values that occur in clinical practice nor are they characterized by accurately coinciding distributions. This could be attributed to the fact that the atrial model population was parameterized using ECG biomarker ranges for P wave amplitudes and durations reported for extensive clinical cohorts partially comprising > 200,000 subjects49,50 which might lead to slightly different feature distributions compared to those extractable from PTB-XL. The QRST complexes were also parameterized according to experimental data or clinical data conducted on smaller model cohorts that may not be representative of the entire population especially in terms of age (covered range: 30–65 years) and comborbidities (healthy subjects). Some parameters were also estimated as no direct clinical or experimental data is available for these entities. One such example is the heightened T wave amplitudes, which stem from repolarization gradients in the ventricles that generate large cardiac source. While the occurrence of repolarization gradients are known31,32, the exact nature of such gradients are not well understood and thus hard to parameterize for a patient population. Therefore, the synthetic signals are not fully representative for an entire population, such as the one in PTB-XL.

The feature distributions in the synthetic cohort are however consistent in themselves, i.e., unrealistic combinations of different features are unlikely to occur. For example, the upper limit of RR intervals in the simulated healthy cohort does not exceed 1000 ms, while simultaneously, the QT interval also only covers lower ranges of the clinical QT interval values (compare Fig. 5 and Table 6). This is due to the fact that multi-variate normal distributions were used during the synthesization procedure ensuring that clinically reported correlations between ECG biomarkers (such as P wave duration and PQ interval or QT duration and RR intervals) are taken into account. This is also advantageous as is is possible to account for physiological responses that include alterations in the QT duration or PQ interval. In the case of exercise, for example, an increase in heart rate outside of the reported physiological range of 67–100 bpm can be accounted for by shortening the QT interval. Furthermore, detailed mechanistic electrophysiological models of the heart were employed and simulation parameters in reasonable ranges reported in literature were chosen leading to realistic single beat P waves and QRST complexes in most cases. It must be noted that PTB-XL lacks clinical data for fibrotic atrial cardiomyopathy and for interatrial conduction block. Thus, fidelity assessment of ECG features within these two classes by means of a comparison to clinical data was not possible using the same clinical ECG resources. However, we already showed in previous work that the simulated P waves reproduce characteristic changes in key diagnostic ECG markers21,51. These include a prolongation of the P wave duration compared to the control simulations due to delayed depolarization in fibrotic patches as well as a retrograde activation of the left atrium through interatrial conduction pathways on the posterior wall. Moreover, as shown in Fig. 6, in interatrial conduction block patients, the morphology and therefore the P wave amplitude is markedly changed in lead aVL compared to the healthy cohort. In patients with fibrotic atrial cardiomyopathy, the most pronounced decrease in P wave amplitude due to scar tissue not contributing to the overall source distribution in the atria occurs in the lateral leads (compare Fig. 6).

Table 5 In the MedalCare-XL dataset two classes are available: (i) the WP2_largeDataset_Noise class, which contains the simulated ECG signals, and (ii) the WP2_largeDataset_ParameterFiles class, which contains all the parameter files used to run the simulations.
Table 6 Mean values μ and standard deviation σ for all features and all 12 leads for healthy simulated (“sim”) and healthy clinical (“clin”) ECG signals.

The clinical Turing tests aimed to investigate the ability of the 12 lead ECG signal to exhibit morphological features in accordance with clinical diagnostic criteria as routinely assessed by clinicians under both normal healthy control and pathological conditions. Within the clinical Turing test performed for normal healthy control, it can be observed that accuracy in identifying whether a signal was simulated or clinical was 77% accurate. Primary ECG features leading to identification as a synthetic signal included fractionation and R wave progression of the QRS complex under certain diseases conditions. Before scaling of the QRS complex according to the P wave, identification of synthetic signls was common based on improper matching of amplitudes under normal hearth rhythm. Spiked T waves with high amplitudes or biphasic T waves could also be observed. Real ECG signals tended to also exhibit a certain noise types not accounted for, including electrical disturbances and large baseline wander, that must either be modulated within simulated data or removed during the clinical Turing test. Within the clinical Turing test to diagnose pathological ECGs, the accuracy of type classification increased to 83%, indicating type classification was easier with synthetic pathological data. Misdiagnosis was common across both signal types as pathologies were only diagnosed correctly by the two expert cardiologists in 51% of cases. More clinicians should perform the clinical Turing test on pathology classification to give a better indication of the true accuracy of ECG diagnosis on both simulated and clinical signals. Furthermore, the clinical Turing test must be conducted on a larger number signals beyond the 100 analyzed, ideally, for the entire ECG synthetic database.

Regardless, it can be observed that clinicians had varying performance on clinical-based 12 lead ECG signals in comparison to those taken from the synthetic ECG database. This is highlighted by the confusion matrices constructed for all pathological cases from the results for both measured and simulated signals (Fig. 8(D)). Clinical signals were classified with the correct pathology at an accuracy of 62%. Within clinical signals, the pathological cases of LAO, 1AVB, and MI were commonly mistaken as a 12 lead ECG in normal sinus rhythm by both clinicians.Simulated signals, on the other side, were classified correctly for the underlying disease pathology at only 39%. None of the modeled pathological cases could be diagnosed with 100% accuracy by either clinicians using standard guidelines for ECG diagnostics across both simulated and clinical signals.

Largest differences in diagnostic outcomes between simulated and clinical data sets is observed for LBBB and RBBB. Within simulated ECGs, LBBB and RBBB were commonly mistaken for MI.This stems from the fact that some morphological features in these signals are characteristics of infarction or aneurysm within the heart. In some RBBB signals, for example, V1 is predominantly negative and in combination with large Q waves in I and aVL could indicate an anterior infarction. LAO within both clinical and simulated data experienced the highest level of misdiagnosis and resulted in similar performance. This could be attributed to the fact that LAO manifests only within the P wave, where morphological deviations are harder to detect due to a substantially lower amplitude than the QRS complex. Misdiagnosis was also high among the diseases of LBBB and RBBB within the simulated data set. Differences in outcome between the clinical and synthetic signals may stem from the inability of the synthetic ECG database to manifest the full complexity of the underlying diseases. For example, remodeling within the ventricles under such conditions may lead to slower conduction properties and alternative wave morphology. Furthermore, only complete LBBB or RBBB was modeled. In clinical practice, however, there are varying degrees of conduction block. A lower reported diagnostic accuracy for MI and 1AVB is seen for the clinical signals in comparison to the simulated ECGs, which could also stem from a lack of complexity within the simulated setup easing diagnosis.

Some results from the Turing test of pathological cases indicate that standard protocols for ECG classification by clinicians are not sufficient. Machine learning algorithms may offer a means to aide in ECG diagnosis to improve reliability of clinical decisions. Therefore it is important to provide reference data to test such algorithms. An earlier benchmark study demonstrated this with the large data set of clinical ECGs in PTB-XL52. In this work, deep learning algorithms were e. g. found to exhibit diagnosis success rates in the range of 80–95 percent depending on the used metric. The clinical PTB-XL data set was also instrumental in demonstrating the clear improvement of algorithms based on self-supervised learning53. Nevertheless, clinical data bases strongly depend on the quality and the terminology used to label the ECG data. In addition large sets of publicly available clinical data sets are still rare and limited in number. Here is where benchmarking ML algorithm with validated simulated data sets can become an important tool in the development and benchmarking of new algorithm for ECG classification. Machine learning algorithms could then also be trained and tested on real and synthetic data in different combinations. Data bases of simulated ECGs like the MedalCare-XL set presented in this paper provide also an important link of the growing knowledge developed in the cardiac modelling community and practical development of algorithm for data analysis.

To lower the mismatch in performance between clinical and synthetic signals, further parameter tuning is needed. Iterative clinical Turing tests would be beneficial to update parameters ranges to mitigate the prevalence of undesirable ECG features within the entire database. Refinement could also be guided by sensitivity analysis that provides more information on the relationship of model parameters and the morphological traits of simulated signals as determined by clinicians. However, this requires a large investment due to the variety in clinical pathological classes, and the lack of known electrophysiology in such conditions. Certain important ECG features may also be detected by machine learning analysis52 to provide insight into the refined sub-classification of pathological cases beyond current routine diagnoses.

When using the synthetic ECGs as an input data source for machine learning applications, samples that were generated based on the same anatomical model should explicitly belong to only one of the training, testing or validation sets. As the main variation in morphology of the P waves and QRST complexes stem predominantly from anatomical differences in the model cohort54, splitting the data in the described fashion thus helps to prevent overfitting to similar or almost identical samples that were already seen during training55.

When applying the simulated data for extending or replacing small or imbalanced clinical datasets, the user is advised to refer to the signals with superimposed realistic ECG noise instead of the raw signal traces. In this way, the simulated signals exhibit characteristics due to noise interference that are also observable in clinical ECGs. Thus, possible domain gaps can be reduced eventually leading to an improved classification outcome on actual clinical data.