Background & Summary

An ECG is a graph of voltage with respect to time that reflects the electrical activities of cardiac muscle depolarization followed by repolarization during each heartbeat. The ECG graph of a normal beat (shown in Fig. 1) consists of a sequence of waves, a P-wave presenting the atrial depolarization process, a QRS complex denoting the ventricular depolarization process, and a T-wave representing the ventricular repolarization. Other portions of the signal include the PR, ST, and QT intervals. Arrhythmias represent a family of cardiac conditions characterized by irregularities in the rate or rhythm of heartbeats. There are several dozen such classes with various distinct manifestations such as excessively slow or fast heartbeats (sinus bradycardia (SB) and atrial tachycardia (AT)) and irregular rhythm with missing or distorted wave segments and intervals (premature ventricular contraction (PVC)). The most common and pernicious arrhythmia type is atrial fibrillation (AFIB). It is associated with a significant increase in the risk of severe cardiac dysfunction and stroke. Recent reports from the American Heart Association1 outlined that, in 2015, AFIB was the underlying cause of death for 23,862 people and was listed on 148,672 US death certificates. In 2010, the estimates of the prevalence of AFIB in the United States ranged from 2.7 million to 6.1 million. According to the same report, AFIB prevalence is expected to rise to 12.1 million in 2030. This alarming situation is not unique to the US. In fact, in Europe, the prevalence of AFIB in adults older than 55 years was estimated to be 8.8 million (95% CI, 6.5–12.3 million) and was projected to rise to 17.9 million by 2060 (95% CI, 13.6–23.7 million). The prevalence of AFIB in the Chinese population aged 35 years or older was 0.71%2. A significant contribution of this database is that it contains 3,889 subjects with AFIB rhythm.

Fig. 1
figure 1

The ECG waveform and segments in lead II that presents a normal cardiac cycle.

According to the current screening and diagnostic practices, cardiologists or physicians review ECG data, establish the correct diagnosis, and begin implementing subsequent treatment plans such as medication regime and radiofrequency catheter ablation. However, the demand for high accuracy automatic heart condition diagnoses has recently increased sharply in parallel with the public health policy of implementing wider screening procedures and the adoption of ECG enabled wearable devices. Such classification methods require large size data that contain all prevalent types of conditions for algorithm training purposes.

There are several labeled, publicly available ECG databases such as the MIT-BIH arrhythmia database3, European ST-T database4, Creighton University ventricular tachycardia arrhythmia database, and St. Petersburg Institute of Cardiological Technics 12-lead arrhythmia database5. The American Heart Association (AHA) developed a database of arrhythmias and normal ECGs that contains 154 beat-by-beat annotated recordings, but it is not available for public use. These databases are either single lead or 12-lead ECG with sampling frequency less than 500 Hz and sample size smaller than 200. The sampling frequency is important in capturing certain vital cardiac conditions. For example, pacemaker stimulus outputs are generally shorter in duration by 0.5 ms, and therefore, they cannot be reliably detected by ordinary signal collection technique with sampling rates between 500 and 1000 Hz6. We compared the characteristics of the above-mentioned datasets and the one proposed in this paper (shown in Table 1). Our database contains the largest number of subjects, the highest sampling rate and the largest number of leads. Further, it also includes 11 heart rhythms and 56 types of cardiovascular conditions labeled by professional physicians. Additionally, the database includes basic ECG measurements such as QRS counts, atrial beat rate, ventricle beat rate, Q offset, and T offset.

Table 1 7 ECG databases comparison.


Participants and digitization parameters

Our data consists of 10,646 patient ECGs including 5,956 males and 4,690 females. Among those patients, 17% had normal sinus rhythm and 83% had at least one abnormality. The age groups with the highest prevalence were 51–60, 61–70 and 71–80 years representing 19.82%, 24.38%, and 16.9%, respectively. A detailed description of the enrolled participants’ baseline characteristics and rhythm frequency distribution is presented in Table 2. The number of volts per A/D bit is 4.88, and A/D converter had 32-bit resolution. The amplitude unit was microvolt. The upper limit was 32,767, and the lower limit was −32,768. The institutional review board of Shaoxing People’s Hospital approved this study, granted the waiver application to obtain informed consent, and allowed the data to be shared publicly after de-identification.

Table 2 Rhythm information and baseline characteristics of participants.

Data acquisition

The data were acquired in four stages. First, each subject underwent a 12-lead resting ECG test that was taken over a period of 10 seconds. The data were stored into the GE MUSE ECG system. Second, a licensed physician labeled the rhythm and other cardiac conditions. Another licensed physician performed a secondary validation. If there was a disagreement, a senior physician intervened and made a final decision. There are labels of each subject’s rhythm and other conditions such as PVC, right bundle branch block (RBBB), left bundle branch block (LBBB), and atrial premature beat (APB). These additional conditions were applied to the entire sample rather than to specified beats in the 10-second reading. The final diagnoses were stored in the MUSE ECG system as well. Third, ECG data and diagnostic information were exported from the GE MUSE system to XML files that were encoded with specific naming conversion defined by General Electric (GE). Finally, we developed a converting tool to extract ECG data and diagnostic information from the XML file and transfer them to CSV format. In doing so, we referred to the work of Maarten J.B. van Ettinger (

Data denoising method

In this study, the noise contamination sources in the ECG data were due to power line interference, electrode contact noise, motion artifacts, muscle contraction, baseline wandering, and random noise. As well known, the presence of noise can be a remarkable obstacle to any statistical analysis. Thus, we proposed and implemented a sequential noise reduction approach to process raw ECG data. Since the frequency range of normal ECG is from 0.5 Hz to 50 Hz, the Butterworth low pass filter was used to remove the signal with a frequency above 50 Hz. Then, LOESS smoother was utilized to clear the effects of baseline wandering. Lastly, the Non Local Means (NLM) technique was used to handle the remaining noise. One ECG sample containing both low and high frequency noise was presented in Fig. 2, whereas the noise reduction performance was displayed in Fig. 3. Another ECG sample contaminated by baseline wandering is shown in Fig. 4, and the effectiveness of LOESS smoother was demonstrated in Fig. 5. To get a full understanding of the techniques and the scheme that was adopted, please refer to the source code in the Code Availability section.

Fig. 2
figure 2

An ECG containing both low and high frequency noise.

Fig. 3
figure 3

An ECG after noise reduction.

Fig. 4
figure 4

An ECG containing baseline wandering.

Fig. 5
figure 5

An ECG after removing baseline wandering.

Butterworth low pass filter

Butterworth is a filter that was first introduced in 1930 by the British engineer and physicist Stephen Butterworth7. Its merit comes from the fact that its frequency response is as flat as possible in the passband. We set up parameters of the filter as follows: passband to 50 Hz, stopband to 60 Hz, no more than 1.0 dB of passband ripple and at least 2.5 dB of attenuation in the stopband. The filtering would not only change the amplitude but also shift the phase that is disadvantageous for subsequent analyses. Thus, we performed filtering in both forward and reverse directions to compensate for this phase-shifting.

LOESS curve fitting

The local polynomial regression smoother (LOESS)8,9 was used to remove baseline wandering. The smoother was fitted using weighted least squares where the weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that farthest away. We used a robust version of LOESS that assigns zero weight to data outside six mean absolute deviations. We subtracted the LOESS estimated trend to clear the effect of baseline wandering.

Non local means(NLM)

The NLM was also used for residual noise reduction. This algorithm was first introduced to smooth the repeated structures in digital images10. Later, this idea was applied to ECG data denoising11, and further developed and combined with Empirical Mode Decomposition12. For a certain length of univariate time series data, NLM reconstructs every data point S(i) through weighted averaging of all data points D(i) in the original sequence, where i and j are indices of location. The weights w(i, j) are determined by a similarity measure between D(i + δ) and D(i + δ), δ Δ.

$$S(i)=\frac{1}{Z(i)}\sum _{j\in N(i)}\,w(i,j)D(j)$$


$$Z(i)=\sum _{j}\,w(i,j)$$


$$w(i,j)=exp\left(-\frac{\sum _{\delta \in \Delta }\,{[D(i+\delta )-D(j+\delta )]}^{2}}{2{L}_{\Delta }{\lambda }^{2}}\right)$$

where λ is a smoothness control parameter, and Δ represents a local patch of samples containing LΔ samples. Thus, at each point, the NLM smoothing borrows information from all points that have similar patterns within the search range N(i). The similarity measure determines how many periods will be included and averaged. We used a Gaussian kernel as a weight function in the smoothing step of our analysis.

Data Records

Data presented in this work consist of four parts: raw ECG data, denoised ECG data, diagnoses file, and attributes dictionary file. These files are available online at figshare13. For each subject, the raw ECG data were saved as a single CSV file, and denoised ECG data were saved under the same name CSV file, but in a different file folder. Also, each CSV file mentioned above contains 5000 rows and 12 columns with header names presenting the ECG lead. These CSV files are named by unique IDs. These IDs were also saved in the diagnostics file with attributes name FileName. The diagnoses file contains all the diagnoses information for each subject including filename, rhythm, other conditions, patient age, gender, and other ECG summary attributes (acquired from GE MUSE system). Table 3 displays detailed information for each attribute. The attribute dictionary file explains the acronym names of other cardiac conditions (shown in Online-only Table 1).

Table 3 Attributes in diagnosis file.

Technical Validation

In this study, various technical approaches were employed to validate the reliability and quality of the ECG data. A detailed description of these validation methods was presented blow.

ECG measurement validation

According to the standard ECG measurement mechanism, two constraints must be satisfied: first, the voltage value of lead II should always be equal to the sum of voltage values of lead I and lead III; second, the sum of voltage values of lead aVR, aVL, and aVF should be equal to zero. It is well known that the right hand electrode and left hand electrode could have their positions switched by operators without a change on corresponding ECG data. Moreover, some of the electrodes could slip off during the test resulting in ECGs displaying a straight line. We created an automatic error-checking algorithm that detects the presence of these undesirable cases and excluded such ECG records from the database.

classification for validation

We implemented several arrhythmia classification algorithms on our data. The extreme gradient boosting tree14 attained the highest overall F1 score of 0.97. Detailed results were presented in Table 4. The high classification accuracy validates both the quality of the ECG data and the reliability of the arrhythmia condition labels. The pipeline of the proposed classification scheme was presented in Fig. 6.

Table 4 Performance report of gradient boosting tree model.
Fig. 6
figure 6

The common process of ECG analysis.

Since some rare rhythms have less than 10 samples as shown in Table 2, following a suggestion from cardiologists, we have hierarchically merged several rare cases to upper-level arrhythmia types. Thus, 11 rhythms were merged into 4 groups (SB, AFIB, GSVT, SR) shown in Table 5, SB only included sinus bradycardia, AFIB consisted of atrial fibrillation and atrial flutter (AF), GSVT contained supraventricular tachycardia, atrial tachycardia, atrioventricular node reentrant tachycardia, atrioventricular reentrant tachycardia and sinus atrium to atrial wandering rhythm, and SR included sinus rhythm and sinus irregularity. Referring to the guidelines15,16,17 that recommend AFIB and AF often coexist, any ECG with a rhythm of AFIB or AF was classified into AFIB group. Merging sinus rhythm and sinus irregularity to SR group helps to distinguish such a combination from the GSVT group, and sinus irregularity can be easily separated from sinus rhythm later by one single criterion, RR interval variation. Supraventricular tachycardia actually is a general term used in the daily ECG screening. For example, if the cardiologists cannot confirm atrial tachycardia or atrioventricular node reentrant tachycardia purely by ECG, they will give the general name supraventricular tachycardia. Therefore, the practice of merging all tachycardia originating from supraventricular locations to GSVT group was adopted in this work. After re-grouping labels of the dataset, these new aggregated classes can significantly contribute to the training of optimal classification approaches.

Table 5 The quantity of data after merged classes.

We designed a novel and interpretable feature extraction method. We added age and gender as features due to their importance in almost all medical data analyses. Features extracted from lead II include ventricular rate in beats per minute (BPM), atrial rate in BPM, QRS duration in millisecond, QT interval in millisecond, R axis, T axis, QRS count, Q onset, Q offset, mean of RR interval, Variance of RR interval, RR interval count. Features extracted from 12 leads contain mean and variance of height, width, prominence for QRS complex, non-QRS complex, and valleys. Peaks and valleys here represent the local maxima and minima. The prominence of a peak or a valley measures how much the peak or valley stands out due to its intrinsic height and its location relative to neighbor peaks or valleys. Thus, the prominence was defined as the vertical distance between the peak point and its lowest contour line. The peaks and valleys were assigned to three subsets, QRS complex, non-QRS peaks, and Valleys. In total, we created 230 features that were used in the extreme gradient boosting tree classification model described above. The F1 score of 0.97 is the average score from 10-fold cross-validation with 20% testing data. For each group, the sample sizes of training and testing datasets are presented in Table 5.

Evaluation protocol for classification

For heartbeat classification evaluation, the ANSI/AAMI EC57 (R2012) gives a protocol and a database, the MIT-BIH arrhythmia database. Referring to the above industrial standard and the guidance from AHA, ACC, and HRS6, we proposed a five-step workflow for future study of rhythm classification.

  1. 1.

    Label selection:

    The available arrhythmia classification studies listed in18 classified heartbeats across all patients. In contrast, in this database, we used a clinically important rhythm classification that aggregates information from all beats into a single label. All rhythm labels are shown in Table 2. These rhythms can be combined according to different measures of similarity, as we demonstrated in the Classification for Validation section to increase sample size and address specific research questions.

  2. 2.


    We recommended a low-frequency filter to cut off 0.67 Hz or below with zero phase distortion, and a high-frequency filter with 50 Hz cutoff frequency. Using the raw ECG signal is also an option for classification scheme.

  3. 3.

    Feature extraction and selection:

    An interpretable feature extraction method is recommended. Using such a feature selection method, one can analyze feature importance and connection with physiological processes. Therefore, uninterpretable feature selection methods such as principal components analysis and neural networks are less desirable.

  4. 4.


    We encourage implementation and comparison of several competing classification schemes that include super-parameter tuning. The classification results need to report average performance accuracy using 10-fold validation.

  5. 5.


    F1-score, Overall Accuracy, Confusion Matrix, Precision (Positive Predictivity), and Recall (Sensitivity) are recommended to report classification performance.

    $${F}_{1}=2\ast \frac{Precision\;\ast \;Recall}{Precision+Recall}$$

Usage Notes

To get a better understanding of our approach, refer to a diagram shown in Fig. 6. In the data collection stage, we recommend the C# ECG Toolkit that is an open-source software to convert, view and print electrocardiograms ( We suggest the use of Matlab or Python to carry out the denoising step of the analysis (see the Code Availability section). In the feature extraction step, BioSPPy ( is recommended to extract general ECG summary features such as QRS count, R wave location, etc. As for machine learning packages, we suggest scikit-learn19, and TensorFlow ( for deep learning model building.