Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network

A Publisher Correction to this article was published on 24 January 2019

This article has been updated

Abstract

Computerized electrocardiogram (ECG) interpretation plays a critical role in the clinical ECG workflow1. Widely available digital ECG data and the algorithmic paradigm of deep learning2 present an opportunity to substantially improve the accuracy and scalability of automated ECG analysis. However, a comprehensive evaluation of an end-to-end deep learning approach for ECG analysis across a wide variety of diagnostic classes has not been previously reported. Here, we develop a deep neural network (DNN) to classify 12 rhythm classes using 91,232 single-lead ECGs from 53,549 patients who used a single-lead ambulatory ECG monitoring device. When validated against an independent test dataset annotated by a consensus committee of board-certified practicing cardiologists, the DNN achieved an average area under the receiver operating characteristic curve (ROC) of 0.97. The average F1 score, which is the harmonic mean of the positive predictive value and sensitivity, for the DNN (0.837) exceeded that of average cardiologists (0.780). With specificity fixed at the average specificity achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes. These findings demonstrate that an end-to-end deep learning approach can classify a broad range of distinct arrhythmias from single-lead ECGs with high diagnostic performance similar to that of cardiologists. If confirmed in clinical settings, this approach could reduce the rate of misdiagnosed computerized ECG interpretations and improve the efficiency of expert human ECG interpretation by accurately triaging or prioritizing the most urgent conditions.

Main

The electrocardiogram is a fundamental tool in the everyday practice of clinical medicine, with more than 300 million ECGs obtained annually worldwide3. The ECG is pivotal for diagnosing a wide spectrum of abnormalities from arrhythmias to acute coronary syndrome4. Computer-aided interpretation has become increasingly important in the clinical ECG workflow since its introduction over 50 years ago, serving as a crucial adjunct to physician interpretation in many clinical settings1. However, existing commercial ECG interpretation algorithms still show substantial rates of misdiagnosis1,5,6,7. The combination of widespread digitization of ECG data and the development of algorithmic paradigms that can benefit from large-scale processing of raw data presents an opportunity to reexamine the standard approach to algorithmic ECG analysis and may provide substantial improvements to automated ECG interpretation.

Substantial algorithmic advances in the past five years have been driven largely by a specific class of models known as deep neural networks2. DNNs are computational models consisting of multiple processing layers, with each layer being able to learn increasingly abstract, higher-level representations of the input data relevant to perform specific tasks. They have dramatically improved the state of the art in speech recognition8, image recognition9, strategy games such as Go10, and in medical applications11,12. The ability of DNNs to recognize patterns and learn useful features from raw input data without requiring extensive data preprocessing, feature engineering or handcrafted rules2 makes them particularly well suited to interpret ECG data. Furthermore, since DNN performance tends to increase as the amount of training data increases2, this approach is well positioned to take advantage of the widespread digitization of ECG data.

A comprehensive evaluation of whether an end-to-end deep learning approach can be used to analyze raw ECG data to classify a broad range of diagnoses remains lacking. Much of the previous work to employ DNNs toward ECG interpretation has focused on single aspects of the ECG processing pipeline, such as noise reduction13,14 or feature extraction15,16, or has approached limited diagnostic tasks, detecting only a handful of heartbeat types (normal, ventricular or supraventricular ectopic, fusion, and so on)17,18,19,20 or rhythm diagnoses (most commonly atrial fibrillation or ventricular tachycardia)21,22,23,24,25. Lack of appropriate data has limited many efforts beyond these applications. Most prior efforts used data from the MIT-BIH Arrhythmia database (PhysioNet)26, which is limited by the small number of patients and rhythm episodes present in the dataset.

In this study, we constructed a large, novel ECG dataset that underwent expert annotation for a broad range of ECG rhythm classes. We developed a DNN to detect 12 rhythm classes from raw single-lead ECG inputs using a training dataset consisting of 91,232 ECG records from 53,549 patients. The DNN was designed to classify 10 arrhythmias as well as sinus rhythm and noise for a total of 12 output rhythm classes (Extended Data Fig. 1). ECG data were recorded by the Zio monitor, which is a Food and Drug Administration (FDA)-cleared, single-lead, patch-based ambulatory ECG monitor27 that continuously records data from a single vector (modified Lead II) at 200 Hz. The mean and median wear time of the Zio monitor in our dataset was 10.6 and 13.0 days, respectively. Mean age was 69 ± 16 years and 43% were women. We validated the DNN on a test dataset that consisted of 328 ECG records collected from 328 unique patients, which was annotated by a consensus committee of expert cardiologists (see Methods). Mean age on the test dataset was 70 ± 17 years and 38% were women. The mean inter-annotator agreement on the test dataset was 72.8%. Supplementary Table 1 shows the number of unique patients exhibiting each rhythm class.

We first compared the performance of the DNN against the gold standard cardiologist consensus committee diagnoses by calculating the AUC (Table 1a). Since the DNN algorithm was designed to make a rhythm class prediction approximately once per second (see Methods), we report performance both as assessed once every second—which we call “sequence-level” and consists of one rhythm class per interval—and once per record, which we call “set-level” and consists of the group of unique diagnoses present in the record. Sequence-level metrics help capture the duration of an arrhythmia, such as its onset and offset within a record, whereas set-level metrics focus only on the existence of a rhythm class within a record. The DNN achieved an AUC of greater than 0.91 for all rhythm classes; at the sequence-level all but one AUC was above 0.97. The class-weighted average AUC was 0.978 at the sequence-level and 0.977 at the set-level. The model demonstrated high AUCs for arrhythmias of greater clinical significance such as AF, atrio-ventricular block, and ventricular tachycardia. The sequence and set-level results were similar, though sequence-level AUC was higher in the majority of cases. In sensitivity analyses, we calculated multi-class AUC using the method described by Hand and Till28 and results were materially unchanged. Supplementary Table 2 shows the maximum sensitivity achieved by the DNN with specificity >90%, and vice versa. With one exception, all sensitivity and specificity pairs were >90%.

Table 1 Diagnostic performance of the DNN and averaged individual cardiologists compared to the cardiologist committee consensus (n = 328)

In addition to a cardiologist consensus committee annotation, each ECG record in the test dataset received annotations from six separate individual cardiologists who were not part of the committee (see Methods). Using the committee labels as the gold standard, we compared the DNN algorithm F1 score to the average individual cardiologist F1 score, which is the harmonic mean of the positive predictive value (PPV; precision) and sensitivity (recall) (Table 1). Cardiologist F1 scores were averaged over six individual cardiologists. The trend of DNN F1 scores tended to follow that of the averaged cardiologist F1 scores: both had lower F1 on similar classes, such as ventricular tachycardia and ectopic atrial rhythm (EAR). The set-level average F1 scores weighted by the frequency of each class for the DNN (0.837) exceeded those for the averaged cardiologist (0.780). We performed multiple sensitivity analyses, all of which were consistent with our main results: both AUC and F1 scores on the 10% development dataset (n = 8,761) were materially unchanged from the test dataset results, although they were slightly higher (Supplementary Tables 3 and 4). In addition, we retrained the DNN holding out an additional 10% of the training dataset as a second held-out test dataset (n = 8,768); the AUC and F1 scores for all rhythms were materially unchanged (Supplementary Tables 5 and 6). We note that unlike the primary test dataset, which has gold-standard annotations from a committee of cardiologists, both sensitivity analysis datasets are annotated by certified ECG technicians.

We plotted receiver operating characteristic curves (ROCs) and precision-recall curves for the sequence-level analyses of three example classes: atrial fibrillation; trigeminy; and AVB (Fig. 1a,b). Individual cardiologist performance and averaged cardiologist performance are plotted on the same figure. Extended Data Fig. 2 presents ROCs for all classes, showing that the model met or exceeded the averaged cardiologist performance for all rhythm classes. Fixing the specificity at the average specificity level achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes (Table 2). We used confusion matrices to illustrate the discordance between the DNN’s predictions (Fig. 2a) or averaged cardiologist predictions (Fig. 2b) and the committee consensus. The two confusion matrices exhibit a similar pattern, highlighting those rhythm classes that were generally more problematic to classify (that is, supraventricular tachycardia (SVT) versus atrial fibrillation, junctional versus sinus rhythm, and EAR versus sinus rhythm).

Fig. 1: ROC and precision-recall curves.
figure1

a, Examples of ROC curves calculated at the sequence level for atrial fibrillation (AF), trigeminy, and AVB. b, Examples of precision-recall curves calculated at the sequence level for atrial fibrillation, trigeminy, and AVB. Individual cardiologist performance is indicated by the red crosses and averaged cardiologist performance is indicated by the green dot. The line represents the ROC (a) or precision-recall curve (b) achieved by the DNN model. n = 7,544 where each of the 328 30-s ECGs received 23 sequence-level predictions.

Source data

Table 2 DNN algorithm and cardiologist sensitivity compared to the cardiologist committee consensus, with specificity fixed at the average specificity level achieved by cardiologists
Fig. 2: Confusion matrices.
figure2

a, Confusion matrix for the predictions of the DNN versus the cardiology committee consensus. b, Confusion matrix for predictions of individual cardiologists versus the cardiology committee consensus. The percentage of all possible records in each category is displayed on a color gradient scale.

Source data

Finally, to demonstrate the generalizability of our DNN architecture to external data, we applied our DNN to the 2017 PhysioNet Challenge data (https://physionet.org/challenge/2017/), which contained four rhythm classes: sinus rhythm; atrial fibrillation; noise; and other. Keeping our DNN architecture fixed and without any other hyper-parameter tuning, we trained our DNN on the publicly available training dataset (n = 8,528), holding out a 10% development dataset for early stopping. DNN performance on the hidden test dataset (n = 3,658) demonstrated overall F1 scores that were among those of the best performers from the competition (Supplementary Table 7)24, with a class average F1 of 0.83. This demonstrates the ability of our end-to-end DNN-based approach to generalize to a new set of rhythm labels on a different dataset.

Our study is the first comprehensive demonstration of a deep learning approach to perform classification across a broad range of the most common and important ECG rhythm diagnoses. Our DNN had an average class-weighted AUC of 0.97, with higher average F1 scores and sensitivities than cardiologists. These findings demonstrate that an end-to-end DNN approach has the potential to be used to improve the accuracy of algorithmic ECG interpretation. Recent algorithmic and computational advances compel us to revisit the standard approaches to automated ECG interpretation. Furthermore, algorithmic approaches whose performance improves as more data become available, such as deep learning2, can leverage the widespread digitization of ECG data and provide clear opportunities to bring us closer to the ideal of a learning health care system29. We emphasize our use in this study of a dataset large enough to evaluate an end-to-end deep learning approach to predict multiple diagnostic ECG classes, and our validation against the high standard of a cardiologist consensus committee. (Most cardiologists were subspecialized in rhythm abnormalities.) We believe this is the most clinically relevant gold standard, since cardiologists perform the final ECG diagnosis in nearly all clinical settings.

Our study demonstrates that the paradigm shift represented by end-to-end deep learning may enable a new approach to automated ECG analysis. The standard approach to automated ECG interpretation employs various techniques across a series of steps that include signal preprocessing, feature extraction, feature selection/reduction, and classification30. At each step, hand-engineered heuristics and derivations of the raw ECG data are developed with the ultimate aim to improve classification for a given rhythm, such as atrial fibrillation31,32. In contrast, DNNs enable an approach that is fundamentally different since a single algorithm can accomplish all of these steps ‘end-to-end’ without requiring class-specific feature extraction; in other words, the DNN can accept the raw ECG data as input and output diagnostic probabilities. With sufficient training data, using a DNN in this manner has the potential to learn all of the important previously manually derived features, along with as-yet-unrecognized features, in a data-driven way2, and may learn shared features useful in predicting multiple classes. These properties of DNNs can serve to improve prediction performance, particularly since there is ample evidence to suggest that the currently recognized, manually derived ECG features represent only a fraction of the informative features for any diagnosis33,34.

While artificial neural networks were first applied toward the interpretation of ECGs as early as two decades ago3,35, until recently they only contained several layers and were constrained by algorithmic and computational limitations. More recent studies have employed deeper networks, although some only use DNNs to perform certain steps in the ECG processing pipeline, such as feature extraction33 or classification25. End-to-end DNN approaches have been used more recently showing good performance for a limited set of ECG rhythms, such as atrial fibrillation22,23,36, ventricular arrhythmias21, or individual heartbeat classes20,21,37,38. While these prior efforts demonstrated promising performance for specific rhythms, they do not provide a comprehensive evaluation of whether an end-to-end approach can perform well across a wide range of rhythm classes, in a manner similar to that encountered clinically. Our approach is unique in using a 34-layer network in an end-to-end manner to simultaneously output probabilities for a wide range of distinct rhythm diagnoses, all of which is enabled by our dataset, which is orders of magnitude larger than most other datasets of its kind26. Distinct from some other recent DNN approaches39, no substantial preprocessing of ECG data, such as Fourier or wavelet transforms40, is needed to achieve strong classification performance.

Since arrhythmia detection is one of the most problematic tasks for existing ECG algorithms1,5,6, if validated in clinical settings through clinical trials, our approach has the potential for substantial clinical impact. Paired with properly annotated digital ECG data, our approach has the potential to increase the overall accuracy of preliminary computerized ECG interpretations and can also be used to customize predictions to institution- or population-specific applications by additional training on institution-specific data. While expert provider confirmation will probably be appropriate in many clinical settings, the DNN could expand the capability of an expert over-reader in the clinical workflow, for example, by triaging urgent conditions or those for which the DNN has the least ‘confidence’. Since ECG data collected from different clinical applications range in duration from 10 s (standard 12-lead ECGs) to multiple days (single-lead ambulatory ECGs), the application of any algorithm, including ours, must ultimately be tailored to the target clinical application. For example, even at the performance characteristics we report, applying our algorithm sequentially across an ECG record of long duration would result in nontrivial false-positive diagnoses. Faced with a similar problem, cardiologists probably incorporate additional mechanisms to improve their diagnostic performance, such as taking advantage of the increased context or knowledge about arrhythmia epidemiology. Similarly, additional algorithmic steps or post-processing heuristics may be important before clinical application.

An important finding from our study is that the DNN appears to recapitulate the misclassifications made by individual cardiologists, as demonstrated by the similarity in the confusion matrices for the model and cardiologists. Manual review of the discordances revealed that the DNN misclassifications overall appear very reasonable. In many cases, the lack of context, limited signal duration, or having a single lead limited the conclusions that could reasonably be drawn from the data, making it difficult to definitively ascertain whether the committee and/or the algorithm was correct. Similar factors, as well as human error, may explain the inter-annotator agreement of 72.8%.

Of the rhythm classes we examined, ventricular tachycardia is a clinically important rhythm for which the model had a lower F1 score than cardiologists, but interestingly had higher sensitivity (94.1%) than the averaged cardiologist (78.4%). Manual review of the 16 records misclassified by the DNN as ventricular tachycardia showed that ‘mistakes’ made by the algorithm were very reasonable. For example, ventricular tachycardia and idioventricular rhythm (IVR) differ only in the heart rate being above or below 100 beats per minute (b.p.m.), respectively. In 7 of the committee-labeled IVR cases, the record contained periods of heart rate ≥ 100 b.p.m., making ventricular tachycardia a reasonable classification by the DNN; the remaining 3 committee-labeled IVR records had rates close to 100 b.p.m.. Of the 5 cases where the committee label was atrial fibrillation (4) or SVT (1), all but one displayed aberrant conduction, resulting in wide QRS complexes (the ECG waveform corresponding to ventricular activation) with a similar appearance to ventricular tachycardia. If we recategorize the 7 IVR records with a rate ≥ 100 b.p.m. as ventricular tachycardia, overall DNN performance on ventricular tachycardia exceeds that of cardiologists by F1 score, with a set-level F1 score of 0.82 (versus 0.77).

This study has several important limitations. Our input dataset is limited to single-lead ECG records obtained from an ambulatory monitor, which provides limited signal compared to a standard 12-lead ECG; it remains to be determined if our algorithm performance would be similar in 12-lead ECGs. However, it may be in applications such as this, which have lower signal-to-noise ratio and where the current standard of care leaves more room for improvement, that approaches such as deep learning may provide the greatest impact. As discussed earlier, a limitation facing this, or any algorithm, before clinical application would be tailoring it to the target application, which may require additional training or post-processing steps. Additionally, systematic differences in the way technicians versus cardiologists labeled records in our dataset could have decreased DNN performance, although we took precautions to limit this by establishing standard operating protocols for annotation. In addition, as revealed in our manual review of discordant predictions, in some cases there remains uncertainty in the correct label. Given the resource-intensive nature of cardiologist committee ECG annotation, our test dataset was limited to records from 328 patients; confidence intervals (CIs) with our test dataset size were acceptably narrow, as we report in Table 1, although our ability to perform subgroup analysis (such as by age/sex) is limited. Finally, we also note that to obtain a sufficient quantity of rare rhythms in our training and test datasets, we targeted patients exhibiting these rhythms during data extraction. This implies that prevalence-dependent metrics such as the F1 score would not be expected to generalize to the broader population.

In summary, we demonstrate that an end-to-end deep learning approach can classify a broad range of distinct arrhythmias from single-lead ECGs with high diagnostic performance similar to that of cardiologists. If confirmed in clinical settings, this approach has the potential to improve the accuracy, efficiency, and scalability of ECG interpretation.

Methods

Study participants and sampling procedures

Our dataset contained retrospective, de-identified data from adult patients >18 years old who used the Zio monitor (iRhythm Technologies, Inc) from January 2013 to March 2017. All extracted data were de-identified according to the Health Insurance Portability and Accountability Act Safe Harbor. According to the iRhythm Technologies privacy policy, fully de-identified patient data may be shared externally for research purposes; patients may opt out of this sharing. Accordingly, written informed consent was not necessary for this study given that the 30-s ECG samples of both the training and test datasets were appropriately de-identified before use. The study was reviewed and exempted from full review by the Stanford University Institutional Review Board.

We extracted a median of one 30-s record per patient to construct the training dataset. ECG records were extracted based on the report summaries produced by iRhythm Technologies clinical workflow, which includes a full review by a certified ECG technician of initial annotations from an algorithm which is FDA 510(k) approved for clinical use. We randomly sampled patients exhibiting each rhythm; from these patients, we selected 30-s records where the rhythm class was present. Although the targeted rhythm class was typically present within the record, most records contained a mix of multiple rhythms. To further improve the balance of classes in the training dataset, rare rhythms such as AVB, were intentionally oversampled, with a median of two 30-s records per patient. For the test dataset, 30-s records of each rhythm were sampled in a similar manner to achieve a greater representation of rare rhythms; however, the test dataset included only a single record per patient. The training, development, and test datasets had completely disjointed sets of patients.

Annotation procedures

All ECG records in the training and test datasets underwent additional annotation procedures. We used separate procedures to annotate the training and test datasets, reserving the resource-intensive cardiologist annotation for use as the gold standard in the test dataset. To annotate the training dataset, a group of senior certified ECG technicians reviewed all records and noted the onset and offset of all rhythms on the record. Every record was randomly assigned to be reviewed by a single technician specifically for this task, not for any other purpose. All annotators received specific instructions and training regarding how to annotate transitions between rhythms to improve labeling consistency. We held out records from a random 10% of the training dataset patients for use as a development dataset to perform DNN hyper-parameter tuning.

Eight board-certified practicing cardiac electrophysiologists and one board-certified practicing cardiologist (all referred to as cardiologists) annotated records in the test dataset. All iRhythm Technologies clinical annotations were removed from the test dataset. Cardiologists were divided into three committees of three members each; each committee annotated a separate one-third of the test dataset (112 records). Cardiologist committees discussed records as a group and annotated by consensus, providing the gold standard for model evaluation. Each of the remaining six cardiologists that were not part of the committee for that record also provided individual annotations for that record. These annotations were used to compare the model’s performance to that of the individual cardiologists. In summary, every record in the test dataset received one committee consensus annotation from a group of three cardiologists and six individual cardiologist annotations.

Many ECG records contained multiple rhythm class diagnoses since the onset and offset of all unique classes were labeled within each 30-s record. The atrial fibrillation class combined atrial fibrillation and atrial flutter. The AVB class combined both type 2 second-degree AVB (Mobitz II/Hay) and third-degree AVB. We combined these classes because they have similar clinical consequences. The noise label was selected whenever artifact in the signal precluded accurate interpretation of the underlying rhythm.

Algorithm development

We developed a convolutional DNN to detect arrhythmias (Extended Data Fig. 1), which takes as input the raw ECG data (sampled at 200 Hz, or 200 samples per second) and outputs one prediction every 256 samples (or every 1.28 s), which we call the output interval. The network takes as input only the raw ECG samples and no other patient- or ECG-related features. The network architecture has 34 layers; to make the optimization of such a network tractable, we employed shortcut connections in a manner similar to the residual network architecture41. The network consists of 16 residual blocks with two convolutional layers per block. The convolutional layers have a filter width of 16 and 32*2k filters, where k is a hyper-parameter which starts at 0 and is incremented by 1 every fourth residual block. Every alternate residual block subsamples its inputs by a factor of 2. Before each convolutional layer, we applied batch normalization42 and a rectified linear activation, adopting the pre-activation block design43. The first and last layers of the network are special-cased due to this pre-activation block structure. We also applied Dropout44 between the convolutional layers and after the nonlinearity with a probability of 0.2. The final fully connected softmax layer produces a distribution over the 12 output classes.

The network was trained de novo with random initialization of the weights as described by He et al.9. We used the Adam optimizer45 with the default parameters β1 = 0.9 and β2 = 0.999, and a mini batch size of 128. We initialized the learning rate to 1 × 10−3 and reduced it by a factor of 10 when the developmentally set loss stopped improving for two consecutive epochs. We chose the model that achieved the lowest error on the development dataset.

In general, the hyper-parameters of the network architecture and optimization algorithm were chosen via a combination of grid search and manual tuning. For the architecture, we searched primarily over the number of convolutional layers, the size and number of the convolutional filters, as well as the use of residual connections. We found the residual connections useful once the depth of the model exceeded eight layers. We also experimented with recurrent layers including long short-term memory cells46 and bidirectional recurrence, but found no improvement in accuracy and a substantial increase in runtime; thus, we abandoned this class of models. We manually tuned the learning rate to achieve fastest convergence.

Algorithm evaluation

Since the DNN outputs one class prediction every output interval, it makes a series of 23 rhythm predictions for every 30-s record. The cardiologists annotated the start and end point of each rhythm class in the record. We used this to construct a cardiologist label at every output interval by rounding the annotation to the nearest interval boundary. Therefore, model accuracy can be assessed at the level of every output interval, which we call ‘sequence-level’, or at the record level, which we call ‘set-level’. To compare model predictions at the sequence level, the model predictions at each output interval were compared with the corresponding committee consensus labels for that same output interval. At the set level, the set of unique rhythm classes across a given ECG record that was predicted by the DNN was compared with the set of rhythm classes annotated across the record by the committee consensus. The set-level evaluation, unlike the sequence-level, does not penalize for time misalignment of a rhythm classification within a record.

Algorithm evaluation at the sequence level allows comparison against the gold standard at every output interval, providing the most comprehensive metric of algorithm performance, which we therefore employ for most metrics. The sequence-level evaluation is also similar to clinical applications for telemetry or Holter monitor analysis, whereby it is critical to identify the onset and the offset of rhythms. Evaluation at the set level is a useful abstraction, approximating how the DNN algorithm might be applied to a single ECG record to identify which diagnoses are present in a given record.

To train and evaluate our model on the Physionet Challenge data, which contains variable length recordings, we made minor modifications to the DNN. Without any change, the DNN can accept as input any record with a length that is a multiple of 256 samples. To handle examples that are not a multiple of 256, records were truncated to the nearest multiple. We used the given record label as the label for approximately every 1.3-s output prediction. To produce a single prediction for the variable length record we used a majority vote of the sequence-level predictions.

Statistical analysis

We calculated the ROC analysis and AUC to assess model discrimination for each rhythm class with a one versus other strategy28,47. AUCs for sequence-level and set-level analyses are presented separately. We give a two-sided CI for the AUC scores48. Sensitivity and specificity were calculated at binary decision thresholds for every rhythm class. We computed the precision-recall curve, which shows the relationship between PPV (precision) and sensitivity (recall)49. It provides complementary information to the ROC curve, especially with class-imbalanced datasets. To compare the relative performance of the DNN to the cardiologist committee labels, we calculated the F1 score, which is the harmonic mean of the PPV and sensitivity. It ranges from 0 to 1 and rewards algorithms that maximize both PPV and sensitivity simultaneously, rather than favoring one over the other. The F1 score is complementary to the AUC, which is particularly helpful in the setting of multi-class prediction and less sensitive than the AUC in settings of class imbalance49. For an aggregate measure of model performance, we computed the class frequency-weighted arithmetic mean for both the F1 score and the AUC. To obtain estimates of how the DNN compares to an average cardiologist, the characteristics of cardiologist performance were averaged across the six cardiologists who individually annotated each record. We used confusion matrices to illustrate the specific examples of rhythm classes where the DNN prediction or the individual cardiologist’s prediction were discordant with the committee consensus at the sequence level. Among the individual cardiologist annotations in the test dataset, we calculated inter-annotator agreement as the ratio of the number of times two annotators agreed that a rhythm was present at each output interval and the total number of pairwise comparisons.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability

Code for the algorithm development, evaluation, and statistical analysis is open source with no restrictions and is available from https://github.com/awni/ecg.

Data availability

The test dataset used to support the findings of this study is publicly available at https://irhythm.github.io/cardiol_test_set without restriction. Restrictions apply to the availability of the training dataset, which was used under license from iRhythm Technologies, Inc. for the current study. iRhythm Technologies, Inc. will consider requests to access the training data on an individual basis. Any data use will be restricted to noncommercial research purposes, and the data will only be made available on execution of appropriate data use agreements.

Change history

  • 24 January 2019

    In the version of this article originally published, the x axis labels in Fig. 1a were incorrect. The labels originally were ‘Specificity,’ but should have been ‘1 – Specificity.’ Also, the x axis label in Fig. 2b was incorrect. It was originally ‘DNN predicted label,’ but should have been ‘Average cardiologist label.’ The errors have been corrected in the PDF and HTML versions of this article.

References

  1. 1.

    Schläpfer, J. & Wellens, H. J. Computer-interpreted electrocardiograms: benefits and limitations. J. Am. Coll. Cardiol. 70, 1183–1192 (2017).

    PubMed  Article  Google Scholar 

  2. 2.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    CAS  PubMed  Article  Google Scholar 

  3. 3.

    Holst, H., Ohlsson, M., Peterson, C. & Edenbrandt, L. A confident decision support system for interpreting electrocardiograms. Clin. Physiol. 19, 410–418 (1999).

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Schlant, R. C. et al. Guidelines for electrocardiography. A report of the American College of Cardiology/American Heart Association Task Force on assessment of diagnostic and therapeutic cardiovascular procedures (Committee on Electrocardiography). J. Am. Coll. Cardiol. 19, 473–481 (1992).

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    Shah, A. P. & Rubin, S. A. Errors in the computerized electrocardiogram interpretation of cardiac rhythm. J. Electrocardiol. 40, 385–390 (2007).

    PubMed  Article  Google Scholar 

  6. 6.

    Guglin, M. E. & Thatai, D. Common errors in computer electrocardiogram interpretation. Int. J. Cardiol. 106, 232–237 (2006).

    PubMed  Article  Google Scholar 

  7. 7.

    Poon, K., Okin, P. M. & Kligfield, P. Diagnostic performance of a computer-based ECG rhythm algorithm. J. Electrocardiol. 38, 235–238 (2005).

    PubMed  Article  Google Scholar 

  8. 8.

    Amodei, D. et al. Deep Speech 2: end-to-end Speech recognition in English and Mandarin. In Proc. 33rd International Conference on Machine Learning, 173–182 (2016).

  9. 9.

    He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In Proc. International Conference on Computer Vision, 1026–1034 (IEEE, 2015).

  10. 10.

    Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    CAS  PubMed  Article  Google Scholar 

  11. 11.

    Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410 (2016).

    PubMed  Article  Google Scholar 

  12. 12.

    Esteva, A. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).

    CAS  PubMed  Article  Google Scholar 

  13. 13.

    Poungponsri, S. & Yu, X. An adaptive filtering approach for electrocardiogram (ECG) signal noise reduction using neural networks. Neurocomputing 117, 206–213 (2013).

    Article  Google Scholar 

  14. 14.

    Ochoa, A., Mena, L. J. & Felix, V. G. Noise-tolerant neural network approach for electrocardiogram signal classification. In Proc. 3rd International Conference on Compute and Data Analysis, 277–282 (Association for Computing Machinery, 2017).

  15. 15.

    Mateo, J. & Rieta, J. J. Application of artificial neural networks for versatile preprocessing of electrocardiogram recordings. J. Med. Eng. Technol. 36, 90–101 (2012).

    CAS  PubMed  Article  Google Scholar 

  16. 16.

    Pourbabaee, B., Roshtkhari, M. J. & Khorasani, K. Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients. IEEE Trans. Syst. Man Cybern. Syst. 99, 1–10 (2017).

    Google Scholar 

  17. 17.

    Javadi, M., Arani, S. A., Sajedin, A. & Ebrahimpour, R. Classification of ECG arrhythmia by a modular neural network based on mixture of experts and negatively correlated learning. Biomed. Signal Process. Control 8, 289–296 (2013).

    Article  Google Scholar 

  18. 18.

    Acharya, U. R. et al. A deep convolutional neural network model to classify heartbeats. Comput. Biol. Med. 89, 389–396 (2017).

    PubMed  Article  Google Scholar 

  19. 19.

    Banupriya, C. V. & Karpagavalli, S. Electrocardiogram beat classification using probabilistic neural network. In Proc. Machine Learning: Challenges and Opportunities Ahead 31–37 (2014).

  20. 20.

    Al Rahhal, M. M. et al. Deep learning approach for active classification of electrocardiogram signals. Inf. Sci. (NY) 345, 340–354 (2016).

    Article  Google Scholar 

  21. 21.

    Acharya, U. R. et al. Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network. Inf. Sci. (NY) 405, 81–90 (2017).

    Article  Google Scholar 

  22. 22.

    Zihlmann, M., Perekrestenko, D. & Tschannen, M. Convolutional recurrent neural networks for electrocardiogram classification. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.070-060 (2017).

  23. 23.

    Xiong, Z., Zhao, J. & Stiles, M. K. Robust ECG signal classification for detection of atrial fibrillation using a novel neural network. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.066-138 (2017).

  24. 24.

    Clifford, G. et al. AF classification from a short single lead ECG recording: the PhysioNet/Computing in Cardiology Challenge 2017. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.065-469 (2017).

  25. 25.

    Teijeiro, T., Garcia, C. A., Castro, D. & Felix, P. Arrhythmia classification from the abductive interpretation of short single-lead ECG records. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.166-054 (2017).

  26. 26.

    Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215–E220 (2000).

    CAS  PubMed  Google Scholar 

  27. 27.

    Turakhia, M. P. et al. Diagnostic utility of a novel leadless arrhythmia monitoring device. Am. J. Cardiol. 112, 520–524 (2013).

    PubMed  Article  Google Scholar 

  28. 28.

    Hand, D. J. & Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001).

    Article  Google Scholar 

  29. 29.

    Smith, M. D. et al.in Best Care at Lower Cost: the Path to Continuously Learning Health Care in America (National Academies Press,: Washington, 2012).

  30. 30.

    Lyon, A., Mincholé, A., Martínez, J. P., Laguna, P. & Rodriguez, B. Computational techniques for ECG analysis and interpretation in light of their contribution to medical advances. J. R. Soc. Interface 15, pii: 20170821 (2018).

    Article  Google Scholar 

  31. 31.

    Carrara, M. et al. Heart rate dynamics distinguish among atrial fibrillation, normal sinus rhythm and sinus rhythm with frequent ectopy. Physiol. Meas. 36, 1873–1888 (2015).

    PubMed  Article  Google Scholar 

  32. 32.

    Zhou, X., Ding, H., Ung, B., Pickwell-MacPherson, E. & Zhang, Y. Automatic online detection of atrial fibrillation based on symbolic dynamics and Shannon entropy. Biomed. Eng. Online 13, 18 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Hong, S. et al. ENCASE: an ENsemble ClASsifiEr for ECG Classification using expert features and deep neural networks. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.178-245 (2017).

  34. 34.

    Nahar, J., Imam, T., Tickle, K. S. & Chen, Y. P. Computational intelligence for heart disease diagnosis: a medical knowledge driven approach. Expert Syst. Appl. 40, 96–104 (2013).

    Article  Google Scholar 

  35. 35.

    Cubanski, D., Cyganski, D., Antman, E. M. & Feldman, C. L. A neural network system for detection of atrial fibrillation in ambulatory electrocardiograms. J. Cardiovasc. Electrophysiol. 5, 602–608 (1994).

    CAS  PubMed  Article  Google Scholar 

  36. 36.

    Andreotti, F., Carr, O., Pimentel, M. A. F., Mahdi, A. & De Vos, M. Comparing feature-based classifiers and convolutional neural networks to detect arrhythmia from short segments of ECG. Comput. Cardiol. https://doi.org/10.22489/CinC.2017.360-239 (2017).

  37. 37.

    Xu, S. S., Mak, M. & Cheung, C. Towards end-to-end ECG classification with raw signal extraction and deep neural networks. IEEE J. Biomed. Health Informatics 14, 1 (2018).

    Article  Google Scholar 

  38. 38.

    Ong, S. L., Ng, E. Y. K., Tan, R. S. & Acharya, U. R. Automated diagnosis of arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Comput. Biol. Med. 102, 278–287 (2018).

    Article  Google Scholar 

  39. 39.

    Shashikumar, S. P., Shah, A. J., Clifford, G. D. & Nemati, S. Detection of paroxysmal atrial fibrillation using attention-based bidirectional recurrent neural networks. In Proc. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 715–723 (Association for Computing Machinery, 2018).

  40. 40.

    Xia, Y., Wulan, N., Wang, K. & Zhang, H. Detecting atrial fibrillation by deep convolutional neural networks. Comput. Biol. Med. 93, 84–92 (2018).

    PubMed  Article  Google Scholar 

  41. 41.

    He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In Proc. European Conference on Computer Vision, 630–645 (Springer, 2016).

  42. 42.

    Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning, 448–456 (2015).

  43. 43.

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (IEEE, 2016).

  44. 44.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    Google Scholar 

  45. 45.

    Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In Proc. International Conference on Learning Representations 1–15 (2015).

  46. 46.

    Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    CAS  PubMed  Article  Google Scholar 

  47. 47.

    Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).

    Article  Google Scholar 

  48. 48.

    Hanley, J. A. & McNeil, B. J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148, 839–843 (1983).

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgements

iRhythm Technologies, Inc. provided financial support for the data annotation in this work. M.H. and C.B. are employees of iRhythm Technologies, Inc. A.Y.H. was funded by an NVIDIA fellowship. G.H.T. received support from the National Institutes of Health (K23 HL135274). The only financial support provided by iRhythm Technologies, Inc. for this study was for the data annotation. Data analysis and interpretation was performed independently from the sponsor. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Author information

Affiliations

Authors

Contributions

M.H., A.Y.N., A.Y.H., and G.H.T. contributed to the study design. M.H. and C.B. were responsible for data collection. P.R. and A.Y.H. ran the experiments and created the figures. G.H.T., P.R., and A.Y.H. contributed to the analysis. G.H.T., A.Y.H., and M.P.T. contributed to the data interpretation and to the writing. G.H.T., M.P.T., and A.Y.N. advised and A.Y.N. was the senior supervisor of the project. All authors read and approved the submitted manuscript.

Corresponding author

Correspondence to Awni Y. Hannun.

Ethics declarations

Competing interests

M.H. and C.B. are employees of iRhythm Technologies, Inc. G.H.T. is an advisor to Cardiogram, Inc. M.P.T. is a consultant to iRhythm Technologies, Inc. None of the other authors have potential conflicts of interest.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Deep Neural Network architecture.

Our deep neural network consisted of 33 convolutional layers followed by a linear output layer into a softmax. The network accepts raw ECG data as input (sampled at 200 Hz, or 200 samples per second), and outputs a prediction of one out of 12 possible rhythm classes every 256 input samples.

Extended Data Fig. 2 Receiver operating characteristic curves for deep neural network predictions on 12 rhythm classes.

Individual cardiologist performance is indicated by the red crosses and averaged cardiologist performance is indicated by the green dot. The line represents the ROC curve of model performance. AF-atrial fibrillation/atrial flutter; AVB- atrioventricular block; EAR-ectopic atrial rhythm; IVR-idioventricular rhythm; SVT-supraventricular tachycardia; VT-ventricular tachycardia. n = 7,544 where each of the 328 30-second ECGs received 23 sequence-level predictions.

Source data

Supplementary information

Source data

Source data Fig. 1

Sensitivity, specificity and PPV values at different operating points as well as the individual and average cardiologist metrics for the arrhythmias in the figure.

Source data Fig. 2

Absolute confusions counts between arrhythmias for both the model and the cardiologists.

Extended data Fig. 2

Sensitivity and specificity values at different operating points as well as the individual and average cardiologist metrics for all of the arrhythmias.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hannun, A.Y., Rajpurkar, P., Haghpanahi, M. et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 25, 65–69 (2019). https://doi.org/10.1038/s41591-018-0268-3

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing