Electrocardiographic changes predate Parkinson’s disease onset

Autonomic nervous system involvement precedes the motor features of Parkinson’s disease (PD). Our goal was to develop a proof-of-concept model for identifying subjects at high risk of developing PD by analysis of cardiac electrical activity. We used standard 10-s electrocardiogram (ECG) recordings of 60 subjects from the Honolulu Asia Aging Study including 10 with prevalent PD, 25 with prodromal PD, and 25 controls who never developed PD. Various methods were implemented to extract features from ECGs including simple heart rate variability (HRV) metrics, commonly used signal processing methods, and a Probabilistic Symbolic Pattern Recognition (PSPR) method. Extracted features were analyzed via stepwise logistic regression to distinguish between prodromal cases and controls. Stepwise logistic regression selected four features from PSPR as predictors of PD. The final regression model built on the entire dataset provided an area under receiver operating characteristics curve (AUC) with 95% confidence interval of 0.90 [0.80, 0.99]. The five-fold cross-validation process produced an average AUC of 0.835 [0.831, 0.839]. We conclude that cardiac electrical activity provides important information about the likelihood of future PD not captured by classical HRV metrics. Machine learning applied to ECGs may help identify subjects at high risk of having prodromal PD.

www.nature.com/scientificreports/ In this manuscript, we hypothesized that early autonomic features of PD are detectable using machine learning, and tested this hypothesis using standard 10-s ECGs collected from participants in the prospective Honolulu-Asia Aging Study (HAAS).

Results
Cohort characteristics. All participants were Japanese American males with characteristics described in Table 1. The age at time of ECG followed a normal distribution for all three subject groups: controls (Kolmogorov-Smirnov Test (KS) p = 0.44), prodromal PD (KS p = 0.14) and prevalent PD (KS p = 0.69). There were no significant differences in mean age at the time of ECG between those with prevalent PD, prodromal PD or controls (ANOVA, p = 0.35). Among those with prodromal PD, the mean duration from ECG until PD diagnosis was 4.3 years (Standard Deviation (SD) 2.4). Among prevalent cases, ECGs were recorded on average 5.4 years (SD 2.5) after first diagnosis of PD. In our cohort, 6 of 25 controls, 5 of prodromal PD cases, and 1 of 10 prevalent PD cases had diabetes.
Heart rate variability metrics. For each ECG, we calculated nine HR characteristics; mean, median, standard deviation, kurtosis, skewness, minimum, maximum, range, and coefficient of variation. Table 2 summarizes these HR characteristics for prodromal PD, controls, and prevalent PD cases.
Signal processing features. The feature selection step revealed 25 features significantly different for prodromal cases and controls (Mann-Whitney-U test, p < 0.05). Of those features, 19 were related to Fast Fourier Transform, while 2 were related to signal complexity, and included features derived from continuous wavelet transform with various parameters. Some signal energy and quantile mass of time series features were also significantly different for two groups (Mann-Whitney-U test, p < 0.05). These features were then analyzed using Logistic Regression. However, the results of the binary classification did not yield favorable results and therefore we did not pursue these features any further. Using 25 signal processing features and PSPR, the model yielded an average fivefold cross-validation sensitivity and specificity of 0.62 and 0.61. Figure 1 summarizes the values of 10 PSPR features calculated for 25 Prodromal PD subjects and 25 Controls. None of the 10 PSPR features followed a normal distribution (KS p < 0.01). Among ten PSPR features, three differed significantly between controls and prodromal PD cases (Mann-Whitney U test, p < 0.05).  The logistic regression model obtained using all 50 ECGs provides a sensitivity of 84.00% and specificity of 80.00% when a cut off value of 0.5 was used to convert predicted probabilities into binary class predictions. Note that we did not include age or other comorbid conditions in the model, since our goal was to investigate the predictive value of ECG features and because there was no significant difference between the age of cases and controls (p < 0.05; both ANOVA and Mann-Whitney U test).

PSPR features.
We also implemented a cross-validated logistic regression models to show whether extracted ECG features may provide generalizable results or not. Figure 2 summarizes the k-fold cross-validation results in terms of average AUC with 95% CI obtained at different 'k' values of k-fold.

Discussion
Early identification of prodromal PD is an essential step as we progress toward implementing disease modifying therapeutic interventions. The current work took advantage of prospectively collected ECGs to develop predictive models to distinguish between control and prodromal PD subjects. Traditional heart rate variability metrics showed no significant difference between controls and subjects. 25 various signal processing features among 794  (1) and control ECGs (0) based on PSPR features. Note that PSPR features represent how a given ECG (from prodromal PD subjects or control) differs (dissimilarity or distant) from the ECGs of subjects with prevalent PD. This implies that the dissimilarity between ECGs of prodromal PD and prevalent PD are smaller (more similar) than the dissimilarity between controls and prevalent PD ECGs (less similar). www.nature.com/scientificreports/ features were selected using a univariate statistical approach, but their individual classification performance was poor, possibly due to the small sample size. Three of ten PSPR features measuring dissimilarity to prevalent PD subjects were statistically significantly smaller for prodromal PD compared to controls, suggesting that there are lower dissimilarities (or high similarities) between the prodromal and prevalent PD groups in terms of how the electrical activity of the heart evolves from the beginning to the end of a given 10-s ECG. Specifically, these three PSPR features correspond to two, eight and nine symbol long patterns where each symbol represent 125 ms long section of ECGs down sampled at 8 Hz. In another words, 250 ms, 1,075 ms and 1,250 ms long subsections of ECGs showed significantly different patterns between controls and prodromal PD subjects.
Finally, the stepwise logistic regression model using these 10 PSPR features provided a high classification performance. Furthermore, a cross-validation study confirmed that the results may be generalizable to a cohort with similar characteristics. We note that claiming a broader generalizability require further external validation on a more diverse cohort. Moreover, there are other classification models that are suitable for analysis of raw ECG signals such as convolutional neural networks (CNN). However, as a deep learning methodology, CNN requires a large sample size, therefore, was not implemented in this study.
Lewy pathology is found throughout the autonomic nervous system in PD 6 . The dorsal motor nucleus of the vagus nerve is thought to be among the earliest affected structures in disease evolution 7 , and pathology in sympathetic and parasympathetic ganglia and cardiac nerves and associated cardiac de-afferentation are consistently seen in early PD [8][9][10][11] . For this reason, cardiac sympathetic de-afferentation as measured by metaiodobenzylguanidine 6,7,12 (I-MIBG) scintigraphy serves as a supportive criterion for the clinical diagnosis of PD in the MDS-PD diagnostic criteria 13 . Cardiac autonomic pathology and de-afferentation are also seen in association with incidental nigral Lewy bodies at post-mortem (ILB) 10 , and as early as 2007 it was proposed that neurocardiologic testing might provide a biomarker for prodromal disease 14 . However, MIBG scintigraphy is invasive and expensive, and is not a viable tool for population-level screening. Thus, the present work investigated whether the ubiquitous, standard 10-s 12-lead EKG might serve as a useful biomarker for prodromal PD.
Berg et al. 13 proposed a classification model that combines predictors of prodromal PD (REM sleep behavior disorder, olfactory impairment, hyperechogenicity of substansia nigra) with epidemiologic risk factors for PD (sex, occupational exposure to pesticides or solvents, caffeine use, smoking, family history of PD). Our results suggest that early pathologic involvement of cardiac autonomic innervation might be detectable using standard 10-s ECGs in concert with machine learning tools. However, despite the supportive cross validation implemented here, this work requires external validation in other cohorts.
Our study has some major limitations. Although our cross-validated results are promising, the sample size of 60 is very small and could be confounded by a variety of factors. Furthermore, our cohort only included men of Japanese-American descent. Future work will focus on validation of our results in larger and more diverse cohorts. Additionally, subjects with major cardiovascular diseases or those taking medications potentially affecting ECGs were excluded. The impact of these and other common comorbidities and medications on model performance requires further investigation in a larger cohort.
We conclude that the electrical activity of the heart carries important information about the onset of PD that can be detected with a standard 10-s ECG, but that classical heart rate variability metrics are relatively insensitive to early PD pathology. It is possible to capture additional informative data by sophisticated analysis of ECG recordings, and thereby identify subjects at high risk of developing PD. This work suggests that a standard 10-s ECG may serve as a universally accessible, non-invasive, and inexpensive biomarker of prodromal PD. Fast growing technological improvements around wearable devices with ECG tracing functionality may facilitate a broad implementation of such screening algorithms among high risk patients.

Methods
Study subjects: Honolulu-Asia aging study (HAAS). The Honolulu Heart Program prospective cohort study of cardiovascular disease started in 1965 with enrollment of 8,006 Japanese American men born between 1900 and 1919 and living on the island of Oahu 15 . In 1991, HAAS was launched, shifting the focus towards neurodegenerative diseases of aging including PD. Environmental, lifestyle, and physical characteristics including features associated with prodromal PD, were ascertained at baseline and at regular follow-up examinations over 50 years 3

. The institutional review boards of Kuakini Medical Center and the Honolulu Veterans
Affairs clinic reviewed and approved the study and written informed consent was obtained from all participants. In addition, a sizeable proportion of participants have undergone post-mortem evaluations for PD-related neuropathology. For the current study, we included 60 individuals with technically good quality ECGs able to be accurately digitized, without arrhythmia or frank conduction abnormality (e.g., bundle branch block), with no history or evidence of myocardial infarction, and not taking beta-blockers or digoxin. The cohort was comprised of 10 subjects who had PD diagnosed prior to ECG recording ('prevalent cases'), 25 subjects without PD at time of ECG recording, but who developed PD within 1-5 years ('prodromal cases'), and 25 subjects without PD either at baseline or throughout follow-up ('controls'). Control subjects were free of CNS Lewy pathology, if neuropathology was available. This research was approved by Loyola University Chicago Institutional Review Board (LU IRB number 212399) with exempt status. Despite our manuscript is a secondary analysis of an existing database, HAAS, the original data collection was carried out by Kuakini Health Systems and was approved by Kuakini Medical Center Institutional Review Board. All methods were carried out in accordance with relevant guidelines and regulations. ECG data. Standard 12-lead 10-s resting ECGs were obtained during evaluations conducted from 1991-1993. Paper ECGs were scanned as tiff files at 300 dpi. All ECGs were visually inspected for print quality, arrhyth-Scientific RepoRtS | (2020) 10:11319 | https://doi.org/10.1038/s41598-020-68241-6 www.nature.com/scientificreports/ mia, or other significant aberrancies (e.g., recording noise, marked bundle branch block). One well-defined lead was selected for digitization using AMPS ECGscan 3.0 16 .
Feature extraction. R peaks on the digital ECG recordings were identified and used to calculate heart rate (HR) characteristics (mean, median, standard deviation (SDNN), kurtosis, skewness, min, max, range, and coefficient of variation). Signal processing approaches including Fast Fourier Transform (FFT), signal complexity, and approximate entropy methods with different parameter settings were used. We also extracted features representing changes in ECG recordings using a novel method called Probabilistic Symbolic Pattern Recognition (PSPR) [17][18][19][20][21] , as described below.
Signal processing features. We utilized the TSFresh Python library 22 , which included unique signal processing methods and their parameters, to extract 794 features from each of the ECG digital signals (control and prodromal group). Each of these features was used to further compare control and prodromal PD subjects using the Mann-Whitney U test, with significance defined at p < 0.05. To minimize potential errors from the converted digital signals, the same digital signals were validated from the ECG image data separately by two authors (AM and RK).
Probabilistic symbolic pattern recognition (PSPR). PSPR is a method to process sequential symbolic data in order to understand how a given single sequential data series evolves, and to compare multiple sequential data series regarding their behavior in time. To do that, PSPR drives a probabilistic model, or pattern transition behavior, of each sequential data series and then implements binary comparisons to calculate the Euclidian distance between these probabilistic models. When three series are compared to each other, two series with lower distance have more common behavior compared to two series providing higher distance 17 . When PSPR is applied to real number numeric valued data, such as raw ECG data, each number is first represented with a symbol from a given alphabet with preset length. This discretization can be done either by using arbitrary thresholds or by utilizing domain knowledge. In order to use PSPR for feature extraction from a given data series, data from each series are compared against a set of reference data series. The determination of the reference series is problem specific. In this study, we used 10 prevalent PD subjects as reference data to compare data from 25 controls and 25 prodromal PD subjects.
Our previous analysis showed that PSPR performs best at 8 Hz ECG sampling frequency in problems such as detecting congestive heart failure 18 , cardiac rhythm classification 20,21,23 , atrial fibrillation prediction 24 , and physiologic data analysis 25 . Considering the proven PSPR performance at low sampling frequencies, we down sampled the original ECG signals from 500 to 8 Hz and ran PSPR for all parameter scenarios described in the Methods section. At each run, the PSPR method provided n p (max pattern length to model) features. We used these features to build a logistic regression model and calculated the (area under receiver operating characteristics curve) (AUC). The AUC was maximized for the parameter combination of n s = 9 (number of symbols, or the alphabet length), n p = 10 . We conducted the rest of the analysis using 10 PSPR features extracted for this parameter setup at 8 Hz.
Statistical analysis. We tested whether continuous variables were normally distributed using the Kolmogorov-Smirnov test. For normally distributed variables, we used analysis of variance (ANOVA) to test for differences between two or more categories. For non-normally distributed variables, we used the Mann-Whitney U test for two categories, or the Kruskal-Wallis test for more than two categories. Two-tailed p-values < 0.05 were considered significant.
PSPR-generated features were compared between groups using nonparametric tests and then analyzed within logistic regression. We extracted ECG features for controls and prodromal PD cases as described above and used them in a stepwise logistic regression model with backward elimination to distinguish prodromal PD from controls. To account for the small sample size and avoid overfitting, we implemented multiple k-fold crossvalidation runs. Number of fold (k) was systematically increased from 2 to 24; for each k, we randomly split data into k folds, built a stepwise logistic regression model using k-onefold of data, and tested the model on the remaining fold. Repeating this process for k times resulted in all predictions being obtained from out-of-sample data. Using these predictions, we calculated AUC. This process was repeated 100 times for each k, with the final results summarized for each k as mean AUC, with a 95% confidence interval.