Deep learning predicts prevalent and incident Parkinson’s disease from UK Biobank fundus imaging

Parkinson’s disease is the world’s fastest-growing neurological disorder. Research to elucidate the mechanisms of Parkinson’s disease and automate diagnostics would greatly improve the treatment of patients with Parkinson’s disease. Current diagnostic methods are expensive and have limited availability. Considering the insidious and preclinical onset and progression of the disease, a desirable screening should be diagnostically accurate even before the onset of symptoms to allow medical interventions. We highlight retinal fundus imaging, often termed a window to the brain, as a diagnostic screening modality for Parkinson’s disease. We conducted a systematic evaluation of conventional machine learning and deep learning techniques to classify Parkinson’s disease from UK Biobank fundus imaging. Our results suggest Parkinson’s disease individuals can be differentiated from age and gender-matched healthy subjects with 68% accuracy. This accuracy is maintained when predicting either prevalent or incident Parkinson’s disease. Explainability and trustworthiness are enhanced by visual attribution maps of localized biomarkers and quantified metrics of model robustness to data perturbations.


Introduction
2][3] The manifestation of these symptoms is pathologically characterized by the significant loss of dopaminergic neurons in the substantia nigra. 4An estimated one million individuals in the United States have PD, leading to nearly $50 billion a year in economic burden. 1 Notably, this financial burden consists not only of direct medical costs but also indirect influences such as necessitated family care and social welfare.The World Health Organization estimates the prevalence of PD has doubled in the last 25 years, while the number of deaths caused by PD increased by over 100% since 2000, largely due to the lack of effective intervention under the rising growth of the elderly population. 2 Innovations to our understanding of the pathology of PD and the development of early diagnostic systems are desired to address the global concerns arising from PD. Systematic diagnostic evaluation of PD currently struggles due to the lack of early biomarkers and balance between both specificity and sensitivity. 3Indeed, cardinal motor symptoms fall within the umbrella of Parkinsonism, while non-motor indicators are symptomatic of numerous neurodegenerative diseases. 5Risk factors such as age, gender, and environmental toxin exposure are not specific to PD. Differential diagnosis frameworks established by the UK's Parkinson's Disease Society Brain Bank and the International Parkinson and Movement Disorder Society 6 are the current standards for evaluation.However, these checklists require increasing amounts of exclusion and evidence, including response to dopaminergic therapy (levodopa), 7 occurrence of dyskinesia, 8 and even DaTscan imaging 9 to conclude a definitive PD diagnosis.A biological definition necessitating additional testing is being entertained.Moreover, early, or atypical PD complicates the observability of cardinal signs, leading to significantly reduced diagnostic accuracy. 10,11 hus, current diagnostic indicators lack solid predictive power, straining diagnostic expenses, time, availability, enrollment in therapeutic or disease-modifying trials, and subjectivity.
3][14][15] Dopamine plays a complex role in visual pathway processing, justifying visual dysfunction findings in PD individuals. 16Moreover, substantial evidence has revealed large temporal gaps between the onset of PD and observable symptoms, inspiring the retina as a prodromal PD biomarker. 5Clinical studies have suggested retinal layer thinning and reductions in microvasculature density in PD patients primarily through optical coherence tomography (OCT) and optical coherence tomography angiography (OCT-A). 17,18 onetheless, clinical findings concerning both retinal degeneration and disease-specific information (disease duration, disease severity, etc.) are not always consistent, 19,20 demanding further studies to bolster retinal diagnostic power.
Artificial intelligence (AI) algorithms are efficient diagnostic tools through their ability to identify, localize, and quantify pathological features, as evident from their success in diverse retinal disease tasks. 21,22 n particular, supervised (labeled) algorithms employed in these tasks can generally be divided into two categories: conventional machine learning algorithms and deep learning models.Conventional machine learning models are known to create (non)linear decision boundaries by rules of inference.On the other hand, deep learning models have risen as powerful models due to their ability to learn meaningful representations and extract subtle features in the training process.Learning retinal biomarkers of PD demands an intricate understanding of structural degeneration of the retinal vasculature, a task unfeasible for even experienced ophthalmologists and neurologists.To address this challenge, we propose the usage of AI algorithms to extract the complex relationships existing within the global and local spatial levels of the retina.
We provide one of the first comprehensive artificial intelligence studies of PD classification from fundus imaging.Our key objective is realized by systematically profiling the classification performance across different stages of Parkinson's disease progression, namely incident (consistent with pre-symptomatic and/or prodromal) PD and prevalent PD.Improving upon other related works, we maximize the diagnostic capacity of AI algorithms by neglecting the usage of any external quantitative measures or feature selection methods.Finally, we assess diagnostic consistency and robustness through extensive experimentation of both conventional machine learning and deep learning approaches together with a post-hoc spatial feature attribution analysis.Overall, this work enables future research in the development of efficient diagnostic technology and early disease intervention.

Study Design and Clinical Characteristics
This study draws from the UK Biobank, a biomedical database containing over 500,000 participants aged 40-69 from 2006 to 2010. 23As of 2019 October, 175,824 fundus images from 85,848 participants were available along with broad clinical health measures.Most subjects have two retinal photographs (left and right eye) with a minority having an additional follow-up imaging session.From this population, we identified 585 fundus images from 296 unique subjects.Following manual image quality selection guidelines, we determined 123 fundus images from 84 unique participants.To carry out an unbiased binary classification framework, we matched each PD image according to the subjects' age and gender, that is, a healthy control cohort of 123 fundus images from 84 subjects.This constitutes our binary-labeled overall dataset of 246 fundus images (123 PD, 123 HC) from 168 subjects (84 PD, 91 HC).Lastly, we form two subsets of the data corresponding to a prevalent dataset of 154 fundus images (77 PD, 77 HC) from 110 subjects (55 PD, 55 HC) and an incident dataset of 92 fundus images (46 PD, 46 HC) from 58 subjects (29 PD, 29 HC).Notably, we define the diagnostic gap as the difference between the date of image acquisition minus the date of diagnosis, where a negative value is interpreted as having a PD diagnosis before fundus image acquisition (prevalent PD) and a positive value is interpreted as having a PD diagnosis post fundus image acquisition.The data collection pipeline is summarized in Figure 1.Risk factors of Parkinson's disease have been extensively studied, [24][25][26] including age, gender, ethnicity, Townsend deprivation indices, alcohol consumption, history of obesity-diabetes, history of stroke, and psychotropic medication usage.Moreover, the effects of Parkinson's Disease have been associated with visual symptoms, from which, we acquire a history of diagnostic eye problems and visual acuity measures.We detail the statistical analyses of subject demographics, visual measures, and covariates of our study population in Table 1.

Model Design
Conventional machine learning and deep learning models were systemically studied for their performance in PD diagnosis, a binary classification task.We evaluate the performance of each AI model by five randomized repetitions of stratified five-fold cross validation (at the subject level) based upon several classification metrics (AUC, accuracy, PPV, NPV, sensitivity, specificity, and F1-score).Given the insufficient amount of longitudinal data, we treat each image separately as a sample without data leakage, rather than performing a longitudinal analysis.The conventional machine learning models provide a classification performance baseline, from which we utilize Logistic Regression, Elastic-Net, Linear SVM and Radial Basis Function SVM kernels.On the other hand, we evaluate the performance of popular deep learning frameworks including AlexNet, VGG-16, GoogleNet, Inception-V3, and ResNet-50.We follow traditional guidelines such as image normalization to the training data for our machine learning models, ImageNet normalization, spatial augmentations, and early-stopping for our deep learning models.The detailed guidelines are outlined in Methods ("Data Pre-processing and Model Training").

Visualization and Explainability
Qualitative explainability is visualized through guided backpropagation on our deep learning models, revealing the models were able to correctly distinguish subtle features on the retinal anatomy (Figure 3).To highlight that the extracted features are consistent with those recognized as retinal biomarkers of neurodegeneration in PD, we use the AutoMorph 27 deep learning segmentation module to generate a map of important retinal structures, namely the arteries, veins, optic cup, and optic disc, along with a manually marked fovea.][30] As such, the overlay of the guided backpropagation map and segmentation map yields a visual comparison of the consistency between features used for classification and potential retinal biomarkers of PD.Moreover, we reinforce the qualitative explainability by quantifying the robustness of our deep learning models to data perturbations through an infidelity and sensitivity analysis.The definitions of the infidelity and sensitivity measures are outlined in Method ("Explainability Evaluation").In this step, we run the guided backpropagation attribution step across N = 50 perturbations at test time with a noise distribution drawn from a normal distribution N ∼ (0, 0.01 2 ).The results of the infidelity and sensitivity analysis are averaged across the repeated cross validation protocol and summarized in Figure 4. Our empirical results suggest that AlexNet is the most robust to data perturbations, while simultaneously the most accurate according to lower infidelity and sensitivity measures.Notably, the VGG-16 has similar robustness measurers as AlexNet with less classification performance, possibly owing to the large model parameter complexity.Predictions made by other deep learning architectures including GoogleNet, Inception-V3, and ResNet-50 were largely influenced by data perturbations.

Feature Engineering Analysis
Feature engineering approaches are generally devised to enhance AI outcomes, in particular, conventional machine learning models, due to the large feature space complexity.On the other hand, deep learning models by design can achieve performance with minimal amounts of data pre-preprocessing.We explore the performance of feature engineering in conventional machine learning models with two changes to the input fundus images: (1) gray-scale color conversion (reducing the dependence of color), and (2) vessel-segmentation via AutoMorph (emphasizing the retinal vasculature).The approach adopted in the latter is also a useful comparison with that of Tian et al. which utilized the technique for Alzheimer's Disease detection.The results of this approach are demonstrated in Supplementary Table S4.Conventional machine learning models are demonstrated to improve by simple color conversion, while declining using the retinal vasculature.The reduction in performance using the vessel segmentation algorithm hints that essential diagnostic features exist across different regions of the eye (e.g. the optic cup and fovea).

Model Covariate Analysis -Diagnostic Gap and Gender
We investigate the influence of Parkinson's disease progression on the model performance.The Parkinson's disease progression can be expressed by the diagnostic gap and thereby a proxy measure of disease severity.Treating each image independently, we divided our Parkinson's subjects into four quartiles according to the diagnostic time (years) and examined the model performance compiled over our repeated five-fold cross validation in our best model, AlexNet.Sensitivity measures in the Parkinson's group did not exhibit monotonic relationships based on the diagnostic gap or consistencies in dataset subtypes (see Supplementary Table S1).One such explanation for this result is discrepancies in the diagnostic gap, that is, the dates of diagnosis do not co-align with the true dates of disease acquisition as Parkinson's disease is a progressive disease.Otherwise, a claim could be made that the model performance is not contingent strictly on the diagnostic gap, but rather, an individual basis of existing imaging biomarkers.
Gender is a known risk factor of Parkinson's disease. 31,32 e investigated a potential bias in our deep learning models due to gender differences in the retina.We categorized the model performances of our AlexNet model compiled over all subjects, Parkinson's-specific, and healthy-control specific data, and conducted a series of Chi-Square Test of Independence (see Supplementary Table S2).In cases where the frequency of observations was found to be less than or equal to five, we conducted a Fisher's Exact test.We discover no statistical significance (p < 0.05) for any experiment or data-subtype.

Discussion
This work demonstrates that deep neural networks can be trained to detect Parkinson's disease in retinal fundus images with decent performance.Our model can predict the incidence of Parkinson's disease ahead of formal diagnosis at demonstrated sensitivity levels of 80.0% from 0 to 3.93 years, 80.0% from 3.93 to 5.07 years, 93.33% from 5.07 to 5.57 years, and 81.67% from 5.57 to 7.38 years.These results indicate a potential pathway for early disease intervention.Automated deep neural networks show strong promises to assist and complement ophthalmologists in terms of biomarker identification and high-throughput evaluation.
Artificial intelligence evaluation of Parkinson's disease through the retina has been rarely applied.Hu et al. 33 trained a deep learning model to evaluate the retinal age gap as one predictive marker for incident Parkinson's Disease using fundus images from the UK Biobank, showcasing statistical significance and predictive AUC of 0.71.Nunes et al. 34 used optical coherence tomography data to compute retinal texture markers and trained a deep learning model with a median sensitivity of 88.7%, 79.5%, and 77.8% concerning healthy controls, Parkinson's disease, and Alzheimer's disease.However, all these works did not provide a comprehensive comparison of conventional machine learning and deep learning methods on this problem and lacked insights into the explainability of their models.We extend upon these works by treating the entire fundus image as a diagnostic modality, and comprehensively evaluate a broad spectrum of conventional machine learning and deep learning methods, as well as shedding light on the explainability in both image space and on the algorithm level.Our work will lay a solid foundation for future exploration in this direction and serve as a reference for algorithm selection in terms of both performance and explainability.Related works have been accomplished focusing specifically on Alzheimer's disease, e.g., Tian et.al 35 and Wisely et al. 36 but not Parkinson's disease.Clinical studies in the field have yielded statistical differences in the retinal layers between PD and HC subjects, lacking evidence for diagnostic power.Further work is necessitated in the field of deep learning to build stronger classification performance and understanding of retinal biomarkers.In the future, a multi-modal model utilizing optical coherence tomography, fundus autofluorescence, and/or electronic health records is a considerable direction for Parkinson's disease analysis.
This study has some limitations.First, the size of our dataset could be enlarged to further capture the wide presentations of Parkinson's disease.Moreover, the data is derived from the UK population and therefore future studies are needed to evaluate whether these models can generalize to other populations.So far, public datasets containing both Parkinson's disease subjects and fundus images (as well as patient health records) apart from the UK Biobank are not available, thus new datasets including both retinal images and PD diagnosis in a larger population will be helpful.A deeper study would hope to investigate the severity of Parkinson's disease, e.g. by the MDS-Unified Parkinson's Disease Rating Scale (MDS-UPDRS), wherein the severity was (weakly) substituted by the diagnostic gap in this work.Furthermore, this current research has been restricted to Parkinson's disease, and it remains questionable whether different eye diseases or neurogenerative diseases (e.g., Alzheimer's Disease) share identical biomarkers or degeneration patterns.The next major research question is whether such model explanations are consistent and/or able to guide the grading of ophthalmologists, which is a major goal of clinical translational research.This matter is further complicated as the visual biomarkers for Parkinson's disease are less well-understood than common eye diseases such as glaucoma.Prospective assessments of retina imaging coupled with biological details and clinical phenotyping are needed to provide insight into the use and implementation of these techniques.These limitations necessitate future work to ensure the trustworthiness of artificial intelligence models in a clinical setting.
Deep learning models outperformed conventional machine learning models to accurately predict Parkinson's disease from retinal fundus images.We demonstrate deep learning models can nearly equally diagnose both prevalent and incident PD subjects with robustness to image perturbations, paving the way for early treatment and intervention.Further studies are warranted to verify the consistency of Parkinson's disease evaluation, to enhance our understanding of retinal biomarkers, and to incorporate automated models into clinical settings.

UK Biobank Participants
The UK Biobank (UKB) is one of the largest biomedical databases, recruiting over 500,000 individuals aged 40-69 years old at baseline throughout assessment centers in the United Kingdom in 2006-2010.The methods by which this data was acquired have been described elsewhere. 23Diverse patient health records include demographic, genetic, lifestyle, and health information.Comprehensive physical examinations as well as ophthalmic examinations were conducted for further analysis.Health-related events were determined using data linkage to the Health Episode Statistics (HES), Scottish Morbidity Record (SMR01), Patient Episode Database for Wales (PEDW), and death registers.

Parkinson's Disease and Definitions
Parkinson's disease was determined by hospital administration data in the United Kingdom, national death register data, and self-reported data.We consider prevalent PD subjects diagnosed prior to baseline assessment and incident PD subjects diagnosed following baseline assessment.Prevalent PD subjects were labeled according to hospital admission electronic health records based on the International Classification of Diseases (ICD9, ICD10) codes or self-reports.Incident PD subjects were labeled according to either the ICD codes or the death registry.The earliest recorded diagnostic dates take priority in case of multiple records.If PD was recorded in the death register only (diagnosed post-mortem), the date of death is used for the date of diagnosis.We define the diagnostic gap as the difference between the date of image acquisition minus the date of diagnosis, where a negative value is interpreted as having a PD diagnosis prior to fundus image acquisition (prevalent PD), and a positive value is interpreted as having a PD diagnosis post fundus image acquisition (incident PD).We acquire our PD labels according to the UKB Field 42032.Further details may be inquired upon from the UKB documentation. 37

Ophthalmic Measures
In the UKB eye and vision consortium, the ophthalmic assessment included (1) questionnaires of past ophthalmic and family history, (2) quantitative measures of visual acuity, refractive error, and keratometry, and (3) imaging acquisition including spectral domain optical coherence tomography (SD-OCT) of the macula and a disc-macula fundus photograph.In our study, we acquire diagnostic fields of eye problems (glaucoma, cataracts, diabetes-related, injury/trauma, macular, degeneration, etc.), visual acuity measured as the logarithm of the minimum angle of resolution (LogMar), and fundus photographs.Fundus photographs were acquired using a Topcon 3D OCT-1000 Mark II system.The system has a 45 • field angle, scanning range of 6 mm × 6 mm centered on the fovea, acquisition speed of 18,000 A-scans per second, and 6 µm axial resolution.The details of the eye and vision consortium have been described in other studies. 38

Study Population and Summary Statistics
A total of 175,824 fundus images from 85,848 subjects were discovered in the UKB (as of 2019 October, UKB).Among this population, we found 585 fundus images from 296 subjects with PD.Image quality selection was held in multiple phases: (1) the deep learning image quality selection module of AutoMorph 27 pretrained on Eye-PACS-Q 39 , (2) subjectivity to the quality of the vessel and optic cup and disc AutoMorph segmentation, and (3) manual grading borderline images according to external guidelines,21 including artifacts, clarity, and field definition defects.Each phase is held in sequence with the additional rounds held to justify borderline candidates for inclusion or exclusion into the dataset.In these guidelines, an image quality score is determined from the total of artifacts (-10 to 0), clarity (0 to +10), and field definition (0 to +10) where an optimal score is +20 and a score less than or equal to 12 is rejectable.The artifacts component evaluates the broad proportion of visibility incurred by artifacts in the image, clarity evaluates the relative visibility of veins and lesions in the image, and field definition evaluates the broad field of view of key retinal structures (e.g. the optic cup and disc, and fovea).Of note, three images categorized as ungradable by AutoMorph, were moved into our dataset on the basis of sufficient manual grading and viewing of the retinal vasculature.Following manual selection, a total of 123 usable Parkinson's disease images from 84 subjects were found to have met our criteria for inclusion.For each PD image, a healthy control (HC) with no history of PD was matched according to their age and gender to prevent covariate bias using the aforementioned image quality selection guidelines.All other fundus images and corresponding subjects were excluded.This constitutes our binary-labeled overall dataset of 246 fundus images (123 PD, 123 HC) from 168 subjects (84 PD, 84 HC).Lastly, we form two subsets of the data corresponding to a prevalent dataset of 154 fundus images (77 PD, 77 HC) from 110 subjects (55 PD, 55 HC) and an incident dataset of 92 fundus images (46 PD, 46 HC)    S3.The computation time estimates of 5-Fold-Cross-Validation for training and testing of models on color fundus images is recorded by its average and standard deviation.The number of parameters with respect to a 256 × 256 × 3 RGB image input with 2 class outputs is provided for reference and interpretation relative to the model complexity.Estimates were obtained on the University of Florida HiPerGator with 2 CPU cores and 1 NVIDIA A100 GPU.

Figure 1 .
Figure 1.Data Collection pipeline from the UK Biobank.Instances in parentheses represent an equal balance of Parkinson's disease and healthy controls subjects.Multiple quality selection phases were used as additional inclusion criteria into our dataset arising from AutoMorph and manual image grading.In total, we have the overall dataset of PD subjects matched with age and gender-matched healthy controls, and two subsets corresponding to prevalent and incident subjects.

Figure 2 .
Figure 2. Box-plots and ROC curves of the Parkinson Disease Classification Models.The models are evaluated over five randomized repetitions of the five-fold stratified cross-validation protocol.The AUC scores are enlisted in the legend.

Figure 3 .
Figure 3. Attribution correspondence of retinal features.In the first column, an artery-vein (red and blue, respectively) map is combined with the optic cup (teal) and optic disc (yellow) generated from the AutoMorph deep learning segmentation module.A white dashed line is shown as an estimate for the foveal region.In the third column, a predicted attribution map is generated using the guided backpropagation algorithm on top of the AlexNet model.The intersection of the salient features with the segmentation is shown in the last column.The images represent the left (top) and right (bottom) eyes from the same subject, demonstrating distinct feature distributions for prediction.

Figure 4 .
Figure 4. Explanation of infidelity and sensitivity comparison among different models.The logarithm of the infidelity score was applied for visualization due to the large range of scores.

Table 2 .
Classification Results of Parkinson Disease Datasets.For each model, the mean average for each performance metric with their 95% confidence interval is provided over 5 randomized repetitions of 5-fold stratified cross-validation.Highlighted in red represents the best model and highlighted in blue is the best performing conventional machine learning model.Model Training Time (mean (std), minutes) Testing Time (mean (std), seconds) Parameters Overall (n = 246 fundus images | 123 PD, 123 HC) Table