## Introduction

As the prevalence of digital health technology and the subsequent collection of large amounts of complex health data increases, machine learning (ML) methods provide epidemiologists with a robust way to analyze and interpret relevant patterns1,2. The increased availability of electronic health data and the use of machine learning models presents major opportunities in the healthcare space for discovery, improving patient safety, and the quality of care for individuals with neurodegenerative diseases3,4,5,6. Moreover, this digital health technology has become increasingly widespread in the area of neurocognitive assessments7,8,9. Mobile device capabilities allow for the collection of more objective information (e.g., digital biomarkers) than is currently achievable using pen-and-paper style tests10,11. In addition, digital health technology allows for the implementation of standardized health screenings and patient reported outcomes (PROs) for individual evaluation12. Previous work has identified that device-based sensors and/or user-device interactions used in digital assessments (e.g., accelerometry based gait assessments, speech recognition systems, and PROs for healthcare) enhances the utility and quality of this collected data2. Further, the combination of objective metrics and patient reported outcomes allows for the collection of relevant health information and monitoring of all functional areas of neurocognition (e.g., motor, memory, speech, language, executive function, autonomic function, sensory, behavior, and sleep)13. This paper focuses on individuals with Parkinson’s Disease (PD) as they may demonstrate impaired functionality across each of these functional areas of neurocognition14.

Mobile devices allow for the implementation of digital versions of standardized assessments (e.g., Montreal Cognitive Assessment (MoCA)15; Mini Mental State Examination (MMSE)16) and questionnaires (e.g., PDQ-3917) on mobile devices for the collection of these objective digital biomarkers and relevant PROs18. Subsequently, these mobile devices can utilize machine learning on the aforementioned feature sets for the depiction of an individual’s neurocognitive functionality, quality of life, and quantification of disease progression13,19,20. As the volume of relevant health data increases, novel ways to interact with and extract meaning from the data emerge1,21. Machine learning is a key technique that has demonstrated the ability to translate these large health data sets into actionable knowledge4,22. Specifically, supervised machine learning of health data has shown potential in the area of disease prediction and classification23,24. In supervised machine learning problems, the utilization of clinically relevant and objective features is necessary as the performance of these algorithms is heavily dependent upon the quality of the input features25. Therefore, the aim of these digital health systems should be to increase the reliability and accuracy of patient reported data by combining it with objective data from mobile devices, through maintaining commonly utilized methods to monitor patients’ short-term (e.g., day to day) changes in their condition and minimizing individual variability and/or bias from subjective reporting methods18,26,27,28,29.

The objective of this preliminary work was to use supervised machine learning classification for the assessment of novel features gathered from specifically-designed tablet-based digital neurocognitive assessments (e.g., digital biomarkers) as they relate to Parkinson’s Disease and its stages (Hoehn and Yahr Stages 1–5)30. In addition, commonly used self-reported metrics (e.g., from specifically designed questionnaires) and clinically-relevant functional movement assessments were included in the supervised machine learning process. Decision tree classification was used to gather significant, objective features (e.g., novel digital biomarkers) for both disease classification (e.g., whether an individual has PD) and stage classification (e.g., what stage of PD are they in). Finally, this work visualized individuals’ perceived neurocognitive capabilities (e.g., responses from subjective questionnaires) in comparison to sensor-based neurocognitive functionality scores (e.g., completion of functional assessments) between groups with and without PD to further depict the necessity of digital health systems as they relate to digitally collected health features.

## Methods

### Disease staging scales

Parkinson’s Disease rating scales are a means of assessing the symptoms of the condition by providing information on the course of the condition and/or assessment of an individual’s quality of life. Disease severity was collected in accordance with the Hoehn and Yahr Scale (H&Y)31 (i.e., an internationally used PD progression rating method for clinical practice) and the MDS-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS)32 (i.e., a scale developed to incorporate elements from existing scales to provide an efficient, flexible, and comprehensive means to monitor both motor and non-motor PD symptoms)33.

Due to its strong clinimetric performance for motor assessments, a high correlation with MDS-UPDRS scores while minimizing intra-subject variance, in addition to providing a concise means of summarizing patient status, the H&Y staging scale was used in the classification of individuals in this preliminary work to maintain heterogeneity and efficiency30,33,34. The stages of the H&Y scale are listed below:

• Stage 1: Symptoms are present on one side only (unilateral).

• Stage 2: Symptoms are present on both sides but no impairment of balance.

• Stage 3: Balance impairment and mild to moderate disease progression.

• Stage 4: Severe disability, but still able to walk or stand unassisted.

• Stage 5: Needing a wheelchair or bedridden unless assisted.

### Cohort

Seventy-five adults between the ages of 50 and 85, divided into two groups- those with a confirmed diagnosis of Parkinson’s Disease and age-matched healthy controls participated in this study. The PD population included 50 individuals; with 22 being in confirmed early stages of PD (H&Y Stages 1 and 2), 9 being in confirmed advanced stages of PD (H&Y Stages 3, 4, and 5), and the remaining 11 were unaware of what stage of the condition they were in (e.g., their respective stage was not communicated to them via a licensed clinician). A breakdown of the population is shown in Table 1. Of the group of individuals diagnosed with PD, slightly more than half were female (n = 26 or 52%). The age-matched control population included 25 individuals; with slightly less than half (n = 12 or 48%) being female. Participants were recruited through advertisements, designed rehabilitation programs, physician and clinician referrals, spouses or caretakers of the diagnosed population, and prior studies from our laboratory. As the mean onset age for PD in the Western world is early-to-mid 60s35, recruitment efforts for this study were limited to individuals in the aforementioned groups aged 50 years or older. Participants were excluded from the current study if they were unable to provide informed consent or if they were unable to speak and/or understand English (as all instructions and tests were formatted in English). All methods in this study were performed in accordance with the relevant guidelines and regulations from the Institutional Review Board (IRB) for the protection of human subjects.

### Qualitative assessment questionnaires

All participants were given a set of questions commonly administered in clinical settings for aging populations in addition to questions specifically asked in the event of suspected neurodegenerative disease. Commonly administered questions included how the individual felt in general, their energy levels, their pain level, their sleep quality, in addition to rating their cognitive functions of memory, speech, motor, and executive functions. During data collection, patients were also asked to give information regarding when, relative to taking medication, their data was collected. The rationale for this parameter is that the assessment of the patient near trough levels will depict the most extreme effects of the PD without the effect of any medication. Finally, a specific quality of life questionnaire (e.g., PDQ-39) was used in which the individual assesses their mobility, activities of daily living, emotional well-being, stigma, social support, cognition, communication, and bodily discomfort17. This standardized PDQ-39 begins each question with “Due to having Parkinson’s disease, how often during the last month have you...”. For the control group a modified version of the PDQ-39 (e.g., removing ‘Due to having Parkinson’s disease’ such that the question reads ‘How often during the last month have you...’) was given to understand quality of life over the same functional areas. In this modified version for the control population, a single question, specific only to the PD population (e.g., ‘How often during the last month have you felt you had to conceal your Parkinson’s from people?’), was removed.

### Functional movement assessments

Participants in the PD group were also administered multiple functional movement assessments both subjective and objective in nature as part of regularly scheduled clinical measures. The Berg Balance Scale, Timed Up and Go (TUG), Sit to Stand (STS), and Six Minute Walk Test (6MWT) are all commonly used functional movement assessments administered by clinicians (e.g., physical trainers or therapists) to assess functional performance36,37,38,39.

### Mobile application testing

All participants were administered, by clinicians (e.g., physical trainers or therapists), a tablet-based neurocognitive assessment specifically designed for individuals with Parkinson’s Disease that focused on user-device interactions for the collection of novel and objective metrics40. This was completed to maintain a controlled setting and ensure correct understanding of and compliance with instructions3. Each participant completed mobile versions of 14 neurocognitive functional tests across the areas of motor, memory, speech, and executive function. Functional tests included single functional tests (e.g., having focus on only one area of neurocognition; motor or memory) and multifunctional tests (e.g., combining two or more single functional tests into one functional test). The 14 administered neurocognitive tests collected 208 objective tablet-based digital biomarkers for all participants. All test descriptions are listed.

• For a fine-motor tracing task the individual is instructed to use their index finger to trace a depicted shape (e.g., a circle).

• In a gross-motor task the user is to manipulate the tablet to “air”-trace a prompted shape (e.g., a square).

• For reaction tasks, the user is intended to tap on the screen to interact with a set of targets.

• For a set of card matching tasks the user is to tap on depicted cards until all cards have been matched in pairs.

• For a set of speech-based tasks, the user is instructed to read a sentence and passage out loud, and name prompted objects.

• In a set of trail making tasks the user is intended to draw a line using their index finger to connect the shapes in increasing numerical order.

• A set of multifunctional tasks include a motor task (e.g., tracing or emulating an object) paired with a non-automatic speech task (e.g., listing the months of the year, aloud, in reverse order; December to January).

• For an executive function/multifunctional task a digital version of the Stroop Word Color Test (SWCT)41 was utilized where the user was required to discern the difference between prompted colors and words and then speak the correct response.

• In an expanded multifunctional task approach (e.g., Narration Writer), the user was instructed to narrate a sentence (e.g., speech) while also writing (e.g., motor) word by word (e.g., writing the same word being said aloud) in the space provided (e.g., executive function).

### Machine learning

The use of decision trees is considered highly powerful in classification problems and there are many popular decision tree algorithms (e.g., CART, ID3, C4.5, CHAID, and J48)42. For this preliminary work, the Classification and Regression Tree (CART) algorithm was chosen due to its high model interpretability, minimization of misclassification, and its diagnostic performance (e.g., increasing use in diagnosis and staging classification problems with respect to medicine, especially in situations where the underlying population is partitioned into a relatively small number of subgroups with distinct means)43,44,45,46,47. Further, this algorithm was chosen as classification and regression trees are highly common supervised methods and they are traditionally used in the selection of optimal training samples for future machine learning models43,48,49,50. This preliminary work utilized an optimized version of the CART algorithm implemented via the ‘Scikit-Learn’ Python module. The CART algorithm constructs binary trees using the feature and threshold that yields the largest information gain at each node51. The CART algorithm uses the Gini Index as a metric to originate binary splits. The calculation of the Gini Index is depicted in Equation (1) where $$P_i$$ is the probability of an object being classified to a particular class.

\begin{aligned} Gini \, Index = 1 - \sum _{i=1}^{C} (P_{i})^{2} \end{aligned}
(1)

The Gini Index will always be between 0 and 1 where a value of 0.5 shows an equal distribution of elements over classes, and a value closer to 0 will depict a better binary split.

### Feature normalization

Finally, the standardization of feature values is necessary to further understand the gap between self reported information and objective features as it relates to current clinical applications13. As graphic visualizations have enormous potential to promote patient-centered care52 this feature normalization was completed as part of a tandem processing step to depict differences in perceived PROs and device collected, sensor-based functionality scores (e.g., as many features are of unique type and have varying units). For this normalization, Z-scores were used. The calculation of a Z-score is depicted in Eq. (2) where x is an individual’s score, $$\mu$$ is the population mean and $$\sigma$$ is the standard deviation of the population.

\begin{aligned} Z = \dfrac{(x- \mu )}{\sigma }\end{aligned}
(2)

The Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

### Ethical approval

The work presented in this manuscript is part of an approved study by The University of Notre Dame and Florida International University Institutional Review Boards (IRBs) for the Protection of Human Subjects. All methods were performed in accordance with the relevant guidelines and regulations from the IRBs. Written informed consent was collected from all participants included in this study. The collected data was authorized for disclosure as part of published works.

## Results

### Disease classification

Decision tree classification between groups (e.g., individuals with PD and controls) was completed for both perceived (e.g., self reported outcomes) and new objective assessments (e.g., digital biomarkers from specifically designed tablet-based functional assessments40). This classification between groups was done to gather further insights on which reported and objective features are significant in the classification of Parkinson’s Disease. Further, it was completed to give insights on an individual’s perceived versus sensor-based neurocognitive functionalities in subsequent analysis. Table 2 reports accuracy, precision, and recall for the classification between individuals with PD and control populations.

#### Patient reported outcomes

Patient reported outcomes (e.g., from general health questionnaires and the PDQ-39) for disease classification were analyzed in Table 2 in addition to Appendix Table 4. Decision tree classification for both the general health questionnaire and PDQ-39 depicted the most significant questions to be:

• “Do you have any handwriting problems?”

• “What is your energy level?”

• “(Due to having Parkinson’s Disease) How often in the last month have you had difficulty writing clearly?”

• “(Due to having Parkinson’s Disease) How often in the last month have you had a fear of falling?”

#### Objective digital biomarkers

Objective digital biomarker classification outcomes are presented in Table 2 as well as in Appendix Tables 5 and 6. Objective features from digital versions of 14 functional assessments were used in the classification between groups. Decision trees were generated to discern what tests are the most relevant, while also identifying what objective digital features are the most significant within each test. All accuracy, precision, and recall metrics are presented in Table 2 based on functional areas of interest (e.g., motor, memory, speech, executive function, and multifunctional assessments). Among the collected sensor-based functional tasks, multifunctional, executive functional, and speech-based tasks provided the highest accuracy, precision, and recall results in the separation of PD and control populations.

Decision tree classification also distinguished on a task level that the finger tapping test was the most significant of the administered tests in the separation of control and PD groups (Gini Index = 0.375 at the root). Other task-level assessments of importance in the separation of PD and control populations were the Grandfather Passage (Gini Index = 0.423 at the root), and multifunctional task assessments of fine motor tracing with speech (Gini Index = 0.429 at the roots). Finger tapping and multifunctional task assessments of fine motor tracing with speech had root features of device acceleration (e.g., the magnitude of acceleration of the device or how the user moves the device during the test) whereas the Grandfather Passage had an accuracy feature (e.g., the number of missed words in a speech test) as the root. In an expansion to include first order features, 17 motor, 9 accuracy, and 8 timing features are shown to be significant for the the classification between individuals diagnosed with PD and control populations. This is seen in Fig. 1.

### Stage classification

The classification of disease stage (e.g., early (H&Y Stages 1 and 2) versus advanced-stage (H&Y Stages 3, 4, and 5) Parkinson’s Disease was completed across perceived (e.g., PROs) and objective assessments (e.g., sensor-based digital biomarkers), as well as clinically administered functional assessments (e.g., Berg Balance Scale, STS, TUG, and 6MWT) for all individuals with PD. Table 3 reports accuracy, precision, and recall for the classification between individuals in early and advanced stages of PD.

#### Patient reported outcomes and functional assessments

Patient reported outcomes and clinically administered functional assessment results for stage classification are seen in Table 3 as well as in Appendix Table 7. PROs come from general health questionnaires and the PDQ-39, whereas functional assessment results come from the Berg Balance Scale, TUG, STS, and 6MWT. Decision tree classification for the PROs depicted the most significant questions in the classification of PD stage to be:

• “Do you have any handwriting problems?”

• “What is your energy level?”

• “Due to having Parkinson’s Disease how often in the last month have you had difficulty with leisure activities?”

• “Due to having Parkinson’s Disease how often in the last month have you felt unable to communicate with others?”

• “Due to having Parkinson’s Disease how often in the last month have you felt unpleasantly hot or cold?”

Classification of functional assessment features from the Berg Balance Scale depicted the most significant features to be standing with one foot in front and turning to look behind, whereas objective functional movement features included the distance traveled during the 6 Minute Walk Test and the individual’s speed during the Timed Up and Go.

#### Objective digital biomarkers

Objective digital biomarker outcomes for stage classification can be seen in Table 3 in addition to Appendix Tables 8 and 9. Similar to disease classification, the breakdown between stages (e.g., early and advanced-stage Parkinson’s Disease) was completed using collected objective features from digital versions of functional assessments. All accuracy, precision, and recall metrics are presented in Table 3 based on functional areas of interest (e.g., motor, memory, speech, executive function, and multifunctional assessments). Among the collected sensor-based functional tasks, speech, motor, and multifunctional tasks provided the highest accuracy, precision, and recall results in the separation of early and advanced stage PD populations.

Decision tree classification also distinguished on a task level that 11 of 14 administered tests were significant in the separation of early (H&Y Stages 1 and 2) and advanced stages (H&Y Stages 3, 4, and 5) of Parkinson’s Disease (Gini Index = 0.363 at the root). Tests with Gini Index higher than 0.363 include Trail Making Tests (e.g., executive function) and the Grandfather Passage with Gini Index values of 0.375 and 0.423, respectively. Root features from the 11 significant tests included 1 device acceleration feature (e.g., how the user moves the device during the test), 4 timing features (e.g., the total elapsed speaking time, or average time between non-match pair), and 6 accuracy features (e.g., the number of targets tapped, or total correct objects named). In an expansion to include first order features, 7 motor, 12 accuracy, and 14 timing features best separate groups based on stage; the inverse of diagnosis. This is also seen in Fig. 1.

### Feature normalization

As graphic visualizations have enormous potential to promote patient-centered care52 feature normalization was completed as part of a tandem processing step to depict differences in perceived PROs and device collected, sensor-based functionality scores. This standardization is necessary to further understand the gap between self reported information and objective features as it relates to current clinical applications13 and support the preface that the aim of digital health systems should be used to increase the reliability and accuracy of patient reported data by combining it with objective data from mobile devices18,26,27,28,29.

#### Patient reported outcomes

In the depiction of perceived neurocognitive functionalities between groups, normalized scores from general health questionnaires and the PDQ-39 were calculated for all functional areas of neurocognition. Z-scores were used in the standardization of these features. This standardization of feature values is necessary as many features are of a unique type and have varying units. The weighted Z-scores of perceived capabilities of controls, early-stage PD (H&Y Stages 1 and 2), and advanced-stage PD (H&Y Stages 3, 4, and 5) populations are shown in Fig. 2.

#### Objective digital biomarkers

For sensor-based neurocognitive functionalities between groups, normalized scores from objective assessments were calculated for the functional areas of motor, memory, speech, executive function, and multifunctional tests. The weighted Z-scores of sensor-based neurocognitive capabilities for controls, early-stage PD (H&Y Stages 1 and 2), and advanced-stage PD (H&Y Stages 3, 4, and 5) populations are shown in Fig. 3. It should be noted that all executive function tests are inherently multifunctional in nature (e.g., an individual needs to move or speak to carry out the executive function) and therefore are a subset of the multifunctional test digital biomarker set (e.g., denoted by an * in Fig. 3).

## Discussion

As Parkinson’s Disease is often described as a “designer disease”, meaning individuals with PD manifest different symptoms across the spectrum of disease characteristics, personalized medicine should be the goal and is required to optimize care53,54. However, to reach personalized medicine utilizing machine learning, relevant features need to be identified as the performance of given algorithms are heavily dependent upon the quantitative and quality of the extracted features4. Nearly 275 features were collected in this work from PROs (e.g., from the PDQ-39), functional movement assessments (e.g., from the TUG, STS, and 6MWT), and novel objective digital biomarkers (e.g., from tablet-based assessments) across multiple neurocognitive tasks. This work sought to identify new significant features in the classification of individuals with Parkinson’s Disease compared to controls (e.g., what features are the most important in discerning if an individual has PD or not), as well as the classification of different stages of PD (e.g., what features best aid the depiction of how far the disease has progressed).

Commonly, PROs are used to monitor an individual’s thoughts or opinions on changes in their condition which can lead to improved disease management in the recognition and understanding of their symptoms and triggers18,26. However, this perceived information may be subject to individual variability and/or bias27,28,29. This work depicts variability in perceived functionality scores (e.g., motor, memory, speech, and executive function) from sensor-based scores for some groups. The perceived functionality for individuals in confirmed early-stage PD (H&Y Stages 1 and 2) across the areas of memory, speech, and executive function, differs by about 22% compared to their sensor-based functionality scores as shown in Figs. 2 and 3. Further, these figures show a relatively large perceived increase in executive function and behavioral abilities for individuals in advanced stages of PD (H&Y Stages 3, 4, and 5) compared to their early-stage counterparts. Therefore these digital health systems, with the ability to administer, collect, and subsequently analyze objective features, should be utilized in a way to allow individuals greater insights on their true capabilities.