Diagnostic accuracy of keystroke dynamics as digital biomarkers for fine motor decline in neuropsychiatric disorders: a systematic review and meta-analysis

The unmet timely diagnosis requirements, that take place years after substantial neural loss and neuroperturbations in neuropsychiatric disorders, affirm the dire need for biomarkers with proven efficacy. In Parkinson’s disease (PD), Mild Cognitive impairment (MCI), Alzheimers disease (AD) and psychiatric disorders, it is difficult to detect early symptoms given their mild nature. We hypothesize that employing fine motor patterns, derived from natural interactions with keyboards, also knwon as keystroke dynamics, could translate classic finger dexterity tests from clinics to populations in-the-wild for timely diagnosis, yet, further evidence is required to prove this efficiency. We have searched PubMED, Medline, IEEEXplore, EBSCO and Web of Science for eligible diagnostic accuracy studies employing keystroke dynamics as an index test for the detection of neuropsychiatric disorders as the main target condition. We evaluated the diagnostic performance of keystroke dynamics across 41 studies published between 2014 and March 2022, comprising 3791 PD patients, 254 MCI patients, and 374 psychiatric disease patients. Of these, 25 studies were included in univariate random-effect meta-analysis models for diagnostic performance assessment. Pooled sensitivity and specificity are 0.86 (95% Confidence Interval (CI) 0.82–0.90, I2 = 79.49%) and 0.83 (CI 0.79–0.87, I2 = 83.45%) for PD, 0.83 (95% CI 0.65–1.00, I2 = 79.10%) and 0.87 (95% CI 0.80–0.93, I2 = 0%) for psychomotor impairment, and 0.85 (95% CI 0.74–0.96, I2 = 50.39%) and 0.82 (95% CI 0.70–0.94, I2 = 87.73%) for MCI and early AD, respectively. Our subgroup analyses conveyed the diagnosis efficiency of keystroke dynamics for naturalistic self-reported data, and the promising performance of multimodal analysis of naturalistic behavioral data and deep learning methods in detecting disease-induced phenotypes. The meta-regression models showed the increase in diagnostic accuracy and fine motor impairment severity index with age and disease duration for PD and MCI. The risk of bias, based on the QUADAS-2 tool, is deemed low to moderate and overall, we rated the quality of evidence to be moderate. We conveyed the feasibility of keystroke dynamics as digital biomarkers for fine motor decline in naturalistic environments. Future work to evaluate their performance for longitudinal disease monitoring and therapeutic implications is yet to be performed. We eventually propose a partnership strategy based on a “co-creation” approach that stems from mechanistic explanations of patients’ characteristics derived from data obtained in-clinics and under ecologically valid settings. The protocol of this systematic review and meta-analysis is registered in PROSPERO; identifier CRD42021278707. The presented work is supported by the KU-KAIST joint research center.


Systematic Search Strategy
Supplementary     Q1B: Was a case-control design avoided? Studies that included patients with uncertain diagnosis score as "yes". Studies that included healthy controls and patients with confirmed PD, MCI or mood disorders score as "no". If no relevant information provided score as "unclear".
Q1C: Did the study avoid inappropriate exclusions?
If the study excluded difficult to diagnose patients score as "no", if no information provided score as "unclear".
Q1D: Was the sample size appropriate? If sample size calculations are performed score as "yes", and mention if it was adequate.

Q1E: Could the selection of patients have introduced bias?
If the patients were considered eligible for the study at early stage PD or MCI, then score as "no". Also, if the patients with clinically validated mood disorder, score as "no". On the other hand, if the patients self-reported their symptoms without clinical diagnosis, or if the study included patients with advanced disease stages, score as "yes".
Q1F: Is there concern that the included patients do not match the review question

Domain 2: Index Test
Criteria description Q2A: Were the index test results interpreted without knowledge of the results of the reference standard If the data collection and analysis was carried out without knowing the labels then score as "yes".
Q2B: If a threshold was used, was it pre-specified? If the diagnosis based on the typing features was conducted with predefined threshold by the authors, score as "yes". If no relevant information provided, score as "unclear".
Q2C: Could the interpretation of the index test have introduced bias?
If the quantitative analysis introduced bias, due to the experimental setting, analysis models, study duration, medication impact, score as "yes" and mention the reason.
Q2D: Is there concern that the interpretation or the conduct of the index test differ from the review question

Domain 3: Reference Standard
Criteria description Q3A: Is the reference standard likely to correctly classify the target conditions?
If the diagnosis is confirmed via gold standard clinical scales, for PD, MCI, and mood disorders, then score as "yes". If the patients self-reported their symptoms without clinical validity score as "no".
Q3B: Were the reference standard interpreted without knowledge of the index test?
If the results of clinical standards were interpreted without knowledge of the typing behavior, score as "yes". If no relevant information reported, score as "unclear".
Q3C: Could the reference standard, its conduct or interpretation, have introduced bias?
This is a "no" for all studies, except those using selfreports as a ground truth.
Q3D: Is there concern that the target conditions, as defined by the reference standard does not match the research question

Domain 4: Flow and Timing
Criteria description Q4A: Was there an appropriate interval between the index tests and the reference standard?
If the interval between the conduct of the clinical reference standard and the acquisition of the index tests is no more than six months, score as "yes". If no relevant information of the time is reports, score as "unclear".
Q4B: Did all the patients receive a reference standard?
If all patients, whom typing data were collected, received a standard, clinical test, score as "yes". This item should be scored as "unclear" if no relevant information is provided.
Q4C: Did patients receive the same reference standard?
If all patients were evaluated by the same clinical scale, this item is scored as "yes".
Q4D: Were all patients included in the analysis? If some patients that were enrolled to begin with, but excluded from the analysis, this item should be scored as "yes". Discuss the exclusion reasons, and whether the explanation is sufficient to preclude bias. If the quantitative features, extracted from the biomarkers, were available for interpretation, score as "yes"

Domain 5: Additional Items
3: Was the method, wether for statistical analysis or classification, consistent through the study?
If the method used in the study is consistent for all participants, score as "yes". if no information is sufficient to make a judgement, state as "unclear" Supplementary  Index Test  Reference Standard  Flow and Timing  Additional Items   Q1A Q1B Q1C Q1D Q1E Q1F Q2A Q2B Q2C Q2D Q3A Q3B Q3C Q3D Q4A Q4B Q4C Q4D  1  2  3 Pham et al.  Indirectness of outcome Lower score if the study collected data in-the-wild and/or labeled data by self-reports.

Inconsistency/ imprecision of results
Lower score if the reported accuracy measures are inconsistent.
Publication Bias This is deemed low risk for all included studies, given we maintained funnel plots symmetry upon study inclusion.

Regression Results
Supplementary

Key Points
What Was Known: • Given the mildness of early stage symptoms, just-on-time detection of neurodegenerative and their associated psychiatric signs remains a challenge.
• Ubiquitous technology and connected devices facilitate the acquisition of high-frequency behavioral data that may aid in early diagnosis and monitoring What this Paper Adds: • Employing keyboard interactions, for detecting motor abnormalities, offers unbiased diagnostic performance, consistently across an array of neurodegenerative and psychiatric disorders, particularly PD, MCI and Depression.
• The diagnosis accuracy is significantly higher for data collected and validated in-the-clinic, compared to data collected in-the-wild, while the latter still results in acceptable accuracy.
• The adoption of multimodal data and deep learning models outperforms unimodal analysis, reflecting the importance of simultaneous learning of behavioral trends from aligned data streams • Preliminary results on longitudinal behavioral analysis and treatment response are promising, yet this domain is still in its infancy.
• Despite the disproportionate utilization of keystroke dynamics analysis for PD, their reproducibility motivated using them for analyzing motor symptoms in other disorders, like AD, depression, multiple sclerosis and Huntington's disease.
• Neuropsychiatric disorders, despite the underlying diagnosis, are characterized by highly unstable typing latencies and more dispersed features, such as the hold time and the flight time.
• Considerable similarities between symptoms exist, entailing further research to better understand disease nosology and decipher disease-specific behavioral traits.
• The immature adoption of digital biomarkers in the clinical workflow requires an urgent multidisciplinary collaboration, realized by a co-creation approach, whereby relevant stakeholders contribute with the final objective of disease diagnosis and early intervention.
• Ideal future path implies undertaking cross-sectional, observational studies with a full representation of the population and longitudinal studies, to inform better understanding of disease nosology and behavioral trajectories aligned with disease prognosis respectively.