Attentive pairwise interaction network for AI-assisted clock drawing test assessment of early visuospatial deficits

Dementia is a debilitating neurological condition which impairs the cognitive function and the ability to take care of oneself. The Clock Drawing Test (CDT) is widely used to detect dementia, but differentiating normal from borderline cases requires years of clinical experience. Misclassifying mild abnormal as normal will delay the chance to investigate for potential reversible causes or slow down the progression. To help address this issue, we propose an automatic CDT scoring system that adopts Attentive Pairwise Interaction Network (API-Net), a fine-grained deep learning model that is designed to distinguish visually similar images. Inspired by how humans often learn to recognize different objects by looking at two images side-by-side, API-Net is optimized using image pairs in a contrastive manner, as opposed to standard supervised learning, which optimizes a model using individual images. In this study, we extend API-Net to infer Shulman CDT scores from a dataset of 3108 subjects. We compare the performance of API-Net to that of convolutional neural networks: VGG16, ResNet-152, and DenseNet-121. The best API-Net achieves an F1-score of 0.79, which is a 3% absolute improvement over ResNet-152’s F1-score of 0.76. The code for API-Net and the dataset used have been made available at https://github.com/cccnlab/CDT-API-Network.


Supplementary Information
Investigation of Hyperparameters in API-Net Training We investigated the impact of the hyperparameters λ in Equation 4and ε in Equation 6 on the model's performance.For this experiment, we selected the CDT-finetuned ResNet-152 as the backbone.The API-Net models were trained using the gradual unfreezing training approach.We tested λ values of 0.1, 1, and 10 and ε values of 0.0005, 0.005, and 0.05.The means and standard deviations of the classification accuracy are summarized in Supplementary Table S.1.
For each ε value, we performed an analysis of variance (ANOVA) test at a significance level of 0.05 to compare the mean accuracy of the models trained with different λ values.The p-values were 0.6485 for ε = 0.0005, 0.9465 for ε = 0.005, and 0.5768 for ε = 0.05.Consequently, we failed to reject the null hypothesis associated with each case, which stated that there was no evidence of statistically significant difference in the mean accuracy of the models trained with different λ values.
With a fixed λ value, changing ε substantially changed the model's performance.Taking λ = 1 as an example, when ε was increased from 0.005 to 0.05, the mean accuracy dropped from 0.7892 to 0.7578.We performed a paired two-sample t-test with the null hypothesis which stated that the mean accuracy of the ε = 0.005 case was less than or equal to that of the ε = 0.05 case.With the obtained p-value of 0.0429, we rejected the null hypothesis, implying that using ε = 0.005 significantly yielded higher mean accuracy than using ε = 0.05.Similarly, when ε was decreased from 0.005 to 0.0005, the mean accuracy dropped from 0.7892 to 0.7568.Using a paired two-sample t-test, we also rejected the null hypothesis in this case (p-value = 0.044), implying that using ε = 0.005 significantly yielded higher mean accuracy than using ε = 0.0005.
According to Equation 6, ε defines a margin between probability p sel f and p other .In particular, it encourages the model to output p sel f of the correct class that is larger than p other of the correct class by at least ε.The value of ε should be carefully chosen.If ε is too low, the model will not get direct benefits from the contrastive component of the training process.If ε is too high, it will enforce an unrealistically large margin that could adversely affect the model training, especially when the clock drawing images being considered are visually similar.We have observed that using ε = 0.005 worked well in this case.With the desired margin specified by ε, the λ value then determines how much to emphasize on the scoring-ranking loss, which is responsible for enforcing the margin, relative to the cross entropy loss.While we observed that the performance of our model is not sensitive to the choice of λ in this study, we suggest that both ε and λ should be carefully selected, possibly through hyperparameter tuning, when the proposed method is adopted or extended to other tasks and datasets.

Figure S. 1 .
Figure S.1.Different training processes.(a) Standard supervised training for the baseline models.(b) Backbone freezing.The backbone's parameters are not updated during the training process.(c) Gradual unfreezing.Only the API component and classifier are updated during the first N epochs, and then all the components are updated for the remaining epochs.

Table S .
1.The means and standard deviations of the clock drawing images classification accuracy with different combinations of λ and ε values, calculated over 5 different stratified random training-validation-test data splittings.