Differential diagnosis of neurodegenerative dementias with the explainable MRI based machine learning algorithm MUQUBIA

Biomarker-based differential diagnosis of the most common forms of dementia is becoming increasingly important. Machine learning (ML) may be able to address this challenge. The aim of this study was to develop and interpret a ML algorithm capable of differentiating Alzheimer’s dementia, frontotemporal dementia, dementia with Lewy bodies and cognitively normal control subjects based on sociodemographic, clinical, and magnetic resonance imaging (MRI) variables. 506 subjects from 5 databases were included. MRI images were processed with FreeSurfer, LPA, and TRACULA to obtain brain volumes and thicknesses, white matter lesions and diffusion metrics. MRI metrics were used in conjunction with clinical and demographic data to perform differential diagnosis based on a Support Vector Machine model called MUQUBIA (Multimodal Quantification of Brain whIte matter biomArkers). Age, gender, Clinical Dementia Rating (CDR) Dementia Staging Instrument, and 19 imaging features formed the best set of discriminative features. The predictive model performed with an overall Area Under the Curve of 98%, high overall precision (88%), recall (88%), and F1 scores (88%) in the test group, and good Label Ranking Average Precision score (0.95) in a subset of neuropathologically assessed patients. The results of MUQUBIA were explained by the SHapley Additive exPlanations (SHAP) method. The MUQUBIA algorithm successfully classified various dementias with good performance using cost-effective clinical and MRI information, and with independent validation, has the potential to assist physicians in their clinical diagnosis.


Data
Subjects with a clinical diagnosis of AD, FTD, DLB, or CN were selected from 5 data sets.
The FTLDNI database contained sufficient FTD data for our purposes.All three FTD subtypes (i.e.: behavioural variant, semantic variant, and progressive non-fluent aphasia) were considered.AD and CN were selected www.nature.com/scientificreports/from a larger sample to avoid size imbalance.For these three classes, only subjects with all three available sequences at the same time-point and DTI directions greater than 12 were included.Because there were no available open access databases of DLB patients with all three sequences needed for this study, we also included subjects with at least one sequence for the DLB group (Supplementary Table S2), thus improving the sample size and allowing more accurate data imputation.A sample of no less than 100 subjects was assembled for each diagnostic class.Sociodemographic, clinical, and imaging variables were collected for all subjects.Neuropsychological test scores were collected in our study but not included in the analysis because the assessment protocol for CN does not always include neuropsychological characterization.The clinical assessment used was the global score of the Clinical Dementia Rating (CDR) Dementia Staging Instrument.
Supplementary Table S1 lists the diagnostic and selection criteria for each study considered.For a complete list of subjects, diagnoses, and data sets used in this study see Supplementary Table S2.

MR imaging
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).
The Newcastle data were provided directly by the Translational and Clinical Research Institute, Newcastle University.
Table 1 reports the imaging characteristics for each sequence and data set.Combining data from multiple hospitals is useful to build ML models that are invariant to systematic inter-scanner effects and to overcome differences in field strengths and acquisition protocols 36 .

Pipelines for image processing
N4 correction, from Advanced Normalization Tools (ANTs) 37 , was performed for all images to correct smooth intensity variations in MRI.The pipelines used for image processing in this study were FS version 6.0, LPA, and TRACULA.
FS is a pipeline for segmenting the cortical and subcortical brain structures using volumetric T13D images, where each voxel is labeled based on a probabilistic atlas 13,14 .The T13D MRIs were processed using the crosssectional stream over the recon-all script using the Desikan-Killiany atlas and, when available in high quality, Table 1.Image characteristics for each data set.Information about scanner manufacturer, sequence type, field strength, dimensionality and directions are reported for each data set.GE general electric, FLAIR fluid attenuated inversion recovery, DTI diffusion tensor imaging, * ADNI1, ADNI2 and ADNI3 data were included.FLAIRs were used to improve the segmentation of the pial surfaces 38 .Volumes of subcortical regions in native space were normalized to FS estimated total intracranial volume (eTIV).Normalization was performed by dividing the volume of the region by the eTIV of the subject and multiplying the ratio by a reference value of 1409 ml 39 to remove the effect of head size 40 .Cortical thickness values were not normalized 19 .
LPA is an algorithm for the quantification of the WM lesions that is part of the Lesion Segmentation Toolbox (LST) 41 .First, FLAIR images were linearly registered to T13D and each voxel was classified as cerebrospinal fluid (CSF), gray matter, or WM using the Statistical Parametric Mapping Tool v12.0 (SPM-12) tissue probability maps.Intensity distributions were calculated for each of them and weighted based on the spatial probability of belonging to WM.Finally, the map was converted to a binary lesion mask and its volume in native space (normalized to eTIV) was calculated.
TRACULA is a tool for automatic reconstruction of a set of 18 major WM pathways 16 .It uses prior information about the anatomy and relative positions of the WM tracts in relation to surrounding anatomical structures, obtained from a set of cognitively normal training subjects in which the tracts were manually labeled to produce tractography streamlines 42 .After mitigating image distortions due to eddy currents and B0 field inhomogeneities, TRACULA fits the shape of the tracts to both the subject's diffusion data and the anatomical neighborhood priors derived from the subject's T1 data.Fractional anisotropy (FA) and mean diffusivity (MD) were extracted from the diffusion data in MNI template space for each of the 18 reconstructed pathways.Then, the mean FA and mean MD of 48 ROIs were obtained from the WM John Hopkins University (JHU-ICBM-labels-1 mm) atlas 43 and applied to the TRACULA maps.www.nature.com/scientificreports/Quality control of the processed outputs was performed by experienced neuroscientists (SD, AR) who inspected the images and the results of each pipeline slice by slice, and discarded those with poor quality or incorrect segmentation (Fig. 1).The influence of WM-hyperintensity load on FA and MD values in MUQUBIA selected tracts was assessed with two multivariate linear regression models 44 (Supplementary Table S3).To investigate possible bias due to different image acquisition protocols in the datasets, we compared the distributions of MRI features of subjects with the same diagnosis from different datasets (inter-cohort variability), and the distributions of MRI features of subjects with different diagnoses from the same dataset (intra-cohort variability) (Supplementary Fig. S8).

MUQUBIA classification steps
Figure 2 shows the workflow for the creation of the Support Vector Machine (SVM) model.
The imaging biomarkers, CDR scores, and sociodemographic information served as input to the SVM algorithm, which was run in Python 3.7.11.The framework we used was based on the scikit-learn library version 0.22.2 45 .
The data set was randomly shuffled, with 70% of subjects forming the training set and 30% forming the test set.All 5 data sets were included in both the training and test data sets.None of the features resulted in more than 50% missing data.For the missing values, we employed the median as a method of imputation 46 .The statistical comparison between the original biomarker values of the training and test sets is presented in Supplementary Table S6 to demonstrate homogeneity between the two groups.All values were standardized by removing the mean and scaling to the variance of the feature distributions of the subjects from the whole training sample (z-scores).
To test the adequacy of the training sample size we modeled the relationship between training sample size and accuracy using the post-hoc "learning curve fitting" method 47 .The results are shown in Supplementary Fig. S1.
Machine learning models tend to overfit and become less generalizable when dealing with high-dimensional features, a well-known phenomenon called the "curse of dimensionality" 48 .A large set of features generally implies the presence of irrelevant, redundant, or correlated variables.To overcome this, our algorithm performed feature selection, considering only those features that maximized the accuracy of the classification task in the training set evaluated with a five-fold cross validation (CV) approach.This procedure allowed us to determine which variables were most informative for the diagnostic categories selected in this study.To determine the best set, a forward and backward sequential feature selection approach was followed, with each feature added to the model individually 49 .If accuracy increased, the feature was considered important; otherwise, it was discarded.After the selection process was completed, the surviving features were further reduced to obtain a Variance Inflation Factor (VIF) below the threshold value of 5 for each of them (see Table 4), indicating that there was no collinearity 50 .
To increase computational efficiency, the one-versus-rest (OVR) method was used to transform the multi-class problem into multiple binary classifications.The classification results were obtained using a non-linear SVM 51 .We optimized the search for the best hyperparameters using a five-fold CV splitting strategy over a grid search to find the best combination of SVM kernel, C and γ values.We also used L2 regularization.
Finally, SVM performance was evaluated using the following metrics: accuracy, precision, recall, F1 score, Area Under the Curve (AUC), Receiver Operating Characteristic curve (ROC).The global metrics, except for the accuracy, are macro-averaged, that is the arithmetic mean of the individual class performance.
In the context of ML, interpretability is necessary to explain the outcome of a model.In this study, Shapley values were calculated using the library SHAP, version 0.40.0 25 , to better understand the contribution of each feature expression.
The clinical challenge for the MUQUBIA algorithm was to distinguish between the different types of dementia.Because CDR is a clinical score collected by clinicians during the assessment process to differentiate the healthy from the dementia state, we evaluated the performance of our model even without including this scale in the feature set (Supplementary Fig. 2) to avoid circularity and minimize potential bias in favor of CN classification.

Statistics
Differences in the variances of the feature distribution of each diagnostic class between the original data set and the data set with imputed medians were assessed using the Brown-Forsythe test.Differences in sociodemographic, clinical, neuropsychological and morphological feature distributions among diagnostic groups, and inter-intra-cohort differences were assessed using the Kruskal-Wallis test for continuous variables and the Chisquared test for dichotomous variables.Post-hoc analyses were performed to test differences between the four diagnostic groups by pairwise comparisons of the Wilcoxon rank sum test for continuous variables and a pairwise comparison between pairs of proportions for dichotomous variables.The p-values of the post-hoc analyses were adjusted with the Benjamini-Hochberg correction.To compare the neuropathological multilabel evidence with the MUQUBIA results, the metric LRAP (Label Ranking Average Precision) was calculated.Similarity between test and train ROC curves was assessed using the DeLong's test.All statistical analyses were performed using R version 3.6.3,and the significance level was set at 0.05 for all tests.

Pipeline availability
The single subject classification tool based on the MUQUBIA models was also made publicly available through the neuGRID platform (https:// neugr id2.eu) 21,52,53 , an on-line high-performance computing (HPC) infrastructure that provides source code, tools, and data for image processing and ML analysis (see Supplementary Fig. S5).

Subjects
The final data set included 506 subjects: 110 AD, 135 FTD, 153 DLB and 108 CN.Demographic, clinical, neuropsychological, and ApoE information are shown in Table 2.Only neuropsychological tests that followed the same protocol in all 5 data sets were considered.

Feature set and sanity check
Image processing yielded a total of 336 features: 202 from FS (including 132 volumes and 70 cortical thickness values); 2 from LPA (WM lesion volume and WM lesion number); 36 from TRACULA (18 FA, 18 MD values for WM pathways); 96 features from the application of the JHU atlas ROIs to the FA and MD maps.The full list of features is reported in Supplementary Table S4.Table 3 reports the number of outputs deemed acceptable after visual inspection for each pipeline and diagnostic group, as well as the consistency of the success rate for each pipeline in the 4 diagnostic groups.

MUQUBIA algorithm
The training sample of MUQUBIA included 354 subjects, while 152 subjects formed the test group.The best hyper-parameters among those tested with the GridSearchCV function (i.e.kernel: linear, polynomial, sigmoid, radial basis function (RBF); C: 1, 10, 100, 1000, 10,000; γ: 0.1, 0.01, 0.001, 0.0001, 0.00001), were RBF kernel, C equal to 1000, and γ equal to 0.0001.For the entire analysis, consisting of image processing and classification of Table 2. Group characteristics.Values are expressed as mean ± standard deviation or percentage (%).P values were determined using the Kruskal-Wallis test for continuous variables and the Chi-squared test for dichotomous variables (α = 0.05).Values in brackets indicate the number of subjects for whom the characteristic is available.( §, post-hoc significant difference between AD and CN; ^, post-hoc significant difference between AD and DLB; °, post-hoc significant difference between AD and FTD; *, post-hoc significant difference between CN and DLB; £, post-hoc significant difference between CN and FTD; ç, posthoc significant difference between DLB and FTD).n sample size, CDR® clinical dementia rating dementia staging instrument, NPI-Q neuropsychiatric inventory questionnaire, GDS Geriatric Depression Scale   The algorithm selected 24 features, but two of them were discarded because of a VIF above 5, namely: fractional anisotropy of the left retrolenticular part of the internal capsule and left postcentral thickness.Figure 3 shows the imaging features selected by the bidirectional selection process implemented in MUQUBIA.The 22 features composing the best set are listed in Table 4.The features were ranked from highest to lowest importance in distinguishing the four diagnostic classes.The set of best features was composed by CDR, 19 MRI features, age and gender.The influence of age and gender on the MRI features was assessed and the results are reported in Supplementary Table S5.Across all diagnoses, CDR was the most important feature.The results of the Kruskal-Wallis test showed that the diagnostic groups differed significantly with respect to the selected variables.Post-hoc analyses revealed p-values below 0.05 in at least one comparison for all the features.
The Brown-Forsythe test always yielded a p-value greater than 0.05 (Supplementary Table S7), indicating that the original variance of the data set was not altered by median imputation.www.nature.com/scientificreports/ of FA for the corticospinal tract and high scores for CDR have a major impact on classification, followed by damage and shrinkage of some ROIs of the left hemisphere, such as: left superior fronto-occipital fasciculus, inferior-parietal thickness, entorhinal thickness.In general, age represents one of the most important factors for classification in all dementias.Additional information can be derived from the partial dependence plot of the main features (Fig. 6).This plot shows the marginal effect that two features have on the predicted outcome of MUQUBIA.Once the first feature was selected, the second was automatically chosen, picking out the feature with the strongest interaction with first one.Most of the plots show complex correlations between the two features and the Shapley values (Supplementary Fig. S7), which are discussed in more detail in the "Discussion" section.

SHAP analysis
Finally, to increase the interpretability and to understand potential problems of MUQUBIA we analyzed some correctly and incorrectly predicted subjects in Supplementary Fig. S3 and in Supplementary Fig. S4.

MUQUBIA performance on test set
The SVM classification task for the subjects in the test set (Fig. 7) resulted in the following global metrics: accuracy 87.50%, macro-precision 88.00%, macro-recall 88.36%, macro-F1 score 87.88%, AUC 97.79%.The DeLong test revealed no significant differences (p > 0.05) between the ROC curves of the training and test sets for each class.A summary of the performance metrics is provided in Table 5.
Classification metrics obtained with MUQUBIA, trained with the same selected features but without CDR, are shown in Supplementary Fig. S2.Performance decreased slightly, especially in the case of CN.However, the classification task yielded the following global metrics: accuracy 84%, macro-precision 84%, macro-recall 84%, macro-F1 score 83%, AUC 96%.Table 4. Best set of features selected by MUQUBIA.Values denote the mean ± standard deviation or percentage of variables that best classified subjects into the 4 diagnostic groups, ordered by Shapley values.P values were determined using the Kruskal-Wallis test or the Chi squared test (α = 0.05) ( §, Post-hoc significant analysis difference between AD and CN; ^, Post-hoc significant analysis difference between AD and DLB; °, Post-hoc significant analysis difference between AD and FTD; *, Post-hoc significant analysis difference between CN and DLB; £, Post-hoc significant analysis difference between CN and FTD; ç, Post-hoc significant analysis difference between DLB and FTD).AD Alzheimer's dementia, CDR® clinical dementia rating dementia staging instrument, CN cognitively normal controls, DLB dementia with Lewy bodies, FTD frontotemporal dementia, FS FreeSurfer version 6.0, LH left hemisphere, RH right hemisphere, FA fractional anisotropy, MD mean diffusivity, VIF variance inflation factor.

MUQUBIA performance on neuropathological assessed subsample of the test set
Table 6 reports the LRAP value used to compare the agreement between the MUQUBIA probability estimates with the National Institute on Aging and Alzheimer's Association protocol 54 for neuropathological assessment of 9 patients in our test group.The LRAP metric is classically used in multilabel ranking problems 55 .It determines the percentage of higher-ranked labels that resemble the true labels for each of the given samples.The score obtained is always greater than 0, and the best score is 1.

MUQUBIA report
An example of the MUQUBIA report generated with the on-line tool on the neuGRID platform is available as supplementary material (Supplementary Fig. S6).

Discussion
In this work, we developed an automated ML algorithm based on multimodal MRI capable of discriminating the most common forms of dementia.The performance of this classifier was validated using quality metrics that resulted in high scores for accuracy, macro-precision, macro-recall, macro-F1 and AUC.The classifier was successful in discriminating between the 4 groups (AD, FTD, DLB and CN) characterized by different neuropsychological scores and ApoE expression (Table 2).The algorithm selected CDR, age, gender information, MRI-based diffusion metrics, volumetric and cortical thickness values as the best differentiating features.SVM performance did not differ significantly between the test and training sets using 22 informative features; and performances on training set were higher than performance on the test set arguing against severe overfitting 56 .
In the test set group, MUQUBIA scored highest in discriminating CN from the others, with excellent discrimination performance for each diagnostic class.The lowest performance was in detecting the AD group.This could be due to the overlap with other types of dementia, especially DLB 57 .Neuropathological brains assessed by Montine's criteria were also correctly classified by MUQUBIA with very good performance (LRAP = 95%).
The MRI features studied were appropriate to selectively distinguish AD, FTD, DLB and to differentiate them from cognitively normal aging.The neuroimaging features were extracted from FS and TRACULA pipelines, making mandatory only the T13D and DTI to run the MUQUBIA algorithm.Optionally, the FLAIR can be used to improve the pial segmentation and to reduce segmentation errors caused by WM hyperintensities.The WM hyperintensity information extracted from the LPA does not seem to affect MUQUBIA, as this aspect is likely already present in the DTIs as increased MD and decreased FA.It is known that WM hyperintensity may have an impact on the DTI metrics, although in the present study and in relation to the features selected by MUQUBIA, only the tract of the superior fronto-occipital fasciculus was weakly affected.
In addition to cortical/subcortical gray matter information, which has long been considered informative biomarkers, WM diffusion metrics have also been shown to be important for ML classification.These metrics appear to be useful in distinguishing AD from FTD 17 , and, albeit to a lesser extent, in distinguishing AD from DLB 35 .The implemented data-driven MUQUBIA approach identified the best set of features, many of which were consistent with those described in the literature, while others were unexpected.For the benefit of the reader, the discussion of the results was organized according to the following 3 main macro-groups: 1. Clinical and socio-demographic features: Among the most important features in our model there is the CDR, a well-known test for detecting and assessing the severity of dementia 58 ; therefore, it is not surprising that it turned out to be the most informative feature.Interestingly, the SHAP partial dependence plot (Fig. 6) shows that the probability of being classified as cognitively normal by MUQUBIA is greater when the CDR score is zero and the MD value of the medial lemniscus tract is low, indicating no degeneration.Higher values of MD, may instead, progressively reduce the weight of the (non-pathological) CDR score in classifying a person as cognitively normal.This could be very promising information, especially for secondary prevention, which, by combining multimodal ad hoc biomarkers, would allow more accurate, sensitive, and earlier stratification of individuals at the pre-dementia stage than using CDR alone 59 .As expected, the MUQUBIA model without CDR performed worse in the classification of CN, but also in AD, DLB, and FTD confirming the importance of CDR also in the classification of dementia groups, as explained by the Shapley values (Fig. 4).
In addition, although neurological diseases are naturally assumed to affect only the elderly, this is not always the case.From the Shapley analysis, younger individuals belonging to the CN class are more likely to drop out (Fig. 5).The younger age of the FTD group must also be taken into account to explain possible brain imaging deviation and possible errors of our model.www.nature.com/scientificreports/Interestingly, according to the literature, DLB is associated with male preponderance 60 , and this was also observed in our DLB group.Finally, MUQUBIA seems to be strongly influenced by the degeneration of the left corticospinal tract, which is more pronounced in women than in men, when classifying AD subjects.
2. Cortical and sub-cortical features: DLB is associated with less global atrophy than AD, whereas posterior cingulate atrophy was similar in AD and DLB.AD patients showed more atrophy of the medial temporal lobe structures compared to DLB 61 .Hippocampal atrophy was not limited to the AD and DLB groups, but has also been noted in FTD, although to a lesser extent than in AD 62 .Conversely, FTD patients showed greater atrophy of the temporal pole and orbitofrontal areas than AD patients, while AD patients showed greater atrophy of the posterior cingulate and inferior parietal regions 63 .In our study, no significant differences were found between DLB and CN with respect to the temporal pole, inferior parietal and orbitofrontal areas.
According to the literature, we found the putamen volume of AD is intermediate between CN and FTD, showing more atrophy in the latter 64 .DLB showed volumetric atrophy in the putamen 65 , with a moderate influence in the MUQUBIA model, or a slight influence in other basal ganglia such as the left pallidum 66 .Even in FTD, where there is limited and conflicting evidence in the literature regarding the volumetry of deep gray matter structures, our results tend to confirm the findings of Möller et al., with respect to the basal ganglia, and show that FTD patients are characterized by the most severe atrophy compared with other diagnostic groups as well as that atrophy of the pallidum contributes to the classification of FTD patients in MUQUBIA model.Further specific efforts will be needed to clarify this point in future studies.
Surprisingly, the volume of the left frontal pole was highest in FTD and differed significantly from all other patients examined in this study.This can be partly explained by the younger age of FTD compared with the other groups by approximately a decade.Consistent with the literature, patients with AD had smaller volumes of the frontal pole, isthmus of cingulate and left pars opercularis 67 compared with CN subjects.
Cortical thickness was a sensitive and comprehensive marker to distinguish AD from other dementias.Cortical shrinkage of the left entorhinal cortex has been reported to be greater in AD than in DLB 68 , but similar in AD and FTD 69 .Left inferior parietal thickness, also greater in FTD, proved to be a robust marker to disentangle AD from FTD for MUQUBIA 70 .Moreover, the SHAP partial dependence plot (Fig. 6) showed that MUQUBIA classifies patients as AD when a concomitant reduction in left inferior parietal thickness is associated with a reduction in total left cortical volume, which has been linked in previous studies to a decrease in semantic fluency 71 .Likewise, the SHAP partial dependence analysis (Fig. 6) revealed that MUQUBIA tends to classify patients in the DLB class when they exhibit lower total left cortical volume and a reduction in left parsopercularis thickness.This observation aligns with the existing literature, that links speech fluency impairment to these important regions in DLB 72,73 .
3. DTI feature FA of the left corticospinal tract was lower in AD than in CN 74 .Degeneration of the corticospinal tract has also been described in FTD 75 .Instead, there is no clear evidence in the literature of damage of this tract in the DLB group 76 , although this tract had a major effect on MUQUBIA.Possible explanations may be found in the larger group size used in our study than in other efforts and the quality of the DTI pipeline and scans we used to quantify the DTI metrics.
FA of the splenium of the corpus callosum and the superior fronto-occipital fasciculus was lower in AD than in CN 72 , although the lowest FA values of these pathways occurred in DLB.DLB also showed lower values for FA than all other groups in many other pathways and ROIs 77 .According to the literature, DLB showed higher MD in brainstem areas 78 , such as in the pontine crossing tract, compared to CN.Other imaging biomarkers, such as the preservation of the retrolenticular part of the internal capsule, influenced MUQUBIA toward DLB classification.This is correct given that motor and sensory fibres run through this ROI 79 and must be maintained integer to prevent dysphagia and swallowing dysfunction.FTD and AD were the most affected groups in the right retrolenticular part of the internal capsule 80 .The medial lemniscus MD proved to be the third most important feature for classifying FTD patients in MUQUBIA.As previously mentioned, FTD was characterized by the degeneration of the corticospinal tract 81 similar to AD.The SHAP partial dependence plot (Fig. 6) for the FTD class also revealed that MUQUBIA finds a direct relationship between left corticospinal tract FA and right medial lemniscus MD values indicating a specific form of frontal neurodegeneration.Last but not least, the correlation between these two tracts could confirm interesting findings on the detection of subtypes of frontotemporal lobar degeneration 82 .

Benefits from MUQUBIA
Recently, the number of studies using ML has steadily increased because ML enables a fully data-driven and automated approach.ML is indeed flexible in discovering patterns, complex relationships, and predicting unobserved outcomes in data, starting from a sufficient number of observations 83 , especially with increasing complexity, where classical statistical methods may be rather ineffective 84 .
Research studies often address the binary classification between two clinical conditions (i.e.: AD vs. CN; FTD vs. CN; FTD vs. AD, etc.…), but this does not reflect the reality of the clinician who needs to make a diagnosis considering multiple neurodegenerative diseases at the same time.Although the field of neurodegenerative diseases has been extensively researched 85 , to our knowledge, few studies have implemented an MRI-based ML algorithm for the classification of AD, FTD, DLB and CN 56,86,87  biomarkers that required an invasive procedure such as lumbar puncture which is difficult to obtain in a large population.This could also affect the applicability in daily routine and clinical practice in hospitals compared to the data needed as input to MUQUBIA.Many advanced research frameworks recommend the analysis of amyloid, tau, or 18 F-fluorodeoxyglucose positron emission tomography (PET) scans of the brain and CSF to better classify patients 88 .However, these expensive procedures may limit their actual utility and are not available in the normal clinical setting.MUQUBIA requires routinely available MRIs, a clinical test, and a few demographic information, so it can be considered widely applicable without incurring excessive costs and burdening patients unnecessarily.
The online MUQUBIA tool does not require manual or "a priori" preprocessing, and the end-user does not need to have prior knowledge of the algorithm, although a quality check of the ROI segmentation is always advisable.
In addition, experienced neuroradiologists are often not available in routine clinical practice outside of a specialized memory clinic, so an automated method capable of extracting and interpreting the information with high precision would be of great clinical value.
A strength of this study is that the DTIs followed heterogeneous acquisition protocols, e.g., gradient directions vary from a minimum of 19 (low) to a maximum of 114 (high).The FLAIR and T13D parameters differed, bringing this study closer also to a real-world clinical scenario.

Limitations and future developments
We have considered various types of neurodegenerative diseases, which account for a large proportion of dementia cases, but this approach to differential diagnosis is far from complete.We did not attempt to define subtypes, such as posterior cortical atrophy in AD or the language or semantic variant in FTD or psychiatric and delirium onset in DLB.This study has limitations related to a partial influence of age and gender on certain MRI features, particularly in the FTD or in DLB.In fact, FTD group is the youngest and has an average age of onset of 56 years, while AD and DLB occurs later 9 .DLB group instead showed a preponderance of male.These confounders could help the classifier to identify more easily these groups and additional experiments should be performed to exclude this point.The fact that inter-cohort variability was lower than intra-cohort variability hints that the effect of etiology of dementia on MRI features is more important than potential bias induced by heterogeneous acquisition protocols, still the classifier might be further improved by trying to minimize the "center-effect" and reduce the few differences observed 89 .
Future efforts will aim to speed up processing times with new tools, such as FastSurferCNN, that exploit deep neural networks and graphical processor units to reduce image preprocessing in minutes.
Finally, due to difficulties finding datasets that contained multimodal and multiclass data, this study lacked a complete independent validation data set, but in the future, MUQUBIA should be validated with independent data sets given the upcoming Big-Data era.

Conclusion
The fully automated classifier developed in this study can discriminate between AD, FTD, DLB and CN with good to excellent performance.Our ML classifier can help clinicians as a second opinion tool to better diagnose the different forms of dementia based on routine and cost-effective biomarkers such as age, gender, CDR and automatically extracted MRI features.It is important to point out that the interpretability and explainability of the methods of ML provide important clues, allow to go beyond the slogan "ML is a black-box", and lead to the discovery of new informative data-driven candidate biomarkers.

Figure 1 .
Figure 1.Acceptable and non-acceptable outputs of each image analysis pipeline.All images and outputs have been inspected slice by slice.Images of low quality, presenting artifacts or resulting in wrong segmentation or unrealistic reconstruction were discarded.

Figure 2 .
Figure 2. Steps to create and test MUQUBIA.(a) Images of 506 subjects were processed to obtain the full set of features.(b) Missing values were replaced with median values.(c) The data were split into training set (70% of the subjects) and test set (30%) to avoid any bias in the selection of features and in the classification performance.(d) Values were standardized.(e) The full set of features was pruned to avoid overfitting using a bidirectional sequential feature selection approach.(f) The non-linear SVM model was built and fine-tuned on the training and validation sets, while being tested on the test set left aside.Acronyms: ft, features; MD, Mean Diffusivity; SVM, Support Vector Machine; WM, White Matter.

Figure 4 Figure 3 .
Figure4shows the average influence of the features on the prediction of each diagnosis, the values of CDR have the greatest influence especially for the classification of CN and AD, whereas the FA of the left corticospinal tract, among the others, influences the classification of DLB and AD groups the most.The global interpretability plot (Fig.5), shows whether a feature shifts the MUQUBIA prediction toward other diagnostic classes and the relative contribution of each feature.The plot consists of all points standardized.Focusing on the CN class, low values of CDR have a very high impact on the determination of this diagnosis.High values of temporal ROIs (left hippocampal volume and left entorhinal thickness) also have a high influence, as does a low MD value of right medial lemniscus.Other MRI measures do not provide simple or practical information on how they influence MUQUBIA outcome.Atrophy of the left frontal pole, associated with the increase of MD in the right medial lemniscus and the decrease of FA in the fronto-occipital fasciculus, influences the prediction of FTD class in addition to the degeneration of the corticospinal tract.For DLB class, the corticospinal tract represented an imaging biomarker of great importance, especially with a reduced value of FA, although this tract is not a classic biomarker for DLB.Other imaging biomarkers, such as preservation of MD in the retrolenticular part of the internal capsule and preservation of left cortical thickness (entorhinal and inferior parietal), have an impact on the classification of DLB patients.For AD, lack or moderate impairment

Figure 4 .
Figure 4. Contribution of each feature to the classification, represented by the mean Shapley magnitude values.The graph shows the importance of each variable for each diagnostic group.Acronyms: AD, Alzheimer's Dementia; FTD, Frontotemporal Dementia; DLB, Dementia with Lewy Body; CN, Cognitive Normal; FA, Fractional Anisotropy; MD, Mean Diffusivity; LH, left hemisphere; RH, right hemisphere.

Figure 5 . 6 Figure 6 .
Figure 5. Global interpretability plots for each diagnostic class.Each dot corresponds to a subject in the training set.The position of the dot on the x-axis shows the effect of that feature on the prediction of the model for that subject.If multiple dots land at the same x position, they piled up to show density.The features are ordered by the sum of the Shapley values.Colors are used to display the standardized value of each feature (colder colors represent lower values, warmer colors represent higher values).Acronyms: AD, Alzheimer's Dementia; FTD, Frontotemporal Dementia; DLB, Dementia with Lewy Body; CN, Cognitive Normal; LH, left hemisphere; RH, right hemisphere; FA, Fractional Anisotropy; MD, Mean Diffusivity.

Figure 7 .
Figure 7. Confusion matrix and ROC curves of the test set.The AUC of each ROC curve for each diagnostic class against all others is reported in the legend.Acronyms: AD, Alzheimer's Dementia; FTD, Frontotemporal Disease; DLB, Dementia with Lewy Body; CN, Cognitive Normal; AUC, Area Under the Curve.

Table 3 .
Number subjects, the algorithm requires 10 h on a machine running Ubuntu Server 18.04 LTS version on a Sun Grid Engine scheduler equipped with 1300 GB RAM and 214 cores.Most of the requested time is spent for image analysis.
of correctly processed images and success rate of image processing after visual inspection.Numeric values denote the number of outputs that were deemed acceptable after visual inspection for each pipeline in each diagnostic group.Percentages indicate the success rate of each pipeline after visual quality inspection by two raters.P values were obtained with the Chi-squared test (α = 0.05).FS FreeSurfer version 6.0, LPA Lesion Prediction Algorithm, AD Alzheimer's dementia, FTD frontotemporal dementia, DLB dementia with Lewy bodies, CN cognitively normal controls.the

Table 5 .
MUQUBIA quantitative metrics for differential diagnosis in each diagnostic group of the test set.Metrics used to determine the goodness of MUQUBIA in discriminating each diagnostic class.AD Alzheimer's dementia, FTD frontotemporal dementia, DLB dementia with Lewy bodies, CN cognitively normal controls, PPV positive predictive value, NPV negative predictive value.

Table 6 .
MUQUBIA agreement with neuropathologic assessments.The table reports the LRAP score derived considering the multilabel neuropathological ground truth (Montine's criteria) of 9 subjects of our test set and the MUQUBIA classification probabilities.All the 9 subjects had cognitive impairment.'Intermediate' or 'High' level of ADNC should be considered adequate explanation of AD dementia.'Limbic' , 'Neocortical' or ' Amygdala-predominant' level should be considered adequate explanation of Lewy Body Diseases and this does not preclude contribution of other diseases (e.g.: ' Amygdala-predominant LBD' typically occurs in the context of advanced AD neuropathologic change).Presence of frontotemporal lobar degeneration with tau or other tauopathy and subtypes were labeled as 'Yes' .LRAP label ranking average precision, ADNC NIA-AA Alzheimer's disease neuropathologic change, ABC Aβ/amyloid plaques (A)-NFT stage (B)-and neuritic plaque score (C), FTLD frontotemporal lobar degeneration, AD Alzheimer's dementia, DLB dementia with Lewy body, FTD frontotemporal dementia, CN cognitive normal.
Vol.:(0123456789) Scientific Reports | (2023) 13:17355 | https://doi.org/10.1038/s41598-023-43706-6 , and to date, no study has used DTIs and multimodal analyses simultaneously.MUQUBIA is the first ML algorithm for differential diagnosis to use DTI together with T13D and FLAIR on a very robust sample size.In fact, Klöppel et al. recruited a small group of FTD and DLB, whereas Koikkalainen et al. and Tong et al. included a broader range of dementias (such as vascular dementia and subjective memory complaints), but still with fewer subjects per group and with worse performance compared with MUQUBIA (i.e.: Klöppel et al.: accuracy of 65%; Koikkolainen et al.: accuracy of 70.6%; Tong et al.: accuracy of 75.2%).Moreover, Tong et al. used CSF