Introduction

Neurodegenerative dementias are a common and increasing cause of mortality and disability worldwide, particularly in older age1. The most common form of neurodegenerative dementia worldwide is Alzheimer’s dementia (AD), but recent epidemiological studies and refinement of new clinical criteria have shown that frontotemporal dementia (FTD) and dementia with Lewy bodies (DLB) are also common forms2. Specifically, DLB accounts for 5–7% of all dementias in the elderly3, FTD about 7%4, with one in four cases occurring late in life5, while AD may contribute to 60–70% of cases overall6. These neurodegenerative dementias are heterogeneous in their clinical presentation and underlying pathophysiology, although they share overlapping features7.

Biomarkers provide a powerful approach to understand the spectrum of neurological diseases by identifying them from the earliest manifestations to the final stages8. Increased diagnostic accuracy allows more precise prognostic approaches and often leads to specific treatments and optimal patient care9. In this context, it is important to determine which diagnostic markers can most reliably identify the different pathologies that lead to dementia. The main challenge for researchers and clinicians is to determine biomarkers that not only identify AD but can simultaneously distinguish between patients with FTD, DLB and cognitively normal controls (CN). Currently, imaging biomarkers assessed by magnetic resonance imaging (MRI) in conjunction with clinical examinations and neurocognitive assessments are the most commonly used tests to diagnose neurodegenerative dementias10. In recent years, several MRI-based imaging sequences or modalities have been introduced into clinical practice. The most commonly used MRI sequences are: structural T1-weighted 3D (T13D) and T2 Fluid Attenuated Inversion Recovery (FLAIR) images, which provide morphological measurements of the brain. In addition, Diffusion Tensor Imaging (DTI) is a well-established technique that is particularly useful for studying white matter (WM) integrity11.

The development of accurate image analysis pipelines combined with advanced classification methods could improve differential diagnosis12. Indeed, automated MRI segmentation tools can systematically generate brain morphometric features with minimal operator-differences, although a limitation is that some of these tools require a lot of processing time and computational power.

The best-known segmentation algorithms are FreeSurfer (FS), which can extract volume, area and thickness of many brain regions of interest (ROI) and the Lesion Prediction Algorithm (LPA), which can quantify WM hyperintensities. Both algorithms have been validated against manual raters and performed well13,14,15. As for DTI analysis, TRActs Constrained by UnderLying Anatomy (TRACULA) is one of the best validated tools for reconstructing WM pathways16.

The results of automated MRI pipelines can be used to develop machine learning (ML) tools with good classification performance. Support Vector Machines (SVM) are among the widely used supervised ML algorithms because they are easy to implement while being effective in diagnostic classification tasks17,18. In some cases, imaging variables can be used in conjunction with clinical and neuropsychological variables as input to multivariate data analyses and ML algorithms19. These models have been shown to be an effective strategy for identifying features capable of discriminating between different classes and subtypes of disease20,21, with results comparable to or better than neuropsychological tests alone22,23.

Indeed, ML in neuroscience is an ever-growing area of research based on learning relationships from large and complex data sets with the ability to apply the learned rules to other similar unseen data. Often, these tools appear to be able to detect brain patterns that are beyond human perception and can help clinicians to highlight and interpret medical findings24. To this end, tools for global and local interpretability of ML models have recently been developed25.

The present study was conducted within the framework of the Italian Network for Neuroscience and Neurorehabilitation (RIN) (https://www.reteneuroscienze.it/en/), established in 2017 by the Italian Ministry of Health. The RIN (1) promotes collaboration among the National Research Hospitals (IRCCS), (2) facilitates the dissemination of information on clinical/scientific community, and (3) promotes the use of harmonized protocols and advanced ML tools to enhance clinical practice26,27,28.

With this background, we developed and explained how our ML algorithm classified subjects into the four diagnostic classes (i.e.: AD, FTD, DLB, CN) based on sociodemographic, clinical, and imaging data. Our objectives were to: (1) discover the most informative combination of biomarkers to distinguish the different forms of dementia; (2) investigate the pathophysiological role of WM alterations multimodally; (3) provide an interpretation of how MUltimodal QUantification of Brain whIte matter biomArkers in dementia (MUQUBIA) works.

Methods

Study design

This study included the following steps: data preprocessing, selection of discriminative features, classification of subjects, SHapley Additive exPlanations (SHAP) analyses.

MRI images were processed with automated tools to extract the volume and thickness of cortical and subcortical brain regions, WM lesions, and WM diffusion metrics. All these values were used to train and test the MUQUBIA model for classification into diagnostic groups with a hold-out strategy.

Data

Subjects with a clinical diagnosis of AD, FTD, DLB, or CN were selected from 5 data sets.

The databases used for data collection were:

  • Alzheimer’s Disease Neuroimaging Initiative (ADNI)29: 84 AD, 15 DLB (from Neuropathology Data, http://adni.loni.usc.edu/methods/neuropath-methods/), 80 CN;

  • Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI): 135 FTD, 10 CN;

  • National Alzheimer's Coordinating Center (NACC)30: 26 AD, 27 DLB, 18 CN;

  • NIH Parkinson's Disease Biomarkers Program (PDBP)31: 60 DLB;

  • Newcastle University, Newcastle upon Tyne32,33,34,35: 51 DLB.

The FTLDNI database contained sufficient FTD data for our purposes. All three FTD subtypes (i.e.: behavioural variant, semantic variant, and progressive non-fluent aphasia) were considered. AD and CN were selected from a larger sample to avoid size imbalance. For these three classes, only subjects with all three available sequences at the same time-point and DTI directions greater than 12 were included. Because there were no available open access databases of DLB patients with all three sequences needed for this study, we also included subjects with at least one sequence for the DLB group (Supplementary Table S2), thus improving the sample size and allowing more accurate data imputation. A sample of no less than 100 subjects was assembled for each diagnostic class. Sociodemographic, clinical, and imaging variables were collected for all subjects. Neuropsychological test scores were collected in our study but not included in the analysis because the assessment protocol for CN does not always include neuropsychological characterization. The clinical assessment used was the global score of the Clinical Dementia Rating (CDR) Dementia Staging Instrument.

Supplementary Table S1 lists the diagnostic and selection criteria for each study considered. For a complete list of subjects, diagnoses, and data sets used in this study see Supplementary Table S2.

MR imaging

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).

ADNI and FTLDNI data were collected from the Imaging Data Archive (IDA) web-portal of the Laboratory of NeuroImaging (LONI) (http://adni.loni.usc.edu).

NACC and PDBP data were downloaded from their respective web portals: https://naccdata.org/ and https://pdbp.ninds.nih.gov/.

The Newcastle data were provided directly by the Translational and Clinical Research Institute, Newcastle University.

Table 1 reports the imaging characteristics for each sequence and data set. Combining data from multiple hospitals is useful to build ML models that are invariant to systematic inter-scanner effects and to overcome differences in field strengths and acquisition protocols36.

Table 1 Image characteristics for each data set.

Pipelines for image processing

N4 correction, from Advanced Normalization Tools (ANTs)37, was performed for all images to correct smooth intensity variations in MRI. The pipelines used for image processing in this study were FS version 6.0, LPA, and TRACULA.

FS is a pipeline for segmenting the cortical and subcortical brain structures using volumetric T13D images, where each voxel is labeled based on a probabilistic atlas13,14. The T13D MRIs were processed using the cross-sectional stream over the recon-all script using the Desikan-Killiany atlas and, when available in high quality, FLAIRs were used to improve the segmentation of the pial surfaces38. Volumes of subcortical regions in native space were normalized to FS estimated total intracranial volume (eTIV). Normalization was performed by dividing the volume of the region by the eTIV of the subject and multiplying the ratio by a reference value of 1409 ml39 to remove the effect of head size40. Cortical thickness values were not normalized19.

LPA is an algorithm for the quantification of the WM lesions that is part of the Lesion Segmentation Toolbox (LST)41. First, FLAIR images were linearly registered to T13D and each voxel was classified as cerebrospinal fluid (CSF), gray matter, or WM using the Statistical Parametric Mapping Tool v12.0 (SPM-12) tissue probability maps. Intensity distributions were calculated for each of them and weighted based on the spatial probability of belonging to WM. Finally, the map was converted to a binary lesion mask and its volume in native space (normalized to eTIV) was calculated.

TRACULA is a tool for automatic reconstruction of a set of 18 major WM pathways16. It uses prior information about the anatomy and relative positions of the WM tracts in relation to surrounding anatomical structures, obtained from a set of cognitively normal training subjects in which the tracts were manually labeled to produce tractography streamlines42. After mitigating image distortions due to eddy currents and B0 field inhomogeneities, TRACULA fits the shape of the tracts to both the subject's diffusion data and the anatomical neighborhood priors derived from the subject's T1 data. Fractional anisotropy (FA) and mean diffusivity (MD) were extracted from the diffusion data in MNI template space for each of the 18 reconstructed pathways. Then, the mean FA and mean MD of 48 ROIs were obtained from the WM John Hopkins University (JHU-ICBM-labels-1 mm) atlas43 and applied to the TRACULA maps.

Quality control of the processed outputs was performed by experienced neuroscientists (SD, AR) who inspected the images and the results of each pipeline slice by slice, and discarded those with poor quality or incorrect segmentation (Fig. 1). The influence of WM-hyperintensity load on FA and MD values in MUQUBIA selected tracts was assessed with two multivariate linear regression models44 (Supplementary Table S3). To investigate possible bias due to different image acquisition protocols in the datasets, we compared the distributions of MRI features of subjects with the same diagnosis from different datasets (inter-cohort variability), and the distributions of MRI features of subjects with different diagnoses from the same dataset (intra-cohort variability) (Supplementary Fig. S8).

Figure 1
figure 1

Acceptable and non-acceptable outputs of each image analysis pipeline. All images and outputs have been inspected slice by slice. Images of low quality, presenting artifacts or resulting in wrong segmentation or unrealistic reconstruction were discarded.

MUQUBIA classification steps

Figure 2 shows the workflow for the creation of the Support Vector Machine (SVM) model.

Figure 2
figure 2

Steps to create and test MUQUBIA. (a) Images of 506 subjects were processed to obtain the full set of features. (b) Missing values were replaced with median values. (c) The data were split into training set (70% of the subjects) and test set (30%) to avoid any bias in the selection of features and in the classification performance. (d) Values were standardized. (e) The full set of features was pruned to avoid overfitting using a bidirectional sequential feature selection approach. (f) The non-linear SVM model was built and fine-tuned on the training and validation sets, while being tested on the test set left aside. Acronyms: ft, features; MD, Mean Diffusivity; SVM, Support Vector Machine; WM, White Matter.

The imaging biomarkers, CDR scores, and sociodemographic information served as input to the SVM algorithm, which was run in Python 3.7.11. The framework we used was based on the scikit-learn library version 0.22.245.

The data set was randomly shuffled, with 70% of subjects forming the training set and 30% forming the test set. All 5 data sets were included in both the training and test data sets. None of the features resulted in more than 50% missing data. For the missing values, we employed the median as a method of imputation46. The statistical comparison between the original biomarker values of the training and test sets is presented in Supplementary Table S6 to demonstrate homogeneity between the two groups. All values were standardized by removing the mean and scaling to the variance of the feature distributions of the subjects from the whole training sample (z-scores).

To test the adequacy of the training sample size we modeled the relationship between training sample size and accuracy using the post-hoc “learning curve fitting” method47. The results are shown in Supplementary Fig. S1.

Machine learning models tend to overfit and become less generalizable when dealing with high-dimensional features, a well-known phenomenon called the “curse of dimensionality”48. A large set of features generally implies the presence of irrelevant, redundant, or correlated variables. To overcome this, our algorithm performed feature selection, considering only those features that maximized the accuracy of the classification task in the training set evaluated with a five-fold cross validation (CV) approach. This procedure allowed us to determine which variables were most informative for the diagnostic categories selected in this study. To determine the best set, a forward and backward sequential feature selection approach was followed, with each feature added to the model individually49. If accuracy increased, the feature was considered important; otherwise, it was discarded. After the selection process was completed, the surviving features were further reduced to obtain a Variance Inflation Factor (VIF) below the threshold value of 5 for each of them (see Table 4), indicating that there was no collinearity50.

To increase computational efficiency, the one-versus-rest (OVR) method was used to transform the multi-class problem into multiple binary classifications. The classification results were obtained using a non-linear SVM51. We optimized the search for the best hyperparameters using a five-fold CV splitting strategy over a grid search to find the best combination of SVM kernel, C and γ values. We also used L2 regularization.

Finally, SVM performance was evaluated using the following metrics: accuracy, precision, recall, F1 score, Area Under the Curve (AUC), Receiver Operating Characteristic curve (ROC). The global metrics, except for the accuracy, are macro-averaged, that is the arithmetic mean of the individual class performance.

In the context of ML, interpretability is necessary to explain the outcome of a model. In this study, Shapley values were calculated using the library SHAP, version 0.40.025, to better understand the contribution of each feature expression.

The clinical challenge for the MUQUBIA algorithm was to distinguish between the different types of dementia. Because CDR is a clinical score collected by clinicians during the assessment process to differentiate the healthy from the dementia state, we evaluated the performance of our model even without including this scale in the feature set (Supplementary Fig. 2) to avoid circularity and minimize potential bias in favor of CN classification.

Statistics

Differences in the variances of the feature distribution of each diagnostic class between the original data set and the data set with imputed medians were assessed using the Brown-Forsythe test. Differences in sociodemographic, clinical, neuropsychological and morphological feature distributions among diagnostic groups, and inter- intra-cohort differences were assessed using the Kruskal–Wallis test for continuous variables and the Chi-squared test for dichotomous variables. Post-hoc analyses were performed to test differences between the four diagnostic groups by pairwise comparisons of the Wilcoxon rank sum test for continuous variables and a pairwise comparison between pairs of proportions for dichotomous variables. The p-values of the post-hoc analyses were adjusted with the Benjamini–Hochberg correction. To compare the neuropathological multilabel evidence with the MUQUBIA results, the metric LRAP (Label Ranking Average Precision) was calculated. Similarity between test and train ROC curves was assessed using the DeLong’s test. All statistical analyses were performed using R version 3.6.3, and the significance level was set at 0.05 for all tests.

Pipeline availability

The single subject classification tool based on the MUQUBIA models was also made publicly available through the neuGRID platform (https://neugrid2.eu)21,52,53, an on-line high-performance computing (HPC) infrastructure that provides source code, tools, and data for image processing and ML analysis (see Supplementary Fig. S5).

Results

Subjects

The final data set included 506 subjects: 110 AD, 135 FTD, 153 DLB and 108 CN. Demographic, clinical, neuropsychological, and ApoE information are shown in Table 2. Only neuropsychological tests that followed the same protocol in all 5 data sets were considered.

Table 2 Group characteristics.

Feature set and sanity check

Image processing yielded a total of 336 features: 202 from FS (including 132 volumes and 70 cortical thickness values); 2 from LPA (WM lesion volume and WM lesion number); 36 from TRACULA (18 FA, 18 MD values for WM pathways); 96 features from the application of the JHU atlas ROIs to the FA and MD maps. The full list of features is reported in Supplementary Table S4.

Table 3 reports the number of outputs deemed acceptable after visual inspection for each pipeline and diagnostic group, as well as the consistency of the success rate for each pipeline in the 4 diagnostic groups.

Table 3 Number of correctly processed images and success rate of image processing after visual inspection.

MUQUBIA algorithm

The training sample of MUQUBIA included 354 subjects, while 152 subjects formed the test group. The best hyper-parameters among those tested with the GridSearchCV function (i.e. kernel: linear, polynomial, sigmoid, radial basis function (RBF); C: 1, 10, 100, 1000, 10,000; γ: 0.1, 0.01, 0.001, 0.0001, 0.00001), were RBF kernel, C equal to 1000, and γ equal to 0.0001. For the entire analysis, consisting of image processing and classification of the subjects, the algorithm requires 10 h on a machine running Ubuntu Server 18.04 LTS version on a Sun Grid Engine scheduler equipped with 1300 GB RAM and 214 cores. Most of the requested time is spent for image analysis.

The algorithm selected 24 features, but two of them were discarded because of a VIF above 5, namely: fractional anisotropy of the left retrolenticular part of the internal capsule and left postcentral thickness. Figure 3 shows the imaging features selected by the bidirectional selection process implemented in MUQUBIA. The 22 features composing the best set are listed in Table 4. The features were ranked from highest to lowest importance in distinguishing the four diagnostic classes. The set of best features was composed by CDR, 19 MRI features, age and gender. The influence of age and gender on the MRI features was assessed and the results are reported in Supplementary Table S5. Across all diagnoses, CDR was the most important feature. The results of the Kruskal–Wallis test showed that the diagnostic groups differed significantly with respect to the selected variables. Post-hoc analyses revealed p-values below 0.05 in at least one comparison for all the features.

Figure 3
figure 3

Representation of brain regions corresponding to imaging features selected by MUQUBIA to distinguish the different diagnostic classes (AD, DLB, FTD, CN). The color of each brain region reflects the ability of the corresponding feature to discriminate among the different classes (averaged mean Shapley value). Acronyms: L, left; R, right.

Table 4 Best set of features selected by MUQUBIA.

The Brown-Forsythe test always yielded a p-value greater than 0.05 (Supplementary Table S7), indicating that the original variance of the data set was not altered by median imputation.

SHAP analysis

Figure 4 shows the average influence of the features on the prediction of each diagnosis, the values of CDR have the greatest influence especially for the classification of CN and AD, whereas the FA of the left corticospinal tract, among the others, influences the classification of DLB and AD groups the most.

Figure 4
figure 4

Contribution of each feature to the classification, represented by the mean Shapley magnitude values. The graph shows the importance of each variable for each diagnostic group. Acronyms: AD, Alzheimer’s Dementia; FTD, Frontotemporal Dementia; DLB, Dementia with Lewy Body; CN, Cognitive Normal; FA, Fractional Anisotropy; MD, Mean Diffusivity; LH, left hemisphere; RH, right hemisphere.

The global interpretability plot (Fig. 5), shows whether a feature shifts the MUQUBIA prediction toward other diagnostic classes and the relative contribution of each feature. The plot consists of all points standardized. Focusing on the CN class, low values of CDR have a very high impact on the determination of this diagnosis. High values of temporal ROIs (left hippocampal volume and left entorhinal thickness) also have a high influence, as does a low MD value of right medial lemniscus. Other MRI measures do not provide simple or practical information on how they influence MUQUBIA outcome. Atrophy of the left frontal pole, associated with the increase of MD in the right medial lemniscus and the decrease of FA in the fronto-occipital fasciculus, influences the prediction of FTD class in addition to the degeneration of the corticospinal tract. For DLB class, the corticospinal tract represented an imaging biomarker of great importance, especially with a reduced value of FA, although this tract is not a classic biomarker for DLB. Other imaging biomarkers, such as preservation of MD in the retrolenticular part of the internal capsule and preservation of left cortical thickness (entorhinal and inferior parietal), have an impact on the classification of DLB patients. For AD, lack or moderate impairment of FA for the corticospinal tract and high scores for CDR have a major impact on classification, followed by damage and shrinkage of some ROIs of the left hemisphere, such as: left superior fronto-occipital fasciculus, inferior-parietal thickness, entorhinal thickness. In general, age represents one of the most important factors for classification in all dementias.

Figure 5
figure 5

Global interpretability plots for each diagnostic class. Each dot corresponds to a subject in the training set. The position of the dot on the x-axis shows the effect of that feature on the prediction of the model for that subject. If multiple dots land at the same x position, they piled up to show density. The features are ordered by the sum of the Shapley values. Colors are used to display the standardized value of each feature (colder colors represent lower values, warmer colors represent higher values). Acronyms: AD, Alzheimer’s Dementia; FTD, Frontotemporal Dementia; DLB, Dementia with Lewy Body; CN, Cognitive Normal; LH, left hemisphere; RH, right hemisphere; FA, Fractional Anisotropy; MD, Mean Diffusivity.

Additional information can be derived from the partial dependence plot of the main features (Fig. 6). This plot shows the marginal effect that two features have on the predicted outcome of MUQUBIA. Once the first feature was selected, the second was automatically chosen, picking out the feature with the strongest interaction with first one. Most of the plots show complex correlations between the two features and the Shapley values (Supplementary Fig. S7), which are discussed in more detail in the “Discussion” section.

Figure 6
figure 6

SHAP partial dependence plots for each diagnostic class (AD, DLB, FTD, CN). Each subplot shows the marginal effect that two features have on the predicted diagnosis. Once the first feature is chosen, the second is selected based on the feature with which the first feature interacts most strongly. The color of a dot indicates the value for the second feature. The color of each plot changes progressively from blue to red (or vice-versa) as you move along the axes. Colder colors represent lower values, warmer colors represent higher values of the second feature. Acronyms: AD, Alzheimer’s Dementia; FTD, Frontotemporal Disease; DLB, Dementia with Lewy Body; CN, Cognitive Normal; LH, left hemisphere; RH, right hemisphere; FA, Fractional Anisotropy; MD, Mean Diffusivity.

Finally, to increase the interpretability and to understand potential problems of MUQUBIA we analyzed some correctly and incorrectly predicted subjects in Supplementary Fig. S3 and in Supplementary Fig. S4.

MUQUBIA performance on training set

The classification resulted in the following global metrics: accuracy 91.53%, macro-precision 91.62%, macro-recall 90.82%, macro-F1 score 90.92%, AUC 98.44%.

MUQUBIA performance on test set

The SVM classification task for the subjects in the test set (Fig. 7) resulted in the following global metrics: accuracy 87.50%, macro-precision 88.00%, macro-recall 88.36%, macro-F1 score 87.88%, AUC 97.79%. The DeLong test revealed no significant differences (p > 0.05) between the ROC curves of the training and test sets for each class. A summary of the performance metrics is provided in Table 5.

Figure 7
figure 7

Confusion matrix and ROC curves of the test set. The AUC of each ROC curve for each diagnostic class against all others is reported in the legend. Acronyms: AD, Alzheimer’s Dementia; FTD, Frontotemporal Disease; DLB, Dementia with Lewy Body; CN, Cognitive Normal; AUC, Area Under the Curve.

Table 5 MUQUBIA quantitative metrics for differential diagnosis in each diagnostic group of the test set.

Classification metrics obtained with MUQUBIA, trained with the same selected features but without CDR, are shown in Supplementary Fig. S2. Performance decreased slightly, especially in the case of CN. However, the classification task yielded the following global metrics: accuracy 84%, macro-precision 84%, macro-recall 84%, macro-F1 score 83%, AUC 96%.

MUQUBIA performance on neuropathological assessed subsample of the test set

Table 6 reports the LRAP value used to compare the agreement between the MUQUBIA probability estimates with the National Institute on Aging and Alzheimer’s Association protocol54 for neuropathological assessment of 9 patients in our test group. The LRAP metric is classically used in multilabel ranking problems55. It determines the percentage of higher-ranked labels that resemble the true labels for each of the given samples. The score obtained is always greater than 0, and the best score is 1.

Table 6 MUQUBIA agreement with neuropathologic assessments.

MUQUBIA report

An example of the MUQUBIA report generated with the on-line tool on the neuGRID platform is available as supplementary material (Supplementary Fig. S6).

Discussion

In this work, we developed an automated ML algorithm based on multimodal MRI capable of discriminating the most common forms of dementia. The performance of this classifier was validated using quality metrics that resulted in high scores for accuracy, macro-precision, macro-recall, macro-F1 and AUC. The classifier was successful in discriminating between the 4 groups (AD, FTD, DLB and CN) characterized by different neuropsychological scores and ApoE expression (Table 2). The algorithm selected CDR, age, gender information, MRI-based diffusion metrics, volumetric and cortical thickness values as the best differentiating features.

SVM performance did not differ significantly between the test and training sets using 22 informative features; and performances on training set were higher than performance on the test set arguing against severe overfitting56.

In the test set group, MUQUBIA scored highest in discriminating CN from the others, with excellent discrimination performance for each diagnostic class. The lowest performance was in detecting the AD group. This could be due to the overlap with other types of dementia, especially DLB57. Neuropathological brains assessed by Montine’s criteria were also correctly classified by MUQUBIA with very good performance (LRAP = 95%).

The MRI features studied were appropriate to selectively distinguish AD, FTD, DLB and to differentiate them from cognitively normal aging. The neuroimaging features were extracted from FS and TRACULA pipelines, making mandatory only the T13D and DTI to run the MUQUBIA algorithm. Optionally, the FLAIR can be used to improve the pial segmentation and to reduce segmentation errors caused by WM hyperintensities. The WM hyperintensity information extracted from the LPA does not seem to affect MUQUBIA, as this aspect is likely already present in the DTIs as increased MD and decreased FA. It is known that WM hyperintensity may have an impact on the DTI metrics, although in the present study and in relation to the features selected by MUQUBIA, only the tract of the superior fronto-occipital fasciculus was weakly affected.

In addition to cortical/subcortical gray matter information, which has long been considered informative biomarkers, WM diffusion metrics have also been shown to be important for ML classification. These metrics appear to be useful in distinguishing AD from FTD17, and, albeit to a lesser extent, in distinguishing AD from DLB35.

The implemented data-driven MUQUBIA approach identified the best set of features, many of which were consistent with those described in the literature, while others were unexpected. For the benefit of the reader, the discussion of the results was organized according to the following 3 main macro-groups:

1. Clinical and socio-demographic features:

Among the most important features in our model there is the CDR, a well-known test for detecting and assessing the severity of dementia58; therefore, it is not surprising that it turned out to be the most informative feature. Interestingly, the SHAP partial dependence plot (Fig. 6) shows that the probability of being classified as cognitively normal by MUQUBIA is greater when the CDR score is zero and the MD value of the medial lemniscus tract is low, indicating no degeneration. Higher values of MD, may instead, progressively reduce the weight of the (non-pathological) CDR score in classifying a person as cognitively normal. This could be very promising information, especially for secondary prevention, which, by combining multimodal ad hoc biomarkers, would allow more accurate, sensitive, and earlier stratification of individuals at the pre-dementia stage than using CDR alone59. As expected, the MUQUBIA model without CDR performed worse in the classification of CN, but also in AD, DLB, and FTD confirming the importance of CDR also in the classification of dementia groups, as explained by the Shapley values (Fig. 4).

In addition, although neurological diseases are naturally assumed to affect only the elderly, this is not always the case. From the Shapley analysis, younger individuals belonging to the CN class are more likely to drop out (Fig. 5). The younger age of the FTD group must also be taken into account to explain possible brain imaging deviation and possible errors of our model.

Interestingly, according to the literature, DLB is associated with male preponderance60, and this was also observed in our DLB group. Finally, MUQUBIA seems to be strongly influenced by the degeneration of the left corticospinal tract, which is more pronounced in women than in men, when classifying AD subjects.

2. Cortical and sub-cortical features:

DLB is associated with less global atrophy than AD, whereas posterior cingulate atrophy was similar in AD and DLB. AD patients showed more atrophy of the medial temporal lobe structures compared to DLB61. Hippocampal atrophy was not limited to the AD and DLB groups, but has also been noted in FTD, although to a lesser extent than in AD62. Conversely, FTD patients showed greater atrophy of the temporal pole and orbitofrontal areas than AD patients, while AD patients showed greater atrophy of the posterior cingulate and inferior parietal regions63. In our study, no significant differences were found between DLB and CN with respect to the temporal pole, inferior parietal and orbitofrontal areas.

According to the literature, we found the putamen volume of AD is intermediate between CN and FTD, showing more atrophy in the latter64. DLB showed volumetric atrophy in the putamen65, with a moderate influence in the MUQUBIA model, or a slight influence in other basal ganglia such as the left pallidum66. Even in FTD, where there is limited and conflicting evidence in the literature regarding the volumetry of deep gray matter structures, our results tend to confirm the findings of Möller et al., with respect to the basal ganglia, and show that FTD patients are characterized by the most severe atrophy compared with other diagnostic groups as well as that atrophy of the pallidum contributes to the classification of FTD patients in MUQUBIA model. Further specific efforts will be needed to clarify this point in future studies.

Surprisingly, the volume of the left frontal pole was highest in FTD and differed significantly from all other patients examined in this study. This can be partly explained by the younger age of FTD compared with the other groups by approximately a decade. Consistent with the literature, patients with AD had smaller volumes of the frontal pole, isthmus of cingulate and left pars opercularis67 compared with CN subjects.

Cortical thickness was a sensitive and comprehensive marker to distinguish AD from other dementias. Cortical shrinkage of the left entorhinal cortex has been reported to be greater in AD than in DLB68, but similar in AD and FTD69. Left inferior parietal thickness, also greater in FTD, proved to be a robust marker to disentangle AD from FTD for MUQUBIA70.

Moreover, the SHAP partial dependence plot (Fig. 6) showed that MUQUBIA classifies patients as AD when a concomitant reduction in left inferior parietal thickness is associated with a reduction in total left cortical volume, which has been linked in previous studies to a decrease in semantic fluency71. Likewise, the SHAP partial dependence analysis (Fig. 6) revealed that MUQUBIA tends to classify patients in the DLB class when they exhibit lower total left cortical volume and a reduction in left parsopercularis thickness. This observation aligns with the existing literature, that links speech fluency impairment to these important regions in DLB72,73.

3. DTI feature

FA of the left corticospinal tract was lower in AD than in CN74. Degeneration of the corticospinal tract has also been described in FTD75. Instead, there is no clear evidence in the literature of damage of this tract in the DLB group76, although this tract had a major effect on MUQUBIA. Possible explanations may be found in the larger group size used in our study than in other efforts and the quality of the DTI pipeline and scans we used to quantify the DTI metrics.

FA of the splenium of the corpus callosum and the superior fronto-occipital fasciculus was lower in AD than in CN72, although the lowest FA values of these pathways occurred in DLB. DLB also showed lower values for FA than all other groups in many other pathways and ROIs77. According to the literature, DLB showed higher MD in brainstem areas78, such as in the pontine crossing tract, compared to CN. Other imaging biomarkers, such as the preservation of the retrolenticular part of the internal capsule, influenced MUQUBIA toward DLB classification. This is correct given that motor and sensory fibres run through this ROI79 and must be maintained integer to prevent dysphagia and swallowing dysfunction. FTD and AD were the most affected groups in the right retrolenticular part of the internal capsule80. The medial lemniscus MD proved to be the third most important feature for classifying FTD patients in MUQUBIA. As previously mentioned, FTD was characterized by the degeneration of the corticospinal tract81 similar to AD. The SHAP partial dependence plot (Fig. 6) for the FTD class also revealed that MUQUBIA finds a direct relationship between left corticospinal tract FA and right medial lemniscus MD values indicating a specific form of frontal neurodegeneration. Last but not least, the correlation between these two tracts could confirm interesting findings on the detection of subtypes of frontotemporal lobar degeneration82.

Benefits from MUQUBIA

Recently, the number of studies using ML has steadily increased because ML enables a fully data-driven and automated approach. ML is indeed flexible in discovering patterns, complex relationships, and predicting unobserved outcomes in data, starting from a sufficient number of observations83, especially with increasing complexity, where classical statistical methods may be rather ineffective84.

Research studies often address the binary classification between two clinical conditions (i.e.: AD vs. CN; FTD vs. CN; FTD vs. AD, etc.…), but this does not reflect the reality of the clinician who needs to make a diagnosis considering multiple neurodegenerative diseases at the same time. Although the field of neurodegenerative diseases has been extensively researched85, to our knowledge, few studies have implemented an MRI-based ML algorithm for the classification of AD, FTD, DLB and CN56,86,87, and to date, no study has used DTIs and multimodal analyses simultaneously. MUQUBIA is the first ML algorithm for differential diagnosis to use DTI together with T13D and FLAIR on a very robust sample size. In fact, Klöppel et al. recruited a small group of FTD and DLB, whereas Koikkalainen et al. and Tong et al. included a broader range of dementias (such as vascular dementia and subjective memory complaints), but still with fewer subjects per group and with worse performance compared with MUQUBIA (i.e.: Klöppel et al.: accuracy of 65%; Koikkolainen et al.: accuracy of 70.6%; Tong et al.: accuracy of 75.2%). Moreover, Tong et al. used CSF biomarkers that required an invasive procedure such as lumbar puncture which is difficult to obtain in a large population. This could also affect the applicability in daily routine and clinical practice in hospitals compared to the data needed as input to MUQUBIA. Many advanced research frameworks recommend the analysis of amyloid, tau, or 18F-fluorodeoxyglucose positron emission tomography (PET) scans of the brain and CSF to better classify patients88. However, these expensive procedures may limit their actual utility and are not available in the normal clinical setting. MUQUBIA requires routinely available MRIs, a clinical test, and a few demographic information, so it can be considered widely applicable without incurring excessive costs and burdening patients unnecessarily.

The online MUQUBIA tool does not require manual or “a priori” preprocessing, and the end-user does not need to have prior knowledge of the algorithm, although a quality check of the ROI segmentation is always advisable.

In addition, experienced neuroradiologists are often not available in routine clinical practice outside of a specialized memory clinic, so an automated method capable of extracting and interpreting the information with high precision would be of great clinical value.

A strength of this study is that the DTIs followed heterogeneous acquisition protocols, e.g., gradient directions vary from a minimum of 19 (low) to a maximum of 114 (high). The FLAIR and T13D parameters differed, bringing this study closer also to a real-world clinical scenario.

Limitations and future developments

We have considered various types of neurodegenerative diseases, which account for a large proportion of dementia cases, but this approach to differential diagnosis is far from complete. We did not attempt to define subtypes, such as posterior cortical atrophy in AD or the language or semantic variant in FTD or psychiatric and delirium onset in DLB. This study has limitations related to a partial influence of age and gender on certain MRI features, particularly in the FTD or in DLB. In fact, FTD group is the youngest and has an average age of onset of 56 years, while AD and DLB occurs later9. DLB group instead showed a preponderance of male. These confounders could help the classifier to identify more easily these groups and additional experiments should be performed to exclude this point. The fact that inter-cohort variability was lower than intra-cohort variability hints that the effect of etiology of dementia on MRI features is more important than potential bias induced by heterogeneous acquisition protocols, still the classifier might be further improved by trying to minimize the “center-effect” and reduce the few differences observed89.

Future efforts will aim to speed up processing times with new tools, such as FastSurferCNN, that exploit deep neural networks and graphical processor units to reduce image preprocessing in minutes.

Finally, due to difficulties finding datasets that contained multimodal and multiclass data, this study lacked a complete independent validation data set, but in the future, MUQUBIA should be validated with independent data sets given the upcoming Big-Data era.

Conclusion

The fully automated classifier developed in this study can discriminate between AD, FTD, DLB and CN with good to excellent performance. Our ML classifier can help clinicians as a second opinion tool to better diagnose the different forms of dementia based on routine and cost-effective biomarkers such as age, gender, CDR and automatically extracted MRI features. It is important to point out that the interpretability and explainability of the methods of ML provide important clues, allow to go beyond the slogan “ML is a black-box”, and lead to the discovery of new informative data-driven candidate biomarkers.