Introduction

While depression and related symptoms are a common mental health problem in older people, late-life depression (LLD) is underdiagnosed and undertreated1 and has been associated with cognitive deterioration and dementia2,3. Brain structural changes in LLD have been observed with magnetic resonance imaging (MRI)4 and histology5,6. Regarding white matter (WM) changes, diffusion weighted imaging (DWI) and its main application, diffusion tensor imaging (DTI) has revealed significant alterations in patients with LLD, compared to non-depressed healthy controls7,8 and these WM changes may precede the onset of depression9. However, variability in the results exists10,11,12 while distinct neuroanatomical dimensions based on MRI have been identified in LLD when large scale data are analyzed8. Therefore, it is important to characterize and better understand the white matter alterations in LLD, in order to assist with correct diagnosis and development of targeted and more personalized treatments.

Machine learning (ML) is receiving a growing interest in neuroimaging literature and is continuously used for classification purposes in a variety of conditions including developmental, neurocognitive and psychiatric disorders13. However, studies on ML methods applied to neuroimaging in LLD are sparse and have utilized T114, functional MRI (fMRI)15 and multimodal MRI8,16,17. While image segmentation is a key step in brain imaging analysis, segmentation of the WM based on multiple DTI contrasts and atlases has never been reported in LLD, to the best of our knowledge.

In this study, we aimed to assess WM changes in LLD using a framework for DTI segmentation not previously used in this population. We then aimed to develop a ML model based on the segmentation output, to automatically diagnose LLD and never depressed individuals. The discrimination performance of the model was evaluated with a variety of measures and the statistical power of the classifier was tested.

Results

The demographic and clinical characteristics of patients with LLD and HC are shown in Table 1.

Table 1 Demographic and clinical variables.

Group differences in DTI

Differences in all DTI measures are presented in Supplementary Table S1. The differences in Fractional Anisotropy (FA) volume were widespread, particularly the fornix, fornix-stria terminalis, internal capsules, left cerebral peduncle, corticospinal tracts, cerebellar regions, superior temporal gyrus, cuneus and the cingulum, while for mean FA, trace, axial diffusivity (AD), and radial diffusivity (RD) the differences were mainly observed in medulla, cerebellum, and midbrain. However, the significant P-values from the Mann–Whitney test did not survive after correction for multiple comparisons. Regarding the gender differences between the groups, a correlation analysis was performed to test for an association between predicted class and gender. For each of the 30 classification iterations and DTI metrics, the mean Pearson correlation coefficient (obtained after averaging transformed r to z-values and then transformed back) was r = 0.2, suggesting a weak correlation. Further, to assess for gender bias in the model, we ran the classification selecting gender as the prediction class. We found that DTI features failed to predict the gender. (Supplemental Fig. S4).

Classification performance and classifier significance

Classification performance with each WM measure is shown in Table 2 and plotted along with 95% confidence intervals in Supplemental Fig. S2. The classifier successfully discriminated between LLD and NC using trace (balanced accuracy = 71%, ROC-AUC = 0.81) and RD (balanced accuracy = 70%, ROC-AUC = 0.80). The most important discriminative WM regions are shown in Fig. 1. The following regions were important with both indices: the left fornix, right fornix stria terminalis, left thalamus, left substantia nigra, left external capsule, left medulla, left anterior limb of internal capsule, left midbrain, right cuneus, right insular, right caudate nucleus, right and left hypothalamus and cerebellar regions. The corpus callosum, the internal capsule, globus pallidus, and cerebral peduncles were important features only for the classification with trace, while the cuneus and the superior longitudinal fasciculus with RD. Interestingly, fornix was the most important structure for classification with both trace and RD (Fig. 1). Classification using all features as the input revealed a statistically significant model (ROC-AUC = 0.78, p = 0.045 and balanced accuracy = 67%, p = 0.044) (Suppl. Fig. S5) predicting the classes with performances close to Adaptive Boosting (AdaBoost) and Gradient Boosting (GBoost) (Table 2). Similar performances as the AdaBoost were obtained with the Gradient boost classifier (Table 2). Interestingly, the two algorithms shared 12 out of 20 most important features for the classification with both trace (Fig. 1a and Suppl. Fig. S7) and RD (Fig. 1b and Suppl. Fig. S8). The performance of Support Vector Machine (SVM) was low (Table 2).

Table 2 Classification performance by the five DTI measures separately and all features together with three algorithms.
Figure 1
figure 1

Shown in descending order are the twenty highest trace (a) and RD (b) relative feature importance determined as the normalized weighted sum of the number of times a given feature is used to split the data in the ensemble. The higher the frequency of a feature being used for splitting, the higher its importance. For example if the fornix has a feature importance of 0.2 it means fornix has a relative importance of 0.2 or a 20% in the ensemble (averaged over 30 repetitions of classification) compared to the other features. Gm gray matter.

In the analysis of classifier’s statistical significance, the ROC-AUC scores obtained with permuted labels were significantly lower than those made with the actual data, using both trace and RD indices (permutation-based P = 0.022 and P = 0.024, respectively) (Fig. 2). Similar results are obtained with balanced accuracy (Fig. S3). This demonstrates that the value of the error in the actual data is small, the prediction accuracy is significantly higher than chance, and the classifier is statistically significant.

Figure 2
figure 2

Distribution of ROC-AUC scores (shown in green) obtained with the AdaBoost classifier using cross validation after 1000 label permutations, compared with the score obtained with the actual data (dashed red line) for trace (a) and RD (b), respectively. Differences are considered significant at the 95% confidence interval level.

Discussion

In this study, we applied for the first time a multi-contrast, multi-atlas method for automatic DTI segmentation combined with the AdaBoost classifier to classify LLD and HC subjects.

The main findings of our work are: (1) using the trace index, the classifier reached a classification balanced accuracy of 71% and a ROC-AUC of 0.81; (2) using the RD index the classifier reached a balanced accuracy of 70% and a ROC-AUC of 0.80; (3) using permutation label testing with cross validation it was found that the classifier reached the above diagnostic performances not by chance (permutation-based p ≤ 0.05, for both indices). Interestingly, fornix was the most important structure for classification by both indices.

A set of WM structures was found to be important in the classification by trace and RD in our study, suggesting that LLD may be characterized by a widespread axonal injury (i.e., trace, RD) and/or demyelination (i.e., RD) in limbic (fornix, uncinate fasciculus, hypothalamus), frontopontine (internal capsule, cerebral peduncle), thalamo-cortical projection fibers (thalamus), fronto-striatal (caudate, external capsule), commissural fibers (corpus callosum), subcortical nuclei (substantia nigra, midbrain), brainstem and the cerebellum. In our study AdaBoost and Gboost outperformed SVM. This can be attributed to the data, the algorithms’ properties and modeling. Classification using all features together led also to a significant model with similar results to AdaBoost and GBoost, although feature importance were more scattered (Suppl. Fig. S6). This is not surprising as DTI indices are complementary in nature and the number of features is now dramatically increased (curse of dimensionality). Significant differences were also found between the groups in non-parametric statistical testing, but did not survive after multiple comparison correction, which can be attributed to factors such as the high number of tests performed and magnitude of the effects.

The literature on ML and DTI in LLD is limited16,17. Patel et al.16 used multimodal MRI data and the Alternating Decision Tree algorithm (an ensemble classifier, similar to AdaBoost) to classify 33 LLD and 35 non-depressed individuals and reported an accuracy of 87.3%. The authors suggest that global imaging measures (atrophy and global WM hyperintensity load) and non-imaging features (age and Mini-Mental Examination) are best predictors of diagnosis. In the study of Stolicyn et al.17 with 40 LLD cases and 40 controls using average FA and MD measures extracted for 19 bilateral and 5 unilateral tracts derived by TBSS and three classification models, the best classification accuracy achieved was 61.25% with MD features and the SVM classifier with optimized hyper parameters. Our study has focused on machine learning classification from an advanced DTI segmentation and the accuracy reached was 76% using both trace and RD indices (Table 2). Our results compare to accuracies reported in the recent review on ML classification in major depression using DWI measures, where they vary from 57 to 91.7%18.

Most of the studies in LLD with DTI in 3 Tesla have used voxel-based analyses (e.g., tract based spatial statistics-TBSS), Tractography, and ROI methods, and have mainly focused on differences between groups and in specific indices (i.e., FA and MD)7,19. Each of these methods carries drawbacks, such as operational burden, variability and error in manual ROI placement, fiber crossings in deterministic and complexity in probabilistic Tractography, as well as challenging investigation of the peripheral WM in voxel-based analysis. Furthermore, many predictions based on MRI variables have been made by univariate measures which reveal a moderate effect20. The segmentation framework used in our study allows high registration accuracy and accurate segmentations of the superficial WM, an area that is difficult to appreciate if population-averaged atlases are used21 as in voxel wise DTI analyses. In our analysis, we moved from a voxel-by-voxel type of analysis, where each of the hundreds of thousands of voxels is tested individually (lowering the statistical power) to a structure-by-structure one, with only 146 anatomically relevant imaging structures covering the whole brain WM and trained an ensemble classifier for diagnostic classification.

We found widespread diffusivity alterations within various anatomical structures as important for LLD diagnosis, and fornix was the most important structure. Based on MRI studies, many underlying circuits have been proposed to be pivotal in LLD, yet direct mechanistic links are missing. Our findings follow earlier studies. Specifically, limbic and frontal-subcortical circuitry disruption have been hypothesized in LLD22,23. Furthermore, brainstem nuclei have been involved in LLD24 and this is supported by pathological findings of neuronal loss in brainstem nuclei (e.g., raphe nucleus) and presence of Lewy bodies in subcortical nuclei (e.g., substantia nigra)6,25. Reduced FA and increased RD in the fronto-subcortical and limbic tracts (i.e., fornix and uncinate fasciculus) superior longitudinal fasciculus, and corpus callosum have been previously reported in LLD26. Another study found that MD was found to be increased in the fornix of patients with LLD compared to controls27. In a large sample from the UK Biobank Imaging Study, MD in anterior thalamic radiation, inferior fronto-occipital fasciculus, uncinate fasciculus, superior thalamic radiation, cingulate gyrus part of cingulum, and middle cerebellar peduncle has been associated with depressive symptoms in older individuals28. In an analysis on Alzheimer’s disease Neuroimaging Initiative data, the presence of subclinical depressive symptoms was associated with lower WM integrity mainly in the fornix, posterior cingulum, corpus callosum and inferior longitudinal fasciculus29. Another study showed that increased anatomical connectivity predominantly in a fronto-limbic network, defined by DTI probabilistic tractography predicted depression with 91.7% accuracy using SVM30. WM structures associated with subcortical gray matter nuclei (i.e., thalamus, caudate) insula and precuneus were found to be important in our study, which is in line with other studies. In particular, thalamic volume reductions were found to be significant in the meta-analysis of MRI studies in LLD31. Similarly, caudate nucleus32,33 and insula volume34 were found to be significantly lower in LLD. From a functional connectivity (FC) perspective, in the study of Lin et al.15 a diagnostic accuracy over 85% was achieved with the superior frontal gyrus, left insula, and right middle occipital gyrus using resting state (rs) fMRI and convolutional neural networks analysis. Increased right anterior insula-right dorsolateral prefrontal cortex rs-FC35, as well as altered fronto-cerebellar connectivity36 have been reported in older depressed adults with apathy. Another study found an increased FC of the left precuneus in patients with LLD compared to controls37.

Our study has the limitations of small sample and many independent variables and a main concern in this context is the risk of overfitting. We have taken actions to deal with this issue that are feasible for the data characteristics and first was the selection of the algorithm. AdaBoost combines a series of weak classifiers in order to build a more robust final classifier/prediction. It acts preventively to overfitting as it inherently performs a soft feature selection and iteratively adjusts the class prediction weights diversifying the data presented to the next cross validation iteration. By using stage wise additive modeling, AdaBoost slows down overfitting by optimizing certain parameters for the next iteration, while the rest from the previous iteration is held fixed (similar to a regularization procedure). The construction of simple base learners and the restricted use of 50 estimators, mitigates the influence of each individual learner, promotes efficient learning from imaging patterns in the data and prevents excessive learning from the training data (overfitting) resulting in a less biased model. This is further ensured by the use of stratified sampling to permit equal distribution of the classes in each cross validation fold. The use of k-fold cross validation creates models that have been tested on data unseen during the training. Even after all the above actions, a relative degree of overfitting cannot be excluded and future studies with larger samples will allow further investigation and accounting for this issue. It should be noted that the classifier has shown substantial improvement in the classification performance in atlas-based analyses38. Another limitation is that the model was not tested in an independent sample. To control this, we used cross validation testing the classifier on a subsample not used during the training; we also performed a permutation test to assess the statistical significance of the developed model. Evaluating our model given the sample characteristics is challenging. In this regard first we tried to control biases in the model (data normalization, stratified sampling Adaboost learning). We evaluated our model using k-fold cross validation and suitable performance measures along with their 95% Confidence Intervals. Importantly, we evaluate statistical significance using permutation testing. Additional classifiers and type of analysis were utilized to further investigate the feasibility of our study. We were able to create a valid model that performs consistently well across evaluation measures and within family of algorithms, and not by chance. The unbalanced data and differences in gender are limitations in our study. In this context we used robust methods for unbalanced data that permit a balanced representation of the two classes (stratified sampling) and combined with the classifier’s ability to focus on the misclassified cases allows effective capturing of the patterns and subtleties of the minority class improve the classifier’s ability to discriminate between unbalanced data. Regarding the gender differences, our model showed a small relationship between gender and DTI features and that it is not biased by gender (Suppl. Fig. S4). Another limitation is that the patients were medicated.

In conclusion, employing a multi-contrast, multi-atlas framework for DTI segmentation for the first time in LLD, to train and test the AdaBoost classifier, we suggest that trace and RD indices within structural networks involving the limbic, cortico-basal ganglia-thalamo-cortical loop, the brainstem, the external and internal capsules, corpus callosum and the cerebellum, are promising features in the diagnostic classification of LLD and HC subjects. The results need further validation and encourage the anatomical characterization of LLD using larger samples, as well as the combination of the adopted methods with other imaging, clinical, historic and environmental variables to develop stronger diagnostic models, evaluate interventions, and inform targeted treatments for a complex and heterogeneous mental disorder.

Methods

Participants

We recruited 26 consecutive patients from the Eginition hospital’s psychogeriatric unit. Inclusion criteria were age > 55 years, a DSM-IV-TR diagnosis of major depressive episode (single episode or recurrent) and no cognitive impairment, based on clinical criteria and a MMSE39 score ≥ 28. Depression was measured with the 15-item geriatric depression scale40. Exclusion criteria were presence of psychosis, suicidal ideation, a history of neurological/psychiatric condition (except depression), delirium, sensory deficits, alcohol/drug abuse, malignancy, and patients with MR incompatible implants and claustrophobia. All imaging data were reviewed by a neuroradiologist (GV) to identify unexpected lesions and by a medical physicist (EK) to identify participant or MRI-related artifacts. We also recruited using word of mouth 12 healthy controls (HC) matched for age, education and MMSE scores based on the same exclusion criteria.

DTI and white matter segmentation

All participants underwent brain MRI in a 3 Tesla whole-body MRI scanner (Philips Achieva TX, Best, The Netherlands) equipped with an 8-channel head coil using the same imaging protocol. Imaging protocol included: (1) a high-resolution 3D axial T1-weigthed turbo field echo SENSE imaging (TE = 3.83 ms, TR = 8.31 ms. Flip angle = 8°. Field of view: 230 × 140 × 182 mm. In plane matrix size = 336 × 336 mm. A total of 200 slices with 0.7 mm thickness and no gaps covered the whole brain); ii) a T2 weighted dual turbo spin echo SENSE axial imaging (TE = 10.11 ms and 96 ms. TR = 3000 ms. Flip angle = 90°. Field of view: 240 × 144 × 210 mm. In plane matrix size = 256 × 256 mm. A total of 96 slices (2 × 48) with 3 mm thickness and no gaps covered the whole brain);and iii) for DTI imaging, a single-shot EPI sequence with SENSE parallel imaging (reduction factor 2.5). Imaging parameters were repetitiontime ≈ 7200 ms, echo time ≈ 74.5 ms, flip angle = 90°. The imaging volume for each subject included 60–70 axial slices of one bmin = 0 s/mm2 (b0) image, and 32 diffusion direction coding images with bm = 700 s/mm2, acquired parallel to the anterior commissure/posterior commissure line, with 2.2 mm isotropic voxel size and image matrix 96 × 96, zero-filled to 256 × 256 and field of view 212 × 212 mm. DTI was repeated twice to improve the signal-to-noise ratio.

All DTI datasets were automatically post-processed and segmented using MRIcloud (www.mricloud.org)41 a valid21,42 and reproducible43 framework running on Windows. Briefly, the images are corrected for head motion and eddy-current-induced distortions44; image corruptions are automatically detected and rejected pixel-wise45. The two DTI sequences are then combined to estimate the tensor and derived maps using multivariate linear fitting. For the mapping, whole brain WM parcellation is performed employing a fully automated multi-contrast, multi-atlas segmentation and label fusion framework46,47. In the current implementation, a library of 8 atlases (“Adult_168labels_8atlases_V1”) of healthy individuals (mean age: 29 years) is used, along with a paired parcellation label map of 168 anatomical structures segmenting the whole brain (see Appendix 1 in the Supplemental Material). The segmentation workflow is graphically described in more detail in Supplemental Fig. S1.

Image quantification and feature extraction

For the final image quantification, a threshold of FA > 0.2 was applied to remove the cortex while still preserving subject-specific anatomical features in these peripheral WM parts21. Of the 168 parcels originally segmenting the brain, 146 structures of interest were finally analyzed, in terms of FA volume (number of voxels with FA > 0.2), mean FA, diffusion trace (analogous to MD, as MD = trace/3), AD and RD. The ROI-Editor software was used for quantification48.

Statistical analysis

Between group differences in DTI parcellation for each WM measure were examined using a non-parametric Mann–Whitney test in SPSS Statistics for Windows, version 28.0 (IBM Corp. Armonk, N.Y. USA).

Machine learning analysis

In a typical ΜL analysis, an algorithm is empirically learning through an iterative training-and-test procedure using the available data to accurately classify unseen data. For our data AdaBoost49,50 was used. Specifically, the SAMME.R (Stagewise Additive Modeling) algorithm50 was employed with default parameters (number of estimators = 50, learning rate = 1.0, max depth = 1) as implemented in Sci-kit learn. All classification analyses were performed in Python 3.6.13 (https://www.python.org), Scikit-learn 0.17.0 (http://scikit-learn.org/stable/)51. The classification procedures are illustrated in Fig. 3. Before training, the data were standardized by zeroing the mean of each attribute and scaling to unit variance using StandardScaler. Based on our sample characteristics, a stratified 12-fold cross validation was used, so that all data were used for training and validation (test), while maximizing the inclusion of HC in the training set (Fig. 3). In cross validation, the data are divided into k non-overlapping subsets (folds) of roughly equal size that serve as training and hold out/test sets. Then, boosting is applied on k-1 subsets while the left-out fold is used for validation and test. The process is repeated for each of the k subsets and a mean performance is obtained after repeating the entire process 30 times to account for bias in the initiation of the classifier and cross validation splitting (Fig. 3). Apart from Adaboost, we tested GBoost52 also from the ensemble boosting family, as well as support vector machines53. Gradient boosting or gradient boosted decision trees algorithm builds an additive model (i.e., the residuals of the previous fit round becomes the input for the next consecutive classifier, on which the trees are built) by combining multiple models moving in a step-by-step manner against the negative gradient to reduce the loss, in order to capture the maximum variance within the data and ultimately to create a strong predictive model based on regression trees. The pipeline for GBoost classification remained similar as that for AdaBoost. An implementation of libsvm53, was used for the classification with Support Vector classifier (SVC), as a supervised learning algorithm implemented in with Scikit-learn. After the data are projected in a high dimensional feature space, the classifier finds the plane (“hyperplane”) corresponding to a radial basis function kernel that best separates the two groups based on measurements (support vectors) closest to that plane. For SVM classification, feature selection was applied using the k = 60 best features with the highest F-scores between two random variables in univariate ANOVA. More details on the machine learning analysis can be found in the Supplemental Material.

Figure 3
figure 3

Illustration of the machine learning procedure followed in this study.

Classification performance was evaluated in terms of mean accuracy and balanced accuracy, precision, recall (sensitivity), F1 score and ROC-AUC. Balanced accuracy is the arithmetic mean of sensitivity and specificity, as using accuracy only for model evaluation can bias towards overoptimistic results, especially with imbalanced datasets54. True positive rate (recall) and false positive rate are performance metrics useful for imbalanced class problems; ROC-AUC summarizes the trade-off between those two for every possible cut off, as the correlation between the class predicted by the classifier and the true class into which the case falls. ROC-AUC represents the power of the classifier measured in a scale that ranges from 0 (below chance performance) to 1 (perfectly accurate model) and 0.5 is random chance55. A combination of precision and recall is the F1-score.

$$ \begin{gathered} {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} \hfill \\ {\text{Balanced}}\;{\text{accuracy}} = \frac{{{\text{Sensitivity}} + {\text{Specificity}}}}{2} = \frac{{\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} + \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}}}{2} \hfill \\ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} \hfill \\ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} \hfill \\ {\text{F}}1{\text{-score}} = \frac{{2{\text{TP}}}}{{2{\text{TP}} + {\text{FP}} + {\text{FN}}}} \hfill \\ \end{gathered} $$

TP is the number of positive samples predicted as positive. FP is the number of negative samples predicted as positive. TN is the number of negative samples predicted as negative. FN is the number of positive samples predicted as negative.

Statistical significance of the classifier

The classifier’s performance against chance was tested with a standard permutation procedure56,57 and ROC-AUC scores. This is a non-parametric approach in which the frequency distribution of a given performance metric (i.e., ROC-AUC) under the null hypothesis of independence is obtained, by randomly exchanging the labels (LLD or NC) associated with an instance. The entire training and test procedure is repeated multiple times using cross validation and an empirical P value is calculated by dividing the number of permutations resulted in a higher performance than that estimated with the actual sample by the number of permutations (i.e., 1000). If a significant association between the labels and WM features truly exists, then the average classification probability obtained after permutation is expected to be close to chance (i.e., around 50%). Permutation analysis was performed in Python.

Ethics approval

The study was conducted according to the latest version of the Declaration of Helsinki and approved by the National and Kapodistrian University of Athens ethics committee (file number: 275/2016.05.31. ΑΔΑ: 6ΘΣ346Ψ8Ν2-ΒΣΡ). According to the permission for the MRI experiment: “Subjects have been informed by the doctor, with any detail about the diagnosis and the nature of his/her conditions, the kind and purpose of the medical intervention, and they gave their written consent for their participation in the brain imaging analysis with MRI. They give permission to the doctor and his assistants to make all the medical interventions they judge are necessary for their good health”. No other ethical permission is applied.

Consent to participate

Written informed consent was obtained from all participants.