Introduction

Normative modelling is an emerging method for quantifying and describing how individuals deviate from the expected pattern learned from a population or large sample1. Recently, this approach has been applied to neuroimaging data to investigate a number of brain disorders, such as attention deficit hyperactivity disorder2, 3, autism spectrum disorder4, 5, schizophrenia3, 5, 6 and dementia7, 8. The procedure of normative modelling used in these studies has two steps: (i) first, statistical models are estimated to characterise the typical brain data from a reference cohort; (ii) then, the estimated model is applied to a target clinical cohort in order to quantify the variation (e.g. due to the effect of brain disorders).

Many statistical models have been proposed for normative modelling, including regression, support vector machines and Gaussian process modelling (for an extensive list, see Marquand et al., 2019). In Pinaya et al.5, we proposed a normative modelling approach based on the use of deep autoencoders to evaluate psychiatric patients. The use of a deep learning approach10, 11 enables models to learn multiple levels of representation about the intricate structure of the data and identify the most important morphological characteristic of the healthy brain. In addition, in Pinaya et al.5, the models were able to detect deviations at the level of the individual, with patients with schizophrenia and patients with autism spectrum disorder presenting values significantly higher than the healthy controls (HC).

Similar to psychiatric disorders, the clinical interpretation of magnetic resonance imaging scans can be challenging in the context of neurodegenerative disorders, as brain alterations may be difficult to distinguish from those related to healthy ageing. The identification of disease-related alterations can be particularly tricky in the early stages of a disorder. For this reason, there is a grown interest in the development of methods for quantifying deviations of regional brain volumes that can discriminate between healthy and pathological ageing, with the ultimate aim of improving diagnostic and prognostic assessment of neurodegenerative disorders12. Here, we used the autoencoder normative method5 to evaluate the most common type of dementia in the elderly worldwide, Alzheimer’s disease (AD).

First, we trained the normative models using a large number of HC subjects (> 11,000 participants). Then, we assessed the performance of these models using data from patients with a diagnosis of mild cognitive impairment (MCI), the prodromal stage to AD, and patients with a diagnosis of AD. This assessment involved calculating the deviation, i.e. the extent to which subjects deviate from the norm, in five additional datasets composed of patients with MCI, patients with AD, and HC subjects. We had two main hypotheses. First, we hypothesised that the normative models would be robust and sensitive enough to create deviation values that reflect the severity of the brain anatomical alterations due to the disease, i.e. individuals with AD would deviate from normality more than those with MCI. Second, we hypothesised that the main brain regions driving the observed deviation would include the medial temporal cortex and the ventricular system, consistent with the results of previous neuroimaging studies of MCI and AD13, 14. Finally, we compared the performance of the normative approach against traditional classifiers to discriminate the patient groups from the HC group.

Methods

Datasets

In our analysis, we used six datasets: the UK Biobank15, the Alzheimer’s Disease Neuroimaging Initiative (ADNI)16, the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL)17, the Alzheimer’s Disease Repository Without Borders (ARWiBo)18, 19, the Open Access Series of Imaging Studies: Cross-Sectional (OASIS-1)20, and the Minimal Interval Resonance Imaging in Alzheimer's Disease (MIRIAD)21.

The UK Biobank is a study that aims to follow the health and well-being of 500,000 volunteer participants across the United Kingdom. From these participants, a subsample was chosen to collect multimodal imaging, including structural neuroimaging. Here, we used an early release of the project’s data comprising of 11,034 HC participants. The inclusion criteria for the present study were: (a) subjects who had the data collected in the same MRI scanner (from Cheadle centre), (b) age between 47 ND 73 years old. The only exclusion criterion was previous hospitalization associated with the diagnosis of mental and behavioural disorders, disease of the nervous system, cerebrovascular diseases, benign neoplasm of meninges, brain and other parts of the central nervous system, or injuries to the head. This study (UK Biobank project #40323) was covered by the general ethical approval for UK Biobank studies from the NHS National Research Ethics Service on 17th June 2011 (Ref 11/NW/0382). All methods were carried out in accordance with the approved guidelines and regulations. All UK Biobank participants provided written informed consent. More details about the dataset can be found elsewhere15, 22,23,24.

The ADNI consortium started in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner. Its goal was to verify whether different neuroimaging biomarkers and neuropsychological assessments can be combined to measure the progression of MCI and to study the development of AD. All ADNI participants provided written informed consent, and study protocols were approved by each local site’s institutional review board. All methods were carried out in accordance with the approved guidelines. Further information about ADNI, including full study protocols, complete inclusion and exclusion criteria, and data collection and availability can be found at http://www.adni-info.org/. All methods as stated on the website were performed with the relevant guidelines and regulations. In this study, we included the structural MRI collected during the ADNI GO, ADNI 2 and ADNI 3 phases. Similar to UK Biobank, we included only subjects with age between 47 and 73 years old. The final dataset comprised of 517 subjects, where 212 were HC, 159 were patients with early MCI (EMCI), 82 were patient with late MCI (LMCI), and 64 were patients with AD. In the ADNI datasets, participants were assigned to these MCI stages based on different levels of impairment on a single episodic memory measure, with the EMCI group showing milder episodic memory impairment than the LMCI group25, 26.

The AIBL dataset was developed to enhance the understanding of the pathogenesis of AD, concentrating on its early diagnosis (more details can be found in Ellis et al., 2009). Ethics approval for the AIBL study and all experimental protocols was provided by the ethics committees of Austin Health, St Vincent’s Health, Hollywood Private Hospital and Edith Cowan University. All experiments and methods were carried out in accordance with the approved guidelines and regulations and all volunteers gave written informed consent before participating in the study. Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The final group was composed of 346 subjects, where 262 were HC, 46 were patients with MCI (stage not known), and 38 were patients with AD.

The ARWiBo is a cross-sectional dataset including data from patients and controls enrolled at the Scientific Institute for the Research and Care of Alzheimer’s Disease [Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy]. A multidisciplinary team of neurologists, neuroscientists, image analysists, neurophysiologists, and geneticists are involved in the assessment of patients. As part of their assessment, participants undergo blood drawing (for APOE genotyping), clinical and cognitive evaluations as well as high-resolution MRI scanning (more details can be found in Frisoni et al., 2009 and Galluzzi et al., 2010). Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 319 subjects, including 215 HC, 67 patients with MCI (stage not known), and 37 patients with AD. Ethics approval for the ARWiBo study and all experimental protocols was provided by the local ethics committee and all participants signed an informed participation consent. All experiments and methods were carried out in accordance with the approved guidelines and regulations.

The OASIS-1 dataset is the result of a collaborative effort of investigators from a single acquisition site supported by the National Institute on Aging (NIA), the Howard Hughes Medical Institute, the Biomedical Informatics Research Network (BIRN) and the Washington University Alzheimer’s Disease Research Center [Alzheimer’s Disease Research Center (ADRC)]. This collaborative effort aimed to create a freely available MRI dataset for the wider scientific community. The original dataset consisted of a cross-sectional collection of subjects aged 18 to 96. It included participants over the age of 60 who had received a clinical diagnosis of very mild to moderate AD (for more information, please see http://www.oasis-brains.org). In our analysis, we selected data collected from individuals who were between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 78 subjects, including 41 HC and 37 patients with AD. Ethics approval for the OASIS-1 study and all experimental protocols was provided by the local ethics committee and all participants signed an informed participation consent. All subjects participated in accordance with guidelines of the Washington University Human Studies Committee. All experiments and methods were carried out in accordance with the approved guidelines and regulations.

The MIRIAD dataset was designed to establish the minimal interval over which it would be feasible to undertake clinical trials in AD using atrophy measured from longitudinal MRI as an outcome measure21. Ethical approval for the MIRIAD study (and subsequently its release) was received from the local research ethics committee, and written consent obtained from all participants. All experiments and methods were carried out in accordance with the approved guidelines and regulations. Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 48 subjects, including 18 HC and 30 patients with AD.

In the present study, we used the UK Biobank set to train the autoencoders and the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets to assess the normative model performance on data from patients with MCI and AD. To perform comparisons between HC and patient groups, we ensured that there were no significant statistical differences regarding age and sex in all five clinical datasets. We assessed each dataset independently using the ANOVA test to verify any differences in age and the Chi-square test of homogeneity to investigate differences in the sex ratios between groups (Tables 1, 2).

Table 1 Demographic information for the subjects from the UK Biobank dataset, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) dataset. We used ANOVA test and the chi‐square test of homogeneity to test for significant differences in age and sex between healthy controls and patients. Abbreviations: HC = healthy control; EMCI = early mild cognitive impairment; LMCI = late mild cognitive impairment; AD = Alzheimer’s disease; MCI = mild cognitive impairment; SD = standard deviation.
Table 2 Demographic information for the subjects from the Alzheimer’s Disease Repository Without Borders (ARWiBo) dataset, the Open Access Series of Imaging Studies: Cross-Sectional (OASIS-1) dataset, and the Minimal Interval Resonance Imaging in Alzheimer's Disease (MIRIAD) dataset. We used ANOVA test and the chi‐square test of homogeneity to test for significant differences in age and sex between healthy controls and patients. Abbreviations: HC = healthy control; AD = Alzheimer’s disease; MCI = mild cognitive impairment; SD = standard deviation.

MRI processing

We used the FreeSurfer software (version 6.0) to estimate the brain regions’ volumes from the T1 weighted images. This estimation was performed using the “recon-all” command (see Fischl, 2012; Fischl et al., 2002, for more information). During this processing, the cortical surface of each hemisphere was parcellated according to the Desikan–Killiany atlas29 and anatomical volumetric measures were obtained via a whole-brain segmentation procedure (Aseg atlas)28. The final data included the cortical volume for each of the 68 cortical subregions (34 per hemisphere) and the volume of 33 neuroanatomical structures, totalling 101 subregions/structures (the complete list is presented in the supplementary materials).

Normative model

In this paper, we developed the normative model using the adversarial autoencoder (AAE; Fig. 1)30, 31. As an autoencoder, this neural network has an encoder and a decoder. The function of the encoder is to take in an input x and map it into a latent encoding space, creating a latent code h. Then, the goal of the decoder is to reconstruct the input data based on the latent code. The AAE is a blend of this autoencoder framework with adversarial training, which is used in generative adversarial networks modelling32. This autoencoder uses the adversarial training to shape the distribution of the latent code to look similar to a predefined prior distribution. The AAE achieves this desired distribution by incorporating a discriminator network into its structure. In this scheme, the discriminator receives two types of inputs: random numbers sampled from the desired prior distribution, and the latent code. During the training process, the discriminator will make predictions regarding whether its input data was sampled from the prior distribution or the latent code. The adversarial training forces the encoder to produce a latent code space that can fool the discriminator into predicting that the encoded samples are just another sample from the prior distribution.

Figure 1
figure 1

Structure of the normative model based on adversarial autoencoders. In this configuration, the subject data is inputted into the encoder and then mapped to the latent code. This latent code is fed to the decoder with the demographic data, and then the decoder generates a reconstruction of the original data. During the training of the model, the discriminator predicts if its input data came from the latent code or if it was randomly sampled from the chosen prior distribution (e.g. Gaussian distribution). Based on these predictions, the adversarial autoencoder forces the encoder to produce a latent code similar to the prior distribution selected. Since the model is trained on healthy controls data, it is expected that it can reconstruct similar data relatively well, yielding a small reconstruction error. However, the model is expected to generate a high error when processing data affected by unseen underlying mechanisms, e.g. pathological mechanisms.

In this study, we trained the AAE to codify and reconstruct the data of HC subjects. The main idea of this normative approach is that, since the AAE only learns how to reconstruct images from HC individuals, it will be less precise at mapping images from patients, which differ due to the pathological mechanisms of the disorder. As a result, the difference between the reconstructed data and the original data will be larger in patients than HC individuals.

Regarding our model architecture, we used an encoder with two hidden layers with 100 neurons, and a latent code with a size of 20 neurons. The decoder and the discriminator had a similar structure (two hidden layers with 100 neurons). All hidden layers had a leaky ReLU non-linearity33. The latent code and the decoder’s output layer had a linear activation function.

Normative model training

To train the autoencoder, first, we performed the pre-processing of the brain features. This involved estimating the relative brain region volumes for each subject by dividing the original brain region volumes by the total intracranial volume. Then, we normalised the relative brain region volumes across all the participants in the training set. In this step, we performed a normalisation robust to outliers by subtracting the median value of the relative brain region volume and then scaling the data according to its interquartile range. Centering and scaling was done independently for each brain region. The same relevant statistics (median and interquartile range) were later used to normalise the data from the clinical datasets before feeding them to the model.

In our analyses, we used a conditioned AAE30. This type of autoencoder allows us to influence the model’s reconstruction using the demographic variables, i.e. age and sex. To input these variables into the model, we transformed age and sex into one-hot encoding vectors. After this transformation, each subject has an age vector with 27 positions, where each position corresponds to a year within the range of 47–73 years. In this vector, all positions have value zero except the one that indicates the subject’s age which has a value equal to 1. The subject’s sex was represented in a one-hot encoded vector with two positions, one for male and one for female. The AAE’s decoder used these vectors together with the latent code to reconstruct the brain data. This architecture forces the network to disentangle the label information from the latent code30.

With the features pre-processed and the conditioning data prepared, we trained the autoencoder to minimise the mean squared value of its reconstruction error using Adam optimizer34 for 200 epochs. A minibatch approach was used in this gradient descent-based optimizer, with a batch size of 256. The model was trained with a cyclical learning rate35, which allows convergence of the training with fewer epochs. We started using a base learning rate with a value of 0.0001 and a maximum learning rate value of 0.005, chosen using the “LR Range Test”36. The learning rate cycle had a basic triangular shape with an amplitude decaying (gamma = 0.98).

In this study, we accessed the robustness of the autoencoder approach by training it with different simulated sets using the bootstrapping as the resampling method. We created 1,000 bootstrapped sets (each one with n = 11,032) by sampling with replacement from the UK Biobank. These bootstrapped sets were used to train the AAE. With this resampling method, we calculated: the value of the mean deviation (“Analysis of the observed deviations” section) for each group from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets, the discriminative performance of the normative approach (“Analysis of the observed deviations” section), and the deviation from normality of each brain region (“Brain regions deviations” section).

Analysis of the observed deviations

Similar to Pinaya et al.5, we processed the data of each subject using the AAE, and we calculated the mean squared error between the reconstruction and the inputted data as the metric of brain deviation (Eq. 1).

$$observed\,deviation=\frac{1}{number\,of\,regions}\sum_{i=1}^{number\,of\,regions}{\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}$$
(1)

where \({x}_{i}\) is the normalised value of the brain region \(i\), \({\widehat{x}}_{i}\) is the autoencoder reconstructed value of the brain region \(i\), and \(number\,of\,regions\) is the number of cortical regions and neuroanatomical structures used (i.e. \(number\,of\,regions\)= 101).

In each iteration of the bootstrap method, we used the trained autoencoder to obtain the deviation metric of the subjects from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets. Then, we calculated the difference between the mean deviation scores of each pair of groups. We identified a significant difference between groups if the confidence interval (95% of confidence) of this difference did not include the zero. Besides, we used the subjects’ deviations to obtain the discriminative performance of the autoencoder approach, measured by the area under the receiver operating characteristic curve (AUC).

Brain regions deviations

The autoencoder approach can quantify how much each brain region deviated from normality and contributed to the observed deviation. These values were obtained by measuring the difference between the inputted value and its reconstruction. In our study, we quantified the deviation for each subject from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets. Then, in each iteration of the bootstrap method, we calculated the effect size of each brain region deviation—using Cliff’s delta37 value—between the HC group and each patient group. Here we used Cliff’s delta—a non-parametric effect size measure—because the observed deviation presents a gamma distribution.

Comparison against traditional machine learning classification

A further aim of the present study was to compare the performance of our normative model against a traditional classification approach. To measure the performance of the classifiers, we calculated the AUC using the 0.632 + bootstrap method38 with 1,000 iterations. Each clinical dataset (ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD) was analysed independently using the HC and patient groups to train the classifiers. Besides, the analysis was performed as multiple binary classifications between HC and each clinical group (e.g. HC versus LMCI).

In each iteration, first, we created the bootstrapped set by sampling the original data (from ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets) with replacement. This bootstrapped set had the same size as the original dataset (for example, when analysing the ADNI dataset to classify healthy controls and patients with Alzheimer’s disease, the bootstrapped set had 212 + 64 = 276 subjects), and it contained repeated subjects (due to replacement). For each iteration, the subjects not included in the bootstrapped set were used as the out-of-bag set (i.e. test set).

Next, we obtained the relative brain region volumes of each subject by dividing the original volume by the total intracranial volume. Then, we normalised the values of the relative brain volumes across the subjects. In this normalisation step, we removed the median value of the brain regions and scaled the data according to the interquartile range. Centering and scaling was done independently for each brain region. The same relevant statistics (median and interquartile range) were later used to normalise the out-of-bag set.

To perform the classification analysis, we used a relevance vector machine (RVM)39 with a linear kernel. The RVM is a Bayesian treatment of identical functional form to the Support Vector Machines (SVM)40. One advantage of the RVM form over the SVM is that it is not necessary to estimate the error/margin trade-off parameter ‘C’. After we trained the RVM on the bootstrapped set, we used the model to obtain the predicted probability of a subject belonging to the patient class. Using these probabilities, we calculated two AUC values, one for the bootstrapped set (called “resubstitution” metric) and one for the test set (called “out-of-bag” metric). By using the 0.632 + bootstrap method, we minimised the optimistic and pessimistic bias of the estimate and obtained the AUC value (Eq. 2).

$$AU{C}_{bootstrap}=\frac{1}{b}\sum_{i=1}^{b}\left(\omega *AU{C}_{out-of-bag, i}+\left(1-\omega \right)*AU{C}_{resubstitution,i}\right)$$
(2)

where b was the number of iterations and the weight ω was defined considering the relative overfitting rate (full description in Efron and Tibshirani, 1997). To obtain the confidence interval (CI; 95% of confidence), we used the percentile method41. Next, we compared these confidence intervals with the AUC obtained during the normative approach.

Finally, we compared the generalization of the classifiers with the results of the autoencoders. In this analysis, we used each trained classifier to predict the group of the subjects from the other clinical datasets. In order to verify if the performance in the independent datasets was significantly different from the normative approach, we calculated the difference between the AUCs of this generalization analysis and the AUCs of the autoencoders. With the 1,000 measures of the difference, we calculated its confidence interval (95% confidence) to verify if this difference is different from zero.

Experiments

We conducted our experiments in Python 3 using the Tensorflow 2.0 library (https://www.tensorflow.org/) and the sklearn_rvm library (https://github.com/Mind-the-Pineapple/sklearn-rvm) developed by Baecker et al.42. We have made publicly available the codes and trained models used in this study at https://github.com/Warvito/Normative-modelling-using-deep-autoencoders. A Google’s Colaboratory notebook that calculates the deviations scores of new data is available at https://colab.research.google.com/github/Warvito/Normative-modelling-using-deep-autoencoders/blob/master/notebooks/predict.ipynb.

Results

Comparison of deviation values for healthy controls and patients

Figure 2 shows the mean value of the observed deviation for each group. For the ADNI dataset, we found a mean value of 0.28 ([0.27, 0.32]; 95% CI) for HC; 0.29 ([0.28, 0.35]; 95% CI) for EMCI; 0.32 ([0.30, 0.38]; 95% CI) for LMCI; 0.37 ([0.34, 0.47]; 95% CI) for AD. For the AIBL dataset, we found a mean value of 0.30 ([0.28, 0.33]; 95% CI) for HC; 0.36 ([0.34, 0.42]; 95% CI) for MCI; and 0.40 ([0.36, 0.50]; 95% CI) for AD. For the ARWiBo dataset, we found a mean value of 0.32 ([0.30, 0.38]; 95% CI) for HC; 0.37 ([0.34, 0.47]; 95% CI) for MCI; and 0.46 ([0.40, 0.62]; 95% CI) for AD. For the OASIS-1 dataset, we found a mean value of 0.41 ([0.39, 0.46]; 95% CI) for HC and 0.65 ([0.58, 0.79]; 95% CI) for AD. For the MIRIAD dataset, we found a mean value of 0.26 ([0.24, 0.30]; 95% CI) for HC and 0.48 ([0.41, 0.71]; 95% CI) for AD.

Figure 2
figure 2

Mean value of the observed deviation calculated by the autoencoder for each group. The square marker indicates the mean value and the horizontal bars indicates the 95% confidence interval calculated using the percentile method on the bootstrap analysis. Abbreviations: AD = Alzheimer’s disease; EMCI = early mild cognitive impairment; LMCI = late mild cognitive impairment; MCI = mild cognitive impairment; HC = healthy controls; ADNI = Alzheimer’s Disease Neuroimaging Initiative; AIBL = Australian Imaging Biomarkers and Lifestyle Study of Ageing; ARWiBo = Alzheimer's Disease Repository Without Borders; OASIS-1 = Open Access Series of Imaging Studies: Cross-Sectional; MIRIAD = Minimal Interval Resonance Imaging in Alzheimer's Disease.

When we examined the confidence intervals of the observed deviations, we found that the five independent datasets presented mean deviation scores significantly different between groups, with the exception of the comparison between HC and EMCI in the ADNI dataset (difference range [-0.03, 0.00]) and the comparison between MCI and AD in the AIBL dataset (difference range [-0.09, 0.00]) (more details can be found in the supplementary materials).

Normative model performance in discriminative tasks

We examined if the observed deviations could be used to predict if a person belonged to the patient or HC group (Fig. 3) using ROC curves. This revealed that the generated deviation values reflected the severity of the disease. Specifically, based on the AUC, it was possible to discriminate patients with AD vs HC better than patients with MCI vs HC, and to discriminate patients with LMCI vs HC better than patients with EMCI vs HC.

Figure 3
figure 3

Discriminative performance of the normative approach. The solid line indicates the mean receiver operating characteristic curve across the bootstrap iterations with the shaded area indicating the 95% confidence interval calculated using the percentile method on the bootstrap analysis. The dashed line indicates the chance level. Abbreviations: AD = Alzheimer’s disease; AUC-ROC = area under the receiver operating characteristic curve; EMCI = early mild cognitive impairment; LMCI = late mild cognitive impairment; MCI = mild cognitive impairment; HC = healthy controls; ADNI = Alzheimer’s Disease Neuroimaging Initiative; AIBL = Australian Imaging Biomarkers and Lifestyle Study of Ageing; ARWiBo = Alzheimer's Disease Repository Without Borders; OASIS-1 = Open Access Series of Imaging Studies: Cross-Sectional; MIRIAD = Minimal Interval Resonance Imaging in Alzheimer's Disease.

Brain regions deviations

Figure 4 present the Cliff’s delta of each brain region when comparing its deviation in the HC group against the deviation in the patient groups. Only the regions with effect sizes significantly different from zero are shown (complete list presented in the supplementary materials). Among the regions showing significant deviation in patients with AD, we found the lateral ventricles, temporal horns, hippocampus, entorhinal cortex, parahippocampal cortex, and amygdala. A number of these regions also showed a high deviation in patients with MCI, including the lateral ventricles and hippocampus. Finally, we also noted that effect sizes were smaller for the regions identified in patients with MCI relative to those identified in patients with AD.

Figure 4
figure 4

Brain regions deviations. The square marker indicates the mean effect size (Cliff’s delta) between the healthy control group and the respective patient groups. The horizontal bars indicate the 95% confidence interval calculated using the percentile method on the bootstrap analysis. Only the regions with a mean effect size significantly different from zero are presented. Abbreviations: AD = Alzheimer’s disease; AUC-ROC = area under the receiver operating characteristic curve; EMCI = early mild cognitive impairment; LMCI = late mild cognitive impairment; MCI = mild cognitive impairment; HC = healthy controls; ADNI = Alzheimer’s Disease Neuroimaging Initiative; AIBL = Australian Imaging Biomarkers and Lifestyle Study of Ageing; ARWiBo = Alzheimer's Disease Repository Without Borders; OASIS-1 = Open Access Series of Imaging Studies: Cross-Sectional; MIRIAD = Minimal Interval Resonance Imaging in Alzheimer's Disease.

Traditional machine learning classification

Using the RVM, we verified the performance of a traditional classifier when performing binary classification between HC and patients. For the ADNI dataset, we obtained an AUC = 0.69 ([0.58, 0.77]; 95% CI) when analysing patients with EMCI, an AUC = 0.76 ([0.64, 0.84]; 95% CI) when analysing patients with LMCI, and an AUC = 0.93 ([0.87, 0.97]; 95% CI) when analysing patients with AD. For the AIBL dataset, an AUC = 0.37 ([0.00, 0.78]; 95% CI) when analysing subjects with MCI, and we obtained an AUC = 0.93 ([0.86, 0.93]; 95% CI) when analysing patients with AD. Note, that the AUC for the AIBL dataset when analysing MCI had a wide interval. This interval was exacerbated due to the presence of overfitting and the 0.632 + bootstrap method compensatory effect that reduce the effect of bias caused by this overfitting. For the ARWiBo dataset, we obtained an AUC = 0.68 ([0.52, 0.78]; 95% CI) when analysing subjects with MCI, and an AUC = 0.94 ([0.87, 0.98]; 95% CI) when analysing patients with AD. For the OASIS-1 dataset, we obtained an AUC = 0.86 ([0.69, 0.96]; 95% CI) when analysing patients with AD. For the MIRIAD dataset, we obtained an AUC = 0.86 ([0.70, 0.96]; 95% CI) when analysing patients with AD.

To identify significant differences between the performance of the normative models and traditional classifiers, we calculated the confidence interval (95% of confidence) of the difference in AUC between the two methods. The traditional classifiers were superior to the normative models when predicting the difference between the groups in the ADNI dataset and the difference between HC and AD in the AIBL dataset; in contrast the performance of the two approaches was comparable for all other comparisons (more details can be found in the supplementary materials).

Finally, we examined how a classifier trained on a certain dataset would perform when applied to other datasets (i.e. cross-cohort generalizability). The results of this examination are presented in Tables 3 and 4. When predicting AD, the classifiers had a higher mean performance than the normative approach in most cases (except when the model was trained on MIRIAD dataset and evaluated on ARWiBo dataset). However, the difference was not significantly different in almost half of the cases. When predicting MCI, the classifiers presented a lower mean performance in all cases, but the difference was not significantly different.

Table 3 Generalization performance of the classifiers for the classification between HC and patients with Alzheimer’s disease. In this table, the rows indicate the dataset where the classifier is trained and the columns indicate the dataset where the performance was tested. The area under the receiver operating characteristic curve is shown with the upper and lower bound of its 95% confidence interval. Performance significantly different from the normative approach calculated using the confidence interval of the difference between the approach across the bootstrap scheme is indicated by “*”.
Table 4 Generalization performance of the classifiers for the classification between HC and patients with mild cognitive impairment. In this table, the rows indicate the dataset where the classifier is trained and the columns indicate the dataset where the performance was measured. The area under the receiver operating characteristic curve is shown with the upper and lower bound of its 95% confidence interval. No case had a performance significantly different from the normative approach calculated using the confidence interval of the difference between the approach across the bootstrap scheme.

Discussion

In this study, we evaluated the performance of the normative modelling approach based on deep autoencoders on data from patients with MCI and AD. Consistent with our first hypothesis, we found that the approach was effective in generating deviation values that reflect the severity of the disease, with patients with AD showing higher deviations than patients with MCI, and patients with LMCI showing larger deviations than patients with EMCI. We also measured how much each brain region deviated from normality and contributed to the observed deviation. Here, we found that regions from the ventricular system and medial temporal lobe were among those making the greatest significant contribution to deviation, consistent with our second hypothesis. Finally, we compared the performance of the normative approach versus a traditional classification approach. Although a higher performance was found for traditional classifiers in most cases, the difference was not statistically significant in the majority of cases.

We have replicated previous findings that the autoencoder is capable of detecting neuroanatomical deviation in individuals with brain disorders5. In particular, in each of our five independent datasets, the normative model was able to assign higher values to patients with AD than healthy controls. This pattern was expected since the disorder is associated with profound alterations in the brain morphometry which were not present in the training set13, 14. In addition, we have expanded these findings by demonstrating for the first time that autoencoders are capable of discriminating between different stages of the disease progress (i.e. EMCI versus LMCI versus AD). In particular, we observed that the MCI group presented intermediary deviation values in three independent datasets (ADNI, AIBL and ARWiBo). These values were also expected since the MCI is considered as a transitory stage between HC and AD43, and usually present less brain atrophy compared to AD44. In addition, within the ADNI dataset, the MCI subjects were divided into two categories, EMCI and LMCI. Although individuals in both stages meet the conventional criteria for MCI, EMCI is associated with less pronounced symptoms thought to reflect an earlier point in the clinical spectrum than LMCI. In our analyses, we found that the patients with LMCI had a significantly (i.e. the confidence interval of the difference between the group do not overlap zero) larger deviation than patients with EMCI providing further confirmation that that deep autoencoders are capable of discriminating between different stages of the disease course.

With the autoencoder based approach, it was possible to identify the brain regions with the highest deviations from the expected normative values. Consistent with our second hypothesis, the AD group showed high levels of deviation in structures that are part of the ventricular system (such as the lateral ventricles, temporal horns, and 3rd ventricle) and in the medial temporal cortex, including the hippocampus, entorhinal cortex, parahippocampal cortex, and amygdala. Progressive ventricular expansion is one of the most reliable morphological changes in dementia patients, reflecting the increasing atrophy of the brain45. Likewise, medial temporal cortex atrophy is among the most consistent findings in neuroimaging studies of AD13, 46 and an established marker of AD47. While deviations in the MCI group had a smaller sizes than those in the AD group, there was a high degree of overlap in the hippocampus, parahippocampal cortex and several temporoparietal regions, consistent with previous neuroimaging studies of MCI48,49,50,51. The smaller effect size in MCI might be explained by two (not mutually-exclusive) factors: (i) earlier stage in the AD course, hence milder atrophy, (ii) heterogeneity of the MCI construct. Since MCI patients were not selected based on AD biomarkers (i.e., presence of beta-amyloid and tau protein in the cerebrospinal fluid)52, this group will likely include a mixture of AD and non-AD cases, hence the milder/diluted effect.

Finally, we compared the performance of our normative approach with traditional classifiers. The performance of the classifiers was measured in two schemes, on data from the same dataset where the model was trained and on data from independent clinical datasets (generalization performance). Although the traditional classifiers had a better mean performance in most cases, the differences between the two approaches were not statistically significant in most of the cases, especially when predicting the subjects from the ARWiBo, OASIS-1 and MIRIAD datasets. This similarity was more evident during the prediction of the patients with MCI (with exception the ADNI dataset).

Although we evaluated our method using a range of different datasets, we did not assess the impact of MRI scanners and acquisition parameters. Recent studies have showed that these variables can have a measurable impact on the performance of machine learning models, highlighting the importance of inter-scanner harmonisation53, 54. In particular, MRI scanners and acquisition parameters have been shown to influence the results not only in traditional machine learning classification but also normative modelling55. For this reason, further studies need to be performed to analyse the influence of inter-scanner harmonisation, which can be implemented using tools such as Neuroharmony54 or Combat56, on the performance of autoencoder based methods.

Different from a case–control context, the normative approach does not need to be trained in a dataset with reasonable balancing between HC and patient groups. It is trained using only healthy controls, which enables the use of large cohorts of HC participants1, 9, such as UK Biobank and Human Connectome Project57. Our approach is not linked with any labels during training; this enables its application to an array of clinical tasks (including diagnosis, prognosis, treatment selection and mechanistic inference) for any brain disorder without the necessity of re-training or fine-tuning. Finally, since our approach involves anomaly detection, it can also work cooperatively with conventional discriminative models to identify and mitigate circumstances where supervised methods could catastrophic fail due to a test example very distinct from the training set (“out-of-distribution” examples). In order to promote open science, we have made all scripts and the trained models available to the wider research community (https://github.com/Warvito/Normative-modelling-using-deep-autoencoders).