Using normative modelling to detect disease progression in mild cognitive impairment and Alzheimer’s disease in a cross-sectional multi-cohort study

Normative modelling is an emerging method for quantifying how individuals deviate from the healthy populational pattern. Several machine learning models have been implemented to develop normative models to investigate brain disorders, including regression, support vector machines and Gaussian process models. With the advance of deep learning technology, the use of deep neural networks has also been proposed. In this study, we assessed normative models based on deep autoencoders using structural neuroimaging data from patients with Alzheimer’s disease (n = 206) and mild cognitive impairment (n = 354). We first trained the autoencoder on an independent dataset (UK Biobank dataset) with 11,034 healthy controls. Then, we estimated how each patient deviated from this norm and established which brain regions were associated to this deviation. Finally, we compared the performance of our normative model against traditional classifiers. As expected, we found that patients exhibited deviations according to the severity of their clinical condition. The model identified medial temporal regions, including the hippocampus, and the ventricular system as critical regions for the calculation of the deviation score. Overall, the normative model had comparable cross-cohort generalizability to traditional classifiers. To promote open science, we are making all scripts and the trained models available to the wider research community.


Methods
Datasets. In our analysis, we used six datasets: the UK Biobank 15 , the Alzheimer's Disease Neuroimaging Initiative (ADNI) 16 , the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) 17 , the Alzheimer's Disease Repository Without Borders (ARWiBo) 18,19 , the Open Access Series of Imaging Studies: Cross-Sectional (OASIS-1) 20 , and the Minimal Interval Resonance Imaging in Alzheimer's Disease (MIRIAD) 21 .
The UK Biobank is a study that aims to follow the health and well-being of 500,000 volunteer participants across the United Kingdom. From these participants, a subsample was chosen to collect multimodal imaging, including structural neuroimaging. Here, we used an early release of the project's data comprising of 11,034 HC participants. The inclusion criteria for the present study were: (a) subjects who had the data collected in the same MRI scanner (from Cheadle centre), (b) age between 47 ND 73 years old. The only exclusion criterion was previous hospitalization associated with the diagnosis of mental and behavioural disorders, disease of the nervous system, cerebrovascular diseases, benign neoplasm of meninges, brain and other parts of the central nervous system, or injuries to the head. This study (UK Biobank project #40323) was covered by the general ethical approval for UK Biobank studies from the NHS National Research Ethics Service on 17th June 2011 (Ref 11/NW/0382). All methods were carried out in accordance with the approved guidelines and regulations. All UK Biobank participants provided written informed consent. More details about the dataset can be found elsewhere 15,[22][23][24] .
The ADNI consortium started in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner. Its goal was to verify whether different neuroimaging biomarkers and neuropsychological assessments can be combined to measure the progression of MCI and to study the development of AD. All ADNI participants provided written informed consent, and study protocols were approved by each local site's institutional review board. All methods were carried out in accordance with the approved guidelines. Further information about ADNI, including full study protocols, complete inclusion and exclusion criteria, and data collection and availability can be found at http:// www. adni-info. org/. All methods as stated on the website were performed with the relevant guidelines and regulations. In this study, we included the structural MRI collected during the ADNI GO, ADNI 2 and ADNI 3 phases. Similar to UK Biobank, we included only subjects with age between 47 and 73 years old. The final dataset comprised of 517 subjects, where 212 were HC, 159 were patients with early MCI (EMCI), 82 were patient with late MCI (LMCI), and 64 were patients with AD. In the ADNI datasets, participants were assigned to these MCI stages based on different levels of impairment on a single episodic memory measure, with the EMCI group showing milder episodic memory impairment than the LMCI group 25,26 .
The AIBL dataset was developed to enhance the understanding of the pathogenesis of AD, concentrating on its early diagnosis (more details can be found in Ellis et al., 2009). Ethics approval for the AIBL study and all experimental protocols was provided by the ethics committees of Austin Health, St Vincent's Health, Hollywood Private Hospital and Edith Cowan University. All experiments and methods were carried out in accordance with the approved guidelines and regulations and all volunteers gave written informed consent before participating in the study. Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The final group was composed of 346 subjects, where 262 were HC, 46 were patients with MCI (stage not known), and 38 were patients with AD.
The ARWiBo is a cross-sectional dataset including data from patients and controls enrolled at the Scientific Institute for the Research and Care of Alzheimer's Disease [Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy]. A multidisciplinary team of neurologists, neuroscientists, image analysists, neurophysiologists, and geneticists are involved in the assessment of patients. As part of their assessment, participants undergo blood drawing (for APOE genotyping), clinical and cognitive evaluations as well as high-resolution MRI scanning (more details can be found in Frisoni et al., 2009 andGalluzzi et al., 2010). Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 319 subjects, including 215 HC, 67 patients with MCI (stage not known), and 37 patients with AD. Ethics approval for the ARWiBo study and all experimental protocols was provided by the local ethics committee and all participants signed an informed participation consent. All experiments and methods were carried out in accordance with the approved guidelines and regulations.
The OASIS-1 dataset is the result of a collaborative effort of investigators from a single acquisition site supported by the National Institute on Aging (NIA), the Howard Hughes Medical Institute, the Biomedical Informatics Research Network (BIRN) and the Washington University Alzheimer's Disease Research Center [Alzheimer's Disease Research Center (ADRC)]. This collaborative effort aimed to create a freely available MRI dataset for the wider scientific community. The original dataset consisted of a cross-sectional collection of subjects aged 18 to 96. It included participants over the age of 60 who had received a clinical diagnosis of very mild to moderate AD (for more information, please see http:// www. oasis-brains. org). In our analysis, we selected data collected from individuals who were between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 78 subjects, including 41 HC and 37 patients with AD. Ethics approval for the OASIS-1 study and all experimental protocols was provided by the local ethics committee and all participants signed an informed participation consent. All subjects participated in accordance with guidelines of the Washington University Human Studies Committee. All experiments and methods were carried out in accordance with the approved guidelines and regulations.
The MIRIAD dataset was designed to establish the minimal interval over which it would be feasible to undertake clinical trials in AD using atrophy measured from longitudinal MRI as an outcome measure 21 . Ethical approval for the MIRIAD study (and subsequently its release) was received from the local research ethics committee, and written consent obtained from all participants. All experiments and methods were carried out in accordance with the approved guidelines and regulations. Here, we included the structural MRI of subjects between 47 and 73 years old, to match the age range of the UK Biobank dataset. The resulting group was composed of 48 subjects, including 18 HC and 30 patients with AD.
In the present study, we used the UK Biobank set to train the autoencoders and the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets to assess the normative model performance on data from patients with MCI and AD. To perform comparisons between HC and patient groups, we ensured that there were no significant statistical differences regarding age and sex in all five clinical datasets. We assessed each dataset independently using the ANOVA test to verify any differences in age and the Chi-square test of homogeneity to investigate differences in the sex ratios between groups (Tables 1, 2). MRI processing. We used the FreeSurfer software (version 6.0) to estimate the brain regions' volumes from the T1 weighted images. This estimation was performed using the "recon-all" command (see Fischl, 2012;Fischl et al., 2002, for more information). During this processing, the cortical surface of each hemisphere was parcellated according to the Desikan-Killiany atlas 29 and anatomical volumetric measures were obtained via a wholebrain segmentation procedure (Aseg atlas) 28 . The final data included the cortical volume for each of the 68 cortical subregions (34 per hemisphere) and the volume of 33 neuroanatomical structures, totalling 101 subregions/ structures (the complete list is presented in the supplementary materials).  Fig. 1) 30,31 . As an autoencoder, this neural network has an encoder and a decoder. The function of the encoder is to take in an input x and map it into a latent encoding space, creating a latent code h. Then, the goal of the decoder is to reconstruct the input data based on the latent code. The AAE is a blend of this autoencoder framework with adversarial training, which is used in generative adversarial networks modelling 32 . This autoencoder uses the adversarial training to shape the distribution of the latent code to look similar to a predefined prior distribution. The AAE achieves this desired distribution by incorporating a discriminator network into its structure. In this scheme, the discriminator receives two types of inputs: random numbers sampled from the  Figure 1. Structure of the normative model based on adversarial autoencoders. In this configuration, the subject data is inputted into the encoder and then mapped to the latent code. This latent code is fed to the decoder with the demographic data, and then the decoder generates a reconstruction of the original data. During the training of the model, the discriminator predicts if its input data came from the latent code or if it was randomly sampled from the chosen prior distribution (e.g. Gaussian distribution). Based on these predictions, the adversarial autoencoder forces the encoder to produce a latent code similar to the prior distribution selected. Since the model is trained on healthy controls data, it is expected that it can reconstruct similar data relatively well, yielding a small reconstruction error. However, the model is expected to generate a high error when processing data affected by unseen underlying mechanisms, e.g. pathological mechanisms. www.nature.com/scientificreports/ desired prior distribution, and the latent code. During the training process, the discriminator will make predictions regarding whether its input data was sampled from the prior distribution or the latent code. The adversarial training forces the encoder to produce a latent code space that can fool the discriminator into predicting that the encoded samples are just another sample from the prior distribution.
In this study, we trained the AAE to codify and reconstruct the data of HC subjects. The main idea of this normative approach is that, since the AAE only learns how to reconstruct images from HC individuals, it will be less precise at mapping images from patients, which differ due to the pathological mechanisms of the disorder. As a result, the difference between the reconstructed data and the original data will be larger in patients than HC individuals.
Regarding our model architecture, we used an encoder with two hidden layers with 100 neurons, and a latent code with a size of 20 neurons. The decoder and the discriminator had a similar structure (two hidden layers with 100 neurons). All hidden layers had a leaky ReLU non-linearity 33 . The latent code and the decoder's output layer had a linear activation function.

Normative model training.
To train the autoencoder, first, we performed the pre-processing of the brain features. This involved estimating the relative brain region volumes for each subject by dividing the original brain region volumes by the total intracranial volume. Then, we normalised the relative brain region volumes across all the participants in the training set. In this step, we performed a normalisation robust to outliers by subtracting the median value of the relative brain region volume and then scaling the data according to its interquartile range. Centering and scaling was done independently for each brain region. The same relevant statistics (median and interquartile range) were later used to normalise the data from the clinical datasets before feeding them to the model.
In our analyses, we used a conditioned AAE 30 . This type of autoencoder allows us to influence the model's reconstruction using the demographic variables, i.e. age and sex. To input these variables into the model, we transformed age and sex into one-hot encoding vectors. After this transformation, each subject has an age vector with 27 positions, where each position corresponds to a year within the range of 47-73 years. In this vector, all positions have value zero except the one that indicates the subject's age which has a value equal to 1. The subject's sex was represented in a one-hot encoded vector with two positions, one for male and one for female. The AAE's decoder used these vectors together with the latent code to reconstruct the brain data. This architecture forces the network to disentangle the label information from the latent code 30 .
With the features pre-processed and the conditioning data prepared, we trained the autoencoder to minimise the mean squared value of its reconstruction error using Adam optimizer 34 for 200 epochs. A minibatch approach was used in this gradient descent-based optimizer, with a batch size of 256. The model was trained with a cyclical learning rate 35 , which allows convergence of the training with fewer epochs. We started using a base learning rate with a value of 0.0001 and a maximum learning rate value of 0.005, chosen using the "LR Range Test" 36 . The learning rate cycle had a basic triangular shape with an amplitude decaying (gamma = 0.98).
In this study, we accessed the robustness of the autoencoder approach by training it with different simulated sets using the bootstrapping as the resampling method. We created 1,000 bootstrapped sets (each one with n = 11,032) by sampling with replacement from the UK Biobank. These bootstrapped sets were used to train the AAE. With this resampling method, we calculated: the value of the mean deviation ("Analysis of the observed deviations" section) for each group from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets, the discriminative performance of the normative approach ("Analysis of the observed deviations" section), and the deviation from normality of each brain region ("Brain regions deviations" section).
Analysis of the observed deviations. Similar to Pinaya et al. 5 , we processed the data of each subject using the AAE, and we calculated the mean squared error between the reconstruction and the inputted data as the metric of brain deviation (Eq. 1).
where x i is the normalised value of the brain region i , x i is the autoencoder reconstructed value of the brain region i , and number of regions is the number of cortical regions and neuroanatomical structures used (i.e. number of regions= 101).
In each iteration of the bootstrap method, we used the trained autoencoder to obtain the deviation metric of the subjects from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets. Then, we calculated the difference between the mean deviation scores of each pair of groups. We identified a significant difference between groups if the confidence interval (95% of confidence) of this difference did not include the zero. Besides, we used the subjects' deviations to obtain the discriminative performance of the autoencoder approach, measured by the area under the receiver operating characteristic curve (AUC).
Brain regions deviations. The autoencoder approach can quantify how much each brain region deviated from normality and contributed to the observed deviation. These values were obtained by measuring the difference between the inputted value and its reconstruction. In our study, we quantified the deviation for each subject from the ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets. Then, in each iteration of the bootstrap method, we calculated the effect size of each brain region deviation-using Cliff 's delta 37 value-between the HC group and each patient group. Here we used Cliff 's delta-a non-parametric effect size measure-because the observed deviation presents a gamma distribution. was to compare the performance of our normative model against a traditional classification approach. To measure the performance of the classifiers, we calculated the AUC using the 0.632 + bootstrap method 38 with 1,000 iterations. Each clinical dataset (ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD) was analysed independently using the HC and patient groups to train the classifiers. Besides, the analysis was performed as multiple binary classifications between HC and each clinical group (e.g. HC versus LMCI).
In each iteration, first, we created the bootstrapped set by sampling the original data (from ADNI, AIBL, ARWiBo, OASIS-1, and MIRIAD datasets) with replacement. This bootstrapped set had the same size as the original dataset (for example, when analysing the ADNI dataset to classify healthy controls and patients with Alzheimer's disease, the bootstrapped set had 212 + 64 = 276 subjects), and it contained repeated subjects (due to replacement). For each iteration, the subjects not included in the bootstrapped set were used as the out-ofbag set (i.e. test set).
Next, we obtained the relative brain region volumes of each subject by dividing the original volume by the total intracranial volume. Then, we normalised the values of the relative brain volumes across the subjects. In this normalisation step, we removed the median value of the brain regions and scaled the data according to the interquartile range. Centering and scaling was done independently for each brain region. The same relevant statistics (median and interquartile range) were later used to normalise the out-of-bag set.
To perform the classification analysis, we used a relevance vector machine (RVM) 39 with a linear kernel. The RVM is a Bayesian treatment of identical functional form to the Support Vector Machines (SVM) 40 . One advantage of the RVM form over the SVM is that it is not necessary to estimate the error/margin trade-off parameter 'C' . After we trained the RVM on the bootstrapped set, we used the model to obtain the predicted probability of a subject belonging to the patient class. Using these probabilities, we calculated two AUC values, one for the bootstrapped set (called "resubstitution" metric) and one for the test set (called "out-of-bag" metric). By using the 0.632 + bootstrap method, we minimised the optimistic and pessimistic bias of the estimate and obtained the AUC value (Eq. 2).
where b was the number of iterations and the weight ω was defined considering the relative overfitting rate (full description in Efron and Tibshirani, 1997). To obtain the confidence interval (CI; 95% of confidence), we used the percentile method 41 . Next, we compared these confidence intervals with the AUC obtained during the normative approach.
Finally, we compared the generalization of the classifiers with the results of the autoencoders. In this analysis, we used each trained classifier to predict the group of the subjects from the other clinical datasets. In order to verify if the performance in the independent datasets was significantly different from the normative approach, we calculated the difference between the AUCs of this generalization analysis and the AUCs of the autoencoders. With the 1,000 measures of the difference, we calculated its confidence interval (95% confidence) to verify if this difference is different from zero.

Experiments.
We conducted our experiments in Python 3 using the Tensorflow 2.0 library (https:// www. tenso rflow. org/) and the sklearn_rvm library (https:// github. com/ Mind-the-Pinea pple/ sklea rn-rvm) developed by Baecker et al. 42 . We have made publicly available the codes and trained models used in this study at https:// github. com/ Warvi to/ Norma tive-model ling-using-deep-autoe ncode rs. A Google's Colaboratory notebook that calculates the deviations scores of new data is available at https:// colab. resea rch. google. com/ github/ Warvi to/ Norma tive-model ling-using-deep-autoe ncode rs/ blob/ master/ noteb ooks/ predi ct. ipynb. Figure 2 shows the mean value of the observed deviation for each group. For the ADNI dataset, we found a mean value of 0.28 ( When we examined the confidence intervals of the observed deviations, we found that the five independent datasets presented mean deviation scores significantly different between groups, with the exception of the comparison between HC and EMCI in the ADNI dataset (difference range [-0.03, 0.00]) and the comparison between MCI and AD in the AIBL dataset (difference range [-0.09, 0.00]) (more details can be found in the supplementary materials).

Comparison of deviation values for healthy controls and patients.
Normative model performance in discriminative tasks. We examined if the observed deviations could be used to predict if a person belonged to the patient or HC group (Fig. 3) using ROC curves. This revealed that the generated deviation values reflected the severity of the disease. Specifically, based on the AUC, it was possible to discriminate patients with AD vs HC better than patients with MCI vs HC, and to discriminate patients with LMCI vs HC better than patients with EMCI vs HC. www.nature.com/scientificreports/ Figure 4 present the Cliff 's delta of each brain region when comparing its deviation in the HC group against the deviation in the patient groups. Only the regions with effect sizes significantly different from zero are shown (complete list presented in the supplementary materials). Among the regions showing significant deviation in patients with AD, we found the lateral ventricles, temporal horns, hippocampus, entorhinal cortex, parahippocampal cortex, and amygdala. A number of these regions also showed a high deviation in patients with MCI, including the lateral ventricles and hippocampus. Finally, we also noted that effect sizes were smaller for the regions identified in patients with MCI relative to those identified in patients with AD.

Brain regions deviations.
Traditional machine learning classification. Using   To identify significant differences between the performance of the normative models and traditional classifiers, we calculated the confidence interval (95% of confidence) of the difference in AUC between the two methods. The traditional classifiers were superior to the normative models when predicting the difference between the groups in the ADNI dataset and the difference between HC and AD in the AIBL dataset; in contrast the performance of the two approaches was comparable for all other comparisons (more details can be found in the supplementary materials).
Finally, we examined how a classifier trained on a certain dataset would perform when applied to other datasets (i.e. cross-cohort generalizability). The results of this examination are presented in Tables 3 and 4. When predicting AD, the classifiers had a higher mean performance than the normative approach in most cases (except when the model was trained on MIRIAD dataset and evaluated on ARWiBo dataset). However, the difference was not significantly different in almost half of the cases. When predicting MCI, the classifiers presented a lower mean performance in all cases, but the difference was not significantly different.

Discussion
In this study, we evaluated the performance of the normative modelling approach based on deep autoencoders on data from patients with MCI and AD. Consistent with our first hypothesis, we found that the approach was www.nature.com/scientificreports/ effective in generating deviation values that reflect the severity of the disease, with patients with AD showing higher deviations than patients with MCI, and patients with LMCI showing larger deviations than patients with EMCI. We also measured how much each brain region deviated from normality and contributed to the observed deviation. Here, we found that regions from the ventricular system and medial temporal lobe were among those making the greatest significant contribution to deviation, consistent with our second hypothesis. Finally, we compared the performance of the normative approach versus a traditional classification approach. Although a higher performance was found for traditional classifiers in most cases, the difference was not statistically significant in the majority of cases.
We have replicated previous findings that the autoencoder is capable of detecting neuroanatomical deviation in individuals with brain disorders 5 . In particular, in each of our five independent datasets, the normative model was able to assign higher values to patients with AD than healthy controls. This pattern was expected since the disorder is associated with profound alterations in the brain morphometry which were not present in the training set 13,14 . In addition, we have expanded these findings by demonstrating for the first time that autoencoders are capable of discriminating between different stages of the disease progress (i.e. EMCI versus LMCI versus AD). In particular, we observed that the MCI group presented intermediary deviation values in three independent datasets (ADNI, AIBL and ARWiBo). These values were also expected since the MCI is considered as a transitory stage between HC and AD 43 , and usually present less brain atrophy compared to AD 44 . In addition, within the ADNI dataset, the MCI subjects were divided into two categories, EMCI and LMCI. Although individuals in both stages meet the conventional criteria for MCI, EMCI is associated with less pronounced symptoms thought to reflect an earlier point in the clinical spectrum than LMCI. In our analyses, we found that the patients with LMCI had a significantly (i.e. the confidence interval of the difference between the group do not overlap zero) larger deviation than patients with EMCI providing further confirmation that that deep autoencoders are capable of discriminating between different stages of the disease course.
With the autoencoder based approach, it was possible to identify the brain regions with the highest deviations from the expected normative values. Consistent with our second hypothesis, the AD group showed high levels of deviation in structures that are part of the ventricular system (such as the lateral ventricles, temporal horns, and 3 rd ventricle) and in the medial temporal cortex, including the hippocampus, entorhinal cortex, parahippocampal cortex, and amygdala. Progressive ventricular expansion is one of the most reliable morphological changes in dementia patients, reflecting the increasing atrophy of the brain 45 . Likewise, medial temporal cortex atrophy is among the most consistent findings in neuroimaging studies of AD 13,46 and an established marker of AD 47 . While deviations in the MCI group had a smaller sizes than those in the AD group, there was a high degree of overlap in the hippocampus, parahippocampal cortex and several temporoparietal regions, consistent with previous neuroimaging studies of MCI [48][49][50][51] . The smaller effect size in MCI might be explained by two (not mutually-exclusive) factors: (i) earlier stage in the AD course, hence milder atrophy, (ii) heterogeneity of the MCI construct. Since MCI patients were not selected based on AD biomarkers (i.e., presence of beta-amyloid and tau protein in the cerebrospinal fluid) 52 , this group will likely include a mixture of AD and non-AD cases, hence the milder/diluted effect. Table 3. Generalization performance of the classifiers for the classification between HC and patients with Alzheimer's disease. In this table, the rows indicate the dataset where the classifier is trained and the columns indicate the dataset where the performance was tested. The area under the receiver operating characteristic curve is shown with the upper and lower bound of its 95% confidence interval. Performance significantly different from the normative approach calculated using the confidence interval of the difference between the approach across the bootstrap scheme is indicated by "*".  www.nature.com/scientificreports/ Finally, we compared the performance of our normative approach with traditional classifiers. The performance of the classifiers was measured in two schemes, on data from the same dataset where the model was trained and on data from independent clinical datasets (generalization performance). Although the traditional classifiers had a better mean performance in most cases, the differences between the two approaches were not statistically significant in most of the cases, especially when predicting the subjects from the ARWiBo, OASIS-1 and MIRIAD datasets. This similarity was more evident during the prediction of the patients with MCI (with exception the ADNI dataset).
Although we evaluated our method using a range of different datasets, we did not assess the impact of MRI scanners and acquisition parameters. Recent studies have showed that these variables can have a measurable impact on the performance of machine learning models, highlighting the importance of inter-scanner harmonisation 53,54 . In particular, MRI scanners and acquisition parameters have been shown to influence the results not only in traditional machine learning classification but also normative modelling 55 . For this reason, further studies need to be performed to analyse the influence of inter-scanner harmonisation, which can be implemented using tools such as Neuroharmony 54 or Combat 56 , on the performance of autoencoder based methods.
Different from a case-control context, the normative approach does not need to be trained in a dataset with reasonable balancing between HC and patient groups. It is trained using only healthy controls, which enables the use of large cohorts of HC participants 1,9 , such as UK Biobank and Human Connectome Project 57 . Our approach is not linked with any labels during training; this enables its application to an array of clinical tasks (including diagnosis, prognosis, treatment selection and mechanistic inference) for any brain disorder without the necessity of re-training or fine-tuning. Finally, since our approach involves anomaly detection, it can also work cooperatively with conventional discriminative models to identify and mitigate circumstances where supervised methods could catastrophic fail due to a test example very distinct from the training set ("out-of-distribution" examples). In order to promote open science, we have made all scripts and the trained models available to the wider research community (https:// github. com/ Warvi to/ Norma tive-model ling-using-deep-autoe ncode rs).