Introduction

Non-invasive magnetic resonance (MR) brain imaging can support the quantitative characterization of neurological conditions such as Alzheimer’s disease (AD). MR imaging can provide informative biomarkers even before clinical symptoms are apparent or irreversible neuronal damage has occurred1,2. The diagnostic potential of biomarkers based on structural imaging has been outlined by Frisoni et al.3 and Klöppel et al.4. Automatically extracted biomarkers can provide diagnostic decision support, increase objectivity in the disease assessment and improve differential diagnosis3,4,5,6. Another important avenue is the use of biomarkers for AD screening or for enrolling suitable participants for pharmaceutical trials4,7. Clinical trials can also benefit from MR biomarkers as they enable enrichment strategies8 or more rigorous inclusion criteria, leading to more homogeneous study groups1,2,3,4.

Subjects with mild cognitive impairment (MCI) do not fulfil the diagnostic criteria for AD9, but are at increased risk of developing AD10. Predicting conversion to AD is of particular importance to patients, clinicians and caregivers, but also for clinical trials4,9,10. An illustration of anatomical changes over a period of two years is shown in Fig. 1 for a healthy control (HC) subject, a progressive MCI (pMCI) subject converting to AD and a patient with AD. Modern neuroimaging can help to improve the accuracy of MCI diagnosis by adding positive predictive value when combined with other diagnostic criteria5,9,11,12. It is, however, uncertain whether information based on individual brain structures is sufficient to fully characterize the complex progression of AD or even to enable a differential dementia diagnosis7,13. Recent studies further suggest that structural MR imaging in combination with other diagnostic procedures, such as positron emission tomography (PET) or chemical analysis of cerebrospinal fluid (CSF), can detect pathological AD-related change years before the onset of AD dementia3,14. Many studies have shown that with progression of the disease, there is significant atrophy in structures of the medial temporal lobe (MTL) such as the hippocampus, amygdala, and entorhinal and parahippocampal cortices4,15,16,17. In the future, structural MR imaging will thus play an important role not only in the diagnosis of AD, but also in monitoring its treatment2,3,4. The development of automatic, robust, quantitative techniques to assess MR images of the brain is therefore an important factor to further increase the utility of structural imaging in the context of neurocognitive disorders.

Figure 1
figure 1

Three examples of MR images (brain-extracted) of subjects from the ADNI1 cohort in coronal section. Top row: a healthy control subject (male, 84.8 years at baseline); middle row: MCI subject (female, 71.8 year at baseline) who converted to AD after three years; bottom row: an AD patient (male, 77.5 years at baseline). Left: baseline; middle: 2-year follow-up; right: baseline with overlaid difference image of rigidly aligned images (blue: volume loss/atrophy, red: positive volume change). The differences are visually subtle, but the increased atrophy in the medial temporal lobe and the enlarged ventricles are apparent in the difference image.

A vast number of studies have shown correlations between quantitative measures calculated from brain MR images with AD progression. Automatic methods perform similarly to trained radiologists when classifying MR images of patients with AD5. There is strong evidence that different anatomical brain structures are affected at different stages of the disease18, with early involvement of the hippocampus, amygdala and entorhinal cortex consistently shown in the literature9,15,17. Although sensitive to dementia, these markers may yet not be sufficiently specific to AD3,12. A uniform approach that considers pathological changes of multiple structures within the whole brain promises to increase specificity in dementia diagnosis and to support differential assessment of various types of dementia3,13. It is thus desirable to follow a holistic approach and to analyse a large number of structures of the whole brain rather than only a limited selection of brain structures. Due to its early involvement in dementia, the focus of many published methods lies on the segmentation of the hippocampus to quantify its volume or shape11,18,19,20,21,22,23,24,25,26,27,28,29. Grey matter (GM) tissue maps10,13 and cortical thickness have also been shown to be of high predictive value in the context of AD12,30. Other approaches are based on voxel-based morphometry (VBM)31, deformation-based morphometry (DBM)32, or tensor-based morphometry (TBM)33 to study group differences. In general, methods exploring the whole brain outperform those focusing on individual structures such as the hippocampus12,34.

Many studies have shown that biomarkers of morphometry, such as volume or shape, correlate with AD progression. However, intersubject variability can lead to substantial overlap with the healthy population and thus limit the discriminative power of these features21. Temporal change of the whole brain or individual structures tends to be more consistent between subjects. Atrophy rates are usually given as a percentage per year. Popular approaches rely on 3D + t optimization using graph cuts22, expectation maximization35, or the boundary shift integral21,23,36. The measurable increase in GM atrophy in patients with AD is a consequence of a substantially accelerated, regionally selective loss of neurons20,37. Most of the published studies can, however, only be compared qualitatively. In many cases, different features, a different methodology to extract the features and different classification techniques are used7. In addition, methods are applied to different cohorts or different subsets thereof. Most studies lack histopathologically confirmed ground truth diagnoses. Instead of seeing subjects as being either healthy or diseased, the diagnosis of AD is a dynamic process in which biomarkers gradually begin to change before current diagnosis criteria are met38,39,40,41,42,43,44. Advances in machine learning, e.g. Gaussian process modelling, have also introduced novel opportunities for personalized healthcare, shifting from “one-size-fits-all” population modeling towards personalized models45,46,47,48.

Of particular importance for routine use in clinical practice is the interpretability of biomarkers7. Many recently developed methods rely heavily on machine learning techniques, e.g. learned manifolds34,49, multiple instance learning50, or region grading51. Even though these methods are often highly accurate, the interpretation of their results can be difficult, and this impedes their adoption into clinical practice. Thus, it is desirable to calculate biomarkers that are easy to interpret, but at the same time as informative as features obtained through such sophisticated machine learning techniques. A further overview can be found in numerous surveys1,3,7,12,52.

In this manuscript we employ multi-atlas label propagation with expectation-maximisation based refinement (MALPEM)53, a state-of-the-art automatic segmentation method for robust segmentation of whole-brain MR images into 138 distinct anatomical structures. Johnson et al.54 recently validated a number of established segmentation methods (SPM31,55, ANTs Atropos56, MALP-EM53,57, FSL FAST58, FreeSurfer59) in the context of Huntington’s disease and found that “MALP-EM appeared to be the most visually accurate tool, […]”. In 2015, an entry based on MALPEM won a third prize in the CADDementia disease classification challenge held in conjunction with MICCAI52. We applied MALPEM to a set of 5074 images of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort with the goal to identify biomarkers that characterize the whole brain, specifically structural volumes and atrophy rates. Our main contributions are:

  • We confirm the accuracy and robustness of MALPEM/MALPEM4D in a cross-sectional/longitudinal study based on a large number of images from the ADNI database.

  • We assess the quality of extracted biomarkers with a clear clinical interpretation (volumes/atrophy rates) and show that their discriminative value is on par with published literature in the context of AD.

  • We share the employed methodology for brain extraction (pincram) and segmentation (MALPEM, MALPEM4D) as open source projects. The validated methodology can readily be applied to other imaging cohorts.

  • We share quality-checked brain masks, structural segmentations and extracted biomarkers for 5074 ADNI images with the community. This resource holds great potential for a deeper analysis and enables training of sophisticated model-based approaches by interested research groups.

Results

Cross-sectional analysis

In a cross-sectional analysis, we investigated the potential to discriminate AD disease stages based on the volumes of individual brain regions. An example segmentation result of a healthy control subject and a patient diagnosed with AD is shown in Fig. 2. The distribution of the measured s of six selected structures is shown in Fig. 3 for the four disease groups: HC, stable MCI (sMCI), pMCI and AD. The reduced GM volume of structures in the medial temporal lobe and the increased ventricular volume in patients with AD is apparent. All cross-sectional volumetric measurements were corrected for the nuisance variables subject age, gender, and intracranial. In the following, these corrected volumes will be employed to investigate their potential to classify relevant disease stages in AD.

Figure 2
figure 2

Example cross-sectional segmentation results of a patient diagnosed with AD (ADNI_018_S_0286, male, 66 years of age) in axial (left), coronal (middle) and sagittal (right) view-plane.

Figure 3
figure 3

Boxplots of structural volumes at baseline for six selected structures before correcting for nuisance factors for distinct disease groups. Structures were selected based on their performance in classifying the investigated disease groups (c.f. Table 1).

Disease classification using structural volumes

An overview of the results of the classification experiments is given in Table 1. The volume of the amygdala allows the strongest distinction between AD and HC with an effect size (Cohen’s d) of 1.56 and a balanced classification accuracy (bACC) of 80%, a sensitivity (SENS) of 76% and a specificity (SPEC) of 84%. Other structures in the MTL region such as the hippocampus (bACC: 78%) or the entorhinal area (bACC: 78%) were similarly discriminative (d > 1.5). Established findings that total brain volume, and cortical grey matter in particular, are smaller in patients with AD were confirmed in our experiments. Accuracy was further improved by combining all structural volumes as features in a support vector machine (SVM; bACC: 89%, SENS: 86%, SPEC: 92%) or random forest (RF; (bACC: 86%, SENS: 83%, SPEC: 90%). As indicated in Table 1, numerous structural differences were highly significant even after Bonferroni correction for multiple comparisons. The structural volumes of the amygdala (bACC: 65%, SENS: 63%, SPEC: 68%) and inferior lateral ventricle (bACC: 64%, SENS: 54%, SPEC: 74%) were most discriminative for the classification of pMCI vs. sMCI. Combining all extracted structural volumes into a single RF classifier further increased classification accuracies (bACC: 68%, SENS: 72%, SPEC: 64%). The classification results for all structures can be found in the supplementary material.

Table 1 Classification results in % (6-fold cross-validation, LDA 100 runs, RF/SVM 20 runs) obtained distinguishing between AD and HC (top) and sMCI from pMCI (bottom).

Longitudinal analysis

In a further longitudinal analysis we investigate the volume change of individual anatomical regions with respect to disease stage. Atrophy was measured for the month-12 (m12) or month-24 (m24) follow-up images with respect to their corresponding baseline (bl) images. In the following, we show atrophy rates with their corresponding sample sizes as well as discriminative power to distinguish between disease stages. Sections from an example segmentation obtained on a subject diagnosed with AD are shown in Fig. 4.

Figure 4
figure 4

Example longitudinal segmentation results of baseline (left) and month 24 (middle) follow-up images of a patient diagnosed AD (ADNI_018_S_0286) in coronal section. Substantial hippocampal atrophy (measured: −7.81%) and ventricular enlargement (16.5%) are apparent in the difference image after affine registration (right).

Atrophy rates and sample sizes

Atrophy rates for selected structures can be found in Table 2 as well as in Table 3 for AD/HC and in Table 4 for pMCI/sMCI. The distribution of the volume change of six selected structures is shown in Fig. 5 for the four investigated clinical groups (HC, sMCI, pMCI and AD). In AD patients, GM structures such as the hippocampus (HC: −1.1%, AD: −4.8%) or the medial temporal gyrus (HC: −1.1%, AD: −3.8%) are subject to significant volume loss between baseline and m12. Concurrently, ventricles expand rapidly in AD patients (HC: 2.8%, AD: 7.2%). Overall, sMCI subjects show atrophy patterns similar to HC, while atrophy patterns in MCI subjects converting to AD are similar to those in AD patients. The amount of atrophy measured between the baseline and m24 are on the order of twice those measured between baseline and m12. The results indicate that the atrophy rate in the entorhinal region is slightly higher in pMCI subjects than in patients with AD. A volume change rate of −4.0% (±3.7) for pMCI subjects and −3.7% (±4.1) for AD patients was measured at m12. Respectively, a volume change rate of −7.4% (±4.7) for pMCI subjects and −7.0% (±5.5) for AD patients was measured at m24.

Table 2 Mean volume change of selected structures in % with corresponding sample sizes for different clinical groups. Standard deviation in parentheses.
Table 3 Classification results in % (6-fold cross-validation, LDA 100 runs, RF/SVM 20 runs) for distinguishing between AD and HC based on volume change from baseline to m12 (top) or m24 (bottom). Individual structures are sorted by effect size. The 5 structures with largest effect size are listed explicitly. Significant group differences indicated by + (p < 0.05) and ++ (p < 0.001). Bonferroni-corrected significance in parentheses.
Table 4 Classification results in % (6-fold cross-validation, LDA 100 runs, RF/SVM 20 runs) for distinguishing between pMCI and sMCI based on volume change from baseline to m12 (top) or m24 (bottom). Individual structures are sorted by effect size. The 5 structures with largest effect size are listed explicitly. Significant group differences indicated by + (p < 0.05) and ++ (p < 0.001). Bonferroni-corrected significance in parentheses.
Figure 5
figure 5

Boxplots of volume changes for selected brain structures (top) and surrogate structures (bottom) from baseline to month 24 follow-up image for different clinical groups. Features selected based on their performance in classifying the investigated disease groups (c.f. Tables 3 and 4).

Based on the atrophy rates, sample sizes were calculated to detect a 25% change in atrophy rate with 80% power at a 5% significance level. The atrophy rates and sample sizes for selected structures are shown in Table 2. The smallest sample sizes (corrected for normal aging) were computed for the inferior lateral ventricles with 228 subjects for bl  → m12 and 126 subjects for bl → m24. The measured atrophy rates of other structures such as the hippocampus or the medial temporal gyrus yielded sample sizes at a similar level. Atrophy rates for all investigated structures can be found in the supplementary material.

Disease classification using structural volume change

We further investigated the potential of structural atrophy rates to distinguish between the clinical groups AD vs. HC and pMCI vs. sMCI. P-values, effect sizes and classification accuracies were calculated to quantify group separation. The results are shown in Table 3 (AD vs. HC) and Table 4 (pMCI vs. sMCI).

The most discriminative structure to distinguish between AD and HC was the hippocampus (dm12 = 1.40, dm24 = 1.88). Based on m12 atrophy a bACCm12 of 78% (SENS: 67%, SPEC: 88%) was calculated, for m24 atrophy a balanced accuracy of 85% respectively. Combining all derived atrophy features in a RF classifier substantially increased classification results to bACCm12: 84%; bACCm24: 88%. Hippocampal atrophy was also a very good feature for classifying progressive versus stable MCI subjects: bACCm12: 67%; bACCm24: 74%. However, the highest individual classification accuracy was obtained at m12 for the medial temporal gyrus (bACCm12: 70%, SENS: 65%, SPEC: 75%) and the inferior lateral ventricles (bACCm12: 69%, SENS: 63%, SPEC: 75%). Over 24 months, atrophy in the fusiform gyrus was most informative for classifying MCI subjects (bACCm24: 76%). Exploiting all available longitudinal features in an RF classifier increased MCI classification accuracy to bACCm12: 73%; bACCm24: 78%. Ventricular enlargement is more discriminative than a reduction in brain tissue for both AD versus HC and pMCI versus sMCI classification. An overview over all considered features and their individual classification performance can be found in the supplementary material.

Discussion

In this study, sets of 1069 baseline, 802 m12, and 532 m24 follow-up images from the ADNI-1/-GO/-2 cohort were analyzed. Considering the size and heterogeneity of the database, we expect our findings to be applicable to other cohorts.

A selection of articles is listed in Table 5 to present our classification results in the context of those reported in the literature. When comparing results to other studies it must be noted that these did not use identical data subsets from the ADNI cohorts. A further potential confounding factor is the definition of the sMCI and pMCI disease groups. In our study, the stratification of MCI subjects in sMCI and pMCI is well-defined. However, this definition is not identical across published studies.

Table 5 Overview over selected articles that use features from T1w MR images from the ADNI cohort. Table adapted from Falahati et al.7. CTH: cortical thickness, ENR: elastic net regression, HV: hippocampus, LLE: locally linear embedding, MBL: manifold-based learning, MIL: multiple instance learning, OPLS: orthogonal partial least square to latent structure, SR: spare regression, TBM: tensor-based morphometry.

Our cross-sectional classification results are very similar to those presented in Wolz et al.34, which are based on the ADNI-1 cohort. Wolz et al.34 classified AD vs. HC (bACC: 89%, SENS: 93%, SPEC: 85%) and pMCI vs. sMCI (bACC: 68%, SENS: 67%, SPEC: 69%) based on a multitude of features, including more abstract criteria derived from TBM and manifold-learning based methods. Other studies report even higher classification results of up to bACC: 92.4% for AD vs. HC and bACC: 74.0% for pMCI vs. sMCI classification60. Unlike our study, most studies shown in Table 5 analyzed the 1.5 Tesla (T) images of ADNI-1 only. Also usually more complex features such as cortical features34,60,61,62, textural features63, manifold-based features34,61 or grading based features28 are employed. In summary, the results presented in this study are comparable to the state of the art. This is encouraging, as we analyzed a large and heterogeneous dataset acquired at both 1.5T and 3T using structural volumes only, which are features with clear biological interpretations. Furthermore, structures that were found to be most discriminative agree well with those highlighted in Fennema et al.2. Examples are the hippocampus, amygdala, entorhinal area, and regions within the temporal gyrus in general.

Based on hippocampal atrophy alone a bACCm12 of 78% and bACCm24 of 85% was calculated for classifying AD vs. HC. These results are similar to those obtained on a different ADNI subset with a method dedicated to hippocampal atrophy measurement22: bACCm12 of 82% (SENS: 81%, SPEC: 83%); bACCm24 of 86% (SENS: 85%, SPEC: 87%). The results for AD vs. HC classification using all longitudinal features are also on par with those using all structural volumes at baseline. However, classification accuracies substantially above 90% were not expected due to potential bias in the study data caused, for example, by diagnostic misclassification, variations in scanner type and field strength, as well as possible remnant differences between participating centres that even strict protocols such as ADNI’s cannot prevent. Using longitudinal information available at the month 24 follow-up visit increased accuracy for classifying sMCI vs. pMCI groups from 68% at baseline to 78%. This confirms the discriminative value of higher structural atrophy rates in MCI subjects who progress to AD. Unlike in patients with AD, atrophy in MCI subjects has not yet manifested itself in substantially reduced structural volumes at baseline.

In Barnes et al.64 the authors concluded in a large meta-analysis that the annualized hippocampal volume change of healthy elderly people is −1.4% compared with −4.6% for patients with AD. Our results are similar: −1.1% (±1.7) for HC and −4.8% (±3.7) for AD subjects. A mean change of −0.5% (±0.8) and −1.3% (±1.1) from bl to m12 was observed for HC and AD groups for brain tissue. These results are also in line with previously published annual brain volume change rates of around −0.6% for controls and −1.5% for AD patients65. This confirms that the employed methodology yields realistic atrophy measurements on individual structures, while providing a comprehensive overview of structural change throughout the whole brain.

For the four structures hippocampus, amygdala, inferior lateral ventricle, and lateral ventricle, their respective counterparts in the left/right brain hemisphere were analyzed separately. The results suggest that structural change in the left hippocampus is slightly more discriminative than change in the right hippocampus. There is no consistent trend for the other investigated structures. In general, features of left-right paired structures perform similarly for all investigated structures and combining them seems, in summary, beneficial.

The use of either SVM or RF classifiers provided substantial improvements over the results obtained using individual features only. Overall SVM and RF performed similarly, with slight advantages for one or the other in individual experiments.

Our experiments confirm that MALPEM is an accurate and sensitive approach for brain image analysis. One of MALPEM’s main advantages is that it delivers a full morphometric analysis of all of 138 structures, unlike specialized methods that only work on a small selected set of individual structures. Another important strength is that our methods allow both the accurate analysis of single images (MALPEM) as well as image series (MALPEM4D). Thus, the presented methodology has strong potential to support both cross-sectional and longitudinal studies that include MR imaging of the brain.

As part of this work, we created a morphometry database of unprecedented size and accuracy, which we share with the community. This database provides pincram brain extractions and MALPEM segmentations of 5074 MR images, as well as longitudinal features extracted from 1334 MR image series. In previous work, we shared a database built on images from ADNI-166. The present development differs from this past effort in important ways that reflect developments of ADNI (data from ADNI-2 and ADNI-GO became available in the meantime) as well as software improvements (MALPEM yields more accurate segmentations than the MAPER method67 used in the previous project). The previous database was substantially smaller and contained only cross-sectional data (996 baseline and screening images). Another difference is in the atlas database chosen: compared to the atlases used for the previous resource68,69, the NMM atlases offer more detailed cortical subdivisions.

We anticipate that our new morphometry database will be an immensely valuable resource for future research on classification and modeling approaches. It can further enable the optimization of training-data intense deep learning methodologies.

Methods

Materials

For this study a subset of T1-weighted (T1w) MR brain images was analyzed from all studies by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) for which data are currently available (ADNI-1/-GO/-2). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. ADNI enrols participants between the ages of 55 and 90 who are recruited at 57 sites in the United States and Canada. After obtaining informed consent, participants undergo a series of initial tests that are repeated at intervals over subsequent years, including a clinical evaluation, neuropsychological tests, genetic testing, lumbar puncture, and MRI and PET scans. Participants were studied under ADNI protocols that were approved by the Institutional Review Board (IRB) at each recruitment site. A listing of sites with named Site Investigators can be found online at https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf (last accessed 30 June 2018). ADNI procedures manuals were developed as a resource for ADNI research sites. They are tailored to site Study Coordinator and support staff. The ADNI MRI Technical Procedures manuals are developed by the ADNI Imaging Core to direct MRI technicians in the scanning of ADNI subjects. For further details and up-to-date information please refer to the supplementary material and http://www.adni-info.org (last accessed 15 March 2018). All methods described in this study were performed in accordance with relevant guidelines and regulations. When the present study was started, we retrieved clinical information and corresponding MR images based on the ADNIMERGE package. Specifically, 5074 (1674 baseline, 3400 follow-up) images were processed with MALPEM. Subjects had been scanned up to 10 times, with the last follow-up image acquired 8 years after the baseline. For the present analysis, a subset of these 5074 processed images was selected based on clinical information. We applied the following criteria for inclusion/exclusion of individual subjects:

  • All subjects who reverted at any time point from a more severe to a less severe disease stage, i.e. AD → MCI or MCI → HC, were excluded from both the cross-sectional (N = 68) and the longitudinal analysis.

  • All subjects with baseline diagnosis early MCI (N = 277) or SMC (N = 76) were excluded from both the cross-sectional and the longitudinal analysis.

  • The sMCI group was defined as those subjects who were diagnosed as MCI (called ‘late MCI’ in ADNI-GO/-2) at baseline and remained at the MCI stage for at least two years and until the most recent diagnosis which was available. This means that subjects for whom this information was not available, e.g. because the repository did not contain a corresponding m24 or later visit, were excluded (N = 130).

  • The pMCI group was defined as those subjects who were diagnosed as MCI at baseline and converted within two years follow-up to a diagnosis of probable AD.

  • Subjects who were diagnosed as MCI at baseline but converted to probable AD more than two years later (N = 54) were excluded from the analyses and neither considered as sMCI nor pMCI.

  • The m12 image (I373205) of one subject (ADNI_007_S_4568) was reviewed after irregular volume measurements and excluded manually from the analysis due to poor image quality.

  • All subjects listed in Table 6 fulfil the above criteria. For the longitudinal analysis, however, all subjects that converted at any time of the study from HC to a symptomatic stage (e.g. to MCI or even to AD) were excluded (N = 52).

Table 6 gives an overview over the baseline images considered in the conducted analyses. Note that this is a well-defined subset of all 5074 images processed. Lists of the processed filenames that also include the unique image identifier are available online at https://doi.org/10.12751/g-node.aa605a70.

Table 6 Overview of the analyzed subjects from the ADNI cohort, including age and clinical information at baseline.

Preprocessing

As preprocessed versions of the images were downloaded from ADNI, no additional preprocessing was performed71. Brain masks were calculated for all available baseline images using pincram57. Brain masks were visually reviewed and some were recalculated with an updated pincram atlas database. Follow-up images were brain-extracted utilizing the corresponding baseline brain masks, which were transformed using rigid intrasubject registration.

Cross-sectional and longitudinal segmentation

All 5074 baseline and follow-up images were segmented individually using MALPEM as described in Ledig et al.53. As the atlas database, we used the manually annotated Neuromorphometrics (NMM) brain atlases (n = 30; provided by Neuromorphometrics, Inc. under academic subscription, http://Neuromorphometrics.com/, last accessed 15 March 2018). The atlas label sets contain expert delineations of 40 non-cortical and 98 cortical brain regions. A description of the individual structures is provided in the supplementary material. MALPEM was recently validated in an independent study led by Johnson et al.54 where the authors compared state-of-the-art segmentation methods in the context of Huntington’s disease. The refined, time-point specific probabilistic segmentation output and the intensity-normalized, brain-extracted images of MALPEM are then employed to perform the consistent longitudinal segmentation as described in Ledig et al.35 (MALPEM4D). MALPEM4D is an approach that employs spatially and temporally varying coupling weights between time points to obtain temporally consistent segmentation estimates. In this work, MALPEM4D incorporates symmetric affine intra-subject registration72,73 and corrects for differential bias between intra-subject acquisitions using unweighted differential bias correction74. MALPEM4D is run on pairs of images to separately estimate volume changes bl → m12 and bl → m24.

Features and classification

For the cross-sectional analysis at baseline, structural volumes were calculated with MALPEM and not MALPEM4D, as the latter exploits information of later scanning time points, which was not available at baseline. For the longitudinal analysis structural volumes were extracted based on the respective MALPEM4D segmentations.

As features, all available structural volumes were employed. For paired structures, the left and right volumes were merged (98/2 = 49 cortical plus 28/2 + 7 = 21 non-cortical features). Note that seven non-cortical structures are unpaired (3rd ventricle, 4th ventricle, brainstem, CSF, cerebellar vermal lobules I–V, cerebellar vermal lobules VI–VII, and cerebellar vermal lobules VIII–X). Exceptions were made for the amygdala, hippocampus, inferior lateral ventricles, and lateral ventricles. As it is expected that these structures are particularly informative, their left and right volumes were retained as separate features (8 features). This allowed us to investigate asymmetric involvement of these structures in the disease progression. Individual structures were further summarized as ventricles, cortical grey matter, deep grey matter, white matter, brain tissue and total brain volume (brain tissue including ventricles/CSF) (6 features). In total 86 features were considered, including age and gender.

For classification, a 6-fold cross-validation (CV) was performed using an linear discriminant analysis (LDA) classifier for individual features. When combining multiple features, both SVM and RF classifiers were employed. A classification framework was implemented using MATLAB (The MathWorks Inc, Natick, MA, USA) that relies on classify (LDA), TreeBagger (RF, 100 trees) and libSVM (linear SVM75). Features were normalized (rescaled) individually to a range from 0 to 1 for the SVM classification. In addition to the standard classification accuracy (ACC), we also quantified the balanced classification accuracy (bACC76) to account for imbalanced group sizes. The bACC is calculated as the arithmetic mean of SENS and SPEC.

Significance levels were quantified as p-values of two-sided, unpaired Student’s t-tests. We employed the conservative Bonferroni correction to correct for multiple comparisons. Further, effect sizes were calculated as Cohen’s d by dividing the differences of the sample means (absolute value) by their pooled standard deviation2,77,78. According to Cohen77 an effect size of d = 0.2 can be considered as small, of d = 0.5 as medium and of d = 0.8 as large. Reporting the effect size in addition to the p-value is important as it quantifies the magnitude of a group difference, while a low p-value by itself only confirms its existence79.

Correction for nuisance factors

The volume of many individual brain structures diminishes during the course of normal, healthy aging. Also, strong correlations between structural volumes and overall head size are well established80. This is illustrated in Fig. 6, where linear regressors are fit to hippocampal volumes with respect to age, gender, and brain volume based on the processed HC subjects.

Figure 6
figure 6

Top: Dependence of hippocampal volume on age (left), gender (middle) and brain volume (right). Bottom: Corresponding s corrected for nuisance factors age, gender and brain size. Overlaid regression lines for distinct disease groups with corresponding regression lines.

We followed a multiple linear regression approach as described in Koikkalainen et al.81 and investigated the nuisance factors age, gender and brain size. Specifically, a separate linear regressor was fit using the processed healthy control subjects for each individual structure and the abovementioned predictor variables. Each multivariate regressor is defined by the slope for each predictor cage, cgender and csize and an intercept b. Assuming M distinct features, feature m of subject n, denoted by \({{F}_{n}}^{m}\), is corrected as:

$${{\mathop{F}\limits^{ \sim }}_{n}}^{m}={{F}_{n}}^{m}-({c}_{{\rm{a}}{\rm{g}}{\rm{e}}}^{m}{{\rm{a}}{\rm{g}}{\rm{e}}}_{n}+{c}_{{\rm{g}}{\rm{e}}{\rm{n}}{\rm{d}}{\rm{e}}{\rm{r}}}^{m}{{\rm{g}}{\rm{e}}{\rm{n}}{\rm{d}}{\rm{e}}{\rm{r}}}_{n}+{c}_{{\rm{s}}{\rm{i}}{\rm{z}}{\rm{e}}}^{m}{{\rm{s}}{\rm{i}}{\rm{z}}{\rm{e}}}_{n}+{b}^{m}).$$
(1)

To correct for head size, the total brain size (sum of all structures) was used as an approximation of the intracranial volume. This is a commonly used approximation80.

The effect of correcting for the nuisance factors patient age, gender and head size is shown in Table 7. The correction substantially increased classification accuracies and effect sizes obtained on individual structures. The benefit of the correction for all investigated structures, including effect sizes, can be found in the supplementary material.

Table 7 Balanced classification accuracies in % for distinguishing between HC and AD subjects (effect sizes in parentheses) after correcting for various nuisance factors (100 runs, 6-fold cross-validation, LDA). Largest effect size in bold.

The observed benefit of correcting for these confounding factors is in agreement with the literature81. In the conducted experiment, correcting for brain size had the biggest effect and correcting for gender had a stronger impact than correcting for age. Correcting for all three nuisance factors (age, gender and brain size) was most beneficial in terms of both classification accuracy and effect size.

An illustration of the volumes before and after correction is provided in Fig. 6. Healthy control subjects have zero mean after correction, and the overall dependence on nuisance factors is clearly reduced.

The corrected volumes were used for the cross-sectional analysis. This means that feature values are no longer actual volumes, but rather volume differences with respect to a healthy population of matched age, gender and brain size. Note that the three independent variables patient age, gender, and head size were not corrected for.

Calculation of atrophy rates and sample sizes

For a volume \({v}_{{t}^{1}}\) at baseline and a volume \({v}_{{t}^{2}}\) at a follow-up time point we calculated atrophy rates using the logarithmic transform as \({{\rm{\Delta }}}_{v}({t}^{1},{t}^{2})={{\rm{\Delta }}}_{v}^{\mathrm{log}}({t}^{1},{t}^{2})=\)\(\mathrm{ln}({v}_{{t}^{2}}/{v}_{{t}^{1}})\cdot \mathrm{100 \% }\). Note that atrophy rate and volume change is used interchangeably, which means that a positive atrophy rate indicates an increase in volume.

For a power (1 − β) and significance level α the sample size can be calculated23 as:

$$N={({z}_{1-{\rm{\beta }}}+{z}_{1-{\rm{\alpha }}\mathrm{/2}})}^{2}\cdot \frac{\mathrm{(2}{{\rm{\sigma }}}_{{\rm{g}}}^{2})}{{{\rm{\Delta }}}^{2}}.$$
(2)

Here Δ is the difference in atrophy rate that is to be shown between the clinical groups. In this study sample sizes were calculated to detect a 25% change in atrophy rate (Δ = 0.25 μg) with 80% power (z0.8 ≈ 0.84) at a 5% significance level (z1−0.05/2 ≈ 1.96). These parameter choices are commonly found in the literature22,23,72. It is important to relate atrophy rates in dementia to normal atrophy during aging, as in the uncorrected case it is assumed that 100% treatment effect would effectively reduce the structural atrophy to zero23. Sample sizes were thus corrected for normal ageing by evaluating Equation 2 with Δ = 0.25(μgμhealthy) to reduce the maximal treatment effect to the level of normal ageing. In Equation 2 it is assumed that measurements of healthy atrophy have the same variance as measurements of diseased subjects (σg ≈ σhealthy)23. This usually leads to a more conservative estimate.

Data availability

The datasets generated during and/or analyzed in this study are available online (all resources last accessed 15 March 2018).