Introduction

The cost of drug development is highest, by far, for neurodegenerative diseases, with unparalleled failure rates1. In this respect, the controversial approval of aducanumab on 7 June 2021 by the Food and Drug Administration (FDA) represents a turning point in Alzheimer’s disease (AD) drug development2. This decision raises the critical issue of demonstrating the clinical benefit of a compound acting on a key biological process, the accumulation of amyloid plaques in the brain3.

It remains unclear why an effective intervention for such a key biological mechanism is only weakly associated with lower levels of cognitive decline. It is likely that the core biological processes and their interactions are not yet fully understood. Another, non-exclusive explanation is that the issue of who and when to treat must be addressed with greater precision to demonstrate clinical efficacy. In 2019, Cummings and coworkers were already stressing the need to improve clinical trials, by targeting the right participant with the right biomarker in the right trial4. The motivation, here, is simple: it is not possible to show that a candidate therapy slows down the degradation of the endpoint if this endpoint is not expected to worsen during the trial. The treatment effect size will be larger if one includes participants right before the disease progression would cause a significant change in the endpoint without an intervention. Such a target period depends on the endpoint selected to demonstrate efficacy.

It is particularly difficult to identify the most appropriate time frame for a disease like AD, which progresses over decades, in a non-linear manner, and with different clinical presentations between patients. The thresholds currently used for the main biomarkers and clinical endpoints are not sufficiently effective for the selection of patient populations with homogeneous progression profiles5. Disease modeling uses computational and statistical methods to address this question6,7,8,9,10,11,12,13,14. These models learn the variability of disease progression from observational longitudinal cohort data and can then predict the progression of patients from their historical data. They require various clinical or biomarker assessments at one or several time points as input. These techniques are beginning to be evaluated for clinical trial design. For example, a retrospective analysis showed that the effect size of treatment could be increased by targeting participants with a predicted type of progression at trial entry15. Other studies indicated that predicting the value of endpoints might make it possible to reduce sample sizes in clinical trials16,17.

We propose here a software tool using a disease progression model for participant selection in clinical trials. The goal is to enrich the selected population of participants likely to display progression during the trial, a concept called prognostic enrichment18 by the FDA and already applied in some AD trials19,20. We will use AD Course Map as a disease progression model. It is a non-linear mixed-effect model, which predicts both the dynamics of progression and the clinical presentation of the disease21,22. This technique outperformed the 56 alternative methods for predicting cognitive decline in the framework of the TADPOLE challenge6,23. We will compare this model with RNN-AD, which is a recurrent neural network, namely a deep learning method that learns temporal dynamic behavior. In June 2020, it ranked 2nd for the prediction of cognitive decline in the TADPOLE challenge24.

We will first evaluate the ability of the model to predict progression for the main endpoints used as outcomes in current clinical trials. We will use five independent data sets with data from more than 4600 patients spread over four continents. We will analyze the systematic biases of such algorithms, their robustness to missing data, and suitability for generalization across countries, ethnicities, and disease stages. Finally, we will simulate inclusion procedures for clinical trials by varying several key parameters: the chosen outcome, trial duration, and selection criteria. Finally, we will show that participants predicted to be at risk of the outcome worsening constitute a population likely to show a greater and more homogeneous response to treatment.

Results

Characteristics of the study population

We used data from 4687 participants from five longitudinal multicenter cohorts from North America, Australia, Japan, and Europe: the Alzheimer’s disease neuroimaging initiative (ADNI)25,26,27,28,29,30,31 (N = 1652), the Australian imaging, biomarker and lifestyle flagship study of aging (AIBL)32,33 (N = 460), the Japanese Alzheimer’s disease neuroimaging initiative (J-ADNI)34,35 (N = 470), the PharmaCog cohort36,37 (N = 111) and the MEMENTO cohort38 (N = 1994). Each study enrolled participants attending memory clinics.

Tables 1 and 2 summarize the characteristics of each data set. These data sets contain diverse patient profiles from different ethnic, genetic, and geographic backgrounds, with follow-up visits at different disease stages. For all these studies, the neuropsychological examinations were performed in accordance with international standards, and the image acquisition procedures were performed in accordance with the protocols established by the ADNI consortium. Together, these data sets, therefore, correspond to a relevant pool of patients for simulating inclusion procedures for a typical large multicenter phase III trial.

Table 1 Characteristics of study participants
Table 2 Distribution of endpoints for study participants

Disease progression models learn the timing of changes in biomarker levels during disease progression

We train disease progression models using the ADNI participants with confirmed pathological amyloid levels as the training set (N = 866) with baseline and all available follow-up data. We kept the data from the other ADNI participants and the members of the four external cohorts as the validation set (N = 3821). The same protocol for training and validating the models is used for AD Course Map and RNN-AD. See Methods for details.

The two models include the following endpoints: Mini-Mental State Examination (MMSE), Alzheimer’s Disease Assessment Scale—cognitive sub-scale with 13 items (ADAS-Cog13), Clinical Dementia Rating—sum of boxes (CDR-SB), volumes of the left and right hippocampus and lateral ventricles, Aβ1–42 and p-tau181 levels in the cerebrospinal fluid (CSF), standard uptake value ratio (SUVR) for Amyloid PET and Tau PET scans. See Methods for details.

AD Course Map assumes that these endpoints follow a logistic progression curve during disease progression with distinct progression rate and age at the inflexion point21,22. It learns how this set of logistic curves need to be adjusted to fit individual data by changing the dynamic of progression and disease presentation (i.e., the relative value of the endpoints at a given disease stage). By contrast, RNN-AD learns how the values of the endpoints will change in the next month given the values of the endpoint at a given time-point. The 1-month transition is assumed to be a non-linear function (e.g. a neural network) of the current value of the endpoints and the current diagnosis. Supplementary Table 1 shows the goodness-of-fit on the training set, consistent with the results of our previous studies on AD Course Map6 and RNN-AD24. See Methods for details.

Disease progression models forecast cognitive decline

The disease progression models predict the subject-specific trajectory of biomarker changes from data collected from the subject concerned at one or several visits. The predicted trajectory is used to forecast the values of the biomarkers at future time points. Figure 1 illustrates this forecast procedure.

Fig. 1: Disease progression models forecast the progression of endpoints from historical data of a participant.
figure 1

In this simplified example, the model has only three endpoints (Amyloid PET, Hippocampus volume, and mini-mental state examination (MMSE)). The participant has been observed twice at 70 and 71 years old (colored crosses). After normalizing the data to a 0–1 scale (0 being the most normal and 1 the maximum pathological change), the model predicts the participant-specific progression curves. From these curves, one forecasts the values of the three endpoints in 4 years’ time (colored dots). As shown in this example, AD Course Map does not require the imputation of missing data. In trial simulations, the curves are predicted from the data at a single time point, e.g. the baseline. CL centiloid scale, ICV intracranial volume.

We repeatedly assessed the errors of AD Course Map and RNN-AD for forecasting cognitive endpoints (ADAS-Cog13, MMSE and CDR-SB) for participants in the validation set. We blinded the latest visits of the participants and tried to predict them from the unblinded data (see Supplementary Fig. 1 and Methods for details of the procedure). From 44,435 forecasts for ADAS-Cog13 (96,970 for MMSE and 96,849 for CDR-SB), we determined the absolute difference between predicted and actual results as a function of the characteristics of the participants and the information used for forecasting purposes.

Figure 2 shows the distribution of mean absolute errors (MAE) for AD Course Map and RNN-AD adjusted for co-founding factors. The reported errors are for the reference participant in the reference forecast design: a 75-year-old American woman from the ADNI cohort with an average education level, no APOE-ε4 mutations, and an A + T + N + status with a questionable dementia (CDR = 0.5 noted C~), for whom we forecast neuropsychological assessments in three years’ time, based on two past visits separated by eight months with no missing data. AD Course Map yields a mean absolute error of 5.98 (95% CI = [5.44, 6.48]) on a scale of 85 for ADAS-Cog13, of 2.54 (95% CI = [2.39, 2.71]) on a scale of 30 for the MMSE, and of 1.86 (95% CI = [1.75, 1.99]) on a scale of 18 for the CDR-SB.

Fig. 2: AD Course Map forecasts cognitive decline better than alternative methods.
figure 2

The mean absolute error is reported for the reference participant: a 75-year-old American woman from ADNI with an average level of education, no APOE-ε4 mutation, and a A + T + N + C~ status (i.e., with CDR global of 0.5), for whom we forecast neuropsychological assessments in three years’ time, based on two past visits separated by eight months and for which all data were available. Box plots represent median value, first and third quartiles; whiskers represent the empirical 95% confidence interval. Statistics are computed for n = 100 resampling of the validation set (see Methods). Source data are provided as a Source Data file. MAE mean absolute error.

On all occasions, AD Course Map and RNN-AD yielded significantly smaller errors than two alternative methods: no-change prediction (predicting the same value as obtained at the participant’s last visit) and a linear mixed-effects model (p < 0.01 for both, see Supplementary Table 2). These two alternatives were shown to be good predictor of short-term progression, essentially because of the overall slow pace of progression of the disease6,23. The deep learning method RNN-AD yields intermediate performance with adjusted mean absolute errors of 6.53 (95% CI = [6.02, 7.19]), 2.75 (95% CI = [2.57, 2.92]), and 1.95 (95% CI = [1.81, 2.09]) for the prediction of ADAS-Cog13, MMSE and CDR-SB respectively.

We investigated the change in MAE for ADAS-Cog 13 score for different categories of participants and forecast designs (Fig. 3). For AD Course Map, the number of previous visits considered (1, 2, or 3) did not significantly affect forecasting error. By contrast, for every additional year of time to prediction, MAE for ADAS-Cog13 score increased by 0.80 (95% CI = [0.71, 0.93]). Forecasts were not significantly affected by sex nor APOE genotype but were slightly improved for participants who are older than average and had longer education. On average, the forecasts for the European participants from PharmaCog cohort as well as the Japanese participants from the J-ADNI cohort were better than those for the American participants from the ADNI cohort, by about 1.1 and 0.6 points respectively. Forecasts were robust to missing CSF or Tau PET data, and slightly worsened when MRI or Amyloid PET were missing with differences in MAE of 0.27 (95% CI = [0.00, 0.55]) and 0.54 points (95% CI = [0.15, 1.02]) respectively. The model forecasts better at earlier stages of the AD continuum than at later clinical stages (Fig. 3a). The method was readily generalizable to the included participants with suspected non-amyloid pathology (SNAP) and possible concomitant pathological non-Alzheimer’s changes.

Fig. 3: Changes in forecast absolute errors depending on covariates.
figure 3

Results are presented for the forecast of ADAS-Cog13 with the AD Course Map. a Changes due to forecast design (4 top rows, in brown), genetic and sociodemographic characteristics of the participant (rows 5–10, in blue), the cohort of the participant (rows 11 and 12, in pink), and missing data (rows 13 to 16, in gray). b Changes due to A(myloid)/T(au)/N(eurodegeneration)/C(linical) status of the participant, grouped in: Alzheimer’s continuum at the top (8 top rows, in green), possible Alzheimer’s disease and concomitant non-Alzheimer’s pathologic change in between (row 9, in orange), and suspected non-Alzheimer’s pathophysiology (SNAP) at the bottom (3 bottom rows, in gray). Coefficients below zero indicate a lower mean absolute error (MAE) (better forecast) than those for the reference participant and design. For example, if the reference participant comes from J-ADNI instead of ADNI, the prediction of ADAS-Cog13 is more accurate, resulting in a 0.63 point decrease in MAE (95% CI = [0.32, 0.96]). Box plots represent median value, first and third quartiles; whiskers represent the empirical 95% confidence interval. Statistics are computed for n = 100 resampling of the validation set (see Methods). Source data are provided as a Source Data file. MAE mean absolute error.

Similar conclusions were drawn for predictions of MMSE and CDR-SB (see Supplementary Figs. 2 and 3). AD Course Map performed better on all but one external validation cohort. Errors were robust to changes in the available information used to make the prediction, such as the number of unblinded visits and missing data. This method did not produce biased forecasts for women. Forecasts for those two endpoints however displayed slightly worse results for participants older than average or with an education level that is below the average, and for APOE-ε4 carriers.

Disease progression models select participants displaying progression for trials

We now use disease progression models to identify the participants likely to experience significant cognitive decline during a trial (see Fig. 4). The definition of participants displaying progression depends on the endpoint used to measure the condition and the duration of the trial. We simulated six clinical trials with different primary outcomes, trial durations, and inclusion criteria. These designs were inspired by real phase III trials (see Table 3).

Fig. 4: Illustration of the prognostic enrichment procedure in a clinical trial.
figure 4

Participants are selected first using standard inclusion criteria and undergo a series of exams. A disease progression model, such as AD Course Map, then forecasts the progression of each participant’s data and predicts if the participant is likely to progress significantly during the trial, as measured by the predicted outcome change, which is the mini-mental state examination (MMSE) in this example. The treatment effect (e.g., a 25% reduction of the change of the MMSE during trial) leads to a greater effect size, and therefore a smaller sample size, on the group of predicted fast progressors compared to the group of predicted slow progressors or the two groups combined. As a result, one may demonstrate the treatment efficacy with fewer participants by monitoring only the group of predicted fast progressors.

Table 3 Description of the simulated trials

For each trial, we selected the participants in the validation set who met the inclusion criteria at one of their visits (considered as the baseline visit for the simulated trial) and attended a follow-up visit after a period equal to the theoretical duration of the trial. We split this population into two equal halves: fast and slow progressors, according to whether the outcome considered (e.g. the annual change in endpoint relative to baseline) was above or below the population median value. We aimed to identify the participants in these two groups exclusively on the basis of their baseline data.

We used the disease progression models to forecast the values of the endpoint at the end of the trial from the baseline data for each participant. The predicted outcome was used as a prognostic score. For AD Course Map, Pearson correlations with the true outcome range from 28% to 47% depending on the trial, while for RNN-AD they range from 13% to 36% (see Supplementary Table 3). Participants with a prognostic score above a given threshold were considered to be likely to be fast progressors. We plotted receiver operating characteristics (ROC) curves for the six simulated trials (Fig. 5). The area under the ROC curve (AUC) of the six simulated trials fell within the 65–80% range for AD Course Map and within the 55–80% range for RNN-AD (see Fig. 5).

Fig. 5: AD Course Map and RNN-AD select participants at risk of experiencing a worsening of the outcome during the trial.
figure 5

Receiver operating characteristic (ROC) curves are shown. They demonstrate the performance of AD Course Map and RNN-AD in selecting the group of participants with the largest change in primary outcome during follow-up. Shaded areas correspond to the empirical 95% confidence interval. The green circle and orange triangle on each curve correspond to selections splitting the participants into two equal groups, with bars representing the 95% confidence intervals. The cross in gray gives the specificity and sensitivity when APOE-ε4 carriers (with 1 or 2 copies) are selected, with bars indicating the 95% confidence interval (note: the first trial includes only APOE-ε4 carriers, and there is, therefore, no gray cross). Statistics are computed for n = 100 resampling of the validation set (see Methods). Source data are provided as a Source Data file. AUC: area under the ROC curve (mean ± standard deviation with 95% confidence interval).

We compared this prognostic enrichment strategy with two alternative methods: selecting participants at random (bisector of the ROC curve) as currently done in most trials, or selecting participants based on their APOE genotype (gray crosses in Fig. 5). All selection methods were significantly better than random selection, meaning that disease progression models succeed in identifying the progressors compared to the current practice that does make any difference among the participants meeting the inclusion criteria. In all but one case, selections with AD Course Map were significantly better than selection on the basis of APOE genotype. RNN-AD also compares favorably against the two alternatives. Nevertheless, it has significantly worse performance than AD Course Map in two out of six tested scenarios, with a drop of 9% and 14% in the ROC AUC. AD Course Map shows therefore more robust results than RNN-AD when the trial design is varied.

We analyzed whether our assessment of the risk of progression led to an over- or under-selection of certain types of participants relative to the true progressors (see Supplementary Fig. 4). Depending on the design, the group that was selected using AD Course Map displayed slight enrichment in men or women, and tended to be biased towards older participants. The selected participants were often, but not always, enriched in carriers of the APOE-ε4 variant. The presented disease progression models do not use sociodemographic or genetic factors as proxies for the selection of participants displaying progression. They limit therefore the biases of sex, age, or APOE-ε4 carriership, which are the basis of current practices to increase the likelihood that a participant progresses during a trial.

Disease progression models can be used to design more powered clinical trials

The automatic selection of participants displaying progression makes it possible to implement prognostic enrichment strategies in trials (see Fig. 4). For each trial design, we simulated a hypothetical treatment decreasing the outcome value. We calculated the sample size required to show the effect of this treatment for a range of treatment effects (see Methods). We compared the results when all eligible participants were included to those obtained when only participants predicted to be fast progressors at baseline were included.

We plotted sample size against the treatment effect for all six simulated trials (Fig. 6). The selection of participants at risk of progression with AD Course Map allowed a significant reduction in sample size relative to current inclusion criteria alone, across all scenarios tested. For a treatment effect of 25%, the sample size was reduced by 50.2% (±7.1) for participants at risk of the onset of AD, by 40.9% (±4.9) for a trial targeting individuals with preclinical AD and high brain amyloid levels, by between 38.1% (±1.6) and 45.4% (±2.0), depending on the outcome considered, for subjects with early AD and high levels of brain amyloid, by 44.6% (±3.9) for subjects with early AD and high brain levels of tau, and by 43.1% (±0.8) for participants with mild cognitive impairment probably due to AD or mild AD.

Fig. 6: Enrichment based on AD Course Map significantly decreases the sample size for a hypothetical treatment effect ranging from 20% to 30%.
figure 6

Reported sample sizes are the total size for two arms. The light-shaded areas represent the 95% confidence interval and the dark-shaded areas the 50% confidence interval around the median value. For all preclinical and early Alzheimer’s disease (AD) trials, enrichment based on AD Course Map significantly outperformed the enrichment based on APOE-ε4 carriership. Statistics are computed for n = 100 resampling of the validation set (see Methods). Source data are provided as a Source Data file.

For all preclinical and early AD trials, enrichments based on AD Course Map significantly outperformed the selection of APOE-ε4 variant carriers only. For mild cognitive impairment due to AD or a mild AD trial, the performance of enrichment based on AD Course Map was not significantly different from targeting APOE-ε4 carriers. AD Course Map achieved a similar decrease in sample size, but without the need to target a specific genetic profile. In this case, we also found that 49.2% (95% CI = [48.6, 49.9]) of the participants would be selected by AD Course Map, versus 39.1% (95% CI = [38.7, 39.4]) for heterozygous APOE-ε4 carriers, facilitating recruitment with AD Course Map (see Supplementary Table 4).

RNN-AD also allowed a significant reduction of the sample size compared to current practice, from 21% to 42% depending on the tested scenario. Nevertheless, the reduction was never better than with AD Course Map with an increase of 10% and 35% participants to be selected for the two scenarios where RNN-AD yielded a lower AUC (see Supplementary Table 5).

Discussion

We used disease progression models to forecast cognitive decline across all stages of the AD continuum. Using five independent cohorts containing more than 4,600 participants, we show here that AD Course Map provides a fair, robust, and generalizable predictive method. It is fair, in that its predictions are not biased with respect to sex, and are only marginally affected by level of education and the age of the participant. The method is robust to missing CSF or Tau PET biomarkers, but in general better results are achieved when MRI and Amyloid PET data are present. The model was trained on data acquired in North America, but it is readily generalizable to participants from Europe, Asia, and Oceania, with no loss of performance. It performed better at the earliest preclinical stages of the AD continuum than at later disease stages, and is therefore relevant for early-stage interventions.

Disease progression models automatically identify the participants already at risk of experiencing cognitive decline at baseline in a trial. They can therefore be used to enrich the trial population in participants likely to experience a worsening of a given endpoint during the trial. By targeting more homogeneous groups of participants displaying progression, AD Course Map makes it possible to decrease sample size significantly, by 38% up to 50%, at the expense of discarding about half of the screened participants. It shows better and more robust performance than the deep learning method RNN-AD. Disease progression models adapt seamlessly to various clinical trial designs targeting different disease stages with different outcomes and trial durations. They do so without the need to re-train the model for each new trial. In comparison, a recent method based on another prognosis score reported sample size reductions of 20% to 28%17.

The main limitation of the method is the data used to monitor disease progression. Cognitive assessment displays about 10% inter-rater variability39,40,41. MRI biomarkers also display a similar degree of variability between two scans acquired on the same day for the same participant, and their reliability is further decreased by possible variations in the processing pipelines42. Mapping CSF biomarkers from different immunoassays also limit their reliability43. These factors limit the accuracy of the method for forecasting disease progression. Increasing the reliability of these measurements would improve the performance of the approach described here. In the future, disease progression models such as AD Course Map may also benefit from the inclusion of promising new biomarkers, such as plasma biomarkers, neurofilament light chain44, or digital biomarkers45.

Given these limitations, it is notable that such large sample size reductions can be achieved with data already available in routine clinical practice. These findings demonstrate the benefits of companion software tools for patient recruitment in trials and for supporting clinicians in the future, enabling them to prescribe the right treatment to the right patient at the right time.

Methods

Participants

We used the data from five longitudinal multicenter cohorts: the ADNI25,26,27,28,29,30,31 (N = 1652), the Australian imaging, biomarker, and lifestyle flagship study of aging (AIBL)32,33 (N = 460), the JJ-ADNI34,35 (N = 470), the PharmaCog cohort36,37 (N = 111) and the MEMENTO cohort38 (N = 1994).

The study protocols were approved by the ethical committees of the university of southern California (ADNI), Austin Health, St Vincent’s Health, Hollywook Private Hospital and Edith Cowan University (AIBL), IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli (PharmaCog), Comité de protection des personnes sud-ouest et outre-mer III (MEMENTO), the National Bioscience Database Center Human Database (J-ADNI). Informed consent forms were obtained from research participants. The research has been performed in accordance with the Declaration of Helsinki and relevant guidelines and regulations. Participants were not compensated for the current study.

The five cohorts are longitudinal observational studies with an average observation period ranging from 2.0 years for PHARMACOG to 4.8 years for ADNI, with an average number of visits ranging from 3.7 in AIBL to 6.9 in MEMENTO. We considered all participants with at least one year of follow-up. The sociodemographic, genetic, biological and clinical characteristics of the selected participants are reported in Tables 1 and 2, as well as the proportion of available data in each cohort.

Neuropsychological assessments

In our experiments, we considered the following neuropsychological assessments:

  • The mini-mental state examination39 (MMSE),

  • The Alzheimer’s disease assessment scale–cognitive sub-scale with 13 items40,46 (ADAS-Cog13),

  • The clinical dementia rating scale41,47 – sum of the boxes score (CDR-SB).

Structural magnetic resonance imaging/anatomical imaging biomarkers

We extracted cortical and subcortical volumes from three-dimensional T1-weighted magnetization-prepared rapid gradient-echo imaging (MPRAGE) sequences.

For the ADNI study, scans were acquired in the standardized protocol for morphometric analyses (http://adni.loni.usc.edu/methods/documents/mri-protocols/). The ADNI MRI core processed raw scans, using Gradwarp for the correction of geometric distortion due to gradient nonlinearity48, B1-correction for the adjustment of image intensity inhomogeneity26, N3 bias field correction for reducing residual intensity inhomogeneity49,50, and geometric scaling for adjusting scanner- and session-specific calibration errors26,51. The same MRI protocol was also used in AIBL32, J-ADNI52, PharmaCog36, and MEMENTO38,53.

For all studies, cortical reconstruction and volumetric segmentation were performed with the Freesurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/). Version 5.3 was used for J-ADNI, MEMENTO, and PharmaCog, and version 6.0 for ADNI and AIBL, operated within Clinica for reproducibility purposes54. The cohort effect in the following analyses accounts for possible differences due to different versions of the software.

We calculated the mean volume of the left and right hippocampus, and the total volume of the lateral ventricles (including inferior lateral volume). Hippocampus segmentation with Freesurfer was previously reported to have good reproducibility55,56. Both volumes were normalized by estimated total intracranial volume (ICV).

Cerebrospinal fluid biomarkers

We used the concentrations in cerebrospinal fluid (CSF) of β-Amyloid 1–42 peptide (Aβ1–42), Tau protein, phosphorylated at the threonine 181 residue (p-Tau181), and total tau protein (t–Tau).

ADNI used the automated Elecsys immunoassay (Roche); AIBL, PharmaCog, and MEMENTO used INNOTEST single-analyte ELISA tests (Innogenetics/Fujirebio NV), and J-ADNI used the multiplex xMAP Luminex platform with the INNO-BIA AlzBio3 immunoassay kit (Innogenetics/Fujirebio NV).

We harmonized the measurements to account for the differences in immunoassays and participants' characteristics across cohorts. Within each cohort, we regressed each biomarker against age, APOE genotype, and CDR global score with a linear mixed model with random intercept. We then linearly transformed the measurements so that the intercept is 0 and the total variance is 1 for all cohorts. Harmonization equations used are listed in Supplementary Table 6 for reproducibility purposes.

Positron emission tomography/functional imaging biomarkers

For ADNI participants, we used regional standardized uptake value ratios (SUVR) extracted from Amyloid PET scans ([18F]-Florbetapir and [18F]-Florbetaben radiotracers), and, starting from ADNI 3, Tau PET scans ([18F]-AV-1451 radiotracer). Each PET scan was registered together with the MRI for the subject performed as close as possible to the PET scan in terms of time.

For Amyloid PET, we used a cortical-summary region consisting of the frontal, anterior/posterior cingulate, lateral parietal, and lateral temporal regions; data were normalized with a composite reference region consisting of the whole cerebellum, brainstem/pons, and eroded subcortical white matter57,58,59. These PET SUVR values were converted to the centiloid scale (CL)60 using equations from the literature61 listed in Supplementary Table 6. In the AIBL cohort, the processed Amyloid PET SUVR data that correspond to the published centiloid conversion equations were not publicly available. In the MEMENTO cohort, Amyloid PET SUVR data are not directly comparable with ADNI data and equations for centiloid conversion were not available. Therefore, we used Amyloid PET data on these cohorts only to define the Amyloid status of the participants, using pathological thresholds provided by these studies.

For Tau PET, we used a volume-weighted average SUVR value for all anatomical Braak regions of interest (I-VI)62, normalized against the inferior cerebellum gray matter63.

A/T/N/C classification

We classified participants with the A(myloid)/T(au)/N(eurodegeneration) classification64,65, together with a C(ogntion)/C(linical) group based on the Clinical Dementia Rating (CDR) global score (see the Supplementary Table 7 for all thresholds used). Participant category at a given visit was based on the patient’s all-time worst biomarker levels to date. Incomplete A/T/N/C profiles are denoted with a star after any of the biomarkers that could not be determined.

Disease progression models

We trained and tested two disease progression models: AD Course Map and RNN-AD. AD Course Map is built on the principles of a parametric Bayesian non-linear mixed-effects model21,22. RNN-AD is built on the principles of recurrent artificial neural networks24,66. The implementation of both models relies on the open-source software that was made publicly available by their respective authors.

Both models use the same set of endpoints as input: MMSE, CDR-SB, ADAS-Cog13, volume of the left and right hippocampus and lateral ventricles, CSF Aβ1–42 and p-tau181 levels, together with cortical-summary SUVR on Amyloid PET and Tau PET scans. They consider these endpoints at one or several visits of a participant, allowing for possible missing data, and predict the value of all these endpoints at any time-point in the future. AD Course Map also takes into account the age of the participant at each visit, while RNN-AD takes into account only the duration between two consecutive visits, irrespective of the age of the participant. In addition, RNN-AD needs the diagnosis of the participant at the corresponding visit, the diagnosis being cognitively normal, mild cognitive impairment, or demented, as defined in the ADNI protocol.

AD Course Map assumes that these endpoints follow a logistic progression curve during disease progression with distinct progression rate and age at the inflexion point21,22. It learns how this set of logistic curves needs to be adjusted to fit individual data, by changing the dynamic of progression and disease presentation (i.e., the ordering and timing of progression among the endpoints. The shape and position of the reference set of logistic curves are the fixed effects, and the parameters changing these curves to fit individual data are the random effects. The model parameters (fixed effects together with the mean and variance of the random effects) are estimated using a training data set containing the repeated measurements of a multitude of participants. After the training phase, the model is fit to the measurements of one test participant (outside the training test) at one or several visit, using the learnt distribution of the random effects as a regularizer. As a result, the model predicts a subject-specific set of logistic curves, which shows the value of each endpoint at any age of the participant.

By contrast, RNN-AD does not make any assumption on the life-long pattern of progression of the endpoints. It learns instead how the values of the endpoints will change in the next month given the values of the endpoint at a given time-point. The 1-month transition is assumed to be a non-linear function of the current value of the endpoints and the current diagnosis (e.g. artificial neurons). The parameters of this transition function are estimated using a training data set containing the repeated measurements of a multitude of participants. After the training phase, the measurements of one test participant (outside the training set) at one or several visits are used as input of the model. The model then computes the values of all the endpoints at each month in the future.

AD Course Map can be trained and tested with missing data: the likelihood is optimized using the available data only. Model training is robust to missing data6, so we did not perform data imputation. By contrast, RNN-AD needs complete data at the baseline visit. We imputed missing data with the mean value of the endpoint in the training set, following authors’ recommendations;24 missing data at subsequent visits are imputed recurrently using model predictions.

Both models also need an internal step of data normalization. For AD Course Map, cognitive assessments were normalized to a 0 to +1 scale according to the theoretical minimum and maximum values of each assessment, 0 representing the theoretical best value (unaffected participants) and +1 the worst possible value. Harmonized amyloid PET data are clipped between 0 and 100 and converted to a (0,1) scale. MRI, tau PET, and Harmonized CSF data were clipped at the first and last centile, and then linearly mapped to a (0,1) scale. For RNN-AD, normalization consists in a z-score transformation estimated from training data.

Regardless of the normalization procedure, the outputs of the models are always converted back to the native scale (and unit) of the measurement before being analyzed (see Fig. 1). Predicted values are therefore comparable with the true, non-normalized data. Forecast errors can be compared across methods that do not use the same normalization procedure.

Validation procedure

We split the data sets in two (see Supplementary Fig. 1). We first considered the ADNI participants who were amyloid-positive according to CSF or PET data on at least one visit (shown in red in Supplementary Fig. 1). We then kept the other ADNI participants and all participants from the four other cohorts as an external validation set (shown in blue in the Supplementary Fig. 1).

We then split the amyloid-positive ADNI participants into five random folds and trained AD Course Map and RNN-AD using all available data of the participants in four out of the five folds, e.g., the training set. We repeated this procedure with another split, so that we ended up with 10 instances of each model. Each participant has been counted twice as a test subject in the left-out fold. Therefore, it can be used twice for evaluating prediction tasks with two different instances of each model. By contrast, each participant in the external validation set can be tested with 10 different instances of each model. In the following, we averaged the prediction made by the 2 instances of the model for the participants in the test sets, and by the 10 instances for the participants in the external validation set.

The test subjects did not contribute to any model selection or hyperparameter tuning neither for AD Course Map nor for RNN-AD. Therefore, we pooled the forecasts of test subjects with the ones in the external validation set.

Forecasting endpoints

We aimed to assess the accuracy of each model to forecast the values of the endpoints of a participant in the test set or the external validation set. The general principle is to blind the latest data of the participant, use the unblinded data as input of the model, and compare the predicted value with the blinded data.

We used a combinatorial procedure to generate prediction tasks, as described in Supplementary Fig. 1. Because we have multiple follow-up visits, we assessed several forecast errors for a single participant: we blinded the data of the participant except at one to three consecutive visits, we predict the individual trajectory using the unblinded data, and forecast the data at the blinded visits after the latest unblinded visit. We required that the participants are between 50 and 90 years old and have a CDR global of at most 2 at the latest unblinded visit to exclude severely demented participants, and that the blinded visits used to assess the forecast fall between 1.4 and 6.6 years after the latest unblinded visit. We computed the forecast error as the absolute difference between this value and the value of the endpoint at the follow-up visit concerned.

Analysis of forecast errors

We analyzed the distribution of mean absolute errors with a mixed-effects model. We corrected the errors for several possible cofounding factors and accounted for the fact that multiple forecasts originated from the same participant. In practice, for a given endpoint and a given model, we performed the following procedure 100 times:

  • We randomly picked a subset of disjointed prediction tasks, namely predictions not sharing any common visit (neither the blinded visit to forecast, nor the unblinded visits used to forecast);

  • We fit a multivariate linear mixed-effects model with a random intercept for each individual, using the following categorical explanatory variables: A/T/N/C stage at prediction, cohort, number of APOE-ε4 alleles, sex, level of education, number of unblinded visits, and continuous explanatory variables: actual patient’s age at prediction centered on 75 years and normalized by 7.5 years, years to prediction centered on three years and normalized by one year, mean time between unblinded visits centered on eight months and normalized by three months, percentage of missing data for the unblinded visits per modality.

Education level was classified as low if the subject had followed no more than nine years of formal education and high if the subject had followed at least 16 years of education, in accordance with the guidelines of the international standard classification of education of the United Nations.

We derived the mean and empirical confidence interval for the model intercept (the mean absolute error adjusted for cofounding factors) and regression coefficients (association between the mean absolute errors and each cofounding factor).

Comparison with alternative methods

We also compared AD Course Map with two additional alternative methods. The first, the no-change prediction or last-observation-carried-forward method, forecasts the future value of an endpoint to be the same as it was at the last unblinded visit. The second method, the linear mixed model method, involved generating a linear mixed-effects model for each endpoint, regressing endpoint values against the age of the participant at the successive visits, with a random intercept and a random slope per subject. The model was fitted to an unseen participant with a maximum a posteriori estimator67. We used the same validation procedure for all models: AD Course Map, RNN-AD, no-change prediction, and the linear mixed model.

Clinical trial simulation, enrichment evaluation, and sample size calculation

We simulated clinical trials in subjects at risk of developing AD or at an early stage of AD, as described in Table 3. For each trial, we selected all pairs of visits from all participants in the five data sets satisfying the following criteria:

  • The primary endpoint of the trial was assessed at both visits,

  • The patient fulfilled the inclusion criteria and had none of the exclusion criteria of the trial at the baseline visit,

  • Visits were separated by the duration of the trial, with a certain tolerance, depending on the trial.

For each pair of visits, the first was considered to be the baseline visit at inclusion and the second was considered to be the visit at the end of the trial. We did not take into account possible intermediate visits. Supplementary Table 8 summarizes the characteristics of participants included in all the simulated trials.

We first evaluated our prognostic enrichment strategy from a diagnostic test standpoint. For each trial, we forecast the value of the primary endpoint at the follow-up visit from the baseline data only, using the procedure described above. We calculated the median value of the outcome (i.e. the annual rate of change between baseline and follow-up visit). Participants above this threshold were considered to be fast progressors and formed the target population to be identified. A threshold for predicted outcomes was used to split the population into two groups: one considered at high risk of progression and the other at low risk of progression. We let the low-risk vs. high-risk threshold vary and calculated the resulting receiver operator characteristic (ROC) curve. On this curve, we identified the point splitting the population into a low-risk and a high-risk group of equal sizes, which was used as the operating point. We determined confidence intervals by performing our analyses 100 times on half the samples selected at random. Within any given run, any visit of a patient was used no more than once. The regions of confidence around ROC curves were constructed graphically as envelopes of both sensitivity and specificity confidence intervals along thresholds.

We evaluated possible biases in the group at high risk of progression. We used a logistic regression predicting selection status from population covariates (age, sex, education, number of APOE-ε4 alleles), cohort, and missing baseline modalities, together with the true indicator of fast progression. This last binary predictor was included to check for biases emerging in addition to the biases naturally present in the target population.

We then evaluated our prognostic enrichment strategy by calculating statistical power. We used a hypothetical individual treatment model: if the outcome actually worsened between baseline and follow-up for the participant, we changed the annual rate of change by the treatment effect, e.g. a 20% improvement of the annual rate of change. We did not apply a treatment effect if the participant improved between baseline and follow-up. For treatment effects ranging from 20% to 30%, we computed effect size (Cohen’s d) and sample size from a two-independent sample asymptotic t-test, with a 5% bilateral level of significance and 80% statistical power. We compared this sample size for the population selected with the trial inclusion criteria alone, and for the subpopulation identified as at high risk of progression. We reported the total sample size for two arms. We did not account for the drop-out rate in the calculation, as the goal was to compare statistical power with and without enrichment.

In these two experiments, we compared the results obtained with those for a method selecting APOE-ε4 carriers (heterozygous or homozygous) as participants at high risk of progression. We were unable to use this method for the trial targeting participants at risk of the onset of AD since this trial included only APOE-ε4 carriers.

Statistics and reproducibility

No statistical method was used to predetermine the sample size. We considered all available data from all the cohorts and excluded only the data of the participants with less than one year of follow-up. The experiments were not randomized since only observational data were used. The investigators were not blinded to allocation during experiments and outcome assessment since only observational data were used. Simulations of clinical trials included a random unblinded allocation into treated and control arms with assessment of biases in sex, center, level of education, and APOE genotype.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.