Abstract
Most approaches to machine learning from electronic health data can only predict a single endpoint. The ability to simultaneously simulate dozens of patient characteristics is a crucial step towards personalized medicine for Alzheimer’s Disease. Here, we use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to simulate detailed patient trajectories. We use data comprising 18month trajectories of 44 clinical variables from 1909 patients with Mild Cognitive Impairment or Alzheimer’s Disease to train a model for personalized forecasting of disease progression. We simulate synthetic patient data including the evolution of each subcomponent of cognitive exams, laboratory tests, and their associations with baseline clinical characteristics. Synthetic patient data generated by the CRBM accurately reflect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total ADASCog scores with the same accuracy as specifically trained supervised models, additionally capturing the correlation structure in the components of ADASCog, and identifies subcomponents associated with word recall as predictive of progression.
Introduction
Two patients with the same disease may present with different symptoms, progress at different rates, and respond differently to the same therapy. Understanding how to predict and manage differences between patients is the primary goal of precision medicine^{1}. Computational models of disease progression developed using machine learning approaches provide an attractive tool to combat such patient heterogeneity. One day these computational models may be used to guide clinical decisions; however, current applications are limited both by the availability of data and by the ability of algorithms to extract insights from those data.
Most applications of machine learning to electronic health data have used techniques from supervised learning to predict specific endpoints^{2,3,4,5,6,7}. An alternative to developing separate supervised models to predict each characteristic is to build a single model that simultaneously predicts the evolution of many characteristics. Statistical models based on artificial neural networks provide one avenue for developing tools that can simulate patient progression in detail^{8,9,10}.
Clinical data present a number of challenges that are not easily overcome with current approaches to machine learning^{11}. For example, most clinical datasets contain multiple types of data (i.e., they are “multimodal”), have a relatively small number of samples, and many missing observations. Dealing with these issues typically requires extensive preprocessing^{3} or simply discarding variables that are too difficult to model. For example, one recent study focused on only four variables that were frequently measured across all 200,000 patients in an electronic health dataset from an intensive care unit^{9}. Developing methods that can overcome these limitations is a key step towards broader applications of machine learning in precision medicine.
Precision medicine is especially important for complex disorders in which patients exhibit different patterns of disease progression and therapeutic responses. Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI) are complex neurodegenerative diseases with multiple cognitive and behavioral symptoms^{12}. The severity of these symptoms is usually assessed through exams such as the Alzheimer’s Disease Assessment Scale (ADAS)^{13} or Mini Mental State Exam (MMSE)^{14}. The heterogeneity of AD and related dementias makes these diseases difficult to diagnose, manage, and treat, leading to calls for better methods to forecast and monitor disease progression and to improve the design of AD clinical trials^{15}. The challenge of distinguishing between related disorders makes differential diagnosis also of interest^{16}.
A variety of disease progression models have been developed for MCI and AD using clinical data^{17,18,19,20,21} or imaging studies^{22,23,24,25,26,27,28,29,30,31}. Although previous approaches to forecasting disease progression have proven useful^{32,33}, they have focused on predicting a single endpoint, such as the change in the ADAS Cognitive (ADASCog) score from baseline. Given that AD is heterogeneous and multifactorial, we set out to model the progression of more than just the ADASCog score. We accomplished this by simulating the progression of entire patient profiles, describing the evolution of each subcomponent of the ADASCog and MMSE scores, laboratory tests, and their associations with baseline clinical characteristics.
The manuscript is structured as follows. Section 2.1 describes our dataset and Section 2.2 describes our machine learning model. Section 2.3 assesses the goodnessoffit of our machine learning model. Predictions for individual components are discussed in Section 2.4. Section 2.5 assesses the accuracy of our approach, which simulates each subcomponent of the cognitive scores, at predicting changes in overall disease activity measured by the ADASCog exam. Finally, Section 3 discusses implications.
Results
Data
Our statistical models were trained and tested on data extracted from the Coalition Against Major Diseases (CAMD) Online Data Repository for AD (CODRAD)^{34,35}. We extracted 18month longitudinal trajectories of 1909 patients with MCI or AD covering 44 variables including the individual components of the ADASCog and MMSE scores, laboratory tests, and background information. Each patient profile consisted of 44 covariates (Table 1) that were classified as binary, ordinal, categorical, or continuous. Patient trajectories described the time evolution of all 44 variables in 3month intervals. Detailed data processing steps are described in Section 5.1 and in the Supporting Information.
Modeling with Conditional Restricted Boltzmann Machines
A statistical model is generative if it can be used to draw new samples from an inferred probability distribution. Generative modeling of clinical data involves two tasks: i) randomly generating patient profiles with the same statistical properties as real patient profiles and ii) simulating the evolution of these patient profiles through time. Each of these tasks is complicated by common properties of clinical data, namely that they are typically multimodal and have many missing observations. Moreover, patient progression is best regarded as a stochastic process and it is important to capture the inherent randomness of the underlying processes in order to make accurate forecasts.
Let x_{i}(t) be a vector of covariates measured in patient i at time t. Creating a generative model to solve (i) involves finding a probability distribution P(x) such that we can randomly draw x_{i}(t = 0) ~ P(x). Solving problem (ii) involves finding a conditional probability distribution P(x(t))x(t − 1) so that we can iteratively draw x_{i}(t) ~ P(x_{i}(t))x_{i}(t − 1) to generate a patient trajectory.
Our statistical model for patient progression is a latent variable model called a Conditional Restricted Boltzmann Machine (CRBM)^{36,37,38,39}. A CRBM is an undirected neural network capable of learning and sampling from the joint probability distribution of covariates across multiple times. To construct the model, the covariates were divided into two mutually exclusive subsets: static covariates that were determined solely from measurements at the beginning of the study \({{\bf{x}}}_{i}^{{\rm{static}}}(t=0)\), and dynamic covariates that changed during the study \({{\bf{x}}}_{i}^{{\rm{dynamic}}}(t)\). To train the model, we defined vectors \({{\bf{v}}}_{i}(t)=\{{{\bf{x}}}_{i}^{{\rm{dynamic}}}(t),{{\bf{x}}}_{i}^{{\rm{dynamic}}}(t1),{{\bf{x}}}_{i}^{{\rm{static}}}(t=0)\}\) by concatenating neighboring time points with the static covariates. All neighboring time points are combined into a single dataset used to train a single statistical model that applies to all neighboring time points. Rather than directly modeling the correlations between these covariates, a CRBM models these correlations indirectly using a vector of latent variables h_{μ}(t). These latent variables can be interpreted in much the same way as directions identified through principal components analysis.
The CRBM is a parametric statistical model for which the probability density is defined as
and Z is a normalization constant that ensures the total probability integrates to one. Here, a_{j}(v_{j}) and and b_{μ}(h_{μ}) are functions that characterize the data types of covariate v_{j} and latent variable h_{μ}, respectively. The parameters σ_{j} and ε_{μ} set the scales of v_{j} and h_{μ}, respectively. We used 50 normally distributed latent variables that were lower truncated at zero, which is known as a rectified linear (ReLU) activation function in the machine learning literature^{40}. To deal with missing data, we divide the visible vector v into mutually exclusive groups v_{missing} and v_{observed} and impute the missing values by drawing from the conditional distribution p(v_{missing}v_{observed}).
Traditionally, CRBMs are trained to maximize the likelihood of the data under the model using stochastic maximum likelihood^{41}. Recent results have shown that one can improve on maximum likelihood training of RBMs by adding an additional term to the loss function that measures how easy it is to distinguish patient profiles generated from the statistical model from real patient profiles^{42}. Therefore, we used a combined maximum likelihood and adversarial training method to fit the CRBM; more details of the machine learning methods are described in the Supporting Information. An overview of our statistical model is depicted in Fig. 1.
To explore and better quantify the performance of the CRBM, we used 5fold cross validation (CV) for the analysis. On each of 5 folds a CRBM was trained on 80% of patients (75% for training, 5% for validation), and the remaining 20% was used to test that CRBM. In the analysis, results over the 5 folds were either averaged (and the standard deviation over the folds used as an uncertainty), or aggregated (e.g., for plotting distributions over the test data). Every result shown is on outofsample test data.
We generated two types of synthetic patient trajectories with the trained CRBMs: (i) synthetic trajectories starting from baseline values for real patients, and (ii) entirely synthetic patients. The first type is useful for many tasks in precision medicine and clinical trial simulation, while the second type has interesting applications for maintaining the privacy of clinical data^{43}. To generate trajectories of type (i), an initial population of patients was selected and then the model was used to predict their future state. To accomplish this, we started with baseline data and used the CRBM to iteratively add new time points. To generate trajectories of type (ii), entirely synthetic patients were generated by first simulating the baseline data, then iteratively adding new time points so that the patient data was entirely simulated for the full trajectory.
Goodnessoffit of the model
The fundamental assumption underyling our analysis is that each timedependent variable in a patient’s clinical record is stochastic; it does not take on a single deterministic value, but is sampled from a distribution of values. For example, if we had the ability to repeatedly measure the cognitive function of a particular patient 12 months from a baseline measurement, we would not observe the same value every time but would instead observe a distribution of values. A CRBM describes this timedependent probability distribution associated with a patient’s characteristics. If we could actually perform this thought experiment, then we could compare the distribution of values observed for a particular patient at each time point to the distribution predicted by the model in order to assess how well the model fits the data. In practice, of course, we are only able to observe one draw from each patient’s distribution. Therefore, we use a variety of metrics to assess if the timedependent means, standard deviations, correlations, etc, determined from the model are consistent with those observed in the test dataset.
To make these comparisons, we used the first type of synthetic patient trajectories described in the previous section. Starting with the baseline values for actual patients, we simulated the trajectory of these patients beyond baseline. We repeated this many times to measure the distribution of each covariate at each timepoint for each patient. All of the actual patients were taken from the test dataset associated with the appropriate CV fold.
First, we focus on assessing the timedependent means and standard deviations computed from a CRBM. For a particular patient i, the observed value of variable j at time t is x_{ij}(t). The conditional mean and variance computed from the CRBM are denoted \({\rm{E}}[{x}_{ij}(t){{\bf{x}}}_{i}(0)]\) and \({\rm{Var}}[{x}_{ij}(t){{\bf{x}}}_{i}(0)]\), respectively. Because we only have a single observation for any given patient, we had to aggregate data across patients in order to perform any statistical comparisons. To do so, we computed a zscore
by subtracting the predicted mean and dividing by the predicted standard deviation for each observed data point. If the predicted means and standard deviations are consistent with the data then z_{ij}(t) will have zero mean and unit standard deviation when viewed across all of the patients (i.e., taking the average with respect to the patient index i). The computed means and standard deviations of the zscores for each timedependent variable are shown in Fig. 2, where they are compared to the ideal values of zero and one, respectively.
We made a simplifying assumption to enable us to compute pvalues for each of these comparisons. If the actual conditional distribution of x_{ij}(t) were normal with mean \({\rm{E}}[{x}_{ij}(t){{\bf{x}}}_{i}(0)]\) and variance \({\rm{Var}}[{x}_{ij}(t){{\bf{x}}}_{i}(0)]\), then \({z}_{ij}(t) \sim {\mathscr{N}}(0,1)\) would be drawn from a standard normal distribution. As above, we aggregate z_{ij}(t) across patients in order to gain enough observations in order to perform a statistical test to determine if the moments computed from the CRBM for variable j at time t are consistent with those observed in the test set. We computed the KolmogorovSmirnov test statistic for the mean μ and standard deviation σ, \({D}_{KS}(\mu ,\sigma )={{\rm{\sup }}}_{x}\Phi (x;\mu ,\sigma )\Phi (x;0,1)\), and computed a pvalue from the Kolmogorov distribution using the test statistic \(\sqrt{n}{D}_{KS}\). Differences that were significant at p < 0.05 and survive a Bonferroni multipletesting correction^{44} are marked in red. A nonsignificant pvalue for a variable implies that the perpatient distribution of that variable obtained from the CRBM has a mean and standard deviation that are consistent with the data. The fact most variables do not show statistically significant differences is a clear indicator of the accuracy of the CRBM. Additional comparisons of univariate distributions are provided in Supplementary Figs S1–S4, and Supplementary Fig. S5 shows a detailed comparison between the data and CRBM for the first and second moment statistics of each variable and each time point.
Next, we move beyond univariate statistics to assess if the CRBM correctly captures the correlations between the variables. Figure 3A shows that many pairs of variables are indeed correlated, so modeling these correlations is nontrivial. These equaltime correlations suggest that the variables can be grouped into three categories: cognitive scores, laboratory and clinical tests, and background information. There are strong correlations between variables belonging to the same category but only weak intercategory correlations. Figure 3B shows a comparison of the pairwise correlations computed from the model with those computed from the test data for all CV folds, with an R^{2} = 0.82 ± 0.01.
In addition to measuring the correlations between pairs of variables at the same time, one can measure correlations between pairs of variables at different times to get an idea for how the variables change over time. Comparisons between timelagged correlations computed from the model and from the test data are shown in Fig. 3C for a 3 month time lag (R^{2} = 0.91 ± 0.01) and Fig. 3D for a 6 month time lag (R^{2} = 0.90 ± 0.01). The good agreement between the data and model for the 6 month time lag correlations is an important check because the CRBM only includes parameters to account for the 3 month autocorrelations.
It is important to note that missing data can affect the ability to estimate correlations between variables. Imputation of missing data was not performed for the statistics calculated on the data; instead, only the samples in which both variables were present were used to compute a correlation. The fraction of time a pair of variables was present is represented in Fig. 3B–D with a blue color gradient. In addition, the R^{2} was computed using a weighted regression in which the weights on each correlation were determined by the fraction of data present in the computation.
As a final test of goodnessoffit, we evaluated the ability of logistic regression to differentiate actual and synthetic patient data at each time point beyond baseline. At each time point we compared actual and synthetic patient data in which each synthetic patient was conditioned on the corresponding actual patient’s baseline data. A logistic regression model was trained to differentiate these two groups of patients, and the performance was estimated using the Area Under the receiver operating characteristic Curve (AUC) metric computed using 5fold cross validation. Note that this type of analysis is commonly employed to assess differences between populations using propensity score matching^{45}. The AUC was averaged over 100 simulations from the CRBM, with the mean and standard deviation for each CV fold shown in Fig. 4. For all points, the AUC of the logistic regression model is consistent with a score of 0.5, meaning the logistic regression model cannot reliably distinguish between actual and synthetic patient data at any timepoint.
Figures 2, 3 and 4 quantitatively assess the accuracy of the CRBM, directly comparing actual and synthetic patient data. These figures demonstrate the model is accurately predicting the first and second moments and correlations of the distribution of actual patient data, even at the perpatient level. The equaltime and lagged autocorrelations between variables as well as the mean and standard deviation of each variable at each time point are all well modeled. Additionally, a standard linear classifier is unable to distinguish actual and synthetic patient data at each time point beyond baseline. We now turn our attention to comparing the performance of the CRBM to other models and examining the ways the model may be applied to patient data.
Simulating conditional patient trajectories
Predictions for any unobserved characteristics of a patient can be computed from our model by generating samples from the model distribution conditioned on the values of all observed variables. Sampling from the conditional distributions can be used to fillin any missing observations (i.e., imputation) or to forecast a patient’s future state. The ability to sample from any conditional distribution is one advantage a modeling framework based on CRBMs has over alternative generative models based on directed neural networks.
A CRBM is designed to capture the underlying timedependent probability distribution of values under the assumption that disease progression is a stochastic process. To distill this distribution into a single ‘predicted’ value for variable j in patient i at time t, we computed the conditional expectation \({\rm{E}}[{x}_{ij}(t){{\bf{x}}}_{i}(t=0)]\), which is the minimum mean squared error predictor for x_{ij}(t) under the model.
For comparison, we trained a series of Random Forest (RF) models that use the baseline data to predict each of the 35 timedependent variables for all 6 time points. Note that there is a separate RF model for each variable at each time point – a total of 210 different RF models. We also trained an ensemble of 6 multivariate RFs – each one predicted all 35 covariates for a given time point – but were unable to get reasonable accuracies (see Supporting Information). For each RF model, mean imputation was used to replace missing data; when the dependent variable to be predicted was missing for a sample, that sample was excluded for both RF and CRBM models. The RMS error of the random forest prediction sets a benchmark for a predictive model that is specially trained for an individual problem. By contrast, a single CRBM model is used to predict all variables, and all time points. Figure 5 presents a detailed comparison between the single CRBM and the ensemble of 210 RF models. The accuracy of the CRBM is close to the specialized RF model for each variable and time point, with the CRBM performing best relative to the RF on the components of ADASCog and more poorly on the noncognitive variables.
As with most supervised models, a decision tree in a RF is trained to minimize the mean squared error. Therefore, a RF learns a function \({f}_{jt}({\bf{x}}(0))\approx {\rm{E}}[{x}_{ij}(t){{\bf{x}}}_{i}(t=0)]\). With that in mind, it is not surprising that the performance of the mean computed from the CRBM and the prediction from the ensemble of RFs have similar mean squared errors. This also means that, unlike a CRBM, the RF cannot generate realistic trajectories that capture the correlations between the covariates. Figure 6A shows that the RF ensemble underpredicts covariance values, as evidenced by the slope of the outlierrobust TheilSen regression between the data and the RF ensemble. By comparison, the CRBM is in much better agreement with the covariances computed from the data.
The difference between the higherorder statistics computed from the RF ensemble and the CRBM can be understood in terms of the law of total variance. If x(t) are the covariates at time t, then the law of total variance divides the covariance values into two contributions conditioned upon baseline covariates:
Samples drawn from the CRBM reflect both terms, but deterministic predictions from the RF ensemble neglect contributions to the total covariance arising from the second term. Figure 6B illustrates this for distribution of the ADASCog11 score. Treating the predictions of the RF ensemble as trajectories would lead one to underestimate the variance of the distribution, particularly in the right tail. By contrast, the distribution computed from the CRBM fits the observed distribution quite well. More details on the comparison between RFs and the CRBM are provided in the Supporting Information.
In summary, stochastic simulations of disease progression have two main advantages compared to supervised machine learning models that aim to predict a single, predefined endpoint. The first is that the simultaneous modeling of entire patient profiles captures correlations between the covariates. This allows for the quantitative exploration of alternative endpoints and different patient subgroups. The second is that stochastic simulations provide indepth estimates of risk for individual patients that can be aggregated to estimate risks in larger patient populations. Moreover, our model provides accurate estimates of variance in addition to forecasts for expected progression of individual patients (Figs S1 and S10).
Forecasting and interpreting disease progression
In this last section, we focus on disease progression as assessed by the overall ADASCog11 score rather than the individual components. Our model is trained to simulate the evolution of the individual components of the cognitive exams, laboratory tests, and clinical data. As a result, it is also possible to simulate the evolution of any combination of these variables, such as the 11component ADASCog score that is commonly used as a measure of overall disease activity. Note that the ADAS delayed word recall component, which is present in the dataset, is not part of the 11component ADASCog score but can be used as an additional probe of disease severity, especially for MCI^{46}. Figure 7A shows a violin plot describing the evolution of the ADASCog score distribution within the population. The data and model show the same trend – an increase in the mean ADASCog score with time along with a widening right tail of the distribution. This implies that much of the trend of increasing ADASCog scores in the population is driven by a subset of patients.
As in the previous section, the CRBM can be used to compute the mean ADASCog11 score for a patient conditioned on the baseline measurements of each variable. In Fig. 7B, we have compared the accuracy of the CRBM predictions for the change in ADASCog11 score from baseline to each possible endpoint in 3month steps through 18 months to a variety of supervised models (a linear regression, a random forest, and a deep neural network). The figure shows the rootmeansquare error (RMS error) of each model’s prediction for the 18month change ADASCog11 score. The figure shows the mean value and standard deviation over all 5 CV folds. Each of the supervised models was trained to predict a specific endpoint (e.g., the change in ADASCog score after 6 months). The CRBM has equivalent performance to these models over the entire range. That is, despite only being trained on data on the individual components with a 3month time lag, the mean ADASCog11 score computed from the CRBM is as accurate as supervised models trained only for this task. More details on the comparison are provided in the Supporting Information.
To gain more insight into the origin of fast and slow progressing patients, we simulated 18month patient trajectories conditioned on a baseline ADASCog11 score of 10 and an initial diagnosis of MCI. This initial ADASCog11 score was chosen because it is representative of a typical patient with MCI. The 5% of synthetic patients with the largest ADASCog11 score increase were designated “fast progressors” and the bottom 5% of synthetic patients with the smallest ADASCog11 score increase were designated “slow progressors”. Differences in baseline characteristics between the fast and slow progressors (the “absolute effect size”) were quantified using the absolute value of Cohen’s dstatistic^{47}, as shown in Fig. 7C. The majority of baseline variables are not associated with disease progression; however, there are strong associations with cognitive tests based on recall (i.e., MMSE recall, ADAS word recall, and ADAS delayed word recall) and word recognition. That is, patients with poor performance on the ADAS delayed word recall test tend to progress more rapidly – even after controlling for the total ADASCog11 score. Variables associated with progression in patients who already have AD are described in the Supporting Information.
Discussion
The ability to simulate the stochastic disease progression of individual patients in high resolution could have a transformative impact on patient care by enabling personalized datadriven medicine. Each patient with a given diagnosis has unique risks and a unique response to therapy. Due to this heterogeneity, predictive models cannot currently make individuallevel forecasts with a high degree of confidence. Therefore, it is critical that datadriven approaches to personalized medicine and clinical decision support provide estimates of variance in addition to expected outcomes.
Previous efforts for modeling disease progression in AD have focused on predicting changes in predefined outcomes such as the ADASCog11 score or the probability of conversion from MCI to AD^{17,18,19,20,21,22,23,24,25,27,28,29,30,31}. Here, we have demonstrated that an approach based on unsupervised machine learning can create stochastic simulations of entire patient trajectories that achieve the same level of performance on individual prediction tasks as specific models while also accurately capturing correlations between variables. Machine learningbased generative models provide much more information than specific models, thereby enabling a simultaneous and detailed assessment of different risks.
Our approach to modeling patient trajectories in AD overcomes many of the limitations of previous applications of machine learning to clinical data^{3,8,9,11}. CRBMs can directly integrate multimodal data with both continuous and discrete variables, and timedependent and static variables, within a single model. In addition, bidirectional models like CRBMs can easily handle missing observations in the training set by performing automated imputation during training. Combined, these factors dramatically reduce the amount of data preprocessing steps needed to train a generative model to produce synthetic clinical data. We found that a single timelagged connection was sufficient for explaining temporal correlations in AD; additional connections may be required for diseases with more complex temporal evolution.
The utility of cognitive scores as a measure of disease activity for patients with AD has been called into question numerous times^{48}. Here, we found that the components of the ADASCog and MMSE scores were only weakly correlated with other clinical variables. One possible explanation is that the observed stochasticity may simply reflect heterogeneity in performance on the cognitive exam that cannot be predicted from any baseline measurements. However, we did find that some of the individual components of the baseline cognitive scores are predictive of progression. Specifically, patients with poor performance on word recall tests tend to progress more rapidly than other patients, even after controlling for the ADASCog11 score.
There are a number of improvements to our dataset and methodology that are important steps for future research. Here, we limited ourselves to modeling 44 variables that are commonly measured in AD clinical trials. We excluded some interesting covariates such as Leukocyte populations because they were not measured in the majority of patients in our dataset constructed from the CAMD database. We also lack data from neuroimaging studies and tests for levels of amyloidβ. Incorporating additional data into our model development will be a crucial next step, especially as surrogate biomarkers become a standard part of clinical trials.
Conclusions
This work provides a proofofconcept that patientlevel simulations are technologically feasible with the right tools and data. We have shown that generative models capable of sampling conditional probability distributions over a diverse array of clinical variables can accurately model the progression of Alzheimer’s Disease. These models have been broadly validated, from the ability to capture statistics of distributions of clinical variables to their ability to predict progression and model composite endpoints. The flexibility and diverse functionality of these models to handle the challenges of clinical data, make probabilistic predictions for individual patients, and accurately predict disease progression means that there are clear applications for clinical trials and precision medicine.
The approach to simulating disease progression that we describe here can be easily extended to other diseases. Widespread application of generative models to clinical data could produce synthetic datasets with lower privacy concerns than real medical data^{10}, or could be used to run simulated clinical trials to optimize study design or as synthetic control arms. In certain disease areas, tools that use simulations to forecast risks for specific individuals could help doctors choose the right treatments for their patients. Currently, progress towards these goals is slowed by the limited availability of high quality longitudinal health datasets and the limited ability of current machine learning methods to produce insights from these datasets.
Methods
Data Processing
Our statistical model was trained and tested on data extracted from the Coalition Against Major Diseases (CAMD) Online Data Repository for AD (CODRAD)^{34,35}. The development and composition of this database have been previously described in detail^{35}. The CAMD database contains 6955 patients from the placebo arms of 28 clinical trials on MCI and AD. These trials have varying duration, visit frequency, and inclusion criteria; nearly all patients have no data beyond approximately 18 months. We chose a 3month spacing between time points based on the visit frequency of the bulk of longlasting patients to ensure that most patients had no gaps in their data. The falloff in patient data after the 18month time point led us to select that as the final time point. Therefore, patient trajectories are represented by 7 time points (0, 3, 6, 9, 12, 15, and 18 months).
Data in the CAMD database is stored in the CDISC format^{49,50}. The covariates used in our statistical model of AD progression originate from tables in the database on demographics, disposition events, laboratory results, medical histories, questionnaires, subject characteristics, subject visits, and vital signs. We designated some variables, such as height, as static. Multiple values for any of the static variables were averaged to produce a single estimate. Timedependent variables were bucketed into 90day windows centered on each time point. Multiple entries in any window were averaged, or extremal values were taken as appropriate. Any data with units (such as laboratory tests) were converted to a common unit for each test for all patients (e.g., g/L for triglycerides). Results for both the ADASCog and MMSE tests were available for many patients to the level of individual components. Individual question data were available for some patients, which we aggregated into component scores. A final processing step converted data into numerical values more suitable for statistical modeling. Categorical variables were onehot encoded and positive continuous variables were logtransformed and standardized. All variables were transformed back to canonical form before analysis.
Our statistical model can perform imputation of missing data during training. However, using covariates that are missing in a large fraction of patients would lead to poor performance. Therefore, we chose 44 variables that were observed in a reasonably large fraction of patients. Table 1 describe each of the variables included in our analysis. Because we are interested in modeling AD progression, we focused on patients in the CAMD database with long trajectories. This led us to select the 1909 patients from CAMD that have a valid ADASCog score (i.e., data is not missing for any of the 11 components) for either of the 15month or 18month time points.
One feature of the real patient data that complicates the comparison in Fig. 4 is the presence of missing data. To handle missing data, we mean impute each missing variable. Because the synthetic data has no missing entries, this would create a significant difference between real and synthetic data and a classifier would be able to distinguish them based solely on the missing data. However, since there is a onetoone correspondence between real and synthetic patients, we assign the mean imputed entries to the corresponding entries in the synthetic data. This removes the ability of the logistic regression to distinguish between the two groups based on the missingness of data. We note that as a natural consequence, higher proportions of missing data limit the classification ability of the logistic regression.
Data Availability
Data used in the preparation of this article were obtained from the Coalition Against Major Diseases (CAMD) database. In 2008, Critical Path Institute, in collaboration with the Engelberg Center for Health Care Reform at the Brookings Institution, formed the Coalition Against Major Diseases (CAMD). The Coalition brings together patient groups, biopharmaceutical companies, and scientists from academia, the U.S. Food and Drug Administration (FDA), the European Medicines Agency (EMA), the National Institute of Neurological Disorders and Stroke (NINDS), and the National Institute on Aging (NIA). The Coalition Against Major Diseases (CAMD) includes over 200 scientists from member and nonmember organizations. The data available in the CAMD database has been volunteered by CAMD member companies and nonmember organizations.
References
 1.
Collins, F. S. & Varmus, H. A new initiative on precision medicine. New Engl. J. Medicine 372, 793–795 (2015).
 2.
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Medicine 1, 18 (2018).
 3.
Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. reports 6, 26094 (2016).
 4.
Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, 301–318 (2016).
 5.
Lasko, T. A., Denny, J. C. & Levy, M. A. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PloS one 8, e66341 (2013).
 6.
Lipton, Z. C., Kale, D. C., Elkan, C. & Wetzel, R. Learning to diagnose with LSTM recurrent neural networks. arXiv preprint arXiv:1511.03677 (2015).
 7.
Myers, P. D., Scirica, B. M. & Stultz, C. M. Machine learning improves risk stratification after acute coronary syndrome. Sci. reports 7, 12692 (2017).
 8.
Choi, E. et al. Generating multilabel discrete electronic health records using generative adversarial networks. arXiv preprint arXiv:1703.06490 (2017).
 9.
Esteban, C., Hyland, S. L. & Rätsch, G. Realvalued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633 (2017).
 10.
BeaulieuJones, B. K., Wu, Z. S., Williams, C. & Greene, C. S. Privacypreserving generative deep neural networks support clinical data sharing. bioRxiv 159756 (2017).
 11.
Goldstein, B. A., Navar, A. M., Pencina, M. J. & Ioannidis, J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Informatics Assoc. 24, 198–208 (2017).
 12.
Kumar, A. et al. A review on alzheimer’s disease pathophysiology and its management: an update. Pharmacol. Reports 67, 195–203 (2015).
 13.
Rosen, W. G., Mohs, R. C. & Davis, K. L. A new rating scale for Alzheimer’s disease. The Am. journal psychiatry (1984).
 14.
Folstein, M. F., Folstein, S. E. & McHugh, P. R. “MiniMental State”: a practical method for grading the cognitive state of patients for the clinician. J. psychiatric research 12, 189–198 (1975).
 15.
Cummings, J. et al. Drug development in Alzheimer’s disease: the path to 2025. Alzheimer’s research & therapy 8, 39 (2016).
 16.
Raamana, P. R. et al. ThreeClass Differential Diagnosis among Alzheimer Disease, Frontotemporal Dementia, and Controls. Front. Neurol. 5, 71, https://doi.org/10.3389/fneur.2014.00071 (2014).
 17.
Rogers, J. A. et al. Combining patientlevel and summarylevel data for Alzheimer’s disease modeling and simulation: a beta regression metaanalysis. J. pharmacokinetics pharmacodynamics 39, 479–498 (2012).
 18.
Ito, K. et al. Understanding placebo responses in Alzheimer’s disease clinical trials from the literature metadata and CAMD database. J. Alzheimer’s Dis. 37, 173–183 (2013).
 19.
Kennedy, R. E., Cutter, G. R., Wang, G. & Schneider, L. S. Post hoc analyses of apoe genotypedefined subgroups in clinical trials. J. Alzheimer’s Dis. 50, 1205–1215 (2016).
 20.
Tishchenko, I., Riveros, C., Moscato, P. & Diseases, C. A. M. Alzheimer’s disease patient groups derived from a multivariate analysis of cognitive test outcomes in the Coalition Against Major Diseases dataset. Futur. science OA 2, FSO140 (2016).
 21.
Szalkai, B. et al. Identifying combinatorial biomarkers by association rule mining in the CAMD Alzheimer’s database. Arch. gerontology geriatrics 73, 300–307 (2017).
 22.
Mueller, S. G. et al. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & Dementia: journal Alzheimer’s Assoc. 1, 55–66 (2005).
 23.
Risacher, S. L. et al. Baseline MRI predictors of conversion from MCI to probable AD in the ADNI cohort. Curr. Alzheimer Res. 6, 347–361 (2009).
 24.
Hinrichs, C. et al. Predictive markers for AD in a multimodality framework: an analysis of MCI progression in the ADNI population. Neuroimage 55, 574–589 (2011).
 25.
Ito, K. et al. Disease progression model for cognitive deterioration from Alzheimer’s Disease Neuroimaging Initiative database. Alzheimer’s & Dementia: journal Alzheimer’s Assoc. 7, 151–160 (2011).
 26.
Weiner, M. W. et al. The Alzheimer’s Disease Neuroimaging Initiative: a review of papers published since its inception. Alzheimer’s &. Dementia: The J. Alzheimer’s Assoc. 8, S1–68, https://doi.org/10.1016/j.jalz.2011.09.172 (2012).
 27.
Suk, H.I. & Shen, D. Deep learningbased feature representation for AD/MCI classification. In International Conference on Medical Image Computing and ComputerAssisted Intervention, 583–590 (Springer, 2013).
 28.
Suk, H.I. et al. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage 101, 569–582 (2014).
 29.
Liu, S. et al. Early diagnosis of Alzheimer’s disease with deep learning. In Biomedical Imaging (ISBI), 2014 IEEE 11 ^{th} International Symposium on, 1015–1018 (IEEE, 2014).
 30.
Ortiz, A., Munilla, J., Gorriz, J. M. & Ramirez, J. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int. journal neural systems 26, 1650025 (2016).
 31.
SamperGonzalez, J. et al. Yet another ADNI machine learning paper? Paving the way towards fullyreproducible research on classification of Alzheimer’s disease. In International Workshop on Machine Learning in Medical Imaging, 53–60 (Springer, 2017).
 32.
Corrigan, B. et al. Clinical trial simulation in Alzheimer’s disease. In Applied Pharmacometrics, 451–476 (Springer, 2014).
 33.
Romero, K. et al. The future is now: Modelbased clinical trial design for Alzheimer’s disease. Clin. Pharmacol. & Ther. 97, 210–214 (2015).
 34.
Romero, K. et al. The Coalition Against Major Diseases: developing tools for an integrated drug development process for Alzheimer’s and Parkinson’s diseases. Clin. Pharmacol & Ther. 86, 365–367 (2009).
 35.
Neville, J. et al. Development of a unified clinical trial database for Alzheimer’s disease. Alzheimer’s & Dementia: journal Alzheimer’s Assoc. 11, 1212–1221 (2015).
 36.
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cogn. science 9, 147–169 (1985).
 37.
Hinton, G. A practical guide to training restricted Boltzmann machines. Momentum 9, 926 (2010).
 38.
Taylor, G. W., Hinton, G. E. & Roweis, S. T. Modeling human motion using binary latent variables. In Advances in Neural Information Processing Systems, 1345–1352 (2007).
 39.
Mnih, V., Larochelle, H. & Hinton, G. E. Conditional restricted Boltzmann machines for structured output prediction. In Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence, 514–522 (AUAI Press, 2011).
 40.
Tubiana, J. & Monasson, R. Emergence of compositional representations in restricted Boltzmann machines. Phys. review letters 118, 138301 (2017).
 41.
Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, 1064–1071 (ACM, 2008).
 42.
Fisher, C. K., Smith, A. M. & Walsh, J. R. Boltzmann encoded adversarial machines. arXiv preprint arXiv:1804.08682 (2018).
 43.
Dankar, F. K. & El Emam, K. Practicing differential privacy in health care: A review. Transactions on Data Priv. 6, 35–67 (2013).
 44.
Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc. 56, 52–64, https://doi.org/10.1080/01621459.1961.10482090 (1961).
 45.
Zhang, Z. Use of area under the curve (AUC) from propensity model to estimate accuracy of the estimated effect of exposure. Master’s thesis, University of Pittsburgh (200s7).
 46.
Sano, M. et al. Adding delayed recall to the Alzheimer Disease Assessment Scale is useful in studies of mild cognitive impairment but not Alzheimer disease. Alzheimer Dis Assoc Disord 25, 122–127 (2011).
 47.
Cohen, J. Statistical power analysis for the behavioral sciences (Lawrence Erlbaum Associates, 1988).
 48.
Benge, J. F., Balsis, S., Geraci, L., Massman, P. J. & Doody, R. S. How well do the ADASCog and its subscales measure cognitive dysfunction in Alzheimer’s disease? Dementia geriatric cognitive disorders 28, 63–69 (2009).
 49.
Kubick, W. R., Ruberg, S. & Helton, E. Toward a comprehensive CDISC submission data standard. Drug information journal 41, 373–382 (2007).
 50.
Hume, S., Aerts, J., Sarnikar, S. & Huser, V. Current applications and future directions for the CDISC operational data model standard: A methodological review. J. biomedical informatics 60, 352–362 (2016).
Acknowledgements
We would like to thank Yannick Pouliot, Pankaj Mehta, and Diane Dickel for helpful comments while preparing the manuscript.
Author information
Affiliations
Author notes
Data used in the preparation of this article were obtained from the Coalition Against Major Diseases database (CAMD). As such, the investigators within CAMD contributed to the design and implementation of the CAMD database and/or provided data, but did not participate in the analysis of the data or the writing of this report
Consortia
Contributions
C.K.F., A.M.S. and J.R.W. developed the idea, J.R.W. constructed and tested the models, and C.K.F., A.M.S. and J.R.W. wrote the paper.
Corresponding author
Ethics declarations
Competing Interests
C.K.F., A.M.S. and J.R.W. are owners and employees of Unlearn. A.I., Inc., a company that creates software for clinical research. CAMD includes members from the biopharmaceutical industry.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fisher, C.K., Smith, A.M., Walsh, J.R. et al. Machine learning for comprehensive forecasting of Alzheimer’s Disease progression. Sci Rep 9, 13622 (2019). https://doi.org/10.1038/s41598019496562
Received:
Accepted:
Published:
Further reading

Analysis of Risk Factors in Dementia Through Machine Learning
Journal of Alzheimer's Disease (2021)

Forecasting of the Prevalence of Dementia Using the LSTM Neural Network in Taiwan
Mathematics (2021)

Deep learning for brain disorders: from data processing to disease treatment
Briefings in Bioinformatics (2021)

The potential for complex computational models of aging
Mechanisms of Ageing and Development (2021)

A Comprehensive Machine Learning Framework for the Exact Prediction of the Age of Onset in Familial and Sporadic Alzheimer’s Disease
Diagnostics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.