A blood-based signature of cerebral spinal fluid Aβ1–42 status

It is increasingly recognized that Alzheimer’s disease (AD) exists before dementia is present and that shifts in amyloid beta occur long before clinical symptoms can be detected. Early detection of these molecular changes is a key aspect for the success of interventions aimed at slowing down rates of cognitive decline. Recent evidence indicates that of the two established methods for measuring amyloid, a decrease in cerebral spinal fluid (CSF) amyloid β1−42 (Aβ1−42) may be an earlier indicator of Alzheimer’s disease risk than measures of amyloid obtained from Positron Emission Topography (PET). However, CSF collection is highly invasive and expensive. In contrast, blood collection is routinely performed, minimally invasive and cheap. In this work, we develop a blood-based signature that can provide a cheap and minimally invasive estimation of an individual’s CSF amyloid status using a machine learning approach. We show that a Random Forest model derived from plasma analytes can accurately predict subjects as having abnormal (low) CSF Aβ1−42 levels indicative of AD risk (0.84 AUC, 0.78 sensitivity, and 0.73 specificity). Refinement of the modeling indicates that only APOEε4 carrier status and four analytes are required to achieve a high level of accuracy. Furthermore, we show across an independent validation cohort that individuals with predicted abnormal CSF Aβ1−42 levels transitioned to an AD diagnosis over 120 months significantly faster than those predicted with normal CSF Aβ1−42 levels and that the resulting model also performs reasonably across PET Aβ1−42 status. This is the first study to show that a machine learning approach, using plasma protein levels, age and APOEε4 carrier status, is able to predict CSF Aβ1-42 status, the earliest risk indicator for AD, with high accuracy.


Introduction
Alzheimer's disease (AD) is a terminal neurodegenerative disease that has historically been diagnosed based on "clinically significant" cognitive decline of an individual and exclusion of other conditions. However, it is increasingly recognized that AD is a decades-long neurodegenerative process, with shifts in amyloid β 1−42 (Aβ 1−42 ) providing the first indicators of disease development, long before "Alzheimer's dementia" (significant cognitive decline) is clinically apparent 1-5 . There is currently no cure or disease-modifying therapy for this terminal illness despite hundreds of clinical trials being conducted since 2002 6,7 . It is hypothesized that the high failure rate of AD trials is in part due to the trials targeting AD patients with significant cognitive impairment, who are therefore in the late stages of the disease and likely have suffered a level of brain tissue loss that cannot be compensated for. Compounding this is the discovery that many patients enrolled in clinical trials, up to 20% in one study 8 , were retrospectively found to have normal levels of amyloid and hence did not have AD 9 Table 1. Demographic characteristics of the ADNI data set used separated into training and validation cohorts, corresponding to individuals with and without CSF measures respectively. Columns in each cohort provide a further breakdown into individuals that are cognitively normal (CN), have mild cognitive impairment (MCI) or Alzheimer's disease (AD). The units of each cell are shown in parentheses in the row names and commonly include number of patients (n) or mean of a given quantity (mean). If a secondary measure (percent (%) or standard deviation (sd)) is also present, it is listed in brackets next to the primary measure.
566 individuals by baseline diagnosis and CSF availability is shown in Table 1. This cohort was split into a training and a validation cohort with 356 and 210 individuals with and without measures of CSF Aβ 1−42 , respectively, where CSF was measured using the Luminex xMAP platform and Innogenetics INNO-BIA AlzBio3 immunoassay. The training set was used to build predictive models and evaluate their performance directly using the measured Aβ 1−42 levels while the validation cohort was not used to build the predictive model, but was instead used to evaluate the generalizability and utility of the model's predictions. For each cohort, we also considered a subset of individuals for whom Aβ 1−42 status from PET was available at least one timepoint (not just at baseline), either using Pittsburgh compound B (PiB) or Florbetapir (AV45) tracer, for further validation of our modelling. Further demographic information for these cohorts can be found in Supplementary Tables 1 to 3.

Binary and regression modelling tasks
The primary aim of this work was to produce a model that predicts if an individual's CSF Aβ 1−42 levels are below the recognized clinical threshold of 192pg/ml for the platform used, indicating an abnormal CSF Aβ 1−42 level, and hence AD risk. Given the continuous CSF Aβ 1−42 measures in the ADNI cohort, two approaches were considered: a 'regression' task, learning the continuous CSF Aβ 1−42 levels and thresholding these post-prediction and a 'binary' task, learning the dichotomized CSF Aβ 1−42 status based on clinical thresholds directly. While both tasks result in a binary classifier, they face different trade-offs. The regression task makes use of the full information in the CSF levels but needs to learn a suitable threshold to convert its continuous predictions into suitable binary labels whereas the binary task only learns from the dichotomized CSF levels. Given these trade-offs, we have investigated both modelling approaches throughout this work.

Statistical modeling
We made use of Random Forests (RF) as the modeling approach to predict CSF Aβ 1−42 levels for both the binary and regression tasks. RFs are a widely-used ensemble method that have a number of advantages for the small sample size and disparate types of features observed in the ADNI dataset. RFs are invariant to the scale of the observed features and make few assumptions about the distributions of observed data allowing them to be applied to multiple data modalities easily.
All analysis in this work made use of the RF implementation in the R package ranger 27 . Each forest contained 2000 individual trees, each making use of a random selection of p 3/4 features, where p was the total number of variables used in a given model. These parameter choices were based on recommendations provided in Ishwaran et al. (2011) 28 . All other parameters in the ranger implementation were set to their default values.
To get an estimate of the performance of our models, we have made use of a nested cross-validation (CV) framework, whereby an inner CV was used to determine model parameters, and the outer CV was used to gain an estimate of the model's performance on unseen data 29 . In this study, we used 3 repetitions of 3 fold cross validation for the inner CV loop and 10 repetitions of 10 fold CV for the outer loop.
As the RF used pre-determined parameter values, only a single parameter had to be determined; the threshold on the continuous regression predictions necessary to generate binary labels. This threshold was selected based on performance in the 3/13 inner CV loop, using the R package OptimalCutoffs 30 to evaluate six potential cutoff metrics (Supplementary Methods) and selecting the method which maximized the accuracy over all of the test folds from the inner cross validation loops. The best performing cutoff criteria was then used in the current iteration of the outer cross-validation loop and the accuracy, sensitivity and specificity derived from this threshold was recorded for that fold. While this approach means that a different method could be used to derive the regression threshold for each fold in the outer cross-validation, the resulting estimate of performance is unbiased and hence is likely to be more representative of performance on unseen data compared with selecting a threshold based on the entire set of training data.

Measures of model performance
Model performance was summarized by the mean and standard deviation of area under the Receiver Operating Characteristic (ROC) curve (AUC), accuracy, sensitivity and specificity from the testing performance across the different cross-validation runs. R 2 values were also calculated for the regression task. Increases in AUC between models were tested for significance using a one-tailed Wilcoxon signed-rank test. Receiver operating curves were constructed by aggregating all of the test predictions from the outer cross-validation.

Evaluating importance of different modalities
The input variables were separated into three classes: a commonly used baseline model (B) including age and APOEε4 carrier status; Proteomics (P), which included the 146 analytes measured on the RBM panel as well as homocysteine and plasma Aβ 1−40 and Aβ 1−42 ; Metabolomics (M), including 138 metabolites and lipids.
Four separate random forests were created using different subsets of these features to determine which were most useful for modeling CSF Aβ 1−42 . We denoted these models by the combination of features they included; for example 'BPM' refers to a model built using all three classes of features. The best performing model was selected for all subsequent analysis.

Discovery of smallest set of markers needed for strong predictive performance
After evaluating the impact of the different modalities, we determine the minimum set of individual analytes necessary to achieve high predictive performance. This was done by treating the number of included features as a parameter to be determined in our nested CV framework. Within each fold of the inner cross validation loop, we used a recursive feature elimination approach, ranking features according to their Variable Importance, the difference in the prediction error on the out-of-bag data when a given feature was permuted and unpermuted 31 , and removed the lowest ranking features in a stepwise fashion. The AUC of the resulting RF was recorded, and the procedure was repeated over increasingly smaller subsets of features until no features were left to be removed. After the inner CV loop finishes, we determine the number of features that achieved the optimal trade-off between model complexity and performance by selecting the smallest subset of features that achieved within 4% of the maximal observed AUC. A model using this subset of features was then trained on all training folds of the outer CV loop and evaluated on the test fold. Again, by determining the number of features to include within our nested cross-validation framework, we are able to determine an unbiased estimate of the model's expected performance over unseen data.

Survival Analysis
Survival analysis was conducted to determine if the rate of conversion from MCI to AD was different between those with predicted low and normal CSF Aβ 1−42 levels, enabling us to determine if our predictions lead to useful clinical outcomes in the validation cohort.
Four separate analyses were performed: using the measured CSF status on the training set (n=198), and using the predicted CSF status from either the B, BP and BP f s , the feature selected model (model with the smallest set of features), in the validation set (n=198). For each analysis, we have examined the hazard ratios using Cox regression and used log-rank tests to compare the survival distribution of low/normal CSF Aβ 1−42 stratifications in the three analyses, as well to compare equivalence between the actual and predicted stratifications.

Validation performance over PET Aβ 1−42 status
In order to further validate our model, we have examined the ability of our model to differentiate PET Aβ 1−42 abnormal and normal status. While it is known that Aβ 1−42 status from PET can differ from that observed in CSF, measurements from the two modalities are correlated and should be very similar for individuals who are not close to the cutoff indicating pathology. This provides a further indication of our model's ability to determine Aβ 1−42 status in individuals in the validation cohort, where CSF measurements are not available.
Given that only a limited number of individuals had associated measures of PET imaging at baseline (n=18 and 27 for training and validation cohorts respectively), we have made use of the earliest PET image available, leaving us with 108 and 68 individuals in the training and validation cohort to evaluate. The threshold for abnormality was defined as a SUVR of 1.5 and  Table 2. Mean and standard deviation (in parentheses) of performance metrics (area under the receiver operator curve, AUC; accuracy, Acc; sensitivity, Sens; specificity, Spec and R 2 for the regression models) for the different Random Forest models using different feature sets across all cross-validation folds. Left and right halves are for the Regression and Binary tasks respectively. Bold faced text on AUCs indicates the best performing model or those that are statistically equivalent (via a Wilcox rank signed test, with a Bonferroni-corrected significance threshold of 0.05/5=0.01). Features sets describe combinations of (B) baseline model (age and APOEε4 carrier status), (P) Proteomics, (M) Metabolomics.
1.11 for PET images using PiB and AV45 tracers respectively. The mean number of years past baseline that a scan was taken was 3.07 and 2.97 years for training and validation cohorts respectively.
The use of imaging at non-baseline times assumes that differences between the baseline and time that the image was taken is relatively small (which may be reasonable assuming a slow rate of Aβ 1−42 accumulation), and that few individuals are close to the defined threshold for abnormality. If this assumptions do not hold, it is likely to worsen predictive performance, making this analysis somewhat conservative.

Models utilizing protein levels accurately predict CSF positivity
We evaluated the ability of blood-based biomarkers to predict CSF Aβ 1−42 normal/abnormal status using RFs trained using different subsets of input variables, treating the modelling of CSF Aβ 1−42 as either a regression or binary task. Summaries of the performance metrics from the resulting models are shown in Table 2 with their corresponding ROC curves shown in Figure  1.
We observed strong overall predictive performance for both the regression and binary tasks within our cross-validation framework. All sets of features outperformed the base model of age and APOEε4 carrier status with BP based models leading to the highest AUC in both tasks of 0.84 and 0.83 for the regression and binary tasks respectively. Standard deviation for the AUC was relatively high (7-8%), likely due to the noise inherent in both the analytes being used for prediction as well as in the CSF Aβ 1−42 measurements.
The BP models resulted in a mean R 2 of 0.29 for the regression task. The automatically derived threshold for this regression RF yielded a mean accuracy of 0.77, with a sensitivity of 0.78 and a specificity of 0.73. Across the 100 cross-validation runs, the chosen threshold ranged from 164pg/ml to 194pg/ml with a median of 185pg/ml. Similar AUC and accuracy could be observed for learning the dichotomized CSF labels directly (e.g AUC 0.83, accuracy 0.77 for the BP model). For the binary task, a slight drop in both AUC, as well as an altered trade-off between sensitivity and specificity, was observed across all different feature sets compared to the regression task. Given this, we chose to focus on the regression model for much of the follow-up analysis.
While all sets of features outperformed the base model, models that made use of the protein level measurements consistently achieved the strongest predictive performance, whereas metabolites appeared to be of limited utility. In both the regression and binary tasks, models containing metabolites and proteomic data (BPM) achieved equivalent or worse AUCs than models containing only the proteomic data (BP). Furthermore, we observed that the use of the base features and metabolites alone (BM) lead to decreased performance compared to the baseline model, indicating that the set of measured metabolites may have contributed little predictive information or may have been too noisy to be useful for predicting CSF status. These findings are in contrast to the utility of metabolites in predicting PET Aβ 1−42 positivity 22 .
While the results presented in this section include clinical diagnosed AD individuals, who are almost all CSF Aβ 1−42 positive, it is worth considering only 'pre-clinical' individuals as this may be more relevant for selective screening in drug trials. Evaluating our model's performance on CN and MCI individuals only, we find that similarly strong predictive performance can be obtained (Supplementary Table 4, Supplementary Figure 1, AUC 0.80, Acc 0.77 for the BP model) supporting our primary findings that plasma protein levels can be utilized to predict amyloid pathology status.
To ensure that our imputation procedure did not bias our results, we also built similar models using only complete cases after applying more stringent QC (removal of plasma analytes where more than 1% of measurements were missing), obtaining similar AUCs of 0.81 for the regression and binary tasks (Supplementary Figure 2).

Strong predictive performance is maintained using only four proteins
The models described so far used all (> 140) available features in this dataset. In practice, measuring hundreds of analytes is costly, negating a key advantage of using blood biomarkers for screening. Given this, we have applied feature selection to the BP regression RF, the most parsimonious model while achieving high predictive performance, to identify the smallest number of features that still achieved high predictive performance. Within cross-validation, we find that the average performance of this feature selection approach, denoted BP f s , yields an AUC, sensitivity and specificity of 0.81, 0.81 and 0.63. The number of features selected in the model ranges from 2 to 15, with a median of 5 features included. When applying this feature selection procedure to the entire set of training data, we identified a subset of four plasma analytes as well as APOE4 genotype status critical for model performance: Chromogranin-A (CGA), Aβ 1−42 (AB42), Eotaxin 3, and Apolipoprotein E (APOE). This combination of protein levels, together with APOEε4 is denoted as BP5. Figure 2 indicates how each variable influences the model predictions after we have accounted for the influence of all other variables. As expected, the strongest relationship with CSF Aβ 1−42 is with APOEε4 carrier status, where being a carrier (APOEε4 = 1) leads to a low predicted Aβ 1−42 level. While the relationships between the proteins and CSF Aβ 1−42 are non-linear (expected given the nature of RFs), the overall correlation with CSF Aβ 1−42 is positive for CGA, Plasma Aβ 1−42 , and APOE protein levels and negative for Eotaxin 3.

Validation of clinical utility
To demonstrate the utility of our modeling on unseen data, we conducted a survival analysis over the validation cohort (n=198), evaluating the probability of baseline MCI individuals transitioning to AD diagnosis over 120 months, stratified by predicted CSF Aβ 1−42 status from either the B, BP or reduced BP5 model. These survival distributions could then be compared to those of the real Aβ 1−42 status observed over the training cohort. Given the demographic similarity of the two cohorts, we would expect to see strong similarities in rates of conversion.
From Figure 3, we observed that in all cases, the predicted low CSF Aβ 1−42 group transitioned to AD significantly faster than the Aβ 1−42 normal group. Comparing the predictions from the BP and BP5 models on the validation cohort to the actual   CSF Aβ 1−42 status on the training cohort, we find that there is no significant difference between the survival distributions for either the normal (log-rank test p = 0.19, 0.2, 0.21) or abnormal (log-rank test p = 0.97, 0.31, 0.23) survival distributions, respectively, reflecting the overlapping confidence intervals of the hazard ratios. However, it can be observed that due to differences in the thresholding of the Aβ 1−42 levels, fewer individuals are deemed as CSF Aβ 1−42 'normal' in the actual data (n=53), compared with any of the three models applied to the validation datasets (n=95, 73, and 71 for BP, BP5, and B models respectively), highlighting the well-recognized issues of defining standardized cutoff values across studies 32 . The significant differences in conversion rates between the predicted normal/low strata, especially from the more parsimonious BP5 model, together with their similarity to the survival distributions of the actual CSF measures, provide strong evidence that our blood-based biomarkers have similar utility to actual CSF Aβ 1−42 levels, given the similar properties of the two cohorts (Table 1).

Concordance with PET Aβ 1−42 status
To further validate and quantify our model's performance, we have explored the relationship between the predicted CSF Aβ 1−42 scores and PET imaging status. While there are difference between CSF and PET Aβ 1−42 status (such as indications that CSF Aβ 1−42 pathology can be detected before PET Aβ 1−42 13 , the two biomarkers are correlated, with Aβ 1−42 status for the first PET image only differing from the baseline CSF Aβ 1−42 status in seven individuals. As such, evaluating our model against the PET Aβ 1−42 status should provide a conservative estimate for the AUC on the validation cohort, despite the lack of CSF measures.
The resulting ROC curves in Figure 4 provide further evidence that the BP and BP5 models are able to predict Aβ 1−42

7/13
Training status, with AUCs on the validation of 0.78 and 0.8 for the BP and BP5 models respectively. These results are similar to those for CSF status from the training data, with a small expected drop due to the inherent differences between the two tasks. Interestingly, we observe stronger performance for the reduced BP5 model compared to the full BP model, with both again significantly improved compared to the baseline model of age and APOEε4 status.

Discussion
The most positive results from AD trials to date have been found in patients with early forms of the disease, leading to an increasing awareness that treatments are likely to be most successful if applied at the earliest stages of AD 10 . While some AD clinical trials are enriching pre-symptomatic AD individuals with PET screening, recent findings that shifts in CSF amyloid can be observed up to a decade before those from PET may indicate that CSF positive individuals are more suitable for clinical trial enrichment 33 . Direct measurement of CSF biomarkers is too invasive to be used in such a screening test 34 . However, the development of a minimally-invasive, low-cost solution that provides the same type of information would overcome these limitations. This current study evaluates the utility of a blood-based signature of CSF Aβ 1−42 status using a Random Forest approach. We demonstrated that CSF Aβ 1−42 normal/abnormal status using age, APOEε4 carrier status and protein levels can be predicted with a high AUC, sensitivity and specificity of 0.84, 0.78 and 0.73 respectively. Compared to the base model (age and APOEε4 genotype) the inclusion of the plasma analytes improved the performance (AUC) by 6%. To make the model more suitable for clinical application, we identified four plasma analytes which, together with APOEε4 carrier status, still achieved a high AUC, sensitivity and specificity of 0.81, 0.81 and 0.64 respectively. These predictive models were then validated on a separate cohort of individuals to demonstrate that MCI subjects with predicted abnormal CSF Aβ 1−42 (low) levels transitioned to an AD diagnosis at a significantly higher rate than those predicted with normal CSF Aβ 1−42 levels. Furthermore, these rates were similar to those observed in a demographically similar cohort of MCIs using actual CSF Aβ 1−42 levels. This is a strong 8/13 validation of our modelling as the blood-based biomarkers for CSF Aβ 1−42 status is only useful if they can replicate the behavior of the actual Aβ 1−42 status for clinically relevant endpoints for individuals that were not used to build the predictive model. Strong predictive power of PET Aβ 1−42 status on the validation cohort provides further evidence for the generalizability and robustness of our modeling.
A number of studies have previously investigated the use of blood analytes to predict the burden of amyloid in the neocortex, as measured by PET 15, 16, 18-20, 22, 23 . Some of these studies showed similar performance metrics to those reported in this work (> 0.80 AUCs 15, 23-25 or > 0.78 accuracy 17 ), indicating that prediction of PET and CSF Aβ 1−42 status are of similar difficulty. PET Aβ is directly related to brain fibrillar amyloid, whereas CSF amyloid is a marker of soluble Aβ 1−42 and they may, therefore, give different insights into AD progression and mechanisms. For example, CSF Aβ 1−42 has been shown to be associated with APOEε4 whereas PET has been shown to have a greater association with tau 35 . Thus, the development of a blood-based screening test for CSF Aβ 1−42 levels is a complementary approach to existing blood-based biomarkers of PET amyloid status.
Of the above studies, the study by Nakamura et. al. 25 not only showed a very high AUC in a discover and independent validation dataset for PET Aβ 1−42 status (AUC 0.94 and 0.96 respectively), but also showed strong performance for predicting abnormalities in CSF Aβ 1−42 levels (AUC 0.88%), in a small cohort (n=45) of their validation set. While these results are promising, the automation of the novel technique used (IP-MALDI-TOF-MS), and hence transfer to a clinician setting, is non-trivial, motivating the search for complementary approaches. The protein signature presented in this study, based on a multiplex immunoassay, is likely to require a far shorter timeframe for clinical translation given the high level of automation that already exists for multiplex immunoassays, and that biomarkers from such platform have already been used in commercially available diagnostic tests that have been approved by the FDA The use of metabolites appeared to be of limited utility for predicting CSF Aβ 1−42 . In both the regression and binary tasks, models containing metabolites achieved equivalent or worse AUCs than models without.These findings can be contrast with the utility of metabolites in predicting PET Aβ 1−42 positivity 22 and their association with AD more broadly 36 . Alternative methods for integrating this source of data 37 , may be required in order to find robust associations with CSF Aβ 1−42 status.
The subset of features found to be critical for our BP5 Random Forest model included APOEε4 genotype, Chromogranin-A (CGA), Eotaxin 3, Aβ 1−42 (AB42), C-reactive protein (CRP) and plasma Apolipoprotein E levels (APOE). Several of these identified proteins have known associations with Alzheimer's disease.Unsurprisingly, the levels of plasma APOE are associated with CSF amyloid levels. APOEε4 is the strongest genetic risk factor for AD. APOE is involved in the clearance of Aβ 1−42 [38][39][40] and there is a strong relationship between APOEε4 genotype and APOE plasma levels, where APOEε4 carriers have lower plasma levels 41,42 . Plasma Aβ 1−42 showed a positive relationship in our model for CSF Aβ 1−42 , in line with a prior observation 43 . This is interesting, as the link between alterations of Aβ 1−42 levels in the blood and the progression of the disease is still controversial and studies assessing the Aβ 1−42 concentration in blood of AD patients have produced conflicting results [43][44][45][46][47][48][49] . Chromogranin A (CGA) is associated with synaptic function, and has traditionally been used as an indicator of neuroendocrine tumors 50 . More recent work has shown that CGA has a degree of co-localisation with amyloid plaques in the brain 51,52 . However, levels of CGA in the CSF and blood serum do not appear to be correlated 53 , and serum CGA has not previously been linked to AD. Eotaxin 3, also known as C-C chemokine ligand 26 (CCL26), plays an important role in the innate immune system and has been found to be dysregulated in AD patients 54 . CSF Eotaxin 3 has been shown to be significantly elevated in patients with prodromal AD, however, Eotaxin 3 levels in plasma or the CSF do not correlate with rates of disease progression 54,55 This study has several limitations. The training and validation cohorts are both part of the ADNI study and thus all measures were conducted on the same platforms. Hence further cross-cohort and cross-platform replication is required. This remains an ongoing issue within the development of all AD biomarkers relating to early screening and requires significant future investment 56 . Furthermore, the current cohort is neuropathology biased, i.e. 84% of the cohort have MCI or AD, and thus likely to have neuronal damage, potentially confounding the analysis of CSF Aβ 1−42 status. Finally, it needs to be noted that there are other medical conditions that are known to affect CSF Aβ 1−42 levels and it is unclear whether these affect any of the patients in our cohort.
The early identification of AD disease is paramount and a major global focus as the success of disease-modifying or preventative therapies in AD may depend on detecting the earliest signs of abnormal amyloid-beta load. The differences between CSF Aβ 1−42 and PET Aβ 1−42 in preclinical stages of AD could also have implications in clinical trial settings. Blood-based biomarkers of amyloid can serve as the first step in a multistage screening procedure, similar to those that have been clinically-implemented in cancer, cardiovascular disease, and infectious diseases 56 . In-conjunction with biomarkers for neocortical amyloid burden, the CSF Aβ 1−42 biomarkers presented in this work may help yield a cheap, non-invasive tool for both improving clinical trials targeting amyloid and population screening.