Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A machine learning approach to brain epigenetic analysis reveals kinases associated with Alzheimer’s disease


Alzheimer’s disease (AD) is influenced by both genetic and environmental factors; thus, brain epigenomic alterations may provide insights into AD pathogenesis. Multiple array-based Epigenome-Wide Association Studies (EWASs) have identified robust brain methylation changes in AD; however, array-based assays only test about 2% of all CpG sites in the genome. Here, we develop EWASplus, a computational method that uses a supervised machine learning strategy to extend EWAS coverage to the entire genome. Application to six AD-related traits predicts hundreds of new significant brain CpGs associated with AD, some of which are further validated experimentally. EWASplus also performs well on data collected from independent cohorts and different brain regions. Genes found near top EWASplus loci are enriched for kinases and for genes with evidence for physical interactions with known AD genes. In this work, we show that EWASplus implicates additional epigenetic loci for AD that are not found using array-based AD EWASs.


Alzheimer’s disease (AD) is an age-dependent, neurodegenerative disorder, the leading cause of dementia, and a major public health concern world-wide1. AD is a complex illness due to environmental and genetic factors with a heritability of ~70%2,3. Compared to genome-wide association studies (GWASs), there are relatively fewer studies examining AD-associated epigenetic changes in the human brain. Yet, understanding epigenetic changes in the brain is important because they will likely illuminate both heritable and environmental aspects of AD pathogenesis. One of the most well-described epigenetic changes, DNA methylation (DNAm), is strongly linked with transcription regulation4, is heritable5, and notably changes in response to environmental exposure6,7,8, such as smoking9,10. Important for AD and other age-dependent illnesses, it is also known to change with age11.

Epigenome-wide association studies (EWASs) use array-based assays to test whether DNAm at particular CpG sites (abbreviated CpGs hereafter) is associated with a disease12,13,14,15. Multiple AD EWASs have identified differential DNAm associated with AD in different regions of the human brain, including prefrontal cortex (PFC)16, entorhinal cortex (EC), superior temporal gyrus (STG), cerebellum (CER)17, temporal pole region, temporal cortex, glia, neuron nuclei, non-neuronal nuclei18, and superior temporal gyrus19. These works revealed AD-associated differential DNAm such as those near ANK116,19 and CDH2316,17, which are distinct from AD GWAS signals. Although these studies have identified new AD-associated genes, array-based methods are limited because they only test about 2–3% of all CpGs in the human genome and have known technical limitations20. To overcome these challenges, we tested whether a machine learning approach could be used to identify additional AD-associated CpGs on a genome-wide scale.

In this work, we construct a supervised machine learning (ML) binary classifier named EWASplus to identify CpGs associated with AD. Given that epigenetic features and DNAm status are interconnected, we hypothesize that we can identify AD-associated CpGs using genomic and epigenetic features. Training data are derived from array-based EWASs, and features include relevant genomic and epigenomic profiling data (e.g., chromatin accessibility, histone modifications). After model training, we apply the trained model to the entire genome to identify additional AD-associated CpGs. Finally, we perform targeted bisulfite sequencing experiments to validate our in silico predictions. We find the highest rate of AD association for regions harboring putative CpGs predicted by EWASplus (65.8%; 25 out of 38), follow by CpGs known to associate with AD by methylation arrays (60.0%; 6 out of 10). Experimental validation shows predicted CpGs are 2.2 times more likely to be associated with AD (p < \(1.00\,\times\, {10}^{-9}\)) than negative control CpGs. These results suggest EWASplus is capable of providing credible information to identify additional AD-associated CpGs.


EWASplus overview

The goal of EWASplus is to identify additional disease-associated CpGs that are not included on the methylation arrays. Currently, the most popular methylation arrays only represent 2–3% of all CpGs in the human genome. EWASplus aims to increase the number of CpGs tested in EWASs to a genome-wide scale. A comparison of CpG coverage between the 450K methylation array and EWASplus is shown in Supplementary Fig. 1. Standard EWAS operates under a testing framework, but EWASplus frames the problem as a supervised learning (i.e., classification) framework. The EWASplus approach (Fig. 1) is to (1) use summary statistics from array-based EWASs to classify all CpGs on the array into either trait-associated (positive) or neutral (negative) group; (2) perform feature selection to identify the most informative features from a collection of 2256 genomic and epigenomic annotations; (3) train an ensemble learning model capable of identifying CpGs for trait association; and (4) score all CpGs in the entire genome to identify additional trait-associated CpGs not present on the array.

Fig. 1: Overview of EWASplus approach.

The EWASplus procedure is composed of four major steps: (1) Training data collection from existing EWASs; (2) External feature (from sources such as ENCODE and Roadmap Epigenome consortia) selection; (3) Ensemble learning; and (4) Genome-wide CpGs risk prediction, in which trained ensemble learning model is applied genome-wide to score all CpGs.

To prepare the training set, EWASplus gathers the most significant CpGs identified from array-based EWAS to form a positive training set. To reflect the fact that there are far fewer significant trait-associated CpGs in the genome than the trait-neutral ones, EWASplus selects a matching negative training set with similar genomic context that is ten times larger than the positive training set.

EWASplus employs an ensemble learning strategy and four different methods were chosen as the base learner: regularized logistic regression (RLR), support vector machine (SVM) classifier, random forest (RF), and gradient boosting decision trees (GBDT). To identify the best ensemble model, we tested all possible combinations of these base learners and found that the combination of RLR and GBDT gives the best performance overall (Supplementary Tables 16), and hence was selected to be the ensemble model in this study. RLR has the best recall but relatively low precision, while GBDT has the best precision but relatively low recall. When these two models are ensembled, the underfitting property of RLR can effectively offset the overfitting from GBDT while still keeping enough model complexity. More detailed description can be found in “Performance evaluation metrics” in the Methods section.

EWASplus can be applied to any array-based EWAS to extend its coverage. In this study, we tested EWASplus on data collected from four different cohorts: ROS/MAP (sample size 717), London (sample size 113), Mount Sinai (sample size 146), and Arizona (sample size 302). Cohort characteristics are given in Supplementary Table 7. All original EWASs were performed using the Illumina 450K methylation array.

EWASplus performance compared to methylation array

To evaluate the performance of EWASplus, we first considered its performance on CpGs present on the Illumina 450K methylation array (henceforth referred to as the “array”). Given the large sample size (n = 717), we choose data from the ROS/MAP cohort as the main dataset for performance evaluation. Methylation is measured on DNA derived from post-mortem PFC. Standard EWAS were conducted on six different AD-related traits: beta-amyloid density, Braak staging, the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) score, cognitive trajectory, global AD pathology, and neurofibrillary tangle density (Supplementary Table 8). We trained a separate classifier for each of the six traits.

EWASplus results are summarized in Table 1 (see section “Hyperparameter tuning and ensemble model” for detailed description of the approach of evaluation). The area under the receiver operator characteristic (ROC) curve (AUC) values from the six traits range from 0.831 (cognitive trajectory) to 0.962 (neurofibrillary tangles) (Fig. 2a). The area under the precision-recall curve (PRC) (AUPRC) values from the six traits range from 0.502 (CERAD) to 0.858 (neurofibrillary tangles) (Fig. 2b). These results indicate that EWASplus works well to predict significant AD-associated CpGs for methylation measured by the array. Among the six traits, we observe the best performance for neurofibrillary tangles.

Table 1 Summary of performance evaluation of all six AD-related traits.
Fig. 2: Summary EWASplus results.

a ROC curves of the predictive performance of EWASplus on the six traits in the ROS/MAP cohort. b Precision-recall curves of the predictive performance of EWASplus on the six traits in the ROS/MAP cohort. Source data are provided as a Source data file.

To further evaluate EWASplus, we asked whether the EWASplus prediction score is capable of distinguishing CpGs with differential DNAm between AD case and control status. To answer this question, we selected four groups of CpGs that differ with respect to differential DNAm association with AD: (a) AD-associated CpGs in the positive training set (i.e., p-value less than the EWAS threshold); (b) CpGs suggestively associated with AD (i.e., p-value slightly greater than the EWAS threshold); (c) CpGs not associated with AD but not in negative training set; and (d) CpGs not associated with AD and in the negative CpG training set (i.e., p-value greater than the EWAS threshold and in negative training set). On average, we find a significant difference for EWASplus prediction scores between suggestively positive and negative CpGs that are not in the training sets (Wilcoxon rank-sum test; p < 3.64 × 10−16; Supplementary Fig. 2) for all six traits. Scores for CpGs in group b are similar to those in group a (positive training set), albeit with higher variation, whereas scores in group c have almost the same scores as group d (negative training set). As expected, our results demonstrated excellent capability of EWASplus in distinguishing CpGs that show AD association or not.

EWASPlus performance for off-array CpGs

We applied the six classifiers trained on the six AD traits using EWASplus to the entire human genome to obtain a prediction score for every CpG (Fig. 3a). The top ten CpGs with the highest composite scores are listed in Table 2. The total number of CpGs with a prediction score is about 78 times the number of CpGs present on the Illumina 450K methylation array. The prediction scores for all CpGs are provided at the EWASplus Github site.

Fig. 3: Genome-wide prediction results.

a Manhattan plots for neurofibrillary tangles: the top panel is for on-450K CpGs with EWAS p-values and the bottom panel is for whole-genome CpGs with imputed LRS by EWASplus. The y-axis is the log-scale rank scores. The top-ranked CpG has the LRS of 7.42 (about empirical p-value of 3.8 × 10−8); the top 100th ranked CpG has the LRS of 5.42 (about empirical p-value of 3.8 × 10−6) and the top 10,000th ranked CpG (about empirical p-value of 3.8 × 10−4) has the LRS of 3.42. b Raw and normalized stacked-proportion histograms for different genomic annotation types. Source data are provided as a Source data file. c The difference of observed and expected chromatin states proportion for the top 10,000 loci across the six AD-related traits: Beta-amyloid, Braak staging, CERAD, cognitive trajectory, global pathology, and neurofibrillary tangles. Source data are provided as a Source data file. The annotated chromatin states are from Roadmap Epigenetics Project and we used the core 15-state model chromatin states for the dorsolateral prefrontal cortex tissue type. To minimize ambiguity, we require only a single annotation type is assigned for each CpG site. if a CpG has multiple annotations, we only record the most “significant” annotation with the following order: enhancer > promoter > exon > intron > near gene (1–5 kb to the TSS) > intergenic. We do not list 5′ UTR and 3′ UTR since these two types are within the first and last exon of each gene according to the UCSC annotation system.

Table 2 Top ten CpG loci for six AD-relevant traits.

For the off-array CpGs, we examined the distribution of prediction scores for different types of genomic regions. We hypothesized that top CpGs with the highest prediction scores would be located in functional regions such as enhancers and promoters, and we find this to be the case (Fig. 3b). The normalized proportion for enhancers ranges from 15.73 to 43.19% and exons range from 4.69 to 23.99%, which are both significantly higher than the expected occurrence of these regions in the high prediction score percentile intervals (binomial test for the highest prediction score quantile interval: p < 1.00 × 10−99 for both enhancers and exons). To better understand the properties and context of top-ranked CpGs predicted by EWASplus, we selected the top 10k CpGs with the highest overall EWASplus prediction scores and analyzed their chromatin states (15-state model) defined in dorsolateral prefrontal cortex. We calculated the enrichment (or depletion) of the 15 chromatin states in the top 10k CpGs. As a result, we found that all six AD-related traits are enriched for sites annotated as flanking active transcription start site (TSS) (binomial test; p < 1.00 × 10−99 for all traits), active TSS (binomial test; p < 1.00 × 10−99 for all traits), enhancers (binomial test; p < 1.22 × 10−9 for all traits), and repressed polycomb (binomial test; p < 1.00 × 10−99 for all traits), and under-represented for sites within quiescent regions (binomial test; p < 1.00 × 10−99 for all traits) (Fig. 3c). There is no significant difference in the enrichment patterns across the six AD traits. These results support the conclusion that top CpGs associated with AD tend to be located in functional regions.

Comparison with a competing method

In a recent work, using array-measured methylation levels, Zhang et al.21 develop a computational algorithm to impute the methylation levels on CpG sites genome-wide including those not on the Illumina 450K array. Their approach employed about 125 genomic and epigenomic features (the number varies when including different sets of individual-level features) mainly composed of regulatory marks from ENCODE project. Although not designed for trait-association prediction, one could apply this method to impute methylation levels for every individual sample and on every CpG site. Subsequently, association test can be conducted on these imputed methylation measures to identify CpGs significantly associated with a trait of interest.

To compare such a strategy with EWASplus, we applied Zhang et al.’s method and used the imputed methylation values to conduct an association test. We found that the AUC for EWASplus is between 0.178 and 0.329 higher compared to the adapted Zhang et al. approach; AUPR for EWASplus is 0.219 to 0.364 higher than adapted Zhang et al. approach across six AD-related traits (see Supplementary Fig. 3a, b for performance comparison).

Experimental validation of EWASplus predictions

To experimentally test the validity of the prediction scores reported by EWASplus, we performed targeted bisulfite sequencing to measure the methylation level at 559 selected CpGs from 150 randomly selected participants from the Religious Orders Study (ROS) or Memory and Aging Project (MAP) cohorts who are representative of both studies and have available brain tissue for bisulfite sequencing (Supplementary Table 9). CpGs were selected for independent validation from the top EWASplus predicted sites using a stepwise selection process that prioritized regions with the highest predicted scores that were physically separated by at least 500 bp. For comparison purposes, we also randomly selected CpGs from regions with predicted scores in the lower half but similar physical characteristics (e.g., GC content). In addition, we targeted CpGs on the array that could serve as positive controls. After quality control, 319 CpGs were analyzed including 31 CpGs on the 450K array identified as AD-associated22, 260 off-array CpGs predicted to be AD-associated based on EWASplus, and 28 off-array CpGs predicted to not be AD-associated. These 319 CpGs can be grouped into 58 independent clusters (referred to as CpG cluster hereafter) on the genome that belongs to three groups: 38 off-array predicted AD-associated, 10 on-array AD-associated22, and 10 off-array predicted not AD-associated. For performance comparison, we combined test results from the six individual traits. Due to the limited sample size, we call a CpG cluster AD-associated if at least one of the CpGs at the locus achieves unadjusted p-value for differential DNAm <0.05 for any of the six traits. Similar to our results from individual traits, we found that positive CpG clusters predicted by EWASplus have the highest rate of association with at least one AD trait (65.8%, or 25 of 38), followed by CpG clusters identified by array-based EWAS (60.0%, or 6 of 10). In contrast, the negative control CpG clusters predicted by EWASplus have the lowest (30.0%, 3 of 10) (Table 3). Thus, CpGs with top EWASplus scores are about 2.2 times more likely to be associated with an AD trait (Binomial test, p < \(1.00\times {10}^{-9}\)).

Table 3 Comparison of number and proportion of differentially methylated CpGs in various categories of CpGs.

EWASplus performance on multiple cohorts

To further test EWASplus, we examined its performance using data from three additional cohorts: London cohort17 (prefrontal cortex, N = 113), Mount Sinai cohort23 (prefrontal cortex, N = 146), and Arizona cohort24 (middle temporal gyrus, N = 302). In all three studies, Braak staging (treated as a continuous variable) is used as the trait in the EWAS studies, as described in Smith et al.25. Detailed information about these cohorts is summarized in Supplementary Table 7.

We found that EWASplus performed well in all three datasets. The AUC values range from 0.697 (London 1) to 0.863 (Mount Sinai) (Supplementary Fig. 4a). The AUPRC values range from 0.233 (Arizona) to 0.604 (Mount Sinai) (Supplementary Fig. 4b). The complete results including all evaluation metrics can be found in Table 4.

Table 4 Summary of performance evaluation on three additional cohorts of samples: London, Mount Sinai, and Arizona.

To understand the most relevant factors influencing EWASplus performance among the different datasets, we treated the performance measurement testing AUC as the response variable and tested numerous independent variables using the linear regression model. We found that when choosing the positive EWAS threshold (negative logarithm transformed p-values) as the independent variable, simple linear regression achieved R2 of 0.588 using other performance measures such as AUPRC and F1 values produced similar results (Supplementary Fig. 5). These results suggest that perhaps the most relevant factor that influences EWASplus performance is the quality and power of the original EWAS, which depends on the effect and sample sizes.

Biological insights into AD

To glean biological insights from the EWASplus results, we examined genes surrounding some of the highest EWASplus scoring CpGs. Interestingly, we found that the highest scoring CpG is located inside the HOXA gene cluster, which has been identified by three independent array-based EWASs of cortical brain tissue associated with Braak staging, a measure of neurofibrillary tangles17,23,26. In contrast to prior analyses that identified individual HOX genes, EWASplus results identify a 40 kb region on chromosome 7 that includes multiple homeobox genes, e.g., HOXA2, HOXA3, HOXA4, HOXA5, and HOXA6, that are associated with AD (Fig. 4c).

Fig. 4: Manhattan plot of neurofibrillary tangles EWAS at the HoxA locus on chromosome 7.

a Array-based EWAS p-values. The most significant CpG identified by De Jager et al. are shown with an arrow. b EWASplus predicted LRS. c The landscape of the HoxA cluster genes.

In addition, of the top 10 detached EWASplus scoring CpGs, seven were not previously implicated in any EWAS of AD. Here detached means any two CpGs on this list are at least 10 kb away from each other. Gene set enrichment analysis by GeNets27 using all genes located within 5 kb of the top 100 EWASplus scoring CpGs (123 genes, Supplementary Data 1) revealed a significant enrichment of protein kinases (\(p=0.044\), Supplementary Fig. 6)—ALPK3, DMPK, MAP3K11, MAP4K1, and TAOK328. Identification of kinases within AD is of particular interest given that neurofibrillary tangles, a hallmark neuropathology of AD, result from hyperphosphorylation of microtubule-associated protein tau (MAPT)29. In addition, we found that genes within the top EWASplus regions have evidence of physical interaction with known AD genes or AD GWAS loci (Supplementary Fig. 6) (e.g., PRKAG2 and TNS3 interact with APOE, CLU, APP, PSEN1/2, and RIN2 and RIN3 interact with BIN1). These analyses support the idea that EWASplus is able to identify interesting underlying biological relationships in AD.


EWAS has been shown to be a powerful and effective approach to derive associations between methylation changes and phenotypes. EWAS studies of human brain have elucidated additional genes involved in AD16,17,18,19. To expand our understanding of potential AD-relevant regions in the genome, we developed EWASplus to explore the 97% of CpGs that are not included on the methylation arrays. EWASplus uses an ensemble learning-based computational pipeline to learn relevant features from a large set of potential omics features.

EWASplus is a powerful machine learning method based on disease-specific EWAS results and has some parallels with genotyping imputation strategy used in genetics studies30. The fundamental difference is that genomic imputation relies on linkage disequilibrium31, but DNAm does not share the same degree of physical correlation21. In fact, the correlation of methylation levels between two adjacent CpGs decays rapidly with distance32,33. Thus, EWASplus takes an alternative approach by inferring whether a CpG is trait-associated. This is achieved under a supervised classification framework.

EWASplus can effectively identify AD-associated differentially methylated CpGs according to multiple experiments conducted to evaluate its performance. First, using only CpGs on the methylation array, in silico cross-validation revealed high AUC and AUPRC for all six traits. Second, we observed good separation of EWASplus prediction scores between near positive CpGs versus negative CpGs not in the training set (Supplementary Fig. 2; Wilcoxon rank-sum test p-value ranges from \(2.82\times {10}^{-99}\) to \(3.64\,\times\, {10}^{-16}\)) in EWASplus prediction scores. For sites not assayed by the methylation array, we found significant enrichment of high-scoring CpGs in genomic regions of functional annotations such as TSS regions and enhancers. Finally, and most importantly, we performed experimental validation using targeted bisulfite sequencing on CpGs not included on the methylation arrays.

Our EWASplus results are notable, in general, for two reasons. First, high-scoring EWASplus CpGs are more likely to be located in regions with functional annotations such as enhancers or promoters. Both of these results are consistent with other work showing that gene regulation is a key facet of many diseases34,35. Second, EWASplus results illustrate how epigenetic “fine mapping” may illuminate disease pathophysiology. For example, in the HOXA locus EWASplus results suggest that epigenetic changes are occurring across the gene cluster in AD rather than one gene-family member.

A key idea of EWASplus is that it bypasses inferring the individual-level DNAm level directly. A similar approach has been used to predict additional trait-associated genetic variants using GWAS and machine learning36. Since our goal is to identify disease-associated DNAm CpGs rather than methylation status directly our approach avoids much complexity associated with accounting for the many factors that can influence DNAm CpG status (e.g., age, cell type proportion). This is illustrated by the performance of EWASplus compared to the modification of Zhang et al.’s method to address disease-association, which it was not originally designed to do, admittedly.

EWASplus results for AD reveal several interesting biological insights. First, we identified a 40 kb region in the homeobox A cluster of genes that are associated with AD, which expands upon the previously described association with individual genes within that cluster (e.g., HOXA3) and AD. Since these are known transcription factors, these findings may suggest important transcriptional regulation occurs in AD or its progression. Second, we find enrichment of kinases—ALPK3, DMPK, MAP3K11, MAP4K1, and TAOK3—in the top EWASplus loci. This finding is particularly relevant for AD given that the pathologic hyperphosphorylation of tau is a hallmark neuropathologic feature of AD (i.e., neurofibrillary tangles). Of these kinases, only ALPK3 and MAP4K1 were previously suggested to associate with AD23,37,38,39. DMPK is notable for causing myotonic dystrophy type 1 due to a repeat expansion within an intronic region in carriers that leads to altered gene expression of genes within that region40. Interestingly, differential DNAm of MAP4K1 has been associated with AD in human hippocampus39 and Braak staging (a measure of neurofibrillary tangle pathology)23 in independent human brain datasets. TAOKs (thousand and one amino acid kinases, also referred to as prostate-derived STE20-like kinases [PSKs]) have been extensively investigated for their ability to phosphorylate MAPT and regulate microtubule assembly41; yet, to our knowledge, methylation of TAOK3 has not been previously associated with AD. Finally, from the top 10 EWASplus results (Table 2) we found four genes that have intriguing connections with AD or cognitive decline from approaches other than methylation. These genes include DUSP1, PPARD, JUN, and PRKAG2. For example, a PPARD null mouse model shows cognitive impairment42, and PPARD is highly expressed in the brain43 and implicated in type 2 diabetes and obesity44, which are risk factors for AD. In addition, there is experimental evidence to suggest that JUN and PRKAG2 regulate or interact with APP45,46, which is of interest in AD given APP is cleaved to beta-amyloid. Thus, these findings from the literature provide complementary support that EWASplus identifies disease-relevant findings and is likely to provide fresh insight into AD.

DNA methylation is tissue-specific. Most of the tests done in this study are conducted on the PFC region of the brain. We focused on PFC for several reasons. First, epigenetic marks are correlated across neocortical regions47. Second, cell loss in PFC is relatively less even in people with high neuropathological burden from AD compared to other cortical regions. Third, the majority of available reference human brain transcriptomes and proteomes are from the PFC allowing future work to test predictions of EWASplus using existing data. Despite focusing on PFC, EWASplus performs well on the middle temporal gyrus. Thus, we expect EWASplus to perform well for other tissues because the genome-wide features used are from many different tissue types. From all the tests we performed, we found that the number and level of significant CpGs seem to have a strong impact on the EWASplus performance. Therefore, we are confident that EWASplus will be able to successfully extend the coverage of high quality, well-powered array-based EWAS studies.

Although the EWASplus methodology is general and can be applied to any tissue type, the methylation profiles are tissue-specific, may change with age/environment and demographics. This implies that the trained EWASplus model is only valid for the specific tissue type collected from samples with certain age/environmental profile and demographics. One should exercise caution when trying to extrapolate the results to other tissue types such as blood, or subjects with different age or environmental and demographic profiles. Since the major utility of EWASplus is to expand the coverage of EWAS beyond the array within a specific experimental dataset, this limitation will not hamper the utility of EWASplus.

A potential limitation of EWASplus is the limited number of underlying training datasets and the focus on subjects of recent European descent. Thus, it is of particular importance to expand the number and diversity of additional EWAS data in future work. The underlying methylation data were also from PFC, which is affected relatively late in AD; however, the findings may not generalize to other neocortical regions. Thus, training data from additional relevant brain regions would improve EWASplus models. Likewise, while we started with a large number of potential features, many were from non-neuronal sources, which may limit generalizability to brain tissue. However, as those data are generated, our approach can be easily retrained with those data for improved specificity for brain-cell types from different regions. A strength of this work is that the underlying methylation data were derived from participants enrolled in a population-based study of aging, and there is a wide range of neuropathology findings that reflects the general population rather than a clinic-based ascertainment48. We also show a high degree of experimental validation and note that future work could employ targeted bisulfite sequencing49 or a custom array platform50 to profile candidate CpGs in a cost-effective and high-throughput manner.

EWASplus does not provide a significant cut-off threshold since it is a supervised classification approach, not a testing-based method. In practice, one can select the threshold empirically by checking whether top CpGs identified by array-based EWAS made the cutoff. Deciding on the number of significant EWAS CpGs to include in training is a tradeoff between the quantity and quality of the training set in EWASplus. Thus, the significance threshold for each EWAS should be decided based on the effect size and sample size of the EWAS. Future work should examine the utility of including different thresholds and use cross-validation to select the desired significance cutoff. For a CpG, no matter how highly ranked by EWASplus, should only be considered as “putative” in terms of trait association unless it can be validated using experimental approaches.

In conclusion, we present EWASplus, a powerful machine learning approach to identify disease-associated CpGs with high reliability. Application of EWASplus to AD highlights important regions and genes that likely contribute to AD pathogenesis, which a valuable addition to the investigation of the epigenetic landscape of AD. In addition, EWASplus is a general approach that may be applied to extend any existing EWAS results obtained using array-based technology, regardless of the trait or phenotypes being studied. We anticipate more exciting findings from its future applications.



The main dataset used in this study comes from the ROS/MAP cohorts. ROS and MAP are longitudinal cohort studies of aging and AD led by investigators at the Rush Alzheimer’s Disease Center51,52. Participants give written informed consent for annual assessments, signed an Anatomic Gift Act, and a repository consent to allow their data and biospecimens to be repurposed. Each year, participants undergo a detailed medical, neurological, and neuropsychiatric assessment. After death, each participant undergoes a detailed brain autopsy with neuropathologic examination. Both ROS and MAP were approved by the Institutional Review Board of Rush University Medical Center. They share a large common core of data at the item level to allow efficient merging of datasets. ROS/MAP resources can be requested at

In addition, we also obtained data from three separate cohorts: London, Mount Sinai, and Arizona. The “London” cohort refers to prefrontal cortex tissue obtained from 113 individuals archived in the MRC London Neurodegenerative Disease Brain Bank. The details of the cohort are described in Lunnon et al.17. The “Mount Sinai” cohort refers to prefrontal cortex tissue obtained from 146 individuals archived in the Mount Sinai Alzheimer’s Disease and Schizophrenia Brain Bank. Details of this cohort is described in Smith and colleagues25. The “Arizona” cohort refers to 302 middle temporal gyrus samples from The Sun Health Research Institute Brain Donation Program24. The details of this cohort are described in Brokaw et al.53.

Sample preparation and differential DNAm CpGs identification

DNAm data were generated from dorsolateral PFC (Broadman area 46) of post-mortem samples obtained from individuals in the ROS/MAP cohorts.

DNAm profiling was performed with the Illumina HumanMethylation450 Beadchip array16. After excluding non-Caucasian subjects, 717 ROS/MAP participants with array DNAm data remained for analysis. We obtained raw IDAT files from the Synapse website (Synapse ID: syn7357283) and removed probes annotated to multiple chromosomes or the X and Y chromosomes by Illumina, probes that cross-hybridize with other probes due to sequence similarity (identified by Chen et al.54), probes with a detection p-value > 0.01 in any sample, probes without a CpG, and probes that overlap with a common SNP (identified by Barfield et al.55). After this filtering, a total of 334,465 autosomal CpGs remained for analysis.

For the EWAS analyses, each probe was normalized using the BMIQ algorithm from the Watermelon R package56, and adjusted for batch effects using the ComBat function from the sva R package57. We used the CpGassoc58 R package to test if the methylation level of each array CpG is associated with the trait of interest via regression methods58. All models were adjusted for proportion of neurons, age at death, sex, post-mortem interval, plate, study, and years of education. Neurons were added as a covariate to avoid potential confounding due to differences in the cellular composition of the tissue samples. The proportion of neurons in each sample was estimated using the CETS R package and reference methylation data from isolated neuronal nuclei59.

We performed EWASs for the following six AD-related traits: (1) beta-amyloid load which is the percent area of beta-amyloid based on image analysis; (2) neurofibrillary tangle density by stereology; (3) CERAD score; (4) Braak stage; (5) global AD pathology burden; (6) cognitive trajectory based on the average z-score of 17 cognitive function tests. Beta-amyloid and neurofibrillary tangle were measured in the cortex using immunohistochemistry with antibodies specific to beta-amyloid and phosphorylated-tau, as described52. We used square-root-transformed values for both traits to improve their normality. CERAD score and Braak stage are semi-quantitative measures that reflect both a neuropathologist’s opinion of AD diagnosis and the distribution and amount of silver-stain-identified neuritic and diffuse plaque and neurofibrillary tangle pathologies, respectively60,61,62. CERAD scores can take on values from one to four indicating definite AD, probable AD, possible AD, and no AD, respectively. CERAD was treated as a continuous trait. Braak stages can take on values from one to six, indicating the increasing spread of neurofibrillary tangle pathology in the brain, and Braak was coded as a binary trait with stages one to three as controls and stages four to six as affected. Global AD pathology burden is a summary measure of silver-stain-identified neuritic plaque, diffuse plaque, and neurofibrillary tangle pathologies52. As global AD pathology burden has a skewed distribution, we used square-root-transformed values. Cognitive trajectory, or the rate of change in cognition over time, was estimated for each ROS/MAP participant using a linear mixed model22. For each person, cognitive trajectory was estimated as the person-specific random slope of a linear mixed model that included global cognitive function as the longitudinal outcome63, follow-up year as the independent variable, and sex, age at enrollment, and years of education as covariates.

For the London, Mount Sinai, and Arizona cohorts, we directly used the processed EWAS results reported in Smith et al.25. Details of the sample preparation and differential DNAm CpGs identification have been described in previous studies17,24,53.

Training sets selection

For each trait, positive CpGs in the training set were selected based on association test p-values (threshold ranges from \(1.00\,\times\, {10}^{-7}\) to \(1.00\,\times {10}^{-5}\)). For each positive CpG, ten matching negative CpGs were selected from the Illumina 450K array such that they have similar \({{{\rm{\beta }}}}\)-values as the positive CpG, but none is considered significant in any of the EWASs conducted on the six traits. We used a conservative threshold (\(p \, > \, 0.40\)) for being not-significant, and \({{{\rm{\beta }}}}\)-values were calculated as the mean values of methylation intensity over 717 ROS/MAP samples for each CpG on the Illumina 450K array.

Base classifiers

We used four different methods as base classifiers with varying model complexity. The goal of this approach was to select the model with the least error to achieve an optimal overall performance. We used four models that included: (1) RLR with L2-penalty, which alleviates overfitting and feature collinearity; (2) SVM classifier64, which performs well with linearly non-separable classification, a common feature for real-world problems; (3) RF65, which is a bagging method with decision tree as base learner; (4) GBDT66, which differs from RF in that a new tree is added to model to gradually optimize the objective function that was set as log loss. EWASplus uses an accurate and efficient implementation of GBDT from package XGBoost67.

Feature selection

We assembled a comprehensive collection of 2256 genomic/epigenomic profiles as well as multiple functional annotation scores as features to be used in the model. Omics profiles include TF and histone ChIP-seq, open chromatin, total RNA-seq, and WGBS. Functional annotation scores include CADD68, GenoCanyon69, and Eigen/EigenPC70. More details can be found in Supplementary Table 10.

The moderate size of the training sets (between 1706 and 3181) may result in overfitting if all features are included in the training. Thus, we used a dimension reduction/feature selection step before the model training. For each trait, we performed feature selection for each of the four base classifiers: RLR, SVM, RF, and GBDT, respectively. For each base classifier, we selected the top 100 most informative features using the training data. In RLR and SVM, features were ranked based on the weights of the fitted model. For RF, features were ranked based on the Gini impurity measure. For GBDT, features were ranked by the gain metric when fitting the model, or, in other words, the improvement in accuracy brought by a feature to the branches it is on.

Next, we ranked the features by the number of times that this feature was selected by the four base learners as informative. We selected the top 60 features (testing on the number of top features ranges 30–100, 60 was selected because it gave the best performance overall). Features were ranked by the number of methods that select the feature as informative. To break a tie, we introduced a secondary sorting method. For each feature, we conducted the Wilcoxon rank-sum test comparing feature values between positive and negative CpGs, and features were ranked from the most significant to the least significant. The top 60 features for all six traits from the ROS/MAP cohort and Braak staging from three additional cohorts are shown in Supplementary Data 210. These features may be informative about underlying disease mechanisms.

Hyperparameter tuning and ensemble model

We used the Tree-of-Parzan Estimators (TPE) implemented in Hyperopt71 to adaptively search the hyperparameter space of each component model (base learner) for the best hyperparameter settings. This model-based hyperparameter tuning method is thought to achieve better performance than random search in terms of both accuracy and efficiency71.

The hyperparameter tuning for each component model is conducted separately. In the training dataset, we uniformly up-sample the positive CpGs to match the number of negative CpGs to alleviate the imbalance problem. In the outer CV, the whole dataset of positive and negative CpGs were split into training and testing sets in a nine-to-one ratio in each round. Within each round, the ninefolds were further split into threefold to conduct the inner 3-fold CV for hyperparameter searching. The best set of hyperparameters was decided by the highest F1 score and it was then used for the remaining onefold in the outer 10-fold CV. Each one of the tenfolds in the outer CV layer is used once as the testing set in a round-robin way so that out-sample predictions cover the whole dataset. We evaluate our model with the out-of-bag estimates for testing error and report the evaluation results in Table 1 for the ROS/MAP cohort and Table 4 for other additional cohorts.

After the outer 10-fold CV, we then built the ensemble model by selecting the best combination of component models. The out-sample predictions of each base learner from the outer 10-fold CV were aggregated in a soft-voting manner to give the ensemble prediction probabilities in different combinations of component models. Due to the problem of class imbalance, we evaluated the performance of the ensemble models using AUC, AUPR, precision, recall, and F1 score.

Performance evaluation metrics

To assess the performance of EWASplus, we used three classes of evaluation metrics: precision and accuracy, AUC and AUPRC, as well as F1 score. Precision measures the true positive rate of a classifier. Accuracy measures the percentage that a classifier correctly labels test samples. For imbalanced datasets where positive samples are of more interest, precision is preferred over accuracy. PRC is preferred over ROC. The F1 score is another widely used performance measure for imbalanced datasets. It takes into consideration both accuracy and precision by assigning each an equal weight in the following calculation formula: \({F}_{1}=2\,\times \left(\frac{{{{{\mathrm{recall}}}}}\ast {{{{\mathrm{precision}}}}}}{{{{{\mathrm{recall}}}}}+{{{{\mathrm{precision}}}}}}\right)\). The focus of F1 score is on the positive samples which is usually under-represented.

Binomial test for enrichment of protein kinases

We selected the top 100 CpGs with the highest EWASplus prediction scores across six AD-related traits in a stepwise forward manner such that any two CpGs in the top 100 list are at least 10 kb away from each other. Next, we searched through the 5 kb neighborhood of these 100 CpGs to retrieve all genes that overlapped, for a total of 123 genes. Among these genes, five are known protein kinases. Given a complete list of human kinases (492 from the autosomes) from ( and a complete list of human genes (31,684 from the autosomes) from Ensembl (, we conducted an enrichment test using binomial distribution which returned an enrichment p-value of 0.044.

Log-scale rank score (LRS) for prioritizing AD-associated loci

In order to better present the whole-genome prediction result, we sorted the prediction scores of each trait and calculated the log-scale rank score (LRS) for each CpG (\({{{\rm{LRS}}}}=-{{\log }}_{10}\frac{{{{{\mathrm{rank}}}}}}{{{{{\mathrm{total}}}}}\,{{{{\mathrm{num}}}}}\,{{{{\mathrm{CpGs}}}}}}\); \({{{{\rm{total}}}}}\,{{{{\mathrm{num}}}}}\,{{{{\mathrm{CpGs}}}}}=26,573,858\)). The LRS is similar to a log-transformed empirical p-value. A higher LRS means the CpG is more likely to be associated with the trait.

Loci selection for targeted bisulfite sequencing

Targeted bisulfite sequencing was conducted on selected CpGs (with neighboring CpGs profiled unintentionally, as well) for 150 randomly selected samples from the ROS/MAP cohort. Since most features used in model training having only one value in every 200 bp bin, CpGs within a 200 bp bin tend to have similar prediction scores. In order to select a more representative (less clustered) set of loci for experimental validation, we required any pair of selected CpGs must be at least 500 bp apart. The forward selection process is performed in the stepwise manner, starting from the CpG with the highest total LRS score. Due to the limitation of sequencing primer design, not all loci on the candidate list were selected for bisulfite sequencing. The selection process was stopped when a pre-determined sequencing capacity is reached. For comparison, we selected 38 off-array CpG clusters with high prediction scores, 10 clusters of on-array CpGs listed in de Jager et al.16 and 10 clusters of off-target negative control CpGs.

Adaptation of Zhang et al. for comparison with EWASplus

For the purpose of fair comparison, we selected 1000 CpGs that are not from the training set used by EWASplus. Instead, we selected 500 “near positive” CpGs with p-values just above the threshold and 500 negative CpGs with p-values > 0.40 but not in the negative training set used by EWASplus. Comparison is performed in two steps: (1) predict methylation levels for the 1000 CpGs across the 717 samples used to train EWASplus following instruction in Zhang et al., and (2) perform association test with R package CpGassoc58 to test for differential methylated based on the predicted methylation level from the first step.

Targeted bisulfite sequencing

Multiplex primers were designed to amplify the identified regions using MPD software72 (primers list can be found in Supplementary Data 11). The 200–500 ng purified genomic DNA was used for bisulfite conversion (EpiTect Bisulfite Kit (Qiagen)). The treated DNA were used for PCR amplification and PCR amplicons were further purified and pooled together in equal molar. Mixed amplicons were then purified for libraries preparation and deep sequencing (100× or above) using a MiSeq following standard procedures recommended by Illumina. Image analysis and base calling were performed using standard Illumina pipelines. Quality control was performed in the same fashion for the array-based genotyping, except the missingness threshold was raised to 50%.

Association testing was performed using the same approach as the array-based methylation with CpGassoc58 with modifications. The methylation levels were modeled as logit transformation of β values (log(β/1 − β)) to stabilize the variance73. Next, we grouped adjacent CpGs into clusters and conducted the test for differential DNAm. Due to the limited sample size of the study, we call a CpG cluster differential DNAm if the lowest p-value from the Differential DNAm test is less than an unadjusted p-value 0.05 among all CpGs in the CpG cluster. We adjusted for the following covariates: age of death, sex, years of formal education, post-mortem interval, study, and cell type proportion, which was estimated using the CETS R package59.

Protein–protein interaction and pathway analyses

To identify potential cross-talk among known AD genes and genes suggested by EWASplus, we used web platform GeNets27 ( to query a combined list of 28 known AD-associated genes and 123 genes near the top 100 detached CpGs ranked by EWASplus prediction scores (Supplementary Data 1).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

EWASplus annotated genome-wide risk scores for six AD-related traits, the processed features used for EWASplus training, the list of gene names residing in the flanking regions of the top predicted CpG loci, and the result of protein–protein interaction and pathway analyses are available at ( The data that support the findings of this study (The processed EWAS data from the London, Mount Sinai, and Arizona cohorts) are made available to us from the authors of Smith et al.25, but restrictions apply to the availability of these data, which were used under agreement for the current study, and so are not publicly available. Targeted bisulfite sequencing data and ROS/MAP EWAS summary statistics of all traits are available through Synapse (!Synapse:syn25832093). Source data are provided with this paper.

Code availability

The EWASplus software implements feature selection, ensemble model training, and annotation of risk-prediction scores. The source code is available via Github at (


  1. 1.

    Hoyert, D. L., Heron, M. P., Murphy, B. S. & Kung, H. National Vital Statistics Reports—Deaths: Final Data for 2003. Vol. 54, 1–16 (U.S. Department of Health and Human Services, Center for Disease Control and Prevention, 2006).

  2. 2.

    Wingo, T. S., Lah, J. J., Levey, A. I. & Cutler, D. J. Autosomal recessive causes likely in early-onset Alzheimer disease. Arch. Neurol. 69, 59–64 (2012).

    PubMed  Article  Google Scholar 

  3. 3.

    Gatz, M. et al. Heritability for Alzheimer’s disease: the study of dementia in Swedish twins. J. Gerontol. A Biol. Sci. Med. Sci. 52, M117–M125 (1997).

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Deaton, A. M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Robertson, K. D. DNA methylation and human disease. Nat. Rev. Genet. 6, 597–610 (2005).

    CAS  PubMed  Article  Google Scholar 

  6. 6.

    Fraga, M. F. et al. Epigenetic differences arise during the lifetime of monozygotic twins. Proc. Natl Acad. Sci. USA 102, 10604–10609 (2005).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Teschendorff, A. E. et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–446 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Rakyan, V. K. et al. Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 20, 434–439 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Petronis, A. Epigenetics as a unifying principle in the aetiology of complex traits and diseases. Nature 465, 721–727 (2010).

    ADS  CAS  PubMed  Article  Google Scholar 

  10. 10.

    Feinberg, A. P., Ohlsson, R. & Henikoff, S. The epigenetic progenitor origin of human cancer. Nat. Rev. Genet. 7, 21–33 (2006).

    CAS  PubMed  Article  Google Scholar 

  11. 11.

    Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, R115 (2013).

    PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31, 142–147 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Edris, A., den Dekker, H. T., Melen, E. & Lahousse, L. Epigenome-wide association studies in asthma: a systematic review. Clin. Exp. Allergy 49, 953–968 (2019).

  14. 14.

    Everson, T. M. et al. Cadmium-associated differential methylation throughout the placental genome: epigenome-wide association study of two U.S. birth cohorts. Environ. Health Perspect. 126, 017010 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Kingsley, S. L. et al. Maternal residential proximity to major roadways, birth weight, and placental DNA methylation. Environ. Int. 92-93, 43–49 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    De Jager, P. L. et al. Alzheimer’s disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci. Nat. Neurosci. 17, 1156–1163 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  17. 17.

    Lunnon, K. et al. Methylomic profiling implicates cortical deregulation of ANK1 in Alzheimer’s disease. Nat. Neurosci. 17, 1164–1170 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Do, C. et al. Mechanisms and disease associations of haplotype-dependent allele-specific DNA methylation. Am. J. Hum. Genet. 98, 934–955 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Watson, C. T. et al. Genome-wide DNA methylation profiling in the superior temporal gyrus reveals epigenetic signatures associated with Alzheimer’s disease. Genome Med. 8, 5 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  20. 20.

    McCartney, D. L. et al. Identification of polymorphic and off-target probe binding sites on the Illumina Infinium MethylationEPIC BeadChip. Genom. Data 9, 22–24 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Zhang, W., Spector, T. D., Deloukas, P., Bell, J. T. & Engelhardt, B. E. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol. 16, 14 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    De Jager, P. L. et al. A genome-wide scan for common variants affecting the rate of age-related cognitive decline. Neurobiol. Aging 33, 1017 e1–15 (2012).

    Article  CAS  Google Scholar 

  23. 23.

    Smith, R. G. et al. Elevated DNA methylation across a 48‐kb region spanning the HOXA gene cluster is associated with Alzheimer’s disease neuropathology. Alzheimer’s Dement. 14, 1580–1588 (2018).

    Article  Google Scholar 

  24. 24.

    Beach, T. G. et al. Arizona study of aging and neurodegenerative disorders and brain and body donation program. Neuropathology 35, 354 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Smith, R. G. et al. A meta-analysis of epigenome-wide association studies in Alzheimer’s disease highlights novel differentially methylated loci across cortex. Nat. Commun. 12, 3517 (2021).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5, 180142 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Li, T. et al. GeNets: a unified web platform for network-based genomic analyses. Nat. Methods 15, 543–546 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Kothari, V. et al. Outlier kinase expression by RNA sequencing as targets for precision therapy. Cancer Discov. 3, 280–293 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Wang, J. Z., Xia, Y. Y., Grundke-Iqbal, I. & Iqbal, K. Abnormal hyperphosphorylation of tau: sites, regulation, and molecular mechanism of neurofibrillary degeneration. J. Alzheimers Dis. 33, S123–S139 (2013).

    PubMed  Article  CAS  Google Scholar 

  30. 30.

    Das, S., Abecasis, G. R. & Browning, B. L. Genotype imputation from large reference panels. Annu. Rev. Genomics Hum. Genet. 19, 73–96 (2018).

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Abecasis, G. R., Ghosh, D. & Nichols, T. E. Linkage disequilibrium: ancient history drives the new genetics. Hum. Hered. 59, 118–124 (2005).

    PubMed  Article  Google Scholar 

  32. 32.

    Bell, J. T. et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 12, R10 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Eckhardt, F. et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat. Genet. 38, 1378–1385 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Yao, D. W., O’Connor, L. J., Price, A. L. & Gusev, A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat. Genet. 52, 626–633 (2020).

  35. 35.

    Battle, A. et al. Genomic variation. Impact of regulatory variation from RNA to protein. Science 347, 664–667 (2015).

    ADS  CAS  PubMed  Article  Google Scholar 

  36. 36.

    Chen, L., Jin, P. & Qin, Z. S. DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles. Genome Biol. 17, 252 (2016).

  37. 37.

    Rosenthal, S. L., Barmada, M. M., Wang, X., Demirci, F. Y. & Kamboh, M. I. Connecting the dots: potential of data integration to identify regulatory SNPs in late-onset Alzheimer’s disease GWAS findings. PLoS ONE 9, e95152 (2014).

    ADS  PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    Kamboh, M. I. et al. Genome-wide association analysis of age-at-onset in Alzheimer’s disease. Mol. Psychiatry 17, 1340–1346 (2012).

    CAS  PubMed  Article  Google Scholar 

  39. 39.

    Altuna, M. et al. DNA methylation signature of human hippocampus in Alzheimer’s disease is linked to neurogenesis. Clin. Epigenetics 11, 91 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  40. 40.

    Johnson, N. E. Myotonic muscular dystrophies. CONTINUUM: Lifelong Learn. Neurol. 25, 1682–1695 (2019).

    Google Scholar 

  41. 41.

    Tavares, I. A. et al. Prostate-derived sterile 20-like kinases (PSKs/TAOKs) phosphorylate tau protein and are activated in tangle-bearing neurons in Alzheimer disease. J. Biol. Chem. 288, 15418–15429 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Hall, M. G., Quignodon, L. & Desvergne, B. Peroxisome proliferator-activated receptor beta/delta in the brain: facts and hypothesis. PPAR Res. 2008, 780452 (2008).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Braissant, O., Foufelle, F., Scotto, C., Dauca, M. & Wahli, W. Differential expression of peroxisome proliferator-activated receptors (PPARs): tissue distribution of PPAR-alpha, -beta, and -gamma in the adult rat. Endocrinology 137, 354–366 (1996).

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Shin, H. D. et al. Genetic polymorphisms in peroxisome proliferator-activated receptor delta associated with obesity. Diabetes 53, 847–851 (2004).

    CAS  PubMed  Article  Google Scholar 

  45. 45.

    Trejo, J. et al. A direct role for protein kinase C and the transcription factor Jun/AP-1 in the regulation of the Alzheimer’s beta-amyloid precursor protein gene. J. Biol. Chem. 269, 21682–21690 (1994).

    CAS  PubMed  Article  Google Scholar 

  46. 46.

    Bharadwaj, P. & Martins, R. N. PRKAG2 gene expression is elevated and its protein levels are associated with increased amyloid-beta accumulation in the Alzheimer’s disease brain. J. Alzheimers Dis. 74, 441–448 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  47. 47.

    Johnson, E. C. B. et al. Large-scale proteomic analysis of Alzheimer’s disease brain and cerebrospinal fluid reveals early changes in energy metabolism associated with microglia and astrocyte activation. Nat. Med. 26, 769–780 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Schneider, J. A., Aggarwal, N. T., Barnes, L., Boyle, P. & Bennett, D. A. The neuropathology of older persons with and without dementia from community versus clinic cohorts. J. Alzheimer’s Dis. 18, 691–701 (2009).

    Article  Google Scholar 

  49. 49.

    Meissner, A. et al. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 33, 5868–5877 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Voight, B. F. et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 8, e1002793 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  51. 51.

    Bennett, D. A. et al. Overview and findings from the rush memory and aging project. Curr. Alzheimer Res. 9, 646–663 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  52. 52.

    Bennett, D. A. et al. Religious orders study and rush memory and aging project. J. Alzheimers Dis. 64, S161–S189 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  53. 53.

    Brokaw, D. L. et al. Cell death and survival pathways in Alzheimer’s disease: an integrative hypothesis testing approach utilizing-omic data sets. Neurobiol. Aging 95, 15–25 (2020).

    CAS  PubMed  Article  Google Scholar 

  54. 54.

    Chen, Y. A. et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics 8, 203–209 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  55. 55.

    Barfield, R. T. et al. Accounting for population stratification in DNA methylation studies. Genet. Epidemiol. 38, 231–241 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Teschendorff, A. E. et al. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29, 189–196 (2013).

    CAS  Article  Google Scholar 

  57. 57.

    Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  58. 58.

    Barfield, R. T., Kilaru, V., Smith, A. K. & Conneely, K. N. CpGassoc: an R function for analysis of DNA methylation microarray data. Bioinformatics 28, 1280–1281 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. 59.

    Guintivano, J., Aryee, M. J. & Kaminsky, Z. A. A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics 8, 290–302 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. 60.

    Bennett, D. A. et al. Neuropathology of older persons without cognitive impairment from two community-based studies. Neurology 66, 1837–1844 (2006).

    CAS  PubMed  Article  Google Scholar 

  61. 61.

    Braak, H. & Braak, E. Neuropathological stageing of Alzheimer-related changes. Acta Neuropathol. 82, 239–259 (1991).

    CAS  PubMed  Article  Google Scholar 

  62. 62.

    Mirra, S. S. et al. The Consortium to Establish a Registry for Alzheimer’s Disease (CERAD). Part II. Standardization of the neuropathologic assessment of Alzheimer’s disease. Neurology 41, 479–486 (1991).

    CAS  PubMed  Article  Google Scholar 

  63. 63.

    Wilson, R. S. et al. Temporal course and pathologic basis of unawareness of memory loss in dementia. Neurology 85, 984–991 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  64. 64.

    Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).

    MATH  Google Scholar 

  65. 65.

    Liaw, A. & Wiener, M. Classification and regression by randomForest. R. N. 2, 18–22 (2002).

    Google Scholar 

  66. 66.

    Friedman, J. H. Stochastic gradient boosting. Computat. Stat. Data Anal. 38, 367–378 (2002).

    MathSciNet  MATH  Article  Google Scholar 

  67. 67.

    Chen, T. & Guestrin, C. Xgboost: a scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).

  68. 68.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  69. 69.

    Lu, Q. et al. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data. Sci. Rep. 5, 10576 (2015).

    ADS  PubMed  PubMed Central  Article  Google Scholar 

  70. 70.

    Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. 71.

    Bergstra, J., Komer, B., Eliasmith, C., Yamins, D. & Cox, D. Hyperopt: a Python library for model selection and hyperparameter optimization. Computat. Sci. Discov. 8, 014008 (2015).

    ADS  Article  Google Scholar 

  72. 72.

    Wingo, T. S., Kotlar, A. & Cutler, D. J. MPD: multiplex primer design for next-generation targeted sequencing. BMC Bioinformatics 18, 14 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  73. 73.

    Du, P. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references


We are grateful to the participants and staff of the ROS and MAP for their time and participation. We are indebted to Ehsan Pishva, Danielle Brokaw, Katie Lunnon, Diego Mastroeni as well as the Lunnon group and the Mastroeni group for sharing EWAS summary data from the London, Mount Sinai, and Arizona cohorts. We thank Mesa Schumacher for graphical design help. We thank Leo Zhang for technical assistance. The work reported in this publication was supported in part by Imagine, Innovate and Impact (I3) Funds from Emory University School of Medicine, the Georgia CTSA NIH Award (UL1-TR002378), NIH awards R56 AG062256, R56 AG060757, R01 AG056533, the Accelerating Medicine Partnership for AD (U01 046152; U01 AG046161; U01 AG061356; U01 AG061357), the Emory Alzheimer’s Disease Research Center (P50 AG025688), the NINDS Emory Neuroscience Core (P30 NS055077), Rush University Alzheimer’s Disease Center (P30 AG10161), and grants supporting ROS/MAP studies (R01 AG017917, R01 AG015819, RC2 AG036547, RF1 AG036042, U01 AG61356). The views expressed in this work do not necessarily represent the views of the Veterans Administration or the United States Government.

Author information




X.S., Y.H., Z.S.Q., T.S.W., and P.J. conceptualized and designed the study. X.S., Y.H., R.L., and Z.M. acquired the data. Y.H., X.S., H.J., S.Y., C.R., M.A., X.S., and E.S.G. conducted analyses. Y.H., X.S., Z.S.Q., T.S.W., and P.J. interpreted the data. X.S. and Y.H. wrote the first draft of the manuscript. X.S., T.S.W., Z.S.Q., P.J., A.P.W., D.A.B., and P.L.D.G. supervised the study. All authors critically revised and reviewed the manuscript.

Corresponding authors

Correspondence to Xiaobo Sun or Thomas S. Wingo or Zhaohui S. Qin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Huang, Y., Sun, X., Jiang, H. et al. A machine learning approach to brain epigenetic analysis reveals kinases associated with Alzheimer’s disease. Nat Commun 12, 4472 (2021).

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing