Identification of candidate DNA methylation biomarkers related to Alzheimer’s disease risk by integrating genome and blood methylome data

Alzheimer disease (AD) is a common neurodegenerative disease with a late onset. It is critical to identify novel blood-based DNA methylation biomarkers to better understand the extent of the molecular pathways affected in AD. Two sets of blood DNA methylation genetic prediction models developed using different reference panels and modelling strategies were leveraged to evaluate associations of genetically predicted DNA methylation levels with AD risk in 111,326 (46,828 proxy) cases and 677,663 controls. A total of 1,168 cytosine-phosphate-guanine (CpG) sites showed a significant association with AD risk at a false discovery rate (FDR) < 0.05. Methylation levels of 196 CpG sites were correlated with expression levels of 130 adjacent genes in blood. Overall, 52 CpG sites of 32 genes showed consistent association directions for the methylation-gene expression-AD risk, including nine genes (CNIH4, THUMPD3, SERPINB9, MTUS1, CISD1, FRAT2, CCDC88B, FES, and SSH2) firstly reported as AD risk genes. Nine of 32 genes were enriched in dementia and AD disease categories (P values ranged from 1.85 × 10-4 to 7.46 × 10-6), and 19 genes in a neurological disease network (score = 54) were also observed. Our findings improve the understanding of genetics and etiology for AD.


INTRODUCTION
As the most common form of neurodegenerative illness, Alzheimer's disease (AD) remains the sixth leading cause of death in the United States and the fifth leading cause of death among Americans age ≥ 65 years [1].AD is a slowly progressing neurodegenerative disorder, which can start 20-30 years before the appearance of the first clinical symptoms [2].An improved understanding of AD etiology is critical to reduce the public health burden of this common disease.
Epidemiological studies provide strong support for a genetic predisposition to AD [3].To date, genome-wide association studies (GWAS) have identified more than 56 gene loci [4] and transcriptome wide association studies (TWAS) [5][6][7][8][9][10][11][12][13][14] and splicing TWAS [15] have identified 29 genomic loci and over 280 genes associated with AD risk.However, together these variants and genes explain only a proportion of the familial relative risk of AD [16,17].One potential explanation is that for AD, some risk associated single nucleotide polymorphisms (SNPs) may regulate the expression of their target genes through influencing DNA methylation levels.As the most extensively investigated epigenetic marker, DNA methylation represents one kind of molecular regulatory mechanisms affecting gene expression that could further influence the risk of phenotypes [18].It has been reported that changes of specific aberrant DNA methylation trigger alterations on the transcriptional levels of genes involved in the pathogenesis of AD [19].Indeed, previous work reported that lower DNA methylation levels at TREM2 intron 1 increased the AD risk because the lower methylation caused the higher TREM2 mRNA expression in the leukocytes of AD patients than in healthy controls [20].DNA methylation at SORL1, SIRT1, UQCRC1, ABCA7, CNP, and DPYSL2 [21][22][23][24] have also been reported to influence AD through similar mechanisms.However, a comprehensive study to assess methylation markers that potentially influence AD risk through the DNA methylation-gene expression-AD risk pathway is largely lacking.
Herein, in this study, we leveraged two sets of DNA methylation prediction models built using large reference methylation datasets in blood (Framingham Heart Study (FHS) and Biobank-based integrative omics study (BIOS); up to 4008) with different modelling strategies [25][26][27], to evaluate the associations of genetically predicted DNA methylation levels with AD risk.For the association analyses with AD risk, we used the latest data from the AD GWAS involving 111,326 (46,828 proxy) cases and 677,663 controls of European ancestry from ten consortia/datasets, including the European Alzheimer & Dementia Biobank (EADB) datasets, the Genomic Research at Ace study (GR@ACE), the European Alzheimer's Disease Initiative Consortium (EADI), Genetic and Environmental Risk in AD/Defining Genetic, Polygenic and Environmental Risk for Alzheimer's Disease Consortium (GERAD/PERADES), the Norwegian DemGene Network (DemGene), Bonn Studies (Bonn), the Rotterdam study, the Copenhagen City Heart Study (CCHS), the Neocodex-Murcia study (NxC) and the UK Biobank (UKBB) [28].

MATERIALS AND METHODS DNA methylation genetic prediction models
DNA methylation genetic prediction models.Two sets of DNA methylation prediction models established using different modelling strategies, FHS [25,26] and BIOS [27], were used in the current study.
FHS models.The detailed information for FHS models has been described in previous studies [25,26,29,30].In brief, the individual level genomewide genotyping and white blood cell DNA methylation data were obtained from the FHS Offspring Cohort (dbGaP accession numbers: phs000342 and phs000724) [25].A total of 1595 genetically unrelated subjects of European descent with genetic and DNA methylation data were used to build FHS DNA methylation prediction models.Genomic DNA was genotyped using the Affymetrix 500 K array, and DNA methylation was measured using the Illumina HumanMethylation450 BeadChip.The genotype data were imputed to the Haplotype Reference Consortium reference panel [31].SNPs meeting the following conditions were used to build DNA methylation prediction models: (1) high imputation quality (R 2 ≥ 0.8), (2) minor allele frequency ≥0.05, included in the HapMap Phase 2 version, and (3) not strand ambiguous.For DNA methylation data, quality control and normalization were performed using the "minfi" package [32].The quality control steps include: removing low-quality samples, excluding low-quality methylation probes, estimating cell-type composition, and calculating methylation beta values.The same scale methylation profile of each sample was first acquired using quantile normalization.A standard normal distribution of methylation values of each cytosine-phosphateguanine (CpG) site was further obtained using rank normalization.The DNA methylation data was adjusted for age, sex, cell type composition variables, and top 10 principal components (PCs).DNA methylation level of each CpG site was predicted using the elastic net method as implemented in the "glmnet" package of R, with α = 0.5 [26,33].In short, we estimated the genetically regulated component of methylation levels for each CpG by including variants within a 2 MB window flanking the CpG site, inclusive.The square of the correlation between predicted and observed levels (R 2 ) were generated to estimate the prediction performance of each of the CpG prediction models established.BIOS models.BIOS DNA methylation prediction models were built using whole-blood methylation data from the BIOS Consortium involving 4008 samples (Illumina 450 K arrays).The detailed information of the model building has been described elsewhere [27,34].Briefly, in total, 881,977 unambiguous HapMap SNPs in the genetic data meeting the following criteria were retained: (1) minor allele frequency >5%, (2) minor allele count >10, and (3) imputation info score >0.8.The genotype data were also imputed to the Haplotype Reference Consortium reference panel [31].For methylation quantitative trait loci (meQTL) analysis, linear regression on each SNP-CpG site pair closer than 250 kb was performed.At a false discovery rate (FDR) of 5% (P < 9.3 × 10 −5 ), there were 151,729 CpG sites with a significant meQTL.For each CpG with a significant meQTL, a prediction model of methylation was established based on local SNPs within 250 kb using glmnet, which is a weighted linear combination of SNPs.We derived the unstandardized prediction models leveraging the original standardized models and standard deviation of variants in European populations of the 1000 Genomes Project data.

Associations between predicted methylation levels and AD risk
Associations between genetically predicted DNA methylation levels and AD risk were analyzed using S-PrediXcan [33] by applying FHS and BIOS DNA methylation prediction models to summary statistics of AD GWAS.These summary data were generated from 111,326 (46,828 proxy) cases and 677,663 controls of European ancestry from ten consortia/datasets, including EADB, GR@ACE, EADI, GERAD/PERADES, DemGene, Bonn, the Rotterdam study, the CCHS study, NxC and the UKBB [28].Instead of using the conventional approach of including clinically diagnosed AD alone, in this dataset both clinically confirmed and parental diagnoses based byproxy phenotypes were included, which has been demonstrated to confer great value in substantially increasing statistical power [35].It has been found that AD-by-proxy, based on parental diagnoses, shows quite strong genetic correlation with AD (r g = 0.81) [35].Detailed information on study participants, genotyping, and imputation methods have been included in the original GWAS paper [28].In our association analysis, the FDR-corrected P value threshold of ≤ 0.05 was used to determine significant associations between genetically predicted DNA methylation levels and AD risk.
To further pinpoint the putative causal CpG sites for AD risk, finemapping of causal gene sets (FOCUS), as described elsewhere, was applied [36].The two sets of blood methylation prediction models and results of main association analyses were used as inputs, and for each independent LD Block defined by LDetect [37], the posterior probability for each CpG site in the LD Block was outputted.For the FOCUS, putative causal CpG sites were prioritized by the default 90% credible CpG sites set.

Correlations of AD-associated CpG sites with their nearby genes
For the AD-associated CpG sites, correlation analysis of their methylation and expression levels of their nearby genes was performed using data of 1367 unrelated European individuals from the FHS Offspring Cohort (dbGaP accession number: phs000363 and phs000724).We were not able to use the BIOS Consortium data due to a lack of access to the individuallevel data.The detailed information about such DNA methylation and gene expression data has been described elsewhere [25,26,29,30].After adjusting for age, sex, cell type composition variables and top principal components (PCs), the correlation of the normalized methylation levels and expression levels of genes nearby the AD-associated CpG sites were calculated.

Associations of potential target genes of CpG sites with AD risk
For identified putative target genes of AD-associated CpG sites, we further assessed associations of their predicted expression in blood with AD risk.Here two sets of gene expression prediction models were used, one established using a modified unified test for molecular signatures (UTMOST) strategy for the Genotype-Tissue Expression Project (GTEx) v8 dataset, and the other developed using LASSO strategy for the BIOS dataset.For the UTMOST models, transcriptome and genome data from the GTEx v8 were used to develop genetic imputation models for genes expressed in whole blood (N = 670).The cross-tissue UTMOST framework was used to build models [8].SNPs within 1 Mb upstream and downstream of each gene of interest were considered as candidate predictors.It was shown that there is no significant difference in prediction quality from applying linkage disequilibrium (LD) pruning [41].Therefore, LD-pruning (r 2 = 0.9) was performed before model training to reduce the computational burden.In the joint-tissue prediction model, the effect sizes were estimated by minimizing the loss function with a logistic least absolute shrinkage and selection operator (LASSO) penalty on the columns (withintissue effects) and a group-LASSO penalty on the rows (cross-tissue effects).The group penalty term implemented sharing of the information from SNP selection across all the tissues.Two hyperparameters, λ 1 and λ 2 , for the within-tissue and cross-tissue penalization, were used as model optimization.For hyperparameter tuning, five-fold cross-validation was performed.A reliable estimate of the imputation performance was obtained by the modified model training approach.The original model training [8] was modified by unifying the hyperparameter pairs to avoid the overestimation of the prediction performance [42].For the BIOS gene expression prediction models, a reference transcriptome dataset involving 3344 subjects was used.The detailed information for the establishment of this set of models has been described elsewhere [27].For each of the 13,870 genes with a significant expression quantitative trait locus (eQTL), a prediction model was fitted in R with glmnet, to assess the potential predictive value of SNPs within 250 kb of the gene for gene expression.We used such sets of gene expression prediction models to estimate the associations between genetically predicted gene expression levels in blood and AD risk, by using the same AD GWAS data, involving 111,326 (46,828 proxy) cases and 677,663 controls as described above [43].

Consistent direction of effect for the DNA methylation-gene expression-AD risk
To assess the possibility that the genetically predicted DNA methylation might putatively influence AD risk through regulating the expression of nearby target genes, associations showing consistent direction of effect for the DNA methylation-gene expression-AD risk were determined by assessing the associations between genetically predicted DNA methylation levels in blood and AD risk, associations between DNA methylation and gene expression in blood, and the associations between genetically predicted gene expression in blood and AD risk.

Functional enrichment analysis
For the genes showing consistent directions of associations across DNA methylation, gene expression and AD risk, their top canonical pathways, disease and biological functions categories and networks were performed using Ingenuity pathway analysis (IPA) software (Qiagen Redwood City, Redwood City, USA, version summer release, July 2023).

RESULTS
DNA methylation prediction models FHS models.Of a total of 223,592 CpG sites for which we were able to develop DNA methylation prediction models using the FHS dataset, 81,360 showed a prediction performance (R 2 ) of at least 0.01 (≥10% correlation between predicted and measured DNA methylation levels).Considering that DNA methylation measurement for the probe-binding sites tends to be unbiased [26,42], we focused on 72,848 of those CpG sites for which there were no SNPs located within the probe-binding site.Such models were used for the association analyses between their predicted DNA methylation levels and AD risk.BIOS models.As described elsewhere [27], leveraging the BIOS data, DNA methylation prediction models for 151,729 CpG sites were established, of which 103,354 showed a prediction performance (R 2 ) of at least 0.01.For 93,442 of those CpG sites, there were no SNPs residing within the binding site.These models were also used for the association analyses.
Overall, models for a total of 104,102 unique CpG sites (either the FHS or BIOS models) were used in our association analyses for AD risk.Of them, for 62,188 CpG sites both sets of models were used; for 10,660 CpG sites only FHS models were used; and for the remaining 31,254 CpG sites only BIOS models were used (Supplementary Fig. S1).

Association between genetically predicted methylation levels and AD risk
Of the 104,102 CpG sites, genetically predicted DNA methylation of 1168 were associated with AD risk at the false discovery rate significance threshold (FDR ≤ 0.05), including 123 sites that met the more stringent Bonferroni correction threshold (P < 3.01 × 10 -7 , 0.05/166,290) (Supplementary Tables S1, S2 and Manhattan plot in Fig. 1), after removing 253 CpG sites in LD regions.Of the 1168 associated CpG sites, 750 showed significant associations using the FHS methylation prediction models and 827 showed associations using the BIOS prediction models.There were 409 CpG sites showing significant associations using both sets of prediction models (Supplementary Fig. S2).Reassuringly, the CpG sites showed the same association directions with AD risk for using the two sets of models (Supplementary Tables S1 and S2).Of those 1168 CpG sites associated with AD risk, 509 sites were located at more than 500 kb away from any known AD risk variants from GWAS studies (Supplementary Table S1).Of these 509 CpG sites, a positive association between predicted DNA methylation levels and AD risk was observed for 266 sites; conversely, an inverse association with AD risk was observed for 243 CpG sites.The remaining 659 CpG sites were located at known AD risk loci (Supplementary Table S2).
Based on analyses of the FOCUS, 26 CpG sites of 27 associations were further prioritized as putatively causal CpG sties for AD risk (Table 1).Of them, four CpG sites (cg09323728, cg18059933, cg26140475, and cg20555462) were located at more than 500 kb away from any known AD risk variants (Supplementary Table S1), involving genes NDUFAF6, TRIB1, LINC00861, and UBASH3B.
Based on annotation using eFORGE v2.0 (https:// eforge.altiusinstitute.org/)[39,40], positions of the 509 novel AD-associated CpG sites were overlapped with regions containing lysine 4 mono-methylated H3 histone (H3K4me1) markers across 36 of 39 cell types in the consolidated Roadmap Epigenomics Project, including blood (primary T cells from cord blood and peripheral blood, primary B cells, natural killer cells and monocytes from peripheral blood, and primary hematopoietic stem cells G-CSF-mobili) (Supplementary Fig. S3).These results indicated that our identified CpG sites associated with AD risk might be enriched in enhancers and transcriptional activation, further confirming the potential functional significance of our findings.

Potential target genes of associated CpG sites
Whether DNA methylation of the associated CpG sites could influence flanking gene expression was investigated by analyzing the FHS data.Of 1168 AD-associated CpG sites, correlation analyses were performed for 1038 pairs of 892 CpG sites and their 485 flanking genes.Two hundred and five CpG site-gene pairs were observed to have statistically significant correlations at FDR P-value < 0.05, including 196 CpG sites and 130 genes (Supplementary Table S4).Of these 205 significant correlations, 131 were negative and 74 were positive.The associations between genetically predicted expression of these 130 genes in blood and AD risk were further evaluated using the same summary statistics of AD GWAS which consisted of 71,880 (proxy) cases and 383,378 (proxy) controls of European ancestry.Of these 130 genes, 46 showed an association with AD risk at FDR P-value < 0.05 (Supplementary Table S5).

DISCUSSION
This is the first large-scale study to comprehensively evaluate associations of genetically predicted DNA methylation levels in blood with AD risk.Using two sets of DNA methylation prediction models developed using different reference datasets and modelling strategies, we identified 1186 CpG sites with predicted DNA methylation levels in blood to be associated with AD risk, including 509 located at novel loci.Through additional analyses involving gene expression, 52 CpG sites and their 32 nearby putative target genes have consistent effects influencing AD risk.Our study provided substantial information to improve the understanding of genetics and etiology for AD.
Previous work has supported that specific DNA methylation biomarkers could potentially be useful for AD risk assessment [ 18,53].For example, methylation at COASY, BER, HOXB6 and BIN1 had been reported to be potentially associated with AD risk [18,[54][55][56][57].However, some of the findings have not been entirely consistent [58], potential due to several limitations in conventional epidemiological studies, including selection bias, uncontrolled confounding, and reverse causation [26].One strategy to reduce some of these biases is to use genetic instruments to assess the association between DNA methylation levels and AD risk.Similar to a design of transcriptome-wide association study (TWAS) [41], a genetically determined proportion of DNA methylation levels is expected to be less susceptible to selection bias and reverse causation.We have conducted several such methylome-wide association studies (MWAS) and identified multiple candidate DNA methylation biomarkers for the risk of several diseases [25,26,29].
In our study, we used two sets of DNA methylation genetic prediction models to estimate the genetically predicted DNA methylation levels in blood.The fact that the identified associated CpG sites were suggested by both sets of models when available provided further assurance the robustness of the associated methylation markers.Importantly, our design of using comprehensive methylation prediction models as instruments is more powerful than studies based on the single-meQTL approach [25,26].Our analyses leveraging large number of available cases and controls also provide substantial higher power than studies evaluating directly measured methylation levels in relatively smaller samples.As a comparison, for example, a previous study for AD risk evaluating directly measured methylation levels in 120 LOAD patients and 115 controls only had an a priori power to detect differences of about 5% in mean methylation levels for the six genes under investigation, and there were no significant findings probably due to the low statistical power [58].
Several potential limitations need to be considered for appropriate interpretation of our findings.Similar to results from TWAS, the associations observed in our analyses focusing on CpG sites are also vulnerable to confounding due to pleiotropy and co-localization of genetic signals [26].Correlated total methylation levels across CpG sites, correlated predicted DNA methylation across CpG sites, as well as shared genetic variants between DNA methylation genetic prediction models, could all lead to spurious associations in our analyses [26,59].When faced with two correlated predictors, regularized regression models will randomly down weight one of them, which may be the true causal variant [26].
Despite these potential limitations, our study has several potential implications.Frist, our study can help fill the gap for systematic methylation analysis of AD risk which can provide insights in the etiology of AD [60].DNA methylation (of CpG sites) can be inherited [61] and plays a key role in regulating gene expression in a wide range of diseases and biological processes [62].For AD, it has been shown that blood DNA methylation levels of specific CpG sites were changed in AD patients compared with controls [63,64] and they could be associated with AD risk [65].In

TMEM106B
our study, we identified 1186 CpG sites and 485 nearby target genes in blood tissue for AD, which may substantially improve our understanding of etiology of this disease.Especially, we identified 52 CpG sites and their 32 nearby genes consistent associations of DNA methylation-gene expression-AD risk through integrating the methylation, gene expression and AD data.Most of these CpG sites target genes were known AD risk genes, such as FCER1G, BIN1, and MS4A6A.FCER1G encodes a high affinity IgE receptor that is involved in the innate immunity.A recent study showed that higher expression of this protein in microglia was related with pathologic inflammatory responses in brain as amyloid accumulation increased [66].For blood tissue, previous research showed that FCER1G was down-regulated (Log 2 fold change =-0.02,FDR-adjusted P = 3.63 × 10 -3 ) in AD (n = 49) patients compared with controls (n = 67) (GEO: GSE63060) [67].In our study, we also detected an inverse association between predicted expression of FCER1G and AD risk (OR = 0.97, FDR-adjusted P = 7.57 × 10 -5 ).These results are intriguing and warrant further investigation.BIN1 encodes bridging integrator 1 and is a key susceptibility gene for LOAD [68].The lower methylation levels of BIN1 promoter in peripheral blood for Chinese subjective cognitive declining participants with significant AD biological characteristics were found when compared with controls based on analyses of the Chinese Alzheimer's Biomarker and LifestylE (CABLE) database [68].Another study showed that decreased methylation levels of three CpG sites in BIN1 3' intergenic region were observed in 50 LOAD cases compared with 50 age and sex-matched controls [57].In our study, higher predicted expression levels of BIN1 and methylation levels of its' intergenic or exonic region CpG sites (cg08563189, cg19153828, cg19590598 and cg22376361) were associated with increased AD risk.MS4A6A, a member of the membrane-spanning 4A gene family, encodes membrane-spanning 4-domains A6A.Previous studies have revealed that MS4A6A was a risk gene for AD [69][70][71].Previous investigation has also reported that MS4A6A transcripts were increased in blood tissue of AD patients compared with that of controls [71], which is consistent with findings of the present study.Moreover, we identified novel AD risk-associated CpG sites and their target genes (CNIH4, THUMPD3, SERPINB9, MTUS1, CISD1, FRAT2, CCDC88B, FES, and SSH2).Three target genes (CNIH4, MTUS1, and FES) were enriched in neurological disease-related network.The remaining six genes (THUMPD3, SERPINB9, CISD1, FRAT2, CCDC88B, and SSH2) were enriched in inflammatory response-related network, which was known as one of the pathological features of AD [72].In the future, functional studies focusing on the implicated CpG sites and target genes are needed to better understand their exact roles in AD development.In the current work we focused on blood for DNA methylation prediction models.It is known that DNA methylation could be tissue-specific.It is unclear whether the DNA methylation markers identified in this study are also associated with AD risk when focusing on more relevant brain tissues.Future research in this area would be needed to identify brain-specific methylation markers relevant to AD risk.
In summary, in an integrative multi-omics study, we identified multiple CpG sites associated with AD risk and that 52 CpG sites might affect AD risk through regulating the expression of putative target genes.Our findings provide new insights into the etiology of AD risk.

Fig. 1 A
Fig. 1 A Manhattan plot of the association results from the Alzheimer's disease methylome-wide association study.The x axis represents the genomic position of the corresponding CpG site, and the y axis represents -log 10-tansformed P value of the associations.Each dot represents the genetically predicted DNA methylation of one specific CpG site.The red line represents P = 5.55 × 10 -4 for the false discovery rate significance threshold and blue line represents P = 3.01 × 10 -7 for the Bonferroni correction threshold (0.05/166,290).The name of top five CpG sites and their nearby genes on four chromosomes were annotated.

a
BIOS Biobank-based Integrative Omics Studies, Chr chromosome, CI confidence interval, CpG CpG sites, FHS Framingham Heart Study, OR odds ratio per SD increase in genetically predicated DNA methylation level (continuous variable); P value: P value after false discovery rate (FDR) correction; UTR untranslated region.b MetaXcan was used to estimate ORs, 95% CIs and P value.All statistical tests were two-sided.

Table 1 .
Twenty-six putatively causal CpG sites for AD risk prioritized by FOCUS.
a BIOS Biobank-based Integrative Omics Studies, Chr chromosome, CpG CpG sites, FHS Framingham Heart Study, kb kilobase, ncRNA noncoding RNA, UTR untranslated region.b TWAS associations with FDR-corrected P value < 0.05 considered significant.Y. Sun et al.

Table 2 .
Fifty-two consistent directions of associations across DNA methylation, gene expression and AD risk.