Introduction

Alzheimer’s Disease (AD) is a fatal neurodegenerative illness characterized by progressive dementia. Familial early onset AD (FAD, <1% AD cases)1 is dominantly inherited and involves mutations in APP 2,3,4 or PSEN1/24,5,6,7 genes. The more prevalent sporadic late-onset AD (LOAD, >90% of cases1) is a complex trait disease, which stems from genetic risk8 and environmental factors9,10,11,12 with an estimated heritability of 0.60~0.8013,14. Genome-wide association studies (GWAS) have found more than 30 LOAD-associated loci15,16,17,18 accounting for ~0.3319 of the heritability (mostly explained by APOEε4)20,21. Often LOAD-associated GWAS loci fall in difficult to interpret non-coding regions.

Males and females differ in AD prevalence and progression. After controlling for APOE status and age22,23,24, females suffer faster cognitive loss25, cerebral atrophy26,27, and hippocampal volume loss27, while males experience greater mortality28,29. Depression30,31,32,33, sleep disturbances34, and cardiometabolic disorders28,35 associated with menopause combined with longer average life expectancy may partially explain the increased likelihood of AD in female36,37,38. However, the role of genetics in sex-specific AD risk has not been systematically studied. Females who carry an APOEε4 allele have higher cerebrospinal fluid (CSF) tau levels and higher AD risk than males39. Gene-by-sex interaction analyses have revealed sex-specific effects on AD risk for ACE40, BDNF41 and RELN42 and a recent family-based association study found four additional genes (GRID1, RIOK3, MCPH1, ZBTB7C) that conferred a sex-specific association to AD43. Beyond these targeted studies, only one GWAS has been performed where the cohort was separated by sex using CSF Aβ42 and tau as endophenotypes44. Most large-scale genome wide meta-analyses have focused on AD status and have not attempted sex-based separation, likely due to the loss of statistical power that halving the sample entails. It is critical to identify genetic contributors that underlie sex-differences in AD as it could lead to more accurate disease risk assessment and more tailored therapeutic approaches45,46. Bridging this gap will require analytical methods capable of extracting meaningful genetic information from smaller samples than those used in traditional genome-wide approaches.

In order to identify novel sex-specific genetic drivers linked to AD, we developed a machine learning method that exploits whole exome sequencing (WES) data from the Alzheimer’s Disease Sequencing Project (ADSP)18 to identify genes that differentiate cases from controls. Figure 1 presents a graphical summary of this study. Unlike other approaches to this problem, our algorithm focuses on the functional impact of non-synonymous coding variants. These coding variants typically have unknown significance, but here we estimated their deleterious effect with the evolutionary action (EA) score47. For a given amino acid substitution at a given protein sequence location, this score is a product of the magnitude of the substitution times the functional sensitivity of the position. The former is estimated from amino acid substitution matrices. For example, an alanine to serine transition represents a small substitution magnitude while an alanine to tryptophan is a large one. The latter portion of EA is estimated from the evolutionary importance of each sequence position. For example, a position that varies often between phylogenetically close species is less sensitive than one that varies seldom and between phylogenetically distant species48. In this way, evolutionary action interprets the potential harm of human coding variants in light of past evolutionary divergences, providing consistently good performance with respect to other state-of-the-art methods and support in many practical applications. Notably, the evolutionary action score performed well in objective, blinded community challenges48,49 and was instrumental in suggesting candidate genes in autism spectrum disorder50, Alzheimer’s disease51, and cancer52,53.

Fig. 1: Schematic overview of study.
figure 1

This figure illustrates 4 steps of the study: ADSP WES data preparation, EAML run on the ADSP WES cohort, EAML runs of male and female separated cohort, and the criteria of success experiments to investigate predicted genes from EAML.

Here for the first time, we combine evolutionary action with machine learning (EAML) by using it as a training feature to rank all genes in the genome for their ability to separate AD cases from controls. Strikingly, EAML maintains its accuracy in smaller samples and can be used separately on males and females to search for sex-specific AD genes. The genes we find are significantly involved in AD biology by multiple computational and experimental criteria and point to a cell-cycle/DNA repair module predictive of AD status specifically in females. These proof-of-concept findings support a general approach to identify genetic mechanisms linked to complex diseases by machine learning over case-control sequence data using phylogenetic evolutionary information. In AD, the results are new potential biomarkers for sex-sensitive diagnosis, drug development, and therapy.

Results

Learning AD-associated genes from the mutational impact of coding variants

To identify genes underlying LOAD we studied 2729 AD patients and 2441 control subjects from the ADSP cohort (dbGaP phs000572.v7. p4). Our ensemble computational approach, EAML, included nine separate machine learning methods, namely, PART54, JRip55, Multi-layer Perceptron56, Naive Bayes57, Logistic Regressions58, Nearest Neighbors59, Decision Trees Random Forest60, J4861, and Adaboost62. Each one used EA scores and the homo- or heterozygous status of coding variants of subjects (see Methods 4) to measure, with a Matthews Coefficient Correlation (MCC)63, how well each gene could separate patients from controls. After trying multiple aggregation metrics, including a voting system, cross-entropy-based ranking, and the MCC average, the average MCC over all nine algorithms was used to rank 17,400 human genes by their ability to predict AD and 98 genes met a significance cutoff (FDR < 0.01, Supplementary Data 1). The top gene was APOE, which suggested that EAML has the potential to identify AD risk genes.

EAML genes are dysregulated in LOAD brains and enriched in pathways disrupted in AD

In order to assess these 98 EAML candidate genes, we asked whether they were functionally connected with genes previously linked to AD. Label propagation in biological networks measures the functional proximity between two sets of genes64,65,66,67,68, and we computed propagation over the generic STRING v11 protein-protein interaction (PPI) network, first removing APOE from the 98 genes given its known connectivity to AD. To control for biases due to the high network connectivity of the EAML candidate genes, we also performed label propagation for 100 sets of 97 random genes with a similar distribution of degree connectivity. Compared to the random genes with equivalent connectivity, the 97 EAML candidates diffused significantly to twenty-five LOAD-associated genes from GWAS17 (z-score ~3.76 and AUC of 0.74, Fig. 2A and Supplementary Table 1). Additionally, previous studies show that genes and diseases whose keywords are co-mentioned in biomedical literature are likely to be biologically connected69,70 and the confidence of their associations is based on the number of papers with co-mentions. Thus, we repeated the diffusion analysis on a different network71 built from the keyword co-occurrences of genes, diseases, and drugs in biomedical papers from PubMed. The 97 EAML genes were significantly connected to AD genes (z-score ~6.2) and to dementia-related disorders compared to random disorders (z-score >3, Table 1). These data suggest that EAML genes are related to functional pathways enriched for AD genes and are consistent with roles in AD pathology and dementia.

Fig. 2: The top 98 EAML genes are connected to GWAS genes, dysregulated in AD patients, and capable of separating AD and healthy control samples.
figure 2

A Diffusions from the top 98 EAML genes from the full cohort to the 25 GWAS genes. AUC-ROC curve, x: False Positive Rate, y: True Positive Rate. This AUC-ROC curve represents the predictive power of 97 EAML genes to prioritize the 25 GWAS genes. Density distribution of randomly generated 100 AUCs based on randomly selected genes. The arrow indicates the AUC calculated from Fig. 2A. B (left) Integration of EAML candidates with expression networks dysregulated in AD. EAML candidates mapped into the AD consensus modules network (organic distribution) based on gene co-expression analysis75,76,78. The main function enriched in each module75,76,78 is indicated. Darker, thicker edges connect genes that are more highly correlated. Genes with a red ring are dysregulated in at least one brain region in AD vs control. (right) Heat map indicating which of the genes in the “immune system” module are dysregulated in AD brains. Also shown is the cell type in which their expression is enriched in the brain. C Risk prediction that based on the EAML genes for the full cohort. The box plot indicates minimum and maximum (lower and upper) whiskers, median (horizontal line), and first and third quartile (box). The mean AUCs of two groups (APOE vs. 98 FDR genes) and p-value of standard t-test (two sided) are shown.

Table 1 Diffusion (Z scores) from EAML genes to MeTeOR network

To further assess EAML candidates, we investigated whether they were connected to AD-related molecular changes. Leaving APOE aside again, we assessed the expression of the remaining 97 EAML candidate genes across AD brain data using the AMP-AD sequencing repository72,73,74,75,76,77. Forty-five EAML candidates were significantly dysregulated in AD patients versus controls in at least one brain region, suggesting that they either respond to or underlie AD-related insults and may therefore play a role in AD pathogenesis (Supplementary Fig. 1, hypergeometric test p = 0.04).

Next, we performed a functional enrichment analysis with respect to AD-related and brain-specific pathways using transcriptomic data from AMP-AD72,73,74,75,76,77,78 brain tissue. Various approaches have been developed to discover consensus transcriptional changes taking place in AD brains compared to controls. These efforts have led to the identification of several AD co-expression modules that are enriched in specific biological processes75,76,77,78. We mapped the EAML predicted genes onto these co-expression networks. We found that of the 53 EAML candidates that have been analyzed in the AD transcriptome, 22 belonged to modules related to immune response (Fig. 2B left, Fisher’s p < 0.005, consensus module B in75,78). Subnetworks within these immune modules revealed EAML candidates potentially involved in cytokine signaling (HLA-C, ACSL5, PTGDR2, BAZ1A, DCLRE1B), synapse pruning (FGD2, RHBDF2) and microglia pathogen phagocytosis (EHP1L1, CD300A, FGD2, RHBDF2 and PTGS1) (Supplementary Fig. 2). These functional enrichment results are consistent with the cell types in which these genes are expressed. The expression profile of these 22 genes in published snRNAseq79,80 datasets revealed that nine are enriched in astrocytes, 10 in microglia and three in endothelial cells, supporting their potential roles in neuroinflammation (Fig. 2B right). Among the three endothelial genes, the prostaglandin receptor PTGDR2 stood out as we also identified the prostaglandin-endoperoxide synthase PTGS1, a key enzyme in prostaglandin production, which has been linked to AD pathology81. We also found enrichment in the immune system module when we integrated the 45 EAML candidate genes that were dysregulated in the AD brain transcriptome with the AD consensus modules (Fig. 2B left, consensus module A in75,76,78). These results support a role of many EAML candidates in neuroinflammation, microglial, and astrocytic biology. This is consistent with the observation that many AD-risk factors identified17,18 using different genetic approaches are involved in neuroinflammation and suggests that the EAML candidates may act through these same pathways.

EAML candidates are modifiers of neurodegeneration

If EAML candidate genes belong to pathways that contribute to AD pathophysiology, we hypothesized that they would modify neurological phenotypes in vivo from two well-characterized Drosophila AD models expressing either secreted Aβ42 or wild-type human 2N4R tau82,83 specifically in neurons. Expression of either Aβ42 or tau in Drosophila leads to late-onset progressive neuronal dysfunction that can be accurately quantified using behavioral (i.e., motor performance) readouts. We used an automated system that video records animals as they climb a vial and uses their trajectories to calculate movement metrics such as speed84. This task provides a quantitative measure of motor performance that can be monitored longitudinally, as the animals age, and serves as an assay for neuronal dysfunction. We obtained loss of function, and shRNA strains available from public repositories targeting the Drosophila homologs of the 98 EAML candidate genes and tested each one in tau and Aβ42 Drosophila models (Supplementary Fig. 3A, B and Supplementary Data 2). In total we were able to test the homologs of 73 genes. We found that 36 of these genes modulated tau-induced degeneration when their function was decreased (12 worsened tau-induced degeneration while 24 ameliorated it). In the case of the secreted Aβ42 model, 17 genes were loss of function enhancers while 12 ameliorated Aβ42-induced neurodegeneration. These results represented a significant enrichment in genetic modifiers (Fisher’s test p = 0.0001 for tau modifiers and p = 0.0115 for the Aβ42 ones) when compared with the usual hit rate (between 15–20%) of our frequent unbiased genetic screens82,85,86. These results strongly connect the EAML candidates with the ability to modulate neurodegeneration in vivo. Importantly, we identified 27 genes whose knockdown attenuates neuronal deficits in vivo, highlighting their therapeutic potential.

AD risk prediction

Since our EAML candidates arose from their individual ability to separate AD patients from controls, we reasoned that the combined genes set should perform well in predicting patient risk stratification. For this model, three efficient classifiers were retained: (1) Adaboost62 is an ensemble method that combines weak learners (e.g., decision stumps) into a stronger one with significant performance for binary classification; (2) Logistic regression58 is a less computationally intensive classifier that is particularly useful when relevant learning features are employed; (3) Random Forest60 is a decision tree-based model that effectively handles high dimensional genomic data. The three classifiers were then combined using a stacking approach87 supported by a decision tree algorithm in order to strengthen the performance of each individual method. Our predictive AD model was trained by 10-fold cross-validation, measuring predictive performance using the area under the curve (AUC) of the receiver operator characteristic. As a control, a similar model was built based on APOE genotype status plus age of onset information. This predictive model built with EAML candidates significantly outperformed the one built with APOE genotype plus age of onset (t-test p < 0.0001) (Fig. 2C). These data show that machine learning predictors trained with gene features selected by EAML perform better than APOE status and may have prognostic value for the risk of developing AD.

EAML retains robust predictive power at small cohort sizes

Given the strength of these associations between EAML genes and AD, we next sought to test the robustness of these findings through down-sampling. We compared gene candidates found by EAML when applied to sequentially smaller sub cohorts of randomly selected case and control subjects, starting from the full original 2729 cases versus 2441 controls down to 60 versus 60. For each different sample size, we performed independent EAML analyses on 10 different sets of randomly picked cases and controls (n = 10), resulting in 100 total experiments. For comparison, we ran in parallel the commonly used gene-based association analysis using SKAT-O88 (see Methods 3) on the same randomly selected cases and controls (10 iterations), and the same non-synonymous variants used in the corresponding EAML experiments. To assess performance in these down-sampled cohorts, we compared the top 50 candidates of 10 iterated experiments at each sample size from the EAML or SKAT-O results to the top 50 genes obtained from applying EAML or SKAT-O to the full ADSP cohort, calculating the Kendall-Tau ranking coefficient89 and hypergeometric overlap p-values to measure consistency. Strikingly, EAML produced consistent outputs and robust prediction capabilities at progressively smaller sample sizes. For example, at n = 700 AD cases (Fig. 3A), EAML identifies greater than 50% (hypergeometric p = 10−58) of the candidates identified in the full cohort compared to 8% (hypergeometric p = 10−4) identified by SKAT-O. Top EAML genes also ranked consistently across the decreasing sample sizes compared to the performance of the SKAT-O predictions at similar cohort sizes (Supplementary Fig. 4). Interestingly, the down-sampled hypergeometric p-value curve of SKAT-O was similar to the in-silico simulated GWAS power curve (gray line in Fig. 3A; see Methods 3). The simulated GWAS power decreased drastically as the sample size dropped (max: 0.85 at 2500, min: 0 at 250). Nine genes including APOE were recovered consistently by EAML (10 out of 10 iterations per sample size) across all conditions starting at 500 cases and 500 controls: PRSS57, CPXM2, GJA3, PTGDR2, EHBP1L1, GZMA, DTL and GLB1L3 (Supplementary Data 3). Notably, most of these recurrent top genes have been linked to AD biology or are dysregulated in AD. PRSS57 falls near the ABCA790,91 locus. CPXM2 regulates CLU levels and is linked to synaptic remodeling and several CPXM2 SNPs are associated with LOAD92,93,94,95. PTGDR2 is a receptor of prostaglandins, which play critical roles in inflammatory response and may contribute to AD96,97,98,99,100,101,102. EHBP1L1 is an interactor of BIN117 and its levels are increased in AD103. GZMA is a factor related to cytotoxicity following Herpesvirus infection and its levels are increased in effector memory T cells from AD patients104. DTL may mediate cell-cycle re-entry in AD neurons105 and GLB1L3 is decreased in AD bulk brain tissue and single cells transcriptome (Supplementary Fig. 1). Taken together these data suggest that the power of EAML remains stable, consistent, and robust even at sample sizes too small for other methods, opening the possibility of interrogating smaller cohorts separating male and female to identify sex-specific modifiers.

Fig. 3: Down-sampling analyses and sex-separated analysis led to better risk prediction.
figure 3

A Down-sampling analyses, x: The number of randomly selected samples of AD cases; an equal number of control samples were also randomly selected. y: Hypergeometric P-values (one-tailed Fisher’s Exact test) comparing the top 50 genes from each iterated experiment to the top 50 genes from full cohort, solid-line: EAML, dot-line: SKAT-O; numbers indicates mean number of overlapped genes between each set and full cohort ADSP, the error bar indicates standard error for the mean number of overlapped genes; gray-line indicates simulated GWAS power using GWAS power calculator. B Venn-diagrams intersection between top EAML predicted genes from full cohort, male, and female. The number indicates the number of overlap genes between each EAML analysis. The hypergeometric p-value (one-tailed Fisher’s Exact test) was calculated using SuperExactTest R package156. C Diffusions from the top 157 genes from male (blue) and the top 127 genes from female (red) to the 25 GWAS genes. D Risk predictions that based on the EAML genes for male (blue), female (red). EAML significantly (p ~ 0.0001) improved prediction AUC comparing to APOE in male and female combined. When separating sexes, male’s genes significantly better predict than the combined (AUC 0.878 vs. 0.825) although female genes’ prediction AUC stay same (AUC 0.824 vs. 0.825). (*GD: APOE alleles + Age). The box plot indicates minimum and maximum (lower and upper) whiskers, median (horizontal line), and first and third quartile (box). The mean AUCs of two groups (APOE vs. 98 FDR genes) and p-value of standard t-test (two sided) are shown.

EAML identifies sex-specific LOAD-associated genes

In light of its robustness to small sample sizes, we searched for sex-specific AD related genes in the ADSP cohort by applying EAML separately to male (1215 AD cases + 1104 controls) and female (1514 AD cases + 1337 controls). EAML identified 157 and 127 top AD-associated genes in male and female, respectively (FDR < 0.01, Fig. 3B and Supplementary Data 1). In each of the sex-separated cohorts, we recovered APOE as the top hit. Next, we investigated whether the identified genes in the sex-separated EAML approach are connected to the 25 AD GWAS genes17 using the same network-based label propagation analyses performed above. Network diffusion of either the 157 male or the 127 female EAML candidates revealed robust and significant connectivity to the 25 GWAS genes over the STRING network (z-scores ~6.22 and ~5.0, Fig. 3C), and to Alzheimer’s Disease in the MeTeOR network (z-scores ~5.29 and ~6.79 for male and female, respectively, Table 1). Remarkably, connectivity for genes identified by the sex-separated EAML was significantly higher for both male and female (male z-score 6.22, female z-score 5.0, Fig. 3C) than in the sex combined approach (full cohort z-score 3.76, Fig. 2A). This increased connectivity of the sex-separated EAML candidates was also supported by the alternative prioritization approach, where sex-specific EAML candidates significantly prioritized the GWAS genes more than hits from the combined EAML (full cohort AUC of 0.74 in Fig. 2A, male AUC 0.86 and female AUC 0.85 in Fig. 3C). This increased network connectivity to the AD genes may not derive from the higher overlap of the sex-specific genes to the GWAS catalog loci since the network connectivity relies on STRING d/b while the overlap to GWAS catalogue loci depends on genomic locations. Taken together our results indicate that applying the EAML algorithm to sex-specific cohorts improves our ability to discover AD-related genes.

As with the full cohort, we next used the male and the female EAML candidates to perform sex-specific risk prediction. Even with half the cohort size, the ability to predict AD status was significantly improved in male (Fig. 3D left, AUC 0.878, p 0.0001) and was similar in female (Fig. 3D right, AUC 0.824, p 0.99) compared to full cohort (Fig. 2C, AUC 0.825). This further highlights the potential value of performing sex-specific analyses to improve the pre-symptomatic diagnosis of AD in male and female.

EAML candidates from sex-separated cohorts identify pathways affected differentially in male and female

We sought to gain functional insights from the EAML candidates identified in the male and female cohorts. We integrated the EAML candidates with transcriptomic data from AD brains using the AMP-AD72,73,74,75,76,77,78 transcriptomic dataset to assess for enrichment in AD-related brain-specific pathways. As with the EAML candidates in the combined cohort, we observed a significant enrichment in genes belonging to the AD-consensus modules associated with immune response and extracellular matrix in both the male and female EAML candidates (Supplementary Fig. 5). We next explored the underlying biology of the EAML candidates that were identified only in the female cohort (female specific EAML genes), or only in the male cohort (male specific EAML genes). Integration of the male specific EAML candidates with the AD transcriptomic network revealed an enrichment in modules involved in stress response and organelle biology (p < 0.05, Supplementary Fig. 5A). On the other hand, integration of the female specific EAML candidates with the AD transcriptional signatures revealed an enrichment in modules associated with cell-cycle and DNA quality control (p < 0.05, Supplementary Fig. 5B). Prompted by this, we sought to identify specific pathways in which these male or female candidate genes may act. We used the STRING database to connect the sex-specific EAML candidates with sex-specific transcriptionally dysregulated genes from the AMP-AD database72,73,74,75,76,77,78,106. This effort did not highlight any additional modules in the male specific EAML candidates. However, this integration uncovered a module composed of several female specific EAML candidates that interact with genes dysregulated more frequently in female AD brains (Supplementary Fig. 6). Interestingly, this module included two genes previously linked to AD: CD2AP15,107 and MCM7108 (Fig. 4A). The main functional enrichment in this module was cell cycle control and DNA quality control. Among all the genes in the module, ANLN stood out as a female-specific EAML candidate and is dysregulated at the transcriptional level only in female AD cases. Furthermore, expression of ANLN and POLD1 correlated with AD neuropathology in human brains (Fig. 4B). Accumulation of cell cycle markers in post-synaptic neurons is a hallmark of AD and other neurodegenerative diseases109,110,111,112,113,114,115,116,117,118,119. Interestingly, a large number of cell cycle/DNA quality control genes that accumulate in AD neurons are direct STRING interactors of genes identified in the female specific EAML module. Prompted by this, we investigated whether the EAML candidates in this module and their interactors could modulate neuronal dysfunction associated with AD. We tested the Drosophila homologs of the genes in this module and their interactors (Fig. 4B) for their ability to modify Aβ42 or tau-induced neurotoxicity using loss of function and overexpression alleles. Decreasing the expression of 3 female specific EAML candidates (ANLN, POLD1 and WDHD1) resulted in amelioration of Aβ42 and/or tau-induced neuronal dysfunction in vivo (Fig. 4C). Knockdown of the Drosophila homologs for three other genes in this module (BARD1, CCNA2 and CD2AP120) also modulated neuronal dysfunction in Drosophila. Among the cell cycle genes that accumulate in AD neurons, BRCA1 stood out in this module as it interacts with ANLN and BARD1, two genes whose modulation ameliorates neurodegeneration. Therefore, we tested the effect of knocking down the Drosophila homologs of BRCA1 o`n neurodegeneration. We found that reducing expression of BRCA1 Drosophila homologs significantly attenuated tau-induced neuronal deficits (Fig. 4C). Taken together, our results support a role for these cell-cycle and DNA quality control genes in AD pathogenesis, but more importantly they also suggest that these genes play a different role in female versus male, as they are more strongly associated to AD risk in female than male.

Fig. 4: Characterization of a cell-cycle/DNA repair-associated module enriched in female-specific EAML candidates.
figure 4

A Integration of five female-specific EAML candidates involved in cell-cycle/DNA repair with genes predominantly dysregulated in female AD brains and cell cycle genes known to accumulate in AD neurons. B positive correlation of gene expression and neuropathologic features for some of the genes in the module shown in (A). C graphs representing longitudinal analyses of neuronal dysfunction assessed as speed of the animals as a function of age (days) for the indicated alleles. Blue corresponds to negative (healthy) controls expressing a non-targeting hp-RNA. Purple shows the performance of β42/ non-targeting and grey the performance of tau/ non-targeting hp-RNA diseased animals. Green or red show the performance of animals carrying the allele indicated on top and either expressing β42 (green) or tau (red) paneuronally. Knockdown of the Drosophila homologs of three of the EAML female-specific candidates (ANLN, POLD1, WDHD1) results in amelioration of the neurodegenerative phenotypes. Knockdown of the Drosophila homolog of BRCA1 also ameliorates neuronal dysfunction (specific alleles used are indicated in Supplementary Data 2). Each graph shows the fit curve of the third-degree polynomial regression (dark line) and the confidence intervals for the same regression (shaded). All experiments shown are statistically significantly different p < 0.05 (exact p values are shown in Supplementary Data 2) when analyzed using non-linear random mixed effects model ANOVA. Four replicates of ten animals each per genotype were used for these experiments.

Discussion

Sex-specific differences in brain physiology and function are beginning to emerge and may underlie the differential predisposition to CNS diseases such as autism and depression between male and female46,121. Although sex differences greatly impact AD risk, the specific effect of sex has been largely ignored in the context of AD genetics28. In this study, we developed a methodology that specifically targeted the genetic factors that influence AD risk separately in males and females. We were able to achieve this goal by combining machine learning with evolutionary data in the form of EA scores that predict the effect of coding mutations on protein function. This EAML framework proved robust with sufficient predictive power at small sample sizes to examine samples of male and female separately. EAML identified numerous AD-associated genes not found in the combined sex cohort (Fig. 3B). Remarkably, the sex-separated analyses had better connectivity to known GWAS genes (z-scores  6.22 and  5.0 for male and female, respectively, Fig. 3C) than candidates derived from the combined cohort. This suggests that separating the cohort by sex produced a more sensitive analysis. A total of 50 EAML candidates overlapped between male and female (Fig. 3A, p 5 × 10−71). Interestingly, 21 of these were not identified by applying EAML to the combined cohort (Fig. 3A). Five of these 21 genes (PTPLA, ABI3, OR5AC2, MAPT, ECE2) have previously been associated to AD risk by other groups122,123,124,125,126, reinforcing that EAML is effective in identifying AD-associated genes and highlighting the increase in sensitivity upon sex separation. Importantly, EAML also uncovered sex-specific candidates associated with AD risk in female but not male and vice versa (Fig. 3A), suggesting that certain biological pathways may play a greater role in AD for one sex than the other. In line with this, we found that while neuroinflammation was the most enriched process among the EAML candidates identified in the combined cohort, the picture changed once we focused on sex specific EAML genes. We found genes associated with stress response and organelle biology were enriched in male specific EAML hits but found cell-cycle and DNA quality control/replication genes in female specific hits. Since cell cycle markers are known to accumulate in AD patients, we followed up on this set of genes. Five female specific EAML candidates (ESCO2, WDHD1, POLD1, SWSAP1 and ANLN) clustered together with genes whose expression is dysregulated predominantly in female AD brain. These genes correlated with neuropathological hallmarks of AD and modulated tau/Aβ42-induced neuronal dysfunction in vivo. Even though dysregulation of cell-cycle proteins and their abnormal aggregation in AD neurons has been previously reported109,110,111,112,113,114,115,116,117,118,119, our findings suggest that this pathway may contribute more to AD risk in females than males.

The above findings could have implications in how AD therapeutic strategies are developed and implemented. Of all the genes identified in this study, eleven have drugs that have been characterized as agonists or antagonists of their function (PTGDR2, SLC6A15, OPRD1, PSMF1, NQO1, GJA3, DDR1, TPO, PTGS1, POLD1, RET). Interestingly, of a total 97 compounds that target these genes, 40 have co-mentions with AD in PubMed. Of the 11 genes, GJA3 and DDR1 are EAML AD predictors in male only. Interestingly, the DDR1 inhibitor nilotinib is currently being investigated in a clinical trial for AD127 while a second DDR1 inhibitor (imatinib) has shown protective effects in mice128. On the other hand, four genes with known pharmacological agents (TPO, PTGS1, POLD1 and RET) are EAML AD predictors in female but not male. RET stands out in this group, as three of its inhibitors - sunitinib129, imatinib128 and regorafenib130 - have shown beneficial effects in mouse AD models. The data presented here supports the stratification of clinical trials based on sex and indicates that some strategies may be more effective in female or male due to sex specific genetic risk factors.

EAML also improved risk prediction ability. Early diagnosis of AD risk based on genomic profiles would be a significant advance as patients will likely require early intervention to avert dementia131,132. Current risk prediction for AD using Polygenic Risk Scores (PRS) ranges from 60133 to 80%134,135 accuracy. This needs to be improved before the results can be used clinically as many of the loci belong to non-coding regions. Recently, ML approaches have gotten attention for risk prediction in complex diseases136 due to their ability to evaluate non-linear genotype-phenotype associations and their interactive effects137. The combination of EAML candidates with their EA profile as features for ML resulted in great accuracy at predicting AD risk (AUC 0.825, Fig. 2C). Remarkably, the sex separated EAML candidates did not exhibit a decreased prediction accuracy even though we used half of the cohort (Fig. 3D). Furthermore, that accuracy was significantly improved in the case of male (AUC 0.878, Fig. 3D left), highlighting the importance of sex-separated genomic analysis in AD risk prediction approaches.

Of note, we did not perform hyperparameter tuning and optimization with each individual ML algorithm and instead we used the preset parameters. Tuning can improve ML prediction accuracy. However, if done inadequately, tuning can lead to overtraining with poor performance on test data different from the training sets. This general contest between optimization and overtraining currently pervades ML approaches. In order to add robustness to the ML predictions we took several steps. First, rather than optimizing each individual algorithm, we ranked the hit genes based on their reproducibility across classifiers, thus attenuating the risk of artefactual overfitting in one specific algorithm. Second, additional confidence in the hit genes came from performing ten-fold cross validation with each classifier. Importantly, we optimized the quality of the information we input to the off-the-shelf classifiers by incorporating functional mutational information. We reasoned these measures will decrease the likelihood of selective overtraining and instead increase our method’s likely range of validity to data other than those trained upon. Our confidence in the resulting hit genes is supported by multiple independent criteria for success that all show that our candidate genes are reliably linked to AD, experimental evidence showing candidate genes modulate neurodegeneration in a live animal model and the ability of the hit genes to stratify patients. These independent lines of evidence suggest that parameter tuning was not a significant issue in this study, and it is likely not weakening the findings. We speculate that the theoretical advantage of optimizing parameters may not warrant the risk of overfitting that this would entail. We will explore this possibility in future studies. Additionally, we did not take into account synonymous variants or those affecting non-coding DNA regions, epigenetic phenomena such as methylation and histone modification, and we also limited to European ancestry. These additional sources of data would be complementary to our findings since studies have shown that different genetic ancestries play important role in AD138,139 as well as non-coding DNA variations140 and epigenetic efects141. As more and larger cohorts become accessible, we will expand the EAML approach to additional ancestries, other interesting subgroup comparisons such as stratifying by comorbidities or presence of certain neuropathological features as well as hyperparameter tuning of each ML algorithm to improve performance without overfitting.

In summary this work indicates that application of machine learning approaches to WES and WGS increased our resolution for AD risk genes compared with current standard statistical methods (e.g., SKATO). These data offer a proof-of-concept for the combination of evolutionary information and phylogenetic speciation with case-control sequencing data to identify genetic mechanisms linked to complex diseases. Importantly, focusing the study on sex-specific sub-cohorts increased our resolution to identify disease-related genes. This emphasizes the need to systematically apply sex separation to disease-gene association analyses, beyond analyzing the combined cohorts. Our pipeline has identified a significant number of AD-associated genes with sex-sensitive pre-symptomatic diagnostic power and potential therapeutic value that we will follow up on in pre-clinical studies. As larger sequencing datasets become available, application of EAML will help generate a more complete picture of the different sex-specific mechanisms involved in AD and other polygenic disorders.

Methods

This study protocol (H-37394) was approved by the Institutional Review Board for Human Subject Research for Baylor College of Medicine and Affiliated Hospitals (BCM IRB). The information about informed consent from the study participants can be found in the study homepage (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000572.v7.p4#restricted-access-section).

Whole exome sequencing data

We obtained whole exome sequencing (WES) data from the Alzheimer’s Disease Sequencing Projects (ADSP) Discovery cohort from dbGaP (phs000572.v7.p4). In total, we analyzed 2729 Alzheimer’s Disease cases and 2441 healthy control white samples.

Quality controls (QC)

We carried out QC processes to identify potentially false-positive variants and outlier samples. We calculated the Ti/Tv and the number of variants in different variant classifications such as non-synonymous, synonymous as well as the total number of variants and singletons. HWE (Hardy Weinberg Equilibrium) exact test142 was performed on the control samples of each cohort. We counted the ratio of heterozygotes and homozygotes in the sex chromosome to compare to the self-reported sex. Then we filtered out non-European descendants. We used Annovar143 to annotate the consequences of variants. We focused on the non-synonymous single nucleotide variants (SNVs) and small indels, which lead to the loss of functions of genes, excluding CNVs (copy number variants). We used PCA (Principal Component Analysis) to cluster the genetic background and identify outliers within the cohort samples. We inferred the genetic relationships between each sample by estimating the kinship coefficients and IBD (Identical by Descent). We used BCFTOOLS144 and PLINK145 for variant filtering and statistics, KING146 for inferring relationships, and SMARTPCA from Eigenstrat package for PCA147.

GWA (genome wide association) analysis and power simulation

We performed variant-wide GWA for common variants (MAF ≥ 0.05) and gene-based association test for rare variants (MAF < 0.05). The APOEε4 variant (rs429358) yielded genome-wide significant p-values 10−20 without covariate APOEε4 status. With covariates sex, APOEε4 status, PC1 and 2, only TREM2 gene yielded a significance (p < 5 × 10−6) in ADSP. In the variant levels of TREM2 gene, we found 50 alleles in AD cases versus 9 in controls (Fisher’s Exact p = 2.1 × 10−7) for the previously known missense variant p.R47H148,149.

We performed association analyses for non-synonymous variants, which were used in the EAML analyses. For a common variant (≥ minor allele frequency 0.05), we carried out variant-wide association test including the covariates - sex, age, APOEε4 allele counts, and the first and second principal components from PCA to correct the effects of sex, age, ethnic differences and APOEε4 status using the logistic-regression function in PLINK software. We used the SKAT-O test88 in EAPCTS analysis package (https://genome.sph.umich.edu/wiki/EPACTS) to test gene-wide associations for burdens of rare variants (minor allele frequencies <0.05). The GWAS power calculated on GAS power calculator (http://csg.sph.umich.edu/abecasis/cats/gas_power_calculator/) assuming Significance Level: 5E-6, Prevalence: 0.01, Disease Allele Frequency: 0.05, Genotype Relative Risk: 1.6.

EAML framework

The first step measures the impact of the amino acid on overall fitness. Rather than attempt to compute this effect stepwise on successive but poorly characterized features, such as protein folding, dynamics, expression, translation, and myriad interactions among diverse components of multiple pathways, we modeled all these perturbations in aggregate through a formal evolutionary fitness (potential) function f that maps genotypes γ to points in the fitness landscape φ. Assuming, as a hypothesis, that f exists and is differentiable, a single coding perturbation then has an impact given by:

$$\nabla f\cdot d\gamma=d\varphi$$
(1)

where, as shown before47, f is the gradient of f approximated with the Evolutionary Trace algorithm150, and is the magnitude of a single missense substitution approximated with amino acid substitution log-odds.

Gene-level metric

The underlying basis of our approach rests on the ability to amalgamate all variant level effects into one metric at the gene level. To do this, we developed a novel scoring system which uses EA at its core, dubbed EA probability or (pEA). This is defined as:

$$\left\{\begin{array}{c}0\,{iff}\,{silent}\,{mutation}\\ {x}_{{EA}\, > \,C}^{i}=1-{\prod }_{j=1}^{k}{\left(1-{{EA}}_{j}/100\right)}^{{zygo}}\,\forall {EA} \, > \,C\\ 1,\,{iff}\,{stop}\,{or}\,{indels}\,{mutations}\end{array}\right.$$
(2)

where k is the total number of variants in an individual, and EAj is the Evolutionary Action scores from (1) of a given variant. This allows us to estimate the complete mutational effect of all variants within a given gene for a given individual.

Feature development and design matrix architecture

By presuming no information about a gene beforehand we allow the possibility that any given gene can affect the phenotype in either an autosomal dominant or recessive manner. We also presume that different levels of EA correspond to different magnitudes of fitness effects. These levels are as follows: EA > 1, EA > 30, EA > 70. In doing this, we can separately address the impact of variants above a given EA threshold. To develop features on which our framework will learn, we combine these different levels and manners of inheritance into six features of: Autosomal Dominant & EA > 1, Autosomal Dominant & EA > 30, Autosomal Dominant & EA > 70, Autosomal Recessive & EA > 1, Autosomal Recessive & EA > 30, Autosomal Recessive & EA > 70. We define autosomal dominant features to include all variants of a gene which are either heterozygous or homozygous, while recessive features only consider variants that are homozygous. This allows us to strictly address recessive features with only homozygous variants, while still considering all variant effects in the dominant features. Finally, these features are aggregated into a design matrix of the architecture n x p where n is number of samples and p is (6 features × number of genes). The genes tested come from the canonical RefSeq gene set.

Sub-setting of design matrix for 10-fold cross validation

To avoid overfitting and determine key driver genes with the highest prediction accuracy, we used a 10 folds cross validation (10-CV) method. The cohort was separated into 90% training data to fit the classification models and 10% testing data to validate the predictions. This process was repeated 10 times, shuffling the training and testing data sets.

Machine learning architecture

Our learning architecture consists of 9 different classifiers. Exploiting the uncertainty in linear and nonlinear genotype-to-phenotype relationships, these classifiers include Association Rules (PART54, JRip55), Function Optimizations (Multi-layer Perceptron56, Naive Bayes57, Logistic Regressions58, and Nearest Neighbors59), Decision Trees (Random Forest60 and J4861), and Meta Classifiers (Adaboost)62. This architecture was implemented in Weka (https://www.cs.waikato.ac.nz/ml/weka/). In MultiLayerPerceptron classifier, we used backpropagation with learning rate of 0.3, momentum of 0.2 and 4 hidden layers. In PART classifier, the confidence threshold for pruning is set to 0.25 and the minimum 5 objects per leaf is used. In Random Forest classifier we used 10 trees. In JRip classifier we used 3 folds, where one fold is used as pruning set, minimal weights of instances within a split is set to 2, with 2 runs of optimization. In J48 classifier, confidence threshold for pruning is set to 0.25 with the minimum of 2 instances per leaf. In Naïve Bayes classifier we used kernel density estimator. In Logistic Regression classifier we used Ridge for the regularization. In KNN classifier we set k = 3, using Euclidean distance with linear nearest neighbor algorithm. In Adaboost classifier, we used decision stump as the base learner.

Evaluating independent gene-level associations

Notably, our framework allows us to evaluate each gene in the context of the phenotype independently from effects of other genes. To evaluate the predictive power of an individual gene, we use a Matthew’s Correlation Coefficient (MCC) defined as:

$${MCC}=\frac{{TP}\,x\,{TN}-{FP}\,x\,{FN}}{\sqrt{({TP}+{FP})({TP}+{FN})({TN}+{FP})({TN}+{FN})}}$$
(3)

This measurement is utilized due to its robustness when considering imbalanced class sizes.

Prioritization of top EAML predictions for further analyses

To prioritize EAML prediction genes for further analyses, we selected genes with an FDR corrected p-value less than 0.01, which is derived from the MCC distribution. These thresholds resulted in 98, 157, 127 genes from male/female full cohort, male, and female, respectively (Supplementary Data 1).

Risk prediction (classifier)

EAML successfully identifies several genes which have been demonstrated to be quantitatively and biologically related to AD pathophysiology. However, EAML identifies these genes in the context of their individual and independent effects on disease. In order to understand the combined effect of the EAML predicted genes on disease status, we designed an experiment to test predictive power of these genes. In other words, can we combine the genes, identified for their individual influence in distinguishing affected versus healthy, to create a powerful predictor of disease status? We further extend this question by asking if we can create predictive models that are sex specific. In all three cases (cohort wide, male specific, female specific) the predictor would stratify an individual into a phenotypic class (healthy or affected) by learning some arbitrary function over the combined mutational profiles of the individual’s genes, specifically the EAML predicted genes. In order to determine whether the predictive power of the EAML genes is superior to traditional predictive factors, a second predictive model was developed based on the APOE genotype status, which is independently genotyped, and normalized age of onset of an AD individual.

In order to build these predictive models, input data must first be reformatted. For predictive models based on EAML genes, this reformatting was done as follows. Consider any one sample n out of N samples and h EAML predicted genes, for the purpose of a predictive model this sample will be represented as a vector of h elements, where each element is given by the pEA of particular EAML gene: n = [pEA1, pEA2,...,pEAh] where pEAi is the pEA value of \({gen}{e}_{i}\). Each sample n is therefore represented as a vector of dimensions 1xh. For N samples in a cohort, we then have an input matrix of size Nxh. Similar processing is done for all male specific, female specific, and cohort level predictive models.

Similar reformatting was done for predictive models based on APOE genotype status and the age of onset, which were also used in the complementary GWAS (see Methods 3). Each individual in the cohort was represented as a 1 × 2 vector where the first element corresponds to APOE variant status, and the second element of the vector is the age of onset. APOE variant status of an individual was represented as the pEA value for the individuals APOE variants. For example, if an individual had APOE3/APOE3 (pEA = 0) variant and age of onset was 64, they were represented as n = [0,64]. For N samples in a cohort, we then have an input matrix of size Nx2. After the cohort level matrix is calculated, normalization is done for age of onset features by subtracting the mean and dividing by standard deviation so that age of onset feature is on a similar scale as pEA features, this is a commonly used practice in machine learning to standardize and scale input data. Similar preprocessing was performed for all male specific, female specific, and cohort level predictive models.

In order to fairly compare and contrast the predictive models, the same machine learning architecture was used for all six predictors. This architecture was an ensemble learner composed of three different classifiers: random forest, logistic regression, and adaboost. Ensemble learning, in particular stacking, was used because stacked ensemble learning models have been shown to be a simple yet effective method for improving predictive accuracy compared to individual classifiers used separately151,152,153. Random forest, logistic regression, and adaboost were chosen as the ensemble classifiers due to two main characteristics. First, out of the nine classifiers used in the EAML architecture, these three demonstrated consistent and high MCC predictions. Second, these three classifiers were empirically identified to train and converge the fastest out of the nine EAML classifiers. Finally, in order to learn how to combine predictions from the three classifiers into one final prediction for a sample, we use a decision tree-based algorithm (Hoeffding tree) to train the individual classifiers together. Here, the Hoeffding tree serves the purpose of computing how much weight to give a specific classifier when making the final prediction, e.g., 30% of the final decision may be based on logistic regression prediction, 30% from adaboost prediction, and 40% from random forest prediction. All classifiers were trained using default hyperparameters and regularizers. The split decision was set to 1e-6, the minimum fraction of weight for info gain splitting was fixed to 0.01 and finally two grace periods of resp. 200 and 300 were considered. Implementation of the complete architecture and learning schemes, like EAML, were done using WEKA package for machine learning.

Complete training of the architecture is done through a k-fold cross validation approach. In this approach, the cohort is split into k equal sub-cohorts, and one of the k sub-cohorts is set aside for testing while the other k−1 sub-cohorts are used for training. This procedure is repeated k times such that each sample in the cohort is both trained on and tested on independently. This is a commonly used approach in machine learning as it is a particularly good preventative measure against model overfitting. For our implementation, we use k = 10 folds. Predictive power of the models is measured as area under curve (AUC) for receiver operator characteristic. AUCs are reported per test set of each of the k-folds. Finally, to compare between model performance using EAML predicted genes and APOE variant + age, t-tests were performed between the distribution of AUCs seen in both types of models and the corresponding p-values are reported.

Network analyses

We performed graph-based diffusion (GID) method64,65,66 in order to evaluate how well the 98 EA-ML predictions are connected to manually curated AD gold standard genes 25 GWAS genes and to dementia-related disorders in biological networks.

STRING network version 11.0 was downloaded from http://version11.string-db.org. We used the combined score that cover evidence from all sources. The network contains 19,247 genes and gene products, in which 98 EAML genes are present. Graph-based information diffusion (GID) method64,65,66 was applied to measure how well two groups of genes are connected to each other. Through GID, functional information was propagated from genes of interest to all genes in the network through their connections. Genes receiving significantly more diffusion signals than random genes are more connected and, thus functionally related to the original genes. Signals were diffused from one group or their comparative random sets to another group. Random genes were also selected from other genes in the network and had similar degrees of connectivity with predicted genes that initiate signals. We validated whether the AD known genes receive significantly more diffusion signals from the predicted AD genes than other genes in the network, and vice versa through area under the curve (AUC) for receiver operating characteristic (ROC). For a diffusion experiment of each pair of predicted groups, random was performed 100 times to obtain a distribution of random AUCs, which was tested for normality. Z-score, which is the number of standard deviations from the random mean, was computed for the experimental AUC based on the distribution of the random AUCs.

A literature network, called MeTeOR71, was used to explore potential literature relatedness of the predicted genes with dementia-related disorders. MeTeOR aggregates publication co-occurrences of Medical Subject Headings (MeSH) terms, which are manually curated by PubMed to annotate key topics, genes, diseases, and chemicals of given articles. Biological entities that are co-mentioned together in publications are more likely to be functionally related. MeTeOR has previously been utilized to explore known and novel meaningful biological associations70,71. We performed the GID method to evaluate whether the predicted 98 genes are well connected to dementia-related disorders in the MeTeOR literature network and thus, are likely to be involved in dementia pathology. Dementia-related disorders were selected from disease terms annotated in MeTeOR that consist of “dementia” and “Alzheimer”. This approach yielded 6 terms for dementia-related disorders Table 1). We compared the diffusion signals that each of the 98 genes received from the 6 dementia-related disorders against random genes that matched the connectivity degrees with the genes in the MeTeOR network. We also evaluated diffusion signals for each dementia disorders from the 98 genes against random. We computed z-scores to compare diffusion signals of the predicted genes and dementia disorders against random. Z-scores above 2.5 were considered significant.

For the coexpression analysis of single cell RNAseq, cell type specific WGCNA networks were obtained from154. Using Cytoscape, we identified the primary degree coexpressed nodes for the EAML genes and built coexpression communities using the HiDef-Louvain algorithm tool in the Community Detection extension. We obtained ~100 communities for each cell type, and then run the functional enrichment tool in the Community Detection extension to explore functional overlap in gProfiler, enrichR and iQuery databases applying an FDR q < 0.05.

For the coexpression analysis of the AD-specific coexpression networks, we used the AD- coexpression networks built in76 and used the first-degree nodes between genes as edges to map the EAML genes into the different functional modules. Enrichment was calculated using Fisher’s test and considered significant if p < 0.05.

Experimental integration

Drosophila strains and motor performance assay

Genetics and strains: the Drosophila lines used to drive expression of either wild-type human 2N4R tau (UAS-Tau) or secreted β42 (UAS−Aos:β42) were previously reported82,83 and are available from the Bloomington Drosophila Stock Center (BDSC), University of Indiana. For pan-neuronal expression we used the elav-GAL4C155 driver obtained also from BDSC. The alleles and shRNAs tested as candidate modifiers were obtained from the BDSC or from the Vienna Drosophila Resource Center (VDRC).

Motor performance assays

To assess motor performance of fruit flies as a function of age, we used ten age-matched females per replica per genotype as previously described84. Flies are collected in a 24 h period and transferred into a new vial containing 300 μl of media every day. Tau animals were kept at 23 C while β42 experiments were maintained at 28 C. Four replicates were used per genotype. Using an automated platform, the animals are taped to the bottom of a plastic vial and recorded for 7.5 s as they climb back up on the walls of the vials. Videos are analyzed using custom software to assess the speed of each individual animal. Four trials per replicate are performed each day shown, and four replicates per genotype are used. Using the average performance of all 10 animals in each replicate and 4 replicates per genotype, a nonlinear random mixed effect model ANOVA155 was applied to the average using each four replicates to establish statistical significance across genotypes. Specifically, we looked at differences in regression between genotypes (genotype p value) and also between genotypes with time (additive effect, represented by a shift in the curve, genotype+time p value). P-values were adjusted for multiplicity using Holm’s procedure. Code for this analysis is available upon request from the Botas Laboratory. All graphing and statistical analyses were performed in R. The non-targeting shRNA line V2691 from the VDRC was used to generate negative controls (Elav-GAL4/UAS-V2691) to establish the healthy baseline motor performance and disease controls (either Elav-GAL4/UAS-Tau/UAS-V2691 or Elav-GAL4/UAS-Aos:β42/UAS-V2691) for the disease baseline.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.