Novel Alzheimer’s disease genes and epistasis identified using machine learning GWAS platform

Alzheimer’s disease (AD) is a complex genetic disease, and variants identified through genome-wide association studies (GWAS) explain only part of its heritability. Epistasis has been proposed as a major contributor to this ‘missing heritability’, however, many current methods are limited to only modelling additive effects. We use VariantSpark, a machine learning approach to GWAS, and BitEpi, a tool for epistasis detection, to identify AD associated variants and interactions across two independent cohorts, ADNI and UK Biobank. By incorporating significant epistatic interactions, we captured 10.41% more phenotypic variance than logistic regression (LR). We validate the well-established AD loci, APOE, and identify two novel genome-wide significant AD associated loci in both cohorts, SH3BP4 and SASH1, which are also in significant epistatic interactions with APOE. We show that the SH3BP4 SNP has a modulating effect on the known pathogenic APOE SNP, demonstrating a possible protective mechanism against AD. SASH1 is involved in a triplet interaction with pathogenic APOE SNP and ACOT11, where the SASH1 SNP lowered the pathogenic interaction effect between ACOT11 and APOE. Finally, we demonstrate that VariantSpark detects disease associations with 80% fewer controls than LR, unlocking discoveries in well annotated but smaller cohorts.

www.nature.com/scientificreports/rate (FDR) method 22 , we are able to use VariantSpark's random-forest-based feature selection approach to narrow down the genome-wide search space to the subset of variants enriched with epistatic interactions.We then apply BitEpi 23 to perform an exhaustive search of this subset to annotate pairwise and higher-order, statistically significant interactions between the variants.We also explore the proportion of phenotypic variance captured by VariantSpark versus the traditional logistic regression (LR) methods.Finally, we demonstrate that VariantSpark has improved sensitivity to detect signal with fewer control samples compared with LR approaches.

VariantSpark identifies known AD loci across two independent cohorts
Using the ML genomics platform VariantSpark 19 , and a novel RFlocalfdr approach 22 , we identified genetic variants that are both marginally and interactively associated in two independent AD cohorts, UKBB and ADNI (7,573 and 784 samples of 4.5M SNPs each).Because of these two types of associations, we expect to find more significant variants than a LR approach at < 5% FDR.
As expected, the APOE loci was identified in both ADNI and UKBB cohorts (Supplementary Tables 1 and  2).To evaluate the functional context of the other significantly associated independent variants, we performed functional enrichment analysis using MAGMA.Gene-set analysis (Supplementary Table S4) identified 9 (ADNI) and 3 (UKBB) gene sets significantly associated (after Bonferroni correction).Many of the significant gene sets and those with suggestive significance levels (P < 0.05) fell into the categories of transmembrane and metal ion transport proteins (known to be key in neuronal signalling in the brain).Tissue expression analysis using MAGMA and GTEX (Supplementary Table S5, Supplementary Table S6) revealed brain tissues to be the most highly ranked, although they did not pass Bonferroni correction.

VariantSpark identifies novel loci associated with AD
We next investigated which loci replicated between the two independent cohorts.Despite the phenotypic heterogeneity across the two cohorts, we replicated three independent, significantly associated genes, APOE (rs429358), SASH1, and SH3BP4 (Table 1 and Supplementary Table S2).It is important to note that the significance threshold for the RFlocalfdr is 0.05 compared to the traditional genome-wide significance threshold of P < 5 × 10 -8 , which needs to correct for multiple tests.Both thresholds, RFlocalfdr for VariantSpark and P < 5 × 10 -8 for logistic regression, control for Type 1 error and correct for the multiple testing burden.For further information, see Methods section.
SASH1 (SAM and SH3 domain-containing 1) encodes a scaffold protein, which is ubiquitously expressed, including in brain tissues and is also a positive regulator of the NFk B signalling pathway through the activation of TLR4 27 .SH3BP4 (SH3 domain binding protein 4) encodes a protein involved in the amino acid-induced TOR signalling pathway 28 .Both SASH1 and SH3BP4 are membrane bound phosphoproteins with SH3 domains.

BitEpi identifies novel interactions between known and novel AD genes
BitEpi was used to identify epistatic interactions between significantly associated variants in both cohorts.The β and α metrics, reflecting association power and interaction effect respectively, were used to select interactions that were strongly associated to the AD phenotype due to an epistatic effect.We identified 37 interactions with significant β and α values in the UKBB cohort, of which 17 were 2-SNP, 16 were 3-SNP, and 4 were 4-SNP interactions (Fig. 2, Supplementary Table S7).Using the ADNI cohort, we identified 58 interactions with significant β and α values, 39 were 2-SNP, 17 were 3-SNP and 2 were 4-SNP interactions (Fig. 3, Supplementary Table S8).Interestingly, the two replicating AD associated genes, SASH1 and SH3BP4, were involved in epistatic interactions.
In the UKBB cohort, the SNP (rs114656810) mapping to SH3BP4 was found to interact with rs429358, which is a reported pathogenic APOE SNP in ClinVar 29 , where the alternate 'C' allele plays a part in the high AD-risk APOE-ε4 isoform.This pairwise interaction was interrogated to identify the genotype combinations associated with AD (Supplementary Table S9).Due to the low number of samples with the homozygous alternate genotype (AA) of SH3BP4 SNP, we reduced the genotypes to two classes; presence or absence of the alternate ' A' allele.In the absence of the alternate SH3BP4 SNP allele, there was no absolute difference in control rates between the SH3BP4xAPOE interaction and the APOE SNP alone (Fig. 4A).This indicates a limited effect of the homozygous reference genotype of rs114656810 on AD.However, with the presence of the alternate allele of the SH3BP4 SNP, the pathogenic effect of the APOE C allele is modulated (Fig. 4A), suggesting that SH3BP4 may have a protective mechanism against AD for carriers of the APOE 'CC' genotype.In the ADNI cohort, this pairwise interaction between SH3BP4 and APOE was marginally significant but did not pass Bonferroni correction.
In the ADNI cohort, the SNP rs9918382 mapping to SASH1 was involved in a triplet interaction with the same pathogenic APOE SNP, rs429358.The other SNP, rs7552961, in the triplet maps to ACOT11, has been shown to be associated to mild cognitive decline 30 .This triplet interaction was also examined further (Supplementary Table S10).Again, due to the low numbers of samples with the homozygous alternate genotype of rs9918382 (n = 15), the genotype was reduced to two classes; presence or absence of the alternate 'G' allele.Figure 4B shows that the alternate 'G' allele of the SASH1 SNP has a protective effect, reversing the pathogenic interaction effect of the rs7552961 (ACOT11) TT genotype and rs429358 (APOE) TC genotype increasing the relative control rate from -0.139 to 0.028 (Supplementary Table S10).However, when the alternate ACOT11 allele (G) is present with the APOE CC genotype, the SASH1 SNP has no effect.In fact, none of the possible pairwise interactions between these three genotypes passed significance for the α metric, which suggests that the association to AD was carried by the interaction of all three SNPs.This highlights the complexity and difficulty of detecting epistatic interactions, where exacerbating or protective properties are exerted through specific combinations of genotypes.

VariantSpark can detect more disease associated signal than logistic regression
Next, we compared VariantSpark with the more traditional GWAS approach implemented in PLINK's logistic regression (LR) to estimate the power to detect disease associated signal with limited control samples.To do this, in addition to using the ADNI cohort, we subset two datasets from the UKBB cohort: the first contained a ratio of 10 controls to 1 case (UKBB10to1) and the second with 2 controls to 1 case (UKBB2to1).
In contrast, VariantSpark identified associations outside of the APOE region such as rs79486209 on chromosome 10 which mapped to PLPP4, a gene previously associated with AD 31 .VariantSpark identified 53 significantly associated independent SNPs (104 in total) in UKBB10to1 (Table 1) and 20 significantly associated independent SNPs (69 in total) in UKBB2to1 (Supplementary Table S12).

VariantSpark captures more phenotypic variance in AD than Logistic Regression
A key goal of this study was to explore whether epistasis can explain some of the missing heritability that is well documented in AD [2][3][4] .To this end, we measured the proportion of phenotypic variance captured by genetic variants identified in the UKBB cohort using Nagelkerke's pseudo-R 2 and fitting three LR models with: Firstly, significant and independent SNPs identified by LR (n = 3).Secondly, significant and independent SNPs identified by VariantSpark (n = 53).Thirdly, significant and independent SNPs identified by VariantSpark with significant interactions identified by BitEpi (n = 122).
Within the UKBB cohort, the VariantSpark-BitEpi model (model ( 3)) captured the highest variance explained at 23.18% compared to model (2) without the BitEpi interactions at 17.12% and model (1) the LR SNPs at 12.77% (Supplementary Fig. S2).To test whether the performance increase of the VariantSpark-BitEpi model was driven by its additional variables, we calculated an empirical P value.We fitted 1000 models containing the 3 LR SNPs  2)).In contrast, VariantSpark-BitEpi's model had a small but significant (p = 0.006) performance improvement over the random models (23.18% vs 19.33%), confirming that additional signal was captured.We make a similar observation for these models when tested on the independent ADNI cohort.LR (model 1) captured 7.09% while the random models captured 25% on average and VariantSpark-BitEpi (model 3) achieved 27.20%.The increase in variance explained on the ADNI set is likely due to an easier signal, which is predominantly driven by APOE (as observed in Section C).
These findings indicate that VariantSpark-identified SNPs and BitEpi-identified epistatic interactions together explain up to 10.41% more phenotypic variance in AD than traditional LR approaches that focus only on marginal effects.This also aligns with previous studies where the addition of 87 marginal effect SNPs (without APOE) explained only 2.1% more variance 32 and 2,042,105 SNPs (without known AD SNPs) accounted for 25.3% variance 3 .Taken together, these results suggest that epistatic interactions across the genome play a part in AD aetiology and should be accounted for when developing therapeutics and genetic risk scores.

Transcriptome-wide association (TWAS) lookup of SASH1 and SH3BP4
Finally, we looked at transcriptomic level information of the mapped genes SASH1 and SH3BP4 as in previous studies 33,34 have shown that this can add confidence that GWAS-identified genes are capturing actual diseaserelated signal.Using the TWAS-hub 35 , SASH1 showed strong evidence (ENET-P = 7.5 × 10 -9 ) of involvement in the prefrontal cortex tissue and a strong association with "Alzheimer's Disease (in father)" (Supplementary Table S13).In contrast, SH3BP4 showed an association with nerve tibial tissue at non-suggestive levels for Alzheimer's Disease (Supplementary Table S14).Another resource used were the gene expression tests built into FUMA 36 using GTEx v8 37 data.In this analysis, both SASH1 and SH3BP4 showed increased expression levels in brain tissue (Supplementary Fig S3).Relative control rates were calculated as the difference between control rates of each genotype combination and the control rate of the entire cohort.Due to sample size restrictions, the rs119656810 SNP and the rs9918382 SNP was reduced to two categories; presence or absence of its alternate allele.There is evidence of a modulating effect of the alternate allele of rs119656810 on the APOE-e4 (rs429358 CC) genotype as seen from the increase in relative control rates in the top middle and top right cells in (A).There is evidence of a protective effect of alternate allele of rs9918382 on the ACOT11 × APOE genotypes as seen from the increase in relative control rates in the top middle cell and the bottom right cell in (B).However, there is no evidence of the same effect for the APOE-e4 (rs429358 CC) genotype in an interaction with the ACOT11 alternate allele (rs7552961).

Discussion
Using VariantSpark, a ML approach to GWAS, we have identified two novel genes, SASH1 and SH3BP4, to be associated with AD reaching genome-wide significance.SASH1 is a known tumour suppressor protein that has been shown to be differentially expressed between AD and control samples 38,39 .Furthermore, a previous study found SNP rs9390537 (located 91,233 bp upstream of SASH1) to be nominally associated to LOAD (χ 2 -p = 8.17 × 10 -6 ) 25 .Indeed, it is a nominated AD drug target in the Agora database, a database curated by AD researchers from the accelerating medicine partnership-Alzheimer's disease consortium and other research teams.
SH3BP4 or transferrin trafficking protein (TTP) interacts with endocytic proteins including clathrin, dynamin, and the transferrin receptor 40 and is involved in the aminal acid-Rag GTPase-mTORC1 signalling pathway.It is a central link between Akt signalling and cell-matrix adhesion regulation 28 .Although SH3BP4 has no established link to AD, a SNP (rs66501349, intergenic to SH3BP4 and CEP19P1) has been marginally associated to poorer cognitive function (χ 2 -p = 2 × 10 -6 ) 26 and its interactor dynamin has strong evidence of a role in AD pathophysiology 41,42 .In particular, the expression of gene DNM2 was significantly decreased in AD patients, and neuronal cell lines transfected with dominant negative DNM genes were observed to have an accumulation of APP and increased Aβ secretion 43 .
The key contribution of our work is adding the lens of epistasis to association.We identified a total of 95 epistatic interactions, including 2-SNP, 3-SNP and 4-SNP interactions associated with AD, in two independent cohorts.This elevated the previously only nominally associated SASH1 25 to pass FDR significance when its interaction with ACOT11 and APOE is accounted for.Specifically, our epistasis analysis revealed that the alternate 'G' allele of SASH1 SNP rs9918382 appears to have a protective effect against AD as it reverses the pathogenic effect of ACTO11 rs7552961 'TT' and APOE rs429358 'TC' genotype combination (Supplementary Fig. S3).However, this modulating effect was not found in the presence of two copies of the pathogenic APOE 'C' allele (rs429358, Supplementary Fig. S3).This result is consistent with co-expression patterns found between AD and control brains 44 and the high expression levels of SASH1 in pre-frontal cortex tissue in the TWAS-hub.Taken together, it is likely that SASH1 plays a role in AD pathophysiology and warrants further investigations.
Although, most of our identified epistasis is concentrated between APOE and a small number of other loci, our methodology can explore genome-wide epistasis in an unbiased manner, unlike previous studies 45,46 .Additionally, a genome-wide search allows for the identification of epistasis in non-coding regions of the genome which have empirically demonstrated to effect gene expression 47 .
For example, our epistasis analysis revealed a modulating effect of the alternate allele of SNP rs119656810 (SH3BP4) on the APOE locus.A possible explanation for this effect is that SH3BP4 has the ability to regulate the activity of dynamin 40 , whereby it enables the processing of amyloid β protein precursors resulting in lower levels of Aβ depositions and AD pathology.Together, SH3BP4 is a novel gene that may play a role in AD pathophysiology through its pathway mechanisms and in combination with APOE.
While VariantSpark identified SH3BP4 and SASH1 in both cohorts due to their cumulative additive and epistatic effects on AD, the exact epistatic interactions they are involved in were not replicated, although SH3BP4-APOE showed marginal significance.This is likely due to the varying number of individuals who might have this exact modulating disease physiology and genotype combinations across the two cohorts.This illustrates the benefits of using VariantSpark instead of traditional LR models on binary traits with potential polygenic interactions, like Alzheimer's disease.
Using VariantSpark, we were also able to detect disease genes with fewer controls than traditional approaches.This is relevant as a recent study calculates 10,000,000 cases would be needed for a traditional GWAS to find significant SNPs explaining 50% of Alzheimer's disease heritability 48 .Even for large initiatives such as FinnGen or 23andMe, such numbers are hard to achieve.Our method offers an alternative and enables discoveries in smaller but well annotated cohorts for AD and other genetic studies.
The limitations to our study are as follows: Firstly, ADNI used whole genome sequencing mapped to the GRCh38 reference genome, while the UKBB used array technology mapped to the GRCh37 reference genome resulting in the final set of 4.5 million common SNPs which was around 50% of the total number of SNPs for both cohorts.Secondly, the ADNI and UKBB cohorts are both different ascertainments.Particularly, UKBB is a relatively healthy volunteer cohort and contained a mix of AD phenotypes while ADNI recruited patients based on their health status and included samples with mild cognitive impairment to maximise sample size but is only an AD-proxy phenotype.Lastly, the ADNI cohort was substantially smaller than the UKBB cohort, with 784 samples, compared to 7582 samples.These three factors in combination limit our power to discover and replicate disease variants and epistatic interactions across cohorts.Furthermore, we restricted our samples in both cohorts to those of European descent as is commonplace 49 .However, it has been shown that ethnicity plays a crucial role in AD aetiology 50,51 and more diverse genomic datasets are needed to gain better unbiased insights 52 .
In conclusion, we have established a ML approach for detecting genetic signals associated with disease, which goes some way to explain the missing heritability observed in previous literature.

Sample selection
Data for AD was obtained from two sources; ADNI and UKBB.The ADNI aimed at testing combinations of imaging and biological markers to measure progression of AD and mild cognitive impairment (MCI).For this study, cases were samples labelled as early and late MCI and AD (Supplementary Note 1).The UKBB contains phenotypic and biological information from 500,000 participants; see their previous publication for more details 21 .For this study, ICD10 codes from hospital inpatient records and participant responses were used to identify cases of AD.See supplementary for specific codes, question, and responses used.Additionally, individuals with indication

Figure 1 .
Figure 1.Miami plot showing significant SNPs identified by VariantSpark in UK Biobank (10 controls to 1 case) (top) and ADNI (bottom) cohorts.Red asterisks mark those variants that have been replicated by position (only independent variants) between the two cohorts.Annotation (black) represents gene annotations that are novel and replicated between the two cohorts.Annotations in grey represent previously identified variants.

Figure 2 .
Figure 2. Network diagram of significant BitEpi interactions from UK Biobank cohort.Nodes in green are known AD associations, in red is the novel gene replicated in this study, and in blue are variants which are novel but unreplicated.All 2-SNP, 3-SNP, and 4-SNP interactions are included.Size of nodes are representative of node degree calculated from the NetworkAnalyzer plug-in in Cytoscape.

Figure 3 .
Figure 3. Network diagram of significant BitEpi interactions from ADNI cohort.Nodes in green are known AD associations, in red is the novel gene replicated in this study, and in blue are variants which are novel but unreplicated.All 2-SNP, 3-SNP, and 4-SNP interactions are included.Size of nodes are representative of node degree calculated from the NetworkAnalyzer plug-in in Cytoscape.

Figure 4 .
Figure 4. Relative control rates of the interactions (A) rs119656810 (SH3BP4) and rs429358 (APOE) in the UK Biobank cohort, and (B) rs7552961 (ACOT11), rs9918382 (SASH1), and rs429358 (APOE) in the ADNI cohort.Relative control rates were calculated as the difference between control rates of each genotype combination and the control rate of the entire cohort.Due to sample size restrictions, the rs119656810 SNP and the rs9918382 SNP was reduced to two categories; presence or absence of its alternate allele.There is evidence of a modulating effect of the alternate allele of rs119656810 on the APOE-e4 (rs429358 CC) genotype as seen from the increase in relative control rates in the top middle and top right cells in (A).There is evidence of a protective effect of alternate allele of rs9918382 on the ACOT11 × APOE genotypes as seen from the increase in relative control rates in the top middle cell and the bottom right cell in (B).However, there is no evidence of the same effect for the APOE-e4 (rs429358 CC) genotype in an interaction with the ACOT11 alternate allele (rs7552961).

Table 1 .
Annotated statistically significant independent SNPs identified using VariantSpark and the UKBB cohort.Replication status to ADNI validation cohort and GWAS Catalog included.