Alzheimer's disease (AD) is the most common form of dementia and predominantly affects individuals over 651. The vast majority (99%) of AD cases are late onset (LOAD) and are driven by multiple genetic and environmental influences, with genetics accounting for between 53 and 80% of total phenotypic variance2,3,4. The heritability of LOAD is predominantly carried by the APOE locus, which explains about 25% of the total heritability of the disease5. In addition to APOE, large-scale genome-wide association study (GWAS) meta-analyses identified 406 and 75 additional risk loci7, but more than 30% of genetic variability remains unknown3. Recent studies8,9 predict that there are 100 to 1000 causal variants with modest effects associated with LOAD, of which only a small proportion have been identified.

Part of the missing heritability in LOAD might be explained by non-additive interactions10, which are ignored by GWAS studies. Indeed, a genome-wide replicated scan has found epistasis to be a ubiquitous phenomenon across multiple phenotypes11. Epistatic interactions have long been implicated in complex genetic disease, including neurological diseases12 and LOAD itself13. However, due to the computational complexity of finding genome-wide gene–gene interactions, the search were limited to candidate gene approaches13,14,15,16,17, or genome-wide approaches exploring interactions between APOE and other risk loci18.

Using the ML platform VariantSpark19, we overcome the shortcomings of traditional statistical GWAS approaches and computationally limited epistatis discovery tools to identify genome-wide variants associated with LOAD and AD in both the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort (512 cases, 272 controls)20 and the UK Biobank (UKBB) cohort (704 cases, up to 6869 controls)21. Using a novel false discovery rate (FDR) method22, we are able to use VariantSpark’s random-forest-based feature selection approach to narrow down the genome-wide search space to the subset of variants enriched with epistatic interactions. We then apply BitEpi23 to perform an exhaustive search of this subset to annotate pairwise and higher-order, statistically significant interactions between the variants. We also explore the proportion of phenotypic variance captured by VariantSpark versus the traditional logistic regression (LR) methods. Finally, we demonstrate that VariantSpark has improved sensitivity to detect signal with fewer control samples compared with LR approaches.


VariantSpark identifies known AD loci across two independent cohorts

Using the ML genomics platform VariantSpark19, and a novel RFlocalfdr approach22, we identified genetic variants that are both marginally and interactively associated in two independent AD cohorts, UKBB and ADNI (7,573 and 784 samples of 4.5M SNPs each). Because of these two types of associations, we expect to find more significant variants than a LR approach at < 5% FDR.

We identified 104 SNPs (53 independent) to be significantly associated with AD in the UKBB cohort (Table 1, Fig. 1, Supplementary Table S1) and 207 significantly associated SNPs (124 independent) in the ADNI cohort (Fig. 1, Supplementary Table S2). When we compared these associations with those associated with AD in the GWAS Catalog (trait ID ‘MONDO_0004975, accessed 16/05/22)24 using locus bins, we observed a 70% overlap with the significant SNPs identified in both the UKBB (72/104) and ADNI cohort (145/207), with 31 out of the 53 independent UKBB SNPs (58.49%) and 82/124 (66.13%) of the independent ADNI SNPs (Table 1 and Supplementary Table 2).

Table 1 Annotated statistically significant independent SNPs identified using VariantSpark and the UKBB cohort. Replication status to ADNI validation cohort and GWAS Catalog included.
Figure 1
figure 1

Miami plot showing significant SNPs identified by VariantSpark in UK Biobank (10 controls to 1 case) (top) and ADNI (bottom) cohorts. Red asterisks mark those variants that have been replicated by position (only independent variants) between the two cohorts. Annotation (black) represents gene annotations that are novel and replicated between the two cohorts. Annotations in grey represent previously identified variants.

As expected, the APOE loci was identified in both ADNI and UKBB cohorts (Supplementary Tables 1 and 2). To evaluate the functional context of the other significantly associated independent variants, we performed functional enrichment analysis using MAGMA. Gene-set analysis (Supplementary Table S4) identified 9 (ADNI) and 3 (UKBB) gene sets significantly associated (after Bonferroni correction). Many of the significant gene sets and those with suggestive significance levels (P < 0.05) fell into the categories of transmembrane and metal ion transport proteins (known to be key in neuronal signalling in the brain). Tissue expression analysis using MAGMA and GTEX (Supplementary Table S5, Supplementary Table S6) revealed brain tissues to be the most highly ranked, although they did not pass Bonferroni correction.

VariantSpark identifies novel loci associated with AD

We next investigated which loci replicated between the two independent cohorts. Despite the phenotypic heterogeneity across the two cohorts, we replicated three independent, significantly associated genes, APOE (rs429358), SASH1, and SH3BP4 (Table 1 and Supplementary Table S2). It is important to note that the significance threshold for the RFlocalfdr is 0.05 compared to the traditional genome-wide significance threshold of P < 5 × 10–8, which needs to correct for multiple tests. Both thresholds, RFlocalfdr for VariantSpark and P < 5 × 10–8 for logistic regression, control for Type 1 error and correct for the multiple testing burden. For further information, see Methods section.

Both SASH1 and SH3BP4 were novel to our study and were not yet present in the GWAS Catalog SNPs, although there is a marginally associated SNP (rs9390537, χ2-p = 8.17 × 10–6) mapping to an intergenic region 91,233 bp upstream of SASH1 associated with AD25 and another marginally associated SNP (rs66501349, χ2-p = 2 × 10–6) intergenic to SH3BP4 and CEP19P1 associated with poorer cognitive function26. The corresponding rsIDs from the UKBB cohort are rs117160741 (Chr 6:148512131) for SASH1 and rs114656810 (Chr 2:235751287) for SH3BP4 (Supplementary Table S1). Both are intergenic and located upstream of the genes. Similarly, the rsIDs from the ADNI cohort are rs9918382 (Chr 6:148265029), an intergenic variant located upstream of SASH1, while rs6711272 (Chr 2:235131361) is an intergenic variant located downstream of SH3BP4 (Supplementary Table S2).

SASH1 (SAM and SH3 domain-containing 1) encodes a scaffold protein, which is ubiquitously expressed, including in brain tissues and is also a positive regulator of the NF-kB signalling pathway through the activation of TLR427. SH3BP4 (SH3 domain binding protein 4) encodes a protein involved in the amino acid-induced TOR signalling pathway28. Both SASH1 and SH3BP4 are membrane bound phosphoproteins with SH3 domains.

BitEpi identifies novel interactions between known and novel AD genes

BitEpi was used to identify epistatic interactions between significantly associated variants in both cohorts. The β and α metrics, reflecting association power and interaction effect respectively, were used to select interactions that were strongly associated to the AD phenotype due to an epistatic effect. We identified 37 interactions with significant β and α values in the UKBB cohort, of which 17 were 2-SNP, 16 were 3-SNP, and 4 were 4-SNP interactions (Fig. 2, Supplementary Table S7). Using the ADNI cohort, we identified 58 interactions with significant β and α values, 39 were 2-SNP, 17 were 3-SNP and 2 were 4-SNP interactions (Fig. 3, Supplementary Table S8). Interestingly, the two replicating AD associated genes, SASH1 and SH3BP4, were involved in epistatic interactions.

Figure 2
figure 2

Network diagram of significant BitEpi interactions from UK Biobank cohort. Nodes in green are known AD associations, in red is the novel gene replicated in this study, and in blue are variants which are novel but unreplicated. All 2-SNP, 3-SNP, and 4-SNP interactions are included. Size of nodes are representative of node degree calculated from the NetworkAnalyzer plug-in in Cytoscape.

Figure 3
figure 3

Network diagram of significant BitEpi interactions from ADNI cohort. Nodes in green are known AD associations, in red is the novel gene replicated in this study, and in blue are variants which are novel but unreplicated. All 2-SNP, 3-SNP, and 4-SNP interactions are included. Size of nodes are representative of node degree calculated from the NetworkAnalyzer plug-in in Cytoscape.

In the UKBB cohort, the SNP (rs114656810) mapping to SH3BP4 was found to interact with rs429358, which is a reported pathogenic APOE SNP in ClinVar29, where the alternate ‘C’ allele plays a part in the high AD-risk APOE-ε4 isoform. This pairwise interaction was interrogated to identify the genotype combinations associated with AD (Supplementary Table S9). Due to the low number of samples with the homozygous alternate genotype (AA) of SH3BP4 SNP, we reduced the genotypes to two classes; presence or absence of the alternate ‘A’ allele. In the absence of the alternate SH3BP4 SNP allele, there was no absolute difference in control rates between the SH3BP4xAPOE interaction and the APOE SNP alone (Fig. 4A). This indicates a limited effect of the homozygous reference genotype of rs114656810 on AD. However, with the presence of the alternate allele of the SH3BP4 SNP, the pathogenic effect of the APOE C allele is modulated (Fig. 4A), suggesting that SH3BP4 may have a protective mechanism against AD for carriers of the APOE 'CC’ genotype. In the ADNI cohort, this pairwise interaction between SH3BP4 and APOE was marginally significant but did not pass Bonferroni correction.

Figure 4
figure 4

Relative control rates of the interactions (A) rs119656810 (SH3BP4) and rs429358 (APOE) in the UK Biobank cohort, and (B) rs7552961 (ACOT11), rs9918382 (SASH1), and rs429358 (APOE) in the ADNI cohort. Relative control rates were calculated as the difference between control rates of each genotype combination and the control rate of the entire cohort. Due to sample size restrictions, the rs119656810 SNP and the rs9918382 SNP was reduced to two categories; presence or absence of its alternate allele. There is evidence of a modulating effect of the alternate allele of rs119656810 on the APOE-e4 (rs429358 CC) genotype as seen from the increase in relative control rates in the top middle and top right cells in (A). There is evidence of a protective effect of alternate allele of rs9918382 on the ACOT11 × APOE genotypes as seen from the increase in relative control rates in the top middle cell and the bottom right cell in (B). However, there is no evidence of the same effect for the APOE-e4 (rs429358 CC) genotype in an interaction with the ACOT11 alternate allele (rs7552961).

In the ADNI cohort, the SNP rs9918382 mapping to SASH1 was involved in a triplet interaction with the same pathogenic APOE SNP, rs429358. The other SNP, rs7552961, in the triplet maps to ACOT11, has been shown to be associated to mild cognitive decline30. This triplet interaction was also examined further (Supplementary Table S10). Again, due to the low numbers of samples with the homozygous alternate genotype of rs9918382 (n = 15), the genotype was reduced to two classes; presence or absence of the alternate ‘G’ allele. Figure 4B shows that the alternate ‘G’ allele of the SASH1 SNP has a protective effect, reversing the pathogenic interaction effect of the rs7552961 (ACOT11) TT genotype and rs429358 (APOE) TC genotype increasing the relative control rate from –0.139 to 0.028 (Supplementary Table S10). However, when the alternate ACOT11 allele (G) is present with the APOE CC genotype, the SASH1 SNP has no effect. In fact, none of the possible pairwise interactions between these three genotypes passed significance for the α metric, which suggests that the association to AD was carried by the interaction of all three SNPs. This highlights the complexity and difficulty of detecting epistatic interactions, where exacerbating or protective properties are exerted through specific combinations of genotypes.

VariantSpark can detect more disease associated signal than logistic regression

Next, we compared VariantSpark with the more traditional GWAS approach implemented in PLINK’s logistic regression (LR) to estimate the power to detect disease associated signal with limited control samples. To do this, in addition to using the ADNI cohort, we subset two datasets from the UKBB cohort: the first contained a ratio of 10 controls to 1 case (UKBB10to1) and the second with 2 controls to 1 case (UKBB2to1).

Using LR, we did identify multiple variants at suggestive significance levels using the ADNI cohort (ranging from χ2-p = 8.34 × 10–8 to χ2-p = 2.63 × 10–6), all falling into the APOE locus (Chr19:45,326,217 to Chr19:45,445,517). Based on the UKBB cohort, we identified three significantly independent associated SNPs in UKBB10to1 (127 in total) (Supplementary Table S3) and one significantly independent associated SNP in UKBB2to1 (74 in total) (Supplementary Table S11). All SNPs found using LR fell within the APOE locus (Chr19:45,326,217 to Chr19:45,445,517).

In contrast, VariantSpark identified associations outside of the APOE region such as rs79486209 on chromosome 10 which mapped to PLPP4, a gene previously associated with AD31. VariantSpark identified 53 significantly associated independent SNPs (104 in total) in UKBB10to1 (Table 1) and 20 significantly associated independent SNPs (69 in total) in UKBB2to1 (Supplementary Table S12).

This demonstrates we have 15% (1/3 vs. 20/53) more power to detect disease associated variants with 80% fewer (2 vs. 10) controls using VariantSpark compared with a LR approach.

VariantSpark captures more phenotypic variance in AD than Logistic Regression

A key goal of this study was to explore whether epistasis can explain some of the missing heritability that is well documented in AD2,3,4. To this end, we measured the proportion of phenotypic variance captured by genetic variants identified in the UKBB cohort using Nagelkerke’s pseudo-R2 and fitting three LR models with: Firstly, significant and independent SNPs identified by LR (n = 3). Secondly, significant and independent SNPs identified by VariantSpark (n = 53). Thirdly, significant and independent SNPs identified by VariantSpark with significant interactions identified by BitEpi (n = 122).

Within the UKBB cohort, the VariantSpark-BitEpi model (model (3)) captured the highest variance explained at 23.18% compared to model (2) without the BitEpi interactions at 17.12% and model (1) the LR SNPs at 12.77% (Supplementary Fig. S2). To test whether the performance increase of the VariantSpark-BitEpi model was driven by its additional variables, we calculated an empirical P value. We fitted 1000 models containing the 3 LR SNPs as well as 50 randomly selected SNPs and 69 interactions to emulate the degrees of freedom of the VariantSpark-BitEpi model (3). As shown in Supplemental Fig. S2, these models achieved an average pseudo-R2 of 19.33%, outperforming the models with fewer predictors (models (1) and (2)). In contrast, VariantSpark-BitEpi’s model had a small but significant (p = 0.006) performance improvement over the random models (23.18% vs 19.33%), confirming that additional signal was captured. We make a similar observation for these models when tested on the independent ADNI cohort. LR (model 1) captured 7.09% while the random models captured 25% on average and VariantSpark-BitEpi (model 3) achieved 27.20%. The increase in variance explained on the ADNI set is likely due to an easier signal, which is predominantly driven by APOE (as observed in Section C).

These findings indicate that VariantSpark-identified SNPs and BitEpi-identified epistatic interactions together explain up to 10.41% more phenotypic variance in AD than traditional LR approaches that focus only on marginal effects. This also aligns with previous studies where the addition of 87 marginal effect SNPs (without APOE) explained only 2.1% more variance32 and 2,042,105 SNPs (without known AD SNPs) accounted for 25.3% variance3. Taken together, these results suggest that epistatic interactions across the genome play a part in AD aetiology and should be accounted for when developing therapeutics and genetic risk scores.

Transcriptome-wide association (TWAS) lookup of SASH1 and SH3BP4

Finally, we looked at transcriptomic level information of the mapped genes SASH1 and SH3BP4 as in previous studies33, 34 have shown that this can add confidence that GWAS-identified genes are capturing actual disease-related signal. Using the TWAS-hub35, SASH1 showed strong evidence (ENET-P = 7.5 × 10–9) of involvement in the prefrontal cortex tissue and a strong association with “Alzheimer’s Disease (in father)” (Supplementary Table S13). In contrast, SH3BP4 showed an association with nerve tibial tissue at non-suggestive levels for Alzheimer’s Disease (Supplementary Table S14). Another resource used were the gene expression tests built into FUMA36 using GTEx v837 data. In this analysis, both SASH1 and SH3BP4 showed increased expression levels in brain tissue (Supplementary Fig S3).


Using VariantSpark, a ML approach to GWAS, we have identified two novel genes, SASH1 and SH3BP4, to be associated with AD reaching genome-wide significance.

SASH1 is a known tumour suppressor protein that has been shown to be differentially expressed between AD and control samples38, 39. Furthermore, a previous study found SNP rs9390537 (located 91,233 bp upstream of SASH1) to be nominally associated to LOAD (χ2-p = 8.17 × 10–6)25. Indeed, it is a nominated AD drug target in the Agora database, a database curated by AD researchers from the accelerating medicine partnership-Alzheimer’s disease consortium and other research teams.

SH3BP4 or transferrin trafficking protein (TTP) interacts with endocytic proteins including clathrin, dynamin, and the transferrin receptor40 and is involved in the aminal acid-Rag GTPase-mTORC1 signalling pathway. It is a central link between Akt signalling and cell–matrix adhesion regulation28. Although SH3BP4 has no established link to AD, a SNP (rs66501349, intergenic to SH3BP4 and CEP19P1) has been marginally associated to poorer cognitive function (χ2-p = 2 × 10–6)26 and its interactor dynamin has strong evidence of a role in AD pathophysiology41, 42. In particular, the expression of gene DNM2 was significantly decreased in AD patients, and neuronal cell lines transfected with dominant negative DNM genes were observed to have an accumulation of APP and increased Aβ secretion43.

The key contribution of our work is adding the lens of epistasis to association. We identified a total of 95 epistatic interactions, including 2-SNP, 3-SNP and 4-SNP interactions associated with AD, in two independent cohorts. This elevated the previously only nominally associated SASH125 to pass FDR significance when its interaction with ACOT11 and APOE is accounted for. Specifically, our epistasis analysis revealed that the alternate ‘G’ allele of SASH1 SNP rs9918382 appears to have a protective effect against AD as it reverses the pathogenic effect of ACTO11 rs7552961 ‘TT’ and APOE rs429358 ‘TC’ genotype combination (Supplementary Fig. S3). However, this modulating effect was not found in the presence of two copies of the pathogenic APOE ‘C’ allele (rs429358, Supplementary Fig. S3). This result is consistent with co-expression patterns found between AD and control brains44 and the high expression levels of SASH1 in pre-frontal cortex tissue in the TWAS-hub. Taken together, it is likely that SASH1 plays a role in AD pathophysiology and warrants further investigations.

Although, most of our identified epistasis is concentrated between APOE and a small number of other loci, our methodology can explore genome-wide epistasis in an unbiased manner, unlike previous studies45, 46. Additionally, a genome-wide search allows for the identification of epistasis in non-coding regions of the genome which have empirically demonstrated to effect gene expression47.

For example, our epistasis analysis revealed a modulating effect of the alternate allele of SNP rs119656810 (SH3BP4) on the APOE locus. A possible explanation for this effect is that SH3BP4 has the ability to regulate the activity of dynamin40, whereby it enables the processing of amyloid β protein precursors resulting in lower levels of Aβ depositions and AD pathology. Together, SH3BP4 is a novel gene that may play a role in AD pathophysiology through its pathway mechanisms and in combination with APOE.

While VariantSpark identified SH3BP4 and SASH1 in both cohorts due to their cumulative additive and epistatic effects on AD, the exact epistatic interactions they are involved in were not replicated, although SH3BP4-APOE showed marginal significance. This is likely due to the varying number of individuals who might have this exact modulating disease physiology and genotype combinations across the two cohorts. This illustrates the benefits of using VariantSpark instead of traditional LR models on binary traits with potential polygenic interactions, like Alzheimer’s disease.

Using VariantSpark, we were also able to detect disease genes with fewer controls than traditional approaches. This is relevant as a recent study calculates 10,000,000 cases would be needed for a traditional GWAS to find significant SNPs explaining 50% of Alzheimer’s disease heritability48. Even for large initiatives such as FinnGen or 23andMe, such numbers are hard to achieve. Our method offers an alternative and enables discoveries in smaller but well annotated cohorts for AD and other genetic studies.

The limitations to our study are as follows: Firstly, ADNI used whole genome sequencing mapped to the GRCh38 reference genome, while the UKBB used array technology mapped to the GRCh37 reference genome resulting in the final set of 4.5 million common SNPs which was around 50% of the total number of SNPs for both cohorts. Secondly, the ADNI and UKBB cohorts are both different ascertainments. Particularly, UKBB is a relatively healthy volunteer cohort and contained a mix of AD phenotypes while ADNI recruited patients based on their health status and included samples with mild cognitive impairment to maximise sample size but is only an AD-proxy phenotype. Lastly, the ADNI cohort was substantially smaller than the UKBB cohort, with 784 samples, compared to 7582 samples. These three factors in combination limit our power to discover and replicate disease variants and epistatic interactions across cohorts. Furthermore, we restricted our samples in both cohorts to those of European descent as is commonplace49. However, it has been shown that ethnicity plays a crucial role in AD aetiology50, 51 and more diverse genomic datasets are needed to gain better unbiased insights52.

In conclusion, we have established a ML approach for detecting genetic signals associated with disease, which goes some way to explain the missing heritability observed in previous literature.


Sample selection

Data for AD was obtained from two sources; ADNI and UKBB. The ADNI aimed at testing combinations of imaging and biological markers to measure progression of AD and mild cognitive impairment (MCI). For this study, cases were samples labelled as early and late MCI and AD (Supplementary Note 1). The UKBB contains phenotypic and biological information from 500,000 participants; see their previous publication for more details21. For this study, ICD10 codes from hospital inpatient records and participant responses were used to identify cases of AD. See supplementary for specific codes, question, and responses used. Additionally, individuals with indication of early onset AD and/or family history of AD were excluded. Based on the UK Biobank two subsets were generated to identify differences in detection power for novel variants. One contained a ratio of 1 case to 2 controls (labelled UKBB2to1) and the other a ratio of 1 case to 10 controls (labelled UKBB10to1). The UKBB10to1 cohort was used for all result sections, unless specified. Counts of individuals included in the analyses are shown in Supplementary Note 1. This research was approved by the UK Biobank's governing Research Ethics Committee.

Quality control

Quality control (QC) included exclusion of variants with minor allele frequency (MAF) < 0.01, imputation quality < 0.9, genotype missingness > 0.1 and those deviating from Hardy–Weinberg equilibrium (P < 1 × 10–6). Furthermore, individuals with a discrepancy between their genetic and reported sex were excluded and if their genotype-derived principal components 1 and 2 were further than 6 standard deviations away from those of 1000 Genomes European population. After QC, we had 11.7 M variants in UKBB, and 9.5 M variants in ADNI, with 4.6 M in common between the two cohorts. Notably, the ADNI cohort was mapped to the GRCh38 reference while the UKBiobank was mapped to the GRCh37 reference.

Genome-wide association study using logistic regression

Association testing between AD and genetic variants was conducted using whole genome LR model implemented in PLINK53 (v1.90beta). Sex, age and the top 20 principal components were used as covariates for the association analysis.

Genome-wide association study using VariantSpark

VariantSpark19, a distributed implementation of the random forest (RF) algorithm, was used for association testing on Amazon Web Services. The same QC’d input files from LR analyses were used in the VariantSpark analyses. Optimisation of four hyperparameters; mTry, minNodeSize, MaxDepth, and nTree was run on all cohorts The optimised settings for all three cohorts were the same for mTry (0.1), MaxDepth (10), and nTree (20,000) except for minNodeSize where UKBB10to1 = 758, UKBB2to1 = 211, and ADNI = 78.

We determined the reliability of VariantSpark on real datasets by comparing Gini importance score of three runs on the UKBB10to1 and ADNI cohorts as Pearson’s correlations (Supplementary Figs. S1). Further, we tested the effect of covariates (as used in LR) in a RF model by comparing the out-of-bag error metric between a Ranger54 run with covariates and a VariantSpark run without covariates. We did not observe any difference between the models; thus, covariates were not included in the final VariantSpark analysis.

Compute resources

LR analyses were conducted on a machine with 16 Cores and 48 GiB memory. VariantSpark analyses were conducted using AWS Elastic Map Reduce with a total sum of 64 vCores and 488 GiB of memory.

Post-GWAS analyses

P value calculation

The primary measure of association from VariantSpark is the importance score derived from Gini-Index55. While this score can rank variants by importance, it is unable to determine significantly associated variants. To determine significance from importance scores, we used a recently developed method22. Briefly, this approach is based on the empirical Bayes method56 which uses RF tree information as a threshold to fit a skew normal distribution and correct for multiple testing akin to Efron’s local false discovery rate approach.

Identification of independent variants, functional mapping and annotation

Variants identified in the GWAS were annotated using SNPTracker57 and clumped using PLINK v.1.90b3.3153 within a window of 1000 kb and r2 of 0.01. Significantly associated variants were functionally mapped and annotated using ANNOVAR (v.7 2020-06-08)58. Furthermore, all significantly associated variants were mapped into locus bins where each locus bin was created based on a two million base-pair sliding window around the variants. This allowed known associations from the GWAS Catalog to be mapped to our results by identifying bins that are shared between the GWAS Catalog and our study’s associations.

General quality assurance of the UKBB (discovery) and ADNI (replication) cohort

PLINK LR results were used to identify potential population stratification using LDSC. No evidence for inflated statistics due to hidden population stratification was detected (LDSC intercept estimate was 1.03 ± 0.01 and 1.03 ± 0.01 for UKBB10to1 and ADNI, respectively).

Epistasis calculation using BitEpi

To identify 2-SNP, 3-SNP, and 4-SNP interactions, BitEpi was applied to the significant VariantSpark associations in the UKBB and ADNI cohorts separately. The methods behind BitEpi have already been discussed elsewhere59 but briefly, BitEpi calculates two entropy metrics, α and β. The β metric reflects the combined association power of all the SNPs involved in the interaction while the α metric represents the gain in association power due to the epistatic effect of all interactive SNPs. Therefore, an interaction with a large α and β has a strong association with the phenotype caused by an epistatic effect between all of the SNPs in the interaction. Quantiles for each order (2-SNP, 3-SNP or 4-SNP interactions) were used to filter out interactions with higher α and β values before P-values were computed through a permutation procedure. Bonferroni-corrected significance thresholds were calculated based on all possible combinations, with < 0.05 denoting significance. SNPs involved in significant interactions were annotated with their independent SNP to remove any redundant interactions.

Using an in-house Python script, we generated contingency tables for some of the significant interactions found by BitEpi (Fig. 4, Supplementary Table S9). The control rate is the number of controls over the number of samples for each genotype combination or for the overall cohort. The relative control rate is then the overall control rate minus the genotype combination control rate. A genotype combination with a negative relative control ratio can be considered to be deleterious and vice versa.

Variance explained calculation

The significant associations from the VariantSpark, PLINK LR, and BitEpi analyses using the UKBB cohort were used to calculate the variance explained calculated as Nagelkerke’s pseudo-R260 within the UKBB and ADNI cohort with the following as predictors in logistic models run using R v4.1.361; (1) significant and independent VariantSpark SNPs (n = 53), (2) significant and independent PLINK LR SNPs (n = 3), (3) significant and independent VariantSpark SNPs and all significant BitEpi interactions as interacting variables (n = 122). For all three models, the response was the AD case/control status. An empirical P-value was calculated from 1000 ‘random noise’ models which were built to mimic the structure of model 3 by including the known APOE SNPs found by VariantSpark but also SNPs with no association with AD.