Genome-wide association of polygenic risk extremes for Alzheimer's disease in the UK Biobank

In just over a decade, advances in genome-wide association studies (GWAS) have offered an approach to stratify individuals based on genetic risk for disease. Using recent Alzheimer's disease (AD) GWAS results as the base data, we determined each individual's polygenic risk score (PRS) in the UK Biobank dataset. Using individuals within the extreme risk distribution, we performed a GWAS that is agnostic of AD phenotype and is instead based on known genetic risk for disease. To interpret the functions of the new risk factors, we conducted phenotype analyses, including a phenome-wide association study. We identified 246 loci surpassing the significance threshold of which 229 were not reported in the base AD GWAS. These include loci that showed suggestive levels of association in the base GWAS and loci not previously suspected to be associated with AD. Among these, there are loci, such as IL34 and KANSL1, that have since been shown to be associated with AD in recent studies. We also show highly significant genetic correlations with multiple health-related outcomes that provide insights into prodromal symptoms and comorbidities. This is the first study to utilize PRS as a phenotype-agnostic group classification in AD genetic studies. We identify potential new loci for AD and detail phenotypic analysis of these PRS extremes.

www.nature.com/scientificreports/ were rs11218343 in SORL1 and the SUZ12P1 locus that did not show significant differences between groups. In addition to SNPs that were below our MAF threshold, there were others that we did not replicate, and these were either borderline significant in our data, or were not further replicated by more recent AD GWAS. Conversely, some loci that were sub-significant in the base GWAS reached significance in this analysis. Some loci, such as www.nature.com/scientificreports/ IL34, that were not significant in the Jansen GWAS, have surpassed the significance threshold in our study and have also been independently shown to be associated with AD 5 .

Phenotype-based gene set enrichment.
To determine if there were sets of genes associated with other phenotypes enriched in AD PRS extremes, we performed a gene set enrichment analysis using FUMA (Supplementary Table 4). In Fig. 4 we report the top 10 most significant GWAS Catalog traits where genes overlap between the GWAS results for each trait and the GWAS results from the AD PRS extremes. To consider the strong effect of the APOE locus we separated results according to the presence (Fig. 4B) or absence (Fig. 4C) of genes located in this locus in the resulting overlapping gene sets.
Genetic correlation. We performed a genetic correlation to determine the relationship to other traits associated with these loci. In Fig. 5 we report the most significant correlated traits when analyzing all datasets available within the "ieu-a" batch available in the OpenGWAS project from the MRC Integrative Epidemiology Unit (IEU) (Fig. 5A). Again, to account for the strong effect of APOE we also conducted this analysis excluding the APOE locus (Fig. 5B). Results for all correlations performed are available in Supplementary Table 6.

PheWAS.
To determine if any of the phenotypes reported in the UKBB dataset were associated with extreme genetic risk for AD and potentially find traits that could be prodromal in AD, we performed a PheWAS using 1424 traits. We also performed the same analysis to understand if the associations were being driven by APOE, by excluding the APOE locus from the underlying GWAS. We focused on associations that surpassed the adjusted p value threshold and had a β ≥ |0.5| (Fig. 6, Table 3). Family history of AD or dementia (represented by parents and by siblings) was significantly associated with the AD PRS extremes. Interestingly, a proxy to longevity in the parents (mother and father still alive) was also associated with AD PRS extremes but in the opposite direction. Other traits also significantly associated with AD PRS extremes included tendency to fall, fecal incontinence, Parkinson's disease in mother, and usage of Gingko forte as medication.

Discussion
Given the difficulty in assembling ever larger cohorts of well characterized AD cases and the fact that other genetic loci contributing to disease risk are still to be identified, it is important to find new analytical methods to fully characterize the genetic architecture of Alzheimer's disease. Here we used an alternative approach to the typical GWAS by performing an association study on genetic risk extremes for AD. The rationale for this approach is based on the hypothesis that by taking a population of individuals and enriching those that carry variants that are of risk for a phenotype, one is also enriching for other variants associated with that same phenotype. Thus, we are using the extreme polygenic risk of AD as a surrogate for disease status and the variants identified here are not necessarily associated with AD itself but with the polygenic risk of AD. In fact, of the 18,892 individuals included in the high PRS extreme, only 365 were reported to have a diagnosis of AD. It is important to note that we are not using the PRS to predict AD status; this would suffer from overfitting since the UKBB samples were used in the study in our base summary statistics. In short, we are separating individuals in the UKBB population solely by their PRS for AD, selecting the extremes of this distribution, and then testing their genetic and phenotypic differences.
In this approach, we used 38,000 individuals selected from the PRS extremes obtained for almost 400,000 individuals in the UKBB. By comparing these genetic risk extremes, we were able to identify 23 loci with p values of association below 1 × 10 -15 that were not shown as associated with AD in the base GWAS study. Among these, the identification of genome-wide significant signals at the IL34, ACE and KANSL1 loci-loci that were not significant in the base GWAS but were subsequently identified to be associated with AD in independent studies-shows the validity of this approach. These results also offer the possibility of auditing the many loci now associated with AD risk. Comparing the results from the recent GWAS studying over 1 million individuals 23 and the previous results by the same group (base GWAS), using largely data on the same samples it is interesting  ADAMTS4 is an interesting locus as it has not been associated with AD by any other GWAS but shows a very significant association in this GWAS of AD PRS extremes (p = 6.9 × 10 -18 ) and functionally is very relevant for the beta-amyloid pathway 25,26 . On the other hand, three of the four loci that were not replicated in this study (HESX1, SUZ12P1 and ALPK2) all have top SNPs that were not included in this study due to low MAF. Still, there is no hint of association in any of these loci and they have not been associated with disease in more recent GWAS, indicating these may be false positives. Also interesting to note is the CD33 locus that has repeatedly been found to be significantly associated (including in the base GWAS), or to not associate with AD risk in the main GWAS for the disease. These inconsistent results may reflect an association that is stronger, or only present in some populations, but can also represent a false positive. The most significant variant in this locus in this GWAS of AD PRS extremes reached a p value of 10 -5 , supporting the latter. Still, the several studies showing different effects of CD33 in AD, such as an elevated expression in AD brain associated with amyloid pathology, disease progression, and microglial activation may reflect the important role of CD33 biological pathway in AD, most likely dependent on TREM2 27 . Given the design of this study, it is not possible to perform a formal replication stage to confirm the novel loci identified that potentially associate with AD risk. Still, some of the loci have now been identified in other GWAS (e.g. WDR12 and DOC2A) 5 and many of the genes nominated in the loci have features suggesting a potential role in AD. This recent GWAS also identified GRN and TMEM106B as novel loci for AD and suggested a continuum between AD and FTD. Interestingly, our results identified several loci that have been previously associated with the risk of Parkinson's disease (e.g.: LRRK2, ITKB, CCDC62), but no loci overlapping with frontotemporal dementia in addition to the MAPT locus. This indicates that instead of a continuum between AD and FTD at the GWAS risk level, the identification of FTD loci most likely reflected the misclassification in the diagnoses of clinical AD and proxy-AD. Misdiagnoses have always been part of GWAS, and these should be more apparent as the sample sizes increase with the inclusion of not-so-well characterized samples.
Both genetic correlation and gene set enrichment analyses identified interesting overlaps. Particularly, gene set overlap and genetic correlation can be observed with psychiatric traits such as schizophrenia, neuroticism, and depression. A previous association study of the shared genetics of autism spectrum disorder, attention deficit-hyperactivity disorder, bipolar disorder, major depressive disorder, and schizophrenia by the Psychiatric Genomics Consortium was one of the most significant correlations in this analysis. Etiologically and clinically, these psychiatric traits and AD are different diseases. Still, in many cases, they have similarities in the patterns of regional brain and biochemical dysfunctions, as well as in symptomatology 28 . Psychotic events are experienced by up to 50% of AD patients over the course of their illness 29 and, when compared with the general population, individuals with schizophrenia have a significantly higher risk (2-4 times) of developing AD and other dementias 30 . It is interesting to note that the base GWAS reported a nominally significant genetic correlation   www.nature.com/scientificreports/ between schizophrenia and AD 7 . More recently, by applying a schizophrenia PRS to AD with psychosis, it was shown that psychosis in AD shares some genetic liability with schizophrenia 31 . It is also interesting that this approach using AD PRS extremes identified enrichment and correlation of genes overlapping with phenotypes such as sleep duration and neuroticism. A robust association of sleep duration in middle and old age with the incidence of dementia has recently been established, using the 25-year follow-up Whitehall II study 32 . It should also be noted that sleep duration is anticorrelated with risk of AD in our study. Similarly, neuroticism has been associated with the risk of AD but also with disease pathology and progression both in sporadic and autosomal dominant disease [33][34][35] .
When examining the presence of APOE in these gene sets, the most enriched phenotypes with prior association to APOE included body mass index, AD CSF biomarkers and HDL/LDL levels, and several of these enrichments are further corroborated with evidence for correlation of anthropometric traits in the genetic correlation analyses. This is indicative of the strong effect that APOE has on these phenotypes, but also that the approach to separate individuals based on their AD PRS captures an enrichment of genes directly associated with AD, but also AD-related phenotypes, such as CSF Abeta and tau levels. Excluding the APOE locus from the overlapping genes also showed an enrichment of AD-related phenotypes, but also of other diseases, such as sarcoidosis and Parkinson's disease.
To explore the phenotypes associated with each quantile and potentially find new phenotypes and traits that could be seen as either comorbidity or predicting factors for AD, we also performed a PheWAS, using more than 1000 traits available in the UKBB dataset.
As expected, AD diagnosis and AD in the family were the most significant phenotypes associated with the AD PRS extremes comparison. Tendency to fall has been previously shown to be significantly higher in a small cohort of 140 AD patients versus 137 controls 36 , a result that we replicate in this study. The use of Ginkgo Forte was also significantly associated in these results, which could be a result driven by individuals with a family history of dementia searching for pharmaceutical options to improve or maintain memory. Parental longevity is inversely associated with AD PRS since individuals in the high PRS group seemed to have higher mortality in their parents. Previous studies have also reported that individuals with parents who live longer tend to have a more preserved brain structure and lower evidence of AD 37,38 .
There are a few features of the approach taken in this study that need to be kept in mind when interpreting these results: as previously mentioned, an extreme PRS for AD does not equate to a clinical diagnosis of AD. The associations described here are not with AD itself but rather with the genetic risk for AD. Related to this, high risk individuals may never develop AD, but they are still genetically predisposed to it. It is not possible to easily www.nature.com/scientificreports/ replicate the results obtained here, given the absence of a similar, independent dataset. Like most GWAS, this study also focuses on individuals of European ancestry-a feature of our method that utilizes the largest available "genetically homogenous" publicly available dataset, but an important aspect that is necessary to address in future studies 39 . Using publicly available data from a previous GWAS on Alzheimer's Disease 7 we computed polygenic risk scores for all genetically unrelated Caucasians in the UK Biobank cohort. To our knowledge, this is the first study using an AD PRS to separate individuals purely based on genetic risk, agnostic to disease status. We identified Figure 6. Phenome Scan results. The analysis was made with individuals in each extreme of the PRS distribution (including and excluding the APOE locus) using PHESANT. Each color represents a group of traits/ diseases, according to the UKBB hierarchical tree. Downward triangles represent results from individuals in the extremes of the PRS calculated when excluding APOE (no APOE). The size of triangles represents the beta value for the association. Vertical dashed lines connect results for the same trait in the APOE and no APOE analyses. Y-axis is the logarithmic scale for the p value, multiplied by the beta value, to depict whether the trait is negatively or positively associated. Red dashed lines represent the adjusted Bonferroni p value threshold. *Note: "Illnesses of father/mother: Alzheimer's Disease/Dementia" had a software output p value of 0, to represent these we attributed a p value of 1 × 10 -150 . Table 3. PheWAS results for AD PRS extremes. Shown are results with adjusted p value ≤ 4.3 × 10 -5 and β ≥ |0.5|. APOE indicates the PheWAS using individuals from the PRS extremes when APOE was included; Excl. APOE indicates PheWAS using individuals in the PRS extremes when the APOE locus was excluded from the PRS analysis (see "Methods"). *p value was too significant, and software output was zero-to plot these values (Fig. 6), we attributed a p value of 1 × 10 -150 . **Traits were excluded from PheWAS analysis using individuals from the PRS calculated without APOE due to lack of representation. www.nature.com/scientificreports/ the two extremes of AD risk from the polygenic risk distribution and analyzed genetic and phenotypic differences between these groups. The power of this unique approach allowed us to identify novel associations, not only at loci that were subsignificant in the base study but also at loci that were not suggestively significant. Some of the loci identified here have been recently and independently associated with AD by typical GWAS, validating this approach. Our findings indicate the urgent need of a systematic and comprehensive audit of all loci currently associated with AD risk. The inclusion of loosely characterized samples and the use of the same samples and/or data by different GWAS contributes to the difficulty in assessing true loci for the disease.
In summary, this is the first time PRS are used as the only defining characteristic to differentiate groups of individuals to identify novel loci associated with the underlying phenotype. Furthermore, we used phenotype analyses to identify comorbidities, traits, and diseases that can point towards new prodromal characteristics of high genetic risk for AD.

Methods
Dataset. We used the UKBB cohort, containing 487,409 whole-genome genotyped individuals (version 3) 40 , with about 200,000 of which also whole-exome sequenced (released in October 2020) 41 . This work was conducted as part of UK Biobank application number 11036 and follows all applicable guidelines and regulations. Individuals are from the United Kingdom and aged between 40 and 69 at recruitment 40 . We included individuals identified in the UKBB documentation as genetically defined "Caucasian" and removed individuals with greater than 3rd-degree relatedness to any other sample in the dataset, by applying a KING cutoff of 0.0884 as implemented in the ukbtools package (v0.11.3).
Polygenic risk score. To derive polygenic risk scores, we applied PRSice-2 42 to the summary statistics of one of the recent GWAS for AD 7 . Variants with p values below 0.05 in the AD GWAS were selected from the UKBB dataset and filtered to keep only variants with a Hardy-Weinberg equilibrium exact test p value above 1 × 10 -15 , missing call rates less than 1% and a minor allele frequency of at least 0.1% in the UKBB dataset. We used the following covariates throughout the analysis: sex, year of birth, Townsend deprivation index at recruitment, genotype measurement batch, and the first ten principal components provided by the UKBB. We defined quantiles from the PRS distribution with individuals in the 5% lowest PRS (18,892 individuals; 53.8% females, 46.2% males) and the 5% highest PRS (18,892 individuals; 54.6% females, 45.4% males). Individuals in the upper quantile will be referred to as "high PRS" individuals, while individuals in the lower quantile will be referred to as "low PRS" individuals. Additionally, to determine how much of the PRS was dependent on the APOE locus, we calculated PRSs using the complete set of markers and excluding SNPs within 1 Mb of the most significant variant in this locus, while using APOE genotype as a covariate.
Genome-wide association study. We determine which individuals fall in the highest and lowest 5% of the PRS distribution and perform a GWAS using these PRS extremes as the classification of groups in an agnostic approach to the clinically defined phenotype. We adjusted this analysis with the same covariates used in the PRS analysis described above. We filtered out all variants with a minimum allele frequency below 5%, Hardy-Weinberg equilibrium exact test p value below 1 × 10 -6 , and missing call rates above 1%. Association analyses were performed using the logistic regression function in PLINK1.9 43 . We then used FUMA package v1.3.6a 15 to annotate, analyze and interpret the results using the SNP2GENE function. All SNPs prioritized as the lead had a p value of less than or equal to 5 × 10 -8 .
Additionally, candidate SNPs were included in the annotation if they had a maximum p value of 0.05. Significant SNPs were considered as independent if they had a clumping R 2 threshold of at least 0.6 while lead SNPs were prioritized from independent SNPs and only considered as such if they had an R 2 threshold for the second clumping step of at least 0.1 (or if it was the same as the first clumping). We used Phase 3 of 1000 Genomes (European samples only) as a reference panel to assess linkage disequilibrium.
Genomic inflation was calculated for lambda (λ) in the QCEWAS package in R.
Phenotype-based gene set enrichment. Using results from the GWAS applied to individuals in the extreme PRS, we performed gene set enrichment analyses through GENE2FUNC in the FUMA package v1.3.6a 15 . Positional gene mapping aligned significant SNPs (p value < 5 × 10 −8 ) by their location within or immediately up/downstream [± 10 kilobases (kb)] of known gene boundaries. We report gene sets that had an overlap of at least two genes between the input list of genes (from SNP2GENE) and the gene sets that were significantly enriched at a maximum adjusted p value threshold of 0.05. Multiple test correction for gene-set enrichment was performed using the Benjamini-Hochberg (FDR) method 44 .
Genetic correlation. A genetic correlation analysis was performed using LD score regression 45 . We analyzed traits available through the OpenGWAS platform 46 , specifically using the ieu-a batch, which has been well described elsewhere 47 . These summary statistics were filtered to only include datasets with more than 2000 male and female samples, and only those reported in European ancestry groups, yielding 149 datasets. These correlations were also performed in the absence of the APOE locus. All SNPs within the region of 19:45236729-45618959 (hg19) were excluded in this analysis. www.nature.com/scientificreports/ performed including and excluding the APOE locus in the GWAS. Phenotypes with more than 20% missing answers were filtered out. We adjusted for sex, age at recruitment, Townsend deprivation index at recruitment, genotype measurement batch, and the first ten principal components. In addition, we considered phenotype categories with a minimum size of 200 answers and converted fields with multiple instances to categorical (multiple) fields as implemented in PHESANT. In total, 1424 traits were analyzed. p values were adjusted for multiple testing correction using Bonferroni.