Identifying individuals with high risk of Alzheimer’s disease using polygenic risk scores is most accurate when using all genetic information.

There is little agreement regarding the approach and optimal p-value threshold of SNPs to calculate genetic risk scores for Alzheimer’s disease (AD). This reects a fundamental underlying debate on the polygenic versus oligogenic disease architecture. We re-investigated the assumptions underlying the choice of specic p-value thresholds dening genetic loci used to determine polygenic risk scores (PRS). We nd the optimal p-value threshold for SNP selection is 0.1, which supports the polygenic architecture of AD. We found that previous studies supporting an oligogenic model of AD did not take account of the reduction of APOE-ε4 allele frequency in older individuals, which skewed the results towards lower p-value thresholds and eclipsed the contribution of genes associated to AD with higher p-values. The polygenic approach to AD is also effective to identify individuals at high or low AD risk, when only APOE-ε3 homozygous individuals are considered. We also introduce the standardisation of PRS against a population data which ensures comparability of the PRS between studies. In conclusion, our work demonstrates that AD is fundamentally a polygenic disease and that stratifying populations for AD risk best takes the full PRS score into account.


Introduction
Alzheimer's disease (AD) is the most common type of dementia, and mainly affects the elderly population. AD is a progressive condition, which means that clinical features develop gradually over many years before diagnosis 1 . The ability to predict AD risk before disease onset is of great importance for personalised prevention and intervention therapies, stratifying people for clinical trials, or the selection of candidates for functional experimental studies.
The common genome-wide signi cant variants discovered though GWAS have small individual effects, with the exception of APOE-ε4 2 . Findings of the optimal p-value threshold (pT) of SNP association with AD, for inclusion in the polygenic risk score (PRS) suggest a range of thresholds from pT≤5e-8 to pT≤0.5 3−6 . PRS is used as a global term for any number of SNPs included in a risk score. Although, it is clear that many genes are involved in disease development and progression, there is no agreement in the eld as to whether AD is polygenic or oligogenic. We de ne oligogenic risk score (ORS) any PRS that include AD risk associated SNPs only (pT≤1e-5) as in Zhang et al. 2020 7 . The debate in the eld became quite heated with recent papers strongly arguing in favour of an oligogenic view of AD 7,8 . However, substantial evidence suggests that the risk of AD is polygenic, similar to other major neurological disorders 3,6 . Given the important implications for the eld, we set out to re-investigate the various methodologies and assumptions used for the different calculations, to see whether the disparate viewpoints can be reconciled and to provide all arguments from both sides to a broader readership, so they can make an informed decision on the use of PRS in their studies. We further present an approach to standardise PRS using population based datasets to facilitate it's utility for research and future clinical applications.
About 35% of life-time risk of dementia is modi able by factors such as education, vascular aspects, social deprivation, etc. 9 , which has potentially led to the decrease in incidence of dementia over the last decades 10 . As a consequence, while AD cases are (relatively) easy to detect by clinical assessment (although dementia has also other causes), comparative control samples are likely to be enriched with future AD cases who are yet to show symptoms. Furthermore, if controls are enrolled from a population and/or are younger than cases, not only are a large proportion of them likely to develop AD given time 11 , but also the genes with small effect sizes associated with AD due to age related pathological changes 12 can be overlooked. For example, the ε4 allele is associated with earlier age at onset 13 but the ε4 allele frequency in a population decreases from 0.18 to 0.09 with increasing age 14 . A recent study comparing AD cases with relatively young age at onset as extreme cases, with centenarian-controls 15 observed that the GWAS signi cant SNPs' effect sizes in this study are on average twice as high as those identi ed by the original GWAS studies, which con rms the importance of controls being age-matched to or even older than cases. It has been shown that the PRS contribution to AD risk differs with age and APOE-ε4 allele status. For example, the effect of PRS (pT ≤ 0.5) is more pronounced in older people 16 , and the effect of oligogenic risk scores constructed using SNPs with an association pT ≤ 1e-5 is greater in ε4 homozygotes 8 . Based upon these observations, we hypothesised that unaccounted age-related genetic differences (in particular the APOE-ε4 agedependent frequency) lead to the disagreement about the optimal p-value threshold and the consequent debate about oligogenic (ORS) vs polygenic (PRS) disease models. We tested this hypothesis in simulated data for 10,000 cases and 10,000 controls, varying the ε4 frequencies and the PRS distribution parameters with age, and con rmed this in a real dataset of 549 AD cases and controls.
The accuracy of predicting disease in the individuals at the extremes of the PRS distribution is high 17 . However, the choice of the PRS calculation methodology may lead to identi cation of different sets of individuals with high/low risk in these extremes of the distribution curve. All methods for PRS calculations attempt to reduce the signal to noise ratio by including fewer SNPs while keeping the most informative ones; of these PRS(P + T) is the simplest one. Bayesian-based methods use all SNPs, and offer strategies to adjust the effect sizes for LD, instead of LDpruning [18][19][20][21] . Functionally informed Bayesian approaches vary the strength of LD-adjustment for each SNP based on its functional annotations. As a result, SNPs with low or medium functional annotation score will have their effect sizes directly scaled down, whereas SNPs with high effect sizes in large LD blocks will be adjusted for LD, but will still be promoted due to less penalisation compared to other SNPs. These methods may reach higher prediction accuracy in a population, but the posterior SNP effect sizes will differ from the "true" effect sizes if they were obtained from, e.g., multivariate regression 22 . Having in mind the goal of robust identi cation of AD PRS extremes, we explore a variety of PRS generation approaches.
Finally, to choose individuals with high and low PRS in a reliable, replicable and comparable manner, we investigated and PRS standardisation against a population.

Data sets and Quality Control
The 1000 Genomes (1000G) Project 23 applied whole genome sequencing to individuals from different populations in order to compile a detailed resource of common human genetic variation. In this study we only consider individuals from a European population, N = 503.
The UK Biobank (UKBB) is a large prospective cohort of approximately 500,000 individuals from the UK containing extensive phenotypic and genotypic data which is still being collected 24 . Participants recruited were aged 39-73 years with a mean age of 56.8. The data here were used under UKBB approval for application 15175 "Further de ning the genetic architecture of Alzheimer's disease" and contain 443,018 individuals after Quality Control (QC) analysis. Additional information can be found at (Data Citation1).
HipSci (Human Induced Pluripotent Stem Cell Initiative) 25 is an initiative which is generating a large, high quality reference panel of human iPSC lines for the research community. These are created from tissue donations from both healthy volunteers and patients from particular rare disease communities. There were 1,228 samples from healthy volunteers available from this study. ADNI (Alzheimer's Disease Neuroimaging Initiative) is a longitudinal study that was developed for the early detection of AD with the use of clinical, genetic and imaging data 26 . The data was collected from 900 participants between ages 55-90. Initially, participants were followed for 2-3 years with repeated imaging scans and psychometric measurements (ADNI1). The study was subsequently extended with the addition of new participants (ADNI-GO and ADNI2). Longitudinal data contained information on clinical assessments from the rst, baseline visit to the latest available visit with mean follow up time approximately 5 years. Genetic data was available for 770 participants who provided written consent. More information can be found at (Data Citation2).
ROSMAP -Religious Orders Study (ROS) and the Rush Memory and Aging Project (MAP) are both ongoing longitudinal clinical-pathologic cohort studies of aging and AD. Older participants were recruited without dementia and multi-layer data were collected that includes structural and functional neuroimaging, quantitative clinical phenotypes, neuropathologic and neurobiological traits, multi-level omics and genetics [27][28][29] . The data were downloaded from (Data Citation3, Data Citation4), 1,196 samples with available genetic information. ROSMAP data can be requested at (Data Citation5).
MSBB (The Mount Sinai Brain Bank) study generated gene expression, genomic variant, proteomic and neuropathological data from brain specimens. Clinical dementia rating scale (CDR) was conducted for assessment of dementia and cognitive status 30 . The data was downloaded (Data Citation6, Data Citation7), resulting in 349 samples with available genetic information.
MAYO -Mayo Clinic Brain Bank is a post-mortem cohort that contains neuropathological, genetic, biochemistry, cell biology data. The samples that are used here are described in MAYO eGWAS 31 . Data was available to download (Data Citation8, Data Citation9), resulting in 349 samples with available genetic information.
All standard Quality-Control (QC) steps were performed separately in each dataset using PLINK 32 Table 1 and   Supplementary Table 2). This data will be referred to as the case-control dataset for the remainder of the manuscript.

Primary PRS calculation (P + T)
For the PRS calculation we used the summary statistics from the largest available clinically assessed case-control GWAS study on AD 2 (N = 63,926) to generate genetic scores for all participants in the cohorts described above as the weighted sum of the risk alleles. PRS were generated with the PLINK genetic data analysis toolset 32 (Data within the sample and b) against population cohorts. For the latter, the dataset was merged with the population data, PCs were derived on the merged data, then the data was standardised using the mean and standard deviation (SD) from the population subsample. Table 1 details the description for each of the PRS models used throughout this manuscript. We also computed PRS using the whole genome data without any prior pruning and thresholding with LDpred-inf, PRS-CS and LDAK, but software issues prevented us from being able to run this with SBayesR. The traditional approach (PRS(P + T)) 33 requires additional LD pruning. PRSice 34 is a software which implements the PRS(P + T) method automatically and so the same LD-pruning parameters were speci ed for this approach. LDAK 18 does not require LD-pruning and calculates PRS adjusting SNP effect sizes for LD by reducing the contribution of SNPs in regions of high LD. LDpred-inf 19 , PRS-CS 20 and SBayesR 21 are all Bayesian approaches which use estimates of SNP effect sizes based on SNP-based heritability and also account for regional LD structure. LD was estimated using the case-control dataset for LDpred-inf and SBayesR and the 1000 Genomes data for PRS-CS (as this was the only option available in the PRS-CS software). All methods were otherwise implemented using default options. The PRS generated were standardised against the 1000 Genomes population data.

Statistical analysis
The case-control association analysis was performed using logistic regression with the glm() function in R (Data Citation11). The prediction accuracy was estimated in terms of a) area under the receiver operating characteristic curve (AUC) and b) R 2 , the proportion of the variance explained by the regression model. The extremes at ± 2 SD were compared in terms of OR with 95% Con dence Intervals (CI), AUC, cases and controls at each tail of the PRS distribution, and pairwise overlap between the extremes for all methods. For the PRS extremes we compare the results of ORS (pT ≤ 1e-5) and PRS (pT ≤ 0.1), including the PRS.AD model. We used the Haldane correction 35 in instances when cell counts were zero in the 2 × 2 contingency table.

Simulation study
Independent genotypes were simulated in a sample of 10,000 cases and 10,000 controls. APOE-ε4 allele frequency was set at 0.142 in controls and 0.356 in cases 36 . For simplicity, we assumed that the age of cases is above the late onset (e.g., over 85) but the age of the controls is below the average early onset (e.g. below 60 years). To estimate the number of controls who will develop the disease at 84, we used results from 13 : showing a frequency of 91% of ≤ donors with AD of ε4ε4 homozygotes, and a mean age of onset of 68 years of age, for ε4 heterozygotes this is 47% at 76 years, and 20% of ε4 non-carriers at 84 years of age, suggesting there to be about 28% "hidden" or putative cases among the controls. Then we re-simulated ε4 genotypes with slightly reduced allele frequency (f = 0.355) in cases, slightly elevated allele frequency (f = 0.36) for putative controls, so the joint allele frequency is ~ 0.356 for the true cases (10,000 + 2,800) and for the 10,000 young population controls matching the distribution of ε4 frequency by age 14

Optimal p-value threshold
In our earlier work on PRS in AD 3,37 we have observed that using the directly genotyped APOE isoforms ε2 and ε4 as separate terms in the regression model in addition to the PRS excluding the APOE region (PRS.AD), provides higher prediction accuracy than modelling the APOE region as part of a full PRS. In the case-control dataset presented here, we observed that the optimal p-value threshold for the PRS depends upon how the APOE effect is accounted for.  We make the assumption that the population controls are younger than cases for our simulations, as this is often observed in real studies. This implies that some of the control population have not reached the age of disease onset yet. Based upon ε4 frequency and studies of ε4 dependent age at onset 13 (Supplementary Fig. 1). In particular, ORS.full has advantage over the PRS.full, however when APOE is accounted for separately in addition to PRS.noAPOE, the PRS.AD has the best prediction accuracy (AUC) and the variance explained (R 2 ).
Informed by the simulation results, we have explored the ε4 allele frequencies in the case-control dataset with age (see Fig. 1 (A), and Supplementary Table 2). As reported in other studies, the ε4 allele frequency in this data set decreases with age, the ε3 frequency increases and ε2 frequency remains approximately the same. Figure 1

(B) and
(C) shows that ε4 frequency reduces faster in cases than in controls (pink line). The oligogenic risk score (ORS.no.APOE) (based on SNPs with pT ≤ 10 − 5 ) also decreases in cases with age but is on average higher than in controls, with the highest being in ORS.no.APOE for ε44 cases as reported in 8 . Contrary to ORS.no.APOE, the mean of PRS.no.APOE (blue line) is higher in older cases and lower in older controls 16 . Thus, because of the changing allelic frequencies of APOE genotypes over age, it is clear that the APOE genotype by itself and the ORS.no.APOE become much less accurate predictors in older cases, while the reverse is seen with the PRS.no.APOE score. Clearly, APOE and ORS will serve as better predictors of AD risk at younger ages. The PRS increases with age, whether this is a true effect or is due to random variation, requires further investigation and replication. Figure 1 shows that the net age effect for the sum of ORS and PRS is smaller than the separate score changes with age. Since these changes are in opposite directions, they cancel each other out if taken as a sum. Moreover, the net effect is approximately the same in cases and in controls. This net effect corresponds to the model that is referred to as "polygenic" in the eld and leads to conclusion in favour of an "oligogenic" model. However, the differential age effect, leveraging the polygenic disease architecture, can only be discovered when considering APOE (and/or ORS) and PRS.no.APOE separately. Adjusting the combined score for age only corrects for the small net effect. Thus, these sample and simulation data demonstrate that even though the ORS is a good predictor for AD at younger ages, it is mainly driven by the age-speci c APOE allele frequency distribution. against 1000 Genomes ( Supplementary Fig. 3, Supplementary Table 4), it can be clearly seen that, as expected, the PRS distribution of the population lies between controls (shifted to the left) and cases (shifted to the right). In addition, the population-based standardisation increases the variation in the case-control sample, implying more cases and controls falling above and below, respectively, a prede ned PRS cut-off (e.g. 2SD).

Individuals at the extreme tails of the PRS distribution
We next investigated to what extent the PRS score can be used to identify, with good con dence, individuals with high and low risk of AD. We de ne PRS extremes as individuals with a score exceeding ± 2 SD from the data mean or from the population mean, depending on the method of standardisation. We assess the effects of 1000G-based standardization on a human iPSC resource, i.e. HipSci, which is population based, as well as on a case-control dataset. For the PRS.AD model when the HipSci sample is standardised within the sample, 11 positive and 2 negative extremes are observed. When standardised against the 1000G population cohort there are 6 positive and 5 negative extremes. It appears that standardisation of the HipSci data against the population provides no advantage above considering them internally, which is not surprising as the PRS distributions in the population and in the population based HipSci should be the same.
The results become much more interesting when considering a case-control dataset. Now the number of positive and negative extremes is greater for PRS standardised against the population than within the sample (see Table 3 and Supplementary Table 4). The highest OR and prediction accuracy is observed with PRS.AD (OR = 124, AUC = 88.2) and the lowest with ORS.full (OR = 10, AUC = 74.6). Often, when selecting individuals at the extremes of risk for AD, researchers may want to understand risk beyond APOE. Thus, in Table 3 we also present the results for extremes selection in the ε3 homozygotes using a score excluding the APOE region. As expected, the number of extremes is lower when APOE is excluded, but the accuracy remains high with PRS.no.APOE (OR = 95, AUC = 95.7). The ORS.no.APOE accuracy for ε33 carriers drops to AUC = 56.3 with an OR smaller than 1, showing that the prediction is in the wrong direction. Therefore, the oligogenic model is not useful for discrimination between ε33 cases and controls in these data. Legend: In case-control dataset the number of cases (N cases) and controls (N controls) in PRS were identi ed and the prediction accuracy of these extremes was assessed with AUC and OR (95% Con dence Intervals) when standardised a) using sample mean and SD b) using mean and SD from 1000 Genomes data. We de ne PRS extremes as individuals with a score exceeding ± 2 SD from the data mean or population mean. Three models were used for the whole dataset  Supplementary Fig. 4).
It can be observed that the greatest number of shared extremes is between PRS(P + T) and PRSice, which again, was anticipated given the methodological similarities of these approaches. The smallest number of shared identi cations is between SBayesR and other methods. Overall, the individuals identi ed with LDpred-Inf, PRS(P + T), PRSice and PRS-CS overlap considerably, in contrast to LDAK and SBayesR.
It can be seen that there are fewer negative extremes identi ed by ORS than by PRS in all methods. This is explained by the fact that ORS is predominantly driven by APOE-ε4 with the consequence that ORS is not very good at identifying negative extremes. Additional plots for mapping 5 top and 5 bottom PRS.no.APOE extremes in ε33 individuals across different methods are presented in Supplementary Fig. 5. The individuals with the most extreme PRS in both the positive and negative tails are consistent between PRS(P + T) and PRSice, while the identi ed extremes may differ substantially across the other different PRS methods. We advise using PRS(P + T) or PRSice for the selection of individuals at risk because the SNPs contributing to the PRS can be easily identi ed which is crucial for future experiments.

Discussion
PRS could be useful to identify individuals at risk of disease development, however, the accuracy of current methods for the distribution as a whole, precludes the use of PRS in the clinic (too many false positives and false negatives).
Results of this and other studies 17 con rm that identi cation based on having a PRS above/below a certain threshold provides much better prediction accuracy than attempting to classify all individuals in a dataset.
In this study we provide ample evidence that AD should be modelled as a polygenic disease. In fact, risk of AD is not different from other diseases where liability to disease is continuous, and disease becomes evident after a threshold has been passed (the liability threshold model). In the threshold model, liability for a genetic disorder is (normally) distributed across the population and polygenic risk scores are a measure of disease liability 39 . The relative contributions of alleles of various effect sizes and frequencies are not fully resolved; while common alleles of small risk, captured by genome-wide association study arrays, capture between a third and a half of the genetic variance in liability, APOE alone substantially increases risk for the disorder 40 . A major problem with AD and using PRS to categorise people at risk, is the age of the study participants. Here we show that APOE-ε4 carriers have a lower burden of common AD risk alleles of small effect, implying that under the liability threshold model, the APOE risk is substantial enough to develop the disease with a lesser burden of common risk alleles with small effects. Since allelic variation at the APOE locus impacts survival altering the age at onset of AD and risk of other conditions (hyperlipidaemia, atherosclerosis, cardiovascular disease [41][42][43][44][45][46], the frequency of APOE-ε4 goes down with age whereas genetic liability to AD measured by the PRS increases. In other polygenic diseases like schizophrenia, penetrance of the phenotype is mostly complete at 40 years of age 47 , while for AD even at 80 there are still individuals at risk but who have not yet developed AD. Looking at the means of the oligogenic risk score and the polygenic risk score across age groups, we found that following the pattern of APOE-ε4 frequency, the ORS decreased with age in cases but was on average higher than in controls. Conversely, PRS increased with age in cases, but decreased in controls. This can be explained if APOE and most of the GWAS signi cant SNPs used to calculate oligogenic scores point to genes which are in the same or overlapping pathways [48][49][50][51][52] . This would also explain why adding the oligogenic scores to the calculation do not improve prediction very much compared to APOE genotype alone (see Table 2). An important point here is that ORS is likely not very suitable to identify genes that provide protection, while PRS becomes lower in controls.
Since 1) age is the major confounding factor, 2) APOE is strongly associated with the age at onset, and 3) it is di cult to disentangle the aging and disease pathogenic components, we suggest to model APOE and PRS.no.APOE as two independent predictors or to use PRS.no.APOE as a predictor in subsamples strati ed by the APOE genotype. In this study, the results show that the prediction accuracy of the oligogenic risk score was not better than using the effect of the APOE gene alone. The best performance overall was found here (and in our earlier study 3 ) using the model with two variables (i.e. PRS.AD), APOE and PRS at pT ≤ 0.1, which excludes the APOE region (PRS.no.APOE).
These differences in prediction modelling also explain why different optimal pTs may have been observed in other studies 5,7,8 .
We also looked at individuals at the extremes of the PRS distributions (above and below 2SDs) and found that both OR and AUC are very high in the whole sample (OR = 124, 95%CI= [6,2707]) and for the ε3-homozygous individuals (OR = 95, 95%CI= [3,2683]) using the proposed approach. The con dence intervals for the ORs are of course broad, as the sample size is small when looking at the extremes, but the accuracy remains high. The ORs for the extremes identi ed by ORS were smaller (OR = 10, 95%CI= [1,75]) and the ORs had narrower CIs, suggesting that this model identi es a greater number of extremes than the polygenic model, but with poorer accuracy. The oligogenic score was not suitable to identify the extremes in the ε33 individuals with OR = 0.6, i.e. misclassifying high ORS cases as controls and vice versa.
Notably, the prediction accuracies using p-value thresholds of 0.1 and 0.5 (the latter reported in earlier work by us and others 36 ) were similar. The reduction of the optimal pT from 0.5 to 0.1 is likely due to the improved estimation of SNP effect sizes, imputation quality and increased GWAS sample size in the latest GWAS 2 in comparison to the earlier GWAS study 53 . Similar ndings have been observed for other polygenic disorders, e.g. in Schizophrenia and Bipolar datasets of the Psychiatric Genetic Consortium 54 .
Comparing six PRS calculation methods, we conclude that the prediction accuracy in the whole sample is very similar, however, the individuals' scores differ. The choice of the individuals at the extremes of the PRS distribution were concordant with PRSice, LDpred-inf, PRS-CS and PRS(P + T). There were more differences shown between LDAK and SBayesR. Due to lack of transparency of the Bayesian approaches, it is di cult to explain why certain individuals are at high polygenic risk whereas others are not, compared to PRS (P + T), where the SNP effect sizes and the LD pruning parameters are traceable. All these reasons allow us to conclude that for AD, PRS(P + T) is the method of choice.
An interesting and important conclusion of our study is that projecting a relatively small case/control sample onto the general population, results in a much better representation of risk in the study. Since case-control samples are enriched for cases as compared to the general population, the PRS distribution of the former is a mixture of two distributions (cases and controls) with distinct means. The PRS distribution for a population sample is likely to have a mean between the means of cases and controls, and a smaller variance (and hence, standard deviation) than that of the combined case-control sample. Standardising the case-control sample to the population sample will result in the shift of the individual scores in the case-control samples to the positive or negative side of the population mean. This makes the detection of more patients at high risk or with high protection possible. Increasing the size of the population sample will provide better estimates of the population PRS mean and SD (since the standard errors of these estimates will decrease as N increases). Note that including a larger population sample will proportionally increase the total number of people in the above and below 2SD categories in the joint (population plus case-control) sample, but that this will not necessarily be enriched by the individuals from the case-control sample. Hence the use of the 1000G population is an easy and straight-forward way to obtain this bene cial effect.
In conclusion, identifying individuals at high and low polygenic risk is very important for further work to understand how genetic risk translates into mechanisms of disease 40 . It might also become very relevant for drug development efforts targeting precise mechanisms of disease, as the PRS scores could be used to select small samples of patients in which proof of concept for the treatment can be obtained before testing the drug in larger cohorts. We show here that for AD, the optimal p-value threshold is pT ≤ 0.1, and the PRS calculation should account for the age-related genetic differences in cases and controls either by modelling APOE separately to the PRS or matching cases and controls for age and APOE status. This adjustment will be re ned when we have a better idea of which genes are contributing to the disease aetiology via aging and which are directly on the pathology pathway.

Declarations
Contributions GL, EB, JSH -performed statistical analysis and wrote the manuscript; AS, MF-reviewed and approved the manuscript and add valuable comments; JW, BS, VEP-conceived and designed the study and wrote the manuscript.