## Introduction

Genome-wide association studies (GWAS) have revealed that the inherited genetic component of most traits not due to variations in a single gene is highly polygenic. Dozens or thousands of single nucleotide polymorphisms (SNPs) can be combined to produce a polygenic risk score (PRS) representing an individual's genetic propensity for a given trait or disease.

There is much enthusiasm for the use of PRS to inform individuals about their risk of future health conditions, either as stand-alone information, or combined with non-genetic data in integrated risk scores1,2. PRS have been proposed in a wide variety of settings such as the prioritisation of people for disease screening, informing the prescription of preventive medicines, and even in embryo selection3. Variants in single genes associated with diseases are already utilised in a clinical setting, however, recent large-scale studies have found that PRS could potentially identify a greater proportion of at risk individuals4.

Early in the development of PRS, researchers5,6 quantified the degree to which a PRS with a limited number of SNPs would misclassify people, when compared with future PRS with many additional variants. At the time it was thought that once dozens, or even hundreds of SNPs were included, diminishing returns would set in and the PRS would be relatively stable. This perception appeared to be supported by the fact that in many large multi-study consortia, additional SNPs now being identified have very small odds ratios (OR; as low as 1.02 per allele or less) and the area under the receiver operating characteristic curve (AUC) of newer versions of the PRS are only minimally higher than that of previous versions.

Recent discussion on the use of PRS in the clinic has largely focused on the reporting standards for the derivation and archiving of PRS7,8, the health economic value of PRS, the potential contribution of PRS to health disparities given the limited databases available for non-European ancestry populations9, how to use PRS for patient benefit10,11, and the means of communicating PRS to patients or members of the public.

However, in addition to the implementation of PRS in practice, it is important to consider the methodological choices made when deriving different PRS. During the construction of PRS, there are multiple design options for deciding the number of SNPs to include and for assigning an appropriate weight to each SNP. Consequently, multiple sets of SNPs exist, resulting in multiple PRS for the same trait. For example, the 313-SNP PRS for breast cancer12 was developed using hard-thresholding stepwise forward regression, whereas the 118 k-SNP PRS13 was selected by the penalised regression "lassosum" and the highest pseudo-R2. However, these PRS are typically compared at a population level using metrics such as the AUC or OR, and limited attention has been paid to how they differ from each other for risk prediction of individuals. Therefore, it is important to understand how the use of different PRS affect an individual’s classification of risk for future disease, as this has important implications for the use of PRS in wider practice.

## Materials and methods

### Study populations

We used the data from the UK Biobank (UKB), a large-scale population-based prospective cohort study of approximately 500,000 individuals aged 40–69 years at recruitment across the United Kingdom between March 2006 and October 2010. The full details of the genotyping and imputation are described elsewhere14,15.

Our study populations for each of the three disease outcomes are defined as follows:

• The breast cancer eligible population was women who had not had breast cancer, carcinoma in situ or mastectomy prior to baseline.

• For the hypertension eligible population, we excluded individuals with missing or implausible systolic blood pressure (SBP) measurements (< 70 or > 270 mmHg) at baseline, and those with Major Adverse Cardiovascular Events (MACE) prior to baseline.

• The dementia eligible population was restricted to individuals without a diagnosis of dementia prior to baseline.

We further restricted to genetically White British individuals (UKB Data Field 22006), and excluded individuals who were related (3rd degree or higher), sex discordant, or outliers for genotype missingness or heterozygosity based on UK Biobank-derived sample quality control data (UKB Data Field 22020). This yields the final size N of each study population.

Disease ascertainment in UKB during the follow-up period utilised linkage to death registry, cancer registry, and Hospital Episode Statistics (HES). Hypertension is defined as SBP >= 140 at baseline; the International Classification of Diseases (ICD) codes for breast cancer and dementia can be found in Supplementary Tables 1416.

### Selection of PRS

For each of the three disease outcomes, we selected a pair of recently published PRS to compare, typically published within two years of each other. The earlier PRS is denoted as PRS-A, while the more recent one is PRS-B.

When choosing PRS we identified scores that were derived using the same trait definition in primarily White European populations, to be appropriate for use in the UK Biobank, and that had been derived in large consortia for their respective disease. Where possible, we selected scores listed in the Polygenic Score Catalog (PGS Catalog), an online database that collects and curates PRS from across the literature and makes metadata available in a standardised way.

Table 1 summarises the characteristics of the chosen PRS for each trait, including the construction method. Supplementary Table 1 contains further information about each PRS, including source of weights (i.e. derivation dataset), population characteristics, and validation dataset.

• For breast cancer, PRS-A (313 SNPs, PGS ID: PGS000004)12 has been widely validated and is included in the current implementation of the BOADICEA breast cancer risk model16,17. For PRS-B, we used a score (118,388 SNPs, PGS ID: PGS000511)13 which was largely developed from the same Breast Cancer Association Consortium (BCAC) GWAS data as PRS-A18.

• For hypertension, PRS-A for SBP (267 SNPs, not in PGS Catalog)19 was selected from the literature (further details later in this Section). The effect sizes of PRS-B for SBP (884 SNPs, PGS ID: PGS000812)20 are derived from the International Consortium of Blood Pressure-Genome Wide Association Studies (ICBP), Million Veteran Program (MVP) and Estonian Genomic Centre of the University of Tartu (EGCUT).

• For dementia, both PRS-A (57 SNPs, PGS ID: PGS000812)21 and PRS-B (39 SNPs, PGS ID: PGS001775)22 used effect sizes from the International Genomics of Alzheimer’s Project (IGAP) GWAS23. PRS-A was constructed in January 2021 while PRS-B was developed in September of the same year. We selected PRS that did not contain the two APOE SNPs (rs429358 and rs7412), in order to prevent the APOE genotype from dominating the PRS.

We examined the overlap in SNPs between PRS-A and PRS-B for each disease, including those in high linkage disequilibrium (LD) (R2 > 0.8).

We took care to attempt to avoid the sample overlap issue, a potential pitfall for PRS24. Since we planned to calculate PRS in the UKB population (i.e. target cohort), we preferred PRS that were not derived in the UKB population. During the PRS selection stage, we examined the base GWAS cohorts for PRS derivation and attempted to ensure that they do not contain our target cohort (i.e. UKB).

We were able to identify such PRS for breast cancer and dementia, but not PRS-A for SBP in the existing recent literature to the best of our knowledge. We investigated all the available PRS for SBP in the PGS Catalog, and found that UKB was present in all the derivation populations. Given the extensive blood pressure measures and genetic data in the UKB, it is unsurprising that researchers would include UKB in their derivation population for PRS.

### Calculating PRS

We computed the PRS of an individual $$j$$ by the weighted sum of trait-associated SNPs,

$$PRS_{j} = \mathop \sum \limits_{i}^{N} \beta_{i} *dosage_{ij}$$

where N is the total number of SNPs, $$\beta_{i}$$ is the effect size (or beta) of SNP $$i$$, and $$dosage_{ij}$$ is the number of effect alleles (usually encoded as 0, 1 or 2 in SNP $$i$$ for individual $$j$$ for the effect allele).

We applied genetic quality control (QC) pipelines for both SNPs and samples. During SNP QC, we removed ambiguous SNPs (A/T or C/G SNPs with MAF > 0.49) and rare variants with MAF < 0.005; we only retained SNPs with high imputation quality (imputation information score > 0.4) (Supplementary Table 1). During sample QC, we excluded participants who were sex-discordant, outliers for missingness or heterozygosity, or related at 3rd degree or higher, using UKB Data Field 22020.

We then weighted the SNPs that passed our QC using the published effect sizes provided from the source paper for each score, given either in their supplementary materials or made available in the PGS Catalog, to compute PRS for those within the study population for each of the three disease outcomes.

### Quantifying the stability of PRS

In each disease-specific study population, we first calculated the correlation coefficient between each pair of PRS. We then computed the age- and sex-adjusted odds ratios (ORs) at various cut-points (e.g. top 1% or top 5% of the PRS), with the middle quintile of the PRS being the reference group.

We computed two versions of the AUC obtained from two separate logistic regression models for each continuous PRS:

1. 1.

Crude-AUC, where only PRS were fitted, adjusting for genetic array, and first 5 principal components (PCs) of ancestry (UK Biobank Field 22009), to reflect the predictive performance from the PRS itself.

2. 2.

Multivariable adjusted AUC (Multi-AUC), where the model further adjusted for age and sex (if applicable).

The outcome of these logistic regression models is the disease status (Yes/No), for each of the three diseases. The Crude-AUC measures the predictive ability of the PRS on its own, while the Multi-AUC measures this after having taken account of age and sex.

The continuous net reclassification index (NRI) was used to compare PRS-A with PRS-B in multivariable logistic models (Table 2). The categorical NRI was used in cross-classification of PRS percentile risk categories. Percentage reclassification for participants who experienced the outcome are shown in Supplementary Tables 57 and 1113, for top 1% and top 5% risk categorisations, respectively.

The 95% confidence interval (CI) of the AUC and NRI were estimated using 1 k bootstrap replications.

### Ethics approval and consent to participate

The UK Biobank study (https://www.ukbiobank.ac.uk) received ethical approval from the North West Multi-center Research Ethics Committee (REC reference: 11/NW/03820). All participants gave written informed consent before enrolment in the study, which was conducted in accordance with the principles of the Declaration of Helsinki. This study has been conducted under the UK Biobank application ID 33952.

### Patient and community involvement

The analyses presented here are based on existing data from the UK Biobank cohort study, and the authors were not involved in participant recruitment. To the best of our knowledge, no patients were explicitly engaged in the design or implementation of the UK Biobank study. No patients were asked to advise on interpretation or writing these results. Results from UK Biobank are routinely disseminated to study participants via the study website and social media outlets.

Yes.

### Transparency statement

The lead author affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

## Results

Our study populations were N = 171,490 (incident cases = 6347) for breast cancer, N = 317,581 (prevalent cases = 137,649) for hypertension, and N = 335,689 (incident cases = 4460) for dementia. For comparing different PRS, we focused on two aspects: firstly, the consistency of the selected SNPs and performance metrics. Then we assessed the correlation between each pair of PRS, and the extent to which PRS-B gave the same predictions for individuals as PRS-A.

We found that less than 80% of SNPs in PRS-A were represented in PRS-B for all three diseases, after having taken LD into account (R2 > 0.8). This is somewhat surprising, as one might expect a newer score (PRS-B) to incorporate most of the previously identified SNPs from PRS-A.

Table 2 presents the performance characteristics of each PRS against the corresponding disease outcome in UKB. In each case the more recent PRS-B was associated with a slightly higher OR than the earlier PRS-A. For example, the OR of breast cancer among women in the top 1% compared to those in the middle quintile was 3.41 for PRS-A and 3.94 for PRS-B. Their corresponding AUCs were only minimally different (0.638 vs 0.641), and the ROCs appear almost identical (Fig. 1). Similar results were obtained for hypertension (PRS-A OR = 1.83, AUC = 0.69; PRS-B OR = 2.18, AUC = 0.70) and dementia (PRS-A OR = 1.78, AUC = 0.80; PRS-B OR = 2.30, AUC = 0.80).

The Crude-AUC of PRS were modest, within 0.55–0.64. After including age and sex in the multivariable model, we observed an expected increase in AUC. Despite similar Multi-AUC, PRS-A and PRS-B were not very highly correlated for any outcome, with their Pearson correlation coefficient $$r$$ only in the range of 0.51 to 0.65. Multi-AUC is noticeably higher than Crude-AUC for dementia, whereas such improvement is less prominent for breast cancer and hypertension. The likely explanation is that age is a more highly influential risk factor for dementia, but less so for breast cancer and hypertension.

Table 2 also shows NRI with PRS-B being the updated model. The positive NRI indicates that PRS-B is better at correctly assigning people to the appropriate risk categories. The small positive NRI is in line with the slight increase in the AUC.

Compatible with these correlation coefficients, there was substantial reclassification of predicted risk according to percentiles of PRS-A and PRS-B for all three diseases. Table 3 shows that for women in the top 1% of breast cancer risk by PRS-A, only 23.1% were in the top 1% risk of PRS-B. The equivalent percentage was 22.9% and 22.7% for hypertension and dementia, respectively (Supplementary Tables 34). We focused on the top 1% of risk because of the widely-promulgated concept that these risks approximate those of the risks for monogenic traits4.

For participants in the top 5% (a more relaxed risk category than the top 1%) of risk for breast cancer by PRS-A, only 35.7% were in the top 5% risk by PRS-B. The equivalent percentage was 35.8% and 40.0% for hypertension and dementia, respectively (Supplementary Tables 810).

## Discussion

The clinical utility of PRS depends on the clinical validity of the predictions. Clinical validity is not only dependent on the information PRS provides on the risk of future events, but also on the stability of these estimates. Here, we demonstrate that for three common conditions (breast cancer, hypertension and dementia), the risk estimates derived from different PRS would result in very different information on risk of future disease being provided to a high proportion of individuals. Choice of the PRS may also influence the use of PRS as covariates or effect modifiers in epidemiologic analyses.

We found that the more recent PRS resulted in minimal increases in AUC compared to older PRS, in line with the small improvement measured by NRI. However, the PRS differed substantially in how they assigned participants into risk categories with a substantial proportion of individuals classified at very high risk by one PRS, not so classified by the other PRS. This suggests a major potential problem for the use of these PRS in clinical practice, given the changes in clinical recommendations associated with labelling a person in the same category of risk as a monogenic disorder. Our results demonstrated large differences across all percentiles of risk; although the clinical consequences at the lower percentiles may not be as extreme as at the higher percentiles, the clinical utility will still be reduced by incorrect classification.

The continual growth of large GWAS studies means researchers are more likely to encounter the issue of inter-cohort sample overlap between derivation and target data sets that may artificially increase the concordance of PRS25. This was indeed our experience when selecting appropriate PRS. We were able to confirm that most of our selected PRS do not have the sample overlap issue, but not PRS-A for SBP (detail in “Selection of PRS” section). The reason is that the UKB represents by far the largest single study contributing to PRS for blood pressure, and is unlikely to be excluded from any attempt to derive a clinically-applicable PRS for blood pressure. We anticipate that fellow researchers will encounter the same obstacle, and would like to highlight this challenge to the research community.

The sample overlap between the base GWAS (i.e. derivation cohort for deriving PRS) and UKB (i.e. target cohort for calculating PRS) in our PRS-A for SBP is a limitation of our study. As a result, our results on PRS for SBP should be interpreted with the awareness of potential bias introduced by the sample overlap in PRS-A for SBP.

Current methodological development in the sample overlap issue includes the EraSOR (Erase Sample Overlap and Relatedness) method25, which is a potential direction for further research to clarify the roles of sample overlap and different statistical methods in the genesis of PRS discordance.

The portability of PRS depends on the characteristics (such as socio-economic status, age or sex) of the individuals in the base GWAS studies, as well as on the GWAS design, even within a single ancestry group26. A difference in study characteristics between the PRS derivation and a target cohort could contribute to the disagreement among different PRS that we have observed in this study. The choice of GWAS sample makes implicit assumptions on sample characteristics that may not hold for the prediction set (i.e. target cohort)26. An obvious example is different ethnic backgrounds, but we minimized this by restricting to the White British subset of the UKB. Our study further supports the importance of providing study characteristics along with the PRS.

We note that we have not established the reasons for the extent of misclassification between different PRS. It does not appear to be attributable solely to the number of SNPs included in the PRS. We show this for PRS comprised of over 100 thousand versus several hundred SNPs (breast cancer), for PRS composed of hundreds of SNPs (SBP), and for PRS composed of approximately 50 SNPs (dementia). The surprisingly small number of SNPs held in common by different PRS for the same condition published only a year apart indicates that different analytical methods used to derive the PRS accounts for some of the discrepancies in classification; understanding this phenomenon is clearly important for PRS selection in broad clinical practice. The correlations we observed between PRS for the same condition are in the same range as is seen for variation in risk predictors measured several years apart such as blood pressure and serum cholesterol, far from the “fixed” or “one-time” value at birth that is often assumed for PRS.

This phenomenon has been previously described. For instance, Läll27 compared the performance of four PRS in breast cancer prediction, noting that some of the correlations were as low as r = 0.3, and observed that a “metaGRS” of the PRS performed better than any of the individual PRS28. After completing our analysis and submitting this paper, we found a preprint29 with similar findings for two different phenotypes (cardiovascular disease, and educational attainment) in the UKB, providing further generalizability to our results. However, this issue does not seem to be widely appreciated, and most publications comparing a new PRS with previous versions assert the superiority of the new PRS and do not address the issue of misclassification of risk between PRS.

Our observation shows two PRS that only minimally differ in predictive performance on a population level may substantially differ in terms of individual risk classification, even among individuals with the same continental ancestry. This issue requires careful consideration before utilising PRS in real-world settings, because such an arbitrary element in health care is obviously undesirable. An individual’s genetic profile is generally considered fixed at birth, leading to the widely held conviction that genetic susceptibility is an immutable value; however, our findings show that methodologic decisions in the construction of the PRS can lead to meaningful differences in PRS derived from the genome. Although it is reasonable to expect incremental improvements in any risk prediction algorithm over time, these results suggest there is still considerable uncertainty associated with estimates of risk derived from different PRS for the same disease. It will be important to develop guidelines on best practice in constructing PRS to minimize the extent to which people could be given inaccurate or contradictory information over short periods of time.