Tuning parameters for polygenic risk score methods using GWAS summary statistics from training data

Jiang, Wei; Chen, Ling; Girgenti, Matthew J.; Zhao, Hongyu

doi:10.1038/s41467-023-44009-0

Download PDF

Article
Open access
Published: 02 January 2024

Tuning parameters for polygenic risk score methods using GWAS summary statistics from training data

Nature Communications volume 15, Article number: 24 (2024) Cite this article

3325 Accesses
13 Altmetric
Metrics details

Subjects

Abstract

Various polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) to predict genetic risks for common diseases, using data collected from genome-wide association studies (GWAS). Some methods require external individual-level GWAS dataset for parameter tuning, posing privacy and security-related concerns. Leaving out partial data for parameter tuning can also reduce model prediction accuracy. In this article, we propose PRStuning, a method that tunes parameters for different PRS methods using GWAS summary statistics from the training data. PRStuning predicts the PRS performance with different parameters, and then selects the best-performing parameters. Because directly using training data effects tends to overestimate the performance in the testing data, we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the genetic architecture of the disease. Extensive simulations and real data applications demonstrate PRStuning’s accuracy across PRS methods and parameters.

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Introduction

The advent of genome-wide association studies (GWAS) has led to the discovery of numerous loci associated with the most common diseases¹. These discoveries also provide the opportunity for predicting risks from an individual’s genotypes². Accurate genetic risk prediction can enable us to identify high-risk individuals and facilitate disease prevention and early treatment³.

Polygenic risk score (PRS) is commonly used in genetic risk prediction due to its simplicity and resulting from the additive assumption. Both empirical and theoretical studies have shown that the additive component is expected to account for most of the genetic variance of complex traits⁴. Based on this additive assumption, PRS sums the allele dosages of single nucleotide polymorphisms (SNPs) weighted by their estimated effect sizes⁵.

Various PRS methods have been proposed to estimate the effect sizes of SNPs from a GWAS dataset. Compared to individual-level genotype data, summary statistics are more accessible without security and privacy concerns^6,7. Many PRS methods proposed recently estimate SNP effects with GWAS summary statistics. One of the simplest is clumping and thresholding (C+T)^{8,9,10,11,12,13,14}, in which linkage disequilibrium (LD) clumping is applied to the SNPs that pass a p-value threshold. Another related method is pruning and thresholding (P+T), which only includes the SNPs whose p-values exceed a threshold after LD pruning. Both LD clumping and LD pruning are step-wise heuristic procedures that select a set of approximately independent SNPs. Compared to LD pruning, LD clumping selects the independent SNPs after p value thresholding. Therefore, SNPs showing stronger associations with the disease are preserved, which is preferred in constructing PRS. We note that some literature referred to C+T as P+T, but we treat them as distinct methods in our following discussion.

It is important to note that for both C+T and P+T, only a portion of independent SNPs are utilized in constructing the PRS model, while other SNPs and LD information are ignored. To further improve the prediction accuracy of genetic risks, many PRS methods have been proposed to incorporate genome-wide SNPs and their LD information, such as LDpred¹⁵, LDpred2¹⁶, sBayesR¹⁷, PRS-CS¹⁸ and SDPR¹⁹. LDpred imposes a point-normal prior for the SNP effect sizes and infers the posterior mean effect sizes using a Markov Chain Monte-Carlo (MCMC) procedure. LDpred2 was further proposed to increase computational efficiency and provide more stable results than LDpred in dealing with long-range LD regions and traits of sparse genetic architecture. To allow more general effect size distributions, sBayesR performs Bayesian posterior inference based on a mixture prior of point and three normal distributions that represent SNPs with small, medium, and large effects respectively. SDPR performs Bayesian posterior inference based on a Dirichlet process modeling effect sizes with a mixture of 1000 normal distributions. To reduce the computational burden from the combination of different components in millions of SNPs, PRS-CS places a continuous shrinkage prior to the SNP effect sizes in a Bayesian framework. All these LD-based methods have demonstrated their superior performance in some datasets of complex diseases. However, none of them has a dominant performance over other methods.

Among these PRS methods, P+T, C+T, LDpred, and LDpred2 rely on parameters that need to be specified by users beforehand. Although PRS-CS and sBayesR have options to estimate parameters with an additional layer of prior distributions, users can also specify the parameters themselves. For all PRS methods that require tuning parameters, an external individual-level genotype dataset is needed to evaluate different parameter values and choose the best-performing ones. However, as we mentioned before, individual-level genotype data are less accessible than summary statistics. Besides, it is not efficient to leave out a portion of data just for tuning parameters and to estimate SNP effects with the remaining data, leading to information loss and reduced performance for PRS methods. These concerns motivated us to develop a method that can evaluate the performance of a PRS model based on summary statistics used for model training.

For diseases with a binary phenotype, the area under the receiver operating characteristic (ROC) curve (AUC) is the most commonly used criterion in practice for evaluating PRS^5,20,21. In 2018, Song et al.²² proposed an estimator of AUC using only summary statistics. This method makes use of an equivalent definition of AUC, i.e. the probability of a PRS from a random case being larger than a PRS from a random control. Based on this definition, AUC can be approximated by a function of the GWAS summary statistics. This method can tune the parameters of a PRS model with summary statistics from another GWAS.

To maximize the power of identifying loci associated with common diseases, some large consortia have conducted meta-analyses of all accessible studies and released summary statistics from these meta-analyses. These summary statistics are usually used as training data to optimize the prediction power of PRS models. In this situation, it is difficult to gain access to summary statistics from another independent GWAS. This problem can not be well addressed if we simply plug the summary statistics from the training data into the derived AUC function, because the variants with larger effects tend to have their effect sizes overestimated and these variants have a larger influence on the PRS than the variants exhibiting small effects. This phenomenon is known as overfitting²³. If we use the observed effects directly, the overfitting would lead to an inflated predicted value of the AUC and the incorrectly selected values of the parameters.

Built on Song’s method, we propose PRStuning, a method that requires only summary statistics from the training data to predict the conventional AUC that needs to be evaluated on another individual-level genotype dataset. We incorporate empirical Bayes (EB) theory to shrink the effect sizes of SNPs, which leads to the attenuation of the predicted AUC so as to overcome the overfitting phenomenon²⁴. In PRStuning, we adopt a point-normal mixture model as the prior distribution of SNP effects and estimate the parameters in the model with GWAS summary statistics from the training data. There are two settings depending on the dependency across the selected SNPs used for training the PRS model. When the SNPs are independent, e.g., the SNPs used in P+T, we utilize an expectation-maximization (EM) algorithm to estimate the parameters in the prior distribution and calculate the posterior distribution of the AUC based on a closed-form formula. When SNPs are dependent due to LD, we use a Gibbs-sampling-based State-Augmentation for Marginal Estimation (SAME) algorithm²⁵ to estimate the parameters in the model and obtain the Monte-Carlo (MC) samples of the predicted AUC. Once this is accomplished, we can select the parameter values for the PRS method with the best predicted AUC.

We applied PRStuning to GWAS datasets of four common diseases, including coronary artery disease (CAD), type 2 diabetes (T2D), inflammatory bowel disease (IBD), and breast cancer (BC), with four PRS methods, namely P+T, C+T, LDpred, and LDpred2. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help with parameter selections.

Results

Overview of PRStuning

Define g_i,m ∈ {0, 1, 2} as the genotype score of SNP m for individual i. A PRS for individual i is the sum of the genotypes g_i = (g_i,1, …, g_i,M) weighted by the corresponding effects ω = (ω₁, …, ω_M), i.e.,

$$PR{S}_{i}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}{g}_{i,m}.$$

(1)

Here M is the total number of pre-selected SNPs used for constructing PRS. Please note that not all SNPs collected in the training GWAS data are necessarily used in PRS calculation. Some PRS methods, such as P+T, select SNPs based on criteria unrelated to association strengths. For those methods, we just need to consider the selected SNPs in estimating AUC. However, some other PRS methods incorporate SNP selection steps based on the associations of the SNPs with the disease, resulting in the inflation of their observed association effects^8,16,17. For those methods, we consider the SNPs used before the selection step to address the effect size inflation issue with the Empirical-Bayes-based method introduced later. Here we define the pre-selected SNPs as the SNPs used in building the PRS model before running any selection step related to association strengths. For example, the pre-selected SNPs in C+T are actually genome-wide SNPs collected in the training GWAS data and the LD clumping procedure used in C+T is a selection step based on the observed association strength. In this situation, we have ω_m = 0 for SNPs not selected for building PRS in C+T. In contrast, LD pruning is a selection step unrelated to SNP associations with the disease. Therefore, the pre-selected SNPs in P+T are the SNPs selected after an LD pruning step. Different PRS methods have been proposed to estimate the weight vector ω from a GWAS dataset or its summary statistics for the disease of interest. Here and after we regard ω as the inferred values from the PRS method of interest.

Based on the definition of AUC and the distribution of PRS, Song²² formulated AUC as

$${{{{{{{\rm{AUC}}}}}}}}={{\Phi }}({{\Delta }}),$$

(2)

where

$${{\Delta }}:\!\!=\frac{2{\sum }_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}\,{{{{{{{\rm{and}}}}}}}}\,{\tau }_{j}^{2}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}^{2}{s}_{j,m}^{2}+2\mathop{\sum}\limits_{{m}_{1} < {m}_{2}}{\omega }_{{m}_{1}}{\omega }_{{m}_{2}}{R}_{{m}_{1},{m}_{2}}{s}_{j,{m}_{1}}{s}_{j,{m}_{2}},$$

(3)

where j = 0 indicates controls and 1 indicates cases. Here for SNP m, we use f_j,m to denote the frequency of the reference allele, ${s}_{j,m}^{2}:\!\!=2{f}_{j,m}(1-{f}_{j,m})$ to denote the variance of the genotype, and δ_m ≔ f_1,m − f_0,m records the difference between the allele frequencies of the cases and controls, and Φ( ⋅ ) is the cumulative distribution function of a standard normal distribution. We use ${R}_{{m}_{1},{m}_{2}}$ to denote the LD coefficient between SNP m₁ and SNP m₂.

We can calculate ${\tau }_{j}^{2}$ (j = 0, 1) by directly plugging in the observed values of allele frequencies and LD coefficients since ${\tau }_{j}^{2}$ is not directly related to the SNPs’ effects on the disease. The observed allele frequencies can be obtained from summary statistics of the GWAS, and LD information can be extracted from another genotype dataset. Some large projects such as the 1000 Genomes project (1KG)²⁶ and the HapMap3 project (HM3)²⁷ have made their data publicly available and we can use them as reference panels to calculate the LD coefficients.

For δ_m in Eq. (3), if we directly plug in the observed allele frequencies ${\hat{f}}_{0,m}$ and ${\hat{f}}_{1,m}$ from GWAS, the SNPs exhibiting large allele frequency differences tend to have their effect sizes overestimated, and these SNPs have larger contributions to the PRS than the SNPs showing smaller effects. The overfitting of the SNP effects would lead to an inflated predicted value of the AUC and incorrectly selected values of the parameters. Therefore, we adopt an Empirical Bayes method in PRStuning to shrink the effects so as to reduce the influence of overfitting. In the Supplementary Methods section, we provide a theoretical demonstration of how overfitting happens and the rationale of alleviating overfitting with a Bayes estimator.

In GWAS, z-scores from the allele frequency difference test are usually used to assess the association of each SNP with the disease. Each z-score is calculated with the following formula:

$${z}_{m}=\frac{{\hat{f}}_{1,m}-{\hat{f}}_{0,m}}{\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}},$$

(4)

where ${\hat{f}}_{j,m}$ is the observed allele frequency for each group, ${s}_{j,m}^{2}$ is the variance of the genotype in the controls or cases, and n₀, n₁ are the sample sizes of the two groups. To simplify this expression, we define ${s}_{m}:\!\!=\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}$. Based on this definition, we have z_m∣δ_m ~ N(δ_m/s_m, 1) given the allele frequency difference δ_m.

Here we denote the allele frequencies among controls and cases when SNP m is assumed to be independent with other SNPs as p_0,m and p_1,m, respectively. Note that f_j,m is the allele frequency of SNP m marginalizing over other SNPs, which is different from p_j,m (j = 0, 1). We use β_m to denote the underlying effect of SNP m in terms of changing allele frequencies between controls and cases, i.e., β_m = p_1,m − p_0,m. If SNP m has no risk on the disease, we have β_m = 0. Let β = (β₁, …, β_M). In the Supplementary Methods section, we further demonstrate that the marginalized allele frequency difference δ = (δ₁, …, δ_M) is related to the LD pattern among the pre-selected SNPs and β, i.e.,

$${{{{\delta }}}}={{{{S}}}}{{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}},$$

(5)

where S is a diagonal matrix with the m-th diagonal element equal to s_m, and R is the LD coefficient matrix. Given δ, the joint distribution of the z-scores z = (z₁, …, z_M) is

$${{{{{\boldsymbol{z}}}}}}| {\delta } \sim N({{S}}^{-1}{\delta },\,{\ R}).$$

(6)

We further assume that the standardized effect β_m/s_m follows a point-normal distribution, i.e.,

$$\frac{{\beta }_{m}}{{s}_{m}}\mathop{ \sim }\limits^{iid}(1-\pi ){\delta }_{0}+\pi N(0,\,{\sigma }^{2}).$$

(7)

Here δ₀ is a point mass at zero, π represents the prior proportion of SNPs that have an effect on the disease, and σ² is the variance of β_m/s_m in the risk SNPs. This point-normal distribution is also used in LDpred as the prior distribution. The relationship between σ² and the heritability of the disease is presented in Section “Notations and assumptions” and the Supplementary Methods section. With this assumption, we derived an expectation-maximization (EM) algorithm to estimate (π, σ²) and calculated the posterior distribution of the AUC when pre-selected SNPs are independent. When SNPs are linked by LD, we derived a Gibbs-sampling-based SAME algorithm to estimate (π, σ²) and obtained the MC samples of the predicted AUC. Once this is accomplished, we can select the parameter values for the PRS method with the best predicted AUC. Details of PRStuning are presented in Section “Methods”.

Simulation experiments

For our simulation experiments, we considered predicting the performance and tuning the parameters for four commonly used PRS methods, namely, P+T, C+T, LDpred, and LDpred2. In the experiments, we varied the p-value thresholds for P+T and C+T from {1, 5e − 1, 5e − 2, 5e − 3, 5e − 4, 5e − 5, 5e − 6}. In P+T, p-value threshold=1 means that no further filtering step based on p-values is utilized on pre-selected approximately independent SNPs after LD pruning. In C+T, p-value threshold=1 means we conduct LD clumping on genome-wide SNPs without filtering based on p-values. While for LDpred, we chose the proportion of the risk SNPs π from {1, 3e − 1, 1e − 1, 3e − 2, 1e − 2, 3e − 3, 1e − 3, 3e − 4, 1e − 4, 3e − 5, 1e − 5}. This is the default setting of LDpred. Because LDpred2 had convergence issues when the risk SNP proportion was set to an extremely small value for simulations based on simulated genotype data, we varied π from {1, 6e − 1, 3e − 1, 1e − 1, 6e − 2, 3e − 2, 1e − 2} which had a smaller range but finer resolution than the set used in LDpred. For simulations based on real genotype data, we used the same parameters as LDpred.

There are two purposes of our method: to predict the AUC and to select tuning parameters. In our experiments, we used another independent dataset with individual-level genotype data as testing data. The AUC of the PRS assessed on the testing data and the parameters showing the best prediction performances on the testing data were treated as benchmarks. To evaluate the performance of PRStuning, we evaluated the performance of PRStuning with two measures: the correlation of the AUC estimates (ρ_AUC) and the relative difference of the highest AUC estimates (rd_AUC). We define ρ_AUC as the correlation of the PRStuning-predicted AUC values and those estimated on the testing data. A high value of ρ_AUC indicates that the predicted AUC using our method is highly correlated with the AUC on the testing data. We define rd_AUC as the relative difference between the predicted AUC with the best-performing parameter tuned by PRStuning and the AUC with the best-performing parameters on the testing data. Here best-performing parameters are defined as those achieving the highest AUC values. A small value of rd_AUC indicates that the tuning parameter selected by PRStuning and the actual best-performing parameter have comparable performances. These two metrics are complementary to each other in the sense that, ρ_AUC measures how much the AUC patterns across parameter values for PRStuning and testing data align with each other, while rd_AUC measures the point difference between the highest AUC values for the two methods. Therefore, we would like to evaluate the results with both metrics.

We first consider the case where the pre-selected SNPs are independent. In our simulations, we set the prevalence of the disease to κ = 1%. For each SNP, we simulated its allele frequency in the general population based on a uniform distribution U(0.05, 0.95). Then we generated its risk effects on the disease based on the two-component mixture model Eq. (7), in which we set the proportion of the risk SNPs to π = 0.05 and the variance of the risk effects to σ² = 0.001n. Here n is the total sample size of the GWAS used in the training data. We assume the GWAS is balanced with an equal number of cases and controls. According to the central limit theorem, we have ${s}_{m}\propto 1/\sqrt{n}$. Hence it is reasonable to assume σ² ∝ n.

In total, we simulated M = 10, 000 independent SNPs and varied the sample size from 4, 000 to 10, 000 in the training GWAS to explore the performance trend across different sample sizes. Each sample size setting was replicated 50 times. And for each replication, we simulated additional 1000 cases and 1000 controls as testing data. We used the AUC evaluated on the testing data as the benchmark, and compared the AUC predicted by PRStuning and the unadjusted AUC obtained by directly plugging in the training summary statistics with the benchmark. Since all SNPs are independent, we only considered P+T as the PRS method.

Figure 1 shows the boxplots of AUC values corresponding to different p-value thresholds and sample sizes of training data for P+T. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC calculated from testing data, and the unadjusted AUC obtained by directly plugging in the training summary statistics, respectively. As expected, the unadjusted AUC estimates were inflated compared to the benchmark due to the overfitting problem. In contrast, with the same summary statistics from the training data, PRStuning was able to shrink the estimates of allele frequency differences and produce AUC estimates comparable to those from the testing data.

**Fig. 1: AUC boxplots for P+T in the simulation experiments with independent SNPs.**

In order to further demonstrate the accuracy of PRStuning, we summarize the average correlation of the AUC estimates ρ_AUC and the average relative difference of the best-performing AUC estimates rd_AUC in Table 1. Those metrics are complementary to each other since two vectors can be perfectly correlated but differ a lot. The values of ρ_AUC were at least 0.976, which indicates that PRStuning is capable of accurately predicting the AUC pattern on testing data. Moreover, the average values of rd_AUC were at most 1.3%, indicating that PRStuning can effectively select parameter values that achieve performance comparable to the best-performing parameter in the testing data. Note that ρ_AUC increased and rd_AUC decreased as the sample size of training GWAS increased. This is expected because a larger sample size in the training data can lead to higher accuracy in estimating allele frequency differences.

Table 1 Summary of the average values of ρ_AUC and rd_AUC in the simulation experiments with independent SNPs

Full size table

We also evaluated PRStuning when the training and testing data are heterogeneous. To be more specific, we considered two different scenarios. In the first scenario, we assumed that the allele frequencies from the training and testing data were different and the difference between allele frequencies was generated from N(0, 0.01²). In the other scenario, we assumed that the effect sizes between training and testing data were different and the difference between effects of risk SNPs followed N(0, 0.0005n). The results of these experiments for P+T are provided in the Supplementary Figures 3-4. The figures demonstrate that PRStuning can still estimate the AUC well when the pooled allele frequencies are different between the training and testing data. However, if the effects of risk SNPs are different between training and testing data, the AUC from PRStuning was overestimated, leading to inaccurate performance to tune parameters.

We then considered the case where the pre-selected SNPs are not filtered by any independence criterion for SNPs. In this case, the pre-selected SNPs are linked as reflected in their LD. We first performed simulations with SNPs with an AR(1) auto-regressive LD structure. We fixed the auto-regressive coefficient ρ to 0.2, which is the correlation coefficient between two adjacent SNPs. Similar to the simulation scenario with independent SNPs, we simulated the reference allele frequencies in the population from U(0.05, 0.95), and the risk effects from a point normal distribution Eq. (7), in which π = 0.05 and σ² = 0.0005n. The variance of risk effects is proportional to the sample size of the GWAS since ${s}_{m}\propto 1/\sqrt{n}$ according to the central limit theorem.

We varied the sample size from 4, 000 to 10, 000 in the training GWAS and generated 50 replications for each sample size. We used CorBin²⁸, an R package for generating high dimensional binary data with a specific correlation structure, to generate individual-level genotype data. Specifically, we generated 1,000 cases and 1,000 controls as testing data for each replication. We additionally simulated 1,000 samples as a reference panel for calculating LD coefficients. We used both C+T and LDpred as the PRS methods in this experiment. In LDpred, we need to specify another parameter named LD radius, which is the number of SNPs on each side of a given SNP for computing pairwise LD. The LD radius was set to 5, indicating that the SNPs used for computing LD have pairwise correlations above 0. 2⁵ ≈ 3 × 10⁻⁴ based on the AR(1) LD structure.

To demonstrate the predictive accuracy of PRStuning, we again regarded the AUC evaluated on the testing data as the benchmark and compared the AUC predicted by PRStuning and the unadjusted AUC with the benchmark. Figures 2, 3 and Supplementary 1 demonstrate the AUC boxplots for C+T, LDpred, and LDpred2 with different parameter values, respectively. For both PRS methods, the unadjusted AUC estimates were largely overestimated compared to the benchmark due to overfitting. On the contrary, the AUC estimates predicted by PRStuning were very close to the benchmark results, especially when the sample size became large.

**Fig. 2: AUC boxplots for C+T in the simulation experiments with correlated SNPs.**

**Fig. 3: AUC boxplots for LDpred in simulation experiments with correlated SNPs.**

We summarize the average values of ρ_AUC and rd_AUC for C+T and LDpred in Table 2. For both C+T and LDpred, the average values of ρ_AUC were at least 0.754 in all sample size settings, indicating PRStuning can accurately predict the AUC on testing data. The average values of rd_AUC were below 3.1%, meaning PRStuning can effectively select a parameter that achieves performance comparable to the actual best-performing parameter on the testing data. Again, we can observe an increasing tendency in ρ_AUC and a decreasing tendency in rd_AUC as we increase the sample size of the training GWAS as the result of the increase in estimation accuracy of the allele frequency differences.

Table 2 Summary of the average values of ρ_AUC and rd_AUC in the simulation experiments with correlated SNPs for C+T and LDpred

Full size table

We evaluated PRStuning when the training and testing data were heterogeneous, where we considered three different scenarios. In the first scenario, we assumed the allele frequencies from training and testing data were different and the differences between allele frequencies were generated from N(0, 0.01²). In the second scenario, we assumed that the effect sizes between training and testing data were different and the difference between effects of risk SNPs followed N(0, 0.0002n). In the third scenario, the LD structure of the testing data was AR(1) with an auto-regressive coefficient ρ = 0.15, which is different from the auto-regressive coefficient of the training data. The results of these experiments for C+T, LDpred, and LDpred2 are provided in Supplementary Figures 5-13. Generally speaking, the figures demonstrate that PRStuning can still estimate the AUC well when the pooled allele frequency and LD matrix are different between training and testing data. However, if the effects of risk SNPs are different between training and testing data, the AUC from PRStuning was overestimated, leading to inaccurate performance to tune parameters.

To investigate whether including more individuals in the reference panel can improve the performance of PRStuning, we conducted simulation experiments to compare its performance with the performance based on the ground truth LD matrix. The comparison results using C+T, LDpred, and LDpred2 to construct PRS can be found in Supplementary Figures 14-16, respectively. From the figures, we observe that the performance of PRStuning based on the LD matrix estimated from 1,000 individuals was almost the same as the performance based on the ground truth LD matrix. Thus, with a sufficient number of individuals in the reference panel, there may be little improvement in performance by including more individuals in the LD matrix calculation.

To further demonstrate the effectiveness of PRStuning, we calculate the sensitivity of the PRS model tuned by PRStuning, which is the proportion of true cases among predicted ones from the PRS model. The cutoff value for PRS is selected by Youden’s J statistic, which is defined as the sum of sensitivity and specificity minus one and is the most commonly used criterion to select the cutoff value for a binary classifier²⁹. The true case proportions of simulation experiments for the four PRS methods are summarized in Supplementary Figure 2.

We also evaluated PRStuning with simulations based on real genotype data. The experiments were conducted based on genotype data collected from the UK Biobank (UKBB)³⁰, which collected genetic and health records from around 500, 000 participants in the UK. The quality control procedure is summarized in the Supplementary Methods section. We only selected independent individuals with European ancestry in the experiments. Since only SNPs presented in the HapMap 3 project (HM3 SNPs) were used in the reference panel for reliable LD estimation and computation efficiency, we focused on the SNPs in HM3 in the UKBB dataset. This resulted in a total of 1, 027, 699 HM3 SNPs and 272, 751 individuals passing the quality control criteria.

We used the two-component mixture model Eq. (7) to simulate risk effects for SNPs with π = 0.1% and σ² = 0.04. The phenotypes of the individuals were simulated based on the additive assumption. Among all individuals, we randomly selected 80% of them for GWAS analysis to calculate the summary statistics as training data and the rest as testing data. We used the data collected from the 1000 Genomes Project (1KG)²⁶ as the reference panel for calculating LD. In the experiments, we used both C+T and LDpred as the PRS methods and compared the AUC estimates predicted by PRStuning with the values calculated on the testing data. The LD radius to be specified in LDpred was set to M/3000 ≈ 343, which is the default practice suggested by LDpred and corresponds to a 2Mb LD window on average in the human genome¹⁵.

In Table 3, we summarize the AUC results of C+T, LDpred, and LDpred2 with different parameter values for both PRStuning and testing genotype data. The AUC estimates from PRStuning were very close to the actual AUC values obtained from the testing data. For C+T, the correlation ρ_AUC reached 0.994, the relative difference rd_AUC was 3.8%, and the sensitivity of the tuned PRS model based on PRStuning was 80.6%. For LDpred, ρ_AUC reached 0.998, rd_AUC was just 1.3%, and the sensitivity was 74.8%. It is worth noting that PRStuning was able to detect the dramatic decrease in the testing performance of LDpred when π was dropped from 1e − 1 to 3e − 2. For LDpred2, ρ_AUC reached 0.989, rd_AUC was 7.0%, and the sensitivity was 85.3%. These results further suggest the accuracy in AUC estimation and effectiveness in parameter tuning using PRStuning on SNPs linked by LD.

Table 3 The predicted AUC values for C+T, LDpred, and LDpred2 with different parameters in the simulation experiment based on the UKBB data

Full size table

Real data applications

We applied PRStuning to GWAS summary statistics from four diseases, including coronary artery disease (CAD), type 2 diabetes (T2D), and inflammatory bowel disease (IBD). Table 4 summarizes the sources of the publicly available GWAS summary statistics and their corresponding sample sizes. Note that the summary statistics from all three datasets are results of meta-analyses and the reported sample sizes represent the total numbers of individuals among all aggregated studies. The actual sample size used to calculate the summary statistics of each SNP was less than the reported sample size, since some of the studies may not have genotypes on this SNP.

Table 4 Summary of the publicly available GWAS summary statistics used in real data applications

Full size table

We used these summary statistics to train the PRS models based on P+T, C+T, LDpred, and LDpred2. Then we used the data collected from the UKBB as the testing data for evaluating the actual prediction performance of the built PRS models. Only the SNPs with minor allele frequencies greater than 5% were included in building the PRS models. Details of the quality control procedure and phenotype extraction method for the UKBB data are provided in the Supplementary Methods section. In line with the simulation experiments based on UKBB genotype data, we only incorporated independent European-ancestry individuals and HM3 SNPs in the UKBB dataset, resulting in 272,751 individuals and 1,027,699 HM3 SNPs. Regardless of which PRS method is considered, only the SNPs overlapped between GWAS summary statistics and the testing data were considered in our analyses. The numbers of the overlapping SNPs for these diseases are summarized in Table 4.

In PRStuning, we adopted the EM algorithm 4.2 for PRS models built by P+T since the pre-selected SNPs were approximately independent, and the Gibbs sampling-based SAME algorithm 4.3 for C+T and LDpred due to the presence of LD among the pre-selected SNPs. The LD radius in LDpred was set to M/3000, which is the default practice suggested by LDpred. Figure 4 shows the predicted AUC by PRStuning and the actual AUC on testing data for four diseases with different PRS models. The dotted and solid horizontal lines respectively refer to the highest AUC for PRStuning and testing data. It is evident in the figure that the AUC predicted by PRStuning and the AUC calculated from testing data had similar patterns across different parameter values, particularly for LDpred. For CAD, the AUC of LDpred increased when the risk SNP proportion π was reduced from 1 to 1e − 2. It peaked at 1e − 2 and then started to decrease when we kept reducing the value of π. This pattern was exactly predicted by PRStuning. More complex patterns of AUC were observed for LDpred in T2D and CAD. The AUC values in both diseases had double modes across parameter values. For T2D, the AUC of LDpred peaked at 3e − 2 and 3e − 4. For IBD, the AUC of LDpred peaked at 3e − 2 and 1e − 5. Still, PRStuning predicted the exact same patterns of AUC for both diseases. This demonstrates the high predictive accuracy of PRStuning. More detailed information for the predicted AUC by PRStuning and the actual AUC on testing data is summarized in Supplementary Table 2.

**Fig. 4: The predicted AUC by PRStuning and the actual AUC on testing data for four diseases with PRS models built from P+T, C+T, LDpred, and LDpred2 using different parameters.**

To further explain why there were double modes for AUC with different parameter values, we refer back to the calculation of Δ in Eq. (3) since AUC is monotonically increasing with respect to Δ. The numerator of Δ is a linear combination of the weights ${{{{\omega }}}}={({\omega }_{1},\ldots,{\omega }_{M})}^{T}$ used in PRS, whereas the denominator is the square root of a quadratic function of ω, which can be further expressed as

$$\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}=\sqrt{{{{{{\omega }}}}}^{T}({{{{{S}}}}}_{0}{{{{ R}}}}{{{{{ S}}}}}_{0}+{{{{{S}}}}}_{1}{{{{R}}}}{{{{{S}}}}}_{1}){{{{\omega }}}}},$$

(8)

where S₀ and S₁ are diagonal matrices with diagonal elements encoding (s_0,1, …, s_0,M) and (s_1,1, …, s_1,M), respectively. The weights in the PRS model were calculated based on different values of parameters. In Supplementary Figure 17, we demonstrate the denominators and numerators of Δ with different parameter values in LDpred for the four diseases. From the figure, we can observe that both the denominator and numerator were actually unimodal functions with respect to the parameter values that peak at different parameter values. Their ratio led the Δ to become bimodal functions with respect to the parameter values.

In Figure 4, we do observe some underestimation of AUC for C+T, LDpred, and LDpred2 on CAD and IBD. This is because the summary statistics collected are results of meta-analyses. The actual sample size used for calculating the summary statistics of each SNP is less than the reported sample size, because some of the studies may not have genotypes at this SNP. Some consortia, such as GLGC³¹, provide the sample size used for calculating summary statistics of each SNP, but most consortia do not provide this information. Even if we have the sample size for each SNP, we can not infer the number of non-overlapping individuals for calculating summary statistics of two SNPs. The non-overlapping individuals will change the correlations between z-values. In our analysis, we simply plugged the total sample sizes reported by the summary statistics into PRStuning. According to Eq. (16), the inflation of the sample size would lead to the systematic underestimation of s_m. Based on Eq. (2), we know that AUC is monotonically increasing with respect to Δ, and we have ${{\Delta }}\propto \mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}$ and δ = SRS⁻¹β. We estimate S⁻¹β directly from z-scores, which are not influenced by the underestimation of s_m. Therefore, the underestimation of s_m would further lead to the underestimation of AUC.

To further illustrate the predictive accuracy of PRStuning, we calculated ρ_AUC and rd_AUC with different PRS methods for the four diseases. The results of ρ_AUC and rd_AUC are summarized in Table 5. The low values of rd_AUC indicate that the prediction performance under the PRStuning-selected parameter approximated the best performance on the testing data accurately, especially for C+T and P+T. Even though LDpred had higher rd_AUC compared to the other two PRS methods, it yielded values of ρ_AUC all above 0.95. The high values of ρ_AUC indicate that PRStuning can accurately predict the pattern of AUC with respect to the parameters on the testing data. This can be clearly observed from Figure 4. These results show that PRStuning can help us select the best-performing parameters in PRS methods with only summary statistics from the training data.

Table 5 Summary of ρ_AUC and rd_AUC when using PRStuning to predict AUCs for four PRS methods on four diseases

Full size table

We note that the correlation between AUC predicted by PRStuning and calculated from the testing data was negative with C+T for CAD. However, also note that the standard deviations among the AUC values with different parameters for both methods were less than 0.01 in this scenario. The extremely small standard deviations of AUC contribute to the large variation of the correlation. Therefore, the correlation is relatively uninformative in characterizing the relationship between the predicted and the actual AUC values. On the other hand, the small value of rd_AUC (0.4%) suggests the effectiveness of PRStuning. The sensitivity values of the tuned PRS model based on PRStuning and Youden’s J statistic are summarized in Supplementary Table 3.

We also compared PRStuning with PUMAS³², a method to estimate predictive R² for PRS models using summary statistics from GWAS by sampling pseudo-summary-statistics. To compare predictive R² with AUC, we first converted Pearson’s correlation to Spearman’s rank correlation and then linearly mapped the latter to AUC³³. In Supplementary Table 4, we summarize ρ_AUC and rd_AUC based on PUMAS. We observe that PRStuning outperformed PUMAS across all real data and PRS methods, and that PUMAS is especially incapable of predicting the AUC well for LDpred and LDpred2.

Discussion

PRS methods have been proven useful for the prediction of common disease risks, which can help improve disease prevention and early treatment. Some PRS methods require users to specify the values for parameters. However, to tune the parameters, an external individual-level genotype dataset is often needed to evaluate the prediction performance of different parameter values. However, individual-level genotype data are much less accessible compared to GWAS summary statistics due to privacy and security concerns. Additionally, leaving out partial data for parameter tuning can also reduce the predictive accuracy of the PRS model.

These concerns motivated us to propose PRStuning, an empirical Bayes method that only requires summary statistics from the training GWAS to evaluate PRS and tune the parameters. PRStuning is based on an AUC estimator proposed in²², which is a function of the GWAS summary statistics. However, plugging the training summary data directly into this estimator would cause overfitting, leading to an inflation of the predicted AUC. To tackle this problem, we adopted the empirical Bayes approach to shrinking the predicted AUC based on the estimated genetic architecture. Extensive simulation experiments and real data applications on four diseases with four PRS methods demonstrated that PRStuning is capable of accurately predicting the AUC on the testing data and selecting the best-performing parameters.

The core of PRStuning is to estimate the allele frequency differences among SNPs. To do so, we need to input the sample sizes of the cases and controls in the training data. Usually, they are provided in the sources of GWAS summary statistics. However, if the summary statistics were derived from a meta-analysis, not all SNPs were genotyped in all studies included in the meta-analysis. In this case, the actual sample sizes used for calculating the summary statistics are less than the reported total sample sizes in the meta-analysis for some SNPs. This may lead to underestimation in AUC according to Eq. (2). This phenomenon was observed when we applied PRStuning to C+T and LDpred on CAD and IBD, where the AUC estimates from PRStuning were lower than the actual values in the testing data. Nevertheless, according to our experimental results, the underestimation phenomenon will not influence the performance of parameter selection since the overall pattern of the AUC values with different parameter values can still be well-predicted by PRStuning.

Currently, we only considered tuning parameters for PRS methods on diseases or other binary phenotypes. For quantitative phenotypes, instead of AUC, predictive r² is commonly used as an evaluation criterion of the PRS model. Extending PRStuning to evaluating predictive r² and selecting parameters on quantitative phenotypes is left as future work.

In PRStuning, we select the best-performing parameter by predicting the AUC of the PRS built under each candidate parameter value. Although AUC is the most commonly used evaluation metric for PRS on binary disease outcomes²², it may be helpful to incorporate additional covariates such as age, sex, etc. into the AUC since they may also have an impact on disease risks³⁴. Two notable variants of AUC to incorporate covariate information include covariate-specific AUC (AUC_x)³⁵ and covariate-adjusted AUC (AAUC)³⁶. Similar to the definition of the ordinary AUC, AUC_x is defined as the probability that the PRS of a random individual from the case group is larger than the PRS from a random individual from the control group conditioning on both individuals share the common covariate value x. AAUC is the weighted average of AUC_x where the weight is the probability density of covariate value x. If the genetic risk of a disease is independent of other covariates, both AUC_x and AAUC will have the same value as the ordinary AUC³⁴. To estimate AUC_x and AAUC, we need to estimate the conditional distribution of PRS given a covariate value, which can only be inferred with the help of individual-level data. Since we focus on using GWAS summary statistics to predict the AUC and tune parameters of PRS, we left the prediction of covariate-incorporated AUC_x and AAUC based on individual-level training data as future work.

The basic assumption of PRStuning is that the training and testing datasets are homogeneous, indicating that both datasets come from the same population and therefore share the same LD matrix and expected allele frequencies among controls and cases. The same assumption is also needed for traditional PRS analyses based on an independent validation dataset to tune parameters. If the validation and testing datasets are heterogeneous, the AUC estimated from the validation dataset and the parameter selected based on the estimated results are not accurate. Without additional information about the heterogeneity between the two datasets, it will be challenging to estimate AUC and tune parameters based on training or validation datasets. We note that some recent PRS methods have been proposed to consider multiple populations from different ancestries together, which can transfer the knowledge from the European population to other demographics with limited sample size^37,38,39,40. In PRStuning, we currently focus on dealing with the overfitting issue when the homogeneous assumption is valid. The adjustment to the selected parameter value based on additional information of the heterogeneity will be considered in our future work. Supplementary Figures 3-13 present the performance of PRStuning when the pooled allele frequency, effect size, and LD matrix are different between training and testing datasets. The figures demonstrate that PRStuning can estimate the AUC well when heterogeneity exists in the pooled allele frequency and LD matrix. However, if the heterogeneity between training and testing data exists in the effects of changing allele frequencies between controls and cases, the AUC from PRStuning will be overestimated and unreliable.

Recent research suggests that combining all PRSs under a tuning grid using ensemble methods can improve the prediction performance^8,41,42,43. In the ensemble methods, an independent validation dataset is needed to estimate the weights used for combining PRSs. In PRStuning, we estimate the AUC and select the best-performing parameters for a PRS method based on the SNP weights derived from the PRS method. If the PRS weights used in ensemble methods have already been estimated in an individual-level validation dataset, we can combine the SNP weights in each PRS and the PRS weights together to derive the ensembled SNP weights. In this situation, PRStuning can be used to predict the AUC of the PRS from the ensembled weights without another individual-level dataset. However, without an individual-level validation dataset to estimate the PRS weights used in the ensemble methods, PRStuning can not estimate the PRS weights simply based on GWAS summary statistics from the training data.

Methods

Notations and assumptions

Based on the additive assumption, the PRS for individual i is the sum of the genotypes g_i = (g_i,1, …, g_i,M) weighted by the corresponding effects ω = (ω₁, …, ω_M):

$$PR{S}_{i}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}{g}_{i,m},$$

(9)

where M is the total number of the pre-selected SNPs used for constructing PRS. Depending on the specific PRS method, not all SNPs collected in the training GWAS data are necessarily used in PRS calculation. Please note that some PRS methods incorporate steps for selecting SNPs based on their associations with the disease. Here we define the pre-selected SNPs as the SNPs used in building the PRS model before running a selection step related to association strengths. LD clumping is an example of the selection step based on the observed association strength. Hence, we refer to the pre-selected SNPs in C+T as genome-wide SNPs collected in the training GWAS data. On the contrary, LD pruning is a selection step unrelated to the associations of SNPs with the disease. Therefore, the pre-selected SNPs in P+T are the SNPs selected after an LD pruning step. Different PRS methods have been proposed to estimate the weight vector ω = (ω₁, …, ω_M) from a GWAS dataset or its summary statistics for the disease of interest. Here and after we simply use ω to denote the effects already estimated from a PRS method.

Based on disease status, we divide individuals into the case and control groups. In the following, we use subscripts j = 0 and j = 1 to denote those from the control and case groups, respectively. For example, the frequencies of the reference allele for SNP m among controls and cases are denoted as f_0,m and f_1,m, respectively. The genotype g_i,m of SNP m for an individual in the control group follows a binomial distribution Bino(2, f_0,m) with mean ${\mathbb{E}}[{g}_{0,m}]=2{f}_{0,m}$ and variance ${s}_{0,m}^{2}:\!\!=Var({g}_{0,m})=2{f}_{0,m}(1-{f}_{0,m})$. Similarly, we have g_i,m ~ Bino(2, f_1,m) if the individual i is from the case group.

By the central limit theorem, PRS approximately follows a normal distribution in each group when the SNP number M is adequately large. For PRS methods involving SNP selection steps unrelated to the SNPs’ associations with the disease, such as P+T, M varies from ~10 to ~10K depending on the selection threshold. For PRS methods with genome-wide pre-selected SNPs, M ranges from ~100K to ~1M determined by the number of SNPs genotyped or imputed in the training data. Based on the central limit theorem, the PRS variables from the two groups follow normal distributions:

$$PR{S}_{i} \sim \left\{\begin{array}{ll}N({\eta }_{0},\,{\tau }_{0}^{2})\quad &{{{{{{{\rm{if}}}}}}}}\,i\in {{{{{{{\rm{control}}}}}}}}\,{{{{{{{\rm{group}}}}}}}}\\ N({\eta }_{1},\,{\tau }_{1}^{2})\quad &{{{{{{{\rm{if}}}}}}}}\,i\in {{{{{{{\rm{case}}}}}}}}\,{{{{{{{\rm{group}}}}}}}}\quad \end{array}\right.,$$

(10)

where

$${\eta }_{j}=\mathop{\sum }\limits_{m=1}^{M}2{\omega }_{m}{f}_{j,m},$$

(11)

and

$${\tau }_{j}^{2}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}^{2}{s}_{j,m}^{2}+2\mathop{\sum}\limits_{{m}_{1} < {m}_{2}}{\omega }_{{m}_{1}}{\omega }_{{m}_{2}}{R}_{{m}_{1},{m}_{2}}{s}_{j,{m}_{1}}{s}_{j,{m}_{2}},$$

(12)

for j = 0 or 1. Here ${R}_{{m}_{1},{m}_{2}}$ corresponds to the correlation between SNP m₁ and SNP m₂, which is known as the LD coefficient.

For a binary phenotype, we usually use AUC as the criterion for evaluating the prediction performance of PRS. AUC is defined as the area under the ROC curve, which can also be calculated as the probability that a random PRS from the case group is larger than a random PRS from the control group⁴⁴. Based on this fact and the distributions of PRS, Song, etc.²² formulated AUC as

$${{{{{{{\rm{AUC}}}}}}}}={{\Phi }}({{\Delta }}),$$

(13)

where

$${{\Delta }}:\!\!=\frac{{\eta }_{1}-{\eta }_{0}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}=\frac{2{\sum }_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}.$$

(14)

Here δ_m ≔ f_1,m − f_0,m records the difference between the allele frequencies of the two groups for SNP m, and Φ( ⋅ ) is the cumulative density function of a standard normal distribution.

To calculate ${\tau }_{0}^{2}$ and ${\tau }_{1}^{2}$ in Eq. (13), we can directly plug in the observed values of the allele frequencies and LD coefficients into Eq. (12) since they are not directly related with the SNP effects on the disease. We can extract allele frequencies from summary statistics of the GWAS and use a genotyping dataset as the reference panel for extracting the LD information. Some large projects, such as the 1000 Genomes project²⁶ and the HapMap3 project²⁷, can be used to calculate the LD coefficients. We will provide the details of calculations in Section “Calculating LD from a reference pane”.

In Eq. (13), the allele frequency differences δ_m (m = 1, …, M) are critical. One may think of directly plugging in the observed allele frequencies ${\hat{f}}_{0,m}$ and ${\hat{f}}_{1,m}$ from GWAS for building the PRS model to obtain δ_m. However, the allele frequency differences of SNPs that exhibit large effects tend to be overestimated, and these SNPs have larger contributions to PRS than the SNPs showing small effects, a phenomenon known as overfitting in the machine learning community²³. Overestimating the SNP effects would lead to an inflated value of the predicted AUC and the incorrectly selected values of the parameters. Here we adopt an empirical Bayes method to reduce the influence of overfitting by shrinking the observed allele frequency differences obtained from the summary statistics of the training GWAS.

In GWAS, we usually use the z-score calculated from the allele frequency difference test to assess the association of each SNP with the disease. Since z-scores are standardized values following a standard normal distribution N(0, 1) under the null hypothesis, we will use z-scores as surrogates to derive the posterior distribution of δ_m. The z-score is calculated with the following formula:

$${z}_{m}=\frac{{\hat{f}}_{1,m}-{\hat{f}}_{0,m}}{\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}},$$

(15)

where ${\hat{f}}_{j,m}$ is the observed allele frequencies among controls or cases, and ${s}_{j,m}^{2}$ is the variance of genotypes in each group. We use n₀ and n₁ to respectively denote the sample sizes of controls and cases in the GWAS. To simplify the expression, we use s_m to denote the denominator of the z-score, i.e.,

$${s}_{m}:\!\!=\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}},$$

(16)

and denote s = (s₁, …, s_M). We use z to encode the z-scores of all the pre-selected SNPs. Based on this definition, we have z_m∣δ_m ~ N(δ_m/s_m, 1) given the allele frequency difference δ_m.

Under an assumed condition that SNP m is independent of other SNPs, its potential allele frequencies among controls and cases may be different. We denote the potential allele frequencies under this condition as p_0,m and p_1,m, respectively. Note that they should be distinguished from the marginalized allele frequencies f_0,m and f_1,m. We denote the effect of SNP m as β_m = p_1,m − p_0,m. If the SNP has no effect on the disease, then β_m = 0. For the risk ones, β_m ≠ 0. In the Supplementary Methods section, we further prove that δ = (δ₁, …, δ_M) is actually related to the LD among the pre-selected SNPs and the underlying SNP effects β = (β₁, …, β_M) in terms of changing allele frequencies between two groups, i.e.,

$${{{{delta }}}}={{{{S}}}}{{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}}.$$

(17)

We further assume that the standardized effect β_m/s_m follows a point-normal distribution, i.e.,

$$\frac{{\beta }_{m}}{{s}_{m}}\mathop{ \sim }\limits^{iid}(1-\pi ){\delta }_{0}+\pi N(0,\,{\sigma }^{2}).$$

(18)

Here δ₀ is a point mass at zero and π represents the prior proportion of the SNPs having effects on the disease. We use σ² to denote the variance of β_m/s_m in the risk SNPs. In the Supplementary Methods section, we derived the following relationship between σ² and the heritability (${h}_{l}^{2}$) of the disease in the liability-scale:

$${\sigma }^{2}=\frac{{N}_{e}{h}_{l}^{2}}{4M\pi }\frac{\phi {({{{\Phi }}}^{-1}(\kappa ))}^{2}}{{\kappa }^{2}{(1-\kappa )}^{2}},$$

(19)

where ${N}_{e}=\frac{4{n}_{0}{n}_{1}}{{n}_{0}+{n}_{1}}$ is the effective sample size of the GWAS, κ is the prevalence of the disease, and ϕ and Φ are the probability density function and cumulative density function of the standard normal distribution N(0, 1), respectively.

In the following two subsections, we will demonstrate how to estimate allele frequency differences in two different scenarios by reducing the effect of overfitting based on the empirical Bayes theory.

Estimating AUC on independent SNPs

First, we consider the situation in which the pre-selected SNPs used for constructing PRS are independent. For example, the pre-selected SNPs in P+T are approximately independent because they are selected after an LD pruning step.

In this scenario, we have δ = β based on Eq. (17) and the joint distribution of z-scores follows a multivariate normal distribution with the covariance matrix equaling to the identity matrix I_M, i.e.,

$${{{{{{{\boldsymbol{z}}}}}}}}| {{{{\beta }}}} \sim {N}_{M}({{{{{ S}}}}}^{-1}{{{{\beta }}}},{{{{{ I}}}}}_{M}),$$

(20)

where S = diag(s) is a diagonal matrix with diagonal elements encoding the standard errors of the observed allele frequency differences.

With the point-normal prior (18) on each entry of β, the log-likelihood of the z-scores is the summation of the log-likelihood for each individual z-score, i.e.

$$\log P({{{{z}}}}| \pi,{\sigma }^{2})=\mathop{\sum }\limits_{m=1}^{M}\log P({z}_{m}| \pi,{\sigma }^{2}).$$

(21)

With this property, we can use an EM algorithm to get estimates of π and σ² by maximizing the likelihood P(z∣π, σ²).

After getting estimates of parameters π and σ², we can derive a closed-form solution for the posterior distribution of δ_m:

$${\delta }_{m}| {z}_{m} \sim (1-{h}_{m}){\delta }_{0}+{h}_{m}N(\lambda {z}_{m}{s}_{m},\lambda {s}_{m}^{2}),$$

(22)

where

$${h}_{m}=\frac{\frac{\pi }{\sqrt{1+{\sigma }^{2}}}\phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})}{(1-\pi )\phi ({z}_{m})+\frac{\pi }{\sqrt{1+{\sigma }^{2}}}\phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})}\,{{{{{{{\rm{and}}}}}}}}\,\lambda=\frac{1}{1+1/{\sigma }^{2}}.$$

(23)

Here ϕ( ⋅ ) is the probability density function of a standard normal distribution N(0, 1). Derivation details of this posterior distribution can be found in the Supplementary Methods section. With Eq. (22), we get MC samples of δ_m∣z_m and plug them as the allele frequency difference in Eq. (13) for calculating the posterior distribution of AUC. The shrink estimator of δ_m in (22) reduces the effect of overfitting. Details of the EM algorithm for estimating π, σ², δ_m, and AUC are summarized in Algorithm 4.2.

Algorithm 1

Estimate AUC on independent SNPs

Input: z-scores z = (z₁, …, z_M)

Output: Estimated π, σ², δ and AUC

1: Initialize π and σ²;

2: repeat

3: for m = 1, 2, . . . , M do

4: E step:

5: ${h}_{m}\leftarrow \frac{\pi \phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})/\sqrt{1+{\sigma }^{2}}}{(1-\pi )\phi ({z}_{m})+\pi \phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})/\sqrt{1+{\sigma }^{2}}}$

6: M step:

7: $\pi \leftarrow \frac{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}}{M}$

8: ${\sigma }^{2}\leftarrow \frac{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}{z}_{m}^{2}}{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}}-1$

9: end for

10: untill π and σ² converge

11 for m = 1, 2, …, M do

12: ${\delta }_{m} \sim (1-{h}_{m}){\delta }_{0}+{h}_{m}N(\frac{{z}_{m}{s}_{m}}{1+1/{\sigma }^{2}},\frac{{s}_{m}^{2}}{1+1/{\sigma }^{2}})$

13: end for

14: ${{\Delta }}\leftarrow \frac{2\mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}$ and AUC ← Φ(Δ)

Estimating AUC on SNPs linked by LD

When the pre-selected SNPs are not filtered by the independence criterion, their genotypes may be correlated due to LD. We can estimate the LD matrix R from a publicly available genotyping reference panel.

In this scenario, we have δ = SRS⁻¹β based on Eq.(17) and the conditional joint distribution of the z-scores given β is

$${{{{{{{\boldsymbol{z}}}}}}}}| {{{{\beta }}}} \sim N({{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}},{{{{R}}}}),$$

(24)

where S = diag(s) is a diagonal matrix encoding the standard errors of observed allele frequency differences.

We used the same point-normal prior (18) on each entry of β as we used in the independent SNP scenario. There are two unknown parameters π and σ² in the prior distribution. We intend to use maximum likelihood estimation (MLE) for estimating them based on the observed z-scores. However, due to the extremely high number of component combinations (i.e., 2^M), the joint likelihood of z-scores P(z∣π, σ²) is intractable. Here we use a Gibbs-sampling-based State-Augmentation for Marginal Estimation (SAME) algorithm to get the maximizer of the likelihood in a stochastic approach²⁵.

Let γ_m ∈ {0, 1} (m = 1, …, M) denote whether SNP m has an effect on the disease or not and γ = (γ₁, …, γ_M). In the SAME algorithm, instead of evaluating the original likelihood, we assess the likelihood of the augmented data P(z, β, γ∣π, σ²). With flat priors on π and σ², we derive a Gibbs sampler for sampling the full parameters β, γ, π and σ² with the joint probability proportional to the augmented data likelihood. We leave the derivation details in the Supplementary Methods section.

By making some simple changes to the originally derived sampler, we can get another Gibbs sampler for simultaneously sampling π, σ² and D artificial replicates of the nuisance parameters ${\{{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)\}}_{d=1}^{D}$, for whom the joint probability is proportional to

$${q}_{D}\left(\pi,\,{\sigma }^{2},\,{\{{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)\}}_{d=1}^{D}| {{{{{{{\boldsymbol{z}}}}}}}}\right)\propto \mathop{\prod }\limits_{d=1}^{D}P\left({{{{z}}}},\,{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)| \pi,{\sigma }^{2}\right).$$

(25)

Based on this probability, the generated replicates of {β, γ} in the sampler are conditionally independent. With this new sampler, the marginal probability of (π, σ²) can be calculated by integrating/summing over all replicates of {β, γ}:

$${q}_{D}\left(\pi,{\sigma }^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)= {\int}_{{{{{\beta (D)}}}}}\mathop{\sum}\limits_{{{{{\gamma (D)}}}}}\ldots {\int}_{{{{{\beta (1)}}}}}\mathop{\sum}\limits_{{{{{\gamma (1)}}}}}{q}_{D}\left(\pi,\, {\sigma }^{2},\, {\{{{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)\}}_{d=1}^{D}| {{{{{{{\boldsymbol{z}}}}}}}}\right)d{{{{\beta }}}}(1)\ldots d{{{{\beta }}}}(D)\\ \propto {\int}_{{{{{\beta (D)}}}}}\mathop{\sum}\limits_{{{{{\gamma (D)}}}}}\ldots {\int}_{{{{{\beta (1)}}}}}\mathop{\sum}\limits_{{{{{\gamma (1)}}}}}\mathop{\prod }\limits_{d=1}^{D}P\left({{{{{{{\boldsymbol{z}}}}}}}},\, {{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)| \pi,\, {\sigma }^{2}\right)d{{{{\beta }}}}(1)\ldots d{{{{\beta }}}}(D)\\= \mathop{\prod }\limits_{d=1}^{D}\left({\int}_{{{{{\beta (d)}}}}}\mathop{\sum}\limits_{{{{{\gamma (d)}}}}}P\left({{{{{{{\boldsymbol{z}}}}}}}},\, {{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)| \pi,\, {\sigma }^{2}\right)d\beta (d)\right)\\= P{({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\, {\sigma }^{2})}^{D}.$$

In other words, (π, σ²) is actually sampled from ${q}_{D}\left(\pi,\,{\sigma }^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)\propto P{({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\,{\sigma }^{2})}^{D}$ in the sampler. We further denote $(\hat{\pi },\,{\hat{\sigma }}^{2})=\arg \mathop{\max }\limits_{(\pi,{\sigma }^{2})}P({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\,{\sigma }^{2})$ and $(\tilde{\pi },\,{\tilde{\sigma }}^{2})$ as another set of parameters. If we let D increase to infinity, the relative probability of sampling $(\tilde{\pi },\,{\tilde{\sigma }}^{2})$ compared to sampling $(\hat{\pi },{\hat{\sigma }}^{2})$ will become

$$\frac{{q}_{D}\left(\tilde{\pi },\,{\tilde{\sigma }}^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)}{{q}_{D}\left(\hat{\pi },\,{\hat{\sigma }}^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)}={\left(\frac{P({{{{{{{\boldsymbol{z}}}}}}}}| \tilde{\pi },\,{\tilde{\sigma }}^{2})}{P({{{{{{{\boldsymbol{z}}}}}}}}| \hat{\pi },\,{\hat{\sigma }}^{2})}\right)}^{D}{D\to \infty }0.$$

(26)

Therefore, the sampled (π, σ²) will converge to their maximum likelihood estimates $(\hat{\pi },{\hat{\sigma }}^{2})$ in the end.

Given their estimates, the Gibbs sampler in the SAME algorithm can provide MC samples of nuisance parameters {β, γ} with probability $P({{{{\beta }}}},{{{{\gamma }}}}| {{{{{{{\boldsymbol{z}}}}}}}},\hat{\pi },{\hat{\sigma }}^{2})$. With them, we can also get the MC samples of δ = SRS⁻¹β and the corresponding AUC based on Eq. (13). The complete Gibbs-sampling-based SAME algorithm for estimating π, σ², δ_m and AUC is summarized in Algorithm 4.3.

Algorithm 2

Estimate AUC on SNPs linked by LD

Input: z-scores z = (z₁, …, z_M)

Output: Estimated π, σ², δ and AUC

Initialize π, σ², γ_m ~ Bernoulli(π) and β_m ~ (1 − γ_m)δ₀ + γ_mN(0, σ²) for m = 1…M

D ← 1

$\lambda \leftarrow \frac{1}{1+1/{\sigma }^{-2}}$

repeat

for d ← 1 to D do

for m ← 1 to M do

If γ_m = 0, β_m ← 0

${\mu }_{m}\leftarrow \lambda ({z}_{m}-{\sum }_{m{\prime} \ne m}\frac{{R}_{mm{\prime} }{\beta }_{m{\prime} }}{{s}_{m{\prime} }})$

If γ_m = 1, sample ${\beta }_{m} \sim N({s}_{m}{\mu }_{m},\lambda {s}_{m}^{2})$

${r}_{m}\leftarrow \pi \sqrt{\frac{\lambda }{{\sigma }^{2}}}\exp (\frac{{\mu }^{2}}{2\lambda })$

${h}_{m}\leftarrow \frac{{r}_{m}}{(1-\pi )+{r}_{m}}$

Sample γ_m ~ Bernoulli(h_m)

end for

β(d) ← β and γ(d) ← γ

end for

Sample $\pi \sim Beta\left(\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{M}{\gamma }_{m}(d)+D,MD-\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\nolimits_{m=1}^{M}{\gamma }_{m}(d)+D\right)$

Sample ${\sigma }^{-2} \sim Gamma\left(\frac{1}{2}\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{M}{\gamma }_{m}(d)+D,\frac{1}{2}\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{d}{\beta }_{m}{(d)}^{2}{\gamma }_{m}(d)\right)$

D ← D + 1

until (π, σ²) converge.

δ ← SRS⁻¹β, ${{\Delta }}\leftarrow \frac{2\mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}$ and AUC ← Φ(Δ)

Calculating LD from a reference panel

Algorithm 4.3 needs users to input the LD matrix among the pre-selected SNPs. Some projects, such as the 1000 Genomes Project²⁶ and the HapMap 3 project²⁷ have released individual-level genotype data. We can use them as reference panels to extract the LD matrix. In our method, we chose the 1000 Genomes Project as our default reference panel since it has a larger sample size. Note that most PRS methods calculate weights on the SNPs genotyped in the HapMap 3 project (HM3 SNPs) because it constitutes a set of commonly used tag SNPs that are usually well-imputed in different GWAS. To extract reliable results of the LD matrix and to reduce the computational cost of Algorithm 4.3, we only included HM3 SNPs in the reference panel in our experiments.

We note that the LD coefficient between SNPs tends to decay with increasing distance between SNPs⁴⁵. The genotypes of SNPs with a long distance are approximately independent. We use LDetect to divide the whole genome into approximately independent blocks⁴⁶. For human genomes with European ancestry, a total of 1703 blocks are partitioned by LDetect.

Within each partitioned block, the correlation matrix among the genotypes of SNPs needs to be estimated as an input. Many methods have been proposed to estimate SNP covariance matrix^47,48,49, but most of them are sensitive to the structure of the covariance matrix or the distribution of the sample data. We note that the Ledoit-Wolf estimator does not depend on the assumptions of the covariance structure or the sample data distribution⁴⁹. In our method, we first standardized genotypes in the reference panel, and then we adopted the Ledoit-Wolf estimator on the standardized genotypes to obtain the correlation matrix.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The 1000genomes data can be downloaded via https://www.internationalgenome.org/, and the HapMap3 data can be downloaded via https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html. The UK Biobank data are available under restricted access. Researchers can apply for access at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. The Type 2 Diabetes GWAS summary level data available from the DIAGRAM consortium [https://diagram-consortium.org/downloads.html]. The Coronary Artery Disease GWA meta-analysis data are available from the CARDIoGRAM Consortium [http://www.cardiogramplusc4d.org/]. The Inflammatory Bowel Disease Disease GWAS summary level data are available from the IIBDGC consortia [https://www.ibdgenetics.org/]. The Breast cancer data are available from the BCA Consortium [https://bcac.ccge.medschl.cam.ac.uk/bcacdata/oncoarray/oncoarray-and-combined-summary-result/gwas-summary-results-breast-cancer-risk-2017]. We provide example data for demonstrating the usage of our method at https://github.com/lscientific/PRStuning, where the reference panel and corresponding LD matrix based on the 1000 Genomes Project can also be found. Source data are provided with this paper.

Code availability

The codes for PRStuning are available at https://github.com/lscientific/PRStuning. Permanent repositories are available at https://doi.org/10.5281/zenodo.10119783⁵⁰.

References

Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Human Genet. 101, 5–22 (2017).
Article CAS Google Scholar
Jostins, L. & Barrett, J. C. Genetic risk prediction in complex disease. Human Mol. Genet. 20, R182–R188 (2011).
Article CAS Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hill, W. G., Goddard, M. E. & Visscher, P. M. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008 (2008).
Article PubMed PubMed Central Google Scholar
Song, S., Jiang, W., Hou, L. & Zhao, H. Leveraging effect size distributions to improve polygenic risk scores derived from summary statistics of genome-wide association studies. PLoS Comput. Biol. 16, e1007565 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Lin, Z., Owen, A. B. & Altman, R. B. Genomic research and human subject privacy (2004).
Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to open consent. Nat. Rev. Genet. 9, 406–411 (2008).
Article CAS PubMed Google Scholar
Privé, F., Vilhjálmsson, B. J., Aschard, H. & Blum, M. G. Making the most of clumping and thresholding for polygenic scores. Am. J. Human Genet. 105, 1213–1221 (2019).
Article Google Scholar
Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 1520–1528 (2007).
Article CAS PubMed PubMed Central Google Scholar
International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Article PubMed Central Google Scholar
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wray, N. R. et al. Research review: polygenic methods and their application to psychiatric traits. J. Child Psychol. Psychiat. 55, 1068–1087 (2014).
Article PubMed Google Scholar
Euesden, J., Lewis, C. M. & O’reilly, P. F. PRSice: polygenic risk score software. Bioinformatics 31, 1466–1468 (2015).
Article CAS PubMed Google Scholar
Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Human Genet. 97, 576–592 (2015).
Article Google Scholar
Privé, F., Arbel, J. & Vilhjálmsson, B. J. Ldpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2020).
Article PubMed Central Google Scholar
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 1–11 (2019).
Article CAS Google Scholar
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1–10 (2019).
Article Google Scholar
Zhou, G. & Zhao, H. A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics. PLoS Genet. 17, e1009697 (2021).
Article CAS PubMed PubMed Central Google Scholar
Leonenko, G. et al. Identifying individuals with high risk of Alzheimer’s disease using polygenic risk scores. Nat. Commun. 12, 1–10 (2021).
Article MathSciNet Google Scholar
Machiela, M. J. et al. Evaluation of polygenic risk scores for predicting breast and prostate cancer risk. Genet. Epidemiol. 35, 506–514 (2011).
PubMed PubMed Central Google Scholar
Song, L., Liu, A., Shi, J. & of Schizophrenia Consortium Gejman PV Sanders AR Duan J Cloninger CR Svrakic DM Buccola NG Levinson DF Mowry BJ Freedman R Olincy A Amin F Black DW Silverman JM Byerley WF, M. G. SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics. Bioinformatics 35, 4038–4044 (2019).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, J. & Simon, R. Overfitting in prediction models–is it a problem only in high dimensions? Contemp. Clin. Trials 36, 636–641 (2013).
Article PubMed Google Scholar
Jiang, W. & Yu, W. Power estimation and sample size determination for replication studies of genome-wide association studies. BMC Genom. 17, 19–32 (2016).
Article Google Scholar
Doucet, A., Godsill, S. J. & Robert, C. P. Marginal maximum a posteriori estimation using Markov chain Monte Carlo. Stat. Comput. 12, 77–84 (2002).
Article MathSciNet Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68 (2015).
Article Google Scholar
International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52 (2010).
Article ADS Google Scholar
Jiang, W., Song, S., Hou, L. & Zhao, H. A set of efficient methods to generate high-dimensional binary data with specified correlation structures. Am. Stat. 75, 310–322 (2021).
Article MathSciNet Google Scholar
Bantis, L. E., Nakas, C. T. & Reiser, B. Construction of confidence regions in the roc space after the estimation of the optimal youden index-based cut-off point. Biometrics 70, 212–223 (2014).
Article MathSciNet PubMed Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central ADS Google Scholar
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Z. et al. Pumas: fine-tuning polygenic risk scores with gwas summary statistics. Genome Biol. 22, 1–19 (2021).
Article Google Scholar
Gneiting, T. & Walz, E.-M. Receiver operating characteristic (roc) movies, universal roc (uroc) curves, and coefficient of predictive ability (cpa). Machine Learning 111, 2769–2797 (2022).
Article MathSciNet Google Scholar
Pardo-Fernández, J. C., Rodriguez-Alvarez, M. X. & Van Keilegom, I. A review on ROC curves in the presence of covariates. Revstat-Stat. J. 12, 21–41 (2014).
MathSciNet Google Scholar
Dodd, L. E. & Pepe, M. S. Semiparametric regression for the area under the receiver operating characteristic curve. J. Am. Stat. Associat. 98, 409–417 (2003).
Article MathSciNet Google Scholar
Janes, H. & Pepe, M. S. Adjusting for covariate effects on classification accuracy using the covariate-adjusted receiver operating characteristic curve. Biometrika 96, 371–382 (2009).
Article MathSciNet PubMed PubMed Central Google Scholar
Cai, M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Human Genet. 108, 632–655 (2021).
Article CAS Google Scholar
Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54, 573–580 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Z., Fritsche, L. G., Smith, J. A., Mukherjee, B. & Lee, S. The construction of cross-population polygenic risk scores using transfer learning. Am. J. Human Genet. 109, 1998–2008 (2022).
Article CAS Google Scholar
Zhou, G., Chen, T. & Zhao, H. Sdprx: A statistical method for cross-population prediction of complex traits. Am J. Human Genet. 110, 13–22 (2023).
Article CAS Google Scholar
Zhang, H. et al. Novel methods for multi-ancestry polygenic prediction and their evaluations in 5.1 million individuals of diverse ancestry. bioRxiv 2022–03 (2022).
Zhang, J. et al. An ensemble penalized regression method for multi-ancestry polygenic risk prediction. bioRxiv 2023–03 (2023).
Jin, J. et al. Me-bayes sl: Enhanced bayesian polygenic risk prediction leveraging information across multiple ancestry groups. bioRxiv 2023–04 (2023).
Hand, D. J. Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach. Learn. 77, 103–123 (2009).
Article Google Scholar
Ardlie, K. G., Kruglyak, L. & Seielstad, M. Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet. 3, 299–309 (2002).
Article CAS PubMed Google Scholar
Berisa, T. & Pickrell, J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283 (2016).
Article CAS PubMed Google Scholar
Cai, T. & Liu, W. Adaptive thresholding for sparse covariance matrix estimation. J. Am. Stat. Associat. 106, 672–684 (2011).
Article MathSciNet CAS Google Scholar
Daniels, M. J. & Kass, R. E. Shrinkage estimators for covariance matrices. Biometrics 57, 1173–1184 (2001).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Ledoit, O. & Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88, 365–411 (2004).
Article MathSciNet Google Scholar
Jiang, W., Chen, L., Girgenti, M. & Zhao, H. Tuning parameters for polygenic risk score methods using gwas summary statistics from training data https://doi.org/10.5281/zenodo.10119783 (2023).
The DIAGRAM consortium. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990 (2012).
Article Google Scholar
Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338 (2011).
Article CAS PubMed PubMed Central Google Scholar
Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).
Article CAS PubMed PubMed Central Google Scholar
Michailidou, K. et al. Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92–94 (2017).
Article CAS PubMed PubMed Central ADS Google Scholar

Download references

Acknowledgements

We sincerely thank CARDIoGRAM, IIBDGC, DIAGRAM, and BCA consortia for the publicly accessible GWAS summary statistics. This study makes use of data generated by the UK Biobank under Application Number 29900. A full list of the investigators who contributed to the generation of the data is available from https://www.ukbiobank.ac.uk/. Our work was supported in part by the National Institutes of Health (https://www.nih.gov/) grants R01 HG012735, R01 GM134005, and National Science Foundation (https://www.nsf.gov/funding/) grant DMS1902903, received by H.Z.. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
Wei Jiang & Hongyu Zhao
Department of Statistics, Columbia University, New York, NY, USA
Ling Chen
Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
Matthew J. Girgenti

Authors

Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Chen
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Girgenti
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.J. designated the idea of this work and developed the main body of algorithms. L.C. performed the simulation and real data experiments and developed part of the algorithms. W.J. and L.C. wrote the manuscript. M.J.G. and H.Z. supervised the project, revised the manuscript, and commented on it.

Corresponding author

Correspondence to Hongyu Zhao.

Ethics declarations

Competing interests

The authors declared no competing interests.

Peer review

Peer review information

Nature Communications thanks Eric Lu Zhang, Haoyu Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jiang, W., Chen, L., Girgenti, M.J. et al. Tuning parameters for polygenic risk score methods using GWAS summary statistics from training data. Nat Commun 15, 24 (2024). https://doi.org/10.1038/s41467-023-44009-0

Download citation

Received: 15 May 2023
Accepted: 27 November 2023
Published: 02 January 2024
DOI: https://doi.org/10.1038/s41467-023-44009-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Genome-wide association studies

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Development and validation of a new algorithm for improved cardiovascular risk prediction

Introduction

Results

Overview of PRStuning

Simulation experiments

Real data applications

Discussion

Methods

Notations and assumptions

Estimating AUC on independent SNPs

Algorithm 1

Estimating AUC on SNPs linked by LD

Algorithm 2

Calculating LD from a reference panel

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links