Introduction

The advent of genome-wide association studies (GWAS) has led to the discovery of numerous loci associated with the most common diseases1. These discoveries also provide the opportunity for predicting risks from an individual’s genotypes2. Accurate genetic risk prediction can enable us to identify high-risk individuals and facilitate disease prevention and early treatment3.

Polygenic risk score (PRS) is commonly used in genetic risk prediction due to its simplicity and resulting from the additive assumption. Both empirical and theoretical studies have shown that the additive component is expected to account for most of the genetic variance of complex traits4. Based on this additive assumption, PRS sums the allele dosages of single nucleotide polymorphisms (SNPs) weighted by their estimated effect sizes5.

Various PRS methods have been proposed to estimate the effect sizes of SNPs from a GWAS dataset. Compared to individual-level genotype data, summary statistics are more accessible without security and privacy concerns6,7. Many PRS methods proposed recently estimate SNP effects with GWAS summary statistics. One of the simplest is clumping and thresholding (C+T)8,9,10,11,12,13,14, in which linkage disequilibrium (LD) clumping is applied to the SNPs that pass a p-value threshold. Another related method is pruning and thresholding (P+T), which only includes the SNPs whose p-values exceed a threshold after LD pruning. Both LD clumping and LD pruning are step-wise heuristic procedures that select a set of approximately independent SNPs. Compared to LD pruning, LD clumping selects the independent SNPs after p value thresholding. Therefore, SNPs showing stronger associations with the disease are preserved, which is preferred in constructing PRS. We note that some literature referred to C+T as P+T, but we treat them as distinct methods in our following discussion.

It is important to note that for both C+T and P+T, only a portion of independent SNPs are utilized in constructing the PRS model, while other SNPs and LD information are ignored. To further improve the prediction accuracy of genetic risks, many PRS methods have been proposed to incorporate genome-wide SNPs and their LD information, such as LDpred15, LDpred216, sBayesR17, PRS-CS18 and SDPR19. LDpred imposes a point-normal prior for the SNP effect sizes and infers the posterior mean effect sizes using a Markov Chain Monte-Carlo (MCMC) procedure. LDpred2 was further proposed to increase computational efficiency and provide more stable results than LDpred in dealing with long-range LD regions and traits of sparse genetic architecture. To allow more general effect size distributions, sBayesR performs Bayesian posterior inference based on a mixture prior of point and three normal distributions that represent SNPs with small, medium, and large effects respectively. SDPR performs Bayesian posterior inference based on a Dirichlet process modeling effect sizes with a mixture of 1000 normal distributions. To reduce the computational burden from the combination of different components in millions of SNPs, PRS-CS places a continuous shrinkage prior to the SNP effect sizes in a Bayesian framework. All these LD-based methods have demonstrated their superior performance in some datasets of complex diseases. However, none of them has a dominant performance over other methods.

Among these PRS methods, P+T, C+T, LDpred, and LDpred2 rely on parameters that need to be specified by users beforehand. Although PRS-CS and sBayesR have options to estimate parameters with an additional layer of prior distributions, users can also specify the parameters themselves. For all PRS methods that require tuning parameters, an external individual-level genotype dataset is needed to evaluate different parameter values and choose the best-performing ones. However, as we mentioned before, individual-level genotype data are less accessible than summary statistics. Besides, it is not efficient to leave out a portion of data just for tuning parameters and to estimate SNP effects with the remaining data, leading to information loss and reduced performance for PRS methods. These concerns motivated us to develop a method that can evaluate the performance of a PRS model based on summary statistics used for model training.

For diseases with a binary phenotype, the area under the receiver operating characteristic (ROC) curve (AUC) is the most commonly used criterion in practice for evaluating PRS5,20,21. In 2018, Song et al.22 proposed an estimator of AUC using only summary statistics. This method makes use of an equivalent definition of AUC, i.e. the probability of a PRS from a random case being larger than a PRS from a random control. Based on this definition, AUC can be approximated by a function of the GWAS summary statistics. This method can tune the parameters of a PRS model with summary statistics from another GWAS.

To maximize the power of identifying loci associated with common diseases, some large consortia have conducted meta-analyses of all accessible studies and released summary statistics from these meta-analyses. These summary statistics are usually used as training data to optimize the prediction power of PRS models. In this situation, it is difficult to gain access to summary statistics from another independent GWAS. This problem can not be well addressed if we simply plug the summary statistics from the training data into the derived AUC function, because the variants with larger effects tend to have their effect sizes overestimated and these variants have a larger influence on the PRS than the variants exhibiting small effects. This phenomenon is known as overfitting23. If we use the observed effects directly, the overfitting would lead to an inflated predicted value of the AUC and the incorrectly selected values of the parameters.

Built on Song’s method, we propose PRStuning, a method that requires only summary statistics from the training data to predict the conventional AUC that needs to be evaluated on another individual-level genotype dataset. We incorporate empirical Bayes (EB) theory to shrink the effect sizes of SNPs, which leads to the attenuation of the predicted AUC so as to overcome the overfitting phenomenon24. In PRStuning, we adopt a point-normal mixture model as the prior distribution of SNP effects and estimate the parameters in the model with GWAS summary statistics from the training data. There are two settings depending on the dependency across the selected SNPs used for training the PRS model. When the SNPs are independent, e.g., the SNPs used in P+T, we utilize an expectation-maximization (EM) algorithm to estimate the parameters in the prior distribution and calculate the posterior distribution of the AUC based on a closed-form formula. When SNPs are dependent due to LD, we use a Gibbs-sampling-based State-Augmentation for Marginal Estimation (SAME) algorithm25 to estimate the parameters in the model and obtain the Monte-Carlo (MC) samples of the predicted AUC. Once this is accomplished, we can select the parameter values for the PRS method with the best predicted AUC.

We applied PRStuning to GWAS datasets of four common diseases, including coronary artery disease (CAD), type 2 diabetes (T2D), inflammatory bowel disease (IBD), and breast cancer (BC), with four PRS methods, namely P+T, C+T, LDpred, and LDpred2. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help with parameter selections.

Results

Overview of PRStuning

Define gi,m {0, 1, 2} as the genotype score of SNP m for individual i. A PRS for individual i is the sum of the genotypes gi = (gi,1, …, gi,M) weighted by the corresponding effects ω = (ω1, …, ωM), i.e.,

$$PR{S}_{i}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}{g}_{i,m}.$$
(1)

Here M is the total number of pre-selected SNPs used for constructing PRS. Please note that not all SNPs collected in the training GWAS data are necessarily used in PRS calculation. Some PRS methods, such as P+T, select SNPs based on criteria unrelated to association strengths. For those methods, we just need to consider the selected SNPs in estimating AUC. However, some other PRS methods incorporate SNP selection steps based on the associations of the SNPs with the disease, resulting in the inflation of their observed association effects8,16,17. For those methods, we consider the SNPs used before the selection step to address the effect size inflation issue with the Empirical-Bayes-based method introduced later. Here we define the pre-selected SNPs as the SNPs used in building the PRS model before running any selection step related to association strengths. For example, the pre-selected SNPs in C+T are actually genome-wide SNPs collected in the training GWAS data and the LD clumping procedure used in C+T is a selection step based on the observed association strength. In this situation, we have ωm = 0 for SNPs not selected for building PRS in C+T. In contrast, LD pruning is a selection step unrelated to SNP associations with the disease. Therefore, the pre-selected SNPs in P+T are the SNPs selected after an LD pruning step. Different PRS methods have been proposed to estimate the weight vector ω from a GWAS dataset or its summary statistics for the disease of interest. Here and after we regard ω as the inferred values from the PRS method of interest.

Based on the definition of AUC and the distribution of PRS, Song22 formulated AUC as

$${{{{{{{\rm{AUC}}}}}}}}={{\Phi }}({{\Delta }}),$$
(2)

where

$${{\Delta }}:\!\!=\frac{2{\sum }_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}\,{{{{{{{\rm{and}}}}}}}}\,{\tau }_{j}^{2}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}^{2}{s}_{j,m}^{2}+2\mathop{\sum}\limits_{{m}_{1} < {m}_{2}}{\omega }_{{m}_{1}}{\omega }_{{m}_{2}}{R}_{{m}_{1},{m}_{2}}{s}_{j,{m}_{1}}{s}_{j,{m}_{2}},$$
(3)

where j = 0 indicates controls and 1 indicates cases. Here for SNP m, we use fj,m to denote the frequency of the reference allele, \({s}_{j,m}^{2}:\!\!=2{f}_{j,m}(1-{f}_{j,m})\) to denote the variance of the genotype, and δmf1,m − f0,m records the difference between the allele frequencies of the cases and controls, and Φ(  ) is the cumulative distribution function of a standard normal distribution. We use \({R}_{{m}_{1},{m}_{2}}\) to denote the LD coefficient between SNP m1 and SNP m2.

We can calculate \({\tau }_{j}^{2}\) (j = 0, 1) by directly plugging in the observed values of allele frequencies and LD coefficients since \({\tau }_{j}^{2}\) is not directly related to the SNPs’ effects on the disease. The observed allele frequencies can be obtained from summary statistics of the GWAS, and LD information can be extracted from another genotype dataset. Some large projects such as the 1000 Genomes project (1KG)26 and the HapMap3 project (HM3)27 have made their data publicly available and we can use them as reference panels to calculate the LD coefficients.

For δm in Eq. (3), if we directly plug in the observed allele frequencies \({\hat{f}}_{0,m}\) and \({\hat{f}}_{1,m}\) from GWAS, the SNPs exhibiting large allele frequency differences tend to have their effect sizes overestimated, and these SNPs have larger contributions to the PRS than the SNPs showing smaller effects. The overfitting of the SNP effects would lead to an inflated predicted value of the AUC and incorrectly selected values of the parameters. Therefore, we adopt an Empirical Bayes method in PRStuning to shrink the effects so as to reduce the influence of overfitting. In the Supplementary Methods section, we provide a theoretical demonstration of how overfitting happens and the rationale of alleviating overfitting with a Bayes estimator.

In GWAS, z-scores from the allele frequency difference test are usually used to assess the association of each SNP with the disease. Each z-score is calculated with the following formula:

$${z}_{m}=\frac{{\hat{f}}_{1,m}-{\hat{f}}_{0,m}}{\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}},$$
(4)

where \({\hat{f}}_{j,m}\) is the observed allele frequency for each group, \({s}_{j,m}^{2}\) is the variance of the genotype in the controls or cases, and n0, n1 are the sample sizes of the two groups. To simplify this expression, we define \({s}_{m}:\!\!=\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}\). Based on this definition, we have zmδm ~ N(δm/sm, 1) given the allele frequency difference δm.

Here we denote the allele frequencies among controls and cases when SNP m is assumed to be independent with other SNPs as p0,m and p1,m, respectively. Note that fj,m is the allele frequency of SNP m marginalizing over other SNPs, which is different from pj,m (j = 0, 1). We use βm to denote the underlying effect of SNP m in terms of changing allele frequencies between controls and cases, i.e., βm = p1,m − p0,m. If SNP m has no risk on the disease, we have βm = 0. Let β = (β1, …, βM). In the Supplementary Methods section, we further demonstrate that the marginalized allele frequency difference δ = (δ1, …, δM) is related to the LD pattern among the pre-selected SNPs and β, i.e.,

$${{{{\delta }}}}={{{{S}}}}{{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}},$$
(5)

where S is a diagonal matrix with the m-th diagonal element equal to sm, and R is the LD coefficient matrix. Given δ, the joint distribution of the z-scores z = (z1, …, zM) is

$${{{{{\boldsymbol{z}}}}}}| {\delta } \sim N({{S}}^{-1}{\delta },\,{\ R}).$$
(6)

We further assume that the standardized effect βm/sm follows a point-normal distribution, i.e.,

$$\frac{{\beta }_{m}}{{s}_{m}}\mathop{ \sim }\limits^{iid}(1-\pi ){\delta }_{0}+\pi N(0,\,{\sigma }^{2}).$$
(7)

Here δ0 is a point mass at zero, π represents the prior proportion of SNPs that have an effect on the disease, and σ2 is the variance of βm/sm in the risk SNPs. This point-normal distribution is also used in LDpred as the prior distribution. The relationship between σ2 and the heritability of the disease is presented in Section “Notations and assumptions” and the Supplementary Methods section. With this assumption, we derived an expectation-maximization (EM) algorithm to estimate (π, σ2) and calculated the posterior distribution of the AUC when pre-selected SNPs are independent. When SNPs are linked by LD, we derived a Gibbs-sampling-based SAME algorithm to estimate (π, σ2) and obtained the MC samples of the predicted AUC. Once this is accomplished, we can select the parameter values for the PRS method with the best predicted AUC. Details of PRStuning are presented in Section “Methods”.

Simulation experiments

For our simulation experiments, we considered predicting the performance and tuning the parameters for four commonly used PRS methods, namely, P+T, C+T, LDpred, and LDpred2. In the experiments, we varied the p-value thresholds for P+T and C+T from {1, 5e − 1, 5e − 2, 5e − 3, 5e − 4, 5e − 5, 5e − 6}. In P+T, p-value threshold=1 means that no further filtering step based on p-values is utilized on pre-selected approximately independent SNPs after LD pruning. In C+T, p-value threshold=1 means we conduct LD clumping on genome-wide SNPs without filtering based on p-values. While for LDpred, we chose the proportion of the risk SNPs π from {1, 3e − 1, 1e − 1, 3e − 2, 1e − 2, 3e − 3, 1e − 3, 3e − 4, 1e − 4, 3e − 5, 1e − 5}. This is the default setting of LDpred. Because LDpred2 had convergence issues when the risk SNP proportion was set to an extremely small value for simulations based on simulated genotype data, we varied π from {1, 6e − 1, 3e − 1, 1e − 1, 6e − 2, 3e − 2, 1e − 2} which had a smaller range but finer resolution than the set used in LDpred. For simulations based on real genotype data, we used the same parameters as LDpred.

There are two purposes of our method: to predict the AUC and to select tuning parameters. In our experiments, we used another independent dataset with individual-level genotype data as testing data. The AUC of the PRS assessed on the testing data and the parameters showing the best prediction performances on the testing data were treated as benchmarks. To evaluate the performance of PRStuning, we evaluated the performance of PRStuning with two measures: the correlation of the AUC estimates (ρAUC) and the relative difference of the highest AUC estimates (rdAUC). We define ρAUC as the correlation of the PRStuning-predicted AUC values and those estimated on the testing data. A high value of ρAUC indicates that the predicted AUC using our method is highly correlated with the AUC on the testing data. We define rdAUC as the relative difference between the predicted AUC with the best-performing parameter tuned by PRStuning and the AUC with the best-performing parameters on the testing data. Here best-performing parameters are defined as those achieving the highest AUC values. A small value of rdAUC indicates that the tuning parameter selected by PRStuning and the actual best-performing parameter have comparable performances. These two metrics are complementary to each other in the sense that, ρAUC measures how much the AUC patterns across parameter values for PRStuning and testing data align with each other, while rdAUC measures the point difference between the highest AUC values for the two methods. Therefore, we would like to evaluate the results with both metrics.

We first consider the case where the pre-selected SNPs are independent. In our simulations, we set the prevalence of the disease to κ = 1%. For each SNP, we simulated its allele frequency in the general population based on a uniform distribution U(0.05, 0.95). Then we generated its risk effects on the disease based on the two-component mixture model Eq. (7), in which we set the proportion of the risk SNPs to π = 0.05 and the variance of the risk effects to σ2 = 0.001n. Here n is the total sample size of the GWAS used in the training data. We assume the GWAS is balanced with an equal number of cases and controls. According to the central limit theorem, we have \({s}_{m}\propto 1/\sqrt{n}\). Hence it is reasonable to assume σ2n.

In total, we simulated M = 10, 000 independent SNPs and varied the sample size from 4, 000 to 10, 000 in the training GWAS to explore the performance trend across different sample sizes. Each sample size setting was replicated 50 times. And for each replication, we simulated additional 1000 cases and 1000 controls as testing data. We used the AUC evaluated on the testing data as the benchmark, and compared the AUC predicted by PRStuning and the unadjusted AUC obtained by directly plugging in the training summary statistics with the benchmark. Since all SNPs are independent, we only considered P+T as the PRS method.

Figure 1 shows the boxplots of AUC values corresponding to different p-value thresholds and sample sizes of training data for P+T. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC calculated from testing data, and the unadjusted AUC obtained by directly plugging in the training summary statistics, respectively. As expected, the unadjusted AUC estimates were inflated compared to the benchmark due to the overfitting problem. In contrast, with the same summary statistics from the training data, PRStuning was able to shrink the estimates of allele frequency differences and produce AUC estimates comparable to those from the testing data.

Fig. 1: AUC boxplots for P+T in the simulation experiments with independent SNPs.
figure 1

Each box represents 50 replications and is presented as median values and the first and third quartiles. The upper/lower whisker extends from the hinge to the largest/smallest value at most 1.5 IQR from the hinge. We changed the p-value threshold from {1, 5e − 1, 5e − 2, 5e − 3, 5e − 4, 5e − 5, 5e − 6} and the sample sizes of training data from 4000 to 10,000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC evaluated on testing data, and the unadjusted AUC directly estimated by plugging in the training summary statistics, respectively. The AUC evaluated on the testing data is the benchmark. PRStuning is able to yield AUC estimates comparable to the benchmark results. Source data are provided as a Source Data file.

In order to further demonstrate the accuracy of PRStuning, we summarize the average correlation of the AUC estimates ρAUC and the average relative difference of the best-performing AUC estimates rdAUC in Table 1. Those metrics are complementary to each other since two vectors can be perfectly correlated but differ a lot. The values of ρAUC were at least 0.976, which indicates that PRStuning is capable of accurately predicting the AUC pattern on testing data. Moreover, the average values of rdAUC were at most 1.3%, indicating that PRStuning can effectively select parameter values that achieve performance comparable to the best-performing parameter in the testing data. Note that ρAUC increased and rdAUC decreased as the sample size of training GWAS increased. This is expected because a larger sample size in the training data can lead to higher accuracy in estimating allele frequency differences.

Table 1 Summary of the average values of ρAUC and rdAUC in the simulation experiments with independent SNPs

We also evaluated PRStuning when the training and testing data are heterogeneous. To be more specific, we considered two different scenarios. In the first scenario, we assumed that the allele frequencies from the training and testing data were different and the difference between allele frequencies was generated from N(0, 0.012). In the other scenario, we assumed that the effect sizes between training and testing data were different and the difference between effects of risk SNPs followed N(0, 0.0005n). The results of these experiments for P+T are provided in the Supplementary Figures 3-4. The figures demonstrate that PRStuning can still estimate the AUC well when the pooled allele frequencies are different between the training and testing data. However, if the effects of risk SNPs are different between training and testing data, the AUC from PRStuning was overestimated, leading to inaccurate performance to tune parameters.

We then considered the case where the pre-selected SNPs are not filtered by any independence criterion for SNPs. In this case, the pre-selected SNPs are linked as reflected in their LD. We first performed simulations with SNPs with an AR(1) auto-regressive LD structure. We fixed the auto-regressive coefficient ρ to 0.2, which is the correlation coefficient between two adjacent SNPs. Similar to the simulation scenario with independent SNPs, we simulated the reference allele frequencies in the population from U(0.05, 0.95), and the risk effects from a point normal distribution Eq. (7), in which π = 0.05 and σ2 = 0.0005n. The variance of risk effects is proportional to the sample size of the GWAS since \({s}_{m}\propto 1/\sqrt{n}\) according to the central limit theorem.

We varied the sample size from 4, 000 to 10, 000 in the training GWAS and generated 50 replications for each sample size. We used CorBin28, an R package for generating high dimensional binary data with a specific correlation structure, to generate individual-level genotype data. Specifically, we generated 1,000 cases and 1,000 controls as testing data for each replication. We additionally simulated 1,000 samples as a reference panel for calculating LD coefficients. We used both C+T and LDpred as the PRS methods in this experiment. In LDpred, we need to specify another parameter named LD radius, which is the number of SNPs on each side of a given SNP for computing pairwise LD. The LD radius was set to 5, indicating that the SNPs used for computing LD have pairwise correlations above 0. 25 ≈ 3 × 10−4 based on the AR(1) LD structure.

To demonstrate the predictive accuracy of PRStuning, we again regarded the AUC evaluated on the testing data as the benchmark and compared the AUC predicted by PRStuning and the unadjusted AUC with the benchmark. Figures 2, 3 and Supplementary 1 demonstrate the AUC boxplots for C+T, LDpred, and LDpred2 with different parameter values, respectively. For both PRS methods, the unadjusted AUC estimates were largely overestimated compared to the benchmark due to overfitting. On the contrary, the AUC estimates predicted by PRStuning were very close to the benchmark results, especially when the sample size became large.

Fig. 2: AUC boxplots for C+T in the simulation experiments with correlated SNPs.
figure 2

Each box represents 50 replications and is presented as median values and the first and third quartiles. The upper/lower whisker extends from the hinge to the largest/smallest value at most 1.5 IQR from the hinge. We changed the p-value threshold from {1, 5e − 1, 5e − 2, 5e − 3, 5e − 4, 5e − 5, 5e − 6} and the sample sizes of training data from 4000 to 10,000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC evaluated on testing data, and the unadjusted AUC directly estimated by plugging in the training summary statistics, respectively. The AUC evaluated on the testing data is the benchmark. PRStuning is able to yield AUC estimates comparable to the benchmark results. Source data are provided as a Source Data file.

Fig. 3: AUC boxplots for LDpred in simulation experiments with correlated SNPs.
figure 3

Each box represents 50 replications and is presented as median values and the first and third quartiles. The upper/lower whisker extends from the hinge to the largest/smallest value at most 1.5 IQR from the hinge. We changed the proportion of risk SNPs from {1, 3e − 1, 1e − 1, 3e − 2, 1e − 2, 3e − 3, 1e − 3, 3e − 4, 1e − 4, 3e − 5, 1e − 5} and the sample sizes of training data from 4000 to 10,000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC calculated from testing data, and the unadjusted AUC, respectively. Source data are provided as a Source Data file.

We summarize the average values of ρAUC and rdAUC for C+T and LDpred in Table 2. For both C+T and LDpred, the average values of ρAUC were at least 0.754 in all sample size settings, indicating PRStuning can accurately predict the AUC on testing data. The average values of rdAUC were below 3.1%, meaning PRStuning can effectively select a parameter that achieves performance comparable to the actual best-performing parameter on the testing data. Again, we can observe an increasing tendency in ρAUC and a decreasing tendency in rdAUC as we increase the sample size of the training GWAS as the result of the increase in estimation accuracy of the allele frequency differences.

Table 2 Summary of the average values of ρAUC and rdAUC in the simulation experiments with correlated SNPs for C+T and LDpred

We evaluated PRStuning when the training and testing data were heterogeneous, where we considered three different scenarios. In the first scenario, we assumed the allele frequencies from training and testing data were different and the differences between allele frequencies were generated from N(0, 0.012). In the second scenario, we assumed that the effect sizes between training and testing data were different and the difference between effects of risk SNPs followed N(0, 0.0002n). In the third scenario, the LD structure of the testing data was AR(1) with an auto-regressive coefficient ρ = 0.15, which is different from the auto-regressive coefficient of the training data. The results of these experiments for C+T, LDpred, and LDpred2 are provided in Supplementary Figures 5-13. Generally speaking, the figures demonstrate that PRStuning can still estimate the AUC well when the pooled allele frequency and LD matrix are different between training and testing data. However, if the effects of risk SNPs are different between training and testing data, the AUC from PRStuning was overestimated, leading to inaccurate performance to tune parameters.

To investigate whether including more individuals in the reference panel can improve the performance of PRStuning, we conducted simulation experiments to compare its performance with the performance based on the ground truth LD matrix. The comparison results using C+T, LDpred, and LDpred2 to construct PRS can be found in Supplementary Figures 14-16, respectively. From the figures, we observe that the performance of PRStuning based on the LD matrix estimated from 1,000 individuals was almost the same as the performance based on the ground truth LD matrix. Thus, with a sufficient number of individuals in the reference panel, there may be little improvement in performance by including more individuals in the LD matrix calculation.

To further demonstrate the effectiveness of PRStuning, we calculate the sensitivity of the PRS model tuned by PRStuning, which is the proportion of true cases among predicted ones from the PRS model. The cutoff value for PRS is selected by Youden’s J statistic, which is defined as the sum of sensitivity and specificity minus one and is the most commonly used criterion to select the cutoff value for a binary classifier29. The true case proportions of simulation experiments for the four PRS methods are summarized in Supplementary Figure 2.

We also evaluated PRStuning with simulations based on real genotype data. The experiments were conducted based on genotype data collected from the UK Biobank (UKBB)30, which collected genetic and health records from around 500, 000 participants in the UK. The quality control procedure is summarized in the Supplementary Methods section. We only selected independent individuals with European ancestry in the experiments. Since only SNPs presented in the HapMap 3 project (HM3 SNPs) were used in the reference panel for reliable LD estimation and computation efficiency, we focused on the SNPs in HM3 in the UKBB dataset. This resulted in a total of 1, 027, 699 HM3 SNPs and 272, 751 individuals passing the quality control criteria.

We used the two-component mixture model Eq. (7) to simulate risk effects for SNPs with π = 0.1% and σ2 = 0.04. The phenotypes of the individuals were simulated based on the additive assumption. Among all individuals, we randomly selected 80% of them for GWAS analysis to calculate the summary statistics as training data and the rest as testing data. We used the data collected from the 1000 Genomes Project (1KG)26 as the reference panel for calculating LD. In the experiments, we used both C+T and LDpred as the PRS methods and compared the AUC estimates predicted by PRStuning with the values calculated on the testing data. The LD radius to be specified in LDpred was set to M/3000 ≈ 343, which is the default practice suggested by LDpred and corresponds to a 2Mb LD window on average in the human genome15.

In Table 3, we summarize the AUC results of C+T, LDpred, and LDpred2 with different parameter values for both PRStuning and testing genotype data. The AUC estimates from PRStuning were very close to the actual AUC values obtained from the testing data. For C+T, the correlation ρAUC reached 0.994, the relative difference rdAUC was 3.8%, and the sensitivity of the tuned PRS model based on PRStuning was 80.6%. For LDpred, ρAUC reached 0.998, rdAUC was just 1.3%, and the sensitivity was 74.8%. It is worth noting that PRStuning was able to detect the dramatic decrease in the testing performance of LDpred when π was dropped from 1e − 1 to 3e − 2. For LDpred2, ρAUC reached 0.989, rdAUC was 7.0%, and the sensitivity was 85.3%. These results further suggest the accuracy in AUC estimation and effectiveness in parameter tuning using PRStuning on SNPs linked by LD.

Table 3 The predicted AUC values for C+T, LDpred, and LDpred2 with different parameters in the simulation experiment based on the UKBB data

Real data applications

We applied PRStuning to GWAS summary statistics from four diseases, including coronary artery disease (CAD), type 2 diabetes (T2D), and inflammatory bowel disease (IBD). Table 4 summarizes the sources of the publicly available GWAS summary statistics and their corresponding sample sizes. Note that the summary statistics from all three datasets are results of meta-analyses and the reported sample sizes represent the total numbers of individuals among all aggregated studies. The actual sample size used to calculate the summary statistics of each SNP was less than the reported sample size, since some of the studies may not have genotypes on this SNP.

Table 4 Summary of the publicly available GWAS summary statistics used in real data applications

We used these summary statistics to train the PRS models based on P+T, C+T, LDpred, and LDpred2. Then we used the data collected from the UKBB as the testing data for evaluating the actual prediction performance of the built PRS models. Only the SNPs with minor allele frequencies greater than 5% were included in building the PRS models. Details of the quality control procedure and phenotype extraction method for the UKBB data are provided in the Supplementary Methods section. In line with the simulation experiments based on UKBB genotype data, we only incorporated independent European-ancestry individuals and HM3 SNPs in the UKBB dataset, resulting in 272,751 individuals and 1,027,699 HM3 SNPs. Regardless of which PRS method is considered, only the SNPs overlapped between GWAS summary statistics and the testing data were considered in our analyses. The numbers of the overlapping SNPs for these diseases are summarized in Table 4.

In PRStuning, we adopted the EM algorithm 4.2 for PRS models built by P+T since the pre-selected SNPs were approximately independent, and the Gibbs sampling-based SAME algorithm 4.3 for C+T and LDpred due to the presence of LD among the pre-selected SNPs. The LD radius in LDpred was set to M/3000, which is the default practice suggested by LDpred. Figure 4 shows the predicted AUC by PRStuning and the actual AUC on testing data for four diseases with different PRS models. The dotted and solid horizontal lines respectively refer to the highest AUC for PRStuning and testing data. It is evident in the figure that the AUC predicted by PRStuning and the AUC calculated from testing data had similar patterns across different parameter values, particularly for LDpred. For CAD, the AUC of LDpred increased when the risk SNP proportion π was reduced from 1 to 1e − 2. It peaked at 1e − 2 and then started to decrease when we kept reducing the value of π. This pattern was exactly predicted by PRStuning. More complex patterns of AUC were observed for LDpred in T2D and CAD. The AUC values in both diseases had double modes across parameter values. For T2D, the AUC of LDpred peaked at 3e − 2 and 3e − 4. For IBD, the AUC of LDpred peaked at 3e − 2 and 1e − 5. Still, PRStuning predicted the exact same patterns of AUC for both diseases. This demonstrates the high predictive accuracy of PRStuning. More detailed information for the predicted AUC by PRStuning and the actual AUC on testing data is summarized in Supplementary Table 2.

Fig. 4: The predicted AUC by PRStuning and the actual AUC on testing data for four diseases with PRS models built from P+T, C+T, LDpred, and LDpred2 using different parameters.
figure 4

The four panels present the results of P+T, C+T, LDpred, and LDpred2, respectively. The dotted and solid horizontal lines refer to the highest AUC for PRStuning and testing data. The overall patterns of AUC predicted by PRStuning and calculated from testing data across different parameter values were similar. Detailed AUC values for different methods and tuning parameters are summarized in Supplementary Table 2Source data are provided as a Source Data file.

To further explain why there were double modes for AUC with different parameter values, we refer back to the calculation of Δ in Eq. (3) since AUC is monotonically increasing with respect to Δ. The numerator of Δ is a linear combination of the weights \({{{{\omega }}}}={({\omega }_{1},\ldots,{\omega }_{M})}^{T}\) used in PRS, whereas the denominator is the square root of a quadratic function of ω, which can be further expressed as

$$\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}=\sqrt{{{{{{\omega }}}}}^{T}({{{{{S}}}}}_{0}{{{{ R}}}}{{{{{ S}}}}}_{0}+{{{{{S}}}}}_{1}{{{{R}}}}{{{{{S}}}}}_{1}){{{{\omega }}}}},$$
(8)

where S0 and S1 are diagonal matrices with diagonal elements encoding (s0,1, …, s0,M) and (s1,1, …, s1,M), respectively. The weights in the PRS model were calculated based on different values of parameters. In Supplementary Figure 17, we demonstrate the denominators and numerators of Δ with different parameter values in LDpred for the four diseases. From the figure, we can observe that both the denominator and numerator were actually unimodal functions with respect to the parameter values that peak at different parameter values. Their ratio led the Δ to become bimodal functions with respect to the parameter values.

In Figure 4, we do observe some underestimation of AUC for C+T, LDpred, and LDpred2 on CAD and IBD. This is because the summary statistics collected are results of meta-analyses. The actual sample size used for calculating the summary statistics of each SNP is less than the reported sample size, because some of the studies may not have genotypes at this SNP. Some consortia, such as GLGC31, provide the sample size used for calculating summary statistics of each SNP, but most consortia do not provide this information. Even if we have the sample size for each SNP, we can not infer the number of non-overlapping individuals for calculating summary statistics of two SNPs. The non-overlapping individuals will change the correlations between z-values. In our analysis, we simply plugged the total sample sizes reported by the summary statistics into PRStuning. According to Eq. (16), the inflation of the sample size would lead to the systematic underestimation of sm. Based on Eq. (2), we know that AUC is monotonically increasing with respect to Δ, and we have \({{\Delta }}\propto \mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}\) and δ = SRS−1β. We estimate S−1β directly from z-scores, which are not influenced by the underestimation of sm. Therefore, the underestimation of sm would further lead to the underestimation of AUC.

To further illustrate the predictive accuracy of PRStuning, we calculated ρAUC and rdAUC with different PRS methods for the four diseases. The results of ρAUC and rdAUC are summarized in Table 5. The low values of rdAUC indicate that the prediction performance under the PRStuning-selected parameter approximated the best performance on the testing data accurately, especially for C+T and P+T. Even though LDpred had higher rdAUC compared to the other two PRS methods, it yielded values of ρAUC all above 0.95. The high values of ρAUC indicate that PRStuning can accurately predict the pattern of AUC with respect to the parameters on the testing data. This can be clearly observed from Figure 4. These results show that PRStuning can help us select the best-performing parameters in PRS methods with only summary statistics from the training data.

Table 5 Summary of ρAUC and rdAUC when using PRStuning to predict AUCs for four PRS methods on four diseases

We note that the correlation between AUC predicted by PRStuning and calculated from the testing data was negative with C+T for CAD. However, also note that the standard deviations among the AUC values with different parameters for both methods were less than 0.01 in this scenario. The extremely small standard deviations of AUC contribute to the large variation of the correlation. Therefore, the correlation is relatively uninformative in characterizing the relationship between the predicted and the actual AUC values. On the other hand, the small value of rdAUC (0.4%) suggests the effectiveness of PRStuning. The sensitivity values of the tuned PRS model based on PRStuning and Youden’s J statistic are summarized in Supplementary Table 3.

We also compared PRStuning with PUMAS32, a method to estimate predictive R2 for PRS models using summary statistics from GWAS by sampling pseudo-summary-statistics. To compare predictive R2 with AUC, we first converted Pearson’s correlation to Spearman’s rank correlation and then linearly mapped the latter to AUC33. In Supplementary Table 4, we summarize ρAUC and rdAUC based on PUMAS. We observe that PRStuning outperformed PUMAS across all real data and PRS methods, and that PUMAS is especially incapable of predicting the AUC well for LDpred and LDpred2.

Discussion

PRS methods have been proven useful for the prediction of common disease risks, which can help improve disease prevention and early treatment. Some PRS methods require users to specify the values for parameters. However, to tune the parameters, an external individual-level genotype dataset is often needed to evaluate the prediction performance of different parameter values. However, individual-level genotype data are much less accessible compared to GWAS summary statistics due to privacy and security concerns. Additionally, leaving out partial data for parameter tuning can also reduce the predictive accuracy of the PRS model.

These concerns motivated us to propose PRStuning, an empirical Bayes method that only requires summary statistics from the training GWAS to evaluate PRS and tune the parameters. PRStuning is based on an AUC estimator proposed in22, which is a function of the GWAS summary statistics. However, plugging the training summary data directly into this estimator would cause overfitting, leading to an inflation of the predicted AUC. To tackle this problem, we adopted the empirical Bayes approach to shrinking the predicted AUC based on the estimated genetic architecture. Extensive simulation experiments and real data applications on four diseases with four PRS methods demonstrated that PRStuning is capable of accurately predicting the AUC on the testing data and selecting the best-performing parameters.

The core of PRStuning is to estimate the allele frequency differences among SNPs. To do so, we need to input the sample sizes of the cases and controls in the training data. Usually, they are provided in the sources of GWAS summary statistics. However, if the summary statistics were derived from a meta-analysis, not all SNPs were genotyped in all studies included in the meta-analysis. In this case, the actual sample sizes used for calculating the summary statistics are less than the reported total sample sizes in the meta-analysis for some SNPs. This may lead to underestimation in AUC according to Eq. (2). This phenomenon was observed when we applied PRStuning to C+T and LDpred on CAD and IBD, where the AUC estimates from PRStuning were lower than the actual values in the testing data. Nevertheless, according to our experimental results, the underestimation phenomenon will not influence the performance of parameter selection since the overall pattern of the AUC values with different parameter values can still be well-predicted by PRStuning.

Currently, we only considered tuning parameters for PRS methods on diseases or other binary phenotypes. For quantitative phenotypes, instead of AUC, predictive r2 is commonly used as an evaluation criterion of the PRS model. Extending PRStuning to evaluating predictive r2 and selecting parameters on quantitative phenotypes is left as future work.

In PRStuning, we select the best-performing parameter by predicting the AUC of the PRS built under each candidate parameter value. Although AUC is the most commonly used evaluation metric for PRS on binary disease outcomes22, it may be helpful to incorporate additional covariates such as age, sex, etc. into the AUC since they may also have an impact on disease risks34. Two notable variants of AUC to incorporate covariate information include covariate-specific AUC (AUCx)35 and covariate-adjusted AUC (AAUC)36. Similar to the definition of the ordinary AUC, AUCx is defined as the probability that the PRS of a random individual from the case group is larger than the PRS from a random individual from the control group conditioning on both individuals share the common covariate value x. AAUC is the weighted average of AUCx where the weight is the probability density of covariate value x. If the genetic risk of a disease is independent of other covariates, both AUCx and AAUC will have the same value as the ordinary AUC34. To estimate AUCx and AAUC, we need to estimate the conditional distribution of PRS given a covariate value, which can only be inferred with the help of individual-level data. Since we focus on using GWAS summary statistics to predict the AUC and tune parameters of PRS, we left the prediction of covariate-incorporated AUCx and AAUC based on individual-level training data as future work.

The basic assumption of PRStuning is that the training and testing datasets are homogeneous, indicating that both datasets come from the same population and therefore share the same LD matrix and expected allele frequencies among controls and cases. The same assumption is also needed for traditional PRS analyses based on an independent validation dataset to tune parameters. If the validation and testing datasets are heterogeneous, the AUC estimated from the validation dataset and the parameter selected based on the estimated results are not accurate. Without additional information about the heterogeneity between the two datasets, it will be challenging to estimate AUC and tune parameters based on training or validation datasets. We note that some recent PRS methods have been proposed to consider multiple populations from different ancestries together, which can transfer the knowledge from the European population to other demographics with limited sample size37,38,39,40. In PRStuning, we currently focus on dealing with the overfitting issue when the homogeneous assumption is valid. The adjustment to the selected parameter value based on additional information of the heterogeneity will be considered in our future work. Supplementary Figures 3-13 present the performance of PRStuning when the pooled allele frequency, effect size, and LD matrix are different between training and testing datasets. The figures demonstrate that PRStuning can estimate the AUC well when heterogeneity exists in the pooled allele frequency and LD matrix. However, if the heterogeneity between training and testing data exists in the effects of changing allele frequencies between controls and cases, the AUC from PRStuning will be overestimated and unreliable.

Recent research suggests that combining all PRSs under a tuning grid using ensemble methods can improve the prediction performance8,41,42,43. In the ensemble methods, an independent validation dataset is needed to estimate the weights used for combining PRSs. In PRStuning, we estimate the AUC and select the best-performing parameters for a PRS method based on the SNP weights derived from the PRS method. If the PRS weights used in ensemble methods have already been estimated in an individual-level validation dataset, we can combine the SNP weights in each PRS and the PRS weights together to derive the ensembled SNP weights. In this situation, PRStuning can be used to predict the AUC of the PRS from the ensembled weights without another individual-level dataset. However, without an individual-level validation dataset to estimate the PRS weights used in the ensemble methods, PRStuning can not estimate the PRS weights simply based on GWAS summary statistics from the training data.

Methods

Notations and assumptions

Based on the additive assumption, the PRS for individual i is the sum of the genotypes gi = (gi,1, …, gi,M) weighted by the corresponding effects ω = (ω1, …, ωM):

$$PR{S}_{i}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}{g}_{i,m},$$
(9)

where M is the total number of the pre-selected SNPs used for constructing PRS. Depending on the specific PRS method, not all SNPs collected in the training GWAS data are necessarily used in PRS calculation. Please note that some PRS methods incorporate steps for selecting SNPs based on their associations with the disease. Here we define the pre-selected SNPs as the SNPs used in building the PRS model before running a selection step related to association strengths. LD clumping is an example of the selection step based on the observed association strength. Hence, we refer to the pre-selected SNPs in C+T as genome-wide SNPs collected in the training GWAS data. On the contrary, LD pruning is a selection step unrelated to the associations of SNPs with the disease. Therefore, the pre-selected SNPs in P+T are the SNPs selected after an LD pruning step. Different PRS methods have been proposed to estimate the weight vector ω = (ω1, …, ωM) from a GWAS dataset or its summary statistics for the disease of interest. Here and after we simply use ω to denote the effects already estimated from a PRS method.

Based on disease status, we divide individuals into the case and control groups. In the following, we use subscripts j = 0 and j = 1 to denote those from the control and case groups, respectively. For example, the frequencies of the reference allele for SNP m among controls and cases are denoted as f0,m and f1,m, respectively. The genotype gi,m of SNP m for an individual in the control group follows a binomial distribution Bino(2, f0,m) with mean \({\mathbb{E}}[{g}_{0,m}]=2{f}_{0,m}\) and variance \({s}_{0,m}^{2}:\!\!=Var({g}_{0,m})=2{f}_{0,m}(1-{f}_{0,m})\). Similarly, we have gi,m ~ Bino(2, f1,m) if the individual i is from the case group.

By the central limit theorem, PRS approximately follows a normal distribution in each group when the SNP number M is adequately large. For PRS methods involving SNP selection steps unrelated to the SNPs’ associations with the disease, such as P+T, M varies from ~10 to ~10K depending on the selection threshold. For PRS methods with genome-wide pre-selected SNPs, M ranges from ~100K to ~1M determined by the number of SNPs genotyped or imputed in the training data. Based on the central limit theorem, the PRS variables from the two groups follow normal distributions:

$$PR{S}_{i} \sim \left\{\begin{array}{ll}N({\eta }_{0},\,{\tau }_{0}^{2})\quad &{{{{{{{\rm{if}}}}}}}}\,i\in {{{{{{{\rm{control}}}}}}}}\,{{{{{{{\rm{group}}}}}}}}\\ N({\eta }_{1},\,{\tau }_{1}^{2})\quad &{{{{{{{\rm{if}}}}}}}}\,i\in {{{{{{{\rm{case}}}}}}}}\,{{{{{{{\rm{group}}}}}}}}\quad \end{array}\right.,$$
(10)

where

$${\eta }_{j}=\mathop{\sum }\limits_{m=1}^{M}2{\omega }_{m}{f}_{j,m},$$
(11)

and

$${\tau }_{j}^{2}=\mathop{\sum }\limits_{m=1}^{M}{\omega }_{m}^{2}{s}_{j,m}^{2}+2\mathop{\sum}\limits_{{m}_{1} < {m}_{2}}{\omega }_{{m}_{1}}{\omega }_{{m}_{2}}{R}_{{m}_{1},{m}_{2}}{s}_{j,{m}_{1}}{s}_{j,{m}_{2}},$$
(12)

for j = 0 or 1. Here \({R}_{{m}_{1},{m}_{2}}\) corresponds to the correlation between SNP m1 and SNP m2, which is known as the LD coefficient.

For a binary phenotype, we usually use AUC as the criterion for evaluating the prediction performance of PRS. AUC is defined as the area under the ROC curve, which can also be calculated as the probability that a random PRS from the case group is larger than a random PRS from the control group44. Based on this fact and the distributions of PRS, Song, etc.22 formulated AUC as

$${{{{{{{\rm{AUC}}}}}}}}={{\Phi }}({{\Delta }}),$$
(13)

where

$${{\Delta }}:\!\!=\frac{{\eta }_{1}-{\eta }_{0}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}=\frac{2{\sum }_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}.$$
(14)

Here δmf1,m − f0,m records the difference between the allele frequencies of the two groups for SNP m, and Φ(  ) is the cumulative density function of a standard normal distribution.

To calculate \({\tau }_{0}^{2}\) and \({\tau }_{1}^{2}\) in Eq. (13), we can directly plug in the observed values of the allele frequencies and LD coefficients into Eq. (12) since they are not directly related with the SNP effects on the disease. We can extract allele frequencies from summary statistics of the GWAS and use a genotyping dataset as the reference panel for extracting the LD information. Some large projects, such as the 1000 Genomes project26 and the HapMap3 project27, can be used to calculate the LD coefficients. We will provide the details of calculations in Section “Calculating LD from a reference pane”.

In Eq. (13), the allele frequency differences δm (m = 1, …, M) are critical. One may think of directly plugging in the observed allele frequencies \({\hat{f}}_{0,m}\) and \({\hat{f}}_{1,m}\) from GWAS for building the PRS model to obtain δm. However, the allele frequency differences of SNPs that exhibit large effects tend to be overestimated, and these SNPs have larger contributions to PRS than the SNPs showing small effects, a phenomenon known as overfitting in the machine learning community23. Overestimating the SNP effects would lead to an inflated value of the predicted AUC and the incorrectly selected values of the parameters. Here we adopt an empirical Bayes method to reduce the influence of overfitting by shrinking the observed allele frequency differences obtained from the summary statistics of the training GWAS.

In GWAS, we usually use the z-score calculated from the allele frequency difference test to assess the association of each SNP with the disease. Since z-scores are standardized values following a standard normal distribution N(0, 1) under the null hypothesis, we will use z-scores as surrogates to derive the posterior distribution of δm. The z-score is calculated with the following formula:

$${z}_{m}=\frac{{\hat{f}}_{1,m}-{\hat{f}}_{0,m}}{\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}}},$$
(15)

where \({\hat{f}}_{j,m}\) is the observed allele frequencies among controls or cases, and \({s}_{j,m}^{2}\) is the variance of genotypes in each group. We use n0 and n1 to respectively denote the sample sizes of controls and cases in the GWAS. To simplify the expression, we use sm to denote the denominator of the z-score, i.e.,

$${s}_{m}:\!\!=\sqrt{{s}_{1,m}^{2}/4{n}_{1}+{s}_{0,m}^{2}/4{n}_{0}},$$
(16)

and denote s = (s1, …, sM). We use z to encode the z-scores of all the pre-selected SNPs. Based on this definition, we have zmδm ~ N(δm/sm, 1) given the allele frequency difference δm.

Under an assumed condition that SNP m is independent of other SNPs, its potential allele frequencies among controls and cases may be different. We denote the potential allele frequencies under this condition as p0,m and p1,m, respectively. Note that they should be distinguished from the marginalized allele frequencies f0,m and f1,m. We denote the effect of SNP m as βm = p1,m − p0,m. If the SNP has no effect on the disease, then βm = 0. For the risk ones, βm ≠ 0. In the Supplementary Methods section, we further prove that δ = (δ1, …, δM) is actually related to the LD among the pre-selected SNPs and the underlying SNP effects β = (β1, …, βM) in terms of changing allele frequencies between two groups, i.e.,

$${{{{delta }}}}={{{{S}}}}{{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}}.$$
(17)

We further assume that the standardized effect βm/sm follows a point-normal distribution, i.e.,

$$\frac{{\beta }_{m}}{{s}_{m}}\mathop{ \sim }\limits^{iid}(1-\pi ){\delta }_{0}+\pi N(0,\,{\sigma }^{2}).$$
(18)

Here δ0 is a point mass at zero and π represents the prior proportion of the SNPs having effects on the disease. We use σ2 to denote the variance of βm/sm in the risk SNPs. In the Supplementary Methods section, we derived the following relationship between σ2 and the heritability (\({h}_{l}^{2}\)) of the disease in the liability-scale:

$${\sigma }^{2}=\frac{{N}_{e}{h}_{l}^{2}}{4M\pi }\frac{\phi {({{{\Phi }}}^{-1}(\kappa ))}^{2}}{{\kappa }^{2}{(1-\kappa )}^{2}},$$
(19)

where \({N}_{e}=\frac{4{n}_{0}{n}_{1}}{{n}_{0}+{n}_{1}}\) is the effective sample size of the GWAS, κ is the prevalence of the disease, and ϕ and Φ are the probability density function and cumulative density function of the standard normal distribution N(0, 1), respectively.

In the following two subsections, we will demonstrate how to estimate allele frequency differences in two different scenarios by reducing the effect of overfitting based on the empirical Bayes theory.

Estimating AUC on independent SNPs

First, we consider the situation in which the pre-selected SNPs used for constructing PRS are independent. For example, the pre-selected SNPs in P+T are approximately independent because they are selected after an LD pruning step.

In this scenario, we have δ = β based on Eq. (17) and the joint distribution of z-scores follows a multivariate normal distribution with the covariance matrix equaling to the identity matrix IM, i.e.,

$${{{{{{{\boldsymbol{z}}}}}}}}| {{{{\beta }}}} \sim {N}_{M}({{{{{ S}}}}}^{-1}{{{{\beta }}}},{{{{{ I}}}}}_{M}),$$
(20)

where S = diag(s) is a diagonal matrix with diagonal elements encoding the standard errors of the observed allele frequency differences.

With the point-normal prior (18) on each entry of β, the log-likelihood of the z-scores is the summation of the log-likelihood for each individual z-score, i.e.

$$\log P({{{{z}}}}| \pi,{\sigma }^{2})=\mathop{\sum }\limits_{m=1}^{M}\log P({z}_{m}| \pi,{\sigma }^{2}).$$
(21)

With this property, we can use an EM algorithm to get estimates of π and σ2 by maximizing the likelihood P(zπ, σ2).

After getting estimates of parameters π and σ2, we can derive a closed-form solution for the posterior distribution of δm:

$${\delta }_{m}| {z}_{m} \sim (1-{h}_{m}){\delta }_{0}+{h}_{m}N(\lambda {z}_{m}{s}_{m},\lambda {s}_{m}^{2}),$$
(22)

where

$${h}_{m}=\frac{\frac{\pi }{\sqrt{1+{\sigma }^{2}}}\phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})}{(1-\pi )\phi ({z}_{m})+\frac{\pi }{\sqrt{1+{\sigma }^{2}}}\phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})}\,{{{{{{{\rm{and}}}}}}}}\,\lambda=\frac{1}{1+1/{\sigma }^{2}}.$$
(23)

Here ϕ(  ) is the probability density function of a standard normal distribution N(0, 1). Derivation details of this posterior distribution can be found in the Supplementary Methods section. With Eq. (22), we get MC samples of δmzm and plug them as the allele frequency difference in Eq. (13) for calculating the posterior distribution of AUC. The shrink estimator of δm in (22) reduces the effect of overfitting. Details of the EM algorithm for estimating π, σ2, δm, and AUC are summarized in Algorithm 4.2.

Algorithm 1

Estimate AUC on independent SNPs

Input: z-scores z = (z1, …, zM)

Output: Estimated π, σ2, δ and AUC

1: Initialize π and σ2;

2: repeat

3: for m = 1, 2, . . . , M do

4: E step:

5: \({h}_{m}\leftarrow \frac{\pi \phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})/\sqrt{1+{\sigma }^{2}}}{(1-\pi )\phi ({z}_{m})+\pi \phi ({z}_{m}/\sqrt{1+{\sigma }^{2}})/\sqrt{1+{\sigma }^{2}}}\)

6: M step:

7: \(\pi \leftarrow \frac{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}}{M}\)

8: \({\sigma }^{2}\leftarrow \frac{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}{z}_{m}^{2}}{\mathop{\sum }\nolimits_{m=1}^{M}{h}_{m}}-1\)

9: end for

10: untill π and σ2 converge

11 for m = 1, 2, …, M do

12: \({\delta }_{m} \sim (1-{h}_{m}){\delta }_{0}+{h}_{m}N(\frac{{z}_{m}{s}_{m}}{1+1/{\sigma }^{2}},\frac{{s}_{m}^{2}}{1+1/{\sigma }^{2}})\)

13: end for

14: \({{\Delta }}\leftarrow \frac{2\mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}\) and AUC ← Φ(Δ)

Estimating AUC on SNPs linked by LD

When the pre-selected SNPs are not filtered by the independence criterion, their genotypes may be correlated due to LD. We can estimate the LD matrix R from a publicly available genotyping reference panel.

In this scenario, we have δ = SRS−1β based on Eq.(17) and the conditional joint distribution of the z-scores given β is

$${{{{{{{\boldsymbol{z}}}}}}}}| {{{{\beta }}}} \sim N({{{{R}}}}{{{{{S}}}}}^{-1}{{{{\beta }}}},{{{{R}}}}),$$
(24)

where S = diag(s) is a diagonal matrix encoding the standard errors of observed allele frequency differences.

We used the same point-normal prior (18) on each entry of β as we used in the independent SNP scenario. There are two unknown parameters π and σ2 in the prior distribution. We intend to use maximum likelihood estimation (MLE) for estimating them based on the observed z-scores. However, due to the extremely high number of component combinations (i.e., 2M), the joint likelihood of z-scores P(zπ, σ2) is intractable. Here we use a Gibbs-sampling-based State-Augmentation for Marginal Estimation (SAME) algorithm to get the maximizer of the likelihood in a stochastic approach25.

Let γm {0, 1} (m = 1, …, M) denote whether SNP m has an effect on the disease or not and γ = (γ1, …, γM). In the SAME algorithm, instead of evaluating the original likelihood, we assess the likelihood of the augmented data P(z, β, γπ, σ2). With flat priors on π and σ2, we derive a Gibbs sampler for sampling the full parameters β, γ, π and σ2 with the joint probability proportional to the augmented data likelihood. We leave the derivation details in the Supplementary Methods section.

By making some simple changes to the originally derived sampler, we can get another Gibbs sampler for simultaneously sampling π, σ2 and D artificial replicates of the nuisance parameters \({\{{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)\}}_{d=1}^{D}\), for whom the joint probability is proportional to

$${q}_{D}\left(\pi,\,{\sigma }^{2},\,{\{{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)\}}_{d=1}^{D}| {{{{{{{\boldsymbol{z}}}}}}}}\right)\propto \mathop{\prod }\limits_{d=1}^{D}P\left({{{{z}}}},\,{{{{\beta }}}}(d),\,{{{{\gamma }}}}(d)| \pi,{\sigma }^{2}\right).$$
(25)

Based on this probability, the generated replicates of {β, γ} in the sampler are conditionally independent. With this new sampler, the marginal probability of (π, σ2) can be calculated by integrating/summing over all replicates of {β, γ}:

$${q}_{D}\left(\pi,{\sigma }^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)= {\int}_{{{{{\beta (D)}}}}}\mathop{\sum}\limits_{{{{{\gamma (D)}}}}}\ldots {\int}_{{{{{\beta (1)}}}}}\mathop{\sum}\limits_{{{{{\gamma (1)}}}}}{q}_{D}\left(\pi,\, {\sigma }^{2},\, {\{{{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)\}}_{d=1}^{D}| {{{{{{{\boldsymbol{z}}}}}}}}\right)d{{{{\beta }}}}(1)\ldots d{{{{\beta }}}}(D)\\ \propto {\int}_{{{{{\beta (D)}}}}}\mathop{\sum}\limits_{{{{{\gamma (D)}}}}}\ldots {\int}_{{{{{\beta (1)}}}}}\mathop{\sum}\limits_{{{{{\gamma (1)}}}}}\mathop{\prod }\limits_{d=1}^{D}P\left({{{{{{{\boldsymbol{z}}}}}}}},\, {{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)| \pi,\, {\sigma }^{2}\right)d{{{{\beta }}}}(1)\ldots d{{{{\beta }}}}(D)\\= \mathop{\prod }\limits_{d=1}^{D}\left({\int}_{{{{{\beta (d)}}}}}\mathop{\sum}\limits_{{{{{\gamma (d)}}}}}P\left({{{{{{{\boldsymbol{z}}}}}}}},\, {{{{\beta }}}}(d),\, {{{{\gamma }}}}(d)| \pi,\, {\sigma }^{2}\right)d\beta (d)\right)\\= P{({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\, {\sigma }^{2})}^{D}.$$

In other words, (π, σ2) is actually sampled from \({q}_{D}\left(\pi,\,{\sigma }^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)\propto P{({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\,{\sigma }^{2})}^{D}\) in the sampler. We further denote \((\hat{\pi },\,{\hat{\sigma }}^{2})=\arg \mathop{\max }\limits_{(\pi,{\sigma }^{2})}P({{{{{{{\boldsymbol{z}}}}}}}}| \pi,\,{\sigma }^{2})\) and \((\tilde{\pi },\,{\tilde{\sigma }}^{2})\) as another set of parameters. If we let D increase to infinity, the relative probability of sampling \((\tilde{\pi },\,{\tilde{\sigma }}^{2})\) compared to sampling \((\hat{\pi },{\hat{\sigma }}^{2})\) will become

$$\frac{{q}_{D}\left(\tilde{\pi },\,{\tilde{\sigma }}^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)}{{q}_{D}\left(\hat{\pi },\,{\hat{\sigma }}^{2}| {{{{{{{\boldsymbol{z}}}}}}}}\right)}={\left(\frac{P({{{{{{{\boldsymbol{z}}}}}}}}| \tilde{\pi },\,{\tilde{\sigma }}^{2})}{P({{{{{{{\boldsymbol{z}}}}}}}}| \hat{\pi },\,{\hat{\sigma }}^{2})}\right)}^{D}{D\to \infty }0.$$
(26)

Therefore, the sampled (π, σ2) will converge to their maximum likelihood estimates \((\hat{\pi },{\hat{\sigma }}^{2})\) in the end.

Given their estimates, the Gibbs sampler in the SAME algorithm can provide MC samples of nuisance parameters {β, γ} with probability \(P({{{{\beta }}}},{{{{\gamma }}}}| {{{{{{{\boldsymbol{z}}}}}}}},\hat{\pi },{\hat{\sigma }}^{2})\). With them, we can also get the MC samples of δ = SRS−1β and the corresponding AUC based on Eq. (13). The complete Gibbs-sampling-based SAME algorithm for estimating π, σ2, δm and AUC is summarized in Algorithm 4.3.

Algorithm 2

Estimate AUC on SNPs linked by LD

Input: z-scores z = (z1, …, zM)

Output: Estimated π, σ2, δ and AUC

Initialize π, σ2, γm ~ Bernoulli(π) and βm ~ (1 − γm)δ0 + γmN(0, σ2) for m = 1…M

D ← 1

\(\lambda \leftarrow \frac{1}{1+1/{\sigma }^{-2}}\)

repeat

for d ← 1 to D do

for m ← 1 to M do

If γm = 0, βm ← 0

\({\mu }_{m}\leftarrow \lambda ({z}_{m}-{\sum }_{m{\prime} \ne m}\frac{{R}_{mm{\prime} }{\beta }_{m{\prime} }}{{s}_{m{\prime} }})\)

If γm = 1, sample \({\beta }_{m} \sim N({s}_{m}{\mu }_{m},\lambda {s}_{m}^{2})\)

\({r}_{m}\leftarrow \pi \sqrt{\frac{\lambda }{{\sigma }^{2}}}\exp (\frac{{\mu }^{2}}{2\lambda })\)

\({h}_{m}\leftarrow \frac{{r}_{m}}{(1-\pi )+{r}_{m}}\)

Sample γm ~ Bernoulli(hm)

end for

β(d) ← β and γ(d) ← γ

end for

Sample \(\pi \sim Beta\left(\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{M}{\gamma }_{m}(d)+D,MD-\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\nolimits_{m=1}^{M}{\gamma }_{m}(d)+D\right)\)

Sample \({\sigma }^{-2} \sim Gamma\left(\frac{1}{2}\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{M}{\gamma }_{m}(d)+D,\frac{1}{2}\mathop{\sum }\limits_{d=1}^{D}\mathop{\sum }\limits_{m=1}^{d}{\beta }_{m}{(d)}^{2}{\gamma }_{m}(d)\right)\)

D ← D + 1

until (π, σ2) converge.

δ ← SRS−1β, \({{\Delta }}\leftarrow \frac{2\mathop{\sum }\nolimits_{m=1}^{M}{\omega }_{m}{\delta }_{m}}{\sqrt{{\tau }_{0}^{2}+{\tau }_{1}^{2}}}\) and AUC ← Φ(Δ)

Calculating LD from a reference panel

Algorithm 4.3 needs users to input the LD matrix among the pre-selected SNPs. Some projects, such as the 1000 Genomes Project26 and the HapMap 3 project27 have released individual-level genotype data. We can use them as reference panels to extract the LD matrix. In our method, we chose the 1000 Genomes Project as our default reference panel since it has a larger sample size. Note that most PRS methods calculate weights on the SNPs genotyped in the HapMap 3 project (HM3 SNPs) because it constitutes a set of commonly used tag SNPs that are usually well-imputed in different GWAS. To extract reliable results of the LD matrix and to reduce the computational cost of Algorithm 4.3, we only included HM3 SNPs in the reference panel in our experiments.

We note that the LD coefficient between SNPs tends to decay with increasing distance between SNPs45. The genotypes of SNPs with a long distance are approximately independent. We use LDetect to divide the whole genome into approximately independent blocks46. For human genomes with European ancestry, a total of 1703 blocks are partitioned by LDetect.

Within each partitioned block, the correlation matrix among the genotypes of SNPs needs to be estimated as an input. Many methods have been proposed to estimate SNP covariance matrix47,48,49, but most of them are sensitive to the structure of the covariance matrix or the distribution of the sample data. We note that the Ledoit-Wolf estimator does not depend on the assumptions of the covariance structure or the sample data distribution49. In our method, we first standardized genotypes in the reference panel, and then we adopted the Ledoit-Wolf estimator on the standardized genotypes to obtain the correlation matrix.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.