Introduction

There is a great demand for more accurate genetic prediction models of complex traits. Better models will, for example, improve our ability to investigate genetic architecture, detect genetic overlap between traits and search for gene–environment interactions1,2. They will also enable more widespread use of precision medicine, for example, by enabling us to better identify subgroups of individuals with elevated risk of developing a particular disease, or those with lowest chance of responding to a particular medication3,4,5,6,7.

Many complex traits have high SNP heritability, which justifies the use of genome-wide, linear, SNP-based prediction models8,9. The resulting predictions are called polygenic risk scores (PRS). They take the form P = X1 β1 + X2 β2 + … + Xm βm, where m is the total number of SNPs, while Xj and βj denote, respectively, the genotypes and estimated effect size for SNP j. Tools for constructing PRS differ in how they estimate the SNP effect sizes. The simplest way to construct a PRS is using effect size estimates from single-predictor regression (classical PRS). However, it is generally better to use an advanced prediction tool that estimates effect sizes using a multi-SNP regression model10,11,12,13,14.

Advanced prediction tools start by making prior assumptions regarding how SNPs contribute toward the phenotype. These assumptions include specifying a heritability model, which describes how E[h2j], the expected heritability contributed by each SNP, varies across the genome15. Almost all existing advanced prediction tools automatically assume that E[h2j] is constant. We refer to this as the GCTA Model, because it a core assumption of the software GCTA (Genome-wide Complex Trait Analysis).(8) In particular, the GCTA Model is assumed by any prediction tool that uses a multi-SNP regression model and assigns the same penalty or prior distribution to standardized SNP effect sizes9,16. However, the GCTA Model is suboptimal. Recently, we provided a method for comparing different heritability models using summary statistics from genome-wide association studies17. Across tens of complex traits, the model that fit real data best was the BLD-LDAK Model, in which E[h2j] depends on minor allele frequency (MAF), local levels of linkage disequilibrium and functional annotations.

In this paper, we construct PRS for a variety of complex traits using eight new prediction tools. The main difference between these and existing tools is that they allow the user to specify the heritability model. We show that for all eight tools, the accuracy of the PRS improves when we switch from the GCTA Model to the BLD-LDAK Model. When individual-level genotype and phenotype data are available, we recommend using our new tool LDAK-Bolt-Predict (a generalized version of the prediction tool contained within the existing software Bolt-LMM18). With access only to summary statistics and a reference panel, we recommend using our new tool LDAK-BayesR-SS (a generalized version of the existing prediction tool SBayesR14). Both tools are available in our software LDAK15 (www.ldak.org).

Results

Overview of methods

Figure 1a classifies our eight new prediction tools based on the form of the prior distribution they assign to SNP effect sizes. Our four individual-level tools, big_spLinReg, LDAK-Ridge-Predict, LDAK-Bolt-Predict and LDAK-BayesR-Predict, use the same prior distribution forms as the existing individual-level data tools Lasso (least absolute shrinkage and selection operator)16, BLUP (best linear unbiased prediction)19, Bolt-LMM18 and BayesR11, respectively. Our four new summary statistic tools, LDAK-Lasso-SS, LDAK-Ridge-SS, LDAK-Bolt-SS and LDAK-BayesR-SS, use the same prior distribution forms as the existing summary statistic tools lassosum13, sBLUP20, LDpred12 and SBayesR14, respectively. Figure 1b illustrates how our new tools incorporate alternative heritability models by allowing the parameters of the effect size prior distribution to vary across SNPs. We provide full details of our new tools in Methods, and scripts for repeating our analyses in Supplementary Note 1.

Fig. 1: Prior distributions for SNP effect sizes.
figure 1

a We divide prediction tools based on the form of the prior distribution they assign to SNP effect sizes, and whether they use individual-level data or summary statistics. For each of our eight new tools (names in blue), there is an existing tool that uses the same prior distribution form (names in red). b Having selected the form of the effect size prior distribution, most existing prediction tools use the same parameters for each SNP. Our new tools, by contrast, use SNP-specific prior distribution parameters. To illustrate this difference, we consider lasso-based prediction tools that assign a double exponential prior distribution to standardized SNP effect sizes. While existing tools might, for example, set the variance of the prior distribution to 5e−7 (so that E[h2j]=5e−7 for all SNPs), our new tools instead let the variance vary across the genome (allowing E[h2j] to be set according to the chosen heritability model).

In total, we construct PRS for 225 phenotypes from the UK Biobank21,22 (Supplementary Data 1). When using individual-level prediction tools, we restrict to the 14 phenotypes for which we have access to individual-level data. Of these, eight are continuous (body mass index, forced vital capacity, height, impedance, neuroticism score, pulse rate, reaction time and systolic blood pressure), four are binary (college education, ever smoked, hypertension and snorer), and two are ordinal (difficulty falling asleep and preference for evenings). For each phenotype, we have 220,000 distantly-related (pairwise allelic correlations <0.03125), white British individuals, recorded for 628,694 high-quality (information score >0.9), common (MAF > 0.01), autosomal, directly-genotyped SNPs. When constructing PRS, we use 200,000 individuals as training samples, and the remaining 20,000 individuals as test samples. When we require a reference panel, we use the genotypes of 20,000 individuals picked at random from the 200,000 training samples. We measure the accuracy of a PRS via R2, the squared correlation between observed and predicted phenotypes across the 20,000 test samples, and estimate the s.d. of R2 via jackknifing. For a given phenotype, R2 is upper-bounded by h2SNP, the SNP heritability, estimates of which range from 0.07 to 0.61 (Supplementary Table 1). When using summary statistic prediction tools, we construct PRS for all 225 phenotypes, using results released by the Neale Lab. These results come from association studies with average sample size 285k (range 35–361k), and the average h2SNP is 0.22 (range 0.07–0.63).

We consider three different heritability models: the GCTA Model assumes E[h2j] is constant, the LDAK-Thin Model allows E[h2j] to vary based on the MAF of SNP j, while the BLD-LDAK Model allows E[h2j] to vary based on the MAF of SNP j, local levels of linkage disequilibrium and functional annotations17. Our previous work compared heritability models based on how well they fit real data17. Specifically, we measured their performance via the Akaike Information Criterion23 (AIC), equal to 2 K - 2logl, where K is the number of parameters in the heritability model and logl is the approximate log likelihood (lower AIC is better). Across the 12 models we considered, AIC was lowest for the BLD-LDAK Model, highest for the GCTA Model, and intermediate for the LDAK-Thin Model (we reproduce these results in Supplementary Table 2).

Supplementary Fig. 1 shows that when run assuming the GCTA Model, each of our new prediction tools performs at least as well as the corresponding existing tool. For some pairs of tools, the results are almost identical. For example, the PRS constructed using LDAK-Bolt-Predict and LDAK-BayesR-SS assuming the GCTA Model have similar accuracy to those constructed using Bolt-LMM and SBayesR, respectively. However, for other pairs, our tools are superior. For example, the PRS constructed using LDAK-Lasso-SS and LDAK-Ridge-SS assuming the GCTA Model tend to be more accurate than those from lassosum and sBLUP, respectively. We explain the algorithmic innovations that lead to these improvements in Supplementary Note 2. As the aim of this paper is to demonstrate the impact on prediction accuracy of improving the heritability model (not due to algorithmic innovations), for the analyses below, we always use our new tools.

Performance of individual-level data prediction tools

First we use our four new individual-level data tools to construct PRS for the first 14 UK Biobank phenotypes. When using all 200,000 training samples, the tools take ~4 h (LDAK-Ridge-Predict), 20 h (LDAK-Bolt-Predict) or 50 h (big_spLinReg and LDAK-BayesR-Predict), and require 35 Gb memory (note that for big_spLinReg, LDAK-Bolt-Predict and LDAK-BayesR-Predict, the runtimes can be reduced substantially by using multiple CPUs).

Figure 2 and Supplementary Table 3 show that the accuracy of PRS always increases when we replace the GCTA Model with either the LDAK-Thin or BLD-LDAK Model (i.e., for all four tools and for all 14 phenotypes). For our recommended tool, LDAK-Bolt-Predict, replacing the GCTA Model with the LDAK-Thin Model increases R2 by on average 9% (s.d. 2%), while replacing the GCTA Model with the BLD-LDAK Model increases R2 by on average 14% (s.d. 2%). Moreover, when run assuming the BLD-LDAK Model, LDAK-Bolt-Predict outperforms our implementations of the existing tools Lasso, BLUP, Bolt-LMM and BayesR for all 14 phenotypes. We note that the performances of LDAK-Bolt-Predict and LDAK-Bayes–Predict are very similar. For example, when run assuming the BLD-LDAK Model, the tools have average R2 0.080 and 0.081, respectively (s.d.s 0.001), and each tool produces the most accurate PRS for seven of the 14 phenotypes. Therefore, our decision to recommend LDAK-Bolt-Predict simply reflects its faster runtime.

Fig. 2: Impact of changing the heritability model when using individual-level data.
figure 2

a We use our four new individual-level data prediction tools to construct PRS for the first 14 UK Biobank phenotypes (using all 200,000 training samples). We measure the accuracy of each PRS via R2, the squared correlation between observed and predicted phenotypes across 20,000 test samples. Points report the percentage increase in R2 for individual phenotypes when each tool is switched from assuming the GCTA Model to either the LDAK-Thin or BLD-LDAK Model (boxes mark the median and inter-quartile range across the 14 phenotypes). b For the same analysis as a, bars report R2 averaged across the 14 phenotypes (vertical segments mark 95% confidence intervals). Colors indicate the assumed heritability model, while blocks indicate the prediction tool. The horizontal lines mark average R2 for classical PRS and a 95% confidence interval. Source data are provided within the Source Data file.

Figure 3 and Supplementary Table 1 show how the accuracy of PRS constructed using LDAK-Bolt-Predict varies with the number of training samples. We find that the increase we observed when we switched from the GCTA Model to the BLD-LDAK Model is equivalent to increasing the number of training samples by about 24%. The ratio R2/h2SNP indicates the accuracy of a PRS relative to the maximum possible accuracy. When we use 200,000 training samples, the PRS achieve between 13% (difficulty falling asleep) and 62% (height) of their potential. The lines of best fit suggest that if we had individual-level data for 400,000 samples, the PRS would explain between 23 and 78% of SNP heritability.

Fig. 3: Dependency of prediction accuracy on sample size.
figure 3

a We use LDAK-Bolt-Predict to construct PRS for the first 14 phenotypes, varying n, the number of training samples, between 100,000 and 200,000. We measure the accuracy of each PRS via R2, the squared correlation between observed and predicted phenotypes across 20,000 test samples. Points report R2 averaged across the 14 phenotypes; colors indicate the assumed heritability model. The lines of best fit are obtained by regressing average R2 on a + bn + cn2; for the GCTA Model, we use the best fit line to predict average R2 if the sample size was 24% higher than specified (dashed line). b For the same analysis as a, points report R2/h2SNP for PRS constructed assuming the BLD-LDAK Model, where h2SNP is the estimated SNP heritability (the maximum possible R2). The lines of best fit are obtained by regressing R2/h2SNP on 1-exp(a + bn). Source data are provided within the Source Data file.

Performance of summary statistic prediction tools

Now we use our four new summary statistic tools to construct PRS for all 225 UK Biobank phenotypes. To construct each PRS takes under 2 h (regardless of which tool we use) and requires <10 Gb memory. Supplementary Fig. 2 and Supplementary Data 1 show that switching from the GCTA Model to the LDAK-Thin Model increases R2 for between 217 and 225 phenotypes (depending on tool), while switching from the GCTA Model to the BLD-LDAK Model increases R2 for between 223 and 225 phenotypes. LDAK-BayesR-SS has the highest average R2 of the four prediction tools, and produces the most accurate PRS for 137 of the 225 phenotypes.

Figure 4 shows that when run assuming the BLD-LDAK Model, LDAK-BayesR-SS outperforms our implementations of the existing tools lassosum, sBLUP, LDpred and SBayesR for 223 of the 225 phenotypes. Compared to the best existing tool, the average increase in R2 is 14% (s.d. 1%). Consistent with simulations (Supplementary Figs. 3 & 4), we find that the increase tends to be higher for phenotypes with lower R2. Nonetheless, the average increase remains substantial and significant (P < 1e−16 from a one-sided Wald Test) if we consider only the 106 phenotypes with R2 < 0.05, only the 51 phenotypes with 0.05 < R2 < 0.1, or only the 68 phenotypes with R2 > 0.1.

Fig. 4: Impact of changing the heritability model when using summary statistics.
figure 4

a For each of the 225 phenotypes, we compare PRS constructed using LDAK-BayesR-SS assuming either the LDAK-Thin or BLD-LDAK Model, to PRS constructed using our implementations of the existing tools lassosum, sBLUP, LDpred and SBayesR. We measure the accuracy of PRS via R2, the squared correlation between observed and predicted phenotypes. The x-axis reports highest R2 across the four existing tools, while the y-axis reports the percentage increase in R2 if instead of using the existing tool with highest R2, we use LDAK-BayesR-SS assuming either the LDAK-Thin or BLD-LDAK Model (improvements above 50% are truncated). b The same as a, except phenotypes are grouped based on highest R2 across the four existing tools: 0.01–0.05 (106 phenotypes), 0.05–0.10 (51 phenotypes) or 0.10–0.33 (68 phenotypes). Boxes mark the median increase in R2 and the inter-quartile range. Source data are provided within the Source Data file.

Additional Analyses

For our main analyses, we measured the accuracy of PRS using R2. Supplementary Fig. 5 shows that improving the heritability model improves accuracy if we instead measure mean absolute error or (for the binary phenotypes) area under the curve. For our main analyses, we used only directly-genotyped SNPs. Supplementary Fig. 6 shows that improving the heritability model also improves the accuracy of PRS when we increase the number of SNPs from 629,000 to 7.5 M by including imputed genotypes.

For Supplementary Fig. 7 and Supplementary Table 4, we consider eight diseases: asthma, atrial fibrillation, breast cancer, inflammatory bowel disease, prostate cancer, rheumatoid arthritis, schizophrenia and type 2 diabetes. For each disease, we construct PRS using summary statistics from published studies (average sample size 117,000, range 35,000–215,000) that did not include UK Biobank data24,25,26,27,28,29,30,31, then test them using UK Biobank data. Again, we find that for all phenotypes, the accuracy of PRS improves when we replace the GCTA Model with the LDAK-Thin or BLD-LDAK Model. This indicates that the improvements we observed in the main analyses are not an artifact of genotyping errors (as were this the case, we would expect the improvements to disappear when using training and test individuals that have been genotyped independently).

For our main analyses, we used white British individuals from the UK Biobank both to train and test the PRS. For Supplementary Fig. 8 and Supplementary Table 5, we instead test the PRS using UK Biobank individuals of South Asian, African and East Asian ancestry. While absolute accuracy is substantially lower, it remains that PRS constructed assuming the LDAK-Thin or BLD-LDAK Models are more accurate than those constructed assuming the GCTA Model. This indicates that the improvements we observed in the main analyses are not due to population structure (as were this the case, we would expect prediction models constructed assuming the LDAK-Thin or BLD-LDAK Models to perform worse across populations than those constructed assuming the GCTA Model).

Discussion

Most existing prediction tools start with the assumption that each SNP contributes equal heritability9. We have instead developed tools that allow the user to specify more realistic heritability models, and shown how these enable the creation of substantially more accurate PRS. Of our eight new tools, we recommend using LDAK-Bolt-Predict when analyzing individual-level data, and LDAK-BayesR-SS when analyzing summary statistics (in both cases, we advise using the tools assuming the BLD-LDAK Model).

When using LDAK-Bolt-Predict, the average increase in R2 due to changing from the GCTA Model to the BLD-LDAK Model was 14% (s.d. 2%). We showed that this increase is equivalent to increasing the sample size by about a quarter. To provide further perspective, consider that the average increase when switching from using LDAK-Bolt-SS to LDAK-Bolt-Predict (i.e., changing from using summary statistics to individual-level data) was 2% (s.d. 2%), the average increase when switching from using directly-genotyped SNPs to imputed genotypes was 7% (s.d. 2%), the average increase when switching from using LDAK-Ridge-Predict to LDAK-Bolt-Predict (i.e., changing from a single prior distribution for effect sizes to a mixture prior) was 16% (s.d. 2%), while the average increase when switching from classical PRS to LDAK-Ridge-Predict (i.e., changing from classical PRS to the worst-performing advanced prediction tool) was 17% (s.d. 3%).

A strength of our study is that we have considered a variety of complex traits. These include continuous, binary and ordinal phenotypes, that have low, medium and high SNP heritability, and that are both closely and distantly related to diseases. Therefore, the fact that we increased prediction accuracy for almost all of the 225 phenotypes we analyzed, makes us confident that improvements will be observed for many more complex traits. Similarly, our new prediction tools have varying forms of prior distribution for SNP effect sizes. Therefore, the fact that prediction accuracy increased for all tools, indicates that if a new tool is developed with a superior prior distribution form, it is likely that this tool could also be made more accurate by improving the heritability model.

We are aware of two existing summary statistic prediction tools where the user can specify the heritability model, AnnoPred32 and LDpred-funct33. AnnoPred is similar to LDAK-Bolt-SS. It assumes that SNP effect sizes have the prior distribution p0 N(0,σ2) + (1-p0) δ0, then incorporates the chosen heritability model by allowing either σ2 or p0 to vary across SNPs32. Supplementary Fig. 1 shows that AnnoPred is outperformed by LDAK-Bolt-SS, regardless of whether we assume the BLD-LDAK Model (our recommended model) or the Baseline LD Model (recommended by the authors of AnnoPred). LDpred-funct is similar to LDAK-Ridge-SS. It first estimates effect sizes assuming the prior distribution N(0,σ2), where σ2 varies across SNPs according to the chosen heritability model, then regularizes these estimates via cross-validation33. Supplementary Fig. 1 shows that LDpred-funct is outperformed by LDAK-Ridge-SS, regardless of whether we assume the BLD-LDAK Model (our recommended model) or the Baseline LD Model (recommended by the authors of LDpred-funct).

When performing heritability analysis, we previously recommended choosing the heritability model with lowest AIC17. We now recommend the same when constructing PRS. Based on average AIC, the BLD-LDAK, LDAK-Thin and GCTA Models rank first, second and third, respectively, which matches their order when ranked based on the average accuracy of the corresponding PRS. We additionally construct PRS assuming the GCTA-LDMS-I34 and Baseline LD Models35, those currently recommended by the authors of GCTA8 and LDSC36, respectively. Based on average AIC, these two models rank between the LDAK-Thin and BLD-LDAK Models (Supplementary Table 2), which similarly matches their order when ranked based on the average accuracy of the corresponding PRS (Supplementary Fig. 9).

Although we observed improvement for almost all of the 225 UK Biobank phenotypes, we found that the relative advantage of our new prediction tools was largest for phenotypes with small and modest R2 (e.g., those with R2 < 0.1). This is relevant because, at present, most successful applications of genetic prediction models37,38 involve PRS with small or modest R2. For example in psychiatric research, a PRS with R2 ≈ 0.05 was used to show that impulsivity is an endophenotype for attention deficit hyperactivity disorder39, a PRS with R2 ≈ 0.07 was used to show that individuals with chronic schizophrenia had higher-than-average genetic liability to schizophrenia40, a PRS with R2 ≈ 0.02 was used to identify clinically-defined subtypes of autism that have significantly different genetic liabilities41, a PRS with R2 < 0.05 was used to demonstrate that risk of developing emotional problems is moderated by an interaction between environmental sensitivity and type of parenting42, and a PRS with R2 ≈ 0.01 was used to demonstrate that stressful life events and childhood trauma are risk factors for the development of major depressive disorder43. Away from psychiatric research, Khera et al.5 demonstrated the utility of genetic risk prediction for atrial fibrillation, breast cancer, coronary artery disease, inflammatory bowel diseases and type 2 diabetes using PRS with R2 between 0.02 and 0.04.

We finish by noting that the performance of our new prediction tools will increase as more realistic heritability models are developed. To date, most of the improvement in PRS accuracy has come from increasing sample size, algorithmic innovations or developing more effective forms of prior distribution for SNP effect sizes. Our work indicates that in future, more focus should be placed on improving the heritability model.

Methods

We begin by explaining our new prediction tools. Note that before running each tool, it is necessary to estimate the expected heritability contributed by each SNP, given the heritability model. Our prediction tools then use these estimates to set the parameters of the effect size prior distribution for each SNP.

Suppose there are n individuals and m SNPs. Let X denote the matrix of genotypes (size n x m, where column Xj contains the genotypes for SNP j), and Y denote the vector of phenotypes (length n). For convenience, the Xj and Y are standardized, so that Mean(Xj) = Mean(Y) = 0 and Var(Xj) = Var(Y) = 1. We assume that the chi-squared (one degree of freedom) test statistic for SNP j from single-SNP analysis is Sj = n rj2 / (1 – rj2), where rj = XjY/n is the correlation between SNP j and the phenotype (this assumes the analysis performed linear regression, but remains a good approximation for Sj computed using logistic regression44). We consider prediction tools that use the linear model

$${\rm{E}}[{\bf{Y}}]={{\bf{X}}}_{1}{\beta }_{1}+{{\bf{X}}}_{2}{\beta }_{2}+\ldots +{{\bf{X}}}_{{\rm{m}}}{\beta }_{m}={\bf{X}}{\boldsymbol{\beta }}$$
(1)

where βj is the effect size for SNP j, and β = (β1, β2, …, βm)T. Because Xj and Y are standardized, the heritability contributed by SNP j is h2j = βj2.

Heritability models

The heritability model takes the form17

$$E[{h}_{j}^{2}]={a}_{j1}{\tau }_{1}+{a}_{j2}{\tau }_{2}+\ldots +{a}_{jK}{\tau }_{K}$$
(2)

where the ajk are pre-specified SNP annotations, while the parameters τk are estimated from the data44. In total, we consider five heritability models (see Supplementary Tables 6 and 7 for formal definitions): the one-parameter GCTA Model assumes E[h2j] is constant;8 the one-parameter LDAK-Thin and 20-parameter GCTA-LDMS-I Model allow E[h2j] to vary based on MAF and local levels of linkage disequilibrium;34,35 the 66-parameter BLD-LDAK and 75-parameter Baseline LD Models allow E[h2j] to vary based on MAF, linkage disequilibrium and functional annotations17,35. The GCTA Model is the most used model in statistical genetics9. The GCTA-LDMS and Baseline LD Models are the recommended models of the authors of GCTA8 and LDSC36, respectively. The BLD-LDAK Model is our preferred model, however, we recommend the LDAK-Thin Model for applications that demand a simple heritability model17. We explain the biological intuition behind the GCTA, LDAK-Thin and BLD-LDAK Models in Supplementary Fig. 10.

For a given phenotype, we estimate the τk in Eq. (2) using SumHer (an existing tool within the LDAK software), which requires summary statistics from single-SNP analysis and a reference panel44. SumHer first calculates the expected value of Sj given the heritability model

$${\rm{E}}[{S}_{j}]\approx 1+n{\Sigma }_{l}{c}_{jl}^{2}({a}_{l1}{\tau }_{1}+{a}_{l2}{\tau }_{2}+\ldots +{a}_{lK}{\tau }_{K})$$
(3)

where c2jl is the squared correlation between SNPs j and l in the reference panel, while the summation is across SNPs near SNP j (e.g., within 1 cM). Then SumHer estimates the τk by regressing the Sj on the E[Sj]. For further details see our earlier publications17,44. Note that while SumHer can allow for confounding bias (by adding an extra parameter to Eq. (3) designed to capture inflation of test statistics due to population structure and familial relatedness), we no longer recommend this feature, nor use it when constructing prediction models45. The computational demands of SumHer depend on the complexity of the heritability model; for our analyses, it took ~20 min when assuming the GCTA or LDAK-Thin Model, and about 1 h when assuming the BLD-LDAK Model (each time requiring <10 Gb memory). As well as estimating τk, SumHer also reports ej, the estimate of E[h2j] obtained by replacing the τk in Eq. (2) with their estimated values.

New prediction tools

Each of our new tools assumes that the error terms in Eq. (1) are normally distributed, so that Y ~ N(, σ2e), where σ2e is the residual variance. They differ in their forms of prior distributions for SNP effect sizes (Fig. 1a). big_spLinReg and LDAK-Lasso-SS use a double exponential distribution, βj ~ DE(λ / E[h2j]0.5). LDAK-Ridge-Predict and LDAK-Ridge-SS use single Gaussian distributions, βj ~ N(0, E[h2j]) and βj ~ N(0, vE[h2j]), respectively. LDAK-Bolt-Predict and LDAK-Bolt-SS use a mixture of two Gaussian distributions, βj ~ p N(0, (1-f2)/p E[h2j]) + (1-p) N(0, f2/(1-p) E[h2j]). LDAK-BayesR-Predict and LDAK-BayesR-SS use a mixture of a point mass at zero and three Gaussian distributions, βj ~ π1 δ0 + π2 N(0, sE[h2j]/100) + π3 N(0, sE[h2j]/10) + π4 N(0, sE[h2j]), where π1 + π2 + π3 + π4 = 1 and s = (π2/100 + π3/10 + π4)−1. The biological intuition behind the different prior distribution forms is explained in Supplementary Fig. 11. For each tool, we set E[h2j]=ej (the estimate from SumHer), and σ2e = 1-Σej. The remaining prior parameters (λ, v, p, f2, π1, π2, π3 and π4) are decided using cross-validation, as explained below.

Model fitting using individual-level data

big_spLinReg is a function within our R package bigstatsr46. The original version of the function is described in Prive et al.47; the most recent version is the same, except that it allows the user to provide penalty factors that transform the prior from βj ~ DE(λ) to βj ~ DE(λ / E[h2j]0.5). In summary, big_spLinReg estimates the βj using coordinate descent with warm starts48,49. Given a value for λ, the βj are updated iteratively (starting from zero) until they converge. Within each iteration, each βj within the strong set (the subset of predictors determined most likely to have non-zero effects49) is updated once, by replacing its current value with its conditional posterior mode. λ starts at a value sufficiently high that βj = 0 for all SNPs, then is gradually lowered to allow an increasing number of SNPs to have non-zero effects. big_spLinReg uses ten-fold cross-validation to decide when to stop reducing λ.

LDAK-Bolt-Predict uses the same algorithm for estimating the βj and deciding values for p and f2 as the existing tool Bolt-LMM18. In summary, LDAK-Bolt-Predict uses variational Bayes to estimate the βj. Given values for p and f2, LDAK-Bolt-Predict updates the βj iteratively (starting from zero), until the approximate log likelihood converges. Within each iteration, each βj is updated once, by replacing its current value with its conditional posterior mean. LDAK-Bolt-Predict considers 6 values for p (0.01, 0.02, 0.05, 0.1, 0.2, and 0.5) and three values for f2 (0.1, 0.3, and 0.5), resulting in 18 possible pairs for p and f2. First LDAK-Bolt-Predict estimates effect sizes for each of the 18 pairs using data from 90% of samples. Then it identifies which pair results in the best fitting model (based on the mean squared difference between observed and predicted phenotypes for the remaining 10% of samples). Finally, for the best-fitting pair, it re-estimates effect sizes using data from all samples. Note that whereas Bolt-LMM begins by using REML (restricted maximum likelihood)50 to estimate h2SNP, then sets E[h2j]=h2SNP/m, LDAK-Bolt-Predict does not require this step because it instead sets E[h2j] based on estimates from SumHer (see above). Supplementary Fig. 1 shows that the results from LDAK-Bolt-Predict, when run assuming the GCTA Model, are very similar to those from Bolt-LMM.

The prior distribution used by LDAK-Ridge-Predict matches that used by LDAK-Bolt-Predict when p = f2 = 0.5. Therefore, LDAK-Ridge-Predict uses the same algorithm as LDAK-Bolt-Predict, except that it fixes p = f2 = 0.5 and it is no longer necessary to perform the cross-validation step. Supplementary Fig. 1 shows that the results from LDAK-Ridge-Predict, when run assuming the GCTA Model, are very similar to those from the existing tool BLUP19 (Best Linear Unbiased Prediction).

The existing tool BayesR estimates all parameters using Markov Chain Monte Carlo (MCMC)11. However, we do not have sufficient resources to apply BayesR to the full UK Biobank data (we estimate that this would require ~900 Gb and weeks of CPU time). Therefore, LDAK-BayesR-Predict instead uses variational Bayes and cross-validation. The algorithm is the same as for LDAK-Bolt-Predict, except that it is now necessary to select suitable values for π1, π2, π3 and π4. In total, we consider 35 different combinations: the first is the ridge regression model (π1, π2, π3, π4) = (0, 0, 0, 1); the remaining 34 are obtained by allowing each of π2, π3 and π4 to take five values (0, 0.01, 0.05, 0.1, 0.2), with the restrictions π4 ≤ π3 ≤ π2 and π2 + π3 + π4 > 0. We investigated omitting the restriction π4 ≤ π3 ≤ π2, in which case there are 125 different triplets, however, we found that while this takes approximately four times longer to run, it did not significantly improve prediction accuracy. In Supplementary Fig. 1, we compare LDAK-BayesR-Predict to BayesR (for computational reasons, we analyze only 20,000 individuals and 99,852 SNPs); the accuracy of LDAK-BayesR-Predict is consistent with that of BayesR, yet our tool is ~60 times faster (takes under 20 min, compared to 20 h) and requires 10 times less memory (2 Gb instead of 20 Gb).

The runtimes reported in the main text (~50, 4, 20 and 50 h for big_spLinReg, LDAK-Ridge-Predict, LDAK-Bolt-Predict and LDAK-BayesR-Predict, respectively) correspond to using a single CPU. However, for big_spLinReg, LDAK-Bolt-Predict and LDAK-BayesR-Predict, we also provide parallel versions. For LDAK-Bolt-Predict and LDAK-BayesR-Predict, the parallel versions utilize the fact that models corresponding to different parameter choices can be generated independently (i.e., on different CPUs). For big_spLinReg, this is not possible (because the final βj for one value of λ are used as the starting βj when λ is reduced), but instead, each of the ten cross-validation runs can be performed independently. Additionally, for the functions LDAK-Ridge-Predict, LDAK-Bolt-Predict and LDAK-BayesR-Predict, LDAK automatically creates a save-point every 10 iterations, so that the job can be restarted if it fails to complete within the allocated time.

Model fitting using summary statistics

LDAK-Lasso-SS, LDAK-Ridge-SS, LDAK-Bolt-SS and LDAK-BayesR-SS are all contained within a new tool called MegaPRS. To run MegaPRS requires a reference panel and three sets of summary statistics: full summary statistics (computed using all samples), training summary statistics (computed using, say, 90% of samples) and test summary statistics (computed using the remaining samples). In some cases, you will already have (or be able to construct) training and test summary statistics. However, most likely, you will only have full summary statistics, in which case you should first generate pseudo training and test summary statistics (see below).

MegaPRS exploits that, in the absence of individual-level data, XjY can be recovered from the results of single-SNP regression (as explained above, we assume Sj = n rj2 / (1 – rj2), where n is the sample size and rj = XjY/n), while XjXl can be estimated from the reference panel (specifically, MegaPRS uses XjXl = n cjl, where cjl is the observed correlation between SNPs j and l in the reference panel). Note that in the equations below, Xj, Y and n vary depending on context. When using full summary statistics, Xj and Y contain genotypes and phenotypes for all samples, and n is the total number of samples. When using training (test) summary statistics, Xj and Y contain genotypes and phenotypes for only training (test) samples, and n is the number of training (test) samples.

MegaPRS has three steps. In Step 1, it uses the reference panel to estimate SNP–SNP correlations. In Step 2, it constructs pairs of prediction models, first using the training summary statistics (we refer to these as the “training models”), then using the full summary statistics (the “full models”). In Step 3, it uses the test summary statistics to identify the most accurate of the training models, then reports effect sizes for the corresponding full model. For our analyses, each step took <30 min and required <10 Gb memory.

In Step 1, MegaPRS searches the reference panel for local pairs of SNPs with significant cjl (by default, we define local as within 3 cM and significant as P < 0.01 from a two-sided likelihood ratio test that cjl = 0). MegaPRS saves the local, significant pairs in a binary file, which requires 8 bytes for each pair (one integer to save the index of the second SNP, one float to save the correlation). For the UK Biobank data, there were 260 M local, significant pairs (on average, 413 per SNP), and so the corresponding binary file had size 1.9 Gb.

In Step 2, MegaPRS uses the training and full summary statistics to estimates effect sizes for training and full prediction models, respectively. Pairs of training and test models correspond to different prior distribution parameters. MegaPRS constructs 11 pairs of models if running LDAK-Lasso-SS or LDAK-Ridge-SS, 132 pairs of models if running LDAK-Bolt-SS and 84 pairs of models if running LDAK-BayesR-SS (Supplementary Table 8 lists the prior parameters to which these correspond). MegaPRS estimates effect sizes iteratively using variational Bayes. As explained above (in the description of LDAK-Bolt-Predict), the variational Bayes algorithm replaces each current estimate of βj with its conditional posterior mean. This is possible because for all four tools, the posterior distribution for βj can be expressed in terms of XjY and XjXl. For example, if we write the prior distribution for LDAK-Bolt-SS in the form βj ~ p N(0, σ2Big) + (1-p) N(0, σ2Small), then the conditional posterior distribution of βj has the form p’N(μBig,vBig) + (1-p’)N(μSmall,vSmall), where

$$\begin{array}{c} \kern7pt\mu \,{\rm{Big}}={{\bf{X}}}_{j}^{T}({\bf{Y}}-{\bf{X}}{\boldsymbol{\beta }}+{{\bf{X}}}_{j}{\beta }_{j})/({{\bf{X}}}_{j}^{T}{{\bf{X}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{Big}^{2})\\ \kern7.5pt v\,{\rm{Big}}={\sigma }_{e}^{2}/({{\bf{X}}}_{j}^{T}{{\bf{X}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{{\rm{Big}}}^{2})\\ \kern7.5pt \mu \,{\rm{Small}}={{\bf{X}}}_{j}^{T}({\bf{Y}}-{\bf{X}}{\boldsymbol{\beta }}+{{\bf{X}}}_{j}{\beta }_{j})/({{\bf{X}}}_{j}^{T}{{\bf{X}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{{\rm{Small}}}^{2})\\ \kern-1.5pt v\,{\rm{Small}}={\sigma }_{e}^{2}/({{{\bf{X}}}_{j}}^{T}{{\bf{X}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{{\rm{Small}}}^{2})\\ \kern1.6pcp{\prime} =\,{[1+(1-p)/p({v}_{{\rm{Small}}}/{v}_{{\rm{Big}}})({\sigma }_{Big}/{\sigma }_{{\rm{Small}}})\exp ([{\mu }_{{\rm{Small}}}^{2}/{v}_{{\rm{Small}}-}{\mu }_{{\rm{Big}}}^{2}/{v}_{{\rm{Big}}}]/2)]}^{-1}\end{array}$$
(4)

When performing variational Bayes using summary statistics, we found it was not feasible to iterate over all SNPs in the genome. This was due to differences between estimates of XjXl from the reference panel and their true values (a consequence of the fact that individuals in the reference panel are different to those used in the original association analysis, and because we assume XjXl = 0 for pairs of SNPs that are either distant or not significantly correlated). These differences accumulate over the genome, resulting in poor estimates of XjT = Σl XjTXl βl, and therefore poor estimates of the conditional posterior distribution of βj. To avoid these problems, MegaPRS uses sliding windows (see Supplementary Fig. 12 for an illustration). By default, MegaPRS iteratively estimates effect sizes for all SNPs in a 1 cM window, stopping when the estimated proportion of variance explained by these SNPs converges (changes by <1e−5 between iterations). At this point, MegaPRS moves 1/8 cM along the genome, and repeats for the next 1 cM window. Within each window, MegaPRS assumes σ2e = 1 (this approximation is reasonable because the expected heritability contributed by a single window will be close to zero). If a window fails to converge within 50 iterations, MegaPRS resets the βj to their values prior to that window. We found this happened rarely. For example, our main analyses constructed 160,650 full models (across 225 phenotypes, four tools and three heritability models), and for only 990 of these (0.6%) did any of the ~12,000 regions fail to converge.

In Step 3, MegaPRS uses the test summary statistics to measure the accuracy of the training models. If β denotes the vector of estimated effect sizes for a model, MegaPRS calculates R = β’TXTY /(n βTXT’)1/2, an estimate of the correlation between observed and predicted phenotypes for the individuals used to compute the test summary statistics. MegaPRS then constructs the final model by extracting effect sizes from the full model corresponding to the training model with highest R.

In Supplementary Fig. 1, we compare LDAK-Lasso-SS with lassosum13, LDAK-Ridge-SS with sBLUP20, LDpred-inf12 and LDpred-funct33, LDAK-Bolt-SS with LDpred251 and AnnoPred32, and LDAK-BayesR-SS with SBayesR14. When we run our new tools assuming the GCTA model, they perform at least as well as the corresponding existing tools.

Pseudo summary statistics

Given results from single-SNP analysis using n samples, we wish to generate two sets of results, mimicking those we would obtain from first analyzing nA < n samples, then analyzing the remaining nB = n - nA samples. We can reword the task as follows. Let γ = (γ1, γ2, …, γm)T denote the vector of true SNP effect sizes from single-SNP analysis (note that γj differs from βj, because βj reflects how much SNP j contributes directly to the phenotype, whereas γj reflects how much contribution the SNP tags). Given XTY/n, the estimate of γ from all n samples, our aim is to generate XATYA/nA and XBTYB/nB, estimates of γ from nA and nB samples, respectively.

The method we use to generate XATYA/nA and XBTYB/nB is a modified version of that proposed by Zhao et al.52. First we sample XATYA/nA from N(XTY/n, nB/nA V/n), where V is the variance of XTY, then we set XBTYB/nB = (XTYXATYA)/nB. In their method, Zhao et al. restrict to independent SNPs, leading them to derive V = I + XTYYTX/n2, where I is an identity matrix. However, as we wish to accommodate SNPs in linkage disequilibrium, we instead use V = XTX, as proposed by Zhu and Stephens53. If X’ denotes the (standardized) genotypes of the reference panel (size n’ x m), then an estimate of V is XTXn/n’, and therefore we achieve the desired sampling by setting XATYA/nA = XTY/n + (nB/nA)1/2 XT/n1/2 g, where g is a vector of length n’ with elements drawn from a standard Gaussian distribution.

As explained above, our primary use of pseudo summary statistics is to construct and test training prediction models, in order to decide parameters of the effect size prior distribution. Supplementary Fig. 13 investigates this use of pseudo summary statistics for the first 14 UK Biobank phenotypes and the eight additional diseases (those we used in Supplementary Fig. 7 and Supplementary Table 4). We see that, in general, the estimates of R for the training models (measured using pseudo test summary statistics) mirror the estimates of R for the corresponding full models (measured using the independent test data), indicating that it is valid to use pseudo summary statistics to decide prior distribution parameters. However, we note two caveats. Firstly, we observe that estimates of R can be unreliable when calculated using a reference panel that was also used to create the pseudo summary statistics and/or to construct the prediction models. Therefore, when running MegaPRS using pseudo partial summary statistics, we ensure that the reference panel used in Step 3 is distinct to the reference panel used in Steps 1 and 2. Secondly, we found that estimates of R can be unreliable when there are strong effect loci within regions of long-range linkage disequilibrium (e.g., this was an issue for rheumatoid arthritis, where a single SNP within the major histocompatibility complex explains 2% of phenotypic variation). Therefore, when estimating R using pseudo summary statistics, we recommend excluding regions of long-range linkage disequilibrium (a list of these are provided at www.ldak.org/high-ld-regions).

Data

When using our individual-level tools, we constructed prediction models for 14 phenotypes from UK Biobank21,22, for which we have access to phenotype and genotype data via Application 21432. These phenotypes are: body mass index (data field 21001), forced vital capacity (3062), height50, impedance (23106), neuroticism score (20127), pulse rate (102), reaction time (20023), systolic blood pressure (4080), college education (6138), ever smoked (20160), hypertension (20002), snorer (1210), difficulty falling asleep (1200) and preference for evenings (1180). Starting with all 487k UK Biobank individuals, we first filtered based on ancestry (we only kept individuals who were both recorded and inferred through principal component analysis to be white British)17, then filtered so that no pair remained with allelic correlation >0.0325 (that expected for second cousins). Depending on phenotype, there were between 220,399 and 253,314 individuals (in total, 392,214 unique). From these, we picked 200,000 and 20,000 individuals to use for training and testing prediction models, respectively. For all analyses, we used adjusted phenotypes, obtained by regressing the original phenotypic values on 13 covariates (across all 220,000 training and test individuals). These covariates are age (data field 21022), sex31, Townsend Deprivation Index (189) and ten principal components (five from the UK Biobank data, five derived from the 1000 Genomes Project54 data). Supplementary Table 1 reports the estimated proportion of phenotypic variation explained by cryptic relatedness (population structure and familial relatedness); across the 14 phenotypes, it is at most 0.001, and never significant (all P > 0.7 from a one-sided likelihood ratio test).

The UK Biobank provides imputed genotype data, but in general we restricted to the 628,694 autosomal SNPs with information score >0.9, MAF > 0.01 and present on the UK Biobank Axiom Array (the exception is for Supplementary Fig. 6, where we did not require that SNPs were present on the Axiom Array). We converted dosages to genotypes using a hard-call-threshold of 0.1 (i.e., dosages were rounded to the nearest integer, unless they were between 0.1 and 0.9 or between 1.1 and 1.9, in which case the corresponding genotype was considered missing). After this conversion, on average, 0.1% of genotypes were missing. Note that big_spLinReg does not allow missing values, so when using this tool, we used a hard-call-threshold of 0.5.

When using our summary statistic tools, we constructed prediction models for 225 phenotypes from UK Biobank (Supplementary Data 1), using the August 2018 results from the Neale Lab. In total, the Neale Lab analyzed 4,203 UK Biobank phenotypes, using up to 361,194 British individuals. We downloaded results for the 283 phenotypes that were computed using both sexes and had estimated SNP heritability >0.05 (using details provided in the file ukb31063_h2_topline.02Oct2019.tsv.gz). For each phenotype, we begun by generating pseudo training and test summary statistics corresponding to 90% and 10% of samples, respectively. We subsequently used the pseudo training summary statistics to construct prediction models, and the pseudo test summary statistics to measure their accuracy. Supplementary Fig. 2 confirms that it is possible to both construct and (fairly) measure the accuracy of different prediction models using a single set of summary statistics. Specifically, it shows that for the first 14 phenotypes, estimates of R2 are similar whether we use data from our own UK Biobank application (for which we have independent training and test data) or use summary statistics from the Neale Lab. Although we downloaded results for 283 phenotypes, in the main text, we restrict to the 225 phenotypes for which it was possible to generate a PRS with R2 > 0.01 (using any tool and any heritability model). We made this choice because it is difficult to reliably compare the performance of tools using PRS with very low, and often non-significant, R2. However, we note that for the 58 phenotypes we rejected, it remained that improving the heritability model increased R2 on average 93% of time, and that the average improvement in R2 was 19% (s.d. 3%) when we switched from the best existing tool to LDAK-BayesR-SS assuming the BLD-LDAK Model.

When we required a reference panel, we used genotypes of 20,000 individuals from the UK Biobank (when requiring multiple reference panels, we ensured 20,000 different individuals were used for each). Note that when using individual-level data prediction tools, for which we use a reference panel to estimate E[h2j] given the heritability model, we always picked the 20,000 individuals from the 200,000 training samples, to ensure that there was no overlap between the data used to train and test prediction models.

For the analysis in Supplementary Fig. 7 and Supplementary Table 4, we used results from published studies to construct PRS for eight diseases: asthma29, atrial fibrillation28, breast cancer31, inflammatory bowel disease26, prostate cancer30, rheumatoid arthritis24, schizophrenia25 and type 2 diabetes27. We chose these diseases as they were the ones for which we could find both cases in the UK Biobank and publicly-available summary statistics from a genome-wide association study that did not use UK Biobank data. We excluded SNPs with ambiguous alleles (A&T or C&G) or that were not present in our UK Biobank dataset, after which on average 470,000 SNPs remained (range 191,000 to 559,000).

For the analysis in Supplementary Fig. 8 and Supplementary Table 5, we measured how well prediction models for the first 14 UK Biobank phenotypes (constructed using data from white British individuals) performed for individuals of non-European ancestry. For this, we used principal component analysis to identify 7,057, 2,717 and 1,331 individuals from the UK Biobank whose ancestries were consistent with individuals reported to be South Asian (Indian or Pakistani), African and East Asian (Chinese), respectively.

Sensitivity of MegaPRS to setting choices

In Supplementary Fig. 14, we test the impact on prediction accuracy of changing the definitions of local and significant when calculating SNP–SNP correlations, the window settings and convergence threshold used when estimating effect sizes, and the choice of reference panel. In general, the impact on accuracy is small. It is largest when we replace the UK Biobank reference panel (20,000 individuals) with genotypes of 489 European individuals from the 1000 Genome Project54. In this case, average R2 reduces by ~3% (about two thirds of this is due to reducing the number of individuals, and one third due to replacing UK Biobank genotypes with 1000 Genome Project genotypes).

Other prediction tools

Supplementary Fig. 1 compares the performance of our new tools with nine existing tools, using the first 14 UK Biobank phenotypes. Note that when using summary statistic tools, we use our own UK Biobank data, rather than summary statistics from the Neale Lab (so there was no need to use pseudo summary statistics). When running BLUP19, Bolt-LMM18, BayesR11, lassosum13, sBLUP20, LDpred-funct33, LDpred251, AnnoPred32 and SBayesR14, we used the default settings of each software (see Supplementary Note 1 for scripts). When it was necessary to select prior parameters (this was the case for lassosum, LDpred2 and AnnoPred, as well as for our four summary statistic tools), we used cross-validation. Similar to above, we constructed models corresponding to different parameter choices using 90% of the training samples (180,000 individuals), then tested these using the remaining 10% of the training samples (20,000 individuals). Having identified the best-performing parameters, we then used these to make the final model (using all training samples). Note that for sBLUP, we found that average R2 improved if we repeated the analyses excluding regions of long-range linkage disequilibrium14,33, while for AnnoPred, we found it was necessary to exclude SNPs from the major histocompatibility complex (otherwise, the software would fail to complete).

When constructing Classical PRS, we used estimates of βj from single-SNP analysis. We considered six p value thresholds (P ≤ 5e−8, P ≤ .0001, P ≤ 0.001, P ≤ 0.01, P ≤ 0.1, all SNPs) and four clumping thresholds (c2jk ≤ 0.2, c2jk ≤ 0.5, c2jk ≤ 0.8, and no clumping). We decided the most-suitable pair of thresholds via cross-validation. Similar to above, we first constructed 18 models using 90% of training samples, then tested these models using the remaining 10% of training samples. We then used the best-performing p value and clumping thresholds to construct the final model (using summary statistics from all training samples).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.