Abstract
Polygenic risk scores (PRS) have shown promise in predicting human complex traits and diseases. Here, we present PRSCS, a polygenic prediction method that infers posterior effect sizes of single nucleotide polymorphisms (SNPs) using genomewide association summary statistics and an external linkage disequilibrium (LD) reference panel. PRSCS utilizes a highdimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRSCS outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRSCS to predict six common complex diseases and six quantitative traits in the Partners HealthCare Biobank, and further demonstrate the improvement of PRSCS in prediction accuracy over alternative methods.
Introduction
Polygenic risk scores (PRS), which summarize the effects of genomewide genetic markers to measure the genetic liability to a trait or a disorder, have shown promise in predicting human complex traits and diseases, and may facilitate early detection, risk stratification, and prevention of common complex diseases in healthcare settings^{1,2}.
To maximize the translational potential of PRS, statistical and computational methods are needed that can (1) jointly model genetic markers across the genome to make full use of the available information while accounting for local linkage disequilibrium (LD) structures; (2) accommodate varying effect size distributions across complex traits and diseases, from highly polygenic genetic architectures (e.g., height and schizophrenia), to a mixture of small effect sizes and clusters of genetic loci that have moderate to larger magnitudes of effects (e.g., autoimmune diseases and Alzheimer’s disease); (3) produce prediction from summary statistics of genomewide association studies (GWAS) without access to individuallevel data; and (4) retain computational scalability.
To date, most applications calculate PRS from a subset of the genetic markers after pruning out single nucleotide polymorphisms (SNPs) in LD and applying a Pvalue threshold to GWAS summary statistics^{3}. Although this approach has advantages in terms of computational and conceptual simplicity, and has been used to predict genetic liability across a broad phenotypic spectrum, recent studies have shown that this conventional method for PRS construction discards information and limits prediction accuracy^{4}. More sophisticated Bayesian polygenic prediction methods that rely on GWAS summary statistics, including LDpred^{4} and the normalmixture model recently developed^{5,6}, can incorporate genomewide markers and accommodate varying genetic architectures, and thus have enhanced performance and flexibility. However, the type of prior on SNP effect sizes used in these methods, known as discrete mixture priors, imposes daunting computational challenges and may result in inaccurate adjustment for local LD patterns.
In this work, we present a polygenic prediction method, PRSCS, which utilizes a Bayesian regression framework and places a conceptually different class of priors—the continuous shrinkage (CS) priors—on SNP effect sizes. Continuous shrinkage priors allow for markerspecific adaptive shrinkage (i.e., the amount of shrinkage applied to each genetic marker is adaptive to the strength of its association signal in GWAS), and thus can accommodate diverse underlying genetic architectures. In addition, continuous shrinkage priors enable conjugate block update of the SNP effect sizes in posterior inference (i.e., effect sizes for SNPs in each LD block are updated jointly, in a multivariate fashion, in contrast to updating the effect size for each marker separately and sequentially), and thus can accurately model local LD patterns and provide substantial computational improvements. Several special cases of continuous shrinkage priors have been applied to quantitative trait prediction or gene mapping^{7,8,9,10,11,12}. However, all previous work required individuallevel data and was limited to smallscale analyses (both in term of the sample size and number of genetic markers). PRSCS only requires GWAS summary statistics and an external LD reference panel, and therefore can be applied in a broader range of settings.
We conduct simulation studies using the UK Biobank genetic data^{13,14}, and demonstrate that PRSCS dramatically improves the predictive performance of PRS over existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRSCS to predict six curated common complex diseases (breast cancer (BRCA), coronary artery disease (CAD), depression (DEP), inflammatory bowel disease (IBD), rheumatoid arthritis (RA), and type 2 diabetes mellitus (T2DM)) and six quantitative traits (height, body mass index, highdensity lipoproteins, lowdensity lipoproteins, cholesterol, and triglycerides) in the Partners HealthCare Biobank^{15}, and further demonstrate the potential of PRSCS for the clinical translation of polygenic prediction.
Results
Conceptual frameworks
We consider a Bayesian highdimensional regression framework for polygenic modeling and prediction:
where N and M denote the sample size and number of genetic markers, respectively, y is a vector of traits, X is the genotype matrix, β is a vector of effect sizes for the genetic markers, and ε is a vector of residuals. By assigning appropriate priors on the regression coefficients β to impose regularization, additive PRS can be calculated using posterior mean effect sizes.
Essentially all widely used prior densities for β can be represented as scale mixtures of normals:
or equivalently, as the following hierarchical form:
where N(μ, σ^{2}) is a normal distribution with mean μ and variance σ^{2}, and G is a mixing distribution. For example, if G places all its mass at a single point, i.e., \(G(\Psi _j) = \delta _{\sigma _\beta ^2}\), where δ_{•} is the Dirac delta measure, then marginally \(\beta _j \sim N(0,\sigma _\beta ^2)\), and we have recovered the infinitesimal model^{16}. To create a more flexible model of the genetic architecture, a discrete mixture of two or more point masses or densities can be used, which allows for a wider effect size distribution than a normal prior can produce. For example, \(G(\Psi _j) = (1  \pi )\delta _0 + \pi \delta _{\tau ^2}\), where π is the mixing probability (the fraction of causal variants), produces the pointnormal prior on effect sizes, β_{j} ~ (1−π)δ_{0} + πN(0, τ^{2}), which was used in LDpred^{4}. Although discrete mixture priors offer a natural and intuitive approach to model noninfinitesimal genetic architectures, posterior inference requires a stochastic search over an exponentially large discrete model space, and does not allow for multivariate block update of effect sizes, which limits computational efficiency and may result in inaccurate modeling of local LD patterns.
In this work, we investigate a conceptually different class of priors—the continuous shrinkage priors. In particular, we consider the following prior on SNP effect sizes, which can be represented as globallocal scale mixtures of normals:
where ϕ is a global scaling parameter that shares across genetic markers and controls the degree of sparseness of the model, and g is an absolutely continuous density function, in contrast to a discrete mixture of atoms or densities. By appropriately choosing the continuous mixing density g, this modeling framework can produce a variety of shapes of the prior distribution on β_{j}. In particular, g can be designed to introduce a prior distribution on the SNP effect sizes that has a sizable amount of mass near zero to impose strong shrinkage on noise, while at the same time has heavy tails to avoid overshrinkage of truly nonzero effects. The markerspecific local shrinkage parameter ψ_{j} can then adaptively squelch small noisy estimates towards zero, while leaving datasupported large signals unshrunk. In this work, we investigate a specific g (known as the StrawdermanBerger prior^{17,18}; see Methods section), and present two versions of the algorithm, which differ in the way to learn the global scaling parameter ϕ. In PRSCS, we search a small number of fixed ϕ, select the ϕ value that produces the best predictive performance in a validation data set, and evaluate the algorithm in an independent testing set. In the second version of the algorithm, which we call PRSCSauto, we use a fully Bayesian approach and place a standard halfCauchy prior on the global shrinkage parameter^{19,20}: ϕ^{1/2} ~ C^{+}(0, 1), such that ϕ is automatically learnt from data and no validation data set is needed.
Individuallevel Bayesian regression models (1) with a prior on SNP effect sizes can often be approximated using an external LD reference panel and turned into summary statistics based methods^{4,6,21,22}. Here we enable posterior inference of SNP effect sizes from GWAS summary statistics under continuous shrinkage priors using an efficient Gibbs sampler with multivariate block update of the effect sizes (see Methods section).
Overview of polygenic prediction methods
We compare PRSCS and PRSCSauto with four polygenic prediction methods that rely on GWAS summary statistics in both simulations and real data analyses: polygenic scoring based on all genetic markers (unadjusted PRS), informed LDpruning (also known as LDclumping) and Pvalue thresholding (P+T), LDpred and LDpredinf^{4}. Throughout the paper, we use the 1000 Genomes Project (1 KG) European sample (N = 503) as the external LD reference panel, but also assess the impact of using an insample LD reference panel on prediction accuracy in Supplementary Information.
Simulations
We first compared the predictive performance of six polygenic prediction methods across different genetic architectures and training sample sizes (i.e., GWAS sample sizes) in simulation studies (Fig. 1 and Supplementary Table 1). SNP effect sizes were simulated using (1) a pointnormal model with different numbers of causal variants, and (2) a normal mixture model, as described in the Methods section. Tuning parameters (Pvalue threshold in P+T, fraction of causal SNPs in LDpred, and global shrinkage parameter in PRSCS) were selected in a validation data set (N = 3000). Prediction accuracy for all methods was quantified by R^{2} between the observed and predicted traits in an independent testing set (N = 3000).
Figure 1 shows that polygenic prediction methods that do not account for noninfinitesimal genetic architectures (unadjusted PRS and LDpredinf) performed poorly when the number of causal variants is small, but became more comparable to other methods when the genetic architectures are highly polygenic. For all the methods, the prediction accuracy decreased as the number of causal variants increases with fixed heritability, because as more causal SNPs are in LD (as a result of more causal SNPs being randomly sampled across the genome) and their effect sizes decline, it becomes increasingly difficult to distinguish real signals from noise. Overall, methods that account for local LD patterns (LDpred, PRSCS, and PRSCSauto) outperformed P+T, which discards LD information. However, one unexpected observation is that, when the genetic architecture is sparse, the prediction accuracy of LDpred decreased dramatically as the training sample size grows. This is likely because when the number of causal variants is small and the training sample size is large, all markers in LD with the causal variant become highly statistically significant in association tests, and LDpred does not accurately adjust for the LD structure, resulting in a decrease in predictive performance. In contrast, PRSCS and PRSCSauto were minimally affected in the combination of sparse genetic architectures and large training sample sizes, which demonstrates the advantage of multivariate modeling and block update of the effect sizes for genetic markers in LD. In a few scenarios where the training sample size is small, PRSCS produced lower prediction accuracy than LDpred, but it outperformed LDpred as the sample size grows across all genetic architectures. PRSCSauto did not perform well when the training sample size is small and the genetic architecture is sparse (e.g., in the case of 100 causal variants and 10,000 training samples), but approached the performance of PRSCS as the sample size increases.
In addition to prediction accuracy, we assessed the calibration of polygenic prediction methods by regressing the true phenotype onto the PRS predictor and inspecting the regression slope. A slope close to one indicates that a predictor is correctly calibrated. Consistent with predictive performance, as the training sample size grows, our Bayesian approach provides the best calibration among all methods examined (Supplementary Table 7). PRSCSauto is particularly well calibrated for large training sample sizes, because it automatically learns the sparseness of the genetic architecture from data and adjusts for the LD structure accordingly.
Secondary simulation studies using (1) the pointnormal model with different total heritability (0.2 and 0.8); (2) a pointt model with different numbers of causal variants; and (3) a pointgamma model with different numbers of causal variants produced similar patterns of prediction accuracy (Supplementary Figs. 1–4; Supplementary Tables 2–5) and calibration properties (Supplementary Tables 8–11). Using the combined UK Biobank validation and testing data sets (N = 6000) as an insample LD reference panel in the pointnormal simulations produced, in general, slightly higher prediction accuracy for methods making use of LD information (Supplementary Fig. 5; Supplementary Tables 6 and 12), suggesting that using a larger reference panel that better aligns with the LD structure of the target sample may increase predictive performance. However, as the improvement was marginal, it appears that the performance of PRSCS(auto) is not particularly sensitive to the LD reference panel, and 1KG can serve as a valid reference despite its relatively small sample size.
Polygenic prediction in the Partners Biobank
We applied PRSCS, PRSCSauto, and alternative methods to predict six curated common complex diseases (breast cancer, coronary artery disease, depression, inflammatory bowel disease, rheumatoid arthritis, and type 2 diabetes mellitus), and six quantitative traits (height, body mass index, highdensity lipoproteins, lowdensity lipoproteins, cholesterol, and triglycerides) in the Partners HealthCare Biobank. Largescale GWAS summary statistics for each disease and trait were downloaded from public domains (Table 1 and Supplementary Data 1). SNP heritability for each disease (both on the observed scale and the liability scale) and trait estimated using GWAS summary statistics and LD score regression^{23} are presented in Supplementary Table 13.
Predictive performance measured by Nagelkerke’s R^{2} (for disease phenotypes) and R^{2} (for quantitative traits) is summarized in Fig. 2. Additional prediction accuracy metrics, including area under the receiver operating characteristic (ROC) curve (known as AUC), area under the precisioncall curve, and the odds ratio (OR) comparing top 10% of the participants having high polygenic risk with the remaining 90% of the sample, produced similar results in terms of the ranked performance of polygenic prediction methods and are reported in Supplementary Data 2.
Consistent with previous work, unadjusted PRS performed poorly regardless of the genetic architecture, and LDpred showed an overall improvement over P+T. Among the six curated disease phenotypes, PRSCS produced substantially better predictions for breast cancer (41.85% relative increase in Nagelkerke’s R^{2} compared to LDpred) and rheumatoid arthritis (28.62% relative increase in Nagelkerke’s R^{2} compared to LDpred). For coronary artery disease, depression and type 2 diabetes mellitus, LDpred and PRSCS had similar predictive performance, and both performed dramatically better than P+T. PRSCS was only inferior to LDpred in the prediction of inflammatory bowel disease (10.24% relative decrease in Nagelkerke’s R^{2}). However, we note that inflammatory bowel disease has the smallest training sample size among all diseases and traits (Table 1). The lower prediction accuracy of PRSCS for this disease is thus consistent with our simulation studies, where we observed that when the training sample size is limited, LDpred can outperform PRSCS. PRSCSauto produced lower prediction accuracy than LDpred except for breast cancer, indicating that the current GWAS sample sizes for most diseases may not be large enough to accurately learn the global shrinkage parameter from GWAS summary statistics.
For the six quantitative traits, both PRSCS and PRSCSauto consistently outperformed all alternative methods examined. The relative improvement in prediction accuracy for PRSCS compared to LDpred ranged from 8.01% for LDL and 8.75% for BMI, to 27.75% for height and 32.05% for cholesterol, with an average improvement of 18.17%. The average improvement of PRSCSauto relative to LDpred across the six quantitative traits was 11.41%. The average improvements of PRSCS and PRSCSauto relative to P+T were 48.16% and 38.62%, respectively. We note that LDpred was the best method after PRSCS and PRSCSauto for all quantitative traits except height, for which its prediction accuracy was lower than LDpredinf and P+T. This is theoretically expected and consistent with a recent study, which also observed that for highly polygenic traits, LDpredinf often outperforms LDpred^{24}.
Overall, using the Partners HealthCare Biobank data as an insample LD reference (N = 19,136) instead of the 1KG reference panel slightly increased the prediction accuracy but the improvement was marginal (Supplementary Fig. 6 and Supplementary Data 3).
Discussion
Polygenic prediction, which exploits genomewide genetic markers to estimate the genetic liability to a complex human disease or trait, is likely to become useful in clinical care and contribute to personalized medicine. As a highdimensional regression problem that requires regularization, a majority of the existing methods that jointly model genetic markers across the genome employ Bayesian approaches and assign a discrete mixture prior on SNP effect sizes. Although intuitively appealing, this class of priors generates daunting computational challenges: the model space grows exponentially with the number of markers, which is difficult to fully explore, and more importantly, discrete mixture priors do not allow for block update of effect sizes and thus hinder accurate LD adjustment in polygenic prediction. LDpred^{4} partially addressed this issue by making several simplifying assumptions to the posterior distribution and using marginal posterior without LD to approximate the true posterior. However, our simulation studies suggest that this approximation may be inaccurate.
We have presented a conceptually different class of priors—the continuous shrinkage priors—which can be represented as globallocal scale mixtures of normals, for polygenic modeling. By using a continuous mixing density on the scales of the marker effects, continuous shrinkage priors enable a simple and efficient Gibbs sampler with multivariate block update of the effect sizes, and thus resolve a major technical hurdle of discrete mixture priors. A second feature of the continuous shrinkage prior is its ability to shrink adaptively. By constructing a prior density on SNP effect sizes that is both peaked at zero and heavytailed, the method imposes strong shrinkage on small effects that are likely to be noise, while applying practically no shrinkage to datasupported truly nonzero signals. Simulated and real data analyses showed that PRSCS consistently outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We note that previous work often extrapolated prediction accuracy for larger effective sample sizes by restricting the analysis to a subset of the genetic markers^{4,24}. However, our simulations suggest that this approach may not fully capture the behavior of a polygenic prediction algorithm when the training sample size grows, and underscore the need for actually scaling up the sample size in future studies.
PRSCS has a tuning parameter, i.e., the global shrinkage parameter ϕ, which needs to be fixed based on prior beliefs about the sparseness of the genetic architecture, or selected by testing a small number of values. If a grid search is used, like other polygenic prediction methods that have tuning parameters such as P+T and LDpred, the optimal value of ϕ should be selected using a validation data set that is independent of the testing set where predictive performance is assessed to avoid overfitting. In this work, we also presented PRSCSauto, a fully Bayesian approach that enables automatic learning of ϕ from GWAS summary statistics. Although analyses in the Partners Biobank indicate that, for many disease phenotypes, the current GWAS sample sizes may not be large enough to accurately learn ϕ and the prediction accuracy of PRSCSauto may be lower than PRSCS and LDpred, simulation studies and quantitative trait analyses suggest that PRSCSauto can be useful when the training sample size is large or when an independent validation set is difficult to acquire.
Although continuous shrinkage priors enable multivariate modeling of the LD structure, simultaneous updating of the effect sizes for genomewide markers remains computationally infeasible. In this work, we used a genome partition computed and validated by prior work^{25}, which divides the genome into 1703 largely independent genomic regions, and has been successfully used in local heritability and genetic correlation analyses^{26,27}. Block update of posterior SNP effect sizes can thus be performed within each LD block, assuming no LD between blocks. Using a sliding window approach as implemented in LDpred^{4} may capture LD across blocks more accurately, but is more memory intensive and computationally expensive. By restricting the analysis to HapMap3 variants, the partition we employed gives a moderate number of SNPs within each block (on average ~500 SNPs per block), and the Bayesian computation with 1000 MCMC iterations on the longest chromosome can be completed within an hour using one Intel(R) Xeon(R) CPU core and 2 GB of memory. Expanding the size of LD blocks may improve prediction accuracy but increases computational cost (as each MCMC iteration requires inverting an L × L matrix where L is the block size), while reducing the size of LD blocks has the potential risk of missing longrange LD. Therefore, the partition we chose represents a balance between modeling accuracy and computational burden. Including multimillion SNP predictors may increase prediction accuracy^{28} but requires further work.
We note that the prior we investigated in this work, i.e., the StrawdermanBerger prior on the local markerspecific shrinkage parameter, is only one of the possible choices within the class of continuous shrinkage priors, which includes the normalgamma prior^{29,30}, the normalinversegaussian prior^{29}, the generalized t (generalized double Pareto) prior^{31,32}, and the normalexponentialgamma prior^{33,34}, among others. In addition, most frequentist regularization procedures, such as LASSO, elastic net and bridge regression, have a Bayesian counterpart that can be represented as globallocal scale mixtures priors in combination with posterior mode inferences. Each of these priors uses a different continuous mixing density to produce a different marginal prior on the SNP effect sizes. These alternatives may perform equally well or better than the StrawdermanBerger prior for certain genetic architectures. However, we found that as long as the prior on the effect sizes places a sizable amount of mass around zero and has heavierthanexponential tails, variation in the shape of the prior does not seem to have a large impact on prediction accuracy. Therefore, we believe that the primary gain of PRSCS over existing methods lies in its more accurate multivariate modeling of local LD patterns and its blockupdated Gibbs sampling that can improve the mixing and convergence rate of the Markov chain. We thus recommend using the StrawdermanBerger prior as a default choice. A systematic investigation and comparison of different continuous shrinkage priors is a direction of future work.
We note several additional directions for further technical developments that may be useful. First, although this paper is focused on polygenic prediction methods that only require GWAS summary statistics, PRSCS, and PRSCSauto can be straightforwardly applied to individuallevel data. Given that a majority of the existing Bayesian genomic prediction models, including Bayes alphabetic methods^{10,35,36,37,38,39,40}, BayesR^{41,42}, BVSR^{43}, BSLMM^{44}, and DPR^{45}, have used discrete mixture priors on SNP effect sizes, we expect that PRSCS can provide substantial improvements in computational efficiency and prediction accuracy for genomic prediction that leverages individuallevel data. Second, jointly modeling multiple genetically correlated traits and including functional annotations in polygenic modeling are expected to increase the predictive performance of PRS, as shown by recent studies^{24,46,47}. Lastly, current research on polygenic prediction has largely been restricted to European samples. Heterogeneity between the GWAS, LD reference and testing samples may reduce prediction accuracy as recently demonstrated in genetic correlation analysis and finemapping^{48,49}. Expanding genomic prediction methods to handle unknown ancestry of the target sample (e.g., applications in forensic science) and enable transethnic risk prediction is critical to maximize the value of PRS in a diverse population.
Although PRSCS provides a substantial improvement over existing methods for polygenic prediction, current prediction accuracy of PRS is still lower than what can be considered clinically useful, and much work is needed to further improve the predictive performance and translational value of PRS. In theory, the utility of PRS depends on multiple factors, including the GWAS sample size, and the heritability and genetic architecture of the disease. For example, among the six complex diseases we analyzed, depression had the lowest prediction accuracy (Nagelkerke’s R^{2} less than 1%), likely due to a combination of its relatively low heritability, extremely polygenic genetic architecture, and the heterogeneous nature of the disorder. A recent study projected that a GWAS with multimillion subjects is needed to identify genetic variants that explain 80% of the SNP heritability for major depressive disorder^{5}. In contrast, it may be easier to produce a clinically useful prediction for some autoimmune diseases or lateonset chronic diseases (e.g., coronary artery disease and type 2 diabetes), due to the existence of SNPs with moderate to larger effect sizes. With these being said, as the GWAS sample size continues to grow, we believe that the predictive value of PRS will keep increasing, and PRSCS(auto) will demonstrate bigger advantages over existing methods with larger training sample sizes.
Methods
PRSCS and PRSCSauto
We consider the following phenotype model:
where y is a vector of standardized phenotypes from N individuals, Z is an N × M matrix of standardized genotypes (each column is mean centered and has unit variance), β is a vector of effect sizes, ε is a vector of independent environmental effects, and we have assigned a noninformative scaleinvariant Jeffreys prior on the residual variance σ^{2}. In contrast to discrete mixture priors, we consider a conceptually different class of priors:
where the variance of β_{j} scales with the residual variance and the sample size, ϕ is a global scaling parameter that is shared across all effect sizes, ψ_{j} is a local, markerspecific parameter, and g is an absolutely continuous mixing density function. This type of prior is known as globallocal scale mixtures of normals.
We first note that, given variance parameters σ^{2}, ϕ and ψ_{j}, j = 1,2,…, M, and the marginal least squares effect size estimates of the regression coefficients \(\hat{\boldsymbol{\beta}} = {\mathbf{Z}}^{\rm{T}}{\mathbf{y}}{\mathrm{/}}N\), the posterior mean of β is
where T = diag{ϕψ_{1},ϕψ_{2},…, ϕψ_{M}} is a diagonal matrix, and D = Z^{Τ}Z/N is the LD matrix. It can be seen that the posterior mean is a matrix shrinkage version of the least squares estimate. In the degenerative special case where ψ_{j} ≡ 1, the model becomes Ridge regression and all effect sizes are shrunk towards zero at the same constant rate controlled by the overall shrinkage parameter ϕ. The introduction of the local shrinkage parameter ψ_{j} thus allows heterogeneity in the scales of effect sizes.
To provide further intuitions, assuming that all genetic markers are unlinked (i.e., no LD), we have D = I and thus
where τ_{j} = 1/(1 + ϕψ_{j}) is the shrinkage factor for the jth marker, which relies on both ϕ and ψ_{j}, and describes the amount of shrinkage from the marginal least squares solution towards zero; τ_{j} = 0 indicates no shrinkage while τ_{j} = 1 yields total shrinkage. Therefore, ϕ controls the overall sparsity level of the model and plays a similar role as the regularization parameter in penalized regression, while ψ_{j} adaptively modifies the amount of shrinkage for each marker. By assigning a prior on ψ_{j}, which can produce a marginal prior density on β_{j} that has both a sharp peak at zero and heavy tails, the model can pull small effects towards zero, while asserting little influence on larger effects.
In this work, we investigate a specific continuous shrinkage prior. We assign an independent gammagamma prior on the local shrinkage parameter ψ_{j}:
where G(α,β) denotes the gamma distribution with shape parameter α and scale parameter β. By using change of variables, it can be verified that placing a gammagamma prior on ψ_{j} is equivalent to placing a threeparameter beta (TPB) prior on the shrinkage factor τ_{j}^{33}:
where the TPB distribution has the following density function:
with 0 < x < 1, a > 0, b > 0 and ϕ > 0. When ϕ = 1, the TPB distribution becomes a standard Beta distribution. For a fixed value of ϕ, a controls the behavior of the TPB prior near one, and thus the behavior of the prior on β_{j} around zero; b controls the behavior of the TPB prior near zero, and thus affects the tails of the prior on β_{j}. Figure 3 shows the prior densities on τ_{j} (upper panel) and β_{j} (middle and lower panels) with ϕ = 1, b = 1/2, and three different values of a: a = 1/2, a = 1 and a = 3/2. It can be seen that when a = 1/2 and b = 1/2, the TPB prior has substantial mass near zero and one (Fig. 3, upper panel), and thus the corresponding prior density on β_{j} has a very sharp peak around the origin, with zero being a pole (singular point; Fig. 3, middle panel), along with heavy, Cauchylike tails (Fig. 3, lower panel). This prior is known as the horseshoe prior^{50}, due to the horseshoeshaped prior density on the shrinkage factor τ_{j}. As a increases, the prior on β_{j} becomes less peaked at zero but the tails remain heavy. Finally, for fixed a and b, decreasing the global shrinkage parameter ϕ shifts the TPB prior from left to right, which imposes stronger shrinkage on the regression coefficients β_{j}.
For all continuous shrinkage priors that take the general form in Eq. (6), Gibbs samplers with block update of the regression coefficients β (i.e., SNP effect sizes) can be easily derived. By using LD information from an external reference panel, the method can be applied to GWAS summary statistics and does not require individuallevel data. We describe the Gibbs sampler in Supplementary Note. In this study, we focus on a specific set of parameter values of the gammagamma prior on ψ_{j} (or equivalently, the TPB prior on τ_{j}): a = 1 and b = 1/2. This particular specification is known as the StrawdermanBerger prior^{17,18} or the quasiCauchy prior^{51}, and appears to work well across a range of simulated and real genetic architectures.
In practice, we partition the genome into 1703 largely independent genomic regions estimated using data from the 1KG European sample^{25,26,27} [http://bitbucket.org/nygcresearch/ldetectdata], and conduct multivariate update of the effect sizes within each LD block (see Supplementary Note). To avoid numerical issues caused by collinearity between SNPs, we set a lower bound on the amount of regularization applied to the genetic markers (i.e., restricting \(\phi ^{  1}\psi _j^{  1} \ge \rho\), where ρ is a small constant). We use ρ = 1 throughout this paper.
We find that the predictive performance of the model is not sensitive to the global shrinkage parameter ϕ, and setting ϕ^{1/2} roughly to the proportion of causal variants^{52} works well. If a prior guess of the sparseness of the genetic architecture is not available, we provide two ways to learn ϕ. In PRSCS, we search a small number of ϕ values: ϕ^{1/2} ∈ {0.0001, 0.001, 0.01, 0.1, 1}, and select the ϕ that produces the best predictive performance in a validation data set, which is independent of the testing set where prediction accuracy of the algorithm is evaluated. In PRSCSauto, we use a fully Bayesian approach and assign a standard halfCauchy prior on ϕ^{1/2}^{19,20}, such that ϕ is automatically learnt from GWAS summary statistics and no validation data set is needed. See Supplementary Note for the Gibbs updates of ϕ.
For both PRSCS and PRSCSauto, the Gibbs sampler usually attains reasonable convergence after 1000 Markov Chain Monte Carlo (MCMC) iterations and produces prediction accuracy close to what can be achieved by much longer MCMC runs. We thus use 1000 MCMC iterations with the first 500 steps as burnin in simulation studies to reduce computational cost. In practice, we recommend using longer MCMC runs when time and computational resources permit. In the Partners HealthCare Biobank analysis, we report the predictive performance of PRSCS and PRSCSauto based on 10,000 MCMC iterations in total and 5000 burnin steps.
Unadjusted PRS
The unadjusted PRS is the sum of all genetic markers across the genome, weighted by their marginal effect size estimates. More specifically, the unadjusted polygenic score for the ith individual is \({\mathrm{PRS}}_i = \mathop {\sum}\nolimits_{j = 1}^M X_{ij}\hat b_j\), where M is the total number of genetic markers, X_{ij} is the genotype for the ith individual and the jth SNP, and \(\hat b_j\) is the estimated marginal perallele effect size of the jth SNP.
P+T
The P+T method refers to the calculation of PRS using informed LDpruning (also known as LDclumping) and Pvalue thresholding. In this study, we use the implementation of the P+T method in the software package PRSice2^{53} [https://choishingwan.github.io/PRSice] and its default parameter settings. Specifically, for any pair of SNPs that have a physical distance smaller than 250 kb and an R^{2} greater than 0.1, the less significant SNP is removed. The polygenic score is then calculated as the sum of the remaining, largely independent SNPs with a GWAS association Pvalue below a threshold P_{T}, weighted by their marginal effect size estimates. We consider P_{T} ∈ {1E−8, 1E−7, 1E−6, 1E−5, 3E−5, 1E−4, 3E−4, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1} in this paper. The P_{T} value that produces the highest prediction accuracy in a validation data set is selected, and the predictive performance is assessed in an independent testing set.
LDpred and LDpredinf
LDpred [https://github.com/bvilhjal/ldpred] is a method that infers the posterior mean effect size of each genetic marker from GWAS summary statistics while accounting for LD, using a pointnormal prior on the SNP effect sizes and LD information from an external reference panel^{4}. Consider the linear model y = Zβ + ε, where both the phenotype y and the genotype matrix Z have been standardized. LDpred places an independent pointnormal prior on each regression coefficient β_{j}:
where \(h_{g}^{2}\) is the heritability explained by genomewide genetic markers (known as SNP heritability), and π is the fraction of causal variants. Given π and an estimate of \(h_{g}^{2}\), which can be obtained, for example, by applying LD score regression^{23} to the GWAS summary statistics, LDpred employs an MCMC sampler to approximate the posterior mean of β_{j}, conditioning on marginal least squares effect size estimates and LD information from a reference panel. In this paper, we consider π ∈ {1E−5, 3E−5, 1E−4, 3E−4, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1}. The π value with the highest prediction accuracy in a validation data set is selected, and the predictive performance is assessed in an independent testing set.
LDpredinf is a special case of LDpred when all variants are assumed to be causal (i.e., π = 1). Under this infinitesimal model, the posterior mean effect sizes in the \(\ell\)th LD window have a closedform approximation:
where \(\hat{\boldsymbol{\beta}} _\ell\) is a vector of marginal least squares effect size estimates, \({\mathbf{D}}_\ell\) is the LD matrix that can be estimated from an external reference panel, I is an identity matrix, and it has been assumed that \(h_\ell ^2\), the heritability explained by SNPs in the \(\ell\)th LD window, is small such that \(1  h_\ell ^2 \approx 1\). In this work, we use an LD radius of M/3000 to approximate local LD patterns, as suggested in Vilhjalmsson et al.^{4}
UK Biobank genetic data
UK Biobank [http://www.ukbiobank.ac.uk] is a prospective cohort study of ~500,000 individuals recruited across Great Britain during 2006–2010^{13}. The protocol and consent were approved by the UK Biobank’s Research Ethics Committee. Data for the current analyses were obtained under an approved data request.
The genetic data for the UK Biobank comprises 488,377 samples and was phased and imputed to ~96 million variants with the Haplotype Reference Consortium (HRC) haplotype resource and the UK10K + 1KG reference panel. We leveraged the QC metrics provided by the UK Biobank^{14} and removed samples that had mismatch between genetically inferred sex and selfreported sex, high genotype missingness or extreme heterozygosity, sex chromosome aneuploidy, and samples that were excluded from kinship inference and autosomal phasing. We further restricted the analysis to unrelated white British participants. We conducted simulation studies using 819,941 HapMap3 SNPs after removing ambiguous (A/T and C/G) SNPs and markers with minor allele frequency (MAF) <1%, missing rate >1%, imputation quality INFO score <0.8, and significant deviation from HardyWeinberg equilibrium (HWE) with P < 1 × 10^{−10}. All genetic analyses in the UK Biobank were conducted using PLINK 1.9^{54} [https://www.coggenomics.org/plink/1.9].
Simulations
We performed simulation studies using real genetic data from the UK Biobank and the 1KG European sample (N = 503) as an external LD reference panel. SNP effect sizes were simulated using (1) a pointnormal model as specified in Eq. (12) with different numbers of causal variants (100, 1000, 10,000, and 100,000), which represent extremely sparse to highly polygenic genetic architectures; and (2) a normal mixture model comprised 10 groupone SNPs, 1000 grouptwo SNPs and 10,000 groupthree SNPs, and the three effect size groups explained 10%, 20%, and 70% of the total heritability, respectively. The simulated trait was generated by the sum of all genetic markers, weighted by their simulated effect sizes, and adding a normally distributed noise term which fixed the heritability at 0.5. We then conducted GWAS to produce a marginal least squares effect size estimate for each SNP, and applied each polygenic prediction method to the GWAS summary statistics. For P+T, LDpred, and PRSCS, tuning parameters were selected in a validation data set of 3000 individuals that are unrelated to the training sample. The predictive performance of all the six methods was evaluated in 3000 individuals (the testing set) that are unrelated to both the training sample and the validation set. R^{2} between the observed and predicted traits was used to quantify the prediction accuracy. We regressed the true phenotype onto the PRS predictor, and used the regression slope as a measure of calibration. A slope close to one indicates that a predictor is well calibrated. For each combination of the genetic architecture and the training sample size (10,000, 20,000, 50,000, and 100,000), the simulation was repeated 20 times.
In order to systematically compare polygenic prediction methods across a wide range of settings, we conducted a number of secondary simulation studies: (1) sampling SNP effect sizes using a pointnormal model with heritability fixed at 0.2 or 0.8; (2) sampling SNP effect sizes using a pointt model with heavy tails (a mixture of a point mass at zero and a Student’s tdistribution with 4 degrees of freedom); (3) sampling SNP effect sizes using a pointgamma model (a mixture of a point mass at zero and a gamma distribution with the shape parameter set to 2), which produces an effect size distribution that is asymmetric about zero and positively skewed with the right tail being long and thin and the left tail being short and fat; (4) using the combined UK Biobank validation and testing data sets (N = 6000) as an insample LD reference panel in the pointnormal simulations. For each setting and training sample size considered (10,000, 20,000, 50,000, and 100,000), and the simulation was repeated 20 times.
Partners HealthCare Biobank genetic data
The Partners HealthCare Biobank [https://biobank.partners.org] is a collection of plasma, serum, DNA and buffy coats samples collected from consented subjects, which are linked to their electronic health records (EHR) and survey data on lifestyle, environment, and family history^{55}. To date, Partners Biobank has enrolled more than 96,000 participants, and released genomewide genetic data for 25,482 subjects. A study protocol is not required for Partners investigators to obtain deidentified data sets from Partners Biobank.
We performed QC on each genotyping batch separately with the following steps: (1) SNPs with genotype missing rate >0.05 were removed; (2) samples with genotype missing rate >0.02 or absolute value of heterozygosity >0.2, or samples that failed sex checks were excluded; (3) SNPs with missing rate >0.02, or HWE test P < 1 × 10^{−6} were discarded. We then removed SNPs that showed significant batch associations with P < 1 × 10^{−6}, and merged genotyping batches for subsequent processing and analyses.
The Partners HealthCare Biobank included individuals from diverse populations. We used the 1KG samples as a population reference panel to infer the ancestry of Partners Biobank participants. Specifically, we computed principal components (PCs) of the genotype data in all the 1KG samples, and trained a random forest model using the top 4 PCs on the super population labels (African [AFR], American [AMR], East Asian [EAS], European [EUR], and South Asian [SAS]), in which EUR (N = 503) included TSI, IBS, GBR, CEU, and FIN subpopulations. The random forest model was then applied to the Partners Biobank participants, and identified 19,136 unrelated subjects (\(\hat \pi \, < \, 0.2\)) with European ancestry.
We used the Eagle2 software^{56} [https://data.broadinstitute.org/alkesgroup/Eagle] for prephasing and Minimac3^{57} [https://genome.sph.umich.edu/wiki/Minimac3] for imputation in the Partners Biobank European sample. Lastly, we removed markers with MAF <1%, missing rate >2%, imputation quality INFO score <0.8, and significant deviation from HWE with P < 1 × 10^{−10}. All genetic analyses in the Partners Biobank were conducted using PLINK 1.9^{54}.
Partners Biobank curated disease populations and quantitative traits
For a number of common complex diseases, the Partners Biobank trained and validated a classification algorithm, which leverages both structured and unstructured EHR data, and combines natural language processing and statistical methods, in a gold standard training set created by expert chart review. The algorithm was then applied to all the participants in the Biobank to identify cases and controls, and create curated disease populations. We selected six curated diseases—BRCA, CAD, DEP, IBD (Crohn’s disease or ulcerative colitis), RA, and T2DM—for which there are more than 500 cases in the Biobank that have been genotyped, and external largescale GWAS summary statistics are publicly available. For all the diseases, cases have an algorithmbased positive predictive value (PPV) of having current or past history of the disease greater than 0.90, and controls have a negative predictive value (NPV) of having no history of the disease greater than 0.99.
In addition, we selected six quantitative traits—height (HGT), body mass index (BMI), highdensity lipoproteins (HDL), lowdensity lipoproteins (LDL), cholesterol (CHOL), and triglycerides (TRIG)—that have been measured in the Partners Biobank healthy control population with a Charlson agecomorbidity index 0–2 and the predicted 10year survival probability greater than 90%. We predicted these quantitative traits in a relatively heathy population to avoid measurements affected by severe diseases or medications. For participants that have multiple measurements of a trait of interest, we used the median value. Table 1 presents the sample size for each curated disease and quantitative trait in the Partners Biobank.
Summary statistics and polygenic prediction
GWAS summary statistics for all the diseases and quantitative traits are publicly available (Supplementary Data 1). We removed ambiguous (A/T and C/G) SNPs and mapped the genetic markers to the Genome Reference Consortium human genome build 37. SNP heritability for each disease and trait was estimated using GWAS summary statistics and LD score regression^{23}. Heritability estimates for diseases on the observed scale were transformed to the liability scale as described in Lee et al.^{58} using the assumed population and sample prevalences shown in Supplementary Table 13. For unadjusted PRS and P+T, we used all the genetic markers that are present in the summary statistics, LD reference panel and the Partners Biobank genetic data. For LDpred(inf) and PRSCS(auto), we further restricted the genetic markers to the HapMap3 panel to reduce memory and computational cost. Table 1 shows the total number of markers included in the analysis for each disease and quantitative phenotype. We note that the GWAS samples and the Partners Biobank sample may have overlap. However, by carefully examining the sample composition of each GWAS study, we believe that sample overlap is minimal (if any) and does not impact the comparison among polygenic prediction methods.
For each curated disease and quantitative trait, the Partners HealthCare Biobank sample was repeatedly and randomly split into a validation set comprising 1/3 of the data and a testing set comprising 2/3 of the data. Tuning parameters (Pvalue threshold in P+T, fraction of causal SNPs in LDpred, and global shrinkage parameter in PRSCS) were selected in the validation set, and the predictive performance was evaluated in the testing set. We use the average R^{2} between the observed and predicted phenotypes across 100 random splits to assess the predictive performance for the quantitative traits, and report the average Nagelkerke’s R^{2} metric across 100 random splits for disease (case–control) phenotypes. Nagelkerke’s R^{2} is defined as \(R_{{\mathrm{nag}}}^2 = R^2/R_{{\mathrm{max}}}^2\), where \(R^2 = 1  ({\cal{L}}_{{\mathrm{res}}}{\mathrm{/}}{\cal{L}}_{{\mathrm{full}}})^{2/N}\), \(R_{{\mathrm{max}}}^2 = 1  {\cal{L}}_{{\mathrm{res}}}^{2{\mathrm{/}}N}\), \({\cal{L}}_{{\mathrm{res}}}\) is the likelihood of a restricted logistic regression model with covariates only (an intercept, current age, sex and top 10 PCs of the genotype data), \({\cal{L}}_{{\mathrm{full}}}\) is the likelihood of the full logistic regression model (covariates and the PRS predictor), and N is the sample size. We define the relative increase or decrease in R^{2} of a polygenic prediction method A compared to method B as \((R_{\mathrm{A}}^2  R_{\mathrm{B}}^2)/R_{\mathrm{B}}^2\). In addition to R^{2} or Nagelkerke’s R^{2}, we also report area under the ROC curve (known as AUC), area under the precisioncall curve, and the odds ratio (OR) comparing top 10% of the participants having high polygenic risk with the remaining 90% of the sample. We adjusted for current age, sex and top 10 PCs of the genotype data in the calculation of all predictive performance metrics.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
UK Biobank data are available to registered investigators under approved applications [http://www.ukbiobank.ac.uk]. All genomewide association summary statistics used in this study are publicly available. Download links are included in Supplementary Data 1. Other relevant data are available from the corresponding author upon request.
Code availability
A Python package for PRSCS is available on github repository [https://github.com/getian107/PRScs].
References
 1.
Chatterjee, N., Shi, J. & GarcaClosas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
 2.
Khera, A. et al. Genomewide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
 3.
International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
 4.
Vilhjálmsson, B. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
 5.
Zhang, Y., Qi, G., Park, J. & Chatterjee, N. Estimation of complex effectsize distributions using summarylevel statistics from genomewide association studies across 32 complex traits. Nat. Genet. 50, 1318–1326 (2018).
 6.
LloydJones, L. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. BioRxiv Preprint 522961 (2019).
 7.
Hoggart, C., Whittaker, J., De Iorio, M. & Balding, D. Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genet. 4, e1000130 (2008).
 8.
De Los Campos, G. et al. Predicting quantitative traits with regression models for dense molecular markers and pedigrees. Genetics 182, 375–385 (2009).
 9.
Makowsky, R. et al. Beyond missing heritability: prediction of complex traits. PLoS Genet. 7, e1002051 (2011).
 10.
Meuwissen, T., Hayes, B. & Goddard, M. E. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
 11.
Xu, S. Estimating polygenic effects using markers of the entire genome. Genetics 163, 789–801 (2003).
 12.
Yi, N. & Xu, S. Bayesian LASSO for QTL mapping. Genetics 179, 1045–1055 (2008).
 13.
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
 14.
Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
 15.
Gainer, V. et al. The Biobank Portal for Partners personalized medicine: a query tool for working with consented biobank samples, genotypes, and phenotypes using i2b2. J. Pers. Med. 6, 11 (2016).
 16.
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
 17.
Strawderman, W. Proper Bayes minimax estimators of the multivariate normal mean. Ann. Math. Stat. 42, 385–388 (1971).
 18.
Berger, J. A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Ann. Stat. 8, 716–761 (1980).
 19.
Gelman, A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 1, 515–534 (2006).
 20.
Polson, N. & Scott, J. Shrink globally, act locally: sparse bayesian regularization and prediction. Bayesian Stat. 9, 501–538 (2010).
 21.
Yang, J. et al. Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369–375 (2012).
 22.
Pasaniuc, B. & Price, A. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 18, 117–127 (2017).
 23.
BulikSullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genomewide association studies. Nat. Genet. 47, 291–295 (2015).
 24.
MarquezLuna, C. et al. Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. BioRxiv Preprint 375337 (2018).
 25.
Berisa, T. & Pickrell, J. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283–285 (2016).
 26.
Shi, H., Kichaev, G. & Pasaniuc, B. Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 99, 139–153 (2016).
 27.
Shi, H., Mancuso, N., Spendlove, S. & Pasaniuc, B. Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet. 101, 737–751 (2017).
 28.
Lee, S., Clark, S. & van der Werf, J. Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS ONE 12, e0189775 (2017).
 29.
Caron, F. & Doucet, A. Sparse bayesian nonparametric regression. In Proceedings of the 25th International Conference on Machine learning. pp. 88–95 (ACM, New York, NY, USA, 2008).
 30.
Griffin, J. & Brown, P. Inference with normalgamma prior distributions in regression problems. Bayesian Anal. 5, 171–188 (2010).
 31.
Lee, A., Caron, F., Doucet, A. & Holmes, C. Bayesian sparsitypathanalysis of genetic association signal using generalized t priors. Stat. Appl. Genet. Mol. Biol. 11 (2012).
 32.
Armagan, A., Dunson, D. & Lee, J. Generalized double pareto shrinkage. Stat. Sin. 23, 119–143 (2013).
 33.
Armagan, A., Clyde, M. & Dunson, D. Generalized beta mixtures of Gaussians. Adv. Neural Inf. Process. Syst. 24, 523–531 (2011).
 34.
Griffin, J. & Brown, P. Bayesian hyperlassos with nonconvex penalization. Aust. N.Z. J. Stat. 53, 423–442 (2011).
 35.
Yi, N., George, V. & Allison, D. Stochastic search variable selection for identifying multiple quantitative trait loci. Genetics 164, 1129–1138 (2003).
 36.
Meuwissen, T. & Goddard, M. Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genet. Sel. Evol. 36, 261–279 (2004).
 37.
Verbyla, K., Hayes, B., Bowman, P. & Goddard, M. Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genet. Res. 91, 307–311 (2009).
 38.
Hayes, B., Pryce, J., Chamberlain, A., Bowman, P. & Goddard, M. Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milkfat percentage, and type in Holstein cattle as contrasting model traits. PLoS Genet. 6, e1001139 (2010).
 39.
Verbyla, K., Bowman, P., Hayes, B. & Goddard, M. Sensitivity of genomic selection to using different prior distributions. BMC Proc. 4, S5 (2010).
 40.
Habier, R. D., Fernando, R. L., Kizilkaya, K. & Garrick, D. Extension of the Bayesian alphabet for genomic selection. BMC Bioinform. 12, 186 (2011).
 41.
Erbe, M. et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J. Dairy Sci. 95, 4114–4129 (2012).
 42.
Moser, G. et al. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 11, e1004969 (2015).
 43.
Guan, Y. & Stephens, M. Bayesian variable selection regression for genomewide association studies and other largescale problems. Ann. Appl. Stat. 5, 1780–1815 (2011).
 44.
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 9, e1003264 (2013).
 45.
Zeng, P. & Zhou, X. Nonparametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat. Commun. 8, 456 (2017).
 46.
Shi, J. et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genomewide association study summarylevel data. PLoS Genet. 12, e1006493 (2016).
 47.
Turley, P. et al. Multitrait analysis of genomewide association summary statistics using MTAG. Nat. Genet. 50, 229–237 (2018).
 48.
Benner, C. et al. Prospects of finemapping traitassociated genomic regions by using summary statistics from genomewide association studies. Am. J. Hum. Genet. 101, 539–551 (2017).
 49.
Ni, G. et al. Estimation of genetic correlation via linkage disequilibrium score regression and genomic restricted maximum likelihood. Am. J. Hum. Genet. 102, 1185–1194 (2018).
 50.
Carvalho, C., Polson, N. & Scott, J. The horseshoe estimator for sparse signals. Biometrika 97, 465–480 (2010).
 51.
Johnstone, I. & Silverman, B. Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Stat. 32, 1594–1649 (2004).
 52.
Piironen, J. & Vehtari, A. On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. J. Mach. Learn. Res. 54, 905–913 (2017).
 53.
Euesden, J., Lewis, C. & O’reilly, P. PRSice: polygenic risk score software. Bioinformatics 31, 1466–1468 (2014).
 54.
Chang, C. et al. Secondgeneration PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
 55.
Karlson, E., Boutin, N., Hoffnagle, A. & Allen, N. Building the partners healthcare biobank at partners personalized medicine: informed consent, return of research results, recruitment lessons and operational considerations. J. Pers. Med. 6, 2 (2016).
 56.
Loh, P. et al. Referencebased phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
 57.
Das, S. et al. Nextgeneration genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
 58.
Lee, S., Wray, N., Goddard, M. & Visscher, P. Estimating missing heritability for disease from genomewide association studies. Am. J. Hum. Genet. 88, 294–305 (2011).
 59.
Michailidou, K. et al. Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92–94 (2017).
 60.
Nikpay, M. et al. A comprehensive 1000 Genomesbased genomewide association metaanalysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).
 61.
Wray, N. et al. Genomewide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
 62.
Liu, J. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).
 63.
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
 64.
Scott, R. et al. An expanded genomewide association study of type 2 diabetes in Europeans. Diabetes 66, 2888–2902 (2017).
 65.
Yengo, L. et al. Metaanalysis of genomewide association studies for height and body mass index in ~700,000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
 66.
Willer, C. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).
Acknowledgements
This work involved the use of the Enterprise Research Infrastructure & Services (ERIS) at Partners HealthCare. We thank the Partners HealthCare Biobank for providing genomic and health information data. This research was funded in part by National Institutes of Health (NIH) U01HG008685 supporting the eMERGE Network, and K99AG054573 (T.G.). J.W.S. is a Tepper Family MGH Research Scholar and was supported in part by a gift from the Demarest Lloyd, Jr. Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This research has been conducted using the UK Biobank resource under an approved data request (ref: 32568). The breast cancer genomewide association analyses were supported by the Government of Canada through Genome Canada and the Canadian Institutes of Health Research, the ‘Ministère de l'Économie, de la Science et de l’Innovation du Québec’ through Genome Québec and grant PSRSIIRI701, The National Institutes of Health (U19CA148065, X01HG007492), Cancer Research UK (C1287/A10118, C1287/A16563, C1287/A10710) and The European Union (HEALTHF22009223175 and H2020 633784 and 634935). All studies and funders are listed in Michailidou et al.^{59}. Data on coronary artery disease have been contributed by CARDIoGRAMplusC4D investigators and have been downloaded from http://www.cardiogramplusc4d.org.
Author information
Affiliations
Contributions
T.G. conceived the study. T.G. and C.Y.C. designed the experiments. T.G. developed the statistical methods with contributions from Y.N. C.Y.C. preprocessed the Partners HealthCare Biobank genetic data. T.G. performed the simulations and real data analyses, with contributions from C.Y.C. and Y.C.A.F. T.G. developed the software, with input from C.Y.C. and Y.C.A.F. T.G. wrote the paper. C.Y.C., Y.N., Y.C.A.F., and J.W.S. provided critical revision for the manuscript. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Journal peer review information: Nature Communications thanks Sang Hong Lee and the other anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ge, T., Chen, CY., Ni, Y. et al. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776 (2019). https://doi.org/10.1038/s41467019097185
Received:
Accepted:
Published:
Further reading

PUMAS: finetuning polygenic risk scores with GWAS summary statistics
Genome Biology (2021)

Evaluation of lowpass genome sequencing in polygenic risk score calculation for Parkinson’s disease
Human Genomics (2021)

Identifying individuals with high risk of Alzheimer’s disease using polygenic risk scores
Nature Communications (2021)

Polygenic risk score, healthy lifestyles, and risk of incident depression
Translational Psychiatry (2021)

Deep neural network improves the estimation of polygenic risk scores for breast cancer
Journal of Human Genetics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.