Introduction

Genome-wide association studies (GWASs) have identified thousands of genetic loci harboring associated single-nucleotide polymorphisms (SNPs) for many complex traits and diseases, providing unprecedented insights into the genetic basis of phenotypic variation1,2,3,4,5,6,7,8. The accumulation of genetic data from existing association studies has led to a growing interest in predicting traits and diseases using genetic markers (in addition to using traditional environmental or clinical variables)9. In animals or plants, accurate phenotype prediction with genetic markers can assist the selection of individuals with desirable breeding values and can improve the effectiveness of breeding programs10. In humans, accurate phenotype prediction with genetic markers can facilitate disease prevention and intervention at early stages and can aid in the development of personalized medicine by using genotype information to customize the treatment and predict the outcome11. Phenotype prediction has also been proposed recently as a key step for integrating functional genomic sequencing studies with GWASs: we can construct more powerful and interpretable gene-set tests in GWASs by setting variant weights to be the coefficients inferred from predictive models in expression quantitative trait locus mapping studies12.

Progress toward accurate phenotype prediction requires the development of statistical methods that can model all SNPs jointly. Previous association studies have demonstrated that most complex traits and common diseases have a polygenic background and are each influenced by many genetic variants with small effects. For instance, it is estimated that thousands of causal variants influence human height13. Similarly, many animal or plant traits are contributed by hundreds of causal variants (e.g., maize-related traits, such as kernel oil and growing degree days (GDDs)14, 15; and cattle-related traits, such as backfat thickness, milk yield (MY) and hot carcass weight)16, 17. Because most complex traits and common diseases have a polygenic architecture, a handful of identified associated SNPs often only capture a small proportion of the phenotypic variation and thus cannot be used to yield accurate phenotype and risk prediction. Instead, accurate phenotype prediction requires polygenic models that can make use of all genome-wide SNPs9, 18,19,20. In the past decade, successful development and application of many polygenic models in the context of genomic selection has revolutionized many animal breeding programs16, 21,22,23. More recently, applications of polygenic models to human GWASs have also yielded fruitful results11, 24,25,26,27.

Most existing polygenic models for prediction make an assumption on the effect size distribution and different methods differ mainly in such modeling assumption. For example, the commonly used linear mixed model (LMM), also known as the best linear unbiased predictor (BLUP), assumes that the effect sizes from all variants follow a normal distribution9, 28. The Bayes alphabetic (e.g., BayesA and BayesB) methods assume that the variant effect sizes follow a t-distribution or its variation10, 18, 29. The Bayesian lasso assumes a double exponential/Laplace distribution30, 31. NEG generalizes the Bayesian lasso by assuming a normal exponential gamma distribution32. BVSR and BayesCπ assume a point-normal distribution29, 33. BSLMM assumes a mixture of two normal distributions34 and is closely related to the early reversible jump Markov Chain Monte Carlo (rjMCMC) method20. BayesR35 assumes a three-component normal mixture together with a point mass at zero. Given the large number of modeling choices, one naturally wonders which method to use for any given trait. Previous studies have suggested that accurate prediction requires choosing a prior effect size distribution that can closely match the shape of the true effect size distribution, such that the inferred posterior can approximate well the polygenic architecture of the given trait24, 35, 36. However, the effect size distribution for any given trait or disease is unknown a priori and varies for different diseases in terms of the number of causal variants, their minor allele frequencies (MAFs), and their individual effect sizes11. Therefore, to achieve robust performance, it is important to design prior distributions that are flexible enough to resemble the true effect distribution in many traits as close as possible34, 35.

Up to now, almost all existing polygenic models are parametric in nature and use a prior effect size distribution that is characterized by a few parameters. From the information channel perspective37, the number of parameters in a parametric model determines model complexity and bounds the amount of information in data that can be captured by the model37,38,39,40. Therefore, using only a few parameters to characterize the effect size distribution can limit the flexibility of the model37, 38 and impede its robust performance across a range of genetic architectures. As an example, the commonly applied LMM uses a normal distribution with one variance component parameter to characterize the effect size distribution. For highly polygenic traits, the assumed normal distribution can approximate the true effect size distribution well, and as a result, LMM can achieve good predictive performance34, 35. However, for traits with large effect variants, the assumed normal distribution can no longer capture the true effect size distribution well and the performance of LMM decays34, 35.

To allow for greater flexibility on the a priori effect size distribution and to enable robust phenotype prediction performance across a range of phenotypes, we develop a Bayesian non-parametric model, which we refer to as the latent Dirichlet process regression (DPR). DPR does not use any fixed parametric distribution as the prior choice for the effect size distribution. Instead, DPR relies on the Dirichlet process to assign a prior on the effect size distribution itself and is thus capable of inferring an effect size distribution from the data at hand. Effectively, DPR uses infinitely many parameters a priori to character the effect size distribution, and with such a flexible modeling assumption, DPR is capable of adapting to a broad spectrum of genetic architectures and achieves robust predictive performance across a wide range of complex traits. We illustrate the benefits of DPR with simulations and real data applications for gene expression prediction, gene-based test via PrediXcan, genomic selection for four traits in two species, as well as genetic prediction of eight complex traits in a human cohort.

Results

Method overview

An overview of our method is provided in the Methods section with details provided in the Supplementary Note. Briefly, we use a Dirichlet process to introduce a non-parametric effect size distribution that can robustly resemble a large classes of unimodal distributions. Indeed, our prior effect size distribution can be used to adaptively and accurately approximate a t-distribution, a point-t mixture distribution, a mixture of step functions, as well as the marginal effect sizes estimated from a real data set; whereas a normal distribution cannot (Fig. 1). Therefore, our prior distribution on the effect size can adaptively approximate a wide range of possible effect size distributions underlying complex traits. Since accurate modeling of the effect size distribution is a key to achieve accurate prediction performance24, 34, 36, we expect our non-parametric model to perform robustly well across a range of polygenic architectures. Our method is implemented in the DPR software, freely available at http://www.xzlab.org/software.html.

Fig. 1
figure 1

The induced non-parametric Dirichlet process (nonparametric DP) normal mixture prior on the effect sizes can be used to approximate a large number of unimodal distributions. We either simulated 2000 values from a a standard t-distribution with df = 4; b a point-t mixture distribution with the zero proportion being 0.2, or equivalently, 0.8 × t(df = 4) + 0.2 × δ 0, where δ 0 denotes a point mass at zero; c a four-component uniform step mixture distribution 0.50 × U(−0.05,0.05) + 0.25 × U(−0.3,0.3) + 0.15 × U(−0.8,0.8) + 0.05 × U(−2,2), where U denotes a uniform distribution; or obtained d the estimated marginal effect sizes from a linear mixed model in the cattle data with SCS (somatic cell score) as the phenotype. To make the first three data comparable with the last data in d, we centered and scaled the values from the first three data for them to have a mean of zero and within the range of (−0.3,0.3). We then fit each data with either our non-parametric distribution (red) or a normal distribution (blue), and displayed the fitted curves on top of the sample distribution (black). Clearly, the non-parametric Dirichlet process normal mixture can approximate all these distributions well, while a simple normal distribution cannot

Simulations

We first compare the performance of DPR with several other commonly used prediction methods using simulations. A total of seven different methods are included for comparison: (1) BVSR29; (2) BayesR35 ; (3) LMM28; (4) MultiBLUP41; (5) rjMCMC20; (6) DPR.VB, the variational Bayesian (VB) version of DPR; and (7) DPR.MCMC, the Markov chain Monte Carlo (MCMC) version of DPR. Note that both BayesR and MultiBLUP have been recently demonstrated to outperform a range of existing prediction methods; thus, we do not include other prediction methods into comparison here.

To make our simulations as real as possible, we used genotypes from an existing cattle GWAS data set17 with 5024 individuals and 42,551 SNPs and simulated phenotypes. To cover a range of possible genetic architectures, we consider eight simulation settings from four different simulation scenarios with the phenotypic variance explained (PVE) by all SNPs being either 0.2, 0.5, or 0.8 (details in Methods). In each setting for each PVE value, we performed 20 simulation replicates. In each replicate, we randomly split the data into a training data with 80% individuals and a test data with the remaining 20% individuals. We then fitted different methods on the training data and evaluated their prediction performance on the test data (i.e., Monte Carlo cross validation). We evaluated prediction performance using either the squared correlation coefficient (R 2) or mean squared error (MSE). We contrasted the prediction performance of all other methods with that of DPR.MCMC by taking the difference of R 2 or MSE between the other methods and DPR.MCMC. Therefore, an R 2 difference below zero or an MSE difference above zero suggests worse performance than DPR.MCMC. Figure 2 shows R 2 differences for different methods across 20 replicates in each of the eight simulation settings for PVE = 0.5. Because Fig. 2 shows prediction performance difference, a large sample variance of a method in the figure only implies that the prediction performance of the method differs a lot from that of DPR.MCMC, but does not imply that the method itself has a large variation in predictive performance. Supplementary Table 1 shows the standard deviation of absolute R 2 values across cross variation replicates; various methods display similar prediction variability. Supplementary Figs. 1 and 2 show the R 2 differences for PVE = 0.2 and PVE = 0.8, respectively. The corresponding results for MSE differences are shown in Supplementary Figs. 35. The R 2 and MSE values of the baseline method, DPR.MCMC, are shown in the corresponding figure legend.

Fig. 2
figure 2

Comparison of prediction performance of several methods with DPR.MCMC in simulations when PVE = 0.5. Performance is measured by R 2 difference with respect to DPR.MCMC, where a negative value (i.e., values below the red horizontal line) indicates worse performance than DPR.MCMC. The sample R 2 differences are obtained from 20 replicates in each scenario. Methods for comparison include BVSR (cyan), BayesR (chocolate), LMM (purple), MultiBLUP (green), DPR.VB (red), rjMCMC (black blue), and DPR.MCMC. Simulation scenarios include: a Scenario I, which satisfies the DPR modeling assumption; b Scenario II, which satisfies the BayesR modeling assumption; c Scenario III, where the number of SNPs in the large effect group is 10, 100, or 1000; and d Scenario IV, where the effect sizes are generated from either a normal distribution, a t-distribution or a Laplace distribution. For each box plot, the bottom and top of the box are the first and third quartiles, while the ends of whiskers represent either the lowest datum within 1.5 interquartile range of the lower quartile or the highest datum within 1.5 interquartile range of the upper quartile. For DPR.MCMC, the mean predictive R 2 in the test set and the standard deviation for the eight settings are, respectively, 0.272 (0.031), 0.299 (0.026), 0.295 (0.026), 0.281 (0.030), 0.277 (0.035), 0.278 (0.030), 0.282 (0.025), and 0.273 (0.022)

Overall, while each method works the best when their individual modeling assumption is satisfied, DPR.MCMC is robust and works well across all eight settings from four scenarios. For example, if we rank the methods based on their median performance across replicates, then when the total PVE is moderate (e.g., PVE = 0.5, Fig. 2; note that for each PVE there are a total of eight simulation settings for the four scenarios), DPR.MCMC is the best or among the best in seven simulation settings (i.e., scenario I, c = 10, 100, and 1000 in scenario III, and normal, t and Laplace distributions in scenario IV; where “among the best” refers to the case when the difference between the given method and the best method is within ± 0.001) and is ranked as the second best in the rest one simulation setting (i.e., scenario II). Similarly, when the total PVE is high (e.g., PVE = 0.8, Supplementary Fig. 2), DPR.MCMC is the best or among the best in seven simulation settings, and it is ranked as the second best in scenario IV when the effect size follows a normal distribution. Even when DPR.MCMC is ranked as the second best method, the difference between DPR.MCMC and the best method is often small. Among the rest of the methods, LMM, MultiBLUP, and rjMCMC all work well in polygenic settings (scenario I; c = 1000 in scenario III; scenario IV) but can perform poorly in sparse settings (scenario II; c = 10 and c = 100 in scenario III). The performance of LMM, MultiBLUP, and rjMCMC in polygenic vs. sparse settings presumably stems from their polygenic assumptions on the effect size distribution. In contrast, because of the sparse assumption on the effect size distribution, both BayesR and BVSR have an advantage in sparse settings (scenario II; c = 10 or 100 in scenario III) but suffers in polygenic settings (c = 1000 in scenario III; scenario IV). The performance of BVSR is also generally worse than BayesR in the challenging setting when PVE is either small or moderate, presumably because of the much simpler prior assumption employed in BVSR for the non-zero effects. Finally, the VB version of DPR (i.e., DPR.VB) performs considerably less well compared with the MCMC version of DPR (i.e., DPR.MCMC), especially when PVE is high (Supplementary Fig. 2). However, DPR.VB still compares favorably with the other methods when PVE is small or moderate (Supplementary Fig. 1).

Real data applications

To gain further insights, we compare the performance of DPR with the other methods in several real data sets to (1) predict gene expression levels using cis-SNPs; (2) conduct subsequent PrediXcan based gene set test; (3) perform genomic selection in animal studies; and (4) predict complex traits in humans.

Our first application is predicting gene expression levels using cis-SNPs in the GEUVADIS data42. The GEUVADIS data contains gene expression measurements on 15,810 genes and 465 individuals after quality control (Methods). These individuals have their genotypes measured in the 1000 Genomes project43. In the data, we first identified cis-SNPs that are within 100 kb of each gene and obtained an average of 175 cis-SNPs per gene. Then, for each gene in turn, we applied different methods to predict gene expression levels using these cis-SNPs. To measure prediction performance, we carried out 20 Monte Carlo cross validation data splits as in simulations. In each data split, we fitted methods in a training set with 80% of randomly selected individuals and evaluated method performance using R 2 in the test set with the remaining 20% of individuals. In addition to the seven methods used in the simulations (i.e., LMM, BVSR, MultiBLUP, BayesR, rjMCMC, DPR.VB, and DPR.MCMC), we also applied Elastic Net (ENET)44, which is the default method used in the original PrediXcan paper12. Table 1 lists the number of genes with a predictive R 2 above different thresholds for different methods. The predictive R 2 obtained from DPR.MCMC vs. various other methods across all genes is shown as scatter plots in Supplementary Fig. 6, where each plot also lists the number of genes for which DPR.MCMC performs better and the number of genes for which DPR.MCMC performs worse.

Table 1 Comparison of seven different methods in predicting gene expression levels in the GEUVADIS data

The results are largely consistent with these in simulations. Overall, DPR.MCMC generally achieves better predictive performance than the other methods. For example, DPR.MCMC is able to achieve a higher predictive R 2 > 0.10 in ~1300 genes, which is ~100 more than that by the second best method at this threshold (i.e., LMM; Table 1). Similarly, compared with other methods, not only does DPR.MCMC achieve a higher R 2 for most genes; the R 2 improvement from DPR.MCMC can be large for many genes (Supplementary Fig. 6). Among the rest of the methods, the performance of LMM, DPR.VB, and ENET are comparable with each other and are ranked right behind DPR.MCMC. On the other hand, the two sparse models (i.e., BVSR and BayesR) perform poorly in this data, especially for some genes whose expression levels are highly predictive by the other methods (Table 1, Supplementary Fig. 6).

The robust performance of DPR.MCMC in predicting gene expression levels also translates to a relatively high power in the subsequent PrediXcan gene set test. To perform PrediXcan gene set test, we consider the seven common diseases from Wellcome Trust Case Control Consortium (WTCCC)4 as in Gamazon et al.12. For each disease and each gene in turn, we used the estimated cis-SNP effect sizes on expression levels from GEUVADIS as weights to construct gene set tests in WTCCC. Following Gamazon et al.12, we focused on a set of 4343 genes that had a predictive R 2 above 0.01 from all methods. The results are shown in Table 2, which lists the number of significant genes identified by different methods for different diseases. In total, DPR.MCMC identified 38 genes associated with different phenotypes, more than that identified by any other methods. The performance of DPR.MCMC is followed by DPR.VB and subsequently LMM and rjMCMC. Supplementary Table 2 lists the significant genes identified by DPR.MCMC, which are all consistent with previous studies. We also note that, in general, a higher gene expression predictive performance leads to a higher power in the subsequent gene set analysis. In addition, consistent with their relatively poor gene expression prediction performance, the two sparse models (BayesR and BVSR) do not perform well in the gene set test as well.

Table 2 Comparison of seven different methods in the PrediXcan gene set test in the WTCCC data

Finally, we compare the performance of DPR with the other methods in predicting phenotypes in three GWAS data sets: (1) a cattle study17, where we focus on three phenotypes: milk fat percentage (MFP), MY, as well as somatic cell score (SCS); (2) a maize study15, where we use GDD as the phenotype; (3) the Framingham heart study (FHS) data45, where we focus on five plasma traits that include low-density lipoprotein (LDL) cholesterol, glucose (GLU), high-density lipoprotein (HDL) cholesterol, total cholesterol (TC) and triglycerides (TGs), and three anthropometric traits that include height, weight and body mass index (BMI). As in simulations, for each phenotype, we performed 20 Monte Carlo cross validation data splits. In each data split, we fitted methods in a training set with 80% of randomly selected individuals and evaluated method performance using R 2 or MSE in a test set with the remaining 20% of individuals. We again contrasted the performance of the other methods with that of DPR.MCMC by taking the R 2 difference or MSE difference with respect to DPR.MCMC. The results are shown in Fig. 3 (R 2 difference) and Supplementary Fig. 7 (MSE difference), with R 2 and MSE of DPR.MCMC presented in the corresponding figure legend. Supplementary Table 1 shows the standard deviation of absolute R 2 values across cross variation replicates. Supplementary Fig. 8 also displays trace plots of the log posterior likelihood from DPR.MCMC for all traits, suggesting reasonable convergence of our method.

Fig. 3
figure 3

Comparison of prediction performance of several methods with DPR.MCMC for twelve traits from three data sets. a Prediction performance for MFP, MY and SCS in the cattle data, and for GDD in the maize data; b Prediction performance for the five plasma traits in the FHS data; c Prediction performance for the three anthropometric traits in the FHS data. Performance is measured by R 2 difference with respect to DPR.MCMC, where a negative value (i.e., values below the red horizontal line) indicates worse performance than DPR.MCMC. Methods for comparison include BVSR (cyan), BayesR (chocolate), LMM (purple), MultiBLUP (green), DPR.VB (red), rjMCMC (black blue), and DPR.MCMC. For each box plot, the bottom and top of the box are the first and third quartiles, while the ends of whiskers represent either the lowest datum within 1.5 interquartile range of the lower quartile or the highest datum within 1.5 interquartile range of the upper quartile. The sample R 2 differences are obtained from 20 replicates of Monte Carlo cross validation for each trait. For DPR.MCMC, the mean predictive R 2 in the test set and the standard deviation across replicates are 0.751 (0.011) for MFP, 0.624 (0.012) for MY, 0.551 (0.017) for SCS and 0.828 (0.012) for GDD, 0.081 (0.033) for LDL, 0.047 (0.017) for GLU, 0.153 (0.044) for HDL, 0.050 (0.020) for TC, 0.044 (0.015) for TG, 0.478 (0.031) for height, 0.169 (0.038) for weight, and 0.137 (0.037) for BMI. The SNP heritability estimates are 0.912 (0.007) for MFP, 0.810 (0.012) for MY, 0.801 (0.012) for SCS, 0.880 (0.013) for GDD, 0.397 (0.024) for LDL, 0.357 (0.036) for GLU, 0.418 (0.024) for HDL, 0.402 (0.036) for TC, 0.334 (0.034) for TG, 0.905 (0.013) for height, 0.548 (0.022) for weight and 0.483 (0.023) for BMI

Overall, consistent with simulations, DPR.MCMC shows robust performance across all traits and is ranked either as the best or the second best method. In the cattle data (Fig. 3a), for MFP and MY, both DPR.MCMC and BayesR perform the best. For SCS, DPR.MCMC performs the best, followed by BayesR. rjMCMC performs well for MFP and MY but poorly for SCS; while LMM and MultiBLUP do not perform well for MFP and MY in the cattle data but their performance improves for SCS. The relative performance of BayesR vs. LMM and MultiBLUP in the cattle data is consistent with the distinct genetic architectures that underlie the three complex traits17, 46: while MFP and MY are affected by a few large or moderate effect SNPs together with many small effect SNPs, SCS is a highly polygenic trait influenced by many SNPs with small effects. BVSR performs poorly for these three traits in the cattle data. In the maize data (Fig. 3a), DPR.MCMC performs the best, followed by BayesR, suggesting that GDD is influenced by a few SNPs with large effects15. In the FHS data (Fig. 3b, c), DPR.MCMC performs the best or among the best for LDL, GLU, TC, Weight, and BMI. Its performance is comparable to BayesR and rjMCMC for Height, and follows right behind BayesR for HDL and TG. Besides DPR and BayesR, rjMCMC also performs well in FHS and is often ranked as the third best method (e.g., for LDL, GLU, Height, and Weight). In contrast to the relatively robust performance of DPR.MCMC, however, all other methods can perform poorly for certain phenotypes. In Fig. 3, for example, BayesR is the second worst method for predicting GLU; LMM is the second worst method for predicting LDL; MultiBLUP is the worst method for predicting Weight and BMI; DPR.VB performs among the worst for LDL and HDL; rjMCMC performs poorly for HDL; while BVSR performs the worst for almost all traits except for LDL, Height, and Weight. The poor performance of BVSR presumably stems from its relatively simple and sparse assumption on the effect sizes.

Because the FHS is a family based study, we use this data to further examine the influence of individual relatedness on prediction performance. To do so, we divided the FHS data into two sub data sets (D1 and D2) with equal sample size but different levels of relatedness (details in Methods): individuals in D1 are more closely related to each other than those in D2. We then compared method performance by performing cross validation in each of the two data sets separately. While the difference between methods becomes smaller due to smaller sample size in the two sub data sets, the relative performance of most methods for the eight traits are largely unchanged in these two sub data sets as compared to that in the complete data (Supplementary Fig. 9 vs. Fig. 3b, c). For example, DPR.MCMC is ranked as the best method or among the best methods for six traits in D1 and for four traits in D2. BayesR performs similarly and is ranked as the best or among the best for four traits in D1 and for five traits in D2. LMM ranks right after DPR.MCMC and BayesR, while the other methods do not perform well here. In addition, all methods generally perform better in D1 than in D2 (Supplementary Fig. 10), suggesting that relatedness improves prediction performance—a phenomenon that has been well recognized by previous studies9, 23, 47,48,49. Besides cross validation within each data set separately, we also performed cross validation between D1 and D2 by predicting traits in one data with parameters inferred from another. The results are again largely consistent with the main results. In particular, DPR.MCMC is ranked as the best or among the best for six traits in D1 to D2 prediction and for eight traits in D2 to D1 prediction. BayesR is ranked as the best or among the best for six traits in D1 to D2 prediction and for eight traits in D2 to D1 prediction. rjMCMC also performs reasonably well and follows right behind DPR.MCMC and BayesR (Supplementary Fig. 11).

Computational time

Finally, we list the computational time of the seven methods for the 12 traits in Table 3. Note that some differences in computational time among methods may reflect implementation issues, including the language environment in which the methods are implemented, rather than fundamental differences between algorithms. In addition, we only list in the table the computation time in the fitting stage. Computation time spent in the prediction stage by plugging in estimated coefficients in the new data is almost ignorable and is thus not listed. For sampling based methods (BVSR, rjMCMC, DPR.MCMC, and BayesR), we measure the computational time based on a fixed number of iterations. However, due to different convergence properties of different algorithms (e.g., BVSR uses a Metropolis-Hastings algorithm, rjMCMC uses a reversible jump MCMC algorithm, while both DPR.MCMC and BayesR use a Gibbs sampling algorithm), a fixed number of iterations in different methods may correspond to different mixing performance. Nevertheless, we can see that DPR.MCMC has a similar computational cost as the other Gibbs based approach (e.g., BayesR), though in the human data both these Gibbs based approaches (DPR.MCMC and BayesR) can be slower than the Metropolis-Hastings approach (BVSR) and the reversible jump MCMC algorithm (rjMCMC) that effectively update only a small subset of significant SNPs in each iteration. In contrast, DPR.VB is orders of magnitude faster than its MCMC counterpart, and is computationally as efficient as the other two non-MCMC based approaches (LMM and MultiBLUP).

Table 3 Mean computational time of the seven methods in the model fitting stage for 12 traits across three data sets

Discussion

We have presented a novel statistical method, DPR, for genetic prediction of complex traits. DPR uses an infinitely many parameters a priori to flexibly model the effect size distribution, and represents the first non-parametric method developed for modeling polygenic traits in genetic association studies. By flexibly modeling the effect size distribution, DPR is capable of adapting to the polygenic architecture underlying many complex traits and enjoys robust performance across a range of phenotypes. With simulations and applications to four real data sets, we have illustrated the benefits of DPR.

We have focused on one application of DPR—genetic prediction of phenotypes. Like some other polygenic methods34, 35, 50, DPR can also be applied to many other polygenic applications. For example, DPR can be used to estimate the proportion of variance in phenotypes explained by all SNPs, a quantity that is commonly referred to as SNP heritability28, 34. Because DPR assumes a flexible effect size distribution that is adaptive to the genetic architecture underlying a given trait, it has the potential to provide accurate estimation of SNP heritability. As another example, DPR can be applied to association mapping. There, we can view the normal component with the smallest variance as the polygenic background, and we can estimate the probability of a SNP being in any normal components other than the smallest one as the posterior inclusion probability (PIP). PIP computed in this way measures SNP marginal association strength in the presence of polygenic effects, and may represent a more powerful association indicator than standard single SNP association test statistics33, 50. An important feature of using PIP in the context of Bayesian models is that PIP quantifies the uncertainty of association strength33, 50, which is a desirable feature that is not easily achieved by penalized frequentist counterparts51.

Here, we have restricted ourselves to applying DPR to continuous phenotypes. For case control studies, we could follow previous approaches of treating binary phenotypes as continuous traits and apply DPR directly34, 35, 41. However, it would be desirable to extend DPR to accommodate case control data or other discrete data types in a principled way, by, for example, extending DPR into the generalized linear model framework. In particular, we could use a probit or a logistic link to extend DPR to directly model case control data. We could use a Poisson or an over-dispersed Poisson distribution to extend DPR to model count data. Extending DPR to various discrete data types would likely lead to wider applications of DPR beyond GWASs. For instance, by modeling count data, DPR could be used to perform differential expression analysis52 or expression QTL mapping in RNA sequencing studies53, 54. Similarly, by modeling proportional data, DPR could be used to perform differential methylation analysis or methyl-QTL mapping in bisulfite sequencing studies55. Extending DPR to modeling discrete data types using the generalized linear model framework is thus an important avenue for future research.

In the present study, while we used unrelated individuals for GEUVADIS gene expression prediction and PrediXcan tests, we used related individuals for the other two real data applications. Related individuals not only share similar genetic background but also are likely influenced by a common set of environmental factors47, 56. In addition, untyped causal SNPs in related individuals can be more easily tagged by neighboring typed SNPs than that in unrelated individuals, thanks to the relatively high linkage disequilibrium (LD) in related data. Because both the shared environmental factors and easy tagging of causal SNPs can facilitate prediction, cross validation using related individuals often results in better prediction performance than using unrelated individuals9, 23, 47,48,49. However, we caution that the prediction accuracy measured in the test data obtained with cross-validation in related individuals are likely inflated if our ultimate goal is to perform prediction in unrelated individuals instead of related ones. In addition, the predictive model inferred from related individuals may not generalize well to unrelated individuals who are not necessarily influenced by the same set of environmental factors and who do not share the same LD pattern near the causal SNPs. We have attempted to tease apart the influence of relatedness on prediction performance by splitting the FHS data into two parts with different levels of relatedness. Our results indeed show that, while the relative performance of various methods remains largely the same, the absolute performance of all methods do increase with individual relatedness. Additionally, while our method performs well relative to the other methods, we caution that DPR’s prediction accuracy is still unlikely of practical use in human clinical setting. Studies on unrelated individuals or studies using a fully independent validation data are likely required to establish the practical utility of prediction methods, which often have unsatisfactory performance there9, 47, 57. Despite the practical importance of using completely independent or cross-population studies for prediction performance validation, however, we also want to point out its potential caveat: using completely independent data for cross-validation may fail to correctly characterize the relative performance of different methods. In particular, a good method that properly captures the signal in the training data may suffer in the validation data due to different LD patterns between the two data sets. Similarly, a poor method that fails to capture the signal in the training data may perform well in the validation data where such signal is no longer relevant. Therefore, using training and validating data that are both representative of the study population is important to not only ensure a proper comparison among methods but also to ensure the clinical relevance and wide applicability of the prediction methods. Exploring the use of such data is an important direction for future research.

DPR is not without its limitations. Perhaps the biggest limitation is its computational cost. Like any other MCMC-based approaches34, 35, 58, our Gibbs algorithm for fitting DPR is computationally slow and can only be applied to moderate-sized GWAS studies. To make DPR widely applicable, we have explored the use of VB approximation for fitting DPR. VB approximation obtains an approximate posterior distribution through optimization59 and represents a much faster alternative to MCMC sampling. Indeed, DPR.VB is orders of magnitude faster than DPR.MCMC. However, despite its faster computational speed, the VB algorithm is less accurate than MCMC when SNP heritability is large, sometimes by quite a large margin (e.g., PVE = 0.80 in simulations). The loss of accuracy in VB is not unexpected because our VB assumes that the posterior distributions of the SNP effect sizes are independent from each other. Posterior independence is an unrealistic assumption given that SNP genotypes are correlated through LD. Therefore, it is important to explore alternative VB algorithms to incorporate the posterior correlation among effect sizes, by, for example, adapting algorithms developed elsewhere60, 61. It would be ideal if we could develop algorithms that can achieve a high predictive performance as DPR.MCMC but incurs a small computational cost as DPR.VB. Certainly, besides developing alternative algorithms to MCMC, there is still room for improvement on our MCMC algorithm. For example, we could use all individuals to compute some quantities while use only a subset of individuals to compute other quantities, as in our previous MQS method62, in order to reduce the computational burden while maintaining the accuracy of the algorithm. In any case, developing efficient and accurate algorithms likely represents a key step to adapt existing polygenic methods to association studies that are orders of magnitude larger.

Methods

Overview of DPR

We provide a brief overview of DPR here. Detailed methods and algorithms are provided in the Supplementary Note. To model the relationship between phenotypes and genotypes, we consider the following multiple regression model

$${\bf{y}} = {\bf{W\alpha }} + {\bf{X\beta }} + {\bf{\epsilon }},\,{\bf{\epsilon }}\sim N\left( {0,\sigma _e^2{I_n}} \right),$$
(1)

where y is an n-vector of phenotypes measured on n individuals; W is an n by c matrix of covariates including a column of 1 s for the intercept term; α is a c-vector of coefficients; X is an n by p matrix of genotypes; β is the corresponding p-vector of effect sizes; ε is an n-vector of residual errors where each element is assumed to be independently and identically distributed from a normal distribution with variance \(\sigma _e^2\).

Like many previous methods9, 19, 28, 34, 41, we assume that the effect size of ith SNP, β i , follows a normal distribution with variance σ 2, i.e., β i  ~ N(0, σ 2). Unlike previous methods, however, we specify a non-parameter prior on the hyper-parameter σ 2 to induce a non-parametric prior on β i . To motivate our prior choice for σ 2, it helps to provide a brief review of the previous polygenic prediction methods. Among the many polygenic prediction methods developed recently, a surprisingly large number of them assume a priori that the effect sizes follow a particular class of prior distribution—the scale mixture of normal distributions63. Specifically, these methods assume that each effect size β i follows a normal distribution β i  ~ N(0, σ 2), with the variance parameter (i.e., the scale parameter) σ 2 following another distribution p(σ 2). The prior distribution on σ 2, p(σ 2), thus differentiates many different predictive methods. For example, LMM assumes a flat prior p(σ 2) that is proportional to a constant9, 28. The Bayes alphabetic methods assume that σ 2 follows an inverse gamma distribution to induce a t-prior on β i 10, 18, 64. The Bayesian lasso assumes that σ 2 follows a Rayleigh distribution to induce a double exponential distribution (a.k.a. Laplace distribution) on β i 30, 58. NEG assumes an exponential gamma distribution on σ 2 to induce an NEG prior on β i 32. BVSR and BayesCπ assume a mixture of a point mass at zero with another flat prior to induce a point-normal distribution on β i 29, 33. BSLMM assumes a mixture of two point masses to induce a normal mixture distribution on β i 34. While BayesR assumes a three point masses together with another point mass at zero on σ 2 to also induce a normal mixture distribution on β i 35.

The scale mixture of normal distributions is flexible because different distributions on the scale parameter σ 2 can be used to induce many smooth unimodal distributions on β i . However, existing predictive methods explicitly make a parametric prior assumption on σ 2, which necessarily relies on a limited number of parameters to characterize the distribution on σ 2. Consequently, the induced effect size distribution on β i from a parametric prior on σ 2 can be restrictive and may sometimes fail to resemble closely the unknown truth effect size distribution underlying complex traits. Motivated by the potential drawback of parametric priors on σ 2, we instead develop a non-parametric prior distribution on σ 2 to induce a more flexible distribution on β i . Because a non-parametric distribution is characterized by effectively infinitely many parameters, our induced effect size distribution on β i has the potential to resemble a wide range of genetic architectures and achieve robust predictive performance across a variety of traits.

Technically, we assume σ 2 follows a Dirichlet process (DP) prior37,38,39,40

$${\sigma ^2}\sim G,G\sim {\rm{DP}}\left( {H,\lambda } \right),$$
(2)

where H is the base distribution, and λ is the concentration parameter that describes how the distribution on σ 2, G, deviates from the base distribution. Here, we use an inverse gamma distribution as the base distribution and set the two parameters in the inverse gamma distribution to small values to keep the prior relatively uninformative. We treat the concentration parameter λ as an unknown hyper-parameter and intend to infer it from the data at hand. Because we use the Dirichlet process as a prior for the latent variance parameter σ 2 we refer to our regression model based on Eqs. (1) and (2) as the latent Dirichlet Process Regression, or DPR. The induced marginal distribution on β i (after integrating out σ 2) is also non-parametric and can robustly resemble a large classes of unimodal distributions. Indeed, the distribution on β i can be used to adaptively and accurately approximate a t-distribution, a point-t mixture distribution, a mixture of step functions, as well as the marginal effect sizes estimated from a real data set; whereas a normal distribution cannot (Fig. 1). Therefore, our prior distribution on the effect size can adaptively approximate a wide range of possible effect size distributions underlying complex traits. Since accurate modeling of the effect size distribution is a key to achieve accurate prediction performance24, 34, 36, we expect our non-parametric model to perform robustly well across a range of polygenic architectures.

It is important to point out that our modeling assumption on the effect sizes β i is also mathematically equivalent to a Dirichlet process normal mixture, which is a mixture of normal distributions with infinitely many normal components. Specifically, using the stick-breaking constructive representation of the Dirichlet process59, we can re-write our modeling assumption on β i in an equivalent form as

$$\begin{array}{c}\\ {\beta _i}\sim \mathop {\sum }\limits_{k = 1}^{ + \infty } {\pi _k}N\left( {0,\sigma _k^2} \right),\\ \\ {\pi _k} = {\nu _k}\mathop {\prod }\limits_{l = 1}^{k - 1} (1 - {\nu _l}),{\nu _k}\sim {\rm{Beta}}\left( {1,\lambda } \right),\\ \end{array}$$
(3)

where λ is the same concentration parameter as in Eq. (2), and determines the number of normal components in the model and subsequently the model complexity59. Each \(\sigma _k^2\) in the above equation follows the base distribution H. From the normal mixture equivalence aspect, our method effectively generalizes many previous methods18, 34, 35 that use a fixed, often small, number of normal components to using infinitely many normal components a priori. Although the prior number of normal components in our model is infinite, the posterior number of components for any given data set will be finite, and can be automatically inferred based on the data at hand. Therefore, our model has the potential to adjust the model complexity according to the data complexity, and has the potential to adapt to a wide range of polygenic architectures.

To fit our model, we develop two complementary algorithms: one is based on the MCMC algorithm, and the other is based on the VB approximation. The MCMC sampling algorithm, which we refer to as DPR.MCMC, is accurate but computationally slow. The VB algorithm, which we refer to as DPR.VB, is computationally fast, but, as we will show in the results, is often less accurate. The two algorithms provide users the choice of speed vs. accuracy. The details of the two algorithms are provided in Supplementary Note.

Simulations

We used genotypes from an existing cattle GWAS data set17 with 5024 individuals and 42,551 SNPs and simulated phenotypes. To cover a range of possible genetic architectures, we consider eight simulation settings from four different simulation scenarios to cover a range of possible genetic architectures:

Scenario I satisfies the DPR modeling assumption, where all SNPs are causal and SNPs in different effect-size groups have different effects. Specifically, we randomly selected 10 group-one SNPs, 100 group-two SNPs, 1000 group-three SNPs, and set the remaining SNPs as group-four SNPs. We simulated SNP effect sizes all from a standard normal distribution but scaled their effects in each group separately so that the proportion of genetic variance explained by the four groups are 0.05, 0.15, 0.20, and 0.60, respectively. We set the total proportion of phenotypic variance (PVE; i.e., SNP heritability) to be either 0.2, 0.5, or 0.8, representing low, moderate, and high heritability, respectively. This simulation scenario consists of one simulation setting for each PVE.

Scenario II satisfies the BayesR modeling assumption, where a small proportion of SNPs are causal. These causal SNPs come from three effect-size groups. The simulations were similar to scenario I with the only exception that the group-four SNPs have zero effects. Here, the proportion of PVE by the three groups are 0.1, 0.2, and 0.7, respectively. Again, we set the total PVE to be either 0.2, 0.5, or 0.8. This simulation scenario consists of one simulation setting for each PVE.

Scenario III is similar to Scenario I except that SNPs come from two effect-size groups, thus representing a simpler scenario than I. In particular, we selected either c = 10, 100, or 1000 SNPs as group-one SNPs and set the remaining SNPs as group-two SNPs. We simulated their effect sizes from a standard normal distribution and scaled their effects in each group separately so that the proportion of PVE by the two groups are 0.2 and 0.8, respectively. Again, we set the total PVE to be either 0.2, 0.5, or 0.8. This simulation scenario consists of three simulation settings for each PVE (c = 10, 100, or 1000).

Scenario IV is related to the assumption made in LMM and MultiBLUP. Here, all SNPs have non-zero effects and their effect sizes come from either a normal distribution, a t-distribution with four degrees of freedom, or a Laplace distribution. We scaled their effect sizes further so that the total PVE equals 0.2, 0.5, or 0.8. This simulation scenario consists of three simulation settings for each PVE (normal, t, or Laplace).

In each setting, we performed 20 simulation replicates. In each replicate, we randomly split the data into a training data with 80% individuals and a test data with the remaining 20% individuals. We then fitted different methods on the training data and evaluated their prediction performance on the test data (i.e., Monte Carlo cross validation).

GEUVADIS data

The GEUVADIS data42 contains gene expression measurements for 465 individuals from five different populations: CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI), and Yoruba (YRI). Following previous studies65, we focused only on protein coding genes and lincRNAs that are annotated from GENCODE66 (release 12). We removed lowly expressed genes that have zero counts in at least half of the individuals and obtained a final set of 15,810 genes. Afterwards, following previous studies67, we performed PEER normalization to remove confounding effects and unwanted variations. In order to remove potential population stratification, we quantile normalized the gene expression measurements across individuals in each population to a standard normal distribution, and then quantile normalized the gene expression measurements to a standard normal distribution across individuals from all five populations. In addition to the gene expression data, all individuals in GEUVADIS also have their genotypes sequenced in the 1000 Genomes Project. Among the sequenced genotypes, we filtered out SNPs that have a Hardy-Weinberg equilibrium (HWE) p-value < 10−4, a genotype call rate < 95%, or an MAF < 0.01. We retained a total of 7,072,917 SNPs for analysis. We intersected these SNPs with imputed SNPs from WTCCC data4 (see below; for the purpose of performing gene set tests) and kept a final set of 2,793,818 overlapping SNPs for analysis. Then, for each gene in turn, we obtained its cis-SNPs that are located within either 100 kb upstream of the transcription start site or 100 kb downstream of the transcription end site, resulting in an average of 175 cis-SNPs per gene.

WTCCC data

The WTCCC4 1 data consists of about 14,000 cases from seven common diseases and 2938 shared controls. The cases include 1963 individuals with type 1 diabetes (T1D), 1748 individuals with Crohn’s disease (CD), 1860 individuals with rheumatoid arthritis (RA), 1868 individuals with bipolar disorder (BD), 1924 individuals with type 2 diabetes (T2D), 1926 individuals with coronary artery disease (CAD), and 1952 individuals with hypertension (HT). We obtained quality controlled genotypes from WTCCC and imputed missing genotypes using BIMBAM68. We obtained a total of 458,868 SNPs shared across all individuals. We then further imputed SNPs using the 1000 Genomes as the reference panel with SHAPEIT and IMPUTE269. We filtered out SNPs that have a HWE p-value < 10−4, a genotype call rate < 95%, or an MAF < 0.01 to obtain a total of 2,793,818 imputed SNPs. For PrediXcan analysis12, as in the GEUVADIS data (see above), we focused on the same 15,810 genes. As in ref. 12, we further restricted our association analysis on a set of 4343 genes that have a predictive R 2 above 0.01 by all predictive methods.

Cattle data

The cattle data17 consists of 5024 samples and 42,551 SNPs after removing SNPs that have a HWE p-value < 10−4, a genotype call rate <95%, or an MAF < 0.01. For the remaining SNPs, we imputed missing genotypes with the estimated mean genotype of that SNP. We analyzed three traits: MFP, MY, and SCS. All phenotypes were quantile normalized to a standard normal distribution before analysis.

Maize data

The maize data15 consists of 2267 inbred accessions and 98,385 SNPs after removing SNPs that have a HWE p-value < 10−4, a genotype call rate <95%, or an MAF < 0.01. For the remaining SNPs, we imputed missing genotypes with the estimated mean genotype of that SNP. We used the GDDs to silking as the phenotype in genomic selection. GDD was calculated using climate data from weather stations located near the farms15, and was quantile normalized to a standard normal distribution before analysis.

FHS data

The FHS data contains genotype data on 6950 individuals and 394,174 SNPs. We filtered out SNPs that have a HWE p-value < 10−4, a genotype call rate <95%, or an MAF < 0.01 to obtain a final set of 387,741 SNPs. For these SNPs, we imputed missing genotypes with the estimated mean genotype of that SNP. We performed analysis on eight traits: five commonly used plasma traits that include LDL cholesterol, GLU, HDL cholesterol, TC, and TGs; and three anthropometric traits that include height, weight, and BMI. Each trait was quantile normalized to a standard normal distribution before analysis. Note that the FHS data is a family-based study where individuals are genetically related. To tease apart the influence of individual relatedness on prediction performance among methods, we also divided the samples in FHS into two separate data sets with different levels of relatedness. Specifically, we first used genotypes to compute the genome-wide genetic relatedness matrix. We then ordered individual pairs based on their genetic relatedness values. From top to bottom of the ordered individual pair list, we selected individuals from individual pairs with high levels of relatedness into a data set D1, and continued this process until the sample size of D1 was half of the full data. We then kept the remaining individuals from individual pairs with low levels of relatedness into a data set D2. The relatedness threshold for separating individual pairs between the two data sets was 0.151. Nevertheless, the majority pairs in D1 and D2 have genetic relatedness values close to zero: 99.6% of pairs in D1 and 99.9% of pairs in D2 have a genetic relatedness value between +/−0.01. As another way of measuring relatedness, we also computed the effective number of chromosome segments (M e)49 in the two data. M e is a crucial parameter that measures the effective number of independent SNPs and is also closely related to the effective number of independent individuals: M e is small in a data with related individuals and is large in a data with unrelated individuals. A small value of M e often correlates to high prediction accuracy48, 49, 70. With 20 cross-validation replicates, we estimated M e in D1 and D2 sub data to be 34541.39 (sd = 140.87) and 81786.01 (sd = 651.52), respectively.

Other methods

We compared the performance of DPR.MCMC and DPR.VB mainly with five existing methods: (1) LMM28 as implemented in the GEMMA software (version 0.95alpha); (2) BVSR29 as implemented in the GEMMA software (version 0.95alpha); (3) MultiBLUP41 as implemented in the LDAK software (version 4.9); (4) BayesR35 as implemented in the bayesR software; (5) rjMCMC20 as implemented in the gwas_rjmc1.163 software. We used default settings to fit all these methods. For rjMCMC, because it requires us to provide a variance component parameter, we used LMM to estimate the variance component first in all analyses. In addition, rjMCMC does not output parameter estimates. Therefore, for the PrediXcan analysis, we first merged the GEUVADIS and WTCCC files for every gene, labeled WTCCC individuals as having missing phenotypes, and then ran rjMCMC on these files to obtained predicted gene expression values using the WTCCC genotype data. The same strategy was also applied to perform cross-validation prediction between D1 and D2 sub data sets in FHS. For gene expression prediction and PrediXcan analysis, following the original PrediXcan paper12, we also used ENET44, which is implemented in the R package glmnet (version 1.9-5). For ENET, following12, we set one penalty parameter (i.e., α) to be 0.5 and selected the other one using 100-fold cross validation in the training data.

Code availability

Our method is implemented in the DPR software, freely available at http://www.xzlab.org/software.html.

Data availability

No data were generated in the present study. The GEUVADIS gene expression data is publicly available at http://www.geuvadis.org. The genotype data from the 1000 genomes project is publicly available at http://www.internationalgenome.org. The WTCCC genotype and phenotype data is publicly available at https://www.wtccc.org.uk. The genotype and phenotype data from the cattle and maize studies are available from the authors upon reasonable request and with permission of Prof. Xiaolei Liu at the HuaZhong Agriculture University (xiaoleiliu@mail.hzau.edu.cn). The FHS genotype and phenotype data is available in dbGaP (https://www.ncbi.nlm.nih.gov/gap) with accession number phs000007.