Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models

Using genotype data to perform accurate genetic prediction of complex traits can facilitate genomic selection in animal and plant breeding programs, and can aid in the development of personalized medicine in humans. Because most complex traits have a polygenic architecture, accurate genetic prediction often requires modeling all genetic variants together via polygenic methods. Here, we develop such a polygenic method, which we refer to as the latent Dirichlet process regression model. Dirichlet process regression is non-parametric in nature, relies on the Dirichlet process to flexibly and adaptively model the effect size distribution, and thus enjoys robust prediction performance across a broad spectrum of genetic architectures. We compare Dirichlet process regression with several commonly used prediction methods with simulations. We further apply Dirichlet process regression to predict gene expressions, to conduct PrediXcan based gene set test, to perform genomic selection of four traits in two species, and to predict eight complex traits in a human cohort.

where y is an n-vector of phenotypes measured on n individuals; W is an n by c matrix of 6 covariates including a column of 1s for the intercept term; α is a c-vector of coefficients; 7 X is an n by p matrix of genotypes;  β is the corresponding p-vector of effect sizes; ε is 8 an n-vector of residual errors where each element is assumed to be independently and 9 identically distributed from a normal distribution with variance 2 σ e . Note that we use  β 10 here instead of β as in the main text for reasons that will become clear shortly. 11 As explained in the main text, we assign a normal prior N(0, 2 2 σ σ e ) on each element of 12  β , and we further assign a Dirichlet process prior on the variance parameter 2 σ . (Note 13 that different from the main text, we also scale the variance with the error variance 2 σ e to 14 simply the algorithm.) Integrating out 2 σ induces a Dirichlet process normal mixture 15 prior on i β  where we set 0k a , 0k b , 0b a , 0b b , 0e a , and 0e b in the inverse gamma distributions to be 0.1; 25 we set 0λ a and 0λ b in the gamma distribution to be 1 and 0.1; and we use a limiting 26 normal prior for each α j with the normal variance goes to infinity, since generally there is 27 enough information in the likelihood to overwhelm any reasonable prior assumption for 28 these parameters. 29 To improve mixing, following 1 , we group the effect sizes that correspond to the first 30 normal component with the smallest variance 2 σ b in equation (2)   Difference between DPR and BayesR 49 Before we proceed further, it is useful to clarify the difference between DPR and the 50 previously proposed method BayesR 3 . While our method is motivated in part by BayesR, 51 DPR is different from BayesR in five important areas. First, BayesR is a sparse model 52 while DPR is a non-sparse model: BayesR assumes that most SNPs have zero effects 53 while DPR assumes that all SNPs have non-zero effects. As a result, BayesR and DPR 54 are expected to perform differently in sparse vs non-sparse settings. Second, BayesR 55 fixes the ratio between the variance parameters from the three non-zero components to be 56 0 where 2 σ n b = + H I K and C is a normalizing constant. To simplify notation, we will 86 ignore all constant terms from now on. Based on the joint posterior, we can derive the 87 conditional posterior distribution for each parameter in turn. When we derive these 88 conditional distributions, we will also ignore the other parameters which these 89 distributions are conditional on to simplify the presentation.
Therefore, the conditional distributions for sampling ik β and ik γ are For ν k , we have 104 Therefore, the conditional distribution for sampling λ is For 2 σ e , we have 120 which is in an unknown distributional form. Nevertheless, it is straightforward to sample 128 from this univariate distribution using reject sampling, importance sampling or other 129 standard methods 4 . Here, we sample 2 σ b based on re-parameterization of 2 σ b following 1,5 . 130 Specifically, we define a new parameter (h 2 ) 2,6,7 131 . We then use the Metropolis-Hastings algorithm to generate 136 posterior samples for h 2 . In particular, we use the independent random walk algorithm for 137 h 2 with a Beta(2,8) distribution as the proposal distribution. With each sampled value of 138 h 2 , we can obtain a sampled value of 2

Sampling b 140
Finally, because of the relationship between u and b in equation (4), we can obtain 141 the posterior conditional distribution for b as 142 where L is the total iterations of MCMC after burn in,  denotes the posterior samples. Note that this selection strategy may lead to local optimal and consequently hinders the 215 performance of our method. Alternative and better strategies may improve DPR's 216 prediction performance further. For the final 50,000 MCMC iterations, we discarded the 217 first 10,000 as burn in and kept the remaining 40,000 for parameter estimation. We did 218 not thin the MCMC chain 18 , which may help improve prediction performance further. 219 Finally, we also provided trace-plots for the log posterior likelihood of our model in all 220 real data analyses following the recommendation in 15,19 . These trace-plots serve as a 221 summary assessment of parameter convergence. 222

223
Despite the many algorithm innovations we use, the resulting MCMC algorithm is 224 still computationally heavy. Therefore, we develop an alternative, much faster, algorithm 225 based on variational Bayesian approximation 12,20-23 . Variational Bayesian approximation 226 attempts to approximate the joint posterior distribution by a variational distribution, 227 , that assumes posterior independence among parameters θ j . To do so, 228 To obtain the variational approximation, we can use the gradient ascent algorithm to 235 maximize the above quantity with respect to each j θ in turn. For each j θ , we set the 236 following derivative 237 to zero. Because ( ) p , y θ does not contain any parameter in ( ) j q θ , this leads to an update 239 for each j θ in the following form 240 where again C is a normalizing constant. We will ignore the constant terms in the 249 following updates. 250 We follow the truncated stick-breaking approximation approach of Blei and Jordan 12 251 and use a finite mixture with a fixed number of normal components, K, as an 252 approximation to the posterior distribution. The parameter K here is considered as a 253 variational parameter and we choose K by optimizing ELBO. Note again that although 254 we use a finite mixture as an approximation to the posterior distribution, our likelihood 255 still consists of a mixture of infinitely many normal distributions 12 .
.    Here, the covariance matrix is diagonal, which facilitates computation. As in MCMC, we 292 use the relationship in equation (4) to obtain the mean of b at the end of the algorithm. 293 The estimated mean of b is added back to the mean of β to obtain a mean estimate for  β . 294

Variational distribution for
where ii Σ is the ith diagonal element of Σ given in (44). 305

ELBO and convergence
Finally, 315 Therefore, the ELBO is 317