Prediction of genomic breeding values using new computing strategies for the implementation of MixP

Dong, Linsong; Fang, Ming; Wang, Zhiyong

doi:10.1038/s41598-017-17366-2

Download PDF

Article
Open access
Published: 08 December 2017

Prediction of genomic breeding values using new computing strategies for the implementation of MixP

Scientific Reports volume 7, Article number: 17200 (2017) Cite this article

1531 Accesses
6 Citations
2 Altmetric
Metrics details

Subjects

Abstract

MixP is an implementation that uses the Pareto principle to perform genomic prediction. This study was designed to develop two new computing strategies: one strategy for nonMCMC-based MixP (FMixP), and the other one for MCMC-based MixP (MMixP). The difference is that MMixP can estimate variances of SNP effects and the probability that a SNP has a large variance, but FMixP cannot. Simulated data from an international workshop and real data on large yellow croaker were used as the materials for the study. Four Bayesian methods, BayesA, BayesCπ, MMixP and FMixP, were used to compare the predictive results. The results show that BayesCπ, MMixP and FMixP perform better than BayesA for the simulated data, but all methods have very similar predictive abilities for the large yellow croaker. However, FMixP is computationally significantly faster than the MCMC-based methods. Our research may have a potential for the future applications in genomic prediction.

Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results

Article 04 May 2022

Multibreed genomic prediction using summary statistics and a breed-origin-of-alleles approach

Article Open access 25 May 2023

Blib is a multi-module simulation platform for genetics studies and intelligent breeding

Article Open access 03 November 2022

Introduction

The advent of next generation sequencing technology has accelerated the development of the theory behind quantitative molecular genetics approaches, such as quantitative trait loci (QTL) mapping, genome-wide association (GWA) studies and genomic selection. Genomic selection was first proposed by Meuwissen et al. as an efficient method to predict animal breeding outcomes¹. Recently, various implementations have been proposed for genomic prediction, such as genomic best linear unbiased prediction (GBLUP)², ridge-regression BLUP (RRBLUP), BayesA, BayesB¹, BayesCπ³, BayesLASSO^4,5, BayesSSVS⁶, fast Bayesian methods^7,8, MixP⁹, among others. GBLUP and RRBLUP assume a constant variance for all SNP loci, which may be an imprecise assumption if a trait is affected by a small number of QTL loci¹. Bayesian methods propose more flexible prior assumptions for SNP effects (or variances). Generally, the prior distributions of Bayesian methods assume that there are large variances in some SNP loci and small or even zero variances at other loci, which seems to be more realistic. The implementations of Bayesian methods, such as BayesA, B, Cπ and LASSO, are mainly based on Markov chain Monte Carlo (MCMC) algorithms, requiring much more computation time to estimate SNP effects. To increase the computational speed, researchers have suggested some fast Bayesian methods, such as fast BayesB⁷ and emBayesB⁸. Yu and Meuwissen proposed another fast Bayesian method using the Pareto principle to perform genomic prediction⁹.

The Pareto principle was proposed by the economist Vilfredo Pareto at the beginning of the 20th century¹⁰. This principle states that approximately 20% of the population possesses 80% of the wealth in a country. Similar theories have been further applied in various fields, such as in genomic prediction by Yu and Meuwissen⁹, resulting in the method termed MixP. The prior distribution of MixP is a mixture of two normal distributions, which assumes that x% of the SNPs cause (100 − x)% of the genetic variance, so the remaining (100 − x)% of SNPs decide the remaining x% of genetic variance. Here we assume γ = x%, and (1 − γ) = (100 − x)%. The large and small variances are proposed as follows⁹:

$$\{\begin{array}{c}{{\sigma }_{1}}^{2}=\frac{(1-\gamma ){V}_{g}}{\gamma M}\\ {{\sigma }_{2}}^{2}=\frac{\gamma {V}_{g}}{(1-\gamma )M}\end{array},$$

(1)

where σ ₁ ² and σ ₂ ² represent the large and small variance of a SNP effect, respectively; V _g is the total additive genetic variance; M is the number of SNPs; and γ ≤ 0.5. The prior for MixP assumes that all SNPs have effects, but each SNP has only two possible variances: σ ₁ ² or σ ₂ ². This is similar but not completely identical to the assumptions found in two other Bayesian methods (BayesA and BayesCπ). In BayesA, the prior also assumes that all SNPs have effects but each SNP has its own variance. The variances of the SNP effects in BayesA follow an inverse-chi-squared distribution^1,11. The prior for BayesCπ assumes that SNPs with non-zero effects have a common variance, which is similar to the assumption of MixP, which assumes that “large” SNPs have a common variance (σ ₁ ²). However, SNP effects with small variance may be shrunk to zero in BayesCπ.

MixP is also a fast Bayesian method that is not based on a MCMC algorithm⁹. However, a multivariate normal density and an inverse matrix are included in the derivation, increasing the difficulty in understanding the derivation. In the nonMCMC-based MixP, the γ is given but not estimated, such that the optimal value of γ should be searched using a cross-validation. However, the parameter γ can be estimated using the MCMC algorithm. For the sake of convenience in distinguishing different algorithms, the MixP not based on the MCMC algorithm is termed fast MixP (FMixP), and the MixP based on the MCMC algorithm is termed MCMC-based MixP (MMixP) here.

In this study, we developed two new computing strategies for FMixP and MMixP, respectively. The first strategy used a product of univariate densities instead of the multivariate normal density to estimate SNP effects for FMixP; the second strategy attempted to use the MCMC algorithm to derive the solutions for MMixP. In addition, the strategies were used to analyse the results on simulated data from an international workshop and real data on large yellow croaker, and compared the predictive abilities with estimations by BayesA and BayesCπ.

Results

Results for simulated data

The predictive results of various Bayesian methods for the simulated data are shown in Table 1. The predictive accuracies are very close in BayesCπ, MMixP, and FMixP (γ = 0.07). The accuracy of BayesA is lower than that of BayesCπ, MMixP, and FMixP (γ = 0.07), but higher than FMixP when γ = 0.5. BayesCπ and MMixP yield comparatively accurate estimates for π and γ, respectively. As there are 48 QTLs simulated in the genome, the true value of π (or γ) is 48/5726 ≈ 0.0084, which is very close to the values estimated by BayesCπ and MMixP. The γ estimates in the Gibbs sampling cycles are shown in Fig. 1. We can find that the value converges when the Gibbs sampling runs at ~1000th cycle.

Table 1 Correlation and regression coefficients of TBV on GEBV for various methods in simulated data.

Full size table

We compared the predictive results between MixP introduced by Yu and Meuwissen⁹ and our FMixP, and found that the two derivations could yield the same prediction accuracies. Graphs of the correlation and regression coefficients of TBV on GEBV (r _(TBV,_GEBV) and b _(TBV,_GEBV), respectively) against γ for FMixP are presented in Fig. 2. Both measures of accuracy follow a similar trend in response to γ. Overall, FMixP yields the highest accuracy when the value of γ is close to 0.07, but this value is higher than the true value (0.0084). The distributions of SNP effects estimated by FMixP and MMixP are shown in Fig. 3. All the QTLs with absolute effects >0.2 can be located by the nearby SNPs in both methods, indicating that the MixP may be a promising implementation in GWA study.

Results for real data

Table 2 shows the predictive abilities of various Bayesian methods for four quantitative traits in large yellow croaker. The results estimated by BayesA, BayesCπ, MMixP and FMixP are very similar for all traits, with no within-trait difference in predictive ability greater than 0.01. The value of γ (the probability of a SNP with a large variance) estimated by MMixP is much higher than that estimated in the simulated data, indicating that there may be many QTLs affecting the phenotypes. The results of FMixP show that predictive abilities are optimized when the probability of a SNP with a large variance in specific traits is 0.02 or 0.05. However, these optimal points are not obvious because the predictive abilities are still very close to the best results when γ = 0.5, which is not consistent with the results from the simulated data. Figure 4 shows graphs of the predictive ability against γ for FMixP for various traits. It shows that the value of γ barely affects the predictive ability as long as γ is larger than 0.05 or even 0.02. The values of γ estimated by MMixP are 0.28, 0.32, 0.27 and 0.31 for the traits body weight, body length, body height and length/height, respectively.

Table 2 Predictive abilities of various methods for four traits in large yellow croaker.

Full size table

Computation time

Table 3 shows the computation time of each method for the simulated data and the trait length/height in large yellow croaker. The Fortran90 codes were run in a computer with an Intel Xeon CPU E7-4820. The computation time of MMixP is the longest in all statistical methods. Compared with the BayesCπ, the computational speed of BayesA is slightly slower in the simulated data but slightly faster in the real data. However, all MCMC-based Bayesian methods show a much slower computational speed than FMixP. The computation time for FMixP with γ = 0.5 is longer than that for FMixP with γ = 0.05 in the simulated data, but this difference is not obvious in the real data. We also compared the computation time between MixP introduced by Yu and Meuwissen and our FMixP, and the results showed that the time of their MixP was approximately 20~25% longer than that of our FMixP.

Table 3 Computation time of genomic prediction using various Bayesian methods for trait length/height.

Full size table

Discussion

In this study, we compared the predictive abilities among BayesA, BayesCπ, MMixP and FMixP. When γ = 0.5, the results of FMixP are equivalent to those of GBLUP or RRBLUP, an observation which was also mentioned by Yu and Meuwissen⁹. Hence, the predictive result of FMixP when γ = 0.5 is the same as that of GBLUP in Shepherd et al.⁸, in which the same simulated data was used. Therefore, we actually compared the results of five methods (i.e., BayesA, BayesCπ, MMixP, FMixP and GBLUP) in this study. The results show that the ranking of the predictive results among the different methods is not consistent between the simulated and real data. In the simulated data, the ranking according to predictive accuracy is: BayesCπ ≈ MMixP ≈ FMixP (γ = 0.07) > BayesA > GBLUP. However, all of the methods yield almost the same result within a given trait in real data from large yellow croaker. A reasonable explanation may be that there is a small number of QTLs in the simulated data but many more QTLs in the real data. There are two reasons that support this speculation: (i) The simulated results of Yu and Meuwissen showed that accuracy was not sensitive to γ when the number of QTL loci was large, but FMixP with γ < 0.5 performed better than GBLUP if there was a small number of QTLs⁹. The results shown in Figs 2 and 4 are consistent with the above two cases. (ii) The values of γ estimated by MMixP in simulated data are much lower than that estimated in the real data, indicating there may be many QTL loci affecting the phenotypes in large yellow croaker. Another explanation is that when the LD between markers is not strong, the accuracy may be due to the relationships captured by markers^12,13. In this case, the GBLUP and various Bayesian methods may yield similar predictive results.

In addition to the predictive accuracy, computational speed is another important aspect in genomic prediction. This study shows that FMixP is significantly faster than the MCMC-based Bayesian methods. The main reason for this difference is that FMixP is not based on MCMC algorithms which are sampling processes and require many cycles to obtain a precise solution. It shows that the computation time for BayesCπ is slightly longer than BayesA in the real data, but slightly shorter in the simulated data. This is because the computational speed of BayesCπ is based on the value of π. A smaller π means more SNPs have zero effects and thus do not need to be sampled from the posterior normal distribution. MMixP needs more computation time than BayesA and BayesCπ, because there are more variables that need to be sampled in MMixP. For example, the SNP effect with variance equalling zero is not sampled in BayesCπ. However, all SNP effects need to be sampled in each Gibbs sampling cycle in MMixP, because each SNP may have a large or small variance. The computational speed for FMixP with γ = 0.05 is faster than for FMixP with γ = 0.5 in the simulated data. The possible reason for this is that the number of QTLs is very small in the simulated data. FMixP with γ = 0.05 is closer to the real QTL distribution, so that FMixP with γ = 0.05 has a faster convergence speed. This also suggests that there may be more QTL loci in the real data, because there is no obvious difference in computation time for FMixP with γ = 0.05 or 0.5.

This study proposed two new computing strategies: one strategy for FMixP and the other one for MMixP. Compared with the derivation of Yu and Meuwissen⁹, we used a simpler derivation to obtain the solutions in FMixP. The advantage of FMixP is the extremely fast computational speed. However, the probability of a SNP having a large variance (represented as γ) and variances of SNP effects cannot be estimated by this implementation. Instead, using the MCMC algorithm can estimate the γ and various variances, but the computational speed is significantly slower than FMixP. The two strategies may provide some references to others who want to perform genomic prediction in the future.

Material and Methods

Ethics approval

This study and all experimental protocols were approved by the Animal Care and Use Committee of the Fisheries College of Jimei University (Animal Ethics no. 1067). All methods were performed in accordance with approved guidelines.

Analytical derivation for FMixP

The linear model for genomic prediction was as follows:

$${\bf{y}}={\bf{X}}{\bf{u}}+{\bf{B}}{\bf{g}}+{\bf{e}},$$

(2)

where y is a vector of phenotypic records, X is the design matrix for fixed effects, and u is a vector of fixed effects. In the simulated data, X = (1 1 … 1)′ and u is overall mean, whereas in the real data, the fixed effects were the sexual effects, X _i = (1 0) for male and (0 1) for female. B is the matrix of SNP genotypes (coded as 0 for genotype ‘A_A’, 1 for ‘A_a’ and 2 for ‘a_a’), g is a vector of SNP effects, and e is a vector of residual effects, where e ~ N(0, I σ _e ²). Genotypic codes were standardised using the formula: $B{^{\prime} }_{ij}=({B}_{ij}-2{p}_{j})/\sqrt{2{p}_{j}(1-{p}_{j})}$, where p _j is the frequency of allele ‘a’ at locus j.

In this study, the prior distribution was the same as that described by Yu and Meuwissen⁹. According to the prior distribution for SNP variance, the prior for SNP effect g _j can be written as a mixture of normal distributions:

$$\pi ({g}_{j})=\gamma {\varphi }({g}_{j}|0,{{\sigma }_{1}}^{2})+(1-\gamma ){\varphi }({g}_{j}|0,{{\sigma }_{2}}^{2}),$$

(3)

where g _j is the effect of SNP j.

Here, we used an Iterative Conditional Expectation (ICE) algorithm⁷ to estimate the SNP effects. This algorithm estimates E(g|y) for each SNP effect in turn, where the current effects of the other SNPs are assumed to be known values. For example, if E(g _j|y _−j) is estimated, the current effects of all other SNPs are used to calculate the y _−j, i.e.,

$${{\bf{y}}}_{-j}={\bf{y}}-{\bf{X}}{\bf{u}}-\sum _{k\ne j}{{\bf{B}}}_{k}{g}_{k},$$

(4)

where B _k is a vector from the k ^th column of B. The expectation of SNP effect, E(g _j|y _−j), is estimated by a Bayesian model^7,9:

$$\begin{array}{rcl}E({g}_{j}|{{\bf{y}}}_{-j}) & = & {\int }_{-\infty }^{+\infty }{g}_{j}f({g}_{j}|{{\bf{y}}}_{-j})d{g}_{j}\\ & = & \frac{{\int }_{-\infty }^{+\infty }{g}_{j}f({{\bf{y}}}_{-j}|{{\bf{B}}}_{j}{g}_{j},{\bf{I}}{{\sigma }_{e}}^{2})\pi ({g}_{j})d{g}_{j}}{{\int }_{-\infty }^{+\infty }f({{\bf{y}}}_{-j}|{{\bf{B}}}_{j}{g}_{j},{\bf{I}}{{\sigma }_{e}}^{2})\pi ({g}_{j})d{g}_{j}}\end{array},$$

(5)

where the f(y _−j|B _j g _j, I σ _e ²) is a multivariate normal density. Evaluating this multivariate density will be computationally intense because it involves calculating the determinant and inverse of variance-covariance matrix for the data y _−j. However, the f(y _−j|B _j g _j, I ${\sigma }_{e}^{2}$) is proportional to the product of univariate normal densities f(Y|g _j, σ ²), where Y = (B _j′B _j)⁻¹ B _j′y _−j and σ ² = (B _j′B _j)⁻¹ ${\sigma }_{e}^{2}$ (See Appendix 2 of Meuwissen et al.⁷). Unlike the derivation of Yu and Meuwissen⁹, we did not calculate the multivariate likelihood but simplified the derivation using f(Y|g _j, σ ²) to replace f(y _−j|B _j g _j, I ${{\sigma }_{e}}^{2}$). Thus, the equation (5) can be rewritten as:

$$E({g}_{j}|{{\bf{y}}}_{-j})=\frac{{\int }_{-\infty }^{+\infty }{g}_{j}\,f(Y|{g}_{j},{\sigma }^{2})\pi ({g}_{j})d{g}_{j}}{{\int }_{-\infty }^{+\infty }f(Y|{g}_{j},{\sigma }^{2})\pi ({g}_{j})d{g}_{j}}.$$

(6)

Combined with equation (3), the numerator of equation (6) can be split into two terms:

$$\gamma {\int }_{-\infty }^{+\infty }{g}_{j}f(Y|{g}_{j},{\sigma }^{2}){\varphi }({g}_{j}|0,{\sigma }_{1}^{2})d{g}_{j}+(1-\gamma ){\int }_{-\infty }^{+\infty }{g}_{j}f(Y|{g}_{j},{\sigma }^{2}){\varphi }({g}_{j}|0,{\sigma }_{2}^{2})d{g}_{j}.$$

(7)

The first term in formula (7) can be derived as follows:

$$\begin{array}{ll} & \gamma {\int }_{-\infty }^{+\infty }{g}_{j}f(Y|{g}_{j},{\sigma }^{2}){\varphi }({g}_{j}|0,{\sigma }_{1}^{2})d{g}_{j}\\ = & \frac{\gamma }{\sqrt{2\pi }}{\int }_{-\infty }^{+\infty }\frac{{g}_{j}}{\sqrt{2\pi }\sigma {\sigma }_{1}}\exp [-\frac{{(Y-{g}_{j})}^{2}}{2{\sigma }^{2}}-\frac{{{g}_{j}}^{2}}{2{\sigma }_{1}^{2}}]d{g}_{j}\\ = & \frac{\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{1}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}}{\int }_{-\infty }^{+\infty }\frac{{g}_{j}}{\sqrt{2\pi }\frac{\sigma {\sigma }_{1}}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}}}\exp [-\frac{{({g}_{j}-\frac{Y{\sigma }_{1}^{2}}{{\sigma }_{1}^{2}+{\sigma }^{2}})}^{2}}{2{(\frac{\sigma {\sigma }_{1}}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}})}^{2}}]d{g}_{j}.\end{array}$$

(8)

The last term in formula (8) can be taken as calculating the expected value of g _j in the normal distribution with a mean Yσ ₁ ²/(σ ₁ ² + σ ²) and variance σ ² σ ₁ ²/(σ ₁ ² + σ ²), so this term equals Yσ ₁ ²/(σ ₁ ² + σ ²). Thus, the first term of formula (7) becomes:

$$\frac{\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{1}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}}\frac{Y{{\sigma }_{1}}^{2}}{{\sigma }_{1}^{2}+{\sigma }^{2}}.$$

(9)

Similarly, the second term becomes:

$$\frac{1-\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{2}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{{\sigma }_{2}}^{2}+{\sigma }^{2}}}\frac{Y{\sigma }_{2}^{2}}{{\sigma }_{2}^{2}+{\sigma }^{2}}.$$

(10)

Thus, the numerator of equation (6) equals:

$$\frac{\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{1}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}}\frac{Y{\sigma }_{1}^{2}}{{\sigma }_{1}^{2}+{\sigma }^{2}}+\frac{1-\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{2}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{2}^{2}+{\sigma }^{2}}}\frac{Y{\sigma }_{2}^{2}}{{\sigma }_{2}^{2}+{\sigma }^{2}}.$$

(11)

The derivation of the denominator in equation (6) is very similar to that of the numerator, but there is no g _j in the integrand. Therefore, the integral is not to calculate the expected value, but rather to calculate the cumulative probability from −∞ to ∞, so this value is 1. Thus, the denominator in equation (6) can be written as:

$$\frac{\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{1}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{1}^{2}+{\sigma }^{2}}}+\frac{1-\gamma }{\sqrt{2\pi }}\exp [-\frac{{Y}^{2}}{2({\sigma }_{2}^{2}+{\sigma }^{2})}]\frac{1}{\sqrt{{\sigma }_{2}^{2}+{\sigma }^{2}}}.$$

(12)

Thus, we derive the final form for equation (6),

$$E({g}_{j}|{{\bf{y}}}_{-j})=\frac{\gamma \frac{Y{{\sigma }_{1}}^{2}}{{{\sigma }_{1}}^{2}+{\sigma }^{2}}+(1-\gamma )\exp [\frac{1}{2}(\frac{{Y}^{2}}{{{\sigma }_{1}}^{2}+{\sigma }^{2}}-\frac{{Y}^{2}}{{{\sigma }_{2}}^{2}+{\sigma }^{2}})]\frac{\sqrt{{{\sigma }_{1}}^{2}+{\sigma }^{2}}}{\sqrt{{{\sigma }_{2}}^{2}+{\sigma }^{2}}}\frac{Y{{\sigma }_{2}}^{2}}{{{\sigma }_{2}}^{2}+{\sigma }^{2}}}{\gamma +(1-\gamma )\exp [\frac{1}{2}(\frac{{Y}^{2}}{{{\sigma }_{1}}^{2}+{\sigma }^{2}}-\frac{{Y}^{2}}{{{\sigma }_{2}}^{2}+{\sigma }^{2}})]\frac{\sqrt{{{\sigma }_{1}}^{2}+{\sigma }^{2}}}{\sqrt{{{\sigma }_{2}}^{2}+{\sigma }^{2}}}}.$$

(13)

The fixed effects are estimated in each iteration by the formula: û = (X′X)⁻¹ X′(y−Bĝ). We judged the convergence of solutions at the tth iteration according to the formula (G ^t−G ^t−1)′(G ^t−G ^t−1)/(G ^t′G ^t) < 10⁻⁸, where G = ( û′ ĝ′)′.

Derivation for MMixP

FMixP does not estimate the parameter γ, such that a direct search should be used to obtain the optimal value of γ in genomic prediction. However, the value of γ can be estimated by the MCMC algorithm. With MMixP, the prior distributions of various variables, such as γ, u, g, ${{\sigma }_{1}}^{2}$, ${{\sigma }_{2}}^{2}$ and ${{\sigma }_{e}}^{2}$, are required. The priors for γ, u and σ _e ² were assumed to follow uniform distributions. The prior for g _j depended on the γ and variances:

$${g}_{j}|\gamma ,{\sigma }^{2} \sim \{\begin{array}{c}N(0,{\sigma }_{1}^{2})\,\,{\rm{with}}\,{\rm{probability}}\,\gamma \\ N(0,{\sigma }_{2}^{2})\,\,{\rm{with}}\,{\rm{probability}}\,(1-\gamma )\end{array},$$

(14)

where γ is the probability that a SNP has a large variance, and ${{\sigma }_{1}}^{2}$ and ${{\sigma }_{2}}^{2}$ represent the large and small variance, respectively. The priors of ${{\sigma }_{1}}^{2}$ and ${{\sigma }_{2}}^{2}$ were assumed to follow the inverse-chi-squared distributions:

$$\{\begin{array}{c}{\sigma }_{1}^{2} \sim {\chi }^{-2}(v,{s}_{1}^{2})\quad {\rm{where}}\,\,{{s}_{1}}^{2}=\frac{(v-2)(1-\gamma ){V}_{g}}{v\gamma M}\\ {\sigma }_{2}^{2} \sim {\chi }^{-2}(v,{s}_{2}^{2})\quad {\rm{where}}\,\,{{s}_{2}}^{2}=\frac{(v-2)\gamma {V}_{g}}{v(1-\gamma )M}\end{array},$$

(15)

The scale parameter s ₁ ² was set because $E({\sigma }_{1}^{2})=\frac{v{s}_{1}^{2}}{v-2}=\frac{(1-\gamma ){V}_{g}}{\gamma M}$ according to the properties of inverse-chi-squared distribution and Pareto principle. A similar method was used to set the parameter s ₂ ². An indicator variable δ _j was used to indicate whether SNP j had a large or small variance. The prior for δ _j was $p({\delta }_{j}|\gamma )={\gamma }^{{\delta }_{j}}{(1-\gamma )}^{(1-{\delta }_{j})}$, where δ _j = 1 and δ _j = 0 represent the σ _j ² = σ ₁ ² with probability γ and ${\sigma }_{j}^{2}$ = ${\sigma }_{2}^{2}$ with probability (1 − γ), respectively.

The δ _j and g_j are sampled from their joint conditional distribution, because the sampling strategy of g_j is dependent on the value of δ _j. The joint conditional distribution can be written as:

$$f({\delta }_{j},{g}_{j}|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )=f({\delta }_{j}|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )f({g}_{j}|{\delta }_{j},{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma ),$$

(16)

where g _−j and δ _−j represent the vectors of SNP effects and indicator variables except g_j and δ _j, respectively, and σ ² = (σ ₁ ², σ ₂ ²). Then the conditional distribution for δ _j can be written as:

$$\begin{array}{cc}f({\delta }_{j}|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma ) & \propto \,f({\bf{y}}|{\delta }_{j},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )p({\delta }_{j}|\gamma )\\ & \propto \,f({{\bf{y}}}_{-j}|{\delta }_{j},{{\sigma }_{j}}^{2},{{\sigma }_{e}}^{2})p({\delta }_{j}|\gamma )\end{array},$$

(17)

where ${{\bf{y}}}_{-j}={\bf{y}}-{\bf{X}}{\bf{u}}-\sum _{k\ne j}{{\bf{B}}}_{k}{g}_{k}={{\bf{B}}}_{j}{g}_{j}+{\bf{e}}$, as in equation (4). Thus, $f({{\bf{y}}}_{-j}|{\delta }_{j},{{\sigma }_{j}}^{2},{{\sigma }_{e}}^{2})p({\delta }_{j}|\gamma )$ can be represented as:

$$f({{\bf{y}}}_{-j}|{\delta }_{j},{{\sigma }_{j}}^{2},{{\sigma }_{e}}^{2})p({\delta }_{j}|\gamma )=\{\begin{array}{c}f({{\bf{y}}}_{-j}|{{\sigma }_{1}}^{2},{{\sigma }_{e}}^{2})\gamma \quad \quad \,\,\,\,{\rm{when}}\,{\delta }_{j}=1\\ f({{\bf{y}}}_{-j}|{{\sigma }_{2}}^{2},{{\sigma }_{e}}^{2})(1-\gamma )\quad {\rm{when}}\,{\delta }_{j}=0\end{array}.$$

(18)

As $f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )+f({\delta }_{j}=0|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )=1$, $f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},$ ${\sigma }^{2},{\sigma }_{e}^{2},\gamma )$ can be sampled from:

$$\begin{array}{c}f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )\\ =\frac{f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )}{f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )+f({\delta }_{j}=0|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )}\\ =\frac{f({{\bf{y}}}_{-j}|{{\sigma }_{1}}^{2},{{\sigma }_{e}}^{2})\gamma }{f({{\bf{y}}}_{-j}|{{\sigma }_{1}}^{2},{{\sigma }_{e}}^{2})\gamma +f({{\bf{y}}}_{-j}|{{\sigma }_{2}}^{2},{{\sigma }_{e}}^{2})(1-\gamma )}\\ =\frac{1}{1+\frac{f({{\bf{y}}}_{-j}|{{\sigma }_{2}}^{2},{{\sigma }_{e}}^{2})(1-\gamma )}{f({{\bf{y}}}_{-j}|{{\sigma }_{1}}^{2},{{\sigma }_{e}}^{2})\gamma }}.\end{array}$$

(19)

Note that f(y _−j|σ _j ², σ _e ²) is a multivariate density, the case of which is similar to that in FMixP. An efficient way is to use the product of univariate distributions of B _j′y _−j instead of the distribution of y _−j ^13,14. The f(B _j′y _−j|${\sigma }_{j}^{2}$, ${\sigma }_{e}^{2}$) has zero mean and variance (B _j′B _j)² ${\sigma }_{j}^{2}$ + B _j′B _j ${\sigma }_{e}^{2}$. Thus, the equation (19) can be written as:

$$\begin{array}{lll} & & f({\delta }_{j}=1|{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma )\\ & = & \frac{1}{1+\frac{f({{\bf{B}}}_{j}^{^{\prime} }{{\bf{y}}}_{-j}|{\sigma }_{2}^{2},{\sigma }_{e}^{2})(1-\gamma )}{f({{{\bf{B}}}_{j}}^{^{\prime} }{{\bf{y}}}_{-j}|{\sigma }_{1}^{2},{\sigma }_{e}^{2})\gamma }}\\ & = & \frac{1}{1+\exp [0.5\,\mathrm{log}({V}_{1})-0.5\,\mathrm{log}({V}_{2})+\frac{0.5{({{\bf{B}}}_{j}^{^{\prime} }{{\bf{y}}}_{-j})}^{2}}{{V}_{1}}-\frac{0.5{({{\bf{B}}}_{j}^{^{\prime} }{{\bf{y}}}_{-j})}^{2}}{{V}_{2}}+\,\mathrm{log}(1-\gamma )-\,\mathrm{log}(\gamma )]},\end{array}$$

(20)

where V ₁ = (B _j′B _j)² σ ₁ ² + B _j′B _j σ _e ² and V ₂ = (B _j′B _j)² ${\sigma }_{2}^{2}$ + B _j′B _j ${\sigma }_{e}^{2}$. After the δ _j has been updated, g _j is sampled as:

$$f({g}_{j}|{\delta }_{j},{\bf{y}},{\bf{u}},{{\bf{g}}}_{-j},{{\boldsymbol{\delta }}}_{-j},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2},\gamma ) \sim \{\begin{array}{c}N(\frac{{{\bf{B}}{\boldsymbol{^{\prime} }}}_{j}{{\bf{y}}}_{-j}}{{{\bf{B}}{\boldsymbol{^{\prime} }}}_{j}{{\bf{B}}}_{j}+{{\sigma }_{e}}^{2}/{{\sigma }_{1}}^{2}},\frac{{{\sigma }_{e}}^{2}}{{\bf{B}}{^{\prime} }_{j}{{\bf{B}}}_{j}+{{\sigma }_{e}}^{2}/{{\sigma }_{1}}^{2}})\quad if\,{\delta }_{j}=1\\ N(\frac{{{\bf{B}}}_{j}^{^{\prime} }{{\bf{y}}}_{-j}}{{\bf{B}}{^{\prime} }_{j}{{\bf{B}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{2}^{2}},\frac{{{\sigma }_{e}}^{2}}{{\bf{B}}{^{\prime} }_{j}{{\bf{B}}}_{j}+{\sigma }_{e}^{2}/{\sigma }_{2}^{2}})\quad if\,{\delta }_{j}=0\end{array}.$$

(21)

As the σ ₁ ² appears only in its own prior and the normal distribution of g _j with δ _j = 1, the posterior distribution of σ ₁ ² can be derived as:

$$\begin{array}{lll}f({{\sigma }_{1}}^{2}|{\bf{y}},{\bf{u}},{\bf{g}},{\boldsymbol{\delta }},{{\sigma }_{2}}^{2},{{\sigma }_{e}}^{2},\gamma ) & \propto & f({{\sigma }_{1}}^{2}){\prod }_{{\delta }_{j=1}}f({g}_{j}|{{\sigma }_{1}}^{2})\\ & \propto & {({{\sigma }_{1}}^{2})}^{-\frac{k+v+2}{2}}\exp (-\frac{v{{s}_{1}}^{2}+{\sum }_{{\delta }_{j=1}}{{g}_{j}}^{2}}{2{{\sigma }_{1}}^{2}})\\ & \sim & {\chi }^{-2}(k+v,\frac{v{{s}_{1}}^{2}+{\sum }_{{\delta }_{j=1}}{{g}_{j}}^{2}}{k+v})\end{array},$$

(22)

where k is the number of SNP loci with δ _j = 1. Similarly, the posterior distribution of ${{\sigma }_{2}}^{2}$ follows the inverse-chi-squared distribution ${\chi }^{-2}(m+v,\frac{v{{s}_{2}}^{2}+{\sum }_{{\delta }_{j=0}}{{g}_{j}}^{2}}{m+v})$, where m is the number of SNP loci with δ _j = 0.

The starting value of γ was set to 0.5, and the posterior probability is drawn from the Beta(k + 1, m + 1):

$$\begin{array}{lll}f(\gamma |{\bf{y}},{\bf{u}},{\bf{g}},{\boldsymbol{\delta }},{{\boldsymbol{\sigma }}}^{2},{{\sigma }_{e}}^{2}) & \propto & f(\gamma )f({\boldsymbol{\delta }}|\gamma )\\ & \propto & {\gamma }^{k}{(1-\gamma )}^{m}\end{array}.$$

(23)

Note that if the sampling value of γ is larger than 0.5, we can switch the labels of the variance ${{\sigma }_{1}}^{2}$ and ${{\sigma }_{2}}^{2}$, and set value of γ to 1 − γ. The posterior distributions of fixed effect u and residual variance ${{\sigma }_{e}}^{2}$ are the same as BayesA, which has been described in many studies^1,13,15.

Genomic prediction by other approaches

Two other Bayesian methods, BayesA¹ and BayesCπ³, were used for comparison with MixP. The prior distribution of variances of SNP effects in BayesA follows an inverse-chi-squared distribution, i.e., σ _j ² ~ χ ⁻²(v, s ²)^1,11. In BayesCπ, SNPs with non-zero effects have a common variance that also follows an inverse-chi-squared distribution³. The degree of freedom (v) of the inverse-chi-squared distribution was set to 5.0. As the SNP genotypes had been standardised, parameter s ² was set without ∑2p _j(1 − p _j) in the denominator, which was different from the formula derived by Habier et al.³ and Gianola et al.¹⁶. In this study, s ² = [(v − 2)V _g]/(vM) in BayesA and s ² = [(v − 2)V _g]/(πvM) in BayesCπ, where π represents the probability of a SNP with a non-zero effect and is estimated by the MCMC algorithm. V _g is total additive genetic variance which is estimated using the R-package “EMMREML” (Version 3.1) that is one of packages^{17,18,19,20,21,22} used to estimate genetic parameters. Before V _g estimation, a genomic relationship matrix (G matrix) was calculated using the formula²: ${\bf{G}}=\frac{({\bf{B}}{\boldsymbol{-}}{\bf{P}}{\boldsymbol{)}}{\boldsymbol{(}}{\bf{B}}{\boldsymbol{-}}{\bf{P}})^{\prime} }{2\sum {p}_{j}(1-{p}_{j})}$, where the jth column of P is a vector of the frequency of allele ‘a’ at the jth locus, i.e., ${{\bf{P}}}_{j}=({p}_{j},{p}_{j},\mathrm{..}.,{p}_{j})^{\prime} $. Gibbs sampling was run for 20000 cycles, and the first 10000 cycles were discarded as burn in.

Simulated data

Both the simulated and real data were used to compare the predictive results of various statistical methods. The simulated data had been distributed to the participants of the QTLMAS XII workshop. The data was described in detail by Lund et al.²³ and a summary is given as follows. Through a simulation of a historic population of 50 generations, 4665 and 1200 individuals were simulated in the training and testing data sets, respectively. Six-thousand biallelic SNP loci were evenly spaced on 6 Morgan chromosomes, and 5,726 SNPs with minor allele frequencies (MAF) ≥0.05 were used for research. Forty-eight QTL loci were simulated, and the effects were sampled from a gamma distribution with a scale parameter 5.4 and a shape parameter of 0.42. The residual values were sampled to obtain a heritability value of 0.3 for the trait.

Real data on large yellow croaker

The experimental materials were large yellow croaker (Larimichthys crocea), which is one of the most commercially important marine fish species in southeast China and Eastern Asia²⁴. All fish were reared in a breeding nucleus farm named ‘Jinling Aquaculture Science and Technology Co. Ltd.’ in Ningde City, Fujian Province, P.R. China. In total, 30 males and 30 females were mated randomly in a pool, and a total of 500 progenies (237 males and 263 females) were randomly selected and measured in the experiment. The trial was carried out in the Key Laboratory of Healthy Mariculture for the East China Sea when the fish were two years old. Four quantitative traits, body weight, body length, body height and the length/height ratio, were selected to perform genomic prediction. Growth rate and body shape (customers prefer purchasing fish with slender bodies) are the important traits for large yellow croaker, so these four traits were selected for research. The parameters of the four traits are shown in Table 4.

Table 4 Statistical results of the phenotypic data for four quantitative traits in large yellow croaker.

Full size table

Next generation sequencing and genotyping

Fin samples from 500 individuals were collected for genotyping. The Genotyping-By-Sequencing (GBS) method was used to construct the libraries for next generation sequencing (NGS). Genomic DNA was incubated at 37 °C with EcoRI and NlaIII, CutSmart™ buffer and MilliQ water. Digestion reactions were heat-inactivated at 65 °C for 20 minutes and the reaction system was held at 8 °C. The digested DNA was ligated to adapter sequences with CutSmart™ buffer, ATP, T4 DNA ligase, adapter mix and MilliQ water at 16 °C. The restriction-ligation reaction was also heat-inactivated at 65 °C for 20 minutes and the reaction system was held at 8 °C afterward. The PCR reaction was performed using diluted restriction-ligation samples, dNTP, Taq DNA polymerase (NEB) and IlluminaF and indexing primers. Fragments that were 200~300 bp in size were isolated using a Gel Extraction Kit (Qiagen). Then, pair-end sequencing was performed using an Illumina high-throughput sequencing platform. The raw sequencing reads were quality checked by FastQC²⁵. The high-quality, filtered reads were mapped to the large yellow croaker reference genome sequence by BWA version 0.7.10²⁶. The alignment files were then sorted and the duplicates marked by Picard (http://picard.sourceforge.net). Then, the GATK package²⁷ was applied for SNP calling. As a result, 29,748 SNPs with a missing rate ≤20%, a MAF (minor allele frequency) ≥0.05 and genotypes in Hardy-Weinberg equilibrium were selected for further analysis. Beagle Version 3.3.2 software was used to impute the missing SNPs²⁸.

Cross-validation

Genomic prediction by a replicated training-testing method was used to evaluate the predictive results of the real data. Cross-validation of 10 replicates was performed. All 500 individuals were randomly and evenly divided into 10 groups of 50 individuals each. In each replicate, one of the groups was selected as the testing data set while the remaining nine groups were used as the training data set. To observe the relationship between the predictive results of FMixP with γ, we varied the value of γ from 0.01 to 0.5 (50 levels were used with 0.01 as a step size).

Predictive accuracy and predictive ability

In the simulation, the correlation coefficient between true genetic values and predicted genetic values, r _(TBV,_GEBV) was used to measure the predictive accuracy¹, where GEBV = Bĝ, ĝ is the vector of estimated SNP effects and B is the SNP genotypes; and an individual true breeding value (TBV) can be obtained by summing up all simulated QTL effects. We give a brief explanation below. If an individual GEBV is close to its TBV, the predictive accuracy is high. But if one aims to assess the predictive accuracies of a set of GEBV, one can use r _(TBV,_GEBV), and higher r _(TBV,_GEBV) suggests higher predictive accuracy.

In the real data analysis, because the true breeding values are unknown, we used the predictive ability to measure the predictive accuracy, which is described as the correlation coefficient between GEBV, and the phenotypes adjusted for the covariates (y − X û , where only genetic and residual effects are left), r _{(y
−
Xû}, _GEBV) ²⁹. The higher correlation between them is, the higher genetic variance captured by the genetic SNPs is, leading to higher predictive ability.

All 500 individuals were added to the prediction model to estimate the computation time for various Bayesian methods. All of the calculation processes (except the REML process) were implemented in Fortran90 codes and run on the computer server of Jimei University.

Availability of data

Raw DNA sequencing reads were deposited in NCBI with the project accession PRJNA309464 and SRA accession SRR3114179.

References

Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157, 1819–1829 (2001).
CAS PubMed PubMed Central Google Scholar
Vanraden, P. M. Efficient methods to compute genomic predictions. J Dairy SCI. 91, 4414–4423 (2008).
Article CAS PubMed Google Scholar
Habier, D., Fernando, R. L., Kizilkaya, K. & Garrick, D. J. Extension of the bayesian alphabet for genomic selection. Bmc Bioinformatics. 12, 1–12 (2011).
Article Google Scholar
Campos, G. D. L. et al. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 182, 375–385 (2009).
Article PubMed PubMed Central Google Scholar
Mutshinda, C. M. & Sillanpaa, M. J. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics. 186, 1067–1075 (2010).
Article PubMed PubMed Central Google Scholar
Yi, N., George, V. & Allison, D. B. Stochastic search variable selection for identifying multiple quantitative trait loci. Genetics. 164, 1129–1138 (2003).
CAS PubMed PubMed Central Google Scholar
Meuwissen, T. H., Solberg, T. R., Shepherd, R. & Woolliams, J. A. A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 41, 1–10 (2009).
Article Google Scholar
Shepherd, R. K., Meuwissen, T. H. & Woolliams, J. A. Genomic selection and complex trait prediction using a fast EM algorithm applied to genome-wide markers. Bmc Bioinformatics. 11, 2568 (2010).
Article Google Scholar
Yu, X. & Meuwissen, T. H. Using the Pareto principle in genome-wide breeding value estimation. Genet Sel Evol. 43, 1–7 (2011).
Article Google Scholar
Ronen, B. The Pareto managerial principle: When does it apply? Int J Prod Res. 45, 2317–2325 (2007).
Article Google Scholar
Wang, C. S., Rutledge, J. J. & Gianola, D. Marginal inferences about variance components in a mixed linear model using Gibbs sampling. Genet Sel Evol. 25, 41–62 (1993).
Article PubMed Central Google Scholar
Habier, D., Fernando, R. L. & Dekkers, J. C. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 177, 2389–2397 (2007).
CAS PubMed PubMed Central Google Scholar
Fernando, R. L. & Garrick, D. Bayesian methods applied to GWAS. Methods in Molecular Biology. 1019, 237–274 (2013).
Article PubMed Google Scholar
Cheng, H., Long, Q., Garrick, D. J. & Fernando, R. L. A fast and efficient Gibbs sampler for BayesB in whole-genome analyses. Genet Sel Evol. 47, 1–7 (2015).
Article Google Scholar
Campos, G. D. L., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. L. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics. 193, 327–345 (2013).
Article PubMed Central Google Scholar
Gianola, D., Campos, G. D. L., Hill, W. G., Manfredi, E. & Fernando, R. Additive genetic variability and the Bayesian alphabet. Genetics. 183, 347–363 (2009).
Article PubMed PubMed Central Google Scholar
Wang, C. et al. GVCBLUP: A computer package for genomic prediction and variance component estimation of additive and dominance effects. Bmc Bioinformatics. 15, 1–9 (2014).
Article Google Scholar
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics. 178, 1709–1723 (2008).
Article PubMed PubMed Central Google Scholar
Lee, S. H. & van der Werf, J. H. MTG2: An efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics. 32, 1420–1422 (2016).
Article CAS PubMed PubMed Central Google Scholar
Covarrubias-Pazaran, G. Genome-Assisted prediction of quantitative traits using the r package sommer. Plos One. 11, e156744 (2016).
Article Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-wide complex trait analysis. Am J Hum Genet. 88, 76–82 (2011).
Article CAS PubMed PubMed Central Google Scholar
Gilmour, A. R. et al. ASReml user guide release 1.0. University of Hamburg Department for. 104, 20617–20637 (2009).
Google Scholar
Lund, M. S., Sahana, G., Koning, D. J. D. & Su, G. Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection. BMC Proceedings. 3(Suppl 1), S1 (2009).
Article PubMed PubMed Central Google Scholar
Xiao, S. et al. Functional marker detection and analysis on a comprehensive transcriptome of large yellow croaker by next generation sequencing. Plos One. 10, e124432 (2015).
Google Scholar
Xi, Y. et al. HTQC: A fast quality control toolkit for Illumina sequencing data. Bmc Bioinformatics. 14, 68–70 (2013).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25, 1754–1760 (2010).
Article Google Scholar
Mckenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 84, 210–223 (2009).
Article CAS PubMed PubMed Central Google Scholar
Legarra, A., Robert-Granié, C., Manfredi, E. & Elsen, J. M. Performance of genomic selection in mice. Genetics. 180, 611–618 (2008).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (U1205122), Key projects of the Xiamen Southern Ocean Research Centre (14GZY70NF34) and the Foundation for Innovation Research Team of Jimei University (2010A02). Shijun Xiao performed the SNP discovery. Kun Ye, Qingkai Chen, Junwei Chen, Yang Liu and other colleagues in the laboratory participated in fish sampling and traits measurement. We also thank the editors and reviewers for their many helpful suggestions for this article.

Author information

Authors and Affiliations

Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture; Fisheries College, Jimei University, Xiamen, Fujian, P.R. China
Linsong Dong, Ming Fang & Zhiyong Wang
Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, 266235, P.R. China
Zhiyong Wang

Authors

Linsong Dong
View author publications
You can also search for this author in PubMed Google Scholar
Ming Fang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.W. designed the experiments and revised the manuscript. L.D. performed the analyses and drafted the manuscript. M.F. revised the manuscript. All of the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Zhiyong Wang.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dong, L., Fang, M. & Wang, Z. Prediction of genomic breeding values using new computing strategies for the implementation of MixP. Sci Rep 7, 17200 (2017). https://doi.org/10.1038/s41598-017-17366-2

Download citation

Received: 22 December 2016
Accepted: 25 October 2017
Published: 08 December 2017
DOI: https://doi.org/10.1038/s41598-017-17366-2

This article is cited by

Genome-Wide Identification of Cis-acting Expression QTLs in Large Yellow Croaker
- Dan Jiang
- Wanbo Li
- Ming Fang
Marine Biotechnology (2021)
Evaluation of Genomic Selection for Seven Economic Traits in Yellow Drum (Nibea albiflora)
- Guijia Liu
- Linsong Dong
- Zhiyong Wang
Marine Biotechnology (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results

Multibreed genomic prediction using summary statistics and a breed-origin-of-alleles approach

Blib is a multi-module simulation platform for genetics studies and intelligent breeding

Introduction

Results

Results for simulated data

Results for real data

Computation time

Discussion

Material and Methods

Ethics approval

Analytical derivation for FMixP

Derivation for MMixP

Genomic prediction by other approaches

Simulated data

Real data on large yellow croaker

Next generation sequencing and genotyping

Cross-validation

Predictive accuracy and predictive ability

Availability of data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Genome-Wide Identification of Cis-acting Expression QTLs in Large Yellow Croaker

Evaluation of Genomic Selection for Seven Economic Traits in Yellow Drum (Nibea albiflora)

Comments

Search

Quick links