Introduction

Meuwissen et al. (2001) first proposed the widely used genomic selection method using a dense marker panel for the genetic evaluation of animals and plants. This method achieves higher genetic evaluation accuracy and has the advantage of reducing generation intervals for some species such as dairy cattle with progeny testing schemes. The accuracy of Genomic Best Linear Unbiased Prediction (GBLUP) was assumed to be mainly due to the linkage disequilibrium (LD) between markers and quantitative trait loci (QTL). However, Habier et al. (2007) demonstrated that GEBV accuracy depends not only on the LD between markers and QTL, but also on the genetic relationships among individuals captured by makers. According to their simulation study, GEBV accuracy decreases rapidly as the validation generation becomes distant from the generations of the training population, even when LD still exists between markers and QTL. Daetwyler et al. (2012) decomposed the components of GEBV accuracy by using a multi-breed sheep population. Surprisingly, they found that single-nucleotide polymorphism (SNP) markers from one single chromosome could achieve up to 86% of the accuracy of using all SNP markers, thus indicating that GEBV accuracy is not only due to LD between markers and QTL, but also due to population structure or genetic relationships among individuals. Habier et al. (2013) further demonstrated that the accuracy of GEBV within families depends largely on additive–genetic relationship information, and is also determined by the effective number of SNP markers and training data size.

A single-step approach was proposed to overcome the limitation that not all animals are genotyped (Christensen and Lund 2010; Legarra et al. 2009). This approach has the merit of using all the genotyped and non-genotyped animals in one analysis, and it can estimate GEBV for all the animals in the analysis. It has been applied to the genetic evaluation of many livestock species, including pigs, chicken, and cattle (Aguilar et al. 2010; Chen et al. 2011; Christensen et al. 2012; Liu et al. 2014). Christensen et al. (2012) have shown that the single-step method provides improved accuracy for both genotyped and ungenotyped animals, whereas GBLUP can only be implemented for genotyped animals. The single-step method, compared with the GBLUP model, allows for less biased and more accurate GEBV predictions when the population is under strong selection (Vitezica et al. 2011). Moreover, Fernando et al. (2014) have presented single-step Bayesian regression models, which have the merit of modeling SNP effects with more flexible distributions (such as a t-distribution).

However, it is still unclear how different components such as LD between markers and QTL, in addition to population structure, contribute to GEBV accuracy in the single-step analysis. Furthermore, very few studies to date have investigated the relative performance of various single-step Bayesian models (Lee et al. 2017). Therefore, by using a simulation study, we investigated the contributions of GEBV accuracy in the single-step analysis in this study. Different numbers of generations between the validation and the training populations along with various numbers of QTL were simulated to show the contributions of these components to GEBV accuracy. We further investigated the performance of different single-step models (SSGBLUP, SS-BayesA, and SS-BayesB) in various scenarios with different number of QTL (5, 50, and 500) in the simulation.

Materials and methods

Models

SSGBLUP model

Legarra et al. (2009) and Christensen and Lund (2010) first proposed the single-step BLUP model, which has been further extended by Fernando et al. (2014) as follows:

$$\left[ {\begin{array}{*{20}{c}} {{\bf{y}}_1} \\ {{\bf{y}}_2} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{X}}_1} \\ {{\bf{X}}_2} \end{array}} \right]{\beta} + \left[ {\begin{array}{*{20}{c}} {{\bf{Z}}_1} & {\bf{0}} \\ {\bf{0}} & {{\bf{Z}}_2} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{\bf{g}}_1} \\ {{\bf{g}}_2} \end{array}} \right] + {\bf{e}}$$
(1)

where y 1 is the vector of phenotype for ungenotyped individuals and y 2 is the vector of phenotype for genotyped individuals. β is the vector of fixed effects and X 1 and X 2 are the incidence matrices for fixed effects of ungenotyped and genotyped individuals. Z 1 and Z 2 are the incidence matrices of ungenotyped and genotyped individuals, respectively. Here, g 1 and g 2 are GEBV of ungenotyped and genotyped individuals. Fernando et al. (2014) have further extended SSGBLUP model from the animal model to the marker effect model by defining

$$\left[ {\begin{array}{*{20}{c}} {{\hat {\bf g}}_{\mathrm{1}}} \\ {{\hat {\bf g}}_{\mathrm{2}}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\hat {\bf T}}_{\mathrm{1}}\hat \alpha + \hat \varepsilon } \\ {{\bf{T}}_{\mathrm{2}}\hat \alpha } \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{A}}_{{\mathrm{12}}}{\bf{A}}_{{\mathrm{22}}}^{ - {\mathrm{1}}}{\bf{T}}_{\bf{2}}\hat \alpha + \hat \varepsilon } \\ {{\bf{T}}_{\mathrm{2}}\hat \alpha } \end{array}} \right]$$
(2)

where T 2 is the centered and scaled observed genotype matrix of genotyped individuals (\({\bf{T}}_{\mathrm{2}} = \frac{{\left( {{\bf{M}}_{\bf{j}} - 2p_j} \right)}}{{\sqrt {\mathop {\sum}\limits_{j = 1}^m {2p_j\left( {1 - p_j} \right)} } }}\), M j is a vector of the genotype for all individuals of marker j, p j is the minor allele frequency of marker j), \(\hat \alpha\) is the vector of estimated marker effects, and \({\hat{\bf T}}_{\mathrm{1}}\) is the predicted or imputed genotype matrix for ungenotyped individuals with \({\hat {\bf T}}_{\mathrm{1}} = {\bf{A}}_{{\mathrm{12}}}{\bf{A}}_{{\mathrm{22}}}^{ - {\mathrm{1}}}{\bf{T}}_{\mathrm{2}}\), where A ij is the partition of the pedigree relationship matrix A that relates to \({\hat {\bf g}}_{\mathrm{1}}\) and \({\hat {\bf g}}_{\mathrm{2}}\). The variance and covariance matrix of \({\hat {\bf g}}_{\mathrm{1}}\) and \({\hat {\bf g}}_{\mathrm{2}}\) is \({\mathrm{cov}}\left( {{\hat {\bf g}}_{\mathrm{1}},{\hat {\bf g}}_{\mathrm{2}}} \right) = {\bf{H}}\), where H was defined as (Legarra et al. 2009):

$${\bf{H}} = \left[ {\begin{array}{*{20}{c}} {{\bf{A}}_{{\mathrm{11}}}{\bf{ + A}}_{{\mathrm{12}}}{\bf{A}}_{{\mathrm{22}}}^{{\mathrm{ - 1}}}\left( {{\bf{G}}_{\mathrm{2}}{\bf{ - A}}_{{\mathrm{22}}}} \right){\bf{A}}_{{\mathrm{22}}}^{{\mathrm{ - 1}}}{\bf{A}}_{{\mathrm{12}}}} & {{\bf{A}}_{{\mathrm{12}}}{\bf{A}}_{{\mathrm{22}}}^{{\mathrm{ - 1}}}{\bf{G}}_{\mathrm{2}}} \\ {{\bf{G}}_{\mathrm{2}}{\bf{A}}_{{\mathrm{22}}}^{{\mathrm{ - 1}}}{\bf{A}}_{{\mathrm{12}}}} & {{\bf{G}}_{\mathrm{2}}} \end{array}} \right]$$
(3)

where G 2 is the genomic relationship matrix for genotyped individuals. The estimated marker effects are assumed to be normally distributed with \(N\left( {0{\mathrm{,}}{\rm I}\sigma _\alpha ^2} \right)\). The imputation residuals, ε, are assumed to be multivariate and normally distributed with \(N\left( {{\bf{0}},\left( {{\bf{A}}_{11} - {\bf{A}}_{12}{\bf{A}}_{22}^{ - 1}{\bf{A}}_{12} \prime } \right)\sigma _g^2} \right)\). Here, \(\sigma _\alpha ^2\) and \(\sigma _g^2\) are the SNP variance and polygenic variance, respectively. The model further becomes

$$\begin{array}{ccccc}\\ \left[ {\begin{array}{*{20}{c}} {{\bf{y}}_{\bf{1}}} \\ {{\bf{y}}_{\bf{2}}} \end{array}} \right] & = \left[ {\begin{array}{*{20}{c}} {{\bf{X}}_{\bf{1}}} \\ {{\bf{X}}_{\bf{2}}} \end{array}} \right]\beta + \left[ {\begin{array}{*{20}{c}} {{\bf{Z}}_{\bf{1}}} & {\bf{0}} \\ {\bf{0}} & {{\bf{Z}}_{\bf{2}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{\hat {\bf T}}_{\bf{1}}\alpha + \varepsilon } \\ {{\bf{T}}_{\bf{2}}\alpha } \end{array}} \right] + {\bf{e}}\\ \\ & = {\bf{X}}\beta + {\bf{W}}\alpha + {\bf{U}}\varepsilon + {\mathrm{e}}\\ \end{array}$$
(4)

where \({\bf{X}} = \left[ {\begin{array}{*{20}{c}} {{\bf{X}}_1} \\ {{\bf{X}}_2} \end{array}} \right], {{\bf{U}} = \left[ {\begin{array}{*{20}{c}} {{\bf{Z}}_{\mathrm{1}}} \\ {\bf{0}} \end{array}} \right]}\), and \({\bf{W}} = \left[ {\begin{array}{*{20}{c}} {{\bf{W}}_{\mathrm{1}}} \\ {{\bf{W}}_{\mathrm{2}}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{Z}}_{\mathrm{1}}{\hat {\bf T}}_{\mathrm{1}}} \\ {{\bf{Z}}_{\mathrm{2}}{\bf{T}}_{\mathrm{2}}} \end{array}} \right]\).

The mixed model equation (MME) corresponding to Eq. (4) for the SSGBLUP marker effects model is

$$\left[ {\begin{array}{*{20}{c}} {{\bf{X \prime X}}} & {{\bf{X \prime W}}} & {{\bf{X \prime Z}}_1} \\ {{\bf{W}} \prime {\bf{X}}} & {{\bf{W}} \prime {\bf{W}} + {\bf{I}}\frac{{\sigma _e^2}}{{\sigma _\alpha ^2}}} & {{\bf{W}} \prime {\bf{Z}}_1} \\ {{\bf{Z}}_1 \prime {\bf{X}}} & {{\bf{Z}}_1 \prime {\bf{W}}} & {{\bf{Z}}_1 \prime {\bf{Z}}_1 + {\bf{A}}^{11}\frac{{\sigma _e^2}}{{\sigma _g^2}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {\hat \beta } \\ {\hat \alpha } \\ {\hat \varepsilon } \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{X \prime y}}} \\ {{\bf{W \prime y}}} \\ {{\bf{Z}}_1^{\bf{ \prime }}{\bf{y}}_1} \end{array}} \right]$$
(5)

where \({\bf{A}}^{11} = \left( {{\bf{A}}_{11} - {\bf{A}}_{12}{\bf{A}}_{22}^{ - 1}{\bf{A}}_{12} \prime } \right)^{ - 1}\), \({\bf{y}} = \left[ {\begin{array}{*{20}{c}} {{\bf{y}}_{\bf{1}}} \\ {{\bf{y}}_{\bf{2}}} \end{array}} \right]\), and \(\sigma _e^2\) is the residual variance.

SS-BayesA/B model

The BayesA/B model can be simply extended to the single-step analysis by using the predicted genotypes for the ungenotyped individuals (Fernando et al. 2014). In the BayesA model, marker variances are assumed to be different for different SNP markers, and marker variances are commonly handled with a scaled-inverse χ 2 prior (Fernando and Garrick 2013; Gianola et al. 2009; Meuwissen et al. 2001):

$$p(\sigma _{\alpha _j}^2|\nu _\alpha ,s_\alpha ^2) = \frac{{\left( {\frac{{\nu _\alpha s_\alpha ^2}}{2}} \right)}}{{\Gamma \left( {\frac{{\nu _\alpha }}{2}} \right)}}\left( {\sigma _{\alpha _j}^2} \right)^{ - \left( {\frac{{\nu _\alpha }}{2} + 1} \right)}e^{ - \frac{{\nu _\alpha s_\alpha ^2}}{2}}$$
(6)

where ν α and \(s_\alpha ^2\) are the degrees of freedom and scale of the scaled-inverse χ 2 prior, respectively, and j is the jth number of marker. The Mixed model equation (MME) for the single-step BayesA (SS-BayesA) model further become

$$\left[ {\begin{array}{*{20}{c}} {{\bf{X \prime X}}} & {{\bf{X \prime W}}} & {{\bf{X \prime Z}}_{\bf{1}}} \\ {{\bf{W \prime X}}} & {{\bf{W \prime W}} + {\bf{D}}^{ - 1}\sigma _e^2} & {{\bf{W \prime Z}}_{\bf{1}}} \\ {{\bf{Z}}_{\bf{1}}{\bf{ \prime X}}} & {{\bf{Z}}_{\bf{1}}{\bf{ \prime W}}} & {{\bf{Z}}_{\bf{1}}{\bf{ \prime Z}}_{\bf{1}} + {\bf{A}}^{11}\frac{{\sigma _e^2}}{{\sigma _g^2}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {\hat \beta } \\ {\hat \alpha } \\ {\hat \varepsilon } \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{X \prime y}}} \\ {{\bf{W \prime y}}} \\ {{\bf{Z}}_{\bf{1}}{\bf{ \prime y}}_{\bf{1}}} \end{array}} \right]$$
(7)

where \({\bf{D}} = \rm{Diag}\left( {\sigma _{\alpha _j}^2} \right)\).

For the single-step BayesB (SS-BayesB) model, the Mixed model equation (MME) is the same as Eq. (7), and marker effects are assumed to be independently distributed as follows:

$$\alpha _j|\pi _m,\sigma _{\alpha _j}^2 = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {{\mathrm{with}}\,{\mathrm{probability}}\,\left( {1 - \pi _m} \right)} \hfill \\ {\sim N\left( {0,\sigma _{\alpha _j}^2} \right)} \hfill & {{\mathrm{with}}\,{\mathrm{probability}}\,\pi _m} \hfill \end{array}} \right.$$
(8)

where π m is the proportion of markers that have non-zero effect. We can estimate π m using a Beta(α π ,β π ) prior (Habier et al. 2011). π m was fixed at 0.01 in this study.

The joint posterior densities of each single-step model and Markov Chain Monte Carlo (MCMC) sampling strategies for other parameters and hyper-parameters were illustrated in Supplementary File 1.

Data simulation

A simulation study was conducted with the program QMSim (Sargolzaei and Schenkel 2009). First, 5000 historical generations (generations 1–5000), each with 2000 animals, were simulated to generate LD between SNP markers (Fig. 1). Then, five recent generations (generations 5001–5005) were generated from the last historical generation (generation 5000) by random mating of 50 randomly selected males and 1000 females from the previous generation. There was no selection for the trait in each recent generation. For the recent population, each female had one offspring with an assuming male and female ratio of 1:1, and each recent generation had 1000 individuals. The dam’s culling rate was 0.5. Fifty percent of dams were from the last generation, and 50% were from the generation before last generation. All sires were from the last generation. For the genome, we simulated 20 chromosomes for each individual, and each chromosome had a length of one Morgan. On each chromosome, 2000 SNP markers were generated in generation 1. After data editing (MAF > 0.01 and r 2 ≤ 0.98), the total number of SNPs retained for the analysis was close to 40,000 (range from 39,956 to 39,972 for each replicate). For the phenotype, heritability was set at 0.2. All individuals’ phenotypes were generated by summing true breeding values (QTL genotypes multiply by QTL effects) and residual effects (sampled from a normal distribution). Three scenarios with different number of QTL (5, 50, or 500) were considered, and QTL were randomly selected among all the SNPs. All other SNPs, except QTL, were assumed to have no effect on the trait. QTL effects were simulated from a normal distribution. The total number of replicates was 10 for each QTL scenario.

Fig. 1
figure 1

The simulated historical and recent generations

Data analysis

To investigate the influence of genetic relationships between the training and validation populations on the GEBV accuracy, we carried out single-step analysis by using all individuals from generations 5000–5002 as the training population, and all individuals from each generation 5003, 5004, and 5005 as a separate validation population. For each QTL scenario (5, 50, and 500 QTL), the design of the training and validation populations was shown in Table 1. To mimic the single-step analysis, we set the genotyping rate at 50% for the training and individual validation populations. The genotyped individuals were randomly selected from the training and validation populations, and the remaining individuals were treated as ungenotyped individuals. To compare the prediction performance of different models, we computed GEBV accuracy as the correlation of GEBV and true breeding values (TBV). GEBV for the ungenotyped individuals were computed by \({\hat {\bf g}}_{\bf{1}} = {\hat {\bf T}}_{\bf{1}}\hat \alpha + \hat \varepsilon\), and GEBV for the genotyped individuals were computed by \({\hat {\bf g}}_2 = {\bf{T}}_{\bf{2}}\hat \alpha\).

Table 1 The training and validation design for genomic prediction of the single-step analysis

To analyze the simulated data, we ran MCMC for 50,000 iterations with 5000 as the burn-in for all three models (SSGBLUP, SS-BayesA, and SS-BayesB) for each replicate within each QTL scenario. For the SS-BayesA model, we estimated hyper-parameters scale (\(s_\alpha ^2\)) and degree of freedom (ν α ). For the SS-BayesB model, to simplify the model structure, we fixed the proportion of non-zero effect SNPs (π m ) as 0.01. The data were analyzed by self-developed R and C codes, and they were available on request.

Results

Influence of the relationships between the training and validation populations

We investigated the GEBV accuracy by using different generations (5003, 5004, and 5005) as the validation population and generations 5000–5002 as the training population for all three single-step models (SSGBLUP, SS-BayesA, and SS-BayesB). Figure 2 shows GEBV accuracy for both genotyped and ungenotyped individuals of generations 5003, 5004, and 5005 each as the validation population for all QTL scenarios. As the generation number of the validation population (measured as distances between validation and training populations) increased with respect to the training population, GEBV accuracy decreased significantly for both genotyped and ungenotyped individuals. For SSGBLUP, GEBV accuracy always decreased with validation generation for both ungenotyped and genotyped individuals. For SS-BayesA and SS-BayesB, the GEBV accuracy for genotyped individuals did not decrease dramatically, compared with SSGBLUP at scenarios of 5 or 50 QTL. For ungenotyped validation individuals, the accuracy of GEBV decreased with the increase of the generation of the validation population. However, for genotyped individuals, the influence of generations of the validation population on GEBV accuracy was more sensitive to both different single-step models and the number of QTL (Fig. 2).

Fig. 2
figure 2

Accuracy of GEBV for different validation population generations, using SSGBLUP, SS-BayesA, and SS-BayesB models at scenarios of 5, 50, and 500 QTL. These results are the means and standard errors of 10 replications. Different letters indicate a significant difference at P value <0.05

Comparison of single-step models

We also compared GEBV accuracy for the three single-step models (SSGBLUP, SS-BayesA, and SS-BayesB). For the scenarios with 5 and 50 QTL, SS-BayesA and SS-BayesB always achieved higher accuracy than SSGBLUP, and SS-BayesB performed better than SS-BayesA at validation population generations 5003 and 5004 (Fig. 3). However, when the number of QTL was 500 in the simulation, no advantage of SS-BayesA and SS-BayesB was found, and SS-BayesB realized the lowest GEBV accuracy. These findings indicated that the single-step Bayesian-type models had an advantage over the SSGBLUP model when there were fewer QTL affecting the trait. Moreover, considering the scenario with 5 and 50 QTL, we observed that the single-step Bayesian models exceeded SSGBLUP by a larger margin for the genotyped animals than for the ungenotyped animals (Fig. 3). When there were only 5 QTL in the simulation (with h 2 = 0.2), the GEBV accuracy of SS-BayesA and SS-BayesB for genotyped animals exceeded 0.93, while it was below 0.5 with SSGBLUP.

Fig. 3
figure 3

Accuracy of GEBV for different single-step models at each validation generation. Note: The single-step models are SSGBLUP (single-step GBLUP), SS-BayesA (single-step BayesA), and SS-BayesB (single-step BayesB). These results are the means and standard errors of 10 replications. Different letters indicate a significant difference at P value < 0.05

Influence of different number of QTL

We further compared the effect of different numbers of QTL for each single-step model. According to Fig. 4, it was clear that the GEBV accuracy of SSGBLUP did not change significantly as the number of QTL increased. However, the GEBV accuracy of SS-BayesA and SS-BayesB for both ungenotyped and genotyped individuals decreased significantly when the number of QTL increased. These results indicate that single-step Bayesian models are more sensitive to the number of QTL affecting the trait, even when the relationship structure of the training and validation populations is almost the same for the various number of QTL scenarios. Meanwhile, SSGBLUP is a robust model to handle scenarios with different number of QTL.

Fig. 4
figure 4

Accuracy of GEBV for different numbers of QTL at each validation generation. Note: The single-step models are SSGBLUP (single-step GBLUP), SS-BayesA (single-step BayesA) and SS-BayesB (single-step BayesB). These results are the means and standard errors of 10 replications. Different letters indicate a significant difference at P value <0.05

Discussion

The objective of this study was to analyze the influence of relationships between the training and validation populations and of LD between markers and QTL on the GEBV accuracy with various single-step models. We further extended single-step Bayesian models of Fernando and Garrick (2013) to single-step BayesB model, and investigate three single-step models (SSGBLUP, SS-BayesA, and SS-BayesB) with a simulation study. To investigate the influence of relationships between the training and validation populations, we used each one of three successive generations (5003, 5004, and 5005) as the validation population. Generally, GEBV accuracy decreased as the distance (measured as the number of generation gap between the training population and validation population) of validation population increased for different single-step models, which was in agreement with results of many previous studies (Habier et al. 2013; Habier et al. 2010; Kang et al. 2016; Wolc et al. 2011). The relationship between the training and validation populations influenced GEBV accuracy more substantially than LD between markers and QTL. Moreover, we observed that Bayesian-type single-step models (SS-BayesA and SS-BayesB) outperformed SSGBLUP in the scenarios with fewer QTL (5 or 50 QTL), whereas SSGBLUP outperformed Bayesian models (SS-BayesB) when the number of QTL reached 500 in the simulation.

Influence of relationships between the training and validation populations

GEBV accuracy decreased as the generation of validation population increased, especially with SSGBLUP. To further investigate the reason for this, we computed the average linkage disequilibrium (r2) of all adjacent SNP pairs for all individuals in each recent generation (generations 5001-5005), along with the means and standard deviations of pedigree-based genetic relationships (A12) of the training (generations 5000-5002) and validation populations (each generation of 5003-5005), along with the means and standard deviations of pedigree-based genetic relationships (A 12) of the training (generations 5000–5002) and validation populations (each generation of 5003–5005). It can be seen from Fig. 5 that the LD between adjacent SNP markers slightly increased with the number of generation. The average A 12 between the training and validation populations was almost the same for validation generations 5003, 5004, and 5005. However, the standard deviations of A 12 decreased by the number of generation. These results indicated that there were more individuals with closer genetic relationships with the training population for validation generation 5003 compared with that of generation 5005. These few animals that had close relationships with the training population caused the overall GEBV accuracy of generation 5003 to be higher than that of generation 5005 (results not shown).

Fig. 5
figure 5

The plot of averaged linkage disequilibrium (r 2) and the means and standard deviations of A 12 by generation. Note: The r 2 was calculated using genotypes of all adjacent SNP markers, and A 12 was the pedigree-based numeric relationship between the training and validation populations. All these statistics are the means of 10 replicates in the scenario of 50 QTL

Habier et al. (2010) have also found that the accuracy of GEBV for four traits (milk yield, fat yield, protein yield, and somatic cell score) decreased when the relationship between the training and validation populations decreased in German Holstein bulls’ data. Kang et al. (2016) have also found that the GEBV accuracy declined by generation in the single-step analysis with a simulation study. Daetwyler et al. (2012) have conducted a genomic prediction analysis using a multiple-breed sheep population, and have also found that a large amount of GEBV accuracy was due to population structure or family relationships instead of LD between markers and QTL at current marker densities. Therefore, our results and those of Kang et al. (2016) indicate that the accuracy of GEBV for the single-step analysis decreases when the generation gap between the training and validation populations increases. In addition, our results indicate that this decrease in GEBV accuracy occurred even when the LD between markers increased marginally (with P value <0.001 for t-test of r 2) (Fig. 5). This finding indicates that the relationship between training and validation populations plays a more important role than the LD between markers and QTL in the GEBV accuracy for both genotyped and ungenotyped individuals, especially with the single-step BLUP model. Habier et al. (2007) have also concluded that the GBLUP model (or RR-BLUP) was influenced mostly by genetic relationships.

For the ungenotyped individuals, a pedigree-based relationship is used for the prediction of their genotypes. According to the formula for the predicted genotype of ungenotyped individuals \({\hat {\bf T}}_1 = {\bf{A}}_{12}{\bf{A}}_{22}^{ - 1}{\bf{T}}_2\) , it is obvious that a larger element in A 12 corresponds to a higher regression coefficient for the corresponding element of T 2 (Here, T 2 is the centered genotype matrix of genotyped individuals and \({\hat {\bf T}}_{\bf{1}}\) is the predicted genotype matrix of ungenotyped individuals). Chen et al. (2014) have also reported that individuals with close relatives in the training population had higher genotype imputation accuracy and higher accuracy of genomic prediction. Our results further illustrate that the genetic relationships between the training and validation populations affect GEBV accuracy more strongly for ungenotyped individuals than genotyped individuals in single-step analysis (Fig. 2). This is so because the GEBV of ungenotyped individuals is composed of two parts: (1) estimated marker effects and (2) imputation residuals. Imputation residuals are estimated on the basis of a pedigree relationship matrix \(\left({\mathrm{\varepsilon }} \propto N\left( {{\bf{0}},\left( {{\bf{A}}_{11} - {\bf{A}}_{12}{\bf{A}}_{22}^{ - 1}{\bf{A}}_{12} \prime } \right)\sigma _g^2} \right)\right)\), whereas the GEBV of genotyped individuals only depends on the estimated marker effects.

Single-step models comparison

In this study, we used three single-step models such as SSGBLUP, SS-BayesA, and SS-BayesB. The Bayesian-type models outperformed the SSGBLUP models when there were fewer QTL (5 or 50 QTL) in the simulation. These results were in agreement with both previous simulation and real data analyses of genomic selection (Habier et al. 2007; Hayes et al. 2009; Meuwissen et al. 2001). SS-BayesB model showed extremely optimistic prediction ability at the case of 5 QTL (Fig. 3). Here the GEBV accuracy values were 0.98 for SS-BayesB and 0.48 for SSGBLUP. The advantage of SS-BayesA and SS-BayesB was mainly due to both models assuming a non-normal distribution of marker effects, in which a t-distribution was assumed for SS-BayesA and a two components mixture distribution for SS-BayesB (Gianola et al. 2009; Habier et al. 2011). These marker effects’ assumptions more closely matched to the QTL and marker structure in our simulation, especially with the 5 and 50 QTL scenarios. Meanwhile, according to Figs. A1A3, it could be seen that the estimated marker effects had much larger range (−600.0 to 200.0) for SS-BayesA/B compared to that of SSGBLUP (−10.0 to 10.0) at the 5 and 50 QTL scenarios. Zhang et al. (2016) found greater accuracies using weighted genomic relationships (vs. regular single-step GBLUP, BayesB, and BayesC) when few QTLs were simulated, and their weighted genomic relationships approaches (WssGBLUP) were more similar to the SS-BayesA/B models in this study.

However, when the number of QTL increased to 500, the Bayesian-type models had no advantage over SSGBLUP (Fig. 3). Interestingly, SS-BayesB obtained lower accuracy than SSGBLUP and SS-BayesA for the 500 QTL scenario. In this scenario, 500 SNPs were simulated as QTL in the phenotype simulation, while only ~400 SNPs (1% of 40,000 SNPs) were allowed to have non-zero effects in SS-BayesB model, as the non-zero proportion (π m ) of markers was fixed at 0.01. This fixation potentially limited the power of SS-BayesB model to capture all existing 500 QTL.

Generally, considering all different scenarios of QTL, SS-BayesA model was the most efficient and robust according to our simulation analysis. The SS-BayesB model with the freedom of estimating π m would capture more LD between markers and QTL, and may obtain better GEBV prediction performance. Karaman et al. (2016) have reported that BayesB and BayesC have no advantage over GBLUP when the reference population is small (<6000 individuals). Therefore, given the findings from Karaman et al. (2016) and several other studies (Habier et al. 2013; Habier et al. 2010; Kang et al. 2016), the advantages of Bayesian models in genomic selection and single-step analysis depend on the training population size, number of QTL for the trait, and other potential factors.

Three single-step models also performed differently in terms of prediction bias. On the basis of regression coefficients of TBV on GEBV and means of deviation between TBV and GEBV, SSGBLUP model achieved the least prediction bias for ungenotyped individuals, and SS-BayesA and SS-BayesB models realized less prediction bias for genotyped individuals. These models need to be further investigated for prediction bias for the application in real data.

Influence of the number of QTL

The three single-step models performed differently as the number of QTL increased in the simulation. For the SSGBLUP model, the GEBV accuracy changed minimally as the number of QTL increased from 5 to 500 (Fig. 4). This could be explained by the fact that SSGBLUP mainly utilized the genomic relationship among the training and validation populations, instead of capturing the LD between markers and QTL. When the number of QTL changed from 5 to 500, the genetic relationship between the training and validation populations did not vary, and thus the GEBV accuracy of SSGBLUP showed little change. However, for the SS-BayesA and SS-BayesB models, the GEBV accuracy decreased as the number of QTL increased. This was because the marker effects were more accurately estimated for the scenarios of 5 and 50 QTL with the single-step Bayesian models. From Figs. A2 and A3 in the Supplementary File 2 (the plots of estimated marker effects for one replicate), it can be seen that only a few SNPs adjacent to the true QTL were estimated with large non-zero effects by SS-BayesA and SS-BayesB at the 5 and 50 QTL scenarios. When the number of QTL was 500 in the simulation, SSGBLUP and SS-BayesA, which allowed all markers to have a non-zero effect, showed better agreement between estimated marker effects and true QTL effects (Figs. A1–A3 in Supplementary File 2), in addition to higher GEBV accuracy compared with SS-BayesB (Fig. 3). To further investigate the influence of QTL numbers, we have also simulated a scenario of 5000 QTL. The results (Fig. A4) also indicated that SSGBLUP and SS-BayesA had obvious advantages over SS-BayesB model, which was similar to the scenario of 500 QTL.

Generally, our results suggest that single-step Bayesian models have appealing advantages when the number of QTL controlling the trait is small (Zhang et al. 2016). Kang et al. (2016) have proposed a single-step random regression model (single-step random regression test-day model, SS RR-TDM) for longitudinal traits, and SS RR-TDM has been found to have an advantage over the pedigree-based RR-TDM and GBLUP. It will be meaningful to further extend single-step Bayesian models to longitudinal traits.

Currently, a new algorithm that uses recursion to compute the genomic relationship matrix has become commonly applied (Misztal 2016; Misztal et al. 2014). This algorithm is also called “algorithm for proven and young,” which splits genotyped animals into core (proven) animals and noncore (young) animals. This methodology can produce an inverse genomic relationship matrix of all genotyped animals by only computing the inverse of core animals (Misztal et al. 2014), thereby dramatically decreasing the computing cost compared with the traditional single-step GBLUP. Because the breeding values of noncore animals can be derived by recursions on the breeding values of core animals (Misztal 2016), these results indicate that phenotypes of core or proven animals are sufficient for estimating markers effects in the Bayesian-like model. Therefore, how to extend the core and noncore concept to the Bayesian and single-step Bayesian genomic models will be an interesting and valuable research topic. Fernando et al. (2016) further proposed a hybrid model for the single-step Bayesian models with an efficient new computing algorithm, and they are easy to extend to multiple traits and multiple-breed analyses.

Data archiving

The simulated data analyzed in this study is available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.hk14j.