Introduction

Quantitative genetic models partition genetic variance into additive, dominance and epistatic components (Falconer and Mackay, 1996). This partitioning has been widely analyzed in plant genetics and breeding using robust statistical methods based on linear mixed model theory. Some studies have shown very little involvement of dominance variance (Gallais, 1990), whereas others have found that this factor can contribute substantially, for example, in annual plants (Wardyn et al., 2007) or tree species (Bouvet et al., 2009). In addition, combining dominance and additive effects in models can improve genotype prediction, for example, in genomic selection (Denis and Bouvet, 2013). Epistatic effects—although a potential source of bias in estimating additive and dominance effects—are often overlooked in experimental quantitative genetic studies (Lynch and Walsh, 1998). Research is now underway to determine its importance with regard to complex traits (Mäki-Tanila and Hill, 2014). Different strategies have been implemented through quantitative genetics models, quantitative trait loci (QTL) or mutation-QTL experiments (Mackay, 2014), but there is still no consensus on its importance. For plants, depending on the population and trait considered, some studies have shown that epistasis can significantly contribute to total genetic variance, for example, in maize (Dudley and Johnson, 2009), rice (Luo et al., 2009) or cotton (Li et al., 2014), while also improving the prediction accuracy (Dudley and Johnson, 2009; Hu et al., 2011). Meanwhile, other studies have shown that epistasis has a little impact regarding the genetic architecture of some traits in maize (Buckler et al., 2009) or in the prediction accuracy (Lorenzana and Bernardo, 2009).

Recently, with the advent of high-throughput molecular technology, numerous markers distributed throughout the whole genome have been used to develop new models of genetic variation (Legarra et al., 2009). Additive and non-additive genetic relationship matrices can, for instance, be constructed from genome-wide single-nucleotide polymorphism (SNP) markers to disentangle confounding factors for the estimation of genetic variance, leading to more accurate models than those based on pedigree (Lee et al., 2010; Su et al., 2012; Muñoz et al., 2014).

This new approach in estimating relationships among individuals has markedly broadened the prospects for modeling genetic effects. This is especially the case regarding animal breeding, when purebred animals are evaluated for their crossbred performance, where models that take the allele origin into account are implemented. Some rely on the relationship among purebred parents (Lo et al., 1997), which are referred to as ‘parent models’ in the following, while others are based on the relationship among crossbred progenies estimated by their gametotype (haplotype of each parental gamete) (Ibánẽz-Escriche et al., 2009; Kinghorn et al., 2010; Zeng et al., 2013), which are referred to as ‘progeny models’ hereafter. Although both parent and progeny models have been fully studied in assessing the performance of crossbreeds, they have mainly been focused on genetic additive effects. However, it could be important to take the magnitude of non-additive effects into consideration in some breeding contexts.

In plants, significant progress has been achieved in estimating additive and non-additive variance within a single population, but modeling the genetic variation requires further investigations with hybrids resulting from the cross of two different species or from the cross of two different populations of the same species presenting different allelic frequencies. The gene origin should be taken into account when estimating variance in hybrid progenies, as shown by Stuber and Cockerham (1966). Massmann et al. (2013) and Technow et al. (2012), with dense genotyping in maize, implemented such an approach to estimate additive and dominance effects, but further investigations are needed to address pending questions in hybrid populations. Does a marker-based relationship matrix perform better than one that is pedigree-based for estimating non-additive variance? Are variance estimation and prediction of genetic value improved by using parent or progeny models?

Beyond the advantages of using appropriate models, hybrids derived from a pure species cross or from two different populations of a single species are often used in crop and perennial species breeding to capture the genetic gain resulting from heterosis (Gallais, 2009). Estimating additive and non-additive variance components is crucial to understand adaptive trait expression and guide breeding programs for long rotation species.

In that setting, the objectives of this study were as follows: (i) to test the performance of pedigree- or marker-based models, (ii) to analyze the influence of parent- and progeny-based relationship matrices in estimating additive and non-additive effects (dominance, epistasis) in hybrid populations, and (iii) to assess the impact of including non-additive effects on the model prediction accuracy.

Data of a first-generation hybrid population, resulting from an Eucalyptus urophylla × Eucalyptus grandis cross achieved under field conditions and genotyped with SNP, were used to model the environmental and genetic effects, estimate the additive and non-additive variance and analyse the prediction efficacies of the different models.

Materials and methods

Field experimental data

Thirteen E. urophylla females and 9 E. grandis males were crossed by controlled pollination to generate 69 full-sib families and 1415 progenies. The males and females were selected in progeny/provenance trials set up in the Republic of the Congo. Each parent tree was selected in a different family. Previous studies have shown that panmixy is the mating pattern in continuous natural E. urophylla populations (Tripiana et al., 2007) with a preferential out-crossing system for both species (Horsley and Johnson, 2007). We thus considered that the coefficient of inbreeding and the coefficients of relationship among parents within each species were null. Each of the 1415 progenies was replicated three times using cutting and a clonally replicated progeny test was planted with 1415 clones (4415 trees at a stocking density of 833 trees ha−1). The field experiment involved a complete block design with three replications. Around 25 trees replicated in three blocks represented each full-sib family. At 32 months, the total number of trees used in this study was reduced to 3596, representing 1130 clones, owing to natural mortality and elimination of non-genotyped trees. The trial site was located in the Republic of the Congo, east of Pointe-Noire (11°59′21″ E, 4°45′51″ S). Rainfall averaged 1200 mm year−1. The soils were characterized by low water retention and a very low level of organic matter as well as a poor cationic exchange capacity.

The total tree height (HT) at 32 months was used for model comparison. It averaged 12.02 m and had a high among-individual coefficient of variation (30%).

Molecular data

Among the 20 000 SNP identified, 3303 were finally selected based on the repeatability, a minor allele frequency >2.5% and a rate of missing data per marker <5%. The 1152 individuals (13 females, 9 males and 1130 progenies) were genotyped with the 3303 SNPs using the DArTseq technology (Wenzl et al., 2004). Haplotype phasing and missing-data inference were done with the localized haplotype clustering method developed in Beagle version 4.0 (Browning and Browning, 2007). Pedigree information among the 1152 individuals, that is, the parent–offspring, half-sib and full-sib relationships, was used by the Beagle algorithm to estimate haplotypes and missing data (Browning and Browning, 2013).

Genetic model

The models developed in the next section are derived from the Stuber and Cockerham (1966) model. The authors considered two genetically divergent populations (species or populations of the same species but with different allele frequencies) and the hybrid population generated by the crosses of random individuals from the two parent populations. By considering the hybrid population with the gene effect depending on the parent origin, that is, female parents from population F (E. urophylla in our experiment) and male parents from population M (E. grandis in our experiment), assuming linkage equilibrium in each parent population and limiting epistatic effects to first-order epistasis, the total genetic variance was partitioned as follows:

where σ2aF and σ2aM are the female and male additive variances of the hybrid population due to alleles from females (males) crossed with males (females); σ2dFM is the dominance variance due to the female × male cross; σ2aFaF and σ2aMaM are the female (male) additive × additive epistatic variances; σ2aFdFM and σ2aMdFM are the female (male) epistatic additive × dominance variances and σ2dFMdFM is the dominance × dominance epistatic variance involving alleles from the M and F parent populations.

The genetic covariance among two individual hybrids x and y:

where ϕf and ϕm denote the coancestry coefficient among females of population F and males of population M, respectively; for example, for half-sib hybrids with the same female from F and different males from M: ϕf=(1+Ψf/2) and ϕm=0 and for full sib hybrids, with the same female from F and the same male from M: ϕf=(1+Ψf/2) and ϕm=(1+Ψm/2). Ψf and Ψm are the coefficients of inbreeding of the parent in population F and M that are considered null in this study.

Statistical model

We developed two types of model based on the equation by Stuber and Cockerham (1966). The first, called the parent model, was based on the relationship among female (f=13) and male (m=9) parents of the hybrid progenies:

where y is the vector of the phenotypic variable (tree height at 32 months), β is the vector of fixed effects due to the general mean and blocks, col~N(0, σ2colId) is the vector of random spatial environmental effects due to the field design column, with σ2col being the variance related to the spatial effects, Id is the identity matrix, r:b~N(0, σ2r:bId) is the vector of random spatial environmental effects due to field design row by block interaction, with σ2r:b being the variance related to the spatial effects, plot~N(0, σ2plotId) is a vector of random spatial environmental effects common to the individuals planted in the same square plot, with σ2plot being the variance related to the spatial effects, af~N(0, σ2afAf) is a vector of random additive effects due to E. urophylla females, with Af[f,f] being the coancestry coefficient matrix among females {ϕf} estimated from the pedigree (AfP) or marker (AfG), with σ2af being the additive variance of the hybrid population due to alleles from females crossed with males, am~N(0, σ2amAm) is the vector of random additive effects due to E. grandis males, with Am[m,m] being the coancestry coefficient matrix among males {ϕm} estimated from the pedigree (AmP) or marker (AmG), with σ2am being the additive variance of the hybrid population due to alleles from males crossed with females, d~N(0, σ2dD) is a vector of random dominance effects due to the female × male cross, D[f × m,f × m] is estimated from the pedigree (DP) or marker (DG), with σ2d being the dominance variance of the hybrid population due to alleles from males crossed with females (see the estimation in the next section), i~N(0, σ2iE) is the term representing random epistatic effects modeled by either the sum of three additive × additive effects aaf~N(0, σ2aaf EAAf),aam~N(0, σ2aam EAAm) and afam~N(0, σ2afamEAfAm) or the sum of additive × dominance effects afd~N(0, σ2afdEAfD), amd~N(0, σ2amd EAmD) or by the dominance × dominance effects dd~N(0, σ2ddEDD) and ɛ~N(0, σ2ɛId) is the vector of residual effects. The epistatic matrices EAAf[f,f], EAAm[m,m], EAfAm[f × m,f × m], EAfD[f × m,f × m], EAmD[f × m,f × m] and EDD[f × m,f × m] were calculated with the Hadamard and Kronecker products (formulas detailed in the next section).

X, Zcol, Zr:b, Zp, Zm, Zf, Zmf and Zi are the incidence matrices connecting the fixed and random effects to the data. The coancestry matrices AfG, AmG and DG were estimated using the formulas defined in the next section. Based on the generic model, 10 parent models combining additive, dominance and epistatic effects and marker or pedigree coancestry matrices were developed (Table 1). Additive × additive, additive × dominance and dominance × dominance epistatic effects were estimated in separate models because the restricted maximum likelihood (REML) algorithm failed to converge when all effects were included in a single model.

Table 1 Main characteristics of the models and the associated relationship matrix

The second type of models, called the progeny model, was developed using relationships among the (c=1130) hybrid progenies. They differed from the parent models because they were defined using the female- and male-origin haplotypes of each progeny inferred by long-range phasing. Such models were implemented in previous studies (Ibánẽz-Escriche et al., 2009; Kinghorn et al., 2010; Zeng et al., 2013) while only considering additive effects in the model. We extended this approach by including dominance and epistatic effects.

The fixed effects and environmental random effects were defined as for the first model. The genetic effects are defined by: af~N(0, σ2af AHfG), am~N(0, σ2amAHmG), d~N(0, σ2dDHG) and i~N(0, σ2i EHG), with AHfG[c,c] and AHmG[c,c] the molecular-based female and male additive relationship matrices, respectively, and DHG[c,c] the dominance relationship matrix. The i~N(0, σ2iEH) is the term representing random epistatic effects modeled by either the sum of three additive × additive effects aaf~N(0, σ2aafEHAAf), aam~N(0, σ2aamEHAAm) and afam~N(0, σ2afamEHAfAm), or the sum of additive × dominance effects afd~N(0, σ2afdEHAfD), amd~N(0, σ2amdEHAmD), or by the dominance × dominance effects dd~N(0, σ2ddEHDD). The ɛ~N(0, σ2ɛId) is the vector of residual effects. The epistatic matrices all had the same dimensions [c,c] and were estimated using the male and female haplotype of each progeny with the Hadamard products (formulas detailed in the next section). Five progeny models combining additive, dominance and epistatic effects were developed (Table 1).

The best linear unbiased predictors (BLUP related to the genetic effects) were computed by solving the mixed model equations. The variance component estimation based on the REML method and the BLUP calculations were done using the ASReml version 3 program (Gilmour et al., 2006) implemented in R software (R Development Core Team, 2011).

Narrow- and broad-sense heritabilities were defined by h2=σ2a/σ2p and H2=σ2g/σ2p, and dominance and espistatic variance ratios by d2=σ2d/σ2p, i2=σ2i/σ2p, with σ2a, σ2i, σ2g and σ2p being the total additive, epistatic, genetic and phenotypic variances, respectively, that varied according to the model. Details on variance components are given in Table 1. The delta method was used to obtain a first-order approximate s.e. for a nonlinear function of a vector of random variables with known or estimated covariance matrix (Oehlert, 1992).

Pedigree and marker relationship matrices

For the parent models, the pedigree coancestry coefficients (or kinship) ϕf and ϕm were estimated based on the pedigree of the female and male parent population. As the pedigree was unknown within the E. urophylla and E. grandis parents, the AfP and AmP matrices were simply the identity matrix multiplied by 0.5 (equal to the self coancestry of a non-inbred individual), DP was the identity matrix with 0.25, epistatic EAAf, EAAm EAfAm were diagonals with 0.25, EAfD, EAmD were diagonals with 0.125 and EDD was a diagonal with 0.0625. The molecular marker-based coancestry between two individuals, defined as the probability that two alleles at the locus taken from each individual are equal, that is, identical by state, was defined using Van Raden’s estimator (Van Raden, 2008). The female AfG[f,f] and male AmG[m,m] coancestry matrices were calculated as follows:

where Mf[f,s] (Mm[m,s]) is the matrix of female (male) marker information (coded as 0, 1 and 2 for, respectively, AA, Aa and aa) and s the total number of SNPs (3303 SNPs). Pf[f,s] (Pm[m,s]) contains frequencies pfi (pmi) of the second allele at each locus such that column i of Pf(Pm) is 2 pfi (2 pmi).

The dominance relationship matrices DP and DG were defined as follows. For two individual hybrids X and Y resulting from the cross of female fi and male mi and female fj and male mj, respectively, the dominance relationship coefficient was defined as the product of coancestry coefficients between females and males DX,Y=ϕfi,fj × ϕmi,mj. D can be calculated by the Kronecker product AfAm after removing lines and columns corresponding to the absent crosses. In the parent models, according to the Stuber and Cockerham (1966) models, the relationship matrices due to the first-degree epistatic terms were computed using the Hadamard and Kronecker products. They were defined as follows: for additive × additive terms EAAf=Af#Af, EAAm=Am#Am, EAfAm=Af Am, for the dominance × dominance term EDD=D#D and for the additive × dominance terms EAfD=Af#Af Am and EAmD=Am#Am Af.

For the progeny models, the female and male additive molecular marker-based coancestry matrices AHfG[c,c] and AHmG[c,c] were derived from the haplotypes of each progeny using Van Raden’s estimator (Van Raden, 2008):

where Hf[c,s] and Hm[c,s] are the matrices of female and male haplotype marker information (coded as 0 and 1 for A and a, respectively), respectively, and c being the number of clones and s the total number of SNPs (3303 SNPs). PHf(PHm) contained frequencies pfi (pmi) of the allele at each locus such that column i of PHf(PHm) is 1 pfi (1 pmi).

The dominance relationship matrix was defined using the Hadamard product: DHG=AHfG#AHmG. The relationship matrices due to the first-degree epistatic terms were computed using the Hadamard product, for additive × additive terms EHAAfG=AHfG#AHfG, EHAAmG=AHmG#AHmG, EHAfAmG=AHfG#AHmG, for the dominance × dominance term EDD=DHG#DHG and for the additive × dominance terms EAfD=AHfG#(AHfG#AHmG), EAmD=AHmG#(AHfG#AHmG).

The correspondence between models and the relationship matrices are given in Table 1.

Model comparison

Models were compared using the Akaike information criterion (AIC; Akaike, 1974) defined as AIC=−2ln(R)+2t, where ln(R) is the log-likelihood of the model and t is the number of variance parameters. The model with the lowest AIC value presented the best data fit.

To assess the dependency among variance components, we used the procedure described in Muñoz et al. (2014): with COVES being the asymptotic variancecovariance matrix of estimates of variance parameters, the asymptotic sampling correlation matrix of estimates (CORES) was computed as CORES=(VARES1/2COVES VARES−1/2), where VARES is a diagonal matrix containing the diagonal elements of V (variances of estimates). Analyzing the magnitude of the correlations of the CORES matrix allows assessment of the dependency between the pairwise estimates. To achieve an overall assessment of dependency between the estimates, Muñoz et al. (2014) suggested assessing the percentage of variation explained by eigenvalues of the CORES matrix. This was carried out by plotting the cumulative percentage of variance explained by the model vs the eigenvalue order (Figure 1). The eigenvalues were computed using the R statistical software function ‘eigen’, which computes eigenvalues and eigenvectors of matrices (R Development Core Team, 2011).

Figure 1
figure 1

Proportion of variance explained by eigenvalues of asymptotic correlation matrices of variance estimates by considering: additive effects from the pedigree parent model (P_A), vs the marker parent model (G_A), vs the progeny model (G_AH) (a); additive and dominance effects from the pedigree parent model (P_A+D), vs the marker parent model (G_A+D), vs the progeny model (G_AH+DH) (b); additive, dominance and additive × dominance effects from the marker parent model (G_A+D+A#D), vs the progeny model (G_AH+DH+ADH) (c); additive, dominance and dominance × dominance effects from the marker parent model (G_A+D+D#D), vs the progeny model (G_AH+DH+DDH) (d). The diagonal represents an independent correlation matrix between variance estimates.

Goodness-of-fit was evaluated with the full data set (1130 clones) by assessing the correlation between predicted additive genetic values and phenotypes of individual trees (average of three replicates of each individual clone) r(Âfull, Ŷfull) and between predicted total genetic values and phenotypes r(Ĝfull, Ŷfull).

The prediction accuracy was tested using the pedigree-based relationship matrix for parent models, or the so-called P-BLUP method, and the marker relationship matrix for parent and progeny models, that is, the G-BLUP method (Cros et al., 2015). A cross-validation procedure was implemented to evaluate the prediction accuracy, with the data set divided into nine subsets. The half-sib progeny phenotypes related to one female and the half-sib progeny phenotypes related to one male were removed at each set. As the smallest number of parents corresponded to the nine males, nine subsets were possible. This process allows comparison of the prediction ability of the parent and progeny models. We chose this method to make sure that the progenies in the training and candidate sets would not have any common parents, which would have biased the comparison of parent and progeny models regarding their predictive capacity. Eight of the sets, including trees with phenotype and genotype, were used in turn for training the models to estimate the model parameters, and the remaining set (candidates set), made of trees with only genotypes, was used for testing the prediction accuracy. This procedure led to different sizes of the training and candidate populations, which ranged from 794 to 1013 clones and from 117 to 336 clones, respectively. The prediction accuracy of the models was evaluated on the candidate set by the correlation between the predicted additive and total genetic values using the training set and phenotypes of individual tree (average of three replicates for each individual clone): r(Âcan, Ŷcan) and r(Ĝcan, Ŷcan).

We also used the prediction stability defined by Muñoz et al. (2014), which is the correlation between breeding or total genetic values of clones of the candidate population predicted with the full data set and the breeding or total genetic values predicted with the training population: r(Âcan, Âfull) and r(Ĝcan, Ĝfull). These correlations measured the dependency of the predicted additive or total genetic values on the phenotypes. Associated mean square errors MSE(Âcan, Âfull) and MSE(Ĝcan, Ĝfull) were calculated.

An analysis of variance (ANOVA) was performed to test the null hypothesis, that is, an absence of difference among models and subsets. ANOVA was conducted with prediction accuracy, prediction stability and MSE as dependent variables. When the ANOVA detected a significant model effect, a pairwise mean comparison among models was carried out for each variable using Newman–Keuls test (Shaffer, 2007).

Results

Variance components

The variance component estimates for the 15 models are presented in Table 2. When comparing the 10 parent models, small-to-moderate changes were observed in the genetic, spatial environmental and residual variances, except for the plot variance σ2plot, which markedly decreased by the inclusion of non-additive effects. Epistatic variances could not be estimated with the three pedigree-based parent models, that is, P_A+D+AA, P_A+D+AD and P_+A+D+DD, because matrix singularities were detected and the REML algorithm did not converge. This is partly explained by some redundancy in the dominance and epistasis matrices (as described in the pedigree and marker relationship matrices section). Otherwise, epistatic components were estimated with the three equivalent marker-based models, but there were some particularities in the estimates. σ2aaf and σ2aam were null with G_A+D+AA, while σ2afd and σ2amd were quite high (1.022 and 4.863, respectively) with G_A+D+AD, but the dominance variance σ2d became null. Finally, with G_A+D+DD, the σ2dd espitatic variance was estimated without affecting the other variances, but it had a very high s.e., as also the σ2d estimates (see Supplementary Table S1 in Supplementary Material).

Table 2 Environmental and genetic variance components and parameters related to the quality of the different models

With the five progeny models, the spatial environmental and residual variances did not change when including non-additive effects, but the additive variances decreased (Table 2). According to the ASReml version 3 (Gilmour et al., 2006) algorithm output, models that included epistatic effects converged at the boundary for the σ2aaf, σ2aam, σ2afd, σ2amd and σ2dd espitatic variances. The components were thus considered as null.

The comparison between the parent and progeny models showed a minor change in the additive variance estimate but a marked change for dominance variance, for example, for P_A+D and G_A+D, additive and dominance were close (σ2af=1.239, σ2am=0.921, σ2d=2.640) and (σ2af=1.260, σ2am=0.953, σ2d=2.395), respectively, whereas an increase in σ2d was noted for G_AH+DH (σ2af=1.340, σ2am=0.819, σ2d=4.313). A change was also observed with residual variance, which decreased from parent models (σ2e=9.772 in average) to progeny models (σ2e=8.029 on average).

The overall degree of dependency between the model variance estimates is clearly illustrated in Figure 1, which shows, for each model, the cumulative proportion of variance explained by eigenvalues compared with the diagonal representing orthogonal correlation matrix, that is, independence between estimates. Figure 1a shows that the cumulative percentage of genetic variance of the P_A and G_A models was closer to the diagonal than G_AH, highlighting that these former models led to less correlated variance estimates. A similar trend was observed with P_A+D and G_A+D compared with G_AH+DH (Figure 1b). On the other hand, with epistatic models, Figures 1c and d show that the progeny models generated less correlated variance estimates than the parent models. An examination of the asymptotic correlation matrix of variance estimates led to similar conclusions (Supplementary Tables S2A–C in Supplementary Materials).

The variance ratios differed between models (Table 3). The narrow sense heritability (h2) estimated with all of the additive effect models (P_A, G_A and G_AH) decreased after including the dominance effect. Adding epistatic effects led to a decrease in both additive and dominance variance when the epistatic estimates differed from zero. As expected, the narrow-sense heritability, h2 (h2=0.136 in average), was lower than the broad-sense heritability H2 with a marked difference (H2=0.334 on average). This observation stresses the preponderance of non-additive effects, while the (σ2d+σ2i)/σ2a ratio was 122, 108, 226 and 293%, with P_A+D, G_A+D, G_A+D+DD and G_A+D+AD parent models, respectively. The same trend was observed with the progeny model, where the (σ2d+σ2i)/σ2a ratio was 199% with G_AH+DH.

Table 3 Heritabilities and variance ratios and their s.e

Model goodness-of-fit and prediction accuracy

The relative model quality was assessed through the AIC. The results showed that the progeny models, with AIC ranging from 8841 to 8847, better fitted the observations than the parent models, with AIC ranging from 8889 to 8955 (Table 2).

The goodness-of-fit was estimated using the full data set (Table 4). With the parent models, the correlations between additive or total genetic values and phenotypes were low, with r(Âfull, Ŷfull) varying in the (0.361; 0.368) interval and r(Ĝfull, Ŷfull) in the (0.478; 0.490) interval, while both did not show marked differences between models. Regarding the progeny models, the correlations were higher, showing a better fit to the observations, with r(Âfull, Ŷfull) varying in the (0.710; 0.770) interval and r(Ĝfull, Ŷfull) had the same value for all models (0.858) due to the null epistatic variance estimates.

Table 4 Comparison of models for goodness-of-fit, prediction accuracy ability and stability

As expected, the prediction accuracy had much lower correlations than goodness-of-fit, with r(Âcan, Ŷcan) varying in the (0.191; 0.288) interval and r(Ĝcan, Ŷcan) in the (0.191; 0.294) interval (Table 4). With parent models, r(Âcan, Ŷcan) had higher estimates with marker than with pedigree-based relationships, for example, r(Âcan, Ŷcan)=0.198 with P_A and r(Âcan, Ŷcan)=0.226 with G_A (Table 4). Regarding the progeny models, the predictive ability estimates were higher than the parent models, that is, on average r(Âcan, Ŷcan)=0.277. However, although ANOVA showed a global significant model effect (P=0.002), only two overlapping groups were detected by the Newman–Keuls test (Table 4). The predictive ability for total genetic value showed higher estimates for the progeny models, that is, r(Ĝcan, Ŷcan)=0.294 on average, than for the parent models, that is, r(Ĝcan, Ŷcan)=0.199. ANOVA (P=0.000) and the mean comparison test (Table 4) clearly differentiated the two groups. As shown by the boxplot graphs (Figures 2a and b), all the models presented a marked variation in the prediction accuracy between the nine subsets of the cross-validation process, and no clear differences were detected among models. Parent model using a marker-based pedigree relationship matrix (P_A and P_A+D) even showed negative values.

Figure 2
figure 2

Comparison of the distribution of prediction accuracy for the breeding value (r(Âcan, Ŷcan)) (a), total genetic value (r(Ĝcan, Ŷcan)) (b), resulting from the cross-validation procedure among the pedigree-based parent models (P_A, P_A+D), marker-based parent models (G_A, G_A+D, G_A+D+AA, G_A+D+DD, G_A+D+AD) and progeny models (G_AH, G_AH+DH).

The prediction stability, which measured the dependency of the additive or total genetic values on the phenotypes, showed better performance for the parent models, with r(Âcan, Âfull)=0.756 and r(Ĝcan, Ĝfull)=0.582 on average, than for the progeny models, with r(Âcan, Âfull)=0.734 and r(Ĝcan, Ĝfull)=0.548 (Table 4). However, ANOVA did not reveal significant differences among models (P=0.484 and P=0.551).

The MSE of prediction of additive values presented, on average for all the models, lower estimates than obtained for the total genetic values, that is, MSE(Âcan, Âfull)=0.346 and MSE(Ĝcan, Ĝfull)=0.734 (Table 4). For the additive value prediction, although the progeny model G_AH+DH had a lower MSE, Fisher test was barely significant, while two overlapping groups were detected by Newman–Keuls test (Table 4). No significant differences between MSE were observed for the prediction of total genetic values. MSE distributions clearly separated the parent from the progeny models, with the latter exhibiting a much narrower distribution (Figures 3a and b).

Figure 3
figure 3

Comparison of the distribution of MSEs for breeding values (MSE(Âcan, Âfull)) (a) and total genetic values (MSE(Ĝcan, Ĝfull)) (b) resulting from the cross-validation procedure among the pedigree-based parent models (P_A, P_A+D), marker-based parent models (G_A, G_A+D, G_A+D+AA, G_A+D+DD, G_A+D+AD) and progeny models (G_AH, G_AH+DH).

Discussion

In this study, we used a first-generation hybrid population resulting from the cross of two parent species, to analyze the ability of the models to separate additive and non-additive effects and to predict genetic values. We compared two sets of models, the first based on pedigree and marker-based parent relationship matrices, with the second using hybrid progeny haplotypes to define marker-based relationship matrices. The model of Stuber and Cockerham (1966) has been implemented with parent pedigree or maker-based relationship matrices in previous studies (for example, Lorenzana and Bernardo, 2009; Massman et al., 2013; Cros et al., 2015), whereas one goal of our study was to develop a new set of models using progeny haplotypes to better estimate non-additive and particularly epistatic effects.

Modeling genetic variance in hybrid populations

Our approach relied on the variance partition of Stuber and Cockerham (1966) while considering two parental populations and complemented the recent study of Muñoz et al. (2014) in which variance was modeled when considering a single population. According to Stuber and Cockerham (1966), distinguishing the parental origin of alleles into models should theoretically lead to variance components different from considering alleles originating from a single population; in the first case we have two allele frequency distributions at each locus and in the second case a unique distribution. We verified this assumption by estimating additive and dominance variances while considering a single population in our design (Appendix Table A1, in annex). Variance components, estimated with a single population approach, were lower for the pedigree-based relationship P_A+D model (σ2a=1.959 and σ2d=1.097) and much lower with the marker-based relationship G_A+D model (σ2a=0.08 and σ2d=0.01) as compared with the estimates presented in Table 2.

With the dual population approach, a first set of nested models based on the parent relationship was implemented. Variance components were affected by the use of marker-based (AfG, AmG, DG) or pedigree-based relationship matrices (AfP, AmP, DP). The additive variance decreased from P_A to G_A (16% for σ2af and 8% for σ2am) but increased from P_A+D to G_A+D (1.6% for σ2af and 3.4% for σ2am), while the dominance variance decreased (9%) (Table 2). However, these variations were much smaller than the additive and dominance variance s.es. (Supplementary Table S1 in Supplementary Material), reflecting the moderate impact of using marker- or pedigree-based relationship matrices. This result could first be explained by the ability of AfG, AmG and DG to reflect the assumption of an absence of genetic relationship between E. urophylla parents and between E. grandis parents. This assumption was suggested by the fact that parents are progenies of trees selected in natural populations and the mating pattern in Eucalyptus is instead panmixy. Second, the set of markers used for establishing AfG, AmG and DG, resulting from strong selection (3303 SNPs were selected among 20000), was of sufficiently good quality to correctly reflect the relationship. Our findings were consistent with those obtained in previous studies comparing the use of pedigree- and marker-based relationship matrices in modeling (Forni et al., 2011; Wang et al., 2014).

Using the dual population approach, we developed the so-called progeny models based on the female and male haplotypes of each hybrid progeny. The relationship estimation was based on the classic formula of Van Raden (2008), that is, using SNPs of each haplotype independently to estimate the coancestry coefficient. It differs from approaches constructing haplotype segments surrounding putative QTLs to estimate the relationship coefficient (Makgahlela et al., 2013). Our progeny models estimated variances with smaller errors as compared with parent models (Table 3). However, they did not perform better in estimating the non-additive effects (epistatic variance estimates were all null). As already mentioned, this result could be attributed to an overparametrization of the model.

Our results showed that including non-additive effects in parent and progeny models decreased the additive variance (Table 2) and consequently the narrow-sense heritability (Table 3). This trend was expected from a theoretical standpoint (Gallais, 1990, Falconer and Mackay, 1996) as additive variance in the additive-only model actually captures non-additive variation, as confirmed experimentally in other studies (Su et al., 2012; Muñoz et al., 2014; Sun et al., 2014).

Estimating non-additive effects

One of the goals of our study was to evaluate the extent of epistatic variance in a hybrid population. We used the Hadamard product to define the relationship coefficients of epistatic variance–covariance matrices. This way of calculating the coefficients supposes that the population is in linkage equilibrium and the markers are not linked (Schnell, 1963). To assess the potential biais of our approach, we estimated the linkage disequilibrium (LD) using the ‘genetics’ R package (R Development Core Team, 2011) and the LD estimators ‘r2’ from Hill and Robertson (1968). The mean and s.d of r2 were r2=0.03 (0.11) for E. urophylla and r2=0.05 (0.12) for E. grandis. These values showed very low LD, indicating that Hadamard matrix multiplication should not substantially bias epistatic variance estimates. Regarding the linkage between markers, Schnell (1963) showed that the recombination rate should be taken into account in the epistatic coefficient matrices. Omitting this parameter upward biased the epistatic variance components. For Eucalyptus, the recombination rate is not well known and there are wide variations in estimates (Silva-Junior and Grattapaglia, 2015), which makes bias evaluation difficult.

Our analyses of the data used for the model did not lead to conclusive results. With parent models, non-null epistatic variances were estimated with G_A+D+DD and G_A+D+AD; however, the s.es. of σ2dd, σ2afd and σ2amd were very substantial, that is, twofold the variance estimates (Supplementary Table S1 in Supplementary Material). With the progeny models, we estimated null epistatic variances and a very high dominance variance. Null epistatic variance may result from a small contribution of this effect to the genetic variance of growth traits, as observed in other forest tree studies (Lepoittevin et al., 2011; Araújo et al., 2012) and/or because we implemented an overfitted model, which could lead to null variance estimates. Increasing the number of parents and reducing the imbalance in the mating design could be a way to better estimate non-null epistatic variances and reduce their s.e.

As shown in previous forest tree studies (Araújo et al., 2012), with an experimental design involving cloned full-sibs and pedigree-based models, modeling cannot separate actual additive and dominance variance from epistatic variance. This design upwardly biased the estimation of true additive and dominance variances and downwardly biased epistatic variance. Lee et al. (2010) and Muñoz et al. (2014) have shown that the use of a marker-based relationship can disentangle non-additive effects from additive and dominance effects. Our models, using a marker-based relationship matrix, had this property (Figures 1a–d), but they were not able to clearly estimate the epistatic variance.

In addition, with the Stuber and Cockerham (1966) model, we could not separate the pure dominance effect from the female-additive × male-additive effect because the relationship matrices {ϕmϕf} used to estimate σ2d and σ2afam were identical. Hence, the dominance variance was inflated and the additive × additive variance was downwardly biased.

Although epistatic variance cannot be clearly estimated, our results have shown that non-additive variance explained a significant part of the total genetic variance, that is, the same extent of additive variance with parent models and twofold higher additive variance with progeny models (Table 3). In the case of height at 32 months, our results stressed the importance of non-additive effects and confirmed the findings of previous studies of Baltunis et al. (2009) with radiata pine and Araújo et al. (2012) with E. globulus dealing with growth traits. However, they differed from earlier forest tree studies using full-sib families with clonal replicates and analyzing growth traits, which revealed a very low or limited proportion of non-additive variance, for example, Lepoittevin et al. (2011) with maritime pine. Our results could be explained by the inter-species hybrid nature of our material, which can exacerbate heterosis (Melchinger et al., 2007). Some examples have been reported in poplar (Wu, 2000), rice (Zhou et al., 2012) and maize hybrids (Guo et al., 2014).

Impact of modeling on the prediction accuracy

We tested the prediction accuracy of the models using the G-BLUP approach, which is among the most widely used parametric methods in genomic selection and giving good results (Heslot et al., 2015). Our results revealed that the prediction accuracy estimated in the candidate population (on average r(Âcan, Ŷcan)=0.242 and r(Ĝcan, Ŷcan)=0.242) remained low compared with the goodness-of-fit (on average r(Âfull, Ŷfull)=0.513 and r(Ĝfull, Ŷfull)=0.649) and prediction stability (on average r(Âcan, Âfull)=0.747 and r(Ĝcan, Ĝfull)=0.567) using the full data set. Although comparisons with other studies should be carried out with caution because the prediction accuracies are calculated according to different methods and formulas, the prediction stability values assessed in our study were similar to those estimated in previous forest tree studies for growth traits (Resende et al., 2012; Muñoz et al., 2014; Beaulieu et al., 2014a, b; Gamal El-Dien et al., 2015). The prediction accuracy defined in our study by r(Âcan, Ŷcan) and r(Ĝcan, Ŷcan) is supposed to give a more objective genomic selection potential. With our experimental breeding data, these parameters were quite low (on average r(Âcan, Ŷcan)=0.242 and r(Ĝcan, Ŷcan)=0.242), which could be explained by factors influencing the genomic selection performance, such as: heritability, training population size, and LD. In our case, narrow- and broad-sense heritabilities of height at 32 months presented low-to-moderate estimates (Table 3), which partially explained the low genomic selection accuracy. LD, which was very low, that is, r2=0.03 and r2=0.05 for female and male populations, respectively, was likely also a key factor explaining our accuracy. Finally, the training population size, ranging from 794 to 1013 clones to predict 117 to 336 clones, was probably not sufficiently large, as this factor is critical, especially when non-additive effects are included in the model (Denis and Bouvet, 2013).

Our model comparison for prediction ability showed some unexpected results. Surprisingly, for parent models, r(Ĝcan, Ŷcan) were smaller than r(Âcan, Ŷcan), whereas we expected a higher value because the total genetic value was more related to the phenotype than to the breeding value. This result could be attributed to the less accurate prediction of non-additive effects owing to overparameterization of the model.

Our study has shown that the prediction accuracy and stability were improved by using marker-based instead of pedigree-based relationship matrices (Table 4). The marker-based matrix had the advantage of capturing both the Mendelian segregation within full-sib families and genetic links through unknown common ancestors, which are not available in the known pedigree. This feature has been observed in genomic selection for animal (Chen et al., 2011) or plant breeding (Zapata-Valenzuela et al., 2013; Muñoz et al., 2014; Beaulieu et al., 2014a; Cros et al., 2015). The comparison of parent and progeny models revealed the higher prediction accuracy of the latter, which better assessed the relationship among progenies by capturing the Mendelian segregation. Note that the performance of this model is improved by reducing the error related to haplotype reconstruction, which can be done by using well-known and multi-generation pedigrees when imputing the data (Hayes et al., 2012).

We tested how the inclusion of non-additive genetic effects affected the prediction accuracy and stability. In our parent model, adding dominance or epistatic effects did not improve the prediction, that is, r(Âcan, Ŷcan) or r(Ĝcan, Ŷcan) even decreased from G_A to G_A+D and G_A+D+AA. However, with progeny models, the prediction accuracy regarding the breeding value increased from G_AH to G_AH+DH. Our study generated new findings to supplement previously published results based on experimental data that had diverse conclusions, for example, in plants, Dudley and Johnson (2009), Hu et al. (2011) and Wang et al. (2012) found that including marker interactions substantially increased the prediction accuracy, but Lorenzana and Bernardo (2009) and Muñoz et al. (2014) came to opposite conclusions. With animals, Su et al. (2012) and Sun et al. (2014) showed that including non-additive effects improved the prediction. According to Denis and Bouvet (2013), with simulated data, the inclusion of non-additive effects improved the prediction accuracy when non-additive effects were preponderant, the training data set size was high and updated over selection cycles to reassess the relationship between markers and QTLs.

Implications for breeding

In the case of single populations, marker-based estimates of coancestry have proved to be a useful alternative to genealogical information for estimating variance components as well as for predicting genetic values (Muñoz et al., 2014). In hybrid populations, we have shown that using genome-wide information can improve the variance partition, although epistasis detection requires further investigation to effectively evaluate its relative magnitude. Using progeny models with a higher number of parents and a more balanced design could improve their performance. The modeling approach incorporating molecular data could reduce the overestimation of additive variance and improve the estimation of non-additive effects and is crucial for accurately predicting genetic gain in hybrid breeding programs, as previously mentioned, for example, in maize (Parvez et al., 2007), oil palm (Cros et al., 2015) or Eucalyptus (Bouvet et al., 2009). Regarding the prediction accuracy, our study has shown that the inclusion of non-additive effects in genomic selection depends on the modeling approach—as compared with parent models, our progeny models, using parental haplotypes, improved the prediction accuracy in the hybrid population.

Data archiving

Data are available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.g73t2