Abstract
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigreebased selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 openpollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotypingbysequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with ageage genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that knearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
Introduction
The principal limitation in most treeimprovement programs is the time required for the completion of one cycle of breeding, testing and selection. In some programs, it can take up to 30 years to complete a single cycle of breeding, specifically for traits with late expression patterns. Strategies to maximize genetic gain per unit time should then be the primary focus to rationalize the enormous spatial and economic requirements associated with forest treeimprovement practices (White et al., 2007). The concept of genomic selection (GS) (Meuwissen et al., 2001) has promised to reduce the time associated with breeding cycles, and has since established itself as a paradigm in animal (Hayes et al., 2009) and plant (Heffner et al., 2009) breeding systems. Though this movement has yet to occur within a forest tree species context.
The novel GS approach combines phenotypes and genotypes of a training population (TP) to develop a prediction model that estimates genomic breeding values (GEBV) for selection candidates, requiring only their genotypic information (Meuwissen et al., 2001). This method may circumvent the need for the long testing phase that forest trees require to attain accurate phenotypic data for traditional pedigreebased estimation of breeding values, and offers a unique opportunity to substantially increase the response to selection through increasing the number of selection candidates. Previously, markerassisted early selection (MAES) has been considered as a selection strategy for forest tree breeding to exploit the linkage disequilibrium (LD) between quantitative trait loci (QTL) and genetic markers (White et al., 2007). However, MAES has not been rewarding in forest tree breeding programs owing to its severe limitations (Strauss et al., 1992).
The primary constraint that has withheld MAES from use in forest trees is the low proportion of the phenotypic variance accounted for by the relatively low number of statistically significant markers used in the analysis (White et al., 2007). This limitation results primarily from the infinitesimal genetic architecture of most complex growthrelated traits (Hill et al., 2008). Further limitations to MAES in forest trees include low levels of LD (Neale and Savolainen, 2004) and strong QTLenvironment and QTLlineage interactions due to overestimation of the QTL effects (Beavis, 1998). GS is fundamentally different from MAES through its simultaneous use of phenotypes and dense set of markers (thousands), which are implemented without a prior assumption concerning marker significance. GS is thus thought to capture more variation in traits with complex inheritance because it is assumed that at least some of the many fitted markers will be in LD with some of the QTL of the desired trait (Meuwissen et al., 2001).
The GS method has been enabled via current generation sequencing technologies, their low persample cost and the use high density single nucleotide polymorphism (SNP) genotyping platforms such as genotypingbysequencing (GBS) (Elshire et al., 2011). The GBS method is characterized by its use of methylationsensitive restriction enzymes to reduce genome complexity, and high levels of multiplexing, which efficiently obtain genomewide SNP markers. The GBS pipeline thus does not require prior genomic information, making it suitable for nonmodel species such as forest trees owing to their current lack of highquality reference genomes (Elshire et al., 2011). Recently, Chen et al. (2013) successfully demonstrated the suitability of GBS for SNP discovery in two economically important forest tree species, white spruce (Picea glauca (Moench) Voss) and lodgepole pine (Pinus contorta Dougl. ex. Loud.). The density of SNP markers obtained by GBS can be increased substantially by tolerating high levels of missing data and use of marker imputation (Crossa et al., 2013). However, the benefit of using imputation and the optimal imputation method have not yet been completely validated in GS studies that utilize GBS (Rutkoski et al., 2013).
The potential for use of GS in forest trees was first explored by Grattapaglia and Resende (2011) through the use of deterministic simulation studies. More recently, empirical studies have all produced promising results in regards to acceleration of the breeding cycle for three tree species; namely, eucalypts (Eucalyptus spp.) (Resende et al., 2012a), loblolly pine (Pinus taeda L.) (ZapataValenzuela et al., 2012; Resende et al., 2012b, c) and white spruce (Beaulieu et al., 2014a, b). This study represents a novel approach over the preceding studies through the application of the nonmodel GBS SNP discovery pipeline, in addition to high missing data ratio imputation methods to produce GS prediction models.
In the present study, we randomly selected 769 40yearold interior spruce (Picea engelmannii × glauca) trees from 25 elite openpollinated families grown on two progeny test sites near Prince George, BC. Height at ages 3, 6, 10, 15, 30 and 40 were used to obtain estimates of pedigreebased breeding values (EBV), narrowsense heritability, ageage genetic correlations, and in combination with SNP marker data, GS prediction models and associated GEBV. The prediction models were developed using three statistical approaches: ridge regression BLUP (rrBLUP), generalized ridge regression (GRR) and BayesCπ (BCπ). Further, we used the GBS pipeline for discovery of SNP markers for each of the 769 interior spruce trees. Three unordered highdensity SNP imputation methods (KNearest Neighbor (KNN); Singular Value Decomposition (SVD); Mean Imputation (M60)) were used on a 60% missing data set to produce the SNP table, and used to explore the effect of imputation method.
The objectives of this study are to: (i) Assess the predictive accuracy (PA) of GS at different ages for the complex trait, tree height, in interior spruce, (ii) evaluate the temporal PA of GS models for purposes of model retraining, (iii) explore variation of PA in the previous two objectives using combinations of three GS statistical approaches and three imputation methods, (iv) consider the suitability of SNPs discovered through a GBS pipeline in conjunction with unordered highdensity imputation, for GS in interior spruce, and (v) assess the relative efficiency of GS to traditional pedigreebased BLUP selection.
Materials and Methods
Genetic material
Fresh foliage was obtained post flush in spring 2013 from two 40 yearold interior spruce progeny trial sites, Quesnel and PGTIS, located within the Prince George Seed Planning Zone (SPZ) of Northcentral British Columbia, Canada (http://www2.gov.bc.ca/gov/topic.page?id=E06AB7FFB0AA49B8814483B1ADC1F5F8). Tissues were sampled, separated, sealed and placed on ice prior to being stored at −80 °C until DNA extraction. Briefly, the two sites, PGTIS (Lat. 53.771639 N, Long. 122.718778 W, Elev. 610 m) and Quesnel (Lat. 52.990889 N, Long. 122.2085 W, Elev. 915 m), contain a total of 32 940 trees initially planted with 2yearold container nursery stock in a randomized complete block design. The two sites are represented by the same 174 openpollinated families each planted in 10treerow plots within 10 replicate blocks and 2.5 by 2.5 mm spacing. This study concerns 769 randomly selected trees from within a subset of 25 elite families based on breeding value for tree volume.
Phenotypic data and MBLUP analysis
Tree height (m) at ages 3, 6, 10, 15, 30 and 40 year were used in this study. Mature tree heights were obtained using an ultrasonic clinometer Vertex III (Haglöf, Långsele, Sweden). The full data set consisting of a maximum of 29 475 trees from 174 openpollinated families were used for estimation of variance components, EBVs, ageage genetic correlations (r_{ij}), narrowsense individual tree heritabilities and their respective standard errors. The following pedigreebased, multivariate polygenic model (MBLUP) (Mrode, 2014) was implemented using ASReml v. 3.0 software (VSN International, Hemel Hempstead, UK; Gilmour et al., 2009):
where Y_{i} is the vector of phenotypes for height at the i^{th} year, and X_{i} and Z_{i} are the incidence matrices relating observations of height at the i^{th} year to the vector of fixed site effect (β_{i}), and vectors of random additive genetic effect (a_{i}), block effect (b_{i}), additive genetic by site effect (ae_{i}), and residual effect (e_{i}). Assuming, is the additive genetic variance for the i^{th} trait and A is the average numerator relationship matrix; , where is the block variance for i^{th} trait, and I is the identity matrix; , where is the additive genetic by site interaction variance for the i^{th} trait; and , where is the residual variance for the i^{th} trait. The covariance matrix of the additive genetic term was modeled with a heterogeneous general correlation structure (‘CORGH’) in ASReml to directly obtain ageage genetic correlations (Gilmour et al., 2009).
Estimates of individual tree narrowsense heritability, , for trait i were calculated as the ratio of estimated additive variance to total phenotypic variance from equation (1). Accuracy of individual EBVs for height at age i obtained from the MBLUP model were estimated following Dutkowski et al. (2002):
where PEV is the individual tree prediction error variance (square of standard error), and is the estimated additive genetic variance component of the i^{th} trait from equation (1).
SNP genotyping and missing data imputation
SNP markers were discovered utilizing a GBS pipeline for nonmodel species; see Chen et al. (2013) for details. Additionally, the three SNP imputation methods evaluated in this study for SNP tables with 60% missing data were: mean imputation (M60), KNN with special family weighting (KNN) and SVD. The KNN and SVD imputation methods are described in detail by Gamal ElDien et al. (2015). Mean imputation was carried out using the ‘A.mat’ function provided in the ‘rrBLUP’ R package, and refers to imputation using the mean for each marker (Endelman, 2011).
Genomic selection
Three GS analytical approaches were assessed: a common shrinkage model, rrBLUP (Whittaker et al., 2000) and two variable selection models, GRR (Shen et al., 2013), and BayesCπ (BCπ) (Habier et al., 2011). Further, rrBLUP and BCπ are both homoscedastic effects models using common marker variances for shrinkage, whereas GRR is a heteroscedastic effects model (that is, markerspecific variances are used). SNP tables were coded as: aa=−1, Aa=0, AA=1, where a is the reference allele, and A is the alternative allele. All analyses were completed using R software (RCoreTeam, 2014). The following base model was implemented (Moser et al., 2009):
where y_{i} is the EBV of individual i obtained from equation (1), 1 is an incidence matrix of ones, μ is the overall mean, x_{i} is a 1 × p vector of SNP genotypes for individual, g(x_{i}) is a function to estimate the GEBV as the combined effect of p SNP markers on the EBV of individual i, and e_{i} is the residual error.
Ridge regression BLUP (rrBLUP)
rrBLUP estimates of GEBV were obtained using the R package ‘rrBLUP’ (Endelman, 2011). The model for rrBLUP follows:
where x_{ik} is the genotype of individual i for SNP marker k, and u_{k} is the additive effect of SNP marker k. The BLUP solution for marker effects, , is obtained using Henderson’s methods for mixed model equations (MME) (Henderson, 1953):
where Z is an incidence matrix relating SNP markers to individuals, I is an identity matrix, y is the vector of EBV and assuming . The SNP shrinkage parameter expressed as the ratio between the residual and common marker variances, . This method shrinks all marker effects equally, where the shrinkage is dependent on the marker allele frequency.
Generalized ridge regression (GRR)
GRR is a twostep variable selection process, and was carried out using the R package ‘bigRR’ (Shen et al., 2013). In the first step, initial estimates of , , and were obtained through the same MME as in rrBLUP. However, the BLUP estimate, , is modified to accommodate a SNPspecific shrinkage parameter:
where Z and y are the same as in equation (5) and λ is a vector of p shrinkage parameters with as the shrinkage parameter for SNP k, and is the variance component for SNP k computed as: , where is the marker effect BLUP for SNP k obtained in equation (5) and h_{kk} is the (n+k)^{th} diagonal of the hat matrix, H=T(TT)^{−1}T′, where
BayesCπ (BCπ)
BayesCπ was developed by (Habier et al., 2011) as an extension to the Bayesian GS methods developed by (Meuwissen et al., 2001). The statistical model for BCπ follows:
where x_{ik} and u_{k} are the same as in equation (4) and δ_{k} is an additional dummy variable that reflects the effect of SNP marker k in the model being equal to zero with probability π.
The proportion of markers with null effect in the model, π, is inferred from the data using a uniform prior distribution (0,1). BCπ assumes a common SNP effect variance, with a scaled inverse chisquare prior, v_{u} degrees of freedom and scale parameter, . The proportion (1–π) markers included in the model have effects following a mixture of multivariate Student’s tdistributions , as described in (Habier et al., 2011).
The R package ‘BGLR’ (Perez and de los Campos, 2014) was used to implement the BCπ algorithm. Gibbs sampling was used to generate the Monte Carlo Markov Chain using 100 000 iterations thinned at a rate of 100, with the first 20 000 discarded for burnin. The ‘BGLR’ package default estimate for the scale parameter, , and degrees of freedom, v_{u}=5, was used. Trace plots were visually checked for model convergence.
Cross validation, PA and relative efficiency of GS
To assess GS prediction accuracy we used 10 replications in a 10fold random crossvalidation scheme. Under this scenario, 90% of available data are selected randomly as the TP while the remaining 10% is designated as the validation population (VP). Here, we define prediction accuracy of GS as the mean Pearson productmoment correlation between EBV from the MBLUP model [1] and GEBV from the GS models, for the VP from the 10 replications, that is, r(GEBV, EBV). The temporal PA (TPA) is then r(GEBV_{j}, EBV_{k}), where k is age 40 and j is an age less than 40.
To explore the relative efficiency of GS to TS, we obtained estimates of PA from model [1] using raw phenotypes of the genotyped individuals as training data, and applying the same crossvalidation method previously stated. Breeding value estimates were then obtained under two scenarios. The first scenario utilized the pedigreebased average numerator relationship matrix, A (EBV_{TS}). The second replaced A with the realized relationship matrix, G (GEBV_{GS}) (Habier et al., 2007; VanRaden, 2008). The latter method is known colloquially as GBLUP and is equivalent to rrBLUP (Mrode, 2014). We used the SVD marker data to compute G as:
where Z=M–P with M as the genotype matrix and P as vector of 2(p_{j}–0.5), and p is the frequency of the alternative allele of the SNP at the j^{th} locus. The relative efficiency of GS to TS, assuming selection response is inversely proportional to the length of the breeding cycle is:
where EBV is the estimated breeding value using the full data from model [1], and t_{TS} and t_{GS} are the length of time to complete a breeding cycle under TS and GS, respectively (Grattapaglia and Resende, 2011).
Results
Phenotypic and MBLUP analysis
Variation in measures of height among trees within the two sites was relatively stable across years (range: CV=25.09% (HT40)—34.44% (HT6) (Table 1). Narrowsense individual tree heritability estimates from the MBLUP analysis ranged from low (0.25, HT10) to moderate (0.64, HT30) (Table 2). As expected, the ageage genetic correlation between juvenile and mature height (HT40) increased with increasing juvenile age (Table 2). Mean individual accuracy of breeding values estimated with the MBLUP method were consistent across years of measurement and ranged from 0.74 to 0.76 (Table 2).
Prediction accuracy
Imputation method and statistical approach
The number of SNP markers retained after filtering for minimum minor allele frequency (0.05), averaged from the crossvalidation scenarios was 34 570 (M60), 39 915 (KNN) and 50 803 (SVD). On average, SVD, followed by the novel KNN familybased imputation approach both surpassed the PA of M60 imputation in all GS analysis methods, with the observed differences being comparatively minor between SVD and KNN (Figure 1, left; Table 3). The increase in PA relative to the baseline M60 imputation method, averaged across statistical approaches and ages, was greatest for SVD (7.9%), followed by KNN (6.6%). The average increase in PA of SVD relative to KNN was 1.2%. The former result produced a trend of diminishing return when comparing the number of markers used in GS analyses.
On average, variation in the relative difference between GS analytical approaches was less than that among imputation methods and number of markers (Figure 1, right; Table 3). PA was, on average, equal between the rrBLUP and BCπ methods and both performed better than GRR, with differences being largest at ages 15, 30 and 40 years. The relative increase in PA, averaged across imputation methods and ages, indicated that rrBLUP and BCπ both performed equally well over GRR (4.3%). Pairwise groupings of statistical approach and imputation method scenarios indicated that rrBLUP or BCπ in combination with SVD produced the greatest PA, on average, across all ages. Standard errors of the predictions computed from the 10 replicates were low, ranging from 0.001 to 0.006 (Table 3).
Predictive accuracy across time
The GS PA was inconsistent across ages (Table 3; Figure 1). The greatest PA occurred for HT3, and while a reduction occurred for all other ages, it was largest for HT10 and HT15 across all imputation and GS analytical approach combinations. As expected, the TPA decreased with increasing difference between the training and VP age of height measurement (Table 4; Figure 2, right). Differences in TPA mirrored the results from imputation comparisons, with SVD and KNN producing the greatest average relative increases over M60 by 22.2% and 12.6%, respectively. Diversity in TPA between analytical approaches was lower than that between imputation procedures with average relative increases of rrBLUP and BCπ at 5.5% and 2.6% over GRR, respectively. Ageage genetic correlations from the MBLUP model were plotted with TPA of GS models (Figure 2, left). As anticipated, the Pearson productmoment correlation between the two was, on average, near perfect (=0.99, not tabulated), with the TPA and ageage genetic correlation both decreasing with difference in years. Interestingly, the TPA of GS models based on 30year data was often equivalent to those based on 40year data.
Distribution of SNP effects
The scatter plots, histograms and correlations of estimated marker effects from the three GS analytical methods suggested greatest similarity between the rrBLUP and BCπ methods, and least similarity between GRR and BCπ (Figure 3, Supplementary Figures 1–5). Evident in the plots is the relative tendency of GRR to apply intense shrinkage to the minor effect SNPs while allowing SNPs with large effect to persist. The latter result contrasts with rrBLUP and BCπ where they tended to distribute marker effects more widely owing to the common shrinkage parameter. The posterior mean of the π parameter (that is, probability of null effect) for BCπ was relatively constant across ages and with estimates ranging from 0.03 to 0.04 (not tabulated), accounting for some of the similarity between the rrBLUP and BCπ methods. The intensity of shrinkage due to marker quantity is also apparent when comparing the SVD to KNN and M60 marker effects plot. Pearson productmoment correlation was used to assess the linear relationship between the absolute value of marker effects in the three analytical approaches (Figure 3, Supplementary Figures 1–5; upper triangles). Pearson productmoment correlation, averaged across imputation methods and ages (not tabulated), was greatest between rrBLUPBCπ (=0.88), followed by rrBLUPGRR (=0.86) and BCπGRR (=0.86). Spearman rankcorrelation, averaged across imputation methods and ages (not tabulated), of marker effects yielded nearly identical ranking between rrBLUPGRR (=0.99), as expected because the rrBLUP procedure precedes GRR. Lower rank correlations were observed between rrBLUPBCπ (=0.93) and BCπGRR (=0.90). Additional marker effect plots for the remaining height measurement years are available in the supplement.
Relative efficiency
PA from rrBLUP and the SVD marker data (GS) was compared with those using TS (Table 5). Under the early selection scenario (10 and 15 years), the TPA using GS was greater than that of TS resulting in 112% and 106% increase in selection response, respectively. GS PA for mature tree height (30 and 40 years) of the same age as model estimation was lower (HT30) or equal (HT40) to their TS counterparts, however, assuming a 25% reduction in breeding cycle length resulted in increases of selection response by 137 and 139% for both ages separately. Additionally, a high (=0.85, not tabulated) Pearson correlation between PA and proportion of additive genetic variance explained by the GS model (that is, narrowsense genomic heritability) was observed in this study. A lower correlation (=0.75, not tabulated) was also observed between TS narrowsense heritability and TS PA.
Discussion
Accuracy of GS prediction through time
In this study, repeated measures of tree height over time permitted the testing of GS models’ accuracy (PA) at different ages (Figure 1,Table 3). PA reported in this study varied substantially throughout time (Figure 1, Table 3). This may reflect the capacity of the SNP markers to account for differential gene expression due to physiological or G × E interaction over time. Interestingly, the large drop in PA at age 10 and 15 years seems to coincide with a period of intense competitive exclusion between trees at this age, perhaps exacerbated by the relatively narrow tree spacing (2.5 × 2.5 m). The observed extent of PA for tree height was comparable with that reported in other studies using clonal eucalypts (Resende et al., 2012a) and 1–6year tree height in loblolly pine (Resende et al., 2012b, c; ZapataValenzuela et al., 2012). More recently, Beaulieu et al. (2014a) tested GS in a halfsib population of white spruce and reported PA for 22year tree height that were similar to those described here.
Next, we trained GS models with EBV of tree height at ages 3, 6, 10, 15 and 30 year and validated the GEBV against EBV from 40year measurements (Figure 2, Table 4). This scenario is of interest because the PA is expected to decline after each breeding cycle owing to the decay of SNPQTL LD as a result of recombination in the offspring (Habier et al., 2013). Thus, the TPA of GS methods is an important consideration for retraining said models as it offers potential to further accelerate the breeding cycle if target phenotypes can be selected earlier. TPA in this study decreased as the difference in age of training and VP increased. Interestingly, the TPA of GS models based on 30year height was nearly equivalent to those based on 40year despite the 10 years difference in measurements, suggesting consistency between the EBVs and SNP effects at both ages. This analysis is in agreement with that of Resende et al. (2012b), who tested TPA for ages 1–6 years in clonal loblolly pine and concluded that model retraining will likely require phenotypic data from midrotation age or later to accurately reflect mature growth trait performance (White et al., 2007). These results are not unexpected as conifers typically have weak ageage genetic correlations (Namkoong et al., 1988) attributed to their long life spans and exposure to a wide range of environmental contingencies over time. Currently, selection based on growth attributes is carried out at the age 15 years in interior spruce.
Model comparison
We tested the PA of rrBLUP, GRR and BCπ statistical approaches for a time series of tree height measurements in interior spruce (Figure 1,Table 3). The rrBLUP and BCπ models both performed consistently well, producing the highest PA across all ages. These two models can be considered equivalent when the posterior mean of π in the BCπ model approaches zero value. The π parameter posterior mean estimate ranged from 0.03 to 0.04 in this study, accounting for some of the models’ observed similarity. The marker effect plots (Figure 3, Supplementary Figures 1–5) illustrate the likeness of the number of markers fitted, marker effect distributions and shrinkage for both methods. Similarly, Resende et al. (2012b) found no difference in PA between rrBLUP and BCπ for 6year height in loblolly pine, although they did find that BCπ outperformed rrBLUP for an oligogenic disease resistance trait. This demonstrates the flexibility of the BCπ algorithm, where the π parameter allows the model to behave like rrBLUP when traits follow the Fisher’s infinitesimal model (Fisher, 1918). This concept gives BCπ the possibility to be useful in prediction of traits with unknown genetic basis at the cost of additional computational time (BCπ took upwards of five times longer than both rrBLUP and GRR combined). rrBLUP has often been suggested as a baseline model to which comparisons should be made because it has been shown to yield high and stable PA across a wide variety of species and traits with low computational time investment (Heslot et al., 2012). Indeed, this result has been found to be true in the present as well as other studies involving forest trees (Resende et al., 2012c; Beaulieu et al., 2014b).
As anticipated, the GRR model did not offer improved PA over rrBLUP or BCπ for mature tree height under this study’s conditions. Tree height is widely regarded as having a complex inheritance pattern under the Fisher’s infinitesimal model (Fisher, 1918). In theory, the statistical approach used in GS can lead to variation in PA, depending on the genetic architecture of the trait (Daetwyler et al., 2010). Hence, variable selection methods are generally expected to perform optimally for traits with simple genetic architecture (that is, few loci with large effect), because SNPs of low effect are strongly shrunk toward zero, while those of large effect persist. Beaulieu et al. (2014b) and Resende et al. (2012c) evaluated BayesA and Bayesian LASSO, approaches similar to GRR, where the improvement in PA of growth related traits was found to be null. The GRR model did, however, offer PA comparable with rrBLUP and BCπ at juvenile ages (3, 6 and 10 years). As observed in the distributions of marker effects at these juvenile ages, the GRR model appeared to shrink all markers equally because of an apparent absence of those with large effect compared with mature ages (Figure 3, Supplementary Figures 1). However, in mature ages where large marker effects were perceived to exist by the GRR model, the intense shrinkage applied to markers of low effect led to an obvious impairment of PA. This is expected because of the complex genetic nature of tree height, guiding the decision that these large effect markers in mature tree height were likely false positives. Further, the increase in PA by rrBLUP and BCπ over GRR at later ages may be accounted for by the knowledge that when markers are shrunk equally, the kinship component of PA is more effectively captured, when compared with heteroscedastic models (Heslot et al., 2012). GS models should, however, ideally be based on the LD between SNPQTL rather than kinship, because the SNPQTL LD component of PA is expected to persevere in subsequent generations following breeding (Habier et al., 2007).
GBS and marker imputation
The use of GBS, in combination with several imputation methods, was successful in supplying a dense genomewide marker data set, and further, enabled moderate levels of PA to be captured for tree height in interior spruce. This study represents the first use of GBS data as a base for genomic prediction in a forest tree species. We report that both SVD and novel KNN imputation methods offered increases in PA over mean imputation (M60) (Figure 1, left). However, it is unclear whether this result is the product of the number of markers retained by the imputation method or the imputation method itself. Although the initial marker matrix with 60% missing information was used for the three imputation methods, variation in the number of retained markers should be expected as some marker loci may be nonimputable with certain algorithms and differences will occur because of filtering of minor allele frequency. Thus, the imputation method should be wholly evaluated based on both its marker yield and PA, because restricting the markers to a common set would lend unintentional penalization to methods that yield greater numbers of markers. Similarly, Rutkoski et al. (2013) noted comparable increases in PA over mean imputation while using both SVD and KNN imputation on GBSderived markers for wheat (Triticum aestivum L.). Further, we found that on average, the SVD imputation method yielded only slightly better PA than KNN. The former result alludes to diminishing returns given the average difference in available markers after filtering the TP SNP tables for a minimum minor allele frequency of 0.05. This result corresponds well to the asymptotic relationship between marker density and PA derived by Grattapaglia and Resende (2011) who used simulated data and deterministic formulae.
Foundationally, the GBS method produces a large number of missing data due to low read depth and possible mutation at restriction sites in some individuals (Elshire et al., 2011), thus, it relies heavily on accurate imputation methods to derive dense marker data. With the availability of a complete reference genome assembly of white spruce, it may be possible to increase the accuracy of imputation through use of methods designed for ordered data and constructed haplotypes (Rutkoski et al., 2013). In the interim, scaffolds of the draft genome assembly for white spruce published by Birol et al. (2013) may suffice to aid in aligning the unordered genomic data produced in this study, and additionally assist in discovering additional markers that can be used for GS. This would be greatly beneficial because the genome size and complexity of conifers demand a large number of markers to sufficiently saturate the genome and provide high PA. Marker density is also particularly important in determining PA for forest tree populations where the effective size (N_{e}) is commonly large (Grattapaglia and Resende, 2011).
Relative efficiency
The observed GS prediction accuracies (PA) in the present study are adequate to produce greater genetic gain over TS (Table 5). The relative efficiency results are compatible with those described by other studies involving white spruce and loblolly pine (Resende et al., 2012b; Beaulieu et al., 2014a). However, we chose to limit the reduction in time of breeding cycle to a conservative value of 25%, as opposed to 50% in other studies, because the reliability of the PA produced from the present study has not been validated in a progeny generation. The theoretical increase in genetic gain produced by GS hinges on the capacity of the prediction models to remain relevant in the next generation. For this to occur, PA must ideally be based on SNPQTL LD rather than kinship (Habier et al., 2013). Recently, the source of the relationship between marker and QTL described by GS models has been decomposed by Habier et al. (2013) into factors involving both kinship and pure markerQTL LD, generating questions about the validity of such models in subsequent generations.
The origin of PA produced by the GS models in this study is currently unknown, and further testing via a progeny VP, or the partitioning of families in a restricted crossvalidation scheme as demonstrated by Beaulieu et al. (2014a), will be required in the future. Optimally, the PA of the models was produced via LD between markers and QTL. Suboptimally, PA was derived through kinship information between individuals in the training and VPs. GS PA can be composed of a combination of the two factors, leading to inflated estimates of PA to occur without proper validation because the kinship component is anticipated to decay more rapidly than markerQTL LD in the progeny generations (Habier et al., 2007, 2010, 2013). In a study of a large population of white spruce openpollinated families, Beaulieu et al. (2014a) observed that PA decreased significantly when kinship between the training and VPs was restricted. This is not unexpected in forest trees, where decay in LD is typically fast (Neale and Savolainen, 2004). Thus, it may be necessary in the future to create ‘designer’ breeding populations through intense management to maximize markerQTL LD. Alternatively, selective withinfamily genotyping of phenotypically extreme individuals (that is, best and worst) could be used to train accurate GS models based on haplotype blocks (Odegard and Meuwissen, 2014).
In the openpollinated testing used in the present study, GS models were trained using traditional pedigreebased EBVs that incorporate expected additive genetic relationships between individuals into the matrix, A, to estimate genetic parameters (Lynch and Walsh, 1998). Mixed model theory assumes that covariance matrices are errorfree (that is, ideally reflecting the segregation of only QTLs), thus, the accuracy of information contained in A is critical in obtaining unbiased and accurate estimates of genetic parameters and breeding values (Mrode, 2014). Ideally, data fitted to train GS models should be analogous to the true additive genetic merit of each individual in the TP (Garrick et al., 2009). Earlier studies have fitted deregressed EBV in GS models to improve prediction accuracy (Garrick et al., 2009). However, there are empirical results in animal breeding to suggest EBV to be superior in some cases (Guo et al., 2010). Kinship explained by marker data can be used to overcome this limitation and has been used in the past to correct pedigree errors (Munoz et al., 2014), and produce increased accuracy of EBV (ElKassaby and Lstiburek, 2009; ElKassaby et al., 2011). Munoz et al. (2014) applied this concept to GS and noted improved PA by correcting pedigree errors prior to estimating EBVs of the training data.
It appears that in the shortterm, single step methods such as those incorporating the realized (G) (VanRaden, 2008) or augmented (H) (Legarra et al., 2009; Christensen and Lund, 2010) genomic relationship matrix may be better suited for treebreeding programs with simple mating structures and shallow pedigrees because EBV accuracy suffers from the insufficiencies produced by the simplified mating design. At present, selection with increased accuracy could be made over traditional BLUP methods with the benefit of accumulating genotypic information that can be used in the long term when deep pedigrees and ‘designer’ TPs have been established. This concept is most relevant to young breeding programs with openpollinated mating structure such as the one studied here.
Data Archiving
Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.m4vh4.
References
Beaulieu J, Doerksen T, Clement S, MacKay J, Bousquet J . (2014a). Accuracy of genomic selection models in a large population of openpollinated families in white spruce. Heredity (Edinb) 113: 343–352.
Beaulieu J, Doerksen TK, MacKay J, Rainville A, Bousquet J . (2014b). Genomic selection accuracies within and between environments and small breeding groups in white spruce. BMC Genomics 15: 1–16.
Beavis WD . (1998) QTL Analyses: Power, Precision, and Accuracy. In: Paterson AH. (ed.) Molecular Dissection of Complex Traits. CRC Press: Boca Raton, FL.
Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA et al. (2013). Assembling the 20 Gb white spruce (Picea glauca) genome from wholegenome shotgun sequencing data. Bioinformatics 29: 1492–1497.
Chen C, Mitchell SE, Elshire RJ, Buckler ES, ElKassaby YA . (2013). Mining conifers’ megagenome using rapid and efficient multiplexed highthroughput genotypingbysequencing (GBS) SNP discovery platform. Tree Genetics & Genomes 9: 1537–1544.
Christensen OF, Lund MS . (2010). Genomic prediction when some animals are not genotyped. Genet Sel Evol 42: 2.
Crossa J, Beyene Y, Kassa S, Perez P, Hickey JM, Chen C et al. (2013). Genomic prediction in maize breeding populations with genotypingbysequencing. G3 (Bethesda) 3: 1903–1926.
Daetwyler HD, PongWong R, Villanueva B, Woolliams JA . (2010). The impact of genetic architecture on genomewide evaluation methods. Genetics 185: 1021–1031.
Dutkowski GW, Silva JCE, Gilmour AR, Lopez GA . (2002). Spatial analysis methods for forest genetic trials. Can J For Res 32: 2201–2214.
ElKassaby YA, Lstiburek M . (2009). Breeding without breeding. Genet Res 91: 111–120.
ElKassaby YA, Cappa EP, Liewlaksaneeyanawin C, Klápště J, Lstibůrek M . (2011). Breeding without breeding: is a complete pedigree necessary for efficient breeding? PLoS One 6: e25737.
Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES et al. (2011). A robust, simple genotypingbysequencing (GBS) approach for high diversity species. PLoS One 6: e19379.
Endelman JB . (2011). Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4: 250–255.
Fisher RA . (1918). The Correlation between relatives on the supposition of mendelian inheritance. Transactions of the Royal Society of Edinburgh 52: 399–433.
Gamal ElDien O, Ratcliffe B, Klápště J, Chen C, Porth I, ElKassaby YA . (2015). Genomic prediction accuracy of growth and wood attributes of interior spruce in space using genotypingbysequencing. BMC Genomics 16: 370.
Garrick DJ, Taylor JF, Fernando RL . (2009). Deregressing estimated breeding values and weighting information for genomic regression analyses. Genet Sel Evol 41: 55.
Gilmour AR, Gogel B, Cullis B, Thompson R . (2009) ASReml User Guide Release 3.0. VSN International Ltd: Hemel Hempstead, UK.
Grattapaglia D, Resende MDV . (2011). Genomic selection in forest tree breeding. Tree Genetics & Genomes 7: 241–255.
Guo G, Lund MS, Zhang Y, Su G . (2010). Comparison between genomic predictions using daughter yield deviation and conventional estimated breeding value as response variables. J Anim Breed Genet 127: 423–432.
Habier D, Fernando RL, Dekkers JC . (2007). The impact of genetic relationship information on genomeassisted breeding values. Genetics 177: 2389–2397.
Habier D, Tetens J, Seefried FR, Lichtner P, Thaller G . (2010). The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet Sel Evol 42: 5.
Habier D, Fernando RL, Kizilkaya K, Garrick DJ . (2011). Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics 12: 186.
Habier D, Fernando RL, Garrick DJ . (2013). Genomic BLUP decoded: A look into the black box of genomic prediction. Genetics 194: 597–607.
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME . (2009). Invited review: Genomic selection in dairy cattle: progress and challenges. J Dairy Sci 92: 433–443.
Heffner EL, Sorrells ME, Jannink JL . (2009). Genomic selection for crop improvement. Crop Science 49: 1–12.
Henderson CR . (1953). Estimation of variance and covariance components. Biometrics 9: 226–252.
Heslot N, Yang HP, Sorrells ME, Jannink JL . (2012). Genomic selection in plant breeding: a comparison of models. Crop Science 52: 146–160.
Hill WG, Goddard ME, Visscher PM . (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet 4: e1000008.
Legarra A, Aguilar I, Misztal I . (2009). A relationship matrix including full pedigree and genomic information. J Dairy Sci 92: 4656–4663.
Lynch M, Walsh B . (1998) Genetics and Analysis of Quantitative Traits Vol 1, Sinauer Associates: Sunderland, MA.
Meuwissen THE, Hayes BJ, Goddard ME . (2001). Prediction of total genetic value using genomewide dense marker maps. Genetics 157: 1819–1829.
Moser G, Tier B, Crump RE, Khatkar MS, Raadsma HW . (2009). A comparison of five methods to predict genomic breeding values of dairy bulls from genomewide SNP markers. Genet Sel Evol 41: 56.
Mrode RA . (2014) Linear Models for the Prediction of Animal Breeding Values. CAB International: Wallingford, Oxfordshire.
Munoz PR, Resende MFR, Huber DA, Quesada T, Resende MDV, Neale DB et al. (2014). Genomic relationship matrix for correcting pedigree errors in breeding populations: impact on genetic parameters and genomic selection accuracy. Crop Science 54: 1115–1123.
Namkoong G, Kang HC, Brouard JS . (1988) Tree Breeding: Principles and Strategies. SpringerVerlag: New York: New York.
Neale DB, Savolainen O . (2004). Association genetics of complex traits in conifers. Trends Plant Sci 9: 325–330.
Odegard J, Meuwissen TH . (2014). Identitybydescent genomic selection using selective and sparse genotyping. Genet Sel Evol 46: 3.
Perez P, de los Campos G . (2014). Genomewide regression and prediction with the BGLR statistical package. Genetics 198: 483–495.
RCoreTeam. (2014) Open access available at http://cran.rproject.org R Foundation for Statistical Computing: Vienna, Austria.
Resende MD, Resende Jr MF, Sansaloni CP, Petroli CD, Missiaggia AA, Aguiar AM et al. (2012a). Genomic selection for growth and wood quality in Eucalyptus: capturing the missing heritability and accelerating breeding for complex traits in forest trees. New Phytol 194: 116–128.
Resende Jr MF, Munoz P, Acosta JJ, Peter GF, Davis JM, Grattapaglia D et al. (2012b). Accelerating the domestication of trees using genomic selection: accuracy of prediction models across ages and environments. New Phytol 193: 617–624.
Resende Jr MF, Munoz P, Resende MD, Garrick DJ, Fernando RL, Davis JM et al. (2012c). Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.). Genetics 190: 1503–1510.
Rutkoski JE, Poland J, Jannink JL, Sorrells ME . (2013). Imputation of unordered markers and the impact on genomic selection accuracy. G3 (Bethesda) 3: 427–439.
Shen X, Alam M, Fikse F, Ronnegard L . (2013). A novel generalized ridge regression method for quantitative genetics. Genetics 193: 1255–1268.
Strauss SH, Lande R, Namkoong G . (1992). Limitations of molecularmarkeraided selection in forest tree breeding. Can J For Res 22: 1050–1061.
VanRaden PM . (2008). Efficient methods to compute genomic predictions. J Dairy Sci 91: 4414–4423.
White TL, Adams WT, Neale DB . (2007) Forest Genetics. CAB International: UK.
Whittaker JC, Thompson R, Denham MC . (2000). Markerassisted selection using ridge regression. Genet Res 75: 249–252.
ZapataValenzuela J, Isik F, Maltecca C, Wegrzyn J, Neale D, McKeand S et al. (2012). SNP markers trace familial linkages in a cloned population of Pinus taeda—prospects for genomic selection. Tree Genetics & Genomes 8: 1307–1318.
Acknowledgements
We thank T Funda and I Fundova for phenotyping, T Funda and J Korecky for DNA extraction, and SE Mitchell and K Hyme for GBS. This study is funded by the Johnson’s Family Forest Biotechnology Endowment, FPInnovations’ ForValueNet, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to YAE.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Supplementary Information accompanies this paper on Heredity website
Rights and permissions
About this article
Cite this article
Ratcliffe, B., ElDien, O., Klápště, J. et al. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods. Heredity 115, 547–555 (2015). https://doi.org/10.1038/hdy.2015.57
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/hdy.2015.57
This article is cited by

Multipletrait analyses improved the accuracy of genomic prediction and the power of genomewide association of productivity and climate changeadaptive traits in lodgepole pine
BMC Genomics (2022)

Chasing genetic correlation breakers to stimulate population resilience to climate change
Scientific Reports (2022)

Improving lodgepole pine genomic evaluation using spatial correlation structure and SNP selection with singlestep GBLUP
Heredity (2022)

Genomic prediction of growth and wood quality traits in Eucalyptus benthamii using different genomic models and variable SNP genotyping density
New Forests (2022)

Genomic relationship–based genetic parameters and prospects of genomic selection for growth and wood quality traits in Eucalyptus benthamii
Tree Genetics & Genomes (2021)