Plant and animal breeders have used quantitative genetics effectively to increase mean phenotypic performance in selected populations. Traditionally, genetic progress has been achieved by combining phenotypic evaluations with the pedigree record, which involves visual evaluation and trait screening over several successive generations (Goddard and Hayes 2007). These approaches have brought significant gains in recent decades. However, it is important to take into account the effort required to achieve these gains; for the majority of perennial crops, this approach is costly and time-consuming, particularly for traits expressed late in a plant’s life-cycle.

The advent of molecular markers has provided an opportunity to achieve faster genetic gains (Lande and Thompson 1990). Meuwissen et al. (2001) first proposed to use all available molecular markers to predict quantitative traits in breeding programs. Known as genomic selection (GS), the methodology has become widely adopted in the animal and plant breeding communities because of its potential to increase genetic gains and shorten the breeding cycle. The rationale behind this approach is that, whenever marker density is high enough, most quantitative trait loci (QTLs) will be in linkage disequilibrium (LD) with some markers and hence estimates of marker effects will lead to accurate predictions of genetic merit for a complex trait (Goddard and Hayes 2007).

When confronting the problem of modeling the relationship between genetic variation and variation in the observed traits, an important question is what statistical method might better describe this relationship. Several analytical approaches have been proposed for genome-based prediction of genetic values, such as penalized and Bayesian estimation procedures, as well as nonparametric regression and dimension reduction methods (Gianola et al. 2009; Kärkkäinen and Sillanpää 2012; Gianola 2013; de Los Campos et al. 2013). A common feature of all these methods is that they were designed to handle highly-dimensional data, with a particular focus on producing accurate estimates in settings in which the number of variables, or SNPs (p), is larger than the number of samples (n). Most successful approaches are based on variable selection and/or shrinkage techniques from the statistics literature (Kärkkäinen and Sillanpää 2012; Zhou et al. 2013; Garrick et al. 2014).

Comparisons between genomic prediction models have been carried out in a variety of scenarios for different species and traits (Heslot et al. 2012; Riedelsheimer et al. 2012; Daetwyler et al. 2013; Wang et al. 2015). Empirical and simulation studies have suggested that different models work better in different scenarios, since biological and technical factors affect prediction accuracy. These factors include population size, genetic architecture and differences between the training and validation data sets (de Los Campos et al. 2013; Daetwyler et al. 2013). Because of this, when considering analyses of new species and breeding scenarios, it can be helpful to compare and assess several methods before carrying out the final genomic analyses. Here we perform such an assessment for genomic prediction in coffee, an important agricultural commodity in which genomic studies are still emerging.

So far, genomic prediction accuracy has usually been evaluated within single environments (Windhausen et al. 2012a, b; Beaulieu et al. 2014; Gamal El-Dien et al. 2015). In coffee, however, breeding schemes are most commonly performed in multiple environments to measure performance of genotypes across a range of conditions. In this study, therefore, we focus on the following question: are marker effects estimated in one set of environments useful for prediction in other environments? This question has important practical implications for the effectiveness of GS in perennial species such as coffee. If feasible, for example, a single prediction model could be used across different environments resulting in time and cost economy (Resende et al. 2012a).

This investigation builds on our earlier work that explored the potential of GS for production of coffee beans (Ferrão et al. 2017). In that work, we used a mixed model framework with integration of spatial and temporal variance-covariance structures. In combination, this investigation explores the potential of several statistical methods to predict three important traits—production of coffee beans, leaf rust incidence, and yield of green beans—evaluated in two recurrent selection populations of Coffea canephora. Our results demonstrate the usefulness of genome-base prediction for coffee breeding. We also provide guidance on implementing molecular assisted selection in practical breeding programs.

Materials and methods

The description of the Materials and methods is organized as follows. In the Sections “Plant material” and “Experimental design”, we describe the development of the population used in the experiments, collection of the phenotype data, and breeding scenarios proposed for GS investigation. We take additional steps, described in the Section “Phenotypic model”, to prepare the phenotype data for analysis with whole-genome prediction models. At this stage of the analysis, no marker information was included in the model. The protocols for DNA sequencing and calling SNP genotypes are described in the Section “Genotypic data”. Finally, in the Sections “Genomic prediction methods” and “Evaluation of genomic predictions” we describe the genomic prediction methods used in our experiments and explain how these methods were compared.

Plant material

We consider an experimental population designed by the Instituto Capixaba de Pesquisa, Assistência Técnica e Extensão Rural (Incaper), ES State, Brazil. Phenotype measurements were collected from two recurrent selection populations formed from the recombination of superior C. canephora clones. Clones were selected by the Incaper as progenitors (founders) based on high production of coffee beans and similar stages of fruit maturity. The latter is an important feature of new coffee varieties because it allows harvests to be standardized. Based on the maturity group, coffee populations were designated as “Intermediate” or “Premature”. The Intermediate population, on average, had fruits that started to ripen in March and April, and then were harvested in June. The fruits of the Premature population started ripening, and were harvested one month earlier, on average.

In 1997, the original Intermediate and Premature populations were derived from crosses of 16 and 9 progenitors, respectively. Each population was planted in an isolated field under open pollination conditions. In 2000, after one cycle of recombination, seeds were derived from each maternal plant, which were then use to develop a new population. At this point, the same number of seeds per maternal plant were harvested in order to preserve genetic diversity. After four consecutive harvest-production years (2002–2005), 103 progenies from the Intermediate population and 87 progenies from the Premature population were selected based on their high performance in terms of production of coffee beans and tolerance to biotic and abiotic stress over these four years. In 2006, founders and the selected progenies belonging to both populations were cloned and assigned to randomized complete blocks with three replications and five plants per plot. The average measurement per plot was used as the phenotype for all subsequent analyses in this study.

Both populations were established in two locations, chosen to be representative of Brazilian coffee production: Marilândia Experimental Farm, or FEM (latitude 19024’ south, longitude 40031’ west, 70 m altitude); and Sooretama Experimental Farm, or FES (latitude 15047’ south, longitude 43018’ west, 40 m altitude).

The complete experiment used phenotype measurements from 3570 coffee trees in the Intermediate population, and from 2880 coffee trees in the Premature population. Measures were recorded over four consecutive harvest-production years (2008–2011) for three traits: production of coffee beans (mature coffee fruit in the “cherries” stage, in 60-kg bags per hectare); natural infection of coffee leaf rust, caused by the Hemileia vastatrix fungus (levels ranging from 1 to 9, according to visual sporulation intensity evaluated in field); and yield of green beans post-harvest trait (ripened beans, in g, after processing by dry methods to remove dried husks in samples of 2 kg of coffee fruit in the cherries stage).

Experimental design

To investigate potential for GS, we considered two aspects of plant breeding: (i) prediction accuracy for different traits, and (ii) prediction accuracy within and across environments. Here we define “environment” as a specific combination of location and population. Figure 1 summarizes our experiments.

Fig. 1
figure 1

Genomic selection experimental scenarios. Here, “environment” is defined as a combination of location (FEM, FES) and population (Intermediate, Premature). Scenarios ad assess genomic selection within the same environment; Scenarios 1–4 compare GS performance across-locations; Scenarios 5–8 evaluate GS performance across populations; and Scenarios 9–12 assess both across-location and cross-population predictions. Direction of the arrows indicate differences in the training and test data sets

For within-environment experiments (Scenarios A–D), predictions were evaluated using a Replicated Training-Testing evaluation (Crossa et al. 2013). In each replication, 80% of the individuals were assigned randomly to the training set (TRN), while the remaining 20% were assigned to the test set (TST). This division was repeated 30 times with different random assignments to TRN and TST. Models were fitted to the training data and prediction accuracy was evaluated in the test data.

For across-environment experiments (Scenarios 1–12), we subdivide the experiments as follows (see Fig. 1): (i) Scenarios 1–4 capture across-location predictions, in which the test set contains samples collected from a different location than the training set, while the source population is kept the same; (ii) Scenarios 5–8 consider across-population predictions, in which the training and test sets contain samples from different populations, while the location is kept the same; (iii) Scenarios 9–12 capture both across-location and across-population predictions, in which the test set has samples from a different population and a different location than the training set. In these experiments we did not use the Replicated Training-Testing design; for example, in Scenario 1 the model was trained with all Intermediate individuals from one location (e.g., FEM) and validated using all Intermediate individuals from another location (e.g., FES).

Phenotypic model

The phenotypes were adjusted for linear effects of environmental covariates, and other experimental covariates. In particular, in our experiments we collected longitudinal data across multiple harvest-years. Different variance-covariance structures were tested to describe this temporal variation across harvest, and therefore improve the estimation of genetic effects. Using a similar notation to Pastina et al. (2012), we considered the following statistical model (underlined terms indicate random variables):

$$y_{ijk} = \mu + B_j + H_k + \underline G _{ij} + \varepsilon _{ijk}$$

where yijk is the phenotype measured in individual i {1,...,n} from block j {1,...,r} and harvest k {1,...,K}; µ is the intercept; Hk is the fixed harvest effect; Bj is the fixed block effect; Gij is a random genetic effect of genotype i at harvest k; and εijk is a random non-genetic residual error term. Here, r = 3 (the number of blocks) and K = 4 (the number of harvests).

To model the random genetic effects, we assumed a multivariate normal distribution with a zero mean and a variance-covariance matrix G. We formulated G as the Kronecker product \(G = \mathop {\sum}\nolimits_H^{K \times K} { \otimes I_g^{n \times n}}\), in which \(I_g^{n \times n}\) is the n × n identify matrix. Four structures \(\sum _H^{K \times K}\) different levels of complexity (i.e., number of model parameters to be estimated) were investigated (see Supplementary Table S1).

Similarly, for the residual error, we assumed a multivariate normal distribution with a zero mean and variance-covariance matrix R defined as \(R = R_H^{K \times K} \otimes I_B^{r \times r} \otimes I_g^{n \times n}\). The term \(I_B^{r \times r}\) is an Identity of dimension equal to the number of blocks, r. For the term \(R_H^{K \times K}\), the “Ident” and “Diag” variance-covariance structures were considered (Supplementary Table S1). Our previous study showed no improvements in the goodness-of-fit values when spatial trends were evaluated (see Ferrão et al. 2017); therefore, we did not consider spatial analysis here.

The final model choices were based on AIC (Akaike Information Criterion) (Akaike 1974) and BIC (Bayesian Information Criterion) (Schwarz 1978). Since calculation of heritability in complex linear mixed models is not straightforward (Cullis et al. 2006; Oakey et al. 2016), broad-sense heritability (h2) was estimated from the simplest phenotypic model—that is, identity structure for the genetic and residual matrices—as: \(h^2 = \frac{{\sigma _g^2}}{{\sigma _g^2 + \sigma _e^2/rK}}\); where \(\sigma _g^2\) is the estimated variance of the genotype component, \(\sigma _e^2\) is the estimated variance of the residual component, and r and K are the number of blocks and harvests, respectively.

All analyses described in this section were performed using the R package nlme (Pinheiro et al. 2013).

Genotypic data

The Intermediate and Premature populations were genotyped using the Genotype-by-Sequencing (GBS) approach that was first developed by Elshire et al. (2011). We followed the GBS protocol used by the Genomic Diversity Facility at Cornell University.

Leaves were collected and lyophilized. DNA was extracted using Qiagen DNeasy Plant, and the genomic libraries were prepared following Elshire et al. (2011). The DNA samples were digested using the ApeKI restriction enzyme, and 96 samples were multiplexed per Illumina flow cell for sequencing.

The GBS analysis pipeline was implemented with the TASSEL-GBS software, version 4.3.7 (Glaubitz et al. 2014). Sequenced tags were aligned against the C. canephora genome assembly (Denoeud et al. 2014). SNPs were extracted from the raw Variant Call Format (VCF file) and filtered manually as follows: (i) triallelic SNPs were removed; (ii) SNPs with minor allele frequency (MAF) less than 1% were removed; and (iii) SNPs with genotypes that were called in less than 70% of the samples were discarded.

To ensure that all genotypes were called consistently, we used a Bayesian approach that incorporates genetic background information, similar to Chan et al. (2016), to call genotypes with low coverage (which we defined as genotypes with less than or equal to five sequenced reads). Specifically, we used a two-step approach: first, the maximum-likelihood estimates of genotypes were computed following Chan et al. (2016) (assuming a uniform genotype prior); second, the inferred parental genotypes were provided as prior information for inferring the genotypes of the progenies. To improve accuracy of the parental genotype estimates, we increased the sequencing coverage of the parents to 3× the coverage of the progenies. To call the genotypes, we retained the maximum-probability genotype, encoded as reference allele counts (0, 1 or 2) in our files.

All SNP manipulation and genotype calling (aside from genotypes of low-coverage samples) were carried out using VCFtools (Danecek et al. 2011). We used R (R Core Team 2013) to implement the Bayesian genotype calling incorporating parental genotype estimates. SNP density plots were created using the synbreed R package (Wimmer et al. 2012).

Genomic prediction methods

We compared 13 different methods for genomic prediction of coffee traits. Most of the genomic prediction approaches included in our experiments are based on a linear regression in which the outcome of interest y is modeled as a linear combination of the SNP markers:

$$y = \mu 1_n + X\beta + \varepsilon$$

Here, y is an n-vector of phenotypes measured on n individuals, after adjusting for linear effects of environmental factors and other experimental factors, as explained in Sec. 2.3; X is an n × p matrix of genotypes measured at p SNPs; β is a p-vector of SNP effects to be estimated; 1n is an n-vector of 1’s; μ is the intercept, and ε is an n-vector of normally distributed residuals, \(\varepsilon \sim N\left( {0,\,\sigma _e^2I_{n \times n}} \right)\).

For comparing genomic prediction approaches, we defined three classes of methods: (1) fixed multiple regression, (2) Bayesian methods, and (3) a third class of methods based on techniques originally developed in machine learning (which don’t already fit into the first two categories).

Fixed multiple regression

This class of method builds on standard statistical association analysis approaches used in genome-wide association studies (GWAS) which test each SNP, one at a time, for association with the phenotype (we refer to this as “single-SNP” analyses). We implement a fixed regression procedure following Spindel et al. (2015), using a subset of markers identified from a single-SNP analysis. For each replication in the cross-validation scheme (Replicated Training-Testing evaluation), single-marker regression was applied to all SNPs, and association p-values were computed using an F-test. A linear regression model was then fitted to the data using the 100 most significant markers. We refer to this method as “fixedMRL”.

Machine learning methods

We consider three approaches from the machine learning literature: regularized regression, dimension reduction, and random forests.

  1. 1.

    Regularized regression: This method fits a regression model with all p SNPs, shrinking all coefficients toward zero. Regression coefficients are fitted by solving an optimization problem that balances goodness-of-fit against model complexity (de Los Campos et al. 2013; James et al. 2013). Several regularized approaches have been proposed, and they differ in the choice of penalty function. Ridge regression (RR) and LASSO are the two most prominent approaches. RR shrinks all coefficients toward zero, with a penalty applied to the ℓ2-norm of the coefficients. In contrast, LASSO uses the ℓ1-norm (James et al. 2013). The RR-BLUP is a version of RR that implements best linear unbiased prediction (BLUP) using a mixed model approach (Endelman 2011). The RR-BLUP and LASSO approaches were implemented using, respectively, the rrBLUP (Endelman 2011) and glmnet (Friedman et al. 2010) R packages. The LASSO penalty strength was chosen via cross-validation, following Silva et al. (2011).

  2. 2.

    Partial least squares regression (PLSR): This is a dimension-reduction approach that transforms the variables (SNPs), then fits a model with the transformed variables. PLSR is similar to principal component regression (PCR); both methods construct a matrix of latent components as a linear transformation (James et al. 2013). PLSR was implemented using the pls R package (Wehrens and Mevik 2007) with the default settings.

  3. 3.

    Random forest (RF): A random forest is a collection of regression trees, in which a subset of SNPs is used to define the best split at each node (James et al. 2013). Different variables are used at each split in different trees. The RF prediction for an observation is obtained by averaging the predictions over trees. One feature of the RF approach is that it allows for non-linear relationships between genotype and phenotype. RF was implemented in our study using the RandomForest R package (Liaw and Wiener 2002) with the default settings.

Bayesian methods

All Bayesian approaches are based on a hierarchical linear regression method, building on (2), and differ primarily in the priors placed on the regression coefficients and other model parameters (Gianola 2013). Using notation similar to de Los Campos et al. (2013), the posterior distribution of the model parameters μ, β, σ2 given the hyperparameters ω is expressed as:

$$p\left( {\mu ,\beta ,\sigma ^2\left| {y,\omega } \right.} \right) \propto {\mathrm{p}}\left( {y\left| {\mu ,\beta ,{\mathrm{\sigma }}^2} \right.} \right){\mathrm{p}}\left( {\mu ,\beta ,{\mathrm{\sigma }}^2\left| {\mathrm{\omega }} \right.} \right)$$

where \(p\left( {\mu ,\beta ,\sigma ^2\left| {y,\omega } \right.} \right)\) is the posterior density of model parameters μ, β, σ2 given the data (y) and the hyperparameters ω, \(p\left( {{\mathrm{y}}\left| {{\mathrm{\mu }},{\mathrm{\beta }},{\mathrm{\sigma }}^2} \right.} \right)\) is the regression likelihood based on (2), and \(p\left( {\mu ,\beta ,{\mathrm{\sigma }}^2\left| {\mathrm{\omega }} \right.} \right)\) is the prior density of model parameters. Table 1 summarizes the Bayesian methods evaluated in our experiments.

Table 1 Summary of effect size distributions used in selected genomic prediction methods, adapted from Zhou et al. (2013)

For all Bayesian methods except bayesVS, we ran the Markov chain for 20,000 time steps, with a burn-in of 2000. The bayesVS method uses a variational approximation instead of Markov chain Monte Carlo (MCMC) (Carbonetto and Stephens 2012). The bayesA, bayesB, bayesC, bayesRR and bayesLASSO models are implemented in the BGLR package (Pérez and de los Campos 2013); the bayesR method is implemented in the BayesR package (Erbe et al. 2012); and the BSLMM (Bayesian Sparse Linear Mixed Model) method is implemented as part of the GEMMA software (Zhou et al. 2013). For all methods, we adopted the default hyperparameter and prior settings.

Evaluation of genomic predictions

We applied each of the 13 methods to predict within-environment phenotypes (Scenarios A–D in Fig. 1). The best performing method was used in the others scenarios (Scenarios 1–12 in Fig. 1). To compare the models, we primarily focused on quantifying the prediction ability, commonly used in the GS literature as a measure of the prediction accuracy. To this end, we compute the Pearson correlation (rgp) between predicted (\(\widehat {y_i}\)) and adjusted phenotypes (yi) obtained with Eq. (1).

Following Asoro et al. (2011), we used analysis of variance (ANOVA) to investigate how different factors might be responsible for differences in accuracy among methods. We used the following model in the ANOVA:

$$\begin{array}{l}r_{gp} = \mu + \mathrm{method} + \mathrm{trait} + \mathrm{pop} + \mathrm{loc} + \left( {\mathrm{method} \times \mathrm{trait}} \right)\cr \quad \quad + \left( {\mathrm{method} \times \mathrm{pop}} \right) + \left( {\mathrm{mehotd} \times \mathrm{loc}} \right) + \left( {\mathrm{trait} \times \mathrm{loc}} \right) + \mathrm{error}\end{array}$$

where μ is the intercept; the levels of method are the prediction methods; the levels of trait are the three traits (production of coffee beans, coffee leaf rust, and yield of green beans); the levels of pop are the two populations (Intermediate and Premature), and the levels loc are the two locations (FEM and FES) considered. Other terms in (4) correspond to double interactions among factors.

Alternatively, we also estimate the mean squared prediction error (MSPE), slope and computational time to compare the 13 models. MSPE was computed using the formula:

\({\mathrm{MSPE}} = \frac{1}{{\mathrm{n}}}\mathop {\sum}\nolimits_{i = 1}^n {\left( {{\mathrm{y}}_{\mathrm{i}} - \widehat {{\mathrm{y}}_{\mathrm{i}}}} \right)^2}\), where n is the number of samples in the test set. To compute the slope, adjusted phenotypes were linearly regressed on predicted phenotypes to express the degree of bias of the predictions, as suggested by Moser et al. (2009). Runtimes for model fitting were recorded in minutes for all methods and data sets. All computations were single threaded and performed on an Intel Core i7-3770 processor (3.40 GHz) with 8 GB of memory.

Since the degree of genetic relationships between training and test sets can impact accuracy of the predictions, the relationships of both populations were investigated using principal components analysis (PCA) and Fst.

In our experiments, we also investigated the effect of number of included SNPs on the predictive ability. To this end, we considered two approaches to selecting SNPs: (i) guided subsets, and (ii) random subsets. To construct the “guided” SNP subsets, we selected 10 SNPs within windows of the same length (in base-pairs) in each C. canephora chromosome. To construct SNP subsets of different sizes, we considered different window sizes, ranging from 5 to 900 Kb, by increments of 100 Kb. Following Spindel et al. (2015), we selected the SNPs with highest minor allele frequencies (MAF) and best call rates within each window. This resulted in SNP subsets with the following numbers of SNPs: 35,427 (smallest windows), 20,450, 13,690, 10,189, 7989, 6577, 5559, 4780, and 4240 (largest windows). For the Premature population the number of SNPs in each subset was 40,767, 21,433, 13,969, 10,283, 8019, 6587, 5560, 4780, and 4240. To construct random SNP subsets, we used exactly the same number of SNPs as in the “guided” subsets; however, SNPs were randomly sampled in each window instead of selecting them based on MAF and call rate.


Phenotypic data

Figure 2 summarizes the phenotypic variation in both populations and at both locations. On average, FES location was more productive than FEM location, and showed higher incidence of rust. We observed a lack of annual production stability in coffee bean production over different years in both populations. This instability was quantified in the mixed model analysis, with better goodness-of-fit values (lower AIC and BIC) when heterogeneity of residual and genetic variance were taken into account (Supplementary Table S3). Further, the boxplots in Fig. 2 highlight this cyclical production, interleaving years of high (2008, 2010) and low bean production (2009, 2011).

Fig. 2
figure 2

Left-hand panel: summary of production of coffee beans (in 60-kg bags of mature coffee fruit in the cherries stage per hectare) and yield of green beans (in g, of mature beans after processing using dry methods to remove the entire dried husk in samples of 2 kg of coffee fruit in the cherries stage) evaluated in two locations (FEM, FES), four harvests (2008–2011) and two populations (Intermediate, Premature). Right-hand panel: Summary of coffee leaf rust scores (Hemileia vastatix), ranging from 1 to 9, according to sporulation observation. Curves are kernel density estimates, which are smoothed version of the histogram

Heritability estimates of the three traits, in different environments, ranged from 0.56 to 0.92 (Table 2). Incidence of rust and yield of green beans showed the highest heritability values (0.89 and 0.92, respectively). On average, traits evaluated in the FES location and in the Premature population showed higher heritabilities than the FEM location and Intermediate population.

Table 2 Broad-sense heritability of production (weight of mature coffee fruit in the cherries stage, in kg), incidence coffee leaf rust (1–9 scale) and yield of green beans (weight, in g, in samples of 2 kilograms of coffee fruit in the cherries stage) across two locations (FEM and FES) and two recurrent selection populations of Coffea canephora (Intermediate and Premature)

Genotyping-by-Sequencing in C. canephora

After following the quality-control steps (see 'Materials and methods'), a total of 45,748 (on average, 64.4 SNPs per Mb) and 59,332 (on average, 83.5 SNPs per Mb) molecular markers (SNPs) were retained in the Intermediate and Premature populations, respectively. Among these, 38,106 SNPs (on average, 53.7 SNPs per Mb) were identified in both populations (Fig. 3b). GBS yielded good coverage of SNPs for most of the C. canephora genome in both populations (Fig. 3b).

Fig. 3
figure 3

a Principal component analysis (PCA) of the two Coffea canephora breeding populations; b Venn diagram is showing common and distinct SNPs to both population (k = 103 SNPs) c SNP density (number of SNPs per 400,000 Mb) across the 11 C. canephora chromosomes in Premature, Intermediate, and common to both populations

Genetic similarity between training and test populations is an important factor affecting prediction accuracy (de Los Campos et al. 2013; Daetwyler et al. 2013). Based on the GBS genotypes, the Intermediate and Premature population are very similar; the Fst measure is 0.0158, and both populations strongly overlap in the projection of the samples onto their first two principal components (Fig. 3a).

Comparison of methods for genomic prediction

For phenotype prediction within the same environment, we evaluated prediction accuracy using 13 previously developed genomic prediction methods for three traits. Although the methods differ in assumptions of the marker effects, most methods yield predictions at comparable levels of accuracy (Fig. 4). The one exception is the fixedMLR approach that consistently yielded poor predictions.

Fig. 4
figure 4

Evaluation of 13 statistical methods for predicting three coffee traits—production of coffee beans, incidence of coffee leaf rust and yield of green beans—in two Coffea canephora populations (Intermediate and Premature) evaluated in two locations (FEM and FES). Refer to the “Genomic prediction methods” section for an overview of the methods compared. Prediction accuracy was computed as the Pearson correlation between the predicted and adjusted phenotype in test samples. Cross-validation were performed using a Replicated Training-Testing. In each replication, 80% of the data were assigned to train the model and 20% remaining to validate the model. The procedure was replicated 30 times

Aside from the fixedMLR method, average predictive ability in different traits and environments ranged from 0.17 to 0.69 (Supplementary Table S4). On average, Bayesian methods were slightly more accurate than methods labeled as “machine learning” (0.47 versus 0.45). Differences in predictive ability were statistically significant (P< 0.05) for different traits, locations and populations (Supplementary Table S2). Average predictive ability was slightly higher for incidence of leaf rust and yield of green beans (0.50 and 0.49, respectively) than production (0.38), in accordance with the heritability results (Table 2). Average prediction accuracy was slightly higher at the FEM location than FES (0.47 vs. 0.45), and predictions in the Premature population were on average more accurate than in the Intermediate population (0.54 vs. 0.39). Accordingly, similar performance across the competing models was observed for the slope and MSPE values (Supplementary Tables S5 and S6).

Although Bayesian approaches tend to yield higher predictive values, they typically come with a much greater computational cost than the alternatives methods (Fig. 5). Of the 13 methods compared, we found that the RR-BLUP and BSLMM methods achieved the best combination of high accuracy and low computational cost. Based on this result, in subsequent experiments we focused on the RR-BLUP method.

Fig. 5
figure 5

Total runtimes, in minutes, for fitting the 13 genomic prediction models in all cross-validation runs. Runtimes are shown separately for the three coffee traits—production of coffee beans, incidence of coffee leaf rust and yield of green beans—in the two Coffea canephora populations (Intermediate and Premature). Cross-validation were performed using a Replicated Training-Testing. In each replication, 80% of the data were assigned to train the model and 20% remaining to validate the model. The procedure was replicated 30 times

In order to check the impact of the SNP density on the predictive ability, we sampled SNPs across the chromosomes (i) guided by MAF and genotyping call rate value; and (ii) randomly sampled. Regardless to the approach taken, we observed a similar predictive ability across different SNP densities (Supplementary Figure S1). For several traits, predictive ability improved modestly as more SNPs were used, but for other traits we observed little to no improvement from using a larger number of SNPs (~35,000) versus a small subset of SNPs drawn from across the genome (e.g., ~4000 SNPs).

In Fig. 6, we summarize the prediction accuracy in different environments (Scenarios 1–12) using the RR-BLUP method. In most cases, positive prediction values were obtained (Fig. 6). When making predictions across locations (Scenarios 1–4), the predictive ability remained high for all traits. The incidence of rust and yield of green beans were consistently predicted with greater accuracy than production trait. These results suggest the potential for using GS models to make predictions in different locations. Across-population (Scenarios 5–8) analyses also yielded some positive values, but, on average, at lower magnitude than across-location predictions. Our results also indicated that production of grain beans was more impacted in across-population predictions and negative values were observed in Scenarios 6 and 8, respectively. Rust incidence and the yield of green beans yielded higher correlation values. Models trained in the Premature population and tested in the Intermediate population had lower prediction accuracy. This could be explained by the fact that the Premature population size was smaller than the Intermediate population.

Fig. 6
figure 6

Predictive ability of cross-environment genomic predictions using the RR-BLUP method. “Environment” refers to a combination of location (FEM, FES) and population (Intermediate, Premature). As explained in Fig. 1, Scenarios 1–4 are used to evaluate GS performance across locations; Scenarios 5–8 assess GS performance across populations; and Scenarios 9–12 evaluate prediction accuracy when both the location and population differ between the training and tests sets. Predictive ability was recorded for three coffee traits: production of coffee beans, incidence of coffee leaf rust, and yield of green beans

In the last set of scenarios, we evaluated prediction accuracy when both locations and populations differed between the training and test sets (Scenarios 9–12). As expected, overall lower predictive accuracy among all comparison were observed. This fact was more evidence for production traits that, once again, yielded negative values of predictive accuracy.


The benefits of GS compared with traditional phenotypic evaluations are well documented, and increasingly widely appreciated (Hayes et al. 2009; de Los Campos et al. 2013). Nonetheless we believe this potential remains under exploited in coffee crops. Possible reasons include: (i) limited genomic resources available; (ii) difficulty in maintaining field experimentation given the long generation cycle, late expression of target traits and requirement of large areas for cultivation; and (iii) physiological makeup (low genetic diversity, ploidy barrier in C. arabica, and self-incompatibility in C. canephora. Herein, we have presented the potential to implement GS in conventional coffee breeding schemes. To this end, accurate phenotypic metrics, high-throughput genotyping and appropriate whole-genome statistical models are important requirements.

For studying coffee crops, complex quantitative traits are typically evaluated across multiple locations and harvests, which are collectively referred to as Multi-Environments Trials (MET) (Smith et al. 2005). Several statistical models have been proposed specifically for MET analyses, since the data collected in these setting typically violate basic assumptions of conventional ANOVA models (e.g., homogeneity and independence of variances). As a consequence, bias can be introduced in estimation of genetic values, which may ultimately affect the predictive ability in GS studies. Guided by this previous work, we used a mixed model framework with appropriate covariance structures to account for genetic and non-genetic effects on the phenotypes. The flexibility to fit the residual and genetic variances showed better goodness-of-fit values than traditional ANOVA results.

High-throughput genotyping capacity has been increased by rapid progress in next-generation DNA sequencing (NGS). Genotyping-by-sequencing (GBS) is a product of this advance. Using GBS, we identified a total of 45,748 and 59,332 SNPs in the Intermediate and Premature populations, respectively. We emphasize that this total number of SNPs is larger than the set identified in a recent study on C. arabica that used a similar approach (DArT methodology) (Del Moncada et al. 2015). This difference in SNP identification could be explained by the fact that C. canephora possesses higher genetic diversity due to its origin, reproduction method and dissemination (Ferrão et al. 2015).

For predictive analysis, we initially compared 13 predictive models on within-environment predictions. Assuming that GS models are align with the true genetic architecture of the phenotype, we were expecting a dependence between predictive ability and trait. For example, bayesRR models assumes that marker effects are normally distributed with fixed variance, similar to the Fisher’s infinitesimal model proposes (Fisher 1919; Meuwissen et al. 2001). In contrast, bayesB assumes that most loci have no effect on the phenotypic variation, that is, traits controlled by relatively few loci whose effects vary in size (Meuwissen et al. 2001). Although conceptually different, we observed similar predicative performances of the competition models, evidencing a somewhat difference of our empirical results with previous simulation studies (Meuwissen et al. 2001; Coster et al. 2010).

Recently, several empirical evaluations have been published comparing predictive models and, like ours, reporting similar results across models (Moser et al. 2009; Heffner et al. 2011; Riedelsheimer et al. 2012; Resende et al. 2012b; Daetwyler et al. 2013; Wang et al. 2015). Some aspects of this similarity might be associated to statistical and biological properties. Statistically, the high discrepancy between number of observation and parameters can restrict the process of statistical learning resulting in similar predictive performances among methods (de Los Campos et al. 2013; Gianola 2013). Biologically, this similarity can be associated with the complex nature of traits. For real data, distribution of QTLs effects for most traits is perhaps less extreme than has been hypothesized in simulation studies (Hayes et al. 2009; de Los Campos et al. 2013; Daetwyler et al. 2013).

One method that consistently performed worse than the others was the fixed regression method (denoted by “fixedMLR” in the results). Fixed regression has been useful to detect genome-wide associations. However, these associations typically explain only a small fraction of the genetic variance of quantitative traits (Manolio et al. 2009). By contrast, methods that simultaneously fit effects for all markers are able to account for a much greater proportion of the genetic variation and, consequently, these approaches are more appropriate to predictive purposes (Meuwissen et al. 2001; Moser et al. 2009).

Contrasting to the predictive results, computational requirements significantly differ across the models. Consistent with previous studies, we found that Bayesian methods typically involved greater computational demand (Moser et al. 2009; Heslot et al. 2012; Neves et al. 2012). Particularly, computational cost is an important consideration since frequent re-estimation of marker effects is necessary in breeding programs (Moser et al. 2009). Judged by the overall performance, we found that RR-BLUP method presented important attributes for GS implementation, including straightforward implementation using existing mixed models software, relative simplicity, good performance, and limited computing time.

In a GS context, the possibility to predict phenotypic performance within and across environments is an outstanding question that has not been fully explored in coffee. As expected, within-environment predictions (Scenarios A–D) yielded higher correlation values than cross-predictions (Scenarios 1–12). It has long been recognized that expression of genotypes are affected by environmental conditions and, as a consequence, across-location predictions (Scenarios 1–4) exhibited lower predictive performance than within-environment predictions (Scenarios A–D). In particular, this suggests that genotype-by-location (G×L) interactions are important, even considering that both locations are within the same breeding zone. In theory, G×L interactions occur because the capture and conversion abilities of a plant are determined by its particular ensemble of genes, which are expressed conditionally to the amount and quality of inputs received in each environment (Malosetti et al. 2013). This differential expression is captured by the estimate of marker effects and ultimately influences the predictions. Decaying accuracy across locations has been observed in GS studies in trees (Resende et al. 2012a; Beaulieu et al. 2014; Gamal El-Dien et al. 2015), cassava (Ly et al. 2013), and maize (Windhausen et al. 2012a, b).

For predictions in different populations (Scenarios 5–8), lower accuracy values can be explained by quantitative genetic concepts, which supports allele substitution effects varying between populations due to differences in allele frequency and LD pattern between SNPs and QTLs (Asoro et al. 2011; Windhausen et al. 2012a, b; Lehermeier et al. 2015). Similar results are presented in Neves et al. (2012) in mice populations and by Hayes et al. (2009) in dairy cattle. Scenarios 9–12 represented the most challenging condition for GS as these scenarios combine predictions across locations and populations simultaneously. In most cases, models yielded poor prediction accuracy, with predictive abilities near to zero for the production trait.

Our results suggest the feasibility of incorporating GS into recurrent selection programs so long as predictive models are used to make predictions within the same environment, or within the same breeding zone. Traditionally, one cycle of phenotypic recurrent selection in coffee is divided into three phases: (i) progenies are obtained from a base population; (ii) field trials are conducted in multiple environments and harvests; and (iii) a new base population is generated via selection and recombination of the best individuals. In coffee, due to the long juvenile period associated with multiple evaluations across harvests, 5–6 years are required on average to complete a breeding cycle. Another challenge is that evaluating and maintaining multiple field trials is expensive and laborious. Therefore, incorporating GS prediction models can potentially reduce the time and expense of recurrent selection. We suggest applying GS methods in the second and third stages of recurrent selection programs by coupling prediction and selection during the seedling phase inside of greenhouses. Rapid-cycle recurrent selection supported by GS has potential to accelerate the increase of favorable alleles in the population and reduce both monetary and time costs associated with phenotyping (Windhausen et al. 2012a, b; Grenier et al. 2015). In a modern breeding scheme, phenotypic trials in multiple environments might be considered in advanced phases (e.g., third recurrent cycle), in order to re-estimate marker effects.

Outside the main topic of genomic prediction models, several other aspects of our study may be of interest to the development of GS in coffee. For some traits, we found that prediction accuracy did not greatly improve as we included more SNPs in the models. In particular, we noted high predictive values for models trained with ~10,000 SNPs. Other studies have reported similar results (Vazquez et al. 2010; Spindel et al. 2015). According to de Los Campos et al. (2013), the point at which adding SNPs does not yield any improvement depends, mainly, on the span of LD in the genome and sample size. In both coffee populations, we expected large LD blocks since the populations were originated from only one cycle of recombination between a finite number of selected candidates. Regarding that resources need to be allocated to genotyping, 10,000 markers might be considered as a reference to design a SNP array as an alternative to reduce genotyping cost and popularize the use of genomic resources in coffee. Additional recommendations are given in Ferrão et al. (2017), including the possibility of exploiting the genomic information generated in GS investigations to guide parental selection.

Finally, we view this work as an initial investigation of genomic prediction in the coffee breeding, and there are additional questions that remain unanswered. In particular, we primarily focused on additive models for GS. Considering that C. canephora is a clonally propagated species, predictive models designed to explore the total genetic per se value of an individual is also a relevant question. Our option to predict additive effects was motivated by the breeding context, which includes the accumulation of favorable alleles through early and short cycles of recurrent selection. We believe that our results are sufficiently promising to justify further research, including the extension to modelling non-additive effects and incorporation of genotype-by-environment (G×E) interactions by considering benefit from genetic correlations between locations (Lopez-Cruz et al. 2015; Cuevas et al. 2016, 2017). We also emphasize that one of the main difficulties in conducting coffee studies is the lack of information about genetic architecture of complex traits (Tran et al. 2016). Therefore, advances in our basic understanding of the genetic architecture, which include studies at genomic and phenotypic level, will lead to further improvements in GS.


In this research, we explored the potential for genomic prediction in C. canephora, a perennial species for which GS concepts are appealing, but the basis for its implementation is still in its infancy. To this end, we investigated GS in multiple recurrent selection populations that were evaluated for three agronomic traits. In addition, we explored different models for accurate prediction of these traits within and across environments.

Some of our key findings include: (i) similar predictive abilities were obtained in most of the models that were compared, consistent with previous studies using real data; (ii) the predictive models exhibited very different computational costs; (iii) the RR-BLUP method achieved a good balance of predictive ability and computational cost; (iv) diversity and genetic relationship between training and testing data sets are important requirements; and (v) positive predictive ability supports the idea of implementing GS in conventional schemes of recurrent selection in coffee. Compared to traditional phenotypic methods, we expect that GS implementation can accelerate breeding cycles in coffee.

Data archiving statement

The genotypes in the study belong to the germplasm collection and breeding program of the Incaper institution (ES, Brazil). Data available from Dryad