Maize (Zea mays) has emerged as an important crop for food, feed production, and various industrial applications, providing livelihoods for millions of people around the world1,2. However, its production is affected by several factors, with drought being one of the most common causes of agricultural shortages in rainfed systems3. This fact, combined with the high demand for this crop and the prospect of a worldwide growth of more than 2 billion people over the next 20 years4, makes it necessary to cultivate increasingly productive crops, as well as more adapted to climate change and also to different planting regions, such tropical conditions.

Cultivars can exhibit differentiated phenotypic responses between environments, and it is possible that a genotype may perform well in one environment but not in another5,6,7. To address this, breeders must submit the developed hybrids to multiple environment trials (MET). In MET, the main objectives are to study the interaction between genotypes and environments and to evaluate genotypic overall performance and stability8. However, MET phenotyping faces challenges such as the limited seeds availability, a high number of genotypes to be tested in the preliminary trials, and the associated costs, resulting in unbalanced experimental designs in different environments9,10. In sparse designs, where hybrids are not evaluated in all environments, accurately selecting superior hybrids for the next cycle can be difficult, as some hybrids may not be stable in many environments, and other genotypes that are discarded may outperform in untested environments.

To address these challenges, genomic prediction (GP) is proposed as a tool to predict the genetic value of individuals that were not evaluated in the field11,12. Several GP methods have been proposed, with Genomic Best Linear Unbiased Predictor (GBLUP) being one of the most commonly used methods. In the context of MET, predicting the genetic value of individuals not observed in specific environments has led to the development of several models, including interaction model of genotypes by environments, environmental covariates, and additive and dominance effects9,13,14,15,16.

Recently, machine learning methodologies have gained attention due to their ability to recognize complex interaction structures in data sets17. Machine learning algorithms approximate the mapping function linking input variables (e.g., phenotypic trait) to output variables from the training datasets without making a priori assumptions about data distribution or the genotype–phenotype relationship18. This flexibility allows these methods to capture more complex genetic architectures in prediction models.

Among the machine learning methodologies used in genomic prediction, decision trees and their refinements (such as bootstrap aggregation (bagging), random forest, and boosting) stand out, as they stratify the predictor space into many sub-regions19,20. These refinements aim to build more accurate prediction models; for example, bagging (Bag) reduces the variance observed in decision trees, random forest (RF) improves accuracy by avoiding high tree correlation21, and boosting (Boost) builds trees sequentially using information from previously built trees22.

Studies using simulated and real data have concluded that tree-based machine learning tools can serve as an alternative to traditional techniques for genomic prediction23,24,25,26. For instance, Sousa et al.27, evaluating genomic prediction for resistance to rust disease in Coffea arabica, observed that Bag, RF, and Boost showed superior predictive abilities to Generalized Bayesian Linear Regression. Westhues et al.28, using genomic and environmental variables, found that machine learning models can provide similar or slightly superior predictive abilities to GBLUP models for traits strongly influenced by environmental factors. Despite the potential of tree-based machine learning, there are still few studies that have evaluated MET data, and these methodologies could prove beneficial in such cases.

Therefore, our study aimed to evaluate and to compare the efficiency of statistical methodologies (GBLUP) and machine learning (Bag, RF, and Boost) for genomic prediction for single cross hybrid evaluated for drought tolerance traits in MET. We considered two different prediction scenarios, mimicking two situations that plant breeders may encounter: (i) predicting the performance of newly developed single hybrids for which there are no existing phenotypic records; (ii) predicting the performance of single cross hybrids in sparse design trials, where some hybrids were evaluated in some environments, but not in others.

Material and methods

Phenotypic data

The data are composed of 265 single cross hybrids from the maize breeding program of Embrapa Maize and Sorghum evaluated in eight combinations of trials/locations/years under irrigated trials (WW) and water stress (WS) conditions at two locations in Brazil (Janaúba—Minas Gerais and Teresina—Piauí) over two years (2010 and 2011). The hybrids were obtained from crosses between 188 inbred lines and two testers. The inbred lines belong to heterotic groups: dent (85 inbred lines), flint (86 inbred lines), and an additional group, referred to as group C (17 inbred lines), which is unrelated to the dent and flint origins. The two testers are inbred lines belonging to the flint (L3) and dent (L228-3) groups. Among the inbred lines, 120 were crossed with both testers, 52 were crossed with the L228-3 tester only, and 16 lines were crossed with the L3 tester only. Silva et al. (2020) evaluated the genetic diversity and heterotic groups in the same database. These authors showed the existence of subgroups within each heterotic group. Therefore, once these groups were not genetically well defined and the breeding program from Embrapa Maize was in the beginning, the effect of allelic substitution in both groups are assumed to be the same. More details on the experimental design and procedures can be found in Dias et al.13,30.

The experiment originally included 308 entries, but hybrids that were not present in all environments were also removed to evaluate the genomic prediction within each environment, resulting in a total of 265 hybrids for analysis. Each trial consisted of 308 maize single cross hybrids, randomly divided into six sets: sets 1–3 for crosses with L3 (61, 61, and 14 hybrids each), and sets 4–6 for crosses with L228-3 (80, 77, and 15 hybrids each). Four checks (commercial maize cultivars) were included in each set, and the experiment was designed in completely randomized blocks. Between trials, hybrids within each set remained the same, but hybrids and checks were randomly allocated into groups of plots within each set. This allocation varied between replicates of sets and between trials. The WS trials had three replications, except for the set containing 15 hybrids and the trials evaluated in 2010, which had two replications. All WW trials, except for the trial in 2011, had two replicates.

Two agronomic traits related to drought tolerance were analyzed: grain yield (GY) and female flowering time (FFT). GY was determined by weighing all grains in each plot, adjusted for 13% grain moisture, and converted to tons per hectare (t/ha), accounting for differences in plot sizes across trials. FFT was measured as the number of days from sowing until the stigmas appeared in 50% of the plants. A summary of means, standard deviations, and ranges of both evaluated traits are available in Table 1.

Table 1 Summary of means (Mean), standard deviations (SD), minimum (Min), and maximum (Max) for grain yield (GY) and female flowering time (FFT) obtained under irrigated (WW) and water stress (WS) conditions, evaluated in the years 2010 and 2011, at the locations of Janaúba (J) and Teresina (T).

To conduct the analyses, hybrids considered as outliers were removed (i.e., hybrids that presented phenotypic values greater than 1.5 × interquartile range above the third quartile or below the first quartile) for the GY and FFT traits. The variations in predictive abilities among hybrids of T2, T1, and T0 are widely recognized31. However, the primary aim of our study was to compare different prediction methodologies in MET assays. In this study, there were 240 T2 hybrids and 68 T1 hybrids, with T2 hybrids had both parents evaluated in different hybrid combinations, while hybrids being single-cross hybrids sharing one parent with the tested hybrids. Given the realistic nature of our scenario, we have a limited and imbalanced distribution of these hybrid groups, making a fair comparison challenging. Consequently, we opted to construct a training set comprising T2 and T0 hybrids.

Statistical analysis of phenotypic data

To correct the phenotypic values for experimental design effects, each trial (WW and WS) and environment were analyzed independently to obtain the Best Linear Unbiased Estimator (eBLUEs) for each hybrid, for the two traits evaluated. The estimates were obtained based on the following model:

$${\varvec{y}} = 1\mu + \user2{ X}_{1} {\varvec{r}} + {\varvec{X}}_{2} {\varvec{s}} + {\varvec{X}}_{3} {\varvec{h}} + {\varvec{e}}$$

where \(\user2{y }\left( {n \times 1} \right)\) is the phenotype vector for \(f\) replicates, \(t\) sets of \(p\) hybrids, and \(n\) is the number of observations; \(\mu\) is the mean; \(\user2{r }\left( {f \times 1} \right)\) is the fixed effect vector of the replicates; \(\user2{s }\left( {t \times 1} \right)\) is the fixed effect vector of the sets; \({\varvec{h}}\) \(\left( {p \times 1} \right)\) is the fixed effect vector of the hybrids; and \(\user2{e }\left( {k \times 1} \right)\) is the residue vector, with \(\user2{ e} \sim ,MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)\), where \({\varvec{I}}\) is an identity matrix of corresponding order, and \(\sigma_{e}^{2}\) the residual variance. \({\varvec{X}}_{{1\user2{ }}} \left( {k \times f} \right)\), \({\varvec{X}}_{{2\user2{ }}} \left( {k \times t} \right)\) e \({\varvec{X}}_{{3\user2{ }}} \left( {k \times p} \right)\) represents incidence matrices for their respective effects. The eBLUES of each environment were used in further analyses.

Genotypic data

A total of 57,294 Single Nucleotide Polymorphisms (SNPs) markers were obtained from 188 inbred lines, and two testers used as parents of the 265 single cross hybrids. The genotyping by sequencing (GBS) strategyare detailed in Dias et al.13. For the quality control, SNPs were discarded if: the minor allele frequency was smaller than 5%, more than 20% of missing genotypes were found, and/or there were more than 5% of heterozygous genotypes. After filtering, missing data were imputed using NPUTE. Then, for each SNP, the genotypes of the hybrids were inferred based on the genotype of their parents (inbred line and tester). The number of SNPs per chromosome ranged from 3121 (chromosome 10) to 7705 (chromosome 1), totalizing 47,127 markers.

Genomic relationship matrix

The additive and dominance genomic relationship matrices were constructed32 based on information from the SNPs using the package AGHmatrix33, following VanRaden34 and Vitezica et al., respectively.

Genomic prediction

Genomic predictions were performed using the Genomic Best Line Unbiased Prediction (GBLUP) method using the package AsReml v. 436. Two groups were considered: the first group comprised four environments under WW conditions, and the second included four environments under WS conditions. The linear model is described below:

$$\overline{\user2{y}} = \mu 1 + \user2{ Xb} + {\varvec{Z}}_{1} {\varvec{u}}_{{\varvec{a}}} + {\varvec{Z}}_{2} {\varvec{u}}_{{\varvec{d}}} + {\varvec{e}}$$

where \(\user2{\overline{y} }\left( {pq \times 1} \right)\) is the vector of eBLUES previously estimated for each environment with \(p\) hybrids and \(q\) environments;\(\mu\) is the mean; \(\user2{b }\left( {q \times 1} \right)\) is the vector of environmental effects (fixed); \({\varvec{u}}_{{\user2{a }}} \left( {pq \times 1} \right)\) is the vector of individual additive genetic values nested within environments (random), with \({\varvec{u}}_{{\varvec{a}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{a} }}^{2} + \rho_{a} \left( {{\varvec{J}}_{{\varvec{q}}} - {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{A}}} \right)\), where \({\varvec{A}}\) is the genomic relationship matrix between individuals for additive effects, \(\rho_{a}\) is the additive genetic correlation coefficient between environments, \({\varvec{I}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)\) is an identity matrix, \({\varvec{J}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)\) is a matrix of ones, and \(\otimes\) denotes the Kronecker product; \({\varvec{u}}_{{\varvec{d}}}\) \(\left( {pq \times 1} \right)\) is the vector of individual dominance genetic values nested within environments (random), with \({\varvec{u}}_{{\varvec{d}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{d} }}^{2} + \rho_{d} \left( {{\varvec{J}}_{{\varvec{q}}} - {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{D}}} \right)\), where \({\varvec{D}}\) is the genomic relationship matrix between individuals for dominance effects, \(\rho_{{\varvec{d}}}\) is the dominance correlation coefficient between environments; \({\varvec{e}}\) \(\left( {pq \times 1} \right)\) is the random residuals vector with \({\varvec{e}}\sim MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)\). The capital letters \(\user2{X }\left( {pq \times q} \right),\user2{ Z}_{1} \left( {pq \times pq} \right)\) and \({\varvec{Z}}_{2} \user2{ }\left( {pq \times pq} \right)\) represent the incidence matrices for their respective effects, \(1\user2{ }\left( {pq \times 1} \right)\) is a vector of ones. The (co)variance components were obtained using the residual maximum likelihood method (REML)37.

Two alternative models were also used. The first for genomic prediction retained only additive effects by removing \({\varvec{u}}_{{\varvec{d}}}\) from Eq. (2). The second model was used to estimate the genetic parameters within each environment separately.

The significance of random effects was tested using the Likelihood Ratio Test (LRT)38, given by:

$$LRT = 2*\left( {LogL_{c} - LogL_{r} } \right)$$

where \(LogL_{c}\) is the logarithm of the likelihood function of the complete model (with all effects included), and \(LogL_{r}\) is the logarithm of the restricted likelihood function of the reduced model (without the effect under test). Effect significance was tested by LRT using the chi-square (X2) probability density function with a degree of freedom and significance level of 5%39.

The narrow-sense heritability (\({ }h^{2}\)), the proportion of variance explained by dominance effects (\(d^{2}\)), and the broad-sense heritability \(\left( {H^{2} } \right)\) for each trait were estimated following Falconer and Mackay 199635.

Machine learning

Similar to the previous topic, the trials were divided between WW and WS conditions, and the potential of regression trees (RT) was explored using the following three algorithms: bagging, random forest, and boosting22. Bagging (Bag) is a methodology that aims to reduce the RT variance22. In other words, it consists of obtaining D samples with available sampling replacement, thus obtaining D models \(\hat{f}^{1} \left( x \right), \hat{f}^{2} \left( x \right), \ldots , \hat{f}^{D} \left( x \right)\), and finally use the generated models to obtain an average, given by:

$$\hat{f}_{medio} \left( x \right) = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \hat{f}^{d} \left( x \right)$$

This decreases the variability obtained in the decision trees. The number of trees used in Bag is not a parameter that will result in overfitting of the model. In practice, a number of trees is used until the error has stabilized22. The number of trees sampled for Bag was set at 500 trees.

Random forest (RF) was proposed by HO40 and it is an improvement of Bag to avoid the high correlation of the trees and to improve the accuracy in the selection of individuals. RF changes only the number of predictor variables used in each split. That is, each time a split in a tree is considered, a random sample of \(m\) variables is chosen as candidates from the complete set of \(p\) variables. Hastie et al.21 suggest that the number of predictor variables used in each partition is equal to \(m = \frac{p}{3}\) for regression trees. The number of trees for the RF was set at 500.

Boosting uses RT by adjusting the residual of an initial model. The residual is updated with each tree that grows sequentially from the previous tree's residual, and the response variable involves a combination of a large number of trees, such that:

$$\hat{f}\left( x \right) = \mathop \sum \limits_{b = 1}^{B} {\uplambda } \hat{f}^{b} \left( x \right)$$

The function \(\hat{f}\left( . \right)\) refers to the final tree combined with sequentially adjusted trees, and λ is the shrinkage parameter that controls the learning rate of the method. Furthermore, this method needs to be adjusted with several splits in each of the trees. This parameter controls the complexity of the Boost and is known as the depth. For Boosting, the number of trees sampled was 250, with a learning rate of 0.1 and a depth of 3.

To perform hybrid prediction for each environment based on MET dataset, we propose the incorporation of location and year information in which the experiments were carried out as factors in the data input file together with SNPs markers as predictors in machine learning methodologies. As a response variable, the eBLUEs previously estimated by Eq. (1) were used.

For the construction of the bagging and random forest models, the randomForest function from the package randomForest41 was used. Finally, the package's gbm function gbm42 was used for boosting. All analyzes were implemented in the software R43.

Model validation

Genomic predictions were carried out following Burgueño et al.16, considering two different prediction problems, CV1 and CV2, which simulate two possible scenarios a breeder can face. In CV1, the ability of the algorithms to predict the performance of hybrids that have not yet been evaluated in any field trial was evaluated. Thus, predictions derived from the CV1 scenario are entirely based on phenotypic and genotypic records from other related hybrids. In CV2, the ability of the algorithms to predict the performance of hybrids using data collected in other environments was evaluated. It simulates the prediction problem found in incomplete MET trials. Here, information from related individuals is used, and the prediction can benefit from genetic relationships between hybrids and correlations between environments. Within the CV2 scenario, two different situations of data imbalance were evaluated. In the first, called CV2 (50%), the tested hybrids were not present in half of the environments, while in the second, called CV2 (25%), the tested hybrids were not present in only 25% of the environments. Table 2 provides a hypothetical representation of this CV1, CV2 (50%), and CV2 (25%) validation scheme.

Table 2 Representation of the three scenarios (CV1, CV2-50% and CV2-25%) for four hybrids (Hybrid 1–4) and four environments (J10, J11, T10, T11).

To separate the training and validation sets, the k-folds procedure was used, considering \(k = 5\). The set of 265 hybrids was divided into five groups, with 80% of the hybrids considered as the training population, and the remaining 20% hybrids considered as the validation population. The hybrids were separated into sets proportionally containing all the crosses performed (Dent × Dent, Dent × Flint, Flint × Flint, C × Dent, C × Flint). The cross-validation process was performed separately for each trait, condition (WS or WW) and scenario (CV1, CV2-50% and CV2-25%) and was repeated five times to assess the predictive ability of the analyses.

The predictive ability within each environment for the conditions (WS and WW) was estimated by the Pearson correlation coefficient44 between the corrected phenotypic values (eBLUES) of Eq. (1) for each environment and the GEBVs predicted by each fitted method.

Ethics statement

The authors confirm that all methods were carried out by relevant guidelines in the method section. The authors also confirm that the handling of the plant materials used in the study complies with relevant institutional, national, and international guidelines and legislation.

Statement of handling of plants

The authors confirm that the appropriate permissions and/or licenses for collection of plant or seed specimens are taken.


Variance components and estimation of genetic parameters

Estimates of variance components and genetic parameters for GY and FFT under WW and WS conditions, obtained for the joint analysis with the four environments and analyses within each environment, are shown in Table 3 . For the joint analysis, the heritability estimates for GY and FFT were slightly different from those obtained by Dias et al.13 using the same material, since here, a different statistical model was used to estimate the genetic parameters, and hybrids that were not present in all environments were removed.

Table 3 Estimates of variance components and genetic parameters for grain yield (GY) and female flowering time (FFT) were obtained considering the joint analysis for the four evaluated environments and analyses within each environment, for the irrigated (WW) and water stress (WS) conditions.

The additive variance found for GY and FFT was greater than the variance due to dominance effects, in both WS and WW conditions. For GY, the variances due to dominance effects represented about 33.3% and 31.0% of the genetic variance in WS and WW conditions, respectively. Lower broad-sense heritability was observed for this trait in WW (0.42) when compared to the WS condition (0.53). As for FFT, the variances due to dominance effects represented about 19.9% and 20.8% of the genetic variance in WS and WW conditions, respectively, and the broad-sense heritabilities for FFT were greater in WW conditions (0.71) than in conditions WS (0.56).

For GY, the narrow-sense heritabilities within environments ranged from 0.30 (T11) to 0.38 (J10) under WS conditions and from 0.20 (T11) to 0.57 (J10) under WW conditions. The proportion of genetic variance explained by dominance deviations ranged from 0.04 (T10) to 0.43 (T11) under WS conditions and from 0.10 (T11) to 0.29 (J11) under WW conditions. The broad-sense heritabilities were lower for the experimental tests that had a lower number of repetitions (2010 under WS conditions, and 2011 under WW conditions).

For FFT, the narrow-sense heritabilities ranged from 0.09 (T10) to 0.80 (J10) under WS conditions and from 0.49 (T10) to 0.74 (J10) under WW conditions. The proportion of genetic variance explained by dominance deviations ranged from 0.01 (T10) to 0.26 (T11) under WS conditions and from 0.07 (J11) to 0.23 (T11) under WW conditions. The broad-sense heritabilities were higher for J10 (0.89 and 0.88) under WS and WW conditions.

The Eq. (2) is an implicit model to perform MET analyses45 and provide genetic correlations for additive and dominance effects across environments. This model, reflects on the levels of genotypes-by-environment interaction. Particularly, for GY, the environmental correlations were 0.35 and 0.24 for WS and WW conditions, respectively, indicating an inconsistent ranking of hybrids across environments. As for FFT, the lowest correlation was observed for dominance effects.

Efficiency of prediction methodologies in multi-environment trials

Figures 1 and 2 show the predictive abilities observed in the three scenarios (CV1, CV2-50%, and CV2-25%) for each of the five compared methods: GBLUP additive model (GBLUP-A), GBLUP additive-dominant model (GBLUP-AD), bagging (Bag), random forest (RF) and boosting (Boost). The numerical results of these figures are presented in Supplementary Tables 1, 2, and 3.

Figure 1
figure 1

Mean predictive abilities and their respective standard errors for grain yield (GY), evaluated in the environments of Janaúba/2010 (J10), Janaúba/2011 (J11), Teresina/2010 (T10) and Teresina/2011 (T11)), under the CV1, CV2 (50%) and CV2 (25%) scenarios, considering irrigated (WW) and water stress (WS) conditions. The evaluated methodologies include bagging (Bag), random forest (RF), boosting (Boost), GBLUP additive model (GBLUP-A), and GBLUP additive-dominant model (GBLUP-AD).

Figure 2
figure 2

Mean predictive abilities and their respective standard errors for female flowering time (FFT), evaluated in the environments of Janaúba/2010 (J10), Janaúba/2011 (J11), Teresina/2010 (T10) and Teresina/2011 (T11), under CV1, CV2 (50%) and CV2 (25%) scenarios, considering irrigated (WW) and water stress (WS) conditions. The evaluated methodologies include bagging (Bag), random forest (RF), boosting (Boost), GBLUP additive model (GBLUP-A), and GBLUP additive-dominant model (GBLUP-AD).

For the GBLUP models, the predictive abilities were lower for the CV1 scenario, where the predictions were based only on phenotypic and genotypic records of other related hybrids. However, the predictive ability increased when the predicted hybrid was present in some environments, being intermediate in CV2-50% and higher in CV2-25%. A more notable improvement in predictive abilities was observed when transitioning from CV1 to CV2-50%. Specifically, for GBLUP-A, the average increase was about 107%, 19%, 18%, and 19% for GY-WS, GY-WW, FFT-WS, and FFT-WW, respectively. As for GBLUP-AD, the average increase was 19%, 7%, 12%, and 15% for GY-WS, GY-WW, FFT-WS, and FFT-WW, respectively (Table Sup. 4). For GY, considering only CV1 scenario, the average predictive abilities were higher in WW conditions. For FFT, mean predictive abilities for WW conditions were higher than for WS for all scenarios (Table Sup. 4).

The environments with the highest broad-sense heritabilities also exhibited the highest average predictive abilities across all evaluated methodologies (Table 3, Figs. 1 and 2). The GBLUP-AD model demonstrated superior predictive abilities for the GY and FFT traits in almost all environments and scenarios (CV1, CV2-50%, and CV2-25%). Conversely, the GBLUP-A performed equally or better than the GBLUP-AD model for FFT in environments where the dominance effect was not significant (T10-WS and J11-WW). However, for GY, in environments where the proportion of dominance genetic variance was close to or greater than the additive variance (J11 and T11 under WS), GBLUP-A exhibited lower predictive ability.

Unlike GBLUP, machine learning methodologies did not show a consistent pattern of increased predictive ability with the presence of phenotypic records of the hybrid to be predicted in correlated environments. Overall, an increase in predictive abilities for GY was observed when the scenario changed from CV1 to CV2, under WS conditions, which was not observed in many environments under WW conditions. For example, for RF, the average predictive abilities in the environments increased by approximately 51% for GY in WS conditions, while a decrease of about 10% was observed when the prediction changed from CV1 to CV2-50% for the same trait in WW conditions (Table Sup. 4). As for FFT, Bag drew the most attention, presenting a large standard error in predictive abilities with statistically similar results for CV1, CV2-50% and CV2-25% in WS and WW conditions, in almost all environments.

In environments where dominance effects accounted for a large portion of the genetic variance, the Bag, RF and Boost methodologies showed intermediate results between the two GBLUP models. Among the evaluated machine learning methodologies, Bag and RF produced very similar predictions and Boost presented the lowest predictive ability for GY. As for FFT, machine learning did not show improvement in predictive abilities when compared to GBLUP.


We employed both statistical and machine learning methods to evaluate the efficiency of genomic predictions for GY and FFT. These models allow us to leverage information about relationships between hybrids and correlated environments. Three scenarios, CV1, CV2-50% and CV2-25%, were used, each characterized by different degrees of data imbalance. Predicting the performance of new hybrids that have not been evaluated in the field (CV1) is more challenging compared to predicting hybrids that have been evaluated in an unbalanced manner across correlated environments (CV2-50% and CV2-25%)15,16.

Among the evaluated methodologies, GBLUP stands out as it allows to estimate relationships between individuals based on molecular markers for additive and dominant effects32,34. This methodology also allows to incorporate variance–covariance structures to handle correlated environments and unbalanced data10,14. As expected, GBLUP showed the highest predictive abilities in the CV2 validation schemes (CV2-50% and CV2-25%) as the predictions benefited from phenotypic records of the same hybrids in other correlated environments. Similar results have been reported in previous studies using wheat inbred lines and maize single cross hybrids13,14,15,16.

GBLUP-AD demonstrated greater efficiency to predict hybrids for traits and environments in which the effects due to dominance deviations were significant. As a substantial part of the genetic variance in maize hybrids for GY is attributed to dominance effects46, incorporating theses effects in the model significantly improved the prediction of this trait. However, when the dominance effect accounts for a small portion of the genetic variance (as observed for FFT), additive and additive-dominant models tend to show minor improvements in the predictions of the estimated total genetic effects13,47. This emphasizes the importance of understand the genetic bases of hybrids which can be decomposed into general combining ability (GCA) and specific combining ability (SCA). For GCA, the additive variance is the major component of the variance48 and, SCA is largely dependent on genes with dominance or epistatic effects49. In this context, gaining a comprehension of the relative magnitudes of GCA and SCA variations is instrumental in guiding the formulation of optimal strategies for hybrid breeding programs, as highlighted in previous research50.

One possible reason why tree-based learning models did not benefit from the presence of phenotyped hybrids in correlated environments is related to their shared concept of recursive division51. These models aim to find decision rules that naturally partition the prediction space to provide an informative and robust hierarchical model21,52. In this context, it is conceivable that the location/year variables played a crucial role in most of the tree construction scenarios, leading to the split of the same hybrids at the beginning of branching and making it difficult for them to be grouped at the last nodes, which are responsible for the predictive response.

Among the evaluated methodologies, Bag showed, in most environments, the same prediction pattern for both scenarios where the predicted hybrid was absent in all environments and where it was absent only in some environments. Bag is a variation of the decision tree that uses resampling of the original data in subsamples (bootstrap) according to the determined number of trees. This process may lead to high correlation between the generated trees, resulting in the same variable consistently being the most important one21. As consequence, the same hybrids in different environments may end up at very distant nodes in the construction of the tree. The RF further reduces the dependence between the decision trees by randomly selecting part of the original data and variables to build the trees22,51. This allows different variables to have a chance to be at the top in their construction. On the other hand, boost sequentially combines different predictors, fitting a new tree to the residuals of the previous model using a specified loss function (e.g. mean squared error for regression)53. It incorporates automatic indirect selection of markers and is generally recommended for regression analysis17,20. It is important to note that these variations of decision trees mentioned above are typically used when obtaining accurate predictions is more important than the biological interpretation of the model itself52.

One of the advantages of using machine learning methodologies is that they do not require the specification of the trait inheritance, having as an initial hypothesis the possibility of capturing non-additive effects in the genome, which are often not specified in traditional statistical methods25. Possibly tree-based machine learning methodologies managed to capture part of the dominance effect, presenting better results than GBLUP-A in environments where dominance represented a large part of the genetic variance. These methodologies are competitive with statistical models and tend to outperform them when applied to large data sets. However, when applied to small training sets, machine learning is probably not able to capture all non-additive information and linear models may perform better54.

Tree-based machine learning methodologies (Bag, RF, and Boost) are considered promising for genomic prediction, especially for traits controlled by a few quantitative trait locus (QTL), capturing non-additive effects in their models25. Among the machine learning methodologies evaluated in this study, Boost is considered the most sensitive methodology for the variation of traits and environments25,26,55. In these studies, carried out with simulated data, the lower the heritability and the greater the number of QTL that control the trait, the lower the Boost prediction efficiency. This fact may be related to the lower values of predictive ability for GY when compared to other methodologies, since this is a quantitative trait with low heritability, greatly influenced by the environment56,57.

A possible explanation for the lower values of predictive ability of machine learning methodologies when compared to GBLUP in FFT prediction is that, even though it is a complex trait58,59, additive genetic variance contributed with a large proportion of this trait. If the relationship between trait and response is close to a linear model, then a linear approach will probably work well and be superior to a method like a regression tree that does not explore this linear structure22,60.

In the context of a hybrid breeding program, identifying high-performance genotypes is essential61. However, extensive field trials are needed to identify the best hybrids in the target environment. These trials require resources in terms of people, equipment, and area to carry out the phenotyping of the plants. Furthermore, most crosses are discarded after evaluation due to poor overall performance62. Associated with this fact is the trend of stagnating or rising costs of field evaluations, unlike genotyping which tends to decrease63. In this sense, the use of genomic information in multiple-environment trials is an alternative to traditional field trials, as it allows to reduce the phenotyping costs and to increase the number of hybrid combinations evaluated when performing the prediction of the genetic value of individuals that were not evaluated in the field11,64.

Based on the results of this study, a practical application of cost reduction and the efficiency of genomic selection in a breeding program can be demonstraded10. Assuming the cost of an experimental maize trial is $17.00 per plot65 and the cost of sequencing genotyping (GBS) standard is $25.38 per parental line66, the total budget needed for a breeding program would be $108,120.00 for phenotyping the 265 single cross hybrids in eight experimental trials using three replications, and $5,076.00 for genotyping the 188 parental lines and the two testers. Using the GBLUP-AD model, it was shown that the predictive abilities for the untested single hybrids averaged greater than 0.40 for GY under WW conditions, with an imbalance of 25% of hybrids randomly missing in each environment. Consequently, the cost reduction for the breeding program would be approximately 25% or $27,030.00 compared to phenotyping alone, which would cover the costs of genotyping the lines ($5,076.00).

The models proposed in this study can be applied to other crops that use hybrids to explore heterosis, and they can be expanded to include environmental variables to predict non-evaluated environments. In addition, this study demonstrates the application of machine learning methodologies in tests of multiple environments for predicting of hybrids, comparing their performance with GBLUP with non-additive effects, thus highlighting the potential of both methodologies.


Genomic prediction for untested maize single cross hybrids using both statistical and machine learning approaches were applied in MET assays. The results demonstrate that both methodologies are efficient and can be valuable tools in maize breeding programs to accurately predict the performance of hybrids in specific environments. The choice of the best methodology depends on the specific case. To optimize the predictive ability of GBLUP, it is crucial to accurately model the variance components. On the other hand, machine learning methodologies have the advantage of capturing non-additive effects without making any assumptions at the outset of the model. We found that predicting the performance of new hybrids that were not evaluated in any field trials was more challenging than predicting hybrids in unbalanced trials. However, regardless of the methodology used, environments with the lowest heritability showed the lowest predictive abilities, underscoring the importance of conducting well-designed and properly replicated experiments.