Genomic prediction in multi-environment trials in maize using statistical and machine learning methods

Barreto, Cynthia Aparecida Valiati; das Graças Dias, Kaio Olimpio; de Sousa, Ithalo Coelho; Azevedo, Camila Ferreira; Nascimento, Ana Carolina Campana; Guimarães, Lauro José Moreira; Guimarães, Claudia Teixeira; Pastina, Maria Marta; Nascimento, Moysés

doi:10.1038/s41598-024-51792-3

Download PDF

Article
Open access
Published: 11 January 2024

Genomic prediction in multi-environment trials in maize using statistical and machine learning methods

Cynthia Aparecida Valiati Barreto¹,
Kaio Olimpio das Graças Dias²,
Ithalo Coelho de Sousa³,
Camila Ferreira Azevedo¹,
Ana Carolina Campana Nascimento¹,
Lauro José Moreira Guimarães⁴,
Claudia Teixeira Guimarães⁴,
Maria Marta Pastina⁴ &
…
Moysés Nascimento¹

Scientific Reports volume 14, Article number: 1062 (2024) Cite this article

1807 Accesses
Metrics details

Subjects

Abstract

In the context of multi-environment trials (MET), genomic prediction is proposed as a tool that allows the prediction of the phenotype of single cross hybrids that were not tested in field trials. This approach saves time and costs compared to traditional breeding methods. Thus, this study aimed to evaluate the genomic prediction of single cross maize hybrids not tested in MET, grain yield and female flowering time. We also aimed to propose an application of machine learning methodologies in MET in the prediction of hybrids and compare their performance with Genomic best linear unbiased prediction (GBLUP) with non-additive effects. Our results highlight that both methodologies are efficient and can be used in maize breeding programs to accurately predict the performance of hybrids in specific environments. The best methodology is case-dependent, specifically, to explore the potential of GBLUP, it is important to perform accurate modeling of the variance components to optimize the prediction of new hybrids. On the other hand, machine learning methodologies can capture non-additive effects without making any assumptions at the outset of the model. Overall, predicting the performance of new hybrids that were not evaluated in any field trials was more challenging than predicting hybrids in sparse test designs.

Genomic prediction applied to multiple traits and environments in second season maize hybrids

Article 29 May 2020

Leveraging data from the Genomes-to-Fields Initiative to investigate genotype-by-environment interactions in maize in North America

Article Open access 30 October 2023

Weighted kernels improve multi-environment genomic prediction

Article Open access 15 December 2022

Introduction

Maize (Zea mays) has emerged as an important crop for food, feed production, and various industrial applications, providing livelihoods for millions of people around the world^1,2. However, its production is affected by several factors, with drought being one of the most common causes of agricultural shortages in rainfed systems³. This fact, combined with the high demand for this crop and the prospect of a worldwide growth of more than 2 billion people over the next 20 years⁴, makes it necessary to cultivate increasingly productive crops, as well as more adapted to climate change and also to different planting regions, such tropical conditions.

Cultivars can exhibit differentiated phenotypic responses between environments, and it is possible that a genotype may perform well in one environment but not in another^5,6,7. To address this, breeders must submit the developed hybrids to multiple environment trials (MET). In MET, the main objectives are to study the interaction between genotypes and environments and to evaluate genotypic overall performance and stability⁸. However, MET phenotyping faces challenges such as the limited seeds availability, a high number of genotypes to be tested in the preliminary trials, and the associated costs, resulting in unbalanced experimental designs in different environments^9,10. In sparse designs, where hybrids are not evaluated in all environments, accurately selecting superior hybrids for the next cycle can be difficult, as some hybrids may not be stable in many environments, and other genotypes that are discarded may outperform in untested environments.

To address these challenges, genomic prediction (GP) is proposed as a tool to predict the genetic value of individuals that were not evaluated in the field^11,12. Several GP methods have been proposed, with Genomic Best Linear Unbiased Predictor (GBLUP) being one of the most commonly used methods. In the context of MET, predicting the genetic value of individuals not observed in specific environments has led to the development of several models, including interaction model of genotypes by environments, environmental covariates, and additive and dominance effects^{9,13,14,15,16}.

Recently, machine learning methodologies have gained attention due to their ability to recognize complex interaction structures in data sets¹⁷. Machine learning algorithms approximate the mapping function linking input variables (e.g., phenotypic trait) to output variables from the training datasets without making a priori assumptions about data distribution or the genotype–phenotype relationship¹⁸. This flexibility allows these methods to capture more complex genetic architectures in prediction models.

Among the machine learning methodologies used in genomic prediction, decision trees and their refinements (such as bootstrap aggregation (bagging), random forest, and boosting) stand out, as they stratify the predictor space into many sub-regions^19,20. These refinements aim to build more accurate prediction models; for example, bagging (Bag) reduces the variance observed in decision trees, random forest (RF) improves accuracy by avoiding high tree correlation²¹, and boosting (Boost) builds trees sequentially using information from previously built trees²².

Studies using simulated and real data have concluded that tree-based machine learning tools can serve as an alternative to traditional techniques for genomic prediction^23,24,25,26. For instance, Sousa et al.²⁷, evaluating genomic prediction for resistance to rust disease in Coffea arabica, observed that Bag, RF, and Boost showed superior predictive abilities to Generalized Bayesian Linear Regression. Westhues et al.²⁸, using genomic and environmental variables, found that machine learning models can provide similar or slightly superior predictive abilities to GBLUP models for traits strongly influenced by environmental factors. Despite the potential of tree-based machine learning, there are still few studies that have evaluated MET data, and these methodologies could prove beneficial in such cases.

Therefore, our study aimed to evaluate and to compare the efficiency of statistical methodologies (GBLUP) and machine learning (Bag, RF, and Boost) for genomic prediction for single cross hybrid evaluated for drought tolerance traits in MET. We considered two different prediction scenarios, mimicking two situations that plant breeders may encounter: (i) predicting the performance of newly developed single hybrids for which there are no existing phenotypic records; (ii) predicting the performance of single cross hybrids in sparse design trials, where some hybrids were evaluated in some environments, but not in others.

Material and methods

Phenotypic data

The data are composed of 265 single cross hybrids from the maize breeding program of Embrapa Maize and Sorghum evaluated in eight combinations of trials/locations/years under irrigated trials (WW) and water stress (WS) conditions at two locations in Brazil (Janaúba—Minas Gerais and Teresina—Piauí) over two years (2010 and 2011). The hybrids were obtained from crosses between 188 inbred lines and two testers. The inbred lines belong to heterotic groups: dent (85 inbred lines), flint (86 inbred lines), and an additional group, referred to as group C (17 inbred lines), which is unrelated to the dent and flint origins. The two testers are inbred lines belonging to the flint (L3) and dent (L228-3) groups. Among the inbred lines, 120 were crossed with both testers, 52 were crossed with the L228-3 tester only, and 16 lines were crossed with the L3 tester only. Silva et al. (2020) evaluated the genetic diversity and heterotic groups in the same database. These authors showed the existence of subgroups within each heterotic group. Therefore, once these groups were not genetically well defined and the breeding program from Embrapa Maize was in the beginning, the effect of allelic substitution in both groups are assumed to be the same. More details on the experimental design and procedures can be found in Dias et al.^13,30.

The experiment originally included 308 entries, but hybrids that were not present in all environments were also removed to evaluate the genomic prediction within each environment, resulting in a total of 265 hybrids for analysis. Each trial consisted of 308 maize single cross hybrids, randomly divided into six sets: sets 1–3 for crosses with L3 (61, 61, and 14 hybrids each), and sets 4–6 for crosses with L228-3 (80, 77, and 15 hybrids each). Four checks (commercial maize cultivars) were included in each set, and the experiment was designed in completely randomized blocks. Between trials, hybrids within each set remained the same, but hybrids and checks were randomly allocated into groups of plots within each set. This allocation varied between replicates of sets and between trials. The WS trials had three replications, except for the set containing 15 hybrids and the trials evaluated in 2010, which had two replications. All WW trials, except for the trial in 2011, had two replicates.

Two agronomic traits related to drought tolerance were analyzed: grain yield (GY) and female flowering time (FFT). GY was determined by weighing all grains in each plot, adjusted for 13% grain moisture, and converted to tons per hectare (t/ha), accounting for differences in plot sizes across trials. FFT was measured as the number of days from sowing until the stigmas appeared in 50% of the plants. A summary of means, standard deviations, and ranges of both evaluated traits are available in Table 1.

Table 1 Summary of means (Mean), standard deviations (SD), minimum (Min), and maximum (Max) for grain yield (GY) and female flowering time (FFT) obtained under irrigated (WW) and water stress (WS) conditions, evaluated in the years 2010 and 2011, at the locations of Janaúba (J) and Teresina (T).

Full size table

To conduct the analyses, hybrids considered as outliers were removed (i.e., hybrids that presented phenotypic values greater than 1.5 × interquartile range above the third quartile or below the first quartile) for the GY and FFT traits. The variations in predictive abilities among hybrids of T2, T1, and T0 are widely recognized³¹. However, the primary aim of our study was to compare different prediction methodologies in MET assays. In this study, there were 240 T2 hybrids and 68 T1 hybrids, with T2 hybrids had both parents evaluated in different hybrid combinations, while hybrids being single-cross hybrids sharing one parent with the tested hybrids. Given the realistic nature of our scenario, we have a limited and imbalanced distribution of these hybrid groups, making a fair comparison challenging. Consequently, we opted to construct a training set comprising T2 and T0 hybrids.

Statistical analysis of phenotypic data

To correct the phenotypic values for experimental design effects, each trial (WW and WS) and environment were analyzed independently to obtain the Best Linear Unbiased Estimator (eBLUEs) for each hybrid, for the two traits evaluated. The estimates were obtained based on the following model:

$${\varvec{y}} = 1\mu + \user2{ X}_{1} {\varvec{r}} + {\varvec{X}}_{2} {\varvec{s}} + {\varvec{X}}_{3} {\varvec{h}} + {\varvec{e}}$$

(1)

where $\user2{y }\left( {n \times 1} \right)$ is the phenotype vector for $f$ replicates, $t$ sets of $p$ hybrids, and $n$ is the number of observations; $\mu$ is the mean; $\user2{r }\left( {f \times 1} \right)$ is the fixed effect vector of the replicates; $\user2{s }\left( {t \times 1} \right)$ is the fixed effect vector of the sets; ${\varvec{h}}$ $\left( {p \times 1} \right)$ is the fixed effect vector of the hybrids; and $\user2{e }\left( {k \times 1} \right)$ is the residue vector, with $\user2{ e} \sim ,MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)$, where ${\varvec{I}}$ is an identity matrix of corresponding order, and $\sigma_{e}^{2}$ the residual variance. ${\varvec{X}}_{{1\user2{ }}} \left( {k \times f} \right)$, ${\varvec{X}}_{{2\user2{ }}} \left( {k \times t} \right)$ e ${\varvec{X}}_{{3\user2{ }}} \left( {k \times p} \right)$ represents incidence matrices for their respective effects. The eBLUES of each environment were used in further analyses.

Genotypic data

A total of 57,294 Single Nucleotide Polymorphisms (SNPs) markers were obtained from 188 inbred lines, and two testers used as parents of the 265 single cross hybrids. The genotyping by sequencing (GBS) strategyare detailed in Dias et al.¹³. For the quality control, SNPs were discarded if: the minor allele frequency was smaller than 5%, more than 20% of missing genotypes were found, and/or there were more than 5% of heterozygous genotypes. After filtering, missing data were imputed using NPUTE. Then, for each SNP, the genotypes of the hybrids were inferred based on the genotype of their parents (inbred line and tester). The number of SNPs per chromosome ranged from 3121 (chromosome 10) to 7705 (chromosome 1), totalizing 47,127 markers.

Genomic relationship matrix

The additive and dominance genomic relationship matrices were constructed³² based on information from the SNPs using the package AGHmatrix³³, following VanRaden³⁴ and Vitezica et al., respectively.

Genomic prediction

Genomic predictions were performed using the Genomic Best Line Unbiased Prediction (GBLUP) method using the package AsReml v. 4³⁶. Two groups were considered: the first group comprised four environments under WW conditions, and the second included four environments under WS conditions. The linear model is described below:

$$\overline{\user2{y}} = \mu 1 + \user2{ Xb} + {\varvec{Z}}_{1} {\varvec{u}}_{{\varvec{a}}} + {\varvec{Z}}_{2} {\varvec{u}}_{{\varvec{d}}} + {\varvec{e}}$$

(2)

where $\user2{\overline{y} }\left( {pq \times 1} \right)$ is the vector of eBLUES previously estimated for each environment with $p$ hybrids and $q$ environments;$\mu$ is the mean; $\user2{b }\left( {q \times 1} \right)$ is the vector of environmental effects (fixed); ${\varvec{u}}_{{\user2{a }}} \left( {pq \times 1} \right)$ is the vector of individual additive genetic values nested within environments (random), with ${\varvec{u}}_{{\varvec{a}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{a} }}^{2} + \rho_{a} \left( {{\varvec{J}}_{{\varvec{q}}} - {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{A}}} \right)$, where ${\varvec{A}}$ is the genomic relationship matrix between individuals for additive effects, $\rho_{a}$ is the additive genetic correlation coefficient between environments, ${\varvec{I}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)$ is an identity matrix, ${\varvec{J}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)$ is a matrix of ones, and $\otimes$ denotes the Kronecker product; ${\varvec{u}}_{{\varvec{d}}}$ $\left( {pq \times 1} \right)$ is the vector of individual dominance genetic values nested within environments (random), with ${\varvec{u}}_{{\varvec{d}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{d} }}^{2} + \rho_{d} \left( {{\varvec{J}}_{{\varvec{q}}} - {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{D}}} \right)$, where ${\varvec{D}}$ is the genomic relationship matrix between individuals for dominance effects, $\rho_{{\varvec{d}}}$ is the dominance correlation coefficient between environments; ${\varvec{e}}$ $\left( {pq \times 1} \right)$ is the random residuals vector with ${\varvec{e}}\sim MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)$. The capital letters $\user2{X }\left( {pq \times q} \right),\user2{ Z}_{1} \left( {pq \times pq} \right)$ and ${\varvec{Z}}_{2} \user2{ }\left( {pq \times pq} \right)$ represent the incidence matrices for their respective effects, $1\user2{ }\left( {pq \times 1} \right)$ is a vector of ones. The (co)variance components were obtained using the residual maximum likelihood method (REML)³⁷.

Two alternative models were also used. The first for genomic prediction retained only additive effects by removing ${\varvec{u}}_{{\varvec{d}}}$ from Eq. (2). The second model was used to estimate the genetic parameters within each environment separately.

The significance of random effects was tested using the Likelihood Ratio Test (LRT)³⁸, given by:

$$LRT = 2*\left( {LogL_{c} - LogL_{r} } \right)$$

(3)

where $LogL_{c}$ is the logarithm of the likelihood function of the complete model (with all effects included), and $LogL_{r}$ is the logarithm of the restricted likelihood function of the reduced model (without the effect under test). Effect significance was tested by LRT using the chi-square (X²) probability density function with a degree of freedom and significance level of 5%³⁹.

The narrow-sense heritability (${ }h^{2}$), the proportion of variance explained by dominance effects ($d^{2}$), and the broad-sense heritability $\left( {H^{2} } \right)$ for each trait were estimated following Falconer and Mackay 1996³⁵.

Machine learning

Similar to the previous topic, the trials were divided between WW and WS conditions, and the potential of regression trees (RT) was explored using the following three algorithms: bagging, random forest, and boosting²². Bagging (Bag) is a methodology that aims to reduce the RT variance²². In other words, it consists of obtaining D samples with available sampling replacement, thus obtaining D models $\hat{f}^{1} \left( x \right), \hat{f}^{2} \left( x \right), \ldots , \hat{f}^{D} \left( x \right)$, and finally use the generated models to obtain an average, given by:

$$\hat{f}_{medio} \left( x \right) = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \hat{f}^{d} \left( x \right)$$

(4)

This decreases the variability obtained in the decision trees. The number of trees used in Bag is not a parameter that will result in overfitting of the model. In practice, a number of trees is used until the error has stabilized²². The number of trees sampled for Bag was set at 500 trees.

Random forest (RF) was proposed by HO⁴⁰ and it is an improvement of Bag to avoid the high correlation of the trees and to improve the accuracy in the selection of individuals. RF changes only the number of predictor variables used in each split. That is, each time a split in a tree is considered, a random sample of $m$ variables is chosen as candidates from the complete set of $p$ variables. Hastie et al.²¹ suggest that the number of predictor variables used in each partition is equal to $m = \frac{p}{3}$ for regression trees. The number of trees for the RF was set at 500.

Boosting uses RT by adjusting the residual of an initial model. The residual is updated with each tree that grows sequentially from the previous tree's residual, and the response variable involves a combination of a large number of trees, such that:

$$\hat{f}\left( x \right) = \mathop \sum \limits_{b = 1}^{B} {\uplambda } \hat{f}^{b} \left( x \right)$$

(5)

The function $\hat{f}\left( . \right)$ refers to the final tree combined with sequentially adjusted trees, and λ is the shrinkage parameter that controls the learning rate of the method. Furthermore, this method needs to be adjusted with several splits in each of the trees. This parameter controls the complexity of the Boost and is known as the depth. For Boosting, the number of trees sampled was 250, with a learning rate of 0.1 and a depth of 3.

To perform hybrid prediction for each environment based on MET dataset, we propose the incorporation of location and year information in which the experiments were carried out as factors in the data input file together with SNPs markers as predictors in machine learning methodologies. As a response variable, the eBLUEs previously estimated by Eq. (1) were used.

For the construction of the bagging and random forest models, the randomForest function from the package randomForest⁴¹ was used. Finally, the package's gbm function gbm⁴² was used for boosting. All analyzes were implemented in the software R⁴³.

Model validation

Genomic predictions were carried out following Burgueño et al.¹⁶, considering two different prediction problems, CV1 and CV2, which simulate two possible scenarios a breeder can face. In CV1, the ability of the algorithms to predict the performance of hybrids that have not yet been evaluated in any field trial was evaluated. Thus, predictions derived from the CV1 scenario are entirely based on phenotypic and genotypic records from other related hybrids. In CV2, the ability of the algorithms to predict the performance of hybrids using data collected in other environments was evaluated. It simulates the prediction problem found in incomplete MET trials. Here, information from related individuals is used, and the prediction can benefit from genetic relationships between hybrids and correlations between environments. Within the CV2 scenario, two different situations of data imbalance were evaluated. In the first, called CV2 (50%), the tested hybrids were not present in half of the environments, while in the second, called CV2 (25%), the tested hybrids were not present in only 25% of the environments. Table 2 provides a hypothetical representation of this CV1, CV2 (50%), and CV2 (25%) validation scheme.

Table 2 Representation of the three scenarios (CV1, CV2-50% and CV2-25%) for four hybrids (Hybrid 1–4) and four environments (J10, J11, T10, T11).

Full size table

To separate the training and validation sets, the k-folds procedure was used, considering $k = 5$. The set of 265 hybrids was divided into five groups, with 80% of the hybrids considered as the training population, and the remaining 20% hybrids considered as the validation population. The hybrids were separated into sets proportionally containing all the crosses performed (Dent × Dent, Dent × Flint, Flint × Flint, C × Dent, C × Flint). The cross-validation process was performed separately for each trait, condition (WS or WW) and scenario (CV1, CV2-50% and CV2-25%) and was repeated five times to assess the predictive ability of the analyses.

The predictive ability within each environment for the conditions (WS and WW) was estimated by the Pearson correlation coefficient⁴⁴ between the corrected phenotypic values (eBLUES) of Eq. (1) for each environment and the GEBVs predicted by each fitted method.

Ethics statement

The authors confirm that all methods were carried out by relevant guidelines in the method section. The authors also confirm that the handling of the plant materials used in the study complies with relevant institutional, national, and international guidelines and legislation.

Statement of handling of plants

The authors confirm that the appropriate permissions and/or licenses for collection of plant or seed specimens are taken.

Results

Variance components and estimation of genetic parameters

Estimates of variance components and genetic parameters for GY and FFT under WW and WS conditions, obtained for the joint analysis with the four environments and analyses within each environment, are shown in Table 3 . For the joint analysis, the heritability estimates for GY and FFT were slightly different from those obtained by Dias et al.¹³ using the same material, since here, a different statistical model was used to estimate the genetic parameters, and hybrids that were not present in all environments were removed.

Table 3 Estimates of variance components and genetic parameters for grain yield (GY) and female flowering time (FFT) were obtained considering the joint analysis for the four evaluated environments and analyses within each environment, for the irrigated (WW) and water stress (WS) conditions.

Full size table

The additive variance found for GY and FFT was greater than the variance due to dominance effects, in both WS and WW conditions. For GY, the variances due to dominance effects represented about 33.3% and 31.0% of the genetic variance in WS and WW conditions, respectively. Lower broad-sense heritability was observed for this trait in WW (0.42) when compared to the WS condition (0.53). As for FFT, the variances due to dominance effects represented about 19.9% and 20.8% of the genetic variance in WS and WW conditions, respectively, and the broad-sense heritabilities for FFT were greater in WW conditions (0.71) than in conditions WS (0.56).

For GY, the narrow-sense heritabilities within environments ranged from 0.30 (T11) to 0.38 (J10) under WS conditions and from 0.20 (T11) to 0.57 (J10) under WW conditions. The proportion of genetic variance explained by dominance deviations ranged from 0.04 (T10) to 0.43 (T11) under WS conditions and from 0.10 (T11) to 0.29 (J11) under WW conditions. The broad-sense heritabilities were lower for the experimental tests that had a lower number of repetitions (2010 under WS conditions, and 2011 under WW conditions).

For FFT, the narrow-sense heritabilities ranged from 0.09 (T10) to 0.80 (J10) under WS conditions and from 0.49 (T10) to 0.74 (J10) under WW conditions. The proportion of genetic variance explained by dominance deviations ranged from 0.01 (T10) to 0.26 (T11) under WS conditions and from 0.07 (J11) to 0.23 (T11) under WW conditions. The broad-sense heritabilities were higher for J10 (0.89 and 0.88) under WS and WW conditions.

The Eq. (2) is an implicit model to perform MET analyses⁴⁵ and provide genetic correlations for additive and dominance effects across environments. This model, reflects on the levels of genotypes-by-environment interaction. Particularly, for GY, the environmental correlations were 0.35 and 0.24 for WS and WW conditions, respectively, indicating an inconsistent ranking of hybrids across environments. As for FFT, the lowest correlation was observed for dominance effects.

Efficiency of prediction methodologies in multi-environment trials

Figures 1 and 2 show the predictive abilities observed in the three scenarios (CV1, CV2-50%, and CV2-25%) for each of the five compared methods: GBLUP additive model (GBLUP-A), GBLUP additive-dominant model (GBLUP-AD), bagging (Bag), random forest (RF) and boosting (Boost). The numerical results of these figures are presented in Supplementary Tables 1, 2, and 3.

For the GBLUP models, the predictive abilities were lower for the CV1 scenario, where the predictions were based only on phenotypic and genotypic records of other related hybrids. However, the predictive ability increased when the predicted hybrid was present in some environments, being intermediate in CV2-50% and higher in CV2-25%. A more notable improvement in predictive abilities was observed when transitioning from CV1 to CV2-50%. Specifically, for GBLUP-A, the average increase was about 107%, 19%, 18%, and 19% for GY-WS, GY-WW, FFT-WS, and FFT-WW, respectively. As for GBLUP-AD, the average increase was 19%, 7%, 12%, and 15% for GY-WS, GY-WW, FFT-WS, and FFT-WW, respectively (Table Sup. 4). For GY, considering only CV1 scenario, the average predictive abilities were higher in WW conditions. For FFT, mean predictive abilities for WW conditions were higher than for WS for all scenarios (Table Sup. 4).

The environments with the highest broad-sense heritabilities also exhibited the highest average predictive abilities across all evaluated methodologies (Table 3, Figs. 1 and 2). The GBLUP-AD model demonstrated superior predictive abilities for the GY and FFT traits in almost all environments and scenarios (CV1, CV2-50%, and CV2-25%). Conversely, the GBLUP-A performed equally or better than the GBLUP-AD model for FFT in environments where the dominance effect was not significant (T10-WS and J11-WW). However, for GY, in environments where the proportion of dominance genetic variance was close to or greater than the additive variance (J11 and T11 under WS), GBLUP-A exhibited lower predictive ability.

Unlike GBLUP, machine learning methodologies did not show a consistent pattern of increased predictive ability with the presence of phenotypic records of the hybrid to be predicted in correlated environments. Overall, an increase in predictive abilities for GY was observed when the scenario changed from CV1 to CV2, under WS conditions, which was not observed in many environments under WW conditions. For example, for RF, the average predictive abilities in the environments increased by approximately 51% for GY in WS conditions, while a decrease of about 10% was observed when the prediction changed from CV1 to CV2-50% for the same trait in WW conditions (Table Sup. 4). As for FFT, Bag drew the most attention, presenting a large standard error in predictive abilities with statistically similar results for CV1, CV2-50% and CV2-25% in WS and WW conditions, in almost all environments.

In environments where dominance effects accounted for a large portion of the genetic variance, the Bag, RF and Boost methodologies showed intermediate results between the two GBLUP models. Among the evaluated machine learning methodologies, Bag and RF produced very similar predictions and Boost presented the lowest predictive ability for GY. As for FFT, machine learning did not show improvement in predictive abilities when compared to GBLUP.

Discussion

We employed both statistical and machine learning methods to evaluate the efficiency of genomic predictions for GY and FFT. These models allow us to leverage information about relationships between hybrids and correlated environments. Three scenarios, CV1, CV2-50% and CV2-25%, were used, each characterized by different degrees of data imbalance. Predicting the performance of new hybrids that have not been evaluated in the field (CV1) is more challenging compared to predicting hybrids that have been evaluated in an unbalanced manner across correlated environments (CV2-50% and CV2-25%)^15,16.

Among the evaluated methodologies, GBLUP stands out as it allows to estimate relationships between individuals based on molecular markers for additive and dominant effects^32,34. This methodology also allows to incorporate variance–covariance structures to handle correlated environments and unbalanced data^10,14. As expected, GBLUP showed the highest predictive abilities in the CV2 validation schemes (CV2-50% and CV2-25%) as the predictions benefited from phenotypic records of the same hybrids in other correlated environments. Similar results have been reported in previous studies using wheat inbred lines and maize single cross hybrids^13,14,15,16.

GBLUP-AD demonstrated greater efficiency to predict hybrids for traits and environments in which the effects due to dominance deviations were significant. As a substantial part of the genetic variance in maize hybrids for GY is attributed to dominance effects⁴⁶, incorporating theses effects in the model significantly improved the prediction of this trait. However, when the dominance effect accounts for a small portion of the genetic variance (as observed for FFT), additive and additive-dominant models tend to show minor improvements in the predictions of the estimated total genetic effects^13,47. This emphasizes the importance of understand the genetic bases of hybrids which can be decomposed into general combining ability (GCA) and specific combining ability (SCA). For GCA, the additive variance is the major component of the variance⁴⁸ and, SCA is largely dependent on genes with dominance or epistatic effects⁴⁹. In this context, gaining a comprehension of the relative magnitudes of GCA and SCA variations is instrumental in guiding the formulation of optimal strategies for hybrid breeding programs, as highlighted in previous research⁵⁰.

One possible reason why tree-based learning models did not benefit from the presence of phenotyped hybrids in correlated environments is related to their shared concept of recursive division⁵¹. These models aim to find decision rules that naturally partition the prediction space to provide an informative and robust hierarchical model^21,52. In this context, it is conceivable that the location/year variables played a crucial role in most of the tree construction scenarios, leading to the split of the same hybrids at the beginning of branching and making it difficult for them to be grouped at the last nodes, which are responsible for the predictive response.

Among the evaluated methodologies, Bag showed, in most environments, the same prediction pattern for both scenarios where the predicted hybrid was absent in all environments and where it was absent only in some environments. Bag is a variation of the decision tree that uses resampling of the original data in subsamples (bootstrap) according to the determined number of trees. This process may lead to high correlation between the generated trees, resulting in the same variable consistently being the most important one²¹. As consequence, the same hybrids in different environments may end up at very distant nodes in the construction of the tree. The RF further reduces the dependence between the decision trees by randomly selecting part of the original data and variables to build the trees^22,51. This allows different variables to have a chance to be at the top in their construction. On the other hand, boost sequentially combines different predictors, fitting a new tree to the residuals of the previous model using a specified loss function (e.g. mean squared error for regression)⁵³. It incorporates automatic indirect selection of markers and is generally recommended for regression analysis^17,20. It is important to note that these variations of decision trees mentioned above are typically used when obtaining accurate predictions is more important than the biological interpretation of the model itself⁵².

One of the advantages of using machine learning methodologies is that they do not require the specification of the trait inheritance, having as an initial hypothesis the possibility of capturing non-additive effects in the genome, which are often not specified in traditional statistical methods²⁵. Possibly tree-based machine learning methodologies managed to capture part of the dominance effect, presenting better results than GBLUP-A in environments where dominance represented a large part of the genetic variance. These methodologies are competitive with statistical models and tend to outperform them when applied to large data sets. However, when applied to small training sets, machine learning is probably not able to capture all non-additive information and linear models may perform better⁵⁴.

Tree-based machine learning methodologies (Bag, RF, and Boost) are considered promising for genomic prediction, especially for traits controlled by a few quantitative trait locus (QTL), capturing non-additive effects in their models²⁵. Among the machine learning methodologies evaluated in this study, Boost is considered the most sensitive methodology for the variation of traits and environments^25,26,55. In these studies, carried out with simulated data, the lower the heritability and the greater the number of QTL that control the trait, the lower the Boost prediction efficiency. This fact may be related to the lower values of predictive ability for GY when compared to other methodologies, since this is a quantitative trait with low heritability, greatly influenced by the environment^56,57.

A possible explanation for the lower values of predictive ability of machine learning methodologies when compared to GBLUP in FFT prediction is that, even though it is a complex trait^58,59, additive genetic variance contributed with a large proportion of this trait. If the relationship between trait and response is close to a linear model, then a linear approach will probably work well and be superior to a method like a regression tree that does not explore this linear structure^22,60.

In the context of a hybrid breeding program, identifying high-performance genotypes is essential⁶¹. However, extensive field trials are needed to identify the best hybrids in the target environment. These trials require resources in terms of people, equipment, and area to carry out the phenotyping of the plants. Furthermore, most crosses are discarded after evaluation due to poor overall performance⁶². Associated with this fact is the trend of stagnating or rising costs of field evaluations, unlike genotyping which tends to decrease⁶³. In this sense, the use of genomic information in multiple-environment trials is an alternative to traditional field trials, as it allows to reduce the phenotyping costs and to increase the number of hybrid combinations evaluated when performing the prediction of the genetic value of individuals that were not evaluated in the field^11,64.

Based on the results of this study, a practical application of cost reduction and the efficiency of genomic selection in a breeding program can be demonstraded¹⁰. Assuming the cost of an experimental maize trial is $17.00 per plot⁶⁵ and the cost of sequencing genotyping (GBS) standard is $25.38 per parental line⁶⁶, the total budget needed for a breeding program would be $108,120.00 for phenotyping the 265 single cross hybrids in eight experimental trials using three replications, and $5,076.00 for genotyping the 188 parental lines and the two testers. Using the GBLUP-AD model, it was shown that the predictive abilities for the untested single hybrids averaged greater than 0.40 for GY under WW conditions, with an imbalance of 25% of hybrids randomly missing in each environment. Consequently, the cost reduction for the breeding program would be approximately 25% or $27,030.00 compared to phenotyping alone, which would cover the costs of genotyping the lines ($5,076.00).

The models proposed in this study can be applied to other crops that use hybrids to explore heterosis, and they can be expanded to include environmental variables to predict non-evaluated environments. In addition, this study demonstrates the application of machine learning methodologies in tests of multiple environments for predicting of hybrids, comparing their performance with GBLUP with non-additive effects, thus highlighting the potential of both methodologies.

Conclusion

Genomic prediction for untested maize single cross hybrids using both statistical and machine learning approaches were applied in MET assays. The results demonstrate that both methodologies are efficient and can be valuable tools in maize breeding programs to accurately predict the performance of hybrids in specific environments. The choice of the best methodology depends on the specific case. To optimize the predictive ability of GBLUP, it is crucial to accurately model the variance components. On the other hand, machine learning methodologies have the advantage of capturing non-additive effects without making any assumptions at the outset of the model. We found that predicting the performance of new hybrids that were not evaluated in any field trials was more challenging than predicting hybrids in unbalanced trials. However, regardless of the methodology used, environments with the lowest heritability showed the lowest predictive abilities, underscoring the importance of conducting well-designed and properly replicated experiments.

Data availability

Data available in the Dryad Digital Repository: https://doi.org/https://doi.org/10.5061/dryad.ps22r.

References

Hossain, F. et al. Molecular breeding for increasing nutrition quality in maize: recent progress. In Molecular Breeding in Wheat, Maize and Sorghum: Strategies for Improving abiotic Stress Tolerance and Yield 360–379 (CABI, 2021). https://doi.org/10.1079/9781789245431.0021.
Hossain, F. et al. Maize Breeding. in Fundamentals of Field Crop Breeding 221–258 (Springer Nature Singapore, 2022). https://doi.org/10.1007/978-981-16-9257-4_4.
Lobell, D. B. et al. Greater sensitivity to drought accompanies maize yield increase in the U.S. midwest. Science 344, 516–519 (2014).
Article ADS CAS PubMed Google Scholar
ONU. World Population Prospects 2022. https://population.un.org/wpp/Graphs/Probabilistic/POP/TOT/900 (2022).
Cruz, C. D., Regazzi, A. J. & Carneiro, P. C. S. Modelos biométricos aplicados ao melhoramento. UFV, Viçosa (2012).
Malosetti, M., Ribaut, J.-M. & van Eeuwijk, F. A. The statistical analysis of multi-environment data: Modeling genotype-by-environment interaction and its genetic basis. Front. Physiol. 4, 44 (2013).
Article CAS PubMed PubMed Central Google Scholar
Crossa, J. Statistical Analyses of Multilocation Trials. in 55–85 (1990). https://doi.org/10.1016/S0065-2113(08)60818-4.
Burgueño, J., Crossa, J., Cotes, J. M., Vicente, F. S. & Das, B. Prediction assessment of linear mixed models for multienvironment trials. Crop Sci. 51, 944–954 (2011).
Article Google Scholar
Jarquin, D. et al. Genomic prediction enhanced sparse testing for multi-environment trials. G3 Genes Genomes Genetics 10, 2725–2739 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krause, M. D. et al. Boosting predictive ability of tropical maize hybrids via genotype-by-environment interaction under multivariate GBLUP models. Crop Sci. 60, 3049–3065 (2020).
Article Google Scholar
Bernardo, R. Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci. 34, 20–25 (1994).
Article Google Scholar
Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
Article CAS PubMed PubMed Central Google Scholar
Dias, K. O. D. G. et al. Improving accuracies of genomic predictions for drought tolerance in maize by joint modeling of additive and dominance effects in multi-environment trials. Heredity 121, 24–37 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jarquin, D. et al. Increasing genomic‐enabled prediction accuracy by modeling genotype × environment interactions in Kansas wheat. Plant Genome 10, (2017).
Jarquin, D. et al. A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 127, 595–607 (2014).
Article PubMed Google Scholar
Burgueño, J., Campos, G., Weigel, K. & Crossa, J. Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci. 52, 707–719 (2012).
Article Google Scholar
González-Recio, O., Rosa, G. J. M. & Gianola, D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest. Sci. 166, 217–231 (2014).
Article Google Scholar
Zhou, Z.-H. Machine Learning (Springer, 2021).
Book Google Scholar
Jannink, J.-L.J.-L., Lorenz, A. J. & Iwata, H. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9, 166–177 (2010).
Article CAS PubMed Google Scholar
Ogutu, J. O., Piepho, H.-P. & Schulz-Streeck, T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc. 5, S11 (2011).
Article PubMed PubMed Central Google Scholar
Hastie, T., Tibshirani, R., Friedman, J., Cruz, C. D. & Nascimento, M. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).
Book Google Scholar
Gareth, J., Daniela, W., Trevor, H. & Robert, T. An Introduction to Statistical Learning: with Applications in R (Spinger, 2013).
Google Scholar
Sarkar, R. K., Rao, A. R., Meher, P. K., Nepolean, T. & Mohapatra, T. Evaluation of random forest regression for prediction of breeding value from genomewide SNPs. J. Genet. 94, 187–192 (2015).
Article PubMed Google Scholar
Farooq, M., van Dijk, A. D. J., Nijveen, H., Mansoor, S. & de Ridder, D. Genomic prediction in plants: Opportunities for ensemble machine learning based approaches. F1000Research 11, 802 (2022).
Article PubMed Google Scholar
Barbosa, I. P. et al. Genome-enabled prediction through machine learning methods considering different levels of trait complexity. Crop Sci. 61, 1890–1902 (2021).
Article CAS Google Scholar
da Costa, W. G. et al. Genomic prediction through machine learning and neural networks for traits with epistasis. Comput. Struct. Biotechnol. J. 20, 5490–5499 (2022).
Article CAS PubMed PubMed Central Google Scholar
de Sousa, I. C. et al. Genomic prediction of leaf rust resistance to Arabica coffee using machine learning algorithms. Sci. Agric. 78, e20200021 (2021).
Article Google Scholar
Westhues, C. C. et al. Prediction of maize phenotypic traits with genomic and environmental predictors using gradient boosting frameworks. Front. Plant Sci. 12, 699589 (2021).
Article PubMed PubMed Central Google Scholar
Silva, K. J. et al. High-density SNP-based genetic diversity and heterotic patterns of tropical maize breeding lines. Crop Sci. 60, 779–787 (2020).
Article CAS Google Scholar
Dias, K. O. D. G. et al. Estimating genotype × environment interaction for and genetic correlations among drought tolerance traits in maize via factor analytic multiplicative mixed models. Crop Sci. 58, 72–83 (2018).
Article Google Scholar
Technow, F. et al. Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197, 1343–1355 (2014).
Article PubMed PubMed Central Google Scholar
Vitezica, Z. G., Varona, L. & Legarra, A. On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics 195, 1223–1230 (2013).
Article PubMed PubMed Central Google Scholar
Amadeu, R. R. et al. AGHmatrix: R package to construct relationship matrices for autotetraploid and diploid species: A blueberry example. Plant Genome https://doi.org/10.3835/plantgenome2016.01.0009 (2016).
Article PubMed Google Scholar
VanRaden, P. M. Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423 (2008).
Article CAS PubMed Google Scholar
Falconer, D. S. & Mackay, T. F. C. Introduction to quantitative genetics. Essex. UK Longman Gr. (1996).
Gilmour, A. R., Gogel, B. J., Cullis, B. R., Welham, S. J. & Thompson, R. ASReml User Guide Release 4.2 Functional Specification. VSN Int. Ltd (2021).
Corbeil, R. R. & Searle, S. R. Restricted maximum likelihood (REML) estimation of variance components in the mixed model. Technometrics 18, 31 (1976).
Article MathSciNet Google Scholar
Wilks, S. S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938).
Article Google Scholar
Dobson, A. & Barnett, A. An Introduction to Generalized Linear Models (Chapman and Hall/CRC, 2008). https://doi.org/10.1201/9780367807849.
Book Google Scholar
Ho, T. K. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition vol. 1 278–282 (IEEE, 1995).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Greenwell, B., Boehmke, B., Cunningham, J. & GBM, D. gbm: Generalized boosted regression models. R package version 2.1. 5. Website https//cran. r-project. org/package= gbm [accessed 12 January 2020] (2019).
R Core Team. R: A language and environment for statistical computing. at (2021).
Resende, M. D. V. de, Silva, F. F. e & Azevedo, C. F. Estatística matemática, biométrica e computacional: Modelos mistos, multivariados, categóricos e generalizados (REML/BLUP), inferência bayesiana, regressão aleatória, seleção genômica, QTL-GWAS, estatística espacial e temporal, competição, sobrevivência. Viçosa Ed. UFV 1–881 (2014).
Gezan, S. A., de Carvalho, M. P. & Sherrill, J. Statistical methods to explore genotype-by-environment interaction for loblolly pine clonal trials. Tree Genet. Genomes 13, 1 (2017).
Article Google Scholar
Fernandes, S. B., Dias, K. O. G., Ferreira, D. F. & Brown, P. J. Efficiency of multi-trait, indirect, and trait-assisted genomic selection for improvement of biomass sorghum. Theor. Appl. Genet. 131, 747–755 (2018).
Article CAS PubMed Google Scholar
Nishio, M. & Satoh, M. Including dominance effects in the genomic BLUP method for genomic evaluation. PLoS One 9, e85792 (2014).
Article ADS PubMed PubMed Central Google Scholar
Reif, J. C., Gumpert, F.-M., Fischer, S. & Melchinger, A. E. Impact of interpopulation divergence on additive and dominance variance in hybrid populations. Genetics 176, 1931–1934 (2007).
Article CAS PubMed PubMed Central Google Scholar
Sprague, G. F. & Tatum, L. A. General vs. specific combining ability in single crosses of corn. J. Am. Soc. Agron. (1942).
Giraud, H. et al. Reciprocal genetics: Identifying QTL for general and specific combining abilities in hybrids between multiparental populations from two maize (Zea mays L.) heterotic groups. Genetics 207, 1167–1180 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hofmarcher, P. & Grün, B. Macroeconomic Forecasting in the Era of Big Data (Springer, 2020).
Google Scholar
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. An introduction to decision tree modeling. J. Chemom. 18, 275–285 (2004).
Article CAS Google Scholar
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189 (2001).
Article MathSciNet Google Scholar
Westhues, C. C., Simianer, H. & Beissinger, T. M. learnMET: an R package to apply machine learning methods for genomic prediction using multi-environment trial data. G3 Fenes Genomes Genetics 12, jkac226 (2022).
Article Google Scholar
Ghafouri-Kesbi, F., Rahimi-Mianji, G., Honarvar, M. & Nejati-Javaremi, A. Predictive ability of random forests, boosting, support vector machines and genomic best linear unbiased prediction in different scenarios of genomic evaluation. Anim. Prod. Sci. 57, 229 (2017).
Article Google Scholar
Zhang, X. et al. Genetic architecture of maize yield traits dissected by QTL mapping and GWAS in maize. Crop J. 10, 436–446 (2022).
Article Google Scholar
Zhang, X. et al. A combination of linkage mapping and GWAS brings new elements on the genetic basis of yield-related traits in maize across multiple environments. Theor. Appl. Genet. 133, 2881–2895 (2020).
Article ADS CAS PubMed Google Scholar
Steinhoff, J. et al. Detection of QTL for flowering time in multiple families of elite maize. Theor. Appl. Genet. 125, 1539–1551 (2012).
Article PubMed Google Scholar
Buckler, E. S. et al. The genetic architecture of maize flowering time. Science 325, 714–718 (2009).
Article ADS CAS PubMed Google Scholar
Abdollahi-Arpanahi, R., Gianola, D. & Peñagaricano, F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 52, 12 (2020).
Article PubMed PubMed Central Google Scholar
Technow, F., Riedelsheimer, C., Schrag, T. A. & Melchinger, A. E. Genomic prediction of hybrid performance in maize with models incorporating dominance and population specific marker effects. Theor. Appl. Genet. 125, 1181–1194 (2012).
Article PubMed Google Scholar
Windhausen, V. S. et al. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 Genes Genomes Genet. 2, 1427–1436 (2012).
Article Google Scholar
Krchov, L.-M. & Bernardo, R. Relative efficiency of genomewide selection for testcross performance of doubled haploid lines in a maize breeding program. Crop Sci. 55, 2091–2099 (2015).
Article CAS Google Scholar
Massman, J. M., Gordillo, A., Lorenzana, R. E. & Bernardo, R. Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126, 13–22 (2013).
Article PubMed Google Scholar
Tech Services. Pricing brochure TSI 2023 test sites. Bluffton IN:TechServices https://techservicespro.com/test-locations/ (2023).
University of Minnesota. Genotyping-by-sequencing (Pricing). Genomics Center https://genomics.umn.edu/service/standard-genotyping-sequencing (2023).

Download references

Acknowledgements

This work was financially supported by the Foundation for Research Support of the state of Minas Gerais (FAPEMIG), by the National Council of Scientific and Technological Development (CNPq), and Coordination for the Improvement of Higher Education Personnel (CAPES - code 001).

Author information

Authors and Affiliations

Department of Statistics, Universidade Federal de Viçosa, Viçosa, Minas Gerais, Brazil
Cynthia Aparecida Valiati Barreto, Camila Ferreira Azevedo, Ana Carolina Campana Nascimento & Moysés Nascimento
Department of General Biology, Universidade Federal de Viçosa, Viçosa, Minas Gerais, Brazil
Kaio Olimpio das Graças Dias
Department of Mathematics and Statistics, Universidade Federal de Rondônia, Ji-Paraná, RO, Brazil
Ithalo Coelho de Sousa
Embrapa Maize and Sorghum, Sete Lagoas, Minas Gerais, Brazil
Lauro José Moreira Guimarães, Claudia Teixeira Guimarães & Maria Marta Pastina

Authors

Cynthia Aparecida Valiati Barreto
View author publications
You can also search for this author in PubMed Google Scholar
Kaio Olimpio das Graças Dias
View author publications
You can also search for this author in PubMed Google Scholar
Ithalo Coelho de Sousa
View author publications
You can also search for this author in PubMed Google Scholar
Camila Ferreira Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Ana Carolina Campana Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Lauro José Moreira Guimarães
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Teixeira Guimarães
View author publications
You can also search for this author in PubMed Google Scholar
Maria Marta Pastina
View author publications
You can also search for this author in PubMed Google Scholar
Moysés Nascimento
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: C.A.V.B., M.N., and K.O.G.D.; methodology: C.A.V.B. and I.C.d.S.; validation: C.A.V.B.; data curation: M.M.P.; Writing original draft: C.A.V.B.; Writing review and editing: C.A.V.B., M.N., K.O.G.D., C.F.A., A.C.C.N., M.M.P., L.J.M.G., and C.T.G.; Supervision: M.N. and K.O.G.D.; Project administration; M.N.; funding acquisition: M.N.

Corresponding author

Correspondence to Moysés Nascimento.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Barreto, C.A.V., das Graças Dias, K.O., de Sousa, I.C. et al. Genomic prediction in multi-environment trials in maize using statistical and machine learning methods. Sci Rep 14, 1062 (2024). https://doi.org/10.1038/s41598-024-51792-3

Download citation

Received: 04 September 2023
Accepted: 09 January 2024
Published: 11 January 2024
DOI: https://doi.org/10.1038/s41598-024-51792-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Genomic prediction applied to multiple traits and environments in second season maize hybrids

Leveraging data from the Genomes-to-Fields Initiative to investigate genotype-by-environment interactions in maize in North America

Weighted kernels improve multi-environment genomic prediction

Introduction

Material and methods

Phenotypic data

Statistical analysis of phenotypic data

Genotypic data

Genomic relationship matrix

Genomic prediction

Machine learning

Model validation

Ethics statement

Statement of handling of plants

Results

Variance components and estimation of genetic parameters

Efficiency of prediction methodologies in multi-environment trials

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Tables.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links