A statistical package for evaluation of hybrid performance in plant breeding via genomic selection

Chen, Szu-Ping; Tung, Chih-Wei; Wang, Pei-Hsien; Liao, Chen-Tuo

doi:10.1038/s41598-023-39434-6

Download PDF

Article
Open access
Published: 27 July 2023

A statistical package for evaluation of hybrid performance in plant breeding via genomic selection

Szu-Ping Chen¹,
Chih-Wei Tung¹,
Pei-Hsien Wang¹ &
…
Chen-Tuo Liao¹

Scientific Reports volume 13, Article number: 12204 (2023) Cite this article

2178 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Hybrid breeding employs heterosis, which could potentially improve the yield and quality of a crop. Genomic selection (GS) is a promising approach for the selection of quantitative traits in plant breeding. The main objectives of this study are to (i) propose a GS-based approach to identify potential parental lines and superior hybrid combinations from a breeding population, which is composed of hybrids produced by a half diallel mating design; (ii) develop a software package for users to carry out the proposed approach. An R package, designated EHPGS, was generated to facilitate the employment of the genomic best linear unbiased model considering additive plus dominance marker effects for the hybrid performance evaluation. The R package contains a Bayesian statistical algorithm for calculating genomic estimated breeding value (GEBVs), GEBV-based specific combining ability, general combining ability, mid-parent heterosis, and better-parent heterosis. Three datasets that have been published in literature, including pumpkin (Cucurbita maxima), maize (Zea mays), and wheat (Triticum aestivum L.), were reanalyzed to illustrate the use of EHPGS.

Joint analysis of phenotype-effect-generation identifies loci associated with grain quality traits in rice hybrids

Article Open access 04 July 2023

Genomic prediction applied to multiple traits and environments in second season maize hybrids

Article 29 May 2020

Structure and function of rice hybrid genomes reveal genetic basis and optimal performance of heterosis

Article Open access 07 September 2023

Introduction

Hybrid plant breeding can potentially be used as a method that employs heterosis to boost yield stability, allow the combination of dominant major genes, and offer a built-in plant variety protection system¹ https://www.pnas.org/doi/full/10.1073/pnas.1514547112. Several field, vegetable, and flower crops use hybrids, including maize, sorghum, and sunflower. Interestingly, hybrid rice has been adopted and hybrid wheat research is drawing new attention². Therefore, it is important and challenging to develop a highly efficient approach for identifying potential parental lines and superior hybrid combinations from many possible candidates. To create such an approach, we constructed a prediction model to screen out the desired individuals based on genomic selection (GS)³. To facilitate practical applications, we also generated a software package to implement our proposed GS-based approach.

Diallel mating designs have been traditionally used to evaluate the combining ability of parental lines in hybrids and to predict hybrid performance on quantitative traits of interest. To analyze diallel crosses, the total genetic variability is often separated into the general combining ability (GCA) for parental lines, and the specific combining ability (SCA) for hybrid combinations. The GCA is a measure of additive gene activity that relates to the average performance of a particular inbred line in hybrid combinations. The SCA is a measure of combining ability that links to the non-additive effects, including dominance and epistatic effects. In addition, mid-parent heterosis (MPH) is defined as the difference between a hybrid’s performance and the average performance of its parental lines, while better-parent heterosis (BPH) is defined as hybrid performance superior to the higher or better parental line⁴. However, the number of crossing combinations can be prohibitively high for extensive testing in a field experiment.

Due to the availability of high-density single nucleotide polymorphism (SNP) markers across an entire genome, GS becomes a promising approach to reduce cost and accelerate breeding cycles for plant breeding^5,6. The conceptual basis of GS is the utilization of a training population with known phenotype and genotype data to build a prediction model that uses individuals with known genotype data only to predict genomic estimated breeding values (GEBVs)⁷. This GS-based approach has been applied to predict hybrid performance for several crops, such as barley⁸, maize^9,10, rice^11,12, wheat^13,14, and pumpkin¹⁵. More recently, hybrid rice performance based on parental characteristics was evaluated using artificial neural networks, adaptive neuro-fuzzy inference system, and support vector machine¹⁶.

In this study, we obtained the required estimates for hybrid performance evaluation based on a GBLUP model, which took both additive and dominance marker effects into account. The GBLUP model was built based on a training population with known phenotype and genotype data. Here, we proposed a Bayesian statistical algorithm for the parameter estimation. Three datasets that have been published in literature, including pumpkin (Cucurbita maxima), maize (Zea mays), and wheat (Triticum aestivum L.), were reanalyzed to illustrate the application of our proposed approach.

Materials and methods

The genomic selection-based approach

The GBLUP model

The GBLUP model considering additive plus dominance effects can be described as follows:

$${\varvec{y}}={1}_{n}\mu +{{\varvec{g}}}_{A}+{{\varvec{g}}}_{D}+{\varvec{e}},$$

(1)

where ${\varvec{y}}$ is the vector of the phenotypic values; ${1}_{n}$ is the unit vector of length n (here n is the number of phenotypic values); ${{\varvec{g}}}_{A}$ is the vector of genotypic values for the additive effects; ${{\varvec{g}}}_{D}$ is the vector of genotypic values for the dominance effects; and ${\varvec{e}}$ is the vector of random errors. It is assumed that ${{\varvec{g}}}_{A}$, ${{\varvec{g}}}_{D}$, and ${\varvec{e}}$ are mutually independent and follow multivariate normal distributions, denoted by ${{\varvec{g}}}_{A}\sim N\left(0, {{\sigma }_{A}^{2}{\varvec{K}}}_{A}\right)$, ${{\varvec{g}}}_{D}\sim N\left(0, {{\sigma }_{D}^{2}{\varvec{K}}}_{D}\right)$, and ${\varvec{e}}\sim N\left(0, {\sigma }_{e}^{2}{{\varvec{I}}}_{n}\right)$. Here, ${{\varvec{K}}}_{A}=\frac{1}{p}{({\varvec{X}}}_{A}{{\varvec{X}}}_{A}^{T})$ is the genomic relationship matrix for the additive effects, abbreviated as A-GRM; the variance component ${\sigma }_{A}^{2}$ represents the cumulative variability of additive marker effects, abbreviated as A-VC; ${{\varvec{K}}}_{D}=\frac{1}{p}{({\varvec{X}}}_{D}{{\varvec{X}}}_{D}^{T})$ is the genomic relationship matrix for the dominance effects, abbreviated as D-GRM; and the variance component ${\sigma }_{D}^{2}$ represents the cumulative variability of the dominance marker effects, abbreviated as D-VC. For the additive effects, the SNP at each locus is coded as − 1, 0, or 1 for the homozygote of the minor allele, the heterozygote, and the homozygote of the major allele, respectively. For the dominance effects, the marker score is coded as 1 for the heterozygote, and 0 for both homozygotes. Then, ${{\varvec{X}}}_{A}$ and ${{\varvec{X}}}_{D}$ are the standardized marker score matrices for the additive effects and dominance effects, respectively, and $p$ is the number of the SNP markers.

Estimation for GEBVs and genomic heritability

Let $\widehat{\mu }$ be the best linear unbiased estimate (BLUE) for $\mu $, ${\widehat{{\varvec{g}}}}_{A}$ be the BLUP for ${{\varvec{g}}}_{A}$, and ${\widehat{{\varvec{g}}}}_{D}$ be the BLUP for ${{\varvec{g}}}_{D}.$ Then, $\widehat{\mu }$, ${\widehat{{\varvec{g}}}}_{A}$, and ${\widehat{{\varvec{g}}}}_{D}$ can be obtained from the Henderson’s equations¹⁷:

$$\left[\begin{array}{ccc}n& {1}_{n}^{T}& {1}_{n}^{T}\\ {1}_{{\varvec{n}}}& {{\varvec{I}}}_{n}+{{{\varvec{K}}}_{A}^{-1}\lambda }_{A}& {{\varvec{I}}}_{n}\\ {1}_{{\varvec{n}}}& {{\varvec{I}}}_{n}& {{\varvec{I}}}_{n}+{{{\varvec{K}}}_{D}^{-1}\lambda }_{D}\end{array}\right]\left[\begin{array}{c}\widehat{\mu }\\ \begin{array}{c}{\widehat{{\varvec{g}}}}_{A}\\ {\widehat{{\varvec{g}}}}_{D}\end{array}\end{array}\right]=\left[\begin{array}{c}{1}_{n}^{T}y\\ y\\ y\end{array}\right],$$

(2)

where ${\lambda }_{A}={\sigma }_{e}^{2}/{\sigma }_{A}^{2}$ and ${\lambda }_{D}={\sigma }_{e}^{2}/{\sigma }_{D}^{2}$. Here, ${\lambda }_{A}$ and ${\lambda }_{D}$ can be replaced with suitable estimates for ${\sigma }_{e}^{2}$, ${\sigma }_{A}^{2}$, and ${\sigma }_{D}^{2}$, respectively denoted by ${\widehat{\sigma }}_{e}^{2}$, ${\widehat{\sigma }}_{A}^{2}$, and ${\widehat{\sigma }}_{D}^{2}$. The estimate for genomic heritability was then obtained as:

$${h}^{2}=\frac{{\widehat{\sigma }}_{A}^{2}+{\widehat{\sigma }}_{D}^{2}}{{\widehat{\sigma }}_{A}^{2}+{\widehat{\sigma }}_{D}^{2}+{\widehat{\sigma }}_{e}^{2}}.$$

(3)

In this study, the breeding population was composed of all possible hybrid combinations in a half diallel mating design. Let ${{\varvec{K}}}_{A}^{(bp)}$ and ${{\varvec{K}}}_{D}^{(bp)}$ respectively denote the A-GRM and D-GRM between the breeding population and the training population. Moreover, let ${\widehat{{\varvec{g}}}}_{A}^{(bp)}$ and ${\widehat{{\varvec{g}}}}_{D}^{(bp)}$ denote the BLUPs for the breeding population of additive and dominance effects, respectively. From the article¹⁸, ${\widehat{{\varvec{g}}}}_{A}^{(bp)}$ and ${\widehat{{\varvec{g}}}}_{D}^{(bp)}$ can be obtained as:

$${\widehat{{\varvec{g}}}}_{A}^{(bp)}={{\varvec{K}}}_{A}^{(bp)}{{\varvec{K}}}_{A}^{-1}{\widehat{{\varvec{g}}}}_{A},$$

(4)

and

$${\widehat{{\varvec{g}}}}_{D}^{(bp)}={{\varvec{K}}}_{D}^{(bp)}{{\varvec{K}}}_{D}^{-1}{\widehat{{\varvec{g}}}}_{D}.$$

(5)

The genomic estimated genotypic values for the individuals in the breeding population were then predicted by:

$${\widehat{{\varvec{y}}}}^{(bp)}={1}_{{N}_{1}}\widehat{\mu }+{\widehat{{\varvec{g}}}}_{A}^{(bp)}+{\widehat{{\varvec{g}}}}_{D}^{(bp)},$$

(6)

where ${N}_{1}$ is the number of hybrid combinations in the breeding population. Here, ${N}_{1}={C}_{2}^{{N}_{0}}$ with ${N}_{0}$ as the number of parental lines.

Estimation for GCA, SCA, MPH, and BPH

Let ${GCA}_{i}$ and ${GCA}_{j}$ separately denote the GCAs for the parental lines ${P}_{i}$ and ${P}_{j}$, and ${SCA}_{ij}$ denote the SCA for their hybrid combination P_i ⨂ P_j. Moreover, let ${g}_{A}^{(ij)}$ and ${g}_{D}^{(ij)}$ denote the BLUPs for P_i ⨂ P_j of additive and dominance effects, respectively. From the article¹⁹,

$${g}_{A}^{(ij)}={GCA}_{i}+{GCA}_{j},$$

(7)

and

$${g}_{D}^{(ij)}={SCA}_{ij}.$$

(8)

From Eq. (8), the BLUP for ${SCA}_{ij}$ was obtained as:

$${\widehat{SCA}}_{ij}={\widehat{g}}_{D}^{(ij)}.$$

(9)

Let

$$ \overline{G}_{A}^{\left( i \right)} = \frac{{\mathop \sum \nolimits_{j \ne i}^{{N_{0} }} \hat{g}_{A}^{{\left( {ij} \right)}} }}{{N_{0} - 1}} $$

(10)

and

$${\overline{G} }_{A}=\frac{{\sum }_{i=1}^{{N}_{0}}{\sum }_{j\ne i}^{{N}_{0}}{\widehat{g}}_{A}^{(ij)}}{{N}_{1}}$$

(11)

where ${\overline{G} }_{A}^{(i)}$ is the average over the additive genotypic values of the parental line i, and ${\overline{G} }_{A}$ is the average over all of the additive genotypic values. From Eq. (7), the BLUP for ${GCA}_{i}$ is given by:

$${\widehat{GCA}}_{i}=\frac{{(N}_{0}-1){\overline{G} }_{A}^{(i)}}{{N}_{0}-2}-\frac{{N}_{0}{\overline{G} }_{A}}{2({N}_{0}-2)}.$$

(12)

From the article¹⁵, the GEBV-based MPH and BPH for P_i ⨂ P_j can be estimated by:

$${\widehat{MPH}}_{ij}={\widehat{SCA}}_{ij}$$

(13)

and

$${\widehat{BPH}}_{ij}={\widehat{SCA}}_{ij}-\left|{\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}\right|$$

(14)

where $\left|{\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}\right|$ is the absolute value of (${\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}$). Under the positive heterosis assumption, the value of MPH or BPH is larger, and the heterosis of the hybrid combination is stronger.

The Bayesian statistical algorithm

For a given training population with known phenotype and genotype data, a Bayesian Gibbs sampling (BGS) algorithm, modified from an algorithm presented in the article²⁰, was used to estimate the required parameters. The algorithm can be described as follows.

Step 1: Set initial values for the parameters in the model.

The default values are given by:

$\upmu =\overline{y }$ (the sample mean of the phenotypic values), ${{\varvec{g}}}_{A}={{\varvec{g}}}_{D}=0$, ${\sigma }_{e}^{2}=1$, and ${\sigma }_{A}^{2}={\sigma }_{D}^{2}=0.5$.

Step 2: Rewrite Eq. (2) as
$$\left[\begin{array}{ccc}{{\varvec{C}}}_{11}& {{\varvec{C}}}_{12}& {{\varvec{C}}}_{13}\\ {{\varvec{C}}}_{21}& {{\varvec{C}}}_{22}& {{\varvec{C}}}_{23}\\ {{\varvec{C}}}_{31}& {{\varvec{C}}}_{32}& {{\varvec{C}}}_{33}\end{array}\right]\left[\begin{array}{c}{{\varvec{g}}}_{1}\\ {{\varvec{g}}}_{2}\\ {{\varvec{g}}}_{3}\end{array}\right]=\left[\begin{array}{c}{{\varvec{\gamma}}}_{1}\\ {{\varvec{\gamma}}}_{2}\\ {{\varvec{\gamma}}}_{3}\end{array}\right].$$
(15)

Update ${{\varvec{g}}}_{i}$ by ${{\varvec{g}}}_{i}\sim N({{\varvec{g}}}_{i}^{*}, {\sigma }_{e}^{2}{{\varvec{C}}}_{ii}^{-1})$, where ${{\varvec{g}}}_{i}^{*}={{\varvec{C}}}_{ii}^{-1}({{\varvec{\gamma}}}_{i}-{{\varvec{C}}}_{i,-i}{{\varvec{g}}}_{-i})$ for i = 1, 2, 3. Here, ${{\varvec{C}}}_{i,-i}$ denotes ${{\varvec{C}}}_{i,j}$ for all $j\ne i$; and ${{\varvec{g}}}_{-i}$ is ${{\varvec{g}}}_{j}$ for all $j\ne i$.

Step 3: Calculate the vector of residuals as: ${\varvec{e}}={\varvec{y}}-{{\varvec{g}}}_{1}-{{\varvec{g}}}_{2}-{{\varvec{g}}}_{3}$.
Step 4: Update ${\sigma }_{e}^{2}$ as ${\sigma }_{e}^{2}=({{\varvec{e}}}^{T}{\varvec{e}}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}$, where ${\chi }_{n+{v}^{\boldsymbol{*}}}^{2}$ is the chi-square random variate with $n+{v}^{\boldsymbol{*}}$ degrees of freedom; ${S}^{\boldsymbol{*}}=0.5V$ with V as the sample variance of the values in ${\varvec{y}}$; and ${v}^{\boldsymbol{*}}=5$.
Step 5: Update ${\sigma }_{A}^{2}$ as ${\sigma }_{A}^{2}=({{\varvec{g}}}_{A}^{T}{{\varvec{K}}}_{A}^{-1}{{\varvec{g}}}_{A}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}$; and ${\sigma }_{D}^{2}$ as ${\sigma }_{D}^{2}=({{\varvec{g}}}_{D}^{T}{{\varvec{K}}}_{D}^{-1}{{\varvec{g}}}_{D}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}$.
Step 6: Update the equations in Eq. (15) with ${\lambda }_{A}={\sigma }_{e}^{2}/{\sigma }_{A}^{2}$, and ${\lambda }_{D}={\sigma }_{e}^{2}/{\sigma }_{D}^{2}$.
Step 7: Repeat Steps 2–6 K times to generate a series of results over the K iterations, which are denoted by:
${\mu }^{(k)}$, ${{\varvec{g}}}_{A}^{(k)}$, ${{\varvec{g}}}_{D}^{(k)}$, ${\sigma }_{A}^{2(k)}$, ${\sigma }_{D}^{2(k)}$, and ${\sigma }_{e}^{2(k)}$ for $k=1, 2, \cdots , \mathrm{K}$.
Step 8: Discard the results from the first $0.9\mathrm{K}$ iterations, and average the results from the remaining 0.1 K iterations. The number of iterations K is defaulted as 5000.
Step 9: Repeat Steps 1–8 M times to generate M sets of the averages of the parameters generated from Step 8. The number of chains M is defaulted as five.
Step 10: Average the resulting mean values of the parameters over the M chains, and the resulting averages are treated as the estimates for the parameters.

An R package called as EHPGS generated for executing the proposed approach is available from GitHub (https://github.com/spcspin/EHPGS). A referenced manual and a tutorial including a demonstration example are provided in the package.

A comparison study

The pumpkin dataset was analyzed using a two-stage approach in the article¹⁵, in which the authors first estimated GEBVs, SCAs, GCAs, MPHs, and BPHs based on a whole genome regression model using Bayes C estimation in the R package BGLR²¹. Then, they calculated A-GRM and D-GRM by the two different formulas^22,23. The restricted maximum likelihood estimation (REML) method was performed for estimating the variance components by using another R package sommer²⁴. A comparison of the results obtained from the two-stage approach and ours was discussed in the next section.

The Bayesian reproducing kernel Hilbert space (RKHS) method in BGLR is another Bayesian algorithm that has been commonly used to perform GEBV prediction for the GBLUP model in Eq. (1). To compare the use of the Bayesian RKHS method with our proposed BGS algorithm, the three datasets was reanalyzed by using BGLR. The priors specified in BGLR were the same as ours, the number of iterations was set to 10,000, the number of burn-in was fixed at 9000, and the number of chains was set to five (the BGLR function was repeatedly run five times). These settings are exactly the same as our algorithm in analyzing the datasets.

A simulation study

To further examine whether the proposed BGS algorithm can more accurately estimate known variance components compared to established methods, such as the REML method in sommer, and the Bayesian RKHS method in BGLR, a simulation study was conducted as follows. The estimated values for the model parameters obtained from the training data (displayed in Table 3) were used to generate 3000 sets of phenotype data for the training population in each dataset (119, 276, and 600 realized observations in each simulated dataset for the pumpkin, maize, and wheat datasets, respectively). For a stimulated dataset, the variance components were estimated by the REML, Bayesian RKHS, and our BGS methods.

A cross-validation analysis

A tenfold cross-validation analysis using empirical data was also performed to compare the accuracy on GEBV prediction among the three methods. There were 119 and 276 empirical observations available in the pumpkin and maize datasets, respectively. For the sake of computational cost saving, 500 individuals randomly selected from the 2556 available hybrids in the wheat dataset were used for this analysis. The procedure can be described as follows. Step 1: Each of the three datasets was partitioned into 10 exclusive clusters at random. Step 2: During the cross-validation process, each of the 10 clusters was progressively and alternately used as the testing set. At the same time, the remaining nine clusters were pooled as the training set. Step 3: After the GEBV prediction by each method, Pearson’s correlation between GEBVs and phenotypic values in the testing set was calculated for each dataset. Here, the procedure was repeated five times to generate 50 correlation coefficients for each dataset.

The genome datasets

Three datasets that have been published in literature were reanalyzed to illustrate the use of EHPGS.

Pumpkin dataset

A pumpkin dataset which contained 119 intra-crossing hybrid combinations of C. maxima with phenotypic values for fruit weight (FWT) (kg) was analyzed for evaluation of hybrid performance¹⁵. The phenotype data were historical data collected from 1988 to 2016. All the trials were conducted at a single location experiment in southern area of Taiwan. Every hybrid had six to ten observations at each time point, and the average of them was used as the phenotypic observation for the hybrid of the year. Because the phenotypic values of every hybrid were observed for more than one year, the different year effects were therefore removed based on the assumption that they were random effects following a normal distribution.

The germplasm collection of the pumpkin set consisted of 320 parental lines, which were classified into three clusters: C. maxima with 142 inbred lines, C. pepo with 60 inbred lines and C. moschata with 118 inbred lines. After SNP calling, 76,815 SNPs were extracted from the parental lines, and only 4,521 SNPs remaining for C. maxima after the filtering by missing rate ≥ 0.05, minor allele frequency (MAF) < 0.05, and a series of operations for determining linkage disequilibrium (LD) blocks. The 142 inbred lines produced ${C}_{2}^{142}=\mathrm{10,011}$ potential hybrid combinations in a half diallel mating design. The means adjusted from the year effects for the 119 C. maxima hybrids were used in the current study to build a GBLUP model for evaluating the performance of the 10,011 hybrid combinations.

Maize dataset

A maize dataset was analyzed to study the optimal designs for GS in hybrid crops, which consisted of 276 hybrids derived from 24 parental lines in a half diallel mating design². The 24 diverse parents were classified into two groups according to the germplasm origin and a principal component analysis. The two groups were (i) the temperate and mixed (TM) group, consisting of 11 inbred lines (i.e., B73, B97, Ky21, M162W, Mo17, MS71, Oh43, OH7B, M37W, Mo18W, and Tx303); and (ii) the tropical and sub-tropical (TS) group consisting of the remaining 13 inbred lines (i.e., CML52, CML69, CML103, CML228, CML247, CML277, CML322, CML333, Ki3, Ki11, NC350, NC358, and Tzi8). There were ${C}_{2}^{11}=55$ hybrid combinations in the TM group, ${C}_{2}^{13}=78$ hybrids in the TS group, and 11 × 13 = 143 hybrids between the two groups. Three trait values, flowering time, ear height, and grain yield (YLD) (Mg/ha), were evaluated for all of the hybrids at two locations (i.e., Columbia, MO and Clayton, NC) in 2005 and 2006. In our study, the combined BLUP values from the two locations for YLD were evaluated.

Genotype data for the 24 inbred lines were extracted from the Maize HapMap V2²⁵ at www.panzea.org, which consisted of 10,296,310 SNP markers. The SNP markers were first filtered by missing rate ≥ 0.05 and MAF ≤ 0.1, resulting in 134,726 SNPs remaining. Missing genotypes were then imputed with the homozygote of the major allele. To screen out reliable SNPs for building a GBLUP model, the retained SNPs were further filtered by LD blocks. The LD parameter ${r}^{2}$ (i.e., the squared Pearson’s correlation coefficient) of the SNPs for each chromosome was estimated using TASSEL5.2.41²⁶ with a sliding window = 10. A smooth function between ${r}^{2}$ and the physical distance (bp) was built using an R function loess.smooth( ) with a second-degree locally weighted polynomial regression. The LD decay of ten chromosomes is displayed in Fig. S1 of the Supplementary Materials. Filtering the 134,726 SNP markers by the LD block sizes if ${r}^{2}$ approached 0.2, resulting in 46,134 SNPs remaining. A SNP was also deleted if its corresponding column for the dominance effects was a zero vector. Finally, 30,239 SNP markers were retained for further analysis. In the current study, all 276 hybrids with known trait values were used as the training population for the prediction model construction.

Wheat dataset

A genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding was investigated, and the study was based on 135 advanced elite winter wheat lines²⁷. A set of 1604 wheat hybrids produced from crosses among the 15 male lines and 120 female lines were then evaluated for grain yield (YLD) (Mg/ha) in 11 environments. Grain yield data for all ${C}_{2}^{135}=9045$ unique hybrids were predicted based on those of the phenotyped individuals. For the genotype data, the 135 lines were fingerprinted by using a 90,000 SNP array based on an Illumina Infinium array. After quality tests, 17,372 high-quality SNP markers were retained.

To study optimal designs for GS, 2556 hybrid combinations, produced by the half diallel mating design on 72 lines selected from the original 135 elite wheat lines, were analyzed in the article². An optimal training population with 600 individuals, determined by the r-score criterion²⁸, was used in the current study to build the GBLUP model for the performance evaluation on the 2556 hybrid combinations.

Results and Discussion

Pumpkin dataset

By the half diallel mating design, the 142 parental lines produced ${C}_{2}^{142}=\mathrm{10,011}$ hybrid combinations in the breeding population. For illustration purposes, we only reported the top 25 superior hybrid combinations with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 1; and the top 10 potential parental lines with the largest GCAs in Table 2. Table 1 illustrates the important finding that both ${MPH}_{ij}$ and ${BPH}_{ij}$ are greater than 0 for all of the selected hybrids, showing that they had better performance in FWT than both of their parents. More interestingly, every superior hybrid presented in Table 1 was derived from one or two of the potential parental lines presented in Table 2. Particularly, P026, the parental line with the highest GCA, involved the top 11 hybrids with the greatest GEBVs among the 25 selected hybrids.

Table 1 The top 25 superior hybrid combinations with the largest GEBVs for fruit weight (FWT) within a pumpkin population.

Full size table

Table 2 The top 10 potential parental lines with the largest GCAs for fruit weight (FWT) within a pumpkin population.

Full size table

The estimates for the variance components and genomic heritability are shown in Table 3. From the table, the estimates of the A-VC, D-VC, and genomic heritability are given by ${\widehat{\sigma }}_{A}^{2}=$ 0.306, ${\widehat{\sigma }}_{D}^{2}=0.159$, and ${h}^{2}=0.807$. The high heritability explains why the values of ${MPH}_{ij}$ and ${BPH}_{ij}$ in Table 1 are all positive, and indicates strong heterosis in FWT among the intra-crossing hybrid combinations of C. maxima.

Table 3 The estimates for the variance components, genomic heritability, and constant term in fruit weight (FWT) for a pumpkin dataset and in yield (YLD) for maize, and wheat datasets.

Full size table

Maize dataset

There were ${C}_{2}^{24}=276$ hybrid combinations derived from the 24 parental lines.

For illustration purposes, we reported the top 15 superior hybrids with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 4; and the top 5 potential parental lines with the largest GCAs in Table 5. From Table 4, both ${MPH}_{ij}$ and ${BPH}_{ij}$ are greater than 0 for all of the selected hybrids, showing that they had better performance in YLD than both of their parents. A total of 12 out of the 15 selected hybrids belong to the inter-crossing group between TM and TS. From Table 5, the top five parental lines with the greatest GCAs are CML228, CML103, Mo17, B97, and B73, and involved all of the 15 superior parental lines, with the exception of MS71 ⨂ Tzi8.

Table 4 The top 15 superior hybrid combinations with the largest GEBVs for grain yield (GYD) within a maize population.

Full size table

Table 5 The top five potential parental lines with the largest GCAs for grain yield (GYD) within a maize population.

Full size table

The estimates for the variance components and genomic heritability are also displayed in Table 3. From the table, the estimates of the estimates of the A-VC, D-VC, and genomic heritability are given by ${\widehat{\sigma }}_{A}^{2}=$ 0.434, ${\widehat{\sigma }}_{D}^{2}=0.420$, and ${h}^{2}=0.415$, partially explaining why the values of ${MPH}_{ij}$ and ${BPH}_{ij}$ in Table 4 are all positive, and showing that there is an obvious heterosis in YLD within the breeding population.

Wheat dataset

By the half diallel mating design, the 72 parental lines produced ${C}_{2}^{72}=2556$ hybrid combinations in the breeding population. For illustration purposes, we only reported the top 20 superior hybrids with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 6; and the top 10 potential parental lines with the largest GCAs in Table 7. The estimates for the variance components and genomic heritability are also displayed in Table 3. From Table 6, all of the ${MPH}_{ij}$ are greater than 0, showing that they had a larger YLD than the mean YLD of their parents. Most of the ${BPH}_{ij}$ are noticeably smaller than ${MPH}_{ij}$, probably because the additive effects ${(\widehat{\sigma }}_{A}^{2}=$ 0.066, Table 3) were stronger than the dominance effects (${\widehat{\sigma }}_{D}^{2}=0.014$, Table 3). Moreover, 11 of the 20 BPHs are negative, showing that the corresponding hybrids were inferior to their better-parents. Every superior hybrid presented in Table 6 was derived from one or two of the potential parental lines presented in Table 7. Particularly, F102, the parental line with the highest GCA (Table 7), involved 17 of the 20 selected superior hybrids (Table 6).

Table 6 The top 20 superior hybrid combinations with the largest GEBVs for grain yield (GYD) within a wheat population.

Full size table

Table 7 The top 10 potential parental lines with the largest GCAs for grain yield (GYD) within a wheat population.

Full size table

In summary, the BPH values were consistently positive for the top hybrids in both the pumpkin and maize datasets, implying that there exists a strong and useful heterosis in the two crops. The valuable result can also be found in literature^15,29. However, only a few of the top hybrids had a positive but too small BPH value in the case of wheat, indicating that the heterosis existing in this dataset may not be adequate for practical utility. A wheat hybrid has a small positive or negative BPH value because one of its parents is inferior³⁰.

The correlation between phenotypic values and GEBVs

Scatter plots of all available phenotypic values (119, 276, and 2556 individuals in the pumpkin, maize, and wheat datasets, respectively) and their GEBVs in each dataset are displayed in Figs. 1, 2 and 3. The respective Pearson’s correlation coefficients are 0.9691, 0.6786, and 0.9445. From the figures, most of the selected superior hybrids appeared in the upper right-hand corners, meaning that the selected hybrids with higher GEBVs also have higher actual phenotypic values. This is a valuable result because phenotypic selection is usually costly and time-consuming for selective breeding. The great consistency exists between the results of genomic selection and phenotypic selection, supporting that the proposed GS-based approach can be recommended for practical applications.

The results of the comparison study

The top 25 superior hybrids identified by the two-stage approach¹⁵, together with those identified by our proposed approach are displayed in Table S1 of the Supplementary Materials. The corresponding identified 10 potential parental lines are displayed in Table S2. Both sets of the results are highly consistent with each other. From Table S1, 18 hybrids were in common among the 25 hybrids selected by each approach, and the top six hybrids with the highest GEBVs were the same, even though the order was slightly different. Table S2 indicates seven potential parental lines in common among the 10 selected by each approach. The variance components for additive, dominance, random error effects, and genomic heritability estimated are 0.195, 0.119, 0.066, and 0.826, respectively, from the two-stage approach. The corresponding estimates by our approach are 0.306, 0.159, 0.111 and 0.807. Even though the two corresponding estimates are different from each other, the two estimates of the genomic heritability are fairly close.

Overall, the results of the identified top parental lines and hybrid combinations between the Bayesian RKHS method in BGLR and our BGS algorithm were highly consistent with each other. Pearson’s correlations between GEBVs and phenotypic values for the datasets are displayed in Table 8. From which, our proposed algorithm led to higher Pearson’s correlations in the pumpkin and three maize datasets, but almost equal in the wheat dataset. Additionally, the estimates for variance components and genomic heritability by using the Bayesian RKHS method are displayed in Table 9. In comparison with those obtained from our BGS algorithm (Table 3), BGLR resulted in relatively low genomic heritability.

Table 8 Pearson’s correlations between GEBVs and phenotypic values for the datasets obtained from Bayesian RKHS method in BGLR and our proposed BGS algorithm.

Full size table

Table 9 The estimates for the variance components and genomic heritability in fruit weight (FWT) for a pumpkin dataset and in yield (YLD) for maize, and wheat datasets by using Bayesian RKSH method in BGLR.

Full size table

The results of the simulation study and the cross-validation analysis

Side-by-side box-plots for the estimates of the variance components over the 3000 repetitions in the simulation study are displayed in Fig. 4. From the figure, the two Bayesian methods of BGS algorithm and the Bayesian RKHS method generally led to larger bias but smaller dispersion than the REML method in the estimation. The performance of the methods might be dependent on different dataset-variance-component combinations. For example, BGS algorithm tended to overestimate ${\sigma }_{A}^{2}$, but the Bayesian RKHS method was likely to underestimate it in the pumpkin dataset. Moreover, BGS algorithm had slightly better performance in ${\widehat{\sigma }}_{e}^{2}$, but worse in ${\widehat{\sigma }}_{D}^{2}$ than the Bayesian RKHS method in the dataset.

The mean and the standard deviation over the 50 resulting values in the cross-validation analysis are displayed in Table 10. From the table, the three methods had quite close performance in the three datasets. BGS algorithm, the REML method, and the Bayesian RKHS method outperformed the others in the maize, wheat, and pumpkin datasets, respectively. However, the margins were very small. According to the above results, the REML method in sommer and the Bayesian RKHS method in BGLR were also imported in EHPGS as options for the GEBV prediction and variance component estimation.

Table 10 Means and standard deviations (in parentheses) over the 50 resulting Pearson’s correlation coefficients in the cross-validation analysis.

Full size table

Conclusion

In this study, a software package called EHPGS was generated for identifying potential parental lines and superior hybrid combinations from a breeding population, which is composed of all possible hybrids produced by a half diallel mating design. A training population with known phenotype and genotype data is required to build the GBLUP model, and then a set of parental lines with known genotype data is also required to perform GEBV prediction for its derived hybrid combinations. Any dataset with such training population and parental line set can fit the package. For an input dataset, EHPGS generates GEBVs, SCAs, GCAs, MPHs, and BPHs for all potential candidates to achieve the task.

Data availability

All phenotype and genotype datasets that were analyzed in this study can be downloaded from Figshare (https://doi.org/10.6084/m9.figshare.22359883.v2).

References

Longin, C. F. H. et al. Hybrid breeding in autogamous cereals. Theor. Appl. Genet. 125, 1087–1096 (2012).
Article PubMed Google Scholar
Guo, T. et al. Optimal designs for genomic selection in hybrid crops. Mol. Plant 12, 390–401 (2019).
Article CAS PubMed ADS Google Scholar
Jannink, J. L., Lorenz, A. J. & Iwata, H. Genomic selection in plant breeding: From theory to practice. Brief. Funct. Genom. 9, 166–177 (2010).
Article CAS Google Scholar
Falconer, D. S. & Mackay, T. F. C. Introduction to Quantitative Genetics 4th edn. (Benjamin-Cummings Pub Co., 1996).
Google Scholar
Heffner, E. L., Lorenz, A. J., Jannink, J. L. & Sorrells, M. E. Plant breeding with genomic selection: Gain per unit time and cost. Crop Sci. 50, 1681–1690 (2010).
Article Google Scholar
Nakaya, A. & Isobe, S. N. Will genomic selection be a practical method for plant breeding?. Ann. Bot. 110, 1303–1316 (2012).
Article PubMed PubMed Central Google Scholar
Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
Article CAS PubMed PubMed Central Google Scholar
Schmid, K. J. & Thorwarth, P. Genomic selection in barley breeding. Biotechnol. Approaches Barley Improv. 69, 367–378 (2014).
Article Google Scholar
Technow, F., Riedelsheimer, C., Schrag, T. A. & Melchinger, A. E. Genomic prediction of hybrid performance in maize with models incorporating dominance and population specific marker effects. Theor. Appl. Genet. 125, 1181–1194 (2012).
Article PubMed Google Scholar
Technow, F. et al. Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197, 1343–1355 (2014).
Article PubMed PubMed Central Google Scholar
Xu, S., Zhu, D. & Zhang, Q. Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proc. Natl. Acad. Sci. 111, 12456–12461 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Wang, X. et al. Predicting rice hybrid performance using univariate and multivariate GBLUP models based on North Carolina mating design II. Heredity 118, 302–310 (2016).
Article PubMed PubMed Central Google Scholar
Zhao, Y., Zeng, J., Fernando, R. & Reif, J. C. Genomic prediction of hybrid wheat performance. Crop Sci. 53, 802–810 (2013).
Article Google Scholar
Haile, J. K. et al. Genomic selection for grain yield and quality traits in durum wheat. Mol. Breed. 38, 75 (2018).
Article Google Scholar
Wu, P. Y., Tung, C. W., Lee, C. Y. & Liao, C. T. Genomic prediction of pumpkin hybrid performance. Plant Genome 12, 180082 (2019).
Article Google Scholar
Sabouri, H. & Sajadi, S. J. Predicting hybrid rice performance using AIHIB model based on artificial intelligence. Sci. Rep. 12, 9709 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Henderson, C. R. Best linear unbiased estimation and prediction under a selection model. Biometrics 32, 69–84 (1975).
Article MATH Google Scholar
Henderson, C. R. Best linear unbiased prediction of breeding values not in the model for records. J. Diary Sci. 60, 783–787 (1977).
Article Google Scholar
Werner, C. R. et al. Genome-wide regression models considering general and specific combining ability predict hybrid performance in oilseed rape with similar accuracy regardless of trait architecture. Theor. Appl. Genet. 131, 299–317 (2018).
Article CAS PubMed Google Scholar
Xavier, A., Muir, W. M., Craig, B. & Rainey, M. Walking through the statistical black boxes of plant breeding. Theor. Appl. Genet. 129, 1933–1949 (2016).
Article PubMed Google Scholar
Perez, P. & de los Campos, G. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495 (2014).
Article PubMed PubMed Central Google Scholar
Endelman, J. B. & Jannink, J. L. Shrinkage estimation of the realized relationship matrix. G3 Genes Genomes Genet. 2, 1405–1413 (2012).
Article Google Scholar
Su, G., Christensen, O. F., Ostersen, T., Henryon, M. & Lund, M. S. Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PLoS ONE 7, e45293 (2012).
Article CAS PubMed PubMed Central ADS Google Scholar
Covarrubias-Pazaran, G. Genome-assisted prediction of quantitative traits using the R package sommer. PLoS ONE 11, e0156744 (2016).
Article PubMed PubMed Central Google Scholar
Chia, J. M. et al. Maize HapMap2 identifies extant variation from a genome in flux. Nat. Genet. 44, 803–807 (2012).
Article CAS PubMed Google Scholar
Bradbury, P. J. et al. TASSEL: Software for association mapping of complex traits in diverse samples. Genet. Pop. Anal. 23, 2633–2635 (2007).
CAS Google Scholar
Zhao, Y. et al. Genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding. Proc. Natl. Acad. Sci. 112, 15624–15629 (2015).
Article CAS PubMed PubMed Central ADS Google Scholar
Ou, J. H. & Liao, C. T. Training set determination for genomic selection. Theor. Appl. Genet. 132, 2781–2792 (2019).
Article PubMed Google Scholar
Schrag, T. A. et al. Prediction of hybrid performance in maize using molecular markers and joint analyses of hybrids and parental inbreds. Theor. Appl. Genet. 120, 451–461 (2010).
Article CAS PubMed Google Scholar
Martin, J. M., Talbert, L. E., Lanning, S. P. & Blake, N. K. Hybrid performance in wheat as related to parental diversity. Crop Sci. 35, 104–108 (1995).
Article Google Scholar

Download references

Acknowledgements

The authors thank two reviewers for their constructive comments, which help to improve the content and presentation of the manuscript.

Funding

This research was supported by the Ministry of Science and Technology, Taiwan (grand number MOST 110-2118-M-002-002-MY2).

Author information

Authors and Affiliations

Department of Agronomy, National Taiwan University, Taipei, Taiwan
Szu-Ping Chen, Chih-Wei Tung, Pei-Hsien Wang & Chen-Tuo Liao

Authors

Szu-Ping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Wei Tung
View author publications
You can also search for this author in PubMed Google Scholar
Pei-Hsien Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Tuo Liao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.P.C.: prepared Tables 3, 4, 5, 6, 7, 8, 9 and 10, and Figs. 1, 2, 3 and 4. C.W.T.: wrote the main manuscript text. P.H.W.: prepared Tables 1, 2, S1, S2, and Fig. S1. C.T.L.: wrote and edited the main manuscript text. All authors contributed to the article and approved the submitted version.

Corresponding author

Correspondence to Chen-Tuo Liao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, SP., Tung, CW., Wang, PH. et al. A statistical package for evaluation of hybrid performance in plant breeding via genomic selection. Sci Rep 13, 12204 (2023). https://doi.org/10.1038/s41598-023-39434-6

Download citation

Received: 03 May 2023
Accepted: 25 July 2023
Published: 27 July 2023
DOI: https://doi.org/10.1038/s41598-023-39434-6

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Joint analysis of phenotype-effect-generation identifies loci associated with grain quality traits in rice hybrids

Genomic prediction applied to multiple traits and environments in second season maize hybrids

Structure and function of rice hybrid genomes reveal genetic basis and optimal performance of heterosis

Introduction

Materials and methods

The genomic selection-based approach

The GBLUP model

Estimation for GEBVs and genomic heritability

Estimation for GCA, SCA, MPH, and BPH

The Bayesian statistical algorithm

A comparison study

A simulation study

A cross-validation analysis

The genome datasets

Pumpkin dataset

Maize dataset

Wheat dataset

Results and Discussion

Pumpkin dataset

Maize dataset

Wheat dataset

The correlation between phenotypic values and GEBVs

The results of the comparison study

The results of the simulation study and the cross-validation analysis

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links