Introduction

Hybrid plant breeding can potentially be used as a method that employs heterosis to boost yield stability, allow the combination of dominant major genes, and offer a built-in plant variety protection system1 https://www.pnas.org/doi/full/10.1073/pnas.1514547112. Several field, vegetable, and flower crops use hybrids, including maize, sorghum, and sunflower. Interestingly, hybrid rice has been adopted and hybrid wheat research is drawing new attention2. Therefore, it is important and challenging to develop a highly efficient approach for identifying potential parental lines and superior hybrid combinations from many possible candidates. To create such an approach, we constructed a prediction model to screen out the desired individuals based on genomic selection (GS)3. To facilitate practical applications, we also generated a software package to implement our proposed GS-based approach.

Diallel mating designs have been traditionally used to evaluate the combining ability of parental lines in hybrids and to predict hybrid performance on quantitative traits of interest. To analyze diallel crosses, the total genetic variability is often separated into the general combining ability (GCA) for parental lines, and the specific combining ability (SCA) for hybrid combinations. The GCA is a measure of additive gene activity that relates to the average performance of a particular inbred line in hybrid combinations. The SCA is a measure of combining ability that links to the non-additive effects, including dominance and epistatic effects. In addition, mid-parent heterosis (MPH) is defined as the difference between a hybrid’s performance and the average performance of its parental lines, while better-parent heterosis (BPH) is defined as hybrid performance superior to the higher or better parental line4. However, the number of crossing combinations can be prohibitively high for extensive testing in a field experiment.

Due to the availability of high-density single nucleotide polymorphism (SNP) markers across an entire genome, GS becomes a promising approach to reduce cost and accelerate breeding cycles for plant breeding5,6. The conceptual basis of GS is the utilization of a training population with known phenotype and genotype data to build a prediction model that uses individuals with known genotype data only to predict genomic estimated breeding values (GEBVs)7. This GS-based approach has been applied to predict hybrid performance for several crops, such as barley8, maize9,10, rice11,12, wheat13,14, and pumpkin15. More recently, hybrid rice performance based on parental characteristics was evaluated using artificial neural networks, adaptive neuro-fuzzy inference system, and support vector machine16.

In this study, we obtained the required estimates for hybrid performance evaluation based on a GBLUP model, which took both additive and dominance marker effects into account. The GBLUP model was built based on a training population with known phenotype and genotype data. Here, we proposed a Bayesian statistical algorithm for the parameter estimation. Three datasets that have been published in literature, including pumpkin (Cucurbita maxima), maize (Zea mays), and wheat (Triticum aestivum L.), were reanalyzed to illustrate the application of our proposed approach.

Materials and methods

The genomic selection-based approach

The GBLUP model

The GBLUP model considering additive plus dominance effects can be described as follows:

$${\varvec{y}}={1}_{n}\mu +{{\varvec{g}}}_{A}+{{\varvec{g}}}_{D}+{\varvec{e}},$$
(1)

where \({\varvec{y}}\) is the vector of the phenotypic values; \({1}_{n}\) is the unit vector of length n (here n is the number of phenotypic values); \({{\varvec{g}}}_{A}\) is the vector of genotypic values for the additive effects; \({{\varvec{g}}}_{D}\) is the vector of genotypic values for the dominance effects; and \({\varvec{e}}\) is the vector of random errors. It is assumed that \({{\varvec{g}}}_{A}\), \({{\varvec{g}}}_{D}\), and \({\varvec{e}}\) are mutually independent and follow multivariate normal distributions, denoted by \({{\varvec{g}}}_{A}\sim N\left(0, {{\sigma }_{A}^{2}{\varvec{K}}}_{A}\right)\), \({{\varvec{g}}}_{D}\sim N\left(0, {{\sigma }_{D}^{2}{\varvec{K}}}_{D}\right)\), and \({\varvec{e}}\sim N\left(0, {\sigma }_{e}^{2}{{\varvec{I}}}_{n}\right)\). Here, \({{\varvec{K}}}_{A}=\frac{1}{p}{({\varvec{X}}}_{A}{{\varvec{X}}}_{A}^{T})\) is the genomic relationship matrix for the additive effects, abbreviated as A-GRM; the variance component \({\sigma }_{A}^{2}\) represents the cumulative variability of additive marker effects, abbreviated as A-VC; \({{\varvec{K}}}_{D}=\frac{1}{p}{({\varvec{X}}}_{D}{{\varvec{X}}}_{D}^{T})\) is the genomic relationship matrix for the dominance effects, abbreviated as D-GRM; and the variance component \({\sigma }_{D}^{2}\) represents the cumulative variability of the dominance marker effects, abbreviated as D-VC. For the additive effects, the SNP at each locus is coded as − 1, 0, or 1 for the homozygote of the minor allele, the heterozygote, and the homozygote of the major allele, respectively. For the dominance effects, the marker score is coded as 1 for the heterozygote, and 0 for both homozygotes. Then, \({{\varvec{X}}}_{A}\) and \({{\varvec{X}}}_{D}\) are the standardized marker score matrices for the additive effects and dominance effects, respectively, and \(p\) is the number of the SNP markers.

Estimation for GEBVs and genomic heritability

Let \(\widehat{\mu }\) be the best linear unbiased estimate (BLUE) for \(\mu \), \({\widehat{{\varvec{g}}}}_{A}\) be the BLUP for \({{\varvec{g}}}_{A}\), and \({\widehat{{\varvec{g}}}}_{D}\) be the BLUP for \({{\varvec{g}}}_{D}.\) Then, \(\widehat{\mu }\), \({\widehat{{\varvec{g}}}}_{A}\), and \({\widehat{{\varvec{g}}}}_{D}\) can be obtained from the Henderson’s equations17:

$$\left[\begin{array}{ccc}n& {1}_{n}^{T}& {1}_{n}^{T}\\ {1}_{{\varvec{n}}}& {{\varvec{I}}}_{n}+{{{\varvec{K}}}_{A}^{-1}\lambda }_{A}& {{\varvec{I}}}_{n}\\ {1}_{{\varvec{n}}}& {{\varvec{I}}}_{n}& {{\varvec{I}}}_{n}+{{{\varvec{K}}}_{D}^{-1}\lambda }_{D}\end{array}\right]\left[\begin{array}{c}\widehat{\mu }\\ \begin{array}{c}{\widehat{{\varvec{g}}}}_{A}\\ {\widehat{{\varvec{g}}}}_{D}\end{array}\end{array}\right]=\left[\begin{array}{c}{1}_{n}^{T}y\\ y\\ y\end{array}\right],$$
(2)

where \({\lambda }_{A}={\sigma }_{e}^{2}/{\sigma }_{A}^{2}\) and \({\lambda }_{D}={\sigma }_{e}^{2}/{\sigma }_{D}^{2}\). Here, \({\lambda }_{A}\) and \({\lambda }_{D}\) can be replaced with suitable estimates for \({\sigma }_{e}^{2}\), \({\sigma }_{A}^{2}\), and \({\sigma }_{D}^{2}\), respectively denoted by \({\widehat{\sigma }}_{e}^{2}\), \({\widehat{\sigma }}_{A}^{2}\), and \({\widehat{\sigma }}_{D}^{2}\). The estimate for genomic heritability was then obtained as:

$${h}^{2}=\frac{{\widehat{\sigma }}_{A}^{2}+{\widehat{\sigma }}_{D}^{2}}{{\widehat{\sigma }}_{A}^{2}+{\widehat{\sigma }}_{D}^{2}+{\widehat{\sigma }}_{e}^{2}}.$$
(3)

In this study, the breeding population was composed of all possible hybrid combinations in a half diallel mating design. Let \({{\varvec{K}}}_{A}^{(bp)}\) and \({{\varvec{K}}}_{D}^{(bp)}\) respectively denote the A-GRM and D-GRM between the breeding population and the training population. Moreover, let \({\widehat{{\varvec{g}}}}_{A}^{(bp)}\) and \({\widehat{{\varvec{g}}}}_{D}^{(bp)}\) denote the BLUPs for the breeding population of additive and dominance effects, respectively. From the article18, \({\widehat{{\varvec{g}}}}_{A}^{(bp)}\) and \({\widehat{{\varvec{g}}}}_{D}^{(bp)}\) can be obtained as:

$${\widehat{{\varvec{g}}}}_{A}^{(bp)}={{\varvec{K}}}_{A}^{(bp)}{{\varvec{K}}}_{A}^{-1}{\widehat{{\varvec{g}}}}_{A},$$
(4)

and

$${\widehat{{\varvec{g}}}}_{D}^{(bp)}={{\varvec{K}}}_{D}^{(bp)}{{\varvec{K}}}_{D}^{-1}{\widehat{{\varvec{g}}}}_{D}.$$
(5)

The genomic estimated genotypic values for the individuals in the breeding population were then predicted by:

$${\widehat{{\varvec{y}}}}^{(bp)}={1}_{{N}_{1}}\widehat{\mu }+{\widehat{{\varvec{g}}}}_{A}^{(bp)}+{\widehat{{\varvec{g}}}}_{D}^{(bp)},$$
(6)

where \({N}_{1}\) is the number of hybrid combinations in the breeding population. Here, \({N}_{1}={C}_{2}^{{N}_{0}}\) with \({N}_{0}\) as the number of parental lines.

Estimation for GCA, SCA, MPH, and BPH

Let \({GCA}_{i}\) and \({GCA}_{j}\) separately denote the GCAs for the parental lines \({P}_{i}\) and \({P}_{j}\), and \({SCA}_{ij}\) denote the SCA for their hybrid combination PiPj. Moreover, let \({g}_{A}^{(ij)}\) and \({g}_{D}^{(ij)}\) denote the BLUPs for PiPj of additive and dominance effects, respectively. From the article19,

$${g}_{A}^{(ij)}={GCA}_{i}+{GCA}_{j},$$
(7)

and

$${g}_{D}^{(ij)}={SCA}_{ij}.$$
(8)

From Eq. (8), the BLUP for \({SCA}_{ij}\) was obtained as:

$${\widehat{SCA}}_{ij}={\widehat{g}}_{D}^{(ij)}.$$
(9)

Let

$$ \overline{G}_{A}^{\left( i \right)} = \frac{{\mathop \sum \nolimits_{j \ne i}^{{N_{0} }} \hat{g}_{A}^{{\left( {ij} \right)}} }}{{N_{0} - 1}} $$
(10)

and

$${\overline{G} }_{A}=\frac{{\sum }_{i=1}^{{N}_{0}}{\sum }_{j\ne i}^{{N}_{0}}{\widehat{g}}_{A}^{(ij)}}{{N}_{1}}$$
(11)

where \({\overline{G} }_{A}^{(i)}\) is the average over the additive genotypic values of the parental line i, and \({\overline{G} }_{A}\) is the average over all of the additive genotypic values. From Eq. (7), the BLUP for \({GCA}_{i}\) is given by:

$${\widehat{GCA}}_{i}=\frac{{(N}_{0}-1){\overline{G} }_{A}^{(i)}}{{N}_{0}-2}-\frac{{N}_{0}{\overline{G} }_{A}}{2({N}_{0}-2)}.$$
(12)

From the article15, the GEBV-based MPH and BPH for PiPj can be estimated by:

$${\widehat{MPH}}_{ij}={\widehat{SCA}}_{ij}$$
(13)

and

$${\widehat{BPH}}_{ij}={\widehat{SCA}}_{ij}-\left|{\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}\right|$$
(14)

where \(\left|{\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}\right|\) is the absolute value of (\({\widehat{GCA}}_{i}-{\widehat{GCA}}_{j}\)). Under the positive heterosis assumption, the value of MPH or BPH is larger, and the heterosis of the hybrid combination is stronger.

The Bayesian statistical algorithm

For a given training population with known phenotype and genotype data, a Bayesian Gibbs sampling (BGS) algorithm, modified from an algorithm presented in the article20, was used to estimate the required parameters. The algorithm can be described as follows.

  • Step 1: Set initial values for the parameters in the model.

The default values are given by:

\(\upmu =\overline{y }\) (the sample mean of the phenotypic values), \({{\varvec{g}}}_{A}={{\varvec{g}}}_{D}=0\), \({\sigma }_{e}^{2}=1\), and \({\sigma }_{A}^{2}={\sigma }_{D}^{2}=0.5\).

  • Step 2: Rewrite Eq. (2) as

    $$\left[\begin{array}{ccc}{{\varvec{C}}}_{11}& {{\varvec{C}}}_{12}& {{\varvec{C}}}_{13}\\ {{\varvec{C}}}_{21}& {{\varvec{C}}}_{22}& {{\varvec{C}}}_{23}\\ {{\varvec{C}}}_{31}& {{\varvec{C}}}_{32}& {{\varvec{C}}}_{33}\end{array}\right]\left[\begin{array}{c}{{\varvec{g}}}_{1}\\ {{\varvec{g}}}_{2}\\ {{\varvec{g}}}_{3}\end{array}\right]=\left[\begin{array}{c}{{\varvec{\gamma}}}_{1}\\ {{\varvec{\gamma}}}_{2}\\ {{\varvec{\gamma}}}_{3}\end{array}\right].$$
    (15)

Update \({{\varvec{g}}}_{i}\) by \({{\varvec{g}}}_{i}\sim N({{\varvec{g}}}_{i}^{*}, {\sigma }_{e}^{2}{{\varvec{C}}}_{ii}^{-1})\), where \({{\varvec{g}}}_{i}^{*}={{\varvec{C}}}_{ii}^{-1}({{\varvec{\gamma}}}_{i}-{{\varvec{C}}}_{i,-i}{{\varvec{g}}}_{-i})\) for i = 1, 2, 3. Here, \({{\varvec{C}}}_{i,-i}\) denotes \({{\varvec{C}}}_{i,j}\) for all \(j\ne i\); and \({{\varvec{g}}}_{-i}\) is \({{\varvec{g}}}_{j}\) for all \(j\ne i\).

  • Step 3: Calculate the vector of residuals as: \({\varvec{e}}={\varvec{y}}-{{\varvec{g}}}_{1}-{{\varvec{g}}}_{2}-{{\varvec{g}}}_{3}\).

  • Step 4: Update \({\sigma }_{e}^{2}\) as \({\sigma }_{e}^{2}=({{\varvec{e}}}^{T}{\varvec{e}}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}\), where \({\chi }_{n+{v}^{\boldsymbol{*}}}^{2}\) is the chi-square random variate with \(n+{v}^{\boldsymbol{*}}\) degrees of freedom; \({S}^{\boldsymbol{*}}=0.5V\) with V as the sample variance of the values in \({\varvec{y}}\); and \({v}^{\boldsymbol{*}}=5\).

  • Step 5: Update \({\sigma }_{A}^{2}\) as \({\sigma }_{A}^{2}=({{\varvec{g}}}_{A}^{T}{{\varvec{K}}}_{A}^{-1}{{\varvec{g}}}_{A}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}\); and \({\sigma }_{D}^{2}\) as \({\sigma }_{D}^{2}=({{\varvec{g}}}_{D}^{T}{{\varvec{K}}}_{D}^{-1}{{\varvec{g}}}_{D}+{S}^{\boldsymbol{*}}{v}^{\boldsymbol{*}})/{\chi }_{n+{v}^{\boldsymbol{*}}}^{2}\).

  • Step 6: Update the equations in Eq. (15) with \({\lambda }_{A}={\sigma }_{e}^{2}/{\sigma }_{A}^{2}\), and \({\lambda }_{D}={\sigma }_{e}^{2}/{\sigma }_{D}^{2}\).

  • Step 7: Repeat Steps 2–6 K times to generate a series of results over the K iterations, which are denoted by:

  • \({\mu }^{(k)}\), \({{\varvec{g}}}_{A}^{(k)}\), \({{\varvec{g}}}_{D}^{(k)}\), \({\sigma }_{A}^{2(k)}\), \({\sigma }_{D}^{2(k)}\), and \({\sigma }_{e}^{2(k)}\) for \(k=1, 2, \cdots , \mathrm{K}\).

  • Step 8: Discard the results from the first \(0.9\mathrm{K}\) iterations, and average the results from the remaining 0.1 K iterations. The number of iterations K is defaulted as 5000.

  • Step 9: Repeat Steps 1–8 M times to generate M sets of the averages of the parameters generated from Step 8. The number of chains M is defaulted as five.

  • Step 10: Average the resulting mean values of the parameters over the M chains, and the resulting averages are treated as the estimates for the parameters.

An R package called as EHPGS generated for executing the proposed approach is available from GitHub (https://github.com/spcspin/EHPGS). A referenced manual and a tutorial including a demonstration example are provided in the package.

A comparison study

The pumpkin dataset was analyzed using a two-stage approach in the article15, in which the authors first estimated GEBVs, SCAs, GCAs, MPHs, and BPHs based on a whole genome regression model using Bayes C estimation in the R package BGLR21. Then, they calculated A-GRM and D-GRM by the two different formulas22,23. The restricted maximum likelihood estimation (REML) method was performed for estimating the variance components by using another R package sommer24. A comparison of the results obtained from the two-stage approach and ours was discussed in the next section.

The Bayesian reproducing kernel Hilbert space (RKHS) method in BGLR is another Bayesian algorithm that has been commonly used to perform GEBV prediction for the GBLUP model in Eq. (1). To compare the use of the Bayesian RKHS method with our proposed BGS algorithm, the three datasets was reanalyzed by using BGLR. The priors specified in BGLR were the same as ours, the number of iterations was set to 10,000, the number of burn-in was fixed at 9000, and the number of chains was set to five (the BGLR function was repeatedly run five times). These settings are exactly the same as our algorithm in analyzing the datasets.

A simulation study

To further examine whether the proposed BGS algorithm can more accurately estimate known variance components compared to established methods, such as the REML method in sommer, and the Bayesian RKHS method in BGLR, a simulation study was conducted as follows. The estimated values for the model parameters obtained from the training data (displayed in Table 3) were used to generate 3000 sets of phenotype data for the training population in each dataset (119, 276, and 600 realized observations in each simulated dataset for the pumpkin, maize, and wheat datasets, respectively). For a stimulated dataset, the variance components were estimated by the REML, Bayesian RKHS, and our BGS methods.

A cross-validation analysis

A tenfold cross-validation analysis using empirical data was also performed to compare the accuracy on GEBV prediction among the three methods. There were 119 and 276 empirical observations available in the pumpkin and maize datasets, respectively. For the sake of computational cost saving, 500 individuals randomly selected from the 2556 available hybrids in the wheat dataset were used for this analysis. The procedure can be described as follows. Step 1: Each of the three datasets was partitioned into 10 exclusive clusters at random. Step 2: During the cross-validation process, each of the 10 clusters was progressively and alternately used as the testing set. At the same time, the remaining nine clusters were pooled as the training set. Step 3: After the GEBV prediction by each method, Pearson’s correlation between GEBVs and phenotypic values in the testing set was calculated for each dataset. Here, the procedure was repeated five times to generate 50 correlation coefficients for each dataset.

The genome datasets

Three datasets that have been published in literature were reanalyzed to illustrate the use of EHPGS.

Pumpkin dataset

A pumpkin dataset which contained 119 intra-crossing hybrid combinations of C. maxima with phenotypic values for fruit weight (FWT) (kg) was analyzed for evaluation of hybrid performance15. The phenotype data were historical data collected from 1988 to 2016. All the trials were conducted at a single location experiment in southern area of Taiwan. Every hybrid had six to ten observations at each time point, and the average of them was used as the phenotypic observation for the hybrid of the year. Because the phenotypic values of every hybrid were observed for more than one year, the different year effects were therefore removed based on the assumption that they were random effects following a normal distribution.

The germplasm collection of the pumpkin set consisted of 320 parental lines, which were classified into three clusters: C. maxima with 142 inbred lines, C. pepo with 60 inbred lines and C. moschata with 118 inbred lines. After SNP calling, 76,815 SNPs were extracted from the parental lines, and only 4,521 SNPs remaining for C. maxima after the filtering by missing rate ≥ 0.05, minor allele frequency (MAF) < 0.05, and a series of operations for determining linkage disequilibrium (LD) blocks. The 142 inbred lines produced \({C}_{2}^{142}=\mathrm{10,011}\) potential hybrid combinations in a half diallel mating design. The means adjusted from the year effects for the 119 C. maxima hybrids were used in the current study to build a GBLUP model for evaluating the performance of the 10,011 hybrid combinations.

Maize dataset

A maize dataset was analyzed to study the optimal designs for GS in hybrid crops, which consisted of 276 hybrids derived from 24 parental lines in a half diallel mating design2. The 24 diverse parents were classified into two groups according to the germplasm origin and a principal component analysis. The two groups were (i) the temperate and mixed (TM) group, consisting of 11 inbred lines (i.e., B73, B97, Ky21, M162W, Mo17, MS71, Oh43, OH7B, M37W, Mo18W, and Tx303); and (ii) the tropical and sub-tropical (TS) group consisting of the remaining 13 inbred lines (i.e., CML52, CML69, CML103, CML228, CML247, CML277, CML322, CML333, Ki3, Ki11, NC350, NC358, and Tzi8). There were \({C}_{2}^{11}=55\) hybrid combinations in the TM group, \({C}_{2}^{13}=78\) hybrids in the TS group, and 11 × 13 = 143 hybrids between the two groups. Three trait values, flowering time, ear height, and grain yield (YLD) (Mg/ha), were evaluated for all of the hybrids at two locations (i.e., Columbia, MO and Clayton, NC) in 2005 and 2006. In our study, the combined BLUP values from the two locations for YLD were evaluated.

Genotype data for the 24 inbred lines were extracted from the Maize HapMap V225 at www.panzea.org, which consisted of 10,296,310 SNP markers. The SNP markers were first filtered by missing rate ≥ 0.05 and MAF ≤ 0.1, resulting in 134,726 SNPs remaining. Missing genotypes were then imputed with the homozygote of the major allele. To screen out reliable SNPs for building a GBLUP model, the retained SNPs were further filtered by LD blocks. The LD parameter \({r}^{2}\) (i.e., the squared Pearson’s correlation coefficient) of the SNPs for each chromosome was estimated using TASSEL5.2.4126 with a sliding window = 10. A smooth function between \({r}^{2}\) and the physical distance (bp) was built using an R function loess.smooth( ) with a second-degree locally weighted polynomial regression. The LD decay of ten chromosomes is displayed in Fig. S1 of the Supplementary Materials. Filtering the 134,726 SNP markers by the LD block sizes if \({r}^{2}\) approached 0.2, resulting in 46,134 SNPs remaining. A SNP was also deleted if its corresponding column for the dominance effects was a zero vector. Finally, 30,239 SNP markers were retained for further analysis. In the current study, all 276 hybrids with known trait values were used as the training population for the prediction model construction.

Wheat dataset

A genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding was investigated, and the study was based on 135 advanced elite winter wheat lines27. A set of 1604 wheat hybrids produced from crosses among the 15 male lines and 120 female lines were then evaluated for grain yield (YLD) (Mg/ha) in 11 environments. Grain yield data for all \({C}_{2}^{135}=9045\) unique hybrids were predicted based on those of the phenotyped individuals. For the genotype data, the 135 lines were fingerprinted by using a 90,000 SNP array based on an Illumina Infinium array. After quality tests, 17,372 high-quality SNP markers were retained.

To study optimal designs for GS, 2556 hybrid combinations, produced by the half diallel mating design on 72 lines selected from the original 135 elite wheat lines, were analyzed in the article2. An optimal training population with 600 individuals, determined by the r-score criterion28, was used in the current study to build the GBLUP model for the performance evaluation on the 2556 hybrid combinations.

Results and Discussion

Pumpkin dataset

By the half diallel mating design, the 142 parental lines produced \({C}_{2}^{142}=\mathrm{10,011}\) hybrid combinations in the breeding population. For illustration purposes, we only reported the top 25 superior hybrid combinations with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 1; and the top 10 potential parental lines with the largest GCAs in Table 2. Table 1 illustrates the important finding that both \({MPH}_{ij}\) and \({BPH}_{ij}\) are greater than 0 for all of the selected hybrids, showing that they had better performance in FWT than both of their parents. More interestingly, every superior hybrid presented in Table 1 was derived from one or two of the potential parental lines presented in Table 2. Particularly, P026, the parental line with the highest GCA, involved the top 11 hybrids with the greatest GEBVs among the 25 selected hybrids.

Table 1 The top 25 superior hybrid combinations with the largest GEBVs for fruit weight (FWT) within a pumpkin population.
Table 2 The top 10 potential parental lines with the largest GCAs for fruit weight (FWT) within a pumpkin population.

The estimates for the variance components and genomic heritability are shown in Table 3. From the table, the estimates of the A-VC, D-VC, and genomic heritability are given by \({\widehat{\sigma }}_{A}^{2}=\) 0.306, \({\widehat{\sigma }}_{D}^{2}=0.159\), and \({h}^{2}=0.807\). The high heritability explains why the values of \({MPH}_{ij}\) and \({BPH}_{ij}\) in Table 1 are all positive, and indicates strong heterosis in FWT among the intra-crossing hybrid combinations of C. maxima.

Table 3 The estimates for the variance components, genomic heritability, and constant term in fruit weight (FWT) for a pumpkin dataset and in yield (YLD) for maize, and wheat datasets.

Maize dataset

There were \({C}_{2}^{24}=276\) hybrid combinations derived from the 24 parental lines.

For illustration purposes, we reported the top 15 superior hybrids with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 4; and the top 5 potential parental lines with the largest GCAs in Table 5. From Table 4, both \({MPH}_{ij}\) and \({BPH}_{ij}\) are greater than 0 for all of the selected hybrids, showing that they had better performance in YLD than both of their parents. A total of 12 out of the 15 selected hybrids belong to the inter-crossing group between TM and TS. From Table 5, the top five parental lines with the greatest GCAs are CML228, CML103, Mo17, B97, and B73, and involved all of the 15 superior parental lines, with the exception of MS71  Tzi8.

Table 4 The top 15 superior hybrid combinations with the largest GEBVs for grain yield (GYD) within a maize population.
Table 5 The top five potential parental lines with the largest GCAs for grain yield (GYD) within a maize population.

The estimates for the variance components and genomic heritability are also displayed in Table 3. From the table, the estimates of the estimates of the A-VC, D-VC, and genomic heritability are given by \({\widehat{\sigma }}_{A}^{2}=\) 0.434, \({\widehat{\sigma }}_{D}^{2}=0.420\), and \({h}^{2}=0.415\), partially explaining why the values of \({MPH}_{ij}\) and \({BPH}_{ij}\) in Table 4 are all positive, and showing that there is an obvious heterosis in YLD within the breeding population.

Wheat dataset

By the half diallel mating design, the 72 parental lines produced \({C}_{2}^{72}=2556\) hybrid combinations in the breeding population. For illustration purposes, we only reported the top 20 superior hybrids with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 6; and the top 10 potential parental lines with the largest GCAs in Table 7. The estimates for the variance components and genomic heritability are also displayed in Table 3. From Table 6, all of the \({MPH}_{ij}\) are greater than 0, showing that they had a larger YLD than the mean YLD of their parents. Most of the \({BPH}_{ij}\) are noticeably smaller than \({MPH}_{ij}\), probably because the additive effects \({(\widehat{\sigma }}_{A}^{2}=\) 0.066, Table 3) were stronger than the dominance effects (\({\widehat{\sigma }}_{D}^{2}=0.014\), Table 3). Moreover, 11 of the 20 BPHs are negative, showing that the corresponding hybrids were inferior to their better-parents. Every superior hybrid presented in Table 6 was derived from one or two of the potential parental lines presented in Table 7. Particularly, F102, the parental line with the highest GCA (Table 7), involved 17 of the 20 selected superior hybrids (Table 6).

Table 6 The top 20 superior hybrid combinations with the largest GEBVs for grain yield (GYD) within a wheat population.
Table 7 The top 10 potential parental lines with the largest GCAs for grain yield (GYD) within a wheat population.

In summary, the BPH values were consistently positive for the top hybrids in both the pumpkin and maize datasets, implying that there exists a strong and useful heterosis in the two crops. The valuable result can also be found in literature15,29. However, only a few of the top hybrids had a positive but too small BPH value in the case of wheat, indicating that the heterosis existing in this dataset may not be adequate for practical utility. A wheat hybrid has a small positive or negative BPH value because one of its parents is inferior30.

The correlation between phenotypic values and GEBVs

Scatter plots of all available phenotypic values (119, 276, and 2556 individuals in the pumpkin, maize, and wheat datasets, respectively) and their GEBVs in each dataset are displayed in Figs. 1, 2 and 3. The respective Pearson’s correlation coefficients are 0.9691, 0.6786, and 0.9445. From the figures, most of the selected superior hybrids appeared in the upper right-hand corners, meaning that the selected hybrids with higher GEBVs also have higher actual phenotypic values. This is a valuable result because phenotypic selection is usually costly and time-consuming for selective breeding. The great consistency exists between the results of genomic selection and phenotypic selection, supporting that the proposed GS-based approach can be recommended for practical applications.

Figure 1
figure 1

The scatter plot for all available phenotypic values (i.e., 119 individuals) and their GEBVs in the pumpkin dataset. The colored points represent the two hybrids out of the 25 selected superior hybrids. Note that the remaining 23 selected hybrids didn’t appear in the plot, because their phenotypic values were not available. Pearson’s correlation for these 119 points was calculated as \(\mathrm{r}=0.9691\).

Figure 2
figure 2

The scatter plot for all available phenotypic values (i.e., 276 individuals) and their GEBVs in the maize dataset. The triangle points represent the top 15 superior hybrid combinations with the highest GEBVs. Pearson’s correlation for these 276 points was calculated as \(\mathrm{r}=0.6786\).

Figure 3
figure 3

The scatter plot for all available phenotypic values (i.e., 2556 individuals) and their GEBVs in the wheat dataset. The colored points represent the top 20 superior hybrids with the highest GEBVs. Pearson’s correlation for these 2556 points was calculated as \(\mathrm{r}=0.9445\).

The results of the comparison study

The top 25 superior hybrids identified by the two-stage approach15, together with those identified by our proposed approach are displayed in Table S1 of the Supplementary Materials. The corresponding identified 10 potential parental lines are displayed in Table S2. Both sets of the results are highly consistent with each other. From Table S1, 18 hybrids were in common among the 25 hybrids selected by each approach, and the top six hybrids with the highest GEBVs were the same, even though the order was slightly different. Table S2 indicates seven potential parental lines in common among the 10 selected by each approach. The variance components for additive, dominance, random error effects, and genomic heritability estimated are 0.195, 0.119, 0.066, and 0.826, respectively, from the two-stage approach. The corresponding estimates by our approach are 0.306, 0.159, 0.111 and 0.807. Even though the two corresponding estimates are different from each other, the two estimates of the genomic heritability are fairly close.

Overall, the results of the identified top parental lines and hybrid combinations between the Bayesian RKHS method in BGLR and our BGS algorithm were highly consistent with each other. Pearson’s correlations between GEBVs and phenotypic values for the datasets are displayed in Table 8. From which, our proposed algorithm led to higher Pearson’s correlations in the pumpkin and three maize datasets, but almost equal in the wheat dataset. Additionally, the estimates for variance components and genomic heritability by using the Bayesian RKHS method are displayed in Table 9. In comparison with those obtained from our BGS algorithm (Table 3), BGLR resulted in relatively low genomic heritability.

Table 8 Pearson’s correlations between GEBVs and phenotypic values for the datasets obtained from Bayesian RKHS method in BGLR and our proposed BGS algorithm.
Table 9 The estimates for the variance components and genomic heritability in fruit weight (FWT) for a pumpkin dataset and in yield (YLD) for maize, and wheat datasets by using Bayesian RKSH method in BGLR.

The results of the simulation study and the cross-validation analysis

Side-by-side box-plots for the estimates of the variance components over the 3000 repetitions in the simulation study are displayed in Fig. 4. From the figure, the two Bayesian methods of BGS algorithm and the Bayesian RKHS method generally led to larger bias but smaller dispersion than the REML method in the estimation. The performance of the methods might be dependent on different dataset-variance-component combinations. For example, BGS algorithm tended to overestimate \({\sigma }_{A}^{2}\), but the Bayesian RKHS method was likely to underestimate it in the pumpkin dataset. Moreover, BGS algorithm had slightly better performance in \({\widehat{\sigma }}_{e}^{2}\), but worse in \({\widehat{\sigma }}_{D}^{2}\) than the Bayesian RKHS method in the dataset.

Figure 4
figure 4

Side-by-side box-plots for the estimates of variance components over the 3000 simulated datasets by using the three different methods. The known values of the variance components in the simulation study are indicated as a red dashed line.

The mean and the standard deviation over the 50 resulting values in the cross-validation analysis are displayed in Table 10. From the table, the three methods had quite close performance in the three datasets. BGS algorithm, the REML method, and the Bayesian RKHS method outperformed the others in the maize, wheat, and pumpkin datasets, respectively. However, the margins were very small. According to the above results, the REML method in sommer and the Bayesian RKHS method in BGLR were also imported in EHPGS as options for the GEBV prediction and variance component estimation.

Table 10 Means and standard deviations (in parentheses) over the 50 resulting Pearson’s correlation coefficients in the cross-validation analysis.

Conclusion

In this study, a software package called EHPGS was generated for identifying potential parental lines and superior hybrid combinations from a breeding population, which is composed of all possible hybrids produced by a half diallel mating design. A training population with known phenotype and genotype data is required to build the GBLUP model, and then a set of parental lines with known genotype data is also required to perform GEBV prediction for its derived hybrid combinations. Any dataset with such training population and parental line set can fit the package. For an input dataset, EHPGS generates GEBVs, SCAs, GCAs, MPHs, and BPHs for all potential candidates to achieve the task.