Introduction

Poaceae (Gramineae) is a very diverse family of flowering plants known as grasses with unquestionable economic importance, as they are sources of food, fuel and fodder. Among Poaceae crop species, sugarcane (Saccharum spp., SU) stands out as responsible for most of the global sugar production1,2 and for its high energy potential3, while tropical forage grasses (FGs), including Urochloa spp. and Megathyrsus maximus (syn. Panicum maximum), are the main food source for beef and dairy cattle in tropical and subtropical regions of the world4,5. In addition to their agricultural relevance, SU and FGs have complex and large polyploid genomes, making the application of marker-assisted selection more challenging than in diploid grasses such as rice, maize and sorghum6,7,8.

As with all crops, many traits of great agricultural importance are highly quantitative and controlled by many loci of very small effect and thus are easily affected by environmental conditions9. However, in these polyploid species, the effect of individual loci tends to be even smaller, as it is distributed among several alleles with different possible combinations, making the genotype-phenotype correspondence uncertain10,11,12. This fact, combined with other characteristics, such as the large and heterozygous nature of their genomes, which have a high content of repetitive regions7,13, explains the shortage of genomic resources for answering relevant biological questions and enabling the advance of the genetic improvement of these crops.

With recent advances in high-throughput sequencing and phenotyping technologies, a compelling approach known as genomic selection (GS) has been used to accelerate selection cycles through the application of statistical methods to predict the breeding value of individuals by using molecular markers, mainly single nucleotide polymorphisms (SNPs)14. First proposed by Bernardo15, this approach has shown great potential in studies on polyploid species of economic importance, as it provides relevant genetic gains considering complex traits controlled by several genes of minor effect16,17,18,19,20. However, there is still a gap in effectively applying GS in SU and FG breeding programs because no available strategy has shown enough improvement in accuracy to significantly increase yearly genetic gains21,22,23.

Among other factors, the accuracies of predictive models vary depending on the statistical and computational methods employed24. However, the several methods available for GS do not work satisfactorily for complex polyploids such as the species mentioned here, and as a consequence, the prediction of marker-trait associations is not accurate. Although representing a valuable alternative to improve predictive accuracy in GS, traditional machine learning (ML) methods have also been shown to be inefficient in increasing genetic gains and, thus, solving the existing limitations to incorporating GS into complex polyploid breeding schemes25,26. Therefore, there is a need to incorporate different learning strategies and algorithms into GS to improve predictive performance for quantitative traits in genetically complex species such as SU and FGs.

In this context, this study is aimed at the development of a predictive system for genomic prediction (GP) of complex traits in polyploid grasses, especially species with incomplete reference genomes and scarce genetic resources. Considering the difficulty of capturing species’ genomic singularities through linear regression techniques and creating predictive models with high accuracies, we incorporated different ML methodologies into an ensemble system composed of classification and regression models. The applicability and potential of this joint learning approach were assessed with several complex traits from three different biparental populations of SU and FGs (U. decumbens and M. maximus) and contrasted with those of traditional linear regression models for GS. Our study provides a novel methodology with strong potential to be implemented in polyploid grass breeding programs, assisting in reducing breeding cycles and increasing genetic gain.

Materials and methods

All the predictive models used in this study were applied to three different biparental populations (\(\mathrm {F}_1\) hybrids): (Pop1) 219 individuals generated from a cross between SU commercial varieties, (Pop2) 239 individuals from a cross between a commercial variety and a genotype of U. decumbens (2n = 4x = 36), and (Pop3) 136 individuals from a cross between a commercial variety and a genotype of M. maximus (2n = 4x = 32). The plant material employed in the present study are in compilance with institutional, national, and international guidelines and legislation. The same data analysis procedures were applied for all populations, with specific differences regarding the genotyping and bioinformatics procedures. Different traits were evaluated for these species, and therefore, specific mixed effect models were estimated for each scenario.

Grass populations

Pop1 hybrids were generated from a cross between the elite clone IACSP95-3018 (female parent) and the commercial variety IACSP93-3046 (male parent). This population was developed by the Sugarcane Breeding Program of the Agronomic Institute of Campinas27. This population was planted in three locations: (i) Ribeirão Preto, São Paulo, Brazil (21\(^{\circ }\) 12\('\) 0.6\(''\) S, 47\(^{\circ }\) 52\('\) 21.5\(''\) W, 546.8 m), in 2005; (ii) Sales de Oliveira, São Paulo, Brazil (20\(^{\circ }\) 46\('\) 19\(''\) S, 47\(^{\circ }\) 50\('\) 16\(''\) W, 730 m), in 2007; and (iii) Piracicaba, São Paulo, Brazil (22\(^{\circ }\) 42\('\) 36.6\(''\) S, 47\(^{\circ }\) 37\('\) 58.3\(''\) W, 554 m), in 2011. Experimental fields were established following an augmented block design with five blocks (44 individuals per block) and plots with 1-m rows and 1.5 m plant spacing. Each individual was evaluated for stalk diameter (SD) and stalk height (SH) in a sample with 10 stalks from individual plots28. These evaluations were performed in 2008 (plant cane) and 2009 (ratoon cane) in location (ii) and (iii), and in 2012 (plant cane), 2013 (ratoon cane) and 2014 (ratoon cane) in location (i).

Pop2 and Pop3 were generated and evaluated by the Embrapa Beef Cattle (Brazilian Agricultural Research Corporation), located in Campo Grande, Mato Grosso do Sul, Brazil (20\(^{\circ }\) 27\('\) S, 54\(^{\circ }\) 37\('\) W, 530 m). Pop2 originated from a cross between U. decumbens D62 (cv. Basilisk) as the male apomictic parent and U. decumbens D24/27 as the female sexual parent obtained by tetraploidization with colchicine29. Plants were evaluated for the following agronomic traits: regrowth capacity (RC), field green weight in kg/ha (FGW), total dry matter (TDM) in kg/ha, leaf dry matter (LDM) in kg/ha, leaf percentage (LP), and the leaf stem ratio (LSR), as described by Mateus et al.30. Briefly, the field experimental design followed a simple lattice of 18 x 18 meters with two replications, seven controls and five plants per plot. These traits were evaluated for seven harvests in 2012: two in the dry season and five in the rainy season.

Pop3 was obtained from a cross between M. maximus cv. Mombaça, the apomictic parent, and an obligate sexual parent, M. maximus S10. According to Deo et al.31, this population was evaluated in an augmented block design with 160 regular treatments distributed in eight blocks with two replicates. Each block was composed of 22 plots with 20 individuals and two controls. Genotypes were evaluated for several agronomic traits: green matter (GM), stem dry matter (SDM), percentage of leaf blade (PLB), TDM, LDM, and RC. Phenotyping was performed in six harvests (three in 2013 and three in 2014), with the exception of phenotyping of RC, which was performed for three harvests (one in 2013 and two in 2014).

Phenotypic data analyses

Raw phenotypic data were processed according to the mixed linear models described by Aono et al.27, Ferreira et al.32, and Deo et al.31 for Pop1, Pop2 and Pop3 respectively. Trait heritabilities were calculated based on model variances considering \(H^2=\sigma ^2_g/\sigma ^2_p\), where \(\sigma ^2_g\) and \(\sigma ^2_p\) are the genetic and phenotypic variances, respectively. Pop1 data were evaluated in R statistical software33 using breedR v.0.1234 and bestNormalize35 for data normalization. We modeled each trait considering the following statistical model:

$$\begin{aligned} Y_{ijrkm}=\mu +L_k+H_m+LH_{km}+B_{j(km)}+G_{i(km)}+e_{ijrkm}, \end{aligned}$$
(1)

where each trait measure \(Y_{ijrkm}\) of the rth replicate of the ith genotype was evaluated at the kth location of the jth block in the mth year. The trait mean is given by \(\mu\), and the contributions of the kth location (\(L_k\)), the mth harvest (\(H_m\)), the jth block at the kth location in the mth year (\(B_{j(km)}\)), and the interaction between the kth location and mth harvest (\(LH_{km}\)) were modeled as fixed effects. The effects of the genotype \(G_{i(km)}\) and the residual error \(e_{ijrkm}\) were included as random effects.

Pop2 phenotypic evaluations were performed using ASReml-R v336,37 and ASRemlPlus38 software. The traits were modeled with the following model:

$$\begin{aligned} Y_{ijmr}=\mu +R_r+H_m+B_{j(r)}+G_i+GH_{im}+e_{ijkm}, \end{aligned}$$
(2)

where each trait measure \(Y_{ijmr}\) of the ith genotype was evaluated in the jth block in the mth harvest considering the rth replicate. The trait mean is given by \(\mu\), and the contributions of the rth replicate (\(R_r\)) and the mth harvest (\(H_m\)) were modeled as fixed effects. The contributions of the ith genotype (\(G_i\)), the jth block in the rth replicate (\(B_{j(r)}\)), and the interaction between the ith genotype and the mth harvest (\(GH_{im}\)) were included as random effects, as well as the residual error \(e_{ijkm}\).

For Pop3, a longitudinal linear mixed model was fit using ASReml-R36, with Box-Cox transformation39 for data normalization. The following model was employed:

$$\begin{aligned} Y_{ijmr}=\mu +H_m+R_{r(m)}+B_{j(m)}+RB_{rj(m)}+G_{i(m)}+e_{ijmr}, \end{aligned}$$
(3)

where each trait measure \(Y_{ijmr}\) of the ith genotype was evaluated in the jth block in the mth harvest considering the rth replicate. The trait mean is given by \(\mu\), and the contributions of the mth harvest (\(H_m\)) and the rth replicate in the mth harvest (\(R_{r(m)}\)) were modeled as fixed effects. The contributions of the jth block in the mth harvest (\(B_{j(m)}\)), the interaction of the rth replicate and jth block in the mth harvest (\(RB_{rj(m)}\)), the ith genotype (or control) in the mth harvest (\(G_{i(m)}\)) and the environmental error (\(e_{ijmr}\)) were modeled as random effects.

The best linear unbiased predictors (BLUPs) (Pop1) and corrected phenotypes (Pop2 and Pop3) were estimated, rescaled to a new range of values between 0 and 1 using min-max normalization and used for phenotypic predictions. For a set of values v, the transformed values f(v) are:

$$\begin{aligned} f(v)=\frac{v-min(v)}{max(v)-min(v)}. \end{aligned}$$
(4)

Visual inspections of phenotypic data distributions were performed using the ggplot2 package40 in R. Trait correlations were estimated with R Pearson correlation coefficients by using the PerformanceAnalytics package41 in R.

Genotyping procedures

As the genotyping approach, we employed a genotyping-by-sequencing (GBS) strategy. GBS libraries were prepared following the protocol of Elshire et al.42 for Pop1 and Pop2 and the protocol of Poland et al.43 for Pop3. DNA extraction from Pop1 was performed using a cetyltrimethylammonium bromide (CTAB)-based protocol44, and two 96-plex GBS libraries were constructed using PstI, with two samples per parent. Five sequencing runs were performed with the Illumina GAIIx and Illumina NextSeq systems27. Genomic DNA of Pop2 was extracted using the DNeasy Plant Kit (QIAGEN, Hilden, Germany). Two 96-plex GBS libraries were constructed using NsiI and five replicates per parent and sequenced on the NextSeq 500 platform (Illumina, San Diego, CA, USA)32. DNA extraction from Pop3 was performed following a CTAB-based method45 with slight modifications. Then, 96-plex GBS libraries were constructed using a combination of a rarely cutting enzyme (PstI) and a frequently cutting enzyme (MspI), with 12 replicates for each parent. Pop3 libraries were also sequenced on the NextSeq 500 platform31.

SNP calling in SU was performed by coupling several bioinformatics pipelines and selecting the intersecting markers by the TASSEL-GBS v.4 pipeline46 modified for polyploids47 and at least one other SNP caller, including SAMtools version 1.648, Stacks version 2.349, Genome Analysis Toolkit (GATK) version 3.7 with the Haplotype Caller algorithm50, and FreeBayes version 1.1.0-351. GBS reads were filtered and preprocessed using FastX-Toolkit scripts52, and comparative alignments were performed against the PhiX genome using BLASTn53, as described by Aono et al.27. Read alignment was performed with BWA-MEM version 0.7.1254 and the methyl-filtered genome of the sugarcane cultivar SP70-114355, retaining only uniquely mapped reads. For GATK, SAMtools and FreeBayes, a common pipeline was established.

For Pop2 and Pop3, SNP calling was performed using TASSEL-4-POLY. Because this pipeline requires a reference genome for SNP calling, we used the Setaria viridis v1.0 and Panicum virgatum v1.0 genomes as references to align the GBS reads of Pop2 and Pop3, respectively, in Bowtie2 v.2.3.156. Both reference genomes were produced by the US Department of Energy Joint Genome Institute and are available from the Phytozome database (http://phytozome.jgi.doe.gov/)57. In this step, the following settings were used for both populations: very-sensitive-location, a limit of 20 dynamic programming problems and a maximum of 4 times to align a read.

After selecting the SNP sets for each population, we kept only biallelic variants with a minimum depth of 50 reads per individual and a maximum of 25% missing data. These markers were represented as allele proportions by calculating the ratio between the number of reads for the reference allele and the total number of reads27. Missing data were imputed as means.

Multivariate analyses

To evaluate genotypic and phenotypic data profiles across populations, we performed principal component analyses (PCAs), t-distributed stochastic neighbor embedding (t-SNE) analyses58, and redundancy analyses (RDAs)59. Scatter plots were constructed using the ggplot2 R package. We evaluated phenotypic data using PCA and genotypic data with t-SNE in the Rtsne R package60, considering a perplexity of 30. We combined both datasets with RDA, which was implemented in the vegan R package61. RDA constrained axes were evaluated with the analysis of variance (ANOVA) F-statistic, and to identify SNPs putatively under association with the phenotypes, we used a three-standard deviation cut-off (p-value of 0.0027).

Additionally, we performed a set of clustering analyses for each trait, considering hierarchical (Ward’s minimum variance with and without squared dissimilarities, herein referred to as WardWSD and WardWoSD, respectively), single, complete, unweighted pair-group method with arithmetic averaging (UPGMA), weighted pair-group method with arithmetic averaging (WPGMA), weighted pair-group method with centroid averaging (WPGMC), unweighted pair group method with centroid averaging (UPGMC) and nonhierarchical (K-means) strategies together with pairwise Euclidean distances. For each of the clustering methods and traits, we calculated the clustering index by using the NbClust R package62, defining the best clustering scheme as that most indicated by the indexes in a cluster number range of 2 to 10. Cluster visualization was performed with the ggtree R package63 and used to create intervals for each phenotype. The best method was defined through visual inspections considering the trait distribution and the cluster configurations on the heatmaps created.

Single regression genomic prediction

We evaluated the prediction accuracies of traditional models for GP and ML-based approaches. As traditional approaches, we selected (a) Bayesian ridge regression (BRR)64; (b) Bayesian reproducing kernel Hilbert spaces regression with kernel averaging (RKHS-KA)65, considering different kernels with bandwidth parameters of 15M, 5M, and 1M, where M represents the median squared Euclidean distance across genotypes; and (c) a single-environment, main genotypic effect model with a Gaussian kernel (SM-GK)66. BRR and RKHS-KA models were implemented in the BGLR R package67, and the SM-GK model was implemented in the BGGE R package66. Given a genotype matrix (SNP data), codified using allele proportions, with p loci and n individuals, the BRR models were estimated considering the following equation:

$$\begin{aligned} {{\textbf {y}}}={{\textbf {1}}}\mu +{{\textbf {Z}}}\varvec{\gamma }+{{\textbf {e}}}, \end{aligned}$$
(5)

where \({{\textbf {y}}}\) represents the response measures, \(\mu\) the overall population mean, \({{\textbf {e}}}\) the residuals (\({{\textbf {e}}}\sim MVN({{\textbf {0}}}_n,{{\textbf {I}}}\sigma ^2_e)\)), \({{\textbf {Z}}}\) the genotype matrix, and \(\varvec{\gamma }\) the vector of SNP effects with the same normal prior distribution for all loci (\(\gamma _i|\sigma ^2_\gamma \sim N(0,\sigma ^2_\gamma )\)) and the hyperparameter following a scaled inverse chi-squared hyperprior distribution (\(\sigma ^2_\gamma |d.f._\gamma ,S_\gamma \sim \chi ^{-2}(d.f._\gamma ,S_\gamma )\), where \(\sigma ^2_\gamma\) is the genetic variance, d.f. the degrees of freedom and \(S_\gamma\) the scale parameter). In SM-GK, considering the same equation, \({{\textbf {Z}}}\) is the incidence matrix for random genetic effects, and \(\varvec{\gamma }\) the genetic effects considering \(\gamma _j|\sigma ^2_\gamma \sim N(0,\sigma ^2_\gamma {{\textbf {K}}})\), where the matrix \({{\textbf {K}}}\) is computed through a Gaussian kernel (GK) function. For pairwise \({{\textbf {K}}}\) calculations between two genotypes \(x_i\) and \(x_{i'}\), \(GK(x_i,x_{i'})=\exp {\{\frac{-h}{p}\sum _{j=1}^{p}{(x_{i,k}-x_{i',k})^2}\}}\), where h is the bandwidth parameter (herein considered 1 for SM-GK). In RKHS-KA, a multikernel model is fitted with the same GK but considering separate random effects for three established kernels. The posterior distribution of the models was assessed with the Gibbs sampler, using 20,000 interactions and discarding 2000 cycles (burn-in).

For ML-based approaches, we selected regression methods based on the following algorithms: (i) k-nearest neighbors (KNN)68, (ii) support vector machine (SVM)69, (iii) random forest (RF)70, and (iv) AdaBoost71. All of these algorithms were implemented with the scikit-learn Python v.3 module72. For KNN (\(k=5\)), pairwise Euclidean distances with uniform weights were considered. In SVM regression, a radial basis function kernel (\(K_{SVM}\)) was used. Considering two genotypes \(x_i\) and \(x_{i'}\), \(K_{SVM}(x_i,x_{i'})=\exp \left( -\theta ||x-x'||^2\right)\), with the free parameter \(\theta\) defined as \(1/(p\times \sigma ^2_Z)\), where p represents the number of loci and \(\sigma ^2_Z\) the variance of the genotype matrix \({{\textbf {Z}}}\). The RF meta-estimator was built considering 100 decision trees for random data splits evaluated through mean squared errors (MSEs) calculated as \(MSE=\frac{1}{n}\sum _{i=1}^n{(y_i-{\hat{y}}_i)^2}\), where \(y_i\) and \({\hat{y}}_i\) are the real and the predicted values, respectively, across n individuals in data split k (from 1 to 100). Tree induction was performed considering a maximum tree height of p (the number of loci). For AdaBoost, the base estimator was a decision tree regressor with a linear loss function for weighting boosting interactions.

All these models were evaluated under a cross-validation (CV) scenario with 50 10-fold repeats and the leave-one-out (LOO) approach. For each trait, predictive accuracies were estimated as MSEs and R Pearson correlations and visualized with the ggplot2 R package. The models’ performances were compared with ANOVA followed by Tukey’s multiple comparisons test (p-value of 0.05), implemented in the agricolae R package73.

Phenotypic interval prediction

In addition to single regression genomic prediction, we evaluated the performance of classification models for predicting phenotypic intervals. The predictive capabilities were estimated using 10-fold CV repeated 50 times with stratification based on the intervals. The following ML algorithms were used: (i) SVM69, (ii) RF70, (iii) multilayer perceptron (MLP) neural network74, and (iv) Gaussian naive Bayes (GNB)75. All these models were estimated using the scikit-learn Python v.3 module72 with the default parameters to provide a baseline score.

Feature selection

In this work, we used feature selection (FS) techniques to extract loci with putative phenotypic associations considering higher predictive accuracies. For this task, we tested the performance of the ML-based regression and classification models described in previous sections for predicting the phenotypes and their interval values considering scenarios with and without (the whole marker dataset) FS in a 10-fold CV repeated 50 times.

For classification FS, we employed the approach described by Aono et al.27. As an FS system, we combined the results of (i) L1-based FS with a linear support vector classification system69, (ii) ANOVA-based univariate FS (p-value of 0.05), and (iii) gradient tree boosting (GTB)76. In this approach, a locus is retrieved by the system if identified as relevant for classification in at least two out of the three FS techniques employed. This methodology was extended for regression FS, considering models adapted for a continuous response variable. Techniques (i) and (iii) were changed, considering SVM and GTB for regression, and approach (ii) was replaced with Pearson correlation (p-value cut-off of 0.05).

Marker annotation

For each phenotype, we retrieved the selected markers and obtained a genomic window of 1000 base pairs upstream and downstream in the reference genome. With these sequences, we performed comparative alignments against coding DNA sequences (CDSs) of fourteen different species from the Poaceae family (Brachypodium distachyon, B. hybridum, B. silvatium, Hordeum vulgare, Oryza sativa, Oropetium thomaeum, Panicum hallii, P. virgatum, Sorghum bicolor, Setaria italica, S. viridis, Triticum aestivum, Thinopyrum intermedium and Zea mays) and Arabidopsis thaliana extracted from the Phytozome v.13 database57. All the correspondences were used to summarize related Gene Ontology (GO) terms using the Revigo tool77 and Cytoscape software78.

Joint learning

With the selected traits, we contrasted the GP methods using traditional statistical methodologies (BRR, RKHS-KA, and SM-GK) and the entire SNP dataset, with the construction of SM-GK models based on the SNPs selected from the FS approaches. We examined FS-selected SNPs considering the established classification (C) and regression (R) techniques, and used the SNPs identified in the intersection of at least two (C2 or R2) or three (C3 or R3) FS methodologies. The tested combinations were SM-GK models estimated based on (i) SNPs selected through the intersection of at least two classification FS techniques (C2); (ii) SNPs selected through the intersection of three classification FS techniques (C3); (iii) SNPs selected through the intersection of at least two regression FS techniques (R2); (iv) SNPs selected through the intersection of three regression FS techniques (R3); (v) the intersection of SNPs selected through C2 and R2 (ICR2); (vi) the union of SNPs selected through C2 and R2 (CR2); (vii) the union of SNPs selected through C3 and R3 (CR3); (viii) the entire set of markers including C3 as fixed effects (C3F); and (ix) the entire set of markers including R3 as fixed effects (R3F).

The inclusion of markers as fixed effects in SM-GK models79,80 was performed with the equation:

$$\begin{aligned} {{\textbf {y}}}={{\textbf {1}}}\mu +{{\textbf {X}}}\alpha +Z\gamma +e, \end{aligned}$$
(6)

where y represents the response measures, \(\mu\) the overall population mean, X the genotype matrix with the FS-associated markers, Z the incidence matrix for random genetic effects (estimated without the inclusion of X), and \({{\textbf {e}}}\) the residuals. The marker effects in the FS set are represented by \(\varvec{\alpha }\), and the genetic effects of \(\varvec{\gamma }\) follow the distribution described above.

To compare predictive performance between the developed models and the other techniques, we employed the same CV scenarios described above (LOO and 10-fold), retrieving the MSE measures and the Pearson correlation coefficients. Correlation plots were constructed with the ggplot2 R package.

Simulation approaches

To evaluate the proposed approach in larger populations in a controlled scenario, we performed simulations using AlphaSimR software81. We selected an entire wheat breeding program82,83 for creating genotypic (21 chromosomes with 1000 SNPs each and coded as 0 (ancestral homozygote), 1 (heterozygote) and 2 (derived homozygote)) and phenotypic data (yield). For simulation scenarios, we considered different quantities of QTLs per chromosome (2, 10, 100 and 1000) with additive effects on yield sampled from a normal distribution (mean of 4 and variance of 0.1). Using such a configuration, 70 inbred lines were created and used for 20 years of phenotypic selection, with the first 10 years discarded as burn-in. The environmental variance was set to 0.4 and the genotype-environment interaction variance was set to 0.2.

The inbred lines were used to create 100 biparental populations with 100 individuals each, for a total of 10,000 genotypes. These were selected using (i) a preliminary yield trial (PYT), (ii) an advanced yield trial (AYT), and (iii) an elite yield trial (EYT). To advance a genotype to the PYT, we created double haploids of the 10,000 lines, visually evaluated in a head row nursery (heritability of 0.1) to select the best 1000 genotypes (10 per family). The best 100 lines in the PYT were selected for the AYT considering an unreplicated trial (heritability of 0.2), and the best 10 in the AYT were advanced to the EYT using a small multilocation replicated trial (heritability of 0.5). From the 10 lines in the EYT, the released variety was selected using a large multilocation replicated trial in 2 consecutive years (heritability of 0.67).

From the last 10 years of phenotypic selection, we selected the genotypic data of the PYT with the corresponding phenotypes simulated to create datasets for evaluating the proposed approach. From a total of 10,000 lines, we sampled different numbers of individuals (100, 250, 500, 1000, 2000, 5000 and 10,000) and performed 10-fold CV evaluations for GP.

Results

Phenotypic and genotypic data analyses

From the defined mixed models, we could estimate, for each trait and population, the corresponding BLUPs and corrected phenotypes (Supplementary Figs. S1S3), which were used for GP. Although there were some evident correlations between traits (Supplementary Figs. S4S6), most of the evaluated phenotypes clearly represented a different prediction task for the development of models. There was a variable range of heritabilities associated with each trait, ranging from 0.49 to 0.51 in Pop1, from 0.52 to 0.81 in Pop2, and from 0.19 to 0.64 in Pop3 (Table 1).

Table 1 Heritability and cluster configuration of each trait across the different traits separated according to the populations: (Pop1) cross between sugarcane commercial varieties, (Pop2) cross between a commercial variety and a genotype of Urochloa decumbens, and (Pop3) cross between a commercial variety and a genotype of Megathyrsus maximus.

A total of 182 (out of 219), 219 (out of 239) and 138 (out of 138) individuals were genotyped for Pop1, Pop2, and Pop3, respectively. For each population, we generated the following number of filtered biallelic SNP markers: (i) 14,540 in Pop1 (from a total of 137,757 raw SNPs), (ii) 4548 in Pop2 (from 9354), and (iii) 5137 in Pop3 (from 23,637). For all these SNP sets, we observed a significant loss in the number of filtered markers (90% for Pop1, 51% for Pop2, and 78% for Pop3), which was expected due to the large quantity of missing data in GBS experiments and the pseudoreference genomes employed in the SNP calling step.

For each of the traits evaluated, we used nine different clustering methods and considered the group configuration as that most indicated by clustering indexes calculated with the NbClust package (with the cluster number ranging between 2 and 10). All nine clustering schemes identified for each trait were compared through visual inspections using dendrograms and heatmaps (Supplementary Figs. S7S9). The most suitable group configuration for separating the phenotypic values was assigned for each trait, resulting in different cluster quantities (Table 1). In Pop1, the WardWoSD method was the best clustering strategy. For Pop2, the WardWoSD method was also most appropriate for most of the traits, except for LSR and TDM, which were evaluated using hierarchical complete clustering and the WardWSD approach, respectively. For Pop3, all the traits were evaluated with the WardWSD approach.

To provide an overview of the genotypic and phenotypic data, we performed different multivariate analyses. For FG populations, contaminating individuals might be introduced during crosses, which could be detected by such analyses84. As expected, we found putative contaminants in Pop2 though t-SNE analysis performed on genotypic data (Supplementary Fig. S10). The individuals with distinct profiles across the Cartesian plane were excluded from the dataset and not considered in further analyses (18 genotypes located apart from most of the population in the t-SNE biplot were discarded).

With the final phenotypic datasets, we performed PCAs (Fig. 1), which were used to color and size the t-SNE results from genotypic data. These plots demonstrate the lack of correspondence between the genotypic and phenotypic datasets, corroborating the difficulty of developing predictive models for such associations in the populations under analysis. By considering linear dependency across SNPs and traits, RDAs were performed, allowing the identification of a small number of putative linear associations (233 in Pop1, 37 in Pop2, and 21 in Pop3), evidencing the need for alternative approaches for categorizing the genomic regions linked to the phenotypes analyzed.

Figure 1
figure 1

Multivariate analyses performed on both genotypic and phenotypic data for the populations of (A) sugarcane (stalk diameter (SD) and stalk height (SH)), (B) Urochloa decumbens (field green weight (FGW), leaf dry matter (LDM), leaf percentage (LP), leaf stem ratio (LSR), regrowth capacity (RC), and total dry matter (TDM)), and (C) Megathyrsus maximus (green matter (GM), LDM, percentage of leaf blade (PLB), RC, stem dry matter (SDM), and TDM). Principal component analyses (PCAs) were performed with phenotypic data; t-distributed stochastic neighbor embedding (t-SNE), with genotypic data; and redundancy analyses (RDAs), with both datasets.

Single regression genomic prediction

The predictive performance for each trait was evaluated with seven different approaches for the regression task: BRR, RKHS-KA, SM-GK, AdaBoost, KNN, RF, and SVM. By employing 10-fold CV 50 times, we observed superior results for BRR, RKHS-KA, SM-GK and SVM (Supplementary Figs. S11S13, Supplementary Table S1). Although not all four methods ranked best in terms of predictive performance for SH (Pop1), RC (Pop2), and PLB (Pop3), there were no pronounced differences between their accuracies and those of the highest ranking models for these traits.

For BRR, RKHS-KA, SM-GK and SVM, we also performed LOO evaluations (Table 2). The predictive performances were small even in a CV scenario with a larger training dataset. The highest observed R values were 0.38, 0.32 and 0.21 for SH in Pop1 (SM-GK and SVM), LDM in Pop2 (SVM), and SDM in Pop3 (SVM), respectively. Additionally, we did not observe clearly superior performance for any model, which would have allowed us to select a specific approach for fitting the regression model. For each trait, one of the four selected methods presented a slightly superior accuracy compared to that of the other three, as already observed in the 10-fold CV evaluations (Supplementary Figs. S11S13). Even with the best accuracy values observed, SH in Pop1, LDM in Pop2 and SDM in Pop3 could not be efficiently predicted, once again demonstrating the need for the development of suitable regression methods for complex polyploid grasses.

Table 2 Accuracies for Bayesian ridge regression (BRR), the reproducing kernel Hilbert space with kernel averaging (RKHS-KA) model, the single-environment, main genotypic effect model with a Gaussian kernel (SM-GK), and support vector machine (SVM)-based regression across the different traits separated according to the populations: (Pop1) cross between sugarcane commercial varieties, (Pop2) cross between a commercial variety and a genotype of Urochloa decumbens, and (Pop3) cross between a commercial variety and a genotype of Megathyrsus maximus.

Feature selection

To improve predictive accuracies, we combined different FS methods for both classification (C) and regression (R) techniques. The regression problem was configured considering trait value prediction, and classification, considering the prediction of the clusters defined for each trait. To supply a more restricted group of markers, we used two different approaches to marker selection: the intersection between the three FS techniques used in classification (C3) or regression (R3) and the intersection of at least two out of the three FS methods (C2 and R2). Each approach generated a different quantity of markers when considering the regression and classification problems separately (Supplementary Table S2).

Regression-based FS generated larger quantities of markers, ranging from 898 to 980 in Pop1, from 242 to 367 in Pop2, and from 266 to 373 in Pop3. Classification-based FS, on the other hand, enabled the identification of more restricted marker subsets (from 208 to 211 in Pop1, from 86 to 117 in Pop2, and from 58 to 133 in Pop3). Interestingly, approximately half of the markers found for classification were also found for regression, showing the compatibility of these approaches. As expected, such FS methods enabled higher accuracies for ML prediction and the categorization of phenotype-associated genomic regions with diverse biological functions (Supplementary Figs. S14S41).

Even though the ML models for predicting the different traits differed in suitability, more pronounced differences were observed when changing the marker dataset and not the ML algorithm. For some trait-model combinations, the more effective strategy was employing C2/R2, and for others, C3/R3. Regarding the biological process GO terms associated with the genes found in these marker regions, there was a clear core set of common interactions for each trait when analyzing the regression- and classification-based approaches, with differences regarding the inclusion of novel connections and categories.

Joint learning

For each trait, we compared the predictive performances of the BRR, RKHS-KA and SM-GK models using the entire set of markers with the performances of SM-GK coupled with the FS approaches considering classification (C) and regression (R) algorithms. In addition to testing the intersection of the three FS techniques tested for both C and R (C3/R3) and the intersection of at least two techniques out of the three (C2/R2), we also evaluated (i) the union of C2 and R2 (CR2), (ii) the union of C3 and R3 (CR3), and (iii) the intersection of C2 and R2 (ICR2). Finally, we tested the addition of C3 and R3 as fixed effects (C3F and R3F) with the maintenance of the remaining markers for estimating the relationship matrix K. All these models were evaluated using R Pearson correlation coefficients together with MSE measures in the 10-fold CV scenario established; these measures were contrasted with Tukey’s multiple comparisons test (Supplementary Figs. S42S55).

The correlation values calculated when subsetting the marker datasets clearly increased due to decreases in MSE values. The addition of markers as fixed effects, on the other hand, increased the R Pearson coefficients together with the MSEs; this indicates that such an increase in the correlation is derived from higher variability in the dataset rather than from better predictive performance. By selecting the best results according to Tukey’s test for both evaluated metrics (Supplementary Table S3), we could establish the best models as SM-GK/R2 and SM-GK/CR2 for Pop1, SM-GK/R2, SM-GK/R3, and SM-GK/CR3 for Pop2, and SM-GK/R2, SM-GK/R3, SM-GK/CR2, and SM-GK/CR3 for Pop3.

Interestingly, we did not observe differences when applying SM-GK with R2 or CR2 and with R3 or CR3. In fact, although presenting intersections with R2 and R3 markers, C2 and C3 markers by themselves did not perform well in regression model estimation (Supplementary Figs. S42S55). To further analyze these results, we employed LOO CV with the SM-GK regression model and the GNB classification algorithm for marker datasets R2, R3, CR2 and CR3 (Supplementary Table S4). We selected the GNB model due to the promising results observed in previous analyses (Supplementary Figs. S14S41).

Similar to the results observed for the regression techniques, the classification algorithms performed better with the inclusion of markers selected for classification. However, as previously pointed out, the use of both strategies combined (CR2 and CR3) produced satisfactory results for both classification and regression, supplying a powerful set of markers to be considered for prediction (Fig. 2).

Figure 2
figure 2

Genomic prediction models’ performances for the populations of (Pop1) sugarcane (stalk diameter (SD) and stalk height (SH)); (Pop2) Urochloa decumbens (field green weight (FGW), leaf dry matter (LDM), leaf percentage (LP), leaf stem ratio (LSR), regrowth capacity (RC), and total dry matter (TDM)); and (Pop3) Megathyrsus maximus (green matter (GM), LDM, percentage of leaf blade (PLB), RC, stem dry matter (SDM), and TDM). The Bayesian ridge regression (BRR) and single-environment main genotypic effect model with a Gaussian kernel (SM-GK) approaches with the inclusion of different feature selection methods (CR2 and CR3) were evaluated.

As the final approach suggested for constructing these GP models for complex polyploids, we highlight the approach of subsetting the marker dataset using the combined FS technique CR3, mainly because of (i) the high predictive accuracies observed for all traits, (ii) its ability to accommodate both classification and regression models, and (iii) the small number of markers compared to that in the CR2 dataset (for all traits, the number of markers selected for CR3 was approximately 10% of the number of markers selected for CR2). Although in some scenarios the prediction accuracies were higher for CR2, even when using CR3, they were significantly better than those observed with the traditional approaches.

Finally, by combining such an FS system based on the joint use of classification and regression ML strategies (CR3) with a popular statistical model for GP (SM-GK), we were able to suggest an ensemble method for predicting quantitative phenotypes in highly complex polyploid species. The increases in accuracy are remarkable (Fig. 3), and such a strategy also enables the construction of classification models for assisting in breeding selection strategies.

Figure 3
figure 3

Correlation plots for the predicted and real values in the populations: (Pop1) sugarcane (stalk diameter (SD) and stalk height (SH)); (Pop2) Urochloa decumbens (field green weight (FGW), leaf dry matter (LDM), leaf percentage (LP), leaf stem ratio (LSR), regrowth capacity (RC), and total dry matter (TDM)); and (Pop3) Megathyrsus maximus (green matter (GM), LDM, percentage of leaf blade (PLB), RC, stem dry matter (SDM), and TDM). The approaches evaluated were Bayesian ridge regression (BRR), colored green, and the single-environment main genotypic effect model with a Gaussian kernel (SM-GK) with the inclusion of the established feature selection machinery, colored black.

Simulation approaches

To evaluate the feasibility of using the proposed approach for wider datasets, we performed simulations of a wheat breeding program and selected lines from the PYT for use in CV strategies. Using ten years of phenotypic selection, we were able to create 4 different datasets of 10,000 rows corresponding to genotypic data created based on 2, 10, 100 and 1,000 quantitative trait loci (QTLs). From these sets, different quantities of rows were sampled for independent evaluations (100, 250, 500, 1000, 2000, 5000, and 10,000). We contrasted the performance of the SM-GK using the entire marker dataset with the performance considering marker selection with CR3 (Fig. 4). For all the configurations tested, the prediction accuracies of SM-GK/CR3 were better than those of SM-GK, with differences in the percentage of increase depending on the configuration established.

Figure 4
figure 4

Accuracies of simulation scenarios created using a wheat breeding program and different quantities of QTLs per chromosome (2, 10, 100, and 1000). From 20 years of phenotypic selection (10 years of burn-in), preliminary yield trials (1000 lines) were selected by year, forming a dataset of 10,000 genotypes, which was sampled for different quantities of individuals (100, 250, 500, 1000, 2000, 5000, and 10,000) to perform 10-fold cross-validation using a single-environment main genotypic effect model with a Gaussian kernel (SM-GK) with and without the inclusion of the established feature selection approach (SM-GK/C3R3). The values in the boxplot labels represent the mean accuracy, and in the gray boxes the quantity of markers used to create the predictive model.

Even though we observed slight differences in model performance when using different QTL quantities for the creation of genotypic datasets, the most pronounced increases in performance for SM-GK/CR3 compared with SM-GK were based on the quantity of individuals sampled from the data simulated and used for CV. In the smallest datasets (100 individuals), the accuracy for SM-GK/CR3 was approximately three times that of SM-GK, with a marker reduction exceeding 99%. The same increases in accuracy and decreases in marker data were also observed for 250 individuals. For 500 individuals, the increase was approximately 2 times, and the marker reduction was 98%. For more than 1000 individuals, the reductions in the marker dataset were also large (approximately 95%, 95%, 94%, and 90% for 1000, 2000, 5000, and 10,000 individuals, respectively); however, the increases in accuracy were more modest, being approximately 50%, 20%, 10%, and 5% for 1000, 2000, 5000, and 10,000 individuals, respectively.

Discussion

SU and FGs have considerable global economic importance, and their production demand is constantly rising1,2. In the current scenario of climate change and food insecurity, the development of novel and more productive cultivars by breeding programs represents the most effective and sustainable strategy for increasing production85,86. Although genomic approaches have revolutionized plant breeding with large associated genetic gains87, the inclusion of such techniques for the species employed in this work is still in its infancy88,89. The genomic complexity of these crops has hindered the development and acquisition of breeding genomic resources for many years7,8. In such a scenario, different molecular strategies have been tested for reducing such genomic complexity through methylation-sensitivity enzymes and facilitating the genotyping process while reducing associated costs, including GBS43,90. In this work, different populations genotyped with the GBS strategy were used to evaluate the estimation performance of GP models.

Even though GBS has been suitable for several studies of QTL mapping in these species31,32,91, developed GP strategies have presented low prediction accuracies20,22,92,93. Therefore, for practical incorporation of such techniques into breeding programs through GS, improved GP strategies are required, especially for most traits of interest, which are quantitative and highly polygenic and show phenotypic effects spread across unknown QTLs organized in a very complex genetic architecture94. Motivated by these facts, we suggested the incorporation of FS into GP as a means of circumventing the difficulty of estimating genotype and phenotype interactions and improving the low accuracies observed.

The traditional approaches employed for GP are based on the creation of regression models25, which estimate marker or genotype effects for the phenotype of interest, providing a means of predicting plant performances in the field25. In recent years, many studies have evaluated different types of statistical paradigms for such model estimation; however, most of the results show little/no changes in accuracy95,96,97. In contrast to the model efficiency observed for species with well-established genomic resources98,99, these strategies do not work properly in our scenario, inaccurately estimating the genetic effects of a specific trait and causing a significant loss of accuracy. We hypothesize that this deficiency is mainly caused by the difficulty of properly genotyping the individuals, which is due to several factors, including a lack of appropriate genomic references, the frequent occurrence of duplicated regions7,13, aneuploidy at different loci100, the difficulty of estimating correct allele dosages101, and the large amount of missing data102. Our first attempt to circumvent some of these limitations involved employing allele proportions instead of dosages, expanding the marker dataset and removing statistical assumptions regarding dosage attributions27.

The inclusion of ML approaches in GP has been controversial, being described as beneficial97,103,104,105 or not advantageous18,106,107. In the first scenarios that we evaluated in our work, we did not observe large increases when incorporating ML into prediction. Most of the ML algorithms performed worse than the traditional modeling approaches. Although we observed reasonable accuracies for some specific k-fold configurations (Supplementary Figs. S11S13), this was not uniform across the CV scenarios established, which corroborates the strong impact of genotyping results on model creation. Beneficial k-fold splits for prediction represent similar training and test datasets (i.e., individuals with similar polymorphic profiles), and in such cases, GP models work accurately. Therefore, we consider that, in addition to our hypothesis that the marker datasets employed were not accurate enough for creating such models, the lack of satisfactory accuracies might be related to an absence of markers linked to several trait-associated QTLs, probably due to the GBS protocol and bioinformatics procedures employed. To evaluate these factors, we decided to check whether the exclusion of nonbeneficial markers for prediction through FS could improve model performance based on (i) possible incorrect marker estimation, (ii) the principle that not all SNPs influence the phenotype108, and (iii) our observation that the models were not able to automatically separate the markers most important for prediction from the rest of the SNPs.

This strategy of FS has already demonstrated promising results across the literature27,109,110,111,112. Subsetting datasets through principles of FS is a common strategy in data science for datasets with a large number of variables113. Considering the principle underlying GS that all markers can be used in model construction with different effects estimated for the traits of interest114, FS methods have not been required in the construction of these predictive models. However, this is not the scenario for several complex datasets, including those used in this work. An FS system in ML aims at removing redundant, incorrect and irrelevant data from the dataset prior to prediction115, and for a GP problem, this strategy supplies a means of selecting the most promising set of markers for prediction, which might also be indicative of QTL associations116.

The reduction of markers through the different FS strategies employed was effective, dramatically reducing the datasets and increasing the prediction accuracies. For the final strategy considered, the intersection of the three FS methods selected enabled the establishment of a very restricted dataset with insights into biological functions potentially associated with different trait configurations. Interestingly, all of the traits were associated with genes having biological functions related to regulatory mechanisms (e.g., regulation of transcription, translation, and DNA duplication), which is in accordance with the control mechanism impacting protein-DNA interactions expected for QTLs117. Additionally, important categories could be retrieved for the evaluated agronomic traits, such as carbohydrate metabolism, defense response, cellulose biosynthetic process, and response to stress. All of these GO terms have been previously associated with important biological processes regulating plant growth performance31,118,119. Although a detailed analysis should be performed to provide in-depth inferences about these mechanisms underlying the trait configurations, this preliminary investigation highlights the potential of the methodology to reveal such important biological features associated with target traits for breeding programs.

Considering that the final purpose of breeding programs is selecting the most promising genotypes for advancing breeding cycles and producing novel cultivars, we also considered for each individual a related phenotypic group, supplying an indicator of its overall performance. Through the initial phenotypic clustering analyses performed (Supplementary Figs. S7S9), it was possible to observe distinct groups of individuals with similar phenotypic values, which were considered classes in an ML-based classification scenario. Such interval prediction can be considered a complementary strategy for selecting the best potential candidates for a specific trait because of its capability of excluding a large number of candidates that have fewer chances to be among the top genotypes in terms of performance. Such interval prediction follows the same principle of ranking approaches, which have already been demonstrated as a potential tool in selective breeding120. Different from traditional GP models based on regression, this classification task can take advantage of a vast set of ML algorithms. Together with the marker dataset selected through FS methods applied to the quantitative traits, we coupled an FS system based on classification, selecting the markers related to these defined intervals and forming a joint strategy for prediction. Therefore, in addition to creating a prediction model for GP of quantitative traits, the same marker dataset can be applied to an independent system estimated for interval prediction.

Regarding the performance comparison of the best models (BRR, RKHS-KA, SM-GK, and SVM), we did not observe significant differences between the accuracy results. For this reason, we selected only one of them, SM-GK, for combination with the FS system. In addition to clear improvements in accuracy by such SNP subsetting, we also evaluated whether such selection could serve as a means of defining fixed effects in GP models. Although we also observed increases in the measurements of accuracy through Pearson correlation coefficients, the same beneficial aspect was not observed when evaluating the models’ performance through MSEs. Instead, we observed higher predictive variability with stronger correlations and larger associated errors. Such a conclusion was previously reached when using peaks from a genome-wide association study (GWAS) as fixed effects in GP121. For this reason, we considered the subsetting strategy as the most promising method for boosting accuracies in our scenarios.

On the basis of the improvements observed (Fig. 3), we concluded that such an FS system was essential for showing the inefficiency of the models tested in terms of capturing real marker effects. Including a larger number of SNPs in GP may introduce background noise110, and in our scenario, such noise is expected to be found due to the elevated genomic complexity of these species. Interestingly, the GP models for SU presented better results than those for FGs. For SU SNP data, an SU pseudoreference genome was used for SNP calling55, different from the approach applied to FGs, where SNPs were estimated using genomic references from closely related species. Using Sorghum bicolor as a genomic reference for SU, a close relative, also decreases the quantity of reliable markers27,112 and, consecutively, the chances of finding SNPs surrounding QTL regions. For this reason, we believe that the increases in accuracy in FG results were more modest than those in SU.

As the last evaluations performed, we created simulated populations using a wheat breeding program to check the feasibility of the approach in scenarios with a broader set of individuals and not only single biparental populations. One of the main limitations that we found was related to the number of individuals used for performing k-fold validation. Increases in the predictive accuracies were modest when very large populations were analysed, but even in these cases our approach still represents an important strategy to reduce marker datasets needed for prediction. This might contribute with cutting down genotyping costs—a constraint that remains highly relevant in breeding programs of the species under study. Increasing the number of observations in the dataset and using the same quantity of folds for CV introduce a larger number of samples to be predicted, which may correspond to a more difficult prediction task if the individuals in the training and test sets are more distantly genetically related122. This fact corroborates the need to develop appropriate populations for the proper application of GS models123. Additionally, using a small number of individuals leads to a restricted set of markers linked to the phenotypic variation being examined, which is not expected for a larger number of genotypes124. As expected, the methodology worked efficiently for a broad number of QTLs, showing its appropriateness for different quantitative traits, as already demonstrated in this study.

In the present study, we provided a method for GP using different datasets of polyploid grasses. Combining an FS engineering system capable of retrieving markers important for classification and regression, our methodology also shows great potential for investigating marker-trait associations. Additionally, our results highlight the benefit of incorporating methodologies for marker selection into prediction, which may also be seen as a promising approach for developing targeted sequencing methodologies that can be applied to create models for GS. With a small number of markers, we could achieve high associated accuracies for quantitative traits and for predicting putative performance intervals through classification, which can be seen as a complementary breeding tool. This strategy has the potential to aid in the development of models for species with elevated genomic complexity, surpassing the limitations of available genomic resources and supplying a means of incorporating GS into their breeding programs.