Metabolic prediction of important agronomic traits in hybrid rice (Oryza sativa L.)

Hybrid crops have contributed greatly to improvements in global food and fodder production over the past several decades. Nevertheless, the growing population and changing climate have produced food crises and energy shortages. Breeding new elite hybrid varieties is currently an urgent task, but present breeding procedures are time-consuming and labour-intensive. In this study, parental metabolic information was utilized to predict three polygenic traits in hybrid rice. A complete diallel cross population consisting of eighteen rice inbred lines was constructed, and the hybrids’ plant height, heading date and grain yield per plant were predicted using 525 metabolites. Metabolic prediction models were built using the partial least square regression method, with predictive abilities ranging from 0.858 to 0.977 for the hybrid phenotypes, relative heterosis, and specific combining ability. Only slight changes in predictive ability were observed between hybrid populations, and nearly no changes were detected between reciprocal hybrids. The outcomes of prediction of the three highly polygenic traits demonstrated that metabolic prediction was an accurate (high predictive abilities) and efficient (unaffected by population genetic structures) strategy for screening promising superior hybrid rice. Exploitation of this pre-hybridization strategy may contribute to rice production improvement and accelerate breeding programs.

The metabolome has manifested potential roles in facilitating crop breeding strategies over the past decades 12,13 . The results from a tomato IL population showed that 73% of the total analysed metabolites were significantly associated with the whole-plant phenotypes 14 . Arabidopsis thaliana biomass could be described as a function of metabolic compositions with high predictive power [15][16][17] . Moreover, metabolite profiling of a rice recombinant inbred line population and diverse accessions showed promise in bridging the gap between the genome and phenome 18,19 . Importantly, parental metabolites as biomarkers have been tested for the prediction of traits such as yield in maize 20 , biomass and tolerance in Arabidopsis thaliana 21,22 and postharvest quality traits in potatoes (Solanum tuberosum L.) 23 .
In the present study, a complete diallel cross population was built with eighteen rice inbred lines as the parents. Three polygenic traits-yield per plant, heading date and maturation stage plant height-were predicted in the hybrids using 525 metabolites with the partial least square regression method. The predictive abilities ranged from 0.858 to 0.977 for the three traits. The population structure and cytoplasmic effects had slight influences on the predictive ability.

Results
Population structure and hybrid performance. Eighteen rice inbred lines were chosen as parental lines for the complete diallel cross design (see Supplementary Table S1). All the parental lines were divided into equal indica (F i values from 0.39 to 1) and japonica (F i values from 0 to 0.33) groups based on the InDel marker estimating method (see Supplementary Table S1) 24 . Principal component analysis (PCA) was also applied to the parental phenotypic and metabolic data to discriminate between the parental lines. PCA of the parental phenotypic performance showed that three indica varieties were divided into the japonica group and three japonica varieties were in the indica group (Fig. 1a). At the metabolic level, a total of 525 analytes were detected in all eighteen inbred lines. Unexpectedly, PCA for the parental metabolic parameters produced a similar grouping result as the consequences of the InDel markers and was also highly similar to the dendrogram (Fig. 1b,c).
Three important agronomic traits were measured: yield per plant (YPP), maturation stage plant height (MSPH), and heading date (HD). Prior to the analysis, the hybrids were divided into an indica-indica group (i-group), a japonica-japonica group (j-group) and an indica-japonica group (ij-group) based on the group to which the corresponding parental lines belonged. For the hybrid phenotypes (trait values per se), the highest mean YPP was in the i-group, the lowest was in the j-group, and the ij-group was between the i-and j-group (Fig. 1d). For MSPH and HD, the highest means were observed for the ij-group, which was consistent with the phenomenon that hybridization between inter-subspecific rice has stronger biomass heterosis (increased plant height) and a later heading date. Box plots depicting the relative low-parent heterosis (LPH), mid-parent heterosis (MPH), better-parent heterosis (BPH) and specific combining ability (SCA) for the three traits are shown in Supplementary Fig. S1a,b. For YPP, the hybrid trait values per se had the closest relationship with SCA (see Supplementary Table S2). Trait values of MSPH per se had the closest relationship with BPH, and HD was most closely associated with MPH.
Predictions of hybrid performance based on parental traits and genetic distance. First, the sum trait values of the parents (sum), the differences between the parents (difference) and the ratios of the parents (ratio) were used as predictive variables. The highest correlation coefficient (0.714) was obtained between the HD and the sum values of the parents (see Supplementary Table S3). After performing linear regression analysis of the sum HD values of the parents and the hybrid HD values, the predicted HD values and measured true values displayed dramatic deviances (see Supplementary Fig. S2). Therefore, the parental traits were not appropriate for the prediction of hybrid performance.
Second, to evaluate whether genetic distance based on parental metabolic data was suitable for predicting hybrid performance, squared Euclidean distances based on all 525 analytes were calculated to analyse the correlations (see Supplementary Table S4). The Pearson correlation coefficients between genetic distance and hybrid performance shown in Supplementary Table S5 suggested that, although genetic distance was significantly correlated with some traits (such as YPP and MSPH), the predictive abilities were quite low, and some traits were not closely correlated with genetic distance.
Stepwise linear regression and partial least square (PLS) regression with metabolic data and hybrid performance. Next, the sums, differences and ratios of the parents' relative metabolite levels were calculated, and stepwise linear regression was used to identify the appropriate metabolic predictors. The adjusted R 2 values in Supplementary Table S6 shows that the highest value was 0.677 for HD, while no metabolite could be used for the regression of traits such as SCA-YPP. Furthermore, the predictive abilities of different traits varied widely, and they were not sufficiently high for most traits.
Then, partial least square regression was used to create a prediction model with metabolic data of the three traits. In Fig. 2a, 17 latent factors were extracted for all the traits using the sums and differences of the parents' relative metabolite levels as variables. Unfortunately, the highest R 2 value obtained was 0.6, and some hybrid traits could not be predicted using these parental variables.
Surprisingly, however, high R 2 values were obtained when the ratios of parental metabolic data were used as predictive variables (Fig. 2b). For this analysis, 104 to 107 latent factors were extracted from the metabolic data; most of the traits exhibited the highest R 2 values at approximately 50 latent factors. Then, the number of latent factors at the top R 2 value for each trait was fixed to obtain the values of variable importance in the projection (VIP) of each variable. Because Latent Factor 1 explained the largest proportion of the variance, the VIP values of Latent Factor 1 were reordered to reduce the number of variables. After excluding variables whose VIP values were smaller than 1, the remaining variables were used for PLS regression. Finally, although more than 300 variables were excluded from the regression model, and only slight decreases were observed in R 2 values compared with the values obtained using all 525 metabolites (Table 1). Among the 15 predicted traits, LPH-YPP used the fewest (149) variables and LPH-HD used the most variables (196).
Considering that 149-196 is still a quite large number of variables and might result in overfitting 22 , the VIP values of each variable were reordered again in the PLS regression model to minimize the amount of variables. Unfortunately, the R 2 values decreased dramatically.

Metabolic prediction of agronomic traits by PLS regression.
The coefficient of each variable in the PLS regression results was used to build the equations (see Supplementary Table S7). Then, a predicted value was assigned for every trait in each hybrid. The relationship between the observed YPP ( Fig. 3a), MPH-YPP (Fig. 3b), BPH-YPP (Fig. 3c), and SCA-YPP (Fig. 3d) values and the predicted values demonstrated the high accuracy for this metabolite-based prediction method. The lowest predictive ability was 0.858 with SCA-YPP and the highest was 0.924 with YPP. High predictive abilities were also achieved through the PLS regression approach for the remaining traits (Table 2).
Because the complete diallel cross population consisted of rice inbred lines ranging from typical indica to typical japonica, the predictive abilities might vary with different hybridization groups. Thus, the predictive abilities for the i-group, j-group and ij-group were calculated to validate the stability of the prediction model ( Table 2). As shown in Fig. 4a,b, the predictive abilities of YPP in the i-group and j-group were 0.849 and 0.806, respectively. Surprisingly, the predictive ability of the ij-group was as high as 0.948 (Fig. 4c). Moreover, the predictive abilities of relative heterosis and SCA of YPP were all the highest for the ij-group, thereby showing a promising approach for exploiting inter-subspecific heterosis 25 . Finally, the cytoplasm from the reciprocal parental lines might have different influences on hybrid performance; these differences might be obstacles to applying the metabolic prediction strategy in screening for potential sterile and restorer lines. Hence, the whole population was divided into two sets of reciprocal hybrids to test whether cytoplasmic effects affected the predictive ability. The results demonstrated that different cytoplasms had weak influences on the predictive ability (see Fig. 4d,e and Supplementary Table S8).

Discussion
Accuracy and efficiency are two pivotal indicators for the application of a biotechnology-assisted prediction model in breeding. Because predictive ability might be heavily influenced in populations consisting of genetically distant lines 4,11 ; therefore, assessing whether a prediction model is appropriate is largely determined by its predictive abilities in different genetic structures. To avoid incomplete bias prediction, the predictive model should be built on representative inbred lines. In addition, the richer the information on these inbred lines, the higher the potential accuracy of the model. In this experiment, eighteen rice inbred lines were used to build a prediction model. These lines were collected from a wide range of locations (China, Italy and India) and possessed different degrees of indica or japonica contents and variable general combining abilities (see Supplementary Table S1).  To some extent, they represented a large range of rice accessions that were suitable for building the prediction model. Therefore, this model may also be applicable to rice inbred lines that were not included in this experiment. The relationship between parental metabolite profiling and hybrid performance should be nonlinear 26 and was treated as a megavariate quantitative structure-activity relationship in this study 27 . Because of its powerful multi-dimensional information regression ability, partial least squares projections to latent structures 28 were applied to build the hybrid trait prediction model. Meanwhile, because cross-validation is built into the PLS algorithms 27 , the predictive significance is assessed and predictive abilities can be insured for every trait. In addition, taking advantage of the values of variable importance in the projection, the low contribution variables were excluded and the representative variables were reserved for the equations 22 . The high predictive abilities (0.858-0.977) in Table 2 demonstrate that PLS matched the metabolic prediction model quite well.
Metabolites have close connections with phenotypes 14,17,19 . Before determining the exact function of a metabolite, correlation analysis between the relative metabolite levels and phenotypes can be used as a type of rough estimating method 29 . In this study, different groups of metabolites were exploited to predict various traits. Because specific metabolites may have single or multiple functions in different pathways, metabolites such as saccharic acid, p-Cresol, and triacontanoic acid methyl ester were found to be predictive variables for two or more traits (see Supplementary Table S9). Different correlation levels were detected for the metabolites that were predictive variables for all three traits (see Supplementary Table S10). Some of these overlapping metabolites manifested significant positive and negative correlations with different traits, which might be the origin of the balance between component traits 30 or outcomes of feedback regulation of the biological networks involved in complex traits 31 . Meanwhile, only a single value was used to represent reciprocal hybrids when calculating the ratios of relative parental metabolite levels. Ignorance of the relationship between the female parent and male parent might decrease the predictive abilities for the reciprocal hybrids. However, almost no predictive ability change was found between the reciprocal hybrids. Therefore, close balances might exist between metabolites and their associations might be controlled by the rules of chemistry 26 . The influences of the complex associations between metabolites on the hybrid trait prediction model remain unknown.
Nevertheless, predictable metabolites in the prediction model might simply have high predictive abilities for the traits. Therefore, biological functions of the predictive variables in this study were analysed to verify their contributions to the prediction model. Fructose (a predictive variable for yield per plant and maturation stage plant height) contributes strongly to the metabolic efficiency of rosette fresh weight and protein concentration in Arabidopsis 29 . Ferulic acid, a phenolamide 32 , plays an important role in plant development 33 and is a predictive variable for plant height. Spermidine (a predictive variable for heading date and yield per plant) is associated with floral induction and development 33,34 . Furthermore, significant correlations have been found between alanine and biomass or yield-related traits in Arabidopsis 35 and tomatoes 14 . Hence, our results demonstrate that the high predictive abilities for the three traits were not merely coincidental.
However, the metabolic prediction model might not be directly applicable to other breeding programs. In this study, metabolite profiling of 15-day-old seedlings of the parental lines was chosen to provide the predictors for hybrid traits. Thus, the metabolite profiling data were only a snapshot of the whole plant growth procedure. However, plant metabolism is a highly dynamic system that changes substantially over time 20,36 . Therefore, although high predictive abilities for three agronomic traits were obtained in this study, the metabolite profiling data presented here might be unsuitable for traits such as grain number per panicle or tiller number per plant. Therefore, the appropriate time points must be adapted when forecasting other polygenic traits from assembled metabolite profiling data.
Furthermore, the performance of a plant in nature is the consequence of the combination of genetic information and environmental influences; thus, the effects of environmental inputs on metabolome-assisted breeding strategies might need to be taken into account 26 . Metabolite profiling data of naturally growing parental seedlings in a specific field environment should be more accurate for the prediction of hybrid performances in corresponding locations.
In summary, a reliable and efficient metabolic prediction strategy was provided by combining parental metabolite profiling with a PLS regression method for rice hybrids. The high predictive abilities for three agronomic traits were implemented with respect to hybrid phenotypes, relative heterosis and specific combining ability. The predictive abilities for the three traits were only slightly influenced by population structures (genetic relatedness) or cytoplasmic effects. Additionally, metabolite-based prediction might be more suitable for traits such as resistance (susceptibility) or tolerance ability 21,23,37 because small molecules function more directly in these defence processes by confronting biotic/abiotic pressures.    For metabolite profiling analysis, seeds of the eighteen rice inbred lines were first submerged in water for two days at 28 °C. Then, the seeds were placed in an incubator at 28 °C for 24 h to accelerate germination. Next, seedlings at approximately the same stage were transferred to soil containers. Three random replicates were applied for each inbred line; the spacing between seedlings was 2 × 2 cm. Finally, all the seedlings were placed in a phytotron with a temperature of 28 °C, 70% relative humidity and an 8 h light/16 h dark photoperiod. On day 15, 100 mg of the seedlings (root excluded) were collected into 2 ml EP tubes for each replicate and immediately frozen in liquid nitrogen.

Methods
Gas chromatography mass spectrometry-based metabolite profiling. For each sample, 0.4 ml of methanol-chloroform (V methanol :V chloroform = 3:1) and 20 μ l of ribitol (0.2 mg/ml stock in dH 2 O, Sigma-Aldrich Co. LLC., USA) were added as internal standards. After vortex mixing for 10 s, steel balls were placed into EP tubes and the samples were homogenized with a ball mill (JXFSTPRP-24, Shanghai Jingxin Experimental Technology, Shanghai, China) for 5 min at 55 Hz. Then, the samples were centrifuged for 15 min at 12,000 rpm at 4 °C. The supernatant (approximately 0.4 ml) was transferred to a new 2 ml GC/MS glass vial. An equal volume of approximately 13 μ l (based on the number of samples) from each sample was transferred into a new 2 ml GC/MS glass vial as a mixed sample for quality control. Then, the extracts were dried in a vacuum concentrator without heating at 30 °C for approximately 1.5 h. Next, 80 μ l of methoxymethyl amine salt (dissolved in pyridine, final concentration of 20 mg/ml) was added to the dried extracts and incubated at 37 °C for 2 h in an oven after mixing and sealing. The lids were opened, and 100 μ l of BSTFA (containing 1% TCMS, v/v, Regis Technologies, Inc., USA) was added to each sample; then, the samples were resealed and incubated at 70 °C for an hour. When the samples had cooled to room temperature, 10 μ l of FAMEs (standard mixture of fatty acid methyl esters, 1 mg/ml C8-C16 and 0.5 mg/ml C18-C30 in chloroform) was added to the mixed sample. Then, the sample was mixed well for GC-MS analysis.
GC/TOF MS analysis was performed using an Agilent 7890 (Agilent Technologies, USA) gas chromatograph system coupled with a Pegasus HT time-of-flight mass spectrometer (LECO Corporation, USA). The system utilized a DB-5 MS capillary column coated with 5% diphenyl cross-linked with 95% dimethyl polysiloxane (30 m × 250 μ m inner diameter, 0.25 μ m film thickness; J&W Scientific, Folsom, CA, USA). A total of 1 μ l of the (a-c) Relationships between observed yield per plant and predicted values in the i, j and ij three subgroups. The entire hybrid population was divided into three subgroups based on whether a parental line was indica or japonica (Fig. 1b). The predictive abilities in different population structures showed only slight changes. (d-e) Relationships between observed yield per plant and predicted yield per plant in reciprocal hybrids. The whole population was divided into two groups of reciprocal hybrids to test cytoplasmic effects on predictive ability. The results demonstrated that different cytoplasms only weakly influenced predictive ability. YPP = Yield per plant. sample was injected in the splitless mode with helium as the carrier gas. The front inlet purge flow was 3 ml min −1 , and the gas flow rate through the column was 20 ml min −1 . The initial temperature was held constant at 50 °C for 1 min, followed by a 10 °C per min ramp up to 330 °C, and then maintained for 5 min at 330 °C. The injection, transfer line, and ion source temperatures were 280 °C, 280 °C, and 220 °C, respectively. The energy was − 70 eV in electron impact mode. Spectra were recorded in full-scan mode with an m/z range of 85-600, at a rate of 20 spectra per second after a solvent delay of 366 s.
The Chroma TOF 4.3X software from LECO Corporation and the LECO-Fiehn Rtx5 database 38 were used for raw peak extraction, data baseline filtering and calibration, peak alignment, deconvolution analysis, peak identification and integration of the peak area. The RI (retention time index) method was used for peak identification; the RI tolerance was 5000.
Data analysis. Relative low-parent heterosis (LPH), mid-parent heterosis (MPH), better-parent heterosis (BPH) and specific combining ability (SCA) for each trait were calculated using the following equations: LPH = (F 1 -P Low )/P Low , MPH = (F 1 -P Mean )/P Mean , BPH = (F 1 -P High )/P High , GCA i (general combining ability) = P i. -P.., and SCA ij = P ij -P..-P i. -P j. 39 . For the heading date, earlier heading indicated positive heterosis. F 1 is the trait value of the hybrid; P Low , P Mean and P High are the low value, mean value and high value of the two corresponding parents, respectively. P ij is the trait value of the hybrid from parent i and parent j, P.. is the mean of all the hybrids, P i. is the mean of the hybrids from parent i and P j. is the mean of the hybrids from parent j. The relative metabolite levels of each analyte were obtained by calculating the ratio between the areas of the analyte and its corresponding ribitol in each repeat. Means among the replications were used as the raw metabolic data. Metabolic data were log 2 -transformed for statistical analysis. Analytes with the same annotations but different mass and similarity (such as 2-hydroxypyridine), the suffixes − 1, − 2, and − 3 were added in retention time order to differentiate between them in the analyses. Microsoft Excel 2010 (Microsoft, USA) was used to calculate the sums (sum = (Female P+ Male P)*0.5) of the parents, differences (difference = Female P-Male P) between the parents, and ratios (ratio = Female P/Male P) of the parents' phenotypic data. Female P represents the mean of the female parent values and male P represents the mean of the male parent values. When parental metabolic data were used as predictive variables, only single sum, difference and ratio values were calculated for the reciprocal hybrids. For example, for reciprocal hybrids YB/Balilla and Balilla/YB, if the relative level of Analyte X is a1 for YB and a2 for Balilla, the sum, difference, and ratio values for Analyte X of these reciprocal hybrids are (a1 + a2)*0.5, a1 -a2, and a1/a2, respectively. The reason for this calculation approach is explained in the Discussion. Correlation analyses and regression analyses were performed with IBM SPSS Statistics 20 (IBM, USA). Average linkage between groups was chosen as the cluster method in the hierarchical cluster analysis. Squared Euclidean distance was calculated as the genetic distance. Principle component analysis was achieved through factor analysis, and eigenvalues greater than 1 were extracted; maximum iterations for convergence was 25; and the non-rotation method was applied for factor analysis. In stepwise linear regression, the probability of F was used as the stepping method criteria. The entry value was 0.05, and the removal value was 0.10. Partial least squares regression was conducted with the PLS extension bundle for SPSS. Main effects were used to specify model effects. Maximum numbers of latent factors were adjusted until no more latent factors could be extracted. Then, in the Proportion of Variance Explained table, the number of latent factor to a trait was determined where the corresponding adjusted R-square value was at the highest. After the number of latent factor was fixed, all the 525 parental metabolic data were applied to PLS again. And variable importance in the projection of Factor 1 was used for reordering. Parameters of independent variables were used to calculate the corresponding values of the dependent variables. Predictive ability was defined based on the Pearson correlation (2-tailed) coefficient of the predicted and observed values 2,20 .