Predicting the quality of soybean seeds stored in different environments and packaging using machine learning

The monitoring and evaluating the physical and physiological quality of seeds throughout storage requires technical and financial resources and is subject to sampling and laboratory errors. Therefore, machine learning (ML) techniques could help optimize the processes and obtain accurate results for decision-making in the seed storage process. This study aimed to analyze the performance of ML algorithms from variables monitored during seed conditioning (temperature and packaging) and storage time to predict the physical and physiological quality of stored soybean seeds. Data analysis was performed using the Artificial Neural Networks, decision tree algorithms REPTree and M5P, Random Forest, and Linear Regression. In predicting seed quality, the combination of the input variables temperature and storage time for REPTree and Random Forest algorithms outperformed the linear regression, providing higher accuracy indices. Among the most important results, it was observed for apparent specific mass that T + P + ST, T + ST, P + ST, and ST had the highest r means and the lowest MAE means, however, Person's r coefficient for these inputs was 0.63 and the MAE between 9.59 to 10.47. The germination results for inputs T + P + ST and T + ST had the best results (r = 0.65 and r = 0.67, respectively) in the ANN, REPTree, M5P and RF models. Using computational intelligence algorithms is an excellent alternative to predict the quality of soybean seeds from the information of easy-to-measure variables.

Sampling and quality analysis of soybean seeds. The soybean grains were obtained from the production fields of a rural property in the municipality of Chapadão do Céu-GO, Brazil, and were cleaned to remove impurities and foreign matter LC 160 machine (Kepler Weber, Panambi, Rio Grande do Sul, Brazil). Then, they were dried in drying silos with radial airflow (Rome Silos Company, Cambé, Paraná, Brazil). The dryer is built in modulated wooden panels (2.11 m × 0.60 m) with treated boards interspersed with aluminum shutters, fixed by galvanized wire and structured with laminated angle arches, mounted overlapping on a self-draining metallic background. Radial ventilation through central tube and centrifugal fan. The temperature of the grain drying air, up to 12% (w.b.) of moisture content, was 40 °C. Then, the grains were processed using spiral separator equipment (Akyurek Technology, Mersin, Turkey) and a dissymmetric table model SDS-80 (Silomax, Rolândia, Paraná, Brazil) in order to standardize their size and weight. The grains lots were stored in raffia bags (polypropylene) in air-conditioned warehouses with temperature control. Nine-kilogram grain samples were collected from the bags using a sampler (EAGRI Equipments, Panambi, Rio Grande do Sul, Brazil), in, with the aid of a manual presser order to be stored experimentally in different storage environments.
During the storage period, the temperature of the grain mass was monitored weekly with the aid of a digital thermohygrometer model Logbox-RHT-LCD (Novus Electronic Products Company, Canoas, Rio Grande do Sul, Brazil) and every three months, the grain samples were collected for quality assessment. The moisture content of the grains was determined in a forced air circulation oven at 220 L (Tecnal Company, Piracicaba, São Paulo, Brazil) at 105 °C ± 1 °C, for 24 h, with four repetitions. Then, the samples were removed and placed in a desiccator for cooling at 5 L (Tecnal Company, Piracicaba, São Paulo, Brazil) and subsequent weighing at balance model B13200H (Shimadzu, São Paulo, Brazil) according to the recommendations of the Rule for Seed Analysis 26 . The moisture content was determined by the mass difference of the initial and the final sample, and the results were expressed as a percentage (w.b.). The apparent specific mass of the grains was determined with the aid of a 150 mL beaker and a precision scale, using the mass/volume ratio, with four repetitions 26 .
The electrical conductivity evaluation was carried out with four sub-samples, each containing 25 seeds per experimental unit, weighed on a precision scale of 0.001 g, and placed in plastic cups with 75 mL of distilled water, and was undertaken in a incubator at 25 °C, for 24 h. After imbibition, the electrical conductivity of the immersion solution was obtained with the aid of a digital conductivity meter model CD-21 (Digimed, São Paulo, Brazil) and the results were expressed in μS cm −1 g −1 according to the methodology proposed by Brazil 26 . For the vigor and germination tests, four sub-samples of 50 seeds from each experimental unit were used, distributed in paper towel rolls (Germitest), and moistened with distilled water in an amount that was 2.5 times the dry paper mass. Then, the rolls with the seeds were placed in a germinator model Mangesdorf (Tecnal, Piracicaba, São Paulo, Brazil) set at a temperature of 25 °C ± 2 °C. The evaluations were carried out on the fifth (vigor) and  30 . RL model was used as a control model as it serves to predict the behaviors between variables that have a good correlation, and is a widely used model in statistics. The prediction of the variables moisture content (MC), apparent specific mass (ASM), electrical conductivity (EC), germination (G), and vigor (V) in soybean seeds was performed by all machine learning (ML) models in a tenfold stratified randomized cross-validation with 10 repetitions (100 runs for each model). Different inputs Statistical analysis. After obtaining the correlation coefficient (r) and the mean absolute error (MAE) statistics, an analysis of variance considering a two factorial scheme (models versus inputs) with 10 repetitions (folds) was performed. The r varies between 0 and 1, and its proximity to 1 indicates that the model is better at explaining the variability of the sample data. It is expected an MAE result inverse to those of the correlation coefficient since it is used to analyze the error between the values predicted by the model and those expected; the lower the values, the closer the model is to the observed outputs. The means were grouped by the Scott-Knott test at 5% probability. Bar charts were constructed for each variable (r and MAE) considering the models and inputs tested. These analyses were performed on the R software 32 using the ExpDes.pt and ggplot2 packages.
Ethics declarations. The experimental research and field studies on plants and plant material were comply with local and national regulations. The authors had permission to collect grains, attending local, national, and international regulations.The study complied with institutional, national, and international guidelines and legislation.

Results and discussion
Analysis of variance. Table 2 shows the p-value results (r and MAE) for the prediction of the variables evaluated, considering the different ML models (M) and different inputs (I). It was possible to observe that there was significant interaction (p < 0.05) between factors for r and MAE for the variables moisture content and germination, and MAE for electrical conductivity. The r of the apparent specific mass had a significant effect only Moisture content. During storage, biological processes in the products continue to occur with greater or lesser intensity, depending on storage conditions and the moisture content of the products 33 . Thus, it was observed that the inputs T + P + ST and the combination T + ST were the ones that had the best performance in predicting seed quality. Juvino et al. 34 observed a higher range of moisture content in uncontrolled temperature environments than the acclimatized one at 18 °C. When the seeds were subjected to lower storage temperatures, they remained in hygroscopic equilibrium with moisture contents close to the initial storage conditions 35 . The reduction in grain temperature slows down the biochemical and metabolic reactions of the seeds, which reserves stored in the support tissue are unfolded, transported and resynthesized in the embryonic axis and allow the maintenance of the initial characteristics of seed storage for longer periods. The combination of input variables temperature and storage time was the best moisture content predictor of soybean seed indices during the storage period. The moisture content of soybean seeds for safe storage is 12% (w.b.), which must remain in equilibrium moisture content with intergranular air at 65-67% 35 . The prediction of seed moisture content during storage is of paramount importance, since the increase or reduction of moisture content can influence the metabolic activity of the seeds, in the cellular tissues and, consequently, in the physiological quality.
For inputs T, T + P, P, P + ST, and ST, there was no difference between the models tested (Tables 3 and 4). However, for inputs T + P + ST and T + ST, the ANN, REPTree, M5P, and RF models had the highest means compared to LR. When analyzing the inputs within each model, it can be seen that, regardless of the model, the T + P + ST configuration provided the highest r means. The MAE results for the ML algorithms with T + ST + P and T + ST as inputs ranged from 0.30 to 0.41, while LR scored 0.73. For the T + P + ST configuration, all ML models had r values above or equal to 0.94, while for the LR the observed r was 0.72. Table 2. The P-value from the analysis of variance for Pearson's correlation coefficient (r) between observed and estimated values of moisture content (MC), apparent specific mass (ASM), electrical conductivity (EC), germination (G), and vigor (V) of soybean seeds by different machine learning models and inputs.   Table 4. Unfolding the significant interaction between model x input for mean absolute error (MAE) between the observed and estimated values of moisture content in soybean seeds by different machine learning models and inputs. Means followed by equal lowercase letters in the same column and equal uppercase letters in the same row do not differ by the Scott-Knott test at 5% probability. T temperature, P packaging, ST storage time. www.nature.com/scientificreports/ Inputs T, T + P, and P from the ANN model had the highest means (Table 4), while for input T + P + ST the LR model had the highest mean. For the T + ST input, the ANN and LR models showed the highest means, while for the P + ST and ST inputs there were no statistical differences among the models tested. It is important to highlight that MAE behaved contrary to r. The low MAE values represented a higher proximity between the observed and estimated values. When analyzing the inputs within each model, it was possible to observe that the T + P + ST configuration provided the lowest MAE means regardless of the model. In Fig. 2, it was observed that the ANN, REPTree, M5P, and RF models when associated with inputs T + P + ST and T + ST provided the highest r and lowest MAE values. Therefore, Random Forest algorithm is recommended to predict the moisture content of the seeds during the storage period because used a smaller amount of data, making it possible to better conduct overfitting problems.

Models T P + T ST + P + T ST + T P ST + P ST
Apparent specific mass. The ASM did not differ for r in the tested models. However, the ANN model presented the highest average MAE in relation to the others, which indicates that this model overestimated the apparent specific mass values. Regarding the inputs tested, it was possible to observe in Tables 5 and 6 that T + P + ST, T + ST, P + ST, and ST showed the highest r means and the lowest MAE means. Person's r coefficient for these inputs was 0.63 and the MAE between 9.59 to 10.47.
In the ASM prediction, storage time was the condition present in all input combinations that best predicted the variable levels. A study carried out by Alencar et al. 36 verified that the ASM was changed according to temperature and storage time conditions. According to the findings reported by the Alencar et al 36 , the decrease in apparent specific mass occurred after 180 days of storage due to the increased metabolic activity of the grains influenced by variations in moisture content and temperature of the stored seed mass. Figure 3 shows that the REPTree, M5P, and RF models when associated with inputs T + P + ST, T + ST, P + ST and ST provided the highest r values and lowest MAE values. Importantly, while the ANN model had the best r  The results obtained indicated that the storage time had a greater influence in the ASM, that is, it reduced the seed mass in relation to their volume. This loss occurs due to the chemical reactions of oxidation during the respiratory process of the seeds, which consume accumulated energy in the form of organic compounds such as sugars, starches and others, effectively reducing the mass and, therefore, the weight of the seeds 5,8,9 . This result indicates that the seeds suffered deterioration and losses in physiological quality. The ANN model can be used to predict the ASN variation.
Electrical conductivity. No significant difference was observed for EC considering the models analyzed (Table 7). However, even so, the r value for the LR model was lower when compared to the other models tested. Regarding the different inputs for EC (Table 8), the combination T + P + ST and T + ST had the highest r means (0.65 and 0.63, respectively), while the input T had the lowest r mean. For inputs T, T + P, P, P + ST and ST, the MAE values did not differ among the models tested. The lowest means were verified for inputs T + P + ST and T + ST for the models REPTree, M5P, and RF ( Table 9).
Considering that conditions (packaging, temperature, and relative humidity) and storage time can influence seed moisture contents by causing seed drying or rewetting, it is expected that the prediction of electrical conductivity as a function of the input conditions tested indicates deterioration of cellular tissues and seed quality. Alencar et al. 36 , when evaluating the soybean quality by the electrical conductivity test, observed that the interaction between moisture content, temperature, and storage time were significant and influenced the quality of the Table 6. Clustering of means for the Pearson's correlation coefficient (r) and mean absolute error (MAE) between observed and estimated values of apparent specific mass in soybean seeds by different inputs. Means followed by the same letters in the same column do not differ by the Scott-Knott test at 5% probability. T temperature, P packaging, ST storage time. www.nature.com/scientificreports/ seeds. Carvalho et al. 37 and Coradi et al. 38 observed that the most significant increase in conductivity of soybean seeds occurred after 180 days of storage, indicating changes in the cellular tissues of the seeds. In Fig. 4, it can be seen that the T + P + ST inputs obtained the best MAE results (21.67) for REPTree, M5P, and RF models. Similar results were observed for the T + ST inputs, where the MAE ranged from 21.95 to 22.01. Although the ANN model showed satisfactory r results, the MAE values did not differ from the LR model. Therfore, the effect of temperature associated with storage time had a greater influence on the deterioration of cell membranes determined by the electrical conductivity test. Random Forest was the algorithm that better predicted electrical conductivity results, for the same reasons described for the variable water contents, smaller amount of data, making it possible to better conduct overfitting problems.
Germination. The obtained and estimated values for soybean seed germination are presented in Tables 10   and 11. The inputs T, T + P, P, and ST did not show significant variation. The highest means for inputs T + P + ST and T + ST were obtained in the ANN, REPTree, M5P, and RF models, while for the REPTree model the best results were obtained at input P + ST.
In Table 10 are the unfoldings of the significant interactions between the models and inputs, considering the observed and estimated seed germination values for MAE. The LR model obtained the highest MAE value (11.26) for the input combination T + P + ST. The REPTree and RF models had the lowest MAE (8.95) for the T + P + ST and T + ST combination. For inputs T, T + P, P, and ST, the means were higher and input P + ST, where only the REPTree model showed a low mean (Fig. 5). Table 7. Clustering of means for the Pearson's correlation coefficient (r) and mean absolute error (MAE) between observed and estimated values of electrical conductivity in soybean seeds by different learning models. Means followed by the same letters in the same column do not differ by the Scott-Knott test at 5% probability.  Table 9. Unfolding the significant interaction between model x input for mean absolute error (MAE) between the observed and estimated values of electrical conductivity in soybean seeds by different machine learning models and inputs. Means followed by equal lowercase letters in the same column and equal uppercase letters in the same row do not differ by the Scott-Knott test at 5% probability. T temperature, P packaging, ST storage time. www.nature.com/scientificreports/ High percentages of seed germination are obtained over storage time when seeds are stored in proper temperatures and packaging 39 . Coradi et al. 40 verified that the artificially cooled soybean seeds maintained their physiological quality for 140 days of storage. Coradi et al. 41 observed that seeds stored in uncontrolled environments obtained increased respiration rate and accelerated deterioration. It was found in Table 11 that the germination results for inputs T + P + ST and T + ST had the best results (r = 0.65 and r = 0.67, respectively) in the  Table 10. Unfolding the significant interaction between model x input for Pearson's correlation coefficient (r) between the observed and estimated values of germination in soybean seeds by different machine learning models and inputs. Means followed by equal lowercase letters in the same column and equal uppercase letters in the same row do not differ by the Scott-Knott test at 5% probability. T temperature, P packaging, ST storage time. www.nature.com/scientificreports/ ANN, REPTree, M5P and RF models. The LR model had a low performance (r = 0.48) for the inputs T + P + ST and T + ST, as did the ANN model for the input T + ST. The germination results followed the results obtained with the moisture contents, apparent specific mass and electrical conductivity. However, in addition to the temperature and storage time factors, the relationship between storage time and packaging had a very significant influence on the physiological quality of the seeds. Among the models tested, REPTree model stood out among the others.

Models T P + T ST + P + T ST + T P ST + P ST
Vigor. The statistics obtained for the vigor variable (r and MAE) showed no significant interaction between models and inputs. However, the ANN model presented the highest mean MAE in relation to the others (Table 12), indicating that the ANN model overestimated the vigor values. Regarding the inputs tested, it was possible to observe ( Table 13) that T + P + ST, T + ST, P + ST, and ST presented the highest mean r and the lowest mean MAE.
The results shown in Fig. 6 indicate that the REPTree, M5P and RF models, when associated with the inputs T + P + ST, T + ST, P + ST and ST provided the highest r values (0.68 to 0.47) and the lowest MAE values. Ferreira et al 9 found that seed storage at low temperatures (T + ST) reduced metabolic activity and maintained physiological quality. However, the choice of the combinations T + P + ST, T + ST was justified when analyzing the values of the mean absolute errors. The MAE for T + P + ST was 13.09, and for T + ST the values were 13.24. Importantly, although the ANN obtained good r results with the aforementioned inputs, the model showed high MAE values for all inputs compared to the LR model.
Seed vigor was mainly influenced by temperature and storage time, as was the case for the other variables evaluated. RF was the model that best predicted the vigor indices of the seeds using a smaller amount of data. The superior performance of RF possibly occurred due to the internal structure of the algorithm, which is based on multiple decision tree sets.  www.nature.com/scientificreports/ RF regression has advantages when predictor or explanatory variables are highly correlated, which is especially true for the variables temperature and storage time evaluated here. Variable collinearity can be a critical problem in traditional prediction models that are derived from linear regression 21,42,43 . Moreover, RF has been considered superior to other machine learning algorithms because it can easily handle many model parameters, reduce estimate bias, and has no problems with overfitting 18 . Recent studies have classified RF as an effective and versatile machine learning method for crop yield predictions 19 . To date, there are no studies for predicting storage seed quality from conditioning variables using ML models. Our study shows that it is possible to obtain satisfactory accuracy in predicting quality variables of stored soybean seeds using computational intelligence techniques, especially by employing the RF model. Furthermore, our findings provide support for decision-making about which conditioning variables should be evaluated and included in such prediction models, contributing to a more efficient soybean seed processing.

Conclusion
The preservation of seed quality involves controlling the storage environment and the use of technology, such as packaging, that allow reducing the metabolic activity of the seeds over time. In this study, evaluating the predicting the quality of soybean seeds stored in different environments and packaging using Machine Learning, it was concluded that: a. The combination of input variables temperature and storage time was the best predictor of soybean seed quality indices during the storage period. The input variable packaging did not influence predicting the physi- Table 13. Clustering of means for the Pearson's correlation coefficient (r) and mean absolute error (MAE) between observed and estimated values of vigor in soybean seeds by different inputs. Means followed by the same letters in the same column do not differ by the Scott-Knott test at 5% probability. T temperature, P packaging, ST storage time.  www.nature.com/scientificreports/ ological quality of soybean. The packaging effect was suppressed by the low storage temperatures, allowing the same results to be achieved, but using a smaller number of input variables. b. The ML techniques outperformed the proposed control model (linear regression). Random Forest algorithm was the one that best predicted the physiological quality indices of the seeds during the storage period with a smaller amount of data, making it possible to better conduct overfitting problems. On the other hand, the Artificial Neural Network had the highest errors (MAE).
The proposed approach stood out in terms of speed compared to the analysis methods routinely used, making the processes more robust and with low operational costs compared to the laboratory analysis strategies traditionally used. Using ML can be an auxiliary tool for decision-making within the seed storage environment, thereby contributing to loss reduction.