Coupling machine learning and crop modeling improves crop yield prediction in the US Corn Belt

This study investigates whether coupling crop modeling and machine learning (ML) improves corn yield predictions in the US Corn Belt. The main objectives are to explore whether a hybrid approach (crop modeling + ML) would result in better predictions, investigate which combinations of hybrid models provide the most accurate predictions, and determine the features from the crop modeling that are most effective to be integrated with ML for corn yield prediction. Five ML models (linear regression, LASSO, LightGBM, random forest, and XGBoost) and six ensemble models have been designed to address the research question. The results suggest that adding simulation crop model variables (APSIM) as input features to ML models can decrease yield prediction root mean squared error (RMSE) from 7 to 20%. Furthermore, we investigated partial inclusion of APSIM features in the ML prediction models and we found soil moisture related APSIM variables are most influential on the ML predictions followed by crop-related and phenology-related variables. Finally, based on feature importance measure, it has been observed that simulated APSIM average drought stress and average water table depth during the growing season are the most important APSIM inputs to ML. This result indicates that weather information alone is not sufficient and ML models need more hydrological inputs to make improved yield predictions.

1. Explore whether a hybrid approach (simulation crop modeling + ML) would result in better corn yield predictions in three major US Corn Belt states (Illinois, Indiana, and Iowa); 2. Investigate which combinations of hybrid models (various ML x crop model) provide the most accurate predictions; 3. Determine the features from the crop modeling that are most relevant for use by ML for corn yield prediction. Figure 1 depicts the conceptual framework of this paper. The remainder of this paper is organized as follows. "Materials and methods" section describes the methodology and the materials used in this study, and "Results" section presents and discusses the results and the possible improvements. "Discussion" section discusses the analysis and findings and finally, "Conclusion" section concludes the paper.

Materials and methods
Since the main objective is to evaluate the performance of a hybrid simulation-machine learning framework in predicting corn yield, this section is split into two parts. The first describes the Agricultural Production Systems sIMulator (APSIM) and the second the Machine learning (ML) algorithms. Each of them explains the details of the prediction/forecasting framework, including the inputs to the models, the data processing tasks, the details of selected predictive models, and evaluation metrics used to compare the results, for simulation and machine learning.

Agricultural Production Systems sIMulator (APSIM). APSIM run details. The Agricultural Produc-
tion Systems sIMulator 37 (APSIM) is an open source advanced simulator of cropping systems. It includes many crop models along with soil water, C, N, crop residue modules, which all interact on a daily time step. In this project, we used the APSIM maize version 7.9 and in particular the calibrated model version for US Corn Belt environments as outlined by Archontoulis et al. 1 that includes simulation of shallow water tables and inhibition of root growth due to excess water stress 38 and waterlogging functions 39 . Within APSIM we used the following modules: maize 40 , SWIM soil water 41 , soil N and carbon 42 , surface residue 42,43 , soil temperature 44 and various management rules to account for tillage and other management operations. The crop models simulate potential biomass production based on a combined radiation and water use efficiency concept. This potential is reduced to attainable yields by incorporating water and nitrogen limitation to crop growth (For additional information, we refer to www.apsim .info).
To run APSIM across the three states Illinois, Indiana, and Iowa, we used the parallel system for integrating impact models and sectors (pSIMS) software 45 . pSIMS is a platform for generating simulations and running point-based agricultural models across large geographical regions. The simulations used in this study were created on a 5-arcminute grid across Iowa, Illinois and Indiana considering only cropland area when creating soil profiles. Soil profiles for these simulations were created from Soil Survey Geographic database (SSURGO) 46 , a soil database based off of soil survey information collected by the National Cooperative Soil Survey. Climate information used by the simulations came from a synthetic weather data set called "IEM Reanalysis", which was engineered at Iowa Environmental Mesonet (mesonet.agron.iastate.edu). This database is developed from a combination of several weather sources. The temperature data comes from National Weather Service Cooperative Observer Program (NWS COOP) observers (www.weath er.gov/coop). The precipitation data is derived from radar-based estimates of National Oceanic and Atmospheric Administration Multi-Radar / Multi-Sensor System (NOAA MRMS) (www.nssl.noaa.gov/proje cts/mrms), Oregon State's PRISM data set (https ://prism .orego nstat e.edu/), and NWS COOP reports. Finally, the radiation data comes from NASA POWER (power.larc.nasa. gov). The synthetic product was tested against point weather stations and proved accurate (see more information here: https ://crops .exten sion.iasta te.edu/facts /weath er-tool). Current management APSIM model input databases include changes in plant density, planting dates, cultivar characteristics and N fertilization rate to corn from 1984 to 2019. Planting date and plant density data derived from USDA-NASS 47 . Cultivar traits data derived through regional scale model calibration. N fertilizer data derived from a combined analysis of USDA-NASS 47 and Cao et al. 48 including N rates to corn by county and by year. Over the historical period, 1984-2019, APSIM captured 78% of the variability in the NASS yields having a RMSE of 1 Mg/ha and RRMSE of 10% (See Fig. 2). This version of the model is used to provide outputs to the machine learning.
APSIM output variables used as inputs to ML models. The first step to combine the developed data set with APSIM variables was to extract all APSIM simulations from its outputs and prepare the obtained data to be added to the mentioned data set. The APSIM outputs include 22 variables (the details are presented in Table 1). The granularity level for the APSIM variables was different from USDA obtained data, as the APSIM variables made at 5 arc (approximately 40 fields within a county). Therefore, to calculate a county-level value for each of them, the median of all corresponding values is used. The reason to use median instead of a simple average is • Imputing zero values with the average of other values of the same feature • Removing rows with missing values • Normalizing the data to be between 0 and 1 • Cross-referencing the new data with the developed data set.
Then, all feature selection procedures explained in "Data pre-processing" section were executed on the newly created data set to keep only the variables that carry the most relevant information for the prediction task.
The developed data set considers data from 1984 to 2018. The data from three years, namely 2012, 2017, and 2018 are in turn considered as the test data and for each scenario, the training data is set to be the data from the other years. In essence, we considered average to wet years (2017 and 2018) and an extremely dry year (2012) as the test years to assess the model performance in all situations.
Machine learning (ML). The machine learning models are developed using a data set spanning from 1984 to 2018 to predict corn yield in three US Corn Belt states (Illinois, Indiana, and Iowa). The data set is comprised of the environment (soil and weather) and management as input variables, and actual corn yields for the period under study as the target variable. The input data are comprised of weather, management, and soil data 1 . Environment data includes several soil parameters at a 5 km resolution 46 and weather data.
Data set. The county-level historical corn yields were downloaded from the USDA National Agricultural Statistics Service 47 for years 1984-2018. A data set including observed information of the environment, management, and yields was developed, which consists of 10,016 observations of yearly average corn yields for 293 counties. The factors that mainly affect crop yields are alleged to be the environment, genotype, and management. To this end, weather and soil as environmental features and plant population and planting progress as management features were included in the data set. It should be noted that data preprocessing has been designed to address the increasing trends in yields due to technological and genotypic advances over the years 49,50 . This is mainly due to that there is no publicly available genotype data set. The data set with 598 variables (including target variable) are described below.  Data pre-processing. Several pre-processing tasks were conducted to ensure the data is prepared for fitting machine learning models. Since it is favorable for some machine learning models especially weighted ensemble models for the data input to have similar ranges, the first pre-processing task was to scale the input data between 0 and 1 using min-max scaling. The most common scaling methods include min-max scaling and normalization, from which min-max scaling is selected as it keeps the distributions of the input variables. The next preprocessing tasks include adding yearly trends, cumulative weather feature construction, and feature selection.
Add yearly trends feature. Figure 3 suggests an increasing trend in the yields over time. It is evident that there is no input feature in the developed data set that can explain this observed increasing trend in the corn yields. This trend is commonly described as the effect of technological gains over time, such as improvements in genetics (cultivars), management 51 , equipment, and other technological advances 52,53 .
Therefore, to account for the trend as mentioned above, the following actions were taken.
A new feature (yield_trend) was constructed that only explained the observed trend in corn yields. For building this new feature, a linear regression model was built for each location as the trends for each site tend to be different. The year ( YEAR ) and yield ( Y ) features formed the independent and dependent variables of this linear regression model, respectively. Then the predicted value for each data point ( Ŷ ) is added as a new input variable that explains the increasing annual trend in the target variable. Only training data was used for fitting this linear regression model and the corresponding values of the newly added feature for the test set is set to be The following equation shows the trend value ( Ŷ i ) calculated for each location ( i ), that is added to the data set as a new feature.
Aggregated and cumulative weather feature construction. To provide more climate information for the machine learning models, additional weather features were constructed that include cumulated values of the existing weather features. The aggregated precipitation, growing degree days, and shortwave radiation features are computed from summation of weather features, while the aggregated minimum and maximum temperature features come from average of the existing values. There are two sets of new cumulative weather features: Quarterly weather features (20 features), and cumulative quarterly weather features (15 features).
Feature selection. Since the data developed data set has a large number of input variables and is prone to overfitting, feature selection becomes necessary to build generalizable machine learning models. A two-stage feature selection procedure was performed to select the most essential features in the data set and prevent the machine learning models from overfitting on the highly dimensional training data. The two steps to perform feature selection were feature selection based on expert knowledge, and permutation feature selection using random forest.
i. Feature selection based on expert knowledge Using expert knowledge, weather features were reduced by removing features for the period between the end of harvesting and the beginning of next year's planting. Additionally, the number of planting progress features were lowered by eliminating the cumulative planting progress for the weeks before planting, as they did not include useful information. The feature selection based on expert knowledge could reduce the number of features from 550 to 387. ii. Permutation feature selection with random forest Strobl 54 pointed out that the default random forest variable importance (impurity-based) is not reliable when dealing with situations where independent variables have different scales of measurement or different number of categories. This is specifically important for biological and genomic studies where independent variables are often a combination of categorical and numeric features with varying scales. Therefore, to overcome this bias and find decisive importance of input features, permutation feature importance is decided to be used 55 .
Permutation feature importance measures the importance of an input feature by calculating the decrease in the model's prediction error when one feature is not available 56 . To make the unavailability of one feature possible, each feature is permuted in the validation or test set, that is, its values are shuffled, and the effect of this permutation on the quality of the predictions is measured. Specifically, if permutation increases the model error, the permuted feature is considered important, as the model relies on that feature for prediction. On the other hand, if permutation does not change the prediction error significantly, the feature is thought to be unimportant, as the model ignores it for making the prediction 57 .
The second stage of feature selection and likely the most effective one, includes fitting a random forest model with 100 number of trees as the base model and calculating permutation importance of input features with 10 times of repetition and considering a random tenfold cross-validation schema. It should be noted that the number of trees hyperparameter of this random forest model is tuned using a tenfold cross-validation. Afterward, the top 80 input features were selected in the second stage of feature selection.
Model selection. Tuning hyperparameters of machine learning models and selecting best models with optimal hyperparameter values is necessary to achieve high prediction accuracies. Cross-validation is commonly used to evaluate the predictive performance of fitted models by dividing the training set to train and validation subsets. Here, we use a random tenfold cross-validation method to tune the hyperparameter of ML models.
Grid search is an exhaustive search method that tries all the possible combinations of hyperparameter settings to find the optimal selection. It is both computationally expensive and generally dependent on the initial values specified by the user. However, Bayesian search addresses both issues and is capable of tuning hyperparameters faster and using a continuous range of values.
Bayesian search assumes an unknown underlying distribution and tries to approximate the unknown function with surrogate models such as Gaussian process. Bayesian optimization incorporates prior belief about the underlying function and updates it with new observations. This makes tuning hyperparameters faster and ensures finding an acceptable solution, given that enough number of observations are observed. In each iteration, Bayesian optimization gathers observations with the highest amount of information and intends to make a balance between exploration (exploring uncertain hyperparameters) and exploitation (gathering observations from hyperparameters close to the optimum) 58 . That being so, to tune hyperparameters, Bayesian search with 20 iterations was selected as the search method under tenfold cross-validation procedure.
Predictive models. In this study, we combine diverse models in different ways and create ensemble models to make a robust and precise machine learning model. One prerequisite for creating well-performing ensemble models is to show a particular element of diversity in the predictions of base learners as well as preserve excellent performance individually 59 . Thus, several base learners made with different procedures were selected and trained, including linear regression, LASSO regression, Extreme Gradient Boosting (XGBoost), Light Gradient www.nature.com/scientificreports/ Boosting Machine (LightGBM), and random forest. Moreover, an average weighted ensemble that assigns equal weights to all base learners is the simplest ensemble model created. Additionally, optimized weighted ensemble method proposed in Shahhosseini et al. 60 was applied here to test its predictive performance. Several two-level stacking ensembles, namely stacked regression, stacked LASSO, stacked random forest, and stacked LightGBM, were built, which are expected to demonstrate excellent performance. The details of each model can be found at Shahhosseini et al. 61 .
Linear regression. Linear regression intends to predict a measurable response using multiple predictors. It assumes the existence of a linear relationship between the predictors and response variable, normality, no multicollinearity, and homoscedasticity 62 .
LASSO regression. LASSO is a regularization method that is equipped with in-built feature selection. It can exclude some variables by setting their coefficient to zero 62 . Specifically, it adds a penalty term to the linear regression loss function, which can shrink coefficients towards zero (L1 regularization) 63 .
XGBoost and LightGBM. XGBoost and LightGBM are two implementations of gradient boosting tree-based ensemble methods. These types of ensemble methods make predictions sequentially and try to combine weak predictive tree models and learn from their mistakes. XGBoost was proposed in 2016 with new features, such as handling sparse data, and using an approximation algorithm for a better speed 64 , while LightGBM was published in 2017 by Microsoft, with improvements in performance and computational time 65 .
Random forest. Random forest is built on the concept of bagging, which is another tree-based ensemble model. Bagging tries to reduce prediction variance by averaging predictions made by sampling with replacement 66 . Random forest adds a new feature to bagging, which is randomly choosing a random number of features and constructing a tree with them and repeating this procedure many times and eventually averaging all the predictions made by all trees 59 . Therefore, random forest addresses both bias and variance components of the error and is proved to be powerful 67 .
Optimized weighted ensemble. An optimization model was proposed in Shahhosseini et al. 60 , which accounts for the tradeoff between bias and variance of the predictions, as it uses mean squared error (MSE) to form the objective function for the optimization problem 68 . In addition, out-of-bag predictions generated by k-fold crossvalidation are used as emulators of unseen test observations to create the input matrices of the optimization problem, which are out-of-bag predictions made by each base learner. The optimization problem, which is a nonlinear convex problem, is as follows.
where w j is the weights corresponding to base model j ( j = 1, . . . , k ), n is the total number of instances, y i is the actual value of observation i , and ŷ ij is the prediction of observation i by base model j.
Average weighted ensemble. Average weighted ensemble, which we call "average ensemble", is a simple average of out-of-bag predictions made by each base learner. The average ensemble can perform well when the base learners are diverse enough 59 .
Stacked generalization. Stacked generalization tries to combine multiple base learners by performing at least one more level of learning task, that uses out-of-bag predictions for each base learner as inputs, and the actual target values of training data as outputs 69 . The out-of-bag predictions are generated through a k-fold crossvalidation and have the same size of the original training set 70 . The steps to design a stacked generalization ensemble are as follows.
(a) Learn first-level machine learning models and generate out-of-bag predictions for each of them by using k-fold cross-validation. (b) Create a new data set with out-of-bag predictions as the input variables and actual response values of data points in the training set as the response variable. (c) Learn a second-level machine learning model on the created data set and make predictions for unseen test observations.
(2) www.nature.com/scientificreports/ Considering four predictive models as the second-level learners, four stacking ensemble models were created, namely stacked regression, stacked LASSO, stacked random forest, and stacked LightGBM.

Performance metrics.
To evaluate the performance of the developed machine learning models, three statistical performance metrics were used. These metrics together provide estimates of the error (RMSE, RRMSE, MBE) and of the variance explained by the models (R 2 ).

Results
Numerical results of hybrid simulation-ML framework. Table 2 shows the test set prediction errors of the 11 developed ML models for the benchmark (the case that no APSIM variable is added to the data set) and the hybrid simulation-ML (where all 22 APSIM outputs are added to the data set) cases. The relative RMSE (RRMSE) is calculated using the average corn yield value of the test set (see Table S1 in the supplementary material). Adding APSIM variables as input features to ML models improved the performance of the 11 developed ML models. In terms of RMSE, the hybrid model boosted ML performance up to 27%. In addition, comparing the lowest prediction errors (RMSE) of the benchmark and the hybrid scenario, we found that the use of hybrid models achieved 8%-9% better corn yield predictions.
Looking at the average test results (Fig. 4), it can be observed that adding APSIM features makes improvements to all designed ML models. Moreover, considering the smallest decrease in the prediction error (RRMSE) which is the worst-case scenario and is obtained by LASSO model, the hybrid model still is proved to be better than the benchmark. Another observation is the superiority of weighted ensemble models compared to other ML models. It should be noted that the negative R 2 value of some models (XGBoost, Stacked Random forest, and Stacked LightGBM) when having no APSIM variables shows that this models' predictions are worse than taking the mean value as the predictions.
On average, stacked ensemble models benefit the most from inclusion of APSIM outputs in predicting corn yields. Besides, considering Mean Bias Estimate (MBE) values of the ML models, we can observe that all ML models presented less biased predictions after having APSIM information in their inputs and it seems that inclusion of APSIM variables helped reducing the prediction bias significantly. Figure 5 illustrates the goodness of fit of some of the designed ML models for two benchmark and hybrid cases for the test year 2018. As mentioned above, the advantage of including APSIM variables in the machine learning algorithms is the better distribution of the residuals (deviation from the 1:1 line) which decreased overall prediction bias. See FigureS1 from supplementary material for additional summary statistics.

Models performance on an extreme weather year (2012).
To assess the performance of the trained models on an extreme weather year, here the data from the year 2012, which was an exceptionally dry year, is considered as unseen test observations and the quality of the predictions made by the benchmark and the hybrid models are compared. Table 3 demonstrates lower prediction accuracy of the models in year 2012 (extreme dry year) compared to average to wet years model predictions (2017 and 2018, see Table 2). This result was consistent for both ML and hybrid models. However, the hybrid model managed to provide improvements over the benchmark in the 2012 year. This was ranging from 5 to 43% decrease in the prediction RMSE. Comparing the best model of the benchmark (LightGBM) with the best model of the hybrid scenario (Stacked regression ensemble), we observed that the use of hybrid model provided 22% better predictions.

Partial inclusion of APSIM variables. This section investigates the effect of partial inclusion of APSIM
variables considering three different scenarios for the test year 2018 (see Table 4). The scenarios are (1) include only phenology-related APSIM variables (silking date and physiological maturity date); (2) include only croprelated APSIM variables (crop yield, biomass, maximum rooting depth, maximum leaf area index, cumulative transpiration, crop N uptake, grain N uptake, season average water stress (both drought and excessive water), and season average nitrogen stress), and (3) include soil and weather-related APSIM variables (annual evapotranspiration, growing season average depth to the water table, annual runoff, annual drainage, annual gross N mineralization, total N loss that accounts for leaching and denitrification, annual average water table depth, ratio of soil water to field capacity during the growing season at 30, 60, and 90 cm profile depth). When including only phenology-related APSIM variables, results demonstrate that stacked regression ensemble model makes the best predictions, while the least biased predictions are generated from stacked random forest ensemble.
In case of having crop-related APSIM variables as ML inputs, results indicate that stacked regression and stacked random forest ensembles make the best and the least biased predictions, respectively.
When the soil and weather-related APSIM variables are considered as ML inputs, the results show that stacked regression ensemble makes decent predictions with having the least amount of prediction error as well as bias.
Scientific Reports | (2021) 11:1606 | https://doi.org/10.1038/s41598-020-80820-1 www.nature.com/scientificreports/ Table 4 presents the test set prediction errors of designed ML models for all three scenarios of partial inclusion of APSIM variables. Overall results indicate that soil and weather-related APSIM variables as well as crop-related variables have a more significant influence on the predictions made by ML. This is interesting and is partially explained by the fact that ML somehow already accounts for phenology-related parameters, which are largely weather-driven, while the soil-related parameters are more complicated parameters that ML alone cannot see. This is more evident in Fig. 6. Furthermore, it can be observed that some of the soil and weather-related ensemble models provide improvements over the models that we developed earlier including all APSIM variables. This result suggests that not all the included APSIM variables have useful information for ML yield prediction.
Variable importance. The permutation importance (see also "Data pre-processing" section) of five individual base models (linear regression, LASSO regression, XGBoost, LightGBM, and random forest) was calculated using the test data of the year 2018. Figure 8 depicts the top-15 normalized average permutation importance of these ML models. It should be noted that due to black-box nature of ensemble models, only individual learners were used to calculate permutation importance. Figure 7 indicates that the most important input feature for ML models is "yield_trend" which is the feature we constructed for explaining the increasing trend in corn yields and incorporated technological advances over To find out which APSIM features have been more influential in predicting yields, the average permutation importance of five individual models (linear regression, LASSO regression, LightGBM, XGBoost, and random forest) was calculated for each test year. Figure 8 demonstrates the ranking of top 10 APSIM features. Results indicate that the AvgDroughtStress, AvgWTInseason, and CropYield were the most important features for machine learning models to predict yield. Most of these are water-related features suggesting the importance of soil hydrology in crop yield prediction in the US Corn belt. This result was consistent across three years, including the year drought 2012, in which model prediction was lower than the other years.  www.nature.com/scientificreports/

Discussion
We proposed a hybrid simulation-machine learning approach that provided improved county-scale crop yield prediction. To the best of our knowledge, this is the first study that designs ensemble models to increase corn yields predictability. This study demonstrated that introducing APSIM variables into machine learning models and utilizing them as inputs to a prediction task on average can decrease the prediction error measure by RMSE between 7 and 20%. In addition, the predictions made by the hybrid model show less bias toward actual yields.
Other studies in this area, are mainly limited in coupling simplest statistical models, i.e. linear regression variants,  www.nature.com/scientificreports/ with simulation crop models and apart from two recent studies 35,36 there has been no study combining machine learning and simulation crop models. Considering the hybrid models, some of the developed models provided predictions with RRMSE values as small as 6-7%. This indicates that the developed models outperform the corn yield prediction models developed in the literature 24,[72][73][74][75] .
In addition to the prediction advantages achieved by coupling ML and simulation crop modelling, we investigated the value of different types of APSIM variables in the ML prediction and found out that soil water related APSIM variables contributed the most in improving yield prediction. The inclusion of APSIM consistently improved ML yield prediction in all years (2012,2017,2018). We also noticed that neither ML nor the hybrid model could sufficiently predict yields of the 2012 dry year. This suggests that more work is needed to adequately predict yields in extreme weather years, which are expected to increase with climate change [76][77][78][79] , but we noticed that yield prediction of the dry year was better done by the hybrid model. Developing models that are more robust to extreme values, including additional climate information that can help the model to detect the drought, and including remote sensing data can be future research directions.
Designing a method that enables the ML models to capture the yearly increasing trends in corn yields was the main challenge of this work. To address this challenge, an innovative feature was constructed that could explains the trend to a great extent and as the variable importance results showed, it is by far the most important input feature for predicting corn yields.
The significant merits of coupling ML and simulation crop models shown in this study raise the question that whether the ML models can further benefit from addition of more input features from other sources. Hence, a possible extension of this study could be inclusion of remote sensing data into the ML prediction task and investigate the level of importance each data source can exhibit.
It should be also acknowledged that APSIM simulations that used as inputs to ML model leveraged the full weather of each test year. In real word applications, the weather will be unknown and the APSIM model would  www.nature.com/scientificreports/ need to run in a forecasting mode 1,12,80 introducing some additional uncertainty. This is something to be explored further in the future.

Conclusion
We demonstrated improvements in yield prediction accuracy across all designed ML models when additional inputs from a simulation cropping systems model (APSIM) are included. Among several crop model (APSIM in this study) variables that can be used as inputs to ML, analysis suggested that the most important ones were those related to soil water, and in particular growing season average drought stress, and average depth to water table.
We concluded that inclusion of additional soil water related variables (either from simulation model or remote sensing or other sources) could further improve ML yield prediction in the central US Corn Belt. www.nature.com/scientificreports/