Using machine learning methods for supporting GR2M model in runoff estimation in an ungauged basin

Estimating monthly runoff variation, especially in ungauged basins, is inevitable for water resource planning and management. The present study aimed to evaluate the regionalization methods for determining regional parameters of the rainfall-runoff model (i.e., GR2M model). Two regionalization methods (i.e., regression-based methods and distance-based methods) were investigated in this study. Three regression-based methods were selected including Multiple Linear Regression (MLR), Random Forest (RF), and M5 Model Tree (M5), and two distance-based methods included Spatial Proximity Approach and Physical Similarity Approach (PSA). Hydrological data and the basin's physical attributes were analyzed from 37 runoff stations in Thailand's southern basin. The results showed that using hydrological data for estimating the GR2M model parameters is better than using the basin's physical attributes. RF had the most accuracy in estimating regional GR2M model’s parameters by giving the lowest error, followed by M5, MLR, SPA, and PSA. Such regional parameters were then applied in estimating monthly runoff using the GR2M model. Then, their performance was evaluated using three performance criteria, i.e., Nash–Sutcliffe Efficiency (NSE), Correlation Coefficient (r), and Overall Index (OI). The regionalized monthly runoff with RF performed the best, followed by SPA, M5, MLR, and PSA. The Taylor diagram was also used to graphically evaluate the obtained results, which indicated that RF provided the products closest to GR2M's results, followed by SPA, M5, PSA, and MLR. Our finding revealed the applicability of machine learning for estimating monthly runoff in the ungauged basins. However, the SPA would be recommended in areas where lacking the basin's physical attributes and hydrological information.

Precisely estimating hydrological parameters in the ungauged basin has drawn the attention of hydrologists and water resources engineering 1 . In meteorology, the assessment of runoff is extremely important 2 , especially in areas where there is no measuring station that cannot be calibrated. Therefore, the regionalization method is optional for transferring model parameters from the gauged basin to the ungauged basin 3,4 . The popular regionalization methods are physical similarity, spatial proximity, and regression 5 . Previously, many studies have been conducted and compared the performance of the regionalization methods to predict total streamflow or direct runoff 6 in the ungauged catchment with various hydrological models (WASMOD 7 , VIC 8 , SWAT 9 , GR4J 10 , HMETS 10 , MOHYSE 10 , and HEC-HMS 11 ) for different regions. Some of them showed that distance-based (spatial proximity, physical similarity) outperformed regression methods 7,8 . The combined watershed classification of inverse distance weighted (IDW) and physical similarity methods was investigated to predict streamflow in the ungauged catchment by Kanishka and Eldho 9 . Swain and Patra 3 pointed that the spatial proximity between the gauged catchment and the ungauged catchment gave better results than the physical similarity for predicting the continuous streamflow. Arsenault et al. 10 studied the efficacy of three regionalization methods: multiple linear regression (MLR), spatial proximity, and physical similarity, to predict current flow in ungauged catchments of Mexico. They showed that transferring a set of parameters from a nearby reservoir is the most efficient method www.nature.com/scientificreports/ moisture remains at level: S2 (mm). Some soil water infiltrates into the soil as subsurface water: P 2 (mm) and conglomerate with rainfall excess to be surface runoff: P 3 (mm). The surface runoff flows into the river combining with the rest water from the previous month: R (mm). The river runoff can change depending on the direction of water flowing into or out from the basin. Finally, the total runoff hydrograph is obtained.  www.nature.com/scientificreports/ Research methodology. In this study, the research methodology ( Fig. 3) consisted of four main steps, i.e., (1) the GR2M model's calibration and verification; (2) analysis of basin's hydrological data and physical attributes; (3) regionalization methods for estimating the GR2M model's parameters and their performance comparison; and (4) performance evaluation of the regional GR2M model parameters in estimating monthly runoff. The detailed information for each step can be explained as the following.
The GR2M model's calibration and verification. The model's calibration and verification were conducted to make the GR2M model reliable for estimating monthly runoff for 37 different runoff stations in the Southern Basins, Thailand. Before calibrating and verifying the GR2M model, it requires a warm-up period to determine the suitable initial values of X 1 and X 2 so that the model can imitate the existing hydrological characteristics of the considering basin. For doing this, the initial R value raining from 10 to 60 mm was sought. And the appropriate warm-up periods of 4 to 7 months were discovered, depending on the runoff station characteristics. In this study, the fitted values of X1 and X2 parameters for each runoff station were automatically determined with Microsoft Excel Solver's help by setting root mean square error (RMSE) as an objective function and the constraints of X1 and X2 parameters. The available monthly rainfall, evapotranspiration, and runoff data for each runoff station ranged from 41 to 80 months, resulting in the calibration and verification periods were 22 to 48 months and 10 to 39 months, respectively.
Analysis of basin's hydrological data, and physical attributes. We collected the monthly runoff (37 stations), rainfall (38 stations) and air temperature (13 stations) information from the Royal Irrigation Department (RID) and the Thai Meteorological Department (TMD). Figure 2 depicts the locations of rainfall, runoff, and weather. Areal rainfall and air temperature information for each runoff gauged station was analyzed by using Thiessen polygon. Table 1 shows the summary statistical values of hydrological data and physical characteristics of runoff gauged station used in this analysis. We used Thornthwaite 40 equation to calculate monthly evapotranspiration. Figure 4 shows the physical characteristics information of the 37 runoff gauged stations, including basin area (A), river length (L), and river length from the basin's centroid to the basin outlet (L c ) of runoff gauged station, were determined with the help of QGIS, a free and open-source geographic information system software. We examined the time matching to choose the appropriate times for calibrating and verifying the model. Hence, all periods for running the GR2M model were in the range of 41 to 80 months. And its calibration and verification periods were in the range of 22 to 49 months and 10 to 39 months, respectively.
Regionalization methods for estimating the GR2M model's parameters. Estimating the GR2M model's parameters in the ungauged basin was analyzed by using the regionalization concept. It is a method that transfers model parameters from donor catchments to the target station or ungauged catchments 41 . In this study, www.nature.com/scientificreports/ two regionalization methods, i.e., regression-based methods, and distance-based methods, were investigated and compared their performance.
Regression-based methods. Regression analysis was used to determine the relationship between the fitted GR2M model parameters and three basin's physical characteristics (A, L, and L c ), and thirteen hydrological data, including monthly average areal rainfall for 12 months, and annual average areal rainfall. Each fitted GR2M model parameter (i.e., X 1 and X 2 ) was a dependent variable, while the basin's physical characteristics and hydrological data were independent variables. We conducted three scenarios for selecting the most suitable group of independent variables, that is, (1) using only the basin's physical characteristics, (2) using only hydrological data, and (3) combining those mentioned variables in the both scenarios 1 and 2. Three regression-based methods were selected herein, Multiple Linear Regression (MLR), Random forest (RF), and M5 Model Tree (M5). The last two methods are based on a data-driven model.

Multiple linear regression analysis (MLR).
where y is the dependent variable, x i is the independent variable; a i is regression coefficient, b is constant of regression equation, and n is number of the independent variable. We utilized regression function in Microsoft excel to develop regionalized GR2M model parameter equations.

Random forest (RF).
Random forest (RF) popular modification of decision trees and one of the ensemble techniques, was first introduced by Breiman in early 2001 42 . It can use for data classification and regression. The advantage of RF is that it can find a series of complex relationships between predictors and responses without (1) y = a 1 x 1 + a 2 x 2 + · · · + a n−1 x n−1 + a n x n + b  www.nature.com/scientificreports/ any relationships between them by including decision trees 42 .RF creates several trees based on decision trees method, where every tree is produced by arbitrarily selecting training data set, called bagging process, and attributes (or features) from the input vector. By the voting method from the predictive output of every tree created, the model prediction is finally obtained. In regression, the tree predictor proceeds on numerical values as arbitrary to class labels used by the random forest classifier 43 . The most frequently used variable selection measures in tree induction are the Information Gain Ratio criterion 44 and Gini index 45 . Unlike the M5 model tree, full-grown RF trees are not pruned. One of the key advantages of random forest regression over the M5 model tree is that it is more flexible. The speculation error always converges as the number of trees grows, even if the tree isn't pruned, and overfitting isn't a concern because of the Strong Law of Large Numbers 43 . We used WEKA, free and open-source software, and all default RF parameters as recommended by WEKA in our study.
M5 model tree (M5). Quinlan 1992 44 irst developed the M5 model tree by employing a divide-and-conquer strategy to establish the relationship between independent and dependent variables. It can be applied to both qualitative (categorical) and quantitative variables. Building M5 involves three stages. The first stage involves the development of a decision tree by dividing the data set into subsets (or leaves). Second, to prevent an overfitted structure or weak generalizer, the overgrown tree is pruned, and linear regression functions are used to replace the pruned sub-trees. the overgrown tree is pruned and the pruned sub-trees are replaced by linear regression functions. The pruning method requires the merger of some of the lower sub-trees into one node. Finally, the smoothing process is used to compensate for the strong discontinuities that would undoubtedly exist between neighboring linear models on the trimmed trees' leaves, especially for some models with a small number of training samples. For regression-based methods, the suitable group of independent variables for determining two GR2M parameters was investigated. Thus, there were three scenarios, that is, (1) scenario-1: using only the basin's physical characteristics, (2) scenario-2: using only hydrological information, and (3) scenario-3: combining those mentioned variables in scenarios 1 and 2.
Distance-based methods. Distance-based methods are a method for determination hydrological model parameters in the ungauged basin by transferring their values from donor catchments to the target station or ungauged catchments. Two approaches are popular recommended: Inverse Distance Weighted (IDW), and Inverse Similarity Weighted (ISW). The IDW value depends on the proximity of the distance, whereas ISW value depends on the similarity of the physical characteristics 7 . By applying IDW and ISW concepts, Spatial Proximity Approach 25 and Physical Similarity Approach (PSA) were utilized respectively herein and can be concisely explained as follows: Spatial proximity approach (SPA). SPA is the method to select donor stations with a proximity distance to a target station 5 . The distance between a gauged station (or donor station) and ungauged stations (or a target station) can be determined by: where x g , x u are the latitude (UTM), y g , y u are the longitude (UTM); which g is donor station, and u is the target station, and D ug is the distance between g and u stations. The inverse distance weighted can be calculated as: W g_i is the inverse distance weighted, and n is the total number of donor stations. A parameter of the target station can be obtained by: where P ug is the parameter of target station, and p g_i is the parameter of donor station.
Physical similarity approach (PSA). PSA is the method based on the concept that catchments with similar physical characteristics would have similar hydrological behavior 46 .
where SI ug is the similarity index, CD g,i , CD u,i are the catchment descriptor of donor catchments to the target station; CD gi is the rage of ith catchment descriptor, k is the total number of catchment descriptor.
(2) www.nature.com/scientificreports/ W g_i (ISW) is the inverse similarity weighted, and n is the total number of donor stations.
where P ug is the parameter of target station, and p g_i is the parameter of donor station. Model performance in estimating the regional GR2M model parameters was compared using four statistical indices, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Pearson Correlation Coefficient (r), and Combined Accuracy (CA) 47 .
Evaluation of the regional GR2M model parameters applied in the GR2M model. Three performance criteria, including Nash-Sutcliffe Efficiency (NSE), Correlation Coefficient (r), and Overall Index (OI), and A Taylor diagram were used for evaluating the applicability of the GR2M Model. The details for each performance criteria can be delineated as the following: Nash-Sutcliffe Efficiency (NSE) is a prominent index for determining model correctness or model performance, as shown in the following equation: The NSE ranges from − α to 1. If the NSE is near to 1, the observed and calculated runoff are likely to be identical, or it is considered the most efficient or accurate 48 .
The correlation coefficient (r) shows agreement between two variables. The following equation can be used to compute the correlation coefficient between X and Y.
The r-value ranges from − 1 to 1. The plus sign (+) indicates the direct relation between observed and predicted values or vice versa 49 .
The overall index (OI) is a model performance criterion that gives the value between − ∝ to 1. The model's performance is prominent if OI approaches to 1 50 .
where Q obs is the observed runoff, Q cal is the calculated runoff, Q obs is the average observed runoff, Q cal is the average calculated runoff, Q obs,max is the maximum observed runoff, Q obs,min is the minimum observed runoff, and n is the number of runoff data.
A Taylor diagram was used herein to comparatively elaborate and evaluate the efficacy among the developed models. This diagram can simultaneously show three statistic parameters, i.e., correlation, root mean square error, and standard deviation.

Results and discussion
This section presents our finding as follows: (1) calibrating and validating the GR2M Model and its fitted values, (2) the suitable group of independent variables for determining two GR2M parameters using regression-based methods, (3) model performance comparison in estimating the regional GR2M model parameters, and (4) evaluation of the regionalized parameters applied in the GR2M model. The details information is delineated as follows.

Calibrating and validating the GR2M Model and its fitted values. The calibrating and validating
results of the GR2M model is depicted as box plot in Fig. 5. The calibrated NSE, r, and OI values were 0.657, 0.825, and 0.757, respectively, and the verified NSE, r, and OI values were 0.449, 0.743 and 0.599, respectively. It was a satisfactory model prediction as suggested by Lian et al. 51 . The obtained r value of more than 0.70 showed a strong positive linear relationship between the calculated and observed runoff data 52 . The OI value of more than 0.60 showing the model had a relatively high accurate prediction. Figure 6 shows two examples of rainfall and runoff time series at the X.64 and X.70 stations, which were obtained from the GR2M model. The monthly rainfall time-series data is shown as a bar chart in blue. The observed and calculated runoff time-series data are shown with the line graphs in orange and green, respectively. And the solid and dot lines indicate the calibrated and validated periods, respectively. The agreement between observed and calculated runoff time-series data with a bit of underestimating the calculated runoff is found.
The statistical values of the fitted GR2M model parameters (X 1 and X 2 ) for 37 runoff stations are displayed in Table 2. The production store capacity (X 1 ) value varies between the allowable minimum (2.00 mm) and maximum (10.00 mm) with the average and standard deviation values of 5.71 mm and 2.49 mm, respectively. www.nature.com/scientificreports/  www.nature.com/scientificreports/ The skewness and kurtosis X 1 values of − 0.52, and − 1.03, respectively, indicated that the production store capacity (X 1 ) in the southern river basin, Thailand, has left skew platykurtic, and non-symmetric distributions. The groundwater exchange rate (X 2 ) value varies between 0.54 and 1.00. Most X 2 values are 1.0, which is the maximum value, resulting in its average X2 value of 0.93 with a meagre standard deviation value of 0.12. The skewness and kurtosis values of X 2 were − 2.01, and 3.69, respectively, indicated that the groundwater exchange rate (X 2 ) in the southern river basin, Thailand, has left skew, leptokurtic, and non-symmetric distributions. It can observe that the positive obtained groundwater exchange rate (X2) value. Thus, it shows no groundwater runs out of the basin.
The suitable group of independent variables for determining two GR2M parameters using regression-based methods. The results of investigating three scenarios for selecting the most suitable group of independent variables, that is, (1) scenario-1: using only the basin's physical characteristics, (2) scenario-2: using only hydrological information, and (3) scenario-3: combining those mentioned variables in scenarios 1 and 2. The results as shown in Table 3 indicates that scenario-1 received the worse performance for all cases due to giving the highest CA values than other cases. The italic number in Table 3 shows the scenario giving the best performance. For developing the regionalized X 1 and X 2 equations with MLR, we found that the most suitable group of independent variables was scenario-3. Also, the scenario-3 used for developing the regionalized X 1 equation in which RF was the best independent variables. In scenario-2, the regionalized X 2 equation developed by RF gave the best performance. By using the scenario-2, the regionalized X 1 equation developed by M5 was the best. In scenario-3, the regionalized X 2 equation developed by M5 gave the best performance. The explicit equations for estimating X 1 and X 2 values using MLR and M5 were shown in Eqs. (11) to (14). It should be noticed that the equation for X 2 obtained from M5 method excluded variables of the basin's physical characteristics, although scenario-3 was selected the best one. RF is a machine learning algorithm and it has no an explicit equation like MLR and M5.

MLR.
(11)  where A = basin area, L = river length, L c = river length from the basin's centroid to the basin outlet, RF 1, RF 2 , RF 3 , …, and RF 12 = average monthly rainfall in January, February, March, …, and December, respectively, and RF y = average annual rainfall.
Model performance comparison in estimating the regional GR2M model parameters. In this section, the application of five methods (i.e., MLR, RF, M5, SPA, and PSA) were applied for developing the regionalized GR2M parameters, which are presented and discussed. The first three methods are based on the regression-based method and the rest two methods are distance-based method. Comparison of the fitted X 1 and X 2 parameters in the GR2M model and the two parameters obtained from those five methods was conducted.
The results of applying those five methods to estimate X 1 and X 2 values are summarized in Table 4 for all 37 runoff stations. It indicated RF gave the best performance for estimating X 1 due to providing the lowest CA value, following by MLR, M5, SPA, and PSA, respectively. Likewise, RF gave the best performance for estimating X 2 , followed by M5, MLR, SPA, and PSA.
Evaluation of the regionalized parameters applied in the GR2M model. This section aims to present the performance evaluation of the regionalized GR2M parameters developed in the previous section. Those parameters, areal monthly rainfall and evapotranspiration were used as input parameters for the GR2M model. With the same input data sets of areal monthly rainfall, evapotranspiration and different X 1 and X 2 values obtained from five different methods, we got five monthly runoff time-series as the GR2M model's output data. Those monthly runoff time-series were compared to that of the calibrated and validated GR2M model. Figure 7 shows the box plot graph, which was obtained from evaluating the GR2M model's effectiveness by using the regionalized GR2M parameters. Also, Table 5    www.nature.com/scientificreports/ applying X 1 and X 2 values in the GR2M model with five regionalized methods. As usual, the calibrated model's performance for all methods was better than those of the validated ones. Figure 7 and Table 5 show that the average values of NSE, r, and OI obtained from the calibration stage gave better values than those obtained from the validation stage. In addition, RF gave the best results when considering NSE and OI values for both calibration and validation stages. Table 6 shows the number of runoff stations that were categorized into four groups with the same interval for each statistical index. By this way, we can see and compare the five methods' effectiveness in our experiment easily. Considering NSE, r, and OI values of equal or more than 0.70 simultaneously, we found that RF gave the best performance in monthly runoff estimation due to providing the highest total number of runoff stations of 60 (i.e., NSE, r, and OI values are equal or more than 0.70 simultaneously in those 60 stations), followed by SPA (53 stations Figure 9 presents a Taylor diagram that compares among five regionalized GR2M model and the calibrated and validated GR2M model. As shown in Fig. 9, all models gave a standard deviation value less than that of the observed runoff time series, except for PSA. RF provided the results closest to GR2M's results, followed by SPA, M5, PSA, and MLP. However, in case of lack of basin's physical characteristics and hydrological data, it would recommend using SPA since it only needs information on the distance between a gauged station (or donor station) and ungauged stations (or a target station).

Conclusion
The performance investigation of the regionalized GR2M model parameters for estimating monthly runoff in the ungauged basin was conducted in this research work. We selected 37 runoff gauged stations located in the southern basin, Thailand, as the study case. The regression-based and distance-based methods were applied for this purpose. Using regression-based methods to determine two GR2M parameters, the hydrological data was more suitable group of independent variables than the basin's physical characteristics. We also found that RF gave the best performance for estimating X 1 and X 2 values due to providing the lowest error, followed by M5, MLR, SPA, and PSA. However, by simultaneously considering NSE, r, and OI values, RF provided the best performance in estimating monthly runoff time series by giving NSE, r, and OI values of equal or more than 0.70, followed by SPA, M5, MLR, and PSA. Furthermore, by using a Taylor diagram, we found that RF provided the results closest to GR2M's results, followed by SPA, M5, PSA, and MLP. However, in case of lack of basin's physical characteristics and hydrological information, it would recommend using SPA since it only needs information on the distance between a gauged station (or donor station) and ungauged stations (or a target station). Estimating monthly runoff time series in the ungauged basin via the regionalization methods could be drastically useful for water resources planning and management. www.nature.com/scientificreports/  www.nature.com/scientificreports/