Toward the accurate estimation of elliptical side orifice discharge coefficient applying two rigorous kernel-based data-intelligence paradigms

In the present study, two kernel-based data-intelligence paradigms, namely, Gaussian Process Regression (GPR) and Kernel Extreme Learning Machine (KELM) along with Generalized Regression Neural Network (GRNN) and Response Surface Methodology (RSM), as the validated schemes, employed to precisely estimate the elliptical side orifice discharge coefficient in rectangular channels. A total of 588 laboratory data in various geometric and hydraulic conditions were used to develop the models. The discharge coefficient was considered as a function of five dimensionless hydraulically and geometrical variables. The results showed that the machine learning models used in this study had shown good performance compared to the regression-based relationships. Comparison between machine learning models showed that GPR (RMSE = 0.0081, R = 0.958, MAPE = 1.3242) and KELM (RMSE = 0.0082, R = 0.9564, MAPE = 1.3499) models provide higher accuracy. Base on the RSM model, a new practical equation was developed to predict the discharge coefficient. Also, the sensitivity analysis of the input parameters showed that the main channel width to orifice height ratio (B/b) has the most significant effect on determining the discharge coefficient. The leveraged approach was applied to identify outlier data and applicability domain.

www.nature.com/scientificreports/ ure 1 shows an elliptical side orifice and the geometric parameters used in it. According to the variables affecting the discharge coefficient of elliptical side orifice, a relation can be written as follows: Using Buckingham's π theory, effective dimensionless parameters can be obtained as follows: Given that in open channels, the most critical effective force is gravity, the effect of viscosity and surface tension can be ignored 5,55 , so Reynolds and Weber numbers can be removed from the above equation.
In the present study, Vatankhah and Rafeifar 3 laboratory data, which includes 588 series of data, were used. They studied the effect of different geometrical and hydraulic parameters on the elliptical side orifice discharge coefficient. A horizontal rectangular channel (12 m length, 0.25 m width, and 0.5 m height) was used to perform experiments. Two types of rectangular and triangular weirs were used to measure the flow through the orifice (Qs) orifice and the upstream flow of the orifice (Qu). Two different lengths of orifice (a = 15, 20 cm), three heights (b = 2,3,4 cm) and 2 crest heights (w = 5, 10 cm) were used. In total, 12 different geometric shapes were created. (Qu) values ranged from 13.8 to 39.6 l/s, Qs ranged from 3.66 to 21.41 l/s, and the Froude number in the main channel ranged from 0.22 to 0.77. Finally, the discharge coefficient can be calculated as C d = Q/πab 2gh c where h c = y 1 − W − b . 588 laboratory data were randomly divided into two parts: training (75%) and test (25%). Table 1 shows the statistical specifications of training and test datasets. Figure 2 shows the relationship between the output variable ( Cd ) and independent variables for the dataset used in this study. The numbers in the figure represent the linear relationship between variables using the Pearson correlation coefficient. The value of this coefficient varies from −1 to 1. Positive values indicate a direct connection between variables, and negative values indicate an inverse relationship between variables. According to Fig. 2, (1) Cd = f 1 a, b, W, B, y2, V 1, g, ρ, σ , µ .
(3)  www.nature.com/scientificreports/ the two variables b/y1 ( r p = 0.39) and Fr1 ( r p = 0.10) directly affects the discharge coefficient, which means that as they increase, the discharge coefficient increases. The three variables W/b ( r p = −0.48) , B/a ( r p = −0.42) and B/b ( r p = −0.55) have the inverse effect on the discharge coefficient, and as they increase, the discharge coefficient decreases. According to Fig. 2, the variables B/b and W/b have the highest absolute correlation with the elliptical side orifice discharge coefficient.

Machine learning technics. Gaussian process regression (GPR). The Gaussian process regression (GPR)
model falls into the category of supervised machine learning methods 56 . GPR is a kernel-based non-parametric based on Bayes, with high computational efficiency and accuracy which its operation is easy for users 57 . This approach can solve classification and regression problems. This method has a high capability in modeling complex nonlinear issues 58 . A Gaussian process is expressed by the mean function m(x) and the covariance function k x i , x j as follows 59,60 : In the regression problem, y is defined as observations and ε as noise. This noise has an average of zero and σ 2 n variance. As a result, the Gaussian process regression model can be expressed as follows: In the above equation, x is the input data matrix, y is the output data vector, and f is the values of the GPR function. The joint distribution is defined by the kernel function as follows: where K(x * , X) is equal to X is the training input matrix X = [x 1 , x 2 , . . . , x n ] T , y is the training output vector y = y 1 , y 2 , . . . , y n T , x * is the test input vector and f * is the output for the test input data vector. Finally, the predictor distribution is expressed by the following equation: where f * and cov f * are defined as follows: The covariance function is used to measure the effect of data points on each other 57 . This function shows the number of coordinated changes between the two variables. The proper selection of kernel function (covariance) is one of the essential factors affecting the performance of the GPR model. Numerous kernel functions www.nature.com/scientificreports/ are defined for use in the GPR model 59 . In the present study, ten types of kernels were examined and evaluated. Table 2 shows the kernel equations used in the present study. where ρ j denotes the output weight vector, which connects the jth hidden layer node and output layer node. g(x) represents the ELM activation function (AF), b j is the weight of the input dataset, and c j is the bias value for the jth hidden layer node. Equation (8) can be defined as, where Y denotes the model output, G is the matrix of hidden layer output, which is expressed as, The ELM uses a fitness function to determine the optimum value for the ρ , which is given as, where T is the target vector.
Based on the generalized inverse theory, the solution of Eq. (12) is defined as, where G † refers to the Moore-Penrose inverse matrix (MPIM) of G . Regarding the orthogonal projection technique and theory of ridge regression 64 , the regularization factor (RF) was applied in the process of optimization so that the ρ can be achieved as, (11) Table 2. List of Kernel functions used for GPR model. Where σ f is the signal standard deviation (Std), σ l is the characteristic length scale, r is the Euclidean distance between x i and x j which is defined by

Kernel function
Kernel equation www.nature.com/scientificreports/ where I denotes the identity matrix. Accordingly, the ELM output function is defined as, Notwithstanding the suitable efficiency of the ELM, but because this method is random, it may be trapped in the local optima. Therefore, the kernel extreme learning machine (KELM) was presented by Huang et al. 65 . The main structure of the KELM is displayed in Fig. 3. In this method, a kernel matrix (KM) ( KM(x, x j ) ) is employed instead of the AF ( g(x) ). The KM can be formulated based on Eq. (18).
The output function of the proposed KELM is expressed as, In this work, the radial basis function (RBF) is utilized as a KM, which can be obtained as, where µ is a constant number. 66 . Unlike the conventional neural networks (CNN), the GRNN does not need a repetitive training process like the back-propagation technique. The GRNN does not stick to local solutions [67][68][69] . This method comprises four layers: input, pattern, summation, and output layers.

Generalization regression neural network (GRNN). Generalized regression neural network (GRNN) is a kind of radial basis function network (RBFN) that is based on kernel regression
The input layer receives the input dataset ( x ). In this layer, the number of neuros is equal to the dimension of the input dataset. In the pattern layer, neurons using a nonlinear function transform the input dataset ( x ) to p k (i.e., the output of the pattern layer) based on the following equation: where x k denotes the training sample of the kth neuron in the input layer. ρ is the spread factor.
The third layer (i.e., summation layer) consists of two types of neurons: (1) one simple neuron and (2) m weighted neurons, which are specified by S o and S t . These kinds of neurons are defined as, Input layer Hidden layer Output layer www.nature.com/scientificreports/ where y k is the target dataset. The output layer (i.e., output layer) divides the summation layer results to achieve the output predicted result, which is expressed as, Surface response methodology (RSM). In the present study, RSM was used to investigate the effect of independent variables (geometric and hydraulic conditions) on the output (response) variable (side orifice discharge coefficient) and also to provide an optimal regression relationship for the elliptical side orifice discharge coefficient prediction. The RSM method is a statistical tool for modeling and analyzing the behavior of the process (input) variables on the response (output) variable 70 . Using RSM, most information can be obtained with a minimum of experimental data. The 2nd order RSM model includes linear, quadratic, and the interaction of input variables sentences. The RSM model for the above case can be expressed as follows 71,72 : where X is the input data matrix, y is the output data estimation vector, ε is a random error vector and α i , α ii and α ij are regression coefficients which the following equation can calculate: A flowchart of the machine learning models for the discharge coefficients of the elliptical side weir can be depicted in Fig. 4. In all models, the input is normalized using the following formula: where x is the value of variable and x min and x max are the minimum and maximum value of the variable, respectively.
Accuracy criteria of approaches. Five statistical indices evaluated the models: root mean square error (RMSE), mean average percentage error (MAPE), correlation coefficient (R), normalized root mean square error (NRMSE), and mean bias error (MBE). The relationships of each of the mentioned parameters are presented below: At the above equations, Cdo i and Cdp i respectively are observed and predicted values of discharge coefficient of elliptical side orifice, Cdo is the mean value of observations, Cdp is the mean value of predictions, and N is the number of data.
Outlier detection with leverage approach. Through developing a mathematical model, it is necessary to detect outlier data obtained from the model. Several methods have been proposed to identify and detect outlier data. Among these, the leverage approach is one of the most well-established and widely used approaches. www.nature.com/scientificreports/ In this method, the difference between the actual data and the data obtained from the model is defined as the residual. To calculate the leverage index ( hat ) the following matrix must be calculated:

Results and discussion
This section discusses and evaluates the results obtained from GPR, KELM, GRNN, and RSM models and regression-based models. There will also be a comprehensive comparison between the mentioned AI models and regression-based models. Error analysis was performed using CDF curves, relative error, and leverage approach. Finally, sensitivity analysis will be performed to determine the parameters affecting the elliptical side orifice discharge coefficient. All models are performed in the MATLAB 2020a software on a personal computer (Intel Core i7 2.6 GHz processor and 16 GB RAM).
Gaussian process regression (GPR) model. The GPR model was created using the dimensionless variables mentioned in the previous section as input and the discharge coefficient ( Cd ) parameter as output. The most important factor in the performance of the GPR model is the type of kernel and its parameters. In the present study, ten kernels and LBFGS-based quasi-Newton methods were used to optimize kernel parameters. Table 3 shows the results obtained from different kernels with their optimal parameter. The results obtained by different kernels were compared using R and RMSE statistical parameters for the test data series. According to Table 3, the ARDsquaredexponential kernel with R = 0.9579, RMSE = 0.0081 and MAPE = 1.3243% had the best performance in estimating the orifice discharge coefficient. The ARDMatern 5/2 kernel with R = 0.9571, RMSE = 0.0087 and MAPE = 1.5782% was the second model with high accuracy. The weakest performance was provided by exponential kernel with R = 0.9509, RMSE = 0.0087 and MAPE error percentage = 1.4063%. The results of the optimal GPR model for the training and test data series are presented in Fig. 5.

Kernel extreme learning machine (KELM).
In the KELM model, the RBF kernel was considered as the model kernel 75,76 . The RBF kernel has one parameter as σ , and the KELM model has one parameter as an adjustment parameter ( C ). The grid serach method was used to obtain σ and C . The values of σ and C were changed from 0.01 to 3 and 1 to 1000, respectively. Finally, the optimal values of σ = 0.1 and C = 600 were obtained for the test data series. According to the optimal parameters of the kernel and KELM model, the best model was obtained with R = 0.9564, RMSE = 0.0082, and MAPE = 1.3499%. The results of the optimal KELM model are presented in Fig. 6 for the test and training datasets.

Generalized regression neural network (GRNN). The only parameter in the GRNN model is the
Spread parameter 77 . To obtain the optimal spread value, its values were changed between 0.01 and 10 with 0.01 intervals. The results showed that the optimal value of this parameter is 0.05. R = 0.929, RMSE = 0.0106 and MAPE = 1.6971% were obtained for the optimal GRNN model. The results of the optimal GRNN model are presented in Fig. 7 for the test and training datasets.

Response surface methodology (RSM).
The effect of independent variables on the side elliptical orifice discharge coefficient was evaluated using the RSM model. One of the advantages of the RSM is presenting a regression relationship between input and output variables. The RSM model is based on the number of inde-(33) H * = 3(k + 1)/n. www.nature.com/scientificreports/    Table 4 shows the ANOVA analysis of variance for the equation and its coefficients. According to this table, all coefficients are significant (p value < 0.05). In the RSM model, R = 0.9456, RMSE = 0.0092 and MAPE = 1.4921% were obtained for the test dataset. Figure 8 shows the performance of the RSM model for training and test data.
Regression-based equations. Vatankhah and Rafeifar 3 presented five regression-based models to calculate the elliptical side orifice discharge coefficient.  www.nature.com/scientificreports/ Table 5 shows the results obtained from these five regression-based models. According to Table 5, Eq. (1) in which all effective parameters are involved with R = 0.9277, RMSE = 0.0106 and MAPE = 1.6846% had the best performance. Equation 2 takes into account the parameters Fr1 , B/a and w/b as input with R = 0.9254, RMSE = 0.0107 and MAPE = 1.6993% is in the second rank. In equations, 3 to 5, which consider the parameters Fr1 − w/b, Fr1 − B/a and w/b − B/a as input, respectively, the accuracy of the equations is not acceptable, and the value of R is R ≤ 0.7 . The MAPE error in Eqs. (3) to (5) models is more than 3%. Figure 9 shows the    Comparison between models. GPR, KELM, GRNN, and RSM models were developed to predict the side elliptic orifice discharge coefficient in the previous section, and their optimal parameters were obtained. This section will compare the machine learning models developed in the previous section and the top regression model. Table 6 shows the statistical parameters of the best results obtained from the machine learning models and the best regression model for the training and test datasets. According to Table 6, all machine learning models performed better than the regression-based model. Comparison between machine learning models also shows that the GPR model, with R = 0.9556, RMSE = 0.0077 for training data, and R = 0.9580 and RMSE = 0.0081 for test data, had the highest accuracy in estimating the orifice discharge coefficient. The KELM model is in the second rank with a slight difference (R = 0.953 and RMSE = 0.0080 for training data and R = 0.9564 and RMSE = 0.0082 for test data). The GRNN model had the lowest accuracy among machine learning models (R = 0.9202 and RMSE = 0.0104 for training data and R = 0.9291 and RMSE = 0.0106 for test data). The RSM model also had a good accuracy in estimating the elliptical side orifice (R = 0.9279 and RMSE = 0.0097 for training data and R = 0.9456 and RMSE = 0.0092 for test data) by presenting a regression relationship. Figure 10 shows the error distribution in a violin graph for machine learning models and five regression equations studied in the present study. According to the figure, the lowest error range is related to the GPR model [− 3.78% to + 4.146%]. After the GPR model, the KELM model is in second place with an error range [− 3.981% to + 4.222%]. The GRNN model with the error range [− 5.99% to + 4.833%] has the highest error range among machine learning models. According to Fig. 10, regression-based models have more error ranges. The best regression model (Eq. 1) has an error range of [− 7.057% to + 3.835%]. Equations (3) to (5) have the largest error range. www.nature.com/scientificreports/ Figure 11 shows the cumulative frequency versus absolute error percentage. According to Fig. 11, the GPR model provides an error of less than 1.7% for 70% of the data. This number is 1.74% for the KELM model and 2.16 and 2.03% for GRNN and RSM models. As a result, the GPR model is more accurate and reliable in estimating the elliptical side orifice discharge coefficient. In regression models, Eqs. (1) and (2) for 70% of the data represent an absolute error percentage of less than 2.2%. In Eqs. (3) to (5), the values of this number are equal to 4.76%, 5.7%, and 3.65%, respectively. The mentioned results of the analysis of the cumulative frequency curve against the absolute percentage of error show the superiority of machine learning models over regression-based models.
Finally, to ensure the statistical validity of the developed models, the values of H matrix, leverage index (hat), standard residual percentage R and warning value of leverage H * was calculated according to the leverage approach, and the Williams diagram was plotted for all machine learning and regression-based models. Figure 12 shows the Williams diagram for the GPR, GRNN, KELM, and RSM machine learning models. According to Fig. 12 in all models, the data obtained from the models are in the range of −3 < R < 3 and 0 < H < H * And are therefore statistically valid. Figure 13 shows the Williams diagram for regression-based models. As can be seen from Fig. 13, Eqs. (1), (2), and (5) are statistically valid and are in the range of −3 < R < 3 and 0 < H < H * But the Eqs. (3) and (4) are not in the range of confidence, and therefore their application is not recommended in estimating the discharge coefficient of elliptical side orifice.  www.nature.com/scientificreports/ Sensitivity analysis. A sensitivity analysis was performed on the data using the GPR model (superior model) to determine the variables affecting the elliptical side orifice discharge coefficient. One of the reliable methods in sensitivity analysis is omitting each data variable and determining statistical parameters in the absence of this variable in model 78 . Table 7 shows the sensitivity analysis results of the variables affecting the elliptical side orifice discharge coefficient. According to Table 7, omitting the parameter B/a (channel width to orifice length) had the greatest effect on reducing the model accuracy (R = 0.7932). Therefore B/a is the most effective parameter in determining the elliptical side orifice discharge coefficient. The Froude number ( Fr1 ) with R = 0.8968 is the second parameter affecting the discharge coefficient. The parameters w/b with R = 0.9052, b/y1 with R = 0.9432 and B/b with R = 0.9576 are in the rank of 3 to 5 parameters affecting the discharge coefficient.

Conclusion
In the present study, four machine learning methods KELM, GPR, GRNN, and RSM, were used to estimate the elliptical side orifice discharge coefficient. The results were compared with the proposed regression-based equations. The data used to develop the models included 588 series of laboratory data. Five dimensionless parameters: orifice crest height to orifice height ratio ( W/b ), main channel width to orifice length ratio ( B/a ), main channel width to orifice height ratio ( B/b ), upstream orifice depth (y1) to orifice height ratio (y1/b) and upstream orifice Froude number (Fr1) as the model input and the discharge coefficient of side elliptical orifice ( Cd ) were considered as model output. The results obtained from the statistical parameters of the test dataset showed that all four machine learning models had performed well in estimating the elliptical side orifice discharge coefficient, and the R-value varies between 0.9580 for the GPR model (the strongest model) to 0.9291 for the GRNN model (the weakest model). Comparing machine learning models and regression-based models showed the superiority of artificial intelligence models in estimating the orifice discharge coefficient. The highest accuracy belongs to GPR (RMSE = 0.0081, R = 0.958, MAPE = 1.3242) and KELM (RMSE = 0.0082, R = 0.9564, MAPE = 1.3499) models. The RSM model had good accuracy and provided a functional regression equation for calculating the discharge coefficient. Error analysis using cumulative error distribution curves and relative error distribution function also shows the superiority of the GPR model over other methods used in the present study. Using the RSM model, this study developed a new practical regression equation to predict the elliptic side orifice's discharge coefficient. The leveraged approach was applied to detect outliers and the model applicability domain. Results showed that all proposed machine learning models are statistically valid. Also, the sensitivity analysis result of  www.nature.com/scientificreports/ model input parameters showed that the ( B/a ) parameter has the most impact on model performance and the ( B/b ) parameter has the least impact on model performance. The present study results can be used to refine the delivered flow measurement for optimal management of water consumption by the elliptical side orifice structure.
Limitations and future scope. The results of this research are valid for the range of data used, and it is most used in a variety of elliptical sharp-crested side orifices. Therefore, to calculate the discharge coefficient related to different types of circular sections, more effort is needed to collect data sets related to them. The future scope can be investigated by providing an individual model capable of estimating the discharge coefficient of both circular and elliptical orifice by combining corresponded experimental data sets. Also, developing an ensemble model for integrating the advantage of each developed standalone model could be effective in enhancing the accuracy of discharge coefficient computation.

Data availability
The used dataset and codes in this research are available upon reasonable request from the corresponding author.