Modeling of wave run-up by applying integrated models of group method of data handling

Wave-induced inundation in coastal zones is a serious problem for residents. Accurate prediction of wave run-up height is a complex phenomenon in coastal engineering. In this study, several machine learning (ML) models are developed to simulate wave run-up height. The developed methods are based on optimization techniques employing the group method of data handling (GMDH). The invasive weed optimization (IWO), firefly algorithm (FA), teaching–learning-based optimization (TLBO), harmony search (HS), and differential evolution (DE) meta-heuristic optimization algorithms are embedded with the GMDH to yield better feasible optimization. Preliminary results indicate that the developed ML models are robust tools for modeling the wave run-up height. All ML models’ accuracies are higher than empirical relations. The obtained results show that employing heuristic methods enhances the accuracy of the standard GMDH model. As such, the FA, IWO, DE, TLBO, and HS improve the RMSE criterion of the standard GMDH by the rate of 47.5%, 44.7%, 24.1%, 41.1%, and 34.3%, respectively. The GMDH-FA and GMDH-IWO are recommended for applications in coastal engineering.

contraction scour depth. Sharafati et al. 25 used teaching-learning-based optimization (TLBO), biogeographybased optimization (BBO), and invasive weed optimization (IWO) algorithms to optimize ANFIS parameters for the prediction of scour depth downstream of weirs. The results showed that ANFIS-IWO is a reliable technique for the prediction of scour depth.
Qaderi et al. 26 developed a shuffled complex evolutionary (SCE) algorithm integrated with GMDH to predict bridge pier scour depth. The results showed a good performance of GMDH-SCE. Alizadeh et al. 27 compared the performance of PSO, GA, and imperialist competitive algorithm (ICA) integrated with a support vector machine (SVM) for the estimation of drilling fluid density. The results showed the high performance of PSO in estimating drilling fluid density. Milan et al. 28 applied three optimization algorithms, comprising particle swarm optimization (PSO), gray wolf optimization (GWO), and Harris hawk optimization (HHO) integrated with ANFIS for predicting optimal groundwater exploitation. The results indicated that all optimization algorithms increase the ANFIS accuracy. Haghbin et al. 29 developed support vector regression (SVR) integrated with IWO for predicting channel sinuosity. The results showed that IWO significantly increases the accuracy of SVR.
In this study, invasive weed optimization (IWO), firefly algorithm (FA), teaching-learning-based optimization (TLBO), harmony search (HS), and differential evolution (DE) optimization techniques are embedded with GMDH to predict a two percent wave run-up height in coastal regions. It is worth noting that the applications of GMDH integrated with novel meta-heuristic models are scarce. Indeed, to the best knowledge of the authors, no study has ever developed and presented the application of hybrid models of GMDH-IWO for modeling problems in any field of research.

Methods
In this section, a brief review of empirical relations for the prediction of wave run-up as well as the description of GMDH and optimization algorithms are presented. www.nature.com/scientificreports/ Empirical relations. In coastal engineering, several different wave run-up values can be investigated, such as the mean value (R), the 33 percent value of the wave run-up i.e. the significant run-up (R s ), the two percent value of the wave run-up (R 2% ), etc. The mean run-up (R) can be calculated as the average run-up of all observed waves. However, the mean value is of limited interest for engineers and scientists. In this respect, other parameters might be used e.g. R s , R 2% and R 10% . Note that R i% refers to the run-up level reached and exceeded by i% of the incoming waves [30][31][32][33][34][35][36] . Numerous studies have been conducted on the regular wave run-up on smooth and rough beaches. Miche 31 presented an equation to predict wave run-up for non-breaking waves: where R max is the maximum vertical run-up, H is the wave height, and α is the beach slope. Hunt 37 presented empirical relations to predict wave run-up on impermeable slopes based on breaking wave shape in the surf zone. The equation for standing waves on a steep slope was proposed in the following form: where R max may be calculated from: in which ξ 0 is the surf similarity parameter or Iribarren number 38 , L 0 is the deep-water wavelength g is the gravitational acceleration and T is the wave period.
One of the first equations presented for estimating irregular wave run-up on mild uniform slopes (tanα ≤ 1/3) was proposed by Wassing 39 : Ahrens 40 studied irregular wave run-up on smooth-impermeable slopes (1/4 ≤ tanα ≤ 1.1) and suggested two equations for breaking and non-breaking waves. Coastal Engineering Manual 41 analyzed Ahrens 40 data and proposed two relations for predicting irregular wave run-up. Mase 42 performed laboratory experiments on the irregular run-up on mild slopes (2° ≤ α ≤ 11.4°) and proposed an equation for estimating run-up due to breaking waves.
Van der Meer and Stam 8 suggested the following relations for estimating wave run-up on smooth slopes for irregular waves: where H s is the significant wave height, γ is the reduction factor that depends on various parameters such as roughness, shallow water conditions, oblique wave attack, berms, and ξ max is known as the Iribarren number corresponding to the maximum wave period.
In another study, Van der Meer and Stam 8 applied the regression model proposed by Van der Meer 3 and derived the following relations for predicting irregular wave run-up on permeable and impermeable slopes: where ξ m is the Iribarren number corresponding to the average wave period. Schimmels et al. 5 proposed the following relations: where H mo is the wave height, γ p is the porosity coefficient, ξ m−1,0 is the Iribarren number. In this study, the formulas derived by Van der Meer and Stam 8 , and Schimmels et al. 5 are applied to determine the two percent wave run-up height.
Group method of data handling. The group method of data handling (GMDH) is a machine learning model belonging to artificial neural networks (ANNs), which was introduced by Ivakhnenko 43 for modeling complex systems. This method has been successfully applied in different fields of science and engineering. Similar to ANNs, the GMDH consists of neurons connected in different layers. In the GMDH network, the neurons in the next layer are produced as a combination of two neurons from the previous layer. Then, the output of each neuron is calculated by quadratic polynomial expressions, and the most effective neurons are selected to be con- R u2% www.nature.com/scientificreports/ nected to neurons in the next layer. In other words, the GMDH algorithm generates the structure of the network through successive generations of quadratic regression polynomials with two input variables or neurons. In Fig. 2, a schematic plan of a five-layer GMDH with effective and ignored neurons is shown. The GMDH network is created and trained layer-by-layer and neuron-by-neuron. This strategy allows users to have access to the neurons' information during the network training process. As can be seen in Fig. 2, in the second, third, and fourth layers, there are some ignored neurons. The ignoring of neurons and the selection of appropriate ones prevent the immense growth of the network. The number of ignored neurons in each middle layer is affected by the error evaluation criteria e.g. MSE criterion: where N is the number of data, y p is the predicted output of neurons in each layer, and y m is the observed value. In the GMDH structure, contrary to conventional matrix structure, a number of mathematical equations are applied to speed up calculation process 44 . The output of each neuron is calculated according to Volterra-Kolmogorov-Gabor (VKG) polynomial. The second-order polynomial is incorporated in the structure of GMDH 43 and is used in this study as the transfer functions in each neuron. The second-order polynomial may be written in the following form: where y is the output, (x 1 , x 2 ) is the input vector, and c is the weighting coefficient. The intricacy of the neurons will be increased layer by layer, which causes that the final network is becoming complex 67 . The weighting coefficients are calculated using regression techniques: where c represents the weighting coefficient vector, A denotes the following matrix: and Y is the matrix of outputs: in which m is the number of samples.
In this study, the number of layers is equal to 5. The maximum number of neurons in the next layer ( N i+1 np ) can be calculated by applying the following equation:   (14), N i+1 np results in 28. In this study, the maximum number of neurons in each layer was determined to be 10. In summary, the following seven steps are taken to build the GMDH network (1) determining the GMDH structure-in this study 5 layers are considered for the network with a maximum of 10 neurons in each layer; (2) standardization of the data; (3) entering the data to the neurons of the next layer; (4) allocating the polynomialbased fit to each neuron in layers based on the values of two neurons of the previous layer; (5) calculating weights for a polynomial-Eq. 11; (6) calculating the output of the neurons and selected appropriateness of them-Eq. 9; (7) move to the next layer and repeat the steps of 3 to 6 to create the entire GMDH network.
Hybrid technique. Instead of using the least-squares technique of GMDH, meta-heuristic optimization algorithms can be embedded with the GMDH model. This technique can be applied to optimize either the weighting coefficients in Eq. 10 or the structure of the network. In this study, meta-heuristic optimization algorithms are used to optimize the weighting coefficients. The main difference between the general execution procedure of the standard GMDH and hybrid GMDH is the calculation process of the fifth step of the GMDH network mentioned in the previous section. In hybrid GMDHs, the meta-heuristic optimization approaches will be employed to optimize the weights of the polynomial. At the first step, a number of candidate solutions are distributed in the search space. Each member of this population represents a solution to Eq. 10. The fitness of members is calculated by applying RMSE, and the population is ranked. At the next step, the new values of the members will be calculated by meta-heuristic algorithms. Henceforth, the fitness of the population will be calculated and the members will be ranked. This process will be repeated until the final iteration. In the end, the best member represents the optimized values of Eq. 10. In Fig. 3, the general flowchart for setting up the hybrid GMDH models is shown.
The applied algorithms comprise new swarm intelligent-based models including invasive weed optimization, IWO, firefly algorithm, FA, and teaching-learning-based optimization, TLBO, as well as evolutionary-based models including harmony search, HS, and differential evolution, DE. The integration of the GMDH model and IWO has been developed and executed for the first time in this study. In Table 1, descriptions of the applied algorithms are presented, and a brief description of these algorithms is provided in the following sections.

Invasive weed optimization. The invasive weed optimization (IWO) algorithm is inspired by weed colo-
nization and was first proposed by Mehrabian and Lucas 54 . In this algorithm, a population of initial solutions (weeds) dispreads randomly in the entire search space. The fitness of weeds is evaluated and produces a number of seeds-the population with better fitness produces more seeds. Produced seeds are randomly distributed in search space by normally distributed random numbers with a mean equal to zero, but with a varying variance. The IWO applies the standard deviation (σ) of the random function which is defined between the ranges of the pre-defined initial value (σ initial ) to a final value (σ final ) and is calculated in each step from the following equation 54 : where σ iter is the standard deviation of the present iteration, iter max is the maximum number of iterations, and n is the nonlinear modulation index usually set as 2 55 .
After some iterations, the number of produced plants reaches a maximum value. At this stage, competitive exclusion eliminates undesirable plants based on the fitness function. Consequently, those with better fitness would survive and are allowed to replicate. This process might be continued either reaching the maximum epoch or achieving the exact solution.
Firefly algorithm. This algorithm was inspired by the flashing and illuminating behavior of the fireflies and was proposed by Yang 56 . This algorithm is based on three general rules, (1) all fireflies in the search space are considered to be unisex so they can be attracted to the others; (2) the attractiveness of a firefly is proportional to its light intensity. Hence, the brighter fireflies attract the less bright ones. In the case of fireflies of similar brightness, their movements are assumed to be random; (3) the brightness of a firefly is determined by the objective function. In FA, the new position of agents ( X t+1 i ) is calculated according to the following equation: where a is a randomized parameter (mutation coefficient), ε i is the random vector, β 0 is the attractiveness at distance r = 0, and γ is the absorption coefficient of light. More information about the FA algorithm can be found in the original work of Yang 56 .
Differential evolution. The differential evolution (DE) is a stochastic and population-based algorithm proposed by Storn and Price 57 applied in global optimization problems. The main structure of DE is based on extracting individual differences from a current population to build a new population. In other words, it is one of the optimization algorithms which has both evolutionary and swarm intelligence features 58 . Mutation, crossover, www.nature.com/scientificreports/ and selection are evolutionary operators, while distance and direction of the population can be considered as swarm intelligence features. Assuming three agents of the population as X i1 , X i2 , and X i3 , the mutation scheme can be expressed as a trial vector ( v t i ) which will be developed for each member of the population by applying the following equation: where t is the iteration index and F is the scale factor that controls the amount of differential variation.
In the crossover scheme, offspring ( u t i1 ) will be generated as: www.nature.com/scientificreports/ where C r is the crossover controller assuming values between 0 to 1 and j rand i is a random integer number between 1 and the dimension of the problem, D. In the end, the selection operator is applied and the new position is calculated from: Teaching-learning-based optimization. The teaching-learning-based optimization (TLBO) is inspired by the philosophy of the teaching and learning process and was proposed by Rao et al. 49 The algorithm is based on the evaluation of the influence of a teacher on the performance of students. The TLBO consists of two main phases: Teaching and Learning Phases. Among all the designated populations in the search space, the best solution (base on fitness) is assigned to the teacher, and the learners would learn and update their knowledge from the teacher according to the teaching operation: where x new i is the new positions of the ith learner, x old i is the old positions of the ith learner, rand i is a random number between 0 to1, x best is the position of the teacher, x mean is the mean individual position of the current class, and T F is the teaching factor that is applied to change the mean value.
After this stage, learners increase their knowledge by interacting randomly with other learners in class as the equation below: where x j is a member of the population that is selected randomly.
Harmony search. The harmony search (HS) is a meta-heuristic optimization method inspired by musicians which simulate the improvisation process of the group of musicians 59 . To create the HS structure, some parameters, including the harmony memory consideration rate (HMCR), pitch adjusting rate (PAR), fret width (FW), and harmony memory size (HMS) have to be set at the initialization stage. The harmony memory is a matrix of candidate solutions. The HMCR varies between 0 and 1 and controls the balance between exploration and exploitation. The PAR and FW are useful parameters in adjusting the convergence rate of the algorithm. The quantity X j = x j 1 , x j 2 , ... , x j D represents the jth harmony vector. The HM is filled with the HMS harmony vectors: At the first iteration, HM is filled with random solution vectors. In the next iterations, the HM must be updated with a new solution. If the new solution vector is better than the worst vector, then it is stored and replaces the worst vector. Experimental data and model setup. Datasets. In this study, eight parameters are used as the input vector to build and train the GMDH models for estimating the wave run-up (R u2% ). The input vector includes spectral peak period (T P ), mean wave period (T m ), significant wave height (H s ), beach slope (cotα), the relative size of bed material (D 85 /D 15 ), the surf similarity parameter corresponding to the mean wave period ( ξ m ), the  www.nature.com/scientificreports/ surf similarity parameter corresponding to the peak period ( ξ p ), and the bed permeability (S p ). The S p characterizes bed material especially particle grading and permeability. Figure 4 shows the range of S p for different types of coastal slopes. The S p data covers both impermeable and permeable slopes (see Table 2). The 256 data set was extracted from Van der Meer and Stam 8 report. All tests were conducted in a wave flume of 1 m wide, 1.2 m deep, and 50 m long. Data were measured for two types of particle grading including uniform rock and riprap, and different values of S p . In this study, all available parameters that have a direct effect on the wave run-up height were selected as input parameters for the predictive models. Information about the statistical properties of the dataset is presented in Table 2.
Evaluation criteria. In the present study, the statistical assessment of the training data set was conducted for checking the reliability of the developed GMDH models. A similar assessment of the testing data set was conducted for the evaluation of the performance of GMDH models and empirical relations. This was done by applying different statistical parameters including root mean square error (RMSE), determination coefficient (R 2 ), mean absolute error (MAE), and the index of agreement (IA).
The RMSE determines the root mean square error between the observed and predicted values, while the MAE describes the distance between the observed and predicted values. The R 2 indicates how well the predicted values fit the regression model. The IA is the measure of the agreement between the predicted and observed values. These parameters are calculated as follow:  www.nature.com/scientificreports/ where N is the number of data, x m i denotes the predicted value, x o i is the observed value, and bar stands for the average of the variables. The best fitting between models and observed data is for RMSE and MAE equal to 0, and R 2 and IA equal to 1.

Models setup and structure.
To begin with the model development process and implementation of data, the data set was divided randomly into a training set comprising 80% of the available data set and a testing set covering 20% of the data set. The general structure of the GMDH models is presented in Table 3. The allocated values of parameters are based on their values from the previous studies [60][61][62][63][64] . For all the developed models, 300 epochs were considered. It is worth noting that this iteration number was reached based on the convergence criteria of GMDH models.

Results
In this study, the GMDH model and the five hybrid GMDH models described in the previous section were used to estimate the wave run-up. Results obtained for the training stage are presented in Table 4.
The results in Table 4 show that the applied GMDH models are capable of modeling wave run-up, which confirm low values of RMSE and MAE, and the close to one values of R 2 and IA. Among all methods developed in this study GMDH-TLBO, with the lowest RMSE and MAE (RMSE = 0.0254 m and MAE = 0.0195 m) and the highest R 2 and IA (R 2 = 0.9863 and IA = 0.9965) provide the best results at the training stage. However, the results Table 3. Structure of models and initial parameters of meta-heuristic algorithms. *The TLBO algorithm can be applied without allocating any specific primary or adjusting parameter.  www.nature.com/scientificreports/ show that HS, FA, DE, and IWO, decrease the performance of the standard GMDH. Nevertheless, the decision on the superiority of the derived models can be conducted by the evaluation of the results at the testing stage. In this respect, Table 5 summarizes the statistical parameters derived for the evaluation of the performance of developed models at a testing stage. The results in Table 5 show that the standard and hybrid GMDH models provide more accurate results than the empirical relations. However, the development of the derived models is a time consuming process and their execution requires more time in comparison to the empirical models. The average values of RMSE, MAE, R 2 , and IA for the GMDH models are 0.027 m, 0.0209 m, 0.9847, and 0.9959, respectively, whereas for empirical relations are 0.0993 m, 0.0692 m, 0.886, and 0.9490, respectively. The developed GMDH models improve the prediction of wave run-up by about 72.81% in comparison to the empirical methods. Among the developed GMDH models, the GMDH-FA with the lowest values of RMSE and MAE (RMSE = 0.0209, MAE = 0.0172) and the highest R 2 and IA (R 2 = 0.9908, IA = 0.9977) can be considered as the most precise predictive model. The application of meta-heuristic optimization algorithms improves the performance of the standard GMDH model by 39.70%. The performance of most hybrid GMDH models is lower at the training stage than the standard GMDH. The results confirm a relatively good performance of the derived models at the testing stage. The standard GMDH model may trap into the local solutions, which causes the over-fitting problem. Figure 5 presents the scatter plots for the testing stage. Scatter points of the hybrid GMDH models are closer to 1:1 line than corresponding points obtained by applying the standard GMDH technique, which indicates that the meta-heuristic algorithms increase the performance of GMDH. www.nature.com/scientificreports/ To provide further insight into the outcomes of the derived models, the results of the models are plotted in the Taylor diagram 65 . The purpose of the Taylor diagram is to present on a single plot three statistical indices including the standard deviation, centered RMSE, and correlation. In Fig. 6, the values on the vertical and horizontal axes represent the standard deviation, the values on the dashed lines represent the correlation, and the values on the dash-curved lines represent the centered RMSE. Figure 6 shows that the dots representing the GMDH models are closer to the observation points than corresponding dots obtained by the application of the empirical relations. The results of the standard and hybrid GMDH models are magnified in a separate box. The plots show the superiority of the GMDH-FA over the remaining GMDH models.
Further discussion. An important factor in the analysis of the performance of machine learning methods and meta-heuristic techniques is the speed of the derived algorithms. Based on the observed CPU time presented in Table 6, it is becoming clear that the standard GMDH model is more efficient in terms of computational cost than the derived hybrid GMDH models. It is worth mentioning that the application of empirical equations in practical projects is easier than the ML models. However, nowadays, with tremendous progress that has been made in technology and minicomputers, the applications of ML-trained models to determine wave run-up, is straightforward.
In the previous section, the analyses of the performance of the GMDH models versus the empirical relations showed the superiority of the developed GMDH models. But in spate of that, the question may arise whether the differences between the results obtained by applying the developed models are statistically significant. Thus, the Kruskal-Wallis tests were carried out and the results are presented in Table 7. The non-parametric Kruskal-Wallis test is a technique often applied in statistical analyses. Mahdavi-Meymand et al. 18 used the Kruskal-Wallis test to compare several machine learning (ML) techniques and empirical equations applied to predict spillways   Table 7. The results of the Kruskal-Wallis test for assessing the significant statistical differences between the applied models. www.nature.com/scientificreports/ air demand and reported that there is no significant difference at the 99% confidence level between the applied ML approaches.
The results in Table 7 show that the probability value of the Kruskal-Wallis test for the empirical relations is 0.7251, which is higher than 0.05 and 0.01. This shows that there are no significant statistical differences between the results obtained by applying two empirical relations at both 95% and 99% confidence levels. However, significant differences exist between the results obtained by applying the GMDH models and empirical relations (p-value ˂0.0006). Thus, the GMDH models may be recommended to be applied instead of empirical relations to predict wave run-up. Moreover, although the statistical criteria confirmed that meta-heuristic algorithms increase the efficiency of the GMDH, the results in Table 7 show that there is no significant statistical difference between the outcome of different GMDH models. Hence, it is recommended to use the standard GMDH in situations where computational time is an important factor for users. More insight into the results obtained in the modelling of wave run-up in coastal regions is provided in Table 8.
Based on the findings of this study following potential subjects are proposed for future studies: Due to literature restrictions, it is recommended to take into consideration other possible future data resources to further evaluate predictive models developed in the present study. There is not a robust formula or procedure to select the best architecture of the GMDH model. More studies should be conducted to facilitate the construction of GMDH models. Based on the data used in this study and related calculated values of the correlation coefficient between the input and output variables, it was found that the beach slope has a limited effect on wave run-up height. Future laboratory studies should take into account a wider range of beach slopes. It is recommended to consider other non-parametric statistical tests such as Mann-Whitney test and evaluate the results of different methods.

Conclusion
The ability to accurately estimate the maximum wave run-up is vital for the maintenance and development of coastal areas and the safety of the coastal zone population. In this study, hybrid swarm and evolutionary intelligent GMDH models as well as the standard GMDH technique were developed and applied to predict the two percent value of the wave run-up (R 2% ). The invasive weed optimization (IWO), firefly algorithm (FA), differential evolution (DE), teaching-learning-based optimization (TLBO), and harmony search (HS) optimization algorithms were used as the meta-heuristic optimization methods to train the GMDH. The results show that the GMDH models have better performance than the empirical relations. The Kruskal-Wallis tests show significant statistical differences between the results of the empirical relations and GMDH models.
The results show that the application of meta-heuristic optimization algorithms improves the performance of the standard GMDH model by 39.70% at the testing stage. However, the performance of most hybrid GMDH models is lower than the standard GMDH at the training stage. Among all developed models, the GMDH-FA www.nature.com/scientificreports/ provides the best results in the testing stage with RMSE = 0.0209 m and IA = 0.9977. Moreover, the results show that among all methods developed in this study GMDH-TLBO, with the lowest RMSE = 0.0254 m and MAE = 0.0195 m, and the highest R 2 = 0.9863 and IA = 0.9965. The computational costs indicate that the standard GMDH model and empirical equations are significantly faster than the embedded meta-heuristic GMDH techniques. Moreover, the results show that there are no significant statistical differences between the GMDH models and meta-heuristic algorithms. Hence, the application of time-consuming models such as GMDH-FA is not recommended in situations where computational cost is a decisive factor. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.