Machine learning algorithm as a sustainable tool for dissolved oxygen prediction: a case study of Feitsui Reservoir, Taiwan

Water quality status in terms of one crucial parameter such as dissolved oxygen (D.O.) has been an important concern in the Fei-Tsui reservoir for decades since it’s the primary water source for Taipei City. Therefore, this study aims to develop a reliable prediction model to predict D.O. in the Fei-Tsui reservoir for better water quality monitoring. The proposed model is an artificial neural network (ANN) with one hidden layer. Twenty-nine years of water quality data have been used to validate the accuracy of the proposed model. A different number of neurons have been investigated to optimize the model's accuracy. Statistical indices have been used to examine the reliability of the model. In addition to that, sensitivity analysis has been carried out to investigate the model's sensitivity to the input parameters. The results revealed the proposed model capable of capturing the dissolved oxygen's nonlinearity with an acceptable level of accuracy where the R-squared value was equal to 0.98. The optimum number of neurons was found to be equal to 15-neuron. Sensitivity analysis shows that the model can predict D.O. where four input parameters have been included as input where the d-factor value was equal to 0.010. This main achievement and finding will significantly impact the water quality status in reservoirs. Having such a simple and accurate model embedded in IoT devices to monitor and predict water quality parameters in real-time would ease the decision-makers and managers to control the pollution risk and support their decisions to improve water quality in reservoirs.

Deterministic and stochastic models are used to model the D.O. concentration changes and capture any pattern from the measured data; However, these models require massive data to model the D.O. pattern and consider very complex 8 . Other models, such as the statistical model introduced to overcome the conventional models. Since many factors impact the concentration of D.O. in the reservoir, which can cause to nonlinearity pattern, the statistical model fails to capture it since it assumes that the relationship between D.O. and other parameters is linear 9 . Alternatively, Machine Learning (ML) techniques have been proposed as an other technique to capture the nonlinearity in any complex system 10,11 .
Artificial Neural Network (ANN) methods were used in conjunction with numerical simulation models to boost the simulation results 12 . Recently, ML techniques have been used intensively in modeling complex parameters related to water resources, such as predicting sea-level rise [13][14][15] , rainfall prediction 16,17 , reservoir water level prediction 18,19 , and streamflow forecasting 11,20,21 . Inspired by the robust performance of ML in capturing the nonlinearity patterns in most of the engineering systems, different algorithms of ML have been adopted to predict the water quality parameters. Predicting the class Water Quality Index (W.Q.I.) has been carried out using different ML algorithms by many researchers [22][23][24] . Artificial Neural Network (ANN) has been used to predict total nitrogen and phosphorus in the United States (U.S.) lakes 25 . At the same time, a support vector machine (SVM) was developed to predict the concentration of biological oxygen demand (B.O.D.) at the Johor river, Malaysia 26 .
Regarding dissolved oxygen concentrations, an adaptive neuro-fuzzy inference system (A.N.F.I.S.) was proposed to predict D.O. at the Johor river, Malaysia 8 . However, the limitations of the A.N.F.I.S. model were reported by Ahmadlou et al. 27 . These drawbacks are that it is not very accurate and cannot find the best parameters; it is also prone to get stuck in a local minimum, contributing to its lack of prediction abilities.
A model was developed by Heddam 28 to predict dissolved oxygen concentration using an optimally pruned extreme learning machine (O.P.E.L.M.). The study found that O.P.E.L.M. provided a reasonable estimate of D.O. However, Sánchez-Monedero et al. 29 found that O.P.E.L.M. tends to degrade too many neurons, which results in noticeable performance degradation in some data sets.
The least-squares support vector (L.S.S.V.R.) has been proposed by Liu et al. 30 to predict the amount of dissolved oxygen in intensive anaerobic ponds. However, It has been found that L.S.S.V.R. performance depends heavily on selecting the kernel coefficient and regularization coefficient, which are necessary for the optimization process and the final L.S.S.V.R. model. Regrettably, there is no unique, perfect method to specify the given parameters in the L.S.S.V.R. model. Extreme machine learning was developed by 31 and used to predict dissolved oxygen in urban rivers. In addition to that, recently, the concentration of dissolved oxygen in fishery pond was predicted using a gated recurrent unit 32 .
To overcome the inherent limitations established by standalone models, hybrid models have been proposed to optimize these algorithms' hyper-parameters by augmenting them with different optimization algorithms. For instance, different hybrid models have been developed and used to predict dissolved oxygen concentration [33][34][35] . Teaching-learning-based optimization algorithm (T.L.B.O.) is used to predict dissolved oxygen 36 . Various regression equations were optimized, including quadratic, exponential, logarithmic, and linear using T.L.B.O. Then the findings from T.L.B.O. compared with an artificial bee colony (A.B.C.) optimizer. Better results were obtained by hybridizing the quadratic regression equation with T.L.B.O. Besides the hybridized model's complexity, the authors used many parameters (twenty parameters) as inputs to develop the model. One of the drawbacks of such a model is to have access to a significant amount of available and reliable water quality parameters data, which is challenging.
Despite the acceptable performances these models achieved, however, there are few limitations associated with the hybridization of ML. One of these limitations is the complexity and complicated architecture and the difficulties in initializing the input parameters for these hybrid models 37 . Kumar et al. 38 found that the artificial neural network's prediction performance can be enhanced by improving the training approach without hybridizing it with optimization algorithms. In addition to that, a recent study highlighted the importance of the input combinations of ML algorithms' output accuracy, where the optimal input combinations can lead to a high level of accuracy without the need to augment ML with optimizers 39,40 .
Therefore, this study's chief aim is to propose an artificial intelligence model with simple architecture and a high-performance level to predict dissolved oxygen concentrations. This study will use historical data recorded for 29 years from the Fei-Tsui Reservoir to train the model to accomplish this goal. The number of neurons will be optimized in order to obtain the desired results. Different input combinations will be investigated and examined to enhance the model's performance. Recently many researchers have been developing AI models with a few inputs 41 . For example, Moghadam et al. used four input parameters and DO concentration to predict DO concentration in three different lead times 42 . Therefore, in this study few input parameters will be investigated.
Sensitivity analysis and uncertainty analysis will be carried out to validate the proposed model. Different statistical indices will be introduced to inspect the proposed model performance. For better visualization, Taylor's diagram, violin plot, and percentage of relative error between the projected data and the observed one have been implemented in this study.

Methodology
Study area and data description. Located in Taiwan's north region, the Fei-Tsui reservoir serves a 300 km 2 catchment area approximately, as shown in Fig. 1 43 . Since the 1980's, for Taipei city, the Fei-Tsui reservoir is considered the primary source of drinking water. One hundred fifty days is the approximate duration when the water resides in the reservoir. Since 1987, monthly measurements have been conducted to examine the reservoir's water status based on different water quality parameters. The water quality samples have been collected at the outlet of the five inflow tributaries of the reservoir and another seven sampling locations at the reservoir's Model development. An Artificial Neural Network with a single hidden layer was proposed to predict the dissolved oxygen concentration in the Fei-Tsui reservoir. The architecture of the proposed model can be seen in Fig. 2 45 . The proposed model consists of an input layer, which presents the input parameters that will be used to develop the model. In contrast, the output layer presents the model's output, which is the dissolved oxygen concentrations. Weights and biases connect the input and output layers to the hidden layer.
The hidden layer consists of several neurons. In this study, a different number of neurons will be investigated. In the beginning, the number of neurons will be set to equal five, then ten, fifteen, and finally, twenty. The predicted dissolved oxygen concentration will be compared with the observed concentration to choose the best model with the best-optimized number of neurons that give the lowest error. 29 years of monthly water quality data (348) will be used in developing the proposed model. 80% of the data will be used to train the model, while 20% will be used to test the model's accuracy. The pre-processing step was carried out by scaling the dataset between 0 to 1. Different types of activation functions and transfer functions will be explored and optimized.  Similarly, different training algorithms will be investigated to develop a model with a high level of precision. Finally, cross-validation with different k-fold (3, 5, 7 and 9) will be carried out to minimize the risk of overfitting. This validation procedure has been implemented using the trails-and-errors procedure, each trail with different value of the k-fold until achieving the one with minimal possibility of experienced overfitting. MATLAB Programming language was used to develop the proposed model.
One of the primary reasons for developing a model to predict the D.O. from other surface water parameters is that the D.O. is relatively costly and time-consuming to acquire and monitor. On the other hand, the main reason for selecting the Water Temperature, Biological Oxygen Demand, Iron, and Total Organic Carbon as a predictor for the D.O. is first because of the availability of these parameters. Secondly, there is a direct relationship between all these parameters and the D.O.; for example, the greater the amount of Biological Oxygen Demand in the water stream, the more rapidly is the depletion of the D.O. in water. Similarly, for the temperature (T), the more the T, the less the D.O. in the water stream will have occurred. Iron could critically consume the D.O. because of D.O. will be consumed as an oxidant for the Iron concentration. Hence, the D.O. concentration could dramatically reduce its amount in water stream. Finally, the Total Organic Carbon is the measuring indicator for how pure is the water stream considering the organisms' life, which is indirectly affected by the level of D.O. in the water stream. Therefore, these parameters have been considered as predictors for D.O. in the current research. Table 2 shows the statistical analysis and the correlations between these parameters and D.O.
Three different statistical indices will be applied to measure how the proposed model predicts dissolved oxygen concentration. These indices are Root Mean Square Error (R.M.S.E.), Coefficient of Correlation (Correlation), and Coefficient of Determination (R-squared). The formulas for these indices with comprehensive explanation can be found in study carried out by Najah et al. 46 . In addition to that, Taylor's diagram and violin plots will be performed to assess the correlation between observed and predicted data. Sensitivity and uncertainty analysis will be carried to validate the proposed model's reliability. Figure 3 demonstrates the flow of the proposed method in this study. As can be seen from the flowchart, after the secondary data is collected, a pre-processing step was carried out to normalize the dataset before building the models. Then, different models will be built using different algorithms, and each model optimized by tuning the hyper-parameters of each model. In addition to that, a comparison between the proposed model and the developed models in literature will be carried out to highlight the contribution of this research.

Results and discussion
One of the main characteristics of defining artificial neural network models is choosing the number of neurons for the hidden layer. An insufficient number of neurons can cause the model not to capture the data's nonlinearity 47 . However, if more neurons are introduced, that might increase the model's time and lead to overfitting. Therefore, to overcome such issues, in this study, different models were developed to find the optimum number of neurons where the number of neurons set to be 5, 10,15, and 20. In addition to that, identifying the right input combination is one of the vital factors that need to be considered in optimizing the performance of  3) used three input parameters (biological oxygen demand, iron, and total organic carbon). The fourth model (M.4) was developed using a different combination of three parameters (water temperature, biological oxygen demand, and iron). And finally, model five (M.5) was developed using all four parameters as input (water temperature, biological oxygen demand, iron, and total organic carbon). Twenty models have been developed with different inputs and numbers on neurons to find the best model for predicting dissolved oxygen concentrations changes. Table 3 presents the performance of each developed model for the optimum number of neurons. It can be seen that the best number of neurons falls in the range of 5 to 15 for each model. It can be seen, in all the proposed models, the number of neurons contributes significantly to the improvement of the models' accuracy, except in model five (M.5). It has been noticed that the poor performance in M.5 is associated with the input combinations, not with the number of neurons. Such findings reveal the input combinations' importance in developing a robust model. In addition to that, it can be observed from Table 3 that the best performance model (M.4) can be achieved when the number of neurons is equal to 15 and the combination of the input is water temperature, biological oxygen demand, and iron. This is followed by M.2 with the same number of neurons but with two input combinations (water temperature and biological oxygen demand). It can also be observed that when the total organic carbon was introduced as input in M.3 and M.5, the artificial neural network models' accuracy dropped. This indicates that the total organic carbon should not be considered input in developing reliable models to predict dissolved oxygen concentration changes. Feed-forward MLP model is used in this study. Regarding the training algorithm, three different algorithms were investigated, namely Levenberg-Marquardt, Bayesian regularization and Scaled conjugate gradient. The best results obtained by using the latter training algorithm. Scaled conjugate gradient is powerful training algorithm where there is no need for much memory. It is also proved to be faster in convergence compared to the other two used training algorithms. With regards to activation function, it was found that tanh (hyperbolic) function is best among the different inspected functions. It can be seen from Table 4 the performance of each developed model with a different input combination. It can be observed that the fourth model (M.4) outperforms all other models in predicting the dissolved oxygen and manages to capture the peak and low concentration of the dissolved oxygen. Moreover, it can be seen that the mean of the predicted data is close to the actual observed data.
To test the proposed model's reliability and to determine the model's validity, Taylor's diagram is recommended by many researchers and is commonly used 19,49 . It can be seen from Fig. 4 the relation between the correlation and the standard deviation for the actual and the predicted concentration of dissolved oxygen for the five models. It can be seen that M.4 is outperforming all other models where the distribution of standard deviation for the predicted data is close to the actual one, which suggests that the proposed model is consistent in capturing the observed data pattern.
The average percentage of relative error for each model has been computed to examine the error percentage between the predicted and the actual observed data, as shown in Fig. 5. The value confirmed this study, where M.4 indicates the lowest error compared to other developed models. While M.2 is ranked second, and the highest error observed with M.5.
A Violin plot is used to demonstrate the difference between the actual and predicted data from each model, as shown in Fig. 6. This plot helps to understand the probability distribution of the data. It can be seen that the best model is M.4, which its predicted data have similar distribution compared with the actual data. www.nature.com/scientificreports/ To measure the proposed models' performance when new data are introduced, the d-factor value is used for this purpose. When the d-factor values close to zero mean that the model can still perform well if a new data set is introduced 50 . This study uses the following equations to calculate d-factor values:  www.nature.com/scientificreports/ σ x represents the standard deviation of actual data x and d x represents the average distance between the upper X U (the value that is greater than or equal to every element in the dataset) and lower X L (the value that is less than or equal to every element in the dataset), i denotes the order of the record in the time series data (i = 1,…,N), while N represents the number of the observed dataset. It can be seen from Fig. 7 that M.4 shows the lowest d-factor value, which indicates this model architecture is reliable to be adopted when a new set of data used and can perform with a high level of accuracy. Four input parameters were used as inputs to develop the four models. In the current study, only three parameters were used as input to the model. As mentioned earlier, choosing the right combinations has a crucial impact on the performance of the model. In the current study, when water temperature, biological oxygen demand, and iron were used as input, the artificial neural network achieved a high accuracy level where r-squared equals 0.98. Multi-layer perceptron neural network developed to predict dissolved oxygen concentration in Malaysia's Johor River 52 . In www.nature.com/scientificreports/ this study, five different input combinations were used to develop the model. The performance of the model was acceptable were r-squared was equal to 0.95. Compared with the proposed model in the current study, a high accuracy level has been achieved where r-squared equals 0.98 with a smaller number of input combinations. It can be concluded that the current proposed model is more accurate and can be adopted as a tool to predict the changes in the concentration of dissolved oxygen. A point that can be raised from the earlier comparison is that the studies were conducted using different datasets in different countries. Therefore, for future works, a more valid comparison should be performed to consider these algorithms in predicting dissolved oxygen concentrations at the Fei-Tsui reservoir. For comparison purposes, the performance of the developed model (M.4) was compared with two other models, namely Random Forest (R.F) and Boosted Tree (B.T) regressions. The comparison was carried out using maximum and average relative percentage error. It has been observed that the maximum relative percentage error for M.4 is equal to 4.7%, while the maximum relative percentage error for B.T and R.F is 46% and 49%, respectively. At the same time, the average relative percentage error for M.4 is 1.3% which is the lowest than both B.T (4.1) and R.F (4.6).

Conclusion
The study focuses on predicting dissolved oxygen concentration as crucial water quality parameters in the Fei-Tsui reservoir in Taiwan using an artificial neural network model with simple architecture. Twenty-nine years of historical data provided the basis for development of the model. To test the model's reliability and optimize the algorithm, different numbers of neurons were used. Various numbers of input combinations were used to enhance the model's accuracy. Statistical indices were used to validate the accuracy of the model. The results reveal that the best number of neurons equals fifteen, while the best input combinations are three input parameters. These parameters are water temperature, biological oxygen demand and iron. The proposed model exhibits a high level of accuracy in predicting dissolved oxygen concentration changes where the r-squared is equal to 0.98. Taylor's diagram shows that the proposed model (M-4) displays a high consistency and accuracy level. Further investigation in implementing the proposed model in this research can predict other water quality parameters and be applied at locations with different climatic conditions for generalization purposes. There is a need to investigate machine learning models' integration with sensing technologies to efficiently monitor and predict water quality parameters for a smart early warning system. In addition, although the proposed optimization of the hyper parameters of the ANN modeling approach could provide proper prediction accuracy for DO, the accuracy could be improved by implementing the optimization of the hyper parameters of other AI model such as Random Forest and Boosted Tree algorithm.

Funding
This research received no external funding.