Abstract
Advancements in Artificial Intelligence (AI) technology allow for development of new tools for analytics and management which present new opportunities in field of environmental protection. The following study showcases usage of Machine Learning (ML) techniques as a complementary method for water status assessment of water bodies. Since the main goal of Water Framework Directive (WFD) is to improve the quality of water and reach good status in all the water bodies across Europe intensive monitoring program was launched together with water status assessment procedure. Based on requirements of the European Union’s WFD concerning ecological status assessment it is presented how ML can be used for assessment of Polish unmonitored river water bodies. Due to the absence of monitoring data, the foremost challenge lay in securing relevant alternative data which was set to be anthropogenic pressures. The pivotal solution was implementation of ML techniques which enable processing of seemingly unrelated information concerning pressures in the catchment. Decision Tree, Random Forest, KNN, Support Vector Machine, Multinomial Naive Bayes, XGBoost models have been tested and the results indicated most suitable techniques. Study shows highest efficiency of XGBoost and Random Forest algorithms for classification of unmonitored water bodies. The models were compared by their overall accuracy (OA) reaching approximately 93% for binary classification and 72% for comprehensive classification as well as partial class accuracies and the Probability of Misclassification (PoM) parameter. The analyses demonstrates a practical application of AI in assessment of unmonitored water bodies in case of binary classification used for reporting water status objectives of WFD as well as possible usage of full classification for planning and operational uses. OA and PoM are postulated as the best measures of goodness of classification.
Similar content being viewed by others
Introduction
The assessment of water’s ecological as well as chemical status are critical steps in the water management cycle. The results of the assessment are crucial to undertake appropriate actions to protect, preserve and manage water resources in a sustainable manner. This assessment is not only essential for purely administrative reasons e.g. to prepare spatial characteristics of existing water bodies in particular but also a formal obligation under the European Commission’s Water Framework Directive (WFD), both in Poland and throughout the European Union countries. All member states are expected to report status assessment in six-year cycles i.e. within each River Basin Water Management Plan.
Together with the obligation of water status assessment WFD imposed a new method of performing this assessment i.e. by comparison with the status of reference water bodies. Such a comparison can be done based on measurements of biological, physiochemical and hydromorphological water quality elements performed in water bodies. The biological elements are particularly significant for the assessment, including indices for phytoplankton, phytobenthos, macrophytes, benthic invertebrates, and ichthyofauna. Traditional approach of water status assessment mandates for these activities to be conducted by sample collection or analysis performed in situ at monitoring stations that allowed for standardized water quality assessment.
In accordance to the principles of the comprehensive assessment of water quality status as outlined in the WFD, there are five ecological status classes i.e. high, good, moderate, poor and bad. WFD establishes a primary environmental objective for all water bodies to reach at least “good status”. Following Common Implementation Strategy guidance1, the One-Out-All-Out (O-O-A-O) principle is also applied to the status assessment. A critical feature of this principle is that the outcome of classification is determined by the lowest scoring quality element from classes of all biological elements. The application of such stringent criterion results in only 6% of flowing water bodies on Polish territory have been classified as good status2.
The poor quality of Poland’s surface waters in terms of ecological status indicators is mainly due to the relatively low efficiency of many wastewater treatment plants in removing nutrients. The main problem is the pronounced eutrophication of waters and the heavy pollution from municipal and industrial waste water. In addition, runoff from agricultural areas and poorly organized wastewater management of livestock facilities account for a very large proportion of the total biogenic load entering water bodies.
It should be noted that the environmental objectives set by the WFD are very difficult to achieve not only for Poland and other Eastern European countries whose economies have only recently emerged from socialist patterns of functioning. No country in Western Europe has managed to achieve good status for all its waters not only in the first timeframe, i.e. 2015, but also in the next six-year timeframe, i.e. 2022. Of all the biological elements of water quality, macroinvertebrates, in particular, lower the ecological status assessment of Polish waters due to their moderate or even poor class.
Since the biological quality elements in the surveillance monitoring program are measured only once during the 6-year water management cycle, and only two full cycles are completed, it is still likely that some measurements and condition class ratings are subject to large uncertainties. The issue of uncertainty in monitoring measurements and status assessments based on those measurements is discussed in detail in other works3 and was not considered in this article.
The entire process of conducting a well-executed monitoring program is time-consuming and cost-intensive. Due to limited resources and resource-intensive nature of comprehensive water monitoring, it is economically unfeasible for any EU member state to monitor all designated water bodies in its territory and assess the ecological status of water bodies based exclusively on actual measurement data. The method for assessing the ecological status of surface waters is described in detail in the Regulations4.
Despite the rather high costs and involvement of qualified staff, as well as the necessity of a well-equipped laboratories, the resulting status assessment must be treated only as an estimated and biased measure of true status1,2. However, due to limited financial and logistical resources, not all water bodies (w.b.) in Poland are included in the surface water monitoring network. In this context, the application of tools such as machine learning algorithms presents a promising alternative approach to augment the existing methodologies.
Regardless of the method used to perform water body status classification, i.e. based on monitoring data, expert judgement or any formal procedure, the WFD mandates that the member states report the status assessment of all water bodies through the classification system comprised of five classes of ecological status and two chemical statuses. Here, hence, the focus was on ecological status only; the result in five classes has been the most desirable. However from the perspective of evaluation criteria of attainment of environmental objectives for individual w.b., the crucial aspect is to check whenever the status is above or below the threshold line between good and moderate status. Consequently several model variants have been prepared to find the most reliable method in establishing solid grounds for water management decisions.
This paper aims to present the application of machine learning for the assessment of water status in unmonitored water bodies in Poland to increase the scope and effectiveness of water management. The analysis of available data on pressures concerning water bodies catchments and results obtained from monitoring data, combined with advanced classification algorithms, enables good estimates of water status for water bodies that have not been systematically monitored hitherto. Using machine learning (ML) techniques provides an alternative to labour intensive in-situ measurements and highly complicated run-off models as efficient and precise tool.
The use of artificial intelligence algorithms to classify water quality, expressed for example, in the form of the so-called water quality index (WQI) was presented by Nasir5. In this specific example, Support Vector Method (SVM), Random Forest (RF), Logistic Regression (LR), decision tree (DT) CATBoost and Multi-layer Perceptron (MLP) were applied.
Many studies have successfully leveraged machine learning capabilities to discover relationships between water quality indicators (WQI)6 or specific elements of water quality assessment7. However, all of these attempts have sought to establish correlations between the status of surface water and actual data collected from samples in a given research area.
From a comprehensive review of machine learning methods in water treatment and monitoring applications8,9 it is clear that AI, as well as ML methods, have been successfully applied for monitoring and modelling of the growth process and health of water-based agricultural system10 as hydroponics and aquaponics, where the control of the fish production is based on optimised setups of the range of water quality indicators, i.e. input variables included.
There have also been studies dedicated to the predictions and assessing ecological status of surface waters that show promising results11,12. Empirical studies validate the efficacy of ML techniques for water quality assessment, yielding high-accuracy, reliable results. However, to the best of our knowledge, these methodologies have not been explicitly incorporated as part of WFD reporting by any EU member states, nor have they been specifically developed for Polish river systems. This gap presents an opportunity for innovative application of ML in regulatory water quality monitoring and reporting frameworks13,14.
It is reported that many AI and ML techniques suffer from poor reproducibility due to using random weights and a specific set of hyperparameters, which can only work with data with similar characteristics to the dataset at which these hyperparameters have been set. That is why the solution found for a particular data set cannot be successfully applied for another application, and models are closely dependent on the selected data8. There is neither commonly accepted nor even suggested by the research community what measure of goodness of the reached solution should be used. That is why comparison of different ML models is always within the subjective importance of various accuracy measures for the modeller.
It has been intended by the authors to formulate some general advice for water managers as for eventual decision concerning application of results of the particular classification for not monitored water bodies. This decision is of great importance as the class of water body status both monitored and not monitored is binding within the whole 6-year period of River Basin Management plan. The discussion presented in this paper concerning that issue is based on the example of river water bodies in Poland.
The aim of the study was to verify possibility to use standard ML algorithms in the assessment of unmonitored water bodies. The results presented in this study can significantly contribute to a better understanding of water quality in Poland and assist in the effective implementation of measures within the structure of the European Union’s Water Framework Directive. Authors of presented research are convinced that machine learning can provide more versatile and effective tools for water status assessment, which will help achieve the environmental goals of water resource protection and sustainable management.
Data and methods
The research problem in this article encompasses situations where no monitoring data are available. That kind of situation completely changes the approach to input data on which algorithms typically learn. The research assumption was to focus on water bodies for which the absence of monitoring data does not allow for the execution of the standard procedure for assessing ecological status. It was assumed that a relation between pressures present in the w.b. catchment and the status class can be derived to classify the status of unmonitored water bodies.
To find the model describing such a relationship, our analysis was first focused on monitored water bodies and their pressures. The input data represented characteristics of anthropogenic pressures which were collected for the entire of Polish territory. This function, once established, can be applied to unmonitored water bodies, providing a practical solution for assessing their ecological status when only data on pressures is available.
Input data
This paper is based on data collected during the period of last River Basin Management Plan on Poland territory (2016–2021) and the information concerning pressure resulting from the implementation of the Water Framework Directive.
Information on the ecological status of monitored water bodies was acquired from the State Environmental Monitoring system. The data used included 2667 riverine water bodies located all across Poland. The data set is highly diverse as it includes w.b. of over 20 river types present in Poland, which have varying lengths and basin sizes ranging from a couple of square kilometres to thousands of square kilometres.
The pressure data included information concerning different anthropogenic pressures in the area of a given w.b., represented by pollution release (or water uptakes) resulting from human activity that enters the aquatic environment. Easily accessible information characterizing these water bodies, such as their length or catchment area, was also utilized. Dealing with unmonitored w.b. indicated the presence of scarce information regarding the state of such water bodies. As such it is necessary to obtain data which is widely available preferably for all Polish waters and in some capacity reflects their quality. All the data were acquired from the State Water Holding Polish Waters (Wody Polskie). Eleven tables presenting pressures were obtained about specific discharges from all voivodeships. The data covers:
-
1.
Discharge of brine, medicinal waters, and thermal waters.
-
2.
Discharge of liquid animal waste, except for manure and slurries intended for agricultural use.
-
3.
Discharge of water from drainage of buildings or excavation sites.
-
4.
Discharge of rainwater and meltwater.
-
5.
Discharge from stormwater overflows.
-
6.
Discharge of domestic wastewater.
-
7.
Discharge of municipal wastewater.
-
8.
Discharge of industrial wastewater (including washwater and dewatering from mining facilities).
-
9.
Discharge of leachate from waste storage facilities and places of their storage.
-
10.
Aquaculture points.
-
11.
Points for the introduction of industrial wastewater containing substances particularly harmful to the aquatic environment into sewage treatment facilities.
It was intended to create a dataset in which each water body of known ecological status has a list of wastewater discharges. Most of the forementioned data included information such as BOD5, COD5, TSS and chemical content for the specific discharge as well as additional parameters of the w.b. including its length and river basin information. The data assigned in this manner constituted a dataset of 176 features and responses that were used in the machine learning processes.
Machine learning algorithms
Machine learning consists of various computation methods and specialized algorithms tailored to specific needs. Algorithms can be categorized into four main prediction methods: classification, regression, clustering, and dimensionality reduction as presented in Fig. 1. In assessing the ecological status, machine learning methods adapted for classification algorithms were employed. In all cases, the scikit-learn library was used, which is a Python environment for machine learning.
Algorithm type decision flowchart.
Classifiers in machine learning are algorithms that learn to assign classes (labels) to objects characterised by their features. They operate based on a subset of data, which contains feature vectors (input data) and their corresponding class labels (responses). The training process involves optimizing the model’s parameters, known as hyperparameters, to achieve the best possible classification performance.
The entire dataset is typically divided into two parts: a training subset and a testing subset. The training data subset is prepared in the form of feature vectors and their corresponding class labels. An appropriate classification model is selected to represent a classification function. This can be, for example, the logistic regression algorithm, decision trees, support vector machines (SVM), neural networks or another classifier. The model is trained on the training dataset, where the model’s parameters are optimized to minimize prediction error (the discrepancy between actual class labels and model predictions).
After training the model, its performance is evaluated on a subset of previously unused testing dataset. Various metrics such as accuracy, precision, recall, etc. are used to assess the correctness/ goodness of resultant classification. Once the model is successfully trained and highly evaluated, it can be used to classify new, previously ‘unseen’ by the model data as shown in Fig. 2.
Flowchart of model schematics.
Classifiers in ML are versatile tools which find applications in various fields such as data analysis, image recognition, medical diagnosis, natural language processing, and many others. The choice of the classifier depends on the data characteristics, performance requirements, and the complexity of the classification problem, and must be selected on a case-by-case basis.
In this paper the ML algorithms were applied for performing ecological classification of water bodies (objects) which were characterised by features represented by pressures present in water bodies catchment. It was decided to use various classification algorithms and compare their goodness of fit and then select the best ones to use in the analysed case of water status classification. Short descriptions of applied algorithms are presented if the following subchapters.
K-nearest neighbours KNN
K-Nearest Neighbours (KNN) is a machine learning algorithm used for classification and regression problems. Training data can be represented as points in a multi-dimensional space, where each object corresponds to one point, and each feature corresponds to a dimension. Before using KNN, input data should be normalized because distance can be sensitive to feature scales. KNN finds a given number, k, of the nearest neighbours of a given sample in the feature space (the closest points in the training data). The value of k is chosen by the user. The sample is assigned to the class that predominates among its closest neighbours. The algorithm requires defining a distance measure to determine which samples are the nearest neighbours. The most commonly used distance measures are the Euclidean distance and the Manhattan distance. The choice of an appropriate value for k and distance measure is crucial to achieving good results with KNN7,15,16,17,18,19.
Support vector machine - SVM
Support Vector Machine (SVM) is also a method used in ML for classification and regressions tasks similarly to KNN. In classification tasks, SVM finds the optimal hyperplane (or hyperplanes in the case of multiclass classification) which in the best way separates objects being classified into different classes in a multidimensional space of dimensionality equal to the number of features in the training data.
SVM finds the hyperplane that maximally separates samples from different classes, having the largest margin between the closest objects from different classes (known as support vectors). After finding the optimal hyperplane or an appropriate kernel transformation, objects from the new data set is assigned to the respective classes based on its position relative to this hyperplane or transformation. It is possible to customize kernel functions, which can provide different levels of result precision, such as the Gaussian RBF (Radial Basis Function) kernel, linear kernel, polynomial kernel, and sigmoid kernel. SVM has been used for analyses of water quality indexes and can be useful for water quality assessment as well as improvement in water resources management20,21.
Multinomial Naïve Bayes -MNB
Multinomial Naive Bayes (MNB) is an implementation of the Naive Bayes classification method, which assumes feature independence. Training data is shown as vectors of numbers, where each sample is a vector, and each feature represents the number of occurrences of a particular feature. MNB estimates probability distributions for each class based on the training data. For each class, it calculates the probability that a given sample belongs to that class, as well as conditional probabilities that determine which features are most likely for a given class. During classification, conditional probabilities are calculated for each class based on the presence of features, and then the class with the highest conditional probability is chosen. MNB is particularly well-suited for text classification because it works effectively with word frequency counts or term frequency-inverse document frequency representations of text data. It’s a simple yet powerful algorithm for tasks like spam detection, sentiment analysis, and document categorization. It may prove useful in finding patterns in water quality datasets7,22.
Decision tree - DT
A decision tree is one of the fundamental algorithms used in machine learning that builds structure of decisions in a manner similar to a tree. The process begins at the root, representing the entire training dataset. For each node (being the graphic interpretation of a step of the classification procedure), the algorithm selects a feature and its threshold value that best divides the data into subgroups, minimizing the heterogeneity of classes. Within the single step data are split into two subgroups. New nodes represent these subgroups, and the algorithm repeats the splitting process for each of them. The splitting process continues until certain termination conditions are met, such as maximum tree depth or a minimum number of objects in a node. When a node can no longer be split, it becomes a leaf and is assigned the label of the most frequently occurring class in that group. Data samples traverse the tree from the root to the appropriate leaves, and predictions are made based on the class labels at the leaves. The decision tree is a simple and effective model that has advantages for use in water quality analysis23,24.
Random Forest- RF
Random Forest (RF) is an ensemble learning method that combines multiple decision trees to improve overall predictive accuracy and reduce the risk of overfitting. Given a training dataset with n cases/ objects and m features, RF creates many random subsets of data (with replacement) by randomly sampling n instances from the original dataset. Each subset is called a “bootstrap sample” and is used to train a single decision tree in the forest. In each decision tree node, only a random subset of features (typically the square root of the total number of features) is considered to find the best split. The random feature selection ensures diversity among individual trees in the forest.
During the prediction phase, each tree in the forest independently classifies input data. In classification tasks, each tree “votes” for a class label, and the final prediction is determined by majority voting among all decision trees. Once all decision trees have made their predictions, the RF Classifier combines individual tree predictions to obtain the final prediction. In classification, this is done by selecting the class with the most votes. In regression tasks, the final prediction is typically the average of all tree predictions. The performance of the RF Classifier can be further optimized by adjusting hyperparameters such as the number of trees in the forest, the maximum depth of each tree, and the number of features considered at each split.
Random Forest offers many advantages, such as high accuracy, resistance to overfitting, and the ability to work with large datasets with a high number of features while being a simple and fast-to-implement algorithm. The characteristics of RF make it a very useful tool for overall environmental and water analyses as well as complex simulations7,23,25,26,27.
XGBoost
Similarly to Random Forest classifier, XGBoost is an example of an ensemble method that sequentially adds learners with lesser individual accuracy, such as decision trees, to improve overall predictive accuracy.
It begins with a simple prediction and builds trees to correct errors in the model. Each tree is trained to predict the negative gradient of the loss function, guiding it to minimize errors. Regularization terms control tree complexity, preventing overfitting. Trees are pruned for efficiency, and a learning rate regulates their contribution to the ensemble. XGBoost iteratively refines its predictions, gradually enhancing performance through a careful combination of trees, regularization, and gradient-based training. As an ensemble method it is a very promising technique what can be used for research related to water quality dilemmas28,29.
Model efficiency measures
Measures of goodness of classifications are based on confusion matrix which allows to compare the actual assignment of classified elements and the assignment resulting from the algorithm. The more the two assignments are alike the better the classification algorithm. There are four fundamental metrics available: accuracy, precision, recall, and F1-score. Based on the values of these metrics, one can evaluate the model’s quality and select the best one.
Accuracy is a measure of overall classification correctness. It is a ratio of all correctly classified instances and the whole set. This measure of correctness of classification is useful when classes are balanced (have a similar number of elements).
Assuming that there are just two classes: “positive” where elements of this class are characterised by some desirable feature and “negative” where elements belonging to the class do not have the desired feature as shown in Fig. 3, the accuracy is defined as follows:
where TP- True Positive; instances where model has correctly classified elements to the “positive” class. TN – True Negative; instances where model has correctly identified the “negative” class. FP – False Positive; instances where model has incorrectly identified “positive” when it should be “negative”. FN - False Negative; instances where model has incorrectly identified the “negative” class when it should be “positive”.
Precision is the ratio of the number of correctly classified items to the “positive” class (true positives) to the total number of samples classified to the positive class both correctly and incorrectly. In case of this study, it is a ratio of number of instances where the model has correctly classified a w.b. for given class and the sum of all instances of sorting w.b. in that class. It demonstrates how many of the positive predictions were actually correct. This measure is useful when it’s important to avoid false positives.
Recall is the ratio of the number of correctly classified positive samples (true positives) to the total number of actual positive samples. In relation to used data this is a ratio between correctly classified instances of a given class and sum of all instances that should have belong to that class. It helps understand how many of the positive samples were detected. It is useful when it’s important to avoid false negatives.
F1-score is the harmonic mean between precision and recall. It is used when we simultaneously care about precision and recall. It provides a more balanced assessment of the classifier’s performance, especially when classes are imbalanced (have different numbers of examples).
Example of confusion matrix for two class variant model.
In order to compare goodness of classification monitored and unmonitored w.b.s, instead of measures described above also probability of misclassification (PoM) can be used.
PoM indicated the likelihood of instance or sample to be incorrectly placed in a given class. High probability of misclassification could lead to erroneous assessments of water quality, potentially resulting in inappropriate interventions or inadequate regulatory measures. For a purpose of this study PoM was assumed to be a complementary of Recall.
Model preparation
All calculations were based on the Spyder environment30, used for data analysis, scientific computing, and programming in Python30. Libraries such as Pandas31, NumPy32, Scikit-learn33, and others were used for the calculations.
Data received from State Water Holding Polish Waters (Wody Polskie) and the Chief Inspectorate of Environmental Protection were prepared and used for all calculations. To avoid significant changes in magnitude that could lead to misleading results, the scaled data were divided into training and testing subsets in a 75:25 ratio, respectively, as a standard practise to facilitate robust model development and performance evaluation. This partitioning ensures that the model is trained on a portion of the data while reserving a separate subset for evaluating its performance.
The computational part consists of code segments with algorithms for performing calculations. In each segment, the aforementioned algorithms were used to create a model with standard hyperparameters, which were later calibrated. The relevance of the data set was not known, so all classifiers were used to identify the most suited approach. During this process, each model iteratively adjusted its internal parameters to minimize the discrepancy between predicted and actual class labels. Following the training phase, the script proceeds to the testing phase.
During preparation of the model Principal Component Analysis (PCA) was applied. PCA is a data analysis method used for dimensionality reduction, which identifies the most significant components (principal components) in the dataset. The analysis has shown that over 99% of data is viable and yields best results as such so none was excluded in further computations which means PCA is not practical in this specific case.
Results
The results of all classification algorithms were initially compared using measure in the form of overall accuracy (1). The initial analysis aimed to select the algorithm that achieved the highest results and focus on tuning the hyperparameters of this algorithm.
Model variants
Due to the occurrence of only one high-status water body in the entire dataset, it was decided to merge the high and good classes into a single class labelled at least good status. Three main model variants were analysed which varies on the number of classes applied:
-
Variant 1: Two possible status classes - at least good (consisting of high and good status classes) and below good (consisting of moderate, poor and bad).
-
Variant 2: Three possible classes - at least good (as in variant 1), moderate, and below moderate (consisting of classes poor and bad).
-
Variant 3: Four possible classes - at least good, moderate, poor, and bad.
Analysing the data in terms of the number of w.b. assigned to the respective status classes, a significant disproportion in the at least good class compared to the others was observed. In the case of Variant 1, this proportion was approximately 7% of at least good status w.b. compared to 93% of below good status w.b. In the conducted research, such a large difference had an impact on the accuracy of the algorithm’s classification results shown in Table 1. Based on this fact, it was decided to use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes in order to prevent distortion of the results34.
This allowed the ML algorithms to more comprehensively learn, recognize and assess the at least good state. In the tables Tables 1 and 2, results for original data and after using SMOTE are shown. Despite the accuracy values in both cases oscillating on similar level to each other, in Table 1 the precision for the good status class is 0%, which means the algorithm cannot make correct predictions for this class. In the second case i.e. after applying the SMOTE technique (Table 2), considerable improvement can be seen in algorithm performance for both classes (below good and good).
Below in Tables 3, 4 and 5 the classification results performed by six algorithms in all analysed variants are shown.
The two-class variant model shown in Table 3, where algorithms performed simple binary classification, was characterized by a significantly high model accuracy. Three out of five algorithms showed accuracy above 80%. Two algorithms, XGBoost and Random Forest, reached the greatest accuracy exceeding 90% ex aequo. The same two algorithms also gave the best partial accuracies greater than 90%.
The finer discretization of the class below good into moderate, poor and bad, as it is in line with WFD, was meant to help in consecutive water management cycles to assess if the performed in the catchment water management actions contributed at least to some level of improvement of the status. If after the poor ecological classification in one cycle, the same water body in the next cycle would be assessed as moderate it would be the confirmation of correct actions undertaken in the w.b. catchment. Here, where the classification is derived based on data on pressures, which in case of Poland are not updated more frequently than expected by WFD i.e. every six years a coarse division into just two classes can be sufficient.
Performed classification into three classes results presented in Table 4 are characterised by lower overall accuracy than that for binary classification. Also in this case XGBoost and Random Forest algorithms resulted with higher accuracies than the rest of algorithms but only barely above 60%.
For the four-state variant classification the Random Forest and XGBoost algorithm achieved as before the highest accuracy at 72% as shown in Table 5. The partial accuracy in identifying the good status was significantly higher than for the other status classes, standing at around 86% and 85%, while the average for the other states is approximately 50%.
It can be observed that for algorithms that classified w.b. status more efficiently (Decision Tree, KNN, Random Forest), in each variant, algorithms had the biggest difficulties to correctly classify the moderate status or its equivalent. In the case of classification variants with three or four classes, it is evident that the algorithms found it easier to recognize the other classes. In the variant with four status classes moderate status has distinguishably the lowest accuracy out of all plausible classes. As a result of the difficulty in recognizing by the algorithm the presence of the moderate status, the overall accuracy of the models drops by approximately 5–8%.
Results of classification based on ML techniques were also analysed with the use of probability of misclassification (PoM) (5). The following Tables 6 and 7 contain values of this probability expressed in percents. It has to be stressed that these numbers correspond to general resultant classification and cannot be interpreted as the PoM of class of particular water body.
Regarding the probability of misclassification, Random Forest and XGBoost algorithms show superior results in comparison to other algorithms. In case of two class model variant the PoM for both of these algorithms show results below 10%. The results of four class variant models show much higher variety in level of PoM but once again the XGBoost gives most promising results with PoM below 50% for all classes. For Random Forest algorithm and Decision Tree PoMs exceed 60% for moderate class.
Principal Component Analysis (PCA) and Feature Importance have been applied using appropriate functions to check for possible noise in the data set and the most significant features. The PCA showed that approximately 99% of data is viable for computation, which leads to the best results. To further verify the results Pearson’s correlation coefficient was calculated and seaborn visualization tools were used to visualize heatmap of feature correlation35. The analysis revealed the presence of significant correlations within the dataset. No strong negative correlations (r < -0.7) were observed, whilst the positive correlations (r > 0.7) were sparse, amounting to only 1.42% of total correlations as illustrated in Fig. 4.
Given the results of PCA and the overall complexity of the data which results in the dataset’s highly unusual nature it was determined to retain the dataset in its entirety without any feature elimination. This decision was made to persevere potential crucial information that while correlated with individual features in very limited amount, may contribute to more complex multivariate interactions within entire the set of features.
Heatmap of features that demonstrated positive correlation above r > 0.7.
Analysis of feature importance presented similar results, showing that most features have a very low impact on the model individually. Feature importance is a technique which is used to assign scores to features based on their impact in the model’s predictions indicating how much each feature contributes to the overall outcome. The features that demonstrated also very high impact were : the highest impact were: river length, basin area, hydraulic discharge, basin characteristics parameter (rainwater), receiver w.b. dimension (agricultural), discharge velocity, substances extractable with petroleum (municipal), sediment discharge, runoff and concentration of mercury (industrial) as presented in Fig. 5. Noticeable all of the features with the highest feature importance oscillate at the level of 0.04–0.06. The highest importance noted was 5.8% belonging to the length of w.b. as well as 5.5% belonging to the area of the river basin. These findings are consistent with the PCA results revealing that majority of features exhibit low individual importance. It must be highlighted that the nature of the dataset suggests a complex relationship between groups of features. Notably the major factors were connected to the river hydromorphology as seen in Fig. 5. Most of the features are either connected to the river geomorphology or hydraulic parameters of the wastewater discharges rather than the concentrations or loads themselves.
Feature Importance results for the 10 highest features.
High number of features and the results of feature importance analyses may raise a concern of model’s overfitting. During the models preparations cross-validation techniques were used to monitor the performance of training and testing sets. The difference in performance for training and later during validation was at the level of 8% and as such it does not raise significant concerns about overfitting. The model is expected to perform well on unseen data, and further feature elimination risks discarding valuable information that contributes to its predictive performance.
Conclusions
The presented results demonstrate the successful application of artificial intelligence in assessing the quality of surface waters not covered by any monitoring system. The binary classification variant with at least good and below good statuses is characterized by a significantly higher level of resultant accuracy than other variants. This suggests that binary classification may be the most effective for this specific problem. Achieved results support the novel approach of utilizing ML techniques to assess unmonitored w.b., which can be tested for accuracy and other statistical measures and can be universally used for all Polish flowing waters.
As previously stated, the two classes of the model have the most significant meaning for reporting water status in fulfilling WFD’s environmental objectives. The result indicating failure in reaching the goal of good status may, in the future lead to imposing penalties for the member state. Since there are exceptionally high accuracies of both at least good and below good classes, there is limited chance that the assessment based on monitoring results, if available, would contradict the outcomes of this model. This creates a clear and decisive tool that can be utilized for water bodies that were omitted in monitoring program due to challenging terrain, time or budgetary limitations. In other words, the classification resulting from the model can be trusted by water managers for their decisions of necessary remediation actions in catchments of water bodies assessed below good status.
All variants have shown that the XGBoost and Random Forest algorithms excelled in identifying the good status, which could be crucial in water quality protection, where certainty in detecting the good state is essential. For algorithms that performed better in classifying a greater number of states, recognizing the moderate state proves to be the most challenging. This suggests that there are certain subtle features or variables that make its correct classification difficult.
The use of machine learning and predictive models for assessing the state of water quality in terms of compliance with the requirements imposed by the Water Framework Directive can significantly facilitate the evaluation of the quality of unmonitored waters in Poland by providing a cost-effective method that achieves 93% accuracy and can undergo statistical evaluation. This method guarantees PoM at the level of 6–9%. The application of a formal technique for the evaluation of water quality status represented by machine learning algorithms presents a methodologically sound approach, offering distinct advantages over traditional expert opinion-based assessments. The variability inherent in subjective assessments, which may diverge between individual experts, is mitigated by rigorous computational techniques. This approach enhances the reliability and consistency of water quality assessment and yields significant efficiencies by circumventing the logistical complexities associated with classical methodologies.
The utilization of the four-variant model notably yields diminished performance outcomes, albeit its primary applicability lies within operational usage and not WFD reporting. The more precise reasoning about the advisability of certain water management decisions should be rather derived based on water quality indicators examined within operational monitoring rather than overall status statement however the reporting duty concerning water status can be perfectly based on the presented classification algorithm. The two-variant model should be used as a pivotal tool for discerning adherence to WFD and specifically gauging compliance with prescribed standards. Complementary to this, the four-variant model serves as a separate tool, enabling the differentiation of water quality states into categories denoting suboptimal conditions for bad, poor and moderate status. Despite encountering challenges in accurately classifying the moderate status, the model retains utility as an initial screening mechanism, guiding the delineation of priority areas necessitating remedial interventions. Thus, while acknowledging limitations in precision, the strategic integration of both model variants can be used as a joint package for informed decision-making and targeted resource allocation in water management.
For the ecological status assessment based on monitoring data, the methodology of calculating the probability of misclassification was elaborated and applied by Loga3. The probability of misclassification within the range (50-100%> for good ecological status was presented by 40–50% of river water bodies, whereas the probability of misclassification in the same range but for moderate status characterize even more 50–60% of river water bodies which somehow confirm the biggest challenge for correct classification of moderate status.
It is clear that even full information concerning all biological and physicochemical quality elements cannot lead to a water status assessment of high certainty. The levels of probability of misclassification for monitored water bodies presented in the quoted paper create a solid base for assessing goodness of fit for models of classification for unmonitored water bodies.
Analysing rather low PoM values, one must remember that this classification algorithm was based on water status assessment biased by the uncertainty of classes derived from measurements. As in all predictive models, the quality of results is highly correlated to the input data. Similarly, this study’s acquisition of viable input data was critical to achieving reliable results. Provided additional data, it is plausible that the quality of results will improve.
It can be concluded that the reported study shows the high efficiency of ML algorithms for classifying unmonitored water bodies, together with accuracy and probability of misclassification used as goodness of classification measures. Given the positive results achieved, a future study should examine the possible usage of the described methods regarding lakes and transitional waters.
Data availability
All monitoring data used in this study are available via https://www.gov.pl/web/gios. Data on pressures are available on request via https://www.gov.pl/web/wody-polskie.
References
Guidance No 11 - Planning Process (WG 2.9).pdf.
Monitoring i ocena jednolitych części wód powierzchniowych rzecznych -. Rzeki - System monitoringu i klasyfikacji wód - Portal jakości wód powierzchniowych. https://wody.gios.gov.pl/pjwp/publication/RIVERS/88
Loga, M. & Wierzchołowska-Dziedzic, A. Probability of misclassifying biological elements in surface waters. Environ. Monit. Assess. 189, 647 (2017).
Rozporządzenie Ministra Środowiska z dnia 21. Lipca 2016 r. w sprawie sposobu klasyfikacji stanu jednolitych części wód powierzchniowych oraz środowiskowych norm jakości dla substancji priorytetowych. https://isap.sejm.gov.pl/isap.nsf/DocDetails.xsp?id=wdu20160001187
Nasir, N. et al. Water quality classification using machine learning algorithms. J. Water Process. Eng. 48, 102920 (2022).
Gupta, S. & Gupta, S. K. A critical review on water quality index tool: Genesis, evolution and future directions. Ecol. Inf. 63, 101299 (2021).
Krtolica, I., Savić, D., Bajić, B. & Radulović, S. Machine learning for water quality assessment based on macrophyte presence. Sustainability 15, 522 (2023).
Lowe, M., Qin, R. & Mao, X. A review on machine learning, artificial intelligence, and smart technology in water treatment and monitoring. Water 14, 1384 (2022).
Moghadam, S. H., Ashofteh, P. S. & Loáiciga, H. A. Investigating the performance of data mining, lumped, and distributed models in runoff projected under climate change. J. Hydrol. 617, 128992 (2023).
Mehra, M., Saxena, S., Sankaranarayanan, S., Tom, R. & Veeramanikandan, M. IoT based hydroponics system using deep neural networks. Comput. Electron. Agric. 155, 473–486 (2018).
Béjaoui, B. et al. Machine learning predictions of trophic status indicators and plankton dynamic in coastal lagoons. Ecol. Indic. 95, 765–774 (2018).
Najafzadeh, M., Ahmadi-Rad, E. S. & Gebler, D. Ecological states of watercourses regarding water quality parameters and hydromorphological parameters: deriving empirical equations by machine learning models. Stoch. Environ. Res. Risk Assess. 38, 665–688 (2024).
Arrighi, C. & Castelli, F. Prediction of ecological status of surface water bodies with supervised machine learning classifiers. Sci. Total Environ. 857, 159655 (2023).
Gebler, D., Kolada, A., Pasztaleniec, A. & Szoszkiewicz, K. Modelling of ecological status of Polish lakes using deep learning techniques. Environ. Sci. Pollut. Res. 28, 5383–5397 (2021).
Chernoff, K. & Nielsen, M. Weighting of the k-Nearest-neighbors, 666–669. https://doi.org/10.1109/ICPR.2010.168 (2010).
Memiş, S. Determining Water Quality using picture fuzzy soft kNN(PFS-kNN) and fuzzy parameterized fuzzy soft kNN (FPFS-kNN) (2023).
(PDF) Water Quality Prediction Using KNN Imputer and Multilayer Perceptron. https://www.researchgate.net/publication/362894874_Water_Quality_Prediction_Using_KNN_Imputer_and_Multilayer_Perceptron?_sg=to6pIyi6kDveJj0W8Rmpwi7qpVyBL1Oden6QQDyzTmpl8Zum6058iGjG1hse0lWiOLYmIi6h-j0WN-w&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6Il9kaXJlY3QifX0
Azadi, F., Ashofteh, P. S., Shokri, A. & Loáiciga, H. A. Development of the FA-KNN hybrid algorithm and its application to reservoir operation. Theor. Appl. Climatol. 155, 1261–1280 (2024).
Khorsandi, M., Ashofteh, P. S., Azadi, F. & Chu, X. Multi-objective Firefly integration with the K-Nearest neighbor to reduce Simulation Model calls to accelerate the optimal operation of Multi-objective reservoirs. Water Resour. Manag. 36, 3283–3304 (2022).
Abu El-Magd, S. A., Ismael, I. S., El-Sabri, M. A., Sh., Abdo, M. S. & Farhat, H. I. Integrated machine learning–based model and WQI for groundwater quality assessment: ML, geospatial, and hydro-index approaches. Environ. Sci. Pollut. Res. 30, 53862–53875 (2023).
Bozorg-Haddad, O., Aboutalebi, M., Ashofteh, P. S. & Loáiciga, H. A. Real-time reservoir operation using data mining techniques. Environ. Monit. Assess. 190, 1–22 (2018).
Metsis, V., Androutsopoulos, I. & Paliouras, G. Spam Filtering with Naive Bayes - Which Naive Bayes? (2006).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Sattari, M., Naebzad, M. & Mirabbasi, R. Surface water quality prediction using decision tree method. Iran. Irrig. Water Eng. 4, 76–88 (2014).
Jena, P., Rahaman, S., Mohapatra, P., Barik, D. & Surabhi, D. Surface water quality assessment by Random Forest. Water Pract. Technol. 18, 201–214 (2022).
Habib, M. A., Abolfathi, S., O’Sullivan, J. J. & Salauddin, M. Efficient data-driven machine learning models for scour depth predictions at sloping sea defences. Front. Built. Environ. 10, 1343398 (2024).
Habib, M. A., O’Sullivan, J. J., Abolfathi, S. & Salauddin, M. Enhanced wave overtopping simulation at vertical breakwaters using machine learning algorithms. PLoS One 18, e0289318 (2023).
Chen, T., Guestrin, C. & XGBoost: A scalable tree boosting system. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785 (2016).
Catajan, A. Jr, Fajardo, A. & Limbago, J. Classification of Water Quality Index in Laguna De Bay using XGBoost, 403–408. https://doi.org/10.1109/JCSSE58229.2023.10202029 (2023).
Home — Spyder IDE. https://www.spyder-ide.org/
pandas documentation. — pandas 2.2.0 documentation. https://pandas.pydata.org/docs/
NumPy Documentation. https://numpy.org/doc/
scikit-learn. machine learning in Python — scikit-learn 1.4.0 documentation. https://scikit-learn.org/stable/
SMOTE. Synthetic Minority Over-sampling Technique | Journal of Artificial Intelligence Research. https://www.jair.org/index.php/jair/article/view/10302
Lavanya, A. et al. Assessing the performance of Python data visualization libraries: a review. Int. J. Comput. Eng. Res. Trends 10, 28–39 (2023).
Funding
The researched described in the article was funded by Grant no. 11/2023 in the discipline of Environmental Engineering, Mining and Energy.
Author information
Authors and Affiliations
Contributions
KP and ML were responsible for the conception. AM performed all computations, was responsible for data curation and wrote the first draft of manuscripts. ML supervision. KP supported computations. All authors revised and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Martyszunis, A., Loga, M. & Przeździecki, K. Using machine learning for the assessment of ecological status of unmonitored waters in Poland. Sci Rep 14, 24509 (2024). https://doi.org/10.1038/s41598-024-74511-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-74511-4







