Predicting suspended sediment load in Peninsular Malaysia using support vector machine and deep learning algorithms

Essam, Yusuf; Huang, Yuk Feng; Birima, Ahmed H.; Ahmed, Ali Najah; El-Shafie, Ahmed

doi:10.1038/s41598-021-04419-w

Download PDF

Article
Open access
Published: 07 January 2022

Predicting suspended sediment load in Peninsular Malaysia using support vector machine and deep learning algorithms

Yusuf Essam¹,
Yuk Feng Huang²,
Ahmed H. Birima³,
Ali Najah Ahmed⁴ &
…
Ahmed El-Shafie⁵

Scientific Reports volume 12, Article number: 302 (2022) Cite this article

3377 Accesses
17 Citations
2 Altmetric
Metrics details

Subjects

Abstract

High loads of suspended sediments in rivers are known to cause detrimental effects to potable water sources, river water quality, irrigation activities, and dam or reservoir operations. For this reason, the study of suspended sediment load (SSL) prediction is important for monitoring and damage mitigation purposes. The present study tests and develops machine learning (ML) models, based on the support vector machine (SVM), artificial neural network (ANN) and long short-term memory (LSTM) algorithms, to predict SSL based on 11 different river data sets comprising of streamflow (SF) and SSL data obtained from the Malaysian Department of Irrigation and Drainage. The main objective of the present study is to propose a single model that is capable of accurately predicting SSLs for any river data set within Peninsular Malaysia. The ANN3 model, based on the ANN algorithm and input scenario 3 (inputs consisting of current-day SF, previous-day SF, and previous-day SSL), is determined as the best model in the present study as it produced the best predictive performance for 5 out of 11 of the tested data sets and obtained the highest average RM with a score of 2.64 when compared to the other tested models, indicating that it has the highest reliability to produce relatively high-accuracy SSL predictions for different data sets. Therefore, the ANN3 model is proposed as a universal model for the prediction of SSL within Peninsular Malaysia.

Predicting streamflow in Peninsular Malaysia using support vector machine and deep learning algorithms

Article Open access 10 March 2022

Suspended sediment load prediction using long short-term memory neural network

Article Open access 09 April 2021

Application of classical and novel integrated machine learning models to predict sediment discharge during free-flow flushing

Article Open access 12 November 2022

Introduction

The background of the present study is first described in this section. This is followed by descriptions of the literature review, research gap, and contributions of the present study.

Background

The conservation of river water quality is important for human civilization as river water often represents a source of potable water while also being used for irrigation purposes in many regions, including Peninsular Malaysia^1,2,3,4. High suspended sediment loads (SSLs), which essentially comprise of tiny clay, silt, and sand particles, are known to have detrimental effects on the quality of river water as the sediments may act as transport mediums for pollutants and bacteria^5,6. The pollutants include phosphorus and heavy metals namely zinc, mercury, and manganese. High suspended sediment loads (SSLs) also affect the ecosystems within rivers by reducing the survivability of aquatic plants as less sunlight is able to penetrate through the river water and be utilised for photosynthesis. History shows many instances of pollutions and disasters caused by unmonitored or unregulated SSL in Peninsular Malaysia and around the globe. In 2016, it was reported by Malaysia’s Natural Resources and Environment Minister that a major Malaysian river recorded a Nephelometric Turbidity Unit (NTU) of 6000, indicating a significantly high concentration of suspended sediments causing poor water quality. Recently in 2021, Sungai Pinang was reported to be polluted with sediments consisting of broken-down organic matter, causing the river to have a black appearance. This sediment-based pollution was a source of foul stench affecting a nearby food court and condominium within the vicinity of the Karpal Singh Drive. Also very recently in 2021, 305 lakes and rivers in Minnesota, United States were listed as too polluted to meet the required standards. Among the causes of high pollution were high sediment concentrations, which harmed fish as they struggled to find food due to high bacteria environments and algal blooms caused by eutrophication. Toxic algal blooms caused by sediments richly leached with nutrients such as phosphorus were also reported in 2018 at the St. Lucie River, Florida, United States, causing respiratory problems as well as irritation in the eyes and noses of the locals. Increased SSLs may also have an effect on dam and reservoir operations^1,7. Dam inlets and channels can be obstructed by suspended sediments, while reservoir capacity may be reduced due to the settling of suspended sediments caused by relatively slow-moving water in the reservoir vicinity. Therefore, the ability to foresee the SSL within a particular river through predictions is especially important as a means to preserve the quality and supply of river water resources; to minimize or mitigate damages to the environment and hydrological structures namely dams and reservoirs; and to ensure the healthy continuity of hydrology-related activities such as irrigation^4,8.

Literature review

Traditionally, the sediment rating curve (SRC), which is a fitted relationship between suspended sediment concentration and river water discharge, has been utilised to assess trends and obtain predictions of SSLs, albeit having long response times and requiring a lot of information. However, a branch of artificial intelligence known as machine learning (ML), has been shown to effectively address these issues⁵ while producing more accurate SSL predictions compared to SRCs^{1,2,3,9,10,11}. ML and deep learning, which is a more specialized version of ML typically consisting of neural networks, have also been used to solve important prediction problems within various fields. ML algorithms such as the decision tree (DT), random forest (RF), support vector machine (SVM) have been used for short-term water quality prediction to improve water management and pollution control, maize crop-yield prediction, and blockchain financial products earnings prediction to reduce concern of investors towards the risks and returns of financial products blockchain technology-based applications^12,13,14; while deep learning algorithms such as the artificial neural network (ANN), long short-term memory (LSTM), and gated recurrent unit (GRU) have lately been utilized to solve more relatively complex problems such as the prediction of points-of-interest for purposes such as monitoring and maintaining public health following the coronavirus diseases (COVID-19), the prediction of greenhouse climate to ensure crop growth stability, and the prediction of health data with privacy reservation to combat the issue of missing data due to healthcare equipment failure and system updates^15,16,17,18. In recent years, the artificial neural network (ANN) and support vector machine (SVM) algorithms have been shown to be among the most established and effective algorithms for application in the prediction of SSLs as shown by numerous existing literature^{3,6,11,19,20,21,22,23,24,25,26,27,28,29,30}. Other than the ANN and SVM, other algorithms have also been studied for the purpose of SSL prediction. Meshram et al.⁹ studied the iterative classifier optimizer-based pace regression (ICO-PR) and iterative classifier optimizer-based random forest (ICO-RF) for SSL prediction in the Seonath River basin, India. It was shown that the ICO-RF is more accurate than the ICO-PR, and stand-alone PR and RF models. The study by Samadianfard et al.³¹ hybridized RF and multi-layer perceptron (MLP) with genetic algorithm (GA) and stochastic gradient descent (SGD) to produce four suspended sediment concentration (SSC) predictive algorithms namely GA-RF, GA-MLP, SGD-RF, and SGD-MLP. These algorithms were tested using data from the Minnesota and San Joaquin rivers; and it was determined that the GA-RF and GA-MLP models performed the best in predicting SSC for the Minnesota River, while the SGD-RF and SGD-MLP models were the most accurate for the San Joaquin River. Shadkani et al.³² used MLP, MLP-SGD, and gradient boosted tree (GBT), to predict SSL for the St. Louis and Chester stations along the Mississippi River, United States. It was found that the SGD optimization on the MLP resulted in more accurate SSL predictions, hence SGD-MLP was put forward as the most accurate model for SSL prediction. Hazarika et al.⁵ applied the coiflet wavelet-based large margin distribution machine-based regression (LDMR) and coiflet wavelet-based large margin distribution machine-based extreme learning machine (ELM) to predict SSL in the Tawang Chu River, India. The study showed that the two coiflet wavelet-based models produced better predictions compared to other tested models based on twin support vector regression (SVR), stand-alone LDMR, and stand-alone ELM. AlDahoul et al.² studied the application of long short-term memory in predicting SSL at the Johor River basin, Malaysia. It was demonstrated that LSTM is capable of outperforming several other ML algorithms namely elastic net linear regression (ENLR), ANN, and extreme gradient boosting (XGB). The prediction of SSL using LSTM was also investigated in the study by Nourani and Behfar³³, in which it was found that the LSTM-based models were superior to classical feed-forward neural networks in predicting SSL at the Mississippi River. The adaptive neuro-fuzzy inference system (ANFIS) was trialled with different membership functions to predict SSL for the Cumberland River, United States in the study by Babanezhad et al.³⁴. ANFIS with the trimf membership function was found to produce the best predictive performance among the tested models, including ant colony optimization-based fuzzy inference system (ACOFIS). ANFIS was also hybridized with the bat algorithm (ANFIS-BA) in the study by Ehteram et al.³⁵, in which it was found that ANFIS-BA was more reliable for SSL prediction in the Atrek River, Iran compared to other tested models namely ANFIS hybridized with whale algorithm (ANFIS-WA), and hybridized multi-feedforward neural network (MFNN) models with the BA and WA algorithms (MFNN-BA and MFNN-WA). The study by Azamathulla et al.³⁶ applied genetic expression programming-based (GEP) models to predict SSLs in the Muda River, Langat River, and Kurau River in Malaysia. The GEP-based model was discovered to produce better predictive performances when compared to the other tested models which are ANFIS and a benchmark regression model. The dynamic evolving neural fuzzy inference system was studied by Adnan et al.³⁷ for the prediction of SSL at two locations within China, namely Guangyuan and Beibei. DENFIS was shown to have a higher predictive accuracy compared to the other two models tested, which are ANFIS with fuzzy c-means clustering (ANFIS-FCM) and multivariate adaptive regression splines (MARS). However, in the study by Yilmaz et al.¹, MARS was found to be capable of predicting SSL for the Çoruh River basin with the lowest error, compared to models based on the artificial bee colony (ABC) and teaching–learning based optimization. Tao et al.⁸ applied the radial basis M5Tree (RM5Ttree) to predict SSL for the Trenton hydrological station on the Delaware River, United States⁸. Results of the study showed that the RM5Tree model produced predictions with enhanced accuracy and outperformed the other tested models based on the response surface method (RSM), ANN and the classical M5Tree. Using the same data set applied in the study by Tao et al.⁸, Salih et al.⁷ used M5P, attribute classifier M5P (AS-M5P), M5Rule (M5R) and K Star (KS) models to predict SSL⁷. Different input scenarios of streamflow (SF) and SSL were used in this study, in which it was found that M5P was superior among the tested models. A hybrid version of the M5P, named bagging-M5P, was utilized by Khosravi et al.³⁸ for SSL prediction in the Estero Morales River, Chile. The study showed bagging-M5P to be superior to the classical M5P, reduced error pruning tree (REPT), instance-based learning (IBK), and hybridized versions of the REPT model. Tabatabaei et al.¹⁰ predicted SSL using data from the Ramian hydrological station on the Ghorichay River, Iran by utilizing an SRC model optimized with the non-dominated sorting genetic algorithm II (NSGA-II), which increased prediction efficiency. In the study by Uca et al.⁴, multiple linear regression (MLR) and ANN were tested to predict SSL for the Jenderam catchment, Malaysia. The results demonstrated the capability of MLR in outperforming ANN with regards to SSL prediction accuracy.

Research gap

A limitation that is present in majority of the aforementioned existing studies on SSL prediction is that most have focused on utilizing ML algorithms to develop predictive models for only one hydrological station or river, which means the models were developed based off of one data set. As the magnitude and behaviour of SSLs for each river is different, the suitability of certain ML algorithms for the task of SSL prediction may vary. Certain ML algorithms may be suitable and produce good SSL predictions for a hydrological station at a particular river but may not perform well in predicting SSLs for a different river, due to variance in anthropogenic and natural factors. In the case study of Peninsular Malaysia, existing studies have utilized ML algorithms, particularly ANN, MLR, LSTM, and GEP, to develop SSL predictive models^{2,4,27,28,29,36}. Apart from the study by Azamathulla et al.³⁶, all studies on SSL prediction within Peninsular Malaysia have focused on developing ML models solely based on data sets from single hydrological stations located in rivers such as Sungai Johor, Johor; Sungai Pari, Perak, Sungai Langat, Selangor, and the Jenderam catchment, Selangor. This creates a noteworthy research gap for the Peninsular Malaysia case study, as it is unknown whether there is a model or algorithm that is capable of producing accurate SSL predictions for multiple different rivers within the region. The present study contributes towards addressing this research gap through the development of predictive models for SSLs based on time series data sets of SF and SSLs from hydrological stations located along 11 different rivers throughout Peninsular Malaysia. The two established algorithms based on existing literature within the current field, namely SVM and ANN, were selected for utilization in the present study. In addition, LSTM was also chosen for the development of predictive models as it has recently been documented to have good ability in accurately predicting SSLs^2,33, while also already performing well in other fields relating to flood forecasting, wind turbine fault diagnosis, rainfall-runoff modelling, building energy consumption forecasting, and drought forecasting^{39,40,41,42,43}.

Contributions

The present study was motivated by the aforementioned cases of SSL pollution in the Malaysian and American rivers, such as Sungai Pinang and St. Lucie River. Early anticipation and mitigation measures through the application of ML models could have played a significant role in reducing damages towards the local people and natural habitat. As there are many novel and advanced SSL-predicting models being developed in different study regions and demonstrated in scientific literature, practical adoption of ML predictive models for real-life application hydrological stations may not be straightforward due to the uncertainty of whether a selected ML model is able to replicate its good performance for different rivers with varying SSL behaviour and magnitude due to different anthropogenic and natural factors. Therefore, the scientific novelty of the present study is the selection and proposal of a single predictive model that is capable of producing SSL predictions of good accuracy for different rivers throughout Peninsular Malaysia. The major contribution of the present study is the testing and development of predictive ML models based on 3 different ML algorithms for hydrological stations on 11 different rivers throughout Peninsular Malaysia, in order to determine and propose a single ML model that is capable of predicting SSLs with high accuracy for multiple different rivers. Using time series data of SF and SSL for each river, SVM, ANN, and LSTM are tested to predict SSLs for each river using four different input scenarios. The performance of each model is evaluated using selected performance evaluation measures, namely mean absolute error (MAE), root mean squared error (RMSE) coefficient of determination (R²) and ranking mean (RM). The ML model that produces the best SSL predictions for the most rivers and obtains the best average RM is then proposed as a universal model that may be used for any specific case study within Peninsular Malaysia. The findings obtained in the present study may mainly be of interest to hydrological organizations looking for suitable or proven ML models for practical application within Peninsular Malaysia, as the models have been developed and tested using 11 different river data sets within the selected region. However, audiences from abroad may also take interest in the findings of the present study as the proposed SSL predictive model may possibly produce accurate SSL predictions for case studies in other regions around the world as well. The method of selecting the best SSL predictive ML model in the present study, which is by using performance evaluation measures to determine the model that produces the best SSL predictions for the most rivers and obtains the best average RM, may also be a point of interest for a wider audience regardless of geographic location. The rest of the present study is organized as follows: Sect. 2 describes the materials and methods used to carry out the present study. Section 3 reports and discusses the results of the present study. Section 4 concludes the overall study.

Materials and methods

In this section, the materials and methods employed in the process of predicting SSL for the 11 selected rivers within Peninsular Malaysia are explained. Important information regarding the location and data of case study, model development process, ML algorithms, data pre-processing, and performance evaluation measures are described.

Location and data of case study

Peninsular Malaysia represents the western region of Malaysia comprising of 13 states and 2 federal territories. It encompasses a total area of 132,265 km², which is about 40% of the total area of Malaysia; and is located just North of the equator. Peninsular Malaysia has approximately 1235 river basins⁴⁴, with Sungai Pahang representing the longest river in the region at 459 km in length. In the present study, raw data in the form of daily average SF and daily total SSL were obtained from the Water Resources Management and Hydrology Division of the Malaysian Department of Irrigation and Drainage for different rivers within 11 states in Peninsular Malaysia. Based on the volume and continuity of the available data; and the relevance of the rivers to their respective state, one river is selected per state for the purpose of the present study. Information on the selected rivers for each state; identification and location of the hydrological measuring stations; and the duration of data provided by the respective station for each selected river is shown in Table 1.

Table 1 Information on selected rivers’ data for each state.

Subjects

Abstract

Similar content being viewed by others

Introduction

Background

Literature review

Research gap

Contributions

Materials and methods

Location and data of case study

Model development process

Machine learning algorithms

Support vector machine (SVM)

Artificial neural network (ANN)

Long short-term memory (LSTM)

Data pre-processing

File merging and preparation

Missing data

Data partitioning

Feature scaling

Feature selection

Performance measures

Mean absolute error (MAE)

Root mean squared error (RMSE)

Coefficient of determination (R 2 )

Ranking mean (RM)

Results and discussion

Performance of models based on the Sungai Johor, Johor data set

Performance of models based on the Sungai Muda, Kedah data set

Performance of models based on the Sungai Kelantan, Kelantan data set

Performance of models based on the Sungai Melaka, Melaka data set

Performance of models based on the Sungai Kepis, Negeri Sembilan data set

Performance of models based on the Sungai Pahang, Pahang data set

Performance of models based on the Sungai Perak, Perak data set

Performance of models based on the Sungai Arau, Perlis data set

Performance of models based on the Sungai Selangor, Selangor data set

Performance of models based on the Sungai Dungun, Terengganu data set

Performance of models based on the Sungai Klang, Kuala Lumpur data set

Overall comparison and analysis of model performances

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links

Coefficient of determination (R ² )