Evaluation of water quality based on a machine learning algorithm and water quality index for the Ebinur Lake Watershed, China

The water quality index (WQI) has been used to identify threats to water quality and to support better water resource management. This study combines a machine learning algorithm, WQI, and remote sensing spectral indices (difference index, DI; ratio index, RI; and normalized difference index, NDI) through fractional derivatives methods and in turn establishes a model for estimating and assessing the WQI. The results show that the calculated WQI values range between 56.61 and 2,886.51. We also explore the relationship between reflectance data and the WQI. The number of bands with correlation coefficients passing a significance test at 0.01 first increases and then decreases with a peak appearing after 1.6 orders. WQI and DI as well as RI and NDI correlation coefficients between optimal band combinations of the peak also appear after 1.6 orders with R2 values of 0.92, 0.58 and 0.92. Finally, 22 WQI estimation models were established by POS-SVR to compare the predictive effects of these models. The models based on a spectral index of 1.6 were found to perform much better than the others, with an R2 of 0.92, an RMSE of 58.4, and an RPD of 2.81 and a slope of curve fitting of 0.97.


Results and Analysis
Statistical analysis of the water quality index. A summary of water quality observations for Ebinur Lake Watershed surface water of the Boertala River, the Jing River, the Akeqisu-Kuitun River (A-KR) and artificial reservoirs (RES) for October of 2016 is presented in Table 1. At different water quality levels, (pH) levels varied considerably from 7.62-8.46 spanning one order of magnitude with a mean value of 7.97 and Coefficient of Variation of 12.29%. Concentrations of TDS also experienced varied considerably from 81.4 mg/L-9470 mg/L with a mean value of 728.88 mg/L and with a Coefficient of Variation of 19.2%. TDS values of the Ebinur Lake Watershed were found to be lower and strongly variable and most likely because upstream reservoirs of the Bolatala and Jing Rivers serve as a settling watershed. Ca levels of the four rivers were found to be similar and to range from low to moderate (42. 4 2− also varied considerably from 4.8 mg/L-8424 mg/L with a mean value of 961.88 mg/L and a Coefficient of Variation of 172.28%. SO 4 2− levels in the Ebinur Lake Watershed were found to be lower and strongly variable and most likely due to the presence of the Boertala and Jing River reservoirs upstream, which serve as a settling watershed. (PO 4 3−) was found to vary considerably from 0-1.7, spanning one order of magnitude with a mean value of 0.2237 mg/L and highly variable Coefficient of Variation of 153.57%. (Cr) was found to vary considerably from 0.01-0.16, thus spanning one order of magnitude with a mean value of 0.0283 mg/L and a highly variable Coefficient of Variation of 102.47%. In short, the water quality index changes considerably in this watershed while pH, DO and TDS values change less. Water quality levels thus vary considerably in the watershed.
Assessment of water quality based on the WQI. In this study, the quality of the Ebinur Lake Watershed surface water was evaluated. To assess the water quality of the river, the WQI method was used. pH, HCO 4 2− , Na, Ca, Mg, COD, PO 4 3− and Cr) and a WQI map for the river were prepared using Geographic Information System (GIS) techniques and are presented in Fig. 4 and Table 2.
Spatially, water quality index (WQI) levels are high for most areas of the Boertala River downstream from Ebinur Lake and (Fig. 1) and occupy the V category. This water is unsuitable for drinking. The highest value of 438 is observed for the Kuitun River. As this water body is located in the town of Tuotuo, the effects of human factors are severe, and water quality levels in this river are poor. Therefore, as water quality levels worsen, WQI levels increase. The best levels of water quality for the Ebinur Lake Watershed are found in the upper reaches of the Bortala River. Its WQI value is less than 100 (I grade water quality) and is suitable for drinking. Poor water quality levels are observed for midstream reaches of Boertala River of Wenquan County where the effects of human factors are severe and where water quality levels have resulted in mutations and in the development of water quality index anomalies. From an ecological perspective, the ecological environment of Ebinur Lake is the worst in the watershed. Rivers originate from mountains surrounding the watershed where the ecological environment is superior to that of Ebinur Lake.
Hyperspectral characteristics of surface water. Figure 2 (a) shows how on the basis of the river areas described above, 48 water samples were classified into 5 categories and spectral plots of each category were averaged as a representative spectral curve of this water quality level (Fig. 2a). Five spectral plots of similar shapes were identified with two pronounced absorption features located at approximately 700 and 950 nm. Of the five  Table 2. Assessment of water quality using the WQI.
categories, sample site 31 exhibited lowest reflectance and a location slightly downstream exhibited the highest reflectance. Sample site 21 presented the highest reflectance value. This sample site is located in the downstream area of the river (into the lake). For each class, an average spectrum was calculated (Fig. 2a), and the plots show reflectance curves of two deep absorption regions at 750 and 980 nm and several weak absorption regions at approximately 452 nm, 703 nm, and 850 nm. It was easy to identify differences in water quality at roughly 700-720 nm and 1,070 nm of the peak. Average and standard values are shown in Fig. 2(b) with no outliers and a normal distribution.
Correlations between the water quality index and spectra. Sensitive wave band selection is central to constructing a water quality index (WQI) estimation model, and correlation coefficients for the water quality index (WQI) and spectral reflectance (single wave bands) are usually used to identify water quality index bands (sensitive wave bands). All correlation coefficients between the water quality index (WQI) and raw reflectance data treated based on fractional derivatives (0 order, 0.2 order, 0.4 order, 0.6 order, 0.8 order, 1.0 order, 1.2 order, 1.4 order, 1.6 order, 1.8 order, and 2.0 order) were tested with a significance level of 0.01 (|r| = 0.24 or above). Spectral curves of correlation coefficients of the raw reflectance and of raw reflectance data treated by fractional derivatives (0 order, 0.2 order, 0.4 order, 0.6 order, 0.8 order, 1.0 order, 1.2 order, 1.4 order, 1.6 order, 1.8 order, and 2.0 order) are plotted in Fig. 3. For the raw reflectance data, 45 bands passed the significance test at 0.01, but as the order of the derivative increases, correlation coefficients increase beyond the 0.01 level in some wavelength ranges. However, band values do not pass the significance test at 0.01. In addition, as the order declines from 1.0 to 2.0, band values increasingly pass the significance test at the 0.01. As correlation coefficients increase, when the order reaches 1.6, correlation coefficients reach 0.68 at 1,368 nm. On the whole, the curves fluctuate greatly, and thus more information cannot be derived from Fig. 3. From Fig. 3 it is not clear how many bands of raw reflectance data treated by fractional derivatives passed the significance test at 0.01, and thus raw reflectance data and raw reflectance data treated by fractional derivatives are  measured and corresponding trend lines and relationships between raw reflectance data and raw reflectance data treated by fractional derivatives and the water quality index (WQI) are shown in Fig. 4. For these 11 mathematical forms of reflectance, different numbers of bands passed the significance test. With an increase in derivative order, values first decreased and then increased, and all reached a minimum value at the 1.0 fractional orders and a maximum value at the 1.6 fractional orders. However, band numbers do not pass the significance test at 0.01. In addition, as the order declines from 1.0 to 2.0, band numbers increasingly pass significance testing at 0.01. As correlation coefficients increase, once the order reaches 1.6, the correlation coefficient is 0.68.

Relationships between the water quality index (WQI) and the spectral indices. Contour maps
of r values between the water quality index (WQI) and two-band spectral indices (DI, NDI and RI) are shown in Fig. 5. A strong correlation between the DI, NDI and RI and the water quality index (WQI) is largely found in the visible and near-infrared ranges (Fig. 5). While the performance of the three spectral indices as predictors of the water quality index (WQI) appears to vary by wavelength, constant forms are revealed. Wavelength combinations in the 350-1100 nm region for R 2 spectra (Fig. 5) show a significant correlation between the RI and the water quality index (WQI).
Wave bands of combinations (DI, RI and NDI) for the reflectivity of the raw spectrum curve and raw reflectance data treated by fractional derivatives and corresponding strong correlations with the water quality index (WQI) were mainly found to be concentrated in two zones (Fig. 5). The ratio index (RI) sensitivity region and normalization index sensitivity region were found to be nearly consistent. However, index sensitivity zones were found to differ. For the RI, good wavelength combinations were observed with R 2 values of 0.40 and 0.92, respectively ( Table 3). The correlation r is minimal in raw reflectivity wave bands of the combinations (R 883 /R 934 ), and the maximum correlation coefficient value is found in raw reflectance data treated by 1.6 order derivatives located at R 600 − R 900 . For the different index (DI), good wavelength combinations were observed with R 2 values of 0.497 and 0.585, respectively ( Table 3). The lowest correlation r is found in raw reflectivity wave bands of the combinations (R 583 − R 844 ), and the maximum correlation coefficient is found in raw reflectance data treated by 1.6 order derivatives for R 500 to R 900 . For the normalized index (NDI), good wavelength combinations were found with R 2 values of 0.764 and 0.914, respectively ( Table 3). The weakest correlation r is found in raw reflectance data treated by 0.2 order derivatives of combinations ((R 520 − R 760 )/(R 520 + R 760 )), and the largest correlation coefficient is found in raw reflectance data treated by 1.6 order derivatives in the R 452 and R 703 zones. Raw observations show several weak absorption regions at close to 452 and 703 nm, and R 452 and R 703 zones of NDI wave bands of the combinations correlation coefficient are the highest. Therefore, the spectrum absorption valley is central to the study of water quality sensitivity levels. In addition, a reflectivity value of 964 nm is found in the most important area of the sensitive band. This analysis reveals the presence of a strong correlation between DI, RI, NDI and the different water quality indices. Strong correlations with water quality are mainly found as r values (Table 3).

Particle swarm optimization (PSO)-support vector regression model. Establishing a WQI estima-
tion model based on a support vector regression model. MATLAB 2014a is applied to design a particle swarm optimization (PSO) support vector regression model. Hyperspectral parameters of sensitive wave bands and the spectral index and water quality index (WQI) of the Ebinur Lake wetlands are used to develop a particle swarm optimization -support vector regression model (POS-SVR). Data were randomly chosen and segregated into training and testing components at a 7:3 ration. After training the model (POS-SVR), it was tested using 30% of the data that differed from the training set. This was conducted to assess the generalization accuracy of the trained model and to ascertain its capacity to use the SVR learned pattern to predict target values for previously unseen datasets. This method is referred to as model validation and the performance assessment method used is only as good as the criteria set for this reason. Each input factor applies a different measurement unit. To eliminate dimension effects of these variables and to realize equivalent expression effects for each input factor, the non-dimensional method is applied for the data analysis to standardize various input factors and to compress the scope of change for each input factor to −1 to 1. The premnmx function is applied in MATLAB 2014a to normalize the input factors. When the nerve cell is satisfactorily accurate, the postmnmx function can be applied to recover the original magnitude of the normalized data. The different input parameters of the POS-SVR model for parameter comparison is as described in Table 4.
Verifying the estimation model of the water quality index. After modeling different water quality indices (WQIs), the accuracy of obtained models was examined for an independent dataset consisting of 11 samples. The corresponding validation results are shown in Figs 6, 7 and statistical results are summarized in Table 5. Scatter diagrams are presented for prediction and real values of the inversion model in Figs 6, 7. The coefficient of determination R 2 between predicted and measured values for monitoring model accuracy is higher, the measured and predicted values are basically linear, and the RMSE is low while the slope of the fitting curve is closer to 1. Therefore, the related POS-SVR model exhibits a strong non-linear fitting capacity, denoting excellent effects of the hyperspectral spectral index on the monitoring water quality index (WQI). Figures 6, 7 and Table 5 show a scatter diagram for the measured real and predicted values. Figures 6,7 and Table 5 show that the predicted water quality index (WQI) value is very consistent with the measured water quality index value. The 15 water quality index estimation models were validated by the 22 water samples. In total, 22 models present acceptable results at RPD > 1.4 and with a slope of close to 1. The sensitive wave band estimation model is more accurate for the 1.6 order derivates. R 2 is valued at 0.92; RMSE is valued at 58.40, RPD is valued at 2.71, and the slope is valued at 0.85. The spectral index estimation model is more accurate for the 1.6 derivates. R 2 is valued at 0.92; RMSE is valued at 61.15, RPD is valued at 2.81, and the slope is valued at 0.97. Compare the accuracy of the machine learning algorithm and geographically weighted regression (GWR). R 883 / R 934 , R 583 − R 844 , and (R 520 − R 760 )/(R 520 + R 760 ) is the independent variable, the GWR model was used for regression analysis of WQI, AIC value is 402.69, R 2 is 0.86, residual sum of squares value is 879.91. Test the model with a validation sample, R 2 is 0.75, RMSE is 80.33, and RPD is 1.90. Scatter diagrams are presented for prediction and real values of the inversion model Fig. 8.
Compare the accuracy of the machine learning algorithm and geographically weighted regression (GWR), the spectral index estimation model is more accurate for the 1.6 derivates based on machine learning algorithm. R 2 is valued at 0.92; RMSE is valued at 61.15, RPD is valued at 2.81, and the slope is valued at 0.97. Therefore, the water

Discussion
Assessment of water quality and of the spatial variability of the water quality index (WQI). In this study, the water quality of Ebinur Lake watershed surface water was evaluated. Rivers of the Ebinur Lake Watershed recharge Ebinur Lake. To evaluate the water quality levels of Ebinur Lake Watershed surface water, 48 sampling sites and 20 water quality parameters were selected for monitoring and analysis. Water quality parameters pH, HCO  Ca, Mg, COD, PO 4 3− and Cr were used to calculate WQI values to evaluate river water quality levels. WQI values were found to range between 56.61 and 2886.52. The WQI classification shows that the Ebinur Lake Watershed presents varying levels of water quality. The downstream areas of the river present poor water quality levels, where the main pollutant sources include wastewater discharged from Wenquan County and the city of Bole, leather and marble factories downstream from the Boertala River Valley and agricultural activities in the oasis of the Ebinur Lake Watershed; the main pollutant sources include wastewater discharged from Jinghe County, the leather industry, saltwork and saline land downstream from the Jinghe River and agricultural and grazing activities in the oasis of the Ebinur Lake Watershed. The Kuitun-Akeqisu River is located in the southwestern area of the watershed. A large amount of salt is found on either side of the river, and water quality in the area is highly saline. Effects of water quality parameters on the WQI map were investigated. Consequently, environmental pollutants negatively affect all water surfaces of the Ebinur Lake Watershed. Therefore, necessary protection measures should be taken on the planned usage of river water.
Estimate water quality index (WQI) value based on hyperspectral remote sensing data. In this study, an estimated water quality index (WQI) value is established based on sensitive wave bands and a spectral index of hyperspectral data. Water quality levels are directly estimated and assessed via remote sensing techniques. Most previous studies 18,34,35 have focused on single indices of water quality such as chlorophyll-a, TDS, and NTU. While single indices of water quality are monitored using remote sensing technologies, and while single water quality parameters of monitoring models are highly accurate, such results are uncertain. As water quality conditions are reflected by all water quality parameters, overall water quality conditions are monitored by remote sensing; spectral reflectance values reflect overall parameters. Therefore, single indices of water quality monitored using remote sensing technologies are uncertain. The water quality index (WQI) reflects overall water quality conditions. The evaluation and estimation of surface water quality based on the hyperspectral remote sensing  Table 5. Summary of parameter correlations between the measured verification values and predicted values. is feasible. In this study the accuracy of the estimation model is improved through the use of new hyperspectral indices (DI, RI, and NDI) and via particle swarm optimization -support vector regression. Remote sensing techniques make it possible to develop a spatial and temporal understanding of surface water quality indices and to more effectively and efficiently monitor water surfaces. Such tools can also be used to estimate water quality distributions. Future studies must measure the applicability of satellite remote sensing data and of unmanned aerial vehicle (UAV) technologies for estimating WQI values. As the number of in situ samples continues to increase, a unique regression model that effectively measure the water quality parameters of different watersheds should be developed for arid regions.

Conclusions
The Ebinur Lake Watershed of the Xinjiang Autonomous Region, China, was used as a study area. We used optimal bands based on difference index, ratio index, and normalized difference index algorithms to assess the WQI using spectral eleven orders (interval 0.2) of fractional derivatives for remote sensing data, and we measured the performance of the proposed models using GA-SVR and the band difference algorithm. The results are as follows: (1) Water quality levels for drinking purposes were evaluated via the water quality index (WQI) method. The computed WQI values were found to range between 56.6133 and 2,886.5198. The prepared WQI map shows that the arid area generally presents low levels of water quality. (2) As the order increased, the number of bands with correlation coefficients passing a significance test at 0.01 first increased and then decreased with a peak appearing with the 1.6 order and with an R 2 of 0.525. The WQI and derivative spectral data of DI, RI and NDI correlation coefficients among the optimal band combinations also show a peak with the 1. This study not only estimates a water quality index using different techniques for the semi-arid area of central Asia but also develops a new algorithm that can be applied to this area and to other areas.

Materials and Methods
Study area. The Ebinur Lake Watershed (44°05′−45°08′N, 82°35′−83°16′E) (Fig. 9) is located on the northern slope of the Tien Shan Mountains southwest of the Junggar Basin. The watershed covers an area of 50,621 km 2 . It is surrounded by a mountainous region (24,317 km 2 ; Alatau Mountains, Maliyi Mountains and Biezhentao Mountains) and by plains (Jinghe Oasis) (26,304 km 2 ) to the north, west and south 36 . Artificial reservoirs (RES) are found southwest of the watershed. The area is characterized by a typical temperate arid continental climate with the mountain-oasis-desert system presenting typical temperate arid ecological characteristics. The study region is located inland (2,000 km from the Pacific and Indian Ocean and 3,000 km from the Arctic Ocean); moisture in the study area is derived from the Atlantic Ocean (7,000 km), but water vapor transport from maritime areas is limited 36 . The lake is a terminal lake fed by the Kuitun Mountains, Akeqisu River, Jing River, Tuotuo River, Sikeshu River, Boertala River, Akaer River and Daheyanzi River. Surface water levels of Ebinur Lake and the Tuotuo River are currently low and thus water ecological safety levels are threatened. Severe water shortage problems and the presence of large volumes of sewage have rendered river and lake water pollution levels high in the Ebinur Lake Watershed, a typical arid area of central Asia 37 .

Materials
Sample collection. Water samples were collected on October 5, 2016 from 48 locations within the Ebinur Lake Watershed (Fig. 9). Collected water quality samples were stored at low temperatures (under 2 °C) during transport before water quality measurements were carried out in a laboratory. Samples were transported in polyethylene plastic bottles previously rinsed with 10% HCI and cleaned with deionized water to minimize changes in water chemical characteristics. We used a handheld global positioning system (GPS) indicator to determine the central coordinates of each sample and used a digital camera to photograph the sampling area (see Fig. 9). Temperature and pH levels were recorded at the time of sampling along the shore. All other measurements were taken within a day following sample collection in the lab. Biochemical oxygen demand (BOD 5 ), total nitrogen (TN), total phosphorus (TP), iron, copper, chemical oxygen demand (COD), zinc, volatile phenol (V.P.) ammonia nitrogen (NH 3 + -N), Henderson-Hassebalch (HCO 3 − ), dissolved oxygen (DO), total dissolved solids (TDS), chloride (Cl − ), sulphate ion (SO 4 2− ) , natriumion (Na), calcium (Ca), magnesium (Mg), phosphate (PO 4 3− ) and (Chromium VI) Cr concentrations collected over five days were determined according to corresponding methods as is shown in Table 6.
Hyperspectral data collection. The FieldSpec ③ 3 ASD Spectroradiometer device is an optical sensor that uses detectors other than photographic film to measure the distribution of radiation in a particular wavelength region to measure the radiant energy level (radiance and irradiance). It was used to visualize spectral reflectance patterns of lake water corresponding to water content levels. Observation methods applied to water surfaces can be found in Supplementary Fig. S1.
To observe the water surfaces (Fig. S1), the spectral range of the spectrometer was set to 350-1050 nm with a 1 nm sampling interval. To avoid environment changes in illumination conditions, measurements between water the target, sky, and whiteboard were collected at each station. Sky conditions were also recorded at each station during spectral measurement.
All field spectrometer measurements were processed to remove sky and sun glare using a constant water body reflection coefficient 38 . Therefore, hyperspectral reflectance values, R rs , were calculated using the following equation: where L u is the total upwelling radiance, L s is the sky radiance, ρ is the water surface reflection efficiency level of 0.028, an d E d is the measured down welling solar irradiance.

Methods
Fractional Derivative Method. Fractional derivative methods have been widely used in certain fields because models described by the fractional derivative are more accurate and efficient than methods based on integer derivatives 39,40 . The most frequently used definitions are the following: Grunwald -Letnikov (G-L), Riemann -Liouville (R-L), and Caputo 41 . As it is less complex than the others, the G-L definition was employed in this study. Grunwald -Liouville is defined as follows 42 : where a is the step length, where h is the order number, and where t and a are the respective upper and lower limits of the derivative. The Gamma formula is written as follows: Based on our use of ASD spectrometer data, when the sampling interval is 1 nm, h = 1. f (X) is the fractional order derivative, which is defined as follows: Therefore, (5) can be regarded as the numerical algorithm used to calculate the fractional derivative of hyperspectral data, and a zero order denotes that hyperspectral data are not treated by the derivative algorithm.

Determination of the best indices.
In obtaining the most sensitive bands from water environment data, previous studies show that the combination of various bands can improve the sensitivity of hyperspectral reflectance data to water quality values 43 . Therefore, this method explores the relationships between water quality and the spectrum reflectance and then applies a 2D correlation diagram to study relationships between the difference index (DI), ratio index (RI), normalization index (NDI), and water quality index 44 . Optimal combination bands for the water quality index value are selected from 350 nm-1,050 nm and are entered into MATLAB 2014a (MathWorks, 2014).
i j i j R i and R j are random bands selected at 350 nm -1,050 nm while R i and R j denote the original reflectivity values of any two bands selected at 350 nm-1,050 nm.

Calculation of the Water Quality Index (WQI). The Water Quality Index (WQI) is an extracted and
estimated index that reflects the composite effects of all water quality parameters 45 . First, each water quality parameter was assigned a weight (W i ) from a scale of 1 (lowest effect on water quality parameters) to 5 (strongest effect on water quality parameters) based on perceived effects on primary health and according to its relative importance to the surface water environment 46,47 . PO 4 3− , SO 2 and Cr values were assigned the highest weight (8) due to their primary role in water quality assessments; a minimum weight of 1 was assigned to parameters Ca, Mg and Na due to their limited importance for water quality assessments 48 . The relative weight (W i ) is computed from the following equation: where W i is the relative weight, W i is the weight of each parameter, and n is the number of parameters. Then, a quality rating (Q i ) for each parameter is assigned by dividing its concentration in each water sample by its limit given in the WHO 33 quality standards for surface water quality for the People's Republic of China. This result is multiplied by 100; where Q i is the quality rating, C i is the concentration of each water quality parameter for each water sample, and S i is the surface water standard for each water quality parameter according to WHO guidelines 33 (2008). To measure the WQI, the SI i value should be calculated first using the following equations;  where SI i is the water quality index of the ith parameter and Q i is the water quality level based on the ith water quality parameter 49 . Estimate the WQI using a machine learning algorithm. Machine learning algorithms have become very popular in the era of big data. Machine learning is an artificial science. The field's main objects of study are artifacts and specifically algorithms that improve performance with experience. The Support Vector Regression (SVR) Model is the main algorithm used for machine learning. We used the Support Vector Regression Model to estimate the WQI for the arid area [50][51][52] .
Given sample data (x i , y i ), i = 1, 2, … , l where x i denotes the input vector, = y f x ( ) i i is the estimated output measure. Estimated methods can be written as: ) is a nonlinear model drawn from the input space to a high dimensional space; ω is a weight vector; and b is the offset.
The regression target identifies parameters ω and b, which minimize the regression error function. The regression error function can be defined as: ( ) is a loss function and where Constant C > 0 is a fixed penalty parameter. The most commonly used loss function is the ε-insensitive loss function: 6 6 This shows that the loss is 0 when the difference between the measured and predicted value is less than a small positive number of ε. To smooth the regression function, a minimum ω must be found, and based on the fitting error, the regression function can be solved as a constrained optimization problem: where ξ i i and ξ j ⁎ are slack variables of upper and lower constraints on outputs of the system. The dual optimization problem illustrated in Equation (14) leads to a quadratic programming (QP) solution involving the Lagrange optimization method that can be expressed as: where a i , a i * are Lagrange multipliers. After solving the optimization problem, denote the optimal solution as a a a a a a b ( , , , , ) , . Three parameters including the penalty coefficient C, the parameter of the kernel function σ and the width of the insensitive loss function ε constitute the model parameters and have a considerable impact on the performance of the SVR model. These parameters are often used by trial and error and are difficult to use to obtain the optimal value. The PSO can extract the optimal value fast in parallel with a complicated search space 54 , and we adopt it to select optimal parameters of the SVR model. The PSO uses particle populations corresponding to individuals in an evolutionary algorithm to explore the solution space of a problem 55,56 . A flowchart for the proposed PSO-SVR algorithm can be found in Supplementary Fig. S2.
Statistical analysis. Test data analyses were constructed using Origin8.0 (Origin Lab Corporation, America), and Matlab 2014a (Math Works Corporation, America) was applied to design the program environment. The significance of the statistical correlations was evaluated from P values and was compared to predicted and measured values from three indices, i.e., the estimate corresponds to high values of R 2 , to the root mean standard error (RMSE) and to the average standard error (SD) 57 as follows: In formulas (4), (5), (6), and (7), *(x i ) is the predicted value; (y i ) is the measured value; N is the total number of samples; − x is the average value of the sampled value, and y − is the average sample forecast value. SD is the standard deviation of the dataset, RMSE is the root mean square error, and when the RMSE is smaller the model's predictive capacity is stable. As the R 2 of the decision coefficient approaches a value of 1, the accuracy of the model improves. For a high RPD of the relative analysis error (RPD < 1.4), the model is not reliable. As 1.4 < RPD < 2, the model is moderately accurate, and RPD > 2, the model presents a high level of predictive ability.
Besides R 2 , RMSE, SD and RPD, in order to acquire the accuracy of the estimate model of WQI based on machine learning algorithm, geographically weighted regression (GWR) (http://gwr4.software.informer. com/ download/) model is selected in this study. As highlighted in the literature 58,59 , the main contribution of the GWR technique is the ability to explore the spatial variation of explanatory variables in the model, where the coefficients of explanatory variables may vary significantly over geographical space. Compare and analyze the accuracy of the machine learning algorithm and geographically weighted regression (GWR) model. Verify the reliability of the machine learning algorithm model.
Water quality assessment standards. The calculated WQI values are classified into five categories as follows 32 . When the WQI value > 50, the water quality level is excellent and is suited for drinking, and values of 50 > and > 100 denote that water quality levels are good. Values of 100 > HIX > 200 denote poor water quality levels. When 200 > HIX > 300, water quality levels are very poor. A value of HIX < 300 denotes that water is unsuitable for drinking (see Table 7).

Class
Threshold value Water quality