Fuzzy clustering and distributed model for streamflow estimation in ungauged watersheds

This paper proposes a regionalization method for streamflow prediction in ungauged watersheds in the 7461 km2 area above the Gharehsoo Hydrometry Station in the Ardabil Province, in the north of Iran. First, the Fuzzy c-means clustering method (FCM) was used to divide 46 gauged (19) and ungauged (27) watersheds into homogenous groups based on a variety of topographical and climatic factors. After identifying the homogenous watersheds, the Soil and Water Assessment Tool (SWAT) was calibrated and validated using data from the gauged watersheds in each group. The calibrated parameters were then tested in another gauged watershed that we considered as a pseudo ungauged watershed in each group. Values of R-Squared and Nash–Sutcliffe efficiency (NSE) were both ≥ 0.70 during the calibration and validation phases; and ≥ 0.80 and ≥ 0.74, respectively, during the testing in the pseudo ungauged watersheds. Based on these metrics, the validated regional models demonstrated a satisfactory result for predicting streamflow in the ungauged watersheds within each group. These models are important for managing stream quantity and quality in the intensive agriculture study area.

The area occupying the center of the Ardabil Province of Iran is a high-yield agriculture production region where recent human activities and land degradation have contributed to accelerated erosion and increased flood frequency 23 . These factors have reduced fertility, lowered the volume of available water, and degraded the water resource reservoirs. All these impacts threaten the agricultural sustainability of region 24 . Addressing these issues requires an assessment of relevant runoff, erosion, and sediment transport processes to guide water resource planning, hazard assessments, and catchment management. Hydrological models are useful management tools in watersheds where streamflow and sediment yields are tied to natural phenomena, anthropogenic activity, and climate variability 13,25 . However, models require data for calibration and testing-and lack of data is the main limitation for applying them in regions such as Ardabil Province where many watersheds are ungauged, having no hydrological monitoring stations. Therefore, the main objective of the work is to develop reliable predictions that can guide management in the ungauged watersheds of this important agriculture region in Iran. This study develops a regionalization approach based on the SWAT model for estimating streamflow in the ungauged watersheds in the Ardabil Province, Iran. To achieve this objective, we applied the FCM method to group 46 watersheds based on their similarity in a variety of properties (such as area, maximum elevation, minimum elevation, slope length, slope, rainfall, and temperature); and used the SWAT model for streamflow regionalization objectives in each homogenous region.

Materials and methods
Study area. The area of interest is the headwaters of the Gharehsoo River in the Ardabil Province, Iran ( Fig. 1). The 287-km river flows past many cities before entering the Aras River, then the Caspian Sea. The river system provides water for domestic, industrial, agriculture, fishing sectors that have about 1.25 million users 23 . In upstream, the river is formed from outflows from basins in the Talesh, Bozgosh, and Sabalan mountains.
The highest elevation in the study area is 4783 m a.s.l. in the Sabaln Mountain; the lowest elevation is 731 m a.s.l. in the outlet of the watershed. The area is dominated by a semi-arid climate with a mean annual rainfall of 340 mm, although there is uncertainty associated with precipitation totals at high elevations. May and August SWAT model. We used the physics-based and semi-distributed Soil and Water Assessment Tool (SWAT), developed by the USDA Agricultural Research Service 28 . SWAT uses hydro-meteorological and topographic information to simulate streamflow and is therefore useful for management purposes related to runoff calculations, sediment load estimation, and agricultural production [29][30][31] . Simulation of catchment hydrology in SWAT considers the land and river routing phases of the hydrology cycle. The first part controls the amount of water transported to the channel in each sub-watershed. The second relates to the flow through the channel network to the catchment outlet 32 . The SWAT model divides a catchment into sub-basins based on a DEM (digital elevation map). Maps are added regarding land use, topography, and soil, which allow further delineation and creation of hydrological response units (HRU s ) to represent a catchment 15,33 .
The SWAT model is composed of various mathematical equations and empirical formulas that are used for determining parameters at daily, monthly, or annual time steps 34 . For example, we employ the Soil Conservation Service (SCS) Curve Number 35 equation for estimating surface runoff: where Q surf is surface runoff elevation (mm); R day is daily rainfall (mm); S is the soil surface moisture (mm). Runoff occurs when R day is greater than 0.5 36 . The S parameter varies based on the soil type, slope of the catchment, and land use management 35 : where, CN is a dimensionless parameter ranging from 0 to 100, which is determined by the SWAT from the input parameters.
The Manning coefficient is used to determine the rate and velocity of streamflow. Finally, we used the Muskingum River routing method to handle variable storage and river routing of water.
To calculate potential evapotranspiration, we used the Penman-Monteith Equation 32 : where E is the latent heat flux density (MJ m −2 day −1 ); E is the depth rate of evaporation (mm day −1 ); ∆ shows the saturation vapor pressure-temperature curve (kPa °C −1 ); H net is the net radiation (MJ M −2 day −1 ); G is the heat flux density to the ground (MJ M −2 day −1 ); e o z is the saturation vapor pressure of air at height z (kPa); γ is the psychometric constant (kPa °C −1 ); r c is the plant canopy resistance (s m −1 ); r a is the diffusion resistance of the air layer (aerodynamic resistance) (s m −1 ); ρ air is dry air density; and C p is specific heat capacity of air (J kg −1 K −1 ).

Watershed characteristics.
Before implementing the SWAT model, we selected seven parameters for 19 gauged and 27 ungauged watersheds according to the literature review and data availability: area, maximum elevation, minimum elevation, slope length, slope, rainfall, and temperature. These parameters have been widely used for watershed clustering in hydrological studies 7,37-39 . In each watershed, the maximum elevation, minimum elevation, and slope parameters were calculated using an ASTER DEM map (with 30 × 30 m resolution), and the average in space for rainfall and temperature (from 1985 to 2015) were calculated using the weights calculated by the Voronoi maps 12 in the ArcGIS 10.5. The number of parameters is high compared with other regionalization applications 7,37-39 . The area of the watersheds ranges between 8 (SW 39) and 1194 km 2 (SW 33). The ranges of rainfall and temperature are considerable. In the time period of 1985-2015, the minimum average rainfall (264 mm) is found in SW27 in the east of the region; and the maximum average rainfall (487 mm) occurs in SW19 in the west. In this period, the temperature range is between 7.3 and 13.0 °C, indicating a cold-weather regime. The mountainous topography contributes to a high elevation range: 731-4783 m. The minimum average catchment slope is 4.2% in SW9; and the average maximum slope is 34.5% in SW19. Slope length in SW33 (161 km) is much higher than in other watersheds: SW27 has the smallest slope length (6.6 km). All of these data were used for watershed grouping via clustering and individual watershed simulations.
Fuzzy c-means clustering. Watersheds with the same climate and topographic circumstances often have the same hydrological conditions, and therefore, they can be grouped by clustering, thereby providing a means of developing validated models for ungauged river systems 13,40 . Indeed, clustering is a pre-processing step for regionalization objectives. Clustering methods divide a data set into classes that similar objects are in an equal group. Among the clustering methods, soft clustering methods such as the Fuzzy c-means (FCM) technique reduces the uncertainty in identifying the members of the group by assigning a degree of membership for each member in each group 41,42 . Therefore, we used the FCM method for identifying similar donor basins to the www.nature.com/scientificreports/ ungauged basins. In this method, the degrees of membership for each watershed in each group is determined using fuzzy memberships 43 . This algorithm has been widely used for clustering similar watersheds for several applications 41,42 . The algorithm, created by Dunn 44 , and improved with Bezdek 45 , is represented by the following objective function: with the restraint in the form where m is a fuzzification parameter (number greater than 1); n is the number of data points; C is the cluster number considered; U ij is the membership of pixel x i in the ith cluster, and v j is the jth cluster number. As some parameters require standardizing before clustering to eliminate a heterogeneous effect of different units 42 , we normalized them to a range of 0 to 1 as follows: where X norm is the normalized value, x is the initial value, x min and x max are the minimum and maximum values.
As the number of clusters is not pre-defined, the optimum clustering number was selected through an iterative process that is repeated until the optimization criteria are met. Our criteria were maximization of the partition coefficient ( V PC ) and minimization of the partition entropy ( V PE ), which are calculated as follows 45 : where n and c are number of data points and clusters, respectively, u ij is the degree of membership of x i in the cluster j. The ranges of V PC and V PE are between 1/c to 1 and 0 to log c 2 , respectively. The processes of the FCM can be explained in: standardizing the data set, choosing m and c, calculate partition matrix ( u ij ), and detecting the number of clusters for a successful clustering.
SWAT forcing data. Hydro-meteorological forcing data for the SWAT model consists of rainfall, minimum and maximum temperature, relative humidity, wind speed, and solar radiation with daily time step that covered 30 years period from January 1985 to December 2015. Rainfall data were derived from 27 stations maintained by the Iran Water Resources Management Company (IWRMC), and eight climate stations managed by the Iran Meteorological Organization (IMO). Wind speed, minimum and maximum temperature, relative humidity, and solar radiation were associated with the IMO climatic and synoptic stations. Missing values were filled using linear regression. The corrected data were prepared in the ArcSWAT2012 format and defined as model inputs. Daily stream discharge values  for model calibration/testing were associated with only six IWRMC hydrometric stations.
The spatially distributed layers for input into the SWAT include a DEM, a soil map, and a land-use map. The DEM is the raster layer consisting of a range of elevation values for each pixel. We used a DEM with a 30 × 30 m resolution, which was downloaded from the ASTER Global Digital Elevation Model (ASTGTM) in four sheets (N38E47, N37E47, N38E48, and N37E48) and merged. The map projection was Universal Transverse Mercator (UTM) on the spheroid of WGS84; it was processed in the first step of the SWAT execution. Sub-watershed characteristics such as slope gradient, slope length, and details of the stream network were determined from the DEM.
We made the land-use map using the ENVI 5.2 software and Landsat 8 satellite imagery available at the US Geographical Survey (www. usgs. gov). The real ground truth points were identified randomly from the high accuracy of the map. According to the literature, the land-use changes are not considerable in the model calibration and validation periods (i.e., 2004-2014) 27,34 . The soil layer was obtained from the Ardabil Department of Natural Resource and completed with the FAO digital soil map of Iran 46 . Information about the soil's polygons such as the number of layers, hydrological soil groups, soil texture was imported to the SWAT database in Access format. The soil characteristics of the study area are presented in Table 1.
Calibration and validation of the SWAT model. In each cluster, two gauged watersheds that have a high degree of membership and sufficient data were used for simulation with the SWAT model. One of these gauged stations is considered as the "donor" watershed, and another is the 'pseudo' ungauged watershed. The donor watershed is used to parameterize and calibrate the SWAT model parameters. The pseudo ungauged watershed is used to test the calibrated model. More details/descriptions of the donor and pseudo ungauged watersheds can be found in Choubin et al. 12 . For assessing the model performance, the determination coefficient In the SWAT Calibration Uncertainty Procedure (SWAT-CUP) software, we used a sequential uncertainty fitting (SUFI-2) algorithm for sensitivity analysis and model calibration and validation. This algorithm is an inverse semi-automatic reversed model that has high performance when calibrating many parameters 34,47 . For sensitivity analysis of the parameters, we used SWAT-CUP, which is based on the SUFI-2 algorithm. For each donor watershed, the periods 2004-2011 and 2012-2014 were used for calibration and validation, respectively. During the calibration, we determined optimal values for all sensitive parameters. In the validation period, the model was executed for watersheds using the optimal parameter value. In testing the model on the pseudo ungauged watersheds, the model was run with calibrated parameters for the period 2004-2011. A flowchart of research steps is shown in Fig. 2.  Fig. 3). Degree of membership in each cluster ranged from 0.46 to 0.82, 0.47 to 0.96, and 0.48 to 0.95, respectively. The first cluster includes watersheds with low elevation and mostly plain areas in the western part of Gharehsoo Watershed (Fig. 4). Watersheds in the second cluster are located in the southeast part with high elevation mountains. Watersheds in the third cluster are located in the south and east parts of the region with intermediate elevation (Fig. 4). Mean monthly temperature and mean annual rainfall from 1985 to 2015 time period are listed in Table 3. In these three clusters are 13, 14, and 18 watersheds, of which 7, 6, and 13 watersheds are ungauged, respectively. Based on the FCM results (Table 2), the SW32 was selected as the donor watershed in the first cluster; and SW25 as the pseudo ungauged watershed (Fig. 4). In the second cluster, the SW33 was selected as the donor watershed and the SW42 as the pseudo ungauged watershed. Finally, in the third cluster, the SW26 was selected as the donor watershed and the SW29 as the pseudo ungauged watershed (Fig. 4). It should be noted that the selected donor/pseudo ungauged watersheds have sufficient climatic data rather than others (for streamflow simulation by the SWAT model).

Sensitivity analysis results.
The global sensitivity analysis performed using the SUFI2 algorithm in each donor watershed. Parameters that have p-value < 0.05 and a high t-stat value (modulus) were considered sensitive. In each cluster, we identified six sensitive parameters for simulating streamflow (Table 4). In all clusters, the curve number (CN) parameter had the highest sensitivity, a result that is consistent with other studies in Iran 34,48 . In the first cluster, with smooth topography, groundwater parameters (GW-Delaye, GE-Revap, Rchrg_DP) and soil parameters (Sol_Awc, Sol_Z) were the most sensitive 15 . In the second cluster, which includes mountains, the  www.nature.com/scientificreports/ Sol-K is one of the sensitive parameters 13 . In the third cluster, the Esco parameter is the second sensitive parameter that shows in this cluster the evapotranspiration has a considerable effect on runoff.
Model calibration and testing results. The maps (Fig. 5) Fig. 9). The values of the RMSE are high compared with observations, this is mostly because of bad prediction in low flows. We then used the calibrated parameters in the validation of the same donor watersheds for the 2012-2014 time period. The obtained hydrographs are shown in Figs. 6, 7, and 8; and the calculated statistical coefficient is indicated in Fig. 9 and Table 5. Results showed that the model performance in the Gilan Watershed (SW33) was higher than in other watersheds, probably because of the comparatively large watershed area. The SWAT model generally performs best in large watersheds 55,56 . Streamflow prediction in the ungauged watersheds. After calibration and validation of the SWAT in gauged (donor) watersheds, it was run for the three pseudo watersheds in the three groups (SW25, SW42, and SW29). The calculated statistical coefficients indicated that the SWAT model had a very good performance  (Table 5; Fig. 10). The good performance on the pseudo ungauged watersheds indicates the usefulness of the model for estimating the streamflow in ungauged watersheds (which are in the same group) ( Table 5). The model is somewhat poor at simulating peak flows in April month. This issue is mainly because of the effect of snow melting in this month that leads to increasing streamflow. The observed streamflow in this month is higher than the simulated value. This issue mainly is mentioned in the watersheds covered by snow such as Vilaysane et al. 33 (in the Xedone River Basin, Thai) and Choubin et al. 12 (in the Karkheh River Basin, Iran). The time and the value of most of the peak flow events were simulated with high precision that shows the used SCS equation in the SWAT model has a high ability for simulation streamflow 12,33,36 . The overall results indicated a good performance of the SWAT model for regionalization objectives in ungauged watersheds. The obtained results of the model simulation in ungauged watersheds could have many implications on water resource management such as having an estimation of the streamflow average volume, peak flow time, peak flow volume, and etc. 52,57 . These data are also very important in determining the type and location of watershed management operation.  www.nature.com/scientificreports/ Therefore, we ran the model in the ungauged watersheds SW43, SW30, and SW3, in groups 1, 2, and 3, respectively. Ungauged watershed SW43 in group 1, which has a 15.

Conclusions
We found that the SWAT model was useful in developing a regionalization method to simulate streamflow in ungauged watersheds in the Gharehsoo watershed in Iran. In this paper, we used fuzzy c-means clustering to partition the 46 watersheds into three homogenous groups that were similar to seven hydro-geomorphic parameters: area, maximum elevation, minimum elevation, slope length, slope, rainfall, and temperature. The model was calibrated and validated using optimal parameters and data from one gauged donor watershed in each group, then tested on another gauged watershed from the same group (considered as a pseudo ungauged watershed). Values of R 2 and NSE > 0.7 indicated that the model performance in estimating streamflow in all three groups was acceptable for applying to the ungauged watersheds in the region. In most of the arid and semi-arid regions, the small watersheds are ungauged therefore this approach provides a reliable method for streamflow estimation. Short hydrologic data availability was the main limitation of this study. Experience of the model with different dry and wet periods in the calibration period can affect the model performance. Separating the dry and wet periods or considering a longer period may improve the model performance, which is recommended for future studies. The impact of water resources in developing agriculture products in arid and semi-arid regions is vital and, in these regions, accurate streamflow simulation is important for the management of the watersheds. Table 5. Evaluation metrics in the donor and pseudo ungauged watersheds.