Mapping wind erosion hazard with regression-based machine learning algorithms

Land susceptibility to wind erosion hazard in Isfahan province, Iran, was mapped by testing 16 advanced regression-based machine learning methods: Robust linear regression (RLR), Cforest, Non-convex penalized quantile regression (NCPQR), Neural network with feature extraction (NNFE), Monotone multi-layer perception neural network (MMLPNN), Ridge regression (RR), Boosting generalized linear model (BGLM), Negative binomial generalized linear model (NBGLM), Boosting generalized additive model (BGAM), Spline generalized additive model (SGAM), Spike and slab regression (SSR), Stochastic gradient boosting (SGB), support vector machine (SVM), Relevance vector machine (RVM) and the Cubist and Adaptive network-based fuzzy inference system (ANFIS). Thirteen factors controlling wind erosion were mapped, and multicollinearity among these factors was quantified using the tolerance coefficient (TC) and variance inflation factor (VIF). Model performance was assessed by RMSE, MAE, MBE, and a Taylor diagram using both training and validation datasets. The result showed that five models (MMLPNN, SGAM, Cforest, BGAM and SGB) are capable of delivering a high prediction accuracy for land susceptibility to wind erosion hazard. DEM, precipitation, and vegetation (NDVI) are the most critical factors controlling wind erosion in the study area. Overall, regression-based machine learning models are efficient techniques for mapping land susceptibility to wind erosion hazards.


Results
Multicollinearity test. Table1 shows the values of the tolerance coefficient (TC) and the variance inflation factor (VIF) for the controlling factors for wind erosion. VIF > 10 and TC < 0.1 indicate multicollinearity among the effective factors. Based on our results, the lowest TC value was obtained for electrical conductivity (EC), while the highest VIF value (5.93) value was calculated for bulk density. The results indicated the absence of any multicollinearity between the 13 factors controlling wind erosion in the study area.
Relative importance of the factors affecting wind erosion. The model with the highest performance (MMLPNN) was applied to quantify the relative importance of the effective factors for wind erosion. Based on Fig. 1, three factors, DEM (with relative importance 0.95), precipitation (with relative importance 0.8), and NDVI (with relative importance 0.54), were recognized as the most important factors controlling wind erosion in the study area. Wind erosion has been shown to be affected by many factors such as wind, precipitation, temperature, soil properties (texture, composition, and aggregation), topography, aerodynamic roughness, vegetation, and land use practice 16 .

Discussion
Maps of wind erosion hazard. The  Overall, MMLPNN, SGAM, Cforest, BGAM, and SGB were identified as the most accurate models for predicting land susceptibility to wind erosion. Based on MMLPNN (Fig. 2e), the four susceptibility classes covered 32.8%, 1.1%, 1.2% and 64.9% of the total area of Isfahan province, respectively. The land susceptibility map to wind erosion hazard generated using SGAM shows the high and very high susceptibility classes covered 5.4% and 61.5% of the total area, respectively, whereas the low and moderate susceptibility classes occupied 27.4% and 5.6%, respectively (Fig. 3d). According to Cforest (Fig. 2b), 26%, 6.4%, 6.6%, and 61% of the total area belonged to the low, moderate, high and very high susceptibility classes, respectively. Using BGAM (Fig. 3c), the very high susceptibility class covered 62% of the study area, whereas the low, moderate, and high classes occupied 23.2%, 7.8% and 7% of the total area, respectively. Finally, in the case of the SGB model (Fig. 3f), the results classified 32%, 0.6%, 2.2% and 65.2% of the study area as low, moderate, high, and very high susceptibility, respectively.
The map of wind erosion hazard produced by MMLPNN is the most accurate. Overall, multi-layer perception networks (MLPS) as universal estimators are well-known techniques for system identification. The monotonicity of MMLPNN does not depend on the quality of the training data because it is guaranteed by its structure 17 . GAM with spline function (SGAM) was one of the 5 most accurate models for wind erosion hazard mapping. The spline functions allow the flexible representation of non-linear marginal relationships of the explanatory and response variables without the necessity to define a specific function 18 . Cforest, as a random forest (RF) model, uses conditional inference trees for prediction 19 . Several studies confirm the performance of RF as a suitable model for spatial predictions of environmental hazards. For example 20 , reported that the RF model is the best model for digital mapping of soil carbon fractions.
Some studies 21 have also argued that RF has the highest predictive capability for modelling landslide susceptibility in comparison with other ML models. Some previous studies 22 have also reported that in comparison with other methods, RF has better performance in estimating PM 2.5 monthly concentration. In this study, we applied the boosting with generalized additive model (BGAM), and based on the indicators for examining model performance, this model exhibited satisfactory performance and was selected as one of the five most accurate models for mapping wind erosion hazard. Boosting is a technique for improving prediction rules, and it can be applied to classification and regression methods to increase the accuracy of the predictions 23 . SGB is related to both boosting and bagging 24,25 . Previous research 26 has reported that SGB provides stable predictions for tree species presence.

Conclusions
This research assessed the performance of 16 individual regression-based ML algorithms for mapping land susceptibility to wind erosion hazard in an arid region in central Iran. In all, 13 effective factors for wind erosion were considered and regions with active wind erosion were mapped using a "wind erosion inventory map". Based on three statistical indicators and a Taylor diagram, the MMLPNN model was the most accurate model.  www.nature.com/scientificreports/ We conclude that MMLPNN is powerful tool for mapping wind erosion hazard in arid and semi-arid region ecosystems worldwide. We recommend that future work should focus on testing and comparing the performance of regression-based and classification-based ML models for the mapping and spatial modelling of wind erosion and dust sources to ensure that robust evidence is provided to support management decisions.

Material and methods
Study area. Isfahan province (Fig. 7), an arid region, is located in central Iran, between the latitudes 30°45′59.51" to 34°27′13.27" N, and between the longitudes 49°41′53.86" to 55°30′13.67" E. It is experiencing intensive wind erosion on the southeastern side (Segzi plain) and its northern parts. Based on a digital elevation model (DEM), there is high variability in altitude with maximum and minimum elevations ranging between 686 m (in the northern part of the study area and southern parts of Dasht-e-Kavir) to 4398 m (in the vicinity of the Dena Mountain in the southwestern part of the study area). The average annual precipitation ranges between 72 mm (in the eastern part with a corresponding annual mean temperature of 18 °C) and 320 mm (in the western part with an average annual temperature of 13 °C).  www.nature.com/scientificreports/ Soil characteristics. Seven soil characteristics (e.g., available water content (AWC) (Fig. 8a), bulk density ( Fig. 8b), calcium carbonate percentage (Fig. 8c), electrical conductivity (EC) (Fig. 8d), exchangeable sodium percentage (ESP) (Fig. 8e), organic carbon content (OCC) (Fig. 8f) and soil texture (Fig. 9a)) were extracted from the world soil map 30 and mapped by interpolation in ArcGIS 10.4.1. It should be noted that a total of 803 points (Fig. 7) were used for generating spatial maps.
Lithology and land use. Lithology (Fig. 9b) and land use (Fig. 9c) were mapped spatially based on the maps produced by the Forests, Rangelands, and Watershed Management Organization of Iran (FRWMOI).
Vegetation cover. The normalized difference vegetation index (NDVI) (Fig. 9d) 31 as the most common index used for the spatial mapping of vegetation cover was applied in our study. NDVI is the difference between the red (RED) and near-infrared (NIR) band combination divided by the sum of the red and near-infrared band combination (Eq. 1).

Elevation.
A digital elevation model (DEM) (Fig. 9e) for the study area was generated using shuttle radar topography mission (SRTM) images with a 30*30 m resolution 8 .

Climatic variables.
Wind speed (Fig. 9f) and precipitation (Fig. 10a) were used as climatic factors affecting wind erosion. The spatial maps of these variables were generated based on the daily average wind speed and total annual precipitation data from 23 meteorological stations located in the Isfahan province. All spatial maps of factors controlling wind erosion were generated in ArcGIS 10.4.1.

Inventory map of wind erosion. An inventory map shows regions with active three-stage processes,
comprising detachment, transportation, and sedimentation due to wind erosion. An inventory map is needed for predicting land susceptibility to wind erosion hazard. We used a map of regions with active wind erosion produced by the Forest, Rangeland and Watershed Management Organization of Iran (FRWMOI) (Fig. 10b). Based on the inventory map, wind erosion active regions covered ~ 10,961 km 2 (440 pixels) in the study area. Pixels with active wind erosion were randomly selected for the training (70% or 308 pixels) and validation (30% or 112 pixels) datasets for the ML models (Fig. 10c). Based on field work and FRWMOI, inventory map of wind erosion was generated in ArcGIS 10.4.1.
Multicollinearity among the factors controlling wind erosion. The tolerance coefficient (TC) (Eq. 2) and variance inflation factor (VIF) (Eq. 3) 8,15,32 were applied to examine multicollinearity among the factors for wind erosion in the Isfahan province.
(1) NDVI = (NIR b4 − RED b3 ) / (NIR b4 + RED b3 )  Background of the ML algorithms used. This section briefly describes the 16 individual regressionbased ML algorithms, which were adopted for mapping wind erosion hazard. These algorithms are available in the caret package, in R software.

Robust linear regression (RLR).
Robust regression is designed to overcome some limitations of traditional parametric and non-parametric methods. Available robust regression methods include M-estimates 33 , R-estimates 34 , least median of squares (LMS) estimates 35 , least trimmed squares (LTS) estimates and S-estimates 36 , generalized S-estimates (GS-estimates) 37 and MM-estimates 38 . We used a robust linear regression model with M-estimates for predicting land susceptibility to wind erosion.
Cforest. Random forest (RF), introduced by 39 , is the most popular method for regression and classification in decision tree learning 40 . RF makes a large number of decision trees in the training phase, and then by averaging the output values of the trees, the output of the model is finalized. Cforest is a type of RF commonly applied for prediction purposes 19 .
(3) VIF = 1 TC www.nature.com/scientificreports/ Non-convex penalized quantile regression (NCPQR). Quantile regression (QR) has gained considerable attention in different fields of modelling since the work of 41 . In comparison with mean regression (MR), QR provides an alternative that is more efficient when the error term follows a non-normal heavy-tailed distribution 42 . We used a penalized QR with a non-convex function 42 for mapping wind erosion hazard.
Neural networks (NN). NN can accurately approximate complicated non-linear input/output relationships 43 . The NN structure includes a set of interconnected units or neurons that estimates the non- www.nature.com/scientificreports/ linear correlations between each variable. The input neurons or predictor variables are connected to a single or multiple layers of hidden neurons, which are then linked to the output neurons 44 . We used a NN with the feature extraction algorithm (NNFE) 45 and a monotone multi-layer perception neural network (MMLPNN) 46 for mapping wind erosion hazard. The feature extractors used textural features based on the spatial relationships between pixels 45 .

Ridge regression with variable selection.
Ridge regression (RR), which was proposed by 47 , is expressed as follows (Eq. 4): Given a set of n vectors, x 1 , … , x n in R m , where m is the number of properties, and the dependent variable y i ∈ R, i = 1, …, n, the objective is to minimize the loss function, i.e., the discrepancy between the real values y i and the predicted values ỹ i = w.x.
We applied a RR model with a kernel function 48 as follows: where K(x, x i ) is the kernel function and β i is the weighting.
Generalized linear models (GLMs). GLMs have been applied to a wide range of research 49 . GLMs have three components, comprising an observation model, a linear predictor, and an invertible link function 50 . Using boosting with GLMs can improve prediction accuracy 23 . We applied two GLMs; boosting GLM (BGLM) and negative binomial GLM (NBGLM) 51 .
Generalized additive models (GAMs). GAMs 52 can be expressed as follows: with where Y i is the ith value of the response variable from an exponential distribution family (EF) with a location parameter ( µ i ) and a scale parameter ( ∅),Z * i indicates the ith row of a parametric model matrix with the vector
Stochastic gradient boosting (SGB). SGB or gradient boosting machine, proposed by 24 is a hybrid algorithm that combines both the advantages of bagging and boosting. This model makes additive regression models by the least-squares at each iteration. where w n indicates the model weights and K (. , .) is a kernel function. We applied two algorithms, SVM with linear kernel function and RVM with polynomial kernel function.
Cubist. Cubist, a rule-based regression tree algorithm, is based on the M5 theory 56 . This model involves four main steps as follows: (1) growing a tree by branching data, (2) developing the model, (3) pruning the tree, and (4) smoothing the tree 57 .
Adaptive network-based fuzzy inference system (ANFIS). This model has been applied in different sciences. ANFIS works based on the fussy if/then rules 58 : Rule 1 : if (x is A 1 ) and y is B 1 then f 1 = p 1 x + q 1 y + r 1 (11) Rule 2 : if (x is A 2 ) and y is B 2 then f 2 = p 2 x + q 2 y + r 2 www.nature.com/scientificreports/ where x and y are as input parameters for FIS, f as FIS output, A and B are fuzzy sets, and p, q, and r are parameters. In all 16 models, the predicted values for pixels ranged between 0-1. Therefore, we can divide susceptibility predictions into four classes (low (0-0.25), moderate (0.25-0.50), high (0.50-0.75) and very high (0.75-1)).

Assessment of model performance.
In order to evaluate model performance in predicting land susceptibility to wind erosion hazard in the study area, three statistical methods comprising root mean square error (RMSE), mean absolute error (MAE) 59,60 and mean bias error (MBE) were used: (v k − v p ) Figure 11. Flowchart of the methodology for mapping of wind erosion hazard.
Scientific Reports | (2020) 10:20494 | https://doi.org/10.1038/s41598-020-77567-0 www.nature.com/scientificreports/ where m is number of the observations, v k and v p indicate the measured and predicted values, respectively. Also, a Taylor diagram was applied as a further test for assessing the performance of individual regression-based ML models 14 .

Prioritization of the factors controlling wind erosion.
Among the 16 ML models tested, a model with the lowest error (RMSE, MAE, and MBE) was applied to quantify the relative importance of the factors controlling wind erosion. In this study, MMLPNN had the lowest error (with RMSE, MAE, and MBE < 0.002%) and was therefore applied for determining the relative importance of the factors for wind erosion. A brief overview of the main steps used in our methods is presented in Fig. 11.