Assessing Susceptibility of Debris Flow in Southwest China Using Gradient Boosting Machine

A gradient boosting machine (GBM) was developed to model the susceptibility of debris flow in Sichuan, Southwest China for risk management. A total of 3839 events of debris flow during 1949–2017 were compiled from the Sichuan Geo-Environment Monitoring program, field surveys, and satellite imagery interpretation. In the cross-validation, the GBM showed better performance, with the prediction accuracy of 82.0% and area under curve of 0.88, than the benchmark models, including the Logistic Regression, the K-Nearest Neighbor, the Support Vector Machine, and the Artificial Neural Network. The elevation range, precipitation, and aridity index played the most important role in determining the susceptibility. In addition, the water erosion intensity, road construction, channel gradient, and human settlement sites also largely contributed to the formation of debris flow. The susceptibility map produced by the GBM shows that the spatial distributions of high-susceptibility watersheds were highly coupled with the locations of the topographical extreme belt, fault zone, seismic belt, and dry valleys. This study provides critical information for risk mitigating and prevention of debris flow.


Results and Discussion
predictive performance. In the cross-validation, the final GBM model showed good performance in predicting the susceptibility of debris flows, with the AUC of 0.88 and accuracy of 82.0% (Table 1). The prediction accuracy for the watersheds without debris-flow observations (85.4%) was relatively higher than that for the watersheds with debris flow observed (73.5%). It thus indicated that the prediction tended to biased towards the low susceptibility of debris flow. As the important hyperparameters in the GBM model, the number of trees and the tree depth were tuned to be 700 and 10, respectively. The hyper-parameter tuning process was essential for improving the predictive performance of the GBM model. The final GBM model retained 37 of the 72 predictors in the initial GBM model through the variable selection process, during which the prediction deviance initially fluctuated and then increased dramatically after 35 iterations (Fig. 1). The operation of variable selection reduced the data requirement and avoided spurious details in estimating the susceptibility of debris flow. Due to the difficulty in data collection, the debris-flow events were compiled from multiple sources over the long-term span . As this study focused on the spatial pattern of debris flow, the effects of data-source inconsistency were assumed to be negligible. The GBM model was superior to the benchmark models (i.e., LR, KNN, SVM, and ANN) in predicting the susceptibility of debris flow ( Table 2). For the KNN model, the best predictive performance was achieved when the number of neighbors considered equaled to 15. For the SVM model, the kernel, gamma, and cost of constraints violation were tuned to radial, 0.01, and 10, respectively. For the ANN model, the number of units in the hidden layer was set to 3, and the decay was set to 0.1. The previous studies also found that GBM models exhibited better performance in simulate susceptibility of debris flows than SVM and mixture discriminant analysis did, although the research domains of these studies were distinctive 33,34 . In the future, more comprehensive model comparisons will be necessary to guide the model selection for simulating debris flow. www.nature.com/scientificreports www.nature.com/scientificreports/ exhibited an importance value of 4.1 (Fig. 2). The elevation range plays a critical role in the formation of debris flow by determining the level of potential energy. Larger elevation difference leads to higher potential energy, creating favorable conditions for debris flows. The debris flow mainly occurred in the mountainous areas, as well  Variable importance plot for the gradient boosting machine predicting the susceptibility of debris flow in Sichuan. The relative importance is normalized so that they sum up to 100 for more intuitive interpretation. Please refer to Table 4 for the description of the variable acronyms.
as the surroundings of undulating plateau 38,39 . A previous study found that debris flow tended to happen when the height difference reached more than 300 m 38 . In our study area, more than 97% of the river basins in the valley where debris flow happened, had a height difference ranging from 400 to 4000 m. In addition, channel gradient provided the conditions for the conversion of loose material forces in the watershed into kinetic energy. It has been acknowledged that higher channel gradient favored occurrence of debris flow 40 . The maximum daily rainfall was the second most important predictor variable, with the importance value of 8.6, while the importance values of the annual rainfall and the maximum 3-day rainfall were 2.9 and 2.5, respectively (Fig. 2). Rainfall is one of the essential trigger factors of debris flow 41,42 . Heavy rainfall indicated by maximum daily rainfall tend to trigger debris flow when source materials are abundant. The maximum 3-day rainfall with a longer time span is supplementary to the maximum daily rainfall. The annual rainfall together with the aridity index reflect the dry-wet condition in the long term.
The aridity index, with the importance value of 8.0, was the third most important predictor variable (Fig. 2). Extremely arid climates have been found to be highly associated with occurrences of debris flows, which are usually caused by extremely dry periods followed by wet seasons 43,44 . The drought background or the dry-wet alternating climate conditions aggravate soil cracks, change the structure/composition of soil, and lower the rainfall thresholds triggering debris flows. Drought degraded vegetation cover, weakens soil structure, and increases loose solid materials prone to debris flows due to their distribution of varied debris and disturbed soil [45][46][47] . Debris flows were found to occur on the sunny side more frequently than the shady side of a mountain, suggesting that the hydrothermal conditions, particularly droughts, influenced occurrences of debris flows 47 .
The water erosion intensity and the negative effects of anthropogenic activity were also important factors to the susceptibility of debris flows. As indicated above, Sichuan lies in the transition area between the Qinghai-Tibet Plateau and the plain region. The previous studies showed that the soil erosion was 0.5-7 mm/y in the Qinghai-Tibet Plateau from 30 Ma (million anniversary) ago to the present [47][48][49] . While the rock/soil types play a critical role in the formation and accumulation of surface sediments, the rapid soil erosion provides massive unconsolidated materials which is source material for debris flows. Earthquakes induce secondary disasters such as landslides providing debris flows with source materials, and the impact was indicated by the seismic intensity. In addition, the anthropogenic activities such as road construction and land overexploitation accelerate soil erosion and consequently exacerbate debris flow 50 , which is reflected by the high importance of the national road length, the number of settlement sites, and the population density (Fig. 2).
As the predictor variables with respect to soil types, the area proportions of clay, silt, and sand exhibited relatively negligible importance to the susceptibility of debris flow, with importance values of 2.3, 2.2, and 1.3, respectively. The soil types directly affected the sediment concentration of debris flow, which in turn influenced its size and flow state. The clay content influences the formation of debris flow by affecting the initiation of debris flow, especially for viscous debris flow 51 . A moderate amount of clay content was an essential precondition for forming large-scale debris flow with a high amount of sediment concentration. Under the effect of precipitation, the loose clay soil expands after water absorption, leading to an increase in pore pressure and failure of viscous resistance, which accelerated the formation of debris flow.
The effects of factors influencing debris flow formation were complicated by non-linearity and interactions. It was therefore very important to identify the key controlling factors. According to the present study modeling the susceptibility of debris flows in Sichuan Province, topographic conditions, geological background, precipitation, and anthropogenic activities played an important role in the formation of debris flow. In addition, the susceptibility of debris flow was also associated with the drought conditions, road construction, soil types, and land use, which were indispensable factors in evaluating the susceptibility of debris flow at regional level.
Susceptibility mapping. As a result of the GBM model, the debris-flow map was constructed. It shows the spatial distribution of the susceptibility, which was classified into five categories, including very low, low, moderate, high, very high (Fig. 3a). Table 3 shows the areas and numbers of watersheds by susceptibility category. The watersheds of very low susceptibility occupy the largest area (226,600 km 2 ), with the largest number of watersheds (1,342) accounting for the 47% of the study area. These watersheds were mainly distributed in western plateau and mountainous areas, as well as eastern plain and hilly areas. The number of the moderate-susceptibility watersheds is the smallest (212), and the area is the smallest (33,500 km 2 ; 7% of the study area). The watersheds with high or very high debris-flow susceptibility (110,100 km 2 ), accounting for 22% of the total areas, are mainly distributed in the central mountainous region across Sichuan from north to south. These areas are located in the lower reaches of the Yalong River and the Dadu River, and the upper reaches of the Minjiang River near the Wenchuan earthquake. The susceptibility map of watershed-based  Table 3. Classification for the predicted susceptibility of watershed-based debris flow by using gradient boosting machine. a N/A: Not applicable. The debris-flow formation conditions were inadequate in the plateaus and plains, and thus these areas were excluded from the susceptibility modeling. www.nature.com/scientificreports www.nature.com/scientificreports/ debris flow evaluated by GBM was considerably different from those evaluated by the benchmark models (Fig. 3). The number of watersheds with very-high susceptibility was largest as predicted by GBM (297), followed by ANN (243), SVM (234), LR (194), and KNN (130). The watersheds with high or very-high susceptibility predicted by GBM were more concentrated near the Wenchuan earthquake region compared with the predictions made by the other models. In addition, the areas with moderate susceptibility were overestimated by KNN, which also underestimated the areas with very-low or very-high susceptibility, manifesting that KNN did not perform well in mapping the susceptibility of debris flow. The map predicted by using the GBM model qualitatively and quantitatively characterized the spatial distribution of the debris-flow susceptibility for the watersheds.
The western part of the study area was mainly located in the hinterland of the Qinghai-Tibet Plateau. The topography was dominated by plateaus and hill-shaped areas with gentle fluctuation. The environmental conditions in all areas, except for some deep-cut river valleys, were insufficient for development of debris flow, where the watersheds were dominated by the ones with very low susceptibility. The eastern part of the study area was mainly distributed in the Sichuan basin and hilly landforms, where the topography did not vary largely. Other than the watersheds of the moderate susceptibility in the Qujiang River basin, most of the watersheds in the eastern Sichuan were of low or very low susceptibility of debris flow.
The watersheds of high debris-flow susceptibility were mainly concentrated in the western part of the study area. Topographically, the highly susceptible areas were located in the topographic belt transiting from the Tibetan Plateau to the Sichuan Basin. In the Hengduan Mountains lying from north to south, the terrain is fragmented and the hills are steep, creating adequate conditions for debris flows. The fault zones of Longmenshan, Xianshuihe, and Anninghe distributed in an "Y" shape (shaded area in Fig. 4), which was generally consistent with the seismic zones. In these zones, earthquakes and rock fractures occurred frequently, with a number of secondary mountain disasters, providing abundant source materials to debris flows.   www.nature.com/scientificreports www.nature.com/scientificreports/ In addition, the high-susceptibility areas were coupled with the dry valley landscape in the study area. Among those areas, the Yalong River and its tributaries, including the Anning River Valley, the Dadu River, the upper reaches of Min River, middle and lower reaches of the Jinsha River, were the concentrated areas of debris flow, which were also identified as the areas with high or very high susceptibility of debris flow. The dry valleys with fragile ecosystem and severe soil erosion were found in all the rivers of Yalong, Dadu, Min and Jinsha. The dry valleys were affected by local circulation and forming activity. The evaporation in the valleys was far greater than the precipitation, where the www.nature.com/scientificreports www.nature.com/scientificreports/ vegetations were hard to grow and the soil erosion was severe. Moreover, in the dry valleys, inappropriate cultivation, such as steep slope reclamation and smooth slope cultivation, led to severe gravity erosion prone to formation of debris flow. Meteorologically, heavy rains tended to trigger debris flows in these areas. In addition, the construction of roads and hydropower stations was intensive in these areas and tended to aggravate the susceptibility of debris flows.
In general, the spatial distribution of high susceptibility of debris flow in Sichuan Province had a degree of overlap with the topographical extreme belt, fault zone, seismic belt, and dry valleys. Prevention and control of debris-flow risk in the study should be focused on these four types of highly coupled areas for preventing or mitigating sudden mass deaths caused by debris flow. We studied the spatial distribution of debris flow for the watersheds in Sichuan Province, and clearly identified the critical areas for the monitoring and early warning of debris flow. The results had very important practical significance and social benefits for disaster prevention and reduction.

conclusions
On the basis of the comprehensive dataset associated with debris flows, a GBM model was developed to simulate the susceptibility of debris flows in Sichuan, Southwest China. The GBM model showed highlighted predictive performance by adequately capturing the complex relationships between the predictor and response variables, which was superior to the benchmark models (i.e., LR, KNN, SVM, and ANN). The elevation range, maximum daily Model description. For simulating the susceptibility of debris flow (i.e., occurrence probability) by watersheds, a GBM model was trained to minimize the following loss or deviance function 62 : www.nature.com/scientificreports www.nature.com/scientificreports/ For k = 1 to K, repeat the steps below in order to obtain f K (x): • Draw a subsample from the training dataset at random without replacement • Use the model updated at step k-1 to calculate the residuals ( y j ) for this sub-sample: • Develop a new classification tree ρ k to fit  y j • Update the model by adding the fitted tree with a shrinkage rate (default: λ = 0.05): The model output was the occurrence probability or susceptibility of debris flow.
Hyperparameter tuning and variable selection were performed to further refine the GBM model. The values of the hyperparameters, including the number of trees (K) and the tree depth, were determined when the associated prediction deviance reached the minimum in the 10-fold cross-validation (explained in the next subsection). Similarly, the predictor variables of the GBM model (initially 72 variables) were selected by using the backward selection strategy, where the least important variable (explained in the next subsection) was removed from the model one at a time. The set of predictor variables with the lowest prediction deviance in the cross-validation was selected to build the final GBM. The R packages of gbm and dismo were used for training the GBM model and making predictions 62,63 . R package doParallel was used to run the modeling process in a parallel manner for reducing the computing time 64 .
With the same data and predictor variables, the GBM model was compared with four benchmark models, including LR, KNN, SVM, and ANN, to evaluate the performance in predicting the susceptibility of debris flow. LR is a generalized linear model for classification parameterized by the maximum likelihood. KNN, a non-parametric  Table 4 for the description of the variable acronyms. The color of each grid cell represents the correlation strength (annotated on the bottom bar) of the two variables labelled in the leftmost and topmost ends.