Effects of non-landslide sampling strategies on machine learning models in landslide susceptibility mapping

This study aims to explore the effects of different non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Non-landslide samples are inherently uncertain, and the selection of non-landslide samples may suffer from issues such as noisy or insufficient regional representations, which can affect the accuracy of the results. In this study, a positive-unlabeled (PU) bagging semi-supervised learning method was introduced for non-landslide sample selection. In addition, buffer control sampling (BCS) and K-means (KM) clustering were applied for comparative analysis. Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, random forest, support vector machine, and CatBoost, were used for landslide susceptibility mapping. The results show that the quality of samples selected using different non-landslide sampling strategies varies significantly. Overall, the quality of non-landslide samples selected using the PU bagging method is superior, and this method performs best when combined with CatBoost for predicting (AUC = 0.897) landslides in very high and high susceptibility zones (82.14%). Additionally, the KM results indicated overfitting, displaying high accuracy for validation but poor statistical outcomes for zoning. The BCS results were the worst.


Study area
Qiaojia County is located in the northeastern part of Yunnan Province, China, and belongs to the city of Zhaotong.Its geographical location is longitudes from 102°52′E to 103°26′E and latitudes from 26°32′N to 27°25′N, covering an area of 3245 km 2 (Fig. 1).By the end of 2020, the county had 17 towns, 192 administrative villages, and a total population of approximately 625,000.Qiaojia County is bordered by rivers on three sides: the Jinsha River in the north and west and the Niulan River in the northeast.The terrain conditions, which have been affected by the erosion and dissolution of the Jinsha and Niulan rivers, are complex.With strong neotectonic movement, Qiaojia County is one of the key prevention areas for geological hazards in Yunnan Province.

Data sources and impact factor processing
Selecting the appropriate impact factors is an important step in mapping the susceptibility of landslides 30,31 .In impact factor selection, we considered factors such as field investigations, study area characteristics, relevant literature, data availability, and acquired data quality.There were 15 impact factors selected from 5 aspects (topography and geomorphology, geological structure, hydrology and ecology, human activities, and earthquake conditions) for landslide susceptibility mapping: elevation, slope, aspect, profile curvature (PC), terrain ruggedness index (TRI), lithology, distance to faults (DTF), soil type, average annual precipitation (AAP), topographic wetness index (TWI), distance to rivers (DTRI), normalized difference vegetation index (NDVI), distance to roads (DTR), land use type, and peak ground acceleration (PGA).
The sources of impact factors were as follows.A digital elevation model (DEM) for the Qiaojia area was acquired from the China Geospatial Data Cloud site (http:// www.gsclo ud.cn).Based on this DEM, the elevation, slope, aspect, PC, TRI and TWI were extracted.The lithology and faults were derived from the 1:200,000 geological map of China, and the lithology description is shown in Table 1.The NDVI was extracted from Landsat-8 OLI images (http:// www.gsclo ud.cn).Soil type and precipitation were provided by the Resource and Environmental Science Data Center of the Chinese Academy of Sciences (http:// www.resdc.cn).River and road data were obtained from Open Street Map (http:// downl oad.geofa brik.de/ asia/ china.html).Land use type data were extracted from 30 m land cover data (https:// doi.org/ 10. 5281/ zenodo.44178 10) 32 .PGA was derived from the United States Geological Survey (https:// earth quake.usgs.gov/ earth quakes/ event page/ usb00 0rzmg/ shake map).Using ArcGIS software, all the influencing factors were converted into a raster data format with a reference scale of 30 m × 30 m and placed into the same projected coordinate system (Fig. 2).

Methodology
Data from the 2014 landslide in Qiaojia County were taken as the research object.First, the study area was divided into landslide area and remaining area using the landslide data.Impact factors were collected and preprocessed from five aspects (topography and geomorphology, geological structure, hydrology and ecology, human activities, and earthquake conditions).Second, landslide samples are selected in the landslide area, and non-landslide samples are selected in the remaining area by PU Bagging, BCS and K-means, respectively, to build the sample data set.Finally, three sample data sets were combined with three machine learning models (RF, SVM, CatBoost) to map and evaluate landslide susceptibility, in which confusion matrix and ROC curve were used to verify accuracy.The flowchart of the research method is shown in Fig. 3.

PU bagging
PU bagging is a semi-supervised iterative classification algorithm 33,34 .The landslide sample data are learned, and then using the learned knowledge, the unlabeled samples are classified.The probability of landslides occurring in areas other than landslides is calculated through quantitative methods, and then non-landslide samples are selected in areas with low probability values, thereby improving the quality of the selected samples.The specific steps are as follows (Fig. 4): (1) Based on the landslide samples, an equal number of unlabeled samples are randomly selected from the unlabeled samples as non-landslide samples to construct a training sample set.(2) A decision tree is used to train the training sample set and generate a classifier.
(3) A classifier is used to predict the samples that are not drawn from the unlabeled samples (out-of-bag samples) and treat the value as the probability that the sample belongs to the landslide samples.(4) Steps ( 1)-( 3) are repeated to calculate the probability that all unlabeled samples belong to the landslide samples.The average probability obtained from the above multiple calculations was used as the final landslide probability for the unlabelled samples, aiming to mitigate prediction uncertainty and overfitting risks.

Buffer control sampling
The BCS method was inspired by the first law of geography 35 , which states that areas closer to landslides are more prone to landslides, and vice versa.The principle of this method in selecting non-landslide samples is simple, and it is easy to implement.In this method, which is the most commonly used method in landslide susceptibility mapping, the areas outside the landslide are considered non-landslide areas.Specifically, buffer zones are established around all landslide sample points, with the area outside the buffer zones considered nonlandslide area.Samples from these areas, which are referred to as non-landslide samples, are randomly selected, as shown in Fig. 5.The size of the buffer distance is determined according to the scale of the landslide.However, areas outside the buffer zone may contain ancient or potential landslides.When a portion of potential landslide samples is misclassified as non-landslide samples, it increases the difficulty of learning for the model, leading to misguidance in the learning process and ultimately affecting the accuracy of the final predictions.www.nature.com/scientificreports/

K-means clustering
The KM clustering method is an unsupervised classification algorithm that is applicable to the classification of unlabeled sample data 36 .The KM clustering method does not need to know the label (landslide or non-landslide) of each sample when training the model.It is based on classifying samples into different categories using the attribute characteristics of the impact factors.If there is a high degree of similarity within a category and a large difference between different categories, the classification result can be considered good.
The specific process of the KM clustering method in non-landslide sample selection is as follows: first, the study area is transformed into numerous individual samples, and the corresponding impact factor attribute eigenvalues of the samples are used as input data for the classification calculation.Then, the KM clustering method is used to classify the sample data into several classes.Finally, the number of landslide samples in each category is counted, and the category with the least number of landslide samples is selected as the data source for the non-landslide samples.The non-landslide samples selected using this method are highly similar to each other, resulting in them only representing a portion of the non-landslide areas and unable to fully reflect the complexity and variations of the non-landslide areas.When the representation of non-landslide samples is insufficient, the model may not adequately learn the characteristics of these samples, leading to overfitting by excessively learning from landslide samples during the training process.

Landslide susceptibility mapping based on machine learning models
Random forest RF is a very representative bagging ensemble algorithm consisting of multiple decision trees and is widely used in landslide susceptibility mapping 37 .It adopts a parallel method to establish multiple independent decision trees and then calculates the final prediction results based on the prediction results of each decision tree through voting principles.The construction of each tree relies on numerous a number of randomly selected impact factors, from which an optimal impact factor is selected when each node in the decision tree splits.The optimal impact factor can be determined by using the information entropy or Gini index, which indicates the correlation between the impact factor and the predicted result 38,39 .Compared with decision trees, RF has a stronger generalization ability and reduces the risk of overfitting by averaging decision trees.www.nature.com/scientificreports/Support vector machine SVM is a machine learning model that follows the principle of structured risk minimization 40,41 .It converts the landslide sample data from a low-dimensional space to a high-dimensional space and converts the nonlinear classification problem in a low-dimensional space to a linear classification problem in a high-dimensional space.
By finding an optimal hyperplane, the landslide and non-landslide data are spaced at a maximum distance apart.The kernel function is the core of the SVM and includes linear, polynomial, radial basis, and sigmoid functions.Linear or nonlinear classification problems can be satisfied with a variety of kernel functions.

CatBoost
The CatBoost model is a modification of the gradient boosting decision tree (GBDT) algorithm framework 42 .
Compared with the mainstream GBDT (extreme gradient boosting and light gradient boosting machine) algorithms, the main advantage of the CatBoost model is that it deals with category-based factors using a target statistical approach without having to convert the category data into numerical data in advance.Second, CatBoost uses an ordered boosting framework to solve the gradient estimation bias problem and reduce the complexity of the algorithm.Finally, the complete binary tree used in the CatBoost model reduces the occurrence of overfitting and increases the speed of prediction 43 .

Accuracy verification
A landslide problem is a binary classification problem (landslide or non-landslide), and the confusion matrix and ROC curve are the most commonly used evaluation indexes [44][45][46][47] .In the confusion matrix, the classification of the different sample categories can be clearly seen.We use several metrics to evaluate the performance of the model, including sensitivity, specificity, precision, accuracy, and F1-score.The five metrics vary between 0 and 1, and larger values indicate better model prediction performance 48 .The ROC curve is based on the confusion matrix and reflects the true positive rate (TPR) (sensitivity) and false positive rate (FPR) (1-specificity) under different thresholds.In the ROC curve approach, each inflection point has a corresponding FPR value as the www.nature.com/scientificreports/ x-coordinate and a TPR value as the y-coordinate 49 .The area under the curve (AUC) of the ROC curve is an indicator to measure the prediction effect of the model.The AUC value is between 0 and 1, and the larger the AUC value is, the better the prediction effect of the model 50,51 .The confusion matrix of landslides and non-landslides is shown in Table 2, and the equation for each metric is shown in Table 3.

Sample dataset construction
First, the sample data needed for the model are prepared.The landslide samples were obtained from a detailed survey of geological hazards in 2014 by the Yunnan Institute of Geological Sciences, and a total of 188 landslide points were obtained.An equal number of non-landslide samples were selected to form a sample set, 70% of  www.nature.com/scientificreports/which were used as training samples, and the remaining 30% were used as test samples 52,53 .A total of 264 training samples and 112 test samples were obtained.The sampling results for the three non-landslide samples are as follows.
The grid corresponding to the study area was converted into single sample point data, and a total of 3,586,374 sample points were obtained.To improve the operational efficiency, 1 million sample points (188 landslide samples and 999,812 unlabeled samples) were extracted from all the data for the experiment.To ensure the accuracy of sample selection, the model was first trained.We selected 70% of the samples as training data and 30% of the samples as test data (56 samples were extracted from 188 landslide samples, and 56 samples were randomly selected from 999,812 unlabeled samples).Then, the trained model was used to calculate the probability value of the unlabeled samples being landslides.The above steps were repeated five times, and the average probability value of the five steps was used as the final probability value.Finally, non-landslide samples were selected by setting the probability threshold for landslide occurrence to 0.5, with samples exceeding this threshold classified as landslide samples and those equal to or below this threshold classified as non-landslide samples.The recall rate was used to verify the model training results, and it represents the ratio of the number of landslide samples that were correctly predicted to the total number of landslide samples.Because only landslide samples were known among all samples, this indicator was used as the evaluation basis.After calculations, the recall rate of the test samples was 0.95, indicating that the model provides high prediction ability for landslide samples.It can select non-landslide samples from the unlabeled sample set based on probability values.Finally, 272,008 landslide samples and 727,880 non-landslide samples were obtained from the unlabeled sample set.Additionally, 188 samples were randomly selected from the 727,880 samples regarded as non-landslide samples (Fig. 6a).
(2) Buffer control sampling method to construct non-landslide samples.
The BCS method was constructed on the basis of landslide samples.A total of 188 samples outside the 500 m buffer zone of the landslide points were randomly selected as non-landslide samples.To avoid the distances between the selected non-landslide samples from being too close to one another, the minimum distance threshold was set to 500 m (Fig. 6b).
The KM clustering algorithm and the PU bagging method use the same data, with 1 million samples for experiments.The attribute characteristics of all samples were substituted into the KM clustering algorithm, and the classification result was set to 5. To select a non-landslide sample set, the number of landslides in each category was counted.The number of landslides in each category and the relative landslide ratio results are shown in Table 4.The category with the least number of landslide samples and the lowest relative landslide ratio was selected as the source of non-landslide samples.It can be seen from the table that clustering result 3 met the requirements, the number of landslide samples was at least 9, and the relative landslide ratio was also the lowest.Therefore, 188 samples were randomly selected from clustering result 3 as non-landslide samples (Fig. 6c).

Landslide susceptibility assessment
To enhance model performance, the best hyperparameters for each model were calculated using a Bayesian optimization algorithm 54 , and then the best hyperparameters were substituted into the models.All models were built using the Python language based on PyCharm software.Test samples were used to validate the prediction accuracy of the models.

Accuracy assessment
Five metrics, sensitivity, specificity, precision, accuracy, and F1-score, were used to validate the accuracy of the nine models (Table 5).For different non-landslide sample selection methods, the results showed significant differences.Comprehensive analysis of the five indicators shows that KM yields the highest values, followed by PU, and BCS produces the worst.For landslide prediction problems, using sensitivity (the proportion of successfully predicted landslides) to further measure the results yields the same results.However, during the statistical analysis of partitioning in section "Statistical analysis by zone", it was noted that the KM prediction results display overfitting.Notably, the prediction accuracy of the PU method is superior, exhibiting a 0.089 higher accuracy than BCS.Regarding different machine learning models, for the PU bagging samples, the SVM model performs best in terms of the specificity and precision indicators.The CatBoost model yields the highest sensitivity, accuracy, and F1-score.Specificity reflects the effectiveness of the prediction results for non-landslide samples, and precision indicates the proportion of correctly predicted landslides in actual landslide forecasts.As landslides constitute highly hazardous disasters, while precision is important, greater attention should be given to the correct identification of landslides.The SVM identified 43 landslides, whereas CatBoost identified 47, indicating that the CatBoost model performs better.For the BSC samples, the performance of the SVM model was superior to that of both the RF and CatBoost models.The RF and CatBoost models exhibited strengths and weaknesses across different metrics.For the KM samples, the RF performed best in terms of specificity, precision, accuracy, and F1-score.CatBoost excels based on sensitivity, correctly predicting 94.6% of landslides.An accuracy assessment was performed using ROC curves, and the results of the ROC curves are shown in Fig. 7. Overall, the AUC values varied widely among models.From the perspective of non-landslide samples, the AUC values calculated using different non-landslide sampling strategies differed widely.The KM clustering results were the highest, the PU bagging results were the second highest, and the BCS results were the lowest.The values calculated using the same strategy differed less.This shows that different non-landslide sampling strategies have a large impact on the prediction results.From the perspective of machine learning models, CatBoost always displayed excellent prediction performance.In the sampling method using PU bagging, the AUC values  www.nature.com/scientificreports/ of different models differed by a maximum of 0.032.Because the data used for accuracy verification were test samples, a partitioned statistical analysis was conducted to further explore the prediction performance of the three non-landslide sample selection methods in the study area.

Statistical analysis by zone
The trained classifier was used to predict the study area and generate the landslide susceptibility prediction map in Qiaojia County.The landslide susceptibility probability map was divided into five classes according to the equal interval method 55 : very low (0-0.2),low (0.2-0.4), moderate (0.4-0.6), high (0.6-0.8), and very high (0.8-1) (Fig. 8).
In the classification statistical analysis, two indicators, the area ratio and landslide ratio of each susceptibility zone, were used, and the statistical results are shown in Fig. 9.By observing the susceptibility partition map, we found that more than 90% of the area of the results obtained using the KM clustering method was classified as high or very high susceptibility areas.There were no very low or very high susceptibility areas in the KM_RF and KM_CatBoost maps.These prediction results are missing certain partitions, which obviously do not match the actual situation.The KM clustering method with the highest AUC value had the worst prediction results for the study area, with the illusion of better prediction accuracy, and the results of the remaining two non-landslide sampling methods were distributed among 5 classifications.From the landslide ratio, it was found that the BCS value suddenly decreased, and the PU value was the largest in the very high susceptibility area.In general, landslides should occur in the very high susceptibility zone.In both the very high and high susceptibility zones, the percentage of landslides based on the BCS method was less than 66.1%.In contrast, the percentage was higher than 66.1% based on the PU bagging method; the best value was 82.14%.Overall, the best landslide susceptibility results were obtained using the PU bagging method.it was found that the non-landslide sampling strategy has an important impact on the prediction results.For landslide susceptibility mapping, only the landslide area sample data are known, and non-landslide samples are not directly available.The area beyond the landslide contains both non-landslide and potential landslide areas.
Whether the selected non-landslide samples can represent the whole research area affects the model learning and generalization abilities.
The spatial distribution of the sample dataset constructed based on PU bagging was inferred, as shown in Fig. 10a.The non-landslide samples were randomly selected from the sample points with a probability less than 0.5, the data quality of both landslide and non-landslide samples was relatively high, and the distribution was balanced.The quality of the training sample data was high, and the characteristics of the landslide and non-landslide samples were relatively clear for separation; therefore, the calculated AUC values were relatively high.When the selected training samples represent the research area, it facilitates the learning of the model.From the statistical results of landslide susceptibility classification obtained by the PU bagging method, it was found that its landslide susceptibility zoning also conforms to basic laws.Some of the landslides were in very low susceptibility areas because they occurred on slopes behind buildings.Models tend to overlook special cases when they learn general laws.Landslides are a kind of natural hazard, their occurrence law is not fixed, and there are certain special cases.
The spatial distribution of the sample dataset constructed based on BCS can be assumed, as shown in Fig. 10b.Non-landslide samples were randomly sampled outside the landslide buffer zone.They may contain many false non-landslide samples, and some of the non-landslide samples will have similar characteristics to the landslide samples.When some potential landslide samples are regarded as non-landslide samples, the learning difficulty of the model will increase, and it will become difficult to find regularity.When the number of fake non-landslide samples reaches a certain number, the learning process of the model will be misled.Thus, the model's prediction ability will be insufficient, and the prediction accuracy will be reduced.By overlaying the landslide data with the landslide susceptibility map, the overall predictive ability of the BCS method was found to be insufficient from the susceptibility classes to which the landslide sites were assigned.Except for the high landslide occurrence area in the eastern part of Qiaojia County, landslides in other regions were not well predicted.
The spatial distribution of the sample dataset constructed based on KM clustering can be assumed, as shown in Fig. 10c.KM clustering uses distance as its similarity index.There is little similarity between different categories and high similarity within each category.The non-landslide samples selected according to this method have a high similarity, resulting in non-landslide samples that only represent a part of the non-landslide area.Thus, the complexity and changes in the non-landslide area cannot be fully reflected.The difference between the landslide and non-landslide samples was obvious, so the accuracy of the AUC value obtained under this training sample was very high, causing the illusion of high predictive power.The attribute features of the non-landslide www.nature.com/scientificreports/samples were simple and easy for the model to learn, so the model preferred to learn the features of the landslide samples.In the process of model learning, there was little interference from the non-landslide samples, so the prediction results in the study area were easily overestimated.The results of susceptibility classification also proved that approximately 90% of the study area was predicted as high or very high susceptibility areas, but the high predictive ability was due to model overfitting.In addition, most of the areas in the partition results were only located in two of the susceptible partitions, which is obviously unreasonable.

The impact of different machine learning models on landslide susceptibility mapping
We evaluate the performance of machine learning models using several metrics in a confusion matrix (sensitivity, specificity, precision, accuracy, and F1-score) and ROC curves.No single model is optimal for all metrics.In view of the overfitting exhibited by KM, the discussion focuses on the relationship between the different machine learning models and the two methods, PU bagging and BCS.For the RF model, PU bagging method is best under Specificity, Precision, Accuracy, F1-score metrics and BCS method is best in Sensitivity.For the SVM model, PU bagging performed best under all metrics.For the CatBoost model, also PU bagging performs best.Because the focus of different evaluation metrics in the confusion matrix is different, there may also be conflicts among the metrics.Therefore, the appropriate metrics should be selected according to the requirements in practical applications.From the analysis of the number of correctly predicted landslides (Sensitivity), CatBoost combined with PU bagging predicted the most (83.9%).In addition, the optimal results predicted among different machine learning models are not fixed, and this problem has been reported in several previous modeling studies 56 .This is because the classification criteria for models vary for different datasets and are influenced by the structure and underlying mechanisms of different models.The results show that the accuracy of SVM and CatBoost models is higher than RF.However, in the validation of ROC curves, the results display some regularity.The CatBoost model always maintains excellent prediction performance regardless of the sample dataset.CatBoost is a GBDT framework based on oblivious trees-based learners.This model can efficiently and effectively handle categorybased factors and solve gradient bias and prediction shift problems.Thus, this approach reduces the occurrence of overfitting, and the accuracy and generalization ability of the model are improved.In addition, when evaluating model performance, the actual application of a model should be accounted for, and model performance should be analyzed comprehensively.In general, model performance is evaluated using training data or test data, but it is important to avoid generalized or biased results, such as in KM clustering.

Conclusions
To overcome the difficulty of selecting high-quality non-landslide samples, an innovative hybrid model combining PU bagging and machine learning was proposed.In addition, BCS and KM were applied for comparative analysis.Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, RF, SVM, and CatBoost, were used for LSM.Then, the performance of different non-landslide sampling strategies was evaluated using the analysis results.The results of the study showed the following: (1) In machine learning models, there is a significant difference in the results obtained based on different non-landslide sampling strategies, indicating that the quality of selected non-landslide samples impacts the effectiveness of model training and prediction.However, the AUC values calculated from the same non-landslide sampling strategy displayed relatively minor differences.

Figure 1 .
Figure 1.Location map of the study area (a) administrative boundaries map of China (b) administrative boundaries map of Yunnan Province, and (c) a digital elevation model of Qiaojia County where triangles show landslides of the study area.(Created using ArcGIS v10.2 29 ).

Figure 3 .
Figure 3. Flowchart of the methods.

Figure 5 .
Figure 5. Schematic diagram of the BCS method.

Figure 10 .
Figure 10.The spatial distribution of samples based on different non-landslide sampling methods: (a) PU bagging, (b) buffer control sampling, and (c) k-means clustering. ).

Table 1 .
Description of stratum lithology in Qiaojia County.

Table 2 .
Landslide and non-landslide confusion matrix.

Table 3 .
Quantitative evaluation metrics for accuracy verification.

Table 4 .
Statistical table of k-means clustering analysis.

Table 5 .
Model performance based on several evaluation metrics.Maximum values are in [bold].