A comprehensive transferability evaluation of U-Net and ResU-Net for landslide detection from Sentinel-2 data (case study areas from Taiwan, China, and Japan)

Earthquakes and heavy rainfalls are the two leading causes of landslides around the world. Since they often occur across large areas, landslide detection requires rapid and reliable automatic detection approaches. Currently, deep learning (DL) approaches, especially different convolutional neural network and fully convolutional network (FCN) algorithms, are reliably achieving cutting-edge accuracies in automatic landslide detection. However, these successful applications of various DL approaches have thus far been based on very high resolution satellite images (e.g., GeoEye and WorldView), making it easier to achieve such high detection performances. In this study, we use freely available Sentinel-2 data and ALOS digital elevation model to investigate the application of two well-known FCN algorithms, namely the U-Net and residual U-Net (or so-called ResU-Net), for landslide detection. To our knowledge, this is the first application of FCN for landslide detection only from freely available data. We adapt the algorithms to the specific aim of landslide detection, then train and test with data from three different case study areas located in Western Taitung County (Taiwan), Shuzheng Valley (China), and Eastern Iburi (Japan). We characterize three different window size sample patches to train the algorithms. Our results also contain a comprehensive transferability assessment achieved through different training and testing scenarios in the three case studies. The highest f1-score value of 73.32% was obtained by ResU-Net, trained with a dataset from Japan, and tested on China’s holdout testing area using the sample patch size of 64 × 64 pixels.


Materials and methods
Study areas and inventory maps. Eastern Iburi (Japan). On September 6, 2018, an earthquake with a magnitude (Mw) of 6.6 struck Eastern Iburi, Hokkaido, Japan. It resulted in widespread destruction, including power cuts, the destruction of power distribution networks, and damage to the Tomato-Atsuma Power Station, which provides electricity for Hokkaido Island 38,39 . Furthermore, the earthquake caused several deep-seated and shallow landslides. Of the 41 deaths related to this earthquake, 36 were due to landslides 40 . Following the earthquake, almost 5600 landslides occurred near Atusma town. The main reason for the significant number of landslides was typhoon Jebi, which brought torrential rainfalls to the region the day before the earthquake and soaked the region's subsurface, making it more prone to landslides 40,41 . Since the depth of the surface soil layer varies from 4 to 5 m, most of the landslides were shallow and primarily affected the hilly regions between the elevations of 200 and 400 m 38 . In addition to shallow landslides, the area was also affected by planar and spoontype deep-seated landslides 41 . The landslide inventory map in this region is generated by the Geographical Survey Institute (GSI) of Japan using aerial ortho-photographs 42 . Zhang et al. 43 updated the landslide inventory map by performing spatial analyses on VHR aerial images and a 10-m resolution DEM. Therefore, 5625 features were reported as landslides over an area of 46.3 km 2 . In this study, we used their landslide geodatabase for Eastern Iburi in ESRI shape format, and then extracted all landslides (Almost 4940 features with a total area of 43.17 km 2 ) within our pre-defined study area (Fig. 1C) in Eastern Iburi. The maximum, minimum, and mean of the landslide features are 569,904.02 m 2 , 89.6 m 2 , and 8688.8 m 2 respectively.
Western Taitung county (Taiwan). Natural hazards such as earthquakes, floods, and landslides are common phenomena in Taiwan 44 . In August 2009, Morakot, the deadliest typhoon in Taiwan's recorded history struck the country. It led to 652 deaths, 47 missing people, and damage to property and infrastructure of over 3 billion USD 44 . In 5 days, typhoon Morakot brought over 2884 mm of precipitation to southern Taiwan, which caused severe floods and induced more than 22,705 landslides covering an area of 274 km 2 . The landslides were mainly characterized as shallow, but some deep-seated landslides also occurred across the mountainous regions of southern Taiwan 45 . For this study, we chose an area in the Western region of Taitung County in the south of the country as a case study area. Since this study's objective was not to map all the landslides in southern Taiwan (Fig. 1a), we selected a region with an area of 467.91 km 2 to train and test our model. Unlike the Eastern Iburi case study, there was no proper database on which to base a landslide inventory map for southern Taiwan. So, based on Google Earth's archive images (2011-2013), we digitized every landslide within our selected area. In total, we digitized 895 landslides with a total area of 31. 33  Sentinel-2 multispectral imagery. Google Earth Engine (GEE) environment is a cloud system developed by Google for processing satellite images. We used GEE to acquire cloud-free Sentinel-2images. Currently, GEE provides two Sentinel-2 products called Level-1C and Level-2A, whereby the former has global coverage but no atmospheric correction and the latter includes atmospheric correction but does not have global coverage. Therefore, we called and masked the Sentinel-2 Level-1C product for each study site and then imported it to Sen-2Cor 50 plugin in SNAP software developed by the European Space Agency (ESA) to apply atmospheric corrections. We used the Sentinel-2A product rather than Sentinel-2B because it had less cloud cover for all three sites. The Copernicus Sentinel-2 mission is a constellation of two polar-orbiting satellites for earth observation. It uses Multispectral Instrument (MSI) sensors to acquire optical imagery. These images are acquire at various spatial resolutions ranging from 10 to 60 m, and 13 bands of visible, near-infrared and short-wave infrared electromagnet spectrum 48 . This constellation has two satellites, Sentinel-2A and 2B, at antipodal points in the same orbit, which provides a high revisit time of 5 days. The main themes of the Sentinel-2 mission are climate change, land monitoring, and emergency management, which includes mapping and monitoring landuse/landcover changes, forests, farmlands, water resources, and natural hazards. More information on the Sentinel-2 constellation is available in the User Handbook 49 . In this study, we only used the high-resolution image bands blue (2), green (3), red (4), and near-infrared (8) of Sentinel-2A with a 10-m spatial resolution to acquire imagery for the three study areas. Table 1 provides information on the acquired images for each site.
Fully convolutional network (FCN). One of the common ways in which DL approaches learn to deal with features with various shapes and sizes is by increasing the depth of the algorithm and using more con- www.nature.com/scientificreports/  www.nature.com/scientificreports/ volutional layers. However, adding more layers usually causes a training degradation problem 35 . Long et al. 50 introduced FCN, which does not have any fully connected (FC) layer, but instead replaces convolutional and upsampling layers to increase the training capability. The FCN is designed to represent image-to-image mapping, which is suitable for calculating per-pixel probability labels 51 . Therefore, the sample image patches with arbitrary sizes are introduced as the input for the algorithm 52 . Since there is no FC layer in an FCN, the algorithm can learn to recognize various representations of the local spatial input 53 . Moreover, FCNs also have skip connections between the down and up samplings for refining image segmentation 35,54 . Of the various types of FCN algorithms, the U-Net has received much interest. It is a simple and effective algorithm for feature extraction that can be trained by a limited number of sample patches 34,55 .
U-Net. U-Net was initially been introduced in the context of bio-medical image segmentation 51 and further adopted in a variety of semantic segmentation tasks, generally achieving good performances 56 . The U-Net architecture consists of a contracting path (an encoder) to capture low-level representations, along with an expanding path (a decoder) to capture the high-level ones 34 . While the expanding path is an asymmetrical structure for retaking the vanished information of the feature localization 57 , the contracting path is similar to the standard CNN architectures, made up of consecutive convolution blocks. Each block contains two convolutional layers with a filter size of 3 × 3, leveraging a rectified linear unit (ReLU) activation function, and a max-pooling layer with a filter size of 2 × 2 and a stride of 2. After each convolution block, the number of feature maps is doubled, and a total of 512 feature maps are generated after the last block. The expanding path is an inverted form of the contracting one, whereby the input to a certain decoder block is represented by the concatenation of the previously outputted feature map and the corresponding output of the encoder block at the same level. The number of feature maps over the expanding path is halved after each block 51 . Overall, the U-Net algorithm implemented in our case comprise a total of 23 convolutional layers, including 19 convolutional and 4 convolutional-transpose ones. The U-Net structure is summarized in Table 2.
Residual U-Net. The ResU-Net design is a variant of the U-Net algorithm, leveraging residual learning blocks. This modification is shown to often improve learning performance 58 , and can even avoid the vanishing gradient problems 59 . The architecture of a residual neural block is described as a stacked sequence of residual units, whereby a single residual unit is defined as: whereby x i and x i+1 refer to the input and output of the i th residual unit; f (y i ) and F(·) are the activation and the residual functions, respectively; and h(·) is the identity mapping h(x i ) = x i . The convolutional layer reduces the spatial resolution of the applied sample patch image in the feature maps so that the dimension of the input ( x i ) might be higher than that of the output ( F(x i )) . Therefore, a linear projection W i is applied to maintain the dimension of the input and output of the convolutional layers (see Fig. 2).
In the ResU-Net, the 2 × 2 max-pooling layer is absent, and the downsampling process is instead obtained with a convolution stride of 2. Moreover, a batch normalization (BN) procedure is inserted before each convolutional layer. Finally, the identity mapping h(x i ) adds the input of a block to its output. The expanding path comprises three residual learning blocks, each of which is preceded by a corresponding upsampling layer (Conv2DTranspose). To generate the statistical probabilities of the semantic categorization, a final convolutional layer with a 1 × 1 filter size and a sigmoid activation function are added on top of the ResU-Net architecture to associate each pixel to a corresponding output probability value comprised between 0 and 1. Thereby, the probability of a pixel belonging to each of the pre-defined segmentation categories is reported, which is relevant for solving our defined classification problem. The overall network structure of our applied ResU-Net consists of 15 convolutional layers, as listed in Table 3. To better depict a visual overview of the network architecture, Fig. 3 shows a schematic representation of the U-Net and ResU-Net structures.
Evaluation metrics. The resulting binary maps of areas detected as landslides were compared with the ground truth inventories in the holdout testing areas to calculate the precision, recall, and f1-score accuracy assessment metrics. The metrics were calculated based on true positives (TPs), which are the correctly detected landslide areas, false positives (FPs), which are the non-landslide areas that have been incorrectly detected as landslides, and false negatives (FNs), which are the landslide areas that have not been detected by the algorithm. The precision metric denotes the proportion of areas that were correctly identified as landslide areas. The recall, also known as the sensitivity metric, is the proportion of areas in the results that were identified as landslide www.nature.com/scientificreports/ areas. The f1-score is a quantitative metric that is useful to assess the balance between precision and recall (see Eqs. [3][4][5]. (3) Precision = TP TP + FP   www.nature.com/scientificreports/ Implementation details. Each of the three images under examination (representing three different study areas) was divided into a training area and a testing area in a 3:2 ratio. This study is the first attempt to use only 10-m resolution satellite imagery for landslide detection using FCNs and so far there is no optimal modified sample patch size for this purpose. Therefore, three different window sizes (32 × 32, 64 × 64, and 128 × 128 pixels) were used to generate the input sample patches by applying a regular grid approach without any overlap or data augmentation. In total, 7888, 1972, and 592 sample patches were generated based on the window sizes of 32 × 32, 64 × 64, and 128 × 128 pixels, respectively (see Fig. 4).
To assess the transferability of the models and the impact of different areas on the network performance, the following scenarios were defined, and each of them was evaluated on the testing areas of Taiwan, China, and Japan: • Scenario 1: training the models on the collection of the three training sets. Moreover, the four scenarios were tested for each of the three different window sizes; therefore, each testing area was described by a total of 16 different results (see Fig. 5).
We used the binary cross-entropy for both U-Net and ResU-Net. The cross-entropy was used as the loss function to find the difference between each Pl(x) (x) from the highest probability of 1 using Eq. (6).
where K is the number of classes and w is a weight map, which is introduced as the pixels that were more important than the others in the training process 51 . The models were trained by backpropagation through mini-batch stochastic training and the Adam optimization algorithm 60 , setting a learning rate of 0.001 (with β 1 = 0.9, and β 2 = 0.999). The batch size was chosen to include four images per step; the optimal results were derived by following an early stopping approach based on the evaluation loss.

Results and accuracy assessment
We used the same input data to train and test both algorithms in different scenarios to compare their performance and transferability. The algorithms were evaluated on the defined testing areas using the precision, recall, and f1-score accuracy assessment metrics. www.nature.com/scientificreports/ Scenario 1 and accuracy. In this scenario, both U-Net and the ResU-Net algorithms were trained with a dataset containing sample patches from all training areas of Taiwan, China, and Japan. The testing procedure was carried out based on the dataset with sample patches from holdout testing areas of each study area and with data from all our study areas. The algorithms were trained and tested separately based on sample patch window sizes of 32 × 32, 64 × 64, and 128 × 128 pixels. For simplicity, Fig. 6 only shows the results of the ResU-Net trained by all training datasets using a sample patch size of 64 × 64 pixels. The accuracy assessment metrics were calculated for each resulting landslide detection map (see Table 4). ResU-Net obtained the higher f1-score values of just under 73% and 71.29% tested on Taiwan's testing area using a sample patch size of 64 × 64 pixels and 32 × 32    www.nature.com/scientificreports/ sample patch size of 128 × 128 pixels. Higher recall values were generally obtained by testing the algorithms on the case study area of Taiwan, and higher precision values were achieved in that of China. This means that when algorithms were trained by all training datasets, they were able to detect most of the landslide areas in China and yield fewer incorrectly identified landslide areas than in Taiwan.

Scenario 2 and accuracy.
To evaluate the generalisation performance of the algorithms, they were also trained separately with data from each individual training area of each case study. Figure 7 shows the results in www.nature.com/scientificreports/ only an enlarged area from the Taiwan testing area. According to this figure, both algorithms that trained and tested on Taiwan datasets represent most of the false positives within the runouts. The accuracy of the algorithms that were trained only with the training dataset of the case study area of Taiwan was also evaluated based on Taiwan, China, Japan, and all test datasets separately and the resulting values are given in Table 5. ResU-Net again obtained the higher f1-score values of 72.68%, and 70.54% tested on Taiwan's testing area using a sample patch size of 64 × 64 pixels and 32 × 32 pixels, respectively. This algorithm also resulted in the highest recall value of 80.72% in the same study area using a sample patch size of 32 × 32 pixels followed by 79.65% achieved with a sample patch size of 128 × 128 pixels using the U-Net. Therefore, training the algorithms with the Taiwan dataset and testing the algorithm against its test area was helpful in these two sample patch sizes to significantly reduce incorrect identification of non-landslide areas as landslides. However, its ability to correctly detect landslide areas was significantly higher in Japan using the sample patch sizes of 128 × 128 pixels, which achieved a precision value of well over 82%.

Scenario 3 and accuracy.
In this scenario, both U-Net and the ResU-Net algorithms were trained with the China training dataset (see Fig. 8), where the ResU-Net algorithm was again able to achieve the highest f1-score value of 72.9% (see Table 6). The accuracy assessment metric values of this scenario fluctuated around 69% for the U-Net algorithm with three different sample patch sizes, and the highest one was over 70% for a window size of 32 × 32 pixels. The U-Net algorithm also achieved the highest recall value of approximately 91% but in Taiwan's holdout testing area. The ResU-Net also showed a substantially good recall value of 89.92% in Taiwan. The highest recall value of the U-Net algorithm was achieved with a 64 × 64 pixel sample patch size, whereas the highest recall value of the ResU-Net algorithm was achieved with a 128 × 128 pixel sample patch size. Like in the previous scenario, high precision values were achieved with both algorithms in Japan's case study area. However, the highest precision value of 81.25 was achieved with U-Net with a window size of 128 × 128 pixels. However, the resulting recall values were lower compared to those of the other study areas, which means that although the algorithms could detect many of the landslide areas in Japan, at the same time, they mistakenly detected several non-landslide areas that displayed similar spectral and slope information as the landslides, but whose characteristics were not represented in the Chinese training dataset.

Scenario 4 and accuracy.
Although the last scenario, namely training the algorithms using the Japan dataset, showed a fairly good accuracy of the results for Japan itself (see Fig. 9 and Table 7), the highest accuracies achieved overall were those tested on the China dataset. The highest precision (71.82%) and f1-score (73.32%) values were obtained by ResU-Net using a sample patch size of 64 × 64 pixels. The highest recall values were achieved with U-Net in China's case study area and were between 81.18 and 78.84% based on widow sizes of 32 × 32 and 64 × 64 pixels, respectively. The f1-score values obtained based on China's holdout testing area fluctuated around 70%, whereas those of Japan were around 60%. Moreover, this scenario was not successful in testing Taiwan's case study area as this dataset achieved the lowest f1-score, precision, and recall values of 31.2%, 27.07%, and 24.7%, respectively.

Discussion
Transferability assessment. We comprehensively assessed the performance and transferability of the algorithms by comparing the landslide detection of all the different possible combinations of training and testing areas. Therefore, while the first scenario is based on the collective training based on all the case study areas, the other scenarios singularly focus the training process on a certain specific area while still evaluating it on all Table 4. The resulting precision, recall, and f1-score values for Scenario 1 (training the algorithms with all training datasets and testing them on the Taiwan, China, Japan, and all test datasets separately). The highest values of precision, recall, and f1-score are indicated in bold.

U-Net
ResU-Net www.nature.com/scientificreports/ www.nature.com/scientificreports/ three different study areas. Figure 10 shows the overall mean F1-score values for U-Net and ResU-Net in each of the four scenarios. The ResU-Net resulting scores were higher than the U-Net ones whenever the training and testing set belonged to the same study area. Specifically, ResU-Net achieves a mean f1-score value of 66.33 while U-Net achieves a value of 63.75 when training and testing based on all the study areas, and 70.54 versus 64.04 when training and testing in Taiwan, 71.02 versus 69.56 when training and testing in China, and 63.21 versus 61.14 when training and testing in Japan. A few curious findings are worth mentioning. The first scenario (collective training) leads to the best evaluation results for the Taiwan area, compared to the other testing areas, which can be due to probable higher similarities among the training areas of Japan and China and the test area of Taiwan. Furthermore, when training the models on the Japan training area, we observed that the highest mean f1-score values were obtained for China's testing area, and not Japan's one (70.64 and 70.14 versus 61.14 and 63.21, for U-Net and ResU-Net, respectively). Finally, regarding the second and third scenarios (training with the Taiwan and China datasets, respectively), the highest mean f1-score was obtained for the local testing areas (namely Taiwan and China, respectively).

Impact of sample input sizes on the network results.
We sampled the images using a regular grid approach to evaluate three different window sizes, namely 32 × 32, 64 × 64, and 128 × 128 pixels. The resulting mean f1-score values of the U-Net model yielded a better performance for the sample size of 128 × 128 pixels, except for the first scenario. On the other hand, the ResU-Net achieved the best results with a sample size of 64 × 64 pixels, except for the third scenario. However, aside from a slightly superior performance when using 128 × 128 pixels for U-Net, and 64 × 64 pixels for ResU-Net, we did not observe a definitive advantage to guide the choice of window sizes that would potentially provide substantial improvements (see Fig. 11). Whereas there is no clear evidence of a consistent impact of different window sizes in the multitude of assessed scenarios, Fig. 12 provides a visual example of the testing results of the U-Net model trained on the Taiwan dataset. The figure shows that, in this case, an increased window size led to higher recall values due to an increase in the correct detection of landslides (true positives), but not all cases followed this trend because of smaller landslide features.
Challenges from imbalanced datasets. Since a total of 7888, 1972, and 592 sample patches were generated based on the window sizes of 32 × 32, 64 × 64, and 128 × 128 pixels, respectively, the extent of the images of the three study areas was very different, with Taiwan being the largest one with 467.91 ha. This leads us to conclude that the first scenario, where the models were trained on the combination of all datasets, performed better in Taiwan because most of the training data were based in Taiwan. The fourth scenario is a curious case, where both U-Net and ResU-Net, trained on the Japan training set, demonstrated better results when tested in China than in the testing area of Japan. We think that the spectral and textural features of the Japan training area are more similar to those of the China testing area than to the ones of the Japan testing area. This qualitatively triggers the hypothesis of a substantial heterogeneous profile of Japan's landslide sample and a more homogeneous profile in China's image, which is partially covered by a similar profile in a section of the Japan training area. It is indeed visible a high number of small landslide feature profiles in Japan that were not represented in the case study of China, While the bigger ones are present in both case study areas of China and Japan. Even if we selected study areas with the same area, the ratio between landslide areas and non-landslide areas might vary, as might the frequency and number of landslides in different geo-environmental case studies. www.nature.com/scientificreports/  www.nature.com/scientificreports/ Compared with recent works. The main objective of this study was to evaluate the performance of the U-Net and ResU-Net algorithms for landslide detection using freely available Sentinel-2 data and an ALOS DEM. Although, to our knowledge, our study is the first application of the U-Net and ResU-Net algorithms for landslide detection from freely available data, we compared our accuracy assessment results to those of the few recently published articles that applied these algorithms for landslide detection using VHR data. For instance, in the studies that were carried out by Liu et al. 34 , Qi et al. 35 , and Yi and Zhang 61 , the U-Net and ResU-Net algorithms were compared to each other in terms of their applicability for landslide detection from VHR imageries. Our results obtained more accurate landslide detection results using the ResU-Net than with the U-Net algorithm. In another instance, Soares et al. 33 used only U-Net to evaluate different sample patch generation methods and window sizes for landslide detection from VHR satellite imagery. The same sample patch window sizes of 32 × 32, 64 × 64, and 128 × 128 pixels were used in their study, and the U-Net that was trained with 128 × 128 pixels achieved the highest f1-scores of 0.55 and 0.58 in two different testing areas. Our results confirmed theirs, as our U-Net yielded the highest mean f1-score values when trained with 128 × 128 pixels sample patches.
Limitations. The application of U-Net and ResU-Net for landslide detection is associated with some issues.
For instance, using these algorithms and image resolution could easily detect the big landslides from dense vegetation, but cases of neighboring bare land increased the false positives substantially. Nevertheless, the inclusion of topographic slope data helped to discriminate landslides from bare land in many cases. However, a precise and detailed extraction of landslides from bare land requires proper auxiliary data like displacement information from SAR data or prior expert knowledge, which was not used in this study. Further factors that need to be carefully considered in the future are the imbalanced nature of the dataset and a detailed analysis of the impact of the dataset size, which will help tackle the remaining unsolved issues. Moreover, our initial expectation was that the global generalised performance would be improved by collectively merging training data from multiple geo-environmental case study areas. However, our hypothesis was not confirmed, and local training data based only on each target study area often outperformed a collective training dataset. Furthermore, it is not yet clear why a specific sample patch size yields better performance than another one in some specific local contexts and scenarios, and not in others, with fluctuations not directly following a consistent trend. This illustrates issues of low transparency related to the use of the proposed models.

Conclusions and outlook
This work evaluated the generalization and transferability of two well-known FCN algorithms (U-Net and ResU-Net) for landslide detection in different scenarios. We demonstrated the effectiveness of these algorithms on landslide detection using freely available Sentinel-2 data and an ALOS DEM. We selected three different geoenvironmental study areas in Taiwan, China, and Japan to train and test the algorithms. The applied semantic segmentation models were trained based on each individual area and on a combined dataset of all areas to detect landslides based on Sentinel-2 data and an ALOS DEM. To the best of our knowledge, no study has yet explored the possibility of using freely available satellite imagery for landslide annotation using FCN deep learning algorithms. Three different sample patch sizes were generated from pre-defined window sizes for training and testing the algorithms. Therefore, multiple experiments have been designed to evaluate the transferability of the algorithms and the impact of window sizes on different operational scenarios. Based on our results, we explored relationships among the applied models, the window sizes of sample patches, and the training datasets for landslide detection. Our results show that although the ResU-Net led to higher performances, the U-Net has www.nature.com/scientificreports/  www.nature.com/scientificreports/  www.nature.com/scientificreports/ more transferability capabilities. ResU-Net demonstrated the highest score in those cases where it was trained on only the local training datasets.
In future work, we aim to integrate the FCN algorithms with some frameworks that enable us to incorporate prior knowledge to different sections of FCNs, e.g., optimally selecting sample patch window size and location and enhancing the detection result by considering possible post-processing classification approaches.