Background & Summary

As the most fundamental and widely used transportation infrastructure, highways ensure the rapid and efficient flow of people and goods. They also play a key role in economic and social development. However, under repeated vehicle loads and harsh environmental conditions, road surface structures undergo aging and deterioration, eventually leading to road damage. This has a severe impact on road performance1. Therefore, the rapid and precise monitoring of road pavement damage and its distribution play a crucial role in extending the service life of highway roads.

Conventional road damage detection techniques typically depend on manual visual detection and vehicle-mounted road Pavement Monitoring Systems (PMS). Such manual-based approaches are greatly influenced by the experience of road maintenance personnel, who primarily employs ground measurements and visual assessments to detect road pavement health. These techniques are often time-consuming, inefficient, and traffic-disruptive, consequently making them unsuitable for monitoring extensive road pavements2. PMS equipped with multiple sensors can acquire comprehensive road information, yet they are associated with high acquisition and operating costs. Hence, they are generally only employed for high-level road surfaces and have a low efficiency3.

With the development of computer vision and deep learning, image classification, object detection, and segmentation techniques have been widely employed in the detection of road pavement damages. Currently, the image data for road pavement damage detection predominantly originates from ground-based platforms, encompassing top-down view, wide view, and street view perspectives. The top-down view images are usually captured by vehicle-mounted or handheld cameras or smartphones, positioned a few meters above the ground at a perpendicular angle to the road surface. The wide view images are usually taken from smartphones installed either on the dashboard or windshield of vehicles, capturing the road conditions from a forward and slightly downward angle. The street view images often consist of the front view of street view images, providing insights of road conditions. The set-ups, angles, and ranges of cameras under vehicle platform for these three perspectives are illustrated in Fig. 1.

Fig. 1
figure 1

Camera set-ups, angles, and ranges under vehicle platform for different views of road damage detection datasets. (a) top-down view; (b) wide view; (c) street view.

The present public datasets for road pavement damage detection are limited to top-down and wide view images. Table 1 reports an overview of the major datasets available4,5,6,7,8,9,10,11,12,13,14,15,16,17,18. Crack Forest Dataset (CFD) is a representative dataset for pavement damage segmentation. Such datasets use smartphones and vehicle-mounted cameras to capture images of road surfaces from a top-down view and subsequently annotate pavement damages on a pixel-by-pixel basis. However, they only distinguish between cracks and background road categories, while other damage categories are not considered. Moreover, the amount of data is small and the image resolution is inconsistent4,5,6,7,8,9,10,11. The German Asphalt Pavement Distress (GAPs) v112 and GAPs v213 datasets use vehicle-mounted CCD imagers to capture the images of road surfaces from a top-down view with an image array of 1920 × 1080. Six road damage categories are annotated with bounding boxes, making them suitable for the object detection of road damage. In particular, the GAPs 10 m14 dataset, released in 2021, contains 20 high-resolution images (5030 × 11,505 pixels) covering 200 m of asphalt pavement of different road categories. Totally 22 categories of objects and damage instances at the pixel level are annotated, facilitating the fine-grained segmentation of road damage. The Road Damage Dataset (RDD) series datasets15,16,17,18, which was recently updated from 2018 to 2022, uses smartphones installed on windshields to capture wide view road images and annotates four damage categories (i.e., longitudinal cracks, transverse cracks, alligator cracks, and potholes). Furthermore, the Global Road Damage Detection Challenge (GRDDC) has attracted much attention in research on road damage detection19. Despite the great progress made by these studies on the application of deep learning algorithms for the detection of road pavement damages, they are usually associated with high acquisition costs, varying image resolutions, and restricted image views, which imped the practical application of the associated models.

Table 1 Public datasets used for the detection of road pavement damage.

Street view images are geotagged images collected by map service providers (e.g., Google Maps and Baidu Maps) through street view imaging systems along roads from multiple viewing angles. These images are then processed and maintained according to the standard methods20. Street view images can accurately depict the urban physical environment21 and have been employed in numerous research applications, including the estimation of poverty, violent crime, health behaviour, and travel patterns22,23. The front view of street view images provides insights of road conditions. These images, which are collected by state-of-the-art devices, have the advantages of low additional costs, easy accessibility, regular updates, and rich data. Thus, street view images offer new data sources for the detection of road pavement damages. Table 2 summarizes the current street view image datasets and their application in the existing studies for the detection of road damage24,25,26,27,28,29,30,31. The datasets used in the studies vary in terms of quantity, spatial resolution, and annotated damage category, and are not publicly available at all. Classifying pavement damage solely at the image level is inadequate, as it can only indicate the presence or absence of damages25,29. While segmenting pavement damage at the pixel level provides detailed information on the shape and size, current studies predominantly focuses only on the presence or absence of damages and is restricted by a limited image dataset24,28. Some road damage detection studies relying on bounding boxes explore few categories of damages, failing to meet industry requirements30. Alternatively, some damage categories may not be suitable for street view image scenarios26,27. Moreover, some datasets have a relatively limited coverage31. Although the models proposed in these studies have made progress in monitoring road damages, their performance is only validated on self-built datasets and there is a lack of training and testing on unified public datasets. This makes it challenging to fairly evaluate and compare the performances of various models.

Table 2 Reported street view image datasets and their application in the detection of road damage.

Compared to the top-down and wide view images, the street view images stand out with distinct characteristics when utilized for road pavement damage detection. The top-down and wide view images cover only the captured areas since they are captured privately. If a trained model is applied to the actual area to be detected, images of the study area need to be acquired. The street view images are captured by map service providers using specialized equipment and are guaranteed in terms of image quality. The data covers almost all cities in the world, is publicly available for download, and updated regularly. The model trained using the road damage dataset of annotated street view images can be easily used in street view images of other areas. Also, these three views datasets can be used as data for domain adaptation studies with each other. And the wide view and street view images have similar perspectives and the same complex road background, which can be better used for the study of pavement damage domain adaptation. Table 3 shows the comparison of the attributes of the three views image datasets.

Table 3 Attributes of top-down view, wide view, and street view image dataset for road pavement damage detection.

In this study, we propose the Street View Image Dataset for Automated Road Damage Detection (SVRDD), a dataset based on street view images for the detection of road pavement damage. To the best of our knowledge, SVRDD is the first public dataset based on street view images for pavement damages detection. It comprises a total of 8000 street view images from Dongcheng, Xicheng, Haidian, Chaoyang, and Fengtai Districts of Beijing City, encompassing a variety of urban road types and pavement conditions. The dataset comprehensively annotates pavement damages at the bounding box level, encompassing a total of 20,804 annotated instances. In terms of both the number of images and the damage instances, the SVRDD dataset stands out among current datasets for road damage detection based on street view imagery, at both bounding box and pixel annotation levels. The categories of pavement damage addressed include six damage categories and one confusing non-concrete pavement, namely, longitudinal crack, transverse crack, alligator crack, pothole, longitudinal patch, transverse patch, and manhole cover. From an application perspective, pavement damage categories in SVRDD are more relevant to the transportation industry sector and are well-suited for detection in street view imagery. Simultaneously, the inclusion of manhole cover annotations can significantly enhance the detection of pothole16,31. SVRDD provides bounding box annotations in two formats (i.e., Pascal VOC and YOLO) to facilitate the easy usage of the datasets. The backgrounds of the street view images in SVRDD include pedestrians, vehicles, buildings, viaducts, trees, and their shadows. The images were collected under multiple seasons and weather and lighting conditions. In order to evaluate SVRDD, we trained and tested ten well-established object detection algorithms using this dataset. We subsequently analysed the performance of the dataset with varying numbers of training images and evaluated the impact of the training subsets association from different districts on the model training, to assist users in utilizing the dataset. Additionally, some potential extensions for the SVRDD dataset were further analysed, which opens the door for further research.

Methods

The creation of the SVRDD dataset includes three key steps, namely, image collection, data cleaning, and damage annotation (Fig. 2).

Fig. 2
figure 2

Flowchart of the generation process of SVRDD.

Images collection

A total of 844,432 street view images of Beijing City were acquired from Baidu Maps32. Notably, the use of Baidu Maps street view images must comply with its terms and conditions33. First, road location information was obtained using the Open Street Map (OSM) road network data, which was converted to the BD09 coordinate system used in the Baidu Maps. A sampling point was then generated every five meters along a road network, and the coordinates of the sampling points and other parameters (e.g., image width, height, and viewing angle) were input into the Baidu Maps API to download the street view images. Two types of images, with pitch angles of 0° and 45°, respectively, were obtained for each sample location. These images were vertically concatenated to obtain a complete front-view street view image for a given location. Each image has a size of 1024 × 1024 pixels. The street view images used for the dataset were mainly captured in 2019 and 2020.

Data cleaning

The large number of obtained street view images ensures a wide distribution of damage features. However, as street view images are not specifically designed for the detection of road damage, data cleaning was required to guarantee the quality of the pavement damage dataset. The data cleaning process was performed following three steps: i) removal of images without damage; ii) considering the high sampling frequency of street view images and the minimal road damage differences between adjacent images, redundant images were also removed; and iii) deletion of images with either blurry or densely overlapped instances of damage. Considering the annotation workload and district area, we ultimately obtained 8000 street view images for the detection of road pavement damage, with 1000 images from Dongcheng and Xicheng districts, respectively, and 2000 images from Haidian, Chaoyang, and Fengtai districts, respectively.

Damage annotation

All the selected images were manually annotated using LabelImg with object bounding boxes. The annotated damage categories include longitudinal cracks, transverse cracks, alligator cracks, potholes, longitudinal patches, and transverse patches. Due to the potential misclassification of pothole and manhole cover16,31, we added a category for manhole covers. The annotation process was done by three trained annotators, and the results from each annotator were cross-checked using the other two annotators. Pascal VOC and YOLO were included as the annotation formats. For the Pascal VOC format, data is stored in .xml files, and the position of a bounding box is represented as (xmin, ymin, xmax, ymax), with (xmin, ymin) and (xmax, ymax) as the top-left and bottom-right coordinates, respectively. For the YOLO format, data is stored in.txt files, and the position of a bounding box is represented as (x, y, w, h), with (x, y), w, and h as the centre coordinate, width, and height of the bounding box, respectively.

Image properties

Statistical analysis was performed to determine the distribution and properties of the SVRDD dataset. Figure 3 presents some example images from the dataset with different damage category annotations. The background of the images contains pedestrians, vehicles, buildings, viaducts, trees, and their shadows. The images were collected under different seasons and weather and lighting conditions. These varying conditions bring challenges to the detection of road damage from the street view images.

Fig. 3
figure 3

Examples of different damage categories included in the SVRDD dataset. (a) longitudinal crack and manhole cover; (b) transverse crack; (c) alligator crack; (d) pothole; (e) longitudinal patch; (f) transverse patch.

The number of damage instances of each category in the SVRDD dataset and the key statistics of each district are illustrated in Fig. 4. The 8000 images in the SVRDD dataset offer a total of 20,804 damage annotation instances. Among the six damage categories, the number of potholes and alligator cracks is relatively low, as these two damage types emerge after severe road aging. Among the five districts, Chaoyang District exhibits the lowest average number of damage instances.

Fig. 4
figure 4

Statistics of the number of damage instances included in the SVRDD dataset.

The position and shape statistics of the damage instances in the SVRDD dataset, namely, the central point coordinates and height–width distributions are presented in Fig. 5. The central points of the damage instances are primarily concentrated in the lower half of the images, while the upper half of the images generally represent the background. Due to the perspective effect, the road surfaces located in the upper half of the images are narrower, making it challenging to identify damage. The height–width distribution of the damage instances reveals that the damage width can span the entire image, while the length is at most equal to half of the image.

Fig. 5
figure 5

Position and shape statistics of the damage instances. (a) distribution of the central point coordinate; (b) distribution of the ratio between the height and width.

The area statistics of damage instances in the SVRDD dataset, which are calculated as the ratio of the area of damage instances to the area of images, are illustrated in Fig. 6. The area distributions of longitudinal cracks, transverse cracks, longitudinal patches, and transverse patches are approximately the same, and concentrated within 10% of the image area. Alligator cracks have a relatively large area, with the highest value reaching 50% of the image area. Pothole and manhole cover are determined to have smaller areas, mostly less than 0.5% of the image area. The results demonstrate there to be significant differences in the area of damage instances. Thus, the object size needs to be considered when constructing a deep learning network for road damage detection.

Fig. 6
figure 6

Area statistics of damage instances. (a) longitudinal crack; (b) transverse crack; (c) alligator crack; (d) pothole; (e) manhole cover; (f) longitudinal patch; (g) transverse patch; (h) total.

Data Records

The SVRDD dataset has been published in the Zenodo repository34. Its data structure and format are described in the following.

The dataset includes two folders, namely, ‘SVRDD_VOC’ and ‘SVRDD_YOLO’. The ‘SVRDD_VOC’ organizes the data in Pascal VOC format and contains the ‘JPEGImages’ folder with the street view images of all the districts of Beijing City, as well as the ‘Annotations’ folder which contains the corresponding bounding box annotation files in .xml format. The ‘SVRDD_YOLO’ organizes the data in YOLO format. It contains the ‘images’ folder with the street view images of all districts and the ‘labels’ folder with the corresponding bounding box annotation files in.txt format. The directory structure for the SVRDD dataset is shown in Fig. 7.

Fig. 7
figure 7

Directory structure of the SVRDD dataset (the file types in folders are exemplified in the ‘Dongcheng’ folder).

Each image filename consists of the image serial number PID, horizontal coordinate X of the shooting position, and vertical coordinate Y of the shooting position, separated by ‘_’. PID is the serial number of the image provided by Baidu Maps and the horizontal and vertical coordinates of the image shooting position are the corresponding coordinates in the Baidu BD09 coordinate system.

Technical Validation

The technical validation of the SVRDD dataset evaluates its applicability in the construction of deep learning models for the detection of road damage based on street view images. In the technical validation, the dataset was randomly split into a training set of 6000 images, validation set of 1000 images, and testing set of 1000 images at a ratio of 6:1:1. The proportions of images from each district in the three sample sets are consistent with the overall image proportions of each district.

Performance of object detection algorithms using SVRDD

The SVRDD dataset was used to train and evaluate the performance of ten mainstream object detection algorithms, including Faster R-CNN35, Cascade R-CNN36, Dynamic R-CNN37, RetinaNet38, FCOS39, ATSS40, YOLOv341, YOLOF42, YOLOv543, and YOLOX44. The experimental hardware configuration includes an Intel (R) Xeon (R) Silver 4116 CPU, 128 GB RAM, and four NVIDIA GeForce 1080Ti GPUs. The open-source object detection library MMDetection45 was used as the implementation platform for the algorithms. For the parameter settings, the batch size was set to 16, the stochastic gradient descent algorithm was selected to optimize the learning rate, the momentum was set to 0.9, the weight decay coefficient was 0.0001, and a warm-up method was employed to initialize the learning rate. Table 4 reports the performance comparison of these models on the SVRDD testing set. Results indicate that YOLOv5, YOLOX, and Cascade R-CNN demonstrated superior performances on the SVRDD dataset. Among them, YOLOv5 exhibited the best detection performance with a F1-score of 0.709 and a mAP@0.5 of 0.733. The YOLOv5 network attends to characterize object features at four scales with strides of 8, 16, 32, and 64, while thoroughly fusing features between different layers, significantly enhancing the detection of pavement damage with substantial variations in size. It can be seen in Table 4 that YOLOX trailed closely with a F1-score of 0.691 and a mAP@0.5 of 0.703, notching the highest mAP@0.5:0.95 of 0.420. Next is the Cascade R-CNN, which having the largest parameters and FLOPS, recorded a F1-score of 0.664 and a mAP@0.5 of 0.674. Following is Dynamic R-CNN, and the most classic object detection algorithm Faster R-CNN and YOLOv3 also achieved good performance. The remaining detection algorithms (i.e., RetinaNet, FCOS, ATSS, and YOLOF) also achieved F1-score and mAP@0.5 values nearly reaching 0.6. These results reveal the effectiveness of the SVRDD dataset with deep learning algorithms for the detection of road damage in street view images.

Table 4 Performance comparison of different mainstream object detection algorithms.

Performance with varying training set sizes

The model performance was investigated using sub-datasets with different numbers of training images. The datasets shared the same validation and testing sets as the original SVRDD dataset, while the number of images in the training set were varied. Figure 8 presents the division of the datasets. SVRDD6K is the original SVRDD dataset, with 6000 images in the training set, SVRDD5K is derived from SVRDD6K by removing 1000 images from the training set based on the image proportions of each district, and so forth for the remaining datasets. Notably, the proportions of images in each district in the training sets of these datasets remain consistent with the original proportions of images in each district.

Fig. 8
figure 8

Dataset splitting for different training set sizes (a) and the proportion of images for each district in the SVRDD1K dataset (b).

As YOLOv5 presents the best performance in the comparative experiments, it was trained and tested on these six datasets (i.e., 6K to 1 K) (Table 5). As the number of training images decreased within a certain range, the model accuracy also decreased. Compared to SVRDD6K, when the number of training samples is 5000, 4000, 3000, 2000, and 1000, the model’s F1-score decreased by 0.28%, 0.42%, 1.97%, 8.04%, and 23.13%, and the model’s mAP@0.5 decreased by 0, 2.18%, 4.50%, 11.60%, and 29.33%, respectively. The model performance is approximately equal for the SVRDD6K and SVRDD5K datasets, with only a slight decrease in the F1-score under the latter. This suggests that the model’s performance is significantly influenced by the dataset size. However, as the dataset size increases, its impact diminishes, and the model performance is more likely to be influenced by the network structure. There is a significant decrease in model performance with the SVRDD1K dataset compared to the others due to the fact that the training data size is too small and does not contain sufficient damage instances, which can overfit noise and irrelevant features in the training data, resulting in poor generalization.

Table 5 Model performance under sub-datasets with different numbers of training images.

The mAP@0.5 values for different categories of pavement damages change as the size of the training set varies, with significant differences observed in their respective changes (Fig. 9). In summary, when comparing the mAP@0.5 values for different categories of pavement damage: the mAP@0.5 for alligator crack is higher than those for patches, which are higher than the mAP@0.5 values for cracks, and finally, the mAP@0.5 for pothole. Additionally, the mAP@0.5 for longitudinal damages are greater than that for transverse ones. The mAP@0.5 values of all pavement damage categories generally decrease as the size of the training set decreases. In the decrease of training set sizes from SVRDD6K to SVRDD1K, pothole is most affected with a mAP@0.5 decrease of 40.33%. Next, transverse cracks and transverse patches experience a decrease of 35.34% and 31.81% respectively, followed by longitudinal patch and longitudinal crack with decreases of 28.01% and 23.88%. Alligator crack is affected with a decrease of 27.28%. All categories of pavement damage show a substantial drop when the dataset size changes from SVRDD2K to SVRDD1K. Transverse cracks and potholes presented the most significant changes in mAP@0.5, with decreases of 30.65% and 29.87% respectively. The remaining mAP@0.5 reductions are around 15%. Meanwhile, the mAP@0.5 for transverse crack experiences the smallest change when the size transitions from SVRDD6K to SVRDD2K, amounting to only 6.77%. Therefore, when conducting data collection or model training, it is crucial to focus on potholes and transverse cracks as these are the most affected types of pavement damage by the dataset size.

Fig. 9
figure 9

The mAP@.5 values of different damage categories with different training set sizes.

Performance on removing training sets from different districts

The impact of removing training data from different districts on the model was also investigated. In particular, the validation and testing set in the SVRDD dataset remained consistent, while training images of Dongcheng, Xicheng, Haidian, Chaoyang, and Fengtai were removed to form new datasets. Considering the proximity of Dongcheng District and Xicheng District and that the sum of the image quantities in the two districts is equal to the individual image quantities in the other three districts, we also included the case of simultaneously removing training images from both these two districts. The ‘SVRDD-Dongcheng’ represents the SVRDD dataset with the training data of the Dongcheng District removed and so forth. The YOLOv5 network, which demonstrated the best performance in the comparative experiments, was used for the training and testing. Figure 10 reports the mAP@0.5 values for the testing set of each district. The model performance decreased to varying degrees when the training data from different districts was removed. The detection accuracies of Dongcheng and Xicheng districts were the lowest when the training data of these two districts was simultaneously removed out. After the training data of Haidian District was removed, its mAP@0.5 value decreased by 0.074, while the decrease of mAP@0.5 was insignificant in other cases. This suggests that the image features in the Haidian District were relatively independent because the District is situated across the plain and West Mountain. Removing the training images of Fengtai District had a significant impact on all districts, and thus it had the largest effect on the overall performance. This may be because Fengtai District provides the majority of pothole annotation data in the dataset, which greatly influences the accuracy of pothole detection in all districts, and ultimately affects the overall performance.

Fig. 10
figure 10

The mAP@.5 values of each district after removing training data from each district.

Usage Notes

Dataset usage

The SVRDD dataset described in this paper is offered to the scientific community for promoting the progress of road damage detection from street view images. It is the first public dataset of street view images for the detection of road damage annotated by trained remote sensing image interpreters and made available publicly. SVRDD can be employed ‘as is’ to train deep learning models for road damage detection. The SVRDD dataset can be downloaded from the link provided. Moreover, the data division strategy used in the technical validation is also provided and users can divide datasets based on their research needs. The users are advised to cite this article and acknowledge the contribution of this dataset to their work.

Limitations

The SVRDD dataset has some limitations as follows:

  • Limited coverage. Although currently covering the five districts of Beijing City, China, the SVRDD dataset spans diverse road types and pavement conditions. It is still less for the vast road network and massive street view data. Pavement damage characteristics may vary from region to region.

  • Imbalanced classes. The number of instances in each damage category of the SVRDD dataset is not equal. Potholes were fewer in number in the area studied, so there were significantly fewer instances of potholes than other damages. This could be the reason for the low accuracy of pothole detection (Fig. 9).

Supported extension

The applicability of SVRDD can be broadened to encompass other scenarios. Users can select extensions to studies based on their interests, and we first recommend addressing the limitations of the dataset. Potential extensions could include:

  • Multi-city imagery. The first aim is to include more cities to expand the dataset. This extension to multi-city street view images is anticipated to enhance diversity and improve pavement damage detection performance in various urban settings.

  • Multi-temporal imagery. The image in the SVRDD dataset is the most recent in the timeline for that location. Given the regular updates of street view images, capturing them at different times facilitates a comprehensive analysis of temporal variations in pavement conditions at specific locations. This temporal dimension adds valuable insights for ongoing monitoring and assessment.

  • Multi-view imagery. Acknowledging the importance of diverse views, the SVRDD dataset can be used with images from other views. Leveraging domain adaptation techniques, models trained on the SVRDD dataset can effectively be deployed on images captured from varying views. The current domain adaptive object detection is mainly based on teacher-student framework and transformer framework, which is worthwhile to consider further research to be applied to the SVRDD dataset.

  • Balancing dataset. Having an imbalanced dataset decreases the sensitivity of the model towards minority classes. To handle class imbalance problem, images of fewer number of damage categories can be added in. Various techniques can also be employed, such as data augmentation, resampling techniques, synthetic minority over-sampling technique, etc.

  • Additional annotations. Beyond its primary focus, SVRDD images can be annotated for various applications, such as identifying road traffic signs, road markings, congestion detection, and more. This expanded annotation approach adds versatility and utility to the dataset for diverse applications.