Background & Summary

Unmanned Aerial Vehicle (UAV)-based object detection algorithms are widely used for various domains such as forest inventory1, mapping applications2, traffic monitoring3, and humanitarian relief4. With the rapid development of deep learning5 and edge computing6, UAVs can now load edge computing devices to run artificial intelligence (AI) algorithms, thereby increasing their value in the aforementioned applications. Motivated by the rapid development of object detection, several general datasets such as PASCAL VOC7, MSCOCO8, and ImageNet9 have been proposed to support algorithm training and evaluation. However, unlike natural environments, aerial images contain more object instances due to the wider view, bringing more significant challenges. Table 1 shows the average quantity of object bounding boxes per image for general datasets and the HIT-UAV10. Compared to general datasets, the HIT-UAV10 contains a higher average quantity of object bounding boxes. Figure 1a,b use samples from the COCO and VisDrone datasets to show the differences between natural and aerial images.

Table 1 The average bounding box (Avg. Bbox) quantity of general datasets and the HIT-UAV.
Fig. 1
figure 1

The samples of different datasets.

Many datasets of aerial perspectives have been introduced to help improve the detection performance of algorithms. The Stanford11, UAV12312, CARPK13, VisDrone14, and AU-AIR15 datasets were introduced with visual light images. The ASL-TID16, BIRDSAI17, FLAME18, DroneRGBT19, DroneVehicle20, and Salient Map21 datasets were introduced with thermal infrared images. The Salient Map dataset contains pedestrian and vehicle objects because the authors found there is no publicly available thermal dataset for detecting pedestrians and vehicles from the perspective of UAVs.

However, although many datasets have been introduced for object detection on UAVs, there are many challenges in this field:

  • Limited application range. Several extant UAV-based datasets only comprise visual light images, which limits their use during night-time operations and raises privacy concerns. As shown in Fig. 2, infrared thermal cameras offer distinct advantages over visual light cameras for night-time imaging. Additionally, Fig. 3 shows a sample image from the HIT-UAV10, wherein persons are represented as white blocks devoid of any personal appearance, clothing, or gender information, thus ensuring complete protection of individual privacy.

    Fig. 2
    figure 2

    The sample images captured by visual light and infrared thermal cameras under the same flight altitude and camera angle at night. The infrared thermal image readily identifies car and bicycle objects, while the visual light image faces difficulty in doing so. The results demonstrate the superior performance of infrared thermal cameras in enabling UAVs to perform tasks more effectively during nighttime operations.

    Fig. 3
    figure 3

    A sample image and recorded information of the HIT-UAV.

  • Insufficient record information. Numerous UAV-based datasets lack critical flight information, such as altitude and camera perspective, thereby precluding researchers from investigating pertinent issues, such as the influence of these factors on detection accuracy. Table 2 shows the record information of different datasets.

    Table 2 The record information of different datasets.
  • Non-diversified data distribution. Many UAV-based datasets focus on a narrow range of aspects, such as synthetic scenes12,17, low altitudes12,13,16,21, single scenes11,16, or specific object categories13,18,19,20. The limitations of synthetic scenes and low altitudes are highlighted in Fig. 4, which illustrates their drawbacks using sample images. Moreover, focusing on a single scene or object category restricts the applicability of the datasets in various scenarios, such as object detection in multiple scenes and detecting multiple object categories. To provide a comprehensive understanding of the current UAV-based infrared thermal datasets and their drawbacks, Fig. 1c–h are presented.

    Fig. 4
    figure 4

    The sample images from synthetic scenes, low-altitude real scenes, and high-altitude real scenes. Synthetic scenes often lack the lighting variations and details present in real scenes, which can result in poorer detection performance when models trained on synthetic scenes are applied to real scenes. Compared to low-altitude perspectives, high-altitude perspectives can detect more objects and enable UAVs to scan a larger area. Additionally, flying at higher altitudes allows UAVs to access areas with tall buildings, making high-altitude datasets advantageous for practical tasks. These advantages highlight the importance of high-altitude datasets in expanding the application of UAVs in real-world scenarios.

To overcome the aforementioned challenges, we present the HIT-UAV10 dataset. The HIT-UAV10 comprises infrared thermal images collected to expand the application range of UAVs at night. To facilitate research on diverse issues, such as the impact of UAV flight altitude and camera perspective on object detection accuracy, the HIT-UAV10 records crucial information, including flight altitude, camera perspective, daylight intensity, and image shooting date. Figure 3 shows a sample image and the recorded information of the HIT-UAV10. Covering a wide range of aspects, including higher altitudes (ranging from 60 to 130 meters), different camera perspectives (ranging from 30 to 90 degrees), various scenes (such as schools, parking lots, roads, and playgrounds), and different common object categories (such as persons, cars, bicycles, and vehicles), the HIT-UAV10 aims to increase data distribution for various tasks.

The dataset comprises 2,898 infrared thermal images extracted from 43,470 frames in hundreds of videos, and all frames were collected in public and desensitized. To promote effective use of the dataset on different tasks, the HIT-UAV10 provides two types of annotated bounding boxes for each object in the images: oriented and standard. The oriented bounding box solves the issue of significant overlap between object instances in aerial images, while the standard bounding box facilitates efficient use of the dataset. The HIT-UAV10 includes five object categories, namely Person, Car, Bicycle, OtherVehicle, and DontCare, with a total of 24,899 annotated objects. The DontCare category includes objects that could not be accurately categorized by the annotators (as further detailed in the Methods section). The dataset comprises 2,029 training images, 579 test images, and 290 validation images. To evaluate the HIT-UAV10, we trained and tested the well-established object detection algorithms, namely YOLOv422, YOLOv4-tiny, Faster R-CNN23, and SSD24, using the dataset. The results show that compared to other visual light datasets, the algorithms exhibit exceptional performance on the HIT-UAV10, indicating the potential of infrared thermal datasets to improve object detection applications in UAVs significantly. Further, we conducted an analysis of the performance of YOLOv4 and YOLOv4-tiny at different altitudes and camera perspectives, yielding insightful observations to aid users in their understanding of UAV-based object detection.

To the best of our knowledge, the HIT-UAV10 is the first publicly available high-altitude UAV-based infrared thermal dataset for detecting persons and vehicles. The HIT-UAV10 has the great potential to enable several research activities, such as (1) the application range of infrared thermal cameras in object detection tasks, (2) the feasibility of UAV-based search and rescue missions at night, (3) the relationship of flight altitude and object detection precision on UAVs, (4) the impact of camera perspective for UAV-based object detection.

Methods

The UAV platform selected for image capture was the DJI Matrice M210 V225, which costs approximately 10,000 US dollars. The setup of the DJI Matrice M210 V2 used is detailed in Table 3. The DJI Zenmuse XT2 camera26 was loaded on the UAV to capture the images. The DJI Zenmuse XT2 camera features a FLIR longwave infrared thermal camera with a thermal infrared camera resolution of 640 × 512 pixels and a 25 mm lens, as well as a visual camera that captures 4 K videos and 12MP photos. The cost of the DJI Zenmuse XT2 camera is approximately 8000 US dollars.

Table 3 DJI Matrice M210 setup.

The dataset generation pipeline comprise four stages: video capture, frame extraction and data cleaning, object annotation, and dataset generation.

Video capture

We captured videos under varying conditions, including schools, parking lots, roads, playgrounds, and more. The flight altitude ranged from 60 to 130 meters, and the camera perspective ranged from 30 to 90 degrees. We conducted flights during both day and night time. For each video, we recorded the flight altitude, camera perspective, flight date, and daylight intensity.

Frame extraction and data cleaning

There is a slight variation in image features between consecutive video frames, making most frames unsuitable for improving the performance of object detection model. Although many datasets reserve full frames to train detection models, this approach does not address the limited feature distribution problem. Fortunately, the HIT-UAV10 provides a sufficient number of original frames (43,470 frames) to ensure a wide distribution of features. The frame resolution is 640 × 512, bit depth is 8, and the average compression rate is 21.059%. To filter adjacent frames that have little difference, we sampled an image every 15 frames (since the video refresh rate is 7 FPS), resulting in 2,898 infrared thermal images.

Object annotation

We annotated the objects in the dataset using two types of bounding boxes: standard and oriented. The standard bounding box is represented as (xc, yc, w, h), where (xc, yc) denotes the center coordinate and w and h denote the width and height of the bounding box, respectively. However, accurately labeling objects in aerial images from the perspective of UAVs can be challenging. To address this issue, we used θ-based oriented bounding box27 to label object instances. The oriented bounding box is represented as (xc, yc, w, h, θ), where θ denotes the oriented angle from the horizontal direction of the standard bounding box. As shown in Fig. 5a, the overlap of standard bounding boxes can be significant, making it difficult for state-of-the-art object detection algorithms to distinguish them well. Using oriented bounding boxes accurately annotates the objects and solves this issue, as shown in Fig. 5b. Note that the bounding box on the boundary is standard because the oriented bounding box cannot exceed the edge. One drawback of oriented bounding boxes is that few native object detection algorithms support training with them. To help users utilize the dataset, we provide both oriented and standard bounding box annotation files.

Fig. 5
figure 5

The samples of the standard bounding box, oriented bounding box, and DontCare object. Oriented bounding boxes have a smaller overlap than standard bounding boxes. In the (c), the red box represents the DontCare object. It is difficult to accurately identify whether the objects in this area are people or not.

We performed manual annotation of oriented object bounding boxes for all images using a modified version of the LabelImg tool. Difficult and truncated object instances were also labeled. Three individuals were involved in the annotation process, and each annotation was verified by the others. To facilitate the use of the dataset, we developed a tool to convert oriented bounding boxes to standard bounding boxes. The conversion method is as follows: First, we obtained the minimum and maximum x and y coordinates (xmin, xmax, ymin, ymax) of the oriented bounding box. Then, we used (xmin, xmax, ymin, ymax) as the boundary to obtain the standard bounding box, where the center coordinate was calculated as xc = (xmin + xmax)/2 and yc = (ymin + ymax)/2, and the width and height were calculated as w = xmaxxmin and h = ymaxymin, respectively.

Dataset generation

We developed a dataset generation tool with functions that include XML and JSON label file generation and dataset splitting. The original images were organized into different folders based on flight data, and the tool generated XML and JSON label files corresponding to each image. To facilitate object detection model training, we split the dataset into training, test, and validation sets with a ratio of 70%, 20%, and 10%, respectively, using the Hold-out method28.

Data Records

The dataset is available at Zenodo10.

Folder structure and recording format

We offer two types of annotation files for users: XML files based on the VOC dataset format and JSON files based on the MS COCO dataset format. Both of these formats are commonly used benchmarks for object detection in computer vision. The top-level folder of our dataset includes four subfolders: normal_json, normal_xml, rotate_json, and rotate_xml. The normal_json and normal_xml folders contain annotation files with standard bounding boxes in JSON and XML formats, respectively. On the other hand, the rotate_json and rotate_xml folders contain annotation files with oriented bounding boxes in JSON and XML formats, respectively.

The image files are named according to the following format: T_HH(H)_AA_W_NNNNN, where T indicates the shooting time (0 for day, 1 for night), HH(H) indicates the flight altitude (ranging from 60 to 130 meters), AA denotes the camera perspective (ranging from 30 to 90 degrees), W indicates the weather condition (only images captured under no rain conditions were included in the dataset), and NNNNN denotes the serial number of the image.

Properties

The annotated object categories include four types that highly appear in rescue and search missions: Person, Car, Bicycle, OtherVehicle. In addition, we labeled unrecognizable objects, namely DontCare, because many objects cannot identify specific types by annotator in high aerial images. As shown in Fig. 5c, the red box represents the object of DontCare. In this object area, it is difficult to accurately identify if they are persons. Therefore, the DontCare can point out easily confused objects in the image.

Figure 6a shows the distribution of annotations across object categories. The main object for the rescue mission (Person) appears more than other objects. Additionally, the presence of a substantial number of Car and Bicycle objects makes the HIT-UAV10 suitable for a wide range of common tasks. To enhance the versatility of the dataset for high-altitude missions, flight altitudes were recorded in intervals of 10 meters, ranging from 60 to 130 meters. This information is depicted in Fig. 6b. The camera perspectives were also recorded in increments of 10 degrees, varying from 30 to 90 degrees, as shown in Fig. 6c. Infrared thermal images have a significant difference between day and night due to the higher background temperature during the day. As shown in Fig. 7, the infrared thermal image during the night is easier to identify the objects than during the day because the background temperature of the night is lower than the day. To increase the diversity of the dataset, infrared thermal images were collected both during the day and night, as presented in Fig. 6d. Figure 6e,f present the distribution of instances with varying categories across flight altitudes and camera perspectives, respectively. The average pixels of different categories across flight altitudes are depicted in Fig. 8a. Theoretically, the average pixel size is expected to decrease with increasing altitude, since higher altitudes result in smaller object sizes. However, for the category of OtherVehicle, the fluctuations are large due to its limited number of instances, which leads to the influence of truncated objects on the results. The average pixel of the remaining categories generally decreases with altitude, though there may be slight fluctuations due to the difference in image coverage at different altitudes and angles. The average pixel of categories across camera perspectives is shown in Fig. 8b, where it is observed that the average pixel size increases initially and then decreases with increasing angle. This is due to the fact that objects become more prominent with reduced vision, but their visible surface area decreases with greater angles. Figure 9 illustrates these visual changes.

Fig. 6
figure 6

The data distribution of the HIT-UAV.

Fig. 7
figure 7

The samples of the night and day images.

Fig. 8
figure 8

The average pixel of categories across flight altitudes and camera perspectives.

Fig. 9
figure 9

The sample images taken at 80 meters with varying camera perspectives. At 30 degrees, objects in the far distance appear smaller due to the wider field of view. Conversely, at 50 degrees, objects appear larger. However, at 90 degrees, objects once again become smaller due to the reduction in the visible surface area of objects.

Technical Validation

We trained four well-established object detection algorithms, namely YOLOv4, YOLOv4-tiny, Faster-RCNN, and SSD, using the HIT-UAV10. The dataset consisted of 2,029 training images, 290 validation images, and 579 test images. The experiments were performed on an RTX 2080Ti GPU. YOLOv4 and YOLOv4-tiny were trained using the Darknet framework, while Faster-RCNN (with a ResNet-101 backbone) and SSD-512 were trained using the MMDetection29 framework. The pre-trained models for YOLOv4 and YOLOv4-tiny were obtained from official sources. The training process was performed for a maximum of 10,000 steps, with a batch size of 64 and subdivision of 16. The learning rate was set to 0.0013 and was multiplied by 0.1 at steps 8000 and 9000. The weight decay and momentum were set to 0.949 and 0.0005. For Faster-RCNN and SSD, the official ResNet-101 and VGG16 models were used as pre-trained models. The maximum number of epochs was 32, with a batch size of 16. The learning rate was set to 0.02 and had a warm-up ratio of 0.001, with a warm-up iteration of 500. The weight decay and momentum were set to 0.9 and 0.0001.

Table 4 presents the precision of the aforementioned models on the HIT-UAV10 test set, as well as the precision of YOLOv4 and YOLOv4-tiny trained on the COCO dataset and the highest accuracy (attained by RRNet) on the VisDrone-2019 challenge30. Our observations indicate that the Average Precision (AP) value for the category of Person is significantly lower when using YOLOv4-tiny on the HIT-UAV10. This discrepancy may be attributed to the lower detection capability of YOLOv4-tiny for small objects in comparison to other models. Additionally, the AP for the category of OtherVehicle is subpar, which may be due to the category imbalance issue. The SSD-512 model exhibits improved performance in the imbalanced category. In the VisDrone-2019 challenge, the highest precision of 55.82% mean Average Precision (mAP) was achieved by the RRNet method. However, the official YOLOv4 model achieved 65.7% mAP on the COCO dataset, surpassing RRNet in the VisDrone challenge. This indicates that aerial image information is more complex than that of natural images. Finally, for the HIT-UAV10, YOLOv4 achieved an mAP of 84.75%, indicating the following observations:

  • Infrared thermal images effectively filter out extraneous information, leading to improved object identification.

  • Infrared thermal images facilitate the outstanding performance of common detection models with limited image data, due to the easily recognizable features of the objects in such images. The HIT-UAV10 has the potential to facilitate the detection of vehicles and persons by UAVs.

Table 4 The Average Precision (AP) of the baseline models.

We used YOLOv4 and YOLOv4-tiny as samples to study the relationship and impact of altitude and camera perspective on UAV-based object detection. The categories of Person and Car were selected for this experiment, as the categories of OtherVehicle and Bicycle have a limited number of objects in the HIT-UAV10. A limited number of objects would result in fluctuations in statistical results. The results of the study are shown in Fig. 10. The following observations and insights have been gleaned from the results:

  • The AP of YOLOv4 demonstrates stability within a certain range, suggesting that variations in altitudes and angles do not significantly impact the detection performance of robust algorithms.

  • The AP of YOLOv4-tiny for the Person category tends to decrease with increasing altitude. This decrease is observed in three stages, ranging from 60 m to 80 m, 80 m to 90 m, and 100 m to 130 m, suggesting that the detection performance of lightweight algorithms is significantly impacted when objects fall outside of a certain size range. Higher altitudes provide a wider field of view, enabling UAVs to cover larger areas within the same flight time. In some UAV tasks, such as person rescue, users may need to weigh the trade-off between detection precision and altitude to achieve optimal performance.

  • The AP of YOLOv4-tiny for the Person category first increases and then decreases with increasing camera angle. This result highlights the impact of the visible surface of objects on detection precision. At 90 degrees, as shown in Fig. 9c, individuals appear as points, making them more challenging to identify compared to when viewed at 50 degrees. As a result, it is crucial for users to choose the appropriate camera perspective when performing object detection tasks.

Fig. 10
figure 10

The Average Precision (AP) of categories in the HIT-UAV test set.

The sample detection results of the YOLOv4 model trained on the HIT-UAV10 are shown in Fig. 11. The results demonstrate that the model effectively recognizes objects in infrared thermal aerial images. We hope the HIT-UAV10 can promote the development of drone-based object detection tasks.

Fig. 11
figure 11

The sample results of YOLOv4 detection.

Usage Notes

The HIT-UAV10 is available at https://pegasus.ac.cn. Users can download the dataset to train object detection algorithms. The VOC and MS COCO dataset is a widely used benchmarks for object detection. We provide the label files with VOC and MS COCO format. Users can easily use the HIT-UAV10.

The HIT-UAV10 was collected in a diverse range of environments, including schools, parking lots, roads, and playgrounds. This allows for the application of trained object detection models to these scenarios as well as other environments through the generalization capabilities of deep learning. Researchers can use the HIT-UAV10 to train object detection models to research the application range of infrared thermal in different object detection tasks. Additionally, the trained models have the potential to be employed in UAV-based search and rescue missions during nighttime to evaluate their feasibility.