HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection

We present the HIT-UAV dataset, a high-altitude infrared thermal dataset for object detection applications on Unmanned Aerial Vehicles (UAVs). The dataset comprises 2,898 infrared thermal images extracted from 43,470 frames in hundreds of videos captured by UAVs in various scenarios, such as schools, parking lots, roads, and playgrounds. Moreover, the HIT-UAV provides essential flight data for each image, including flight altitude, camera perspective, date, and daylight intensity. For each image, we have manually annotated object instances with bounding boxes of two types (oriented and standard) to tackle the challenge of significant overlap of object instances in aerial images. To the best of our knowledge, the HIT-UAV is the first publicly available high-altitude UAV-based infrared thermal dataset for detecting persons and vehicles. We have trained and evaluated well-established object detection algorithms on the HIT-UAV. Our results demonstrate that the detection algorithms perform exceptionally well on the HIT-UAV compared to visual light datasets, since infrared thermal images do not contain significant irrelevant information about objects. We believe that the HIT-UAV will contribute to various UAV-based applications and researches. The dataset is freely available at https://pegasus.ac.cn.


Background & Summary
Unmanned Aerial Vehicle (UAV)-based object detection algorithms are widely used for various domains such as forest inventory 1 , mapping applications 2 , traffic monitoring 3 , and humanitarian relief 4 .With the rapid development of deep learning 5 and edge computing 6 , UAVs can now load edge computing devices to run artificial intelligence (AI) algorithms, thereby increasing their value in the aforementioned applications.Motivated by the rapid development of object detection, several general datasets such as PASCAL VOC 7 , MSCOCO 8 , and ImageNet 9 have been proposed to support algorithm training and evaluation.However, unlike natural environments, aerial images contain more object instances due to the wider view, bringing more significant challenges.Table 1 shows the average quantity of object bounding boxes per image for general datasets and the HIT-UAV 10 .Compared to general datasets, the HIT-UAV 10 contains a higher average quantity of object bounding boxes.Figure 1 (a) and (b) use samples from the COCO and VisDrone datasets to show the differences between natural and aerial images.
Many datasets of aerial perspectives have been introduced to help improve the detection performance of algorithms.The Stanford 11 , UAV123 12 , CARPK 13 , VisDrone 14 , and AU-AIR 15 datasets were introduced with visual light images.The ASL-TID 16 , BIRDSAI 17 , FLAME 18 , DroneRGBT 19 , DroneVehicle 20 , and Salient Map 21 datasets were introduced with thermal infrared images.The Salient Map dataset contains pedestrian and vehicle objects because the authors found there is no publicly available thermal dataset for detecting pedestrians and vehicles from the perspective of UAVs.

(g) DroneRGBT dataset (h) DroneVehicle dataset
The DroneRGBT only contains human objects and limited angles to help crowd counting.

2/12
However, although many datasets have been introduced for object detection on UAVs, there are many challenges in this field: • Limited application range.Several extant UAV-based datasets only comprise visual light images, which limits their use during night-time operations and raises privacy concerns.As shown in Figure 2, infrared thermal cameras offer distinct advantages over visual light cameras for night-time imaging.Additionally, Figure 3 shows a sample image from the HIT-UAV 10 , wherein persons are represented as white blocks devoid of any personal appearance, clothing, or gender information, thus ensuring complete protection of individual privacy.
• Insufficient record information.Numerous UAV-based datasets lack critical flight information, such as altitude and camera perspective, thereby precluding researchers from investigating pertinent issues, such as the influence of these factors on detection accuracy.Table 2 shows the record information of different datasets.
• Non-diversified data distribution.Many UAV-based datasets focus on a narrow range of aspects, such as synthetic scenes 12,17 , low altitudes 12,13,16,21 , single scenes 11,16 , or specific object categories 13,[18][19][20] .The limitations of synthetic scenes and low altitudes are highlighted in Figure 4, which illustrates their drawbacks using sample images.Moreover, focusing on a single scene or object category restricts the applicability of the datasets in various scenarios, such as object detection in multiple scenes and detecting multiple object categories.To provide a comprehensive understanding of the current UAV-based infrared thermal datasets and their drawbacks, Figure 1 (c) -(h) are presented.The sample images from synthetic scenes, low-altitude real scenes, and high-altitude real scenes.Synthetic scenes often lack the lighting variations and details present in real scenes, which can result in poorer detection performance when models trained on synthetic scenes are applied to real scenes.Compared to low-altitude perspectives, high-altitude perspectives can detect more objects and enable UAVs to scan a larger area.Additionally, flying at higher altitudes allows UAVs to access areas with tall buildings, making high-altitude datasets advantageous for practical tasks.These advantages highlight the importance of high-altitude datasets in expanding the application of UAVs in real-world scenarios.
impact of UAV flight altitude and camera perspective on object detection accuracy, the HIT-UAV 10 records crucial information, including flight altitude, camera perspective, daylight intensity, and image shooting date.Figure 3 shows a sample image and the recorded information of the HIT-UAV 10 .Covering a wide range of aspects, including higher altitudes (ranging from 60 to 130 meters), different camera perspectives (ranging from 30 to 90 degrees), various scenes (such as schools, parking lots, roads, and playgrounds), and different common object categories (such as persons, cars, bicycles, and vehicles), the HIT-UAV 10 aims to increase data distribution for various tasks.
The dataset comprises 2898 infrared thermal images extracted from 43470 frames in hundreds of videos, and all frames were collected in public and desensitized.To promote effective use of the dataset on different tasks, the HIT-UAV 10 provides two types of annotated bounding boxes for each object in the images: oriented and standard.The oriented bounding box solves the issue of significant overlap between object instances in aerial images, while the standard bounding box facilitates efficient use of the dataset.The HIT-UAV 10 includes five object categories, namely Person, Car, Bicycle, OtherVehicle, and DontCare, with a total of 24899 annotated objects.The DontCare category includes objects that could not be accurately categorized by the annotators (as further detailed in the Methods section).The dataset comprises 2029 training images, 579 test images, and 290 validation images.To evaluate the HIT-UAV 10 , we trained and tested the well-established object detection algorithms, namely YOLOv4 22 , YOLOv4-tiny, Faster R-CNN 23 , and SSD 24 , using the dataset.The results show that compared to other visual light datasets, the algorithms exhibit exceptional performance on the HIT-UAV 10 , indicating the potential of infrared thermal datasets to improve object detection applications in UAVs significantly.Further, we conducted an analysis of the performance of YOLOv4 and YOLOv4-tiny at different altitudes and camera perspectives, yielding insightful observations to aid users in their understanding of UAV-based object detection.
To the best of our knowledge, the HIT-UAV 10 is the first publicly available high-altitude UAV-based infrared thermal dataset for detecting persons and vehicles.The HIT-UAV 10 has the great potential to enable several research activities, such as (1) the application range of infrared thermal cameras in object detection tasks, (2) the feasibility of UAV-based search and rescue missions at night, (3) the relationship of flight altitude and object detection precision on UAVs, (4) the impact of camera perspective for UAV-based object detection.

Methods
The UAV platform selected for image capture was the DJI Matrice M210 V2 25 , which costs approximately 10,000 US dollars.The setup of the DJI Matrice M210 V2 used is detailed in Table 3.The DJI Zenmuse XT2 camera 26 was loaded on the UAV to capture the images.The DJI Zenmuse XT2 camera features a FLIR longwave infrared thermal camera with a thermal infrared camera resolution of 640×512 pixels and a 25mm lens, as well as a visual camera that captures 4K videos and 12MP photos.The cost of the DJI Zenmuse XT2 camera is approximately 8000 US dollars.
The dataset generation pipeline comprise four stages: video capture, frame extraction and data cleaning, object annotation, and dataset generation.

Video capture
We captured videos under varying conditions, including schools, parking lots, roads, playgrounds, and more.The flight altitude ranged from 60 to 130 meters, and the camera perspective ranged from 30 to 90 degrees.We conducted flights during both day and night time.For each video, we recorded the flight altitude, camera perspective, flight date, and daylight intensity.

Frame extraction and data cleaning
There is a slight variation in image features between consecutive video frames, making most frames unsuitable for improving the performance of object detection model.Although many datasets reserve full frames to train detection models, this approach does not address the limited feature distribution problem.Fortunately, the HIT-UAV 10 provides a sufficient number of original frames (43470 frames) to ensure a wide distribution of features.The frame resolution is 640×512, bit depth is 8, and the average compression rate is 21.059%.To filter adjacent frames that have little difference, we sampled an image every 15 frames (since the video refresh rate is 7 FPS), resulting in 2898 infrared thermal images.

Object annotation
We annotated the objects in the dataset using two types of bounding boxes: standard and oriented.The standard bounding box is represented as (x c , y c , w, h), where (x c , y c ) denotes the center coordinate and w and h denote the width and height of the bounding box, respectively.However, accurately labeling objects in aerial images from the perspective of UAVs can be challenging.To address this issue, we used θ -based oriented bounding box 27 to label object instances.The oriented bounding box is represented as (x c , y c , w, h, θ ), where θ denotes the oriented angle from the horizontal direction of the standard bounding box.As shown in Figure 5 (a), the overlap of standard bounding boxes can be significant, making it difficult for state-of-the-art object detection algorithms to distinguish them well.Using oriented bounding boxes accurately annotates the objects and solves this issue, as shown in Figure 5 (b).Note that the bounding box on the boundary is standard because the oriented bounding box cannot exceed the edge.One drawback of oriented bounding boxes is that few native object detection algorithms support training with them.To help users utilize the dataset, we provide both oriented and standard bounding box annotation files.We performed manual annotation of oriented object bounding boxes for all images using a modified version of the LabelImg tool.Difficult and truncated object instances were also labeled.Three individuals were involved in the annotation process, and each annotation was verified by the others.To facilitate the use of the dataset, we developed a tool to convert oriented bounding boxes to standard bounding boxes.The conversion method is as follows: First, we obtained the minimum and maximum x and y coordinates (x min , x max , y min , y max ) of the oriented bounding box.Then, we used (x min , x max , y min , y max ) as the boundary to obtain the standard bounding box, where the center coordinate was calculated as x c = (x min + x max )/2 and y c = (y min + y max )/2, and the width and height were calculated as w = x max − x min and h = y max − y min , respectively.

Dataset generation
We developed a dataset generation tool with functions that include XML and JSON label file generation and dataset splitting.The original images were organized into different folders based on flight data, and the tool generated XML and JSON label files corresponding to each image.To facilitate object detection model training, we split the dataset into training, test, and validation sets with a ratio of 70%, 20%, and 10%, respectively, using the Hold-out method 28 .

Data Records
The dataset is available at Zenodo 10 .

Folder structure and recording format
We offer two types of annotation files for users: XML files based on the VOC dataset format and JSON files based on the MS COCO dataset format.Both of these formats are commonly used benchmarks for object detection in computer vision.The top-level folder of our dataset includes four subfolders: normal_json, normal_xml, rotate_json, and rotate_xml.The normal_json and normal_xml folders contain annotation files with standard bounding boxes in JSON and XML formats, respectively.On the other hand, the rotate_json and rotate_xml folders contain annotation files with oriented bounding boxes in JSON and XML formats, respectively.The image files are named according to the following format: T _HH(H)_AA_W _NNNNN, where T indicates the shooting time (0 for day, 1 for night), HH(H) indicates the flight altitude (ranging from 60 to 130 meters), AA denotes the camera perspective (ranging from 30 to 90 degrees), W indicates the weather condition (only images captured under no rain conditions were included in the dataset), and NNNNN denotes the serial number of the image.

Properties
The annotated object categories include four types that highly appear in rescue and search missions: Person, Car, Bicycle, OtherVehicle.In addition, we labeled unrecognizable objects, namely DontCare, because many objects cannot identify specific types by annotator in high aerial images.As shown in Figure 5 (c), the red box represents the object of DontCare.In this object area, it is difficult to accurately identify if they are persons.Therefore, the DontCare can point out easily confused objects in the image.
Figure 6 (a) shows the distribution of annotations across object categories.The main object for the rescue mission (Person) appears more than other objects.Additionally, the presence of a substantial number of Car and Bicycle objects makes the HIT-UAV 10 suitable for a wide range of common tasks.To enhance the versatility of the dataset for high-altitude missions, flight altitudes were recorded in intervals of 10 meters, ranging from 60 to 130 meters.This information is depicted in Figure 6 (b).The camera perspectives were also recorded in increments of 10 degrees, varying from 30 to 90 degrees, as shown in Figure 6 (c).Infrared thermal images have a significant difference between day and night due to the higher background temperature during the day.As shown in Figure 7, the infrared thermal image during the night is easier to identify the objects than during the day because the background temperature of the night is lower than the day.To increase the diversity of the dataset, infrared thermal images were collected both during the day and night, as presented in Figure 6 (d). Figure 6 (e) and (f) present the distribution of instances with varying categories across flight altitudes and camera perspectives, respectively.The average pixels of different categories across flight altitudes are depicted in Figure 8 (a).Theoretically, the average pixel size is expected to decrease with increasing altitude, since higher altitudes result in smaller object sizes.However, for the category of OtherVehicle, the fluctuations are large due to its limited number of instances, which leads to the influence of truncated objects on the results.The average pixel of the remaining categories generally decreases with altitude, though there may be slight fluctuations due to the difference in image coverage at different altitudes and angles.The average pixel of categories across camera perspectives is shown in Figure 8 (b), where it is observed that the average pixel size increases initially and then decreases with increasing angle.This is due to the fact that objects become more prominent with reduced vision, but their visible surface area decreases with greater angles.Figure 9 illustrates these visual changes.

Technical Validation
We trained four well-established object detection algorithms, namely YOLOv4, YOLOv4-tiny, Faster-RCNN, and SSD, using the HIT-UAV 10 .The dataset consisted of 2029 training images, 290 validation images, and 579 test images.The experiments were performed on an RTX 2080Ti GPU.YOLOv4 and YOLOv4-tiny were trained using the Darknet framework, while Faster-RCNN (with a ResNet-101 backbone) and SSD-512 were trained using the MMDetection 29 framework.The pre-trained models for YOLOv4 and YOLOv4-tiny were obtained from official sources.The training process was performed for a maximum of 10,000 steps, with a batch size of 64 and subdivision of 16.The learning rate was set to 0.0013 and was multiplied by 0.1 at steps 8000 and 9000.The weight decay and momentum were set to 0.949 and 0.0005.For Faster-RCNN and SSD, the official ResNet-101 and VGG16 models were used as pre-trained models.The maximum number of epochs was 32, with a batch size of 16.The learning rate was set to 0.02 and had a warm-up ratio of 0.001, with a warm-up iteration of 500.The weight decay and momentum were set to 0.9 and 0.0001.Table 4 presents the precision of the aforementioned models on the HIT-UAV 10 test set, as well as the precision of YOLOv4 and YOLOv4-tiny trained on the COCO dataset and the highest accuracy (attained by RRNet) on the VisDrone-2019 challenge 30 .Our observations indicate that the Average Precision (AP) value for the category of Person is significantly lower when using YOLOv4-tiny on the HIT-UAV 10 .This discrepancy may be attributed to the lower detection capability of YOLOv4-tiny for small objects in comparison to other models.Additionally, the AP for the category of OtherVehicle is subpar, which may be due to the category imbalance issue.The SSD-512 model exhibits improved performance in the imbalanced category.In the VisDrone-2019 challenge, the highest precision of 55.82% mean Average Precision (mAP) was achieved by the RRNet method.However, the official YOLOv4 model achieved 65.7% mAP on the COCO dataset, surpassing RRNet in the VisDrone challenge.This indicates that aerial image information is more complex than that of natural images.Finally, for the HIT-UAV 10 , YOLOv4 achieved an mAP of 84.75%, indicating the following observations: • Infrared thermal images effectively filter out extraneous information, leading to improved object identification.
• Infrared thermal images facilitate the outstanding performance of common detection models with limited image data, due to the easily recognizable features of the objects in such images.The HIT-UAV 10 has the potential to facilitate the detection of vehicles and persons by UAVs.We used YOLOv4 and YOLOv4-tiny as samples to study the relationship and impact of altitude and camera perspective on UAV-based object detection.The categories of Person and Car were selected for this experiment, as the categories of OtherVehicle and Bicycle have a limited number of objects in the HIT-UAV 10 .A limited number of objects would result in fluctuations in statistical results.The results of the study are shown in Figure 10.The following observations and insights have been gleaned from the results: • The AP of YOLOv4 demonstrates stability within a certain range, suggesting that variations in altitudes and angles do not significantly impact the detection performance of robust algorithms.
• The AP of YOLOv4-tiny for the Person category tends to decrease with increasing altitude.This decrease is observed in three stages, ranging from 60m to 80m, 80m to 90m, and 100m to 130m, suggesting that the detection performance of lightweight algorithms is significantly impacted when objects fall outside of a certain size range.Higher altitudes provide a wider field of view, enabling UAVs to cover larger areas within the same flight time.In some UAV tasks, such as person rescue, users may need to weigh the trade-off between detection precision and altitude to achieve optimal performance.
• The AP of YOLOv4-tiny for the Person category first increases and then decreases with increasing camera angle.This result highlights the impact of the visible surface of objects on detection precision.At 90 degrees, as shown in Figure 9 (c), individuals appear as points, making them more challenging to identify compared to when viewed at 50 degrees.As a result, it is crucial for users to choose the appropriate camera perspective when performing object detection tasks.
The sample detection results of the YOLOv4 model trained on the HIT-UAV 10 are shown in Figure 11.The results demonstrate that the model effectively recognizes objects in infrared thermal aerial images.We hope the HIT-UAV 10 can promote the development of drone-based object detection tasks.

Usage Notes
The HIT-UAV 10 is available at https://github.com/suojiashun/HIT-UAV-Infrared-Thermal-Dataset.Users can download the dataset to train object detection algorithms.The VOC and MS COCO dataset is a widely used benchmarks for object detection.We provide the label files with VOC and MS COCO format.Users can easily use the HIT-UAV 10 .
The HIT-UAV 10 was collected in a diverse range of environments, including schools, parking lots, roads, and playgrounds.This allows for the application of trained object detection models to these scenarios as well as other environments through the generalization capabilities of deep learning.Researchers can use the HIT-UAV 10 to train object detection models to research the application range of infrared thermal in different object detection tasks.Additionally, the trained models have the potential to be employed in UAV-based search and rescue missions during nighttime to evaluate their feasibility.
due to the wider view.(c) ASL-TID dataset (d) BIRDSAI dataset (f) Salient Map dataset (e) FLAME dataset The ASL-TID dataset has only a very low altitude and a small number of objects.The FLAME dataset only contains forest fire objects.The BIRDSAI dataset only contains animal and human objects.The Salient Map dataset only contains a few objects due to the low altitudes.The DroneVehicle only contains the limited categories (vehicles), altitudes (80m, 100m, 120m), and angles (15°, 35°, 45°).

Figure 3 .
Figure 3.A sample image and recorded information of the HIT-UAV.To overcome the aforementioned challenges, we present the HIT-UAV 10 dataset.The HIT-UAV10 comprises infrared thermal images collected to expand the application range of UAVs at night.To facilitate research on diverse issues, such as the (a) Synthetic scene of the UAV123 dataset (b) Low-altitude real scene (c) High-altitude real scene

Figure 4 .
Figure 4.The sample images from synthetic scenes, low-altitude real scenes, and high-altitude real scenes.Synthetic scenes often lack the lighting variations and details present in real scenes, which can result in poorer detection performance when models trained on synthetic scenes are applied to real scenes.Compared to low-altitude perspectives, high-altitude perspectives can detect more objects and enable UAVs to scan a larger area.Additionally, flying at higher altitudes allows UAVs to access areas with tall buildings, making high-altitude datasets advantageous for practical tasks.These advantages highlight the importance of high-altitude datasets in expanding the application of UAVs in real-world scenarios.

Figure 5 .
Figure 5.The samples of the standard bounding box, oriented bounding box, and DontCare object.Oriented bounding boxes have a smaller overlap than standard bounding boxes.In the (c), the red box represents the DontCare object.It is difficult to accurately identify whether the objects in this area are people or not.

Figure 6 .
Figure 6.The data distribution of the HIT-UAV.

Figure 7 .Figure 8 .
Figure 7.The samples of the night and day images.

Figure 9 .
Figure 9.The sample images taken at 80 meters with varying camera perspectives.At 30 degrees, objects in the far distance appear smaller due to the wider field of view.Conversely, at 50 degrees, objects appear larger.However, at 90 degrees, objects once again become smaller due to the reduction in the visible surface area of objects.

Figure 10 .
Figure 10.The Average Precision (AP) of categories in the HIT-UAV test set.

Figure 11 .
Figure 11.The sample results of YOLOv4 detection.

Table 1 .
The average bounding box (Avg.Bbox) quantity of general datasets and the HIT-UAV.

Table 2 .
The record information of different datasets.

Table 4 .
The Average Precision (AP) of the baseline models.