Drone-Person Tracking in Uniform Appearance Crowd: A New Dataset

Drone-person tracking in uniform appearance crowds poses unique challenges due to the difficulty in distinguishing individuals with similar attire and multi-scale variations. To address this issue and facilitate the development of effective tracking algorithms, we present a novel dataset named D-PTUAC (Drone-Person Tracking in Uniform Appearance Crowd). The dataset comprises 138 sequences comprising over 121 K frames, each manually annotated with bounding boxes and attributes. During dataset creation, we carefully consider 18 challenging attributes encompassing a wide range of viewpoints and scene complexities. These attributes are annotated to facilitate the analysis of performance based on specific attributes. Extensive experiments are conducted using 44 state-of-the-art (SOTA) trackers, and the performance gap between the visual object trackers on existing benchmarks compared to our proposed dataset demonstrate the need for a dedicated end-to-end aerial visual object tracker that accounts the inherent properties of aerial environment.


Background & Summary
The advancement of Unmanned Aerial Vehicles (UAVs), commonly known as drones, has significantly improved security surveillance capabilities.Drones excel at tracking and pinpointing individuals of interest, rendering person-following and tracking systems 1 invaluable in domains like surveillance 2 , search and rescue missions 3 , and healthcare 4,5 .These systems leverage Visual Object Tracking (VOT) techniques, which involve locating and estimating the trajectory of specific objects within a sequence of consecutive frames 6,7 .VOT finds applications in various fields, including autonomous vehicles 8 , robotics 9,10 , and robot-assisted person following 10 .In VOT, a fundamental challenge is to learn an appearance model from the initial state of a target object, which is essential for locating the target object in subsequent frames 6,7 .This challenge becomes particularly pronounced with the presence of similar appearance distractors 6 .
A reliable visual object tracker is vital for the effectiveness of a vision-based drone-person following system in the face of numerous challenges.To train a robust tracker capable of excelling in diverse scenarios, it must be exposed to various challenging tracking scenarios.In recent years, several large-scale object tracking datasets have been released, such as UAV123 11 , OTB100 12 , VOT2018 13 , TrackingNet 14 , LaSOT 15 , GOT-10k 16 , and LaTOT 17 , which cover diverse real-world tracking challenges.However, none of these datasets specifically cover person tracking in a uniform appearance environments.Such settings are common in regions like the Gulf and many parts of Asia and pose unique tracking challenges due to similar clothing.Introducing a dataset for uniform appearance tracking can be a valuable addition to the computer vision community.A recent study 18 has introduced a dataset named PTUA, which focuses on ground robot-person tracking in a uniform dressing environment scenario.However, this dataset has several limitations in the settings chosen while recording it.Firstly, it only considers a maximum person density of four, which does not reflect a truly crowded scenario.Secondly, the dataset was captured in a controlled environment, which does not accurately simulate the challenges of real-world tracking scenarios.Thirdly, the dataset does not simulate scenarios where a person may be imitating intruder behavior, such as moving quickly to evade a robot which are essential factors in assessing the robustness of a tracking algorithm in a person tracking scenario.
In scenarios that involve tracking individuals using drones in uniform appearance crowd, the task of maintaining the trajectory of a designated target becomes notably challenging.This challenge arise from the complexities introduced by drone-based capturing and the substantial presence of uniform appearance distractors.
The combined impact of these challenges creates obstacles in effectively tracking the intended object, thereby rendering it more intricate compared to tracking tasks in other datasets, as visually illustrated in Fig. 1, and as comprehensively discussed in "Technical Validation" section.
To address the above gaps, we introduce the Drone-Person Tracking in Uniform Appearance Crowd (D-PTUAC) dataset 19 for uniform-clothed crowds.The dataset also stands out in having a target person behaving as an intruder to single them out from the crowd.The sequences were collected by controlling a camera-equipped drone, specifically DJI Mavic 3 Pro (https://www.dji.com/ae/mavic-3-pro) with a wireless manual controller using DJI GO 4 Android application (https://www.dji.com/ae/downloads/djiapp/dji-go-4) to follow a target person among the crowd wearing the same attire.To enhance the diversity of the dataset, we performed the collection in different challenging scenarios such as Uniformity (UF), Abrupt Appearance Change (AAC), Background Clutters (BC), Aspect Ratio Change (ARC), Scale Variation (SV), LR, Rotation (ROT), Pose Variation (PV), Occlusion (OCC), Out-of-View (OV), Short-Term (ST), Long-Term (LT), Motion Blur (MB), Fast Motion (FM), Illumination Variations (IV), Weather Conditions (WC), Crowd Densities (CD), Deformation (DEF), and Surveillance Settings (SS). Figure 2 illustrates sample images taken from the proposed D-PTUAC dataset.The evaluation of RGB trackers revealed a substantial performance decline, particularly in the presence of LR and BC, as illustrated in Fig. 3.This decline can be attributed to the nature of the dataset captured by a drone, where the subjects tend to be LR and wear clothing with a uniform appearance, intensifying the challenges of UF and BC.Additionally, to highlight the challenges in drone crowd uniform appearance tracking, we show that previous frameworks that rely on estimated depth fusion and segmentation fail on our dataset.For this purpose, we employed two frameworks: MiDaS for monocular depth estimation 20 to generate RGB-D data and ViT-B SAM 21 to generate segmentation masks for tracking.The dataset 19 is available for access on Figshare at (https://doi.org/10.6084/m9.figshare.24590568.v2).

Methods
Human subjects.Our study involving human subjects was approved by the Research Ethics Committee of Khalifa University of Science and Technology (IRB protocol number H23-029), ensuring adherence to ethical standards in research.Following this approval, we specifically targeted the Khalifa University community for participant recruitment, encompassing students, staff, faculty, and local residents in Abu Dhabi, United Arab Emirates.This effort successfully engaged approximately 40-50 subjects; all of whom voluntarily participated in the construction of the dataset.The Ethics office, in collaboration with the investigators, played a key role in disseminating detailed information about the study and obtaining informed written consent from all participants.The inclusion criteria for our participant cohort were clearly defined: individuals aged 18 years and above, of any sex, either UAE nationals or residents, who were capable of understanding and providing consent.Exclusion criteria included individuals below 18 years of age and those unwilling or unable to give consent.Importantly, participants were explicitly informed that their likenesses captured in videos and images would be shared as part of an open-access dataset, thus ensuring their full awareness and understanding of the extent of their involvement and how their data would be utilized in the research community.

Proposed D-PtUaC dataset.
To construct a benchmark dataset tailored for drone-person tracking scenarios, we conducted RGB video recordings using a DJI Mavic 3 Pro drone.During these recordings, the drone was manually operated to track and follow a designated individual within a group.This approach allowed us to Fig. 1 A visual performance evaluation encompasses a comparative analysis of six state-of-the-art trackers, specifically STARK-ST50 25 , ToMP50 22 , KeepTrack 46 , MixFormer-CvT 37 , OSTrack384-NeighborTrack 47 , and OSTrack384 41 .This evaluation is carried out across seven distinct datasets, followed by a direct comparison with our proposed D-PTUAC dataset.It is noteworthy that the results underscore a noticeable decline in the performance of the state-of-the-art trackers when applied to our D-PTUAC dataset in sharp contrast to their performance on the other seven datasets.capture a range of typical drone navigation characteristics and challenges, including ego-motion, MB, and occurrences of OCC.Subsequent sections will provide detailed insights into the dataset construction process.
Surveillance settings.The D-PTUAC dataset comprises videos for dynamic and static SS to simulate real-world scenarios.Sample frames extracted from these videos are visually depicted in Fig. 2. Below, we provide a comprehensive overview of the specific applications and dataset particulars pertaining to these two distinct SS.
• Dynamic surveillance involves actively monitoring a particular subject or group of subjects using a moving drone.The D-PTUAC dataset features 88 videos specifically recorded for dynamic surveillance.Participants were instructed to walk from point A to point B while the drone captured their movement from both the front and back views, although not simultaneously.The drone closely monitors the subject by flying a few meters ahead of them, and user cooperation is not necessary, as the drone is designed to follow the subject's movements via manual control using the DJI GO application.• Static surveillance involves using a static drone to monitor an area or event without a specific focus on any particular subject or object.The D-PTUAC dataset includes 50 videos captured for static surveillance, each featuring participants instructed to move around, engage in discussions, or walk within a designated area while the drone captured the entire scene.Unlike dynamic surveillance, the drone's movement is kept static, and its objective is to record the events of the specified region rather than monitor specific individuals.This approach simulates real-world surveillance scenarios, making the D-PTUAC dataset a challenging and valuable resource for drone-based tracking and understanding of human activities.
Dataset setup.The D-PTUAC dataset consists of twenty-four distinct settings, offering variation in terms of SS, angle of view, CD, and times of capture.The dataset was meticulously collected on an outdoor tennis court located at Khalifa University in Abu Dhabi, United Arab Emirates.The tennis court has dimensions of 20×10×4 meters.
The data collection spanned two seasons, Fall and Spring, with diverse crowds to enhance dataset diversity.In the Fall season, Crowd 1 videos were recorded during both morning and evening sessions, whereas in the Spring season, Crowd 2 videos were exclusively captured during the morning hours, except for one instance characterized by rainy weather conditions.The dataset introduces three CD categories, namely sparse, medium, and compact.For each CD category, participants were recorded from both front and back views, resulting in a total of 87 videos recorded in the morning for each CD and SS, along with an additional 51 videos captured in the evening.It is worth noting that the morning videos possess high illumination, while the evening videos exhibit relatively lower illumination levels.To ensure sufficient lighting in the evening videos, floodlights were employed.
Figure 4 visually illustrates scenarios involving a crowd with an intruder among them.These scenarios are constructed under the assumption that the individual to be tracked is an intruder, and the objective is for the drone video tracker to effectively follow their movements as they navigate through and within the crowd, occasionally attempting to evade the tracker's surveillance.4. Scenario 4 (S4): The uniform appearance crowd moves randomly within the tennis court area, while the drone 5tries to follow the intruder.The intruder attempts to overlap with the crowd to confuse the tracker and hide.This scenario represents a dynamic SS, as the drone follows the intruder throughout the video.5. Scenario 5 (S5): The uniform appearance crowd is instructed to move randomly within a confined area of the tennis court.Meanwhile, the intruder moves in a circular path towards the drone.Once the intruder notices the drone, they immediately join the crowd and attempt to overlap with them to confuse the drone tracker and evade detection.The drone tries to follow the intruder throughout the video, while the intruder tries to hide within the crowd.
The D-PTUAC dataset contains repeated videos, and this is intentional due to the division of each crowd into groups of intruders.Each group participated in one of the five scenarios mentioned earlier, leading to multiple videos of the same scenario but with different intruders.The collection process of the D-PTUAC dataset is described in Algorithm 1.In terms of subject demographics, all individuals captured in the videos fall within the age range of 20 to 35 years.Additionally, a subset of videos includes two participants who exceed the age of 35 years.For each specific combination of settings, a single group of 2-5 intruders appears in only one video, resulting in a total of 138 videos across the twenty-four combinations.Detailed statistics of the D-PTUAC dataset can be found in Table 1, which includes over 76 K frames for dynamic SS and over 44 K frames for static SS.
The dataset also encompasses high-resolution gallery images of each subject captured in constrained settings using a 12-megapixel smartphone.The video scenarios were captured at a frame rate of 30 Frames Per Second (FPS) and a resolution of 3840×2160 pixels using a DJI Mavic 3 Pro drone.These gallery images were taken in optimal lighting conditions and encompass four distinct poses.These images serve multiple purposes, including the development of a facial identification system capable of recognizing intruders in aerial images, even when the facial area is limited to only a few pixels.Furthermore, these images are integral to a comprehensive tracking framework.Initially, a face detector identifies the first bounding box of a specific person, which is then confirmed by a face recognizer.This information is subsequently passed on to the tracker to initiate the tracking process.
Videos annotations.In the annotation process for the D-PTUAC dataset, the heads of the subjects were selected as the most suitable body part for annotation.This choice was made to address the challenges presented by subject overlap and OCC.Precisely defined bounding boxes were employed to encompass the visible region of the heads, taking into consideration the perspective from the drone.
To assess the quality of head annotations in the D-PTUAC dataset, a comparison was made against full-body annotations using twenty video sequences and a pretrained ToMP50 tracker 22 .The head-tracking success rate achieved 55.29%, significantly outperforming the full-body tracking success rate of 11.09%.This outcome underscores the superiority of head annotations for the specific tracking tasks in this dataset.
The annotation process for the dataset was meticulous, involving a team of experienced individuals with expertise in VOT.Manual annotation was performed using the Computer Vision Annotation Tool (CVAT) (https://www.cvat.ai/)with precise attribute labels assigned.The process underwent three stages of scrutiny and refinement to ensure the annotations' high quality.Several challenges were encountered during the annotation process, including small head sizes in the video frames, OCC, MB, LR, UF, and BC.Addressing these challenges necessitated re-annotation for approximately 80% of the dataset.
Annotating only the head of the target person in the D-PTUAC dataset results in small bounding boxes of size 64×64 pixels, as depicted in Fig. 5, leading to challenging LR samples.These small targets lack sufficient appearance information and pose difficulties for deep networks, which produce weak features when directly processing LR regions.Enlarging the regions introduces blur and sawtooth artifacts, compromising LR image representation and increasing computational costs.Additionally, when LR objects occupy a small portion of the image, they are vulnerable to interference from background objects and noise.These combined challenges impede the localization and discrimination capabilities of general visual tracking networks when dealing with LR objects 17 .
Dataset tracking attributes.Drone-person tracking scenarios often present various challenging factors, many of which have been intentionally incorporated into the scenarios described in the "Dataset Setup" section.The tracking attributes in the D-PTUAC dataset, as shown in Fig. 6, can be classified into two groups: controlled and implicitly inherited.This study focuses on four crucial video-level controlled attributes relevant to aerial environments and significantly impact tracking algorithm performance in aerial captured video sequences.These attributes are summarized as follows:  1. Abrupt Appearance Change: It describes sudden and significant changes in the appearance of the tracked object.In the scenarios discussed earlier, such those in S3 (Fig. 4a) and S5 (Fig. 4e), where the intruder executes zigzag movement and blending into the crowd can cause AAC.This results in instances where the initially tracked region corresponds to the front head but subsequently shifts to the back head.Additionally, 14 implicit attributes in the dataset were inherited from the nuisance and distraction factors.These attributes arise from factors that were not explicitly controlled or introduced by humans but were inherent in the dataset recordings.They are briefly described below: 1. Uniformity: The D-PTUAC dataset exhibits a unique characteristic wherein all individuals, including the target and distractors, wear a white dress and headscarf throughout the recording, as depicted in Fig. 2a-e.This feature distinguishes our proposed dataset from others, such as the recent robot-person tracking dataset 18 , which includes a similar attribute but with fewer people.In our dataset, scenes consist of 15-30 people, making individuals appear as moving blobs without visible leg movements.The scenes also feature two to five intruders who subsequently join the crowd, augmenting the dynamic nature and appearance of the scene.To determine LR, we follow the method proposed in 17 , which calculates the object's relative size by dividing its bounding box area by the image area in   each frame.The average relative size is then computed across all frames within each video sequence.We set the average relative size threshold at 1%, as suggested in 17 .However, to prevent misclassification of larger object sequences as LR sequences, we incorporate the concept of average absolute size, with a threshold set at 22×22 pixels.For a video sequence to be classified as LR, both the average absolute and relative size must be below these thresholds.This dual-criteria approach enhances the accuracy of LR sequence identification in our study.Frameworks for using rGB-D and segmentation based trackers.RGB-D framework.To comprehensively evaluate different categories of trackers, it is important to consider RGB-D trackers that rely on both RGB and depth data for fusion algorithms.However, since the D-PTUAC dataset is captured solely with an RGB camera, we have developed a framework to generate depth information from RGB data.The RGB-D framework utilizes monocular depth estimation techniques, specifically leveraging the MiDaS network 20 .Specifically, we have employed the DPT-Swin2-Tiny-256 network, which balances accurate depth estimation and real-time inference on embedded devices, achieving a framerate of 90 FPS 20 .This choice is particularly important for deploying the framework on resource-constrained systems such as drones.
Segmentation mask generation framework.Given the uniqueness of our dataset, a segmentation model capable of generating masks for various objects, with a focus on the head in our case, is required.To address this, we have utilized the SAM model 21 , which has been trained on a large dataset of over 1 billion masks derived from 11 million licensed and privacy-respecting images.The extensive training enables the SAM model to generalize well and accurately segment specific targets, such as the head of the person of interest.Among the available options in the SAM model, we have chosen the ViT-base model due to its lightweight nature in terms of parameters and floating point operations (FLOPs), as well as its fast inference speed compared to other models discussed in the paper 21 .

Data records
The D-PTUAC dataset 19 has been made available for public download through Figshare at (https://doi.org/10.6084/m9.figshare.24590568.v2).Access to the data does not necessitate any registration process.The dataset occupies a combined storage of 15.01 gigabytes.An elaboration of the folder arrangement encompassing the dataset and its pertinent files is provided below.
Folder structure.An overview of the D-PTUAC dataset is given in Fig. 7.The dataset's root directory, labeled

technical Validation
We conducted a comprehensive performance evaluation of existing state-of-the-art (SOTA) trackers on our proposed D-PTUAC dataset, which included attribute-wise analysis to test the trackers' robustness against specific challenges.To further enhance the tracking performance, we finetuned 10 high-quality SOTA trackers on a training split of the D-PTUAC dataset.All experiments were conducted on a workstation equipped with one Nvidia GeForce RTX 3080 GPU, 11th Gen Intel 2.3 GHz CPU, 32GB RAM, and 8GB VRAM.We used the official source codes provided by the respective authors to implement all trackers.
Evaluation metrics.To evaluate the trackers, we used the popular One-Pass Evaluation (OPE) protocols proposed by OTB 12 and LaSOT 15 to measure Success Rate (SR), Precision Rate (PR), and Normalized Precision Rate (NPR).
The SR, is calculated by taking the Intersection over Union (IoU) of the pixels between the tracker's predicted bounding box, boxP, and the actual ground truth's bounding box, boxG.

∩ ∪ =
Tracking algorithms are ranked based on their SR, which is determined by the Area Under the Curve (AUC) ranging from 0 to 1.A higher AUC indicates a better success rate for the tracker.The ranking is done from the worst to the best-performing tracker.
In general, the PR is computed as the distance between the center of the ground truth bounding box and the predicted bounding box generated by the tracker.The PR is defined as: The ranking of trackers is determined by varying threshold values from 0 to 20 pixels in this measure, and those with a higher PR are considered to have better performance.
To address the sensitivity of the PR measurement to image resolution and bounding box sizes, we incorporated the NPR, measurement.The calculation of NPR is as: This metric, denoted by NPR, normalizes the PR using the ground truth annotations, as described in 14 .Trackers are ranked based on the AUC for NPR values ranging from 0 to 0.5.A higher NPR score indicates better performance of the tracker.
Baseline trackers.The evaluation of the dataset involved a careful selection of representative SOTA baseline trackers in highlighting the challenges posed by the proposed dataset.A total of 44 prominent trackers were included to ensure a comprehensive evaluation.These trackers were chosen from different categories, including Discriminative Correlation Filters (DCF)-based trackers like ATOM 23 and DiMP 24 , hybrid transformer-based trackers like STARK 25 , and TATrack 26 which combine Siamese networks and transformer networks for improved feature discrimination, and the DCF-based RGB-D tracker DeT 27 that extends the DiMP tracker 24 to incorporate depth information.Segmentation-based trackers such as RTS 28 were also included in the evaluation.Evaluating trackers in static drone settings provides insights into their performance in the presence of multiple uniform appearance distractors.(c) Scenario Performance: Conducting individual evaluations for each scenario depicted in Fig. 4 to analyze the tracker's performance under different intruder behaviors, such as circular paths (Fig. 2c,e), or attempts to blend in with the crowd (Fig. 2a).(d) Crowd Density Performance: Performing separate evaluations for sparse, medium, and compact CD levels to understand the impact of uniform appearance distractors on tracker performance, as they act as dynamic obstacles with similar appearances.(e) Different Daytime Performance: Evaluating the performance of trackers in morning and evening scenarios to assess their adaptability and robustness under varying lighting conditions and environmental changes.(f) Attribute Evaluation: Using trackers to assess distinct attributes exhibited in the videos, enabling in-depth analysis of tracker performance related to specific attributes.

Evaluation protocols.
For sections "Drone Surveillance Settings (Multi-scale) Performance", "Scenario Performance", "Crowd Density Performance", "Different Daytime Performance", and "Attribute Evaluation", we specifically selected 10 SOTA trackers that underwent finetuning based on the results presented in Tables 3, 4. A comparison is then made between these chosen trackers for each evaluation protocol.The objective behind these evaluations is to conduct a comprehensive analysis of the trackers' performance while also gaining insights into the impact of various scenarios and attributes on their effectiveness.
training/testing split.The D-PTUAC dataset was split into training and testing sets.The training set consists of 90 videos, while the testing set consists of 48 videos.The training set contains approximately 78 K frames, while the testing set contains around 42 K frames.A comprehensive comparison of the training and testing sets of D-PTUAC is presented in Table 2.The analysis shows that the minimum frames, mean frames, median frames, and maximum frames exhibit similarity between these two subsets.Additionally, Fig. 9 demonstrates that the ratios of sequences across all attributes and settings are also similar.These findings, derived from both Table 2 and Fig. 9, provide evidence of the consistency and coherence of our training/testing split.
Overall Performance on the testing set.To conduct a comprehensive analysis, we evaluate the performance of 44 pretrained trackers on the testing set of D-PTUAC, as depicted in Fig. 8.Among these trackers, 24 are listed in Table 3, which showcases their performance in the pretrained state.Additionally, we present 20 trackers in Table 4, which demonstrates their performance before and after finetuning on the training set of D-PTUAC, allowing us to assess the impact of our training set on tracker performance.It is important to note that no changes were made to the hyperparameters of these 20 trackers.
The analysis of the results reveals that algorithms that combine the Siamese network and Transformers, such as TATrack 26 , SeqTrack 29 , and ToMP 22 , demonstrate a higher level of robustness in performance.These algorithms effectively leverage the advantages of capturing contextual information and addressing LT dependencies, which are crucial for accurate tracking.However, it is worth noting that certain algorithms that combine the Siamese network and Transformers, including STARK 25 , TrTr 30 , and TransT 31 , fail to achieve satisfactory results on our dataset.A similar pattern is observed in DCF trackers, such as DiMP18 24 , SuperDiMP 32 , PrDiMP18 33 , and ATOM 23 .While SuperDiMP 32 demonstrates robust performance by employing effective scale regression techniques and online learning strategies, ATOM 23 falls short in achieving desirable results.
Specifically, the results indicate that TATrack-Base384 26 achieved the highest performance with values of 59.39% for SR and 71.54% for NPR.Regarding PR, SeqTrack-B256 29 outperformed other trackers, achieving a PR of 65.60%.Upon finetuning, a similar trend was observed where TATrack-Base384 26 continued to demonstrate its effectiveness, yielding the best results with values of 64.74% for SR and 79.59% for NPR.Notably, AiATrack 34 achieved the highest PR of 73.72%.
RGB-D and segmentation-based trackers, such as DeT 27 and RTS 28 , face significant challenges and exhibit notable failures on our testing set.The homogeneous distribution of the crowd at the same distance from the camera leads to identical depth outputs for all individuals.Consequently, the depth data presents a similar appearance for the entire crowd, making it difficult for trackers to differentiate and accurately track the target person.This lack of depth variation hampers the tracker's ability to distinguish between individuals, leading to the loss of track on the target and compromising performance in such scenarios.The presence of OCC and multi-scale targets further exacerbates the challenges faced by segmentation-based trackers.These trackers rely on pixel-level segmentation masks, which can be unreliable when multiple individuals in the crowd have similar appearances.This results in inaccurate tracking and compromised performance on our testing set.
Upon analyzing the data presented in Table 4, it is clear that each of the 20 trackers studied shows a consistent enhancement in performance following the finetuning process using our training set.This notable improvement not only validates the effectiveness of our training set but also emphasizes its critical importance in the context of drone-person following in scenarios involving a crowd with uniform appearance.
Additionally, Fig. 10 provides a visual representation of the comparatively limited performance exhibited by the 20 finetuned SOTA trackers when evaluated against the testing set.This diminished performance is indicative of the increased complexity and challenge inherent in the testing set, thereby underscoring the need for further advancements in tracker technology to effectively address such demanding scenarios.
Surveillance settings (Multi-scale) performance.As detailed in "Surveillance Settings" section, the D-PTUAC dataset comprises videos categorized into dynamic and static SS.In dynamic drone settings, the scale of the target's bounding box remains relatively consistent due to the drone's efforts to maintain a constant distance.However, in static drone settings, the target's bounding box exhibits significant SV as the object moves closer to or farther away from the drone, resulting in multi-scale bounding boxes.This introduces challenges such as SV and necessitates tracking algorithms to effectively handle these changes and maintain accurate localization.
Based on the comparison provided in Table 5, AiATrack tracker 34 demonstrates the highest performance in videos with multi-scale variations, particularly in static drone settings.Compared to the baseline tracker and the second best tracker, TrDiMP 35 , AiATrack achieves a performance improvement of 2.94%.Furthermore, TATrack-Base384 26 is the top-performing tracker in dynamic drone settings, characterized by challenges such as MB and FM.In comparison to the baseline tracker and the second best tracker, SuperDiMP, TATrack-Base384 exhibits a performance enhancement of 1.08%.Scenario performance.As outlined in Section "Dataset Setup", the D-PTUAC dataset consists of five distinct scenarios that aim to simulate various intruder behaviors within a uniform appearance crowd.These scenarios are designed to replicate real-life situations where a law enforcement drone is deployed to track an intruder amidst a crowd with similar attire.Evaluating tracker performance in these scenarios is crucial for assessing their effectiveness in real-world applications.By evaluating the performance of the trackers on the D-PTUAC dataset, our objective is to gain insights into their capabilities and limitations when dealing with intruder tracking in uniform appearance crowd scenarios.
The benchmarking results presented in Table 6 reveal notable variations in tracker performance across different scenarios.Specifically, in scenarios S3 and S5, where the presence of appearance ambiguity challenges the trackers, a significant decline in performance is observed.The random behavior of the crowd in S5 further amplifies the performance degradation due to increased OCC.In both S2, S3, and S5, AiATrack 34 outperforms other trackers, achieving performance improvements of 0.62%, 19.89%, and 35.18%, respectively, compared to the second-best performing trackers.Similarly, in scenario S1, characterized by high OCC levels as the intruder attempts to blend in with the crowd, SeqTrack-B256 29 achieved highest performance of 80.45%.On the other hand, in scenario S4, where the drone closely follows the intruder, trackers exhibit a significant performance boost as the target remains within the drone's field of view for most of the time.TATrack-Base384 26 demonstrates the best performance in S4, surpassing the second best-performing tracker, SuperDiMP 26 , by 3.13%.
Crowd density performance.The evaluation of trackers on different CDs, including sparse, medium, and compact, provides valuable insights into their capabilities and limitations in real-world surveillance scenarios.It allows for a comprehensive assessment of their performance in handling diverse crowd configurations, distinguishing targets from the background, and coping with OCC.Understanding the specific challenges and limitations associated with each density enables researchers and developers to enhance the trackers' capabilities and address density-specific obstacles.Furthermore, benchmarking and comparing tracker performance across different densities facilitate informed decision-making for selecting suitable trackers based on specific CD requirements.
As shown in Table 7, there is a noticeable decline in performance among various trackers, such as SLT-TrDiMP 36 , SeqTrack-B256 29 , and AiATrack 34 , when transitioning from sparse to medium to compact CD.These variations align with an increase in OCC and the emergence of more complex BC.In contrast, in the sparse and medium CD levels, ToMP50 22 achieves impressive AUC values of 72.15% and 68.94%, respectively.In compact CD scenario, SuperDiMP achieved slight performance improvement of 0.82% compared to the second-best performer, TrDiMP 35 .
Daytime performance.The evaluation of trackers on morning and evening scenarios in SS offers multiple benefits.It provides valuable insights into their performance under varying lighting conditions, ensuring their adaptability to different times of the day.This evaluation also allows for the assessment of trackers' robustness in handling challenges such as shadows, IV, and low light conditions specific to morning and evening environments.Additionally, it enables the analysis of potential temporal variations in tracking accuracy, aiding in the selection of trackers that exhibit consistent performance throughout the day.
As indicated in Table 8, all trackers performed better in the evening compared with the morning.A possible reason for a tracker performing worse in videos captured in the morning compared to the evening could be variations in lighting conditions.In the morning, the lighting may be softer, with lower contrast and potentially more shadows, making it challenging for the tracker to accurately detect and track objects.Additionally, the angle and intensity of sunlight can change throughout the day, leading to different levels of illumination and potential glare in morning videos.These variations in lighting conditions can affect the quality and reliability of visual features used by the tracker, resulting in decreased performance.
TrDiMP 24 demonstrates the best performance in evening videos, surpassing the second best-performing tracker SeqTrack-B256 29 by 4.45%.Moreover, the best-performing tracker in videos captured in the morning is TATrack-Base384 26 with a performance improvement of 2.32% compared to the second best-performing tracker AiATrack 34 .
Attribute-wise performance.In order to comprehensively evaluate the performance of various tracking algorithms, we evaluated ten finetuned trackers on 17 attributes using the D-PTUAC testing set.The results of this evaluation are presented in Table 9.For better visualisation, we plot the top nine unique attributes in our D-PTUAC dataset against the performance of the chosen trackers as depicted in Fig. 11.
TATrack-Base384 26 and AiATrack 34 demonstrate effective mitigation of challenges such as ACC, BC, FM, LT, ST, POC, FOC, PV, ROT, and SV when compared to other algorithms.It is worth highlighting that these tracking scenarios with attributes such as AAC, ARC, SV, OCC, BC, LR, FM, MB, ROT, different CD levels, and different SS.Furthermore, to facilitate observation, we have enlarged the regions containing the target objects and presented them on the right side of the original images.In the D-PTUAC dataset, sequences often exhibit multiple challenge attributes, posing significant difficulties for tracking multi-scale uniform appearance objects and leading to frequent failures of current SOTA trackers.For instance, in Fig. 12a-c, in S1, S2, and S3 sequences present challenges such as ARC, FM, MB, OCC, SV, BC, LR, and UF, which pose considerable challenges for existing trackers.
Additionally, we present some examples of failed cases for SOTA trackers on the D-PTUAC dataset in Fig. 12d,e.These failed cases involve various challenge attributes, including AAC, ARC, SV, BC, MB, FM, OCC, BC, LR, and UF.The challenging attributes of FM can cause the target to move beyond the trackers' search area.Although there are some re-detection tracking algorithms capable of addressing such problems, trackers often struggle to track the multi-scale uniform appearance object due to the lack of sufficient appearance information and interference from the BC.MB, often accompanied by FM and camera motion, further degrades the quality of feature representation.Moreover, as shown in Fig. 12d,e, the challenge attribute of OCC frequently results in model drift and targets moving beyond the search area.In summary, the main reasons for the failure of other trackers in tracking the D-PTUAC dataset can be attributed to (1) the LR, UF, and limited informative content of multi-scale uniform appearance objects, which hinder effective feature extraction and precise target localization, and (2) the presence of multiple challenge attributes within the same video sequence in the D-PTUAC dataset, posing substantial challenges for tracking methods.

Usage Notes
The D-PTUAC dataset 19 , is publicly accessible on Figshare, available at (https://doi.org/10.6084/m9.figshare.24590568.v2).This dataset is offered for unrestricted use, permitting users to freely copy, share, and distribute the data in any format or medium.Additionally, users are granted the flexibility to adapt, remix, transform, and build upon the material.In the pursuit of fostering reproducibility, the predicted bounding boxes and finetuned weights of the visual object trackers are also available on Figshare, accessible at (https://doi.org/10.6084/m9.figshare.24590268.v2) 38.Both the dataset and the evaluation scripts are licensed under the Creative Commons "Attribution 4.0 International" license, which can be reviewed at (https://creativecommons.org/ licenses/by/4.0/) .

Fig. 2
Fig. 2 Selected samples from the proposed dataset highlighting challenging attributes (IV, BC, UF, OCC, PV, MB, FM, OV, AAC, ARC, DEF, ROT, CD, SV, ST, LT) through a structured layout: Row 1 showcases RGB sample images, Row 2 presents depth sample images, and Row 3 displays Segmentation masks sample images.The columns within the figure showcase samples that encompass multiple attributes, where Column (a) features LR, IV, BC, and UF, Column (b) includes OCC, BC, IV, and UF, Column (c) portrays IV, BC, and UF, Column (d) demonstrates OCC, BC, IV, UF, and PV, Column (e) encompasses DEF, LR, BC, IV, and UF, Column (f) showcases MB and FM, and Column (g) encompasses OV.These images emphasize the importance of developing robust drone-person tracking methods.

Fig. 3
Fig. 3 Evaluation results of 44 state-of-the-art pretrained trackers on D-PTUAC on videos with LR and BC attributes using, (a) Success Rate, (b) Precision Rate, and (c) Normalized Precision Rate.Please zoom for better clarity.

Algorithm 1
Algorithmic Overview of D-PTUAC Dataset Collection Process.The D-PTUAC dataset consists of a range of visual factors, including PV, IV, OCC, and LR, as depicted in Fig. 2. Each video in the dataset contains approximately 40-50 subjects for the two crowds, with 15-30 subjects per combination of CD, angle of view, drone SS (static or dynamic), and time of capture.

2 .
Crowd Density: It represents the proximity and density of individuals in a crowd.The dataset categorizes CD into three levels: sparse (more than 1-meter distance), shown in Fig. 2a, medium (1-meter distance), shown in Fig. 2c, and compact (shoulder-to-shoulder proximity), shown in Fig. 2e. 3. Surveillance Settings: It refers to dynamic and static SS.Dynamic surveillance, with 88 videos, involves capturing subjects' movements while the drone follows them, while static surveillance, with 50 videos, documents events in a designated area without actively tracking individuals.4. Illumination Variation: It denotes significant changes in lighting conditions within a scene.In the drone-person following scenario, IV can arise from various sources such as the sun's position, artificial lights, shadows, and reflections.The dataset captures videos in different lighting conditions, including morning, evening, and rainy weather.Examples are shown in Fig. 2a,b.

Fig. 5
Fig.5 The histogram represents the distribution of ground truth annotated heads based on their size.It can be observed that a significant portion of the ground truth images have less than 4,000 pixels, which is equivalent to images smaller than 64×64 pixels.

Fig. 6
Fig. 6 Distribution of Sequences Across Each Attribute within the D-PTUAC Dataset.

2 .
Fast Motion: It occurs when the tracked object or the drone moves quickly, challenging the tracker to keep up.3. Motion Blur: It arises from the drone or target's movement, causing blurred frames, as illustrated in Fig. 2f. 4. Pose Variation: It captures the variability of human poses, including actions like running or hugging, resulting in significant pose changes between consecutive frames, as shown in Fig. 2a,d. 5. Scale Variation: It represents changes in the object's size from the first frame to the current frame is outside the range of [0.5, 2], particularly in static scenarios where the target moves closer or farther from the drone as shown in Fig. 2a-d.6. Background Clutter: It occurs when the target's appearance resembles the background, leading to challenges in accurate differentiation, as shown in Fig. 2a-g.7. Low Resolution: It describes the characteristics of the target object in the video frames, specifically referring to the tracked person's head in our D-PTUAC dataset.

8 .
Rotation: It happens when the target deliberately conceals themselves within the crowd after being detected by the drone.9. Aspect Ratio Change: The bounding box aspect ratio falls outside the specified range of [0.5, 2].10.Deformation: The target undergoes deformations and changes in shape during the tracking process, as depicted in Fig. 2e.11.Occlusion -Partial Occlusion (POC)/Full Occlusion (FOC): It arises when parts or the entire target is obstructed by objects or people in the scene, as depicted in Fig. 2b-d.12. Out of View: It refers to a situation where the target fully leaves the camera field of view, as shown in Fig. 2g.13.Short-Term Videos: It refers to a sequence length of less than 1,000 frames.Our dataset contains 95 ST videos.14.Long-Term Videos: It refers to a sequence length of more than 1,000 frames.Our dataset contains 43 LT videos.

Fig. 7
Fig. 7 The folder structure of the D-PTUAC dataset.

Fig. 9
Fig. 9 Comparison of sequence distribution in each attribute between training and testing sets.
The evaluation protocol is designed to assess the following aspects: [label = ()] (a) Overall Performance on the Testing Set: Comparing the performance of 44 SOTA trackers on the D-PTU-AC testing set before and after finetuning.(b) Drone Surveillance Settings (Multi-scale) Performance: Assessing trackers in dynamic and static drone scenarios to understand their capabilities and limitations under different operational conditions, including object tracking during drone movement and challenges such as MB, FM, OCC, and changing trajectories.

Fig. 12
Fig. 12 Qualitative evaluation on six representative trackers.To enhance visibility, we have magnified the object regions and presented them on the right side of the original images.The enlarged regions are shown for the following examples: (a) S1, (b) S2, (c) S3, (d) S4, and (e) S5.

Table 1 .
Comprehensive Statistics of the D-PTUAC Dataset.

Table 2 .
Comparison between training and testing sets of D-PTUAC.

Table 3 .
Comparative results of pretrained trackers on D-PTUAC testing set.Trackers are ranked per metric, with the best indicated by *, second best by **, and third best by ***.

Table 4 .
Comparative results of pretrained versus finetuned trackers on D-PTUAC testing set.Trackers are ranked per metric, with the best indicated by*, second best by**, and third best by***.

Table 5 .
Comparative results of finetuned trackers on D-PTUAC testing set per SS.Trackers are ranked per SR (%) metric, with the best indicated by *, second best by **, and third best by ***

Table 6 .
Comparative results of finetuned trackers on D-PTUAC testing set per scenario.Trackers are ranked per SR (%) metric, with the best indicated by *, second best by **, and third best by***.

Table 7 .
Comparative results of finetuned trackers on D-PTUAC testing set per CD.Trackers are ranked per SR (%) metric, with the best indicated by *, second best by **, and third best by ***

Table 8 .
Comparative results of finetuned trackers on D-PTUAC testing set per daytime.Trackers are ranked per SR (%) metric, with the best indicated by *, second best by **, and third best by ***