Lightweight air-to-air unmanned aerial vehicle target detection model

The rapid expansion of the drone industry has resulted in a substantial increase in the number of low-altitude drones, giving rise to concerns regarding collision avoidance and countermeasure strategies among these unmanned aerial vehicles. These challenges underscore the urgent need for air-to-air drone target detection. An effective target detection model must exhibit high accuracy, real-time capabilities, and a lightweight network architecture to achieve a balance between precision and speed when deployed on embedded devices. In response to these requirements, we initially curated a dataset comprising over 10,000 images of low-altitude operating drones. This dataset encompasses diverse and intricate backgrounds, significantly enhancing the model’s training capacity. Subsequently, a series of enhancements were applied to the YOLOv5 algorithm to realize lightweight object detection. A novel feature extraction network, CF2-MC, streamlined the feature extraction process, while an innovative module, MG, in the feature fusion section aimed to improve detection accuracy and reduce model complexity. Concurrently, the original CIoU loss function was replaced with the EIoU loss function to further augment the model’s accuracy. Experimental results demonstrate an enhancement in the accuracy of drone target detection, achieving mAP values of 95.4% on the UAVfly dataset and 82.2% on the Det-Fly dataset. Finally, real-world testing conducted on the Jetson TX2 revealed that the YOLOv5s-ngn model achieved an average inference speed of 14.5 milliseconds per image. The code utilized in this paper can be accessed via https://github.com/lucien22588/yolov5-ngn.git.

cameras on the ground or stationary platforms.This method finds applications across various domains, including military air defense surveillance and civilian aviation air traffic management.Conversely, air-to-air detection involves UAVs using their onboard cameras or sensors to detect other UAVs in flight.This is critical for swift and accurate detection to avoid collisions and enable cooperative operations, especially in multi-UAV systems.
In recent years, research on ground-to-air UAV detection has gained significant attention due to the rising utilization of UAVs.However, air-to-air UAV detection poses even greater challenges that remain underexplored.A key reason for this complexity is the contrasting monitoring environments.
In many ground-to-air UAV detection scenarios, stationary or minimally mobile ground-based cameras [15][16][17][18][19][20] capture images under relatively stable atmospheric conditions, such as clear or cloudy skies.Detection in such settings is comparatively straightforward due to the static background.Conversely, air-to-air UAV detection involves capturing images of UAVs in flight within dynamic and complex backgrounds, including urban and natural settings .These scenarios introduce additional visual complexities, as background elements, ground structures, and architectural features can obscure UAVs, exacerbating detection challenges.Furthermore, the dynamic flight characteristics of onboard cameras can lead to significant variations in the appearance of UAVs, affecting their shape, size, proportions, and color.This visual variability intensifies the difficulty of precise detection across diverse contexts.Another complicating factor is the small size of micro UAVs, making them inconspicuous in airspace and further increasing detection complexities.Effective solutions to air-to-air UAV detection necessitate addressing these challenges posed by diversity, complexity, and small-scale characteristics, representing a critical research focus in the current landscape.

Methods for unmanned aerial vehicle (UAV) detection
Unmanned aerial vehicle (UAV) detection involves employing various methodologies rooted in sensor technology and signal processing: • Radar Systems: These systems detect UAV-emitted radar signals, ensuring long-range detection reliability, even amidst varying environmental conditions.• Infrared Sensors: Utilizing infrared radiation, these sensors capture thermal emissions from UAVs.
• Sonar Systems: Primarily intended for underwater UAVs, sonar technology is occasionally adapted for detect- ing aerial UAVs using sound waves.• Radio Frequency Spectrum Analysis: This method identifies UAV communication signals, uncovering inter- actions between the UAV and its controller, facilitating inferences about the UAV's position and operator.• Image Processing and Machine Learning: Employing computer vision and machine learning algorithms, analysts assess aerial images and videos for UAV detection.These methods rely on discerning UAV visual features and movement patterns.
However, sensors such as radar, sonar, and radio frequency spectrum can suffer from interference from other UAV onboard sensors, potentially compromising accuracy.Additionally, most sensors lack the ability to identify specific target objects.Recent strides in computer vision have introduced vision-based techniques as innovative detection methods.Vision sensors operate in more relaxed experimental settings, boast lower production costs, and encompass attributes like extensive data collection and broad detection capabilities.Consequently, researchers are increasingly focusing on vision-based target detection techniques.

Traditional approaches
Visual object detection is the process of identifying, recognizing, and labeling specific objects in images, which is closely related to object classification, tracking, and image segmentation.Traditional object detection methods [33][34][35] typically employ a sliding window strategy to scan the entire image with a series of sliding windows to determine possible object locations.Hand-crafted features, such as scale-invariant feature transform 36 and histogram of oriented gradients 37 , are then extracted from the image window, followed by classification using support vector machine (SVM) or AdaBoost classifiers.However, traditional object detection algorithms based on the sliding window strategy have issues of high computational complexity, limited efficiency in detecting objects, and difficulty in handling changes in object shape and background.Additionally, designing hand-crafted features for each new object class requires considerable time.
Current research on UAV detection employs two fundamental technical approaches.Firstly, one method utilizes feature extraction techniques to capture UAV characteristics within images, such as shape, color, and texture.Subsequently, discriminative classifiers, such as Support Vector Machines (SVM) or Convolutional Neural Networks (CNN), are employed to analyze and categorize these extracted features, thereby determining the UAV's position, category, or state 38,39 .
The second approach revolves around the detection of moving objects within images, with UAVs typically categorized as one type of moving object.Subsequently, generative classifiers are deployed to assess whether these moving objects correspond to UAV targets.This method emphasizes the analysis of trajectories, velocities, and motion patterns of moving objects to distinguish UAVs from other mobile entities [40][41][42] .

Deep learning-based approach
The domain of target detection methodologies in the realm of deep learning encompasses a diverse array of techniques and architectures.• Convolutional Neural Network (CNN) Methods: R-CNN Series: Encompassing R-CNN 43 , Fast R-CNN 44 , and Faster R-CNN 25 , these methodologies achieve target detection by introducing candidate regions and Region Proposal Networks (RPN).YOLO Series: Including YOLO (You Only Look Once) 45 , YOLOv2    51,52 .
These methodologies exhibit distinctive advantages and applicability in various application scenarios and tasks.Researchers continually explore novel deep learning approaches to address challenges in target detection, thereby expanding the developmental landscape of deep learning-based target detection technologies.

UAVfly dataset
Datasets are indispensable for training object detection models, providing essential information to facilitate target learning.Their quality significantly influences model performance and generalization capabilities.Therefore, constructing datasets with high quality, diversity, representativeness, and balance is crucial for training and evaluating object detection models.However, collecting datasets for visually detecting air-to-air unmanned aerial vehicles (UAVs) is challenging due to the inherent complexities, such as complex backgrounds and image distortions from dynamic flights.
Currently, datasets explicitly designed for the precise detection of unmanned aerial vehicle (UAV) targets in the empty-to-empty scenario are exceedingly rare.Zheng et al. 32 .introduced a dataset named "Det-Fly" to address this gap.The dataset comprises 13,271 images of UAVs in the empty-to-empty scenario, captured from three different angles across four environmental backgrounds.Notably, it incorporates challenging backgrounds such as varying lighting conditions, dynamic blurring, and other factors that faithfully reflect the authentic operational processes of UAVs.However, practical utilization of models trained on this dataset reveals suboptimal accuracy in detecting UAVs in complex background settings.To address this issue, we propose a novel dataset named "UAVfly." Its primary distinction from Det-Fly lies in providing more intricate empty-to-empty scenario images, thereby enhancing the dataset's generalization capabilities.Table 1 illustrates the comparison between Dataset Det-Fly and Dataset UAVfly.
In this study, we employed three unmanned aerial vehicle devices (DJI AIR2s) to collect datasets in an air-toair fashion, conducted across three distinct Chinese provinces: Shanxi, Sichuan, and Guangdong.
The dataset consists of 10,281 images with a resolution of 1280 × 720 pixels.It comprehensively covers diverse geographical environments, encompassing urban blocks, suburbs, deserts, fields, lakes, skies, and mountains.Each environmental background type contributes nearly equally to the entire dataset, ensuring a high degree of diversity and representativeness.The dataset collection process spans an entire day, segmented into three time periods: morning, noon, and evening, with each period contributing almost equally to the dataset.This diverse and uniformly distributed data collection methodology ensures the dataset's comprehensiveness and applicability, offering researchers a wealth of experimental material.
Regarding challenging backgrounds, we adopted the collection strategy from Dataset Det-Fly, accounting for factors such as varying lighting conditions, dynamic blurring, and partial occlusion of target objects.Specific During the data collection process, we adhered to the following data collection strategies.
(1) Throughout the data collection process, strict adherence to local regulations governing unmanned aerial vehicle (UAV) operations was maintained to ensure full compliance and safety.(2) In order to render the dataset versatile for a broad spectrum of applications within the low-altitude airspace domain, a comprehensive evaluation of background complexity was undertaken, leading to the establishment of a maximum UAV flight altitude of 100 m.Additionally, a mandatory constraint was enforced, stipulating that UAVs must maintain a minimum separation distance of 5 m, thereby guaranteeing the safe execution of data collection activities.(3) Within the dataset, a subset of images was captured by onboard monocular cameras at regular 0.5-s intervals.This data acquisition strategy proved advantageous in capturing temporal dynamics, including positional and state alterations of the targets at distinct timepoints, thereby imbuing the dataset with valuable temporal context.(4) Furthermore, to ensure data accuracy and usability, professional annotation software, LabelMe, was employed, and the annotation process was conducted by highly skilled experts.This meticulous annotation procedure facilitated the creation of a repository of meticulously annotated high-quality data, thereby establishing a reliable cornerstone for subsequent research endeavors.
The self-constructed dataset is illustrated in Fig. 1.The dataset was partitioned randomly in a 7:3 ratio, leading to the creation of training and validation sets.This proportional division was orchestrated to maximize the effective use of the dataset for model training.Subsequent evaluation and validation were performed on the dedicated validation set to improve the model's generalization performance.The employment of this random partitioning methodology serves to uphold dataset diversity and proficiently mitigate the risk of model overfitting to specific data distributions.

Methods Abbreviation
As depicted in Table 2, explanations for the abbreviations referenced in the article have been provided.

The original YOLOv5s network
YOLOv5s is a highly efficient object detection model and the smallest in scale in the YOLOv5 series 47 .The model uses a lightweight backbone network and a multi-scale feature fusion and efficient output prediction strategy to achieve efficient, accurate, and real-time object detection 53,54 .www.nature.com/scientificreports/The YOLOv5s network architecture comprises an input section, backbone network, feature fusion module, and output section.The input segment employs the Mosaic-4 data augmentation technique, which enhances image information by utilizing cropping, concatenation, and resizing operations.This method is particularly effective for small object detection.The backbone network uses CSPDarknet53 as the feature extractor, which can improve feature extraction capabilities while maintaining efficiency.The feature fusion module employs a hybrid of Feature Pyramid Network (FPN) and Pyramid Attention Network (PAN) architectures to enhance the features extracted by the backbone network, thereby enhancing the network's feature fusion capacity.The output end uses an efficient prediction strategy, including the use of multi-scale feature maps in the backbone network, feature fusion in convolution layers, and efficient processing in the output layer to achieve fast and accurate object detection.
Compared with other models in the YOLOv5 series, YOLOv5s has a smaller model size and faster inference speed but slightly lower accuracy than other larger models.The model demonstrates strong performance across various public object detection datasets and has found extensive utilization in diverse real-world applications, including intelligent surveillance, autonomous driving, and robotic vision.The YOLOv5s structure is illustrated in Fig. 2.

Lightweight feature extraction
The feature extraction network in this paper has been improved in two parts: Firstly, the modification involves substituting the C3 module within the backbone network with CF2.Secondly, the alteration encompasses the transformation of the downsampling segment following the C3 module in the backbone network to MC.The CF2 network structure, illustrated in Fig. 3, processes the input by applying a 1 × 1 convolution and channel split operation to divide it into two sub-features (referred to as gray and blue features).In convolutional operations, channel splitting involves segmenting the input channels into multiple subsets, each subset undergoing convolution independently.This approach enhances network parallelism, thereby improving the computational efficiency of the model and enabling more effective utilization of hardware resources within the network.
Sub-feature 1 (gray) remains unchanged, while sub-feature 2 (blue) is sent to the F module for feature learning.The learned blue feature is then fed into the F module again to generate the purple feature.The gray, blue, and purple multi-scale features are concatenated to enhance feature reuse, and are then sent to a 1 × 1 convolution to obtain the output feature.The idea of multi-scale feature concatenation and feature reuse comes from VOVNet 55 .Feature reuse refers to the process of utilizing features extracted from preceding layers and employing them in subsequent layers.In neural networks, lower-level features usually encompass more fundamental information, while higher-level features encapsulate more abstract and advanced characteristics.By enabling the model to efficiently leverage information gleaned from earlier layers, feature reuse enhances the model's performance and efficiency.This technique has the potential to diminish computational complexity, decrease the number of parameters, and shorten training time, all while enhancing the model's generalization capabilities.
The F module consists of two 3 × 3 grouped convolutions, where channel shuffling is performed to facilitate inter-group feature interactions 56 .The channel shuffle technique is frequently employed on the output of convolutional layers that possess multiple channels.Its purpose is to facilitate the exchange and fusion of information among channels by reordering the arrangement of channels, thereby amplifying the network's representational capacity.The channel shuffle operation encompasses grouping, interlacing, and reconfiguring feature maps derived from various channels.This procedure helps enhance the model's capability to abstractly represent features, thereby reinforcing interrelations and diversification among features, ultimately contributing to optimizing the model's performance.
The MC network structure, as shown in the Fig. 3, is mainly inspired by the downsampling method in PeleeNet 57 .The Stemblock structure in PeleeNet ensures strong feature expression ability and reduces a significant number of parameters.In the MC structure, one branch uses max pooling and a 1 × 1 convolution to reduce the number of channels by half (represented by c in the figure), while the other branch uses a 3 × 3 convolution with a stride of 2 to reduce the number of channels by half.The outputs of the two branches are then concatenated to obtain the output feature, ensuring that the final result has sufficient semantic information while reducing the number of parameters and avoiding excessive loss of information.

Lightweight feature fusion
In the feature fusion section, this paper extensively utilizes spatial and channel attention mechanisms to holistically model the high-level semantic information and channel-wise characteristics derived from the feature pyramid structure of the backbone network 58 .
The spatial and channel attention mechanism is a technique employed in neural network architectures, notably in convolutional neural networks (CNNs), to refine feature representation.The spatial attention mechanism is geared toward identifying and accentuating pertinent spatial areas within an image.It allocates weights to distinct spatial locations, highlighting regions of greater significance for the given task.This approach enables www.nature.com/scientificreports/ the network to focus on vital image sections, thereby bolstering its capacity to effectively capture spatial information.Conversely, the channel attention mechanism strives to accentuate crucial channels within the feature maps generated by diverse network layers.It assesses interdependencies among channels and assigns weights to emphasize channels harboring more pertinent information for the designated task.This process empowers the network to prioritize and concentrate on informative channels, thus enhancing its ability to extract relevant features from the data.The coordinated interplay of spatial and channel attention mechanisms enables the network to better discern critical spatial areas and channel-specific features within the data.A novel fusion module (MG) is devised, and predictions are generated based on a distinct feature layer.The MG network structure, as shown in the Fig. 4.
To ensure alignment in terms of channel and feature map size, the feature pyramid structure C3, C4, C5 obtained from the backbone network undergoes operations such as convolution with a stride of 2, 1 × 1 convo- lution, and upsampling with a 1 × 1 convolution.Efficient spatial attention structures (ESA) are subsequently introduced after each convolutional structure.Within the ESA, channel information is effectively compressed by employing two global compression techniques along the channel dimension, namely global average squeezing (GAS) and global maximum squeezing (GMS).This compression process facilitates the formation of comprehensive global information in the spatial domain.The two branches of features are then concatenated and processed using a 7 × 7 convolutional operation, enabling effective fusion of the global information.The resulting score map, after sigmoid activation, is element-wise multiplied with the original feature map.
Moreover, the concatenated feature maps derived from the three layers undergo global average pooling (GAP), leading to the generation of a one-dimensional vector encapsulating the global channel-wise information for the multi-scale feature maps.This vector is subsequently subjected to global information compression through fully connected (FC) layers, a process referred to as "squeeze" 59 .Additionally, the squeezed vector is separately fed into three distinct FC layers, facilitating distinct learning of channel-wise information for the three feature maps.Consequently, three different vectors (designated as green, dark yellow, and light yellow) are obtained.By applying the softmax function to each vector, attention levels for the diverse multi-scale global information are effectively discerned.Following the softmax operation, the vectors are element-wise multiplied with their corresponding scale feature maps, and subsequently aggregated to yield the final single-feature prediction head.www.nature.com/scientificreports/

Improvement of the loss function
Intersection over union (IoU) loss Illustrated in Fig. 5, the red box signifies the predicted bounding box, while the green box represents the ground truth box (annotated box).The Intersection over Union (IOU) quantifies the extent of overlap between the ground truth box and the predicted box through the subsequent steps: • Compute the intersection area between the ground truth box and the predicted box.
• Calculate the union area between the ground truth box and the predicted box.
• Determine the IOU ratio by dividing the intersection area by the union area.
The IoU Loss is defined as follows: The Intersection over Union (IoU) metric assesses the detection performance between predicted and ground truth bounding boxes.It possesses scale invariance, demonstrating insensitivity to scale variations.In regression tasks, IoU serves as a direct indicator of the distance between the predicted box and ground truth (GT).However, IoU encounters the following limitations: • Inability to conduct gradient backpropagation: The IoU cannot be used directly for gradient updates.When IoU equals 0 (indicating no intersection between the two boxes), computing a loss as 1 − IoU leads to dif- ficulties in gradient backpropagation.• Incapability to ascertain the distance between predicted and ground truth boxes: IoU yields 0 when there is no intersection between the boxes, making it challenging to gauge the distance between these boxes.• Lack of information about the nature of intersection between boxes: IoU fails to describe how the two boxes intersect or the overlap pattern they exhibit.• Inability to precisely quantify the degree of overlap between two boxes: IoU lacks the granularity needed to precisely measure the level of overlap or coincidence between the two bounding boxes.

CIoU loss function
YOLOv5s utilize the CIoU Loss function, which is defined as follows: The values of α and v are defined as: (1) The gradient of the CIoU Loss function with respect to the side lengths w and h is: In the equation, b and b gt correspond to the centers of the predicted box and the ground truth box, respec- tively.The term ρ 2 indicates the Euclidean distance between these centers, while c represents the diagonal length of the smallest encompassing closed region that includes both the predicted and ground truth boxes.
While the CIoU Loss function incorporates considerations for bounding box regression, including overlap area, center point distance, and aspect ratio, the inclusion of the penalty term for relative proportion in the CIoU Loss function does not affect the regression process when the width and height aspect ratios of the predicted and ground truth boxes exhibit a linear relationship.Additionally, the gradient values of w and h have opposite signs, meaning that an increase in one value results in a decrease in the other value, making it impossible to keep them increasing or decreasing simultaneously.

EIoU Loss function
The EIoU Loss function represents an enhanced iteration of the CIoU Loss function 60 .In contrast to the CIoU Loss function, the EIoU Loss function employs an EIoU metric to quantify the overlap between the predicted and ground truth bounding boxes.It directly imposes penalties on the predicted width and height results.The EIoU metric effectively penalizes discrepancies in the predicted width and height from the ground truth values.Furthermore, it circumvents the gradient conflict issue encountered during gradient computation in the CIoU Loss function.The EIoU Loss function is defined as: The variables C w and C h represent the width and height of the predicted box and the ground truth box mini- mum bounding rectangle, respectively.The EIoU Loss function partitions the loss function into three components: overlap loss ( L IoU ), center distance loss ( L dis ), and width-height loss ( L asp ).The aspect ratio loss term is separated into the discrepancy between the predicted width and the width of the minimum bounding rectangle, along with the discrepancy between the predicted height and the height of the minimum bounding rectangle.This approach enhances the aspect ratio loss convergence rate and enhances regression precision.

Results
We conducted extensive model training on the UAVfly and Det-Fly 32 datasets to comprehensively validate the effectiveness of our proposed algorithm.This meticulous training process was undertaken with the aim of ensuring the algorithm's robustness and generalization capabilities across diverse datasets, further substantiating its outstanding performance in the field of unmanned aerial vehicle (UAV) visual detection.The combined utilization of these two datasets facilitated the capture of UAV images in various scenarios and contexts, enabling a more comprehensive assessment of the algorithm's performance and reliability.

Experimental environment
The training environment for all models in this experiment was Windows 10, with an AMD Ryzen 5 4600H with Radeon Graphics 3.00 GHz CPU, 16 GB of RAM, an NVIDIA GeForce GTX 1660 Ti GPU, and Pytorch as the deep learning framework.The Python version used was 3.7.

Datasets
• UAVfly.The detailed information regarding the dataset UAVfly is elaborated upon in "UAVfly dataset".We collected datasets in air-to-air form using three unmanned aerial vehicle devices (DJI AIR2s) .The datasets encompass residential areas, streets, fields, lakes, and mountains.Each environmental background type contributes nearly equally to the overall dataset.• Det-Fly.The Det-Fly dataset 32 comprises over 13,000 images of airborne unmanned aerial vehicles taken by a DJI Mavic2 drone.It encompasses diverse real-world scenarios, featuring varied background scenes, viewing angles, relative distances, and flying altitudes.The dataset contains data from three different perspectives, including front view , top view , and bottom view .Floating-point operations (FLOP) serve as a metric to measure algorithmic complexity and are commonly used as an indirect measure of the speed of neural network models.Each multiplication or addition represents one FLOP.The computational complexity (FLOP) indicates the model's demand on hardware computational units and reflects the number of multiplications and additions required for forward inference.
Frames Per Second (FPS) refers to the number of frames (images) the network can process (detect) per second.It evaluates the detection speed, depicting the quantity of images processed per second or the time needed to process a single image.A shorter time implies a faster speed.FPS serves as a direct measure of the neural network model's speed, validating the algorithm's detection speed in subsequent embedded experiments.

The performance evaluation metrics of the model
In machine learning and deep learning, the prediction outcomes of a classification task are categorized into the following four types, collectively known as the confusion matrix: • True Positive (TP): Predicted positive and labeled positive, indicating a correct prediction.
• False Negative (FN): Predicted negative but labeled positive, indicating a misclassification.
• False Positive (FP): Predicted positive but labeled negative, indicating a misclassification.
• True Negative (TN): Predicted negative and labeled negative, indicating a correct prediction.
Here, positive and negative refer specifically to unmanned aerial vehicles.Object detection algorithms typically use several metrics to evaluate their performance, including precision, recall, average precision (AP), and mean average precision (mAP).The equations for calculating precision, recall, AP, and mAP are as follows: Upon analysis of the formula, it is evident that Precision concerns the predicted positives in relation to the true positives and negatives.As Precision increases, the number of False Positives (FP) diminishes.This reduction signifies fewer instances of misclassifying other categories as the designated class, indicating higher purity in the predicted positives.Higher Precision corresponds to fewer false alarms or instances of erroneous detections.
Upon analysis of the formula, it is apparent that Recall concerns the predicted positives and negatives in relation to the true positives.As Recall increases, the number of False Negatives (FN) decreases.This decrease implies fewer instances of misclassifying positives as negatives, indicating a higher capacity to capture a larger portion of all actual positives.Higher Recall corresponds to fewer instances of missed detections or lower rates of failing to identify actual positives.
Although named average precision (AP), the calculation method of AP does not involve computing the average of Precision values.Instead, it computes the area enclosed by the Precision-Recall curve and the coordinate axes for each class, utilizing integral methods for computation.If a model has a larger AP, signifying a larger area enclosed by the Precision-Recall curve and the coordinate axes, it implies higher Precision and Recall overall.
The mAP averages the AP values for all classes.AP reflects the precision of predictions for each class, while mAP represents the average AP across all classes, serving as an indicator of the overall accuracy of the entire model.
The mAP encompasses two forms: one is mAP@0.5, which denotes the mAP value at an IOU threshold of 0.5.In this scenario, when the Intersection over Union (IOU) between the predicted box and the annotated box exceeds 0.5, the object is considered predicted correctly.The mAP is then computed under this premise.The other form is mAP@[0.5:0.95], which represents the mAP across multiple IOU thresholds within the range [0.5, 0.95] with an interval of 0.05.It involves utilizing ten IOU thresholds within this range to compute their respective mAP values, followed by averaging these values.A larger mAP@[0.5:0.95]indicates more accurate predicted boxes as it encompasses a wider range of higher IOU thresholds.

Training parameters
The YOLO algorithm is a widely used method for object detection, and the training parameters play a crucial role in determining the model's effectiveness and precision.Here's a succinct overview of key parameters and their respective functions: • Number of Epochs: Specifies the number of complete passes the model makes through the entire dataset during training.Increasing epochs may improve model performance but can also lead to overfitting.• Batch Size: Determines the quantity of samples fed into the model for weight updates in each iteration.Larger batch sizes generally expedite training but may require more memory resources.• Learning Rate: Governs the size of adjustments made to model parameters during training.A higher learning rate can hasten convergence but might result in unstable training.Conversely, a lower learning rate could demand more training time but contributes to a more stable convergence towards an optimal model.• Optimizer: An algorithm employed to fine-tune model weights to minimize the loss function.Common optimizers include Stochastic Gradient Descent (SGD), Adam, RMSprop, each offering distinct advantages and suitability in various contexts.• Image Size: Specifies the dimensions of input images.Larger image sizes generally enhance detection accuracy but also increase computational load and training duration.
The meticulous selection and fine-tuning of these parameters can significantly impact training speed, model performance, and convergence.Strategic adjustments aid the model in better adapting to the dataset, ultimately enhancing detection accuracy.
The hyperparameter settings for this experiment are as follows: initially, experiments involved varying the training epochs between 100, 200, 300, and 400 to optimize model performance while minimizing overfitting risks.Ultimately, we determined the optimal training epochs as 300.Considering the influence of batch size on memory usage and guided by equipment limitations, a batch size of 16 was chosen for training.The training images were standardized to dimensions of 640 × 640 pixels.Stochastic Gradient Descent (SGD) served as the optimizer for this study.To dynamically regulate the learning rate during training, we initialized the lr0 (initial learning rate) at 0.01, applying the cosine annealing algorithm for adjustments.Figure 6 depicts the comparative analysis of mean Average Precision (mAP) between our modified algorithm and the original approach postparameter tuning.The graphical representation distinctly exhibits the enhanced model's superiority in both detection accuracy and convergence speed over the original YOLOv5s.3. Table 3 displays that model 2, replacing the lightweight convolutional neural network CF2-MC directly as the YOLOv5s backbone, exhibits a slightly reduced detection accuracy compared to the original YOLOv5s.However, its lightweight metric significantly outperforms the latter.The experimental results illustrate the efficacy of employing CF2-MC to achieve model lightweighting.The substantial reduction in parameter count and computational demands contributes to effectively lowering the hardware prerequisites for both model training and inference.To compensate for the accuracy loss caused by lightweighting, the CF2-MC backbone network was optimized by adding an SPP module after the last convolutional layer of the YOLOv5-CF2-MC backbone network, and the model's mAP@0.5 increased from 92.16 to 93.5%.The outcomes indicate that employing the enhanced CF2-MC-SPP as the backbone network leads to a substantial enhancement in detection accuracy in comparison to model 2. Through this experiment, the effectiveness of the lightweight backbone model based on CF2-MC-SPP has been validated.

Ablation experiments of feature fusion network models
Merely relying on the lightweight design of the backbone network is insufficient to meet our requirements.In this section, we conducted lightweight ablation experiments on the feature fusion part of YOLOv5s using the UAVfly dataset to achieve a lower parameter count and higher detection accuracy.
In the ablation experiments, YOLOv5s was used as the baseline.Model A was obtained by replacing the backbone network of YOLOv5s with CF2-MC-SPP.Building upon Model A, Model B was derived by introducing the MG fusion module, and Model C was developed by incorporating the EIoU function.
Table 4 presents the results of the ablation experiments, demonstrating that by replacing the backbone network of YOLOv5s with CF2-MC-SPP and utilizing both the MG fusion module and the EIoU function, the YOLOv5s-ngn model achieved a reduction in parameter count and an improvement in accuracy.Model B exhibited a 57.7% reduction in parameter count and a 41% reduction in FLOP compared to Model A, with a corresponding 0.7% increase in mAP0.5.Model C, which solely employed the EIoU function, showed no change in parameter count and FLOP compared to Model A, but achieved a 1.3% increase in mAP0.5.Subsequently, Model D (YOLOv5s-ngn) was derived by incorporating the EIoU function into Model B, further enhancing detection accuracy.These experiments validate the performance of the proposed improved model.YOLOv5s-ngn achieved a balance between speed and accuracy by reducing the parameter count, while surpassing the detection accuracy of the original YOLOv5s.

Experimental results on the self-constructed UAVfly dataset
In order to confirm the superiority of the proposed YOLOv5s-ngn network over conventional object detection algorithms, we conducted training using various network models of object detection algorithms on a dataset we created.To ensure experimental reliability, YOLOv7 61 , YOLOX 62 , YOLOv5, YOLOv4 27 , and YOLOv3 26 were trained using identical hyperparameters to their unimproved counterparts.The evaluation metrics for these eight network models are presented in Table 5.
Based on the detection results in Table 5 for the eight network models, it is evident that the YOLOv5s-ngn model outperforms all six models, except YOLOv7, in mAP for UAV object detection.Furthermore, it also surpasses the other seven models in lightweight metrics, including parameter count and FLOP.These findings underscore the exceptional performance of the YOLOv5s-ngn model in UAV object detection tasks.However, alternative approaches exhibit limited generality in this context, encountering challenges such as insufficient detection accuracy and slow detection speeds.Therefore, the choice of YOLOv5s-ngn as the target detection algorithm for identifying UAV objects is highly compelling.This selection not only ensures high detection

Experimental results on the Det-Fly dataset
We conducted a comprehensive comparison between YOLOv5s-ngn and mainstream algorithms applied to the Det-Fly dataset, with specific results presented in Table 6.These findings reveal that YOLOv5s-ngn achieves outstanding detection accuracy in urban and sky backgrounds.Additionally, its mean Average Precision (mean AP) surpasses that of the other eight algorithms significantly across the four different background conditions.This underscores the exceptional performance of YOLOv5s-ngn across various background environments, highlighting its prowess in target detection tasks.

Embedded experiments
The NVIDIA TX2 is an embedded artificial intelligence computing platform designed by NVIDIA.It incorporates a high-performance NVIDIA Pascal architecture GPU, providing robust computational power and energy efficiency.Specifically tailored for machine learning, deep learning, computer vision, and associated domains, the TX2 boasts multiple connectivity interfaces and comprehensive software support.Consequently, it has found extensive use across diverse domains including unmanned aerial vehicles, robotics, intelligent cameras, and similar fields.The specific specifications of the NVIDIA TX2 used are detailed in Table 7.
We established the experimental setup on the NVIDIA TX2 platform using JetPack 4.5.1 and CUDA 10.1, and implemented PyTorch 1.8.1.Subsequently, the trained model was successfully deployed on the TX2 for realtime unmanned aerial vehicle (UAV) detection tests based on the UAVfly dataset scenarios.Detailed information www.nature.com/scientificreports/regarding the average inference speed of various models processing individual frame images is presented in Table 8.The obtained outcomes highlight the model we proposed achieving a processing rate of 69 frames per second (FPS), demonstrating substantial potential to meet practical demands in UAV detection tasks.Lastly, embedded experiments conducted on the NVIDIA TX2 platform revealed an average inference speed of only 14.5 ms per single-frame image.Despite YOLOv5s-ngn accomplishing real-time detection of air-to-air drones, it still faces limitations, such as the datasets' inability to fully represent the actual operational environments of drones and challenges in distinguishing between multiple overlapping drone images.Our future work will involve expanding the dataset to encompass a more comprehensive range of real-world drone operating environments, including scenarios with multiple overlapping drones.Furthermore, our focus will extend to exploring model optimization techniques to achieve faster and more accurate drone detection, which will remain a significant focal point in our future endeavors.

Figure 5 .
Figure 5. Intersection over Union (IoU) schematic diagram: (a) illustrates the positions of the predicted box and the ground truth box, (b) shows the intersection between the predicted box and the ground truth box, and (c) represents the union of the predicted box and the ground truth box.

Figure 6 .
Figure 6.Comparison of mAP between the improved algorithm and the original algorithm during training.

Figure 7
Figure 7 demonstrates the detection performance of the original YOLOv5s model and the improved model on unmanned aerial vehicle (UAV) targets.Based on Fig. 7, it is noticeable that the enhanced model demonstrates heightened confidence in UAV detection outcomes, with anchor boxes displaying a strong alignment with UAV positions, thereby implying improved localization accuracy.The improved network model can accurately identify UAVs, therefore, the proposed lightweight network-based model can perform real-time detection on UAVs.DiscussionTo address the challenge of achieving real-time detection among drones during operation, we propose a lightweight air-to-air drone detection model based on the YOLOv5s architecture.Initially, we achieved lightweight feature extraction by integrating the CF2 − MC + SPP lightweight feature extraction network into YOLOv5s.Simultaneously, we introduced an innovative feature fusion module (MG) and applied the EIoU loss function, aiming to enhance the detection accuracy while reducing the complexity of the YOLOv5s model.The optimized YOLOv5s model demonstrated outstanding performance on the UAVfly and Det-Fly datasets.Lastly, embedded experiments conducted on the NVIDIA TX2 platform revealed an average inference speed of only 14.5 ms per single-frame image.Despite YOLOv5s-ngn accomplishing real-time detection of air-to-air drones, it still faces limitations, such as the datasets' inability to fully represent the actual operational environments of drones and challenges in distinguishing between multiple overlapping drone images.Our future work will involve expanding the dataset to encompass a more comprehensive range of real-world drone operating environments, including scenarios with multiple overlapping drones.Furthermore, our focus will extend to exploring model optimization techniques to achieve faster and more accurate drone detection, which will remain a significant focal point in our future endeavors. 46 50YOLOv326, YOLOv427, and YOLOv5 47 , these approaches treat target detection as a regression problem, enabling realtime detection.SSD (Single Shot MultiBox Detector)28: Efficient single-stage object detection is attained by concurrently detecting targets at different hierarchical levels and producing multiple detection results.RetinaNet48: Addressing the imbalance between positive and negative samples through Focal Loss, RetinaNet enhances detection performance for small targets while maintaining high recall.•Non-ConvolutionalApproaches:Non-ConvolutionalMethods in Deep Learning: In addition to CNNs, cer- tain deep learning-based target detection methods adopt non-convolutional structures, such as those based on Recurrent Neural Networks (RNNs)49or attention mechanisms. • aph Neural Network (GNN) Approaches: Applications of GNN in Target Detection: Certain studies explore the application of Graph Neural Networks (GNNs) in target detection, leveraging the capture of relational information within graph structures to enhance detection performance50 . • Transformer-Based Approaches: Target Detection Based on Transformers: Recently, some endeavors have sought to apply Transformer architectures to the field of target detection, incorporating self-attention mechanisms to capture global and local relationships

Table 1 .
Comparison between dataset det-fly and dataset UAVfly.

Categories of backgrounds Relative viewing angles Proportion of small targets Challenging scenarios
www.nature.com/scientificreports/efforts were not made to collect images of small target objects (objects with a height and width less than 10% of the entire image).Instead, images were captured at fixed time intervals during the UAV's operation, ensuring coverage of various UAV horizontal distance scenarios in the dataset collection.
The evaluation metrics for lightweightness This paper employs three key metrics to evaluate the lightweight nature of the model: the number of parameters required (Param), floating-point operations (FLOP), and Frames Per Second (FPS).The number of parameters represents the total model parameter count affecting memory usage and program initialization time.The parameter count is specific to the network model, and once confirmed, it remains unchanged.During model lightweighting, the parameter count decreases.Upon completion of model training, each parameter has a precise value, allowing direct usage of parameter files for target predictions during detection tasks.

Table 3 .
Comparative experiment of different backbone networks for YOLOv5s.

Table 5 .
Comparative experiment of classic target detection network.

Table 6 .
Comparative experiments of YOLOv5s-ngn with other mainstream methods on the Det-Fly Dataset.

Table 7 .
Detailed specifications of the Jetson TX2 embedded system.