A novel dual-pooling attention module for UAV vehicle re-identification

Vehicle re-identification (Re-ID) involves identifying the same vehicle captured by other cameras, given a vehicle image. It plays a crucial role in the development of safe cities and smart cities. With the rapid growth and implementation of unmanned aerial vehicles (UAVs) technology, vehicle Re-ID in UAV aerial photography scenes has garnered significant attention from researchers. However, due to the high altitude of UAVs, the shooting angle of vehicle images sometimes approximates vertical, resulting in fewer local features for Re-ID. Therefore, this paper proposes a novel dual-pooling attention (DpA) module, which achieves the extraction and enhancement of locally important information about vehicles from both channel and spatial dimensions by constructing two branches of channel-pooling attention (CpA) and spatial-pooling attention (SpA), and employing multiple pooling operations to enhance the attention to fine-grained information of vehicles. Specifically, the CpA module operates between the channels of the feature map and splices features by combining four pooling operations so that vehicle regions containing discriminative information are given greater attention. The SpA module uses the same pooling operations strategy to identify discriminative representations and merge vehicle features in image regions in a weighted manner. The feature information of both dimensions is finally fused and trained jointly using label smoothing cross-entropy loss and hard mining triplet loss, thus solving the problem of missing detail information due to the high height of UAV shots. The proposed method’s effectiveness is demonstrated through extensive experiments on the UAV-based vehicle datasets VeRi-UAV and VRU.


Introduction
As an important component of intelligent transportation systems, vehicle re-identification (Re-ID) aims to find the same vehicle from the vehicle images taken by different surveillance cameras.The use of vehicle Re-ID algorithm can automatically perform the work of image matching, solving the problem of vehicle identification due to the influence of external conditions, such as artificially blocked license plates, obstacle blocking, blurred images, etc., saving manpower and consuming less time, providing strong technical support for the construction and maintenance of urban security order and guaranteeing public safety.Driven by deep learning technology, more and more researchers have started to shift towards the deep convolutional neural network, which solves the previous problem of insufficient feature extraction expression using traditional methods.
Existing vehicle Re-ID work [1][2][3][4][5][6] is mainly through road surveillance video to obtain vehicle data.A large number of surveillance cameras deployed in highways, intersections and other areas can only provide a specific angle and a small range of vehicle images.When encountering certain special circumstances, such as camera failure or events that the target vehicle is not in the monitoring coverage, it is impossible to identify and re-identify the target vehicle.In recent years, unmanned aerial vehicles (UAVs) technology 7 has made significant developments in terms of flight time, wireless image transmission, automatic control, etc. Mobile cameras on UAVs have a wider range of viewpoints as well as better maneuverability, mobility, and flexibility, and UAVs can track and record specific vehicles in urban areas and highways 8 .Therefore, the vehicle Re-ID task in the UAV scenario has received increasingly wide attention from researchers as a complementary development to the traditional road surveillance scenario and has greater application value in practical public safety management, traffic monitoring, and vehicle statistics.Figure 1 compares the two types of vehicle images based on road surveillance and aerial photography based on UAVs.The similarity between the two is that the captured vehicle image is a single complete vehicle.The difference is that the height of the UAV is usually higher than the height of the fixed surveillance camera, which results in the angle of the vehicle image sometimes being approximately vertical.Also, the height of the UAV is uncertain, resulting in scale variation in the captured vehicle images.
Since the height of the UAV is usually higher than the height of the fixed surveillance camera, the obtained vehicle images are taken at an almost near vertical angle, and therefore fewer local features of the vehicle are used for Re-ID.On the one hand, the idea of the attention mechanism has been proven to be effective.It is important to build an attention module to focus on  channel information and important regions.On the other hand, average pooling 9 takes the average value in each rectangular region, which preserves the background information in the image and allows input of the information of extracting all features in the feature map to the next layer.Generalized mean pooling operation 10 allows focusing on regions with different fineness by adjusting the parameters.The minimum pooling operation 11 will focus on the smallest pixel points in the feature map.Soft pooling 12 is based on softmax weighting to retain the basic attributes of the input while amplifying the feature activation with greater intensity, i.e., to minimize the information loss brought about by the pooling process and to better retain the information features.Unlike maximum pooling, soft pooling is differentiable, so the network acquires a gradient for each input during backpropagation, which facilitates better training.A series of pooling methods have been successively proposed by researchers [13][14][15] , each of which has shown different advantages and disadvantages.Previous studies usually combine only average pooling and maximum pooling to capture key features of images, while ignoring the use of multiple pooling methods in combination.In addition, the pooling layer is an important component in convolutional neural networks and has a significant role in reducing the number of network training parameters, decreasing the difficulty of network optimization, and preventing overfitting 16 .

Vehicle Re-ID methods
In recent years, most vehicle Re-ID methods are based on traditional road surveillance images, and their methodological ideas broadly include using vehicle local features to achieve the extraction of detailed feature information [17][18][19] , using attention mechanisms to improve the model's ability to focus on important regions [20][21][22] , optimizing network training to improve recognition rates by designing appropriate loss functions 23,24 , and using unsupervised learning without manual labeling to improve the generalization ability of the model in complex realistic scenes [25][26][27] .For example, Jiang et al. 28 designed a global reference attention network (GRA-Net) with three branches to mine a large number of useful discriminative features to reduce the difficulty of distinguishing similar-looking but different vehicles.Viewpoint-Aware Network (VANet) 29 is used to learn feature metrics for the same and different viewpoints.Generative adversarial networks (GAN) are used to solve the labeling difficulty in the Re-ID dataset 30,31 .EMRN 32 proposes a multi-resolution features dimension uniform module to fix dimensional features from images of varying resolutions, thus solving the multi-scale problem.Besides, GiT 33 uses a graph network approach to propose a structure where graphs and transformers interact constantly, enabling close collaboration between global and local features for vehicle Re-ID.
However, the current vehicle datasets VeRi-776, VehicleID, etc. are captured by fixed surveillance cameras, and the perspective and diversity of vehicles are insufficient, so the above-mentioned feature extraction methods are only for vehicle images captured by traditional road surveillance.Since the birth of the first vehicle Re-ID dataset VARI 34 based on aerial images in 2019, vehicle Re-ID using images captured by UAVs has started to attract the attention of researchers [35][36][37] .For example, in the aerial image scenario, the normalized softmax loss 37 is proposed to increase the inter-class distance and decrease the intra-class distance and combine with the triplet loss to train the model.

Attention mechanism
The attention mechanism 38,39 is widely used in various fields of deep convolutional neural networks, and its core idea is to select the most important information for the current target task from a large amount of information.Channel attention squeezeand-excitation (SE) network 40 plays the role of emphasizing important channels while suppressing noise.Spatial attention non-local 41 considers the weighting of all location features to obtain more comprehensive semantic information.Triplet attention 42 , which predicts channel and spatial attention separately, considers the relationship between two neighborhoods through three branches to achieve cross-domain interaction.
Recently, many researchers have combined the attention mechanism with vehicle Re-ID models to substantially improve the feature representation capability of the models.For example, the dual-relational attention module (DRAM) 39  importance of feature points in the spatial dimension and the channel dimension to form a three-dimensional attention module to improve the performance of the attention mechanism and mine more detailed semantic information.In addition, Zhang et al. 43 proposed a dual-attention granularity network approach for vehicle reconstruction, which used an embedded self-attention model to obtain an attentional heat map and then obtained accurate local regions by partial localization on the attentional heat map.However, unlike traditional road surveillance cameras, the height of UAV aerial photography is more flexible and usually higher than that of fixed surveillance cameras, resulting in captured vehicle images that are mostly non-complete vehicles captured in a top-down view.Consequently, some generic attention modules cannot capture local detail regions well, resulting in poor performance of vehicle Re-ID models based on UAV aerial images.
Based on the above analysis, this paper presents a novel dual-pooling attention (DpA) module for UAV vehicle Re-ID.Specifically, we first design the channel-pooling attention (CpA) module and the spatial-pooling attention (SpA) module respectively, by combining multiple pooling operations, so that the network can better focus on detailed information while avoiding the intervention of more redundant information, and the pooling operations are also useful in preventing overfitting.Second, the CpA module is designed to focus on the important features of the vehicle while ignoring unimportant information.The SpA module is designed to capture the local range dependence of spatial regions, and then the two are concatenated to obtain the dual-pooling attention module, which is embedded into the conventional ResNet50 backbone network to improve the model's perception of channels and spaces.Among them, omni-dimensional dynamic (OD) convolution is also introduced in the CpA and SpA modules to further extract rich contextual information dynamically.In addition, the paper introduces hard mining triplet loss combined with cross-entropy loss with label smoothing for training, thus improving the ability of triplet loss to perform strong discrimination even in the face of difficult vehicle samples.Finally, a large number of experiments are conducted to verify the effectiveness of our model, and the results show that the proposed method can achieve an mAP of 81.74% on the VeRi-UAV dataset.On the three test subsets of VRU, the accuracy of mAP reaches 98.83%, 97.90%, and 95.29%, respectively.This indicates that the DpA module can solve the problem of insufficient fine-grained information based on vehicle Re-ID images taken by UAVs.

Overall network architecture
The overall network architecture of this paper is shown in Figure 2. It consists of three parts: input images, feature extraction, and output results.First, the input image is enhanced with data by AugMix 44 method, where AugMix overcomes the image distortion problem caused by previous MixUp data enhancement by applying different data enhancements randomly to the same image.Then, the backbone network ResNet50 and a dual-pooling attention (DpA) module are used as the feature extraction part of the network.After the gallery set to be queried and the target query vehicle are input to the network model for feature extraction, the similarity between the features of the target query vehicle image and the vehicle image features in the gallery set is calculated by a metric method.Finally, the similarity is ranked and the vehicle retrieval results are obtained.

Channel-pooling attention
To focus more on the features with the discriminative nature of vehicle images and avoid the interference of background clutter information, four pooling methods are introduced to process the channel features.The specific module diagram is shown in Figure 3 (a).First, let the output features of the third residual block (Conv4_x) of Resnet50 be the input matrix X. Suppose the input matrix X ∈ R C×H×W , where C, H, and W represent the channel number, height, and width of the feature map respectively.Four copies of X are made, and the average pooling (AvgP) 9 , generalized mean pooling (GeMP) 10 , minimum pooling (MinP) 11 , and soft pooling (SoftP) 12 operations are performed on them.The first three poolings make the dimension change from C × H × W to C × 1 × 1 channel descriptors.The feature map X ∈ R C×H×W is taken as input and a vector f ∈ R C×1×1 is generated as the output of the pooling operation.The vector f = [ f 1 , ... f k , ... f C ] in the case of the AvgP, MinP and GeMP of are respectively given by: where x k pq denotes the element located at (p, q) in the rectangular region R i j , |R i j | indicates the number of elements in the rectangular area R i j .
where x k pq denotes the element located at (p, q) in the rectangular region R, |R| denotes the number of all elements of the k th feature map, and α is the control coefficient.
And the feature map generated by SoftP is still C × H ×W .Its formulas for SoftP are shown as follows: where x k mn is similar to x k pq above and denotes the element located at (m, n) in the rectangular region R. From one perspective, since AvgP focuses on each pixel of the feature map equally and SofP captures important regions better than maximum pooling, the outputs of both are summed to obtain a 1 ∈ R C×H×W to give more attention to important vehicle features.From another perspective, GeMP can focus on different fine-grained regions adaptively by adjusting the parameters, while minimum pooling focuses on small pixels in the feature map, i.e., the background regions, so GeMP and MinP are subtracted to obtain a 2 ∈ R C×1×1 to give more attention to vehicle fine-grained features and ignore the background regions as much as possible.The output of both of them is dotted and multiplied to obtain the channel attention map C * ∈ R C×H×W .The channel pooling matrix C * can be formulated as: where Conv stands for convolution operation and * represents the dot product operation.The OBR module is composed of OD convolution, batch normalization (BN) and rectified linear unit (ReLU) activation function, which is sequentially used twice in a row for the channel attention map C * .Compared with normal convolution, dynamic convolution is used here, which is linearly weighted by multiple convolution kernels and establishes certain dependencies with the input data to better learn flexible attention and enhance the extraction of feature information.Finally, the original input matrix X is summed with the output of the OBR module and normalized by the sigmoid function to obtain the final channel-pooling attention output matrix X ∈ R C×H×W .These operations can be defined as: where σ (•) is the sigmoid activation function and the OBR module represents the OD convolution of 3×3, BN, and ReLU activation function.

Spatial-pooling attention
Feature relations are used to compute spatial attention, similar to the above channel-pooling attention module.As shown in Figure 3 (b), first, the output feature X of the original feature, the third residual block of Resnet50 (Conv4_x), is transposed to obtain X T ∈ R H×W ×C .Then the operation of multiplying H and W is performed to aggregate and extend the dimensions to become a matrix of HW ×C × 1.This matrix is then copied in four copies and AvgP, SoftP, GeMP, and MinP are applied along the channel axis, which finally makes the dimension change from H ×W ×C to HW × 1 × 1 spatial descriptors.Similarly, the outputs of AvgP and SoftP are added and the convolution layer is applied to obtain b 1 ∈ R HW ×1×1 .The outputs of GeMP and MinP are subtracted to obtain b 2 ∈ R HW ×1×1 .Finally, the two are concatenated to get the output S * ∈ R 2HW ×1×1 .The spatial pooling matrix S * can be formulated as: where [• , •] is the concatenation operation.Then convolution is applied to S * to expand it to C × 1 × 1.Similarly, the OBR module uses twice for the output attention map S * to dynamically enhance the acquisition of spatial domain information features.Finally, the original input X is added up to get the output matrix X ∈ R C×H×W of the spatial set attention module.These operations can be defined as:

Loss functions
In vehicle Re-ID, a combination of identity loss and metric loss is often used.Therefore, in the training phase, we use cross-entropy (CE) loss for classification and triplet loss for metric learning.The CE loss is often used in classification tasks to represent the difference between the true and predicted values.The smaller the value, the better the prediction of the model.The label smoothing (LS) strategy 45 is introduced to solve the overfitting problem.Therefore, the formula for the label smoothing 5/14 cross-entropy (LSCE) loss is as follows: where parameter ε is the smoothing factor, which was set to 0.1 in the experiment.The core idea of triplet loss is to first build a triplet consisting of anchor samples, positive samples, and negative samples.Then after continuous learning, the distance between positive samples and anchor samples under the same category in the feature space is made closer, and the distance between negative samples and anchor samples under different categories are made farther.In this paper, we use hard mining triplet (HMT) loss to further improve the mining ability in the face of difficult vehicle samples by selecting the more difficult to distinguish positive and negative sample pairs in a batch for training.The loss function for the hard mining triplet is calculated as follows: where T denotes the number of vehicle identities in each training batch, S denotes the number of images included in each vehicle identity.A i , P i , and N j denote the anchor sample, the vehicles belonging to the same category as the anchor sample but least similar to it, and the vehicles belonging to a different category than the anchor sample but most similar to it, respectively.m represents the minimum boundary value of this loss, and [•] + is the max(•, 0) function.In summary, this work combines LSCE loss and HMT loss.The final loss is: where λ 1 and λ 2 are two weights for different losses, and λ 1 = λ 2 = 1.

Datasets
Song et al. 46

Implementation details
In this paper, we use the weight parameters of ResNet50 pre-trained on ImageNet as the initial weights of the network model.All experiments were performed on PyTorch.For each training image, balanced identity sampling is taken and it is resized to 256 × 256 and pre-processing is also performed using the AugMix data augmentation method.In the training phase, the model was trained for a total of 60 epochs, and a warm-up strategy using a linear learning rate was employed.For the VeRi-UAV dataset, the training batch size is 32 and an SGD optimizer was used with an initial learning rate of 0.35e-4.The learning rate tuning strategy of CosineAnnealingLR is also used.For the VRU dataset, the training batch size is 64 and the initial learning rate is 1e-4 using the Adam optimizer.The learning rate tuning strategy of MultiStepLR is used, which decays to 1e-5 and 1e-6 in the 30th and 50th epochs.In addition, the batch sizes for testing are all 128.In the model testing phase, we use Rank-n and mean average precision (mAP) as the main evaluation metrics.The mINP is also introduced in the ablation experiments to further demonstrate the experimental effects.

Comparison with state-of-the-art methods
Comparisons on VeRi-UAV.The methods compared on the VeRi-UAV dataset include the handcrafted feature-based methods BOW-SIFT 48 and LOMO 49 , and the deep learning-based methods Cross-entropy Loss 45  Triplet+ID Loss 34 , RANet 52 , ResNeSt 53 and PC-CHSML 46 .Among them, LOMO 49 improves vehicle viewpoint and lighting changes through handcrafted local features.BOW-SIFT 48 performs feature extraction by employing content-based image retrieval and SIFT.VANet 51 learns visually perceptive depth metrics and can retrieve images with different viewpoints under similar viewpoint image interference.RANet 52 implements a deep CNN to perform resolution adaptive.PC-CHSML 46 are approaches for UAV aerial photography scenarios, which improves the recognition retrieval of UAV aerial images by combining pose-calibrated cross-view and difficult sample-aware metric learning.Table 1 shows the comparison results with the above-mentioned methods in detail.First of all, the results show that the deep learning-based approach achieves superior improvement over the manual feature-based approach.Secondly, compared with methods for fixed surveillance shooting scenarios such as VANet 51 , DpA shows some improvement in shooting highly flexible situations.Additionally, compared with the method PC-CHSML 46 for the UAV aerial photography scenario, DpA shows an improvement of 4.2%, 10.0%, 9.5%, and 9.6% for different metrics of mAP, Rank-1, Rank-5, and Rank-10.Consequently, the effectiveness of the module is further verified.
Comparisons on VRU.It is a relatively newly released UAV-based vehicle dataset, hence, few results have been reported about it.Table 2 compares DpA with other methods 47,[54][55][56] on VRU dataset.Among them, MGN 54 integrates information with different granularity by designing one global branch and two local branches to improve the robustness of the network model.SCAN 55 uses channel and spatial attention branches to adjust the weights at different locations and in different channels to make the model more focused on regions with discriminative information.Triplet+CE loss 56 then uses ordinary triplet loss and cross-entropy loss for model training.The GASNet model 47 captures effective vehicle information by extracting viewpoint-invariant features and scale-invariant features.The results show that, in comparison, DpA contributes 0.32%, 0.59%, and 1.36% of the mAP improvement to the three subsets of VRU.Taken together, this indicates that the DpA module enhances the ability of the model to extract discriminative features, which can well solve the problem of local features being ignored in UAV scenes.

Ablation experiments
In this section, we designed some ablation experiments on the VeRi-UAV dataset to evaluate the effectiveness of the proposed methodological framework.The detailed results of the ablation studies are listed in Tables 3, 4, 5 and 6.It is worth noting that a new evaluation index mINP was introduced in the experiment.The mINP is a recently proposed metric for the evaluation of Re-ID models i.e., the percentage of correct samples among those that have been checked out as of the last correct result.

Evaluation of DpA module
To verify the validity of the DpA module, we directly used the baseline network composed of ResNet50 as the backbone network, combined with generalized mean pooling, batch normalization layer, fully connected layer, LSCE loss, and HMT loss.The detailed results of the ablation study on the VeRi-UAV dataset are shown in Table 3.Firstly, the results showed that the addition of CpA to the baseline resulted in a 1.62% and 0.36% improvement in the assessment over the baseline on mAP and Rank-1, respectively.This indicates that CpA enhances the channel information to be able to extract discriminative local vehicle features.Then after adding SpA to the baseline alone, it improved by 0.67% and 0.54% over the baseline on mAP and Rank-1 respectively, showing a greater focus on important regions in the spatial dimension.Finally, after combining CpA and SpA on top of the baseline, we can find another 2.49%, 0.63%, 0.36%, and 2.27% improvement on mAP, Rank-1, Rank-5, and mINP, respectively.We can draw two conclusions: firstly, feature extraction from two dimensions, channel and spatial, respectively, can effectively extract more and discriminative fine-grained vehicle features.Secondly, the accuracy of Re-ID is improved by connecting two attention modules in parallel.

Comparison of different attention modules
This subsection compares the performance with the already proposed attention modules SE 40 , Non-local 41 , CABM 57 , and CA 58 .Correspondingly, SE 40 gives different weights to different positions of the image from the perspective of the channel domain through a weight matrix to obtain more important feature information.Non-local 41 achieves long-distance dependence between pixel locations, thus enhancing the attention to non-local features.CBAM 57 module sequentially infers the attention map along two independent channel and spatial dimensions and then multiplies the attention map with the input feature map to perform adaptive feature optimization.CA 58 decomposes channel attention into two one-dimensional feature encoding processes that aggregate features along both vertical and horizontal directions to efficiently integrate spatial coordinate information into the generated attention maps.Table 4 shows the experimental comparison results for different attentional mechanisms.Firstly, adding the SE and CA attention modules can slightly improve the accuracy of the model to some extent, while adding the Non-local, CBAM attention module does not produce the corresponding effect.Second, compared with the newer attention module CA, the proposed DpA module can achieve 2.16% mAP, 0.36% Rank-1, 0.27% Rank-5, and 4.55% mINP gains on VeRi-UAV.Therefore, this demonstrates the proposed DpA module is more robust in UAV aerial photography scenarios with near-vertical shooting angles and long shooting distances.
To further validate the effectiveness of the DpA method, we also used the Grad-CAM++ technique to visualize the different attention maps.As shown in Figure 4, from left to right, the attention maps of residual layer 3 (without any attention), SE, Non-local, CBAM, CA, and DpA are shown in order.It can be clearly seen that, firstly, all six methods focus on the vehicle itself.secondly, the attention modules of SE, Non-local, CBAM and CA pay less attention to the local information of the vehicle and some important parts are even ignored, while the red area of the DpA module is more obvious to achieve more attention to important cues at different fine-grained levels and to improve the feature extraction capability of the network.No.

Comparison of DpA module placement in the network
We designed a set of experiments and demonstrated its effectiveness by adding DpA modules at different stages of the backbone network.indicates that the DpA module is added after one of the residual blocks of the backbone network.
Table 5 shows the experimental results of adding the DpA module after different residual blocks of the backbone.Firstly, the results show that the different residual blocks added to the backbone network have an impact on the network robustness.Specifically, adding the DpA module behind the 2nd (No.1), 3rd (No.2) and 3rd and 4th (No.6) residual blocks of the backbone network respectively improves the accuracy over the baseline (No.0), indicating that the module is able to effectively extract fine-grained vehicle features at these locations.In contrast, adding the DpA module behind the 4th residual block (No.3) and behind the 2nd and 3rd (No.4), 2nd and 4th (No.5) residual blocks all show some decrease in accuracy over the baseline (No.0), which indicates that the network's attention is more dispersed after adding it to these positions, thus introducing more irrelevant information.Secondly, it can be seen from the table that using mostly one DpA is more robust to the learning of network features than using two DpAs jointly, and saves some training time.In particular, No.2, after adding the DpA module to the third residual block of the backbone network, has at least a 2.17% improvement in mAP compared to the joint use of two DpA's.In brief, weighing the pros and cons, we choose to add the DpA module only after Conv3_x of ResNet50.

Comparison of different metric losses
Metric loss has been shown to be effective in Re-ID tasks, which aim to maximize intra-class similarity while minimizing inter-class similarity.The current metric losses treat each instance as an anchor, such as HMT loss and circle loss 59 which utilize the hardest anchor-positive sample pairs.The multi-similarity (MS) loss 60 which selects anchor-positive sample pairs is based on the hardest negative sample pairs.The supervised contrastive (SupCon) loss 61 samples all positive samples of each anchor, introducing cluttered triplet while obtaining richer information.The adaptation of different loss functions to  the scenario often depends on the characteristics of the training dataset.Table 6 shows the experimental results of applying different metric losses for training on the VeRi-UAV dataset, and it can be seen that the HMT loss improves both in terms of mAP compared to other losses, which indicates that the HMT loss targeted to improve the network's ability to discriminate difficult samples for more robust performance in the vehicle Re-ID task in the UAV scenario.

Discussion
Although the current attention mechanism can achieve certain effect on some vision tasks, its direct application is not effective due to the special characteristics of the UAV shooting angle.Therefore, the main idea of this paper is to design the attention module combining multiple pooling operations and embedding it into the backbone network, which improves the fine-grained information extraction capability for the vehicle Re-ID in the UAV shooting scenario, and devotes to solving the problem of insufficient local information of the vehicle due to the near vertical angle of the UAV shooting and the varying height.In addition, Figure 5 shows the matching rate results from the top 1 to the top 20 for the different validation models mentioned above, respectively.In contrast, the curves plotted by our proposed method as a whole lie above the others, which further validates the effectiveness of the method in terms of actual vehicle retrieval effects, thus providing some support for the injection of UAV technology into intelligent transportation systems.

Visualization of model retrieval results
To illustrate the superiority of our model more vividly, Figure 6 shows the visualization of the top 10 ranked retrieval results for the baseline and model on the VeRi-UAV dataset.A total of four query images corresponding to the retrieval results are randomly shown, the first row for the baseline method and the second row for our method.The images with green borders represent the correct samples retrieved, while the images with red borders are the incorrect samples retrieved.In contrast, on the one hand, the baseline approach focuses on general appearance features, where the top-ranked negative samples all have similar body postures.However, our method focuses on vehicle features with discriminative information, such as the vehicle parts marked with red circles in the query image in Figure 6 (vehicle type symbol, front window, rear window, and side window).On the other hand, as in the second query image in the figure, our method correctly retrieves the top 5 target vehicle samples in only 5 retrievals, while in the baseline method, it takes 9 retrievals to correctly retrieve the top 5 target vehicle samples.

Conclusion and future work
In (a) Road surveillance-based image display (b) UAV-based image display

Figure 1 .
Figure 1.Comparison of two types of vehicle images.

DpAFigure 2 .
Figure 2. The overall framework of the network for vehicle Re-ID.

Figure 4 .
Figure 4. Heat map comparison of different attention modules.The red area indicates the part of the network with the highest attention value, and the blue area indicates the part of the network with the lowest attention value.

Figure 5 .
Figure 5. Comparisons of CMC curves for the case of: (a) CpA, SpA and DpA modules, (b) five different attention mechanisms, and (c) DpA placed in different positions of the backbone network.

Figure 6 .
Figure 6.Visualization of the ranking lists of model and baseline on VeRi-UAV.For each query, the top and bottom rows show the ranking results for the baseline and joining the DpA module, respectively.The green (red) boxes denote the correct (wrong) results.
this paper, we propose a dual-pooling attention (DpA) module for vehicle Re-ID to solve the current problem of difficult extraction of local features of vehicles in UAV scenarios due to the high shooting height and vertical shooting angle.The DpA module consists of a channel-pooling attention module and a spatial-pooling attention module, which extracts fine-grained important features of the vehicle by taking two dimensions from the channel domain and the spatial domain.We then fuse the features from both branches to improve the discriminability of the feature representation.Extensive experiments on VeRi-UAV and VRU datasets show that the proposed methodological framework can improve the effective extraction of features from UAV-captured vehicle images and achieve competitive performance in the Re-ID tasks.Due to the lack of research on vehicle Re-ID in the UAV aerial photography scene, there is great potential for future research, such as considering expanding UAV scene datasets (e.g., placing drones at different angles to increase the number of vehicle images containing multiple views), combining spatiotemporal information of vehicles, and combining vehicle images captured by fixed surveillance cameras and UAVs for application to vehicle Re-ID tasks.

Table 1 .
, Hard Triplet Loss 50 , VANet 51 , Comparison of various proposed methods on VeRi-UAV dataset (in %).Bold numbers indicate the best ranked results.

Table 2 .
Comparison of various proposed methods on VRU dataset (in %).Bold numbers indicate the best ranked results.

Table 3 .
Ablation experiments of DpA module on VeRi-UAV (in %).Bold and underlined numbers indicate the best and second best ranked results, respectively.

Table 4 .
Ablation experiments of different attention modules on VeRi-UAV (in %).Bold and underlined numbers indicate the best and second best ranked results, respectively.

Table 5 .
Ablation experiment of adding DpA module at different residual blocks of the backbone network on VeRi-UAV (in %).Bold and underlined numbers indicate the best and second best ranked results, respectively.

Table 6 .
Ablation experiments of different metric losses on VeRi-UAV (in %).Bold and underlined numbers indicate the best and second best ranked results, respectively.