Anchor free based Siamese network tracker with transformer for RGB-T tracking

In recent years, many RGB-THERMAL tracking methods have been proposed to meet the needs of single object tracking under different conditions. However, these trackers are based on ANCHOR-BASED algorithms and feature cross-correlation operations, making it difficult to improve the success rate of target tracking. We propose a siamAFTS tracking network, which is based on ANCHOR-FREE and utilizes a fully convolutional training network with a Transformer module, suitable for RGB-THERMAL target tracking. This model addresses the issue of low success rate in current mainstream algorithms. We also incorporate channel and channel spatial attention modules into the network to reduce background interference on predicted bounding boxes. Unlike current ANCHOR-BASED trackers such as MANET, DAPNet, SGT, and ADNet, the proposed framework eliminates the use of anchor points, avoiding the challenges of anchor hyperparameter tuning and reducing human intervention. Through repeated experiments on three datasets, we ultimately demonstrate the improved success rate of target tracking achieved by our proposed tracking network.


Proposed method
Figure 1 depicts the overall frame construction.We outline each part of our technique separately in this section.Backbone.As shown in Fig. 1 as the input of 4 images, our input consists of a visible light branch and a thermal infrared branch.The target branch in turn contains visible input Z1 and thermal infrared input Z2, and the search area branch is divided into visible input X1 and thermal infrared input X2.Since it is a Siamese model, both the visible and thermal infrared branches use the same ResNet50 model to extract feature maps.After backbone network feature extraction, the visible branch will get feature maps ϕ(Z1) and feature maps ϕ(X1),and the thermal infrared branch will get feature maps ϕ(Z2) and feature maps ϕ(X2) .Then, we obtain the feature maps ϕ(Z1),ϕ(X1),ϕ(Z2),and ϕ(X2) , where a portion is directly fed as input to the Transformer module for the next step of computation, while another portion is enhanced by passing through the SA and CA modules to augment the feature maps before being transmitted to the Transformer module for further computation.
During the object tracking process, we aim to include more image feature information in the response map.Inspired by reference 7 , we consider extracting feature maps from different layers as outputs during the process of feature extraction.Deep features and shallow features have different roles in target tracking.First, deep features have good discrimination of the required speech properties, which are enhanced for our classification task.In contrast, shallow features are rich in information about visual attributes such as edges and colors, which are enhanced for the target localization task.Inspired by references 7,15 , We modify the last module of ResNet50 to obtain feature maps from layers 6, 7, and 8.We obtain F 6 (X1), F 7 (X1), F 8 (X1).And we also get F 6 (X2), F 7 (X2), F 8 (X2).Here 6, 7, and 8 indicates the feature values we extracted from layer 6 layer 7, and layer 8.There are 256 channels in F 6 (X2), F 7 (X2), F 8 (X2).
Module for spatial attention.The channel attention mechanism can enhance the predictive capability of the network.To improve the information transfer capability between the two modes, we designed a Channel Attention Feature Enhancement module (CA module) as shown in Fig. 2.
In the CA module Fig. 2, we take the feature map ϕ(Z1) and ϕ(Z2) extracted through the backbone network as input to the CA module and obtain the joint feature as U ca .denoting the output as x ca rgb , x ca tir , and the overall CA module can be described as follows: where δ represents the full set, ω is the fully connected layer, ε denotes Sigmoid, ⊗ is the product of channel-wise, and Split is the operation of extracting features along the channel dimension.
In order to suppress the effect of background noise on the classification task, we designed a spatial attention (SA module) module.It is shown in Fig. 3.This module mainly utilizes the spatial inter-relationship of features.We take ϕ(X1) and ϕ(X2) extracted through the backbone network as the input feature maps, and by using the SA module, we finally obtain the feature map U ca using the following mathematical expression as: where ρ is for the average set, φ or the largest set, Cat stands for the process of stringing features along the channel dimension, ϕ stands for the two-dimensional convolution operation, H stands for a collection of kernel weights, and ε stands for the Sigmoid function.
The output is then represented as x sa rgb , x sa tir , and SA module as seen below: (1)  www.nature.com/scientificreports/⊙ denotes connecting a CA module to an SA module, and finally, we get the final SA module.
Transformer network.Inspired by reference 16 , we designed a Transformer Network as shown in Fig. 4.
From Fig. 4, we can see that the template feature and search feature extracted by the backbone network first pass through the CA and SA channel modules respectively, and obtain x sa rgb ,x sa tir , zf ca rgb and zf ca tir .Then, the feature vectors are fed into the Transformer module, as shown in Fig. 4. First, x sa rgb and zf ca rgb pass through a Transformer attention module (BCTM).Then, x sa tir and zf ca tir pass through another Transformer attention module (BCTM).BCTM is used to fuse different branch information.To make the fusion information more accurate, the fusion process is repeated four times.Finally, an extra Transformer module (ACTM) is added to fuse the feature vectors of the template and search branches.BCTM and ACTM have the same network structure.Here, we will provide a detailed explanation using BCTM as an example.
Figure 5 shows the BCMT module for transformers.The BCMT module utilizes positional encoding to distinguish position information of feature sequences and utilizes a residual-based multi-head cross-attention to integrate feature vectors from different inputs.Additionally, a residual-based feed-forward network (FFN) is employed to obtain the final output.The specific calculation process of the BCMT module is as follows: The calculation process of the CMT module involves W ∈ R d×N x and W KV ∈ R d×N KV as two inputs from different branches, while P Q ∈ R d×N Q and P KV ∈ R d×N KV represent spatial positional encodings of W Q and W KV .The output of the residual multi-head cross-attention and the final output are represented by W CF ′ and W CF , respectively.
After ACTM, we obtain enhanced image features R1 of size 25*25*256.Next, referring to the lower part of Fig. 4, we observe that the image features that have not undergone the CA and SA modules are initially fused separately.For example, fusion of features ϕ(X1) and ϕ(X2) results in image feature ϕ(X) of size 31*31*m.
(3) www.nature.com/scientificreports/Fusion of features ϕ(Z1) and ϕ(Z2) results in image feature ϕ(Z) of size 7*7*m.Firstly, ϕ(X) passes through a Transformer attention module (STM).Then, ϕ(Z) also passes through a Transformer attention module (STM).Finally, we transmit the obtained Q, K, and V to the Transformer attention module (BCTM), from which we can obtain image features R2 of the original image.We perform a CAT operation on R1 and R2, resulting in R: In Fig. 6, we can observe the self-attention modules for the transformer (STM).These modules begin by incorporating a positional encoding technique to accurately differentiate the position information within feature sequences.Next, they utilize multi-head self-attention to consolidate the feature vectors from various positions.Lastly, a residual form is employed to obtain the output.The specific calculation process of the TS module is described below.
where P K ∈ R d×N x denotes the spatial positional encoding obtained through the application of a sine function.W ∈ R d×N x represents the input to the TS module, while W SF ∈ R d×N x denotes the resulting output after the TS module's operations.b.Training loss.Firstly, we classify the input samples into positive and negative samples.Since negative samples have a lower probability of representing the target, we only perform regression operations on positive samples.

Anchor
We will determine whether a sample is positive or negative by drawing two ellipses, S1 and S2, around the target.We may obtain an ellipse S1 as illustrated in the equation that follows.
( www.nature.com/scientificreports/Likewise, we can obtain an ellipse S2: If a sample point (k, j) lies outside the ellipse S1, it is defined as a negative sample.Conversely, if it lies inside S2, it is defined as a positive sample.
For the coordinates of positive samples, we perform regression operations.In anchor-based regression, we typically compare predicted boxes with ground truth boxes.However, in the anchor-free regression algorithm we employ, we use the following equation for regression calculations.
where d l , d t , d r , d b and are the distances from that place to the four edges of the surrounding box.For the calcula- tion of the loss function, we use IOU (Intersection over Union).By modifying the coordinates of the predicted bounding box's top-left corner and bottom-right corner, we can obtain the predicted bounding box for each point on the feature map corresponding to the search image.IOU represents the ratio of the intersection area between the ground truth and the predicted bounding box.
If the regression value exceeds 0 and the point (x, y) marked as a positive sample lies within the ellipse S2, then the IOU value falls between 0 and 1.
Tracking.The RGBT series consists of visible light and thermal infrared images.The visible light photos and thermal infrared images undergo a cropping process.The size of the search image is adjusted to 255 × 255 pixels, while the size of the template image is adjusted to 127 × 127 pixels.From these images, two sets are selected, each containing 60 samples (40 negative samples and 20 positive samples), which are the visible light and thermal infrared images.
The first step of the prediction process is to set up the tracker, which handles the first frame.Then, we save the image information of the first frame.The search image (second frame) is processed through the backbone to extract feature maps from the 6th, 7th, and 8th layers, which are then resized to 7 × 7. The feature maps extracted from the visible light and thermal infrared images are separately processed using the SA and CA channel attention mechanisms, preparing them for the next step of operations.
We separately input the extracted unenhanced and enhanced visible light and thermal infrared feature maps into the Transformer Network.Then, after undergoing transformation in the Transformer Network, we perform classification and regression operations on the obtained outputs.By performing regression operations using the following equation: The top and bottom right corners of the prediction box are ( P x1 ,P x2 ) and ( P y1 ,P y2 ), respectively, while ( d ) denote the projected values of the regression box.The optimal tracking box is chosen from the generated prediction boxes, and the tracking box coordinates are updated through linear interpolation with the previous frame's state to achieve tracking.After generating the prediction boxes, cosine windows are applied to mitigate significant displacements, and penalties are introduced to discourage substantial changes in size and scale.Through the aforementioned series of operations, we ultimately obtain the best predicted bounding box.

Experiments
Data set and device description.In this study, we will evaluate our model by testing it on two datasets, GTOT 1 , RGBT210 16 and RGBT234.
Using PyTorch and two GTX 3080-Ti cards for training, the algorithm is put into practice.The search region's input size was 255 pixels, whereas the template's input size was 127 pixels for comparison's sake.we build a training subnetwork with the ResNet-50 as its core.Using ImageNet, the network had already been trained.The pre-trained weights served as an initialization for our model's further training.
To broaden the scope of comparison, we also included transformer-based method APFNet 48 and non-deep RGB-T trackers, such as CMCF 40 , NRCMR 32 , CMR 41 , SGT 42 , the method proposed by Li et al. 43 , CSR 32,44 , MEET 45 + RGBT, and KCF 46 + RGBT.This selection encompasses a wide range of RGB-T methods across various categories.

Results on GTOT.
The method of this study further concludes the tracking on the GTOT dataset, which contains 50 different video sequences and considers different environmental conditions, as shown in Table 1 Figure 7 shows the comparison results of our proposed model with other anchor-based models on the GTOT datasets.
Table 3 displays the tracking results on the GTOT dataset.In terms of these two metrics, our RGB-T tracker achieves almost superior success rates compared to all Anchor-Based RGB-T trackers.However, we also notice that our results in accuracy are not high.The performance gap between our supervised RGB-T tracker and the state-of-the-art can be attributed to their usage of large-scale annotated RGB-T image pairs for training.Additionally, these trackers employ more complex models.In the future, we will modify our model to improve accuracy.
Results on RGBT210.The tracking outcomes of this technique using the RGBT210 data set are shown in Fig. 8. 210 real-label visible and thermal infrared video clips are included in RGBT210.This data collection takes a lot of difficult cases into account, as illustrated in Table 2 5 .
By performing validation on the RGBT210 dataset, Fig. 8 shows that our tracker beats all trackers.
Results on RGBT234.The results on the RGBT234 dataset, presented in Table 3, demonstrate that our proposed RGB-T tracker outperforms both supervised and non-learning-based RGB-T trackers in terms of MSR.However, its performance on RGBT234 is relatively weaker compared to the GTOT dataset.This discrepancy can be attributed to the increased challenges posed by RGBT234, which comprises 234 images and encompasses 12 challenging attributes, surpassing the 7 attributes of GTOT.On the RGBT234 dataset, our model was compared to other Anchor-Based models.The comparison results show that we have achieved almost superior performance compared to all algorithms.However, it is worth noting that we have lower accuracy in certain aspects.To address this performance gap, our future work aims to explore better backbone trackers and larger training datasets.By performing validation on the RGBT234 dataset, Fig. 9 shows that our tracker beats all trackers.
Attribute-based results.The performance of the challenging attributes on the RGBT234 dataset is shown in Figs. 10 and 11.It can be observed that our RGB-T tracker achieves highly competitive performance in various aspects.In the MSR graph of the challenging attributes, our model performs well in most attributes except for TC, PO, NO, and LI, where it is not as effective as other models.However, our model performs well in the remaining attributes.In the MPR graph of the challenging attributes, our model performs poorly in SV, PO, and NO, but demonstrates excellent performance in the other nine attributes.Overall, our model faces challenges in dealing with PO and NO, indicating areas for improvement in our future work.
Qualitative results.As shown in Fig. 12, our RGB-T tracker is compared qualitatively with other anchor-based RGB-T trackers on the RGBT210 dataset.The images in Fig. 12 are sourced from the RGBT210 dataset 17 .We would like to express our gratitude to LI 17 for making the dataset publicly available.We selected several RGB-T trackers, including SOWP 18 , SOWP + RGBT, KCF 19 + RGBT, CSR 20 , SGT, and MEEM 21 + RGBT.It can be seen from the figure that our RGB-T tracker performs better on these three sequences (Baketballwaliking , Balancebike, car41).www.nature.com/scientificreports/Ablation studies and analysis.As shown in Fig. 13, these are the results of our ablation experiment.In this experiment, we used the RGBT234 dataset as the training set and the GTOT dataset as the test set.From the graph, we can obtain the following information: the experimental results without adding any module are significantly worse compared to the experimental results with the modules added.In the PR score graph, the experimental results of "ours-GTOT-SA" and " ours-GTOT-CA" are significantly better than the experimental results of "ours-GTOT-no (CA-SA-TS)", indicating that adding the SA and CA modules helps improve the accuracy of the tracker.In the SR score graph, we found that the experimental results of "ours-GTOT-SA" are better than those of " ours-GTOT-no (CA-SA-TS)".However, the experimental results of "our-ca" have decreased compared to "ours-GTOT-no (CA-SA-TS)", indicating that adding the CA module does not help improve the success rate of the tracker, but it still has an effect on improving the accuracy of the tracker.In this experiment, the experimental results of "ours-AFTS" are the highest, indicating that the TS module has a significant impact on improving the success rate of the tracker.

Conclusion
This paper introduces a novel approach for RGBT tracking, specifically an adaptive tracker based on the Transformer model with dual-Siamese architecture and anchor-free design.
The proposed method incorporates Transformer attention mechanism to replace the correlation operation in the Siamese network, leading to improved tracking success rate.By eliminating candidate boxes and reducing human-induced interference, our approach effectively addresses the limitations of Anchor-Based methods while eliminating the need for many hyperparameters.Experimental results demonstrate the reliability of the proposed algorithm, which successfully exploits the complementary information from visible light and thermal infrared modalities.As part of our future work, we are exploring the integration of RGBD tracking design, aiming to expand the application scope and enhance the performance in challenging scenarios.The results marked with '∞' are computed by us using raw tracking results.The results marked with '*' are copied from references 32,48 .Other results are extracted from corresponding papers.'--' means not mentioned in the corresponding paper.Values worse than our method are marked in pink and yellow.

Figure 4 .
Figure 4. Illustration of the features fusion network based on the transformer.
-free based bounding box prediction.a. Position prediction head.The location prediction head in Fig. 1 includes classification and regression modules.After passing through the Transformer attention module, we obtain image features of size (*, 256, 25, 25).These features are subsequently used in the location prediction head to generate image features of size (*, 2, 25, 25) and (*, 4, 25, 25) for classification and regression branches, respectively.

Figure 8 .TABLE 2 .
Figure 8.The results of our proposed model compared to other Anchor-Based models on the RGBT210 datasets.(a) Precision Rate; (b) Success Rate.

Table 1 .
A list of annotated attributes for the GTOT data set.

Table 3 .
Comparison with existing anchor based RGB-T trackers on the GTOT dataset and RGBT234 dataset.