SiamFDA: feature dynamic activation siamese network for visual tracking

In this paper, we present a novel anchor-free visual tracking framework, referred to as feature dynamic activation siamese network (SiamFDA), which addresses the issue of ignoring global spatial information in current Siamese network-based tracking algorithms. Our approach captures long-range dependencies between distant pixels in space, which enables robustness to unreliable regions. Additionally, we introduce a hierarchical feature selector that adaptively activates features at different layers, and an adaptive sample label assignment method to further improve tracking performance. Our extensive evaluations on six benchmark datasets, including VOT-2018, VOT-2019, GOT10k, LaSOT, OTB-2015, and OTB-2013, demonstrate that SiamFDA outperforms several state-of-the-art trackers in various challenging scenarios, with a real-time frame rate of 40 frames per second.

Visual tracking is a fundamental task in computer vision, with various practical applications in the real world such as video surveillance, human-machine interaction and biomedical image analysis.Generally, given the initial state of a target, we are expected to predict its motion trajectory in subsequent frames.Though many efforts have been done recently, visual tracking still needs to cope with scale variation, appearance deformation, background clutter and so on.
Recently, tracking algorithms based on the Siamese network 1,2 have attracted great attention because of their balanced accuracy and speed.The pioneering works SiamFC 1 simply matches the initial patch of the target in the first frame with candidates in subsequent frames and returns the most similar patch by a learned matching function.SiamRPN 2 introduces the region proposal network to discard traditional multi-scale tests, which inevitably introduces many anchor related hyper-parameters that require carefully tuning and heavy computational burdens.To solve these problems, SiamBAN 3 introduces an anchor-free tracker, which directly regresses the positions of the target in a video frame.Although above methods have obtained excellent performance on visual object tracking, they merely focus on the local characteristics of the target, and inevitably ignores the intrinsic structural information within the global region.These long-range features are particularly suitable for specific constraints of set prediction 4 such as background clutter and other challenges.Therefore, as Fig. 1 shown, SiamBAN 3 cannot identify the target ant from similar objects in the first sequence, and even cannot discriminate different objects such as between the knee and the football.Recently, non-local network (NLNet) 5 is proposed to model the long-range dependencies via self-attention mechanism 6 .Intuitively, a NL block compute the response at a position as a weighted sum of the features at all positions in the input feature map, to attain an attention map.Then the input features are aggregated with the important weights defined by the above attention map, thus allowing distant pixels to contribute to the filtered response at a local location.However, For an image, different query positions get almost the same global context information through the non-local structure 7 .Moreover, NL block has to compute the pixel-level pairwise relations among all positions, which results in a heavy computational load.
In this work, we propose a simple yet effective anchor-free visual tracking framework named feature dynamic activation siamese network (SiamFDA), which consists of a Siamese network backbone for feature extraction and a feature dynamic activation (FDA) subnetwork for accurate target location estimation as well as bounding box prediction.Specifically, we design a novel FDA block for efficiently modeling long-range dependencies of the target and its modeling framework can be abstracted into three steps: (1) context modeling module obtains position-independent context information as attention weights to make the tracking model focus on crucial regions.
(2) Transform module further strengthens the representation power of the meaningful contextual information and captures the channel-wise interdependencies at the same time.(3) Fusion module merges the original input feature with global context features to improve discriminability.Besides, to fuse fine-grained information and abstract semantic information of features adaptively, we introduce a squeeze-and-excitation (SE) block 8 , which makes up for the lack of channel attention.Furthermore, on the observation that when the target aspect ratio is close to 1, the number of positive samples captured by an ellipse is less than a circle, we modify the original label assignment method to add more reliable samples, thus improving the tracking accuracy to some extent.Figure 1 displays that compared with SiamBAN 3 , our SiamFDA pays more attention to the tracking target without being misled by similar objects and the background.For example, in the third sequence, our SiamFDA would focus more on the player's jersey number instead of other places, which is more consistent with human perception.When we look at the fast-moving players on the court, the jersey numbers can help us quickly determine the identity of the player.In Fig. 2, we provide a qualitative comparison between our SiamFDA and SiamBAN on the VOT-2018 dataset.It is evident from the visualization results that our tracker outperforms the baseline (SiamBAN) in terms of precise tracking.
The main contributions of our work can be summarized as:  • We propose a simple yet effective anchor-free Siamese network SiamFDA to accurately estimate scale vari- ation and aspect-ratio changes, thus boosting the generalization ability of the tracker.• We design a novel FDA block which encodes rich global context information into the target representation along the spatial dimension.This block activates reliable patches, and enables our model to be robust to the unreliable regions during tracking.Furthermore, we adopt the SE block as a hierarchical feature selector in the classification and regression branches, which further maximizes the discriminative abilities via exploiting the inter-channel relationship.• We introduce an adaptive sample label assignment method to add more reliable positive samples, thus improving the tracking performance.• The effectiveness of SiamFDA is verified on six datasets, and the results demonstrate that SiamFDA is very promising for various challenging scenarios compared with several state-of-the-art(SOTA) trackers, with real-time performance of 40 fps.

Visual tracking
Recently, the proposal of Siamese network is a pioneering work in visual tracking community due to its endto-end training capabilities and high efficiency.SiamFC 1 presents a real-time tracking algorithm that utilizes a novel fully-convolutional Siamese network, trained end-to-end.SiamRPN 2 introduces a region proposal network for precise bounding box regression.Building upon this, SiamRPN++ 10 architecture for improved performance.Although these anchor-based methods effectively address scale variation and aspect ratio changes, they introduce numerous additional hyper-parameters that necessitate careful tuning and impose significant computational burdens.Furthermore, the anchor setting is not in line with the spirit of generic visual tracking, as it requires pre-defined hyper-parameters to describe the shape.Therefore, SiamFC++ 11 introduces a set of guidelines that include the decomposition of classification and state estimation, non-ambiguous scoring, being prior knowledge-free, and estimation quality assessment.SiamBAN 3 propose a simple yet effective visual tracking framework by exploiting the expressive power of the fully convolutional network.With the emergence of Transformer architectures, their significant advantages in handling complex sequential data have increasingly captured the attention of researchers in the academic field.Despite this, Transformer-based trackers [12][13][14][15][16][17][18] face significant challenges in practical applications, particularly due to their higher computational burden, which limits their feasibility in real-time tracking scenarios.In contrast, while CNN-based trackers may lag behind Transformer-based models in certain performance metrics, their lower computational complexity makes them more advantageous in scenarios requiring quick response times.Similar to SiamBAN 3 , we design an anchor-free Siamese network, which avoids hyper-parameters associated with the candidate boxes and makes the tracker more flexible and general.

Long-range dependency modeling
Recently, many new approaches focusing on long-range dependency modeling have emerged in object classification and detection.To model the pairwise relation, NLNet 5 computes the response at a position as a weight sum of the features at all positions.GCNet 7 has found that the global contexts modeled by NLNet 5 are almost the same for different positions within an image.Therefore, GCNet 7 creates a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet 5 but with significantly less computation.To model the query-independency global context, SENet 8 focuses on the channel relationship and adaptively recalibrates channel-wise feature responses.CBAM 19 exploits both spatial and channel-wise attention based on an efficient architecture.Particularly, the recent advance of tracking approaches has achieved great success by integrating attention mechanisms.SiamAttn 20 learns strong context information and aggregates rich contextual inter-dependencies between two branches of Siamese network, via deformable self-attention and cross-attention jointly.
In our paper, we introduce a novel FDA block designed to effectively model long-range dependencies, addressing the NL block's inherent limitations.This approach enables our model to adaptively focus on reliable regions across the spatial dimension.The SE block is further exploited to determine the effectiveness of each output channel.

SiamFDA framework
As displayed in Fig. 3, the proposed SiamFDA consists of a Siamese network backbone for feature extraction and a FDA subnetwork for accurate target location estimation as well as bounding box prediction.Specifically, the Siamese network backbone encodes the appearance information of the template image and the search image.The FDA subnetwork includes a classification branch and a regression branch, which considers the spatial layouts of the target and models the query-independency global context via three novel FDA blocks.Besides, a SE block is introduced to further amplify the discriminative ability along the channel dimension.

Revisiting Siamese network backbone
The Siamese network-based trackers view visual tracking as a cross-correlation problem and learn a tracking similarity map from a fully-convolutional network, which compares a template image Z against a search image X of the same size and returns a high score if the two images depict the same object and a low score otherwise.We use the initial appearance feature of the target as the template and a larger crop centered on the last estimated position of the target as the input of the search branch.These two branches share parameters in the Siamese backbone so that the two patches are implicitly encoded by the same transformation which is suitable for the subsequent network.We use the modified ResNet-50 3 pretrained from ImageNet 21 as the backbone.The down-sampling operations from the last two convolution blocks are removed to reserve detailed spatial information and thus perform dense prediction.Besides, atrous convolutions with different atrous rates are adopted to improve the receptive field.

Feature dynamic activation subnetwork
FDA subnetwork consists of a classification branch and a regression branch, which captures long-range dependencies of the target via three novel FDA blocks.As illustrated in Fig. 4, our FDA block contains three modules: context modeling module, transform module and fusion module.Specifically, as different instantiations achieve comparable performance 5 , we adopt embedded Gaussian as the basic NL block to compute similarity in an embedding space.Suppose the input features are X, with shapes of N p = C × H × W . H represents the height of the target, W denotes the width and C denotes the channel.

Context modeling module
Based on the observation that the attention maps for different positions are almost the same in the NLNet 5 , we replace the pixel-level pairwise operation with a 1 × 1 convolution W c , and obtain a position-independent attention map via a softmax function.Then these attention weights are aggregated with the input features by matrix multiplication, to recalibrate the importance of different spatial positions.Thus, the context modeling procedure can be formulated as where i denotes the index of query positions and j enumerates all possible position.

Transform module
To further strengthen the representation power of global context features, we aggregate the global context features to each position of the input feature via element-wise multiplication, and adopt a 1 × 1 convolution W t to capture channel-wise dependencies, as W t ( Xi • X j ).

Fusion module
We broadcast the simple element-wise addition for final feature fusion.Besides, a subsampling trick via a 1 × 1 convolution W s is used before context modeling module to further lower computation, as Xj = W s X j and Xm = W s X m .Thus, the overall procedure can be expressed as In the paper, the FDA block is not inserted between features of different layers of the backbone, but acts directly on the output of layer 3-5, respectively.This not only effectively utilizes the global context feature information of different layers, but also avoids the false guidance of low-level features to high-level feature extraction.The final output can be attained by the concatenation operation.
Considering that FDA blocks mainly pay attention to the global spatial information which decides 'where' to focus, and miss the complementary channel attention which decides 'what' to focus, a SE block 8 is introduced and placed in a sequential manner.The SE block serves as a hierarchical feature selector which directly selects features that are more conductive to identifying the current target and amplifies their discriminative abilities, leading to more accurate tracking.Specifically, the concatenated features from three FDA blocks are fed into a SE block, and are decoupled according to corresponding layers.For convenience, the decoupled feature of the template branch and the search branch is simply denoted as F z se and F x se , respectively.Then, we copy F z se and F x se of each layer to the classification branch and the regression branch, denoted as Each branch combines the feature maps via a depth-wise cross-correlation layer: where * represents the convolutional operation, P cls and P reg denote the classification and the regression map, respectively.Finally, the classification maps and the regression maps from different layers are fused independently, and the corresponding weights are optimized through training.Specifically, each location (i, j) on the classificat i o n m a p i s c o n s i d e r e d a s a p o s i t i v e s a m p l e i f i t s c o r r e s p o n d i n g p o s ition(⌊ on the input image falls within the ground-truth bound- ing box, and a negative sample otherwise.Here, w im and h im represent the width and the height of the input image, and s denotes the total stride of the network.For each location (i, j) on the regression map, we estimate a 4D vector at each spatial location of the feature map.The 4D vector represents the relative offsets from the four sides of a bounding box to the center location.

Ground-truth and Loss
As illustrated in SiamBAN 3 , the sample label assignment is important for the tracking performance, which is usually ignored by most Siamese network-based trackers.SiamBAN 3 adopts two ellipses to define both negative labels and positive labels.However, as Fig. 5 shown, we find that if the target aspect ratio is close to 1, which means that the target shape approximates a circle, the number of positive samples contained in the ellipse E2 is less than the circle C2.Therefore, to add more reliable positive samples, we preserve the setting for negative labels and modify for positive labels.Specifically, following the definitions 3 , the width, height, top-left corner, center point and bottom-right corner of the ground-truth bounding box are represented by g w , g h , (g x1 , g y1 ) , (g xc , g yc ) and (g x2 , g y2 ) , respectively.Then the border for negative labels can be formulated as where (p i , p j ) denotes the location of the feature maps.The border for positive labels can be formulated as when min g w , g h < 0.25 * max g w , g h , which represents the target shape is close to a long rectangle.Under this circumstance, the area of the ellipse with g h 4 as the axes length is larger than the area of the circle with min

,
g h 2 and min g w , g h ≥ 0.25 * max g w , g h , which represents the target shape is close to a square and the area of a circle is larger than an ellipse.
Therefore, the location (p i , p j ) is assigned with a positive label if falling within E2/C2, while a negative label if falling outside E1.The position falls between E2/C2 and E1 would be ignored.It should be noticed that only the location with a positive label would be used for bounding box regression.Finally, the multi-task loss function is minimized as where L cls is the focal loss for the classification result, L reg is the intersection over union (IoU) loss for the regres- sion result.Similar to SiamBAN 3 , we do not search for the hyper-parameters of the loss function and simply set 1 = 2 = 1.

Experiments Implementation details
Our approach is implemented in Python using Pytorch on a PC with an Intel i7 CPU and four NVIDIA GeForce 1080Ti GPU.

Training phase
Our proposed SiamFDA is trained end-to-end with image pairs picked from ImageNet VID 21 , YouTube BoundingBoxes 22 , COCO 23 , ImageNet DET 21 , GOT10k 24 and LaSOT 25 , using Stochastic Gradient Descent(SGD) with a minibatch of 32 pairs.The size of an template patch is 127 × 127 pixels, and the size of a search patch (5) E1 : www.nature.com/scientificreports/ is 225 × 225 pixels.We adopt the modified ResNet-50 3 pretrained from ImageNet 21 as the backbone and the parameters of the first two layers are frozen.The total training epoch is 20.We first train our model for 5 warm up epochs with a learning rate linearly increased from 0.001 to 0.005, then use a learning rate exponentially decayed from 0.005 to 0.00005 in the last 15 epoches.In the first 10 epochs, we only train those layers without pretraining, and fine-tune the remaining parameters in the last 10 epochs.

Tracking phase
The template feature in the first frame is computed via the Siamese backbone once, and then is continuously matched to subsequent search images, generating the target center location and bounding boxes via the classification branch and regression branch, respectively.In order to achieve a more stable and smoother prediction between adjacent frames, cosine windows and scale change penalties 2 are used.Cosine windows reduce boundary effects by applying a cosine-shaped weight distribution within the tracking window, placing the highest weight at the center and gradually decreasing towards the edges.This method focuses on the target at the center of the window, minimizing the disruptive influence of the window's edges, thereby making the tracking process smoother and more focused.On the other hand, scale change penalties are employed to manage changes in the target's size within the video.As the target moves away from or closer to the camera, its size in the frame changes.By penalizing rapid or significant scale changes, this mechanism assists the tracking algorithm in smoothly and gradually adjusting the size of the tracking window, avoiding instability due to abrupt scale changes.The combination of these two techniques significantly enhances the coherence and stability of frame-to-frame predictions, improving the overall efficacy of the tracking algorithm.Then, we identify the predicting bounding box with the highest score as the most probable location of the target in each frame.This bounding box is then linearly interpolated with the states from historical frames to maintain a continuous and accurate trajectory of the target.This interpolation not only utilizes the current frame's data but also leverages the historical information, ensuring a more reliable tracking even when the target undergoes sudden changes in motion or appearance.Subsequently, the target state is updated based on this interpolated data, which includes the target's updated position and size.To further enhance tracking accuracy, especially in scenarios of occlusion where the target is partially or completely obscured, we employ a Kalman filter.This filter assists in predicting the target's location by extrapolating from previous observations, thereby compensating for moments when the target is not clearly visible.The integration of a Kalman filter proves crucial in maintaining robust tracking in complex environments, effectively mitigating the challenges posed by occlusions.

GOT10k
GOT10k 24 test set is a large-scale high-diversity dataset, containing 180 videos, with the average overlap (AO) and success rates (SR) at two thresholds as measure metrics.We evaluate our SiamFDA with SiamFC 1 , DaSiamRPN

LaSOT
LaSOT 25 test set (280 videos, average length of 2448 frames) is a long-term visual object tracking evaluation dataset, which uses success plots and normalized precision plots to evaluate tracking performance.We evaluate our tracker with trackers including SiamBAN 3 , SiamRPNpp 9 , UpdateNet 35 , SPLT 44 , SiamDW 38 , ASRCF 45 , ATOM 32 and SiamFC 1 .Figure 9 shows that our SiamFDA tracker achieves an advantageous result with a success rate of 0.536 and 0.540 normalized precision.

OTB-2015
OTB-2015 28 consists of 100 sequences and adopts one-pass evaluation (OPE) success plots and precision plots as evaluation metrics.Our SiamFDA tracker is compared with numerous SOTA trackers including ATOM 32 , TADT 46 , DaSiamRPN 43 , SiamRPN 2 , GradNet 47 , SiamTri 48 and SiamFC 1 .As results displayed in Fig. 10, our SiamFDA tracker is dominant over other trackers, with a success score of 0.672 and a precision score of 0.879.

OTB-2013
OTB-2013 29 consits of 50 challenging image sequences, which is a subset of OTB-2015 28 and annotated with bounding boxes with several different attributes.Besides, we compare our tracker SiamFDA with other SOTA trackers including TADT 46 , SiamRPN 2 , GradNet 47 , DaSiamRPN 43 , ATOM 32 , SiamTri 48 and SiamFC 1 .Table 4 shows that that our proposed SiamFDA performs favorably against other outstanding trackers especially when encountering with low resolution and background clutter.www.nature.com/scientificreports/TNL2K TNL2K represents a recently developed benchmark specifically tailored for visual-language (VL) tracking, encompassing a comprehensive dataset with 2000 video sequences.This benchmark distinguishes itself through a combination of key attributes, including superior quality, the inclusion of challenging adversarial samples, and extensive variation in appearance.We compare our tracker SiamFDA with other SOTA trackers including TNL2K 49 , SNLT 50 , CTRNLT 51 , VLTTT 52 , JointNLT 53 .Table 5, from which, We can conclude that SiamFDA exhibits superior performance on the assessed dataset compared to most of the current state-of-the-art methods.Notably, even though Transformer-based approaches surpass SiamFDA in accuracy, they significantly fall short in terms of real-time performance.This juxtaposition highlights SiamFDA's advantage in delivering efficient tracking capabilities, particularly in scenarios that demand rapid response and minimal computational resources.Therefore, despite the superior accuracy of Transformer-based methods, SiamFDA emerges as a more practical solution for real-time tracking, striking a balance between high accuracy and operational feasibility.www.nature.com/scientificreports/

Ablation studies
Ablation studies are performed on VOT-2018 26 and VOT-2019 27 to demonstrate the impact of key components of SiamFDA.As shown as Table 6, FDA block, NL block, SE block represent feature dynamic activation block, non-local block and squeeze-and-excitation block.Rectangle, Circle represent rectangle labels ( E1 + E2 ), adap- tive labels ( E1 + E2/C2 ), respectively.

Ablation studies on blocks
As shown as Table 6, we perform an ablation study on the effects of blocks we adopt.Compared A1 with A2, we can found that the introduction of SE block makes the EAO criterion increases from 0.406 to 0.435 on VOT-2018 26 and 0.281 to 0.314 on VOT-2019 27 .Based on A2, when using our proposed FDA block, the performance achieves better results.From A3 to A4, though NL blocks 5 reach competitive results on object detection/segmentation, it's not effective enough when applied directly to object tracking, and we speculate that this is because of the essential difference among these fields.

Ablation studies on sample label assignments
To explore the impact of sample label assignments on tracking performance, we take the target shape into account.Compared A3 with A5, we can reach a conclusion that the adaptive sample label assignment method contributes to better tracking results.

Conclusions
In this paper, we propose a novel anchor-free network named SiamFDA, which consists of a Siamese network backbone for feature extraction and a feature dynamic activation subnetwork for accurate target location estimation as well as bounding box prediction.Specifically, a simple yet effective FDA block is designed to capture long-range dependencies between distant pixels in space and further activate reliable regions, thus improving the tracking robustness.Besides, a SE block serves as a hierarchical feature selector to focus on features which are more advantageous to track the current target.Furthermore, we adjust the sample label assignment method adaptively according to the target shape.Extensive experiments are conducted on five datasets, where our method obtains competitive results, with real-time running speed.

Figure 1 .
Figure 1.Visualization of attention maps (heatmaps) of SiamBAN (column 2 and 3) and our proposed SiamFDA (column 4 and 5) on three challenging video sequences, from which, we can see that SiamFDA can effectively identify ambiguous patches and enables our model to be robust to the unreliable regions.

Figure 2 .
Figure 2. Qualitative comparison of our SiamFDA with SiamBAN on VOT-2018.Frames 1, 2, 3, and 4, each representing a consecutive frame in the tracking process.Observed from the visualization results, our tracker is better than the baseline in terms of accurate tracking.

Figure 3 .
Figure 3. Overview of the proposed SiamFDA architecture.The top branch is the template branch which encodes the appearance information of the target, and the bottom branch is the search branch.Conv3_z , Conv4_z and Conv5_z represent the feature maps of the template branch while Conv3_x , Conv4_x and Conv5_x represents the feature maps of the search branch.The features of each stages from the Siamese network backbone are extracted and then modulated by three FDA blocks, which generates global context features and feeds them into a SE block to further exploit the channel attention.The network finally outputs a 2D classification map and a 4D regression map.

Figure 4 .
Figure 4. Architecture of the NL block (left) and our FDA block (right), both of which contain three modules: context modeling module, transform module and fusion module.The feature maps are displayed as feature dimensions, e.g., C × H × W denotes that a feature map with channel number C, height H and width W. ⊗ denotes matrix multiplication, denotes element-wise multiplication and denotes element-wise addition.The blue boxes denote 1 × 1 convolution and the purple ellipses denote the softmax operation.

Figure 5 .
Figure 5.The sample label assignment methods of SiamBAN and SiamFDA.E1 denotes ellipse E1, which is the border for negative labels, E2 and C2 denote ellipse E2 and circle C2, which are the border for positive labels.

Figure 6 .
Figure 6.Expected average overlap (EAO) graph with trackers ranked from right to left on VOT-2018.The right-most tracker achieves the top-performing result.

Figure 8 .
Figure 8. Expected average overlap (EAO) graph with trackers ranked from right to left on VOT-2019.The right-most tracker achieves the top-performing result.

Table 3 .
Performance comparisons on GOT10k.italic, bolditalic and underline fonts indicate the top-3 trackers.
Figure 7.A comparison of the quality and the speed of SOTA trackers on VOT-2018.which,we can conclude that our SiamFDA significantly outperforms nearly all top-performing SOTA trackers in all performance metrics.

Table 4 .
Comparisons on OTB-50, evaluated by precision and success rate.Italic, bolditalic, and underline fonts indicate the top-3 trackers.

Table 5 .
Performance comparisons on TNL2K.Italic, bolditalic and underline fonts indicate the top-3 trackers.