Deep asymmetric extraction and aggregation for infrared small target detection

Infrared small target detection is widely applied in military and civilian fields. Due to the small size of infrared targets, textural detail is missing. Common target detection methods extract semantic feature by narrowing down the feature map several times, which may lead to the small targets lost in deep layers and are not effective for infrared small target detection. To solve this problem, we propose a novel network called deep asymmetric extraction and aggregation. The network mainly consists of two processes - the vertical feature extraction and the horizontal feature aggregation, both of which are enhanced by an asymmetric attention mechanism. In the vertical process, the use of asymmetric attention mechanism combined with the reduction of down-sampling makes the small target better retained in the deep layers. Then through the horizontal process, shallow spatial feature and deep semantic feature are aggregated to further highlight the small targets while suppressing background noise. Experiments on the public datasets NUAA-SISRT, NUDT-SISRT and MDvsFA-cGan show that our proposed network outperforms the state-of-the-art methods in terms of detection accuracy and parameter efficiency.

1.A novel DAEA network is proposed for infrared small target detection.DAEA distinguishes small target from background noise more effectively due to full use of shallow spatial feature and deep semantic feature by iterative aggregation.Using iterative aggregation can reduce the influence of shallow features on image prediction.2. Semantic feature is more relevant to small target because it is better retained on deep layers, which is achieved by fused global and local attention and reduced number of down-sampling operations.This, in turn, makes the following feature aggregation more meaningful.3. Experiments on the public datasets NUAA-SISRT, NUDT-SISRT and MDvsFA-cGAN demonstrate the superiority of our proposed network, outperforming the current SOTA methods by more accurate detection with less parameters.

Related work
There are two main types of methods for the problem of infrared small target detection, one is multi-frame detection 6 , and the other is single-frame detection.The former generally uses a set of consecutive sequential images to detect the continuity of target in adjacent frames assuming that the background of the adjacent frames is stationary.However, in real scenario the infrared sensor needs to constantly adjust its angle in order to be able to capture fast moving object, which leads to the assumption that the background is stationary no longer satisfied 7 .Moreover, the efficiency of multi-frame detection is low and cannot meet the task of real-time detection.Therefore, single frame detection has attracted more attention.
For single-frame infrared small target detection, approaches are mainly classified into model-driven and data-driven ones.Most model-driven approaches convert the problem to outlier detection 8,9 , highlighting small target by measuring the discontinuity between them and the background.They include filter-based methods 10,11 , local contrast-based methods 12,13 , and low-rank-based methods 14,15 , which mostly use local mean or maximum gray level as features to manually set thresholds for target segmentation.These models do not need to be trained and follow the predefined process and hyper-parameters 16 to achieve detection results.However, in practice, it is found that the biggest problem with these methods is that it is difficult to achieve good detection results using fixed hyper-parameters in the face of scene changes.At the same time, it is difficult to distinguish between background noise and target, resulting in a large number of false detections.Small target with different size and insignificant feature are easily overwhelmed by the background leading to inaccurate detection of target.
Deep learning is based on data-driven learning of target feature, which has been quite effective in the field of computer vision in recent years.Thanks to the powerful fitting ability of CNNs and the large amount of data labeling work, it is practical for CNNs to learn target feature accurately.Data-driven approaches show superior performance compared to traditional model-driven approaches.Liu et al. 17 first used a target detection framework for detecting small infrared target, and their network was a 5-layer multilayer perceptual neural network.Zhao et al. 18 proposed a generative adversarial network (GAN) based detection model for infrared small target detection.Wang et al. 19 used conditional generative adversarial network (CGAN), which treated miss detection and false alarm as two opposing problems and trained the network to make a trade-off between the two metrics.
Image segmentation approaches have also received much attention, especially the extensive use of U-Net 20 for medical image segmentation, which is now applied to infrared small target detection.Zhao et al. 21used U-Net combined with a semantic constraint module to achieve semantic segmentation of infrared small target.Dai et al. 5 designed an asymmetric contextual module for image segmentation network, the network fuses high-level and low-level features to extract rich semantic information and spatial detail.Dai et al. 12 designed a trainable attentional local contrast network in combination with a model-driven approach in subsequent network improvements.Li et al. 22 designed a tri-direction dense nested interactive module and incorporated an attention mechanism, cascaded channels and a spatial attention module to set multiple nodes interconnected in the encoding and decoding paths to achieve repetitive feature fusion and enhancement.Although these networks have improved in performance but still cannot solve the problem of small target lost in the deep network coding process.How to keep the small target on the deep layers is the key to solve the problem of infrared small target detection.

Network architecture
The network architecture is shown in Fig. 1.The input is an infrared small target image, and is passed downwards through the vertical feature extraction module, which is called the backbone network.The backbone network consists of several AAM-blocks stacked and is divided into three different stages.At each stage, high-level semantic features are extracted, and each stage is followed by a max pooling layer except the last one.Then features are propagated through the horizontal feature aggregation module.Features from neighboring stages are aggregated, with an up-sampling operation applied on the deeper feature map to match their shapes before aggregation.The resolution of the feature map is gradually restored to be the same with the input resolution.The Predict module takes the final feature map as input, and produce a binary image as output, which is the final detection result of the model.
Let L i,j denote the outputs of the nodes in the Fig. 1, where i denotes the i-th iteration of feature aggrega- tion and j denotes the j-th feature extraction stage.The backbone network consists of node L 0,j , j ∈ 0, 1, 2 .The expression of L i,j is shown in Eq. (1).
where input is the input infrared small target image, Ext(•) denotes feature extraction, Agg(•) denotes feature aggregation, P max (•) is the down-sampling operation using max-pooling, and U(•) is the up-sampling operation using bilinear interpolation.
Our network structure is similar to the U-Net 20 structure in that both have encoding and decoding processes.However, the way of feature aggregation during decoding is different.The common approach to image segmentation is to simply aggregate shallow features with deeper features using skip connections.Our approach is to iteratively aggregate deeper features starting from the shallowest ones, while we also add attention to the process.Deep-layer feature has rich global semantic information and relatively less local detail information 23 .In infrared small target detection task, small target feature is not obvious.Hence it is important to leverage global semantic feature from deep-layers for small target recognition.Since small target feature can get overwhelmed easily in deep layer, we reduced the number of down-sampling and enhanced feature extraction by employing an asymmetric attention mechanism, we enhance feature extraction by employing an asymmetric attention mechanism, and iteratively aggregates deep and shallow features.Small target feature is continuously enhanced and the final feature map has rich global semantic information.
As shown in Fig. 1, our backbone network has blocks of cascaded convolutional layers as those in ResNet 2 .We extend residual block with an extra attention layer SA to form the AAM-block, which extracts global channel feature and local spatial detail, and uses channel shuffle to interact channel and spatial information.So the learning capability of the network is adaptively enhanced.As shown in Table 1, The down-sampling process is applied on the output of each stage except the last one, i.e.L 0,0 and L 0,1 .The length of the backbone network can be adjusted by the hyperparameter S, which is the number of cascaded convolutional blocks.The number of down-sampling limits the depth of the backbone network.
The input of the horizontal aggregation node is two feature maps from preceding adjacent nodes.Because the two feature maps have different size, the deep-layer feature map is up-sampled to the same size as the shallowlayer feature map before entering the aggregation node.The aggregation node uses both global attention and local attention to extract the semantics of the high-level feature and the detail of the low-level feature, respectively.
(1) www.nature.com/scientificreports/Thus, the semantic understanding of the low-level feature is enhanced and the detail deficiencies of the high-level feature are filled in.Finally, the modulated high-level and the low-level features are aggregated.

Asymmetric attention mechanism
How to retain small target in the deep layers is the key to solve the problem of infrared small target detection.Attention mechanism is employed in the network to enhance the target feature while suppress the interference of background noise.In the field of computer vision, there are mainly channel attention and spatial attention, but it is also possible to combine both of them, e.g., Convolutional Block Attention Module (CBAM) 24 .The channel attention mechanism is a global attention that is more concerned with global semantic features and which ones are important, and the spatial attention mechanism is a local attention that is more concerned with local detail of the target and which positions need to be focused on.It is more effective to combine the two in parallel or in sequence 25 .We call such combination of global attention and local attention asymmetric attention mechanism, or AAM for short.In this paper, we applied AAM in both the feature extraction module and the feature aggregation module.AAM has two forms: self-attention in feature extraction and cross-attention in feature aggregation.
For the vertical feature extraction process, as the network goes deeper, If small targets are lost in the deep layers, then the extracted global semantic information is also invalid.Therefore, it is crucial to protect small targets on the backbone network for feature extraction.We use the self-attention form of AAM to enhance feature extraction, as shown in Fig. 2a, a global attention (GA) is used to extract semantic features, and a local attention (LA) is used to extract detail features.Both GA and LA are self-attention and applied in parallel.Then, the two branches are blended together for the two kind of features to complete each other.We call this AAM-extraction.
Similarly, in the horizontal feature aggregation process both GA and LA are applied and semantic features and detail features are blended.One difference here is that both GA and LA are in cross-attention form, as in Fig. 2b.The deep features undergo global attention to extract semantic information to enhance shallow features, and shallow features undergo local attention to extract detail information to enhance deep features, and we call this module AAM-aggregation.The overall architecture of the SA module is shown in Fig. 3.The input feature map X ∈ R C×H×W is divided into G groups along the channel dimension, and each group is again divided in half along the channel dimension into sub-features X k1 , X k2 ∈ R C 2G ×H×W , on which the global attention and the local attention are applied, respectively.
Specifically, the globally attended sub-feature X ′ k1 is produced as follows where g denotes global average pooling, are the parameters for scaling and shifting, and σ denotes the sigmoid function.
where GN denotes Group Norm 26 , and are the parameters for scaling and shifting.Then, all these attended sub-features are concatenated.And the channel shuffle operation 27 is applied on the concatenated feature for the global and the local information to interact along the channel dimension.The model extracts both channel and spatial information of the deep-layer feature.Thus it can focus adaptively on semantic regions as well as local detail of the target, and improve the segmentation of small target significantly.

Asymmetric attention for feature aggregation
The global attention and local attention modules in AAM-Aggregation can be implemented in a variety of ways, and we use the global and local modules in ACM as a specific implementation of AAM-Aggregation.
Where X is a low-level feature map, and Y is a high-level feature map.Both feature maps have cross attention is used so that high-level semantic feature can attend to spatial details, and low-level feature can attend to abstract semantics.
The globally attended feature X ′ is produced by cross attention as follows where g denotes global average pooling, β, δ, σ denote Batch Normalization (BN), Rectified Linear Unit (ReLU), Sigmoid function, respectively, and r are the parameters of two fully connected layers.The hyperparamer r represents the channel number reduction ratio, and 4 is used in this paper.
The locally attended feature Y ′ is produced by cross attention as follows where PWC 1 and PWC 2 denote two point-wise convolution layers, having kernel sizes of C r × C × 1 × 1 and C × C r × 1 × 1 , respectively.Again, r is the ratio of channel number reduction.Finally, the global attention feature and local attention feature are aggregated according to Z = X ′ + Y ′ .Now, the aggregated feature map Z ∈ R C×H×W is enrich with both deep semantic and spatial detail information.

Experiment Loss function
As with most infrared small target detection practices, we also use the soft-IoU loss function for the network training, and the loss function is defined as Eq. ( 6).
where P ∈ R H×W is the prediction output of the trained network, and L ∈ R H×W denotes the labels.

Evaluation metrics
Some commonly used pixel-level evaluation metrics are not applicable due to the lack of detailed textures for small infrared target.For small targets covering only a few pixels, incorrect prediction can cause a sharp drop in pixel-level evaluation metric values, so we include some metrics about the model localization ability.In this paper, the following three evaluation metrics are used to evaluate infrared small target detection.
1. Intersection over Union (IoU) is a pixel-level evaluation metric to evaluate the contour description capability of the algorithm by the ratio of intersecting pixels and union pixels of the predicted target and the label.The expression is shown below.
where N inter and N union denote the number of pixels where the predicted target intersects with the label and the number of pixels where the two are concatenated, respectively.2. Probability of Detection ( P d ) is an evaluation metric for target localization, which is the ratio of the number of correctly predicted targets to the number of all labelled targets.It indicates the capability to cover labelled targets, and a higher value means less missing targets.The expression is shown below.
where T correct and T all denote the number of correctly predicted targets and the number of all labelled targets, respectively.The correctly predicted target is defined as the target that its center-of-mass deviation is less than a given threshold.In this paper, the threshold is set to 3. 3. False Alarm Rate ( F a ) is also a target-level evaluation metric.It is used to measure the ratio of false alarm pixels to all image pixels.It indicates the probability of incorrectly predicting a target, with smaller values indicating fewer incorrectly detected targets.F a is defined as follows where P false and P all denote the numbers of falsely predicted pixels and all image pixels, respectively.The falsely predicted pixel is defined as the centroid derivation of the target is larger than a given threshold.In this paper, the threshold is set to 3. 4. The Receiver operating characteristic curve (ROC) is used to describe the trend between the true positive rate (TPR) and the false positive rate (FPR) of a model at different thresholds, with TP, FP, TN, FN, denote true positive, false positive, true negative, false negative, in the following equation.Area Under Curve (AUC) is a quantitative indicator of ROC, with higher AUC value indicating better detection performance.
In addition, we also provide parameters (Params) and FLOPs are used to describe the complexity of the neural network.Inference time (Time) is used to indicate the speed of inference of the model.

Datasets description
The datasets used in this experiment are NUAA-SISRT (NUAA-SISRT 5 ) by Dai et

Training details
Using the NUAA-SISRT, NUDT-SISRT and the MDvsFA-cGAN dataset, we conducted experiments on the PyTorch platform using a single GPU P5000-16G, CUDA 11.2.The input images are initially adjusted to a resolution of 256*256 and then normalized to all images to accelerate network convergence.Our network is trained using the soft-IoU loss function, Adagrad 28 as the optimization method, and randomly initialized network parameters.We use a batch size of 8, an initial learning rate of 0.05.Trained 500 epochs on the NUAA-SISRT, 400 epochs on the NUDT-SISRT and 50 epochs on the MDvsFA-cGAN.The threshold value used in the predict module is 0.5.

Comparison to the state-of-the-art methods
We compare the proposed network with several state-of-the-art (SOTA) methods.The selected model-driven methods include Top-Hat 10 , Max-Median 11 , weighted strengthened local contrast measure (WSLCM) 29 , multiscale tri-layer local contrast measure (TLLCM) 30 , Infrared patch-image (IPI) 15 , non-convex rank approximation minimization (NRAM) 31 , Reweighted infrared patch-tensor (RIPT) 16 , partial sum of the tensor nuclear norm (PSTNN) 32 , multiple subspace learning and spatial-temporal patch-tensor (MSLSTIPT) 33 .And the selected data-driven methods include U-Net 20 , Asymmetric Contextual Modulation (ACM) 5 , Attentional Local Contrast (ALC) 12 , Infrared Small-Target Detection U-Net (ISTDU) 13 and Dense nested attention network for infrared small target detection (DNANet) 22 .The adaptive thresholds applied in the model-driven methods are calculated by the Equation 11.For the data-driven methods, we keep the same experimental parameter settings as in the respective papers.
where Max(G), Avg(G), σ (G) denotes the maximum value, the average value, and the standard deviation of the output, respectively.

Quantitative results
The quantitative results are shown in Tables 2, 3 and 4, and the data-driven methods are more effective than the model-driven methods on all three datasets.Especially in terms of IoU, the model-driven methods can only reach 30.41 at best.These methods focus on target loclization, and are not good at dealing with the contour details of the target.At the same time, the manually selected parameters also limit the generalization ability of the model, which can not adapt to various complex background changes.Although several model-driven methods have achieved better P d results on the MDvsFA-cGAN dataset, a comparison of AUC results shows that this high detection probability is obtained with a high probability of false detection.In addition, as we can see in Fig. 5, these methods have a large number of false detections.
Compared with other data-driven methods, our model takes the shortest time to train and has the fewest parameters.For the MDvs DA-c GAN and NUAA-SIRST datasets, DAEA achieved the best results in IoU, P d , and F a .On the NUAA-SIRST dataset, our method outperforms the current SOTA with a margin of 0.26 in terms of IoU, and 2.87 in terms of P d .On the MDvsFA-cGAN dataset, our method is also at the leading level.For the NUDT-SIRST dataset, which is a newly publicised dataset, our method is not yet adapted to this new dataset ( 11) and is not taking the lead for now.However, our model is not far from the results of DNANet and outperforms them on the other two datasets.Therefore, in summary, our method is superior in the detection accuracy as well as shape matching of small targets.Speaking of the parameter efficiency, with S = 3 , our method already outperforms other methods in IoU and P d .We also experiment with the length of the network by tuning the hyperparameter of S. It demonstrates the typical U shape with the best performance on the NUAA-SISRT dataset when S = 5.
As can be seen from Fig. 4, DAEA has the best AUC values on both NUAA-SIRST and MDvsFA-cGAN datasets, indicating that our method has excellent detection performance.We can also see from this that data-driven approaches generally out performs the model-driven approachs.
The images in these datasets have different complex backgrounds, target shapes and target size irregularities, which means that DAEA can learn feature that are robust to scene change.www.nature.com/scientificreports/

Qualitative results
Figure 5 shows the visualization of results by different methods on 9 test images, and Fig. 6 shows the 3D visualization of these results.The prediction results show that the model-driven method performs well only on the 2nd image.These methods have difficulty in distinguishing the target from the background noise with high local contrast.Hence there are a large number of miss detection and false detection, and the detected targets are displayed very faintly.This is because the features are manually selected at a shallow level, and the parameters are preset rather than learned, which result in limited generalization capability.Data-driven methods outperform model-driven methods.However, their performance are different.By comparing the results of the data-driven methods in Fig. 5, all the methods except ours have miss detection or false detection.In addition, we can see from Fig. 5(5) that DAEA has more accurate shape segmentation compared to ACM.This is because the asymmetric attention for feature extraction in the backbone network plays a key role.The small targets get retained in the deep layers by blending global and local features using asymmetric attention.The deep layer is equipped with accurate global semantic information and the global attention for feature aggregation is more effective.

Ablation study
To investigate the role of AAM in the feature extraction and feature aggregation processes, we remove SA and ACM from DAEA, respectively.The results are shown in Table 4, from this we see that the performance of the model is severely degraded for both DAEA without SA and DAEA without ACM compared to the full-fledged model (DAEA).In particular, the IoU metric of the model after removing the SA module decreases by 2.74 on the NUAA-SIRST dataset, by 2.34 on the MDvsFA-cGAN dataset and by 5.11 on the NUDT-SIRST dataset, indicating that AAM plays a role in both processes in both processes.Moreover, the joint use of both AAMs in both processes works much better than using one of them alone.We replaced SA using two attentional mechanisms, Convolutional Block Attention Module (CBAM) 24 and Squeeze-and-excitation (SE) 34 , and showed a significant decrease in model effectiveness.This suggests that it is more appropriate to use asymmetric attention mechanisms in the infrared small target detection problem.We also investigate the advantage of asymmetry, i.e., we use symmetric attention (both with local attention or both with global attention) for both branches of the AAM.As shown in Table 5, using symmetric attention, these models do not achieve the best results, indicating the need for common attention to local detail information and global semantic information during feature extraction and feature aggregation.Asymmetric attention is more advantageous than symmetric attention.
To investigate the role of iterative aggregation in the network, we use skip-connections in U-Net instead of iterative aggregation.The results show that the effect decreases on all three datasets, with a significant rise in the particular false alarm rate, suggesting that iterative aggregation can reduce the effect of shallow noise on the predicted images.
Also we found that in the AAM-extraction module, the improvement of both LA on the model in terms of IoU value is greater in both datasets compared to both GA, indicating that the use of LA can help small targets to be retained on the deep feature map.In the AAM-aggregation module both GA is more effective than both LA, the IoU value is improved by 0.77 on the NUAA-SISRT dataset and 0.66 on the MDvsFA-cGAN dataset, which indicates that the semantic features are more important in the feature aggregation process.

Figure 1 .
Figure 1.DAEA network architecture.The green and red arrows represent down-sampling and up-sampling operations, respectively.The dashed box shows the detailed flow of the AAM-block.

Figure 2 .
Figure 2. Asymmetric attention mechanism flowchart.(a) AAM-extraction in feature extraction process.(b) AAM-aggregation in feature aggregation process.GA is the global attention, LA is the local attention.

Figure 3 .
Figure 3. SA module flowchart.GA is the global attention, LA is the local attention.

Figure 4 .
Figure 4. ROC curves of different infrared small target detection methods on three datasets.

Figure 5 .Figure 6 .
Figure 5.The visualization of the results achieved by different methods on 9 test images.The zoomed-in targets are shown in the red boxes.The red circles mark the areas of correctly detected targets, and the green circles mark the areas of miss targets and false alarms.Our DAEA model achieves accurate target localization as well as shape segmentation.

Table 2 .
Comparison of different infrared small target detection methods on the NUAA-SIRST dataset.The best results according to each metric are marked in italic, and the second in bold.

Table 3 .
Comparison of different infrared small target detection methods on the MDvsFA-cGAN dataset.The best results according to each metric are marked in italic, and the second in bold.

Table 4 .
Comparison of different infrared small target detection methods on the NUDT-SIRST dataset.The best results according to each metric are marked in italic, and the second in bold.

Table 5 .
Results of ablation studies on asymmetric attention mechanism.