RA-Net: reverse attention for generalizing residual learning

Since residual learning was proposed, identity mapping has been widely utilized in various neural networks. The method enables information transfer without any attenuation, which plays a significant role in training deeper networks. However, interference with unhindered transmission also affects the network’s performance. Accordingly, we propose a generalized residual learning architecture called reverse attention (RA), which applies high-level semantic features to supervise low-level information in the identity mapping branch. It means that higher semantic features selectively transmit low-level information to deeper layers. In addition, we propose a Modified Global Response Normalization(M-GRN) to implement reverse attention. RA-Net is derived by embedding M-GRN in the residual learning framework. The experiments show that the RA-Net brings significant improvements over residual networks on typical computer vision tasks. For classification on ImageNet-1K, compared with resnet101, RA-Net improves the Top-1 accuracy by 1.7% with comparable parameters and computational cost. For COCO detection, on Faster R-CNN, reverse attention improves box AP by 1.9%. Meanwhile, reverse attention improves UpperNet’s mIoU by 0.7% on ADE20K segmentation.

Our main contributions are summarized as follows: • Reverse attention mechanism is proposed.It utilizes high-level semantics to supervise relatively low-level semantics.The existence of a semantic gap enhances the effectiveness of reverse attention.• With the help of reverse attention, we propose the RA architecture, which is a generalized form of residual learning.It proves that performance can also be improved by optimizing the architecture.• We propose Modified Global Response Normalization (M-GRN) to implement RA architecture, which intro- duces negligible extra parameters.• RA-Net is derived from the RA architecture.Experiments prove that RA-Net brings significant improvements over residual networks.

Residual learning
Residual learning can be traced back to the proposal of ResNet 2 .As deeper networks begin to converge, a problem is revealed: as the depth of the network increases, the accuracy becomes saturated and then decreases rapidly.
ResNet addresses the degradation problem by adding an identity mapping branch.Rather than expecting each stacked layer to fit directly into the desired underlying mapping, ResNet explicitly makes these layers fit into the residual mapping.In extreme cases, if an identity mapping is optimal, it is much easier to push the residuals to zero than to fit an identity mapping with multiple layers.Residual learning is favored by the computer vision community and is present in almost all advanced frameworks.Pure CNNs frameworks such as ConvNeXt 4,8 , ParC-Net 9 , RepLKNet 10 , PoolFormer 3 , MobileNet 11 , ShuffleNet 12 and EfficientNet 13 focus on different aspects of accuracy and efficiency.Another category is attention-based frameworks such as ViT 5 , Swin Transformer 14 , and PVT 15 .The hybrid structure is also a hot spot of current research, such as Mobile-Former 16 , CoaT 17 , and Mobile-VIT 18 .Without exception, these approaches all rely on the identity mapping branch to optimize deeper networks.

Attention mechanism
Attention mechanism plays a crucial role in computer vision tasks, especially self-attention in transformer.ViT 5 directly applied to image patch sequences for image classification tasks.Meanwhile, more attention-based models are applied to computer vision tasks such as detection 19 and segmentation.Self-attention is applied to establish relationships among patches.And all patches perform the same operation before the self-attention calculation, thus they have the same semantic level.

General architecture
Transformers show great potential in computer vision tasks.It is widely believed that their attention-based token mixer module contributes the most to their capabilities 22 .However, MetaFormer 3 proves that the general architecture of the Transformer is more important to the model's performance.The MetaFormer derived model PoolFormer, utilizing a pooling-based token mixer, surprisingly achieves competitive performance on several computer vision tasks.This suggests the importance of architecture in neural networks.Therefore, it is more attractive to explore the impact of general architecture on performance.Deeply inspired by MetaFormer, reverse attention is applied to generalize the residual learning framework.We prioritize leveraging the strengths of the architecture itself to improve the performance.Reverse attention is implemented by M-GRN, which introduces negligible parameters and computation consumption.Consequently, the performance improvements achieved through reverse attention can be primarily attributed to the inherent advantages of its architecture.

RA architecture
Residual learning (identity mapping) is the dominant architecture in current models.The formulation of residual learning is as follows: where x is the identity mapping branch, which represents shallow layer information that is added directly to the higher-level semantic F (x) without hindrance.The indiscriminate transmission of information also leads to the accumulation of interference to deeper layers.Therefore, we propose the reverse attention architecture, which generalizes the residual learning.Reverse attention, as the name implies, is the opposite of the forward process.
It is an attention mechanism that utilizes high-semantic features to supervise low-semantic information.The generalized equation is: where the G (F (x)) implements reverse attention.F contains some convolutional layers to aggregate informa- tion of spatial and channel dimensions, which is used to improve the receptive field and capacity of the network.In general, the semantic "level" can be enriched by the number of stacked layers (depth) 2 .Therefore, compared with x , F (x) has a higher semantic level, which ensures that it contains richer information.The semantically dominant, cross-level attention mechanism adaptively determines the degree to which x flows to the deeper layer.
The model adaptively retains valuable information while blocking interference.In particular, the RA architecture retains the advantages of residual learning.When the condition G (F (x)) = 1 is satisfied, Eq. ( 2) is equivalent to Eq. (1), i.e., the RA architecture degenerates to the residual learning approach.This is the key reason why residual learning can be generalized through reverse attention.
Nowadays, the differences among the different models are mainly in the F , that is, by optimizing the design of F to increase the model capacity.In response to this limitation, the RA architecture provides another way.The branch G takes F (x) as input and its output is directly multiplied by x .Fitting G with different methods directly affects the model's performance.In the following subsection, we introduce the RA block derived from the RA architecture, including a simple instantiation of G.

RA block
We present the RA block derived from the RA architecture.As shown in Fig. 2, regarding the optimization of F , there are many related literatures for a comprehensive analysis.Therefore, this paper focuses on the design of the reverse attention branch.Two conditions need to be satisfied for the reverse attention branch.The first one is to ensure that G (F (x)) = 1 when F (x) = 0 , which means G (0) = 1 .It aims to preserve the advan- tages of residual learning 2 .The other one is to ensure that the semantic level of F (x) is higher than x .The two commonly used attention functions, Sigmoid 23 and Softmax 24 , no longer satisfy the first condition.Therefore, we propose a novel implementation of the RA branch ( G ) based on M-GRN.As shown in Fig. 3, given an input feature, X ∈ R H×W×C , M-GRN consists of three steps: (1) global feature aggregation, (2) feature normalization and (3) feature calibration.
To obtain global features and reduce extra computational costs (FLOPs), the feature spatial dimension ( H × W ) is compressed at the beginning of the RA branch.There are many ways to obtain global features, such www.nature.com/scientificreports/as Global Average Pooling (GAP), Global Max Pooling (GMP), L1-Norm, and L2-Norm.We choose the optimal method L2-Norm through comparative experiments.The equation is as follows: where P (X) i = �X i � is a scalar, aggregate the information of the i-th channel.C represents the number of feature channels.And �•� is L2-Norm.
Regarding the normalization function, the most commonly used is standardization, such as BN 25 and LN 26 .

Its equation is as follows:
where ε is a small float added to the denominator to avoid dividing by zero.This approach does not satisfy the first condition of reverse attention.When P 8 provides another efficient normalization method.The aggregated values are normalized as follows: When X i is equal to zero, the value of N( X i ) is also zero.Compared to Eq. ( 4), the normalization method of Eq. ( 5) is easier to modify to satisfy the reverse attention condition.It is described in detail in Step 3.
To facilitate optimization, two learnable parameters, γ and β , are usually introduced to calibrate the features.The formula is as follows: The method cannot be directly applied in the reverse attention branch.We first remove the bias term β in Eq. ( 6) and add a constant 1. Equation ( 6) is modified as: It is easy to conclude that G (0) = 1 .In addition, we only introduce γ as a learnable parameter, which adds insignificant parameters and computational cost.Therefore, it can effectively verify the performance of RA architecture.
It is worth mentioning how to ensure that the semantic-level of F (x) is higher than that of x, which is the second condition that the RA block needs to satisfy.ResNet 2 shows that the "level" of semantics can be enriched by stacking layers (depth).However, the semantic gap does not exist for untrained models with randomly initialized parameters.Therefore, F should be trained preferentially compared to the RA branch.We adopt two approaches to make F (x) be trained preferentially.Firstly, we initialize γ to 0, which means G (X) = 1 .This setup allows the RA block to initially perform residual learning and gradually adapt during training.Furthermore, we modify Eq. ( 7) as follows: where temp represents temperature annealing strategy 27 for facilitating the training process.The temp is not a fixed value but changes dynamically with training iterations.In this paper, we linearly reduce temp from 30 to 1 in the first 10 epochs of training.It slows down the training of G , widening the semantic gap between F (x) and x.
As a generalized architecture for residual learning, we first consider applying reverse attention to ResNet 2 .We maintain all parts in ResNet and only embed the reverse attention branch, which is called RA-ResNet.We further validate the reverse attention mechanism on lightweight models, such as MobileNetV2 11 .It is worth noting that mobileNetV2 contains two different unit blocks, which differ in whether skip connections (identity mappings) are included.We only embed RA branches in blocks containing skip connections, named RA-MobileNetV2.
(3) www.nature.com/scientificreports/ Experiments are carried out mainly on these two types of models to comprehensively verify the performance of the reverse attention mechanism.

Datasets
ImageNet-1K 28 is one of the most classic classification datasets.It contains about 1.3 M training images and 50 K validation images, covering rich scenes and common 1k categories.Therefore, it can accurately represent the difference in accuracy of different methods.Consistent with most approaches, the performance is evaluated by top-1 and top-5 recognition rates on the ImageNet-1K validation set.More training details are listed in Table 1.

CNN backbones
The MobileNetV2 11 and ResNet 2 families are selected for experimentation with both lightweight and large CNN architectures.In particular, we choose six backbones, including ResNet18, ResNet50, ResNet101, and Mobile-NetV2(1.0×, 0.75× , 0.5× ).These backbones are utilized to verify the effect of the reverse attention mechanism on models with different sizes and depths.

Results comparison with ResNets
We first implement experiments on ResNets 2 .The results are shown in www.nature.com/scientificreports/When ResNet18 is selected as the baseline, RA-ResNet18 improves the Top-1 accuracy by 1.0% with comparable parameters and FLOPs.Meanwhile, the Top-1 accuracy of SE-ResNet18 and CBAM-ResNet18 is improved by 0.8% and 0.9%, respectively.Similarly, it can be seen that when ResNet50 is the baseline, the Top-1 accuracy of SE-ResNet50, CBAM-ResNet50, and RA-ResNet50 are increased by 1.1%, 1.2%, and 1.4%, respectively.There is only a slight difference in the FLOPs of RA, SE, and CBAM due to the use of feature aggregation.
To verify the performance of the reverse attention mechanism on a deeper network, we further conduct comparative experiments on ResNet101.The proposed RA-ResNet101 shows a significant performance improvement compared to SE and CBMA.As shown in Table 2, embedding SE and CBAM in ResNet101 increases the parameters of the model by 4.78 M and 9.56 M.And their Top-1 accuracy increased by 1.0% and 1.1%.Meanwhile, the RA branch embedded in ResNet101 leads to a 1.7% improvement in Top-1 accuracy with only 0.03M additional parameters.
Overall, compared to SE and CBAM, RA utilizes fewer extra parameters to bring greater performance improvements.In addition, with the increase of network depth, the advantage of reverse attention becomes obvious.For example, introducing SE in ResNet18, ResNet50 and ResNet101 increases the Top-1 accuracy by 0.8%, 1.1%, and 1.0%.Similarly, CBAM improves the performance by 0.9%, 1.2% and 1.1%.The improvement stays around 1.0%.On the contrary, RA improves the performance of ResNet18, ResNet50, and ResNet101 by 1.0%, 1.4%, and 1.7%.It shows an upward trend.We infer that the reverse attention mechanism is more conducive to the optimization of deeper networks without compromising the advantages of residual learning.Meanwhile, comparing the training time of these methods, we find that the "Training hours" of RA and SE are close, slightly higher than the baseline.CBAM's "Training hours" is significantly larger than other methods, which is caused by its simultaneous use of spatial and channel attention.

Results comparison with MobileNetV2
We further verify the performance of the reverse attention mechanism on the lightweight model MobileNetV2.The experiment results are shown in Table 3. Overall, RA, SE, and CBAM all improve the performance.RA has the highest accuracy when the parameters and FLOPs are comparable to the baseline.For example, when selecting MobileNetV2 (0.5× ) as the baseline, SE, CBAM, and RA improve the Top-1 accuracy by 0.8%, 1.0%, and 1.5%.For MobileNetV2 (0.75× ), embedding SE, CBAM and RA into the baseline, the Top-1 accuracy improvement is 0.8%, 1.1% and 1.2%.Similarly, SE, CBAM, and RA improve the Top-1 accuracy on MobileNetV2 (1.0× ) by 0.5%, 0.8% and 0.9%.Therefore, we can conclude that the reverse attention mechanism is also effective for lightweight models.Furthermore, we find that the training time ("Training hours") of these methods is at the same level, due to the fact that the baseline MobileNetV2 is a lightweight model.

Datasets
We evaluate the performance of the reverse attention mechanism in downstream tasks on the COCO 31 and ADE20K 32 datasets.Following standard training and testing protocols, trainval35k set and minimal set (5 K images) are utilized for training and testing.Consistent with most detectors 33 , the performance is evaluated by Average Precision (AP) 31 .

Detection results
The performances of SE, CBAM, and RA are compared on two classical detectors Faster R-CNN and Mask R-CNN.ResNet50 is adopted as the backbone of the detector.The experimental results of object detection are shown in Table 4.For example, when using ResNet50 as the backbone of the Faster R-CNN, RA increased the

Segmentation results
We further verify the performance of the reverse attention mechanism on the segmentation task.The results are presented in Table 5.For instance segmentation on Mask R-CNN 35 , RA improves Mask AP by 1.6%.For semantic segmentation on UperNet 36 , RA improves mIoU by 0.7%.Obviously, RA brings significant performance improvement.

Feature aggregation
The purpose of the feature aggregation step is to obtain global features while reducing the cost of computation.
We compare the performance of several common feature aggregation methods, including Global Max Pooling (GMP), Global Average Pooling (GAP), L1-Norm, and L2-Norm.When directly adopting GMP and GAP for feature aggregation, the training of the model is unstable.However, the model converges stably using their absolute values (GMP † , GAP † ).Compared with the baseline, both GMP † and GAP † can improve the model accuracy.For example, the Top-1 accuracy of GAP † improves by 0.3% compared to the baseline.Additionally, we conduct experiments on L1-Norm and L2-Norm feature aggregation methods.The experimental results in Table 6 demonstrate that L2-Norm yields the optimal performance.+RA 46.23 M 260.14 G 39.9 (↑1.7) 61.1 (↑2.4) 24.1 (↑2.3) 43.7 (↑1.9) 50.9 ( ↑1.4) www.nature.com/scientificreports/Compared with the baseline, L2-Norm improves the Top-1 accuracy and Top-5 accuracy by 1.0 and 0.7%.Therefore, in the reverse attention branch, we implement the feature aggregation step with L2-Norm.

Feature normalization
Feature normalization effectively accelerates the convergence of the model.Therefore, we enumerate the commonly utilized normalization methods BN and LN and explore their effects on reverse attention.The experimental results of different feature normalization methods are reported in Table 7.
To verify the importance of the feature normalization step, we first conduct experiments without applying any normalization method.The absence of normalization degrades the performance.However, the addition of normalization significantly improves the performance.For instance, the utilization of BN and LN resulted in a 0.5% and 0.9% improvement in Top-1 accuracy.M-GRN uses the normalization method of Eq. ( 5).Compared with BN and LN, its Top-1 accuracy is increased by 0.5% and 0.1%, respectively.

Activation function
We explore the effect of activation functions on reverse attention performance.The experimental results are shown in Table 8.
The activation function Softmax almost cuts off the identity mapping in residual learning, so it degrades the model performance.Compared to the baseline, using Softmax in the reverse attention branch leads to a drop of 0.3% and 0.2% in Top-1 and Top-5 accuracy, respectively.The Sigmoid activation function brings a 0.7% improvement in Top-1 accuracy.In M-GRN, a constant term of 1 is added to ensure that the RA architecture degenerates into residual learning when γ • N(P (X)) = 0 .The experimental results in Table 8 prove its effec- tiveness.Without adding extra consumption, it improves both Top-1 accuracy by 1.0% compared to the baseline.

Temperature annealing
The temperature annealing strategy enables the network to learn F (x) first, ensuring that the semantic level of F (x) is higher than that of x .The temp is applied to implement the temperature annealing strategy, which is initialized to 30 and gradually decreases to 1 after 10 epochs.We verified the effect of temp on three models RA-ResNet18, RA-ResNet50, and RA-MobileNetV2 (1.0× ).The results are shown in Table 9.In summary, we observe RA activation through instances of different classes.As expected, the reverse attention mechanism adaptively scales features.From the dimension of the channel, it enhances the effective information and blocks the transmission of interference.

Conclusion
In this work, we propose a reverse attention mechanism, which utilizes high-level semantics to supervise lowlevel information.Meanwhile, based on reverse attention, we introduce a generalized residual learning framework, which is the RA architecture.Additionally, We implement the RA architecture with the proposed M-GRN and subsequently derive RA-Net from it.Compared to residual learning networks, RA-Net significantly improves performance with comparable model size and computational cost.This shows that the model's performance can also be improved by the advantages of the architecture.Meanwhile, RA's high-to-low guidance approach can also be applied to building frameworks in other areas.

Figure 1 .
Figure 1.Comparison of residual learning and the proposed reverse attention (RA) architecture.(a) Residual learning architecture in ResNet 2 .(b) Our proposed generalized architecture.The approach of scaling x with high-semantic G (F (x)) is called reverse attention.

Figure 2 .
Figure 2. Illustration of the RA block.M-GRN represents Modified Global Response Normalization.

Figure 5 .
Figure 5. Activations induced by reverse attention at different depths in RA-ResNet-50 on ImageNet-1K.The nomenclature for each set of activations follows the RA_stageID_blockID scheme.For instance, the activation of the third block in the second stage can be identified as RA_2_3.
21sides transformers, there are other types of attention mechanisms.SE-Net 20 is proposed to re-estimate the channel responses of convolutional features.And it belongs to channelwise feature supervision.Based on SE-Net, CBAM21adds a spatial attention module.These are plug-and-play attention methods with great flexibility.Different from the above attention methods, what we propose is a reverse attention mechanism, that is, the high-level semantics are used to supervise the low-level semantics in reverse.It is more in line with the human learning patterns in which teachers with more experience instruct students.Other attention mechanisms rarely consider semantic level issues.

Table 2 .
All these models are trained only on the ImageNet-1K training set and report their accuracy on the validation set.We mainly compare RA-Net with ResNet, SE-Net, and CBAM.As shown in Table2, we analyze the results from the following perspectives, including parameters, FLOPs, Top-1, and Top-5 accuracy.FLOPs are obtained when the input size is 224 × 224 .

Table 3 .
Comparison of proposed RA with SE, CBAM in MobileNetV2 backbones.The experimental results are obtained by training 150 epochs on ImageNet-1K.FLOPs are obtained when the input image is 224 × 224 ."Traininghours" is evaluated on 2 RTX4090 GPUs.Significant values are in bold.AP by 1.9%.However, SE and CBAM increased the box AP by 1.6% and 1.5% when adding 2.51M and 5.01M parameters, respectively.On the Mask R-CNN detector, SE, CBAM, and RA improve Box AP by 1.4%, 1.4%, and 1.7%.

Table 4 .
Results of object detection on the COCO dataset.ResNet50 is adopted as the backbone.The final model weights pre-trained in ImageNet-1K are used as the initialization of the detector.FLOPs are obtained when the input image is 1280 × 800.Significant values are in bold.

Table 5 .
36sults of instance and semantic segmentation on COCO and ADE20K datasets.ResNet50 is adopted as the backbone of Mask R-CNN and UperNet36.Significant values are in bold.

Table 6 .
Comparison of different feature aggregation approaches.† Means to take the absolute value.ResNet18 is adopted as the baseline.Significant values are in bold.

Table 7 .
Comparison of different feature normalization approaches."None" indicates no normalization step.ResNet18 is adopted as the baseline.Significant values are in bold.

Table 8 .
Comparison of different activation functions.ResNet18 is adopted as the baseline.Significant values are in bold.

Table 9 .
Effect of temperature annealing strategy.