Feature fusion network based on strip pooling

Contextual information is a key factor affecting semantic segmentation. Recently, many methods have tried to use the self-attention mechanism to capture more contextual information. However, these methods with self-attention mechanism need a huge computation. In order to solve this problem, a novel self-attention network, called FFANet, is designed to efficiently capture contextual information, which reduces the amount of calculation through strip pooling and linear layers. It proposes the feature fusion (FF) module to calculate the affinity matrix. The affinity matrix can capture the relationship between pixels. Then we multiply the affinity matrix with the feature map, which can selectively increase the weight of the region of interest. Extensive experiments on the public datasets (PASCAL VOC2012, CityScapes) and remote sensing dataset (DLRSD) have been conducted and achieved Mean Iou score 74.5%, 70.3%, and 63.9% respectively. Compared with the current typical algorithms, the proposed method has achieved excellent performance.

Semantic segmentation, which is the fundamental and challenging problem in computer vision, is to parse the category of each pixel in the image. It has been extensively researched in a variety of fields, such as autonomous driving, remote sensing images, medical diagnosis, and so on.
With the emergence of Fully Convolutional Neural Networks (FCNs) 1 , many methods have made remarkable progress in semantic segmentation. However, due to the limitations of the network structure, the traditional FCN only obtains the local information of the image and lacks sufficient contextual information, which can easily lead to incorrect segmentation results.
In recent years, many novel networks [2][3][4][5][6][7] have tried to seek new methods to solve FCN's issues. UperNet 8 uses feature pyramid network (FPN) to capture multi-scale features and analyze different scenes. DenseASPP 9 combines dense connection with ASPP, which is composed of the dilated convolution with different rates, to generate different receptive fields. Affinity Loss 10 was proposed by Yu et al. to distinguish the relationship between different pixels. HRNet 11 maintains high-resolution representations by connecting high-resolution to low-resolution convolutions in parallel. LedNet 12 uses attention pyramid network (APN) to capture contextual information, and uses convolutional decomposition and channel separation to reduce network complexity. HANet 13 introduces a highly-driven attention module to improve image segmentation in urban scenes. SPNet 14 proposes the strip pooling to solve the long-term dependence of the network.
In order to complete the semantic segmentation task more quickly and accurately, a novel semantic segmentation network is designed, which can efficiently aggregate context information. Specifically, it consists of a series of convolution branches and two FF modules. The FF module uses strip pooling and two linear layers to generate the affinity matrix, which can capture the correlation between any features. For each spatial position on the affinity matrix, it collects all the information from the local feature map. The main contributions of this study can be summarized as follows: 1. We design a new network with self-attention mechanism, to solve the long-term dependence problem in semantic segmentation tasks.
2. An FF module is proposed to reduce the computational cost of affinity matrix. It efficiently captures contextual information by converting matrix multiplication to vector multiplication.
3. The experiments show that the proposed method has better performance on three mainstream benchmarks including PASCAL VCO 2012, Cityscapes and DLRSD.
The remaining paper is organized in the following way. "Related work" examines the top-ranking related work on semantic segmentation. The proposed method is introduced in "Methods". In "Experiments", We have conducted a large number of ablation and comparative experiments to verify the effectiveness of the proposed method. "Conclusion" is the summary of this paper. Network architecture. The overall architecture of the network is shown in Fig. 1. The convolutional layer in the figure represents a convolution, BN, and ReLU. CNN uses ResNet50 with dilated convolution. To retain more detailed information, dilated convolution is used in the last two blocks of ResNet50. The height and width of the output are 1/8 of the input I. The extracted feature is processed by 3 × 3 convolutional layer to get I' (the number of channels decreased from 2048 to 512). Then, the network with Q, V, and X branches is designed. The Q has two serial FF modules. The first FF module generates feature map F by extracting information in the horizontal and vertical directions. The second FF module generates the affinity matrix F' , which is the result of the weighted summation of all pixels. The V and X directly reduce the channel dimension through the 3 × 3 convolution with BN and ReLU (the number of channels decreased from 512 to 128). And the result of V branch and F' are multiplied to generate the attention matrix M. The result of X branchand M are added to enhance the www.nature.com/scientificreports/ feature representation. Finally, the fused feature map is sent to the convolutional layer and generates ultimate prediction images. Fig. 2, given feature map F (CxHxW), which divides into two branches q and k. In the q branch, it performs 1 × 1 convolutional to reduce dimension to C' × H × W (C' is half of the C) and a column average pooling to compress height dimension to get Y (C' × 1 × W), where Y ∈ R. Then reshapes to C' × W (remove height dimension) and gets feature vector q' by using two linear layers. Among them, the function of the linear layer is to convert the strip pooling result and reduce feature loss caused by strip pooling. It is worth noting that the output size of the linear layer is C' → C'/4 → C' , and they all use linear activation functions. This process can be described as:

Feature fusion module. As shown in
Equation (2) shows the process of column average pooling, where H represents the height of the feature map, t i represents the ith element in each column. Y is the result of column average pooling. Eq. (3) shows the process of fully connected layer, where g represents linear layer, W is the learnable weight matrix of the linear layer. It can be found that q' is generated after the input feature map is compressed and then space transformation is performed.
The k branch is similar to the q branch, and k' will be obtained after row average pooling and two linear layers. After reshaping q' and k' , matrix multiplication is performed to produce the output E. Then, use E to generate output O. Please note that O is equal to E in the first FF module, but in the second FF module, O is obtained after E passes through the Softmax function. As shown in Fig. 2, the way of the FF module collects information is marked in red. Each position on O combines information from row and column of the feature map F. An FF module cannot collect enough global information, so we feed O to the FF module again to capture global information, and calculate the affinity matrix between pixels through the Softmax function. Note that the linear layer can only output a fixed size. So,we use 1 × 1 convolution instead of the linear layer to achieve an output of any size. Experiments have also proved that the 1 × 1 convolution and linear layer are equivalent.

Experiments
we first introduce PASCAL VOC2012, Cityscapes, and DLRSD, then introduce the experimental environment and details, and finally compare and verify the proposed method on different datasets.
Datasets. PASCAL VOC2012 is a segmentation dataset. It has 21 categories, including airplanes, bicycles, boats, etc. The dataset has 10,582 images for training and 1449 images for verification.
CityScapes is a city segmentation dataset. It collects road landscape images of 50 cities, each image size is 2048 × 1024. The dataset contains 19 common categories in road scenes, with a total of 5000 high-quality pixel-level labels. The training set contains 2979 images, the validation set contains 500 images, and the test set contains 1525 images.
DLRSD is a dense labeling dataset that builds for remote sensing image segmentation tasks. It contains 2100 images with a pixel size of 256 × 256, covering 17 common remote sensing image scene categories. We divide the training set and validation set according to the ratio of 0.8:0.2 for each category.
Experimental settings. The implementation of our network is based on the Pytorch framework. Its version is 1.1.0, and the CUDA version is 10.0. We only use a Nvidia GTX 1080TI to complete the experiment. Like the previous method, it uses the 'Poly' strategy to update the learning rate. The decoder initial learning rates of where TP represents true positive, TN represents true negative, FP represents false positive, FN represents false negative, and k represents the number of categories.

Results analysis
Ablation study. We use the same hyperparameters for experiments. As shown in Table 1, the ablation experiments on the PASCAL VOC 2012 are performed.
In the Table1, the second row is the result of one FF module. And the fifth row is the result of two FF modules. Obviously, The FF module can significantly improve the segmentation accuracy. Compared with the base-line FCN8s (use ResNet50 as the backbone network), using an FF module can bring an 8.4% improvement on mIoU. When stacking two FF modules repeatedly, the proposed method can increase mIoU from 72.8% to 74.5%. And it can help the network better aggregate contextual information. We add the FF module to different backbone networks to verify its effectiveness. Like ResNet50, we replace the last convolutional layer of the backbone network with dilated convolutional and fine-tune it. When the FF module is combined with the lightweight backbones MobileNet v2 and EfficientNet b0, 67.7% and 70.4% mIoU can be achieved respectively. It is worth noting that when we use ResNet101, its feature extraction ability is stronger, which can bring the highest mIoU of 75.8%.
In Fig. 3, we visualize feature maps (come from ResNet50) at different positions. the images in the 4th and 5th columns are the output of the 13th and 15th channels respectively. It shows that the proposed method can get better features. The 6th and 7th columns are the output of the first FF module and the second FF module respectively. After the second FF module, the relationship between each pixel will be calculated, and important information will be given higher weight (such as the bright spot in Fig. 3). The attention map (come from attention matrix), which is generated after aggregating context information, is shown in the final image. It is not difficult to find that the attention map can make the network pay more attention to the area of interest.   Table 2.
Obviously, the proposed method is better than other methods. Compared with other attention methods, such as DANet and CCNet, the proposed method achieves a higher mIoU (74.5%). In terms of model complexity, the proposed method parameter is only 279 MB, which is about 1/3 less than the most recent mainstream models, such as SPNet and DRANet. The segmentation results of each category of PASCAL VOC 2012 (val) are shown in Table 3. For categories with a small number and a small area, such as "bicycles" and "bottles", the proposed model considers rich context information, which make segmentation more delicate and better segmentation results.

Results on CityScapes.
We conduct experiments on the CityScapes. The experimental results are shown in Table 4. It can be found that the proposed method achieves 70.3% mIoU, which surpasses the previous mainstream methods. Compared with DANet and DRANet, which also use the self-attention mechanism, the proposed method has 2.9% and 1.1% improvements in mIoU, respectively.
As shown in Fig. 4, we visualize the most recent mainstream methods on the CityScapes. The proposed network can obtain a global perspective and accurately segment the image based on contextual information. For example, red boxes for the "road " or "building" in the image, the proposed method can correctly judge the target around the "road " according to the context information and make the segmentation more accurate.

Results on DLRSD.
The DLRSD dataset is taken from the sky. The background of the objects in the image is complex and the scale is changed drastically, which makes segmentation very difficult. Table 5 shows the verification results on the DLRSD, where FLOPs are measured when the input size is 3 × 248 × 248 and the number of outputs is 17. Compared with DANet, which also uses the self-attention mechanism, the parameter amount of the proposed method is 20% lower than it. Computational complexity can be measured in FLOPs. The proposed network has far fewer FLOPs than the dual-channel self-attention network DANet. Compared with lightweight network LedNet, the proposed method has higher computational complexity, but more computation brings higher segmentation accuracy. The proposed method can automatically aggregate contextual information and achieve 63.9% in the mIoU. Figure 5 shows the corresponding visualization results. In red boxes, for large-scale targets, such as "aircraft", the proposed method can make a more complete segmentation. For small-scale targets, such as "cars", the proposed method can perceive their existence from a global perspective, which is less missed than other methods.

Conclusion
We propose an efficient self-attention segmentation network (FFANet). FF module that can efficiently capture contextual information is designed. It uses strip pooling to reduce the complexity of the affinity matrix. The spatial transformation is performed through the linear layer to compensate for the information ambiguity caused by the strip pooling. Experiments show that the proposed method can effectively solve long-term dependence and make the segmentation result more accurate. It achieves 74.5% mIoU on the PASCAL VOC 2012, 70.3% mIoU on the CityScapes, and 63.9% mIoU on the DLRSD. Although the use of a linear layer can reduce the information loss caused by the pooling operation, some information will still be lost. Therefore, in the future research, we will explore other feature compression methods to capture global information more effectively (Suppl. Information).