SU2GE-Net: a saliency-based approach for non-specific class foreground segmentation

Salient object detection is vital for non-specific class subject segmentation in computer vision applications. However, accurately segmenting foreground subjects with complex backgrounds and intricate boundaries remains a challenge for existing methods. To address these limitations, our study proposes SU2GE-Net, which introduces several novel improvements. We replace the traditional CNN-based backbone with the transformer-based Swin-TransformerV2, known for its effectiveness in capturing long-range dependencies and rich contextual information. To tackle under and over-attention phenomena, we introduce Gated Channel Transformation (GCT). Furthermore, we adopted an edge-based loss (Edge Loss) for network training to capture spatial-wise structural details. Additionally, we propose Training-only Augmentation Loss (TTA Loss) to enhance spatial stability using augmented data. Our method is evaluated using six common datasets, achieving an impressive \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\beta }$$\end{document}Fβ score of 0.883 on DUTS-TE. Compared with other models, SU2GE-Net demonstrates excellent performance in various segmentation scenarios.

The first GRSU-L module severely affects the subsequent detection task by extensive experiments.When there is a large missing area in the detection result, the feature map of the first GRSU-L module is already missing that part and is not caused by downsampling.Therefore, the third problem should be solved by enhancing the spatial perceptibility and global context-aware of the model.TTA Loss is proposed to solve this problem.By fusing the prediction after data enhancement (for example, horizontal flipping) using the spatial domain with the prediction of the original images and back-propagating the loss calculated as the final prediction result, the model converges faster and better.The Transformer is better than CNN in obtaining global features, so the Swin TransformerV2 is added to obtain more global information.
The final modified network SU 2 GE-Net achieves the foreground segmentation of non-specific class of image subjects.In addition, experimental validation is performed on DUTS-TE 9 , ECSSD 10 , PASCAL-S 11 , HKU-IS 12 , DUT-OMRON 13 , SOD 14 , showing that the proposed model outperforms well.
The main contributions of the paper are: (1) A saliency-based foreground subject segmentation model, SU 2 GE-Net, is proposed to effectively separate non-specific class foregrounds from backgrounds.(2) Based on U 2 -Net, the Swin TransformerV2 15 is used for feature extraction.SU 2 GE-Net is rebuilt according to the architecture of U 2 -Net, which improves the performance of extracting image features from the backbone network.The GRSU-L module was reconstructed by integrating the GCT to achieve a better segmentation effect non-specific categories of subjects.(3) TTA Loss and Edge Loss are added to the model training process to improve the model convergence efficiency, optimize the edge detail part and subject recognition.

Related works
Salient object detection.Salient object detection is similar in results to that of the segmentation task, with the difference in the different Ground Truth labels of the two tasks.The purpose of salient object detection is to identify the main part and analyze the probability that each pixel in the image is the main class.The real label of the segmentation task marks the category corresponding to each pixel.Salient object detection can be divided into two categories according to different data types, one is salient object detection of RGB image type, whose input image is the common RGB image type, and the other is salient object detection of RGB-D [16][17][18] image type, whose input image includes depth maps in addition to RGB images.Since depth maps require an additional depth camera to capture, only RGB images are considered for salient object detection to acquire the main part of the image.
U 2 -Net uses an Encoder-Decoder structure for salient object detection of images.The U 2 -Net encoder has been strategically designed to make feature extraction more efficient and rich to better distinguish the main part.
Non-specific class foreground subject segmentation.The task of non-specific class foreground subject segmentation is to find and segment the main part of an input image.This task is also very similar to the matting task 19 , which starts by feeding the trimap together to the Encoder-Decoder, predicting the alpha mask of the image, and optimizing the alpha mask with a small network for more detailed edges.
Sengupta et al. 20 have proposed a background matting technique that enables casual capture of high-quality foreground+alpha mattes in natural settings.This approach avoids using a green screen or painstakingly constructing a detailed trimap as typically needed for high matting quality.Attention mechanism.Attention mechanisms in computer vision are implemented in various forms, such as channel attention mechanisms [22][23][24] , spatial attention mechanisms 25,26 , self-attention mechanisms 27,28 , and gated attention mechanisms 10,29 .Channel attention and spatial attention, respectively, set different weights that can be learned at the channel and spatial levels of an image and use these weights to distinguish the importance of different channels and spaces.The self-attention mechanism disregards pooling weights and instead employs mappings of feature maps to distinct spaces, combining features from three different spaces in a specific manner to achieve the attention mechanism's intended effect.Gated attention, however, uses learnable parameters to model the channel relations in the feature map, which correspond to the competition and cooperation relations of neurons in the neural network, and guides the competition and cooperation by gating parameters, thus solving the deficiency of insufficient attentional attention.

Swin TransformerV2.
Compared to CNN, the Transformer 30 can extract global features better.The Swin TransformerV2, modified from the Swin Transformer 31 , makes the network model larger and can adapt to different resolution images and different size windows. 32is a trick that is recognized to improve predictions and is often used in hit-list competitions.Specifically, it creates multiple augmented copy images of each predicted image in the test set, lets the model make predictions for each image, and speaks the corresponding images for fusion as the final prediction.Although Test Time Augmentation can get better prediction results, it increases the time consumed, so we propose TTA Loss.The process is described in the section Training-only Augmentation Loss.

Proposed method
Initially, we present an overview of the utilized modules and elaborate on the specifics of the SU 2 GE-Net network architecture in Fig. 2a.The network supervision strategy and the loss are described at the end of this section.Swin TransformerV2.The Swin TransformerV2 tackles three major issues in the training and application of large vision models.A residual-post-norm method combined with cosine attention was used to improve training stability uses.It proposes a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs.The Transformer requires a large dataset and has a large computational; therefore, the Swin TransformerV2 uses a self-supervised pre-training method, SimMIM, to reduce the need for vast labeled images.We tested various Swin Trans-formerV2 models on DUTS-TE and ended up using the following config: input size:256, drop path rate:0.3,embed dim:96, depths:[2, 2, 18, 2], num heads: [3, 6, 12, 24], window size:16.

GRSU-L module.
The GRSU-L module is the basic unit that constitutes the SU 2 GE-Net, and its internal RSU-L structure is the same as that of U 2 -Net, a U-shaped structure.The structure diagram of the GRSU-L mod- www.nature.com/scientificreports/ule is illustrated in Fig. 2b.GCT denotes the module that implements the GCT.To solve the attention problem, the GCT is introduced in the GRSU-L module to obtain the most attentional part of the image before performing feature extraction.It is set before each convolutional layer of GRSU-L so that the module can extract more attentional features 33 .The regions with higher attention weights are often the main part of the image, and the use of gated attention enables the model to segment the main part better.In the GRSU-L module, L denotes the number of layers that the module performs in the Encoder-Decoder phase.By controlling the number of layers and the extraction method used for feature extraction of feature maps in different stages of the network, effective utilization of features for images of different scales is achieved.

Gated channel transformation (GCT).
The utilization of GCT addresses the model attention problem by leveraging the competitive and cooperative dynamics among neurons in neural networks.This mechanism stimulates collaboration among network neurons when attention is insufficient, enhancing focus on the subject of interest.Conversely, it encourages competition among neurons in situations of excessive attention, facilitating the retention of more competitive components.The GCT is illustrated in Fig. 3.

Architecture of SU 2 GE-Net.
In order to enhance our model's ability to capture long-range dependencies and extract comprehensive contextual information, we have opted to substitute the conventional CNN-based backbone with the Swin-TransformerV2.Additionally, GCT was introduced to autonomously regulate the interplay between competition and cooperation among neurons during the model training process.This guidance facilitated the model in prioritizing its attention toward the main component.The GRSU-L module constructed by the GCT, as the basic unit of the SU 2 GE-Net, has an internal U-shaped Encoder-Decoder structure.The extracted features are different according to the different depths of the network.The SU 2 GE-Net is a two-level nested U-structure, and combined with the reasonable use of the GRSU-L module; it can obtain different scales of the segmentation results (6 scales are used in the article) extracted again by 1 × 1 convolution after stitching the 6 different scales of feature maps.To address the problem of rough object edge segmentation, edge detection is performed using pairs of segmentation results with the results of real labels, and the difference between the two edges is calculated using Binary Cross-Entropy loss.The gradient calculation is performed using TTA Loss, which eventually guides the network to output a more refined network segmentation of the subject.The introduction of this loss function will only guide the segmentation results of the image subject segmentation model toward fine edges during the training process of the model and will not increase the number of parameters or the computational effort of the model.We sample the output of each Swin TransformerV2 Block back to the size of the original images, concatenate with the original images, and later use 1 × 1 convolution to downscale to three channels.Figure 2a shows the SU 2 GE-Net structure diagram in detail.
Edge loss.Following the loss function design adopted by U 2 -Net, the outer U-shaped structure, the subject mask generated by each layer of decoding, and the real labels are computed with a Binary Cross-Entropy loss for deep supervised training, and this loss L global can be expressed in Eq. (1): where N denotes the total number of layers in the outer U-shaped structure, and i denotes the subject mask output at layer i in the decoder.x i represents the predicted result.Respectively, y i denotes the true label corresponding to the input image.
In the model calculation process, the size of the output result of each layer is different, so the images are interpolated to the size of the input image before summation is performed.BCELoss() denotes the Binary Cross-Entropy loss function, which is calculated as depicted in Eqs. ( 2) and ( 3): (1) www.nature.com/scientificreports/mean() denotes the average of all l.Where a j and b j denote the pixel value of the jth input image with its cor- responding label, BatchSize is S, and w j denotes the weight of the jth image.
To address the problem that the segmentation of object edges by U 2 -Net has certain defects, the edge operator Edge() is used to perform edge detection of the prediction mask and the real label, and the Edge Loss is designed according to the edge detection results of both, whose calculation can be expressed using Equation (4): where x 0 and y 0 denote the mask and true label of the final subject segmentation of the model, respectively, and Edge() denotes the edge detection of the input image using edge operators (e.g.Canny, Laplacian, Sobel, Scharr).
The loss function Loss for model training is obtained by combining L edge and L global as shown in Eq. ( 5): Among them, w g and w e are hyperparameters that can be set independently.In the early training period, the fore- ground segmentation model of non-specific class subjects is not perfect, and the edge operator extracts the image edges poorly, so the value of w g will be set larger than w e , whereas the segmentation results are gradually refined in the later training period, so the value of w g will be set smaller than w e .The process is illustrated in Fig. 4.
Training-only augmentation loss.In a manner similar to Test Time Augmentation, input x performs data augmentation again before inputting into the network, the results of multiple data augmentations are predicted and the corresponding losses are obtained, and the mean of these losses is used to back-propagation to provide a more accurate convergence guide to the model.The procedure is depicted in Fig. 5. Loss tta is referred to as Eq. ( 6): (3)   where net() is SU 2 GE-Net, gt is the Ground Truth, x is the set of predicted images containing the original images and the data enhancement of the original images, the data enhancement can be flipped, rotated shifted, etc. Ave() is the averaging function.We believe that the idea of TTA Loss can be used in various tasks and is not limited to saliency target detection.Loss tta is the value ultimately used in the article to guide SU 2 GE-Net back-propagation.

Results and discussion
Dataset.Train set: We use DUTS-TR to train SU 2 GE-Net.DUTS-TR contains 10,553 images and is generally used on saliency object detection.Test set: SOD includes 300 images that were originally intended for image segmentation.ECSSD contains 1000 images that are semantically meaningful but structurally complex.DUT-OMRON has 5168 images, each with one or two objects.PASCAL-S is made up of 850 images with cluttered backgrounds and intricate foreground objects.HKU-IS has 4447 images.The majority of them have multiple connected or disconnected foreground objects.DUTS-TE is a part of DUTS and has 5019 images for testing.

Implementation details.
Training was performed on a Tesla A100 GPU (40GB), where the images were first scaled to a size of 320 × 320, then horizontally flipped for data enhancement, and finally randomly cropped to a size of 288 × 288.The hyperparameters w g and w e in the total loss function are set to 0.7 and 0.3 in the first 100 epochs and swapped with each other in the second 100 epochs.Using an AdamW optimizer with OneCy-cleLR as Schedule, the maximum learning rate was set to 1e−5, betas = (0.9, 0.999).eps= 1e−8, weight decay = 0.05.The results of the metrics calculation are illustrated in Table 1, and the metrics were calculated once per 1000 iterations.TTA Loss was used after training 20 epochs.

Evaluation metrics.
In our evaluation, we employed five widely adopted metrics, namely MAE, MaxF, MeanF, F β and S-measure , to assess the performance of the model.MaxF, MeanF, and F β were computed based on precision-recall pairs, using a weight β 2 of 0.3.MaxF represents the maximum value achieved across all thresholds, while MeanF denotes the average value calculated for all thresholds.For this particular case, F β was determined using the middle threshold of 127.The S-measure incorporates two components: object-aware (So) and region-aware (Sr), both weighted equally with α set to 0.5 to ensure equilibrium.
Tests with different edge operators.In this subsection, four operators, Sobel, Scharr, Laplacian, and Canny, were used for the Edge Loss function of SU 2 GE-Net in the Edge Loss calculation, and the model was trained separately to determine the most suitable operator in the Edge Loss function.After the input image was downsampled by the model, the resolution decreased, making the edges of the image not clear enough; even if the image is restored using the upsampling method, the original edges of the image also have some loss.Since the real label is not downsampled, the edge information is not lost.The article designs an Edge Loss function by performing edge detection on the output mask and the real label (both are binarized images) and using the edge information not lost in the real label as the basis for the edge refinement of the output mask.The traditional edge detection operator, which usually has a better performance in images without complex pixels, especially in binarized images, and the edge detection results of some binarized images are displayed in Fig. 6.Sobel and Scharr are first-order operators, while Laplacian and Canny are improved second-order operators built on top of the first-order.The second-order operators usually process better than the first-order operators.In DUTS-TE there are more subjects with complex structures which have more noise on the edges, and Canny is generally sensitive to noise, and the results of image processing for most of them are not very different and more stable, and Canny has also achieved better results in the experiments.Therefore, in the training of SU 2 GE-Net, the Canny operator was used to calculate Edge Loss.
Comparison with state-of-the-arts.We compare the proposed algorithm with 6 state-of-the-art saliency detection methods, including the U 2 -Net, BASNet 34 , P2T 35 , MSIN 36 , SCRN 37 , EGNet-R 38 ,RCSB 39 and DC-Net 40 .All saliency maps of these methods are computed by their released codes for fair comparisons.The superiority of SU 2 GE-Net can be seen in Table 1.
Qualitative evaluation.Some representative examples are shown in Fig. 7.These examples reflect a variety of situations.1st to 3rd row reflects the recognition and segmentation of the subject in different situations.Compared to the 1st row, all other models result in missing segmentation above the ankle part due to the white harness.And SU 2 GE-Net segments the edges more smoothly than gt.For the 2nd row, other models fail to distinguish the subject due to changes in the color of the dog's fur and the overlapping of the tree and dog, resulting in incorrect segmentation.SU 2 GE-Net completely distinguishes the body part of the dog.For the prediction of ( 6) Loss tta = Ave Loss x∈X net(x) , gt , the 3rd row, other models inaccurately segmented the subject due to the similar color of the rusty nail and the tree stump.SU 2 GE-Net is able to extract the main body of the iron nail.4th row reflects the segmentation effect of the object on the occluded subject.U 2 -Net and BASNet incorrectly identify the branch as the subject, while P2T predicts the branch as the background but does not recognize the tail well due to the lack of global semantic information.SU 2 GE-Net is able to remove the branches and retain the bird as a whole.The 5th and 6th rows show the segmentation of single and multiple objects with low contrast between foreground and background.Even when the subject is challenging to identify with eyes, SU 2 GE-Net can still be effectively segmented.The remaining models can only segment a single object, and some of the segmentation is erroneous.In assumption, SU 2 GE-Net consistently generates more accurate and complete saliency maps, can effectively segment holes and occluded objects, and identifies borders with low contrast and small objects.
Ablation experiment.The effectiveness of the different trick overlays can be seen in Table 1. Figure 8 shows the ablation experiments in three different groups.The comparison of the proposed methods and U 2 -Net demonstrates that our methods perform admirably in terms of convergence speed.The Base is U 2 -Net+Swin TransformerV2, and the other two groups build on it by adding TTA loss, Edge Loss, and GCT.It is obvious to see that the model is almost converged in the 45th epoch, while U 2 -Net needs about 230 epochs.

Conclusions
Edge-based loss functions were designed to be trained for edges and use the GCT to promote cooperative and competitive relationships between neurons.Feature extraction is performed using the Swin TransformerV2.The non-specific class foreground subject segmentation algorithm SU 2 GE-Net was proposed based on U 2 -Net, and TTA loss was used to make the network converge more efficiently, solving the problems of concern Fig. 1 attention and edge roughness.The feasibility of edge-based loss computation was verified by showing edge detection results using the traditional edge operator for true label masks.Four different edge detection operators were also used for experiments, and the Canny operator with the best results was finally selected as the computation of the Edge Loss function.Validation using the multiple datasets demonstrated the excellent performance of SU 2 GE-Net, which was better compared to some SOTA methods.The comparative experiments show that SU 2 GE-Net has fast convergence and universal applicability when segmenting multiple image scenes.However, the model does not perform well in segmenting fine details such as hair, indicating a limitation in the size and type of the edge operator used in calculating the Edge Loss.We believe that future research should focus on proposing improved approaches to address this issue.

Figure 2 .
Figure 2. Network structure chart: (a) the SU 2 GE-Net structure and (b) the GRSU-L module structure.

Figure 4 .
Figure 4.The pipeline of calculating edge loss.

Figure 6 .
Figure 6.Detection results of different edge detection operators.

Figure 7 .
Figure 7. Several visual examples with style-varying objects and their predictions generated by the proposed SU 2 GE-Net, U 2 -Net, U 2 -Net, BASNet and P2T methods.