Full stage networks with auxiliary focal loss and multi-attention module for submarine garbage object detection

Submarine garbage is constantly destroying the marine ecological environment and polluting the ocean. It is critical to use detection methods to quickly locate and identify submarine garbage. The background of submarine garbage images is much more complex than that of natural scene images, with object deformation and missing contours putting higher demands on the detection network. To solve the problem of low accuracy under complex backgrounds, full stage networks with auxiliary focal loss and multi-attention module are proposed for submarine garbage object detection based on YOLO. To maximize the gradient combination, a hierarchical fusion feature mechanism and a segmentation and merging strategy are used in this paper to optimize the difference in gradient combination to obtain full-stage features. Then the criss-cross attention module is used to precisely extract multi-scale features of small object dense regions while removing noise information from complex backgrounds. Finally, the auxiliary focal loss function addresses the issue of unbalanced positive and negative samples, focusing on the learning of difficult samples while improving overall detection precision. Based on comparative experiments and ablation experiments, the FSA networks achieved state-of-the-art performance, and is applicable to the real-time object detection of submarine garbage in complex backgrounds.


Full stage networks with auxiliary focal loss and multi-attention module for submarine garbage object detection
Hui Zheng 1 , Xinwei Guo 1,2* , Guihai Guo 1 , Yizhi Cao 1 , Xinglei Hu 3 & Pujie Yue 4 Submarine garbage is constantly destroying the marine ecological environment and polluting the ocean.It is critical to use detection methods to quickly locate and identify submarine garbage.The background of submarine garbage images is much more complex than that of natural scene images, with object deformation and missing contours putting higher demands on the detection network.To solve the problem of low accuracy under complex backgrounds, full stage networks with auxiliary focal loss and multi-attention module are proposed for submarine garbage object detection based on YOLO.To maximize the gradient combination, a hierarchical fusion feature mechanism and a segmentation and merging strategy are used in this paper to optimize the difference in gradient combination to obtain full-stage features.Then the criss-cross attention module is used to precisely extract multi-scale features of small object dense regions while removing noise information from complex backgrounds.Finally, the auxiliary focal loss function addresses the issue of unbalanced positive and negative samples, focusing on the learning of difficult samples while improving overall detection precision.Based on comparative experiments and ablation experiments, the FSA networks achieved state-of-the-art performance, and is applicable to the real-time object detection of submarine garbage in complex backgrounds.
Submarine garbage is becoming an increasingly serious issue in the marine ecological environment.Due to poor management, a large amount of garbage generated using artificial products would enter the marine environment, causing serious pollution of the ocean.Wood, fishing nets, glass, metals, plastics, and other durable and corrosion-resistant materials may be found in garbage, and once in the ocean, they become persistent pollutants.As a result, controlling and managing submarine garbage pollution is critical [1][2][3][4] .On technological level, the detection method of submarine garbage, that is, rapidly locating and identifying submarine garbage using detection, obtaining basic information on the distribution and quantity of submarine garbage pollution, and formulating control policies, is a critical link in promoting submarine garbage pollution cleanup and recycling.
The majority of the studies use remote sensing technology to detect and classify marine floating plastic wastes; however, few scholars have studied submarine garbage, and many types of garbage greatly increase the difficulty of object detection.Xu 5 employed YOLOv3 to detect fish in underwater environments for waterpower applications, and achieved a mean average precision (mAP) value of 54.92%.Asyraf 6 conducted a study on the efficiency of the YOLOv3 detector in detecting underwater life on two open-source datasets, and the results indicate that the YOLOv3 detector is capable of detecting underwater objects with high accuracy, with mAP scores ranging from 74.88 to 97.56%.Rosli 7 used YOLOv4 to detect underwater animals and the training results showed a mAP of 97.86%.Chen 8 used YOLOv4 to detect 4757 images with 4 categories on the URPC dataset, and the results showed a mAP of 73.48%.Zhang 9 trained and tested the URPC dataset with a mAP of 81.01%.Gašparović 10 improved YOLOv4 and achieved better detection results in underwater pipeline object detection with 94.21% mAP.
Object detection technology is of great significance as the basis of more complex and higher-level visual tasks such as pattern recognition, object tracking, event detection, and activity recognition.Currently, deep

Results
Experimental environment and parameters.The software environment and hardware parameters used in this paper are shown in Table 1.The hyperparameter of experiments in training FSA networks is shown in Table 2.
This paper mainly adopts mAP50:95(mAP) as the model evaluation index of performance.
Baseline experiments.In order to verify the effectiveness of the proposed FSA networks model, ablation experiments are conducted to evaluate the effect of different modules on the performance of the object detection algorithm under the same experimental conditions.Before determining the baseline model, comparison experiments are conducted between YOLOv5 and YOLOv7 series models.
From Table 3, it can be seen that layers, parameters, GFLOPS, and the mAP of the YOLOv5 series all increase with the increase of model size.The mAP reaches its maximum at YOLOv5x.Layers, parameters, and GFLOPS of the YOLOv7 series all increase with the increase in model size, and the mAP reach maximum at YOLOv7-w6.Therefore, in the ablation experiment, YOLOv7 was selected as the baseline model.
Ablation experiments on FSA networks.On COCO datasets, the various YOLO family improvement methods currently perform significantly better, but the extension to bespoke datasets has not yet been thoroughly demonstrated.In this paper, an FSA network is designed using a highly reused FFS module with highly used features in the backbone and an efficient group convolutional SPPCSPC module in the neck part.Additionally, the criss-cross attention mechanism is connected to the FFS module in the head and combined the features in the backbone, and the object detection task is completed using lead head and auxiliary head.In this section, ablation experiments are conducted to verify the effectiveness of the FSA network and to compare it with the current leading detectors.The results show that the FSA network proposed in this paper achieves state of the art on the submarine garbage dataset.
As can be seen from Table 4, model A uses a combination of HS and FS to extract features using standard convolution with 464 layers, 121.22 M parameters, and an mAP of 52.5%, which demonstrates the ability of the HS and FS modules to extract features while improving accuracy and reducing the number of parameters compared to the YOLOV7 series models.The backbone and head backbone architectures of model B both use FS modules, which have a slight increase in the number of parameters and GFLOPS and a 0.2% increase in mAP compared to HS modules.This is mainly because the FS module retains the detailed features of each layer in the module.In order to further reduce the complexity of the model and the number of parameters, model C uses depthwise separable convolution in the FS module, so that each convolution kernel operates on only one channel and does not change the number of channels, but some channel information is lost.therefore, in the experiment, the kernel size is increased to expand the feature extraction The results show that the FS extracted high-density feature map after depthwise separable convolution has the same accuracy as model A.
The FSS module in model D is the final structure adopted in this paper.Compared with the FS module, the FSS module, with the addition of shortcut connections, is similar to the residual structure of ResNet.Moreover, the feature maps of layers P3-P6 in backbone are passed to layers P3-P6 in head, which make up for the lack of information in depthwise separable convolution, and the mAP has a more obvious improvement (+ 1.6%).Model E and model F were trained by adding criss-cross attention mechanism and auxiliary head in turn, and the mAP was improved by 2.5% and 3.0%, respectively, compared with model A. Therefore, the FSA network proposed in this paper has a very significant detection effect.
Figure 1 displays the heat map and detection accuracy at the head, SPPCSPC, and backbone outputs of FSA networks.The benefit of the attention mechanism grows enormously as the network's depth increases, and accuracy likewise rises.

Comparisons with state-of-the-art methods.
To validate the effectiveness of FSA networks, this paper compares some state-of-the-art methods on the submarine garbage dataset.The models involved in the comparison are two-stage detector and one-stage detector (containing anchor free detectors, such as TOOD, YOLOX).The compared algorithms are trained based on PaddleDetection and MMDetection with epochs set to 100 for two-stage detector and 300 for one-stage detector, and the rest of parameters remain unchanged.The results show that the FSA model achieves 55.5% mAP, which is more accurate than many state-of-the-art methods, and the comparison results are shown in Table 5.
As can be seen from Table 5, the mAP of misc, can, tire, plastic, rod, and metal are less than 0.5, on the one hand, because the training samples of the dataset are small (the number of training samples are 170, 89, 488, 136, 17, 30, respectively), on the other hand, the objects are deformed due to the water flow, image resolution, light refraction, etc.The object features learned by the model are not completely consistent with the inherent attribute features of the object.One of the lowest detection accuracies is metal, except for the reason of the small sample dataset, when labeling the dataset, the tin, coin, iron cage, rusty anchor, and tin can are marked as metal, even though the FSA networks model incorporates attention module, it can only extract the abstract features of the object, and the mAP of metal is only 23.1%.Due to the small number of sample sets in rod, the focal loss was used in the programming to solve the problem of lower accuracy due to sample imbalance, and the final mAP was only 30.2%.During the acquisition of the datasets, some of the plastic was in the marine soil, some was floating on the ocean surface, and some of the plastic overlapped with other object samples, making the features learned by the model incomplete, resulting in a final mAP of only 30.7%.
In summary, the FSS module and group SPPCSPC module introduced in the FSA networks can extract shallow features extraction, deep features, and reconstruct the image, while reducing the number of parameters; CCA focuses more attention on the dense object feature region, while introducing the residual operation to improve the fusion ability of shallow and deep feature maps; the joint use of the auxiliary head and the lead head at the output end allows the lead head to focus on learning the remaining features that have not yet been learned, effectively improving the feature extraction ability of objects in complex environments and making the model more advantageous when dealing with complex submarine garbage image object detection tasks.The detection results are shown in Fig. 2. As can be seen from the figures, the object detection accuracy is generally high, and even if there is an occlusion or object at the junction of water bodies, it can be detected very well.Figure 3 shows the detection effect in the case of significant light variation.There exists dim scene in the figures, and the object color is similar to the background color, the results illustrate that the model is less affected by the light variation and has better detection ability.Figure 4 presents images of fuzzy distortion caused by underwater shooting, and the detection results reveal that the model can detect the objects in the fuzzy scene, which indicates that the model has good robustness.Figure 5 represents the detection effect of dense small objects, where the pbag, pbottle, and tire are extremely small, however, all of them can be detected precisely, which illustrates the model's outstanding detection ability for small objects as well.The detection effect in the presence of occlusion is shown in Fig. 6.The results demonstrate that the FSA network model is able to detect the object correctly even in the presence of occlusion by other objects, or incomplete objects.
Pictures in the left column of Figs. 7 and 8 are the FSA detection results, and the right column are the YOLOv7 detection results.As can be seen from Fig. 7, the FSA detection results show more accurate bounding box, while YOLOv7 has a situation where the bounding box is too large or too small.Consequently, it proves that the prediction box obtained by CIoU used in this paper is more consistent with the real position of the object.
Figure 8 demonstrates that the detection results of the FSA networks model are more accurate and have better detection effects.Therefore, it can be demonstrated that the FSA networks model, through the attention module, increases the perceptual field of the feature map, strengthens the feature extraction ability of the network for small objects, and can reserve more feature information of the object area.The combination of auxiliary head and lead head weakens the interference of background noise, fuses shallow and deep features, improves the global feature extraction ability, and has better performance in dealing with object detection in complex backgrounds, which not only reduces missed and false detections, but also is less affected by environmental and illumination changes.Consequently, the FSA networks model has higher detection accuracy and more precise detection results.Overall, the FSA networks model can obtain more accurate object positions, has better robustness to illumination changes, and the object detection effect is obviously improved in sophisticated backgrounds, and the inference speed can reach 72.15 FPS.

Discussion
In this paper, we propose a one-stage detector, full stage auxiliary networks with auxiliary focal loss and multiattention module, based on YOLO.It aims to improve the performance of dense small object detection in complex backgrounds for real-time submarine garbage object detection tasks.In order to avoid overfitting and improve the generalization ability of the model, data augmentation is performed using left-right inversion, mosaic, mix up and other strategies for the submarine garbage dataset.Then, utilizing channel streaming, cross-stage connection strategy to obtain all features of each stage and hierarchical cross features, the criss-cross attention module added afterwards better extracts the deep abstract features of the full stage by calculating the distance of intra-class and inter-class features, which makes the obtained features more focused on the intensive features of small objects.In the regression analysis stage, the auxiliary focal loss function is used to calculate the object  Although, the FSA networks proposed in this paper is excellent in terms of performance and accuracy, the number of parameters and computational effort are greatly increased by introducing the attention module at the end of each FSS module, and the effect of model width is not discussed.Therefore, it can be further investigated how to reduce the computational overhead brought by the addition of the attention module while increasing the model width.

Methods
Overall architecture.The detector proposed in this paper called full stage auxiliary networks (FSA Networks) is based on the YOLO detection framework.As shown in Fig. 9, the images are performed by data enhancement before being sent to the backbone.Pi (i ∈ [1-6]) indicates that the feature map image size output by the current layer is 1/2 i of the original image, and after the group SPPCSPC convolution operation, the size of the feature map of the input head is 1/64 of the original image.FSS-i indicates the full stage shortcut convolution  www.nature.com/scientificreports/operation for the current layer respectively.In backbone and head, FSS is used to extract shallow features and deep features respectively, and the operation of attention mechanism is added behind the FSS in head to obtain image context information from each pixel vertical and horizontal path, so that the model can focus more on capturing feature information of dense small target regions and reduce the noise interference of complex background.The outputs of P3, P4, P5 and P6 in FSS in backbone are used as auxiliary heads in the regression analysis of classification, location and confidence, and the loss is calculated together with the lead head of each layer output of FSS-CCA.
Full stage convolution.Different from the backbone of CSPNet and YOLOv7, The HS (Fig. 10) and FS (Fig. 11) modules preserve DenseNet's advantage of reusing features, while preventing excessive repetitive gradient information transfer and learning by truncating the gradient flow, mainly through a hierarchical feature fusion strategy.First, split the upper feature map into two parts, one part goes through the stage and transition layers, and the other part concatenates with the transmitted feature map to the next stage.The module implementation extends the number of channels and bases of the computational module by group convolution, uses channel streaming, and cross-stage connection strategy, retains all features of the upper layer, fuses the channel features of each stage, and finally, the output channel is twice as many as the input channel, which can acquire more features in depth and width at the same time, better preserves the actual feature structure of the object, and makes the model more robust and has stronger generalization ability.Figure 9 demonstrates that in order to obtain feature maps at various scales, each FS module in the head must be upsampled.However, this adds irregular pixels, which causes the image to lose some of its finer details.In order to better reuse the features in the backbone and compensate for the missing data introduced by upsampling, the FS-i feature map in the backbone is transmitted to the FS-i layer corresponding to P i in the head in this study.The shortcut is used to connect the residuals with the features at the end of the FS module, thereby employing both the full features before and after upsampling.Therefore, the FSS (Full Stage Shortcut) module (Fig. 12) is used for all the backbone and head in this paper.

Attention module.
In recent years, the attention model has been widely used in image processing 46 , speech recognition 47 , natural language processing 48 , and other fields 49 .The quality of attention module is a set of weight coefficients that are learned independently through the network, and it emphasizes the areas of our interest while suppressing irrelevant background areas in a "dynamic weighting" way.
Therefore, in order to reduce the GPU occupancy, use larger batch size and improve the detection accuracy, this paper uses the CCA 50 (Criss-Cross Attention) module to upgrade the model.Given a local feature maps  F∈R C × W × H , two feature maps G and K are generated by two 1 × 1 convolutions layers respectively, where G, K∈R C ′ ×W × H , C' is the number of channels, which is less than C for dimension reduction.After obtaining feature maps G and K, the three-dimensional feature map with the shape of C' × H × W can be easily reshaped into a two-dimensional C' × (H × W) matrix.The attention map A ∈R (H+W−1)×W×H is generated by the Affinity operation.For each position u in the feature map G, a vector G∈R C ′ with dimension C' can be obtained.At the same time, the set �u∈R (W+H−1)×C ′ can also be obtained from the feature map K which belongs to the same row or column with position u.
The features acquired by the FSS module are hierarchical cross features, therefore, adding the CCA module after the FSS module (Fig. 13) can obtain the category consistency loss and better extract more in-depth abstract features by calculating the distance between intra-class and inter-class features while preserving the feature structure.
Auxiliary focal loss function.The one-stage method discards the stage of generating candidate boxes in order to improve the detection speed, and directly classifies the anchor boxes at a fine-grained level, so many boxes are predicted, but few boxes contain the correct object, leading to the category imbalance problem.In order to solve this problem, Lin 26 proposed the focal loss function on the basis of the two-category balanced cross-entropy loss function, adding a weight factor in front of each category to solve the problem of unbalanced positive and negative samples, and adjusting the factor (γ ≥ 0 is an adjustable focusing parameter) to reduce the weight of easy-to-classify samples, focus on the training of difficult samples, and prevent easy-to-classify samples from dominating the gradient transfer.The definition is as follows: In order to improve the overall accuracy and performance of the model, this paper uses the FSS module in the backbone to generate the auxiliary head for auxiliary training.The lead head generated by the FSS-CCA module is the main prediction result.Different from YOLOv7, the lead head and auxiliary head (Fig. 14) participate in the optimization model simultaneously and assign different weights (Fig. 15) to calculate classification, confidence, and regression losses.This is done to reduce the impact of the auxiliary head's "coarse" label and prevent a reduction in the lead head's detection accuracy.The lead head and auxiliary head both extract the IoU of the top 20 samples for summing in the actual calculation, and the classification and regression loss weights are set to 1:0.25.Similar to YOLOv5, the confidence loss is set at a ratio of 1/4 based on the output scale of the   Materials: dataset and health check.To validate the robustness and generalization property of the proposed model, an opensource submarine garbage dataset is used to learn all types of labeled objects without undergoing any human screening, and the ratio of training sets validation sets, and test sets is set to 0.7:0.2:0.1 (Table 6).The dataset consists of 5,136 images of marine debris in 15 categories, with 2.3 labels per image.The aspect ratio distribution for each classification of dimension insight are shown in Table 7, most of the aspect ratios are images with median width multiplied by median height (300 × 199 pixel), and a few categories, such as cellphone, have high aspect ratios.
It can be seen from the distribution of the original dataset that there is a serious sample imbalance in the labeled images.So, some strategies need to be taken to expand the dataset, such as Random Erasing Data Augmentation 51 , RandAugment 52 , Mixup 53 ,Cutout 54 , CutMix 55 , Mosaic 32 , Copy-Paste 56 , etc.
www.nature.com/scientificreports/Visualization analysis.This paper uses representative and difficult images from the submarine garbage test set to evaluate the actual results of the algorithm for all classes of objects and visualize and analyze them.

Figure 1 .
Figure 1.Performance of different feature layers of FSA networks.(a) Original image; (b) Heatmap of backbone output; (c) Heatmap of SPPCSPC output; (d) Heatmap of head output.

Figure 9 .
Figure 9.The structure of full stage auxiliary networks.

Figure 15 .
Figure 15.Loss with weighted different head.

Table 1 .
Software and hardware configuration of the experimental environment.

Table 2 .
FSA networks experimental training parameters.

Table 4 .
Ablation experiments of parameters with different module.