Abstract
Submarine garbage is constantly destroying the marine ecological environment and polluting the ocean. It is critical to use detection methods to quickly locate and identify submarine garbage. The background of submarine garbage images is much more complex than that of natural scene images, with object deformation and missing contours putting higher demands on the detection network. To solve the problem of low accuracy under complex backgrounds, full stage networks with auxiliary focal loss and multi-attention module are proposed for submarine garbage object detection based on YOLO. To maximize the gradient combination, a hierarchical fusion feature mechanism and a segmentation and merging strategy are used in this paper to optimize the difference in gradient combination to obtain full-stage features. Then the criss-cross attention module is used to precisely extract multi-scale features of small object dense regions while removing noise information from complex backgrounds. Finally, the auxiliary focal loss function addresses the issue of unbalanced positive and negative samples, focusing on the learning of difficult samples while improving overall detection precision. Based on comparative experiments and ablation experiments, the FSA networks achieved state-of-the-art performance, and is applicable to the real-time object detection of submarine garbage in complex backgrounds.
Similar content being viewed by others
Introduction
Submarine garbage is becoming an increasingly serious issue in the marine ecological environment. Due to poor management, a large amount of garbage generated using artificial products would enter the marine environment, causing serious pollution of the ocean. Wood, fishing nets, glass, metals, plastics, and other durable and corrosion-resistant materials may be found in garbage, and once in the ocean, they become persistent pollutants. As a result, controlling and managing submarine garbage pollution is critical1,2,3,4. On technological level, the detection method of submarine garbage, that is, rapidly locating and identifying submarine garbage using detection, obtaining basic information on the distribution and quantity of submarine garbage pollution, and formulating control policies, is a critical link in promoting submarine garbage pollution cleanup and recycling.
The majority of the studies use remote sensing technology to detect and classify marine floating plastic wastes; however, few scholars have studied submarine garbage, and many types of garbage greatly increase the difficulty of object detection. Xu5 employed YOLOv3 to detect fish in underwater environments for waterpower applications, and achieved a mean average precision (mAP) value of 54.92%. Asyraf6 conducted a study on the efficiency of the YOLOv3 detector in detecting underwater life on two open-source datasets, and the results indicate that the YOLOv3 detector is capable of detecting underwater objects with high accuracy, with mAP scores ranging from 74.88 to 97.56%. Rosli7 used YOLOv4 to detect underwater animals and the training results showed a mAP of 97.86%. Chen8 used YOLOv4 to detect 4757 images with 4 categories on the URPC dataset, and the results showed a mAP of 73.48%. Zhang9 trained and tested the URPC dataset with a mAP of 81.01%. Gašparović10 improved YOLOv4 and achieved better detection results in underwater pipeline object detection with 94.21% mAP.
Object detection technology is of great significance as the basis of more complex and higher-level visual tasks such as pattern recognition, object tracking, event detection, and activity recognition. Currently, deep learning-based object detection algorithms are divided into two main categories: one-stage object detection and two-stage object detection.
The most classical two-stage algorithm is the R-CNN11 (Region-based Convolutional Neural Networks) proposed by Grishick12 based on the AlexNet architecture combining region proposal with CNN, but this detector is more time consuming. He proposed the detector of SPPNet (Spatial Pyramid Pooling Networks)13, which solved the time-consuming problem of R-CNN. Grishick14 proposed Fast R-CNN detector based on R-CNN and SPPNet, which improved the mAP (mean Average Precision) to 70.0% and reduced the elapsed time. Ren15 proposed the Faster R-CNN detector based on the RPN (Region Proposal Networks) which unifies the generation of candidate regions, feature extraction, confirmation of candidate objects, and border coordinate regression into the same network framework. Dai16 proposed a region-based detector R-FCN (Region-based Fully Convolutional Networks) based on FCN17 (Fully Convolutional Networks) to solve the contradiction between the location insensitivity of classification networks and the location sensitivity of detection networks. Lin18 proposed the FPN (Features Pyramid Networks) detector based on Faster R-CNN, which has better detection advantages for small objects and objects with large-scale variations. He19 introduced ROI (Region of Interesting) into Faster R-CNN and proposed Mask R-CNN to achieve fast detection and instance segmentation of objects. Cai20 proposed a cascade multi-stage network architecture Cascade R-CNN to solve the problem of IoU (Intersect over Union) threshold selection in object detection. Hu21 proposed the RelationNet detector by utilizing the interrelationship between objects to optimize the detection effect. Zhang22 put forward the RefineDet detector based on SSD23 (Single Shot Multibox Detector), adopted the idea of one-stage and two-stage, and integrated SSD, RPN, and FPN algorithms, which can improve the detection effect.
OverFeat24 is an early classic one-stage object detection algorithm based on AlexNet, which implements a trinity network framework of recognition, localization, and detection. Redmon25 proposed the YOLO algorithm, which takes the object detection as a regression problem, the object position and category information can be output by detecting the image only one time. Ross26 proposed RetinaNet based on the ResNet27 structure using FPN to compensate for the accuracy discrepancy caused by the one-stage category imbalance. Duan28 proposed CenterNet which transforms the detection object bounding box into the detection object centroid, avoiding post-processing by non-maximum suppression and eliminating the need for border regression. Tan29 used EfficientNet as backbone and scaled the model using bidirectional feature pyramid network and multiscale features, enhancing advanced feature fusion with better efficiency, accuracy, and smaller size.
Joseph30 proposed YOLOv2 based on YOLO, which trains the object detector on both detection and classification datasets, using the data from the detection dataset to learn the exact location of the object and the data from the classification dataset to increase the number of categories for classification. Among the YOLO series of object detection models, YOLOv331 is a classic one-stage model, which is divided into four parts: input, backbone, neck, and prediction. YOLOv432 has made many innovations based on YOLOv3. YOLOv533 mainly calculates the scaling ratio of the original image size and the input size and obtains the scaled image size, and the main difference from YOLOv4 are mosaic data enhancement is adopted at the input, and CSPDarknet34 in backbone, mish activation function35, drop block, etc. The Neck adopts the structure of SPP11 and FPN18 with PAN (Path Aggregation Network)36, CIoU (Complete IoUnion) 37 loss, and DIoU (Distance-IoU) 38 NMS(Non-Maximum Suppression) are used in the output. YOLOX39combines the best advances in the field of object detection with YOLO, such as decoupling headers, data broadening, label assignment, and anchor-free module, to achieve a significant performance improvement. YOLOv740 is a combination of a collection of existing tricks as well as modular re-referencing and dynamic label assignment strategies, ultimately outperforming the vast majority object detectors in speed and accuracy in the 5 FPS to 160 FPS range. YOLOv841 is a SOTA model that builds on the success of previous YOLO versions and introduces new features and improvements to further enhance performance and flexibility.
PicoDet42 is a compact object detector that employs attention processes and multi-scale feature pyramids to enhance detection through one-stage network construction. By aligning task relevance, task-aligned one-stage object detection (TOOD)43 addresses the issue of inconsistent categorization and localization predictions in detection tasks and delivers accurate and effective detection. RTMDet44is a more recent industrial detector that combines the most recent performance in real-time instance segmentation and rotating object recognition with the best parametric accuracy between tiny, small, medium, large, and oversized model sizes for diverse application scenarios. PP-YOLOE45 is an object detection model that improves on the YOLOv3 algorithm with a redesigned network structure and a more efficient convolution operation. This enables PP YOLO to process images in real time while maintaining high detection accuracy.
The YOLO-based network model achieves a balance between detection speed and accuracy and is the most popular use of the one-stage object detection approaches. However, due to the situations of blurred submarine garbage images, incomplete object contours, and deformation of objects captured underwater, object detection of submarine garbage is more challenging. In response to the above situation, this paper identify and detect 15 types of submarine garbage, and proposed a full stage shortcut convolutional neural networks with auxiliary focal loss and multi-attention module for submarine garbage object detection based on the YOLO method, adopting hierarchical fusion feature mechanism alleviates the drawbacks caused by using explicit feature map replication for cascading, adding criss-cross attention module and fusing with full stage cross features can obtain dense features that focus more on small objects, using auxiliary head and weighted focal loss to solve the problem of unbalanced positive and negative samples, solving the problem of difficult extraction of submarine garbage objects in complex backgrounds, and boosting detection accuracy overall, and enriching the identification types of submarine garbage, and providing more reference information for pollution cleaning and recycling of submarine garbage.
Results
Experimental environment and parameters
The software environment and hardware parameters used in this paper are shown in Table 1. The hyperparameter of experiments in training FSA networks is shown in Table 2.
This paper mainly adopts mAP50:95(mAP) as the model evaluation index of performance.
Baseline experiments
In order to verify the effectiveness of the proposed FSA networks model, ablation experiments are conducted to evaluate the effect of different modules on the performance of the object detection algorithm under the same experimental conditions. Before determining the baseline model, comparison experiments are conducted between YOLOv5 and YOLOv7 series models.
From Table 3, it can be seen that layers, parameters, GFLOPS, and the mAP of the YOLOv5 series all increase with the increase of model size. The mAP reaches its maximum at YOLOv5x. Layers, parameters, and GFLOPS of the YOLOv7 series all increase with the increase in model size, and the mAP reach maximum at YOLOv7-w6. Therefore, in the ablation experiment, YOLOv7 was selected as the baseline model.
Ablation experiments on FSA networks
On COCO datasets, the various YOLO family improvement methods currently perform significantly better, but the extension to bespoke datasets has not yet been thoroughly demonstrated.
In this paper, an FSA network is designed using a highly reused FFS module with highly used features in the backbone and an efficient group convolutional SPPCSPC module in the neck part. Additionally, the criss-cross attention mechanism is connected to the FFS module in the head and combined the features in the backbone, and the object detection task is completed using lead head and auxiliary head. In this section, ablation experiments are conducted to verify the effectiveness of the FSA network and to compare it with the current leading detectors. The results show that the FSA network proposed in this paper achieves state of the art on the submarine garbage dataset.
As can be seen from Table 4, model A uses a combination of HS and FS to extract features using standard convolution with 464 layers, 121.22 M parameters, and an mAP of 52.5%, which demonstrates the ability of the HS and FS modules to extract features while improving accuracy and reducing the number of parameters compared to the YOLOV7 series models. The backbone and head backbone architectures of model B both use FS modules, which have a slight increase in the number of parameters and GFLOPS and a 0.2% increase in mAP compared to HS modules. This is mainly because the FS module retains the detailed features of each layer in the module. In order to further reduce the complexity of the model and the number of parameters, model C uses depthwise separable convolution in the FS module, so that each convolution kernel operates on only one channel and does not change the number of channels, but some channel information is lost. therefore, in the experiment, the kernel size is increased to expand the feature extraction The results show that the FS extracted high-density feature map after depthwise separable convolution has the same accuracy as model A.
The FSS module in model D is the final structure adopted in this paper. Compared with the FS module, the FSS module, with the addition of shortcut connections, is similar to the residual structure of ResNet. Moreover, the feature maps of layers P3–P6 in backbone are passed to layers P3–P6 in head, which make up for the lack of information in depthwise separable convolution, and the mAP has a more obvious improvement (+ 1.6%). Model E and model F were trained by adding criss-cross attention mechanism and auxiliary head in turn, and the mAP was improved by 2.5% and 3.0%, respectively, compared with model A. Therefore, the FSA network proposed in this paper has a very significant detection effect.
Figure 1 displays the heat map and detection accuracy at the head, SPPCSPC, and backbone outputs of FSA networks. The benefit of the attention mechanism grows enormously as the network’s depth increases, and accuracy likewise rises.
Comparisons with state-of-the-art methods
To validate the effectiveness of FSA networks, this paper compares some state-of-the-art methods on the submarine garbage dataset. The models involved in the comparison are two-stage detector and one-stage detector (containing anchor free detectors, such as TOOD, YOLOX). The compared algorithms are trained based on PaddleDetection and MMDetection with epochs set to 100 for two-stage detector and 300 for one-stage detector, and the rest of parameters remain unchanged. The results show that the FSA model achieves 55.5% mAP, which is more accurate than many state-of-the-art methods, and the comparison results are shown in Table 5.
As can be seen from Table 5, the mAP of misc, can, tire, plastic, rod, and metal are less than 0.5, on the one hand, because the training samples of the dataset are small (the number of training samples are 170, 89, 488, 136, 17, 30, respectively), on the other hand, the objects are deformed due to the water flow, image resolution, light refraction, etc. The object features learned by the model are not completely consistent with the inherent attribute features of the object. One of the lowest detection accuracies is metal, except for the reason of the small sample dataset, when labeling the dataset, the tin, coin, iron cage, rusty anchor, and tin can are marked as metal, even though the FSA networks model incorporates attention module, it can only extract the abstract features of the object, and the mAP of metal is only 23.1%. Due to the small number of sample sets in rod, the focal loss was used in the programming to solve the problem of lower accuracy due to sample imbalance, and the final mAP was only 30.2%. During the acquisition of the datasets, some of the plastic was in the marine soil, some was floating on the ocean surface, and some of the plastic overlapped with other object samples, making the features learned by the model incomplete, resulting in a final mAP of only 30.7%.
In summary, the FSS module and group SPPCSPC module introduced in the FSA networks can extract shallow features extraction, deep features, and reconstruct the image, while reducing the number of parameters; CCA focuses more attention on the dense object feature region, while introducing the residual operation to improve the fusion ability of shallow and deep feature maps; the joint use of the auxiliary head and the lead head at the output end allows the lead head to focus on learning the remaining features that have not yet been learned, effectively improving the feature extraction ability of objects in complex environments and making the model more advantageous when dealing with complex submarine garbage image object detection tasks.
Visualization analysis
This paper uses representative and difficult images from the submarine garbage test set to evaluate the actual results of the algorithm for all classes of objects and visualize and analyze them. The detection results are shown in Fig. 2. As can be seen from the figures, the object detection accuracy is generally high, and even if there is an occlusion or object at the junction of water bodies, it can be detected very well.
Figure 3 shows the detection effect in the case of significant light variation. There exists dim scene in the figures, and the object color is similar to the background color, the results illustrate that the model is less affected by the light variation and has better detection ability. Figure 4 presents images of fuzzy distortion caused by underwater shooting, and the detection results reveal that the model can detect the objects in the fuzzy scene, which indicates that the model has good robustness. Figure 5 represents the detection effect of dense small objects, where the pbag, pbottle, and tire are extremely small, however, all of them can be detected precisely, which illustrates the model’s outstanding detection ability for small objects as well. The detection effect in the presence of occlusion is shown in Fig. 6. The results demonstrate that the FSA network model is able to detect the object correctly even in the presence of occlusion by other objects, or incomplete objects.
Pictures in the left column of Figs. 7 and 8 are the FSA detection results, and the right column are the YOLOv7 detection results. As can be seen from Fig. 7, the FSA detection results show more accurate bounding box, while YOLOv7 has a situation where the bounding box is too large or too small. Consequently, it proves that the prediction box obtained by CIoU used in this paper is more consistent with the real position of the object.
Figure 8 demonstrates that the detection results of the FSA networks model are more accurate and have better detection effects. Therefore, it can be demonstrated that the FSA networks model, through the attention module, increases the perceptual field of the feature map, strengthens the feature extraction ability of the network for small objects, and can reserve more feature information of the object area. The combination of auxiliary head and lead head weakens the interference of background noise, fuses shallow and deep features, improves the global feature extraction ability, and has better performance in dealing with object detection in complex backgrounds, which not only reduces missed and false detections, but also is less affected by environmental and illumination changes. Consequently, the FSA networks model has higher detection accuracy and more precise detection results. Overall, the FSA networks model can obtain more accurate object positions, has better robustness to illumination changes, and the object detection effect is obviously improved in sophisticated backgrounds, and the inference speed can reach 72.15 FPS.
Discussion
In this paper, we propose a one-stage detector, full stage auxiliary networks with auxiliary focal loss and multi-attention module, based on YOLO. It aims to improve the performance of dense small object detection in complex backgrounds for real-time submarine garbage object detection tasks. In order to avoid overfitting and improve the generalization ability of the model, data augmentation is performed using left–right inversion, mosaic, mix up and other strategies for the submarine garbage dataset. Then, utilizing channel streaming, cross-stage connection strategy to obtain all features of each stage and hierarchical cross features, the criss-cross attention module added afterwards better extracts the deep abstract features of the full stage by calculating the distance of intra-class and inter-class features, which makes the obtained features more focused on the intensive features of small objects. In the regression analysis stage, the auxiliary focal loss function is used to calculate the object class and confidence level to balance the problem of unbalanced positive and negative samples, focus the training on difficult samples, and improve the overall detection accuracy. The experimental results demonstrate that the FSA networks achieved state-of-the-art performance compared with the mainstream networks, while ensuring high efficiency in inference, and can be applied to real-time object detection tasks.
Although, the FSA networks proposed in this paper is excellent in terms of performance and accuracy, the number of parameters and computational effort are greatly increased by introducing the attention module at the end of each FSS module, and the effect of model width is not discussed. Therefore, it can be further investigated how to reduce the computational overhead brought by the addition of the attention module while increasing the model width.
Methods
Overall architecture
The detector proposed in this paper called full stage auxiliary networks (FSA Networks) is based on the YOLO detection framework. As shown in Fig. 9, the images are performed by data enhancement before being sent to the backbone. Pi (i ∈ [1–6]) indicates that the feature map image size output by the current layer is 1/2i of the original image, and after the group SPPCSPC convolution operation, the size of the feature map of the input head is 1/64 of the original image. FSS-i indicates the full stage shortcut convolution operation for the current layer respectively. In backbone and head, FSS is used to extract shallow features and deep features respectively, and the operation of attention mechanism is added behind the FSS in head to obtain image context information from each pixel vertical and horizontal path, so that the model can focus more on capturing feature information of dense small target regions and reduce the noise interference of complex background. The outputs of P3, P4, P5 and P6 in FSS in backbone are used as auxiliary heads in the regression analysis of classification, location and confidence, and the loss is calculated together with the lead head of each layer output of FSS-CCA.
Full stage convolution
Different from the backbone of CSPNet and YOLOv7, The HS (Fig. 10) and FS (Fig. 11) modules preserve DenseNet's advantage of reusing features, while preventing excessive repetitive gradient information transfer and learning by truncating the gradient flow, mainly through a hierarchical feature fusion strategy. First, split the upper feature map into two parts, one part goes through the stage and transition layers, and the other part concatenates with the transmitted feature map to the next stage. The module implementation extends the number of channels and bases of the computational module by group convolution, uses channel streaming, and cross-stage connection strategy, retains all features of the upper layer, fuses the channel features of each stage, and finally, the output channel is twice as many as the input channel, which can acquire more features in depth and width at the same time, better preserves the actual feature structure of the object, and makes the model more robust and has stronger generalization ability.
Figure 9 demonstrates that in order to obtain feature maps at various scales, each FS module in the head must be upsampled. However, this adds irregular pixels, which causes the image to lose some of its finer details. In order to better reuse the features in the backbone and compensate for the missing data introduced by upsampling, the FS-i feature map in the backbone is transmitted to the FS-i layer corresponding to Pi in the head in this study. The shortcut is used to connect the residuals with the features at the end of the FS module, thereby employing both the full features before and after upsampling. Therefore, the FSS (Full Stage Shortcut) module (Fig. 12) is used for all the backbone and head in this paper.
Attention module
In recent years, the attention model has been widely used in image processing46, speech recognition47, natural language processing48, and other fields49. The quality of attention module is a set of weight coefficients that are learned independently through the network, and it emphasizes the areas of our interest while suppressing irrelevant background areas in a “dynamic weighting” way.
Therefore, in order to reduce the GPU occupancy, use larger batch size and improve the detection accuracy, this paper uses the CCA50(Criss-Cross Attention) module to upgrade the model. Given a local feature maps \({\text{F}}{ \in }{\text{R}}^{\text{C}}\times{\text{W}}\times{\text{H}}\), two feature maps G and K are generated by two 1 × 1 convolutions layers respectively, where \(\text{G, K}{ \in }{\text{R}}^{{{\text{C}}^{\prime}}}{\times{\text{W}}\times {\text{H}}}\), C’ is the number of channels, which is less than C for dimension reduction. After obtaining feature maps G and K, the three-dimensional feature map with the shape of C’ × H × W can be easily reshaped into a two-dimensional C’ × (H × W) matrix. The attention map A \({ \in }{\text{R}}^{{\text{(H}}+{\text{W}}-\text{1})\times {\text{W}}\times{\text{H}}}\) is generated by the Affinity operation. For each position u in the feature map G, a vector \({\text{G}}{\in}{\text{R}}^{{C^{\prime}}}\) with dimension C’ can be obtained. At the same time, the set \( \Omega \text{u}{\in}{\text{R}}^{{\text{(W}}+{\text{H}}-\text{1})\times C^{\prime}}\) can also be obtained from the feature map K which belongs to the same row or column with position u.
The features acquired by the FSS module are hierarchical cross features, therefore, adding the CCA module after the FSS module (Fig. 13) can obtain the category consistency loss and better extract more in-depth abstract features by calculating the distance between intra-class and inter-class features while preserving the feature structure.
Auxiliary focal loss function
The one-stage method discards the stage of generating candidate boxes in order to improve the detection speed, and directly classifies the anchor boxes at a fine-grained level, so many boxes are predicted, but few boxes contain the correct object, leading to the category imbalance problem. In order to solve this problem, Lin26 proposed the focal loss function on the basis of the two-category balanced cross-entropy loss function, adding a weight factor in front of each category to solve the problem of unbalanced positive and negative samples, and adjusting the factor (γ ≥ 0 is an adjustable focusing parameter) to reduce the weight of easy-to-classify samples, focus on the training of difficult samples, and prevent easy-to-classify samples from dominating the gradient transfer. The definition is as follows:
In order to improve the overall accuracy and performance of the model, this paper uses the FSS module in the backbone to generate the auxiliary head for auxiliary training. The lead head generated by the FSS-CCA module is the main prediction result. Different from YOLOv7, the lead head and auxiliary head (Fig. 14) participate in the optimization model simultaneously and assign different weights (Fig. 15) to calculate classification, confidence, and regression losses. This is done to reduce the impact of the auxiliary head's "coarse" label and prevent a reduction in the lead head's detection accuracy. The lead head and auxiliary head both extract the IoU of the top 20 samples for summing in the actual calculation, and the classification and regression loss weights are set to 1:0.25. Similar to YOLOv5, the confidence loss is set at a ratio of 1/4 based on the output scale of the detection head. According to Fig. 9, the output contains 4 scales (1/8, 1/16, 1/32, 1/64), so it is very suitable for small object detection in multi-scale complex backgrounds.
Because of the additional training of the auxiliary head, the focal loss functions all need to be added for synchronous training, and the modified auxiliary focal loss function is as follows:
where b is the batch size during training, l is the number of groups of detection heads (in this paper, there are 4 groups of auxiliary head and lead head, therefore, l = 4), b(i) = [4, 1, 1/4, 1/16] is the balance factor of auxiliary head and lead head.
Materials: dataset and health check
To validate the robustness and generalization property of the proposed model, an opensource submarine garbage dataset is used to learn all types of labeled objects without undergoing any human screening, and the ratio of training sets validation sets, and test sets is set to 0.7:0.2:0.1 (Table 6). The dataset consists of 5,136 images of marine debris in 15 categories, with 2.3 labels per image. The aspect ratio distribution for each classification of dimension insight are shown in Table 7, most of the aspect ratios are images with median width multiplied by median height (300 × 199 pixel), and a few categories, such as cellphone, have high aspect ratios.
It can be seen from the distribution of the original dataset that there is a serious sample imbalance in the labeled images. So, some strategies need to be taken to expand the dataset, such as Random Erasing Data Augmentation51, RandAugment52, Mixup53,Cutout54, CutMix55, Mosaic32, Copy-Paste56, etc.
In this paper, we adjust hue, saturation, and value in the HSV color model, and enhance it by rotating 10 degrees and shifting the range to [−0.2, 0.2]. At the same time, we expand the dataset utilizing left–right inversion, Mosaic, Mix-Up, and Copy-Paste. Images after data enhanced shown in Fig. 16.
Data availability
The datasets analyzed during the current study is available at https://universe.roboflow.com/ncwu-mdh99/submarine-garbage.
References
Ciappa, A. C. submarine garbage detection by sentinel-2: A case study in North adriatic (summer 2020). Remote Sens. 14, 2409 (2022).
Topouzelis, K. et al. Floating submarine garbage detection algorithms and techniques using optical remote sensing data: A review. Mar. Pollut. Bull. 170, 112675 (2021).
Fulton, M., et al. Robotic detection of submarine garbage using deep visual detection models. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019. (2019).
Garaba, S. P. & Dierssen, H. M. An airborne remote sensing case study of synthetic hydrocarbon detection using short wave infrared absorption features identified from marine-harvested macro-and microplastics. Remote Sens. Environ. 2018(205), 224–235 (2018).
Xu, W. & Matzner, S. “Underwater fish detection using deep learning for water power applications. In 2018 International Conference on Computational Science and Computational Intelligence (CSCI). Las Vegas, USA: IEEE, 313–18 (2018).
Asyraf, M. S., Isa, I. S., Marzuki, M. I. F., Sulaiman, S. N. & Hung, C. C. CNN-based YOLOv3 comparison for underwater object detection. J. Electr. Electron. Syst. Res. (JEESR) 18(APR2021), 30–3716 (2021).
Rosli, M. S. A. B., Isa, I. S., Maruzuki, M. I. F., Sulaiman, S. N. & Ahmad, I. Underwater animal detection using YOLOV4. In 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, IEEE, 158–63. (2021).
Chen, L., Zheng, M., Duan, S., Luo, W. & Yao, L. Underwater target recognition based on improved YOLOv4 neural network. Electronics 10(14), 1634 (2021).
Zhang, M., Xu, S., Song, W., He, Q. & Wei, Q. Lightweight underwater object detection based on YOLO v4 and multi-scale attentional feature fusion. Remote Sens. 13(22), 4706 (2021).
Gašparović, B., Lerga, J., Mauša, G. & Ivašić-Kos, M. Deep learning approach for objects detection in underwater pipeline images. Appl. Artif. Intell. 36(1), 2146853 (2022).
Girshick, R. et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, 580–587 (2014).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 2012, 25 (2012).
He, K. et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015).
Girshick, R. Fast R-CNN. Proc. IEEE Int. Conf. Comput. Vis. 2015, 1440–1448 (2015).
Ren, S. et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 2015, 28 (2015).
Dai, J. et al. R-FCN: Object detection via region-based fully convolutional networks (Curran Associates Inc, Red Hook, 2016). https://doi.org/10.48550/arXiv.1605.06409.
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440 (2015).
Lin, T. Y., Dollár, P. & Girshick, R., et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125 (2017).
He, K., Gkioxari, G. & Dollár, P., et al. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969 (2017)
Cai, Z. & Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6154-6162 (2018).
Hu, H., Gu, J. & Zhang, Z., et al. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597 (2018).
Zhang, S., Wen, L. & Bian, X., et al. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4203–4212 (2018).
Liu, W., Anguelov, D. & Erhan, D., et al. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, Cham 21–37 (2016).
Sermanet, P., Eigen, D. & Zhang, X., et al. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 (2013).
Redmon, J., Divvala, S. & Girshick, R., et al. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788 (2016).
Lin, T. Y., Goyal, P. & Girshick, R., et al. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988 (2017).
He, K., Zhang, X. & Ren, S., et al. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (2016).
Duan, K., Bai, S. & Xie, L., et al. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6569–6578 (2019).
Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10781–10790 (2020).
Redmon, J. & Farhadi, A. YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271 (2017).
Redmon, J., & Ali, F. YOLOv3: An incremental improvement. arXiv:1804.02767 20 (1804). (2018).
Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. YOLOv4: Optimal speed and accuracy of object detection. arXiv:2004.10934 (2020).
Ultralytics. Yolov5. https://github.com/ultralytics/yolov5 (2023).
Wang, C. Y., Liao, H. Y. M. & Wu, Y. H., et al. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 390–391 (2020).
Misra, D. Mish: A self regularized non-monotonic neural activation function. arXiv:1908.08681, 4(2): 10.48550 (2019).
Wang, K., Liew, J. H. & Zou, Y., et al. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9197–9206 (2019).
Zheng, Z. et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52, 8574–8586 (2021).
Zheng, Z., Wang, P. & Liu, W., et al. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence 34(07): 12993–13000 (2020)
Ge, Z., Liu, S. & Wang, F., et al. Yolox: Exceeding yolo series in 2021. arXiv:2107.08430 (2021).
Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv:2207.02696 (2022).
Ultralytics. Yolov8. https://github.com/ultralytics/ultralytics (2023).
Yu, G., Chang, Q. & Lv, W, et al. PP-PicoDet: A better real-time object detector on mobile devices. arXiv:2111.00902 (2021).
Feng, C., Zhong, Y. & Gao, Y., et al. Tood: Task-aligned one-stage object detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 3490–3499 (2021).
Lyu, C., Zhang, W. & Huang, H, et al. RTMDet: An empirical study of designing real-time object detectors. arXiv:2212.07784 (2022).
Xu, S., Wang, X. & Lv, W., et al. PP-YOLOE: An evolved version of YOLO. arXiv:2203.16250 (2022).
Niu, Z., Zhong, G. & Yu, H. A review on the attention module of deep learning. Neurocomputing 2021(452), 48–62 (2021).
Wang, F., Jiang, M. & Qian, C., et al. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3156–3164 (2017).
Azuma, R. T. A survey of augmented reality. Presence Teleoper Virtual Environ 6(4), 355–385 (1997).
Fritsch, J., Kuehnl, T. & Geiger, A. A new performance measure and evaluation benchmark for road detection algorithms. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013). IEEE 1693–1700 (2013).
Huang, Z., Wang, X. & Huang, L., et al. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 603–612 (2019).
Zhong, Z., Zheng, L. & Kang, G., et al. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 34(07): 13001–13008. (2020).
Cubuk, E. D., Zoph, B. & Shlens, J., et al. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702–703 (2020).
Zhang, H., Cisse, M. & Dauphin, Y. N., et al. mixup: Beyond empirical risk minimization. arXiv:1710.09412 (2017).
DeVries, T. & Taylor, G. W. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552 (2017).
Yun, S., Han, D. & Oh, S. J., et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023–6032 (2019).
Ghiasi, G., Cui, Y. & Srinivas, A, et al. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2918–2928 (2021).
Acknowledgements
I would like to express my sincere gratitude to my colleague, Dr. Guo, for his valuable suggestion and support throughout my research. I would also like to thank my colleagues in the laboratory, especially Professor. Guo Guihai, Hu Xinglei, Yue Pujie, and Dr. Cao Yizhi, for their valuable discussions and technical support. Their assistance has been instrumental in the completion of this project.
Author information
Authors and Affiliations
Contributions
H.Z. conceived conceptualization, method design, experiments, paper writing, and visualization. X.G. project administration, writing-review and editing. G.G. and Y.C. validation experiments. X.H. investigation, resources, and supervision. P.Y. formal analysis and data curation. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zheng, H., Guo, X., Guo, G. et al. Full stage networks with auxiliary focal loss and multi-attention module for submarine garbage object detection. Sci Rep 13, 16115 (2023). https://doi.org/10.1038/s41598-023-42896-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-42896-3
This article is cited by
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.