Introduction

Submarine garbage is becoming an increasingly serious issue in the marine ecological environment. Due to poor management, a large amount of garbage generated using artificial products would enter the marine environment, causing serious pollution of the ocean. Wood, fishing nets, glass, metals, plastics, and other durable and corrosion-resistant materials may be found in garbage, and once in the ocean, they become persistent pollutants. As a result, controlling and managing submarine garbage pollution is critical1,2,3,4. On technological level, the detection method of submarine garbage, that is, rapidly locating and identifying submarine garbage using detection, obtaining basic information on the distribution and quantity of submarine garbage pollution, and formulating control policies, is a critical link in promoting submarine garbage pollution cleanup and recycling.

The majority of the studies use remote sensing technology to detect and classify marine floating plastic wastes; however, few scholars have studied submarine garbage, and many types of garbage greatly increase the difficulty of object detection. Xu5 employed YOLOv3 to detect fish in underwater environments for waterpower applications, and achieved a mean average precision (mAP) value of 54.92%. Asyraf6 conducted a study on the efficiency of the YOLOv3 detector in detecting underwater life on two open-source datasets, and the results indicate that the YOLOv3 detector is capable of detecting underwater objects with high accuracy, with mAP scores ranging from 74.88 to 97.56%. Rosli7 used YOLOv4 to detect underwater animals and the training results showed a mAP of 97.86%. Chen8 used YOLOv4 to detect 4757 images with 4 categories on the URPC dataset, and the results showed a mAP of 73.48%. Zhang9 trained and tested the URPC dataset with a mAP of 81.01%. Gašparović10 improved YOLOv4 and achieved better detection results in underwater pipeline object detection with 94.21% mAP.

Object detection technology is of great significance as the basis of more complex and higher-level visual tasks such as pattern recognition, object tracking, event detection, and activity recognition. Currently, deep learning-based object detection algorithms are divided into two main categories: one-stage object detection and two-stage object detection.

The most classical two-stage algorithm is the R-CNN11 (Region-based Convolutional Neural Networks) proposed by Grishick12 based on the AlexNet architecture combining region proposal with CNN, but this detector is more time consuming. He proposed the detector of SPPNet (Spatial Pyramid Pooling Networks)13, which solved the time-consuming problem of R-CNN. Grishick14 proposed Fast R-CNN detector based on R-CNN and SPPNet, which improved the mAP (mean Average Precision) to 70.0% and reduced the elapsed time. Ren15 proposed the Faster R-CNN detector based on the RPN (Region Proposal Networks) which unifies the generation of candidate regions, feature extraction, confirmation of candidate objects, and border coordinate regression into the same network framework. Dai16 proposed a region-based detector R-FCN (Region-based Fully Convolutional Networks) based on FCN17 (Fully Convolutional Networks) to solve the contradiction between the location insensitivity of classification networks and the location sensitivity of detection networks. Lin18 proposed the FPN (Features Pyramid Networks) detector based on Faster R-CNN, which has better detection advantages for small objects and objects with large-scale variations. He19 introduced ROI (Region of Interesting) into Faster R-CNN and proposed Mask R-CNN to achieve fast detection and instance segmentation of objects. Cai20 proposed a cascade multi-stage network architecture Cascade R-CNN to solve the problem of IoU (Intersect over Union) threshold selection in object detection. Hu21 proposed the RelationNet detector by utilizing the interrelationship between objects to optimize the detection effect. Zhang22 put forward the RefineDet detector based on SSD23 (Single Shot Multibox Detector), adopted the idea of one-stage and two-stage, and integrated SSD, RPN, and FPN algorithms, which can improve the detection effect.

OverFeat24 is an early classic one-stage object detection algorithm based on AlexNet, which implements a trinity network framework of recognition, localization, and detection. Redmon25 proposed the YOLO algorithm, which takes the object detection as a regression problem, the object position and category information can be output by detecting the image only one time. Ross26 proposed RetinaNet based on the ResNet27 structure using FPN to compensate for the accuracy discrepancy caused by the one-stage category imbalance. Duan28 proposed CenterNet which transforms the detection object bounding box into the detection object centroid, avoiding post-processing by non-maximum suppression and eliminating the need for border regression. Tan29 used EfficientNet as backbone and scaled the model using bidirectional feature pyramid network and multiscale features, enhancing advanced feature fusion with better efficiency, accuracy, and smaller size.

Joseph30 proposed YOLOv2 based on YOLO, which trains the object detector on both detection and classification datasets, using the data from the detection dataset to learn the exact location of the object and the data from the classification dataset to increase the number of categories for classification. Among the YOLO series of object detection models, YOLOv331 is a classic one-stage model, which is divided into four parts: input, backbone, neck, and prediction. YOLOv432 has made many innovations based on YOLOv3. YOLOv533 mainly calculates the scaling ratio of the original image size and the input size and obtains the scaled image size, and the main difference from YOLOv4 are mosaic data enhancement is adopted at the input, and CSPDarknet34 in backbone, mish activation function35, drop block, etc. The Neck adopts the structure of SPP11 and FPN18 with PAN (Path Aggregation Network)36, CIoU (Complete IoUnion) 37 loss, and DIoU (Distance-IoU) 38 NMS(Non-Maximum Suppression) are used in the output. YOLOX39combines the best advances in the field of object detection with YOLO, such as decoupling headers, data broadening, label assignment, and anchor-free module, to achieve a significant performance improvement. YOLOv740 is a combination of a collection of existing tricks as well as modular re-referencing and dynamic label assignment strategies, ultimately outperforming the vast majority object detectors in speed and accuracy in the 5 FPS to 160 FPS range. YOLOv841 is a SOTA model that builds on the success of previous YOLO versions and introduces new features and improvements to further enhance performance and flexibility.

PicoDet42 is a compact object detector that employs attention processes and multi-scale feature pyramids to enhance detection through one-stage network construction. By aligning task relevance, task-aligned one-stage object detection (TOOD)43 addresses the issue of inconsistent categorization and localization predictions in detection tasks and delivers accurate and effective detection. RTMDet44is a more recent industrial detector that combines the most recent performance in real-time instance segmentation and rotating object recognition with the best parametric accuracy between tiny, small, medium, large, and oversized model sizes for diverse application scenarios. PP-YOLOE45 is an object detection model that improves on the YOLOv3 algorithm with a redesigned network structure and a more efficient convolution operation. This enables PP YOLO to process images in real time while maintaining high detection accuracy.

The YOLO-based network model achieves a balance between detection speed and accuracy and is the most popular use of the one-stage object detection approaches. However, due to the situations of blurred submarine garbage images, incomplete object contours, and deformation of objects captured underwater, object detection of submarine garbage is more challenging. In response to the above situation, this paper identify and detect 15 types of submarine garbage, and proposed a full stage shortcut convolutional neural networks with auxiliary focal loss and multi-attention module for submarine garbage object detection based on the YOLO method, adopting hierarchical fusion feature mechanism alleviates the drawbacks caused by using explicit feature map replication for cascading, adding criss-cross attention module and fusing with full stage cross features can obtain dense features that focus more on small objects, using auxiliary head and weighted focal loss to solve the problem of unbalanced positive and negative samples, solving the problem of difficult extraction of submarine garbage objects in complex backgrounds, and boosting detection accuracy overall, and enriching the identification types of submarine garbage, and providing more reference information for pollution cleaning and recycling of submarine garbage.

Results

Experimental environment and parameters

The software environment and hardware parameters used in this paper are shown in Table 1. The hyperparameter of experiments in training FSA networks is shown in Table 2.

Table 1 Software and hardware configuration of the experimental environment.
Table 2 FSA networks experimental training parameters.

This paper mainly adopts mAP50:95(mAP) as the model evaluation index of performance.

Baseline experiments

In order to verify the effectiveness of the proposed FSA networks model, ablation experiments are conducted to evaluate the effect of different modules on the performance of the object detection algorithm under the same experimental conditions. Before determining the baseline model, comparison experiments are conducted between YOLOv5 and YOLOv7 series models.

From Table 3, it can be seen that layers, parameters, GFLOPS, and the mAP of the YOLOv5 series all increase with the increase of model size. The mAP reaches its maximum at YOLOv5x. Layers, parameters, and GFLOPS of the YOLOv7 series all increase with the increase in model size, and the mAP reach maximum at YOLOv7-w6. Therefore, in the ablation experiment, YOLOv7 was selected as the baseline model.

Table 3 Baseline experiment.

Ablation experiments on FSA networks

On COCO datasets, the various YOLO family improvement methods currently perform significantly better, but the extension to bespoke datasets has not yet been thoroughly demonstrated.

In this paper, an FSA network is designed using a highly reused FFS module with highly used features in the backbone and an efficient group convolutional SPPCSPC module in the neck part. Additionally, the criss-cross attention mechanism is connected to the FFS module in the head and combined the features in the backbone, and the object detection task is completed using lead head and auxiliary head. In this section, ablation experiments are conducted to verify the effectiveness of the FSA network and to compare it with the current leading detectors. The results show that the FSA network proposed in this paper achieves state of the art on the submarine garbage dataset.

As can be seen from Table 4, model A uses a combination of HS and FS to extract features using standard convolution with 464 layers, 121.22 M parameters, and an mAP of 52.5%, which demonstrates the ability of the HS and FS modules to extract features while improving accuracy and reducing the number of parameters compared to the YOLOV7 series models. The backbone and head backbone architectures of model B both use FS modules, which have a slight increase in the number of parameters and GFLOPS and a 0.2% increase in mAP compared to HS modules. This is mainly because the FS module retains the detailed features of each layer in the module. In order to further reduce the complexity of the model and the number of parameters, model C uses depthwise separable convolution in the FS module, so that each convolution kernel operates on only one channel and does not change the number of channels, but some channel information is lost. therefore, in the experiment, the kernel size is increased to expand the feature extraction The results show that the FS extracted high-density feature map after depthwise separable convolution has the same accuracy as model A.

Table 4 Ablation experiments of parameters with different module.

The FSS module in model D is the final structure adopted in this paper. Compared with the FS module, the FSS module, with the addition of shortcut connections, is similar to the residual structure of ResNet. Moreover, the feature maps of layers P3–P6 in backbone are passed to layers P3–P6 in head, which make up for the lack of information in depthwise separable convolution, and the mAP has a more obvious improvement (+ 1.6%). Model E and model F were trained by adding criss-cross attention mechanism and auxiliary head in turn, and the mAP was improved by 2.5% and 3.0%, respectively, compared with model A. Therefore, the FSA network proposed in this paper has a very significant detection effect.

Figure 1 displays the heat map and detection accuracy at the head, SPPCSPC, and backbone outputs of FSA networks. The benefit of the attention mechanism grows enormously as the network’s depth increases, and accuracy likewise rises.

Figure 1
figure 1

Performance of different feature layers of FSA networks. (a) Original image; (b) Heatmap of backbone output; (c) Heatmap of SPPCSPC output; (d) Heatmap of head output.

Comparisons with state-of-the-art methods

To validate the effectiveness of FSA networks, this paper compares some state-of-the-art methods on the submarine garbage dataset. The models involved in the comparison are two-stage detector and one-stage detector (containing anchor free detectors, such as TOOD, YOLOX). The compared algorithms are trained based on PaddleDetection and MMDetection with epochs set to 100 for two-stage detector and 300 for one-stage detector, and the rest of parameters remain unchanged. The results show that the FSA model achieves 55.5% mAP, which is more accurate than many state-of-the-art methods, and the comparison results are shown in Table 5.

Table 5 Comparisons with state-of-the-art methods on submarine garbage test set.

As can be seen from Table 5, the mAP of misc, can, tire, plastic, rod, and metal are less than 0.5, on the one hand, because the training samples of the dataset are small (the number of training samples are 170, 89, 488, 136, 17, 30, respectively), on the other hand, the objects are deformed due to the water flow, image resolution, light refraction, etc. The object features learned by the model are not completely consistent with the inherent attribute features of the object. One of the lowest detection accuracies is metal, except for the reason of the small sample dataset, when labeling the dataset, the tin, coin, iron cage, rusty anchor, and tin can are marked as metal, even though the FSA networks model incorporates attention module, it can only extract the abstract features of the object, and the mAP of metal is only 23.1%. Due to the small number of sample sets in rod, the focal loss was used in the programming to solve the problem of lower accuracy due to sample imbalance, and the final mAP was only 30.2%. During the acquisition of the datasets, some of the plastic was in the marine soil, some was floating on the ocean surface, and some of the plastic overlapped with other object samples, making the features learned by the model incomplete, resulting in a final mAP of only 30.7%.

In summary, the FSS module and group SPPCSPC module introduced in the FSA networks can extract shallow features extraction, deep features, and reconstruct the image, while reducing the number of parameters; CCA focuses more attention on the dense object feature region, while introducing the residual operation to improve the fusion ability of shallow and deep feature maps; the joint use of the auxiliary head and the lead head at the output end allows the lead head to focus on learning the remaining features that have not yet been learned, effectively improving the feature extraction ability of objects in complex environments and making the model more advantageous when dealing with complex submarine garbage image object detection tasks.

Visualization analysis

This paper uses representative and difficult images from the submarine garbage test set to evaluate the actual results of the algorithm for all classes of objects and visualize and analyze them. The detection results are shown in Fig. 2. As can be seen from the figures, the object detection accuracy is generally high, and even if there is an occlusion or object at the junction of water bodies, it can be detected very well.

Figure 2
figure 2

Object detection results.

Figure 3 shows the detection effect in the case of significant light variation. There exists dim scene in the figures, and the object color is similar to the background color, the results illustrate that the model is less affected by the light variation and has better detection ability. Figure 4 presents images of fuzzy distortion caused by underwater shooting, and the detection results reveal that the model can detect the objects in the fuzzy scene, which indicates that the model has good robustness. Figure 5 represents the detection effect of dense small objects, where the pbag, pbottle, and tire are extremely small, however, all of them can be detected precisely, which illustrates the model’s outstanding detection ability for small objects as well. The detection effect in the presence of occlusion is shown in Fig. 6. The results demonstrate that the FSA network model is able to detect the object correctly even in the presence of occlusion by other objects, or incomplete objects.

Figure 3
figure 3

Detection results under illumination changes.

Figure 4
figure 4

Fuzzy image detection results.

Figure 5
figure 5

Small object detection results.

Figure 6
figure 6

Object under occlusion detection results.

Pictures in the left column of Figs. 7 and 8 are the FSA detection results, and the right column are the YOLOv7 detection results. As can be seen from Fig. 7, the FSA detection results show more accurate bounding box, while YOLOv7 has a situation where the bounding box is too large or too small. Consequently, it proves that the prediction box obtained by CIoU used in this paper is more consistent with the real position of the object.

Figure 7
figure 7

Detection results (boundary box).

Figure 8
figure 8

Detection results (accuracy).

Figure 8 demonstrates that the detection results of the FSA networks model are more accurate and have better detection effects. Therefore, it can be demonstrated that the FSA networks model, through the attention module, increases the perceptual field of the feature map, strengthens the feature extraction ability of the network for small objects, and can reserve more feature information of the object area. The combination of auxiliary head and lead head weakens the interference of background noise, fuses shallow and deep features, improves the global feature extraction ability, and has better performance in dealing with object detection in complex backgrounds, which not only reduces missed and false detections, but also is less affected by environmental and illumination changes. Consequently, the FSA networks model has higher detection accuracy and more precise detection results. Overall, the FSA networks model can obtain more accurate object positions, has better robustness to illumination changes, and the object detection effect is obviously improved in sophisticated backgrounds, and the inference speed can reach 72.15 FPS.

Discussion

In this paper, we propose a one-stage detector, full stage auxiliary networks with auxiliary focal loss and multi-attention module, based on YOLO. It aims to improve the performance of dense small object detection in complex backgrounds for real-time submarine garbage object detection tasks. In order to avoid overfitting and improve the generalization ability of the model, data augmentation is performed using left–right inversion, mosaic, mix up and other strategies for the submarine garbage dataset. Then, utilizing channel streaming, cross-stage connection strategy to obtain all features of each stage and hierarchical cross features, the criss-cross attention module added afterwards better extracts the deep abstract features of the full stage by calculating the distance of intra-class and inter-class features, which makes the obtained features more focused on the intensive features of small objects. In the regression analysis stage, the auxiliary focal loss function is used to calculate the object class and confidence level to balance the problem of unbalanced positive and negative samples, focus the training on difficult samples, and improve the overall detection accuracy. The experimental results demonstrate that the FSA networks achieved state-of-the-art performance compared with the mainstream networks, while ensuring high efficiency in inference, and can be applied to real-time object detection tasks.

Although, the FSA networks proposed in this paper is excellent in terms of performance and accuracy, the number of parameters and computational effort are greatly increased by introducing the attention module at the end of each FSS module, and the effect of model width is not discussed. Therefore, it can be further investigated how to reduce the computational overhead brought by the addition of the attention module while increasing the model width.

Methods

Overall architecture

The detector proposed in this paper called full stage auxiliary networks (FSA Networks) is based on the YOLO detection framework. As shown in Fig. 9, the images are performed by data enhancement before being sent to the backbone. Pi (i [1–6]) indicates that the feature map image size output by the current layer is 1/2i of the original image, and after the group SPPCSPC convolution operation, the size of the feature map of the input head is 1/64 of the original image. FSS-i indicates the full stage shortcut convolution operation for the current layer respectively. In backbone and head, FSS is used to extract shallow features and deep features respectively, and the operation of attention mechanism is added behind the FSS in head to obtain image context information from each pixel vertical and horizontal path, so that the model can focus more on capturing feature information of dense small target regions and reduce the noise interference of complex background. The outputs of P3, P4, P5 and P6 in FSS in backbone are used as auxiliary heads in the regression analysis of classification, location and confidence, and the loss is calculated together with the lead head of each layer output of FSS-CCA.

Figure 9
figure 9

The structure of full stage auxiliary networks.

Full stage convolution

Different from the backbone of CSPNet and YOLOv7, The HS (Fig. 10) and FS (Fig. 11) modules preserve DenseNet's advantage of reusing features, while preventing excessive repetitive gradient information transfer and learning by truncating the gradient flow, mainly through a hierarchical feature fusion strategy. First, split the upper feature map into two parts, one part goes through the stage and transition layers, and the other part concatenates with the transmitted feature map to the next stage. The module implementation extends the number of channels and bases of the computational module by group convolution, uses channel streaming, and cross-stage connection strategy, retains all features of the upper layer, fuses the channel features of each stage, and finally, the output channel is twice as many as the input channel, which can acquire more features in depth and width at the same time, better preserves the actual feature structure of the object, and makes the model more robust and has stronger generalization ability.

Figure 10
figure 10

Half stage convolution module.

Figure 11
figure 11

Full stage convolution module.

Figure 9 demonstrates that in order to obtain feature maps at various scales, each FS module in the head must be upsampled. However, this adds irregular pixels, which causes the image to lose some of its finer details. In order to better reuse the features in the backbone and compensate for the missing data introduced by upsampling, the FS-i feature map in the backbone is transmitted to the FS-i layer corresponding to Pi in the head in this study. The shortcut is used to connect the residuals with the features at the end of the FS module, thereby employing both the full features before and after upsampling. Therefore, the FSS (Full Stage Shortcut) module (Fig. 12) is used for all the backbone and head in this paper.

Figure 12
figure 12

Full stage shortcut convolution module.

Attention module

In recent years, the attention model has been widely used in image processing46, speech recognition47, natural language processing48, and other fields49. The quality of attention module is a set of weight coefficients that are learned independently through the network, and it emphasizes the areas of our interest while suppressing irrelevant background areas in a “dynamic weighting” way.

Therefore, in order to reduce the GPU occupancy, use larger batch size and improve the detection accuracy, this paper uses the CCA50(Criss-Cross Attention) module to upgrade the model. Given a local feature maps \({\text{F}}{ \in }{\text{R}}^{\text{C}}\times{\text{W}}\times{\text{H}}\), two feature maps G and K are generated by two 1 × 1 convolutions layers respectively, where \(\text{G, K}{ \in }{\text{R}}^{{{\text{C}}^{\prime}}}{\times{\text{W}}\times {\text{H}}}\), C’ is the number of channels, which is less than C for dimension reduction. After obtaining feature maps G and K, the three-dimensional feature map with the shape of C’ × H × W can be easily reshaped into a two-dimensional C’ × (H × W) matrix. The attention map A \({ \in }{\text{R}}^{{\text{(H}}+{\text{W}}-\text{1})\times {\text{W}}\times{\text{H}}}\) is generated by the Affinity operation. For each position u in the feature map G, a vector \({\text{G}}{\in}{\text{R}}^{{C^{\prime}}}\) with dimension C’ can be obtained. At the same time, the set \( \Omega \text{u}{\in}{\text{R}}^{{\text{(W}}+{\text{H}}-\text{1})\times C^{\prime}}\) can also be obtained from the feature map K which belongs to the same row or column with position u.

The features acquired by the FSS module are hierarchical cross features, therefore, adding the CCA module after the FSS module (Fig. 13) can obtain the category consistency loss and better extract more in-depth abstract features by calculating the distance between intra-class and inter-class features while preserving the feature structure.

Figure 13
figure 13

Full stage shortcut criss-cross attention module.

Auxiliary focal loss function

The one-stage method discards the stage of generating candidate boxes in order to improve the detection speed, and directly classifies the anchor boxes at a fine-grained level, so many boxes are predicted, but few boxes contain the correct object, leading to the category imbalance problem. In order to solve this problem, Lin26 proposed the focal loss function on the basis of the two-category balanced cross-entropy loss function, adding a weight factor in front of each category to solve the problem of unbalanced positive and negative samples, and adjusting the factor (γ ≥ 0 is an adjustable focusing parameter) to reduce the weight of easy-to-classify samples, focus on the training of difficult samples, and prevent easy-to-classify samples from dominating the gradient transfer. The definition is as follows:

$${F}_{L}=\left\{\begin{array}{l}-\alpha {\left(1-p\right)}^{r}\mathrm{log}(p)\\ -\left(1-\alpha \right){p}^{r}\mathrm{log}\left(1-p\right) \end{array}\right.\begin{array}{l}if y=1\\ if y=0\end{array}$$
(1)

In order to improve the overall accuracy and performance of the model, this paper uses the FSS module in the backbone to generate the auxiliary head for auxiliary training. The lead head generated by the FSS-CCA module is the main prediction result. Different from YOLOv7, the lead head and auxiliary head (Fig. 14) participate in the optimization model simultaneously and assign different weights (Fig. 15) to calculate classification, confidence, and regression losses. This is done to reduce the impact of the auxiliary head's "coarse" label and prevent a reduction in the lead head's detection accuracy. The lead head and auxiliary head both extract the IoU of the top 20 samples for summing in the actual calculation, and the classification and regression loss weights are set to 1:0.25. Similar to YOLOv5, the confidence loss is set at a ratio of 1/4 based on the output scale of the detection head. According to Fig. 9, the output contains 4 scales (1/8, 1/16, 1/32, 1/64), so it is very suitable for small object detection in multi-scale complex backgrounds.

Figure 14
figure 14

Lead head and auxiliary head.

Figure 15
figure 15

Loss with weighted different head.

Because of the additional training of the auxiliary head, the focal loss functions all need to be added for synchronous training, and the modified auxiliary focal loss function is as follows:

$${F}_{L\_class}=b\cdot \left({F}_{L\_class\_lead}+\frac{1}{4}{F}_{L\_class\_aux}\right)$$
(2)
$${F}_{L\_conf}=b\cdot \sum_{i=0}^{l-1}\left[b(i){F}_{L\_conf\_lead}+{\frac{b(i)}{4}F}_{L\_conf\_aux}\right]$$
(3)

where b is the batch size during training, l is the number of groups of detection heads (in this paper, there are 4 groups of auxiliary head and lead head, therefore, l = 4), b(i) = [4, 1, 1/4, 1/16] is the balance factor of auxiliary head and lead head.

Materials: dataset and health check

To validate the robustness and generalization property of the proposed model, an opensource submarine garbage dataset is used to learn all types of labeled objects without undergoing any human screening, and the ratio of training sets validation sets, and test sets is set to 0.7:0.2:0.1 (Table 6). The dataset consists of 5,136 images of marine debris in 15 categories, with 2.3 labels per image. The aspect ratio distribution for each classification of dimension insight are shown in Table 7, most of the aspect ratios are images with median width multiplied by median height (300 × 199 pixel), and a few categories, such as cellphone, have high aspect ratios.

Table 6 Submarine garbage datasets.
Table 7 Aspect ratio distribution.

It can be seen from the distribution of the original dataset that there is a serious sample imbalance in the labeled images. So, some strategies need to be taken to expand the dataset, such as Random Erasing Data Augmentation51, RandAugment52, Mixup53,Cutout54, CutMix55, Mosaic32, Copy-Paste56, etc.

In this paper, we adjust hue, saturation, and value in the HSV color model, and enhance it by rotating 10 degrees and shifting the range to [−0.2, 0.2]. At the same time, we expand the dataset utilizing left–right inversion, Mosaic, Mix-Up, and Copy-Paste. Images after data enhanced shown in Fig. 16.

Figure 16
figure 16

Images after data enhancement.