Introduction

The oceans are a massive energy reservoir and an indispensable power source for humankind’s sustainable development. The rich resources in the oceans can be rationally developed and utilized, while the oceans, lakes, and rivers provide the environment for human economic production. Aquaculture is a rapidly expanding sector of the agricultural structure, contributing significantly to both the fishery economy and the national economy. It can vigorously drive the economic development of coastal areas and gradually become a pillar industry. Globally, the overall production of farmed aquatic products also rises yearly. The scale of farming and quality benefits impact the aquaculture industry’s financial income. Traditional low-density, low-yield aquaculture methods, driven by the quality and yield of aquatic products, have yet to keep up with the marine economy’s development demands. As a result, the aquaculture industry has incorporated new and modern information technology and intelligent technology to optimize the aquaculture chain. However, it is crucial to use computer vision technologies to create novel aquaculture, which can achieve production modernization and automation, improve production efficiency, promote the development of the fishery economy, and save workforce and material resources.

Since 2014, deep learning-based target detection technology has advanced quickly, and its range of applications is growing, such as medical image analysis1,2, road surface collapse detection3, crop pest detection4, ship target detection5, and the field of automatic driving6. The mainstream underwater target recognition technology currently includes sonar image detection7and underwater optical image detection8,9,10. Sonar equipment is more expensive and will cause a certain degree of noise pollution, affecting the growth and development of aquatic organisms, so the underwater target recognition technology based on sonar equipment does not apply to the field of aquaculture. Underwater optical images taken by underwater camera equipment using optical image processing technology can realize the accurate recognition of submerged targets. However, turbidity will seriously affect the accuracy of underwater target detection. Literature11 reviews the problems encountered in target detection in muddy water environments and the unique influence of muddy water environments on target recognition and classifies the main problems into image degradation and target recognition in muddy water environments. In addition, literature12 proposed a foreground detection method for real-time monitoring of outdoor swimming pools, focusing on robust and high-accuracy dynamic aquatic background modeling and nighttime reflection elimination. The foreground visibility was enhanced by new background modeling techniques and decimate-based filtering methods and was tested and evaluated in actual scenes. In literature13, aiming at the problems of obstacles and external disturbances faced by AUV in ocean tasks, kinematic and dynamic models were established, and a trajectory tracking method considering obstacles and uncertainties was proposed. Underwater cameras with underwater robots and other automated equipment for underwater operations can complete the feeding and fishing of specific types of aquatic products. They will not cause pollution of the underwater environment. As technology advances, so do their precision and frame rate, which can completely satisfy the automation requirements of the aquaculture sector.

Target detection techniques built on deep learning have become increasingly popular as a result of deep learning’s advancement and superior performance. Currently, classical target recognition algorithms mainly include single-stage and multi-stage approaches. Single-stage approaches include YOLO (You Only Look Once) series14, SSD15, and RetinaNet16 algorithms. Single-stage target detection algorithms mean the target can be detected by extracting features only once, and their recognition speed is faster than others. Still, their accuracy is inferior compared to multi-stage methods. Multi-stage methods include RCNN17, Fast RCNN18, Faster RCNN19, and Mask RCNN20 algorithms. Multi-stage target detection methods lower the inference speed while increasing detection accuracy by first extracting instance bounding boxes based on the input picture and then performing secondary correction based on the candidate region to obtain the detection point results, which is suitable for the domain of higher requirements on the accuracy of detection but does not require real-time performance. Even though prediction accuracy is still inferior to the two-stage target detection algorithms, the YOLO series’ quicker inference time has made it a mainstay in the industry’s marketplace. Some advanced object detectors are difficult to apply to new domains without a pre-trained model for the corresponding domain. So, the most widely used real-time object detector is still the YOLO family. Compared with YOLOv8 and YOLOv9 networks, YOLOv5 has more vital generalization ability, more stable network structure, more mature technology, better fault tolerance ability, and better real-time performance and efficiency performance. Therefore, YOLOv5 is more robust. Furthermore, the real-time object detection network prefers a lightweight network structure. In summary, we choose YOLOv5 as the basic framework of the WBi-YOLOSF target detection network.

Traditional aquatic organism identification methods can use machine learning techniques to recognize the class of marine organisms being studied. For example, classifiers such as Naive Bayes21 (NB), Decision Tree22 (DT), and Support Vector Machine (SVM) algorithms rely on manual selection techniques for feature extraction, i.e., selecting the relevant traits based on human subjectivity and the selection of characteristics using such methods is very subjective, insufficient, and tends to overlook essential features; thus their accuracy is limited. Several researchers have made rapid progress using deep learning to implement target detection algorithms in the last several years.

In 2020, Song et al.23 achieved mAP values greater than 90% on a small dataset of aquatic organisms by combining the Mask R-CNN with the MSRCR technique for image augmentation. Although the detection accuracy was increased, the training period was lengthy. In the same year, Han et al.24 integrated the refined YOLOv3 algorithm into an underwater robot to detect enhanced marine creature images in real-time. Nevertheless, the method is plagued by leakage issues and cannot identify marine species with hazy edges. Using the enhanced YOLOv4 network, Mao et al.25 presented a model in 2021 for detecting marine species in shallow waters. The YOLOv4 network’s Embedded Connection (EC) component was built and integrated to increase detection accuracy. This technique lowers computing work while increasing detection accuracy. Iqbal et al.26 presented a significant end-to-end CNN in 2022, intending to classify fish behavior into two groups: regular and hungry. They could assess the CNN’s performance by changing the number of layers in the fully connected layer and deciding whether to employ the maximum pooling technique. According to the experimental results, the detection method’s accuracy may be increased by 10% to reach 98% accuracy and demonstrate high performance by including a maximum pooling function into the CNN’s shallow architecture and adding three fully connected layers. Kaya et al.27 developed the CNN-based model IsVoNet8 to categorize fish species, which demonstrated a 91.37% classification accuracy in 2023. The same year, Ren et al.28 conducted a study that used LIBS and Raman spectroscopy to create a new method of fish species identification. They combined two machine learning algorithms, SVM and CNN, with Raman spectroscopic data from 13 different fish species to generate a classification model, with the proposed CNN model achieving the highest classification accuracy of 96.2%. Even if underwater target recognition has made great strides in the past, there is still much space for improvement, particularly in aquatic creature detection, localization, species identification, and quantitative statistics. At the same time, underwater target recognition requires high real-time performance, so designing a fast and high-accuracy underwater target detection model is vital.

In economics and engineering, optimization algorithms can be used to solve real-world problems, such as using improved Pareto Front Evolution to solve and optimize multiobjective problems29. In deep learning, the diversity of swarm intelligence algorithms can be applied to different models to solve various problems. For example, the most classical particle swarm optimization(PSO)30 algorithm is a robust algorithm based on stochastic optimization, and the continuous optimization process allows multiple objectives and more changes. It is inspired by the foraging behavior of bird flocks and is used to solve continuous nonlinear optimization problems. Although it may be slower than some gradient-based optimization algorithms in some cases, it performs well in solving multimodal, nonlinear, and multidimensional problems and is easy to tune and apply. In addition, researchers have proposed many novel metaheuristic algorithms in recent years. For example, The Liver Cancer algorithm(LCA)31 simulates liver tumors’ growth and takeover process. The Parrot Optimizer (PO)32 algorithm simulates the relationship between the parrot and its owner, exhibiting four different parrot behavior characteristics. The Slime Mould Algorithm (SMA)33 simulates slime mold foraging behavior and morphological changes. The slime mold enhances the cytoplasmic flow by approaching the food through the biological oscillator. The higher the food concentration, the faster the flow. The AlgorithmArtemisinin optimization (AO)34 algorithm, inspired by the artemisinin drug treatment process of malaria, conceptualizes drug particles as search agents of the algorithm, and the overall set of these search agents constitutes the solution set of the algorithm. The moth Swarm Algorithm (MSA)35 supposes that the feasible solution and fitness function value of the problem to be optimized is represented by the position of the light source and the luminous intensity of the light source, which are mainly divided into exploration moths, detection moths, and observation moths. The Hunger Games Search (HGS)36 algorithm, inspired by the foraging behavior of animals, emphasizes the role of hunger drive in the decision-making process. The algorithm shows its performance in optimization problems by simulating the cooperation and competition of social animals. The Runge Kutta optimizer(RUN)37 algorithm is based on the fourth-order Runge Kutta method, takes the gradient as the search direction, and establishes the population update rules based on the enhanced individual quality mechanism. The carnivorous plant algorithm (CPA)38 simulates the behavioral ability of carnivorous plants to adapt and survive in harsh environments, such as hunting insects for food and pollinating for reproduction. The weighted mean of vectors algorithm (INFO)39 achieves the purpose of optimization through different weighted average rules of vectors, which has the characteristics of solid optimization ability and fast convergence speed. The Harris Hawks Optimization (HHO)40 algorithm is derived from the cooperative behavior of Harris Hawks in hunting prey and the hunting style of sudden attack. The optimization process of the algorithm includes three stages: exploration, exploration and exploitation, conversion, and exploitation. The Rime Optimization Algorithm (RIME)41 performs algorithm search by simulating the movement of soft frost particles and develops the algorithm by simulating the crossover behavior between hard frost agents.

This research has practical value in underwater object detection and can be embedded into hardware devices to automate fisheries. In summary, the main contributions of our work are as follows:

  • In this study, an underwater target detection dataset was created, which included 2108 underwater images containing underwater biological targets collected by underwater image sensors, and these images included 15 common aquatic species. The dataset is divided into 1920 training and 188 validation set images. The aquaculture dataset is available at the following link: https://github.com/muyujie/Aquatic-dataset.

  • Because of the low-quality images captured using underwater cameras, there are often cases of low image contrast, high noise, color deviation from the real world, and uneven or darker brightness. An underwater image quality improvement technique is shown to solve this issue. It can raise the image’s quality and automatically modify brightness to fix the color deviation issue. The accuracy of subsequent target recognition is significantly improved.

  • Due to the limited underwater image data acquisition conditions, the number of data sets is small. A data set enhancement method was proposed to optimize the training phase of the model and improve the accuracy of target detection, especially small target detection, to solve the problem of low detection accuracy of underwater small targets and aggregated targets.

  • In this paper, the swarm intelligence algorithm is introduced into the underwater target detection model for the first time to optimize the hyperparameters of the network rather than manually setting the initial hyperparameters for the network to learn by itself. The experimental results show that the optimization method accelerates the convergence speed of model training and dramatically improves the algorithm’s accuracy. We use the swarm intelligence algorithm to find the global optimal solution of the network parameters and verify the application value of the swarm intelligence algorithm. This will be demonstrated in the experimental section.

  • In this paper, we propose a new object detection network: WBi-YOLOSF. Based on the YOLO series framework network, it innovatively proposes a new structure: AU-BiFPN in the feature extraction link, which enhances the feature extraction ability of the network and solves the bottleneck problem of gradient information. In addition, to echo the improvement for small targets in the dataset enhancement method, the Focal Loss loss function is introduced to further pay attention to the prediction accuracy of dense targets.

This is how the remainder of the paper is structured. The “Materials and Methods” section describes the dataset, the dataset preprocessing process, and the WBi-YOLOSF target detection network. The “Results” section shows the experimental results. The “Discussion” section analyzes and discusses the experimental results. The “Conclusions” section summarizes the work of this paper.

Materials and methods

Experimental dataset and its pre-processing

Dataset preparation

This study creates an underwater biological target detection data set, carefully dividing the underwater biological targets. Model training aims to accurately locate and classify the biological targets covered by the data set. The dataset consists of 2108 image data, divided into 1920 training set images and 188 validation set images, and contains 15 aquatic species. Underwater image sensors collected the data. The 15 kinds of aquatic products are abalone, carp, salmon, jellyfish, scallops, perch, silver pomfret, catfish, grouper, shrimp, tilefish, crab, squid, yellow croaker, and turbot. These biological targets have economic benefits, and this research can further promote the development of underwater target detection and the improvement of fishery automation. The dataset sampling is shown in Fig. 1.

Figure 1
figure 1

The example of dataset samples. (a) Abalone; (b) Carp; (c) Salmon; (d) Jellyfish; (e) Scallop; (f) Perch; (g) Silver pomfret; (h) Catfish; (i) Grouper; (j) Shrimp; (k) Tilefish; (l) Crab; (m) Squid; (n) Yellow croaker; (o) Turbot.

Underwater image enhancement network

Underwater images captured by underwater camera equipment will have different degrees of quality degradation problems, affecting the accuracy of target detection, so it is essential to pre-process underwater images before target detection. Underwater optical images generally have three types of issues: uneven brightness or darkness; different wavelengths of light are absorbed and scattered by the water medium to different degrees, and underwater images generally show blue-green color, with specific color deviation; propagation of underwater light will be absorbed and scattered by the water medium, resulting in image fogging and reduced contrast. For the above three types of problems, this paper proposes an underwater image enhancement network that can better degrade underwater images. The backbone structure of the underwater image enhancement network is shown in Fig. 2.

Figure 2
figure 2

Backbone structure of underwater image enhancement network.

First, the white balance algorithm is used to enhance the contrast and adjust the hue of the image, and the principle as in Equation 1.

$$\begin{aligned} \left\{ \begin{array}{l} C\left( {R'} \right) = C\left( R \right) * \frac{{{\overline{R}} + {\overline{G}} + {\overline{B}} }}{{3{\overline{R}} }}\\ C\left( {G'} \right) = C\left( G \right) * \frac{{{\overline{R}} + {\overline{G}} + {\overline{B}} }}{{3{\overline{G}} }}\\ C\left( {B'} \right) = C\left( B \right) * \frac{{{\overline{R}} + {\overline{G}} + {\overline{B}} }}{{3{\overline{B}} }} \end{array}\right. \end{aligned}$$
(1)

where \(C\left( R \right) \), \(C\left( G \right) \) and \(C\left( B \right) \) represent the input image R, G and B three-channel components, \(C\left( {R'} \right) \), \(C\left( {G'} \right) \) and \(C\left( {B'} \right) \) represent the output image three-channel components, \({\overline{R}}\), \({\overline{G}}\) and \({\overline{B}}\) represent the average value of the image in the three channels.

Second, the image’s brightness is adjusted by an improved gamma correction. The gamma correction is ineffective in correcting overly bright or dark regions due to a small range of gamma values. The enhanced gamma correction improves in correcting both areas of uneven brightness of the image, and the improved gamma correction is in Eq. (2)–(5).

$$\begin{aligned} O\left( {x,y} \right)&= 255 \times {\left( {\frac{{i\left( {x,y} \right) }}{{255}}} \right) ^\gamma } \end{aligned}$$
(2)
$$\begin{aligned} \gamma 1&= \frac{1}{{1 + \left( {1 - \theta \times \frac{m}{{255}}} \right) \times \cos \left( {\pi \times \frac{{L\left( {x,y} \right) }}{{255}}} \right) }} \end{aligned}$$
(3)
$$\begin{aligned} \gamma 2&= \frac{1}{{1 + \left( {1 - \theta \times \left( {1 - \frac{m}{{255}}} \right) } \right) \times \cos \left( {\pi \times \frac{{L\left( {x,y} \right) }}{{255}}} \right) }} \end{aligned}$$
(4)
$$\begin{aligned} \gamma&= \left\{ \begin{array}{l} \gamma 1,\frac{m}{{255}} \le 0.5\\ \gamma 2,\frac{m}{{255}} > 0.5 \end{array}\right. \end{aligned}$$
(5)

where \(O\left( {x,y} \right) \) denotes the pixel value of the image after improved gamma correction, m denotes the average value of the pixels of the input image, \(L\left( {x,y} \right) \) denotes the value of the pixels of the input image, and \(\theta =0.6\) for the best correction effect.

Finally, Underwater image color deviance is corrected using the unsupervised color correction approach. The algorithm simultaneously linearly stretches the histograms of the R, G, and B channels in the RGB color model and the S and I channels in the HSI color model to improve the image contrast and enhance the actual color and brightness of the image. The comparison of the original and improved images of the test set data is shown in Fig. 3.

Figure 3
figure 3

Comparison between original images and enhanced images. (a) Original images; (b) Enhanced images.

The approach described in this research is contrasted with four additional image enhancement algorithms to confirm its efficacy; the comparison algorithms include MSRCR, UDCP, CLAHE, and Water-Net42, of which the first three are traditional machine learning techniques, deep learning network is the final one. Two measures are employed in this paper to evaluate the image quality: UIQM43 and UCIQE44, which are targeted to assess the performance of underwater image enhancement algorithms. Table 1 demonstrates the experimental results, and it is evident that the suggested method performs better than the other examined algorithms.

Table 1 Comparison of experimental results of different underwater image enhancement algorithms.

Augmentation for small object detection

During model training, we found that accurately locating small objects is a challenging problem. However, in general, underwater biological targets are mostly small and often densely distributed targets. However, the overlap between the prediction box and the ground truth box of small targets is generally lower than the expected intersection over union threshold. The accuracy of small target prediction also dramatically affects the model’s overall performance. The root cause of this problem is that small objects are numerous and dense. However, the proportion of pixels in the data set needs to be improved, and the network often needs to allocate more attention during feature extraction. Based on the above problems, this paper proposes a new data augmentation method: Copy-Pasting Strategies. The implementation principle of the strategy is as follows: Small objects of different shapes and types are extracted from the training set and then pasted onto the background map without objects, and the image contains large objects by scaling, rotating, and flipping at different scales. In addition, the small objects are overlaid on the large objects, improving the robustness of small object detection. This method improves the proportion of pixels of small targets, but the detection accuracy of large targets decreases with it, which is undesired. Therefore, the Mosaic data enhancement45 method is fused here, and large target features are strengthened simultaneously to improve small targets’ detection accuracy without reducing the large targets’ detection accuracy. Figure 4 is an example of the process of the data enhancement method proposed in this paper. On average, every two training set images will be combined with two background images to randomly generate six new training set images through the data enhancement process. The figure does not list all the training set images generated. This process doubles the number of training set images, that is, 5760 images. At the same time, the original training set data is retained, and only the data generated by Copy-Pasting Strategies must be retained. The final data set is expanded to 11520 images. The proposed data enhancement method has significantly enriched the data set and strengthened the data characteristics.

Figure 4
figure 4

Data augmentation method.

Adaptive anchor box calculation and adaptive image scaling

In the model training, the model outputs the prediction box based on the initial anchor box and then compares it with the ground truth box, calculates the error between the two, updates the network parameters in the opposite direction, and iterates the network parameters. To reduce the computational cost and enhance the adaptability of the network, an adaptive anchor calculation method is used here, which adaptively calculates the best anchor values in different training sets during each training. In addition, the resolution of the images in the dataset is different. Hence, we need to uniformly scale the images to a standard size before feeding them into the model for training. Commonly used sizes in YOLO are 416*416 and 608*608. Because the image’s aspect ratio is different, the size of the black edges at both ends is different after scaling and filling. If more black edges are filled, information redundancy will be caused, and reasoning speed will be affected. This paper adds the least black edges to the image adaptively, reducing the calculation amount.

WBi-YOLOSF target detection network

WBi-YOLOSF network structure

The network structure of WBi-YOLOSF is similar to that of the YOLO network, which is divided into input, backbone, neck, and head networks. To centralize the information of the images W and H on the channel and act as downsampling without causing information loss, the input image is first sliced before being fed into the backbone network. The Convolutional Layer, Batch Normalization, and Funnel Activation is also known as FReLU activation function make up the CBF module in the backbone network section. Target identification, semantic segmentation, and image classification are among the tasks where the FReLU activation function performs better than activation functions like ReLU and SiLU. It overcomes activation functions’ insensitivity to spatial cues when performing visual tasks. FReLU, an activation function specifically created for visual tasks, adds very little spatial conditional overhead to ReLU and PReLU, extending them to 2D activation. It raises the accuracy of small target detection. The formula for FReLU is found in Eqs. (6)–(7).

$$\begin{aligned} y&= \max \left( {x_{c,i,j},T\left( {x_{c,i,j}} \right) } \right) \end{aligned}$$
(6)
$$\begin{aligned} T\left( x_{c,i,j} \right)&= x_{c,i,j}^\omega \cdot p_c^\omega \end{aligned}$$
(7)

where \(T\left( x_{c,i,j} \right) \) represents a simple and efficient spatial context feature extractor. \(x_{c,i,j}^\omega \) represents the window centered at the 2D position (ij) on the channel c; \(p_c^\omega \) represents the shared parameters of this window in the same channel.

The image undergoes a series of CBF module downsampling and CSP module feature extraction in the backbone part to generate a set of feature maps with different resolutions. At the end of the backbone network, the SPPCSP module is introduced to divide the features into two parts: regular CBF processes, one part, and the other part is processed by the SPP structure. That is, max pooling of four different scales is used for processing. Finally, the two parts are processed by concat, which can reduce the amount of calculation by half and achieve speed improvement. This study innovatively proposes a new feature extraction network structure in the neck network: AU-BiFPN. Incorporating this structure into the YOLO framework is the critical innovation proposed in this paper. The AU-BiFPN network structure will be described in detail in the next section. In the head network, the RepConv46 structure is introduced. RepConv obtains a good performance on the VGG47 structure by reparameterization, which raises the accuracy of the network’s predictions without adding more parameters or convolutional computation. The RepConv structure has different network constructs for training and inference. The output is obtained by summing two branches with different numbers of convolutional kernels and a normalized branch during training. During inference, the branch parameters are reparameterized to the main branch. The number of channels in the final output feature map is 3 \(\times \) (NC + 5), where 3 denotes three anchor boxes with different aspect ratios, NC represents the number of categories, and 5 indicates the two parameters of the anchor box’s center point and the two parameters of the anchor box’s length and width plus a foreground probability parameter for the anchor box. WBi-YOLOSF reduces overfitting by employing Dropblock48 regularization, derived from the 2017 Cutout49 data augmentation method. In this method, Dropblock applies Cutout to each feature map after Cutout zeroes out portions of the input image. It starts with a tiny ratio during training and grows this ratio linearly with the training process, as opposed to having a set zeroing ratio. Dropblock, in contrast to Cutout, is more efficient and offers a thorough upgrade and enhancement to the network’s regularization process. Figure 5 displays the core components and network architecture of the WBi-YOLOSF target detection network. For clarity of the main network structure diagram, the structure of AU-BiFPN is not reflected in Fig. 5.

Figure 5
figure 5

Network architecture and basic components of WBi-YOLOSF target detection network. (a) network architecture; (b) basic components.

AU-BiFPN feature extraction structure

The feature map’s main feature extraction task is completed in the target detection neck network. This paper proposes an AU-BiFPN (Auxiliary Weighted Bidirectional Feature Aggregation Network) feature extraction network. The structure comprises an improved BiFPN50 sub-network and an auxiliary network. The central schematic diagram of the improved BiFPN network is shown in Fig. 6c: effective bidirectional cross-scale connections and weighted feature fusion introduced to aggregate features at different resolutions of the feature map. Each node in Fig. 6 corresponds to features at different scales. The BiFPN in Fig. 6c removes some nodes that have only one input and without feature fusion, then they contribute less to the feature fusion network, and a simplified PANet bi-directional network is obtained by excising these nodes; secondly, to improve the ability of the network to fuse the features, an additional edge is added to the original input-to-output nodes that are at the same level. These added edges correspond to the dashed and red solid arrows in Fig. 5a. At the same time, it does not add much computational cost. Finally, to achieve a higher level of feature fusion, we will repeat a feature network layer multiple times, i.e., top-down and bottom-up bi-directional path network layers, which corresponds to Fig. 5a can be seen in Fig. 6c with repeated blocks = 3, i.e., the feature network layer is repeated three times. The principle of multi-scale aggregation is shown in Eq. (8).

$$\begin{aligned} {{\overrightarrow{P}} ^{out}} = f\left( {{{\overrightarrow{P} }^{in}} = P_{l_{i}}^{in}} \right) ,i = 1,2... \end{aligned}$$
(8)

where \(P_{li}^{in}\) denotes the \(l_{i}\)-layer feature, the network aims to find a transformation f can efficiently aggregate different input features \({\overrightarrow{P} }^{in}\) and output a new set of features \({\overrightarrow{P}} ^{out}\).

For input features with different resolutions, whose importance varies because they contribute differently to the output features, an additional weight should be assigned to each input, and the network should be allowed to learn the value of the weight. Here, the weights are calculated using fast normalized feature fusion with the following formula:

$$\begin{aligned} O = \sum \nolimits _i {\frac{{\omega _i}}{{\varepsilon + \sum \nolimits _j {\omega _j} }} \cdot I_i} \end{aligned}$$
(9)

where \(\omega _i \ge 0\) is ensured by applying the Relu activation function after each \(\omega _i\) and \(\varepsilon = 0.0001\) to ensure numerical stability, and the weights are between 0 and 1 by the normalization process. In summary, BiFPN integrates bidirectional cross-scale connections and a fast normalized feature fusion’s weight calculation method to optimize multi-scale feature fusion in neck networks, and ablation experiments validate the effectiveness of introducing the improved BiFPN network.

Figure 6
figure 6

Feature network structure diagram. (a) FPN; (b) PANet; (c) BiFPN.

Improving the BiFPN structure can indeed pay attention to more feature information, but it has a disadvantage that cannot be ignored. When the number of network layers is deepened, it is easy to lose some vital information in the image, and the deep neural network may not fully capture all the information related to the predicted target at the output. In this case, the information that the network relies on in the training process is incomplete, which may lead to the inaccuracy of the gradient calculation and then affect the model’s convergence performance, making the training results’ reliability decrease, and then the information bottleneck problem appears. The above is why the model’s performance decreases when we optimize the parameters. To make the model use the optimal parameters to achieve the best effect, an auxiliary multi-level structure is added to the BiFPN architecture to solve the above problems. As shown in the figure, the AU-BiFPN network structure diagram. The reversible structure helps the main branch to extract the lost important information, but it increases the computational burden of the inference stage of the network. The auxiliary reversible branch can be removed in the inference stage, retaining the inference speed of the original network. At the same time, the backbone network can obtain stable gradient information from reversible branches, which effectively alleviates the problem of gradient disappearance. In addition, a multi-level auxiliary information structure is introduced to solve the information loss problem of deep feature pyramid in multi-size object detection. The structure integrates the gradient information of different branches through the ensemble network. It feeds it back to the main branch to enhance the ability of the model to maintain the target information, reduce error propagation, and optimize the parameter update. The auxiliary reversible branch calculation principle is as follows.s is why the model’s performance decreases when we optimize the parameters. To make the model use the optimal parameters to achieve the best effect, an auxiliary multi-level structure is added to the BiFPN architecture to solve the above problems. Figure 7 shows the AU-BiFPN network structure diagram. The reversible structure helps the main branch to extract the lost important information, but it increases the computational burden of the inference stage of the network. The auxiliary reversible branch can be removed in the inference stage, retaining the inference speed of the original network. At the same time, the backbone network can obtain stable gradient information from reversible branches, which effectively alleviates the problem of gradient disappearance. In addition, a multi-level auxiliary information structure is introduced to solve the information loss problem of deep feature pyramid in multi-size object detection. The structure integrates the gradient information of different branches through the ensemble network. It feeds it back to the main branch to enhance the ability of the model to maintain the target information, reduce error propagation, and optimize the parameter update. The principle of auxiliary reversible branch calculation is as follows.

Figure 7
figure 7

AU-BiFPN network structure diagram.

$$\begin{aligned} {P^{in}} = {f'_\varsigma }\left( {{f_\psi }\left( {{P^{in}}} \right) \cdot M} \right) \end{aligned}$$
(10)

where \({P ^ {in}}\) is the input characteristics, M is a dynamic binary mask, \({f '_\varsigma }\) is \({{f_\psi }}\) the inverse transformation of functions, \(\psi \) and \(\varsigma \) are the parameters to the function f and the inverse function \(f'\), respectively.

SimAM attention mechanism

This study incorporates the simple, parameter-free attention module (SimAM attention mechanism) into the neck network to further increase the prediction accuracy of the WBi-YOLOSF target detection network. Lightweight and practical, the SimAM attention module is simple to use. SimAM concentrates on information related to both channel and spatial dimensions. In restricted mathematical resources, no extra parameters are required to calculate the 3D attention weights, successfully avoiding the issue of growing model parameters caused by structural modifications. The attention mechanism weights can be computed using only an energy function. By computing the energy function, it is possible to determine that a neuron’s significance increases and its distinction from other neurons is more pronounced when its energy is lower. Consequently, the SimAM attention mechanism has extensive applications and can precisely capture the critical information in image features. Figure 8 depicts the architecture of the SimAM attention mechanism.

Figure 8
figure 8

SimAM attention mechanism structure.

In Fig. 8, the 3-D weights are calculated as follows:

$$\begin{aligned} \mathop X\limits ^ \bullet&= sigmoid\left( {\frac{1}{E}} \right) \odot X \end{aligned}$$
(11)
$$\begin{aligned} E&= \frac{{4\left( {{\sigma ^2} + \lambda } \right) }}{{{{\left( {t - \mu } \right) }^2} + 2{\sigma ^2} + 2\lambda }} \end{aligned}$$
(12)

where X is the input feature, E is the energy function on each channel, and the sigmoid function is used to limit the possible oversize values in E. t is the value of the input feature \(t \in X\), \(\lambda \) is the constant \(1e - 4\), \(\mu \) and \(\sigma ^2\) denote the mean and variance of each channel in X respectively, which are calculated by the following formulas:

$$\begin{aligned} \mu&= \frac{1}{M}\sum \nolimits _{i = 1}^M {x_i} \end{aligned}$$
(13)
$$\begin{aligned} {\sigma ^2}&= \frac{1}{M}\sum \nolimits _{i = 1}^M {{{\left( {x_i - \mu } \right) }^2}} \end{aligned}$$
(14)

where \(M = H \times W \) denotes the number of neurons on each channel, the weight of each neuron can be obtained through the above calculation, and introducing this attention mechanism improves the accuracy of the model target detection without effectively increasing the computational burden of the network.

Loss function of the model

The object detection loss function includes classification loss, regression loss, and target loss. VFL51 (Varifocal Loss) is selected as the classification loss in WBi-YOLOSF. According to the experimental results, VFL is more beneficial to the detection accuracy of dense small targets. The principle is that it treats positive and negative samples asymmetrically. By giving balanced attention to positive and negative samples with different importance, it coordinates the contribution of the two types of samples in the learning process. The VFL is calculated as follows.

$$\begin{aligned}{} & {} L_{class}\left( {p,q} \right) = \left\{ \begin{array}{l} - q\left( {q\log \left( p \right) + \left( {1 - q} \right) \log \left( {1 - p} \right) } \right) ,q > 0\\ - \alpha {p^\gamma }\log \left( {1 - p} \right) ,q = 0 \end{array} \right.{} & {} \end{aligned}$$
(15)

where p is the predicted value of the IoU-aware classification score and q is the target score. When \(q > 0\), VFL does not apply any hyper-parameters when dealing with positive samples, meaning that the weights of positive samples keep their original values and will not be affected by decay. When \(q = 0\), VFL will introduce a hyper-parameter for negative samples, where the function of parameter \(\gamma \) is to reduce the weight of negative samples and their influence on the model, and the parameter \(\alpha \) is used to avoid excessive attenuation of the weight of negative samples. To sum up, this design effectively reduces the contribution of negative samples to the final result.

Regression loss uses SIoU LOSS52. SIoU LOSS comprises Angle loss, Distance loss, and Shape loss. SIoU LOSS performs excellently in object detection tasks, especially in scenarios that require accurate bounding box regression. The Angle loss is calculated as follows:

$$\begin{aligned} \Lambda&= \cos \left( {2 * \left( {\arcsin \left( {\frac{{{c_h}}}{\sigma }} \right) - \frac{\pi }{4}} \right) } \right) \end{aligned}$$
(16)
$$\begin{aligned} \frac{{{c_h}}}{\sigma }&= \sin \left( \alpha \right) \end{aligned}$$
(17)
$$\begin{aligned} \sigma&= \sqrt{{{\left( {b_{{c_x}}^{gt} - {b_{{c_x}}}} \right) }^2} + {{\left( {b_{{c_y}}^{gt} - {b_{{c_y}}}} \right) }^2}} \end{aligned}$$
(18)
$$\begin{aligned} {c_h}&= \max \left( {b_{{c_y}}^{gt},{b_{{c_y}}}} \right) - \min \left( {b_{{c_y}}^{gt},{b_{{c_y}}}} \right) \end{aligned}$$
(19)

where \(\sigma \) is the distance between the center point of the ground truth box and the prediction box. \({{c_h}}\) is the height difference between the center point of the ground truth box and the prediction box. \(b_ {c_ {x}} ^ {gt}\), \(b_ {c_ {y}} ^ {gt}\) as ground truth box center coordinates, \(b_ {c_ {x}}\), \(b_ {c_ {y}}\) as prediction box center coordinates. The Distance loss is calculated as follows:

$$\begin{aligned} \begin{array}{l} \Delta = \sum \limits _{t = x,y} {\left( {1 - {e^{ - \gamma {\rho _t}}}} \right) } = 2 - {e^{ - \gamma {\rho _x}}} - {e^{ - \gamma {\rho _y}}},\\ {\rho _x} = {\left( {\frac{{b_{{c_x}}^{gt} - {b_{{c_x}}}}}{{{c_w}}}} \right) ^2},{\rho _y} = {\left( {\frac{{b_{c_y}^{gt} - {b_{c_y}}}}{{{c_h}}}} \right) ^2},\gamma = 2 - \Lambda \end{array} \end{aligned}$$
(20)

where \(c_{w}\) and \(c_{h}\) are the width and height of the smallest external rectangle of the ground truth box and the prediction box. The Shape loss is calculated as follows:

$$\begin{aligned} \begin{array}{l} \Omega = \sum \limits _{t = w,h} {{{\left( {1 - {e^{ - {w_t}}}} \right) }^\theta }} = {\left( {1 - {e^{ - {w_w}}}} \right) ^\theta } + {\left( {1 - {e^{ - {w_h}}}} \right) ^\theta },\\ {w_w} = \frac{{\left| {w - {w^{gt}}} \right| }}{{\max \left( {w,{w^{gt}}} \right) }},{w_h} = \frac{{\left| {h - {h^{gt}}} \right| }}{{\max \left( {h,{h^{gt}}} \right) }} \end{array} \end{aligned}$$
(21)

where w, h, \(w^{gt}\), \(h^{gt}\) are the width and height of the prediction box and ground truth box respectively, and \(\theta \) controls the degree of concern for shape loss in the range \(\left[ {2,6} \right] \). In summary, SIoU LOSS is defined as follows:

$$\begin{aligned} {L_{local}} = 1 - IoU + \frac{{\Delta + \Omega }}{2} \end{aligned}$$
(22)

The target loss uses the binary cross-entropy loss function. The specific calculation formula is as follows:

$$\begin{aligned} L_{conf}\left( {o,c} \right)&= - \frac{{\sum \limits _{i = 1}^N {\left( {o_i\ln \left( {{\widehat{c}}_i} \right) + \left( {1 - o_i} \right) \ln \left( {1 - {\widehat{c}}_i} \right) } \right) } }}{N} \end{aligned}$$
(23)
$$\begin{aligned} {\widehat{c}}_i&= sigmoid\left( {c_i} \right) \end{aligned}$$
(24)

where \(o_i \in \left[ {0,1} \right] \) denotes the IoU of the prediction box and ground truth box, c is the predicted value, \({\widehat{c}}_i\) is the predicted confidence obtained from c via sigmoid activation function, and N is the number of positive and negative samples. The total loss function for the target detection model is calculated as follows:

$$\begin{aligned} LOSS = {\lambda _1}L_{class} + {\lambda _2}L_{local} + {\lambda _3}L_{conf} \end{aligned}$$
(25)

where \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) are balancing parameters.

The DIoU Non-Maximum Suppression (DIoU NMS) technique is used in the post-processing step of the target detection algorithm to reduce false detection and eliminate duplicate boxes. The highest-scoring detected box and all other detected boxes are given a corresponding IoU value in the conventional NMS algorithm, and those boxes whose values exceed the NMS threshold are filtered out. As can be observed, the only element taken into account by the traditional NMS algorithm is IoU. Nevertheless, in real-world application scenarios, only one detection box is frequently left behind after NMS processing when two distinct objects are close because of the relatively large IoU value. This results in the error scenario of missed detection. Because the IoU ignores the aspect ratio and center point distance, it simply considers the overlap region between the predicted and actual boxes. That is why the DIoU NMS considers the IoU and the separation between the boxes’ center points. It might be regarded as a box of two objects and will not be filtered out if the IoU between two boxes is comparatively significant, yet the distance between the centers of the two boxes is relatively large. The DIoU NMS algorithm successfully decreases the false detection rate of the traditional NMS method.

Artificial rabbits optimization

The artificially set hyper-parameters have significant limitations, so the optimization algorithm is introduced here to optimize the convergence speed, prediction accuracy, and robustness of model training. Artificial Rabbits Optimization (ARO) is a novel bionic meta-heuristic proposed by Wang et al.53, 2022. The design of the algorithm is inspired by the survival strategies of rabbits in nature, especially their behavior patterns when foraging and avoiding predators. The design principles of the ARO algorithm are based on the survival strategies of rabbits in nature, which are abstracted and simulated in the algorithm to solve the optimization problem. Specifically, the algorithm simulates rabbit behavior in two ways:

  • Detour Foraging. In nature, rabbits tend to feed away from the area of their nest, reducing the risk of natural enemies finding their nest. In the ARO algorithm, this behavior is abstracted as each “rabbit” (that is, the searching individual) exploring a “meadow” (that is, a potential solution) away from its current position within its search area. This strategy encourages searching individuals to jump out of the local optimum region and explore a broader solution space to find a more optimal solution. In the algorithm implementation, detour foraging can be simulated by letting the searching individual randomly choose the position of another individual in the population and adding a specific perturbation to it. This perturbation helps individuals discover new solutions but also helps increase the population’s diversity and prevent the algorithm from prematurely converging to a local optimal solution. The algorithm works as follows:

    $$\begin{aligned}{}&\begin{array}{l} {{\mathop c\limits ^ \rightarrow }_i}\left( {t + 1} \right) = {{\mathop p\limits ^ \rightarrow }_j}\left( t \right) + \delta \left( {{{\mathop p\limits ^ \rightarrow }_i}\left( t \right) - {{\mathop p\limits ^ \rightarrow }_j}\left( t \right) } \right) + round\left( {0.5 \cdot \left( {0.05 + {r_1}} \right) } \right) \cdot {n_1},\\ i,j = 1,...,n\left( {j \ne i} \right) \end{array} \end{aligned}$$
    (26)
    $$\begin{aligned}{}&\delta = d \cdot \alpha \end{aligned}$$
    (27)
    $$\begin{aligned}{}&d = \left( {e - {e^{{{\left( {\frac{{t - 1}}{T}} \right) }^2}}}} \right) \cdot \sin \left( {2\pi {r_2}} \right) \end{aligned}$$
    (28)
    $$\begin{aligned}{}&\begin{array}{l} \alpha \left( x \right) = \left\{ \begin{array}{l} 1,\;if\;x = = g\left( l \right) \\ 0,\;\mathrm{{ else}} \end{array} \right. ,\\ x = 1,...,m\;\;\mathrm{{ and }}\;\;l = 1,...,\left\lceil {{r_3} \cdot m} \right\rceil \end{array} \end{aligned}$$
    (29)
    $$\begin{aligned}{}&g = randperm\left( m \right) \end{aligned}$$
    (30)
    $$\begin{aligned}{}&{n_1} \sim N\left( {0,1} \right) \end{aligned}$$
    (31)

    where \({{\mathop c\limits ^ \rightarrow }_i}\left( {t + 1} \right) \) is the candidate position of the i rabbit on the \({t + 1}\) iteration; \({{\mathop p\limits ^ \rightarrow } _i}\left( t \right) \) is the current position of the i rabbit at the t iteration; n is the number of rabbit colonies; m is the dimension of the problem; T is the maximum number of iterations; randperm(m) means return a random arrangement of integers from 1 to m; \({r_1}\), \({r_2}\), \({r_3}\) are all random numbers in the interval (0, 1), d is the search path distance; \({n_1}\) is a random number that follows a standard normal distribution. In ARO, the perturbation term assists in global search and avoids local optima. A considerable run in the length of d initially promotes exploration and gradually decreases d in iterations to refine the search. The mapping vector \(\alpha \) introduces randomness and maintains diversity. The running operator \(\delta \) simulates the behavior of rabbits, promotes global exploration, and strengthens the ability of the ARO algorithm to find the optimal solution.

  • Random Hiding. To avoid predators, rabbits dig multiple burrows around their nests as hiding places. In the ARO algorithm, this behavior is modeled as search individuals randomly choosing to update their population positions. This randomness introduces the exploration capability of the algorithm so that the search process is not limited to the neighborhood of the current solution but can search widely in the solution space. During the algorithm’s execution, the random hiding strategy allows the searching individual to visit the positions of other individuals randomly in the population. This visit is unordered and is independent of the distance or quality between individuals. Such a strategy helps the algorithm to jump out of the local optimum region and explore those regions of the solution space that may have been overlooked. The following formula generates the j burrow of the i rabbit:

    $$\begin{aligned}{}&\begin{array}{l} {{\mathop h\limits ^ \rightarrow }_{i,j}}\left( t \right) = {{\mathop p\limits ^ \rightarrow }_i}\left( t \right) + \text{K} \cdot g \cdot {{\mathop p\limits ^ \rightarrow }_i}\left( t \right) ,\\ i = 1,...,n\mathrm{{ \;\;and\;\; }}j = 1,...,m \end{array} \end{aligned}$$
    (32)
    $$\begin{aligned}{}&\text{K} = \frac{{T - t + 1}}{T} \cdot {r_4} \end{aligned}$$
    (33)
    $$\begin{aligned}{}&{n_2} \sim N\left( {0,1} \right) \end{aligned}$$
    (34)
    $$\begin{aligned}{}&\begin{array}{l} g\left( x \right) = \left\{ \begin{array}{l} 1,\mathrm{{ if\;\; }}x = = j\\ 0,\mathrm{{ else}} \end{array} \right. ,\\ x = 1,...,m \end{array} \end{aligned}$$
    (35)

    where m burrows are generated near the rabbit position along each dimension; During the iteration, \(\text{K}\) decreases linearly from 1 to 1/T. Thus, the burrows generated at the beginning of the iteration are located in a larger field near the rabbit, and the field gradually shrinks as the number of iterations increases. To model the random hiding strategy mathematically, we introduce the following formula:

    $$\begin{aligned}{}&\begin{array}{l} \mathop {{c_i}}\limits ^ \rightarrow \left( {t + 1} \right) = \mathop {{p_i}}\limits ^ \rightarrow \left( t \right) + \delta \left( {{r_4} \cdot {{\mathop h\limits ^ \rightarrow }_{i,r}}\left( t \right) - \mathop {{p_i}}\limits ^ \rightarrow \left( t \right) } \right) ,\\ i = 1,...,n \end{array} \end{aligned}$$
    (36)
    $$\begin{aligned}{}&\begin{array}{l} {g_r}\left( x \right) = \left\{ \begin{array}{l} 1,\mathrm{{ if\;\; }}x = = \left\lceil {{r_5} \cdot m} \right\rceil \\ 0,\mathrm{{ else}} \end{array} \right. ,\\ x = 1,...,m \end{array} \end{aligned}$$
    (37)
    $$\begin{aligned}{}&{{\mathop h\limits ^ \rightarrow } _{i,r}}\left( t \right) = \mathop {{p_i}}\limits ^ \rightarrow \left( t \right) + \text{K} \cdot {f_r} \cdot \mathop {{p_i}}\limits ^ \rightarrow \left( t \right) \end{aligned}$$
    (38)

    where \({{{\mathop h\limits ^ \rightarrow }_{i,r}}\left( t \right) }\) represents the burrow chosen at random to hide its m burrows; \({r_4}\) and \({r_5}\) are two random numbers in the range (0, 1). The i rabbit randomly selects the next of its m burrows to update its position.

After implementing one of the above two strategies, the position of the individual rabbit is updated to:

$$\begin{aligned}{}&{{\mathop p\limits ^ \rightarrow } _i}\left( {t + 1} \right) = \left\{ \begin{array}{l} {{\mathop p\limits ^ \rightarrow }_i}\left( t \right) ,f\left( {{{\mathop p\limits ^ \rightarrow }_i}\left( t \right) } \right) \le f\left( {{{\mathop c\limits ^ \rightarrow }_i}\left( {t + 1} \right) } \right) \\ {{\mathop c\limits ^ \rightarrow }_i}\left( {t + 1} \right) ,f\left( {{{\mathop p\limits ^ \rightarrow }_i}\left( t \right) } \right) > f\left( {{{\mathop c\limits ^ \rightarrow }_i}\left( {t + 1} \right) } \right) \end{array} \right. \end{aligned}$$
(39)

This formula states that if the rabbit’s potential new position has a higher fitness value than the existing position, the rabbit will abandon the original position and occupy the new position determined by Eq. (26) or (36).

The energy level of the rabbit is used in the algorithm as a regulatory mechanism, which determines the transition of the rabbit, the individual in the algorithm, between the two strategies. As the energy level decreases, the rabbits are more inclined to adopt a random hiding strategy, manifested in a more random search behavior in the algorithm. This energy contraction mechanism enables the algorithm to achieve a dynamic balance between global and local search, thereby improving the probability of finding the optimal global solution. Therefore, an energy factor is introduced to model the transition from exploration to exploitation. In the ARO algorithm, the energy factor is defined as follows:

$$\begin{aligned} A\left( t \right) = 4\left( {1 - \frac{t}{T}} \right) \ln \frac{1}{r} \end{aligned}$$
(40)

where r is a random number in the range (0, 1).

The energy factor \(A\left( t \right) \) shows an oscillatory downward trend and tends to zero. A high energy factor means that rabbits are active and tend to detour foraging; A low energy factor means the rabbit is less active and more prone to random hiding. In the ARO algorithm, rabbits explore other areas when \(A\left( t \right) > 1\) and dig burrows to hide when \(A\left( t \right) \le 1\). Accordingly, ARO switches between exploration and hiding according to the value of energy factor \(A\left( t \right) \).

In this paper, the network hyper-parameters optimized by the ARO algorithm and their description are shown in Table 2.

Table 2 Hyper-parameters and their descriptions.

The flowchart of applying the ARO algorithm to optimize the network parameters is shown in Fig. 9. In Fig. 9, \({F_p}\) represents the fitness function of ARO, considering the two evaluation indicators, mAP50 and mAP50-95. The calculation formula is as follows:

$$\begin{aligned} {F_p} = {\omega _1}mAP50 + {\omega _2}mAP50-95 \end{aligned}$$
(41)

where \({\omega _1}\) and \({\omega _2}\) represent the weight size of the evaluation index, respectively.

Figure 9
figure 9

Optimization parameter flowchart.

We select the F13 multi-modal test function to measure the optimization effect of the ARO algorithm. Figure 10b–e show the ARO search history, the average fitness curve, the one-dimensional search trajectory curve, and the convergence curve, respectively. Figure 10f shows the proportion of exploration and exploitation during ARO iteration. It can be concluded that the ARO algorithm finds the global optimal solution of hyper-parameters faster and better, accelerates the convergence speed of model training, and saves computational costs.

Figure 10
figure 10

Qualitative analysis of the ARO optimization algorithm. (a) F13 multi-modal test function; (b) ARO search history; (c) the average fitness curve; (d) the one-dimensional search trajectory curve; (e) the convergence curve; (f) the proportion of exploration and exploitation during ARO iteration.

Results

Experimental environment

The environment for model training, as well as testing, is a PC with NVIDIA GeForce RTX 3060 Laptop GPU, memory 16G, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz CPU, and Windows 11 OS. Pytorch 1.7.1, CUDA 11.0, CUDNN 8.2, Python 3.8.

ARO optimization parameter experiment comparison

To evaluate the effectiveness of introducing the ARO algorithm, algorithm optimization, and non-algorithm optimization were used as the control group, and mAP50 and mAP50-95 were selected as evaluation indexes. The experimental results are shown in Fig. 11. Figure 11 shows that the model optimized by ARO has a faster convergence speed and higher accuracy index value. An optimized parameter algorithm is introduced that dramatically improves the model’s performance, and the two evaluation indexes are increased by 2.9% and 4.2%, respectively. The experiment verifies the necessity of the implementation of the algorithm. In addition, Fig. 12 illustrates the changes in the models mAP50 and MAP50-95 during the optimization iterations of the ARO algorithm. It can be seen from Fig. 12 that when the ARO algorithm optimizes the hyperparameters, the model accuracy basically converges after the 300th iteration; that is, the ARO algorithm jumps out of the local optimal solution and finds the global optimal solution.

Figure 11
figure 11

Experimental comparison diagram of ARO algorithm effectiveness. (a) mAP50; (b) mAP50-95.

Figure 12
figure 12

Convergence curves of mAP50 and mAP50-95 during ARO algorithm optimization iterations.

Evaluation metrics of experimental result

This paper uses the following evaluation measures for the experimental results: F1, AP (Average Precision), mAP (mean Average Precision), R (Recall), P (Precision), and FPS (Frames per Second). The precision P value denotes the percentage of the model-predicted positive samples that really contain positive samples. The percentage of positive samples that the model successfully detected out of all positive samples is represented by recall R. The accuracy of class detection is measured by average precision (AP), which is the average of the precision under various recall rates. The value of AP is the size of the area under the PR curve. The following is the precise calculation formula:

$$\begin{aligned} P&= \frac{{TP}}{{TP + FP}} \end{aligned}$$
(42)
$$\begin{aligned} R&= \frac{{TP}}{{TP + FN}} \end{aligned}$$
(43)
$$\begin{aligned} AP&= \int _0^1 {P\left( R \right) dR} \end{aligned}$$
(44)

where TP indicates that positive samples are predicted as positive, FP indicates that positive samples are predicted as negative, and FN indicates that negative samples are predicted as positive. Equation (44) represents the PR curve obtained by setting different confidence levels with R as the horizontal coordinate and P as the vertical coordinate, and the area enclosed by this PR curve and the coordinate axis is the value of AP.

The accuracy of multiple-category detection is measured by the mean average of precision, or mAP, which is the average of the APs of several categories. The mAP50 and mAP50-95 are the evaluation measures used in this project. The average precision when the intersection over the union IoU threshold is 0.5 is denoted by the mAP50, and the average precision when the IoU threshold is between 0.5 and 0.95 is denoted by the mAP50-95. The F1 Score, which takes accuracy and memory into account and provides a thorough assessment of the model’s advantages and disadvantages, is calculated as the harmonic average of recall and precision. The number of frames per second (FPS) that the model can detect is used to gauge its inference speed. The specific calculation formula is as follows:

$$\begin{aligned} mAP&= \frac{{\sum \limits _{i = 1}^N {AP_i} }}{N} \times 100\% \end{aligned}$$
(45)
$$\begin{aligned} F1&= \frac{{2PR}}{{P + R}} \end{aligned}$$
(46)

where N is the number of categories and \({AP_i}\) is the AP of the i category.

Experimental results

For the 15 categories of this study, the experimental results are shown in Table 3.

Table 3 Experimental results of each category.

Figure 13 displays the visualization results of the WBi-YOLOSF target detection network provided in this paper in the experimental test set. From this, it is evident that the 15 species of aquatic products engaged in this study can be accurately classified and localized. When paired with hardware, deep learning’s exceptional performance makes it possible for the aquaculture sector to operate automatically. The model applies to real-world scenarios due to its high recognition accuracy and low misdetection rate. Figure 14 displays the results of the real-time target detection sampling of the natural undersea scene.

Figure 13
figure 13

Experimental test set visualization results.

Figure 14
figure 14

Real-time target detection sampling.

Discussion

Experimental label database

The experiments labeled the ground truth box of each image in the dataset created in this paper. The distribution of labels in the dataset is shown in Fig. 15. The horizontal coordinates of the histograms in Fig. 15a indicate the different categories, and the vertical coordinates indicate the number of label instances in the different categories. Figure 15b counts the length and width of all ground truth boxes in the dataset, where the center coordinates of each ground truth box are set at the same position. The scatter plots in Fig. 15c,d represent the distribution of the ground truth box’s center coordinates \(\left( {x,y} \right) \) in the image and the width and height distribution of the ground truth box, respectively.

Figure 15
figure 15

Labels and label distribution. (a) number and class of labels in the dataset; (b) ground truth box; (c) center point coordinate distribution of ground truth box; (d) width and height distribution of ground truth box.

In addition, Fig. 16 summarizes the labels of the training set data and plots the relationship diagram between the four variables of the training set data labels: x, y, width, and height. The histogram in the first row of Fig. 16 shows the distribution of the horizontal coordinates x of the center point of the anchor box. It can be seen that the horizontal coordinates x of the center point are concentrated in the positions of 0.25, 0.50, and 0.75 of the image, with the position of 0.50 in the center of the image being the most prevalent; the histogram in the second row shows the distribution of the vertical coordinates y of the center point of the anchor box, it can be seen that the vertical coordinate y is uniformly distributed over the image at 0.25, 0.50 and 0.75. The histogram in the third row shows the distribution of the width of the anchor boxes, and it can be seen that most of the anchor boxes are not wider than half of the image; the histogram in the last line shows the distribution of the height of the anchor boxes, and it can be seen that the height of most of the anchor boxes is not more than half of the image.

Figure 16
figure 16

Labels correlogram.

Model performance

The target detection classification model’s confusion matrix is displayed in Fig. 17. The number in the matrix’s square represents the likelihood that the model will correctly identify the category. In contrast, the horizontal and vertical axes represent the actual and predicted categories. Using the confusion matrix diagram, the classification model presented in this research may accomplish a more accurate classification task. Figure 18 displays the curve diagrams for the target detection model’s operational outcomes, by which the performance of a model can be more intuitively assessed. Figure 18a shows the F1-Confidence curve, which can reflect the relationship between the F1 scores and the confidence level; Fig. 18b represents the Precision-Confidence curve, which reflects the precision values under different confidence thresholds; Fig. 18c shows PR curve, the area under this curve is taken as AP; Fig. 18d shows Recall-Confidence curve, which reflects the relationship between recall and confidence. The visual examination of the model evaluation metrics during the dataset’s training is displayed in Fig. 19, where the horizontal axis represents 100 epochs. In addition to the changes in accuracy, recall, mAP50, and mAP50-95 of the model throughout training, the figure shows variations in box loss, objectness loss, and categorization loss for the training and validation sets during the model training period.

Figure 17
figure 17

Confusion matrix diagram.

Figure 18
figure 18

Operation results curve. (a) F1-confidence curve; (b) Precision-confidence curve; (c) Precision-recall curve; (d) Recall-confidence curve.

Figure 19
figure 19

Visual analysis of model evaluation indicators during training.

Ablation experiment analysis

To verify the effectiveness of dataset preprocessing, we performed ablation experiments, the results of which are shown in Table 4. The experimental results show that underwater image enhancement and training set data augmentation can significantly improve model prediction accuracy. The experimental control variable here and the network structure used is the WBi-YOLOSF target detection network proposed in this paper.

Table 4 Ablation experiment results of dataset preprocessing.

To verify the effectiveness of the proposed WBi-YOLOSF target detection network, we performed ablation experiments on the improved model, and the results are shown in Table 5. The training data sets used by different models in Table 5 are after preprocessing. Ablation experiments were conducted to verify the effectiveness of each component. The experimental results show that the AU-BiFPN network structure contributes significantly to the model’s performance, and the prediction accuracy is significantly improved without sacrificing much of the inference speed of the model. The gain in accuracy of the model after introducing the AU-BiFPN network structure is far more than the decrease in speed. The first row in the table selects the original YOLOv5 framework for comparison experiments. The improved model improves mAP50 by 0.096, MAP50-95 by 0.083, FPS by 48 frames per second, and F1 Score by 0.092.

Table 5 Ablation experiment results of WBi-YOLOSF.

In addition, although this paper uses an optimization algorithm to select the optimal solution of the parameters, the network architecture also needs to find the optimal framework through experiments. Therefore, four basic frames of different sizes are experimented with in this paper. The results are shown in Table 6. To make the network lightweight, YOLOm is selected as the basic framework. The performance of our proposed model exceeds that of the large-scale networks by using a lightweight network and saving a lot of computing power. Again, the experiment controlled for all variables.

Table 6 Comparisons of different scale networks.

Comparative experiment analysis

The WBi-YOLOSF target detection network is presented in this paper, and the mainstream target detection models are chosen to perform a comparison experiment with it. The experimental results are displayed in Table 7, demonstrating that the WBi-YOLOSF target detection network performs better overall than the other comparison models. The algorithms selected for comparison are Faster RCNN, Mask RCNN, SSD, RetinaNet, YOLOv5, YOLOv754, YOLOv855, YOLOv956, MCCNN57, RTAL58, and the method proposed by Yue et al.59in 2023 for underwater multi-target detection. Two-stage target detection algorithms, such as Faster R-CNN and Mask R-CNN, can attain a comparatively high detection accuracy. However, the FPS is significantly lower than that of the other models.YOLOv7’s intricate network structure allows it to attain greater accuracy at the cost of reduced detection speed. In addition, MCCNN is a multi-model cascaded convolutional neural network model for automatic detection and classification of household waste proposed by Li et al. in 2022, which can achieve automatic detection and classification of household waste. We used the dataset created in this paper to train and test the model. The experimental results are demonstrated in Table 7. It can be seen that although the application fields of the model are different, the model has specific compatibility and applicability when applied to different data sets. RTAL is an algorithmic model used by autonomous underwater vehicles equipped with side-scan sonar to detect underwater environments and targets, and we used it to train the dataset in this paper and obtain experimental results. Finally, we also compare it with an underwater multi-target detection method, and all the detection accuracies of the network proposed in this paper are higher than those of the compared algorithms. Figure 20 also shows the mAP50 and mAP50-95 comparison curves of this paper’s method with the comparison algorithm, which visualizes the overall better performance of the proposed network. Especially for the mAP50-95 index, the proposed method achieves a cliff-like improvement. Figure 21 illustrates the correlation matrix between the studied algorithms and reports the corresponding p-value, showing statistical differences between the compared algorithms. Figure 22 shows the comparison of visual results of the original YOLO framework, YOLOv9, the second-best overall performance, and WBi-YOLOSF target detection network prediction targets, from which it can be seen that the accuracy of the improved model in this paper for predicting stacked targets and small targets in the background, as well as the classification and localization accuracy, have been significantly improved. Qualitative and quantitative analyses verify the improved model’s efficiency, robustness, and accuracy.

Table 7 Comparison of different target detection models.
Figure 20
figure 20

Comparison curves of mAP50 and mAP50-95 for different target detection models. (a) mAP50; (b) mAP50-95.

Figure 21
figure 21

Correlation matrix of comparison algorithms.

Figure 22
figure 22

Detection result comparison. (a) YOLO Architecture. (b) YOLOv9. (c) WBi-YOLOSF.

Conclusions

This study proposes a real-time target detection network, WBi-YOLOSF. To promote the development of aquaculture, promote the rapid economic growth of coastal areas, and solve the problems of low efficiency and small scale of traditional aquaculture, our work aims to realize the modernization and automation of aquaculture and save fishermen’s human and material resources. Firstly, this paper creates a data set containing 15 different categories of aquatic products collected by underwater image sensors, which provides available resources for developing aquatic product recognition. Secondly, this study made two innovations in the data preprocessing part: underwater image quality enhancement and training set data enhancement. The former improved the low contrast, uneven brightness, and color deviation of underwater images and improved the quality of underwater images. The latter enriched the training data set, especially by proposing a new enhancement method for small targets. The deep learning network can better learn the characteristics of the target, and the ablation experiment partially proves the effectiveness of data preprocessing. Next, the WBi-YOLOSF target detection network is proposed, and the FReLU activation function is used to overcome the problem of spatial insensitivity of the activation function in visual tasks and improve the accuracy of small object detection with negligible computational overhead. A new network structure for feature extraction, AU-BiFPN network structure, is proposed to solve the information bottleneck problem caused by the depth of the network based on multi-scale cross-feature fusion while not affecting the overhead of the reasoning process. In addition, the SimAM attention module is aggregated to solve the problem of information overload, and the efficiency and accuracy of the model prediction target are improved in all aspects. In the head network, RepConv is used to reparameterize the convolution structure, further improving network prediction accuracy. In this study, the ARO algorithm is applied to the object detection model to optimize the network hyperparameters, eliminating the traditional method that relies on manual preset initial hyperparameters. Experimental data confirm that this optimization strategy significantly improves the convergence rate of model training and significantly enhances the algorithm’s accuracy. Through the application of the ARO algorithm, we successfully determine the global optimal values of network parameters, which fully proves the practical value of the swarm intelligence algorithm in this field. In this study, VFL plus SIoU LOSS is innovatively used as the loss function to further focus on small and aggregated targets that are difficult to detect. Finally, ablation and comparison experiments are carried out in the experimental part of this paper. The ablation experiments verify the effectiveness of data preprocessing and the WBi-YOLOSF target detection network. In the comparison experiment, the WBi-YOLOSF target detection network proposed in this paper is compared with other mainstream deep learning network models. Through the experimental data and experimental visualization comparison results, the performance of the model proposed in this paper is significantly better than that of other comparison models. Finally, the prediction accuracy of the WBi-YOLOSF target detection network reaches 98.2%, and the speed reaches 203 frames per second. Compared to the base original network, it improves by 2.4% and 48 frames per second. Before the experiment, the data set labels were also counted. In addition, we also show the test results of natural underwater scenes, including the sampling of underwater video detection results, which are captured by underwater camera equipment. In conclusion, the network model proposed in this study has good robustness and detection performance, with specific application and reference values in the actual development of aquaculture industry automation.