Autonomous driving integrates environmental perception, dynamic planning, and control execution in automobiles, it has received considerable scholarly attention in recent years1. Object detection serves as the principal perception method for autonomous vehicles, and the crux of this task is to enhance the accuracy of object detection.

Object detection algorithms can be divided into two major classes: two-stage detectors e.g. Faster R-CNN2, TS4Net3, AccLoc4, and Part-\(A^ 2\) net5, and one-stage detectors e.g. YOLO6, CG-SSD7, and PAOD8. Two-stage algorithms generate region proposals, then classify and localize objects according to them. Contrary to two-stage algorithms, one-stage ones perform classification and localization using pre-defined candidate proposals. In general, two-stage detectors achieve greater accuracy but are more time-consuming than one-stage detectors.

To obtain a more robust and accurate detection model, the following literature provides different improvement methods. Shi et al.9 introduced GIoU into K-means++ to obtain better anchors. Manuel et al.10 used an evolutionary algorithm to search for optimal region-based anchors. Wang et al.11 proposed a feature extraction network to ensure that small objects are correctly detected. Wang et al.12 adopted the dynamic attention module to improve detection performance. As a model with arbitrary hyperparameters leads to unsatisfactory performance, optimization algorithms e.g. Bayesian optimization13, and fitness sorted rider optimization algorithm14 have been used to find the optimal hyperparameter group.

How to quickly and accurately obtain the optimal high-dimensional parameter combination optimization is a major problem, the metaheuristic optimization algorithms such as ant colony optimization(ACO)15, particle swarm optimization(PSO)16, whale optimization algorithm(WOA)17, grey wolf optimization(GWO)18, and firefly algorithm(FA)19 aim to solve this. WOA is known for its simplicity and outstanding global solving ability among a variety of optimization algorithms, it has been applied in solving the optimal hyperparameter group20,21,22, data clustering23,24,25,26, multi-objective problems27,28, etc. However, the performance of canonical WOA is limited by low convergence and unsatisfactory accuracy. Therefore, WOA must be improved by weighing up exploration and exploitation29,30, integrating other algorithms31,32,33, and using better update strategies34,35. For example, in SHADE-WOA, Chakraborty et al.36 added an extra parameter \(\alpha\) which is used to control the exploration and exploitation phases. In WhaleFOA37, the original FOA’s random search strategy is replaced by WOA’s hunting strategy to enrich FOA’s global exploration capability. Chen et al.38 developed a double adaptive weight strategy, the results show that the WOA using this method has better global optimization capability.

Referring to the above literature, a novel modified whale optimization algorithm(MWOA) is proposed by fusing the structure of WOA and the hunting strategy of GWO with multi-faceted improvements. The core procedures of MWOA are as follows: the scaling factor is calculated using an adaptive update formula based on the population’s fitness. To improve GWO’s optimization performance, the position of the local optima is added as one instructor along with \(\alpha\) wolf, \(\beta\) wolf, and \(\delta\) wolf in the GWO’s hunting strategy. The instruction vectors \(V^t_{k,i}\) are improved using the new formula and they are weighted by fitness. Then the population’s concentration ratio is leveraged as the controller of the MWOA’s hunting branches. To verify the performance of MWOA, six multi-dimensional benchmark functions are used as the fitness function. The test result shows that MWOA possesses better performance. A novel G-YOLO network is proposed and MWOA is implemented to optimize its hyperparameters. A self-built dataset including pedestrians, cyclists, and cars serves as the training set and test set of G-YOLO, and the final whale optimization G-YOLO(WOG-YOLO) model presents a stronger detection ability and stability.

Proposed MWOA algorithm

Description of WOA and GWO

The canonical WOA is enlightened by the foraging mechanism of the humpback whales, it defines three behaviors to search for the best global solution. Its optimization stages can be concluded as follows: initialize the population and related parameters, calculate each individual’s fitness and identify the best global solution, then update the individuals’ position through the following formulas:

$$\begin{aligned} \vec {X}_i^{t+1}= & {} \left\{ \begin{array}{lll} V_i^t &{}&{} {p_h<0.5,|A|<1}\\ V_{rand}^t &{}&{} {p_h<0.5,|A|\ge 1}\\ \vec {X^t_*}+|\vec {X^t_*}-\vec {X^t_i}|\cdot e^{bl}\cdot cos(2\pi l) &{} &{} {p_h\ge 0.5}\\ \end{array} \right. \end{aligned}$$
$$\begin{aligned} \vec {V_i^t}= & {} \vec {X_*^t}-A\cdot |C\cdot \vec {X_*^t}-\vec {X_i^t}| \end{aligned}$$
$$\begin{aligned} A= & {} 2a\cdot r_{1}-a \end{aligned}$$
$$\begin{aligned} C= & {} 2\cdot r_{2} \end{aligned}$$
$$\begin{aligned} a= & {} 2-\frac{2t}{T} \end{aligned}$$

where \(\vec {X_i^t}\) and \(\vec {X}_i^{t+1}\) are the positions of i th individual in t th and \(t+1\) th iteration respectively, \(\vec {X^t_*}\) is the optimal position of in t th iteration, \(p_h\), \(r_1\), \(r_2\) are random numbers in the range of [0, 1], l is a random number in the range of \([-1,1]\), b is the spiral constant(in this paper, b equals 1), and a is the scaling factor that hinges on the current iteration step t and the maximum number of iteration steps T.

Repeat the above steps until the end requirements are satisfied.

The standard GWO selects \(\alpha\) wolf, \(\beta\) wolf, and \(\delta\) wolf from the wolf pack by each individual’s fitness, then updates the individual’s position by the following formulas:

$$\begin{aligned} \vec {X}^{t+1}_i= & {} \frac{\vec {V_{\alpha ,i}^t}+\vec {V_{\beta ,i}^t}+\vec {V_{\delta ,i}^t}}{3} \end{aligned}$$
$$\begin{aligned} \vec {V_{k,i}^t}= & {} \vec {X_k^t}-A\cdot |C\cdot \vec {X_k^t}-\vec {X_i^t}| \end{aligned}$$

where \(\vec {X_i^t}\) and \(\vec {X}_i^{t+1}\) are the positions of i th individual in t th and \(t+1\) th iteration respectively, \(\vec {X_k^t}\) is the position of k, k can represent the position of \(\alpha\) wolf, \(\beta\) wolf, and \(\delta\) wolf, A and C are the same as the formula in WOA.

Modified WOA

Adaptive scaling factor

In WOA and GWO, the scaling factor decreases linearly to control the process of conversion from global optimation to local optimation. However, this approach fails to accommodate the practical condition as most optimization problems are complicated non-linear processes. Hence an adaptive scaling factor formula is proposed as follows:

$$\begin{aligned} a=2-min\{2,\frac{f^{t-1}_*}{f^t_*}(1-cos\frac{t\pi }{T})\} \end{aligned}$$

where \(f^t_*\) and \(f^{t-1}_*\) are the optimal fitness in the current iteration and last iteration respectively.

In the above formula, the scaling factor is modulated by \(\frac{f^{t-1}_*}{f^t_*}\), thus expanding the searching scope if \(\frac{f^{t-1}_*}{f^t_*}<1\) or vice versa if \(\frac{f^{t-1}_*}{f^t_*}\ge 1\). The cosine function introduces non-linearity into the scaling factor. Furthermore, the minimum value function is used to ensure the scaling factor is greater than or equals to zero.

Improved GWO’s hunting strategy

In WOA, the optimal individual’s position instructs the update of other individuals’ positions. This method can facilitate convergence but has poor robustness, namely, it may stagnate around the local optimal solution. To accelerate the convergence of local optimation and strengthen the ability to search for global solutions, the position update formula of GWO is introduced to replace the original optimal position update method. Furthermore, each individual’s historical optimal position \(\vec {X^t_l}\) is introduced to calculate \(\vec {V^t_l}\). The new position update formula is as follows:

$$\begin{aligned} \vec {X}^{t+1}_i=w_\alpha ^t\vec {V_\alpha ^t} +w_\beta ^t\vec {V_\beta ^t}+w_\delta ^t\vec {V_\delta ^t}+w_l^t\vec {V_l^t} \end{aligned}$$

where \(w_\alpha\), \(w_\beta\), \(w_\delta\), and \(w_l\) denote the weights of \(\alpha\), \(\beta\), \(\delta\), and optimal local position, \(\vec {V_\alpha ^t}\), \(\vec {V_\beta ^t}\), \(\vec {V_\delta ^t}\), and \(\vec {V_l^t}\) denote the instruct vectors of \(\alpha\), \(\beta\), \(\delta\), and each individual’s historical optimal position.

The weights \(w_\alpha ^t\), \(w_\beta ^t\), \(w_\delta ^t\), and \(w_l^t\) depend on their respective positions’ fitness. Taking the case of minimal optimization, the position gets greater weight with smaller fitness, and then the weights are normalized into the range of [0, 1]. The weight can be calculated by:

$$\begin{aligned} w_k^t=\frac{1}{f_k^t\cdot (\sum {\frac{1}{f_j^t}})} \end{aligned}$$

where \(f_k^t\) denotes the fitness in position k, j in \(\sum {\frac{1}{f_j}}\) can represent the fitness of \(\alpha\), \(\beta\), \(\delta\), or each individual’s historical optimal position l.

In GWO, a random number \(C_i\) in the range of [0, 2] is used to control the influence of optimal position. However, this method is uncontrollable, namely that it lacks the precise trade-off between global optimization and local optimation.

To solve this issue, \(|1-a|\) and \(1-|1-a|\) are introduced to improve the original \(\vec {V_{k,i}^t}\). \(|1-a|\) is used to control the influence of \(\alpha\), \(\beta\), and \(\delta\), \(1-|1-a|\) is used to control the influence of historical optimal position l. As a decreases from 2 to 0, \(|1-a|\) firstly decreases from 1 to 0, and then increases from 0 to 1, this method fully utilizes the position of \(\alpha\), \(\beta\), and \(\delta\) and can facilitate convergence in both early and final stages. In the middle stage, \(|1-a|\) is close to zero and \(1-|1-a|\) is close to 1, the position update is mainly instructed by l and random variations, thus the population has better global optimization ability. \(\vec {V_t}\) can be expressed as follows:

$$\begin{aligned} \vec {V_{k,i}^t}= & {} |1-a|\vec {X_{k}^t}-A*||1-a|\vec {X^t_{k}}-\vec {X^t_i}| \end{aligned}$$
$$\begin{aligned} \vec {V_{l,i}^t}= & {} (1-|1-a|)\vec {X_l^{t}}-A*|(1-|1-a|)\vec {X^t_l}-\vec {X^t_i}| \end{aligned}$$

where k in \(X_{k}^t\) can be represented by \(\alpha\), \(\beta\), and \(\delta\).

During the optimation process, the warmup skill is used: in the first \(N_{warmup}\) (e.g. 2) iterations, the scaling factor is set to be very small(e.g. 0.1), and after the warmup iterations the scaling factor reverts to normal behavior. This method helps the entire population find its better optimization direction and recognize the most efficient way to enhance their fitness.

Incorporation of improved GWO’s hunting strategy into WOA

In WOA, spiral hunting and the normal optimal position update method are used, and they both have a 50\(\%\) possibility of being executed. As the spiral hunting method has a larger search scope, and the optimal position update method searches comparatively in the local scope, a new possibility \(p_{h}\) is proposed. \(p_{h}\) is used to serve as the possibility of the spiral hunting method, and it decreases to 0 gradually as a decreases. This branch control method keeps the population to be neither too concentrated nor too sparse.

$$\begin{aligned} p_{h}= & {} \theta ^t\cdot \frac{a}{2} \end{aligned}$$
$$\begin{aligned} \theta ^t= & {} \frac{N\cdot f_*^{t-1}}{\sum _{i=1}^N f_i^{t-1}} \end{aligned}$$

where a is mentioned above in Eq. (8), \(\theta\) is the population’s concentration ratio, \(\sum _{i=1}^N f_i^{t-1}\) stands for the sum of all individuals’ fitness, N denotes the total number in the population, and \(f_*^{t-1}\) is the best fitness of the current population.

The crucial problem of swarm intelligence is that the population’s concentration ratio \(\theta\) graduates to being huge. As \(\theta\) gets larger, it results in narrower diversity of the population, hence making it harder to continue global optimation. Therefore, the population’s concentration ratio is calculated and leveraged to control the ratio \(\theta\).

The graphic process of MWOA is shown in the Fig. 1.

Figure 1
figure 1

The flow diagram of MWOA.

Benchmark function test

Six multi-dimensional benchmark functions36 are used to verify the effectiveness and precision. F1, F2, F5, and F6 have many local minima, hence the optimization algorithm is prone to stagnate around them. F6 possesses many global mininum positions with the same value and its minimum value is determined by the dimension. F3 and F4 are bowl-shaped and they don’t have a local minimum. The hyperparameters of the target detection model are generally less than 20, thus the dimension of the test function d is set to 20. They are listed as follows:

$$\begin{aligned} F_1(x)= & {} -d e^{-0.2\sqrt{\frac{1}{d}\sum _{i=1}^{d}{x_i^2}}}-e^{\frac{1}{d}\sum _{i=1}^{d}{cos(2\pi x_i)}}+d+e \end{aligned}$$
$$\begin{aligned} F_2(x)= & {} \sum ^{d}_{i=1}\frac{x_i^2}{4000}-\prod ^{d}_{i=1}{cos(\frac{x_i}{\sqrt{i}})}+1 \end{aligned}$$
$$\begin{aligned} F_3(x)= & {} \sum ^{d}_{i=1}x_i^2+(\sum _{i=1}^{d}0.5ix_i)^2+(\sum _{i=1}^{d}0.5ix_i)^4 \end{aligned}$$
$$\begin{aligned} F_4(x)= & {} \sum ^{d}_{i=1}|x_i|^{i+2} \end{aligned}$$
$$\begin{aligned} F_5(x)= & {} \frac{1}{2}\sum ^{d}_{i=1}(x_i^4-16x_i^2+5x_i) \end{aligned}$$
$$\begin{aligned} F_6(x)= & {} \prod _{i=1}^{d}\left( \sum _{j=1}^{6}j\cos {(j+1)x_i+j}\right) \end{aligned}$$

PSO16, WOA17, GWO18, WhaleFOA37, and MWOA are employed to solve the above functions. In order to obtain the objective results, the common parameters are set to be consistent: the maximum number of iterations is 100, and the number of individuals in the population is 50. In PSO, the local coefficient and global coefficient are set to be 2.05, the minimum weight of the bird is 0.4, and the maximum weight of the bird is 0.8. In WhaleFOA, the safety threshold value is 0.8, and the number of producers percentage is 0.2. Test functions and MWOA are implemented using NumPy39. The core algorithms of PSO, GWO, WOA, and WhaleFOA are implemented by mealpy40.

All algorithms are tested using a device with an i5-10600KF processor and 32.0 GB RAM, each benchmark function is run independently thirty times, Table 1 shows the best and worst results in thirty solutions. Dealing with F1, F2, F5, and F6, MWOA is blessed with a more robust global searching ability and hardly falls into stagnation. As for bowl-shaped problems like F3 and F4, MWOA achieves better accuracy and stability. To get a scrutiny of the iteration process, Fig. 2 provides the average convergence curves of PSO, GWO, WOA, WhaleFOA, and MWOA. Contrasting with other algorithms, MWOA possesses faster convergence faculty during both the early and final stages.

Table 1 Benchmark function test.
Figure 2
figure 2

The average iteration process.


Network structure

YOLOv541 is one of the most famous and utilitarian object detectors, it’s known for its high detection speed and elegant structure. Nevertheless, limited by its grid-based mechanism, YOLOv5 lacks competence in detecting small objects, thus the new YOLOv5 structure, named as G-YOLO, is proposed. The attention mechanism in SKNet42 is introduced into G-YOLO’s backbone network and the original C3 block is replaced by the G-C3 block. As convolution with a \(3\times 3\) kernel is sensitive to small features and convolution with a \(5\times 5\) kernel is sensitive to larger features, the SKConv can switch to the \(3\times 3\) or \(5\times 5\) perceptive field easily to obtain smaller scale features or bigger features. However, using the above two convolutions isn’t cost-effective compared with a single \(3\times 3\) convolution. Hence depth-wise convolution43 is used to replace the vanilla convolution, furthermore, the \(5\times 5\) convolution is replaced by \(3\times 3\) convolution whose dilation is set to 2. The improved SKConv has the same perceptive field and it has fewer parameters. The structures of G-C3, improved SKConv and GhostConv are shown in Fig. 3a–c.

Figure 3
figure 3

The structures of G-C3, improved SKConv and GhostConv.

An extra detection head is added to the G-YOLO to detect small objects more efficaciously and precisely. The number of branches in the PAN structure changes from 3 to 4, namely, a new branch using the \(160\times 160\) feature map is added for minor objects. The new network requires more parameters, to keep the network lightweight, GhostConv43 is introduced into the G-C3 block. GhostConv takes advantage of both vanilla convolution and depth-wise convolution, thus the number of trainable parameters reduces sharply without losing too much detection precision. Compared to the mostly used lightweight model YOLOv5s, the parameter size of G-YOLO is close to YOLOv5s. The new structure is shown in Fig. 4.

Figure 4
figure 4


Data preparation and processing


A great deal of previous research into autonomous driving has focused on the detection of cars, cyclists, and pedestrians using the Kitti dataset44, nevertheless, the precision in pedestrians and cyclists is unsatisfactory in comparison with that of cars. The low detection accuracy is far from the practical application of autonomous driving. The images containing vehicles, cyclists and pedestrians are extracted from the Kitti dataset, then vans and trucks are labelled cars. The final dataset contains 5325 images.

Data augmentation

In YOLOv5, mosaic is used as an image augment method, which gives the YOLOv5 network considerable enhancement in both precision and recall. The input image should be square in YOLOv5, but the width and height of images in the Kitti dataset are 1240 and 370 pixels respectively. Thus a large part of the input image is padded with blank. To reduce the padded area, three images are concatenated vertically as one image before using the mosaic method.

Optimization of YOLOv5’s hyperparameter

Parameters function as a critical part of the convolution network, it controls the entire training process and has a great impact on the performance of the final detection model. For models with a brand-new framework, tuning parameters one by one can be time-consuming and inefficient. Furthermore, inadequately tuned parameters can not fully reflect the performance of the model. Fine-tuned parameters can boost the recall and precision by setting a suitable threshold which is instrumental in obtaining a performant model.

In this paper, 12 parameters in G-YOLO are chosen to evolve using MWOA, their names and ranges are shown in Table 2. To be cost-efficient, the number of individuals is set to 5, the iteration number is set to 10.

Table 2 12 hyperparameters in YOLOv5.

The fitness function receives the newcome parameters, dispatches them to G-YOLO, and then activates the training of the detection model. After the G-YOLO training process, the evaluation score of the model is passed back to MWOA as the fitness. Trained with the optimal hyperparameters, the final whale optimization G-YOLO (WOG-YOLO) model is obtained.

The evaluating indicators are P, R, F1, and mAP. P refers to precision, which calculates the ratio of the number of correct detection results TP and the number of total detection results(\(TP+FP\)). R refers to recall, which calculates the ratio of the number of correct detection results TP and the number of actual objects(\(TP+FN\)). F1 is based on the harmonic mean of P and R, which considers both P and R. The indicators and scores are calculated by following formulas:

$$\begin{aligned} P= & {} \frac{TP}{TP+FP} \end{aligned}$$
$$\begin{aligned} R= & {} \frac{TP}{TP+FN} \end{aligned}$$
$$\begin{aligned} F1= & {} 2\cdot \frac{P\times R}{P+R} \end{aligned}$$
$$\begin{aligned} AP= & {} \int _0^1P(R)dR \end{aligned}$$
$$\begin{aligned} mAP= & {} \frac{\sum _{i=1}^{k}AP_i}{k} \end{aligned}$$
$$\begin{aligned} score= & {} \frac{1}{0.1P+0.1R+0.2F1+0.6mAP} \end{aligned}$$

where k is the number of classes.


The experiment is based on Ubuntu 18.04, using NVIDIA A2000 GPU. The batch size is 8, the number of training epochs is 100, the image size is \(640\times 640\), the confidence threshold is 0.25, and the NMS IOU threshold is 0.5.

The default and the optimized hyparameters are shown in Table 3. The most representative indicator mAP is used in the evaluation of YOLOv5s, WOG-YOLO, YOLOv745, YOLOX46 and Faster-RCNN2 and the results are shown in Table 4, the ablation study of WOG-YOLO is shown in Table 5 and the loss curve is shown in Fig. 5. The mAP of YOLOv5s is 92.5\(\%\) and the F1-score of YOLOv5s is 90.0\(\%\). By adding an extra detection head, its mAP improved by 0.6\(\%\). Based on YOLOv5-4heads, its C3 module is replaced by the lightweight G-C3 module (G-YOLO), which reduces the mAP by 0.3\(\%\). Compared with the YOLOv5s model, the final WOG-YOLO’s overall mAP increases by 1.7\(\%\), its mAP of the pedestrian increases by 2.6\(\%\), and its mAP of the cyclist increases by 2.3\(\%\). As pedestrians and cyclists have comparatively smaller features than cars, the WOG-YOLO model is more sensitive to small objects and has greater precision.

Table 3 The default and optimized hyperparameters in YOLOv5.
Table 4 Indicators of YOLOv5s, WOG-YOLO and other algorithms.
Table 5 Results of WOG-YOLO with different improvements.
Figure 5
figure 5

Loss of YOLOv5s and WOG-YOLO in the training processes.

As shown in Fig. 6, WOG-YOLO has excellent capability to detect small objects. Moreover, when part of the object is covered by other things, WOG-YOLO still has reasonable detection ability.

Figure 6
figure 6

The detection effect of WOG-YOLO and YOLOv5s.


To accurately identify objects in autonomous driving, a stable and effective detecting algorithm is needed. A novel and efficient optimization algorithm with WOA and GWO is proposed for improving the G-YOLO model.

The hunting strategy of GWO is improved and it’s integrated into WOA, thus the basic structure of MWOA is formed, furthermore, adaptive scaling factor, population concentration ratio, and improved position update method are implemented in MWOA. In comparison with PSO, GWO, WOA, and WhaleFOA, MWOA is verified by different kinds of benchmark functions to have greater precision and better global solving ability.

By replacing the C3 block with the G-C3 block and adding an extra detect layer, the highly optimizable G-YOLO is proposed. To improve G-YOLO’s performance, 12 hyperparameters are optimized by MWOA. The G-YOLO model is trained and evaluated using the self-built dataset containing 5325 images, thus the final whale optimization G-YOLO(WOG-YOLO) model is obtained. Compared with the 92.5\(\%\) mAP and 90.0\(\%\) F1 in YOLOv5s, WOG-YOLO is 1.7\(\%\) better in mAP and 1.0\(\%\) in F1. For small objects like pedestrians and cyclists, WOG-YOLO increases the respective mAP by 2.6\(\%\) and 2.3\(\%\).

In conclusion, the proposed method is an applicable and highly optimized approach to obtain a robust and efficient detection model in autonomous driving.