SSD with multi-scale feature fusion and attention mechanism

In the field of the Internet of Things, image acquisition equipment is the very important equipment, which will generate lots of invalid data during real-time monitoring. Analyzing the data collected directly from the terminal by edge calculation, we can remove invalid frames and improve the accuracy of system detection. SSD algorithm has a relatively light and fast detection speed. However, SSD algorithm do not take full advantage of both shallow and deep information of data. So a multiscale feature fusion attention mechanism structure based on SSD algorithm has been proposed in this paper, which combines multiscale feature fusion and attention mechanism. The adjacent feature layers for each detection layer are fused to improve the feature information expression ability. Then, the attention mechanism is added to increase the attention of the feature map channels. The results of the experiments show that the detection accuracy of the optimized model is improved, and the reliability of edge calculation has been improved.

The application scenarios of image acquisition equipment are more and more owing to the booming development of the Internet of Things and the huge breakthrough and progress of computer vision related technologies.On account of the increase of equipment nodes, the pressure of data transmission increases sharply.the data flow generated by image acquisition equipment may be very large, so it is necessary to use edge computing 1-3 to preprocess images.With the successive breakthroughs of deep learning technologies [4][5][6][7][8] and the rapid development of the world economy, object detection algorithms [9][10][11] have made significant progresses in various fields [12][13][14][15] .Especially, target detection is one of the basic research projects in the fields of public transportation 16 , national defense and military.It is widely used in aerospace 17 , robot navigation 18 , industrial detection 19 , pedestrian tracking 20 and military applications 21 .And a high-performance target detection algorithm can promote the development of industry.The so-called target detection is to find the target from pictures or videos by analyzing the geometric characteristics of the target, judge the specific category of each target accurately, and provide the bounding box of each target.The CNN (Convolution Neural Network) 22 has been proved to be an effective model for processing visual tasks.The convolutional layer can capture the image representation of hierarchical patterns and obtain the feature layer of different receptive fields.To find a more powerful expression is an significant topic in the research of object detection, so that the network can better capture the significant information in the specific tasks [23][24][25][26] .The accuracy of edge equipment in screening images data is improved and the reliability is enhanced.In recent years, deep learning has developed rapidly.And more and more scholars have applied deep learning to the field of object detection.There are two kinds of target detection algorithms based on deep learning 27 .One is the object detection algorithm based on candidate box represented by RCNN 28 , Fast-RCNN 27 , Faster-RCNN 29,30 etc.This kind of target detection algorithm firstly uses the Selective Search 31 , Edge Boxed 32 and other algorithms to generate the candidate regions (region proposal) 33 that may contain the target to be detected, and then these candidate regions are classified and located to achieve the intent of targets detection.The other one is the regression-based object detection algorithm represented by SSD [34][35][36] series and YOLO 37,38 series.The regression-based object detection algorithm is surely faster than the box-based detection algorithm, and the main advantage of the box-based target detection algorithm is its high accuracy.Object detection algorithms not only require high precision, but also require fast real-time performance.Although the target detection algorithm based on candidate box has high precision, the generation of candidate box will consume a lot of time and result in unsatisfactory speed.However, the object detection based on regression does not need to generate candidate boxes and it is directly detected on the original image.The speed is greatly improved, but the disadvantage is that the accuracy is not high enough.With the improvement of the algorithm, some regression-based target detection algorithms have high accuracy and fast detection speed, and their accuracy is even higher than the same box-based target detection algorithms [39][40][41] .In this paper, the SSD target detection algorithm has been optimized, the extracted feature

Background and related work
Now many scholars both domestically and internationally are interested in the field of object detection, especially small object detection.And they have done a lot of research works and achieved good research results.For examples, in order to raise the detection efficiency of small objects, an improved multiscale feature fusion method is proposed in reference 48 , namely, the atrous spatial pyramid pooling-balanced-feature pyramid network is proposed for object detection.In particular, the atrous convolution operators with different dilation rates are applied to fully utilize context information, where the skip connection is employed to achieve sufficient feature fusions.In reference 49 , the authors show how Deep Learning may be used to reliably extract higher-level features and then fuse multi-scale features to identify eddies, regardless of their structures and scales.And the experimental results show that high target detection accuracy can be get by their method.
Next, this chapter will describe the basic idea of the traditional SSD algorithm and analyse the advantages and shortcomings of SSD algorithm in detail.Then, some model evaluation criteria will be introduced.Then, each grid point of the feature layer creates some prior boxes with different aspect ratios.The numbers of prior boxes generated on different feature layers are different, which are 4, 6, 6, 6, 4, 4 respectively.On the basis of the size of the receptive fields, Conv4_3 layer and Conv7 layer have large size and small receptive fields and have strong geometrical information expression ability, which are used to detect smaller targets; Conv10_2 layer and Conv11_2 layer have small size and large receptive fields, and the semantic information expression ability is strong.It is suitable for the detection of large targets.The geometrical information and the semantic information obtained for Conv8_2 and Conv9_2 are between those obtained for Conv10_2 and Conv11_2, and they are used for detect medium targets.Finally, all redundant prior boxes are removed by non-maximum Supression (Non-Maximum Suppression, NMS) to generate the final detection boxes.

SSD network structure
SSD algorithm uses high-level feature information with large receptive field to predict large objects, and lowlevel feature information with small receptive field to predict small objects effectively.This brings a problem: when the feature information of low-level network is used to predict small targets, SSD algorithm has a weak detection performance for small targets due to the lack of high-level semantic information because the deep feature map loss too much information and has insufficient resolution after being sampled multiple times.

Loss function
Loss function of SSD algorithm contains two aspects: location loss ( L loc ) and confidence loss ( L conf ).There are many prior boxes, and relatively few objects to be detected from a image.Then many prior boxes cannot match a real box and cannot generate too many negative samples.This algorithm can conduct difficult sample mining, adjust and control the positive and negative samples, reduce the influence of too many negative samples, and improve the optimization speed and the stability of training results.The algorithm's loss function is defined as Eq.(1): The L x, y, l, g is total loss function, N is the number of default boxes which are matched truth boxes; the parameter α is used to adjusting the ratio between the location loss and the confidence loss; c is category confi- dence;l represents the positional information of predictive boxes; g on behalf of the positional information of truth boxes; the value of input x p i,j ∈ {0, 1} depends on the IoU (intersection over union, IoU) 52 threshold between prior box and real box.When the IoU between a priori box i and real box j is greater than the threshold, x = 1 .This indicates a priori box i is matched with real box j, and real box category is p , or x = 0 .And the location loss function is adopted for Smooth L1 loss, the function is defined as Eq. ( 2): The smooth L1 is defined as Eq. ( 3): The cx, cy are on behalf of the offset of boxes' center along direction x and y; and the width and height of boxes represented by w, h ; i ∈ Pos shows the predictive box i which is positive sample, and Pos represent positive sample collection.Because the predictive box is encoded, so by encoding operation of the real box to get ĝ .The coding process defines as follows: The confidence loss adopts softmax loss, which is defined as Eq. ( 8): i ∈ Neg represent the predictive box i which is negative sample, and Neg represent the negative sample collection.ĉ0 i is the probability which represents the category is correctly classified as background, the ĉp i calculated through the softmax function represents the probability that the category is correctly classified as non-background.

Model evaluation criteria
Some model evaluation criteria are introduced.To evaluate the detection effect of the model, the following criteria are used to measure the model.And the common terms are shown in Table 1.
Table 1.Common terms for object detection evaluation criteria.1. Accuracy: The accuracy is one of the common evaluation criteria of object detection model.The mathematical meaning is to divide the number of correctly classified samples by the number of all samples.The higher the accuracy, the better the detection effect of the model.And the function is as follows: 2. Precision: The precision is calculated from the test results, which indicates the number of real positive samples in the samples predicted as positive samples.It is denoted as Eq. ( 10): 3. Recall: The recall rate is calculated from the real sample set, which indicates the probability of correct recognition in all positive samples.It is denoted as Eq. ( 11): 4. AP (average precision): In general, precision and recall rate are contradictory standards.Thus, AP is proposed to better measure the performance of the model.After drawing the smooth PR curve (precision recall curve), and the final AP value is calculated as follows: 5. mAP (mean average precision): AP means the average precision for a single category, while mAP means the average of AP for multiple categories.The value range of mAP is 0-1, and the higher the value of mAP, the better the detection performance.This criterion is the most important one in the evaluation criteria of object detection algorithm.It is denoted as follows: 6. PFS (frames per second): Object detection algorithm requires high precision and fast detection speed.The ultimate goal is to find a high-precision and efficient model.The mathematical meaning of FPS refers to the quantity of pictures that the model can detect per second.

Improved algorithm based on SSD
In this chapter, we will optimize the SSD algorithm and introduce the optimization steps in detail.There are two main steps to optimize the model.The first step is adopting different feature fusion methods for different scale feature layers to improve the utilization rate of feature maps.The second step is adopting the channel attention to optimize the model.

Multi-scale feature fusion
Based on the basic structure of SSD, multi-scale feature fusion attention mechanism ( MFA ) is proposed to improve the utilization rate of the model for extracting features.Different fusion mechanisms are adopted for feature layers of different sizes, the layer Conv4_3 for the detection of small targets is fused with Conv_7 and Conv8_2, and the fusion method can be seen in Fig. 3a.The fusion method of Conv7 which is fused with Conv8_2 and Conv9_2 is shown in Fig. 3b.It is beneficial to strengthen semantic information of the shallow feature layer by fusing the features of relatively deeper layer, and increased the accuracy of small target detection.There, we select any dimension of the corresponding feature layer for visualization, as shown in Fig. 2a.In the thesis, we named method multi-scale feature fusion attention ( MFA S ) for Small object.While the Conv8_2 used to detect medium targets is fused with Conv7 and Conv9_2, the fusion method can be shown as Fig. 3c.And the fusion method of Conv9_2 which fused with Conv8_2 and Conv10_2 can be seen in Fig. 3d, making full use of the information from adjacent extracted features to improve the ability of information expression.Feature fusion operation to detected medium-sized targets is called Multi-scale feature fusion attention ( MFA M ) for medium object, the visualization results are shown in Fig. 2b.Finally, layer Conv10_2 used for detecting large-scale objects is fused with Conv8_2 and Conv9_2, and the fusion method can be seen in Fig. 3e.And the fusion method of layer Conv11_2 fused with Conv9_2 and Conv10_2 can be seen in Fig. 3f.As the deep feature layer goes through multiple convolution and downsampling, the receptive field becomes larger, but lots of feature information are lost, which affects the detection accuracy, especially for smaller objects.Such influence can be reduced through the fusion of relatively shallow features, and such operation is named multi-scale feature fusion attention ( MFA L ) for large object.Specific fusion steps are visualized as shown in Fig. 2c.In the fusion step, we change the size of feature maps by upsampling and convolution with a stride size of 2, and adopt the convolution with kernel 1 × 1 to change the number of channels.The persons and animals in Fig. 2 are from reference 53 .
In this paper, different fusion methods are adopted for different depth feature layers, which greatly improves the utilization rate of feature information.To reduce the overfitting of the model, the following random data enhancement was performed on the original data to improve the diversity of the input data.(1) zoom: randomly scale the image to a certain size; (2) flip: randomly flip the picture from side to side; (3) color replacer: transform the image from RGB color space to HSV color space, and fine-tune the image's hue (H), saturation (S), and value  2 and MFA is i∈S,M,L MFA i .The Table 2 indicates that FPS of the real-time detection of different fusion methods is lower than that of the conventional SSD algorithm to a lesser extent.And mAP mAP of the SSD algorithm of different fusion methods are higher than the conventional algorithm of SSD, and mAP of SSD algorithm with MFA is 90.57%, increas- ing 3.27% compared with the conventional SSD algorithm.The average detection speed of SSD algorithm with MFA on the experiment platform is 26.11 frame/second, compared to the conventional SSD algorithm reduced 3.2 frame/second.

Feature channel attention mechanism
Squeeze-and-excitation network(SEnet) were proposed by Senior R&D engineer Hu Jie and his team, the network won the Image Classification task champion of the last ImageNet 2017 with great advantage.SEnet network structure is shown in Fig. 4.
SEnet alters the attention between feature channels to improve model feature extraction.By learning to automatically acquire the importance of each characteristic channel, according to this degree, more attention is paid to the model's effective channel, while the ineffective or inefficient channel is suppressed.SEnet consists of two   important parts, squeeze and excitation.The operation of squeeze is to compress each two-dimensional data into a real number through global average pooling in spatial dimension, and the real number has a global receptive field.Next, we learn to generate weights for each channel, which names excitation.SEnet network parameters increase mainly comes from two full connection layers, and the first full connection layer through the compression ratio r (r = 16) reduced the number of arguments.Therefore, the detection rate of the proposed algorithm is only slightly reduced.In this study, SEnet was added to each feature layer after different fusion operations, and the framework of the peoposed SSD algorithm was shown in Fig. 5.

Experimental equipment and data
The experimental equipment configuration in this paper is as follows: Intel(R) Core (TM) i5-9300HF CPU @ 2.  53 , and the experiment data table can be seen in Table 3.

Analysis of experimental results
In this paper, four common target detection algorithms, namely, SSD, YOLOv3, YOLOv4, and Faster RCnn, are used to compare the performance with the improved SSD algorithm.And Table 4 shows the experimental results.
Experiments are carried out using PASCAL VOC 2007 dataset and detection performance indexes include mAP and FPS.mAP is the average of all kinds of classes'AP , FPS is detection speed.mAP 50 refers to the average precision when IoU threshold of the real box and prior box is 0.5.And mAP 75 refers to the average precision when IoU threshold of real box and prior box is 0.75.The mAP 50:90 is the average of mAP 50 , mAP 60 , mAP 70 , mAP 80 , mAP 90 .The experimental data show that the mAP mAP of improved algorithm of SSD (SSD + MFA) under different IoU threshold is the highest.With mAP 50 as evaluation standard, the improved SSD algorithm is 2.00% better than the second-ranked YOLOv4 algorithm.With mAP 75 as evaluation standard, the improved SSD algorithm is 14.82% higher than the second-ranked Faster RCNN algorithm.With mAP 50:90 as evaluation standard, the improved SSD algorithm is 11.90% higher than the second-ranked Faster RCNN algorithm.The improved SSD algorithm is second-ranked in average detection speed, and it's average detection speed is only 3.2 frames /second lower than the SSD algorithm ranked first, however, the average detection of the improved SSD algorithm is 8.35 frames/second higher than the third-ranked YOLOv3 algorithm.The comprehensive comparison shows that the improved SSD algorithm has the best performance.
Figure 6 shows the accuracy rate, namely, recall rate curve comparison diagram of the average precision of different algorithms in different categories.Where SSD stands for conventional SSD algorithm, SSD+MFA MFA stands for improved SSD algorithm, YOLOv3 stands for YOLOv3 algorithm, YOLOv4 stands for YOLOv4 algorithm, and Faster-RCNN stands for Faster-RCNN algorithm.Seen from the figure, for the class' person' and class 'motorbike' , there is a small difference in the accuracy rate-recall rate curve of each detection algorithm, but the improved SSD algorithm has the best performance.And the accuracy rate-recall rate curve of the improved SSD   and the IoU between the detection result (prediction box) of the improved SSD algorithm and the corresponding real box are also improved.The persons and other objects in Fig. 7 are from reference 53 .

Conclusion
In this paper, an improved SSD algorithm (SSD + MFA) is proposed by adopting different fusion methods for feature extraction different scales layers and using the channel attention mechanism to reallocate the channel weights of the fused feature map.The mAP on PASCAL VOC2007 dataset reached 90.57%, which is 3.27% higher than the conventional SSD algorithm and 2.00% higher than YOLOv4 algorithm.The improved SSD algorithm can effectively reduce the error detection rate.And the value of mAP of detection targets for different sizes has been improved to some extent, which improved significantly the precision of edge equipment screening image.
True positive (TP) Number of positive samples which are classified correctly True negative (TN) Number of negative samples which are classified correctly False positive (FP) Number of positive samples which are classified correctly False negative (FN) Number of negative samples which are classified correctly https://doi.org/10.1038/s41598-023-41373-1www.nature.com/scientificreports/(V).test results of different fusion methods on the PASCAL VOC 2007 datasets are shown in Table

Figure 3 .
Figure 3. Fusion methods for different feature layers.

Figure 6 .Figure 7 .
Figure 6.Comparison of five algorithms in different categories of accuracy-recall rate curve.

Table 2 .
Performance comparison between different methods.

Table 3 .
Experiment data table of PASCAL VOC 2007.

Table 4 .
The performance comparison of the different object detection algorithms.
Vol.:(0123456789) Scientific Reports | (2023) 13:21387 | https://doi.org/10.1038/s41598-023-41373-1www.nature.com/scientificreports/algorithm in class ' chair' and class ' dining table' increases significantly, which's AP respectively are 88%, 92%, the detection accuracy is clearly better than other detection algorithms.For class ' cow' and class 'sofa' , the improved SSD algorithm, SSD algorithm, and YOLOv4 algorithm have a small difference in detection accuracy, but are significantly better than YOLOv3 algorithm and Faster RCNN algorithm.For the class 'bottle' , YOLOv4 algorithm has the highest precision, and improved SSD algorithms'AP ranks second.For the class 'pottedplant' , the detection accuracy of the improved SSD algorithm and YOLOv4 algorithm is clearly better than other algorithms.To sum up, the detection effects of the improved SSD algorithm in different size targets both have been improved.