YOLOFM: an improved fire and smoke object detection algorithm based on YOLOv5n

To address the current difficulties in fire detection algorithms, including inadequate feature extraction, excessive computational complexity, limited deployment on devices with limited resources, missed detections, inaccurate detections, and low accuracy, we developed a highly accurate algorithm named YOLOFM. We utilized LabelImg software to manually label a dataset containing 18644 images, named FM-VOC Dataset18644. In addition, we constructed a FocalNext network, which utilized the FocalNextBlock module from the CFnet network. This improves the integration of multi-scale information and reduces model parameters. We also proposed QAHARep-FPN, an FPN network that integrates the structure of quantization awareness and hardware awareness. This design effectively reduces redundant calculations of the model. A brand-new compression decoupled head, named NADH, was also created to enhance the correlation between the decoupling head structure and the calculation logic of the loss function. Instead of using the CIoU loss for bounding box regression, we proposed a Focal-SIoU loss. This promotes the swift convergence of the network and enhances the precision of the regression. The experimental results showed that YOLOFM improved the baseline network’s accuracy, recall, F1, mAP50, and mAP50-95 by 3.1%, 3.9%, 3.0%, 2.2%, and 7.9%, respectively. It achieves an equilibrium that combines performance and speed, resulting in a more dependable and accurate solution for detection jobs.

Fires often cause serious casualties and property damage.Therefore, early detection and accurate identification of fires are crucial for reducing losses and protecting people's lives and property.Traditional fire detection technologies rely primarily on temperature, light, and smoke sensors.However, these approaches have several limitations, such as limited detection range and low detection accuracy.The development of computer vision has resulted in substantial enhancements in tackling these challenges.According to Celik et al. 's YCbCr separation 1 , "they analyzed shape, color, and texture to identify fire smoke." Yamagishi et al. 2 employed color CCD cameras.Nevertheless, these approaches performed well only in properly lit and uncomplicated contexts but struggled in complex environments with insufficient lighting, resulting in poor detection and incorrect alerts.Support vector machine (SVM) 3 are inadequate at detecting fires because they frequently generate false alarms when cameras move or function in vibrating surroundings.Chi et al. 4 addressed several problems related to videos captured by stationary cameras.However, difficulties such as restricted location choices and expensive upkeep continue to exist.Toreyin et al. 5 proposed a real-time video processing system but encountered performance degradation issues when dealing with high-resolution videos.
With the upgrading of computer hardware and the development of deep learning technology, an increasing number of deep learning algorithms are being utilized in fire detection.The algorithms currently used include Faster R-CNN 6 , YOLO 7 , EfficientDet 7 , YOLOX 8 , SSD 9 , RetinaNet 10 , and CenterNet 11 .Chaoxia et al. 6 reduced Faster R-CNN false alarms by adopting a color-guided anchoring strategy.However, this improvement came at the expense of increased computational complexity.Xu et al. 7 improved EfficientDet to detect forest fires.Nevertheless, acquiring complete global information remained a challenge.Liau et al. 9 improved the detection speed of SSD networks, but accuracy in complicated circumstances still has to be improved.To boost network robustness, Li et al. 11 proposed a lightweight backbone network and anchor-free detection methods.However, this improvement has serious drawbacks when dealing with complicated scenarios with shifting lighting conditions.These deep learning algorithms employed comprehensive analyses of various fire features, such as color, texture, and shape.In contrast to traditional visual processing algorithms, they show greater resilience in complex scenarios, decreasing the frequency of incorrect detection and better meeting the requirements of complex tasks.
• To address the limited availability and inadequate quality of publicly accessible fire object detection datasets, we created a dataset named FM-VOC Dataset18644.The dataset contains 16,844 images depicting fire and smoke.In addition, we employed image enhancement methods such as flipping, rotating, and adjusting image brightness to preprocess the dataset, which improved the quality and quantity of data for the experiments.• Considering the importance of YOLOv5n's fusion network in multiscale feature fusion and the issue of insufficient feature fusion caused by limited parameters, we proposed the FocalNext network.This network takes inspiration from the design concept of the CFNet network 22,23 and incorporates the FocalNextBlock focusing module to reconstruct the backbone network.This network can integrate feature fusion operations into the backbone network, simultaneously merging detailed local features and broad global characteristics.This allows the fusion network to function efficiently in the subsequent phase.• We integrated network quantization and reparameterization methods to construct a QARepVGG-style 24,25 feature pyramid network QAHARep-FPN.It solves the issue of detection accuracy loss during network quantization and re-parameterization, as well as the difficulty of completing complex fire and smoke detection tasks on mobile devices and embedded systems with constrained hardware resources.This design achieves an effective balance between detection accuracy and inference speed.• The original YOLOv5n head network uses an integration and sharing method for classification and regression tasks.However, this method results in inadequate focus on the bounding box regression task and uneven feature acquisition.To address this issue, we proposed a new asymmetric decoupled head (NADH) that uses multi-level channel compression technology to address the issue of insufficient feature learning in bounding box regression tasks 26,27

Proposed FocalNext network
Traditional YOLO models employ the backbone network to extract multiscale features, which are subsequently fused in lightweight networks such as the feature pyramid network (FPN).However, the lightweight YOLOv5n model has fewer parameters assigned for the FPN network compared to the backbone network.We proposed a FocalNext network, which incorporates the FocalNextBlock focusing module and draws inspiration from the architecture of CFNet 22,23 to improve the integration of features without compromising the lightweight design.This network can integrate feature fusion operations into the backbone network, simultaneously merging detailed local features and broad global characteristics.This increases the number of parameters that can be used for feature fusion while still allowing the model to benefit from the weights obtained from pre-training.The structure of the FocalNext network is shown in Fig. 1.The structure consists of a skip connection and a series of stacked FocalNextBlock modules.The input tensor X processes a sub-path and an independent convolution operation before combining it with the feature X 2 that has undergone FocalNextBlock stacking to produce X 3 .Finally, the combined feature X 3 undergoes a convolutional operation to produce the ultimate output X 4 .For feature fusion and multilayer processing, the FocalNext network used skip connections to make the network better at showing details.This approach effectively mitigates the issue of gradient disappearance that arises with increasing network depth.
The FocalNextBlock is a focusing block within the FocalNext network.The module combines two skip connections and extended depth convolution, which lets fine-grained local interactions and coarse-grained global interactions merge at the same time.Fig. 1 illustrates the internal structure of the FocalNextBlock.The first step for the input tensor X entails a 7 × 7 convolution in the backbone path, then fusion with X. Subsequently, DropPath processing is applied to obtain X 1 .Subsequently, X 1 undergoes fusion with itself after a 7 × 7 convolution.Subsequently, the combined features undergo a sequence of operations, such as DropPath, Permute, and normalization, to derive X 2 .Subsequently, the input X 2 undergoes processing through a 1 × 1 convolution and GELU activation function to obtain the output X 3 .Subsequently, the tensor X 3 is subjected to a 1 × 1 convolution and permutation operation before being combined with the input tensor X.Following the fusion process, the final DropPath processing is carried out to derive the fusion feature X 4 .

Proposed QAHARep-FPN network
The neck network plays a crucial role in efficiently handling multiscale features from the backbone network.Increasing the quantity of convolutional layers in the neck network might optimize the advantages of fusion.However, it also increases the computational complexity, resulting in an adverse impact on processing efficiency, especially in devices with restricted resources.Network quantization 38,39 can decrease the cost and computational requirements but may sacrifice detection accuracy.Parameterization 24,25 can achieve a trade-off between detecting performance and speed, although it may experience a decline in performance when subjected to quantization.In this paper, we integrated network quantization and reparameterization methods to construct a QARepVGGstyle 24,25 feature pyramid network QAHARep-FPN.The QAHARep-FPN structure uses QARepVGGB, QARep-NeXt, the Transpose operation 40,41 , and the GhostConv convolution 42,43 .This can be seen in Fig. 2.This approach seeks to achieve an optimal balance between maintaining the accuracy of the detection of fires and achieving fast and efficient inference on devices that have limited resource availability.
There are 3 × 3 convolution, 1 × 1 convolution, identity, and batch normalization (BN) in both the RepVGG- style and QARepVGG-style convolutional structures (see Fig. 3).During the process of inference, the multibranch structure is converted into a single-branch 3 × 3 convolution structure through reparameterization.Nevertheless, the incorporation of three branches results in a covariate shift, which leads to significant performance deterioration during quantization.To address this problem, the QARepVGG-style convolutional structure adds more BN operations and gets rid of BN operations after the 1 × 1 convolution and identity layers to make the training process more stable.This adjustment greatly enhances the quantization effects of the QARepVGG-style convolutional structure 24,25 .The QARepVGGB module in the paper employed the QARepVGG-style Convolutional structure.We substituted two standard convolutions with the QARepVGGB module.
Furthermore, we draw inspiration from the EfficientRep network 44 and design the QAR Unit (Fig. 4) and QARepVGG-Block (Fig. 5).The QAR Unit establishes a linear connection between two QARepVGG-style convolutional structures.The QARepVGG-Block establishes a linear connection between n 2 QAR units.The structure of QARepNeXt is illustrated in Fig. 5, using QARepVGGB and QARepVGG-Block.The input variable X undergoes the QARepVGG-style convolutional operation in the backbone path, resulting in the generation of X 1 .Subsequently, X 1 is fed into the QARepVGG-Block to undergo more extensive feature extraction, yielding the feature X 2 .After the QARepVGG-style convolutional operation in the subpath, the feature X 3 is combined with X 2 through fusion.Ultimately, the combined characteristics undergo the QARepVGG-style convolutional operation to produce X 4 .
Moreover, we replaced the nn.Upsample operation with the Transpose operation 40,41 .The nn.Upsample mainly uses interpolation to resize images.Although it has some utility in some image-processing applications,

Proposed NADH decoupled head
By dealing with the data that the neck network has processed, the head network makes final predictions.The YOLOv5 head network adopts an integration and sharing method for classification and regression tasks (Fig. 6a).This structure potentially results in detecting conflicts for both classification and regression tasks, ultimately leading to subpar performance.YOLOX 8,21 divides the classification and regression tasks into separate subnetworks.This lets it do more calculations and have more parameters (Fig. 6b).This effectively resolves conflicts but also makes the parameters and computations bigger.YOLOv6 16 employs hybrid channels as a solution, resulting in parameter reduction at the expense of accuracy (Fig. 6c).YOLOCS 27 uses asymmetric multichannel compression and decoupling head technology to create separate subnetworks for different detection tasks.This makes the model much more accurate at finding things.However, it has challenges in adjusting the number of convolutional layers and resolving the problem of the vanishing gradient issue (Fig. 6d).
We proposed a new asymmetric multistage channel compression decoupled head named NADH (Fig. 6e).Within NADH, we employed three separate subnetworks to handle classification, object scoring, and bounding box regression.To address the bounding box regression problem, we employed three GhostConv convolutions, which effectively expand the receptive field and augment the parameter count.We used a 3 × 3 ChostConv convolution and two 3 × 3 DWConv convolutions to expand the network path for the object scoring and object classification tasks, respectively.At the same time, we compressed the features of the three channels with the same dimension.This allows the three channels to maintain the three-layer convolutional network architecture (Fig. 6).

Proposed Focal-SIoU loss
The loss function is divided into three parts: classification loss, object scoring loss, and bounding box regression loss.The classification loss evaluates the model's accuracy in categorizing each bounding box as a member of the corresponding class.The categorical cross-entropy loss is commonly employed to quantify the difference between the model's classification prediction and the actual label.The calculation procedures for the classification loss are represented by Eqs.(1)-( 2).The N denotes the total number of classes, x i represents the predicted value of the current class, y i represents the probability that the current class will occur given the processing of the activation function, and y * i represents the true value of the current class (which can be either 0 or 1).
The object scoring loss quantifies the model's level of certainty for each bounding box and assesses whether the bounding box encompasses the object.Binary cross-entropy loss is commonly used to quantify the discrepancy between the model's predicted confidence and the true label.Eq. ( 3) illustrates the computation process.The L obj denotes the loss of the object score.The N obj denotes the number of positive samples, which corresponds to the number of bounding boxes that include the actual target.The y i represents the actual label of sample i, usually assigned as 1 to indicate the existence of a target or 0 to indicate the lack of a target.The C i denotes the model's confidence estimation for sample i.This estimate, which uses the sigmoid function, typically falls between 0 and 1.
The bounding box regression loss evaluates the precision of localizing bounding boxes in object detection, which is essential for achieving successful results.The YOLOv5n utilizes the CIoU Loss, which considers the overlap between bounding boxes, the position of the center point, and the difference in size.Nevertheless, the CIoU Loss has difficulties in achieving sample weight balancing, effectively handling overlapping objectives, and adapting to diverse aspect ratios during the training process.We came up with a Focal-SIoU Loss that combines SIoU Loss 29,32 with Focal L1 Loss to make object detection better.This new loss function considers both positive and (1) negative sample weights, as well as angle, distance, shape, and IoU between the predicted and true bounding boxes.It expedites the convergence of the model and enhances the accuracy in the bounding box regression job (as shown in Eq. ( 4), where γ is usually set to 0.5).
The calculation procedures for the angle cost are represented by Eqs. ( 5)- (8).Fig. 7 illustrates that is dependent on α .α represents the relative angle between the two boxes.The calculation involves utilizing x = sin α and taking into account the π 4 .When α approaches 0, the angular disparity between the two boxes becomes negligible.When approaches 1, this suggests the necessity for optimization of the angle α .When α approaches π 4 , and is tiny, suggesting that β is required to be optimize.The calculation procedures for distance cost are represented by Eqs. ( 9)- (10).Fig. 8 illustrates that is dependent on ρ x and ρ y , which quantifies the distance difference between the predicted box and the genuine box.The weight of the distance cost is controlled by , which utilizes γ to balance the losses of ρ x and ρ y .The γ varies in response to changes in .When α falls and the γ increases, the impact of distance cost diminishes, suggesting that distance optimization is hindered.When α approaches π 4 , γ decreases, and the importance of distance cost grows, suggesting that distance optimization becomes more prominent.
The calculation procedures for shape cost are represented by Eqs. ( 11)- (12).When θ = 1 , the shape cost optimizes the bounding box's shape and constrains the freedom of the shape.The ω w and ω h denote the relative variances in width and height, respectively.Eq. ( 13) is a representation of the IoU cost calculation procedures.The intersection-over-union ratio (IoU) loss between the predicted and real boxes is measured by the L I oUCost .This quantifies the extent of overlap within the bounding box.The SIoU loss consists of the angle cost, the distance cost, the shape cost, and the IoU cost.Eq. ( 14) serves as a representation of the calculation procedures.

Experimental setup and data enhancement
The environmental parameters are displayed in Table 1.Batch Size=640×640, Epochs=200, Batch-size=16, Optimizer=SGD, Patience=100, mosaic=1.0,learning rate is 0.01, momentum is 0.937, and weight attenuation coefficient is 0.0005.We used LabelImg software to label fire images and added two types of labels: "fire" and "smoke".Afterward, we divided the dataset into a 9:1 ratio of training and test sets.We also used image enhancement techniques, including flipping, rotating, and adjusting brightness, to increase the data set.Finally, we acquired a dataset for target detection with 18,644 fire images, which we named FM-VOC Dataset18644.This dataset includes various fire scenarios, such as structure fires, grassland fires, indoor fires, forest fires, road fires, and small target fires.To assess the performance of the model, we employed various metrics such as precision, recall, F1, mean average precision at 50% (mAP50), mean average precision from 50% to 95% (mAP50-95), frames per second (FPS), parameters (Params), and billions of floating-point operations per second (GFLOPs).The calculation procedures for these metrics are shown in Eqs. ( 15)- (18).

The comparative experimental analysis of backbone network improvement
Within Table 2, we conducted a series of integration experiments including the integration of several networks such as InceptionNeXtBlock 45 , FasterNext 46 , ShuffleNetV2Block 47 , BiFormerBlock 48 , CB2D 49 , ELANB 50 , and ConXBv2 51 .The FocalNext network exhibits superior precision and recall compared to the other networks.This illustrates that the FocalNext network can enhance detection precision, while simultaneously minimizing both false positives and false negatives.The FocalNext network exhibits the highest mAP50 and mAP50-95, suggesting superior performance across different IOU thresholds and with a 50% overlap.It possesses the ability to (15)

The comparative experimental analysis of head network improvement
The experimental results presented in Table 4 demonstrate the distinct advantages of NADH in enhancing the performance of the YOLOv5n head network, surpassing other head networks.The NADH achieves a precision of 93.8% and a recall of 92.9%, which is significantly better than other head networks.This demonstrates that NADH can attain remarkably high levels of detection accuracy while simultaneously maintaining exceptional recall.The mAP50 and mAP50-95 for NADH are remarkably high, with respectively of 96.2% and 70.6%.This demonstrates that NADH exhibits exceptional performance across various IoU levels.

The comparative experimental analysis of loss function improvement
The experimental data presented in Table 5 demonstrate that Focal-SIoU outperforms other loss functions to a significant degree.It exhibits high levels of precision and recall, achieving 92.7% and 91.3%, respectively.These results demonstrate that the Focal-SIoU method can achieve accurate object detection with an elevated recall.The Focal-SIoU achieves high mAP50 and mAP50-95 of 95.7% and 68.6% respectively.These results demonstrate that the Focal-SIoU is stable over various IoU overlaps.The Focal-SIoU has a high FPS of 82.29.This demonstrates that Focal-SIoU exhibits a comparatively rapid processing rate in challenges involving highaccuracy object detection.The parameters and GFLOPs are comparable to other loss functions, with both being 6.72 MB and 4.1 G.

The ablation experiment
The data shown in Table 6 demonstrates that each improvement has a substantial impact on enhancing the performance of the YOLOv5n fire detection model.Overall, smoke detection is markedly superior to fire detection.The differences could be attributed to the differing visual attributes of smoke targets in comparison to fire targets, making smoke targets more discernible.Moreover, smoke features are simple, whereas fire features are comparatively intricate, posing a greater challenge for the model to comprehend fire in contrast to smoke features.www.nature.com/scientificreports/While there are differences, the accurate detection of either smoke or fires in the fire detection can significantly mitigate the risk of fire.In Fig. 9, it can be observed that as the number of training rounds reaches 200, the model converges gradually without any signs of overfitting during the training process.The training and verification losses of YOLOFM are lower than those of YOLOv5n, and the downward trend is more pronounced, suggesting a superior capacity to fit the data.As illustrated in Fig. 10, the YOLOFM exhibits greater accuracy in detecting fire and smoke, bettering the YOLOv5n model by a substantial margin.While the FPS of YOLOFM experienced a slight decrease (Table 6), it is apparent that the YOLOFM has achieved notable advancements in improving performance metrics, including precision, recall, mAP50, and mAP50-95.In the context of fire detection tasks, there is a need to balance performance and speed.Generally, greater precision and recall are considered more crucial.When considering the collective impact, the substantial enhancement in overall performance outweighs the minor decrease in FPS, thereby offering a more dependable and precise solution for fire detection assignments.The experimental data from the ablation study offer essential evidence for improving fire detection models, illustrating the effectiveness of these improvements in improving fire detection performance.

The SOTA comparison experiment
To fully illustrate the originality and effectiveness of the upgraded YOLOFM network,we compared the trained results using the FM-VOC Dataset18644 with other state-of-the-art target recognition techniques, including Fast R-CNN 6 , EfficientDet 7 , SSD 9 , RetinaNet 10 , CenterNet 11 , YOLO series, and EfficientNet-YOLOv3 12 .To ensure fairness, all networks go through the same fine-tuning process during the experiment.The following settings were used: image dimensions of 640x640, 200 epochs, batch size of 16, SGD optimizer, patience value of 100, mosaic factor of 1.0, and learning rate of 0.01.To minimize the impact of software and hardware on model inference time, the experiments are conducted in a controlled experimental setup as shown in Table 1.Table 7 shows that YOLOFM performs well across all parameters, notably in precision, recall, and mAP50.While some algorithms may have slightly better FPS performance in specific conditions, YOLOFM is still an outstanding fire detection algorithm that can properly identify fires.Furthermore, its model parameters and computational complexity are quite low, making it suitable for environments with limited resources.This provides a more dependable and accurate solution for fire detection in equipment with limited resources.

Conclusion
Insufficient feature extraction, excessive network processing complexity, limited deployment on resource-constrained devices, and missed, false, and low accuracy in current fire detection algorithms are discussed in this paper.Optimizing the YOLOv5n algorithm yields the YOLOFM, which is a high-precision, hardware-aware, and quantization-aware fire detection algorithm.The optimization plan includes backbone network rebuilding, neck structure augmentation, asymmetric compression decoupled head introduction, and loss function substitution.These improvements maximize algorithm efficiency and detection performance.However, the complexity

( 8 )
c h = max(b gt c y , b c y ) − min(b gt c y , b c y )

Figure 10 .
Figure 10.The comparison of real instance detection results between YOLOv5n and YOLOFM.

Table 1 .
The experimental environment settings.

Table 2 .
The experimental results of backbone network improvement.

Table 3 .
The experimental results of neck network improvement.

Table 4 .
The experimental results of head network improvement.

Table 5 .
The experimental results of loss function improvement.

Table 7 .
The SOTA comparison experiment.