Thermal fault diagnosis of complex electrical equipment based on infrared image recognition

This paper realizes infrared image denoising, recognition, and semantic segmentation for complex electrical equipment and proposes a thermal fault diagnosis method that incorporates temperature differences. We introduce a deformable convolution module into the Denoising Convolutional Neural Network (DeDn-CNN) and propose an image denoising algorithm based on this improved network. By replacing Gaussian wrap-around filtering with anisotropic diffusion filtering, we suggest an image enhancement algorithm that employs Weighted Guided Filtering (WGF) with an enhanced Single-Scale Retinex (Ani-SSR) technique to prevent strong edge halos. Furthermore, we propose a refined detection algorithm for electrical equipment that builds upon an improved RetinaNet. This algorithm incorporates a rotating rectangular frame and an attention module, addressing the challenge of precise detection in scenarios where electrical equipment is densely arranged or tilted. We also introduce a thermal fault diagnosis approach that combines temperature differences with DeeplabV3 + semantic segmentation. The improved RetinaNet's recognition results are fed into the DeeplabV3 + model to further segment structures prone to thermal faults. The accuracy of component recognition in this paper achieved 87.23%, 86.54%, and 90.91%, with respective false alarm rates of 7.50%, 8.20%, and 7.89%. We propose a comprehensive method spanning from preprocessing through target recognition to thermal fault diagnosis for infrared images of complex electrical equipment, providing practical insights and robust solutions for future automation of electrical equipment inspections.


Infrared image preprocessing Image denoising
Image denoising involves processing degraded images that contain noise to estimate the original image.Traditional Denoising Convolutional Neural Networks (Dn-CNN) use a fixed 3 × 3 convolutional kernel for noise feature extraction in images.However, Dn-CNN mainly learns noise information from images containing noise, without accommodating shape rules, which limits the effectiveness of feature extraction with a fixed-shape convolutional kernel 13 .To overcome this, a deformable convolution module is introduced to enhance the DeDn-CNN, which employs a deformable 3 × 3 convolution in place of the original convolution operation.The network's first layer is modified from Conv + ReLU to Deform Conv + ReLU, and the last layer is changed from Conv to Deform Conv, as depicted in Fig. 1.
The deformable convolution module introduces an offset to the sampling points, as illustrated in Fig. 2. The top part generates the index offset by processing the input feature map through a regular convolution layer, while the bottom part convolves the input feature map with the corresponding kernel to produce the output feature map 14 .The deformable convolution kernels are capable of adapting to the extraction of complex noise patterns in images.

Image enhancement
The original infrared image is decomposed into two layers-basic and detail-using Weighted Guided Filtering (WGF).These layers are processed individually and then combined to produce the enhanced image.For the basic layer, which suffers from low contrast and poor quality, an improved SSR algorithm integrated with anisotropic diffusion filtering is employed to adjust the grayscale, enhancing dark regions in the image and improving overall contrast.For the detail layer, which contains numerous edge and texture features, an arctan nonlinear function is applied to emphasize these details without introducing additional noise.

Image layering based on weighted guided filtering
Traditional guided filtering applies a fixed regularization factor ε to each region of the image, which does not take into account the textural differences among various regions.To address this limitation, WGF introduces an edge weighting factor Γ G , allowing ε to be adaptively adjusted based on the degree of image smoothing, thereby enhancing the algorithm's capability to preserve image edges 15 .The edge weighting factor Γ G and the modified linear factor a k are defined in the following equation.
where δ 2 G,i (i) is the variance within the window w k of the image centered on pixel i, Γ G (i) is the use of the current window variance divided by the variance of all the windows in the whole image and then take the mean, N is the number of all the pixels, and L is the distribution range of the image grayscale level 16 .
If the pixel is situated in a region of the image with sharp variations, the variance within the window centered around the pixel will be larger, causing the Γ G (i) to be greater than 1.This increase leads to a higher value of a k , which in turn better preserves edge details.In contrast, in smoother regions of the image, the Γ G (i) will likely be less than 1, resulting in a decrease in a k and a smoother output in the filtered image.
WGF is employed to process the input image, yielding a smoother base layer, and the detail layer image is obtained by subtracting this base layer from the original image, as illustrated in the following equations 17 .
where p is the original image to be enhanced, q is the output basic layer after weighted guided filtering, O is the decomposed detail layer, and WGF is the operation of weighted guided filtering.The basic layer image is subsequently augmented by the improved SSR algorithm for subsequent enhancement.The detail layer O is processed by a nonlinear function to suppress the noise information in the image, and the expression is shown:

Ani-SSR algorithm
According to Retinex theory, the illumination component of an image is relatively uniform and changes gradually.Single-Scale Retinex (SSR) typically uses Gaussian wrap-around filtering to extract low-frequency information from the original image as an approximation of the illumination component L(x, y).However, Gaussian wraparound filtering tends to skew the estimate of the illumination component at the strong edges of the image, often resulting in a pronounced halo effect around object edges in the enhanced image 18 .As a solution, anisotropic diffusion filtering is utilized in place of Gaussian wrap-around filtering.This alternative approach provides a more accurate estimation of the illumination at image boundaries and reduces halo artifacts at strong edges.The anisotropic diffusion equation is presented below.
where A is the input grayscale image; t is the diffusion time; div is the dispersion operator; ∇ is the partial deriva- tive i.e. gradient operator; Δ is the Laplace operator; c is the diffusion function, which controls the diffusion.
where k is the thermal conductivity coefficient, which controls the filtering sensitivity, the larger the value of k the smoother the image obtained, but at the same time the image details will become blurred 19 .� • � is the norm for calculating the difference between predicted noise and true noise.Anisotropic diffusion filtering is used instead of Gaussian wrap-around filtering, which makes the estimation of the light component at the image boundary more accurate, and attenuates the halo at the strong edge part of the enhanced image.

Preprocessing results
Infrared temperature measurements were conducted using a Testo 875-1i thermal imaging camera at various substations in Northwest China.A total of 508 infrared images of complex electrical equipment, each with a pixel size of 320 × 240, were collected.Out of these, 457 were randomly selected as the training set after artificial www.nature.com/scientificreports/noise was added, and the remaining 51 images formed the test set.The DeDn-CNN was benchmarked against the Dn-CNN, NL-means 20 , wavelet transform 21 , and Lazy Snapping 22 for denoising purposes, as shown in Fig. 3.An analysis of Fig. 3 reveals that the NL-means and wavelet transform denoising effects are somewhat inferior compared to Dn-CNN, with more residual noise remaining after NL-means processing and more severe image distortion.The infrared image denoised with Dn-CNN has fewer residual noise spots because Dn-CNN autonomously extracts more abstract feature information from the noise by learning the difference between the noise map and the clean map, rather than relying on manually summarized statistical noise properties.This allows it to better fit the noise distribution of the image.The DeDn-CNN achieves superior denoising results as it is better adapted to noise with chaotic distributions and irregular shapes during feature extraction, leaving the least amount of noise in the image post-denoising and attaining higher image fidelity.The average PSNR for NL-means, wavelet transform, Dn-CNN, and DeDn-CNN are 33.47,34.82, 38.25, and 40.33, respectively, which further demonstrates that DeDn-CNN is more effective at removing noise from infrared images.
The Ani-SSR algorithm is compared with histogram equalization, the original SSR, and the bilateral filter layering 23 , as depicted in Fig. 4. The original infrared image exhibits a low overall gray level, low contrast, and a suboptimal visual effect.Histogram equalization enhances the brightness and contrast of the image but results in a diminished range of gray levels and more significant degradation of image details.The original SSR enhancement of the infrared image leads to a pronounced halo effect, and a serious loss of texture, which hinders subsequent equipment recognition.The results from the bilateral filter indicate an issue of over-enhancement, causing the image to be overexposed and visually unappealing.In contrast, Ani-SSR successfully improves image contrast while preserving rich edge information and texture details.It overcomes the problem of halo effects in the original SSR, particularly at strong edges with drastic gradient changes, and provides superior overall enhancement of the infrared image of electrical equipment.
The average gradient (AG) is also used as an evaluation index for assessment, as shown in equation.
where G i,j is the gradient value of the pixel at (i, j) in the image.The larger the AG, the richer the information of edge texture is represented, and the comparison of AG of each algorithm is shown in Table 1.From Table 1, it is evident that the original SSR achieves a lower Average Gradient (AG) due to its inability to adapt to regions with drastic edge changes, as it utilizes a Gaussian function during the enhancement process, resulting in the loss of image edges and texture details.The Ani-SSR, by preserving more image details while enhancing contrast, exhibits an improvement in the average gradient score compared to the other three algorithms, objectively demonstrating the effectiveness of the proposed algorithm in this paper.

Refined detection of complex electrical equipment
The single-stage target detection network, RetinaNet 24,25 , has been improved to better suit the detection of electrical equipment, which often has a large aspect ratio, a tilt angle, and is densely arranged.The horizontal rectangular frame of the original RetinaNet has been altered to a rotating rectangular frame to accommodate

Original RetinaNet
Contemporary mainstream target detection networks fall into two categories: two-stage target detection algorithms exemplified by Faster-RCNN and one-stage target detection algorithms such as the YOLO algorithms.
The former relies on a Region Proposal Network (RPN), which introduces additional computational complexity, while the latter directly predicts the target classification confidence and location parameters through regression computation, typically with lower accuracy.RetinaNet employs the Focal Loss function to balance the weights of difficult and easy samples within the loss calculation, merging the benefits of both detection accuracy and speed 26 .
RetinaNet comprises three components: the backbone, neck, and head, as illustrated in Fig. 5.The backbone is primarily responsible for feature extraction, often utilizing ResNet-101; the neck uses Feature Pyramid Networks (FPN), which integrates features from different scales outputted by the backbone to adapt to objects of various sizes; the head, employing Fully Convolutional Networks (FCN), predicts the target location regression parameters and classification confidence for different scale feature maps 27 .

Rotating rectangular frame
Given the dense arrangement and potential tilt of electrical equipment due to the angle of capture, the standard horizontal rectangular frame of RetinaNet may only provide an approximate equipment location and can lead to overlaps.When the tilt angle is significant, such as close to 45°, the horizontal frame includes more irrelevant background information.By incorporating the prediction of the equipment's tilt angle and modifying the horizontal rectangular frame to a rectangular frame with a rotation, the accuracy of localization and identification of electrical equipment can be considerably enhanced.The comparison results of the two detection frames are displayed in Fig. 6.
The rotational frame defined in this paper is illustrated in Fig. 7. Here, the side forming an acute angle with the positive direction of the x-axis is labeled as h, while the other side of the rectangle is identified as w.The angle  is defined as the acute angle between h and the x-axis, with its value ranging from [−π / 2,0).To define a frame with a rotation, five parameters are necessary: (x, y, w, h, θ), which represent the coordinates, width, height, and inclination angle, respectively.
The pixel area at five different detection scales are 32 2 , 64 2 , 128 2 , 256 2 , and 512 2 ,.Each pixel area includes three scale factors of [2 0 , 2 1/3 , 2 2/3 ] and three aspect ratios of [0.5, 1, 2], resulting in the creation of nine frames.Since electrical equipment typically have elongated shapes with large aspect ratios, this paper extends the original three aspect ratio factors to seven scales:

Attention mechanism
The Attention module enhances the network's capability to discern prominent features in both the channel and spatial dimensions of the feature map by integrating average and maximum pooling.In this paper, the detection target is power equipment in substations, environments that are often cluttered and have complex backgrounds.Therefore, the network is improved with the Attention module 28 .The addition of the Attention module to the shallow layer feature maps does not significantly enhance performance due to the limited number of channels and the minimal feature information extracted at these levels.Conversely, implementing it in the deeper network layers is less effective since the feature map's information extraction and fusion operations are already complete; it would also unnecessarily complicate the network.Consequently, in this study, the Attention module is introduced after the backbone and before the FPN module, as shown in Fig. 9.

Path aggregation network (PAN)
The Path Aggregation Network (PAN) is incorporated subsequent to the FPN module, as indicated in Fig. 10.The original FPN module conveys the deep feature map's strong semantic information to the shallow feature map via a "top-down" approach but does not carry the detailed target location and texture information from the shallow feature map to the deep feature map 29 .The PAN structure enables a "bottom-up" feature fusion mechanism by downsampling the shallow feature map with Conv + BN + ReLU and then superimposing it onto the deeper feature map.This approach enriches the target texture and position information conveyed from the shallow to the deeper feature map.The integration of the FPN and PAN modules optimizes the use of features extracted by the backbone, fuses feature parameters across different layers, and addresses the limitation of single-scale feature maps in one-stage methods, which may not effectively represent object location and semantic information across multiple scales simultaneously.

Head structure and loss function
The original head predicts the classification confidence parameter and the location regression parameter using the Fully Convolutional Networks (FCN) 30 .due to the increase in the number of frames in this paper, it is necessary to change the FCN appropriately, as presented in Fig. 11.The original RetinaNet only needs to predict the 4 parameters (t x ′, t y ′, t w ′, t h ′) of the horizontal rectangular frame, so the last layer outputs the tensor of W × H × 4A.The rotating rectangular frame adds the prediction of the angular, such that it is imperative to adjust the network to predict the 5 parameters of (t x ′, t y ′, t w ′, t h ′, t θ ′) , outputting the tensor of W × H × 5A, as illustrated in Fig. 11.The loss function of the original RetinaNet is divided into two parts: classification loss and position regression loss.The electrical equipment with tilt angle is detected accurately, so the angular offset of the target should be added to the loss function of position regression, as shown in the following equation.
where (x, y, w, h, θ) and (x a , y a , w a , h a , θ a ) are the position coordinates and tilt angle of the real frame and predicted frame, respectively, and (t x , t y , t w , t h , t θ ) represents the offset of the predicted frame relative to the real frame.The loss value of position regression is calculated based on Smooth L1 function.
where the value range of t i is (t x , t y , t w , t h , t θ ) and the value range of t i ' is (t x ′, t y ′, t w ′, t h ′, t θ ′) .The calculation of the total loss value of target classification and position regression is: where N denotes the number of frames; t n ′ takes 1 when the frame is foreground, and 0 when the frame is background; t ni ′ represents the coordinate offset of the predicted position corresponding to the n-th frame; and t ni expresses the coordinate offset of the n-th frame with respect to the real frame; p n denotes the value of the multicategory confidence distribution of the n-th frame predicted by the sub-network after the Sigmoid function is computed, and t n expresses the belonging category label of the n-th frame corresponding to the real target.L cls denotes the category loss, calculated using the Focal Loss function of the original RetinaNet; the parameters λ 1 and λ 2 are taken as 1 by default.

Performance comparison
Infrared images of six types of substation equipment-insulator strings, potential transformers (PTs), current transformers (CTs), switches, circuit breakers, and transformer bushings-were selected for recognition.The detection accuracy of the improved RetinaNet is evaluated using Average Precision (AP) and mean Average Precision (mAP).AP assesses the detection accuracy for a specific type of electrical equipment, while mAP is the mean of the APs across all equipment types, indicating the overall detection accuracy.AP and mAP are defined as follows.
where TP represents the number of positive samples classified correctly, FP represents the number of negative samples incorrectly classified as positive samples, FN is the number of positive samples incorrectly labeled as negative samples, and P and R are the detection rate and accuracy rate, respectively.Table 2 presents the APs and mAPs for different models detecting six types of electrical equipment, including Faster R-CNN, YOLOv3, the original RetinaNet, and the improved RetinaNet.The improved RetinaNet's AP values surpass those of the other three models for all six equipment types.The model's mAP is 1.9 percentage points higher than that of the original RetinaNet, indicating improved detection accuracy.Additionally, in scenarios where electrical equipment is densely arranged at various angles, the rotating rectangular frame achieves more precise detection than the horizontal frame, as illustrated in Fig. 12.A tilted electrical equipment's rotating rectangular frame introduces less background information than the horizontal rectangular frame, and there is less overlap in the detection results of the densely arranged electrical equipment,, aiding in the separation of the equipment for fault diagnosis based on thermal information.
Analyzing Fig. 12, we see that the two rows display the detection effects of the original RetinaNet and the improved RetinaNet, respectively.Figures 12a,b show that insulator strings and CTs, which have large tilt angles, are poorly served by algorithms using horizontal rectangular frames as these introduce a significant amount of irrelevant background images unrelated to the electrical equipment.In contrast, the improved RetinaNet more accurately contours the edges of the equipment, reducing the inclusion of extraneous background information.Figures 12c,d demonstrate that, due to the camera angle, the equipment appears not only tilted but also densely arranged, which challenges the traditional horizontal rectangular frame-based detection networks in separating individual equipment.The improved RetinaNet utilizes rotating frames to locate and identify equipment, circumventing the limitations of conventional framing and reducing overlap, thereby achieving more precise detection outcomes.

Semantic segmentation of electrical equipment
Semantic segmentation involves the pixel-wise classification according to different semantics based on pixel features, as exemplified in Fig. 13.DeeplabV3 + utilizes a classic encoder-decoder structure 32 .Its encoder eliminates pooling operations to preserve more detail and positional information.Additionally, by incorporating a channel-separable convolution module, the encoder decouples spatial from channel information, reducing parameter count during network training 33 .The decoder produces prediction maps that match the original image's resolution-for instance, Fig. 13 classifies pixels on the top of the transformer bushing and the bushing itself 34 .Our focus is directed toward segmenting three vulnerable structures: the cap of the transformer bushing (Cap), the disconnecting link of switches (Disconnecting Link), and the potential transformer bushing (Bushing).

Fault diagnosis of of thermal fault-prone structures
The relative temperature-difference method employs the temperature-difference information of the corresponding positional temperature values of two equipment with the same or similar basic states, such as category, load, and environment, to identify faults.Firstly, the temperature difference between the corresponding temperature points of two equipment is measured, then the temperature-rise value of the higher temperature point among the two points is calculated.Lastly, the relative temperature difference δ t is computed using the ratio of the two, which is formulated in the following function: where δ t is the relative temperature difference between the two equipment under test, τ 1 is the temperature-rise of the hot spot under test (unit: K), T 1 is the temperature of the hot spot (unit: K), τ 2 and T 2 are the temperaturerise and temperature of the normal temperature point, and T 0 is the ambient temperature.Relative temperature-difference method is primarily applicable to the current-heating faults judgment, especially for the abnormal heating caused by the small load current, the relative temperature-difference method can reduce the probability of leakage judgment of the small current load defect.
Similar comparison method refers to the same working condition, the same external environment of the same type of equipment temperature comparison to determine the equipment thermal defects, can be used for fault diagnosis of potential-heating faults.
Diagnostic criteria are set for Cap, Disconnecting Link, and Bushing.Cap and Disconnecting Link are prone to current-heating faults, are shown in Table 3.For Bushing, it is easy to have potential-heating faults.If the temperature difference is less than 2 K, it is determined that there is no faults, and if the temperature difference is greater than this threshold, it is determined that there is a potential-heating fault.

Thermal fault diagnosis of the cap
The Cap is prone to current-heating faults, often due to internal bolt loosening or wiring aging corrosion and other reasons that increase the resistance, resulting in an increase in the amount of heat generated.Figure 14 illustrates the fault diagnosis process of the Cap.Initial detection of Cap is carried out using improved RetinaNet, and the results are input DeeplabV3 + model for segmentation, thus separating n regions of the Cap.The local temperature maximum T 1 , T 2 , T 3 …T n are yielded, the maximum value is selected as the hot spot temperature T max and the minimum value is selected as the normal temperature T min , and the relative temperature difference δ t is obtained.If the T max and δ t satisfy the discriminating conditions, it is determined as the corresponding fault level, and if they do not satisfy the conditions, it is judged that the equipment is normal.

Thermal fault diagnosis of the disconnecting link
The Disconnecting Link is prone to current-heating faults.Frequent reversing operations of the Disconnecting Link often result in insufficient spring clamping force of the contact fingers and abrasion of the contact fingers.Figure 15 illustrates the fault diagnosis process of the Disconnecting Link.The local temperature maximum T 1 , T 2 , T 3 …T n are obtained, the maximum value is selected as the hot spot temperature T max and the minimum value  www.nature.com/scientificreports/ is selected as the normal temperature T min , and the relative temperature difference δ t is obtained.The T max and δ t are adopted to determine whether the equipment is faulty.

Thermal fault diagnosis of the bushing
The Bushing is prone to abnormal heating due to the failure of the internal capacitance unit, and is a potentialheating fault.Capacitor unit fault primarily arises from moisture, capacitive components aging and other factors, usually in the wet season is more frequent.Fault diagnosis process of the Bushing is shown in Fig. 16.Since the Bushing belongs to the potential-heating fault, the basis for judgment differs from the current-heating fault.Initial detection of potential transformers was performed using improved RetinaNet, and the results were input into the DeeplabV3 + model for segmentation.The maximum temperatures T 1 , T 2 , T 3 …T n were extracted for each region, and the hotspot temperature max(T 1 , T 2 , T 3 …T n ) and the normal temperature min(T 1 , T 2 , T 3 …T n ) were selected.If the temperature difference exceeds 2 K, it is determined that the Bushing has occurred a potentialheating fault; otherwise it is determined to be normal.

Experimental analysis
A selection of 282 infrared images containing bushings, disconnecting links, and PTs was chosen for fault diagnosis.The test set includes 47 infrared images of thermal faults on bushings and 52 images showing abnormal heating at disconnecting links, as shown in Table 4.The images of PTs comprise 44 with faults and 38 without faults.The fault diagnosis results for the three types of equipment are displayed in Tables 5, 6, and 7, respectively.Of the 143 fault images, faults were identified in 41 images of caps, 45 images of disconnecting links, and 40 images of PT bushings.The recognition accuracies reached 87.23%, 86.54%, and 90.91%, with false alarm rates of 7.50%, 8.20%, and 7.89%, respectively.The recognition results for some of the thermal fault images are presented in Fig. 17.The cap shown in Fig. 17 exhibits a current-induced heating fault due to corrosion.The maximum temperature of the cap was 59.5 °C, the normal temperature was 25.9 °C, and the relative temperature difference δ t was 85.06%.The algorithm in this paper identifies this as a severe fault, which is consistent with the actual sample's fault level.The disconnecting link underwent oxidation due to long-term operational switching, causing an abnormal temperature rise.The maximum temperature recorded for the structure was 103.3℃, the normal temperature was 41.4℃, and the δ t was 70%.The diagnostic model in this paper classified this as a severe fault.The temperature difference between the faulty and non-faulty states of the bushing was 3.2 K, exceeding the judgment threshold, indicating a potential heating fault.

Conclusion
This paper presents a fault diagnosis method for electrical equipment based on deep learning, which effectively handles denoising, detection, recognition, and semantic segmentation of infrared images, combined with temperature difference information.A comprehensive approach is proposed, ranging from preprocessing to   www.nature.com/scientificreports/recognition, for diagnosing thermal faults in infrared images of electrical equipment.This contributes valuable experience and viable solutions for future automation of electrical equipment inspection.
(1) A denoising algorithm for infrared images, DeDn-CNN, is introduced.It incorporates a deformable convolution module into the Dn-CNN to autonomously learn noise features in infrared images.Additionally,

Figure 4 .
Figure 4. Comparison of image enhancement results.

Figure 6 .
Figure 6.Comparison of the detection effect of two frames.

Figure 14 .
Figure 14.Diagnostic process of the cap.

Table 1 .
Comparison of AG score.

Table 2 .
Comparison of detection results of different models.

Table 3 .
Diagnostic criteria for faults.

Table 4 .
Fault diagnosis data set.

Table 5 .
Fault diagnosis results of the cap.

Table 6 .
Fault diagnosis results of the disconnecting link.

Table 7 .
Fault Diagnosis Results of the Bushing.
Figure 17.Diagnostic effect of some images.