Transformer fault diagnosis method based on SMOTE and NGO-GBDT

In order to improve the accuracy of transformer fault diagnosis and improve the influence of unbalanced samples on the low accuracy of model identification caused by insufficient model training, this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT. Firstly, the Synthetic Minority Over-sampling Technique (SMOTE) was used to expand the minority samples. Secondly, the non-coding ratio method was used to construct multi-dimensional feature parameters, and the Light Gradient Boosting Machine (LightGBM) feature optimization strategy was introduced to screen the optimal feature subset. Finally, Northern Goshawk Optimization (NGO) algorithm was used to optimize the parameters of Gradient Boosting Decision Tree (GBDT), and then the transformer fault diagnosis was realized. The results show that the proposed method can reduce the misjudgment of minority samples. Compared with other integrated models, the proposed method has high fault identification accuracy, low misjudgment rate and stable performance.

is related to the stability of the power system.When a transformer malfunctions, if accurate diagnosis cannot be made in a timely manner, it will cause significant economic losses.Therefore, how to improve the accuracy of transformer fault diagnosis has always been a hot topic for scholars to study.
As the aging process of transformer insulation progresses, H 2 , CH 4 , C 2 H 6 , C 2 H 4 , C 2 H 2 , CO 2 , and other gases are produced and dissolve into the insulating oil.The present condition of the transformer may be inferred from the concentration and composition of these dissolved gases within the oil 1 .The predominant analytical techniques employed to assess the transformer's condition encompass the IEC three-ratio method 2 , Rogers' four-ratio method 3 , Duval Pentagon 4 , Doernberg's ratio method 5 , among others.In 6 , a fuzzy logic approach was proposed to overcome the shortcomings of traditional IEC methods and enhance the accuracy of model diagnosis.In 7 , based upon the data of dissolved gases within oil, a fuzzy logic-based transformer fault diagnosis model employing the Rogers Four Ratio Method has been developed.The model's implementation has demonstrated its capacity to rectify the deficiencies inherent in conventional fault diagnosis methods, thereby enhancing the accuracy of fault diagnosis.Conversely, this method lacks comprehensive coding and the diagnostic threshold is too rigidly defined, thereby failing to capture the intricate nature of faults within the transformer and compromising the accuracy of fault diagnosis 8 .In 9 , the ratio coding method and raw gas data are used to construct 24-dimensional features, which improves the model's ability to distinguish between different faults and makes it more versatile.Ref. 10 .proposes a PSO-RF diagnostic model that extracts transformer fault characteristic information without using coding ratios, thereby improving the model's fault diagnosis capabilities.However, in existing research, the dimensionality explosion problem is less considered when constructing feature parameters.Because as the sample size increases, the fault diagnosis model becomes better.However, the increase in feature dimension leads to an exponential increase in the amount of calculation and an increase in redundant information.Therefore, it is necessary to remove redundant information to improve model operation efficiency and diagnostic accuracy.
As artificial intelligence technology advances, machine learning applications in transformer fault diagnosis have gained momentum.Support Vector Machine [11][12][13] , Convolutional Neural Network(CNN) 14,15 , Self-Organizing Mapping Neural Network(SOM) 16 , Gate Recurrent Unit(GRU) 17,18 , Cloud Model(CM) 19 , Adaptive Boosting(AdaBoost) 20 , Gradient Boosting Decision Tree(GBDT) 21 and other models have demonstrated remarkable success in classification identification.Yet, The fault diagnosis models mentioned above were all constructed based on the assumption of having a relatively large dataset.However, in practical operations, transformers rarely experience failures and the frequencies of different types of faults vary significantly.This makes it difficult to meet the precision requirements using big data samples.Therefore, when addressing the practical challenges of transformer fault diagnosis, the issue of sample imbalance needs to be given immediate attention in order to achieve precision.
The formulation of transformer fault diagnosis models hinges upon an abundance of data sets.In practical operations, the likelihood of transformer malfunction is slim; the variance of diverse fault types is vast, thereby making it challenging to attain the requisite standards for extensive datasets.
Research on imbalanced datasets mainly focuses on developing classifiers and data preprocessing techniques.Data-level processing involves reconstructing the dataset to better align with its inherent characteristics, thereby addressing issues arising from an imbalance in sampling frequency.undersampling 22 involves selecting a subset of the most representative samples from the majority classes to mitigate the issue of class imbalance.However, this approach may result in the loss of crucial information regarding the bulk of sample classes, ultimately impairing the performance of classifiers.Oversampling involves artificially increasing a limited sample size to achieve data balance.This can be done through techniques such as Synthetic Minority Oversampling Technique(SMOTE) 23,24 , SVM SMOTE 25 , Borderline-SMOTE 26 , Adaptive Synthetic Sampling(ADASYN) 27 , Generative Adversarial Network(GAN) 28 , and others.Common approaches at the classification algorithm level include CostSensitive 29 and Ensemble Learning 30 .In 31 , cost-sensitive classifiers are used to address class disparities and improve fault categorization accuracy.The Auxiliary Generation Mutual Countermeasure Network (AGMAN) was proposed in Ref. 32 .to enhance the accuracy of small sample class imbalance fault diagnosis.In 33 , MeanRadius-SMOTE is proposed based on the traditional SMOTE oversampling algorithm, which effectively avoids the generation of useless samples and noisy samples, and the generalization of this algorithm is verified.
The main contributions of this work are as follows: (1) Improved classification performance on imbalanced and small sample data using oversampling methods, avoiding classifiers focusing too much on majority samples and causing the classifier's hyperplane to shift towards minority class samples.(2) Established a deep relationship between dissolved gases in oil and fault types, reduced redundancy between features by using LightGBM feature selection, and improved the computational efficiency of the diagnostic model.(3) Optimized algorithm for parameter optimization of the diagnostic model to establish the optimal diagnostic model.Finally, the effectiveness of the proposed methods in this paper was verified through different sampling methods, different feature selection methods, and different diagnostic models.

Composite minority oversampling technique
The main idea of the SMOTE is to randomly select a majority class sample, then find the k nearest neighbors, and select a sample from the k nearest neighbors according to the sampling probability to generate a new sample based on formula (1), repeatedly balancing the dataset.Among them, Z 1 is the majority class sample; Z 2 is one of the k samples closest to Z i ; rand belongs to a random number of [0,1]; Y represents the newly generated minority class sample.

Northern goshawk optimization algorithm
The northern goshawk optimization algorithm 34 is a new meta heuristic algorithm proposed in 2022, which simulates the behavior of northern goshawks during hunting.The hunting strategy is mainly divided into two stages: prey identification stage and chase and escape behavior stage.
(1) Initialization phase Population initialization, as shown in formula (2): Among them, X represents the matrix of the population, X i is the initial value of the ith individual, x i,j are the values of the jth dimension of the ith individual, N is the number of populations, and m is the dimension of the search space.
The objective function of the population is shown in formula (3): (1)

Based on LightGBM feature selection
LightGBM 35 is an efficient framework of gradient enhancement decision tree algorithm, which can evaluate the importance of features, and speed up the training of models by eliminating features of low importance to avoid dimensional disasters.The steps for calculating the feature importance are as follows: For the training set W = {w 1 ,w 2 ,…,w n } corresponding to{g 1 ,g 2 ,…,g n }, the sampling rate of the sample is a, and the sampling rate of the small gradient sample is b, then the steps for calculating feature importance are as follows: Step 1: sorting the fault samples in descending order according to the absolute value of the gradient; Step 2: selecting the initial a × N samples to form a large gradient sample subset C 1 ; Step 3: a random selection of b × N samples is drawn from the remaining faulty samples to form a smaller gradient sample subset D 1 ; Step 4: adopting C 1 × D 1 to learn new decision trees, assign weights (1-a)/b to small gradient faulty samples when calculating information gain at computing nodes; Step 5: reiterate Steps 1-4 until reaching a predetermined iteration count or convergence threshold.Throughout the model, the sum of the information gain of each feature across all nodes of a split feature represents the significance of that feature.

Transformer fault diagnosis process
The methodology for transformer fault diagnosis through SMOTE and NGO-GBDT entails two stages: offline model training and online identification and diagnosis.The specific workflow is depicted in Fig. 1.During the The specific steps in the offline training phase are as follows: Step 1: Sample data preprocessing.The collected DGA samples are normalized, and the data set is balanced through the application of the SMOTE.
Step 2: Feature selection.The candidate feature set is established through the code-free ratio method, while the optimal input feature is determined via the LightGBM.
Step 3: Model training and validation.The training set, validation set, and test set are separated; the parameters of the GBDT model, including those of max_ depth, n_ estimators, and learning_rate, are optimized using the NGO algorithm.Therefore, utilizing the verification set to assess the diagnostic efficacy of each iterationary model, if the disparity in accuracy between successive training sessions does not exceed five percent, save the model parameters upon conclusion of the training; In the event that such conditions are not met, one must retrain the model until they are fulfilled.
Step 4: Model validation.The test dataset is fed into the optimal model; the diagnostic accuracy of the NGO-GBDT model is validated.
The specific steps in the online identification and diagnosis stage are as follows: Step 1: Sample data preprocessing.Normalizing the DGA samples collected in real-time.
Step 2: The candidate feature set is established through the code-free ratio method, while the optimal input feature is determined via the LightGBM.
Step 3: The subset of optimal characteristics is directly inputted into the optimal model, thereby obtaining the results of online diagnosis of transformers.

Transformer fault diagnosis process
In the context of imbalanced data classification, an overabundance of samples from dominant classes can result in the model's tendency to excessively focus on the majority categories, thereby neglecting a select few.This can lead to the plane of the classifier shifting towards a subset of samples within these categories.To effectively assess the efficacy of the transformed transformer fault diagnosis model, this paper selects a multi classification evaluation index system based on confusion matrix, with accuracy, recall, F1 value, G-mean, and Kappa coefficient as the model evaluation indicators. (

1) Precision and recall
The accuracy is the proportion of predicted positive samples to actual positive samples.The recall rate represents the proportion of predicted positive samples in the actual positive sample results.www.nature.com/scientificreports/Among them, represents the accuracy; R represents the recall rate; TP is the case when the classification of the positive sample is correct; FP is the case where the counter example sample is misclassified; F N is the case where the positive sample is misclassified.
(2) F1 value (F1 score) The F1 value represents the harmonic average of accuracy and recall.
(3) Kappa coefficient The kappa coefficient reflects the consistency between real classification and predicted classification, and is one of the commonly used indicators to evaluate the accuracy of fault diagnosis.
Among them, P 0 is the number of correctly predicted samples divided by the total number of samples.Assuming that the true samples for each class are a 1 , a 2 , …, a e , and the predicted unclassified samples are b 1 , b 2 , …, b e , respectively The range of Kappa coefficient values is [0,1], which is generally divided into five groups to represent different levels of consistency: 0-0.20 (extremely low consistency), 0.21-0.40(general consistency), 0.41-0.60(medium consistency), 0.61-0.80(high consistency), and 0.81-10 (almost identical).That is, the closer the kappa coefficient is to 1, the better the diagnostic effect.

Example analysis
According to DL/T722-2014 Analysis and Judgment Criteria for Dissolved Gases in Transformer Oil 27 , transformers are classified into six types based on whether or not the transformer has malfunctioned and the type of fault.They are represented by labels 1-6, including low energy discharge (D1), high energy discharge (D2), medium low temperature heat release (T1&T2), high temperature heat release (T2), partial discharge (PD), and normal (N) This article selects 480 sets of monitoring data provided by a power supply company in Yuhang City, Zhejiang Province as the fault sample set.Each operating state in the sample set includes 5 characteristic gases, including H 2 , CH 4 , C 2 H 6 , C 2 H 4 , and C 2 H 2 .The distribution of the dataset and sample labels are shown in Table 1.

Data preprocessing
When a transformer malfunctions, the composition and content of dissolved gases in the insulation oil will change.This article selects H 2 , CH 4 , C 2 H 6 , C 2 H 4 , and C 2 H 2 dissolved in the oil as sample inputs.Normalize the characteristic gas, as shown in formula (15): Among them, x i and x′ are features before and after normalization; Ximin and Ximax is the minimum and maximum values of each column feature in the raw data before normalization. )

Data balancing processing
In accordance with Table 1, it becomes apparent that mid-to-low temperature overheat failures constitute 30.6% of all instances, whereas partial discharge failures comprise merely 6.0%.Should the data sets employed for model formulation be imbalanced, the model may not acquire sufficient proficiency in certain sample types, leading to an increased likelihood of misclassifying these sample types during the identification stage, thereby compromising the accuracy of the model's classification.In this study, we employ SMOTE to balance the dataset.The sample distribution subsequent to SMOTE oversampling is depicted in Table 2.In preparation for subsequent feature optimization, model training, and diagnostic purposes, the sample count for each category in Table 2 has been harmonized.

Optimization of transformer fault characteristics
In the field of DGA fault diagnosis, the IEC three-ratio method, Rogers four-ratio method, and uncoded ratio method are generally used as references.However, the above methods have incomplete feature selection and insufficient data utilization, and cannot fully reflect the relationship between faults and features.Therefore, this paper uses 5 characteristic gases as the basis to construct 19-dimensional ratio characteristics, as shown in Table 3.The 19-dimensional features constructed in this article expand the feature space and make full use of data information.However, there will be information redundancy.These redundant features will increase the computational burden of the model.It is necessary to reduce the data dimensions and reduce the complexity of the model.Therefore, the LightGBM feature importance evaluation method is introduced to optimize the 19-dimensional features.The feature importance ranking results are shown in Table 4. Features sorted according to the  www.nature.com/scientificreports/importance of LightGBM features are sequentially and incrementally input into the GBDT model for diagnosis and identification.In order to avoid contingency, ten-fold cross-validation is performed on the input data sampling, and the average accuracy is taken as the final result, as shown in Fig. 2. As the number of features increases from 1 to 8 in Fig. 2, the diagnostic accuracy of the GBDT model gradually increases.When the number of features is 8, the average diagnostic accuracy reaches a maximum of 93.68%.When the fault diagnosis accuracy reaches a high point, as the number of features continues to increase, its accuracy remains unchanged or decreases.The reason is that too many features lead to an increase in the complexity of the model.Based on this, the first 8-dimensional features sorted by LightGBM are selected for model training and diagnosis.

Analysis of fault diagnosis results
The selected optimal feature subset is divided into training set, test set and verification set according to the ratio of 6:2:2.The specific distribution is shown in Table 5.
In order to ensure the accuracy and effectiveness of the model, NGO is used to optimize max_depth, learning rate, learning_rate and n_estimators.The GBDT hyperparameter optimization range is set as shown in Table 6. Figure 3 shows the confusion matrix of the transformer fault diagnosis results based on SMOTE and NGO-GBDT.The blue diagonal line in the figure represents the number of correct predictions in the real samples, and the sum of each row of data is expressed as the total number of samples.Among the 174 test samples in Fig. 3, a total of seven fault samples were misjudged.The total accuracy of transformer fault diagnosis was 95.98%.Among them, normal and medium-low temperature exothermic samples were correctly identified.Among them, the misjudgment rates for high-temperature exothermic, low-energy discharge and partial discharge samples are only 7.40%, 7.40% and 3.45%, indicating that the model proposed in the article has good stability.Based on the

Comparative analysis of different feature selection methods
To validate the efficacy of the proposed feature selection strategy, this paper employs four distinct approaches: Recursive Feature Elimination (RFE), XGBoost Feature Selection, RF Feature Optimization, and 19-dimensional feature extraction as inputs for the NGO-GBDT model.The classification results are delineated in Fig. 4 and Table 7.It is apparent from Table 7 that, following the rigorous selection of features, the diagnostic precision and Kappa coefficient undergo significant enhancement across various degrees, while the duration of operation diminishes.Among these methods, the LightGBM feature selection approach exhibits the most favorable diagnostic performance compared to the others, thereby affirming the superiority of the LightGBM feature selection method.

Comparative analysis of sample equalization effects
In order to verify the effectiveness of the diagnostic model in processing unbalanced data, random oversampling and ADASYN oversampling methods were used from the data processing level to compare the diagnostic results with the original data set.The confusion matrix is shown in Fig. 5, and the model evaluation indicators are shown in Table 8 shown.According to the diagnosis results, it can be seen that the original data set without balance processing has insufficient training for minority class samples, resulting in high misjudgment rates for the three types of minority class samples: high temperature overheating, partial discharge, and low energy discharge during identification and diagnosis.After oversampling, the model diagnosis accuracy and Kappa coefficient have been improved to varying degrees.After using SMOTE for data enhancement in this article, the diagnostic accuracy and comprehensive indicators of each type are better than other sampling methods, further validating this article.The superiority of the proposed method in handling imbalanced data.

Comparative analysis of diagnostic effects of multiple models
In order to verify that the integrated learning method proposed in this article can effectively improve the accuracy of transformer fault identification, NGO-GBDT was compared with WOA-GBDT, GBDT, RF and DT, and the model classification effect was evaluated through multiple indicators.In order to make the model more convincing, the GA-XGBoost diagnostic model proposed in Ref. 36 .and The PSO-BiLSTM diagnostic model proposed in Ref. 37 .The WOA-SVM diagnostic model proposed in 38 ensures that the input features are consistent.Table 9 shows the diagnostic results of different models.
From the perspective of a single diagnostic model, GBDT has a better classification effect than RF and DT.After optimizing the hyperparameters of the GBDT model through the optimization algorithm, the model diagnosis accuracy has been improved, indicating that the NGO optimization algorithm has strong optimization capabilities.It can effectively improve model diagnosis performance.At the same time, comparing GA-XGBoost, PSO-BiLSTM and WOA-SVM with NGO-GBDT, the diagnostic accuracy increased by 1.68%, 2.32% and 3.66% respectively.Based on the parameter analysis of recall rate, precision rate, F1 value and Kappa coefficient, it is shown that the model proposed in this article has better diagnostic effect than other models, verifying the superiority of the NGO-GBDT model.

Conclusion
In response to the issue of misjudgment of minority samples caused by imbalanced transformer fault samples, this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT based on data oversampling and ensemble learning algorithm models.The following conclusions are drawn from actual data: (1) By using the LightGBM feature selection method to select the optimal feature subset, redundant information can be avoided and the accuracy of transformer fault identification can be effectively improved.(3) Compared with other ensemble learning models, this article constructs an NGO-GBDT transformer fault diagnosis model with high diagnostic accuracy, and further verifies the superiority of the proposed method through evaluation indicators such as accuracy, recall, F1 value, etc.
In summary, the strategy proposed in this paper enables the online diagnosis of electrical transformers, augmenting the operational efficiency of transformer management; to some extent, addressing the scarcity and imbalance of fault sample occurrence during actual operation.Yet, in the selection of K near-neighbors for the synthesis of new samples, this approach possesses a certain degree of blindness, subject to interference from noisy samples, and lacks clarity regarding the boundary between samples, hindering the model's diagnostic capabilities.The text insufficiently delves into the study of dissolved gases in oil, neglecting the impact of two distinctive gases-CO and CO 2 -on transformer faults.Further research is imperative to thoroughly analyze and enhance these issues.

Figure 1 .
Figure 1.Flow chart of transformer fault diagnosis.

Figure 2 .
Figure 2. The number of feature subsets corresponds to the average diagnostic accuracy of the model.

( 2 )
This article deals with imbalanced fault samples from the data processing level, and solves the problem of low diagnostic accuracy caused by insufficient and imbalanced sample data through the SMOTE oversampling method, reducing the misdiagnosis rate of the diagnostic model.

Figure
Figure Comparison of diagnostic results of different feature optimization methods.(a) Recursive feature elimination, (b) RF feature selection, (c) XGBoost feature selection, (d) 19 dimensional joint features.

Table 1 .
min Dataset distribution and sample labels.

Table 2 .
Data distribution before and after SMOTE balancing.

Table 4 .
Feature importance ranking results.

Table 5 .
Distribution of the sample data.

Table 6 .
Hyperparameter optimization range setting.information in the confusion matrix, the precision P, recall R and F1 values of the diagnostic model are 0.9598, 0.9601 and 0.9599 respectively.The Kappa coefficient of the model is 0.9521, that is, the consistency between the model's true classification and the predicted classification is almost completely consistent, indicating that The model proposed in this article has strong fault identification and classification capabilities.

Table 7 .
The Kappa coefficient of different features was selected for comparison.

Table 8 .
Comparison of diagnostic results under different sampling methods.

Table 9 .
Model comparison analysis results.