A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique

Detecting code smells may be highly helpful for reducing maintenance costs and raising source code quality. Code smells facilitate developers or researchers to understand several types of design flaws. Code smells with high severity can cause significant problems for the software and may cause challenges for the system's maintainability. It is quite essential to assess the severity of the code smells detected in software, as it prioritizes refactoring efforts. The class imbalance problem also further enhances the difficulties in code smell severity detection. In this study, four code smell severity datasets (Data class, God class, Feature envy, and Long method) are selected to detect code smell severity. In this work, an effort is made to address the issue of class imbalance, for which, the Synthetic Minority Oversampling Technique (SMOTE) class balancing technique is applied. Each dataset's relevant features are chosen using a feature selection technique based on principal component analysis. The severity of code smells is determined using five machine learning techniques: K-nearest neighbor, Random forest, Decision tree, Multi-layer Perceptron, and Logistic Regression. This study obtained the 0.99 severity accuracy score with the Random forest and Decision tree approach with the Long method code smell. The model's performance is compared based on its accuracy and three other performance measurements (Precision, Recall, and F-measure) to estimate severity classification models. The impact of performance is also compared and presented with and without applying SMOTE. The results obtained in the study are promising and can be beneficial for paving the way for further studies in this area.

The proper and efficient maintenance of software has always been a challenge for the industry, researchers, or software professionals.The maintenance becomes even more challenging if the software developed is complex one.And nowadays, software's complexity is rising due to the increased module numbers and their size, complicated requirements, and also due to the significant presence of code smells in the developed software.The complexities are challenging to evaluate, comprehend, and go beyond developers' scope, posing obstacles in development as well as in software maintenance.However, researchers have found methods to avoid complexities in the developmental stage and hence ultimately, ease the maintenance efforts.One such method is identifying code smells and fixing them to simplify the software's interface, precise, uncomplicated to create and maintain 1 .Developers must follow the required software quality standards by using functional and nonfunctional concepts in the software improvement process 2 .It has been quite evident that developers focus only on functional needs while ignoring nonfunctional needs, including maintainability, credibility, reprocessability, and accessibility 3 .The lack of focus on nonfunctional requirements reduces software quality and ultimately increases the software maintenance effort and difficulties.The method of software quality assurance includes software inspection as a fundamental component 4 .The quality of software is heavily influenced by the quality of the process employed during its development.The software process can be characterized, controlled, evaluated, and enhanced 5 .
• RQ1 Which ML algorithm is the most effective for detecting severities of code smell?
• Motivation Fontana et al. 6 and Alazba et al. 13 applied various ML algorithms and compared the performances of ML algorithms.Therefore, we applied five ML algorithms to investigate and observe the performance and find the best algorithm for CSS detection.• RQ2 What is the impact of the class balancing method (SMOTE) on the performance of various models on the CSS detection?• Motivation To address the issue of class imbalance, Pandey et al. 14 applied a random sampling technique.
Therefore, we used the SMOTE method to determine how the class imbalance issue affected the level of code smell detection.• RQ3 What is the impact of feature selection technique (FST) on the performance of various models on the CSS detection?• Motivation Dewangan et al. 7 , and Mhawish et al. 15,16 , investigated the effect of several FSTs on performance measures.They discovered that using FST improved performance accuracy.So, to examine the impact of the FST on improving the method's accuracy and extracting code smell severities, which contributes a substantial role in the CSS detection process.
To address the above three research questions, our key contributions to this study are as follows: (1) This study addresses the class imbalance problem and applies SMOTE class balancing technique to the four CSS datasets.(2) A principal component analysis (PCA)-based FST is used to show the result of FST on the model performance for detecting the severity of code smells.(3) We have applied five ML models: Logistic Regression (LR), Multi-layer Perceptron (MLP), Random forest (RF), Decision tree (DT), and K-nearest neighbour (KNN).(4) We have considered four performance measurements: Precision, Recall, F-measure, and Severity Accuracy Score for each severity class for the severity dataset of each code smell.
Thus, our study applied the SMOTE method to handle the class imbalance issue and the Principal Component Analysis (PCA)-based feature selection technique to improve the model accuracy and achieved a severity accuracy score of 0.99 using the Random Forest and Decision Tree algorithms in the context of detecting the Long Method code smell.
The paper is organized as follows: "Background/literature review" section discusses related works and provides a brief description of CSS detection by applying ML algorithms."Description of the proposed model and dataset" section discusses the dataset's description and proposed models, the experimental results of the proposed model are described in "Experiment results" and "Discussion and result analysis" section outlines the discussion and compares our outcomes with other related studies, and finally the last "Conclusion" section concludes with future research directions.

Code smell severity (CSS) detection based on the machine learning algorithms
Numerous researchers have employed a variety of machine learning (ML) algorithms to detect CSS.Fontana et al. 6 explored a range of ML techniques, including regression, multinomial classification, and a binary classifier for ordinal classification.Their evaluation demonstrated a correlation between predicted and actual severity, achieving 88-96% accuracy measured by Spearman's p. Another study by Abdou et al. 12 utilized different ML models, including ordinal, regression, and multinomial classifiers for CSS classification.They also applied the LIME approach to interpret ML models and rules of prediction, utilizing the PART algorithm to assess feature efficiency.The highest accuracy they achieved was 92-97% using the Spearman algorithm correlation measurement.
Tiwari et al. 17 introduced a tool to identify long methods and their severity, emphasizing the significance of refactoring long methods.Their findings showed that this tool matched expert evaluations for approximately half of the approaches with a one-level tolerance.Additionally, they identified high severity evaluations that closely aligned with expert judgments.
For closed-source software bug reports with varying degrees of severity, Baarah et al. 18 investigated the use of eight ML models, including Support Vector Machine, Naive Bayes, Naive Bayes Multinomial, Decision Rules (JRip), Decision Tree (J48), Logistic Model Trees, K-Nearest Neighbor, and Random Forest.The Decision Tree (DT) model outperformed the others with 86.31% accuracy, 90% Area under the Curve (AUC), and 91% F-measure.
Gupta et al. 19 introduced a hybrid technique to assess code smell intensity in the Kotlin language and identified identical code smells in the Java language.Their work involved applying various ML models, with the JRip algorithm achieving the best outcome at 96% precision and 97% accuracy.
Hejres et al. 20 utilized three ML models (J48, SMO, and ANN) to detect CSS from four datasets.The SMO model yielded the best results for the god class and feature envy datasets, while the ANNE with the SMO model showed the highest accuracy for the long method dataset.
In their study, Hu et al. 21reexamine the efficacy of ten classification approaches and eleven regression methods for predicting code severity.The evaluation of these methods is based on two key performance metrics: the Cumulative Lift Chart (CLC) and Severity@20%.Additionally, Accuracy is considered as a secondary performance indicator.The findings indicate that the Gradient Boosting Regression (GBR) technique has superior performance in relation to these criteria.Sandouka et al. 22 proposed a Large Class and Long Method code smell based Python code smell dataset.They utilized six ML models for Python code smell detection.They measure the Accuracy and MCC percentage.They obtained the 0.89 best MCC rate using the DT model.
Zakeri-Nasrabadi et al. 23 surveyed 45 pre-existing datasets to examine the factors contributing to a dataset's effectiveness in detecting smells.They found that the suitability of a dataset for this purpose is heavily influenced by various properties, including its size, severity level, project types, number of each type of smell, overall number of smells, and the proportion of smelly to non-smelly samples within the dataset.Most currently available datasets support identifying code smells such as God Class, Long Method, and Feature Envy.However, it is worth noting that there are six code smells included in Fowler and Beck's catalog that do not have corresponding datasets available for analysis.It may be inferred that the current datasets exhibit imbalanced sample distributions, a shortage of severity level support, and a limitation to the Java programming language.

Code smell severity (CSS) detection based on the ensemble and deep learning algorithms
Numerous research studies have explored the application of various ensemble learning methods for code smell detection.Alazba et al. 13 conducted experiments with fourteen ML and stacking ensemble learning methods with six datasets for code smells and reported a remarkable accuracy of 99.24% with LM Dataset using the Stack-SVM algorithm.
Malathi et al. 24 introduced a deep learning approach for detecting class code smells.This approach leverages a diverse set of characteristics specifically designed for different types of code smells.This deep learning model would effectively detect instances belonging to the single class CS only.Therefore, this paper proposes an advanced Deep Learning Based many Class type Code Smell detection (DLMCCMD) to automatically detect many kinds of Code Smells, such as huge class, misplaced class, lazy class, and data clumps.The CNN-LSTM architecture has been devised for the purpose of classifying a certain feature that encompasses both source code information and code metrics.The acquired data is consolidated to conduct positive testing of source code programs with reduced computational time.
Dewangan et al. 25 utilized four ML (LR, RF, KNN, DT) and three ensemble models (AdaBoost, XG Boost, and Gradient Boosting) to detect CSS from four datasets.They used chi-square FST and two-parameter optimization methods (Grid search and Random search).They obtained that the XG Boost model achieved a high accuracy rate of 99.12% when applied to the Long method code smell dataset, utilizing the Chi-square-based feature selection strategy.www.nature.com/scientificreports/Nanda et al. 26 employed a hybrid approach that integrated the Synthetic Minority Over-sampling Technique (SMOTE) with the Stacking model to effectively classify datasets related to the severity of DC, GC, LM, and FE, achieving performance improvement from 76 to 92%.
Pushpalatha et al. 27 proposed a method for predicting bug report severity in closed-source datasets, utilizing the NASA project dataset (PITS) from the PROMISE Repository.To enhance accuracy, they employed ensemble learning methods and two-dimensional reduction techniques, including information gain and chi-square.
Zhang et al. 28 introduced MARS, a brain-inspired method for code smell detection that relies on the Metric-Attention method.They applied various ML and Deep learning models and found that MARS outperformed conventional techniques in terms of accuracy.
Liu et al. 29 presented a severity prediction approach for bug reports based on FSTs and established a rankingbased policy to enhance existing FSTs and create an ensemble learning FST by combining them.Among the eight FST methods applied, the ranking-based approach achieved the highest F1 score of 54.76%.
Abdou et al. 30 suggested using ensemble learning techniques to detect software defects.They explored three ensemble approaches: Bagging, Boosting, and Rotation Forest, which combine re-sampling techniques.The experiments conducted on seven datasets from the PROMISE repository showed that the ensemble method outperforms single learning methods, with the rotation forest using the re-sampling approach achieving a maximum accuracy of 93.40% for the KC1 dataset.
Dewangan et al. 11 employed ensemble and deep learning methods to discover code smells.They achieved a remarkable 100% accuracy for the LM dataset by applying all ensemble methods and using Chi-square FST and SMOTE class balancing methods.

Code smell severity (CSS) detection dealing with class imbalance problem
Zhang et al. 31 proposed a DeleSmell method to identify the code smells using a deep learning model.They constructed the dataset by collecting data from 24 real-world projects.To address the unbalance in the dataset, a refactoring technique is intended to automatically change useful source code into smelly code and to generate positive data using actual cases.They employed the SVM method and found that DeleSmell enhances the efficiency of brain class code smell detection by up to 4.41% compared to conventional techniques.Pecorelli et al. 32 implemented five imbalance techniques (Class Balancer, SMOTE, Resample, and Cost-Sensitive Classifier, One Class Classifier) to identify the impact of five code smell detection on the various ML algorithms.They found that ML models relying on SMOTE obtained the best performance.A random sampling approach was applied by Pandey et al. 14 to address the problem of class imbalance.With the random sampling technique, they discovered better results.
The related work summarizes that various authors used machine learning techniques (machine learning, ensemble learning, and deep learning)."Code smell severity (CSS) detection based on the machine learning algorithms", "Code smell severity (CSS) detection based on the ensemble and deep learning algorithms", and "Code smell severity (CSS) detection dealing with class imbalance problem" sections discussed all related studies which worked on the CSS datasets.The above literature has some limitations, only some studies have solved the class imbalance problem in the datasets, but they need to address the dataset's class-wise accuracy.Also, only some studies have used the feature selection technique and examined its effect on performance accuracy.

Description of the proposed model and dataset
We followed the following steps to detect the severity, as depicted in Fig. 1.Fontana et al. 6 served as the source for initially deriving four datasets on CSS.The min-max preprocessing technique was used to ensure data comparability, normalizing data values to fall within the range of 0-1.A SMOTE class balancing algorithm is applied to handle the class imbalance issues.Next, a PCA-based FST technique was used to select the most relevant features from each dataset.Subsequently, the dataset was into two parts: an 80% training set for model training and a separate test set for model evaluation (fivefold cross validation).Finally, machine learning algorithms were applied, and performance evaluations were conducted.The entire procedure conducted in this study is outlined in Fig. 1.

Description of the dataset
The four datasets from Fontana et al. 6 that are being considered are divided into two class-level datasets (DC, GC) and two method-level datasets (FE, LM).Visit http:// essere.disco.unimib.it/ rever se/ MLCSD.html to access each of these datasets.Out of 111 systems, 76 have been selected by Fontana et al. 6 and have been computed using a variety of sizes and a significant amount of object-oriented features.For the system selection, they considered the systems Qualitas Corpus compiled by Tempero et al. 33 .These methods included iPlasma (Brain Class, GC), Anti-pattern Scanner 34 , PMD 35 , iPlasma, Fluid Tool 36 , and Marinescu detection rules 37 for determining the intensity of code smells.Table 1 displays the automatic detection tools.

Code smells severity classification
After manually assessing each instance of a code smell, a severity score is assigned.
• 1: A class or method that is unaffected receives a score of 1 for "No smell"; • 2: A class or function that is only marginally affected receives a score of 2 for a non-severe smell; • 3 : A class or method receives a smell score of 3 if it possesses all of the qualities of a smell; www.nature.com/scientificreports/ The datasets are defined below: • DC It refers to classes that hold fundamental data with essential functionality and are extensively utilized by other classes.A DC typically exposes numerous features through simple accessor methods, presenting a straightforward and uncomplicated design 6 .• GC It refers to classes that centralize the system's intelligence, often being considered one of the most complex code smells.GCs tend to accumulate numerous responsibilities, actions, and tasks, leading to issues related to code size, coupling, and complexity 6 .• FE It pertains to techniques or methods that heavily rely on data from classes other than their own.It shows a preference for utilizing features exposed through accessor methods in other classes 6 .• LM It describes strategies or procedures that concentrate a class's functionality, frequently leading to long and complicated code.Because they rely so largely on information from other classes, LMs are difficult to understand 6 .

Dataset structure
Each dataset contains 420 instances (classes or methods).Specifically, 63 instances are selected for the DC and GC datasets, while 84 instances are chosen for the FE and LM datasets.The dataset configuration, as shown in www.nature.com/scientificreports/least number of occurrences in the datasets.Additionally, the class-based smells (DC and GC) exhibit a different balance of severity levels 1 and 4 compared to the method-based smells (FE and LM) 6 .

Preprocessing technique
The datasets encompass a diverse set of features; consequently, it is preferable to normalize the features before using the ML techniques.In this study, the Min-Max preprocessing method is used to rescale datasets with feature or observation values ranging from 0 to 1 38 .The min-max formula, as presented in Eq. 1, calculates the normalized value denoted by X' , based on the original real value represented by X.The feature's minimum value (Xmin) is set to "0," and the maximum value (Xmax) is set to "1."All other values are scaled proportionally as decimals within the range of 0-1.

Class balancing technique
From Table 2, we observed that the dataset (Fontana et al. 6 ) has four types of severity levels (metrics).The distribution of each severity level of each dataset is different.The class distribution of this dataset is not balanced.In this research, each class of each dataset was balanced using the SMOTE class balancing approach.SMOTE is a well-known oversampling method that was developed to improve random oversampling 39 .

Feature selection technique
Feature selection aims to identify the most relevant features in a dataset, enhancing model performance by better understanding the instances that contribute to distinguishing parallel roles in features 40 .In this study, we utilized the PCA (Principal Component Analysis) feature selection technique to extract the most informative features from each dataset.PCA is a dimensionality-reduction method commonly employed to reduce the number of variables in large datasets, creating a smaller set that preserves most of the data's variability 41 .The discussion of the selected best features/instances from each dataset and their impact on performance accuracy is provided in "Effect of PCA feature selection technique on the model's severity accuracy score" section.

Machine learning models
Machine learning is a computational approach that encompasses a range of methodologies employed by computers to make predictions, enhance predictive accuracy, and forecast behavior patterns using datasets 42 .In this study, we have applied five ML models to detect the CSS from CSS datasets.The five ML models are Logistic regression, Multi-layer perceptron, Random forest, Decision tree, and K-nearest neighbor.The five ML modes described in following subsections:

Logistic regression (LR)
To analyze and categorize binary and proportional response data sets, researchers frequently use the LR method, one of the most significant statistical and data mining approaches.One of its key features is that LR may extend to multi-class classification problems and automatically generate probability.

Multi-layer perceptron (MLP)
This classifier is made up of layers of units.Each node in the fully linked network under consideration here comprises a layer.In that layer, every other node is connected to every other node in the layer below it.A minimum of three layers, including an input layer, one or more hidden layers, and an output layer, make up each MLP.
The input layer divides up the inputs among the following levels.Input nodes lack thresholds and have linear activation functions.There are thresholds connected to the minimum addition to the weights for each hidden unit node and each output node.The outputs have linear activation functions, while the buried unit nodes have nonlinear activation functions.

Random forest (RF)
In the proposed model, we employed Random Forest (RF) as the machine learning classifier.In RF, each tree depends on the values of a random vector sampled randomly, and this sampling is done with the same distribution for all the trees in the forest.With an increasing number of trees in the forest, the generalization error www.nature.com/scientificreports/asymptotically converges to a limit.The overall generalization error of the forest of tree classifiers is determined by the quality of each individual tree and the relationships between them.

Decision tree (DT)
In a DT model, each internal branch is connected to a decision, and the leaf node is often connected to a result or class label.Each internal node tests one or more attribute values that result in two or more links or branches.Each connection has a potential decision value attached to it.These connections are distinct and comprehensive 7 .

K-nearest neighbor (KNN)
The KNN method is a supervised ML technique used for classification prediction issues.Meanwhile, most of its applications in the industry are for classification prediction issues.The KNN model uses "feature similarity" to predict the value of a new data point, which also implies that the value will depend on how closely the new data point resembles the training point 7 .

Performance evaluations
We employed four performance evaluations-Precision, Recall, F-measure, and Severity Accuracy Score to determine the performance of five machine learning models.These evaluation indicators are described briefly in following subsections.Four terms are considered while calculating the performance evaluation: True positive (TP), False positive (FP), True negative (TN), and False negative (FN).The confusion matrix (CM) calculates these four terms, which contains the actual and predicted values recognized by CSS models.Figure 2 shows the confusion matrix prediction.

Precision (P)
Precision (P) is concerned with the accurate identification of code smell severities by the ML model 43 .To calculate precision, Eq. ( 2) is employed, where precision is determined by dividing the number of true positives (TP) by the sum of TP and false positives (FP).

Recall (R)
Recall (R) pertains to the accurate identification of code smell severities by the ML model 43 .To calculate recall, we use Eq. ( 3), which involves dividing the number of true positives (TP) by the sum of TP and false negatives (FN).

F-measure (F)
F-measure (F) deals with the harmonic mean of precision and recall, and it's set for a balance between their values 43 .Its value lies between 0 and 1, 0 is the poorest performance and 1 is the most excellent performance.Equation ( 4) is applied to calculate F-measure.

Severity accuracy score (SAS)
Severity accuracy score (SAS) deals with the organization of precision and recall.It illustrates the measurement of exactly classified instances in the positive and negative classes 43 .Equation ( 5) is used to compute accuracy.SAS is considered as dividing the sum of the TP and TN by the sum of the TP, TN, FP, and FN.

Experiment results
To address RQ1, five ML models are used.The datasets GC, DC, FE, and LM for the severity of code smells are chosen.In this study, each dataset has four categories of severity (severity 1, severity 2, severity 3, and severity 4).We have shown individual outcomes for each dataset's severity level.In addition, the average outcome of all severity classifications is also presented.The following "Outcomes for data class" to "Outcomes for long method" sections, display the experimental outcomes of five ML models with fivefold cross validation: LR, MLP, RF, DT, and KNN, in tabular form for four datasets.

Outcomes for data class
This subsection represents the effect of applying the five ML models to the DC dataset.Table 3 shows the severity detection outcomes with four measurements (Precision, Recall, F-measure, and Severity Accuracy Score) for the DC dataset (for each level of severity, with the average of all levels of severity) applying five ML models.
Figure 3 shows the accuracy comparison of the data class dataset for all the classifiers.For the DC dataset, it has been observed that the DT model detected the highest severity accuracy score (with an average of all the severity classes) of 0.83, the precision of 0.84, recall of 0.83, and F-measure of 0.84, while the worst severity of 0.40 accuracy was detected by the MLP model.

Outcomes for god class
This subsection represents the effect of applying the five ML models to the GC dataset.Table 4 shows the severity detection outcomes with four measurements for the GC dataset (for each level of severity, with the average of all levels of severity) applying five ML models.Figure 4 shows the accuracy comparison of the god class dataset for all the classifiers.For the GC dataset, it has been observed that the RF model detected (with an average of all the severity classes) the highest Severity Accuracy Score, precision, recall, and F-measure of 0.85, while the worst Severity Accuracy Score is 0.43 was detected by the MLP model.

Outcomes for feature envy
This subsection represents the effect of applying the five ML models to the feature envy dataset.Table 5 shows the severity detection outcomes with four measurements for the feature envy dataset (for each level of severity, www.nature.com/scientificreports/with the average of all levels of severity) applying five ML models.Figure 5 shows the accuracy comparison of the feature envy dataset for all the classifiers.For the feature envy dataset, it has been observed that the MLP and RF model detected (with an average of all the severity classes) the highest Severity Accuracy Score, precision, and recall of 0.96 and the F-measure is 0.96 for the RF model and 0.95 for MLP.The worst Severity Accuracy Score is 0.90, detected by the LR model.

Outcomes for long method
This subsection represents the effect of applying the five ML models to the LM dataset.Table 6 shows the severity detection outcomes with four measurements for the LM dataset (for each level of severity, with the average of all levels of severity) applying five ML models.Figure 6 shows the accuracy comparison of the long method dataset for all the classifiers.For the LM dataset, we observed that the RF and DT both models detected (with an

The impact of SMOTE's class-balancing method on predictive performance
RQ2 was addressed using the SMOTE class balancing method.This experiment is done to observe SMOTE's impact on balancing the classes of four severity code smell datasets.www.nature.com/scientificreports/

Effect of PCA feature selection technique on the model's severity accuracy score
In this study, we have applied the PCA-based FST to select the best features from the severity dataset.The PCA selects the DC dataset with eight components, the GC dataset with nine components, the FE dataset with nine components, and the LM dataset with ten components.Table 8 shows the best-selected features from each dataset using PCA.All selected feature descriptions are provided in the appendix section of Table 12.

Discussion and result analysis
In this study, three research questions are presented in "Introduction" section.To address the RQ1, we applied five ML algorithms (LR, MLP, RF, DT, and KNN) to the four CSS datasets, and their results are discussed in "Outcomes for data class" to "Outcomes for long method" sections.The achieved results answer RQ1 and found that the RF model is most helpful in detecting the highest Severity Accuracy Score from GC, FE, and LM datasets, and the DT model is most helpful in detecting the highest Severity Accuracy Score from DC and LM datasets.
To address the RQ2, the SMOTE class balancing method is applied to the four CSS datasets discussed in "The impact of SMOTE's Class-balancing method on predictive performance" section.All datasets have four types of severity classes: severity1, severity2, severity3, and severity4, and all classes had a high imbalance among the values.The dataset configuration with severity class is shown in Table 2. Table 7 presents the results of applying SMOTE technique on the CSS datasets with five ML models.The results confirm that most of the models detected the better Severity Accuracy Score for all the datasets when the SMOTE class balancing method is applied.
To address the RQ3, we have applied the PCA technique to the four CSS datasets discussed in "Effect of PCA feature selection technique on the model's severity accuracy score" section.Table 8 shows the important features selected from each dataset, and Table 9 shows the result comparison between with and without the applied PCA

Evaluation of our results with relevant research studies
This section constructs a comparative summary of proposed approach's result with other relevant research studies.
To the best of our knowledge and available literature on CSS detection, only three authors (Fontana et al. 6 ; Abdou et al. 12 ; Dewangan et al. 25 ) have studied the severity dataset.They applied different methodologies, which are shown in Table 10.Table 10 compares our outcomes with Fontana et al. 6 , Abdou et al. 12 , and Dewangan et al. 25 .Fontana et al. 6 applied eighteen ML models and implemented binary classification, multinomial classification, and regression technique with linear co-relation filter method.Abdou et al. 12 applied forty binary and multinomial classification techniques with a ranking correlation algorithm.Dewangan et al. 25 applied seven ML and ensemble methods.Our approach applied five ML models (LR, MLP, RF, DT, and KNN) with PCA-based Feature selection and SMOTE class balancing techniques.The comparison for each dataset is shown in the following points: (1) For the DC dataset, in our approach, DT model detected the highest Severity Accuracy Score of 0.83, while the Fontana et al. 6 detected a Severity Accuracy Score of 0.77 applying the O-RF method and Abdou et al. 12 detected a Severity Accuracy Score of 0.93 applying the O-R-SMO method.Dewangan et al. 25 detected a Severity Accuracy Score of 0.88 using gradient boosting model.Therefore, the Abdou et al. 12 approach is good.(2) For the GC dataset, in our approach, the RF model detected the highest Severity Accuracy Score, 0.85, while the Fontana et al. 6 detected a Severity Accuracy Score of 0.74 by O-DT approach and Abdou et al. 12 detected a Severity Accuracy Score of 0.92 by R-B-RF approach.Dewangan et al. 25 detected a Severity Accuracy Score of 0.86 using DT model.Therefore, the Abdou et al. 12 approach is good.(3) For the FE dataset, in the proposed approach, the MLP and RF model detected the highest Severity Accuracy Score, 0.96, while the Fontana et al. 6 detected a Severity Accuracy Score of 0.93 applying the J48-Pruned method and Abdou et al. 12 detected a Severity Accuracy Score of 0.97 applying the R-B-JRIP and O-R-SMO methods.Dewangan et al. 25 detected a Severity Accuracy Score of 0.96 using DT model.Therefore, the Abdou et al. 12 approach is good.(iv) For the LM dataset, in the proposed approach, the RF and DT model detected the highest Severity Accuracy Score of 0.99, while the Fontana et al. 6 detected a Severity Accuracy Score of 0.92 applying the B-Random Forest algorithm and Abdou et al. 12 detected a Severity Accuracy Score of 0.97 applying the R-B-JRIP, O-B-RF, and O-R-JRip algorithms.Dewangan et al. 25 detected a Severity Accuracy Score of 0.99 using XG boosting model.So, the Dewangan et al. 25 and our proposed approach is best the LM dataset.

Comparing machine learning models statistically
From Tables 7 and 9, it is observed that the same types of results are obtained after applying different models to the same dataset.Therefore, the best model out of the two must be chosen in this scenario where two different models produce similar results.To select the best model from the given five ML models, we applied a Paired t-test statistical analysis to see whether there was a statistically substantial distinction between the two ML models, allowing us to use only the best one.N distinct test sets are needed to generate each classifier in this paired t-test.For N test sets, we employed tenfold cross-validation.The statistical analysis was performed to tenfold cross-validation using a Paired t-test.The mean accuracy and standard deviation for each ML model for each dataset were computed in this study.
• Mean accuracy For a dataset, a model with a greater mean accuracy performs better than one with a lower mean accuracy.We used tenfold cross-validation and a significance value of 0.05 to calculate the statistical analysis.Table 11 shows the mean accuracy and standard deviation of each classification model across each code-smell dataset.Table 11 shows that the LR model had a 0.01 standard deviation and 0.99 mean accuracy scores for the DC dataset.The LR model achieved the highest 1.00 mean accuracy score for the GC and LM datasets with a 0.00 standard deviation.Additionally, the LR model had a 0.02 standard deviation and a highest mean accuracy score of 0.97 for the FE dataset.As a result, the LR model is determined to be the best model for the severity detection of the four code smell datasets because it has a high mean accuracy and a low standard deviation across all datasets.

Conclusion
Class imbalance issues are significant primary challenges in the CSS dataset.We have considered four CSS datasets: GC, DC, LM, and FE.Five ML models were applied over four CSS datasets.SMOTE method was applied to avoid the class imbalance problem.We also compared performances without using SMOTE techniques.We have also applied the PCA-based FST technique and compared performances without using PCA techniques.The conclusions, obtained from study are presented below.Ensemble learning has a good scope to be applied in the CSS dataset.Deep learning-based models are still not possible because of the small number of instances in a dataset; however, by using data augmentation, we may increase the size of our training set so that deep learning-based models can be effectively applied.The deep learning methods and other FST techniques can be used in future studies. https://doi.org/10.1038/s41598-023-43380-8

Figure 3 .
Figure 3. Accuracy comparison of data class dataset for all the classifier.

Figure 4 .
Figure 4. Accuracy comparison of god class dataset for all the classifier.

Figure 5 .
Figure 5. Accuracy comparison of feature envy dataset for all the classifier.

Figure 6 .
Figure 6.Accuracy comparison of long method dataset for all the classifier.

( 1 )
From the Data class dataset highest Severity Accuracy Score of 0.83 was detected by the DT model using eight features selected by the PCA feature selection technique.(2) From the God class dataset highest Severity Accuracy Score of 0.85 was detected by the RF model using nine features selected by the PCA feature selection technique.(3) From the Feature envy dataset highest Severity Accuracy Score of 0.96 was detected by the MLP and RF model using nine features selected by the PCA feature selection technique.(4) From the Long Method dataset highest Severity Accuracy Score of 0.99 was detected by the RF and DT model using ten features selected by the PCA feature selection technique.

Table 2
, includes the distribution of instances across severity levels.It is observed that severity level 2 has the

Table 3 .
Severity Accuracy Score (SAS) = TP + TN TP + TN + FP + FN Outcomes for data class dataset.Significant values are in bold.

Table 4 .
Outcomes for god class dataset.Significant values are in bold.

Table 7
shows how each model's performance Severity Accuracy Score gets affected for four CSS datasets.According to the comparison, the SMOTE class balancing methodology helps almost all ML techniques improve their Severity Accuracy Score for all datasets, and it affects each model and each dataset in slightly different ways.We observed the following points for each dataset:

Table 5 .
Outcomes for feature envy dataset.Significant values are in bold.(1)For the DC dataset, MLP, RF, DT, and KNN models provided higher Severity Accuracy Scores when we used SMOTE technique, while the LR model achieved a better Severity Accuracy Score without using SMOTE technique.The DT model achieved the highest Severity Accuracy Score of 0.83 using the SMOTE balancing technique.(2)For the GC dataset, RF, DT, and KNN models provided higher Severity Accuracy Scores when we applied SMOTE technique, while the LR model achieved a better Severity Accuracy Score without using SMOTE technique, and the MLP model presented the same results for both with and without applied SMOTE balancing technique.The RF model achieved the highest Severity Accuracy Score of 0.85 using the SMOTE balancing technique.(3)For the FE dataset, all five models presented higher Severity Accuracy Score when we applied SMOTE technique.The highest Severity Accuracy Score of 0.96 was achieved by the MLP and RF model using SMOTE balancing technique.(4)For the LM dataset, all five models provided higher Severity Accuracy Score when we SMOTE technique.The RF and DT model obtained the highest Severity Accuracy Score of 0.99 using the SMOTE balancing technique.

Table 6 .
Outcomes for long method dataset.Significant values are in bold.

Table 7 .
Result Comparison between with and without applied SMOTE.Significant values are in bold.

Table 8 .
PCA-selected features from each dataset.

Table 9
shows the result comparison with and without the applied PCA based FST in each dataset with five ML algorithms.We observed the following points for each dataset:(1) For the DC dataset, RF, DT, and KNN models provided higher Severity Accuracy Scores when applied the PCA feature selection technique, while the LR and MLP models achieved better Severity Accuracy Scores without applying PCA.The highest Severity Accuracy Score of 0.83 was achieved by the DT model using the PCA feature selection technique.(2) For the GC dataset, LR, RF, and DT models resulted higher Severity Accuracy Score when applied the PCA feature selection technique.The highest Severity Accuracy Score of 0.85 was achieved by the RF model using the PCA feature selection technique.At the same time, the MLP and KNN models achieved better Severity Accuracy Scores without applying PCA.(3) For the Feature envy dataset, MLP and RF models provided higher Severity Accuracy Scores when applied the PCA feature selection technique, while the LR and KNN models achieved better Severity Accuracy Scores without applying PCA.The highest Severity Accuracy Score of 0.96 was achieved by the MLP and RF model using the PCA feature selection technique.The DT model achieved the same result with and without applied PCA.(4) For the LM dataset, LR, MLP, RF, and DT models resulted higher Severity Accuracy Scores when applied the PCA feature selection technique, while the KNN model achieved a better Severity Accuracy Score without applying PCA.The highest Severity Accuracy Score of 0.99 was achieved by the RF and DT model using the PCA feature selection technique.

Table 9 .
Result comparison between with and without applied PCA based FST.Significant values are in bold.

Model name Severity accuracy score with applied PCA Severity accuracy score without applied PCA
After comparison, we observed that the PCA is useful for improving the Severity Accuracy Score of all the ML models for all the datasets.

Table 10 .
Evaluation of our findings with relevant research.Standard Deviation A high standard deviation indicates that most of the values in the dataset are spread out over a wide range.And a low standard deviation indicates that most of the values in the dataset are close to the mean.As a result, the model with the lowest standard deviation is the best choice.