Research on expansion and classification of imbalanced data based on SMOTE algorithm

With the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally, including binary classification problems as well as multi-class classification problems 1 . For multi-classification problem, the category with more data samples is called majority category, while the category with less data samples is called minority category. For binary classification problems, categories with more data samples are called negative samples, and categories with less data samples are called positive samples 2 . In recent years, the classification of imbalanced data sets has been widely concerned. Class distributions that are highly skewed tend to bias the results of a machine learning or data mining algorithm, because the performance index used by machine learners [3][4][5][6][7][8] is usually the overall accuracy 9 . For example, there are 90 normal samples in disease classification and only 10 diseased samples. Even if all diseased samples are misclassified, the accuracy of the model is still 90%, but the sensitivity and specificity are both 0. So, in a practical sense, the characteristics of the data can't be accurately learned by the model, and the samples can't be accurately classification. For the patients, misdiagnosis has a great impact and will have serious consequences.
Nowadays, the classification of imbalanced data sets has become a hot issue in data mining 10 , and has been thoroughly studied by scholars from the data layer and the algorithm layer.
From the algorithm layer, the classification performance of the algorithm is improved by the algorithm structure design. Galar proposed a new ensemble algorithm (EUSBoost) based on RUSBoost, which combines random under-sampling with enhancement algorithm, effectively avoiding over fitting 11 . In Datta's paper, a Near-Bayesian Support Vector Machine (NBSVM) is developed focused on the philosophies of decision boundary shift and unequal regularization costs 12 . Qian proposed a resampling integration algorithm based on the classification problems for imbalanced datasets. In the method, the majority classes are under-sampled and minority classes are oversampled 13 . Chen proposed a Long Short-Term Memory-based Property and Quantity Dependent Optimization (LSTM.PQDO) method. The method realizes the dynamic optimization of the resampling proportion and overcome the difficulties of imbalanced datasets 14 . Hou proposed a time-varying optimization module to optimize the results of special periods and effectively eliminate imbalances 15 .
The main idea based on the data level is to construct the minority samples to increase the imbalance rate 12 (The ratio of the number of minority samples to the number of majority classes). Chawla proposed the SMOTE (Synthetic Minority Over-sampling Technique) algorithm 16 . Blagus investigated the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data 17 . In order to solve the noise problem generated by the SMOTE algorithm, Mi introduced the classification performance of support vector machines and proposed an imbalanced data classification method based on active learning SMOTE 18 . Seo uses machine learning algorithms to find effective SMOTE ratios for rare categories (such as U2R, R2L, and Probe) 19 . A novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with Imbalanced Data Learning (IDL) problems 20 . Based on the ensemble algorithm, Yang proposed anovel intelligent classification model based on SMOTE and ensemble learning to classify railway signal equipment faults 21 . Douzas presented G-SMOTE, a new over-sampling algorithm, that extends the SMOTE data generation mechanism. G-SMOTE selects a safe radius around each minority of clustering algorithm 22 . Ma proposed the CURE-SMOTE (Combination of Clustering Using Representatives Synthetic Minority Oversampling Technique) algorithm 23 . Experiments on the UCI imbalanced data show that the original Synthetic Minority Over-sampling Technique is effectively enhanced by the use of the combination of clustering using representative algorithm. In Prusty's paper, SMOTE has been modified to Weighted-SMOTE (WSMOTE) where over-sampling of each minority data sample is carried out based on the weight assigned to it 24 . Xwl proposed the LR-SMOTE algorithm. The algorithm makes the newly generated samples close to the center of the sample, avoiding generating outlier samples or changing the distribution of data sets 25 . Fernandez reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for big data problems 26 . Majzoub proposed Hbrid Custering Affinitive Borderline SMOTE (HCAB-SMOTE). It manages to minimize the number of generated instances while improving the classification accuracy 27 . Chen introduced relative density to measure the local density of each minority sample, and divides non-noise minority samples into boundary samples and safety samples adaptively according to the distinguishing characteristics of relative density, which effectively enhances the separability of the boundary 28 .
SMOTE algorithm can improve the classification effect of imbalanced data by randomly generating new minority sample points to increase the imbalance rate to a certain extent. However, the SMOTE algorithm has two shortcomings. On one hand, the SMOTE algorithm generates the minority sample points by random linear interpolation between the minority sample points and their neighbors, so the edge points of minority samples may produce distribution marginalization. On the other hand, the value of k (the number of nearest points selected when generate new points according to a certain minority sample point) needs to be set manually when the SMOTE algorithm performs data expansion. Based on the SMOTE algorithm and the idea of Normal distribution, this paper proposes a novel data expansion algorithm for imbalanced data sets. The Uniform random distribution in the original SMOTE algorithm was replaced by the Normal random distribution, so that the newly generated sample points are distributed near the center of the minority sample with a higher probability, which can avoid the marginalization of the expanded data. Then, this paper analyzes the parameter selection of the proposed algorithm. Appropriate parameter selection can make the expanded data maintain the distribution characteristics (inter-class distance and sample variance) of the original data. The experimental results show that the classification effect of the random forest after data expanded by proposed algorithm is better than the original SMOTE on the imbalanced data sets of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin.

SMOTE algorithm. SMOTE (Synthetic Minority
Over-sampling Technique) algorithm is an extended algorithm for imbalanced data proposed by Chawla 16 . In essence, SMOTE algorithm obtains new samples by random linear interpolation between a few samples and their neighboring samples. The data imbalance ratio is increased by generating a certain number of artificial minority samples, so that the classification effect of the imbalanced data set is improved 18 . The specific process of SMOTE is as follows.
Step 1. For each minority sample x i (i = 1, 2, . . . , n) , calculate its distance to other samples in minority sample according to certain rules to obtain its k nearest neighbors.
Step 2. According to the over-sampling magnification, the random m nearest neighbors, as a subset of k nearest neighbors set, of each sample x i are selected and denoted as x ij j = 1, 2, . . . , m , then an artificially constructed minority sample p ij is calculated by Eq. (1).
where rand(0, 1) is a random number uniformly distributed within the range of [0,1]. The operation of formula (1) is stopped until the fused data reaches a certain imbalance ratio.

Motivation.
Marginalization may occur when the SMOTE algorithm constructs data. If a positive (minority) sample point near to the distribution edge of the positive sample set, the "artificial" sample points generated by the positive sample point and adjacent sample points may also be on this edge and become more and more marginalized 23 . As a result, the boundaries between positive and negative (majority sample) samples are blurred. Therefore, an improved SMOTE algorithm based on the Normal distribution 29,30 is proposed in this paper, and the distribution of the generated data samples is controlled by appropriate parameters selection.
The rand(0, 1) denotes a random number falling in the interval of (0,1) with equally probability, so the generated sample points will be evenly distributed between the sample point x i and its neighbor x ij j = 1, 2, . . . , m in Eq. (1), which will lead to the phenomenon of marginalization of the expanded data when the sample point x i is near to or on the edge of the minority sample. While, if the Uniform distribution random number rand(0, 1) is replaced by a Normal distribution random number randn , and the minority sample center is used to substitute x ij , then the expanded points will be distributed near the sample center with a higher probability (details in Eq. 5). Where randn denotes a random number obeying Normal distribution with the mean value of µ = 1 and standard deviation of σ (adjustable). And the number p = randn has the following distribution characteristics.
(1) www.nature.com/scientificreports/ The core of the improved SMOTE algorithm based on the Normal distribution is to make the generated new samples gather towards to the center of minority samples with a high probability, and could preserve the statistic characteristics of the original minority by proper parameter selection.

Improved algorithm design. The process of improved SMOTE algorithm based on Normal distribution
is as follows.
Step 1. Standardize the original data by Eq.
(2) to avoid errors caused by different dimensions.
where x ij is the i-th sample point under the j-th feature of the original data, x j min and x j max are the minimum and maximum value in the j-th feature respectively.
Step 2. Calculate the center point x ′ center of minority samples.
where n is the total number of samples in minority samples, and r is the number of features in minority samples.
Step 3. Estimate Normal distribution of n × 1 dimensional normalized minority samples under each feature. Let σ 0 denote the standard deviation vector of the data set of the minority normalized samples.
is the standard deviation of the i-th feature.
Step 4. Synthesis of new samples based on interpolation formula (5).
where p i (i = 1, 2, . . . n) is a newly generated minority sample. According to Eq. (5), it can be known that the main control part for data generation is f (x) . When the value of f (x) is 1, p i is the minority sample center x ′ center . If f (x) takes the values near to 1 with a higher probability, then the expanded minority samples will be closer to the center point x ′ center . Let f (x) is a random number obeying Normal distribution with mean value of µ= 1 and standard deviation σ . Then if take σ =σ 0 3 , the value of f (x) will appear in the interval of (1 − σ 0 , 1 + σ 0 ) with a probability of 99.74% and the interval of 1 − σ 0 3, 1 + σ 0 3 with the probability of 68.26%.
Step 5. The expansion stops until the imbalance ratio reaches 0.7. Ma conducted an extended experiment on 5 imbalanced data sets including Breast-cancer-wisconsin in the UCI database 23 . The experimental results showed that when the imbalanced ratio reached 0.7, the corresponding classification effect was better. So, we choose 0.7 as a threshold value of imbalance ratio to judge whether the expansion is enough.
Step 6. The newly generated minority data is fused with the original minority data. The flow chart of the improved SMOTE algorithm based on Normal distribution is shown in Fig. 1.

Classification method and evaluation index
Random Forest algorithm. With the rapid development of the field of machine learning, random forest is widely used because of their high error tolerant performance and strong classification performance 16 .
Traditional random forest algorithms are used to handle balanced data sets, but imbalanced data sets are more common, especially in practical problems. Random Forest (RF) 31-34 is a bagging ensemble learning algorithm proposed by Leo Breiman in 2001. Multiple decision trees are constructed and combined to complete a learning task in parallel, and the final prediction and classification results are obtained by voting 35 . The process of the random forest is as follows.
Step 1. The data is randomly divided into two sets, training set and test set.
Step 2. During training, these data are randomly stratified by sampling them into K parts with K-fold crossvalidation.
Step 3. The bootstrap method is used to randomly extract a training set from the training set in K-fold crossvalidation for each decision tree.
Step 4. h features are randomly selected from the r features of each subnode in the decision tree as split attribute sets.
Step 5. N decision trees trained in parallel constructed as random forest models.
Step 6. Based on the principle that the majority wins the minority, random forest vote to obtainthe last experiment results.
Step 7. fivefold cross-validation is used in experiments, and the average of the accuracy of the validation set is calculated. www.nature.com/scientificreports/ Experimental evaluation index. The evaluation indexes AUC , F-value, G-value and OOB_error will be involved in this paper introduced in Eqs. (6)- (10). Suppose the data is divided into two categories, positive and negative, the confusion matrix is introduced in Table 1.
Classification Accuracy (AUC ): AUC represents the ratio of the sum of samples that are correctly classified to the total number of samples. Generally, the higher the AUC value is, the better the classification effect of the model is. F-value: where β ∈ (0, 1] , but β is generally taken to be 1. And F-value is an index for evaluating the classification performance of imbalanced sets from the perspective of positive samples. The higher the F-value is, the better the classification effect of the model is.   where ntree is the number of decision trees, and OOB_error of the overall sample data is the arithmetic mean of the out of bag error for each decision tree. The smaller the OOB_error, the better the classification effect of the model.

Simulation experiment
Experimental environment. The experimental data are derived from five data sets in the UCI (University of California Irvine) database, that is Pima, WPBC, Breast-cancer-wisconsin, WDBC, and Ionosphere dataset. The specific information of these datasets is summarized in Table 2. The hardware configuration is Intel(R) Core(TM) i5-3210 M, and the processing speeds of CPU and RAM are 2.50 GHz and 4.00 GB respectively. The random forest model is implemented using the Python 3.7 language package, and the improved SMOTE algorithm based on the Normal distribution is jointly implemented by Python, Excel, and SPSS. SPSS is used to estimate the normal distribution of each column of characteristic data and obtain the variance of the Normal distribution. The version of SPSS used in this paper is IBM SPSS Statistics 23.lnk. For the random forest model training, fivefold stratified cross validation is used to prevent overfitting. The number of features trained in each decision tree is generated based on the empirical formula h = log r 2 +1 . When each decision tree is split, the Gini index is used to select the best features. To simulate the actual situation appropriately and preserve the degree of imbalance of the original data, the training set and testing set were divided using stratified random sampling at a ratio of 3:1.
Numerical experiment. Part 1. The comparative experiment between the proposed algorithm and original SMOTE algorithm. The two algorithms are used to expand the 5 imbalance data sets respectively, and the expanding stops when the imbalance ratios reach 0.7. Random forests are then used to classify the extended data and original data to make the comparison. In order to obtain more scientific and reasonable experimental results, each data set is extended 5 times by both the SMOTE algorithm and the proposed algorithm. The average value of these 5 classification results is taken as the final experimental result (the same for experiments in Part 2). Part 2. The influence of different parameters on classification result. When the proposed algorithm is used for expansion, the three values of standard deviation of the Normal distribution in Eq. (5) are considered, that is σ = σ 0 , σ = 2σ 0 3 and σ = σ 0 3 , where σ 0 is the standard deviation of the original minority normalized data. Random forest is used to classify the expanded data based on different parameters. Part 3. The analysis of parameter selection according to inter-class distance and sample variance. Based on the classification results in part2, the inter-class distance and sample variance of the original data set and the fused data set are calculated. The inter-class distance is obtained by calculating the Euclidean distance between the center points of the majority sample and the minority sample.     www.nature.com/scientificreports/ data expanded by the SMOTE algorithm, the classification effect of WPBC dataset after expanded by improved SMOTE algorithm shows an increase in classification accuracy by 2.073% and 2.267%, respectively; OOB_error value decreased by 3.445% and 2.4%; F-value increased by 20.188% and 7.88%; G-value increased by 10.987% and 6.571%. For the Ionosphere dataset, the classification effect after expanded by the improved SMOTE shows a 7.152% increase on F-value and 5.851% increase on the G-value than that condition based on original SMOTE. In addition, the CURE-SMOTE algorithm is used to expand the Breast-cancer-wisconsin dataset and random forest is used to do classification in 23   www.nature.com/scientificreports/ According to Figs. 7,8,9,10,11, for the Pima data set, the classification effect of random forest is generally better when the parameter σ of the Normal distribution in Eq. (5) takes σ = σ 0 , although the corresponding F-value is higher when σ = 2σ 0 3 ; For the WPBC, Ionosphere, and Breast-cancer-wisconsin data sets, the classification effect of random forest is the best when the parameter σ takes σ = σ 0 ; For WDBC data set, the classification effect of random forest is the best when the parameter σ takes σ = σ 0 3 in the improved SMOTE algorithm.

Results
Part 3 experimental results. The experimental results of part3 are shown in Tables 3, 4. Where bold font indicates the best experimental results.
According to Table 3, it can be found that for the Pima, WPBC and Breast-cancer-wisconsin datasets, when the standard deviation σ of the Normal distribution in Eq. (5) takes σ = σ 0 , the inter-class distance between categories after expanded is closest to that of the original unexpanded data. For the Ionosphere data set, when the standard deviation σ takes σ = 2σ 0 3 , the inter-class distance between categories after expanded is closest to that of the original data, and it is quite closer when takes σ = σ 0 . For the WDBC data set, when the standard deviation σ takes σ = σ 0 3 , the inter-class distance between categories after expanded is closest to that of the original unexpanded data. www.nature.com/scientificreports/ According to Table 4 it can be found that for the Pima, WPBC, Ionosphere, and Breast-cancer-wisconsin datasets, when the standard deviation σ of the Normal distribution in Eq. (5) takes σ = σ 0 , the sample variance of the expanded data is the closest to that of the original unexpanded data. For the WDBC datasets, When the standard deviation σ takes σ = 2σ 0 3 , the sample variance of the expanded data is the closest to that of the original unexpanded data.
Combining with the experimental results in Figs. 7,8,9,10,11, it can be seen that for Pima, WPBC, and Breast-cancer-wisconsin dataset, the corresponding classification effect is the best when takes the parameter (Normal distribution standard deviation) of σ as σ = σ 0 in Eq. (5) to expand the data set. And the inter-class distance and sample variance after expansion are the closest to that of the original data in this condition. It suggests that the better the distribution characteristics of the original minority data are maintained, the more the expended data is similar to the original minority data, then the better the classification effect is. For the Ionosphere and WDBC dataset, the data do not show a consistent pattern. While for Ionosphere, it is not so confusing to get that the corresponding classification effect is the best and statistical characteristics are well maintained under the parameter selection σ = σ 0 , because the sample variance value is the closest to that of the original one, and the inter-class distance between categories is quite similar to that of expansion under the parameter selection σ = 2σ 0 3 . For the WDBC data set, the inter-class distance between categories after expansion is closest to that of the original condition when σ = σ 0 3 , while the variance of the extended data is closest to that of the original unexpanded data when σ = 2σ 0 3 , it is confusing to make parameter selection. Considering the nature of the classification problem, it is clear that the data set is more separable when the inter-class distance between the categories is greater and the divergence within the classes is smaller. Then it is not hard to choose the parameter σ = σ 0 3 to get expansion data with closest inter-class distance to original condition and a smaller divergence.
The experimental results reveal a clue that the parameters selected when the statistical characteristics of the expanded data are closer to that of the original data are optimal. To verify the conclusion more rigorously, more detailed options for parameter selection should be considered in future.

Conclusion
Aiming at the problem of classification of imbalanced data sets, a new data expansion algorithm based on the idea of Normal distribution is proposed in this paper. The algorithm expands the minority data by linear interpolation based on the Normal distribution trend between the minority sample points and the minority center, so that the newly generated minority data distributed closer to the center of the minority sample with a higher probability to effectively expand minority samples and avoid marginalization. The experiments show that a better classification effect could be got when the proposed algorithm is used to expand the five imbalance datasets than that of the condition of original SMOTE algorithm. In addition, the inter-class distance and sample variance of augmented data by the proposed algorithm with different parameters ( σ = σ 0 ,σ = 2σ 0 3 and σ = σ 0 3 ) are calculated, and the comparison of the classification effect of the random forests are analyzed. It is revealed that when the inter-class distance and sample variance of the expanded data are closer to those of the original data, the classification effect of the random forest is the best in the designed experiments.

Data availability
The data used to support the results of this study is publicly available and can be obtained from the website http:// archi ve. ics. uci. edu/ ml/.