Elitist Binary Wolf Search Algorithm for Heuristic Feature Selection in High-Dimensional Bioinformatics Datasets

Due to the high-dimensional characteristics of dataset, we propose a new method based on the Wolf Search Algorithm (WSA) for optimising the feature selection problem. The proposed approach uses the natural strategy established by Charles Darwin; that is, ‘It is not the strongest of the species that survives, but the most adaptable’. This means that in the evolution of a swarm, the elitists are motivated to quickly obtain more and better resources. The memory function helps the proposed method to avoid repeat searches for the worst position in order to enhance the effectiveness of the search, while the binary strategy simplifies the feature selection problem into a similar problem of function optimisation. Furthermore, the wrapper strategy gathers these strengthened wolves with the classifier of extreme learning machine to find a sub-dataset with a reasonable number of features that offers the maximum correctness of global classification models. The experimental results from the six public high-dimensional bioinformatics datasets tested demonstrate that the proposed method can best some of the conventional feature selection methods up to 29% in classification accuracy, and outperform previous WSAs by up to 99.81% in computational time.

Scientific RepoRts | 7: 4354 | DOI: 10.1038/s41598-017-04037-5 A feature selection methods-based filter approach has a higher computational efficiency to evaluate the quality of features with certain metrics, such as distance 26 , information gain 27 , correlation 28 and consistency 29 . The RELIEF 30 series algorithms are commonly used for filter approach. A RELIEF algorithm aims to solve binary class datasets. First, it randomly selects m samples from the training dataset based on: the difference between each selected sample with its two nearest samples, respectively, in the same and different classes, to calculate correlations between each feature of the selected samples and each class; then, the average values of multiple selection as the weights of each feature; and finally, the algorithm obtains the correlation between each feature and class. Selecting the features with higher weights as selected feature combinations, RELIEFF 31 is an extended version for solving multi-class and regression problems. It estimates the selected features as the closed samples' identification abilities; that is, the samples with better feature combinations in the same classes are closed in the search space, and vice versa.
The wrapper approach in feature selection depends on the machine learning algorithm. It uses the selected sub-feature set to train the machine learning algorithm directly, then estimates the quality of the selected sub-feature set's performance in testing the machine learning algorithm. The wrapper approach can achieve a significant solution when it combines machine learning (classification) and random strategy algorithms, which are mentioned in the previous section together. Previous researchers have combined GA and decision trees into classification models that select the optimal combination of features with the lowest error rate 32 ; moreover, they have combined different classifiers -such as neural networks, Naive Bayes 11,12 and support vector machines 32 -with random strategy algorithms, such as PSO and bat-inspired (BAT) algorithms 11 to optimise the wrapper approach. Therefore, researchers are constantly trying to optimise machine learning and random strategy algorithms to enhance their computational efficiency and the quality of selected features. Given that the wrapper approach requires that these classifiers be constantly called and trained to verify and evaluate the performance of selected sub-feature sets, it takes more computation time than the filter approach. The wrapper approach offers higher accuracy, but when tackling high-dimensional datasets, the filter approach is more commonly used.

Results
The classification results are assessed by different training and testing parts. We perform a strict 10-fold cross-validation 33,34 to test the corresponding performance of the current dataset classification model. The dataset is randomly subdivided into ten parts, based on averages, and each part takes a turn being the testing dataset with the other nine parts as training datasets in the repeated ten-times classifications. Accuracy and other performances of this cross validation process are then averaged from these ten classifications. To maintain the fairness of the experiment, because our proposed method, PSO, BPSO and WSA are random searching strategy algorithms, their experiments are also repeated ten times, and the final results are used as the mean value. Tables 1 to 3 record the accuracy, dimension (%) and kappa statistics 35,36 of the selected sub-datasets with different methods. Tables 4 and 5 present the precision and recall values, respectively, to help us evaluate and compare these methods. Given the randomness of swarm intelligence algorithms, the results of this category in this time are the average values of their offsets (stand deviations) to verify and reflect the impartiality of our experiment.  Table 2. Dimensions of all datasets with different methods.
The classification accuracy of the original all datasets for ELM is around 0.5 and 0.6. The first three methods with the heuristic and filter strategies improve the accuracy a little, whereas the RFAE obtain worse accuracy while processing high-dimensional datasets. The CHSAE is the best of the first three. It evaluates the chi-squared statistics of each feature with respect to the class. As mentioned in Section 2, heuristic searching strategies combined with filters can achieve some good effects, but random searching strategies based on a wrapper approach can obtain better feature sets with higher accuracy. Their worst algorithm (PSO) is still better than the CHSAE. Combining the results shown in Fig. 1 with the values in Table 1, and WSA, the binary version of PSO and BPSO are all better than PSO, which is a typical, effective swarm intelligence algorithm. However, it can be observed that WSA and BPSO do not increase the classification accuracy by much, whereas the features selected by the EBWSA exhibit more than a 10% increase in classification accuracy. The Kappa statistic is a value to measure the robustness of classification models 35 , with a bigger value indicating greater reliability. Table 3 and Fig. 2 illustrate that the robustness of the classification model for the original high-dimensional bioinformatics dataset is weak. After feature selection, the Kappa statistic of each classification model is enhanced for the bioinformatics dataset SMK_CAN_187 with filter methods. The EBWSA cycle in Fig. 2 is much bigger than the others, along with the point of the CHSAE for the Colon dataset. Table 2 records the selected optimal features of each method and dataset, and Fig. 3 presents the percentage of the size of the original feature set that the selected optimal feature subset contains; that is, the length of the selected feature subset in the percentage of the maximum dimension of the original feature set 37 or % dimension. It can be observed that the dimensions of experimental datasets are in the thousands and tens of thousands. Figure 3 intuitively reflects how the CHSAE and EBWSA selected smaller, more precise particles of feature sets, resulting in superior performances with respect to the filter and wrapper approaches. However, the lengths of selected features do not imply that more refined feature sets perform better, because INFORGAE was much worse than most of the other methods with longer lengths.  Table 5. Recall of all datasets with different methods (best results highlighted in bold).

Discussion
It is indeed that filter methods are much faster than the wrapped random searching algorithms. The latter needs to call the classify tens of thousands of times. However, on the premise of the reasonable and acceptable time cost, wrapper is able to obtain better performances. Conventional WSA for feature selection is verified that it better than PSO and BAT with different classifiers in classification performance 11 . WSA's shortcoming is that the large computation time because of its multi-leader and escape mechanism in a vast search space. Figure 4 displays the time cost of the proposed method with other three comprised swarm intelligence algorithms, the unit of time cost is the second. It is significantly observed that super time cost of WSA, and the extremely and effectively shorten the consumption time of EBWSA. Although EBWSA needs more time than PSO and BPSO, but they are  sufficiently close to each other In addition, the experimental results indicate that it even enhances the computational up to 99.81% than WSA. Furthermore, EBWSA also could obtain the better the second-best optimal feature set with higher classification performance, it is displayed in Fig. 5. Figure 5 is the average accuracy, kappa and dimensions (%) in total of each method, besides RFAE, the results demonstrated the gradual growth from the left to the right. Wrapper is better than filter to have better selected features with higher performance of the classification model. Swarm intelligence algorithms is able to select more suitable length of selected features. The conclusions are as follows. This paper proposes the EBWSA to optimise the feature selection for a high-dimensional bioinformatics dataset. Based on the WSA, the EBWSA selects a better second-best feature set with higher accuracy for classification within a more reasonable computation time. It uses the wrapper strategy that combines the EBWSA and ELM classification to implement the feature selection operation. The elitist strategy motivates stronger wolves to find better solutions in severe environments to accelerate population updates, and while the weaker wolves are assigned to some resources when the environment improves, the resources are executed according to variable weights for each wolf as their fitness values change. Based on their searching abilities, different wolves have different step sizes, and the memory function makes the search more effective by promoting the convergence of the population. Meanwhile, the binary approach diverts the feature selection problem to a similar function optimisation problem to obtain an optimal feature set with optimal classification accuracy and optimal length. The experimental results show that the wrapper approach is better than the filter approach in classifying selected features within a longer computing time. However, with an extremely high-dimensional  dataset, the wrapper approach is more effective and useful than the filter approach within a reasonable time (when faced with tens of thousands of features, a few hundred seconds is needed to obtain a better solution). The EBWSA outperforms other conventional feature selection methods in classification accuracy by up to 29%, and it outperforms the previous WSA by up to 99.81% in computational time.

Methods and Materials
Elitist Binary Wolf Search Algorithm. As mentioned, the random strategy algorithm is an essential part of the wrapper model in feature selection. Different new algorithms are proposed to improve feature selection. The swarm intelligence algorithm, also called a bio-inspired algorithm, is a unique random strategy algorithm that exhibits significant performance (some examples include PSO 38 and BAT 39 . As their names reflect, they are inspired by natural biological behaviour and use swarm intelligence to find an optimal solution. The WSA 10 is a new swarm intelligence algorithm inspired by the hunting behaviour of wolves. However, it differs from the other bio-inspired algorithms because the behaviour in the WSA is assigned to each wolf rather than to a single leader, as in the traditional swarm intelligence algorithms. In other words, the WSA obtains an optimal solution by gathering multiple leaders, rather than by searching in a single direction. Figure 6 uses an example to show the hunting behaviour of WSA in 2-D.The original WSA observes three rules and follows the related steps to achieve the algorithm 37 : Each wolf has a full-circle visual field in full rage with v as the radius. Here, the distances are calculated by Minkowski distance, as in Equation 1: where x(i) is the current position, X denotes all of the candidate neighbouring positions, x(c) is one of X and μ is the order of dimensional space. Each wolf moves towards its companions, who appear within its visual circle, at a step size that is usually smaller than its visual distance. Equation 2 is the absorption coefficient, where β o is the ultimate incentive and r is the distance between the food or the new position, and the wolf. Equation 2 is needed because the distance between the current wolf 's position and its companion's position must be considered. The distance and attraction are inversely proportional, so the wolf is eager to move towards the position with the minimum distance.
o Given a different environment, the wolf may encounter its enemies. It will escape to a random position far from its current position and beyond its visual field. The following two equations are the movement formula:  where the escape () function obtains a random position to jump to with a minimum length constraint, x(j) is the peer with a better position and better fitness than x(i) and s and rand() in equation 4 are the step size and a rand value within −1 and 1, respectively. Actually, the step size of the WSA is equal to the velocity of PSO. An escape machine effectively reduces the population until it falls to the local optimum. The WSA outperforms the other swarm intelligence algorithms in accuracy of feature selection when used with the wrapper strategy 35,37 . However, the multi-leader searching strategy and escape mechanism result in the need for better performance from selected features with a higher time cost. In a vast search space, a fixed step size limits wolves' visual fields and movement speeds. The population of wolves converges slowly towards the optimal solution, and as mentioned, the wrapper strategy typically needs more computation time. Therefore, we proposed the Elitist Binary Wolf Search Algorithm (EBWSA) to improve the performance and reduce the time cost of the WSA for feature selection.
The EBWSA, which is based on the WSA 10 , uses the weight of each wolf while searching, to determine the step size dynamics for their fitness values in the current iteration, to update their own weights 44. In initiation, each wolf is treated as a weak searcher. During the process of searching, the stronger wolves who find the better results with weights that are less than the half of the total gain more weight and become Elitist wolves. In contrast, under this mechanism, the wolves with poor ability weaken, as their Elitist counterparts take more than half of the total weight. If the Elitist wolves take less than half the total weight, it means that the living environment is poor, which will motivate them to gain more resources. When the living environment improves, the weak wolves gain resources to balance the whole population. This simulates Darwinian evolution 40 ; specifically, natural selection and survival of the fittest. To avoid throwing the whole population out-of-balance, a weak wolf will be eliminated and reborn as a stronger wolf while its weight touches the elimination and reborn threshold for searching, the normalized operation of weights will redistribute the weights to each wolf. Meanwhile, if any wolf 's weight is equal to or more than half of the whole, its weight will be reset and the weights of the population will be redistributed. Inspired by the Eidetic WSA 9 , the EBWSA also has a memory function to avoid repetition and promote efficient searching. This function records the worst position of the wolf at each iteration, so that wolves in subsequent iterations will keep away from the previous worst positions. To reduce the time cost, several earlier records of the worst memory position will be forgotten once the default memory length is full. This operation makes the whole population more intelligent to fulfil the imitated population in nature with the memorising-forgetting mechanism. Figure 7 in the diagram below demonstrates the 2-D hunting behaviour of a pack of five wolves with their weights being adjusted in an iteration according to EBWSA.
As the example was shown in Fig. 7. In the last iteration, if the total weight of better wolves is smaller than 0.5, the better wolf means its fitness is better than its previous iteration. Determine the current living environment is worse. The better wolf 's weight will be increased in the next iteration. In this diagram, w4 and w5 are better wolves in worse living environment in current iteration. On the contrary, if the total weight of the better wolves is bigger than 0.5 in the last iteration, the worse wolves will get some weights from the better wolves to increase their weights in better living environment, at this time, w1 and w2 are better wolves in current iteration. In addition, if total weight of better wolves is equal to 0.5, the wolves will keep their own weight. It needs to be noted that, if the weight of w4 or w5 bigger than 0.5, the whole weight will be re-given; else if the weight of w1 and w2 smaller than one percent of the initial weight, the whole weight will be re-assigned and these two positions will be reborn. The following are the weight variation steps: 1. The total weight is 1, each wolf 's weight is 1/N (W i ) in the initial phase and N is the size of the population. 2. For m = 1,…, M (M is the maximum iteration time) where t is the number of better wolves and γ is the sum of the better wolves' weights. b) Choose σ m to measure the living condition of the population: m when γ < 0.5, σ m > 0. If γ decreases, σ m increases. This means that if the current population of Elitist wolves takes less than half of the total resources, then the living environment is worse. If γ is bigger than 0.5, it indicates that the environment is good. c) Update the distribution of weights for the wolves. If γ is smaller than 0.5, then σ m has a positive value. If σ e m is bigger than 1, then + W m i 1, will increase in the next iteration. In other words, in a poor living environment, the weights of these wolves must be increased to motivate them. In contrast, the weaker wolves will have some weights that equal the population in a better environment. When the population is weak overall, the Elitist wolves will gain increasing rewards. When the population is strong overall, the weaker wolves need help. The updated formulas in equations (3) and (4) become equations (9) and (10). In equation 9, each wolf needs to multiply the corresponding weights to update its position. The dynamic weight changes the fixed value of the step size to realise the Elitist wolf, who is able to go further in its searching. The total value of the weights is normalised in equation (8). Thus, w i j , is less than 1, and ⁎ w N i j , is a value floating around 1 to change the step size.
x i x i s w N rand Prey The EBWSA optimises the feature selection process into a binary optimisation problem. The numbers of features stand for the dimensions and positions of each feature. This means that in the EBWSA'a feature selection, the position of each individual particle can be given in binary form (0 or 1), which adequately reflects the straightforward 'yes/no' choice of whether a feature should be selected. The scope of a position is from −0.5 to 1.5. Then, equation (9) is used to calculate the binary value of the position.
where x i j denotes the particle x(i) in position(dimension) j and the round() function calculates the binary value X i j of the corresponding position to achieve the binary optimisation operation. Figure 8 presents a wolf 's movement in an iteration of a binary strategy. The EBWSA's feature selection can be regarded as a high-dimensional function optimisation problem, wherein the values of the independent variables are 0 or 1. In addition, values of 0 and 1 can also be given to dependent variables calculated by the rounding function, whose independent variables can be assigned from −0.49 to 1.49. The step size of each position is a very small value in a fixed range. At the beginning of Section 2, we described the classical definition of feature selection as selecting a sub-dataset d with f features from the primary dataset D with F features, f ≤ F, where d has the optimal performance in all of the sub-datasets with f features from the primary dataset 6 . Thus, we know that the value of f is a defined value in this definition, and while it should be a variable, that means that algorithms should find the optimal length with an optimal combination. The EBWSA repairs this problem to obtain the optimal feature set using a similar method of function optimisation. Figure 9 illustrates the weighted EBWSA process, and we present pseudo EBWSA code.  The comments after the symbols "//" denotes explanatory information. The function Generate_ new_location() calls for a classifier to calculate the accuracy of the classification model and return it as the fitness. What the EBWSA gathers is implemented using the above steps and codes to select the optimal feature set. In our experiment, we used classification accuracy as the evaluation metric to estimate the quality of the selected features. Higher classification accuracy signified a better combination of features, and vice versa. Figure 10 is an example of the EBWSA with 5 wolves and 100 iterations for the feature selection of the dataset Prostate_Ge -a high-dimensional bioinformatics dataset introduced in the next section. The first subfigure represents the wolves' survival environment, which expresses the total weight of the Elitist wolves, whose fitness values are better than those of the wolves in the previous iteration. If this value is smaller than 0.5, the survival environment is considered to be bad, and the Elitist wolves will get more weight from their weaker counterparts. Because the search spaces of high-dimensional datasets are large, and the populations are small, most of the wolves have a difficult time finding a better solution. The next five subfigures describe the variations in each wolf 's (weight × population). If the weight is smaller than 1, then weight × population takes a value bigger or smaller than 1 to change the step size of each wolf. Therefore, these five subfigures indirectly present the step size changes of each wolf in 100 iterations.   Dataset benchmarks. The six binary class bioinformatics datasets in Table 6 are used to test the effectiveness of the proposed method, and to compare the algorithms. They are biological data downloaded from the Arizona State University website 41 . It is observed that these are high-dimensional bioinformatics datasets with tens of thousands of features, besides datasets Colon and Leukemia. Such dimensions are commonly seen in biological or bioinformatics datasets.
Comparison algorithms. In addition to the proposed methods, six algorithms are compared, three that use heuristic and filter strategies and three that use random and wrapper strategies. Classification accuracy is the evaluation metric for the selected features in our experiment. Hence, the first comparison is of the basic classifier extreme learning machine (ELM) 42 , which classifies the original high-dimensional datasets, and a traditional single hidden layer feed-forward neural network (SLFN) ELM that promotes the computational time cost under the premise that it guarantees learning accuracy. It is a network structure composed of an input layer, a hidden layer and an output layer. The hidden layer completely links the input and output layers. The whole learning process can be briefly divided into the following parts. First, determine the number of neurons in the hidden layer, then randomly set the threshold of neurons in the hidden layer and the connection weights between the input and hidden layers. Second, select an activation function to calculate the output matrix from the neurons in the hidden layer. Finally, calculate the output weights. Besides the ELM's fast computational speed, simple parameters, strong generalisation ability and simple, quick construction of the SLFN make it ideal for use as the basic classifier in our experiment. ELMs also classify datasets with selected features using different feature selection methods, and offer their classification accuracy for comparison. The first three approaches are chi-squared attribute evaluation (CHSAE), information gain attribute evaluation (INFORGAE) and RELIEFF attribute evaluation (RFAE) from the Waikato Environment for Knowledge Analysis 43 . CHSAE evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class, INFORGAE does so by measuring the information gain with respect to the class and RFAE does so by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. It can operate on both discrete and continuous class data. As mentioned before, these filter approaches rank attributes as the measured value. Thus, we retain and collect the features whose values are worth more than 0. The other three feature selection methods are separately wrapped PSO, binary PSO and a preliminary version of WSA with an ELM classifier to perform the feature selection operation and discover the feature combinations with the optimal accuracy of classification. The swarm intelligence iterative methods and ELM are programmed by Matlab 2014b with a population of 15 and a maximum of 100 iterations, inertia weight is 0.8. The computing platform for the entire experiment is CPU: E5-1650 V2 @ 3.50 GHz, RAM: 32 GB.