A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Gene/feature selection is an essential preprocessing step for creating models using machine learning techniques. It also plays a critical role in different biological applications such as the identification of biomarkers. Although many feature/gene selection algorithms and methods have been introduced, they may suffer from problems such as parameter tuning or low level of performance. To tackle such limitations, in this study, a universal wrapper approach is introduced based on our introduced optimization algorithm and the genetic algorithm (GA). In the proposed approach, candidate solutions have variable lengths, and a support vector machine scores them. To show the usefulness of the method, thirteen classification and regression-based datasets with different properties were chosen from various biological scopes, including drug discovery, cancer diagnostics, clinical applications, etc. Our findings confirmed that the proposed method outperforms most of the other currently used approaches and can also free the users from difficulties related to the tuning of various parameters. As a result, users may optimize their biological applications such as obtaining a biomarker diagnostic kit with the minimum number of genes and maximum separability power.

In computational biology, researchers may be involved with the handling of large omics datasets with many features (e.g., genomics, proteomics, metabolomics, etc.) 1 . For instance, the total number of profiled genes is usually more than 20,000 in human samples, which have been exploited for different purposes such as the detection of biomarkers 2 . Given that the number of features from proteomics and metabolomics data is potentially much larger 3 , it is almost impossible to extract a set of biomarkers kit of a manageable size from such large data sets 4 . For instance, in the field of genomic data, researchers aim to (i) select genes having higher separability power between different states, such as cancerous and noncancerous samples, and (ii), confine them to a reasonable number to be handled 5 . From the machine learning perspective, features or genes can be categorized into three classes as follows: (i) Negative features 6 , which can mislead a learner and reduce its performance. Thus, they must not be selected in the application. (ii) Neutral features 7 , which do not play any role in the performance of a learner and can only increase the time of predicting. Like the first group, these features should be avoided. (iii) Positive features 8 , which play a determinant role in distinguishing between samples and enhance the performance of a learner. For such features, the feature selection (FS) methods need to be applied since some of the features may have redundant roles as others. Further, a large set of them may be represented by a small set.
Due to the combinatorial nature of FS, it is a nondeterministic polynomial (NP-hard) problem that cannot be solved in a polynomial-time order, in large part because of being accepted by nondeterministic Turing machines 9 . To overcome the time complexity, heuristic and metaheuristic algorithms, which find acceptable answers to these problems, have been developed 10 .
In different studies, it has been shown that the metaheuristic algorithms, which do not confine themselves to a specific range of the search space, are generally more suitable than heuristic algorithms [11][12][13] . In addition, two-step methods may obtain better results than single methods 14,15 . Therefore, in this study, we capitalized on a two-step method, which is based on a genetic algorithm (GA) 16 and our previously developed world competitive contests (WCC) optimization algorithm 17 , the so-called "GA_WCC method". In the first step of the GA_WCC method, the GA reduces the total number of features to a minimum upper bound. Next, the WCC selects an optimal subset of features for the desired application. Overall, the GA_WCC method is based on a two-step process for FS, which (i) does not require limiting the number of features to a predefined value, and (ii) outperforms other currently used methods.

Related works
In this section, we discuss the limitation of related approaches works that can be divided into six classes as follows: (i) Filter methods: These techniques look for the relationships among features and investigate how much information exists in a feature. For this purpose, various mathematical formulas have been proposed, including Entropy 18 , mutual information 19 , Fisher score 20 , correlation 21 , Laplacian 22 , etc. Although these approaches are simple and have a low time-complexity, their performance is lower than the other categories 23 . To tackle such a limitation, wrapper-based method has been developed and are built-upon in this paper. (ii) Wrapper methods: Unlike the first class, these approaches score the selected features by a learner such as a support vector machine (SVM) 24 , artificial neural networks (ANN) 25 , decision tree (DT) 26 , or others [27][28][29] .
Usually, optimization algorithms are applied to select an optimal subset of features 30,31 . In different studies, it has been shown that these approaches can achieve remarkable outcomes 32 , but most of the FS studies do not employ state-of-the-art algorithms for the FS. Here, we used the WCC algorithm for the FS problem. (iii) Ensemble methods: For the FS, ensemble methods create a learner such as a decision tree 33 and selects features in such a way that the learner chooses them for generating a model 34,35 . Due to their greedy nature, ensemble methods may fall into local optima solutions and do not reach the optimal result. To deal with this limitation, we introduce the WCC algorithm, which features a low probability of falling into local optima. (iv) Hybrid methods: A combination of the three mentioned methods is applied to the FS problem 36 . For example, the total number of features is reduced by filter methods, and then an optimal subset of features is chosen by wrapper or ensemble methods 37,38 . In this class of related works, it is essential to combine the algorithms properly. Therefore, we assumed that a combination of wrapper-wrapper approaches, which merge two wrapper-based algorithms, might be a suitable option for FS. (v) Hypothesis-based studies: A concept is hypothesized based on prior knowledge and the correctness of which is tested via various experiments on gold-standard datasets 39 . Although these techniques can help in making a proper decision, they do not prevent the mentioned limitations. vi) Review works: These works survey different methods such as filter 40 , wrapper 41 , ensemble 42 , hybrid 43 , and discuss their advantages and disadvantages. Further, they study the role of FS in diverse areas and often constitute the future directions 44 .

Materials and methods
The datasets. Several datasets with diverse properties have been selected from various sources such as the machine learning repository developed at the University of California Irvine (UCI) 45 and published seminar literature sources. For every dataset, the total number of samples is almost the same in its different classes. Table 1 shows the properties of the datasets and describes them.
The proposed method. Our proposed GA_WCC method ( Fig. 1) selects the features using a two-step wrapper approach. To this end, as the first step, the Genetic Algorithm (GA) limits the total number of genes or, generally, features, and then the World Competitive Contests (WCC) selects an optimal subset of them from the reduced set of features. Overall, this study has been established based on the following rationale: (i) The GA starts with a first population of candidate solutions, which each consists of several variables (a subset of features). Unlike other optimization algorithms such as the particle swarm optimization (PSO) 53 , for the GA, the probability of falling into local optima is minimal, because it produces a high number of candidate sets. However, the convergence speed of GA is usually less than other optimization algorithms (e.g., TLBO 54 and FOA 55 ). Hence, this limitation may be addressed when the GA algorithm is combined with other state-of-the-art optimization algorithms. This issue is considered in the present study, by merging the GA and WCC algorithm. (ii) The WCC begins with a first population of potential answers and applies its all the operators to all the existing candidate solutions (CSs), so it spends more times than other optimization algorithms. Hence, when applying the WCC algorithm to an optimization problem, the total number of CSs is limited. This algorithm has an acceptable convergence speed, but the main limitation of WCC relates to its complex stages, which increase the execution time. Further, for a CS, WCC calls the cost function more than other algorithms due to the nature of its operators. At the last steps of the algorithm, the applied operators Optimization algorithms differ from each other from a way that they change CSs (the operators of the algorithms). In this study, the WCC algorithm is developed to the FS problem, and its operators are modified to www.nature.com/scientificreports/ select an optimal subset of features. Given the advantages and disadvantages of the GA and WCC algorithm (the modified version of the WCC algorithm), it is expected that their limitations will be diminished when combined with each other. Inspired by this idea, this study has been designed, and an efficient two step feature selection method based on a wrapper approach has been introduced. As shown in Fig. 1, the GA_WCC method includes several steps as follows: (i) Applying the genetic algorithm: In the first step of the proposed method, a version of GA is used for the FS 56 . In different FS studies, CSs are binary, while their length is constant and equal to the total number of features. In this study, for both GA and WCC algorithms, CSs have variable sizes and contain the indices of the selected features. In the optimization scope, the GA is the basis for other optimization algorithms. However, GA generally exhibits a low level of performance in comparison with other algorithms. This notwithstanding, GA produces different CSs, which may help other optimization algorithms to obtain better results 57 . In Fig. 2, the flowchart of the employed GA is shown, which includes the following main steps: (a) Creating a first population of CSs: potential answers or CSs are called 'chromosomes' in the GA algorithm, and their values of genes are randomly quantified. Every CS incorporates some features, which are chosen from a given feature set (the total number of variables in a CS depends on the size of a dataset). In the proposed method, initially, the CSs have an identical length, but their length may vary from each other because of some repeated values. For instance, in generating initial CSs, it is possible that a CS contains some repeated features. In such a case, only one of the repeated values is remained and the remaining ones are ignored. (b) Applying GA operators: The GA consists of three main operators named mutation, crossover, and selection. In the employed mutation operator, a variable of a chromosome is randomly selected, and its value is replaced by another randomly selected variable. In the crossover operator, two ranges of the CSs with the same length are randomly chosen, and their contents are exchanged. Finally, in the selection operator, elitism technique has been used, which forms the new population based on the most deserve chromosomes of the current population. In Figs. 3 and 4, the instances of the mutation and crossover operators are depicted, which describes, how the mentioned operators are applied to generating new CSs. (c) Scoring the selected features: The proposed method is a wrapper method in which a learner evaluates the selected features. Due to the nature of the datasets, which are approximately class-balanced, we basically use the accuracy score (Eq. 1). Other criteria are also inspected in the experimental section.
(1) Score = Accuracy = TP + TN TP + TN + FP + FN Figure 2. Flowchart of the employed GA. This algorithm begins with several randomly generated potential answers (a subset of the existing features) and applies its operators to produce new CSs, which contain the selected features. To calculate fitness of the CSs, a model is created using SVM, and its accuracy (based on fivefold cross-validation) is reported. Also, to generate new population, elitism method (which generates new population based on the CSs having the higher value of fitness) has been used. CS, candidate solution; GA, genetic algorithm; SVM, support vector machines. (ii) Applying the proposed algorithm (the WCC): As mentioned before, at the end of the first step, GA passes the created CSs to the proposed algorithm (the flowchart of the WCC algorithm is shown in Fig. 5) and constitutes its first population of CSs. Next, WCC changes the CSs using its operators, which are explained and formulated as follows:  where CS, n, and k are a given candidate solution, the total number of features, and an integer random value between 1 and n, respectively. In other words, the k parameter determines how many variables of a CS must be changed. Further, the sigma sign denotes a loop, and r is an integer value between 1 and n as is k. Here, is an example of the attacking operator in Fig. 6. (b) Transferring operator: Based on the scores (classification accuracy using a given CS), this operator selects several CSs with the highest score (Selected_CS), and then, chooses randomly some values (features) from them. Next, for a given CS, this operator imports the selected values. Equation 3 formulates the mentioned steps. Figure 7 describes the transferring operator in detail.
[CS(r) = selected CSm (rand(l))]   Each of the changes induced by the operators will be accepted if they increase the accuracy score. Further, repeated features may appear by applying the operators. In these situations, only one of the repeated features is kept and all others are removed. Hence, the length of CSs may vary.
(d) Investigating the termination conditions: For terminating the algorithms, several options (e.g., predefined number of iterations, time, accuracy, etc.) can be used. In the present study, two different strategies are chosen for terminating the algorithm. As mentioned before, when the value of accuracy remains about  www.nature.com/scientificreports/ constant in the last ten iterations, the GA is finished. For the WCC algorithm, a predetermined number of iterations has been considered as the termination condition.

Results
To obtain results, a computer system with a dual-core 2.2 GH processor and 12 GB of RAM was employed. Further, our designed FeatureSelect software application and MATLAB programming language were used for the implementations. In this section, all the obtained outcomes refer to results from the five-fold cross-validation technique. For comparing the algorithms and methods, the same conditions were considered. For example, GA, WCC algorithm, and GA_WCC method allowed to run for an identical time for getting the results. The size of populations for the GA, WCC algorithm, and GA_WCC method was determined using a "trial and error" method and their time-consuming parameter, in which the best performance of the algorithms is observed. Based on the outcomes, the population sizes were considered 100, 20, and 100 for the GA, WCC algorithm, and GA_WCC method, respectively. The mutation and crossover rates were set to 30%, because the GA shows a suitable behavior based on them. In addition to the population size parameters, the WCC algorithm consists of the match time (the total number of attempts to change a CS) parameter, which has been set to 2. This parameter was initiated 1 to the GA_WCC method. The outcomes (which encompassed the results of five popular filter FS methods, GA, WCC, a two-step filter-wrapper method (EN_WCC), and the proposed wrapper-wrapper method (GA_WCC)), were divided into the following three categories: (i) The first category of the results: This class consists of the results obtained from applying the mentioned algorithms and methods to the datasets having more than 50 features and relating to the classification type. Tables 2 and 3 represent the attained outcomes. Also, Fig. 9 depicts the results of the SVM without applying the FS algorithms on the investigated datasets.
Wrapper-based FS methods improve the performance of SVM, whereas Filter-based FS approaches may reduce its performance. Overall, among the filter methods, the entropy-based (EN) FS method has led to more appropriate results than others. Moreover, between GA and WCC algorithms, WCC yields better outcomes. Hence, a combination of EN and WCC (the so-called EN_WCC) is also investigated and compared against the others. For the Cancer dataset, GA_WCC, GA, and WCC have yielded the best solutions. However, GA_WCC and GA classify the data with six features, whereas WCC classifies them with ten attributes. For the Arrhythmia dataset, the proposed approach outperforms others in terms of the total number of features (NOF) and other classification criteria. For the Diabetes dataset, EN_WCC yielded a minimum number of features and have yielded better outcomes than the filter methods, as observed for the cancer dataset. Nevertheless, the data of GA_WCC, WCC, and GA surpass EN_WCC. Similar outcomes are observed for the other datasets. Tables 2 and 3 show that wrapper and two-step methods are more efficient than the filter ones, and their performance can be sorted as GA_WCC, WCC, GA, and EN_WCC, respectively.
For further evaluating the methods, receiving operating characteristic (ROC) curves of the methods are shown in Figs. 10 and 11. The area under the curve (AUC) values of the approaches on the datasets of the first class of the outcomes are shown in Table 4. The two-step and wrapper approaches have remarkable functionality compared to the others, and the proposed method outperforms all of them (Figs. 10, 11, Tables 2, 3, and 4). In another evaluation of the algorithm's performance, the p-value (PV) measurement was considered (Table 5). To this end, every algorithm was performed in 50 individual executions, and the results of the proposed method (GA_WCC) were considered as a test base. Next, the outcomes of the other algorithms were compared with them. Except for the Cancer dataset, in which the effectiveness of the algorithms is the same, the proposed method has outperformed the others for the remaining datasets. Figure 12 also presents boxplots of the algorithms' outputs obtained using One-Way ANOVA test. Every execution consists of 100 iterations of the algorithms step. At the end of an iteration, the best acquired accuracy was stored, and the convergence behavior of the algorithms were investigated for the datasets including more than 1000 features (Fig. 13). It was observed that the convergence speed of the proposed method is higher than the GA and WCC algorithms (without merging them). As mentioned before, the combined method can efficiently address the limitations of the GA and WCC algorithm (the low convergence of the GA algorithm and the restricted number of CSs in the WCC) and yield better outcomes when combined than when run individually.
In filter FS methods, determining the total number of features is a challenging problem and plays an essential role in the performance of a model. The results of the five filter approaches are shown in Figs. 14, 15, 16, and 17. These outcomes show the performance of the filter FS methods with a different number of features. (ii) The second category of results: This section includes the results of the algorithms on the datasets having less than 50 features/attributes. The main goal of this section is to check the effect of FS methods on datasets, which consist of fewer numbers of features. For the small datasets, single wrapper methods do not face special challenges in the FS. Indeed, the mentioned FS methods may obtain the best solution by improving the run time. Hence, in this section, the functionality of the GA and WCC algorithms are inspected. Like for the first part, criteria such as sensitivity, specificity, accuracy, precision, and AUC were investigated. The acquired data are listed in Table 6. Without applying the GA and WCC algorithms, SVM alone yields 0.5263, 0.6645, and 0.5812 value of accuracy using the fivefold cross-validation technique on the CHD, SHD, and PID datasets, respectively. www.nature.com/scientificreports/ By applying the algorithms, the value of accuracy improved for the CHD and SHD datasets and remains unchanged for the PID dataset. Further, the total number of features is remarkably reduced. Thus, the obtained models obtained by applying the algorithms operate faster than the model, which uses all the existing features. Having compared GA and WCC algorithms, WCC was seen to lead to a model with lower number of features and higher values of criteria. Therefore, it might be concluded that the stateof-the-art optimization algorithm can get more acceptable data than others. (iii) The third category of the results: In this section, the outcomes of the methods and algorithms are evaluated on the regression-based dataset (WDBC and drug datasets). To this end, the criteria such as root mean squared error (RMSE) and the correlation between predicted and real labels were calculated and gathered (Table 7). For the filter FS methods, different numbers of features have been tested, and then, their best results were reported. For the wrapper FS approaches, it is not necessary to limit the total number of features and they can regulate it. Even so, they produce variable results in their different executions, so they must be executed at least 30 times and their best-obtained outcomes among from the executions (different accuracy values of the executions) are reported as a solution to the problem. Thus, several criteria were reported for them, based on the acquired results in 50 individual executions, including confidence interval (CI), p-value, standard deviation (STD), etc. www.nature.com/scientificreports/ From the run-time perspective, filter FS methods require less time than wrapper approaches, but do not result in improved outcomes. For instance, for the WDBC dataset, the entropy FS approach yields the minimum value of error and the maximum value of correlation between the predicted and real labels, when the total number of features is limited to 13. The value of correlation can be calculated not only for the entropy method but also for others. As the first class of results, the second one also shows the remarkable performance of the proposed approach (GA_WCC) in terms of error, correlation, the total number of selected features, run-time, etc. Besides, WCC and GA present that wrapper FS method may acquire better results than the filter FS approaches. In Fig. 18, the scatter plots of the proposed method on the regression-based datasets are shown.

Discussion
Many methods and algorithms have been proposed for selecting an optimal subset of features, which is indeed an NP-hard problem, particularly in machine learning with a biological context. Besides enhancing the separability power of a model, optimal features improve the speed of a model and may lead to valuable results such as acquiring an optimal kit of biomarkers to be used in applications. In this area, it has been shown that two-step FS approaches lead to better outcomes than single methods 59 , and wrapper-based FS methods usually outperform www.nature.com/scientificreports/ filter and embedded FS techniques 60 . The results of this study also confirm the mentioned observations and allow for the following important key conclusions: First, wrapper FS methods may obtain an optimal subset of features, which do not require confining the total number of features to a predefined number. Nevertheless, there are some restrictions in determining the total number of selected features. For example, wrapper methods may obtain a subset of attributes with the highest score, while the total number of the selected features may be greater than the required number of features (problem limitations). In this line, we believe that wrapper FS methods are still better than the filter and embedded FS approaches, in large part because they can be formulated in a way to resolve the problem constraints.
Second, limiting the filter methods to a predefined number is a challenging problem and affects the performance of filter FS approaches. The results of this work show that the performance of filter FS approaches vary with the different number of selected features. Thus, this parameter remains a challenge for researchers.  In addition to accrediting the results of the three mentioned section, these diagrams state that EN reaches to a better solution than other filter approaches, and its combination with WCC improves the performance of a model. ROC, receiving operating characteristic; WCC, world competitive contests algorithm; GA, genetic algorithm; PC, Pearson correlation; LA, Laplacian score; EN, entropy; MI, mutual information; FI, Fisher score. Third, the FS is also essential for datasets having a low number of features. In the second part of the results, the performance of wrapper FS methods was investigated on some gold-standard datasets, for which their total number of features is less than 50. Based on other conducted studies 61 , it seems that the FS has been ignored in these works even though it may improve the performance. For this class of datasets, considering the total number of features, single wrapper methods might be a proper method.
Forth, wrapper-wrapper FS methods may be the best option for selecting an optimal subset of features. In the last decade, different types of hybrid methods have been introduced for the FS problem due to their amazing results. However, most of them combine filter-filter or filter-wrapper approaches and a suitable configuration of        www.nature.com/scientificreports/ wrapper-wrapper methods have been ignored. In the present investigation, a wrapper-wrapper approach based on GA and the proposed WCC-algorithm was introduced, which resulted in superior outcomes compared to the other approaches. The WCC algorithm starts with a first population of CSs and, then, applies its operators to them in order to obtain a better solution to the FS problem. The main difference between the WCC algorithm and other optimization algorithms relates to the steps of the algorithm and its operators. The two-step approaches differ from hybrid methods that merge the optimization algorithms such as the whale optimization algorithm and simulated annealing 62 . In this study, to obtain an efficient combination of the algorithms, the advantages and limitations of the GA and WCC algorithm were considered. Since the GA produces various CSs, the WCC algorithm confines them to a limited number. Unlike the WCC algorithm, the GA may suffer from low convergence speed and not show a suitable performance relative to other optimization algorithms. Given the mentioned reasons, GA and WCC algorithm were combined, and the results showed that their combination yields better outcomes. Fifth, the performance of algorithms and methods varies on different datasets. Every algorithm or method has its own attitude relative to the FS problem, so their functionality may differ on various data. Generally, it is impossible to predict a priori, which of the methods or algorithms is suitable for a given problem. Nonetheless, wrapper-wrapper FS approaches appear promising to produce desired results. As a future work, the proposed method can be applied to other algorithms such as the Salp Swarm Algorithm 63 and DE 64 with considering limitations and disadvantages. Also, the proposed method scores a set of features and does not rank the features of the obtained set. To address this limitation, the proposed approach can be combined with state-of-the-art ranking techniques such as SVM-RFE 65,66 .

Conclusion
For selecting an optimal subset of features, a two-step wrapper-wrapper FS method based on GA and our proposed algorithm (WCC) was introduced and applied to the thirteen biological datasets with different properties. In comparison with other approaches, it can be concluded that two-step techniques may lead to better results than single-step methods. Furthermore, among the two-step approaches, wrapper-wrapper FS methods may be more appropriate than others. For biological applications, it seems that wrapper approaches are the most convenient and reliable method, in large part because they do not need to be restricted to a predefined number of features. Taken together, based on our findings, wrapper-wrapper FS methods can be used to optimize the FS problems and result in robust and desired outcomes.