Bird’s Eye View feature selection for high-dimensional data

Brahim Belhaouari, Samir; Shakeel, Mohammed Bilal; Erbad, Aiman; Oflaz, Zarina; Kassoul, Khelil

doi:10.1038/s41598-023-39790-3

Download PDF

Article
Open access
Published: 16 August 2023

Bird’s Eye View feature selection for high-dimensional data

Samir Brahim Belhaouari¹,
Mohammed Bilal Shakeel¹,
Aiman Erbad¹,
Zarina Oflaz² &
…
Khelil Kassoul³

Scientific Reports volume 13, Article number: 13303 (2023) Cite this article

1 Citations
Metrics details

Subjects

Abstract

In machine learning, an informative dataset is crucial for accurate predictions. However, high dimensional data often contains irrelevant features, outliers, and noise, which can negatively impact model performance and consume computational resources. To tackle this challenge, the Bird’s Eye View (BEV) feature selection technique is introduced. This approach is inspired by the natural world, where a bird searches for important features in a sparse dataset, similar to how a bird search for sustenance in a sprawling jungle. BEV incorporates elements of Evolutionary Algorithms with a Genetic Algorithm to maintain a population of top-performing agents, Dynamic Markov Chain to steer the movement of agents in the search space, and Reinforcement Learning to reward and penalize agents based on their progress. The proposed strategy in this paper leads to improved classification performance and a reduced number of features compared to conventional methods, as demonstrated by outperforming state-of-the-art feature selection techniques across multiple benchmark datasets.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Article 08 May 2024

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Article 14 May 2024

A guide to artificial intelligence for cancer researchers

Article 16 May 2024

Introduction

The increasing number of high-dimensional datasets in various organizations is driving the need for advanced data mining techniques^1,2. However, handling high-dimensional data presents a challenge that limits the application of data mining algorithms. To overcome this, feature selection³ and extraction methods are used to reduce the dimensions of the data. While feature extraction transforms raw data into a new feature space, feature selection algorithms choose the optimal subset of features from the raw data, leading to lower dimensionality and improved interpretability while preserving the actual data space⁴.

With the rise of high-dimensional data in various organizations, the need for effective feature selection algorithms has become increasingly crucial. Currently, several search mechanisms exist, including ranking-based methods⁵, swarm intelligence/evolutionary algorithms⁶, forward/backward search^7,8, and nature-inspired meta-heuristics⁹. These approaches can be further classified as supervised¹⁰, semi-supervised¹¹, or unsupervised¹² based on the availability of training data labels. Despite their successes, supervised-wrapper configurations of these methods often face limitations in handling high-dimensional data. In this paper, we introduce the Bird's Eye View (BEV) model for feature selection that incorporates the strengths of supervised evolutionary algorithms in a wrapper configuration while addressing their limitations in high-dimensional data spaces.

The BEV model draws inspiration from various natural mechanisms to achieve a comprehensive perspective on feature selection (as illustrated in Fig. 1). Similar to how a bird surveys a vast terrain to search for food from a high altitude, the BEV technique scours high-dimensional datasets for valuable features. Furthermore, the BEV approach resembles the biological process of gene regulation, in which a cell selects which genes to activate from its genome to form a unique gene pattern that enables each cell type to perform its specific function. This integration of nature-inspired mechanisms allows the BEV model to have a more comprehensive view of feature selection.

Our method determines which features to retain for optimal performance and discards unnecessary features. This resembles a reward-based training approach, similar to teaching a dog the desired behavior through positive reinforcement with treats, play, and other incentives. Our model's agents evaluate the performance of various subsets of data and reward improved performance with increased probabilities. Conversely, reduced performance results in lower probabilities.

The proposed BEV model is a unique feature selection technique with the following significant contributions:

1.
The design of the Markov chain and Reinforcement learning paradigms in an evolutionary framework for efficient communication between search agents and optimal global solution.
2.
The evolution of agents is based on the Markov chain, generating new agents with improved accuracy and associated probabilities.
3.
A new metric for evaluating classifiers is proposed as a fitness function.
4.
The movement of agents in search space is guided by reinforcement learning, rewarding progress and penalizing regress with changes in associated probabilities.
5.
The process involves iterations that result in improved agents and reduced computational complexity by restricting the number of agents involved in each iteration.
6.
The recursive approach includes choosing a subset of characteristics at each stage in order to remove unimportant features while keeping important ones.

Background and literature review

In recent years, various optimization techniques have been developed to tackle complex problems across fields such as computer science, engineering, finance, machine learning, and data science. This section reviews three of the most prominent algorithms: Markov Chain, Evolutionary Algorithms (specifically Genetic Algorithm), and Reinforcement Learning. These methods have proven to be effective in addressing challenging optimization problems and have been widely used. Despite their importance, these methods have certain drawbacks, including constrained exploration, the necessity for parameter modification, the inability to handle multiple objectives, and slow or premature convergence. Thus, it is crucial to take these restrictions into account when applying them to challenging optimization problems. One can overcome these limitations by carefully characterizing the problem, selecting the best algorithm, fine-tuning the parameters, and using complementary strategies to solve the shortcomings of each approach. In the following subsections, a brief overview of each approach, its key concepts, applications, advantages, and usage in the proposed work are provided.

Markov chain

The Markov analysis is a technique for estimating the value of a variable that is solely dependent on its current state, without taking into account prior activity¹³. It calculates a random variable based on the present state of other variables using a probability matrix. This makes it a useful tool for evaluating state transitions in various fields such as surveillance¹⁴, machine learning¹⁵, and computer vision¹⁶. Its popularity is due to its ease of use and good prediction accuracy, often outperforming more complex models¹⁷. Although widely used, few studies have applied it to feature extraction^18,19,20, where Markov chain features are extracted to capture dynamic changes in data and used by learning algorithms to make decisions. A new concept of feature selection, based on the transition probabilities of the Markov chain, is proposed as an alternative to feature extraction in our work.

Evolutionary algorithms

An Evolutionary Algorithm (EA) is a computational method that solves problems by mimicking the behavior of living organisms using nature-inspired mechanisms²¹. The use of EAs for feature selection has received significant attention in academia, with various algorithms being proposed, including Particle Swarm Optimization (PSO)^22,23,24, Genetic Algorithm (GA)^25,26, Artificial Bee Colony (ABC)²⁷, Genetic Programming (GP)²⁸, Gravitational Search Algorithm (GSA)²⁹ and Ant Colony Optimization (ACO)^30,31. One advantage of EAs is their population-based search approach, which involves a team of entities exploring the fitness landscape to find the globally optimum solution. This allows for more effective and efficient exploration of vast and challenging search areas. The sharing of information among team members also enables the discovery of potential regions of the search space and the narrowing of the search to critical areas. Additionally, these methods balance exploration and exploitation, allowing for faster convergence while avoiding local optimal solutions. These unique characteristics make EAs a promising approach for designing neural networks³².

Genetic algorithms are the type of evolutionary algorithms used in this work. A genetic algorithm is an optimization technique that uses a process inspired by natural evolution to find the best solution for a problem. The algorithm works by iteratively searching through a space of potential solutions, selecting and breeding the most promising candidates based on a set of rules inspired by genetics, and introducing random mutations to create new solutions. This process is repeated until either a satisfactory solution is found or a specified number of iterations have passed. Genetic algorithms are commonly used in machine learning and data analysis to find optimal model parameters^33,34,35 or identify patterns in data^36,37. The same approach is applied to feature selection in the proposed work. Initially, a set of possible feature combinations is generated randomly, represented as pairs. These pairs are then evaluated using a fitness function that assigns a score based on their accuracy. The pairs with the highest scores are selected for reproduction, mimicking the process of natural selection. The process repeats until a satisfactory solution is found or a specified number of iterations have been reached.

Reinforcement learning

Reinforcement learning^38,39 is a method of learning by interacting with the environment and learning from rewards received from actions taken. It aims to find the best long-term solution by balancing exploration and exploitation. This type of learning has a lot of potential for effective feature selection in the subspace of features. Feature selection can be performed through single-agent^40,41 or multi-agent⁴² decision processes. In a single-agent process, only one agent decides on the selection or deselection of features, resulting in a large action space and the risk of getting stuck in a local optimum solution. On the other hand, in a multi-agent process, multiple agents are involved in feature selection, which enables easier exploration and convergence of the search space. This approach also resembles natural systems, as there are similarities between reinforcement learning and biological systems⁴³.

A fitness function to better evaluation of classifiers

Classifier evaluation metrics^44,45 are used to determine the effectiveness of a classification model by comparing the predicted outcomes to the actual outcomes. Some commonly used metrics for evaluating classifiers include:

Accuracy It measures the percentage of correct predictions made by the model out of all predictions. It is defined as $\left( {TP + TN} \right)/\left( {TP + TN + FP + FN} \right)$, where TP (True Positives) represents the number of positive instances correctly predicted, TN (True Negatives) represents the number of negative instances correctly predicted, FP (False Positives) represents the number of negative instances incorrectly predicted as positive, and FN (False Negatives) represents the number of positive instances incorrectly predicted as negative.
Precision It is the ratio of true positive predictions to the sum of true positive and false positive predictions. Precision measures the ability of the classifier to avoid false positive predictions and is defined as $TP/\left( {TP + FP} \right)$.
Recall (Sensitivity or True Positive Rate) It is the ratio of true positive predictions to the sum of true positive and false negative predictions. Recall measures the ability of the classifier to detect positive instances and is defined as $TP/\left( {TP + FN} \right)$.
F1-Score It is the harmonic mean of precision and recall, used to balance precision and recall when they are in conflict. The F1-Score is defined as $\left( {2 \cdot {\text{Precision}} \cdot {\text{Recall}}} \right)/\left( {{\text{Precision }} + {\text{ Recall}}} \right)$. It provides a balance between precision and recall, as it is a measure of the harmonic mean of these two values.
AUC-ROC curve The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate at different classification thresholds. The area under the ROC curve (AUC) summarizes the performance of the classifier.
Confusion matrix It is a table used to evaluate the performance of a classification algorithm, by comparing the predicted classes to the actual classes.
Log Loss (Cross-Entropy Loss) It measures the performance of a classification model by calculating the likelihood of the predicted outcomes being accurate.

The choice of evaluation metric will depend on the problem and the goals of the classifier. For example, precision may be important when false positive predictions are costly, while recall may be important when false negative predictions are costly. Note that in multiclass classification, precision, recall, and F1-Score can be calculated for each class and then averaged using macro-average or micro-average methods. The confusion matrix is a table that has C rows and C columns, where C is the number of classes. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. For example, consider a multiclass classification problem with C = 3 classes. The confusion matrix would be a 3 × 3 table, as shown below in Table 1.

Table 1 Confusion matrix.

Full size table

Where $TP_{i}$ represents the number of instances of class i that are correctly predicted as class i, and $FP_{ij}$ represents the number of instances of class j that are incorrectly predicted as class i.

From the values in the confusion matrix, various evaluation metrics such as accuracy, precision, recall, and F1-Score for each class, as well as macro-average and micro-average across all classes, can be calculated. The choice of evaluation metric will depend on the problem and the goals of the classifier.

In this study, a new metric is proposed to better monitor the performance of classifiers. Our new metric will accurately measure the accuracy of each class and is suitable for use in feature selection. Therefore, this metric can be used as a fitness function in our search algorithm

$$ \mathop {\min }\limits_{i} \left( {\frac{{TP_{i} }}{{TP_{i} + \mathop \sum \nolimits_{j \ne i} FP_{ij} }}} \right) $$

(1)

Methods

The goal of feature selection is to identify and select the smallest possible subset of relevant features from a larger set of features, to improve the accuracy, interpretability, and computational efficiency of the model. The idea is to remove redundant, irrelevant, and noisy features that may negatively impact the model's performance. The selection of a smaller set of relevant features not only aids in mitigating overfitting but also enhances the interpretability and comprehensibility of the model for human experts. A new tree search algorithm is developed in this paper to better explore the search space representing all the possible subsets. Our algorithm starts from the root node and expands it to generate child nodes until a goal node is found.

The search algorithm begins with a randomly selected subset of features represented by a sequence of 1 s and 0 s, where 1 s indicate selected features and 0 s indicate unselected features, i.e., each leaf belongs to $\left\{ {0,1} \right\}^{d}$, where the integer $d$ is the size of the total features.

The root leaf generates $ {\mathcal{A}}$ new subsets, known as children, by randomly altering the states of each pair of features. The children are formed using the transition probability of the Markov chain of each feature pair, the transition matrices reflect the likelihood of transitioning between distinct states {00, 01, 10, 11}, with initial values for the transition probabilities of 0.25.

Through the expansion, the transition matrices are updated based on a rewards function reflecting the performance of the generated children. Therefore, each new leaf generated will inherit the transition matrices of each pair of features from the parent and update them according to the concept of reward that will describe later in this section.

Updating these transition matrices in the right manner will favor certain extensions of the proposed tree to better explore the search space. After each cycle or iteration, only the highest-performing leaves are kept for further expansion.

The following definitions are crucial for a thorough explanation of the approach:

States or leaves are defined in: $\left\{ {0,1} \right\}^{d}$, where the integer $d$ is the size of the total features.
${\mathcal{A}}$: number of children generated by each leaf; each offspring represents a subset of selected features.
$ {\mathcal{M}}_{{\mathcal{A}}} $: number of top-performing leaves that are selected for further expansion at each iteration.
t: number of iterations.
s: number of stages.
${\mathcal{F}}_{j}^{t,s}$: represents the status of the jth leaf (i.e., state) at time t and stage s, ${\mathcal{F}}_{j}^{t,s} \in \left\{ {0,1} \right\}^{d} ,$ j = $1, \ldots , {\mathcal{M}}_{{\mathcal{A}}} $, which specifies whether each feature has been selected or not. The position of values of 1 shows the location of the features that have been chosen, and the position of the values of 0 indicates the position of the features that have been eliminated.
$f_{i,j}^{t,s} :$ represents the value of the ith feature in the jth leaf at time $t$ and stage $s$, $f_{i,j}^{t,s} \in \left\{ {0,1} \right\}$, $i = 1,2, \ldots ,d$ and j = $1, \ldots , {\mathcal{M}}_{{\mathcal{A}}} $ .
$C_{i,j}^{t,s} : $ represents the state of ith feature pair, $C_{i,j}^{t,s} = \{ f_{2i - 1,j}^{t,s} ,$ $f_{2i,j}^{t,s} \}$, at time $t$ and stage $s$ of jth leaf.
$P_{i,j}^{t - 1,s} ({ \mathcal{C}}_{i,j}^{t.s} |{ }{\mathcal{C}}_{i,j}^{t - 1,s} { })$: transition probability from the pair ${\mathcal{C}}_{i,j}^{t - 1,s} { }$ to the pair ${ \mathcal{C}}_{i,j}^{t.s}$, it represents the actions of the evolutional algorithm.
$d$: dimension of data or number of features $f_{i,j}^{t,s} , i = 1,2, \ldots ,d$;
n: number of observations of data.
$\varepsilon$: reward function.

Genetic algorithm

The BEV algorithm utilizes a smart branching evolution approach that is based on dynamic Markov chains. At each new expansion, a fixed number of leaves (${\mathcal{M}}_{{\mathcal{A}}} $) are chosen. Each leaf is represented by a sequence consisting of 1 s and 0 s and they are organized in pairs within the sequence. The process begins with a root leaf and generates $ {\mathcal{A}}$ children leaves, where $ {\mathcal{A}}$ is less than $ {\mathcal{M}}_{{\mathcal{A}}} $. Since the number of generated leaves does not exceed $ {\mathcal{M}}_{{\mathcal{A}}} $, all of them are selected. During the next expansion, each leaf (or child) generates $ {\mathcal{A}}$ leaves, resulting in a total of ${\mathcal{A}} \cdot {\mathcal{A}}$ children and $ {\mathcal{A}}$ parent leaves. These children and parent leaves are evaluated, and the best $ {\mathcal{M}}_{{\mathcal{A}}} $ leaves are chosen for the expansion.

In the subsequent step, each leaf from the selected ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$ leaves generates a $ {\mathcal{A}}$ child, resulting in (${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$) children and $ {\mathcal{M}}_{{\mathcal{A}}} $ parent leaves. Again, these leaves are assessed, and only the best $ {\mathcal{M}}_{{\mathcal{A}}} $ leaves are selected for the next expansion. This process continues until there is no further improvement in the quality of the solution.

Figure 2 illustrates the process of the BEV method, which involves expanding the children and selecting the most effective subset of features with $ {\mathcal{A}}$ set to 3 and $ {\mathcal{M}}_{{\mathcal{A}}} $ set to 9. Starting from the root leaf, three leaves are generated and all of them will be selected as they do not exceed the value ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$. The next expansion results in 9 children and 3 parent leaves, and the 9 best leaves will be chosen based on their performance (step 1). From the selected 9 leaves, a total of 27 leaves (children) are generated, leading to a combined set of 36 leaves (including parents and children). Similarly, in the next expansion, the 9 best leaves among the 36 will be chosen (step 2), and this process continues iteratively.

Each leaf is represented by a sequence of 1 s and 0 s, where the features are grouped in pairs, as shown in Fig. 3. Every pair of features for each leaf has its transition matrix that determines the expansion process for that pair. Two scenarios must be taken into account when features are grouped two by two. Figure 4a, b demonstrate these two scenarios depending on whether the dimension d is even or odd.

Markov decision process (MDP) and reinforcement learning

In order to determine the optimal subset of features that effectively differentiate between different classes, the BEV algorithm utilizes an smart approach to update transition probabilities during the transition from one state to another. This updating process is based on a reward and penalty mechanism. When the fitness function shows improvement, a reward value is added to the transition probability associated with the corresponding direction. At the same time, one third of the reward value is deducted from the transition probabilities of other directions. On the other hand, if the fitness function does not improve, a penalty value is applied to the transition probability of the relevant direction, while one third of the penalty value is added to the transition probabilities of other directions.

As each Markov chain has four states {00, 01, 10, 11}, each pair of features at each leaf of ${\mathcal{F}}_{j}^{t,s}$ has four separate probability mass functions that govern the expansion process. Each child leaf will inherit these probability mass functions, or transition matrices, from the parent leaf and update them based on the fitness function as shown in Figs. 5 and 6.

The fitness function, denoted by $f$, can be interpreted as the classification accuracy at the state ${\mathcal{F}}_{j}^{t,s} ,$

$$ f:\left\{ {0,1} \right\}^{d} \to \left[ {0,1} \right] $$

(2)

The accuracy is calculated based solely on the features chosen with a value of 1 at their positions. The fitness function $f$ can be chosen as a minimum accuracy for each class as:

$$ \mathop {\min }\limits_{1 \le i \le K} \left( {\frac{{TP_{i} }}{{TP_{i} + \mathop \sum \nolimits_{j \ne i} FP_{ij} }}} \right) $$

(3)

where $TP_{i}$ represents the number of instances of class i that are correctly predicted as class i, and $FP_{ij}$ represents the number of instances of class j that are incorrectly predicted as class i. The value $K$ represents the total number of classes.

In the case where ${\mathcal{A}} =$ 3 and ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$ $=$ 9, Fig. 5 illustrates the early stages of expansion in a process, where three leaves, denoted as ${\mathcal{F}}_{j}^{t = 1,s = 0}$ with $j = 1, 2, 3, $ emerge from the root leaf. Another 9 leaves are generated from the 3 leaves ${\mathcal{F}}_{j}^{t = 1,s = 0}$ noted ${\mathcal{F}}_{j}^{t = 2,s = 0}$ for $j = 1$ to 9. From these 12 leaves, only 9 are selected for continued expansion through the application of fitness functions, $f\left( {{\mathcal{F}}_{j}^{t = 1,s = 0} { }} \right) $ for $j = 1, 2, 3$ and $f\left( {{\mathcal{F}}_{j}^{t = 2,s = 0} { }} \right) $ for j = 1 to 9, which determines the most suitable leaves for growth.

The growth of each leaf is achieved through the transitions of each pair of features, represented by $C_{i,j}^{t,s}$ .The progression is guided by the transition probabilities, which are visualized in Fig. 6 through the presentation of four probability mass functions.

The transition probability of the ith pair at time t and stage s can be described as follows:

$$ P_{i,j}^{t - 1,s} ({ \mathcal{C}}_{i,j}^{t.s} |{ }{\mathcal{C}}_{i,j}^{t - 1,s} { }) = \left\{ {\begin{array}{*{20}l} {{\mathcal{P}}_{0,i,j}^{t - 1,s} } \hfill & {if\quad {\mathcal{C}}_{i,j}^{t,s} = \left\{ {0,0} \right\}} \hfill \\ {{\mathcal{P}}_{1,i,j}^{t - 1,s} } \hfill & { if\quad {\mathcal{C}}_{i,j}^{t,s} = \left\{ {0,1} \right\}} \hfill \\ {{\mathcal{P}}_{2,i,j}^{t - 1,s} } \hfill & { if\quad {\mathcal{C}}_{i,j}^{t,s} = \left\{ {1,0} \right\}} \hfill \\ {{\mathcal{P}}_{3,i,j}^{t - 1,s} } \hfill & { if\quad {\mathcal{C}}_{i,j}^{t,s} = \left\{ {1,1} \right\}} \hfill \\ \end{array} } \right. $$

(4)

$$ \mathop \sum \limits_{h = 0}^{3} {\mathcal{P}}_{h,i,j}^{t - 1,s} = 1 $$

(5)

$$ {\mathcal{C}}_{i,j}^{t - 1,s} \in \left\{ {\left\{ {0,0} \right\},\left\{ {0,1} \right\},\left\{ {1,0} \right\},\left\{ {1,1} \right\}} \right\} $$

(6)

Figure 7 illustrates an example of how the probabilities are updated according to the fitness function values where it was initially supposed to be uniformly distributed, i.e., $P_{i,1}^{0,s} \left( {C_{i,0}^{t = 0,s = 0} } \right) = 0.25$. When the fitness function improves, a reward in the form of a value (ε) is added to the transition probability associated with the corresponding direction. Simultaneously, a deduction of ε/3 is made from the transition probabilities of other directions. Conversely, if the fitness function fails to improve, a penalty is applied by subtracting ε from the transition probability of the relevant direction, while ε/3 is added to the transition probabilities of other directions.

Figure 8 clarifies the process of our approach, where each leaf ${\mathcal{F}}_{j}^{t,s} $ from ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$ leaves will be expanded to ${\mathcal{A}}$ leaves noted as follows:

$$ {\mathcal{F} }_{{{\mathcal{A}}\left( {j - 1} \right) + 1}}^{t + 1,s} , {\mathcal{F} }_{{{\mathcal{A}}\left( {j - 1} \right) + 2}}^{t + 1,s} , \ldots , {\mathcal{F} }_{{{\mathcal{A}}.j}}^{t + 1,s} $$

(7)

The selected best ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$ leaves, according to the fitness function, will be given new labels of ${\mathcal{F} }_{{\text{j}}}^{t + 1,s} {\text{for}} j = 1 {\text{to}} {\mathcal{M}}_{{\mathcal{A}}}$.

At each stage s and iteration t, new leaves are identified by generating $ {\mathcal{A}} $ independent uniform random variables, denoted $\alpha_{j,i,r}^{t,s}$, for each leave j and each pair of features i. These variables are drawn from a uniform distribution between 0 and 1, with r = 1, …, $ {\mathcal{A}} $, as illustrated in Fig. 9.

The transition pair from ${\mathcal{C}}_{i,j}^{t.s}$ to ${\mathcal{C}}_{i,\left(j-1\right)\mathcal{A}+r}^{t+1.s}$ is controlled by the values of the random variable ${\alpha }_{j,i,r}^{t,s}$ as indicated by Eq. (8).

$$ { \mathcal{C}}_{{i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t + 1.s} = \left\{ {\begin{array}{*{20}l} {\left\{ {0,0} \right\}} \hfill & { if\quad \alpha_{i,j,r}^{t,s} < {\mathcal{P}}_{{0,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} } \hfill \\ {\left\{ {0,1} \right\}} \hfill & {if\quad {\mathcal{P}}_{{0,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} \le \alpha_{i,j,r}^{t,s} < {\mathcal{P}}_{{0,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} + {\mathcal{P}}_{{1,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} } \hfill \\ {\left\{ {1,0} \right\}} \hfill & {if\quad {\mathcal{P}}_{{0,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} + {\mathcal{P}}_{{1,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} \le \alpha_{i,j,r}^{t,s} < {\mathcal{P}}_{{0,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} + {\mathcal{P}}_{{1,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} + {\mathcal{P}}_{{2,i,\left( {j - 1} \right){\mathcal{A}} + r}}^{t,s} } \hfill \\ {\left\{ {1,1} \right\}} \hfill & {Elsewhere} \hfill \\ \end{array} } \right. $$

(8)

where $r = 1,2, \ldots ,{ } {\mathcal{A}}.$

At every expansion, one of the four probability mass functions for each pair of features for each leave generated from the ${\mathcal{A}} \cdot {\mathcal{M}}_{{\mathcal{A}}}$ leaves must be updated after inheriting the transition matrices from the parent leaf. This process is illustrated in Figs. 7, 8 and 10.

A probability mass function (p.m.f) is a function that describes the probability distribution of a discrete random variable. The following are some of the properties of a p.m.f that need to be kept during the process of updating:

Non-negativity The p.m.f must be non-negative, meaning that it can take a value of 0, but it cannot be negative.
Non-exceeding 1 The p.m.f must be less than 1, meaning that it can take a value of 1, but it cannot be bigger.
Normalization The sum of the p.m.f over all possible outcomes of the discrete random variable must equal 1, meaning that the probabilities of all outcomes add up to 100%.

Therefore, the procedure of probability of transition updating can be executed according to the following equation when the transition was performed from ${\mathcal{C}}_{i,j}^{t,s} = \left\{ {0,1} \right\}$ to ${\mathcal{C}}_{i,1}^{t,s} = \left\{ {1,1} \right\}$ for instance.

$$ {\text{P}}_{i,l}^{t + 1,s} \left( {x|{\mathcal{C}}_{i,j}^{t,s} = \left\{ {1,0} \right\}} \right) = \left\{ {\begin{array}{*{20}l} {{\mathcal{P}}_{0,i,l}^{t + 1,s} = b{\text{ max}}({\mathcal{P}}_{0,i,l}^{t,s} - \frac{\varepsilon }{3}\gamma ,0)} \hfill & {if\quad x = \left\{ {0,0} \right\}} \hfill \\ {{\mathcal{P}}_{1,i,l}^{t + 1,s} = b{\text{ max}}({\mathcal{P}}_{1,i,l}^{t + 1,s} - \frac{\varepsilon }{3}\gamma ,0)} \hfill & {if\quad x = \left\{ {0,1} \right\}} \hfill \\ {{\mathcal{P}}_{2,i,l}^{t + 1,s} = b{\text{ max}}({\mathcal{P}}_{2,i,l}^{t + 1,s} - \frac{\varepsilon }{3}\gamma ,0) } \hfill & {if \quad x = \left\{ {1,0} \right\}} \hfill \\ {{\mathcal{P}}_{3,i,l}^{t + 1,s} = b{\text{ min}}({\mathcal{P}}_{3,i,l}^{t + 1,s} + \varepsilon \gamma ,1)} \hfill & { if \quad x = \left\{ {1,1} \right\}} \hfill \\ \end{array} } \right. $$

(9)

where $\varepsilon$ is the value given by the reward function, $\gamma \in \left\{ { + 1, - 1} \right\},$ and

$$ b = \frac{1}{{{\text{max}}({\mathcal{P}}_{0,i,l}^{t,s} - \frac{\varepsilon }{3}\gamma ,0) + {\text{ max}}({\mathcal{P}}_{1,i,l}^{t + 1,s} - \frac{\varepsilon }{3}\gamma ,0) + {\text{ max}}({\mathcal{P}}_{2,i,l}^{t + 1,s} - \frac{\varepsilon }{3}\gamma ,0) + {\text{min}}({\mathcal{P}}_{3,i,l}^{t + 1,s} + \varepsilon \gamma ,1)}} $$

(10)

The other three probability mass functions ${\text{P}}_{i,l}^{t + 1,s} \left( {x|\left\{ {0,0} \right\}} \right),{\text{P}}_{i,l}^{t + 1,s} \left( {x|\left\{ {0,1} \right\}} \right),{\text{P}}_{i,l}^{t + 1,s} \left( {x|\left\{ {1,1} \right\}} \right)$ are kept the same.

The reward may be positive or negative depending on the evolution of the fitness function values from the leaf ${\mathcal{F}}_{m}^{t + 1,s}$ to the leaf ${\mathcal{F}}_{r}^{t,s}$, and it can be captured by the variable $\gamma$ as follows:

$$ \gamma = \left\{ {\begin{array}{*{20}l} { + 1 } \hfill & { if\quad f\left( {{\mathcal{F}}_{m}^{t + 1,s} } \right) > f\left( {{\mathcal{F}}_{r}^{t,s} } \right)} \hfill \\ { - 1 } \hfill & {Elsewhere } \hfill \\ \end{array} } \right. $$

(11)

The reward function $\varepsilon$ should be small variables depending on the progress of the fitness function, and different functions can be proposed as follows:

$$ \varepsilon \left( {fitness\left( {{\mathcal{F}}_{r}^{t,s} } \right) - fitness\left( {{\mathcal{F}}_{m}^{t + 1,s} } \right)} \right) =\upeta {*}\tanh \left| {fitness\left( {{\mathcal{F}}_{r}^{t,s} } \right) - fitness\left( {{\mathcal{F}}_{m}^{t + 1,s} } \right)} \right| $$

(12)

Or

$$ \varepsilon \left( {fitness\left( {{\mathcal{F}}_{r}^{t,s} } \right) - fitness\left( {{\mathcal{F}}_{m}^{t + 1,s} } \right)} \right) = \frac{\upeta }{{\sqrt {1 - \left| {fitness\left( {{\mathcal{F}}_{r}^{t,s} } \right) - fitness\left( {{\mathcal{F}}_{m}^{t + 1,s} } \right)} \right|} + \tau }} $$

(13)

where $\upeta $ and τ are two parameters that can be any small values, refer to Fig. 11.

The process proceeds through stages until accuracy can no longer be improved or further dimension reduction is not possible. The next stage (s + 1) will evaluate the best features selected from the previous stage (s) as the root of the new stage (s + 1). The progression through stages is necessary when there is a progression in performance, as shown in Fig. 12.

As shown in Fig. 13, most transition probabilities will eventually converge to either 1 or 0, referred to as the equilibrium distribution, after a certain number of iterations determined by the reward value $\varepsilon $. At that point, it is necessary to reset the transition probabilities to 0.25 of the best leaf of the current stage as the root leaf for the next stage and repeat the branching process to see if higher accuracy can be achieved with fewer features.

The overall structure of each stage of the BEV approach is summarized in Fig. 14.

Results and discussion

This section plans to evaluate the proposed strategy by conducting experiments on a range of datasets that are commonly used for testing and comparison purposes. These datasets will serve as benchmarks to compare the performance with state-of-the-art methods and showcase the robustness of our technique. A thorough analysis of results, in terms of accuracy and size of selected features, will provide valuable insights into the strengths and weaknesses of our approach.

Datasets

The evaluation of the suggested method was conducted using 10 real-world high-dimensional datasets. These datasets are used to test the performance of the method in terms of feature selection and classification tasks. The datasets used in the evaluation of the suggested method are gene expression datasets with high dimensionality, meaning there are more features than observations. Additionally, the datasets are challenging because of the imbalanced distribution of observations across classes. Table 2 provides information on the number of observations, number of features, and other relevant details for these datasets.

Table 2 Details of datasets.

Full size table

Experimental settings

This paragraph describes the process of evaluating the proposed strategy using tenfold cross-validation. To account for the limited number of samples in the datasets, the cross-validation technique is used to create the training and test sets (no validation set is used). One-fold is reserved as the test set and not used in the feature selection process, while the remaining nine folds are used for building the training data. The selected features are then used to update the training and test sets, which are fed into the KNN algorithm to evaluate their performance. To ensure a fair and comprehensive assessment, each dataset is subjected to ten independent tenfold cross-validation tests with different random seeds, resulting in 100 total runs for each dataset. This approach aligns with previous research and provides a current assessment of the state-of-the-art^31,46.

Baseline methods

To demonstrate its effectiveness, the proposed work is compared with several existing feature selection algorithms that cover various techniques such as ant colony optimization, variable-length particle swarm optimization, comprehensive learning PSO with adaptive learning probability, and correlation-based feature selection. The comparison includes evolutionary models (TSHFS-ACO (two-stage hybrid feature selection model based on ant colony optimization)³¹, IRRF- SACO (Relevance-redundancy feature selection based on ant colony optimization)⁴⁷), particle swarm optimization [Standard PSO, VL-PSO (Variable-Length Particle Swarm Optimization)⁴⁶, CLPSO (Comprehensive Learning PSO) enhanced with the adaptive learning probability⁴⁸, and CSO (Competitive Swarm Optimizer)⁴⁹], graph-based [TFSACO (Text feature selection using ACO)⁵⁰], and classical methods [LFS (linear forward selection), CFS (correlation-based feature selection)⁵¹, and FCBF (fast correlation-based feature selection)⁵²].

Parameter’ settings

Table 3 presents the parameters utilized in the proposed approach. The rest of the baseline methods compared are in line with those specified in prior studies^31,46.

Table 3 Parameter’s settings.

Full size table

Results and discussion

Table 4 demonstrates the performance of the proposed methodology on 10 high-dimensional real-world datasets. The comparison between the actual feature vector and the results of the proposed feature selection method is displayed for each dataset. The developed algorithm significantly improves classification accuracy and reduces the dimensionality of all datasets, as shown in Fig. 15. The graphical comparison highlights the improvement in the performance of the proposed feature selection results compared to the original feature vectors. Table 5 provides a detailed analysis of the performance of the proposed algorithm, including the best, worst, and meaningful results.

Table 4 Results on different datasets compared to the full feature set.

Full size table

Table 5 Best, worst, and mean results on different datasets by the proposed algorithm.

Full size table

The dataset size reduction process is implemented iteratively until the accuracy and feature count remain consistent in three consecutive stages. During these initial stages, the dimensionality reduction is carried out without sacrificing precision. In the following three stages, the criteria for maintaining accuracy are relaxed, allowing for further reduction in dimensions with the possibility of fluctuating accuracy. Figures 16 and 17 summarize the results of 10 separate runs on all datasets using these additional stages. It can be seen that the number of features decreases as the stages progress. Initially, accuracy increases consistently, but in the last three stages, accuracy may decline as the feature count decreases. The results show that, while the balanced accuracy may vary among the same dataset experiments in the early stages, it eventually converges to a similar level in the later stages.

Additionally, Fig. 18 demonstrates that as the feature count decreases, the balanced accuracy for all datasets improves, highlighting the critical role of feature selection in attaining optimal accuracy and its potential for reducing the actual feature vector size. It is noteworthy that there is a trade-off between the number of features and accuracy, as reducing the feature vector size too much can result in decreased accuracy in most cases.

To showcase the versatility of our approach, we expanded our analysis by incorporating two additional classification models, Random Forest and Support Vector Machine (SVM), in addition to the KNN model. We conducted experiments on two datasets, ‘brain tumor 1’ and ‘brain tumor 2’, to assess the accuracy of the BEV and Autoencoder algorithms. We evaluated and compared the performance of these algorithms by averaging the results obtained from 10 experiments. These datasets were intentionally selected as they offer potential for improvement beyond what the BEV algorithm achieves in terms of accuracy. The corresponding comparison is presented in Table 6. Details of the Autoencoder parameters used for these evaluations can be found in Table 7.

Table 6 Accuracy comparison of BEV and Autoencoder algorithms on two datasets: ‘brain tumor 1’ and ‘brain tumor 2’ based on the average of 10 experiments.

Full size table

Table 7 Autoencoder parameters.

Full size table

The results clearly demonstrate that our proposed algorithm outperforms the Autoencoder when employing different classification models on the aforementioned datasets. Notably, the BEV algorithm achieves optimal performance by selecting only 7 and 5 features for ‘brain tumor 1’ and ‘brain tumor 2’, respectively, whereas the Autoencoder attains its best performance with 100 features on both datasets.

Moreover, our proposed model offers a distinct advantage by eliminating the need for a predefined number of desired feature selections, which is a requirement in the Autoencoder approach. In order to investigate the influence of desired feature selection on the Autoencoder’s performance, we conducted experiments utilizing various feature configurations on the ‘Brain Tumor 1’ and ‘Brain Tumor 2’ datasets.

To ensure a fair comparison, we specifically examined the performance of two feature sets: one with 7 features for the ‘brain tumor 1’ dataset and another with 5 features for the ‘brain tumor 2’ dataset. These feature sets represent the average number of features obtained by the BEV algorithm for each dataset. Additionally, we assessed the performance of the Autoencoder using two different desired feature settings: 50 and 100 features. The performance of the Autoencoder under these settings for the two datasets is presented in Table 8, based on the average results from 10 experiments. These analyses allow us to explore the impact of feature selection on the Autoencoder's performance.

Table 8 Autoencoder performance on different desired features on two datasets i.e., ‘brain tumor 1’ and ‘brain tumor 2’ based on AVG of 10 experiments.

Full size table

After analyzing the results, we made several key observations. Firstly, the random forest classifier demonstrated the best performance when utilizing the autoencoder with 7 features. However, when employing 50 and 100 features, the KNN classifier outperformed other classification models. It is important to highlight that, despite the varying performance across different feature configurations, none of the results surpassed the accuracy and feature efficiency achieved by the BEV algorithm.

Furthermore, we emphasize that the BEV algorithm excels in extracting precise features, ensuring the preservation of the exact features present in the dataset. In contrast, the autoencoder learns compressed representations that may not directly align with the original features of the data. This distinction highlights the strength of the BEV algorithm in capturing relevant information from the dataset.

Comparison with existing literature

Table 9 presents the results of the proposed methodology against state-of-the-art approaches in terms of balanced classification accuracy. The proposed BEV method outperforms current state-of-the-art techniques, including the two best methods TSHFS-ACO and ERM-FS, in balanced classification accuracy. BEV achieved an average improvement of 9.21% and 4.23% over TSHFS-ACO and ERM-FS, respectively. The largest improvement was observed in the Brain Tumor 2 dataset, with 8.77% and 21.92% over ERM-FS and TSHFS-ACO. The second largest improvement was seen in Brain Tumor 1 dataset, with 5.74% and 17.58% improvement, respectively. The lowest improvement was 1.88% and 5% on Leukemia 2 dataset. The proposed method performed best in the largest dataset, Lung Cancer (with 12,600 dimensions), with 11.38% and 6.75% improvement over TSHFS-ACO and ERM-FS, respectively (see Table 9). Figure 19 highlights the superiority of our approach in comparison with the two best techniques TSHFS-ACO and ERM-FS in terms of accuracy.

Table 9 Comparison in terms of average balanced accuracy with existing studies in 100 feature selection runs (mean ± std).

Full size table

Table 10 compares the average number of selected features for various techniques. Despite a higher mean balanced accuracy, the proposed BEV approach results in a lower average number of selected features on 8 out of 10 datasets. This highlights the efficiency of the proposed BEV in identifying the optimal features while reducing dimensions. The TFSACO performed better in reducing dimensions on 2 out of 10 datasets. Table 11 presents a comparison of the average balanced accuracy and the average number of selected features of classical studies. It clearly shows that the proposed BEV approach outperforms all other techniques in overall performance. In conclusion, these results demonstrate the superiority of the proposed BEV for high-dimensional feature selection.

Table 10 Comparison in terms of average number of features selected with existing studies in 100 feature selection runs.

Full size table

Table 11 Comparison in terms of average balanced accuracy and number of selected features with classical studies in 100 feature selection runs.

Full size table

To assess the performance of the BEV algorithm, we conducted an evaluation with recall which is an important metric in addition to accuracy. We compared the results of the BEV algorithm with the ERM-FS algorithm, which achieved the second-highest accuracy after our proposed algorithm, as shown in Table 11. The evaluation was performed using 'macro' recall since our scenario involved multiple classes. The results for both algorithms can be found in Table 5. It is important to note that 'macro' recall was utilized to ensure a comprehensive evaluation in our multi-class setting.

Table 12 demonstrates that the BEV algorithm consistently surpasses the ERM-FS algorithm in terms of macro recall across the various datasets. This finding highlights the superior performance of the BEV algorithm in accurately capturing important information from the data. In fact, the BEV algorithm achieves a perfect macro recall score of 100% on the majority of the datasets, further emphasizing its effectiveness. However, it is important to mention that in the case of 11 Tumor, Brain Tumor 1, and 9 Tumor datasets, the BEV algorithm exhibits a comparatively lower macro recall of 86.3%, 80%, and 66.6% respectively, indicating an area with potential for improvement.

Table 12 Macro recall comparison of ERM-FS and BEV algorithms on multiple datasets.

Full size table

Algorithm complexity

The BEV algorithm utilizes the KNN model as its classification model. During training, the time complexity of the KNN model is $O(1)$, indicating that it does not depend on the size or dimensionality of the dataset. However, during prediction, the time complexity becomes $O(k\cdot n\cdot d)$, where $k$ represents the number of neighbors, $n$ denotes the number of samples/points in the data, and $d$ represents the dimensionality of the dataset. It's important to note that the time required for distance calculations is typically insignificant compared to other algorithmic steps. The performance of the BEV algorithm is primarily affected by the dimensionality of the dataset. As the dimensionality increases, the computational time also increases. Consequently, the overall time complexity of the BEV algorithm can be expressed as $O({d}^{2}\cdot n)$, assuming the number of neighbors ($k$) remains constant. Table 4 provides the computational time needed for the different algorithms. The algorithms were executed on an Intel Core i7-4770 CPU @3.4 GHz.

According to Table 13, the proposed algorithm is positioned as the third fastest in terms of average computation time across all datasets. It is noteworthy that VLPSO exhibits the highest speed, followed by ERM-FS. However, it is important to emphasize that although VLPSO excels in computational efficiency, it does not rank among the top algorithms in terms of accuracy. Conversely, the proposed algorithm demonstrates slightly slower computation time compared to ERM-FS, but it achieves significantly better accuracy performance while utilizing a reduced number of features.

Table 13 Computational time comparison of various algorithms.

Full size table

Conclusion

The proposed Bird's Eye View (BEV) feature selection approach offers a solution to the challenge of selecting features in high-dimensional datasets. It combines three different paradigms and employs a rewarding scheme and collective evolution with Markov impact to iteratively reduce the feature space. The BEV algorithm draws inspiration from the genetic algorithm mechanism and implements a smart branching evolution approach that relies on dynamic Markov chains. The algorithm begins by initializing a root leaf and proceeds to generate children leaves, where the number of generated leaves is determined by a predetermined fixed value. Each leaf is represented by a sequence of 1 s and 0 s, organized in pairs. The best leaves are selected for each expansion based on evaluation. This iterative process continues until no further improvement is observed. The BEV algorithm effectively distinguishes between different classes by utilizing a reward and penalty mechanism to update transition probabilities during state transitions. This mechanism is based on the improvement or lack thereof in the fitness function. As a result, the algorithm achieves a significantly reduced feature subset while preserving high classification performance.

The effectiveness of the proposed BEV approach in high-dimensional feature selection is demonstrated by its ability to generate a significantly reduced feature subset while maintaining a high fitness level. Through evaluation on 10 benchmark datasets, the BEV model outperforms current state-of-the-art methods. Furthermore, it offers advantages such as simplicity in development, ease of hyperparameter configuration, and fast execution.

However, it is important to note that our approach is a stochastic algorithm, which means it provides suboptimal solutions rather than guaranteed optimal solutions. Despite effectively exploring the search space, there is no guarantee that the selected feature subset will be the absolute best. Achieving satisfactory performance in the proposed approach depends heavily on fine-tuning various hyperparameters. One avenue for future research involves exploring the tuning of additional hyperparameters to enhance the algorithm's performance. Additionally, we plan to investigate the inclusion of sets of k-features, as opposed to limiting the selection to only two features. This modification aims to assess whether expanding the feature selection scope can further improve the approach's performance.

Data availability

The code and datasets are available from the links https://github.com/Bilal39/Bird-Eye-View-Script and https://github.com/tnbinh/VLPSO/tree/main/Data.

References

Rehman, A. U. & Belhaouari, S. B. Divide well to merge better: A novel clustering algorithm. Pattern Recognit 122, 108305 (2022).
Article Google Scholar
Rehman, A. & Belhaouari, S. B. Unsupervised outlier detection in multidimensional data. J. Big Data 8, 1–27 (2021).
Article Google Scholar
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D. & Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 1, 56–70 (2020).
Article Google Scholar
Liu, H., Shao, M. & Fu, Y. Feature selection with unsupervised consensus guidance. IEEE Trans. Knowl. Data Eng. 31, 2319–2331 (2019).
Article Google Scholar
Zhang, X., Fan, M., Wang, D., Zhou, P. & Tao, D. Top-k feature selection framework using robust 0–1 integer programming. IEEE Trans. Neural Netw. Learn. Syst. 32, 3005–3019 (2021).
Article MathSciNet PubMed Google Scholar
Nguyen, B. H., Xue, B. & Zhang, M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol. Comput. 54, 100663 (2020).
Article Google Scholar
Bhadra, T. & Bandyopadhyay, S. Supervised feature selection using integration of densest subgraph finding with floating forward–backward search. Inf. Sci. (NY) 566, 1–18 (2021).
Article MathSciNet MATH Google Scholar
Valente, J. M. & Maldonado, S. SVR-FFS: A novel forward feature selection approach for high-frequency time series forecasting using support vector regression. Expert Syst. Appl. 160, 113729 (2020).
Article Google Scholar
Sharma, M. & Kaur, P. A comprehensive analysis of nature-inspired meta-heuristic techniques for feature selection problem. Arch. Comput. Methods Eng. 28, 1103–1127 (2021).
Article MathSciNet Google Scholar
Kadhim, A. I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 52, 273–292 (2019).
Article Google Scholar
Sheikhpour, R., Sarram, M. A., Gharaghani, S. & Chahooki, M. A. Z. A survey on semi-supervised feature selection methods. Pattern Recognit. 64, 141–158 (2017).
Article ADS MATH Google Scholar
Solorio-Fernández, S., Carrasco-Ochoa, J. A. & Martínez-Trinidad, J. F. A review of unsupervised feature selection methods. Artif. Intell. Rev. 53, 907–948 (2020).
Article Google Scholar
Markov, A. A. The theory of algorithms. Trudy Matematicheskogo Instituta Imeni VA Steklova 42, 3–375 (1954).
Google Scholar
George, M., Jafarpour, S. & Bullo, F. Markov Chains with maximum entropy for robotic surveillance. IEEE Trans. Autom. Control 64, 1566–1580 (2019).
Article MathSciNet MATH Google Scholar
Zou, B. et al. k-Times Markov sampling for SVMC. IEEE Trans. Neural Netw. Learn. Syst. 29, 1328–1341 (2018).
Article PubMed Google Scholar
Salzenstein, F. & Collet, C. Fuzzy Markov random fields versus chains for multispectral image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1753–1767 (2006).
Article PubMed Google Scholar
Mestre, R. & McAdam, P. Is forecasting with large models informative? Eur. Cent. Bank Work. Pap. Ser 950 (2008).
Zang, D., Liu, J. & Wang, H. Markov chain-based feature extraction for anomaly detection in time series and its industrial application. in Proc. 30th Chinese Control Decis. Conf vol. CCDC 1059–1063 (2018).
Liu, J., Zang, D., Liu, C., Ma, Y. & Fu, M. A leak detection method for oil pipeline based on Arkov feature and two-stage decision scheme. Meas. J. Int. Meas. Confed 138, 433–445 (2019).
Article Google Scholar
Ozkan, H., Ozkan, F. & Kozat, S. S. Online anomaly detection under Markov statistics with controllable type-I error. IEEE Trans. Signal Process. 64, 1435–1445 (2016).
Article ADS MathSciNet MATH Google Scholar
Miikkulainen, R. & Forrest, S. A biological perspective on evolutionary computation. Nat. Mach. Intell. 3, 9–15 (2021).
Article Google Scholar
Rehman, A. U., Islam, A. & Belhaouari, S. B. Multi-cluster jumping particle swarm optimization for fast convergence. IEEE Access 8, 189382–189394 (2020).
Article Google Scholar
Hamdi, A., Karimi, A., Mehrdoust, F. & Belhaouari, S. Portfolio selection problem using CVaR risk measures equipped with DEA, PSO, and ICA algorithms. Mathematics 10, 2808 (2022).
Article Google Scholar
Weiel, M. et al. Dynamic particle swarm optimization of biomolecular simulation parameters with flexible objective functions. Nat. Mach. Intell. 3, 727–734 (2021).
Article Google Scholar
Tao, J. & Zhang, R. Intelligent feature selection using GA and neural network optimization for real-time driving pattern recognition. IEEE Trans. Intell. Transp. Syst. 23, 1–10 (2021).
Google Scholar
Rojas, M. G., Olivera, A. C., Carballido, J. A. & Vidal, P. J. A memetic cellular genetic algorithm for cancer data microarray feature selection. IEEE Lat. Am. Trans. 18, 1874–1883 (2020).
Article Google Scholar
Essiz, E. S. & Oturakci, M. Artificial bee colony-based feature selection algorithm for cyberbullying. Comput. J. 64, 305–313 (2021).
Article Google Scholar
Nag, K. & Pal, N. R. A multiobjective genetic programming-based ensemble for simultaneous feature selection and classification. IEEE Trans. Cybern. 46, 499–510 (2016).
Article PubMed Google Scholar
Zhu, L., He, S., Wang, L., Zeng, W. & Yang, J. Feature selection using an improved gravitational search algorithm. IEEE Access 7, 114440–114448 (2019).
Article Google Scholar
Peng, H., Ying, C., Tan, S., Hu, B. & Sun, Z. An improved feature selection algorithm based on ant colony optimization. IEEE Access 6, 69203–69209 (2018).
Article Google Scholar
Ma, W., Zhou, X., Zhu, H., Li, L. & Jiao, L. A two-stage hybrid ant colony optimization for high-dimensional feature selection. Pattern Recognit. 116, 107933 (2021).
Article Google Scholar
Stanley, K. O., Clune, J. & Lehman, J. Designing neural networks through neuroevolution. Nat. Mach. Intell. 1, 24–35 (2019).
Article Google Scholar
Raji, I. D. et al. Simple deterministic selection-based genetic algorithm for hyperparameter tuning of machine learning models. Appl. Sci. 12, 1186 (2022).
Article CAS Google Scholar
Hamdia, K. M., Zhuang, X. & Rabczuk, T. An efficient optimization approach for designing machine learning models based on genetic algorithm”. Neural Comput. Appl. 33, 1923–1933 (2021).
Article Google Scholar
Asim, M., Mashwani, W. K. & Shah, H. An evolutionary trajectory planning algorithm for multi-UAV-assisted MEC system. Soft Comput. 26, 7479–7492 (2022).
Article Google Scholar
Ewees, A. A. Boosting arithmetic optimization algorithm with genetic algorithm operators for feature selection: Case study on cox proportional hazards model. Mathematics 9, 2321 (2021).
Article Google Scholar
Amini, F. & Hu, G. A two-layer feature selection method using genetic algorithm and elastic net. Expert Syst. Appl. 166, 114072 (2021).
Article Google Scholar
Wang, J. et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 3, 914–922 (2021).
Article Google Scholar
Jin, Y., Liu, X., Shao, Y., Wang, H. & Yang, W. High-speed quadrupedal locomotion by imitation-relaxation reinforcement learning. Nat. Mach. Intell. 4, 1198–1208 (2022).
Article Google Scholar
Fard, S. M. H., Hamzeh, A. & Hashemi, S. Using reinforcement learning to find an optimal set of features. Comput. Math. Appl. 66, 1892–1904 (2013).
Article MathSciNet MATH Google Scholar
Kroon, M. & Whiteson, S. Automatic feature selection for model-based reinforcement learning in factored MDPs. In 8th Int. Conf. Mach. Learn. Appl. ICMLA 324–330 (2009).
Liu, K. et al. Automated feature selection: A reinforcement learning perspective. IEEE Trans. Knowl. Data Eng (2021).
Neftci, E. O. & Averbeck, B. B. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 1, 133–143 (2019).
Article Google Scholar
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Article Google Scholar
Kc, G. B. et al. A machine learning platform to estimate anti-SARS-CoV-2 activities. Nat. Mach. Intell. 3, 527–535 (2021).
Article Google Scholar
Tran, B., Xue, B. & Zhang, M. Variable-length particle swarm optimization for feature selection on highdimensional classification. IEEE Trans. Evol. Comput. 23, 473–487 (2019).
Article Google Scholar
Tabakhi, S. & Moradi, P. Relevance-redundancy feature selection based on ant colony optimization. Pattern Recognit. 48, 2798–2811 (2015).
Article ADS Google Scholar
Yu, X., Liu, Y., Feng, X. & Chen, G. Enhanced comprehensive learning particle swarm optimization with exemplar evolution. In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics 10593 LNCS, 929–938 (2017).
Gu, S., Cheng, R. & Jin, Y. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput. 22, 811–822 (2018).
Article Google Scholar
Aghdam, M. H., Ghasem-Aghaee, N. & Basiri, M. E. Text feature selection using ant colony optimization. Expert Syst. Appl. 36, 6843–6853 (2009).
Article Google Scholar
Hall, M. A. Correlation-based feature selection for discrete and numeric class machine learning. In Proc. 7th Int. Conf 359–366 (2000).
Yu, L. & Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings, Twent. Int. Conf. Mach. Learn 2, 856–863 (2003)

Download references

Author information

Authors and Affiliations

Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Samir Brahim Belhaouari, Mohammed Bilal Shakeel & Aiman Erbad
Department of Industrial Engineering, Faculty of Engineering and Natural Sciences, KTO Karatay University, Konya, Turkey
Zarina Oflaz
Geneva School of Economics and Management (GSEM), University of Geneva, 1211, Geneva, Switzerland
Khelil Kassoul

Authors

Samir Brahim Belhaouari
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bilal Shakeel
View author publications
You can also search for this author in PubMed Google Scholar
Aiman Erbad
View author publications
You can also search for this author in PubMed Google Scholar
Zarina Oflaz
View author publications
You can also search for this author in PubMed Google Scholar
Khelil Kassoul
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, S.B.B., methodological approach, S.B.B., software, M.B.S., validation, S.B.B., M.B.S. A.E., Z.O. and K.K., detailed review, S.B.B., M.B.S., A.E., Z.O. and K.K., evaluation, S.B.B, M.B.S., A.E., Z.O. and K.K., writing manuscript preparation, S.B.B, M.B.S., A.E., Z.O. and K.K., writing assessment and editing, S.B.B, M.B.S., A.E., Z.O. and K.K., visual analytics, S.B.B, M.B.S., A.E., Z.O. and K.K., supervision, S.B.B. The manuscript’s published form was approved by all authors after they had read it.

Corresponding authors

Correspondence to Samir Brahim Belhaouari or Khelil Kassoul.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Brahim Belhaouari, S., Shakeel, M.B., Erbad, A. et al. Bird’s Eye View feature selection for high-dimensional data. Sci Rep 13, 13303 (2023). https://doi.org/10.1038/s41598-023-39790-3

Download citation

Received: 28 February 2023
Accepted: 31 July 2023
Published: 16 August 2023
DOI: https://doi.org/10.1038/s41598-023-39790-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Accurate structure prediction of biomolecular interactions with AlphaFold 3

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

A guide to artificial intelligence for cancer researchers

Introduction

Background and literature review

Markov chain

Evolutionary algorithms

Reinforcement learning

A fitness function to better evaluation of classifiers

Methods

Genetic algorithm

Markov decision process (MDP) and reinforcement learning

Results and discussion

Datasets

Experimental settings

Baseline methods

Parameter’ settings

Results and discussion

Comparison with existing literature

Algorithm complexity

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links