Dimensionality reduction using singular vectors

A common problem in machine learning and pattern recognition is the process of identifying the most relevant features, specifically in dealing with high-dimensional datasets in bioinformatics. In this paper, we propose a new feature selection method, called Singular-Vectors Feature Selection (SVFS). Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D= [A \mid \mathbf {b}]$$\end{document}D=[A∣b] be a labeled dataset, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {b}$$\end{document}b is the class label and features (attributes) are columns of matrix A. We show that the signature matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_A=I-A^{\dagger }A$$\end{document}SA=I-A†A can be used to partition the columns of A into clusters so that columns in a cluster correlate only with the columns in the same cluster. In the first step, SVFS uses the signature matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_D$$\end{document}SD of D to find the cluster that contains \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {b}$$\end{document}b. We reduce the size of A by discarding features in the other clusters as irrelevant features. In the next step, SVFS uses the signature matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_A$$\end{document}SA of reduced A to partition the remaining features into clusters and choose the most important features from each cluster. Even though SVFS works perfectly on synthetic datasets, comprehensive experiments on real world benchmark and genomic datasets shows that SVFS exhibits overall superior performance compared to the state-of-the-art feature selection methods in terms of accuracy, running time, and memory usage. A Python implementation of SVFS along with the datasets used in this paper are available at https://github.com/Majid1292/SVFS.

and more informative features. Using the signature matrix S A , features that have correlations to each other are clustered. The most important features are then picked from each cluster. This process can be optimized using two thresholds to make our model capable of handling a wide range of high dimensional data types. We view the data and interactions between all features globally in the sense that we measure the relevancy of features to b all at once and then breakdown the original feature space into a collection of lower dimensional subspaces. In contrast, many FS methods apply one or two discriminative concepts locally and at the individual feature level to obtain the most important features. Thus, they may perform well on some types of datasets and have inferior performances on other types of datasets. For example, as we shall see in Section 4, Fisher score 21 and Trace ratio criterion 22 have a good performances on biological benchmark datasets while they produce weak results on the image benchmark datasets.
We show in Section 3, that S A is the same as the orthogonal projection P onto the null space of A; hence S or P can be constructed using right singular vectors. We define a graph G where the nodes are columns of A and there is an edge between columns F i and F j if and only if S i,j = 0 . As we shall explain, each connected component of G corresponds to a subset of columns of A that are linearly dependent. In other words, the correlations between columns of A are encoded in the signature matrix S A .
We view D as a matrix and form the signature matrix S D = I − D † D . The cluster of D containing b consists of relevant features to b and all features in the other clusters are considered irrelevant. After removing irrelevant features, we update A and use the graph associated to S A to find the clusters. There are many efficient algorithms to find the clusters of a graph. We use Breadth-First Search (BFS) 18 to find the features which are directly or indirectly connected to the other features. The novelty of our method is to use the signature matrix S D of D to detect and remove irrelevant features and then use the signature matrix S A of the reduced matrix A to partition the columns of A into clusters so that columns within a cluster correlate only with columns within the same cluster. Finally, we rank the features in a cluster based on the entries on the main diagonal of S A and select a small subset of top ranked features with the highest Mutual Information (MI) with respect to b.
In order to evaluate the performance and efficiency of our method, we compare it with the state-of-the-art FS methods, namely Conditional Infomax Feature Extraction (CIFE) 19 , Joint Mutual Information (JMI) 20 , Fisher score 21 , Trace ratio criterion 22 , Least angle regression (LARS) 23 , Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso) 24 , Conditional Covariance Minimization (CCM) 25 , and Sparse Multinomial Naive Bayes (SMNB) 26 on a series of high dimensional benchmark as well as biological datasets.
The rest of this paper is structured as follows. An overview of the existing FS approaches is given in section 2. Then, in Section 3, we give a theoretical background along with some examples on synthetic data to show how our method removes irrelevant features and finds correlations between the rest of the features using the signature matrix S. Section 4 gives an account on specifications of the datasets and reports our experiment results. Finally, we provide a summary in Section 5.

Related work
FS methods are categorized as filter, wrapper, and embedded methods 27 . The filter methods use some underlying and intrinsic properties of the features measured via univariate statistics, while the wrapper methods measure the importance of features based on the classifier performances. While optimizing the classifier performance is the essential goal of FS, and the wrapper methods have their own efficient internal classifiers, these methods are computationally more expensive in comparison with the filter methods due to the iterated learning steps of the wrapper methods and their cross-validation to avoid the risk of overfitting the model. The embedded methods are similar to the wrapper methods; however, the former mainly uses an intrinsic model building metric during the learning process.
Many FS algorithms work based on information-theoretical approaches which utilize various criteria to measure and rank the importance of features. The basic idea behind many information-theoretic methods is to maximize feature relevance and minimize feature redundancy 21 . Since feature correlation with class labels normally measures the relevance of the feature, most algorithms in this group are applied in a supervised manner. A brief introduction to basic information-theoretic concepts is given here.
Shannon entropy, as the primary measurement in information-theoretical approaches, measures the uncertainty of a discrete random variable. The entropy of a discrete random variable X is described as below: where x i is a specific value of X and P(x i ) refers to the probability of x i over all values of X.
Second concept is the conditional entropy of X and Y, which is another discrete random variable, defined as follows: where P(y i ) is the prior probability of y i , P(x i |y j ) refers to the conditional probability of x i and y j .
To measure the amount of information shared between X and Y, MI or information gain is used, which is defined as follows: where P(x i , y j ) is the joint probability of x i and y j . MI is symmetric such that I(X; Y ) = I(Y ; X) and in case X and Y are independent, their MI would is zero. Since we applied the MI concept in our proposed method, two representative algorithms of information-theoretical based family are selected for comparison, including Conditional Infomax Feature Extraction (CIFE) 19 , Joint Mutual Information (JMI) 20 . Several studies including CIFE 19 and 28,29 are based on the idea that the conditional redundancy between unselected features and selected features given class labels should be maximized rather than minimizing the feature redundancy. Minimum Redundancy Maximum Relevance (MRMR) reduces feature redundancy in the feature selection process. In contrast, JMI 20,30 is introduced to increase the MI that is distributed between selected features and unselected features. There have been some improvements of JMI, see 31 .
Another category of FS methods is the similarity-based approaches that measure the feature relevances by their ability to preserve data similarities. The two superior similarity-based methods, i.e. the Fisher score 21 and Trace Ratio criterion 22 are selected to provide a basis for comparison with our proposed method.
Fisher score is a supervised feature selection method that explores features with high discriminant capacity. For sample points in different classes, Fisher score aims to maximize distances between samples; in contrast, it minimizes the distances between sample points in the same class. Trace Ratio criterion has the same idea of maximizing data similarity between-class of instances, while minimizing data similarity the within-class of instances. It computes a Trace Ratio norm by building two affinity matrices S w and S b to designate within-class and between-class data similarity.
Some approaches use aggregated sample data to select and rank the features 23,24,32,33 . The least absolute shrinkage and selection operator (LASSO) is an estimation method in linear methods that performs two main tasks: regularization and feature selection. For the first task, it calculates the sum of the absolute values of the model parameters, and the sum must be less than a prefixed upper bound. Therefore, by applying a regularization (shrinking) process, it penalizes the coefficients of the regression variables shrinking, some of them are set to zero. For the second task, the features that still have a non-zero coefficient after the regularization process are chosen to be part of the model. The goal of this process is to lessen the prediction error.
Least angle regression (LARS) proposed by Efron et al. 23 works based on LASSO and is a linear regression method that computes all least absolute shrinkage and selection operator 33 estimates and selects those features which are highly correlated to the already selected ones. Yamada et al. in 24 proposed a non-linear FS method for high-dimensional datasets called Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso). By solving a Lasso problem and using a set of kernel functions, HSIC-Lasso selects informative non-redundant features. In another work 34 called Least Angle Nonlinear Distributed (LAND), the authors have improved the computational power of the HSIC-Lasso. They illustrated through comprehensive examinations that LAND and HSIC-Lasso achieve comparable classification accuracies and dimension reduction. However, LAND has the advantage that it can be developed on parallel distributed computing.
HSIC-Lasso and LAND are based on a convex optimization problem with a ℓ 1 -norm penalty on the regression coefficients to improve sparsity while having a significantly high computational cost, especially on high dimensional data. Very recently, Askari et al. 26 proposed a sparse version of naive Bayes, leading to a combinatorial maximum likelihood capable of solving the binary data and providing explicit bounds on the duality gap for multinomial data, at a fraction of the computing cost.
We also remark that FS is applied and used in various domains including gene selection, face recognition, handwriting identification, and remote sensing [35][36][37][38] .

Proposed approach
Let A be an m × n matrix of rank ρ and consider the singular value decomposition (SVD) of A as A = U V T , where U m×m and V n×n are orthogonal matrices and � = diag(σ 1 , . . . , σ ρ , 0, . . . , 0) is an m × n diagonal matrix. We denote column j of V by v j and row j of V by v j . Furthermore, we partition v j as v j = v j,1 v j,2 , where v j,1 consists of the first ρ entries of v j and v j,2 is the remaining n − ρ entries. Note that Av j = 0 , for all ρ + 1 ≤ j ≤ n , and moreover ker(A) is spanned by all v ρ+1 , . . . , v n . We denote by F j the j-th column of A.
Let V be the matrix consisting of columns ρ , that is range of P is N (A) , P 2 = P and P T = P . We also let S = I − A † A . By Lemma 2.1 in 39 , we know that S and P are indeed the same. Nevertheless, the computational complexity of computing of S and P might be different. For to compute P we just need the right singular vectors of the symmetric matrix A T A . On the other hand, if A is full row rank then we know A † = A T (AA T ) −1 . So in case A has full row-rank, the complexity of computing S is the same as complexity of matrix inversion.
Let D = [A | b] be a dataset, say a binary Cancer dataset, where rows of A are samples (patients), columns of A are features (gene expressions) and b is the class label that each of its entries are either 0 (noncancerous) or 1 (cancerous). In large datasets that are a large number of features that are irrelevant. For example, in gene expression datasets, there are a large number of genes that are not expressed. So, identifying and removing features that have negligible correlation with the class labels is crucial. The aim of FS is to come up with a minimal subset of features that can be used to predict the class labels as accurate as possible. There might be redundancies (correlations) among relevant features that must be detected and removed.
As we explain below, we use the matrix S (or P) to divide the set of all features into clusters where features within a cluster correlate with each other and different clusters are linearly independent from each other. So, a set of linear dependencies defines the correlations within a cluster. www.nature.com/scientificreports/ Without loss of generality, we assume that {F 1 , . . . , F t } is a cluster, that is F 1 , . . . , F t are linearly dependent and independent of the rest of the F k , where k ≥ t + 1 . The following theorem from 39 , is the first major step to identify clusters. Theorem 1 Suppose that {F 1 , . . . , F t } is a cluster. Then P i,j = 0 , for every 1 ≤ j ≤ t and every i ≥ t + 1.

Example 1 Consider a 100 × 80 synthetic matrix A with the only relations between columns of A as follows:
The signature matrix S A (rounded up to two decimals) is: We note that A is randomly generated and the only constrain on A is the set of dependent relations given above. We can see that S has a block diagonal form, where each block corresponds to a cluster. So, features F 1 , . . . , F 4 constitute a cluster. Similarly, {F 5 , . . . , F 11 } is another cluster. Note that {F i } is a singleton cluster, for all i ≥ 12 . We provide some details about these facts in the next lemma.

Lemma 1 Let A be the matrix in Example
Proof We note that rank of A is ρ = 73 . Hence, Av k = 0 , for every 74 ≤ k ≤ 80 . Since Av k = 0 yields a dependence relation between columns of A and F 1 , . . . , F 4 are independent from the rest of the columns, we deduce that Av k = 0 , where v k consists of the first 4 entries of v k . Then we form the matrix M = v 74 · · ·v 80 . Since any linear combination of columns of M provides a dependence relation between F 1 , . . . , F 4 , we can use elementary (column) operations to transform M into the matrix C 1 : Then F 1 · · · F 4 C 1 = 0 ; in other words columns of C 1 give us the minimal relations between F 1 , . . . , Substituting for F 1 and F 2 in terms of F 3 and F 4 using the matrix C 1 , we get We deduce that Since the above equations hold for every k in the range ρ + 1 ≤ k ≤ n , we deduce that Let j be in the range 5 ≤ j ≤ n . Then taking the dot product with v j,2 yields Let C = C 1 0 0 0 be an n × n matrix. Let c 1 , . . . , c n be the columns of C and denote by p j the j-th row of P.
In general it follows from Theorem 1 that after re-ordering the columns of A, the matrix S has a block-diagonal form where each block corresponds to a cluster. Of course, a priori, columns within the same cluster are not next to each other in the matrix A. Furthermore, the converse of Theorem 1 is not true in general. In other words, P i,j could be zero even when F i and F j are in the same cluster as can be seen in Example 1 where P 1,3 = P 5,6 = 0.
To find the clusters, we define a graph G whose vertices consists of F 1 , . . . , F n and we define an edge between F i and F j if and only if P i,j = 0 . The graph associated to matrix A in Example 1 is depicted in Figure 1.
Even though, there may not be an edge between two nodes of the same cluster, it turns out there is always a path connecting every two nodes in the same cluster. This fact which is Theorem 2.10 in 39 , can be summarized as follows.

Theorem 2
The sub-graph of G consisting of nodes F 1 , . . . , F t and corresponding edges is connected.
As we mentioned, in real datasets there are many irrelevant features. To identify the irrelevants, we construct the signature matrix S D of D and identify the cluster that includes b . The remaining clusters consist of features that have a negligible correlation with b . So, we can remove all other clusters from A.

Example 2 Let
A be the synthetic matrix as in Example 1 and b = F 1 − 3F 3 + 2F 9 − F 14 . The last row of signature matrix S D (rounded up to four decimals) is: The cluster containing b consists of features F i such that S i,n+1 � = 0 . So, we identify the columns F j where j = 12, 13 or 15 ≤ j ≤ 100 as irrelevant features and remove them from A.
Alternatively, we can also identify irrelevant features by looking at the least-squares solutions of the system Each component x i of x can be considered as an assigned weight to the feature F i of A. Hence, the bigger the |x i | , the more salient F i is in correlation with b.

Example 3 Let
A be the synthetic matrix as in Example 1 and b = F 1 − 3F 3 + 2F 9 − F 14 . We solve Ax = b using the least-squares method where the vector x (rounded up to two decimals) is: . , x n ] , where each x i is an assigned weight to F i . Hence, we can approximate b as a linear combination of the form x 1 F 1 + · · · + x n F n . Therefore, x i = 0 implies F i has no impact on b and that F i is irrelevant. According to vector x , x i = 0 for i = 12, 13 and 15 ≤ i ≤ n and we remove the corresponding F i from A.
Since, the notion of relevancy is not quantitative and one has to be cautious in removing features, we set a soft threshold Th irr and incorporate both the methods explained in Examples 2 and 3. In this paper, we first filter out features with minimal weight, that is features with |x i | less than 1 n n i=1 |x i | × Th irr where 1 n n i=1 |x i | is the average of the |x i | s. Then we set |P i,n+1 | = 0 whenever |P i,n+1 | < Th irr . Note that the last row of S D reflects the correlations with b . We sort the last row of S D as descending and remove the features outside the length of 1 n n i=1 |P i,n+1 | × (Th irr + 1) . So, we apply a two-step process with a soft threshold at each step to remove the irrelevant features. Note that we still denote by A the reduced matrix obtained after removing the irrelevant features.
In the next step, we identify redundant features. To do so, we use the signature matrix S A of A and consider the associated graph. There are many efficient algorithms to find the clusters or connected components of a graph. One such algorithm is Breadth-First Search (BFS) 18 . By applying the BFS starting from vertex F i , we can determine its accessible vertices. In other words, different clusters can be specified using BFS on the unvisited vertex F i . For example, in Fig. 1, the first unvisited vertex (feature) is F 1 , and applying BFS on F 1 would visit F 2 , F 4 , F 3 , respectively. Since there is no unvisited connected feature, the first cluster consists of F 1 to F 4 . Then, BFS should be applied to the next unvisited F i , and add the consequently visited features to the next cluster until all the connected vertices in the current cluster are visited.
From each resulting cluster, a feature that carries the highest MI with b is selected as the output of the SVFS method. The selected feature from each cluster is, indeed, the one that best represents that cluster. In real datasets we might inherently encounter minor correlations between features, that is in the matrix S A we might see very small entries that indicate weak correlations. We use a threshold Th red to map the weak feature correlations to zero. Also, in case we encounter a few clusters with numerous vertices, we set a threshold α to split the clusters with more than α vertices into sub-clusters with the maximum of α vertices. The features in each sub-cluster are then sorted based on the last row of S D , and the top β features are selected to find their highest MI with b . The choice of β features in each sub-cluster is with the aim of reducing the computational cost of the MI calculations.
Algorithm. In this section, we present the algorithm and flowchart of SVFS in Figure 2 www.nature.com/scientificreports/ The while loop in the algorithm essentially demonstrates finding the connected components of the graph associated to P. The well-known BFS algorithm finds the connected components of a graph G(V, E) with complexity O(|V | + |E|) . In our case, |E| is the number of non-zero entries in P. So, the worst case in the algorithm can happen when |E| = n(n − 1) 2 . Hence, the complexity of the while loop is O(n 2 ) . We also mention that parallel algorithms for BFS have been of great interest, see for example 40 .
The complexity of computing S = I − A † A is more delicate. There is extensive research on finding efficient and reliable methods to find A † , see for example [41][42][43] . One of the most commonly used methods is the Singular Value Decomposition (SVD) which is very accurate but time and memory intensive especially in the case of large matrices. The complexity of computing SVD of A m×n is O(min(mn 2 , m 2 n)).
Pseudo-inverses are used in neural learning algorithms to solve large least square systems. So, there is a great interest in finding the pseudo-inverse efficiently. Courrieu in 44 proposed an algorithm called Geninv based on Cholesky factorization and showed that the computation time is substantially shorter, particularly for large systems. It is noted in 44 that the complexity of Geninv on a single-threaded processor is O(min(m 3 , n 3 )) whereas in a multi-threaded processor, the time complexity is O (min(m, n)) . The authors in 45 investigated the effective computation of the pseudo-inverse for neural networks and concluded that QR factorization with column pivoting along with Geninv works well. Since our implementation is single-threaded and m << n , the complexity of pseudo-inverse is O(m 3 ) . We can conclude that the complexity of our algorithm is at most O(max(m 3 , n 2 )).

Experimental result
We compared our method with eight state-of-the-art FS methods including Conditional Infomax Feature Extraction (CIFE), Joint Mutual Information (JMI), Fisher score, Trace Ratio criterion, Least angle regression (LARS), Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso), Conditional Covariance Minimization (CCM), and Sparse Multinomial Naive Bayes (SMNB). We used the scikitfeature library, which is an open-source feature selection repository in Python developed in the Arizona State www.nature.com/scientificreports/ University (ASU). It includes the implementation of CIFE, JMI, LARS, Fisher, and Trace Ratio methods. The reset of methods, namely, HSIC-Lasso, CCM, and SMNB are implemented in Python by their authors. To have a fair comparison among the different FS methods, we take advantage of 5-fold stratified cross-validation (CV) of the dataset so that 80% of each class is selected for FS. Then we use the Random Forest (RF) classifier with its default setting implemented in 46 , to build a model based on the selected features and evaluate the model on the remaining 20% of the dataset. We report the average classification accuracy over 10 independent runs (twice 5-fold CV) using the RF classifier on each dataset.

Datasets.
We selected a variety of publicly available datasets from two sources, i.e. Gene Expression Omnibus (GEO) which has various real genomic data, and the scikit-feature selection repository at Arizona State University which has benchmark biological and face image data to perform feature selection and classification. The specifications of these datasets are given in Tables 1 and 2.
The pre-processing of GEO datasets used in this research was carried out by cleaning and converting the NCBI datasets to the CSV format. The mapping between the gene samples and the probe IDs has been retrieved using GEO2R 47  Parameters. The input parameters of our proposed SVFS method are k, Th irr , Th red , α, β . The parameter k denotes the number of selected features and is a common parameter in all the methods evaluated in this study. There is no fixed procedure in the literature for determining the optimum value of k, but in many research works [48][49][50][51] , it is set to 50 which seems to be satisfactory in many cases. However, we take k in a wider range from 10 and 90 to ensure a fairground for comparison. When a subset of k features are returned as the output of a FS algorithm, we feed the first t features from the subset to the classifier to find an optimal t so that the subset of first t features yields the highest accuracy. This set up is applied across all FS methods. Also, we report average classification accuracy of a model over 10 independent runs (we run stratified 5-fold CV twice).
The parameter Th irr is the threshold set to filter out the irrelevant features. In this paper, we set the value of Th irr to 3. The parameter Th red is another threshold defined to deal with the low level of sparsity of S. In real-world large datasets, the condition S i,j = 0 might rarely be encountered. Indeed, the threshold Th red maps the weak feature correlations to zero. Here, we have set the value of Th red to 4 for the biological datasets and 7 for the face image datasets. The parameter α is used when facing big clusters to divide the clusters into subclusters with α members. The parameter β is the number of features selected from each of the subclusters with α members. In this work, we have set the values of α and β to 50 and 5, respectively.
Results. The average classification accuracies over 10 independent runs (twice 5-fold CV) using the RF classifier on the datasets described in Section 4.1 are presented in this section. In Figure 3, we present the classification accuracy of SVFS compared to the other FS methods on 4 benchmark face image datasets. As it can be seen, our method attains either the best or second best accuracy compared to other FS methods. It is interesting to note that SVFS attains 100% accuracy on all of pixraw10P, warpPIE10P, and orlraws10P with at most 90 features. Figure 4 shows the classification accuracy performance of SVFS compared to the other methods on benchmark biological datasets. As we can see, SVFS has performed consistently well and achieved the highest accuracy in 7 out of the 12 cases, while producing reasonably good accuracies in most of the other cases as well. JMI has produced the highest accuracy in 3 cases, where Fisher and HSIC-Lasso have shown their best performance in GLIOMA and ALLAML datasets, respectively. As we mentioned, the thresholds Th irr and Th red are set for 3 and 4, respectively for all biological datasets. However, it is possible to tune these thresholds and get better results. For example, if we set Th irr = 1.2 and Th red = 2 , we get an average accuracy of 94.52 and 96.37 on ALLAML and Lymphoma datasets, respectively, and using at most 50 features ( α = 50, β = 15 ). Similarly, Th irr = 1.1 and Th red = 2 , gives an average accuracy of 87 on GLIOMA dataset ( α = 50, β = 15 ), while Th irr = 1.2 and Th red = 4 , gives an average accuracy of 74.14 on NCI9 dataset ( α = 50, β = 10).
The general superiority of SVFS can be further witnessed on genomics datasets with large number of features as shown in Figure 5. Note again that Th irr = 3 and Th red=4 for all these datasets. However, it is possible to tune the parameters Th irr and Th red to obtain better results per dataset. This can be particularly useful when we focus on specific datasets for disease diagnosis and biomarker discovery.
We conclude from Figures 3, 4, and 5 that our proposed SVFS has achieved the highest accuracy on 12 datasets out of the total 25 datasets, while noting that no other method has achieved the highest accuracy for more than 4 datasets. In cases where SVFS has not produced the highest accuracy, its performance is nonetheless among the most accurate ones.
Since IBM LSF is capable of reporting running time, CPU time, and memory usage by each feature selection model, we depict the running time in seconds for all feature selection methods in Figure 6. As there are 25 datasets for the evaluation process, Figure 6(a) includes the running time on the benchmark biological and benchmark image datasets and Figure 6(b) covers the running time on the genomic datasets. Note that the reported running times include the RF classification time. It can be seen that the running times of CIFE and JMI are worse than other methods while the running time of CCM method on GEO datasets is high and roughly the Scientific Reports | (2021) 11:3832 | https://doi.org/10.1038/s41598-021-83150-y www.nature.com/scientificreports/ same as CIFE and JMI. The other methods including SVFS have comparable and very reasonable running times in the sense that these methods can be comfortably run on regular PCs. Some methods because of their immense cost of computing are implemented in parallel to perform in reasonable running time. Since HSIC-Lasso hired all available core of CPUs, its CPU time is comparable with CIFE and JMI methods, as shown in Figure 6(c). Moreover, the CCM model takes advantage of TensorFlow 52 with an optimized CPU implementation in a parallel way, leading to a high CPU time on most of the datasets. The rest of the methods are implemented in a non-parallelized manner; therefore, their CPU times are relatively similar to their running times.
In terms of performance in memory usage, Figure 6(d) shows that CIFE, JMI, Fisher, SMNB, and SVFS are efficient and required comparatively low memory. In contrast, CCM, HSIC-Lasso, and Trace Ratio required a high volume of memory in the magnitude of thousands.

Conclusion
In this paper, we have proposed a feature selection method (SVFS) based on singular vectors of a matrix. Given a matrix A with its pseudo-inverse A † , we showed that the signature matrix S A = I − A † A can be used to determine correlations between columns of A. To do this, we associate a graph where the vertices are the columns of A and columns F i and F j are connected if S i,j = 0 . We show that connected components of this graph are the clusters of columns of A so that columns in a cluster correlate only with columns in the same cluster. We consider a dataset D = [A | b] , where rows of A are samples, columns of A are features, and b is the class label. Then we use the signature matrix S D and its associated graph to find the cluster of columns of D that correlate with b . This allows us to reduce the size of A by filtering out the columns in the other clusters as irrelevant features. In the next step, we use the signature matrix S A of A to partition columns of A into clusters and then pick the most important features from each cluster.  www.nature.com/scientificreports/ A comprehensive assessment on benchmark and genomic datasets shows that the proposed SVFS method outperforms the state-of-the-art feature selection methods. Our algorithm includes two thresholds Th irr and Th red that are used to filter out irrelevant and remove redundant features, respectively. The thresholds have been set the same for all the datasets. However, it is possible to further tune the parameters Th irr and Th red to obtain better results. This can be particularly useful when we focus on specific datasets for disease diagnosis and biomarker discovery.  Use of experimental animals, and human participants. This research did not involve human participants or experimental animals.