Estimation of Discriminative Feature Subset Using Community Modularity

Feature selection (FS) is an important preprocessing step in machine learning and data mining. In this paper, a new feature subset evaluation method is proposed by constructing a sample graph (SG) in different k-features and applying community modularity to select highly informative features as a group. However, these features may not be relevant as an individual. Furthermore, relevant in-dependency rather than irrelevant redundancy among the selected features is effectively measured with the community modularity Q value of the sample graph in the k-features. An efficient FS method called k-features sample graph feature selection is presented. A key property of this approach is that the discriminative cues of a feature subset with the maximum relevant in-dependency among features can be accurately determined. This community modularity-based method is then verified with the theory of k-means cluster. Compared with other state-of-the-art methods, the proposed approach is more effective, as verified by the results of several experiments.

is NP-hard 19 . To avoid the combinatorial search problem to find an optimal subset, variable selection methods are employed. The most popular of these methods mainly include forward 20 , backward 21 , and floating sequential schemes 22 , which adopt a heuristic search procedure to provide a sub-optimal solution.
In the subset evaluation method, evaluation of the relevance of a feature subset, including relevance and redundancy in a feature subset, is important in multivariate methods; however, this task is difficult in practice. Relevance evaluation methods based on mutual information (MI) have become popular recently [23][24][25][26][27][28] . However, these algorithms approximately estimate the discriminative power of a feature subset because loss of intrinsic information in raw data can occur while estimating the probability distribution of a feature vector by the discretization of a feature variable 27,28 .
A good feature subset should contain features that are highly correlated with the class but uncorrelated with one another 29 . In other words, in a good feature subset, the samples in different classes can be separated well; that is, the within-class distance in samples is small and between-classes distance is large. Therefore, if the samples are shown in a graph (also referred to as a complex network), the graph should exhibit obvious community structures 30 and a high community modularity Q value 31,32 . Thus, the community modularity Q value can be utilized to evaluate the relevance of a feature subset with regard to the class. In this paper, a novel method is proposed to address the feature subset relevance evaluation problem by introducing a new evaluation criterion based on community modularity. The method accurately assesses the relevance independency of a feature subset by constructing a sample graph in different k-features. To the best of our knowledge, this work is the first to employ community modularity in feature subset relevance evaluation. The proposed method indiscriminately selects relevant features through the forward search strategy. This method not only selects relevant features as a group and eliminates redundant features but also attempts to retain intrinsic interdependent feature groups. The effectiveness of the method is validated through experiments on many publicly available datasets. Experimental results confirm that the proposed method exhibits improved FS and classification accuracy. The discriminative capacity of the selected feature subset is significantly superior to that of other methods.

Related Work
FS has elicited increasing attention in the last few years. In the early stage, individual evaluation methods were more popular, such as those in [7][8][9][10] , which measure the discriminate ability of each feature according to a related evaluation criterion. Based on class information, these methods belong to the supervised FS algorithm. An unsupervised feature ranking algorithm has also been proposed; this algorithm considers not only the variance of each feature but also the locality preserving ability, such as the Laplacian score 33 .
A known limitation of individual evaluation methods is that the feature subset selected by these methods may contain redundancy 15,34 , which degrades the subsequent learning process. Thus, several subset evaluation-based filter methods, such as those in 17,29,[35][36][37] , have been proposed to reduce redundancy during FS.
MI is gaining popularity because of its capability to provide an appropriate means of measuring the mutual dependence of two variables; it has been widely utilized to develop information theoretic-based FS criteria, such as MIFS 23,38 , CMIM 39 , CMIF 24 , MIFS-U 25 , mrmr 27 , NMIFS 28 , and FCBF 40 . MI is calculated with a Parzen window 41 , which is less computationally demanding and provides better estimation. The Parzen window method is a non-parametric method to estimate densities. It involves placing a kernel function on top of each sample and evaluating density as the sum of the kernels. The author in 42 pointed out that common heuristics for information-based FS (including Markov Blanket algorithms 43 as a special case) approximately and iteratively maximize the conditional likelihood. The author presented a unifying framework for information theoretic-based FS, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. Analysis of the redundancy among selected features is performed by computing the relevant redundancy between the features and the target. However, MI-based FS methods have been criticized for their limitations. First, loss of intrinsic information in raw data could occur because the probability distribution of the feature vector is estimated by the discretization of the feature variable. The second limitation is that these methods only select relevant features as an individual and disregard these informative features as a group 44 . Several researchers have also found that combining optimal features as an individual does not provide excellent classification performance 45 .
Graph-based methods, such as the Laplacian score 33 and improved Laplacian score-based FS methods [46][47][48][49] , have been widely applied to feature learning because these approaches can evaluate the similarity among data. Generally, the graph-based method includes two phases. First, a graph is constructed in which each node corresponds to each feature, and each edge has a weight based on a criterion between features. Second, several clustering methods are implemented to select a highly coherent set of features 50 . Optimization-based FS algorithms are preferred by many researchers. R. Tibshirani 51 proposed a new method called "lasso" for estimation in linear models. Based on graphical lasso (GL), a new multilink, single-task approach that combines GL with neural network (NN) was proposed to forecast traffic flow 52 .
Statistical methods have been widely applied to FS. Two popular feature ranking measures are t-test 53 and F-statistics 54 . Well known statistic-based feature selection algorithms include χ 2 -statistic 55 , odds ratio 56 , bi-normal separation 57 , improved Gini index 58 , measure using Poisson distribution 59 , and ambiguity measure 60 . Most of these methods calculate a score based on the probability or frequency of each feature in bag-of-words to rank features according to a feature's score; the top features are selected. Yan Wang 61 introduced the concept of feature forest and proposed feature forest-based FS algorithm.

Results
Experiments on artificial datasets, including binary class and multi-class datasets, were conducted to test the proposed approach. The proposed approach was also compared with several popular FS algorithms, including MIFS_U, mrmr, CMIM, Fisher, Laplacian score 33 , RELIEF 62 , Simba-sig 63 , and Greedy Feature Flip (G-Flip-sig) 63 . Off-the-shelf codes 42  To evaluate the effectiveness of the proposed method, the nearest neighborhood classifier (1NN) with Euclidean distance and support vector machine (SVM) 64 using the radial basis function and the penalty parameter c = 100 were employed to test the performance of the FS algorithms. We utilized the LIBSVM package 65 for SVM classification. All experiments were conducted on a PC with Intel(R) Core(TM) i3-2310 CPU@2.10 GHz and 2G main memory.

Datasets and preprocessing.
To verify the effectiveness of the proposed method, six continuous datasets from the LIBSVM datasets 65 , two cancer microarray datasets, and two discrete datasets from UCI were utilized in the simulation experiments. All the features in the datasets, except discrete features, were uniformly scaled to zero mean and unit variance. The details of the 10 datasets are shown in Table 1.
Feature selection and classification results. Classification performance was utilized to validate the FS method, and tenfold cross validation was employed to avoid the over-fitting problem. To reduce unintentional effects, all the experimental results are the average of 10 independent runs. In comparing the different methods, the feature subset was produced by picking the top s selected features to access each method in terms of classification accuracy (s = 1, ..., P). We discretized continuous features to nine discrete levels as performed in 66,67 by converting the feature values between μ − σ/2 and μ + σ/2 to 0, the four intervals of size σ to the right of μ + σ/2 to discrete levels from 1 to 4, and the four intervals of size σ to the left of μ − σ/2 to discrete levels from − 1 to − 4. Extremely large positive or small negative feature values were truncated and discretized to ± 4 appropriately. Table 2 indicates the average classification accuracy of both 1NN and SVM classifiers at different s. A bold value indicates the best among the FS methods under the same classifier and the same number of selected features. To avoid the influence of data scarcity, the average value of accuracy at different s for all datasets in the same selector is shown in the bottom line of Table 2 (Avg.). The results in Table 2 indicate that the proposed method (k-FSGFS) exhibits the best average performance compared with other methods in both classifiers. The Avg. values are 83.65% and 83.97% in 1NN and SVM classifiers, respectively. These values are higher than those of the other methods. CMIM is superior to mrmr and MIFS_U. Figures 1 and 2 show the performance of SVM and 1NN at different s of selected features for six datasets, namely, Sonar, Glass, Svmguide4, Segment, DLBCL_A, and Lung-cancer. The six datasets were selected because they cover a diverse range of characteristics, including continuous and discrete data, in terms of the number of features and number of examples. Figures 1 and 2 show that the proposed method (k-FSGFS) outperforms the other methods. In most cases, the average accuracy of the two classifiers is significantly higher than that of other selectors. High classification accuracy is commonly achieved with minimal selected features, which indicates that our evaluation criterion based on community modularity Q not only selects the most informative features but also provides the solution of relevant independency among selected features. The proposed method can evaluate the discriminatory power of a feature subset.
Additionally, the proposed approach was compared with other popular FS methods, including Laplacian score 33 , Relief 62 , Simba-sig 63 , and Greedy Feature Flip (G-Flip-sig) 63 . Relief 62 , Simba-sig 63 , and G-Flip-sig 63 are margin-based FS or feature weighting methods, in which a large nearest neighbor hypothesis margin ensures a large sample margin. Thus, these algorithms find a feature weight vector to minimize the upper bound of the leave-one-out cross-validation error of a nearest-neighbor classifier in the induced feature space. For fairness, only the 1NN classifier was utilized to evaluate the performance of the compared FS algorithms in all the datasets. Figure 3 shows that the proposed method is also superior or comparable to other methods in most cases. Particularly, the proposed method can achieve significantly higher classification accuracy in the first several features than the other methods in most cases. To verify, the classification accuracy results with the 1NN classifier at different selected features s (s = 2, 3, 4) for different methods are illustrated in Table 3. The table clearly indicates that our method significantly improves the classification results with fewer selected features. Thus, our method achieves optimal performance with an acceptable number of features.
To further confirm the effectiveness of this feature evaluation criterion, the decision boundary of the 1NN classifier in 2D feature spaces from the Wine database was used, as shown in Fig. 4(a-d). The indicated dimensions are the two best features selected by each method. The two features selected by k-FSGFS and CMIM are relatively informative (Fig. 4(d)) and help in effectively separating the sample data. Both Fish Score and mrmr selected the same top two features, as indicated in Fig. 4(a), and separated the samples better than MIFS_U in   Table 2.
The capability of k-FSGFS to obtain the discriminatory attribute of a feature subset and the relevant independency among features is so effective that it can select these informative features with fewer redundancies. Thus, k-FSGFS performs better than other FS algorithms. For parameter K during the construction of k-FSG in our method, numerous experiments demonstrate that a value of K selected from 2 to 11 is effective for most datasets for either SVM or 1NN classifier. In this study, K was set to 2.

Statistical test.
The classification experiments demonstrated that the proposed framework outperforms the other FS algorithms. However, the results also indicate that k-FSGFS does not perform better than several algorithms in a number of cases. Therefore, paired sample one-tailed test was used to assess the statistical significance of the difference in accuracy. In this test, the null hypothesis states that the average accuracy of k-FSGFS at different numbers of subsets is not greater than that of the other FS algorithms in terms of classification. Meanwhile, the alternative hypothesis states that k-FSGFS is superior to other FS algorithms in terms of classification. For  example, if the performance of k-FSGFS is to be compared with that of Fisher Score method (k-FSGFS vs. Fish Score), the null and alternative hypotheses can be defined respectively as follows: H 0 : μ k−FSGFS ≤ μ Fish_Score and H 1 :   5 indicate that regardless of whether 1NN or SVM is used, the p-values obtained by the pair-wise one-tailed t-test are substantially less than 0.05, which means that the proposed k-FSGFS significantly outperforms the other algorithms.
Justification of k-FSGFS based on K-means cluster. The justification of the proposed feature evaluation criterion based on community modularity was demonstrated by adopting the theory of K-means cluster to determine why k features with a higher Q value are more discriminative. The K-means cluster 68 is the most well-known clustering algorithm. It iteratively attempts to address the following objective: given a set of points in a Euclidean space and a positive integer c (the number of clusters), the points are split into c clusters to minimize the total sum of the Euclidean distances of each point to its nearest cluster center, which can be defined as follows: where x i and µ c t are the i-th sample point and its nearest cluster center, respectively, and ⋅ 2 is the L 2 -norm. In the feature weighting K-means, the feature that minimizes within-cluster distance and maximizes between-cluster distance is preferred, thus obtaining higher weight 56 . Confirming whether the features with a high community modularity Q value in our method can minimize within-cluster distance and maximize between-cluster distance is necessary.
According to Equation (7)  exhibits a large inner-degree d in (small out-degree d out ), and the sample points in the k-features space with the same labels can be correctly classified as many as possible into the same class and as few as possible into different classes while these k features are good features as a group. The expected number of sample points in the k-features space that are correctly classified can be calculated through Neighborhood components analysis 69 .
Given the selected feature subset S and candidate features f, each sample point i in S ∪ f feature space selects another sample point j as its neighbor with probability P ij . P ij can be defined by a soft max over Euclidean distances as follows: Under this stochastic selection rule, we can compute the probability P i that point i will be correctly classified (denote the set of points in the same class as i by C t = { j|c t = c j }).
Hence, the expected number of sample points in the S ∪ f space correctly (ENC) classified into the same class is defined by Feature f with larger ENC is more discriminative. According to Eqs. 2 to 4, maximizing ENC is mutually equivalent to minimizing the K-means cluster objective J(c, μ).
c is the number of clusters. The lower bound of ENC( f ∪ S) is defined by ENC L_bound. ENC( f ∪ S) can be maximized simultaneously by maximizing its lower bound ENC L_bound and equivalently , which denotes that lower bound ENC L_bound has been maximized. ENC(f ∪ S) obtains the maximum value when the K-means objective (Eq. 1) is optimized for the minimum.
is equivalent to minimize while maximizing the ENC(f ∪ S), and because Hence, k-means cluster function J(c, μ) is min- μ) in the S ∪ f space must be minimized when the community modularity Q value of SG in S ∪ f space obtains a high value, which indicates that the features selected by the proposed method can minimize within-cluster distance. Similarly, the expected number of points incorrectly classified is defined by where n is the number of samples. A small ENIC(f ∪ S) results in a few edges between communities and large between-cluster distance. The feature subset with a high Q value is highly relevant, which not only minimizes within-cluster distance but also maximizes between-cluster distance.

Discussion
In this study, a novel feature subset evaluation criterion using the community modularity Q value by constructing k-features sample graphs (k-FSGs) is presented to measure the relevance of the feature subset with target variable C. To address the redundancy problem of ranking in filter methods, the sample graph in k-features that captures the relevant independency among feature subsets is utilized rather than the conditional MI criteria. By combining the two points above, a new FS method, namely, k-FSGFS, is developed for feature subset selection. The method effectively retains as many interdependent groups as possible during FS. The proposed k-FSGFS works well and outperforms other methods in most cases. The method remarkably or comparatively improves FS and classification accuracy with a small feature subset, which demonstrates the ability of the proposed method to select a discriminative feature subset. The experimental results also verify that interdependent groups commonly exist in the real dataset and play an important role in classification. Unlike the other methods used for comparison, the proposed method accurately evaluates the discriminative power of a feature subset as a group. The Fisher method, which is an individual evaluation criterion, cannot eliminate the redundancy in a feature subset, thereby reducing classification performance. The experiment results for the Fisher method verify this finding. The MI-based methods, such as mrmr, MIFS_U, and CMIM, consider the relevance and redundancy among feature subsets as a group and are superior to the Fisher method. However, these MI-based methods can only approximately estimate the relevance and redundancy in a feature subset (such as considering all the redundancy between pair-wise features to estimate the redundancy among a feature subset as a group in mrmr method) because of the difficulties in accurately computing the probability density function. The results in Table 2 and Figs 1 to 2 indicate that mrmr, MIFS_U, and CMIM methods perform better than the Fisher method but worse than the proposed method.
From the mentioned above, our method perform better than MI-based methods in most cases. In our method, larger inter-class distance implies that the local margin of any sample should be large enough. By the large margin theory 70 , the upper bound of the leave-one-out cross-validation error of a nearest-neighbor classifier in the feature space is minimized and usually generalizes well on unseen test data 70,71 . However, traditional mutual information based relevance evaluation between feature and class can not accurately measure the discriminative power of a feature. In order to better illustrate this, for simplicity, the features f 1  According to MI-based methods, the feature f 1 has the same relevancy as f 2. In our method, the feature f 2 has more discriminative power than f 1 because the community modularity Q in feature f 2 is larger than feature f 1 .
Intuitively, feature f 2 should be more relevant than f 1 due to its between-class distance is larger than f 1 . However, the MI-based method can not capture the difference between f 1 and f 2. Therefore, our relevancy evaluation criterion based on community modularity Q is more efficient and accurate.
However, in practice, the proposed method is not always efficient for all types of datasets, such as imbalanced datasets, especially when a few samples in one class are compared with other classes. For example, in the dataset Lung-cancer, our method performs worse than simba-sig and G-flip-sig. Because, modularity optimization is widely criticized for its resolution limit 72 illustrated in Fig. 5, which may prevent the approach from detecting clusters. The clusters are comparatively small with respect to the graph as a whole, which results in maximum modularity Q not corresponding to a good community structure, that is, features with a high Q value may be irrelevant. The KNN searching needs to be conducted iteratively in our method, thus, the efficiency of our method is low for larger data amounts in real applications with regard to time complexity. Our future work will focus on resolving these problems.

Methods
In this paper, a new feature evaluation criterion based on the community modularity Q value is proposed to evaluate the class-dependent correlation 73 of features as a group instead of identifying the discriminatory power of a single feature. Detailed information on our method is presented in Algorithm 2. The innovations of our work mainly include the following points.
(1) The discriminatory power of features as a group can be evaluated exactly based on the community modularity Q value of sample graphs in k-features. (2) The proposed method can select features that have discriminatory power as a group but have weak power as an individual. (3) Relevant independency instead of irrelevant redundancy between features is measured using the community modularity Q value rather than information theory.
The proposed framework is presented in a flow diagram in Fig. 6.
Community modularity Q. The community structure in an undirected graph exhibits close connections within the community but sparse connections among various communities relatively 31,32 . Figure 7 shows a schematic example of a graph with three communities to demonstrate the community structure. Thus far, the most regarded quality function is the modularity of Newman and Girvan 32 . Modularity Q can be written as follows:  in the same community and equal to zero otherwise. Another popular description of modularity Q can be written as follows:  23 assume that a high value of modularity indicates good partitions. In other words, the higher modularity Q is, the more significant the community structure is.  Based on the definition of community, the within-class distance in a community is small and the between-class distance is large. Thus, if a graph has a clear community structure, the nodes in different communities can be locally and linearly separated easily, as shown in Fig. 7. The features that minimize within-cluster distance and maximize between-cluster distance are preferred and obtain a high weight. If the sample graph in k-features (k-FSG) has an apparent community structure, these k features will have strong discriminative power as a group because intra-class distance is small and inter-class distance large. This condition can be proven sequentially with the theory of K-means cluster.

Sample graph in k-features (k-FSG).
Given an m × n dataset matrix (m corresponding to samples and n corresponding to features), the sample graph in k-features (k-FSG) can be constructed as follows: an edge A(i, j) (A(i, j) = 1) exists between samples X i and X j if X i ∈ K − NN(X j ) or X j ∈ K − NN(X i ).where X i is the node i corresponding to the sample i, K − NN(X i ) is the K-neighborhood set of node i, and A is the adjacency matrix, which is symmetrical. K is the predefined parameter and does not have large values, which generally range within {3-11}.
The discussion above indicates that if k-FSG in k-features exhibits clear community structures corresponding to a large Q value, these k features are highly informative as a group. The algorithm of constructing k-FSG is shown as Algorithm 1. values will be selected in feature subset S. The procedure will not stop until the number of selected features satisfies |S| = P. To facilitate understanding of our evaluation scheme, we regard a UCI dataset, iris, as an example. The dataset consists of 150 samples and four features. The dataset is divided into three classes with 50 samples in each class. The iris dataset is processed with zero mean and unit variance according to 1-FSG in one feature. The 3rd feature with the highest Q value is the most informative as an individual. Given the 3rd feature, Fig. 8 illustrates the sample scatter points in 2-FSGs for the remaining features {1 2 4} in dataset iris. Three community modularity Q 3↔q values are shown in Table 6 (q = 1, 2, 4). Figure 8 clearly indicate that the 2-FSG in 3 ↔ 4 feature space exhibits more obvious community structures, and the sample points in different classes in 3 ↔ 4 features can be easily separated. The results in Table 6 show that the 2-FSG in 3 ↔ 4 feature space provides the largest community modularity Q value. Thus, the 4th feature has strong informative power combined with the 3rd feature. Given the 3rd and the 4th features, the 1st and the 2nd features can be selected according to the 3-FSGs and 4-FSGs, respectively. The selected feature subset in iris using our method is {3 4 1 2}, which is the selected features of most of the methods. In short, given selected feature subset S, feature f selected by our criterion can be defined as follows: where Q f∪S is the community modularity value of SG in features f ∪ S and F and S are the set of all features and selected feature subset, respectively.

Relevancy analysis.
Ranking-based filter methods cannot handle high redundancy among the selected features. To solve this problem, conditional MI (CMI) is applied in this study to obtain the relevant independency (RI) or relevant redundancy 74 instead of the irrelevant redundancy between features, as shown in Fig. 9. RI(f i , C; f j ) is now the amount of information features f i that can predict target variable C when feature f j is given; In this study, the discriminative capability of k features as a group was evaluated using the community modularity Q value of the constructed k-FSG. A high Q value of k-FSG denotes large RI among the k features as a group, and the sample points in different classes can be separated well. Thus, the community modularity Q value of k-FSG in k-features can accurately illustrate relevant independency RI(f i , C; S) in selected feature subset S. The community modularity Q value of k-FSG was utilized to measure relevant independency instead of MI theory. For verification, the iris dataset was used as an example. Different RI(f i , C; f 3 ) values were calculated, and the third 2 − FSG 3↔q 3 ↔ 4 3 ↔ 1 3 ↔ 2 Q 3↔q 0.6057 0.5719 0.5430 Table 6. The community modularity Q values of 2-FSG (k = 2) in different pairwise features in iris dataset. The more larger the community modularity is, the more relevant the pairwise features are. The features 3 and 4 as a group have more discriminative power.  Table 7. The RI in different pairwise features in terms of the third feature in iris dataset. The larger RI states that the features 3 and 4 as a group have more discriminative power.
feature was selected (i = 1, 2, 4), as indicated in Table 7 The table clearly indicates that RI(f 4 , C; f 3 ) is the largest, which demonstrates that fourth feature f 4 can provide more informative information when the third feature is given. Similarly, the Q 3↔4 value in Table 6 is also the highest in Table 7, which demonstrates that the community modular Q value of k-FSG in k-features can replace MI to effectively evaluate the RI of feature subset S. Thus, our method can resolve relevant redundancy among selected features. CMI can be computed with the FEAST tool 42 .
Relevant independency RI(f i , C; S) between feature f i and selected feature set S was replaced by the community modularity Q value of SG in f i ∪ S, which can be defined as follows: ∪ = RI f C S Q ( , ; ): (9) i f S i A larger value of RI(f i , C; S) indicates that f i is highly independent with features in S but relevant in terms of target variable C and has strong informative power combined with features in S. These results indicate that our method can select these features with more relevancy as a group in terms of class and larger RI among selected features.
The details of k-FSGFS are presented in Algorithm 2. Algorithm 2: k-FSGFS: k-features sample graph based feature selection Time complexity of k-FSGFS. Algorithm 2 shows that k-FSGFS mainly includes two steps. The first step is to construct k-FSG in k-features space. The second step is to calculate the community modularity Q value of each k-FSG. The most time-consuming step is establishing k-FSG, whose time complexity is about ο(Pnm 2 ), where n is the number of features in feature space, m is the number of samples in the dataset, and P is the number of predefined selected features. Fortunately, fast K-nearest neighbor graph construction methods 75,76 can be applied to the construction of k-FSGs; such application would reduce the time complexity from ο(Pnm 2 ) to ο(Pnm 1.14 ). In the second step, the spending time is approximately ο(mlog m). Thus, the overall time cost of k-FSGFS is approximately ο ο + . Pnm m ( ) ( log m) 1 14 .