Semi-Supervised Maximum Discriminative Local Margin for Gene Selection

In the present study, we introduce a novel semi-supervised method called the semi-supervised maximum discriminative local margin (semiMM) for gene selection in expression data. The semiMM is a “filter” approach that exploits local structure, variance, and mutual information. We first constructed a local nearest neighbour graph and divided this information into within-class and between-class local nearest neighbour graphs by weighing the edge between the two data points. The semiMM aims to discover the most discriminative features for classification via maximizing the local margin between the within-class and between-class data, the variance of all data, and the mutual information of features with class labels. Experiments on five publicly available gene expression datasets revealed the effectiveness of the proposed method compared to three state-of-the-art feature selection algorithms.

Notations. In the present study, matrix = ∈ ×  X {x , x , , x } R 1 2 m n m refers to the gene expression data, where m denotes the number of samples, and n denotes the number of genes, which is the dimensionality number. = ∈ is an n dimensional column vector that denotes the rth gene in the gene expression data, where f ri indicates the rth gene in the ith sample. The matrix is presented by boldface and capital letters, whereas the vectors are denoted by boldface and lowercase letters.
Maximum Margin Projection. The MMP is a semi-supervised learning method for dimensionality reduction. This semi-supervised learning method has two assumptions: smoothness and cluster 31 . The former indicates that if two points are close to each other in a high-density region, then the corresponding projecting outputs should also be close. The latter assumes that the points in the same cluster tend to be in the same class. MMP obeys these two rules and aims to capture both the geometrical and discriminating structures of the local data manifold with both labelled and unlabelled data.
The MMP constructs a k nearest neighbour graph G with a binary weight to depict the geometry of the underlying local manifold. G is divided into two subgraphs, that is, the within-class graph G w and between-class graph G b , to discover the discriminating information of the data manifold. N(x i ) denotes the k nearest neighbours of arbitrary data point x i and is naturally composed of N b (x i ) and N w (x i ). If the samples are neighbours and have different class labels, then they belong to set N b (x i ); otherwise, the remaining neighbours are placed into N w (x i ). W b and W w are the weight matrices of G b and G w , respectively, with the following definitions: W if x and x share the same class label if x or x is unlabeled but x N x or x N x otherwise , , Semi-supervised graph embedding is similar to locality sensitive discriminant analysis (LSDF), a semi-supervised feature selection algorithm proposed in 32 .
MMP detects a linear transformation based on the following two objective functions to maximize the local margin between the within-class graph G w and between-class graph G b : where a is a projection vector of projection matrix A and A ∈ R d×n . By performing some algebraic steps and imposing a constraint, a T XD w X T a = 1, the objective functions (3) and (4) can be rewritten as (5) and (7), respectively: Thus, the optimization problem is: where α is a tuning constant with 0 ≤ α ≤ 1. The optimal projection vector a is subsequently obtained by solving the generalized eigenvalue problem defined in Eq. (9), where γ is the generalized eigenvalue. This linear transformation can optimally and simultaneously preserve the local neighbourhood and discriminatory information. Laplacian Score. The LS is an unsupervised feature selection method proposed in 30 . This method was developed due to the observation that two data points close to each other are potentially in the same class. The LS selects features with more locality preserving power as evaluated by Eq. (10). Moreover, the LS is similar to two pop manifold learning methods, namely, Laplacian eigenmaps 33 and locality preserving projection 34 . The LS first constructs a k nearest neighbour graph, which is defined in Eq. (11). Given that the variance in the data manifold can be calculated by Eq. (12) based on the spectral graph theory 35 , Eq. (10) can be reformulated as Eq. (13) by performing some algebraic steps.
ij  where D is a diagonal matrix with = ∑ D W ii j ij , and L is a Laplacian matrix with a definition of L = D − W. Specifically, a "good" feature indicates more representative power and local structure preserving power. The former requires larger variance of a feature, and the latter means that if two data points are very close, then these points should have similar features. In an algebraic sense, increased representative power and local structure preserving power can be interpreted as maximizing the denominator and minimizing the numerator in Eq. (10). Consequently, feature selection with the LS is performed to minimize the objective function in Eq. (10); that is, a smaller L r indicates that better features are selected.

Semi-Supervised Maximum Discriminative Information for Feature Selection
In this section, we introduce the proposed semiMM from two aspects, including the criterion and algorithm flow of the semiMM.
The semiMM is a semi-supervised feature selection method based on manifold learning. The graph embedding originated from the previously described MMP, which is a semi-supervised manifold learning method (see Section 2). Thus, the semiMM constructs between-class and within-class neighbour graphs to simultaneously characterize the local manifold of the dataset with all samples and the discriminative information from  the labelled samples. Moreover, the semiMM also considers the variance of features and the mutual information between the classes and features. This method aims to maximize the local margin between within-class and between-class data and simultaneously discover the most related class features.

Criterion of SemiMM. Based on the two basic assumptions about semi-supervised learning mentioned in
Section 2, two data points from the same neighbourhood potentially belong to the same class (and vice versa) with the name of the local preserving power. A "good" feature possesses more local preserving power and is most discriminative in clarifying the data. Therefore, the within-class and between-class information should be simultaneously minimized and maximized, respectively, to ensure a maximum local margin. In addition, a good feature for gene selection should be genes differentially expressed for samples with different class labels. This difference in gene expression level can be characterized by the mutual information between features and class labels, denoted by NMI(f r , c). A larger difference indicates more mutual information and vice versa. Maximizing the mutual information between features and class labels might enhance the discriminative capability. A reasonable criterion of the semiMM is to minimize the objective function given as follows:   The first term in Eq. (14) shares the same idea with the LS, which regards variance information as a representative power of all data points. The first term in our objective function represents the local margin preserving power. The second term characterizes the class-related capability, where λ is a tuning parameter with 0 < λ < 1, and sem-iMM r denotes the score of the λth feature evaluated by the proposed semiMM. Given S = W w − W b , the objective function can be rewritten as Eq. (15) through some simple algebraic steps, where L is the Laplacian matrix with L − D − S, and D is a diagonal matrix with the column or row sum of the symmetric matrix S ij being its diagonal entries. The normalized mutual information between features and class labels can be calculated by Eq. (16): Algorithm flow of SemiMM. In summary, the algorithm flow of the semiMM is presented as follows:

Experiments
In this section, we conducted extensive experiments to evaluate the performance of the proposed semiMM method in a semi-supervised manner for gene selection. Experiments are conducted on five gene expression profile datasets. All datasets are publicly available from GEMS 36 . The detailed description of the datasets is shown in Table 1.
The methods presented in this article were evaluated using five tumour datasets and compared to other methods. The following is a brief introduction to the five datasets used in this article.    Experimental Design. In this experiment, we first pre-processed the five gene expression datasets to obtain the prepared data: initial data for feature selection and split data for classification. The optimal values of parameters were selected in the proposed method. Three state-of-the-art feature selection methods were selected for comparison to better understand the proposed method. The experiments were conducted, and the outputs were recorded and analysed.

Data Preparation. Initial Data.
In this experiment, we set up a semi-supervised setting to simulate the "small sample, high dimension" problem. In a semi-supervised setting for a filtered feature selection, both labelled and unlabelled samples must be used during the calculation of the score of each feature, and the feature selection method is used to rank the features. Here, we selected different numbers of samples per subclass of a gene expression dataset, in which the labelled data with stratified random sampling is denoted by L. The values of L are 2, 4, and 6. Thus, the number of labelled samples in a certain gene expression dataset is the product of L and the number of classes. The remaining data in the dataset are regarded as unlabelled data. The obtained data were termed initial data for convenience.
Split Data. During classification, we divided each gene expression dataset into a training and testing set with a ratio of 6:4 through stratified random sampling. We conducted the classification with different numbers of genes ranging from 5 to 300 with a step of 5. Considering the intrinsic characteristics of semi-supervised feature selection, we repeated the experiment 10 times at each step and recorded the average prediction accuracy for evaluation.

Compared Methods and Experimental
Setup. Laplacian score. The LS is an unsupervised feature selection method. In this method, a nearest neighbour graph is constructed to model the local geometric structure 30,37 . The LS selects the features with more locality preserving power 25 . Fisher score. As a supervised feature selection method, the Fisher score seeks features according to their discriminating power 32 .
LSDF. As a semi-supervised feature selection algorithm, the LSDF utilizes both labelled and unlabelled data and determines the discriminative structure and geometrical structure of the data. Features that can maximize the margin between the within-class and between-class graphs are selected by the LSDF.
In the proposed semiMM, the tuning parameter lambda can be searched from the grid {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. The number of nearest neighbours, k, is empirically assumed to be 5 because the k nearest neighbours are adopted to model the local manifold structure of the data. The weight in the whole experiment is determined by binary similarity, and the alpha is set as 100, similar to LSDF. By conducting many experiments to select the proper value of parameters in the proposed algorithm, we determined that the proposed method can robustly detect changes of the parameters, whereas the LSDF is sensitive to k and alpha. Thus, k and alpha are set at 5 and 100, respectively, to ensure that the LSDF can still perform well in the experiments, and a better comparison between LSDF and the proposed semiMM can be obtained. In this experiment, the top 300 genes were selected as the feature subset for classification, and each gene was normalized to achieve zero mean and unit variance for further assessment.
Evaluation Metrics. In this evaluation framework, five evaluation metrics, including accuracy, precision, recall, f-score, and area under the receiver operating characteristic curve (AUC), were used to assess the performance. These metrics were determined by the following equations: where true positives and true negatives refer to the number of samples that are correctly classified into their class group in the ground truth; i.e., positive samples are predicted to be positive, and negative samples are classified into the negative group. The same logic is applied to understand false negatives (FN) and false positives (FP). The tumour dataset Lung is an unbalanced multiclass dataset. As stated in 32 , a larger AUC indicates better performance. Thus, the AUC score is applied to assess the prediction performance of classification to properly evaluate FPs and FNs for cancer classification. The proposed semiMM can manage both binary classification and multi-classification datasets. The gene expression datasets used in the present study include two binary classification datasets and three multi-classification datasets. To perform the multi-classification experiment, we devised a one-against-rest approach for each class and thus constructed c binary classifiers, where c denotes the number of classes in each dataset. The average results over the c binary classifiers are shown as the final result of multi-classification.
Experimental Results. In this subsection, classification is performed via SVM on the training set with a chosen feature subset (the top 300 genes) in the five gene expression datasets to evaluate the performance of the proposed semiMM method and compare it with three other methods. Figure 1 shows the curves of average prediction accuracy versus gene dimension for the four methods with different labelled samples on two binary classification datasets.
All filter methods achieve high average prediction accuracy with an increased number of selected genes in most cases. Figures 1 and 2 shows that the performance of the supervised Fisher score method is improved when the number of labelled samples per subclass L increases from 2 to 6. In contrast, the performance of the unsupervised LS method has degraded. A larger L value indicates that fewer unlabelled samples remained in each dataset. Thus, the observation is reasonable. However, the semiMM and LSDF methods perform better with a larger L.
The semiMM method performs best and converges fastest to the optimal point when less than 100 genes are selected. This finding might indicate that the proposed semiMM method has better ability to utilize the label information than LSDF, i.e., the semiMM has more discriminating power than the LSDF method. Roughly speaking, the semiMM and LSDF show stable performance with varying values of L because the shapes of their curves are almost unchanged. This finding can be explained by the semi-supervised properties of the semiMM and LSDF; these methods simultaneously select features from both labelled and unlabelled samples.
The multiclass classification performance of three publicly available datasets is shown in Figs 3-5. The performance of the LS is unchanged in all three multiclass datasets, and its average prediction accuracy decreases when additional labelled samples are selected. The Fisher score performance improves with a larger L on the Leukemia2 and SRBCT datasets but degrades slightly on the Lung dataset under the same condition. Thus, not all labelled samples are useful for category recognition. Overall, the proposed semiMM method converges faster and achieves slightly higher optimal average classification accuracy when L increases in all three multiclass datasets. When L Scientific RePoRTS | (2018) 8:8619 | DOI:10.1038/s41598-018-26806-6 equals 2, the semiMM outperforms the supervised Fisher score method when the number of selected genes is less than 50 for multiclass datasets. The performance of the other semi-supervised method, LSDF, is slightly different; its average classification accuracy is poor when L increases in the Leukemia2 dataset but is totally different on the SRBCT and Lung datasets. In addition, its performance is not comparable to that of the other methods in most cases.
Therefore, the semiMM performs well irrespective of the dataset itself, whereas its competitors are sensitive to the dataset. The semiMM is effective for tackling "small sample" problems. The good and stable performance of this method is due to its simple and efficient idea to discover both geometrical and discriminating information with labelled and unlabelled samples together. Although no method outperforms the other three algorithms in all circumstances, with regard to robustness of the dataset and good prediction accuracy, and the proposed semiMM is a good choice for gene selection with small and limited numbers of labelled samples.
Considering that all four methods show a stable and promising performance when the number of selected genes is 150, we listed the corresponding classification results with different values of L in Tables 2-6. For a given L, the highest values are shown in bold-faced forms. The parameter λ is set as 0.6 in the proposed semiMM in all experiments. From the binary datasets, i.e., Tables 2 and 3, and the following three multiclass datasets, the semiMM and Fisher score achieve the highest values in most cases. In the cases where the semiMM is not the best method, its performance remains higher and better than that of the other two. This finding verifies the conclusion from the analysis of Figs 1 and 2. The proposed semiMM is an effective feature selection method with good and stable performance irrespective of the dataset itself.

Conclusion and Future Work
In the present study, we introduced a novel semi-supervised method called the semiMM that is based on spectral graph and mutual information theories and is used for gene selection. The semiMM method is a "filter" approach that simultaneously exploits local structure, variance, and mutual information. In the first step, we constructed a local nearest neighbour graph and subsequently divided this information into within-class and between-class local nearest neighbour graphs by weighing the edge between two data points. This method aims to discover the most discriminative features for classification by maximizing the local margin between within-class and between-class data, the variance of all data, and the mutual information of features with class labels.
In contrast to three state-of-the-art methods, i.e., the Fisher score, LS, and LSDF methods, the experimental results show that the semiMM method perfectly balances the use of both labelled and unlabelled samples. Regardless of whether the dataset is binary-class or multiclass, the proposed semiMM can always achieve a good performance. The performance of the semiMM is comparable to that of the Fisher score and even outperforms the Fisher score when the number of labelled samples equals 2, and the number of selected genes is less than 50. Both the Fisher score and semiMM are superior to the LS and LSDF in most cases.
The following issues should be addressed in future research: No theoretical selection is established for the controlling parameter lambda, which tunes the weight between the first and second terms of the present criterion.
The semiMM considers only the discriminating information of class labels as features and between-labels. If this method can delete these redundant features, then a compact feature subset that is maximally discriminative and minimally redundant can be obtained.
The second term, which is the mutual information between class label and features, can be time-consuming when dealing with datasets with many subclasses. This factor makes the proposed semiMM method less competitive for multi-classification problems with limited time.
The analysis of single cell data has become a hot topic at present, and it is very interesting to extend the sem-iMM method to be used in the analysis of single cell data.