Fisher Discrimination Regularized Robust Coding Based on a Local Center for Tumor Classification

Tumor classification is crucial to the clinical diagnosis and proper treatment of cancers. In recent years, sparse representation-based classifier (SRC) has been proposed for tumor classification. The employed dictionary plays an important role in sparse representation-based or sparse coding-based classification. However, sparse representation-based tumor classification models have not used the employed dictionary, thereby limiting their performance. Furthermore, this sparse representation model assumes that the coding residual follows a Gaussian or Laplacian distribution, which may not effectively describe the coding residual in practical tumor classification. In the present study, we formulated a novel effective cancer classification technique, namely, Fisher discrimination regularized robust coding (FDRRC), by combining the Fisher discrimination dictionary learning method with the regularized robust coding (RRC) model, which searches for a maximum a posteriori solution to coding problems by assuming that the coding residual and representation coefficient are independent and identically distributed. The proposed FDRRC model is extensively evaluated on various tumor datasets and shows superior performance compared with various state-of-the-art tumor classification methods in a variety of classification tasks.

test samples are classified based on the solved vector αand the dictionary D. The selection of vector α and the dictionary D is crucial to the success of the sparse representation model. The previously described SRC-based methods directly regarded the training samples of all classes as the dictionary to represent the test sample and classified the test sample by evaluating which class leads to minimal reconstruction error. Although these methods showed interesting results, noise, outliers, incomplete measurements, and trivial information in the raw training data made this classification less effective. These naive methods also do not make maximize the discriminative information in the training samples. These problems can be addressed by properly learning a discriminative dictionary.
In general, discriminative dictionary learning methods can be divided into two categories. In the first category, a dictionary shared by all classes is learned, whereas the representation coefficients are discriminative. Jiang et al. proposed that samples of the same class possesses similar sparse representation coefficients 21 . Mairal et al. proposed a task-driven dictionary learning framework that minimizes the different risk functions of the representation coefficients for different tasks 22 . In general, these of methods aims to learn a shared dictionary by all classes and classify test samples with representation coefficients. However, the shared dictionary loses the class labels of the dictionary atoms. Thus, classifying the test samples based on the class-specific representation residuals is not feasible.
In the second category, discriminative dictionary learning methods learn a dictionary class by class, and atoms of the dictionary correspond to the subject class labels. Yang et al. learned a dictionary for each class, classified the test samples by using the representation residual, and applied dictionary learning methods to face recognition and signal clustering 23 . Wang et al. proposed a class-specific dictionary learning method for sparse modeling in action recognition 24 . In the previously mentioned methods, test samples are classified by using the representation residual associated with each class, but the representation coefficients are not used and are not enforced to be discriminative in the final classification.
To solve the previously discussed problems, Yang et al. proposed a Fisher discrimination dictionary learning framework to learn a structured dictionary 25 . In discrimination dictionary learning, the sparse representation coefficients present large between-class scatter and small within-class scatter. Each class-specific sub-dictionary presents good reconstruction of the training samples from that class and poor reconstruction of the other classes. By Fisher discrimination dictionary learning, the representation residual associated with each class can effectively be used for classification and the discrimination of representation coefficients can be exploited.
All SRC-based methods assume that the coding residual follows a Gaussian or Laplacian distribution, which may not be effective for describing the coding residual in practical GEP datasets. To address this problem, Yang et al. proposed a regularized robust coding (RRC) method for face recognition 26 . The RRC model searches for a maximum a posteriori (MAP) solution of the coding problem by assuming that the coding residual and representation coefficient are independent and identically distributed. However, either SRC-based or RRC methods or both do not take full advantage of discriminative information in representation coefficients. In the present study, we present RRC based on the Fisher discrimination dictionary learning method, a novel and effective cancer classification technique combining RRC methods and the concept of Fisher discrimination dictionary learning, which can maximize the use of discriminative information in representation coefficients and representation residuals. The proposed Fisher discrimination regularized robust coding (FDRRC) model extensively applies to various tumor GEP datasets and shows superior performance to different state-of-the-art SRC-based and machine learning-based methods in a variety of classification tasks.
The remainder of the paper is organized as follows: Section 2 mainly describes the experimental process and presents the experimental results obtained from eight tumor datasets. Section 3 discusses the proposed method, concludes the paper and outlines future studies. Section 4 describes the fundamentals of FDRRC.

Results
In present study, eight publicly available tumor data sets are used to evaluate the performance of FDRRC. The experiment is divided into four sections. In the first section, cancer datasets and dataset preprocessing are introduced. In the second section, parameter selection is discussed. In the third section, describes the various samples used in the experiment with 400 top genes on eight datasets. In the fourth section, to make a fair performance comparison, cross-validation (CV) is presented. The proposed method is compared with several representative methods, such as SRC 18 , SVD + MSRC 27 and MRSRC 5 . SRC, MSRC, and MRSRC are SRC-based methods that have been widely used in tumor classification in recent years. All experiments are implemented in the Matlab environment and conducted on a personal computer (Intel Core dual-core CPU with 2.93 GHz and 8 G RAM).
Cancer datasets and dataset preprocessing. For a more comprehensive comparison of the performance of these methods, eight tumor GEP datasets are used to evaluate the proposed method. These datasets include five two-class datasets and three multi-class datasets. The summarized descriptions of the eight GEP datasets are provided in Table 1.
The five two-class tumor datasets are acute leukemia dataset 28 , colon cancer dataset 29 , gliomas dataset 30 , diffuse large B-cell lymphoma (DLBCL) dataset 31 and Prostate dataset 32 . The acute leukemia set contains 72 samples from two subclass. The colon cancer data set includes 62 samples, with gene expression data for 40 tumor and 22 normal colon tissue samples. The gliomas data set consists of 50 samples from two subclasses (glioblastomas and anaplastic oligodendrogliomas), and each sample contains 12,625 genes. For the DLBCL data set, RNA was hybridized to high-density oligonucleotide microarrays to measure the gene expression. The target dataset contains 77 samples of 7,129 genes. The target class has 2 states, including 58 diffuse large b-cell lymphoma samples and 19 follicular lymphoma samples. For the prostate tumor data set, the gene expression profiles were derived from tumors and non-tumor samples from prostate cancer patients, including 59 normal and 75 tumor samples. The number of genes is 12,600. Table 1 provides the details of the data sets. For multi-class datasets, the data sets include the small round blue cell tumors (ALL) 33 , MLLLeukemia 34 , and LukemiaGloub 28 . The ALL data set total contains 248 samples and 12,626 genes from six subclasses. The MLLLeukemia data set contains 72 samples and 12,582 genes per sample with three subclasses. The LukemiaGloub data set contains 72 samples with three subclasses. Each sample contains 7,129 genes. Table 4 provides details of the data sets.
GEP data offer high dimensionality and a small sample size. Redundant and irrelevant data significantly affects classification. To compare the performance of FDRRC and SRC-based methods in the gene selection, the ReliefF algorithm is applied to the training set 35 . Then, the top 400 genes are selected from each dataset, thereby presenting a good trade-off between computational complexity and biological significance.
Parameter selection. Five parameters should be set in the FDRRC model. The dictionary learning phase employs two parameters: λ 1 and λ 2 , which are both presented in Eq. (8). In general, we search λ 1 , λ 2 from a small set {0.001, 0.005, 0.01, 0.05, 0.1} by five-fold CV. The classifying phase includes three parameters, namely, μ and δ from the weight function Eq. (21) and w from residual function Eq. (24). Parameter μ controls the decreased rate of the weight w i,i ; we can simply set μ = s/δ, where s = 8 is a constant. Parameter δ controls the location of the demarcation point, which can be obtained by using the following formula: and ϕ = ς(τm) outputs the largest integer smaller than τm. According to the experiments 7 , τ = 0.9 can be set in the classification of tumors. Parameter w can balance the contributions of the representation residual and representation vector to the classification. We search for w from a small set {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1} by five-fold VC.
Comparison of the balance division performance. Different divisions of the training set and test set can greatly affect the classification performance. To avoid the effects of an imbalanced training set, the balance division method (BDM) is designed to divide each original data set into a balanced training set and test set. For this BDM, Q samples from each subclass are randomly selected for use in the training set, and the remaining samples are used in the test set. Here, Q is an integer number. In the present study, we set Q = 5to − c min( ) 1 i samples per subclass as the training set and used the remaining samples for testing to guarantee that at least one sample in each category can be used in the test. Q denotes the number of training samples per class, and min(|c i |) denotes the minimum number of subclass set of samples in the training data. Suggesting that when Qis 5, then 5 samples per-subclass are randomly selected and used as the training set and the rest are assigned to the test set. In this experiment, the training/testing is performed 10 times, and the average classification accuracies are presented.
The average prediction accuracies that vary with different values of Q are shown in Figs 1 and 2, showing that, in the case of two-class classification, FDRRC achieves the highest classification accuracy in most cases in the acute leukemia and Gliomas datasets. Although gliomas are difficult to classify, FDRRC can still achieve the  Fig. 3.  Figure 3 presents the average prediction accuracy for the classification of eight tumor data sets. As shown in Fig. 3, FDRRC achieves the best accuracy in the five data sets in most cases, illustrating that FDRRC is robust with respect to the number of top genes. For Colon, Acute leukemia, DLBCL, Gliomas, Prostate and MLLLeukemia data sets, the accuracy of the curve increases with the increasing number of genes selected. Clearly, the selection of the top genes can improve the performance of all classification methods. For Acute leukemia dataset and ALL dataset, the best number of top genes is 400. These results suggest that the selection of the top 400 genes is reasonable.
Comparison of 10-fold CV performance. To evaluate the classification performance on imbalanced split training/testing sets, we perform a 10-fold stratified CV experiment to evaluate the classification performance between FDRRC and SRC-based methods. All samples are randomly divided into 10 subsets and nine subsets are used for training, the remaining samples are used for testing.
The 10-fold CV results are summarized in Tables 2, 3 and 4. Table 2 shows that FDRRC achieves the highest level of accuracy in seven datasets. Particularly in multi-class datasets, FDRRC exhibits the best classification accuracy in all datasets. Table 3 indicates that FDRRC achieves the highest prediction sensitivity in six datasets, whereas FDRRC shows the best classification sensitivity in four tow-class datasets. Table 4 shows that FDRRC exhibits the highest specificity in seven datasets. Particularly in multi-class datasets, FDRRC exhibits the best classification accuracy in all datasets. Thus, we concluded that the excellent applicability of FDRRC whether in two-class or multi-class datasets, exhibits the best classification accuracy, the best classification sensitivity, and the best classification specificity in most cases.

Discussions
The results of the present study, show that FDRRC outperforms the sparse representation-based methods (such as SRC, MSRC, and MRSRC) in most experiments. FDRRC outperforms the sparse representation-based methods probably because the representation residual associated with each class can be effectively used for classification, the discrimination of representation coefficients has been exploited, the coding residual is independent and identically distributed and the local center can help to distinguish outliers.
In the present, we proposed a new method, called FDRRC for classifying tumors. This method adopts the Fisher discrimination dictionary learning method and the concept of the local center with the RRC model. The FDRRC model learns a discriminative dictionary and seeks a MAP solution to the coding problem. Classification is achieved by a local center classifier, which takes full discriminative information in representation coefficients. We also compare the performance of FDRRC with those of three sparse representation-based methods by using eight tumor expression datasets. The results demonstrate the superiority of FDRRC and validate the effectiveness and efficiency of FDRRC in tumor classification.
Compared with the other methods, FDRRC exhibits a stable performance with respect to various datasets. The properties of this FDRRC algorithm should be further investigated. Thus, we will extend the algorithm with a superior discriminative dictionary and consider the driver genes to tailor the algorithm in our future studies. In addition, FDRRC will be used to predict miRNA 36 and lncRNA-disease association 37 in future studies.

Methods
Description of SRC problem. Assuming that X = {X 1 , X 2 , …, X c } ∈ R m×n is a training sample set, where c corresponds to the number of subclasses, and m, n are dimensionality and the number of samples, respectively. The j th class training samples X j can be presented as columns of a matrix where x j,i is a sample of j th class, and n j refers to the number of j th class training samples. Let L = {l 1 , l 2 , … l c } denote the label set, whereas y ∈ R m is a test sample. Then, the SRC-based problem can be represented as follows: includes the sparse representation coefficient of y with respect to X, and γ is a small positive constant. By obtaining representation coefficient α ∧ , SRC-based method assigns a label to test sample y according to the following equation: where i α ∧ is the sparse representation coefficient sub-vector associated with subclass X i . The classification rule is set as identity(y) = argmin i {e i }.
Fisher Discrimination Dictionary Learning. Given the training samples X = {X 1 , X 2 , …, X c }, the Fisher discrimination dictionary learning model not only requires that D should be highly capable of representing X (i.e., X ≈ Dα) but also that D can strongly distinguish the samples in X. The Fisher discrimination dictionary learning model can be expressed as follows: where f(α) is a discrimination term imposed on the coefficient matrix α, a 1 is the sparsity penalty, r(X, D, α)is the discriminative data fidelity term, and λ 1 and λ 2 are scalar parameters. We can write α i as [ ; ; ; ; ] , where i j α is the representation coefficient of X i over D i . For the discriminative data fidelity term r(X, D, α), X i could be well represented by D i but not by D j, j ≠ i. This relationship indicates that α i i should present several significant coefficients to achieve a small α − X D For the discriminative coefficient term f(α), the Fisher discrimination criterion 38 is expected to minimize the within-class scatter of α, denoted by SW(α), and maximize the between-class scatter of α, denoted by SB(α). SW(α) and SB(α) are defined as follows: where m i and m are the mean vectors of α i and α, respectively, and n i is the number of samples in class X i . Thus, the criminative coefficient term can be defined as follows: where tr(⋅) means the trace of a matrix, η is a parameter, and Optimization of the Fisher discrimination dictionary learning model can be divided into sub-problems, that is, updating α with a fixed D and updating D with a fixed α. When α is updated, the dictionary D is fixed and can compute α i class by class. When computing α i , all α j , j ≠ i are fixed. The objective function expressed in Eq. (8) is reduced to a sparse representation problem and can be written as follows: where M k and M are the mean vector matrices of class k and all classes, respectively. In this study, we set η = 1 for simplicity. Notably, all terms in Eq. (9), except for a 1 , are differentiable. We rewrite Eq. (9) as follows: 39 can be employed to solve Eq. (10), as described in Table 5.
When updating D = [D 1 , D 2 , …, D c ], the coefficient α is fixed. We also update =  class by class. When updating D i , all D j , j ≠ i, are fixed. The objective function expressed in Eq. (8) is reduced to: and α j is the representation matrix of X over D i . Eq. (11) could be re-written as follows: , 0 and h = 1.

while convergence or the maximal itertion number is not reached do
( 1) , and S τ/σ is a component-wise soft thresholding operator defined by Wright et al. 42 .
and 0 is a zero matrix with the appropriate size based on the context. Eq. (12) can be efficiently solved by updating each dictionary atom one by one via the algorithm of Yang et al. 40 . The update of dictionary D is described in Table 6.

Description of RRC.
In the SRC-based method, coding residual e = y − Dα follows Gaussian distribution 25 .
However, in practice, Gaussian priors on e may be invalid, especially when GEP data are corrupted and contain outliers. To deal with this problem, we can consider tumor classification from the view point of Bayesian estimation, especially MAP estimation. Based on MAP estimation, sparse representation coefficient α can be expressed as follows 26 : Then, by using Bayesian formulation, we can obtain the following: Assuming that elements e i of coding residual e = y − Dα = [e 1 ; e 2 ; … e m ] are independent and identically distributed and feature the probability density function (PDF) f θ (e i ), then we can obtain the equation below: Meanwhile, assuming that element α i of sparse representation coefficient α = [α 1 ; α 2 ; …; α n ] are independent and identically distributed and contain the PDF f σ (α i ), then we can acquire the following formula: Finally, MAP estimation of α can be expressed as follows: and ρσ (α) = −lnf σ (α), then, the above equation can be converted into the following: The above model is called RRC. Two key issues must be considered to solve the RRC model: determining distributions of ρ θ (e) and ρ σ (α); and minimizing energy function. For ρ θ (e), given diversity in gene variations, predefining distribution presents difficulty. In RRC model, unknown PDF ρ θ (e) is assumed symmetric, differentiable, and monotonic. Therefore, ρ θ (e) features the following properties: (1) ρ θ (0) is global minimal of ρ θ (Z); (2) ρ θ (Z) = ρ θ (−Z); (3) if |Z 1 | < |Z 2 |, then ρ θ (Z 1 ) < ρ θ (Z 2 ). Without loss of generality, we let ρ θ (0) = 0. Meanwhile, ρ θ (e) is allowed to feature a more flexible shape, which adapts to input testing sample y, to make the system more robust to outliers. Then, by Taylor expansion, Equation (18) can be approximated as follows: where W is a diagonal matrix and can be updated via the following formula: 3. Then Fix D and update α like Table 5.
Thus, minimization of RRC focuses on calculating diagonal weight matrix W. As ρ θ (e) is symmetric, differentiable, and monotonic, ω θ (e i ) can be assumed as continuous and symmetric while being inversely proportional to e i . With these considerations, the logistic function which features the same properties is a good choice for ω θ (e i ) 41 . Thus, we can obtain the following: For ρ σ (α), we can assume that sparse representation coefficient α i follows a generalized Gaussian distribution as only the representation coefficients associated with training samples from the target class can feature high absolute values. As we do not know beforehand the class of the test sample, a reasonable prior can be that only a small percent of representation coefficients contains significant values. Then, we can used the following equation: 1. Set the initial value of iteration count t = 1.
2. Compute the coding residual: ; ; m m m (1) 1 1 1 is the initial vector, and m is the mean of all training samples.
3. Estimate weight value of each gene as follows: where μ and δ are estimated in each iteration, and δ is associated with residual.
4. Weighted regularized sparse representation coefficient: 5. Update the sparse representation coefficients: where 0 < υ (t) ≤ 1 is a suitable step size that can be search from 1 to 0 by the standard line-search process 43 .
6. Reconstruct the test sample by sparse representation coefficient and all metagenes: ( ) and let t = t + 1.

Go back to Step 4 until condition of convergence
, where ϕ is a small positive scalar) is met, or maximal number of iterations is reached. Testing samples y ∈ R m Output: Label l of y.

Initialize D.
We initialize the atoms of D i as the eigenvectors of X i .

Update coefficient α.
Fix D and solve α i , i = 1, 2, ... C, one by one by solving Eq. (9) with the algorithm presented in Table 5.

Classify test sample y.
Fix α and D, and solve the sparse representation α ∧ of y with the algorithm presented in Table 7.
When the algorithm converges, we can classify the test samples as follows: where W final is the final weight matrix, i α ∧ is the final sub-sparse representation coefficient vector associated with class i, and α ∧ is the final representation coefficient vector. where Γ is the gamma function. After determining distributions ρ θ (e) and ρ σ (α), minimized energy function can be used in the iteratively reweighted RRC (IR 3 C) algorithm, which was designed by Yang et al., to solve the RRC model efficiently 26 . The RRC (IR3C) algorithm is described in Table 7. (3) is the classification function of SRC-based methods that only consider discrimination capability of representation residuals and not the discrimination capability of representation vectors.

Local center classifier. Equation
Assuming that m i is the mean sparse representation coefficient vector of class X i , mean vector m i can be viewed as the center of class X i in the transformed space comprising D. Thus, we label m i as the local center. For classification of tumor, when y originates from class i, residual α − ∧ y D i i 2 2 should be small while y D j i , j j 2 2 α − ≠ ∧ , should be big. In addition, sparse representation coefficient vector α ∧ should be close to m i but far from mean vectors of other classes. Considering the above factors, we define the following classifier: where w is a parameter for balancing contribution of the two terms to classification. Finally, we can obtain the label of y according to the following formula: = identity y e ( ) argmin ( ) (25) i i Algorithm of FDRRC. By combining the IR3C algorithm 26 and Fisher discrimination dictionary learning model, we can obtain the algorithm of FDRRC. Table 8 shows the overall procedure of the algorithm.