Kernel-based Joint Feature Selection and Max-Margin Classification for Early Diagnosis of Parkinson’s Disease

Feature selection methods usually select the most compact and relevant set of features based on their contribution to a linear regression model. Thus, these features might not be the best for a non-linear classifier. This is especially crucial for the tasks, in which the performance is heavily dependent on the feature selection techniques, like the diagnosis of neurodegenerative diseases. Parkinson’s disease (PD) is one of the most common neurodegenerative disorders, which progresses slowly while affects the quality of life dramatically. In this paper, we use the data acquired from multi-modal neuroimaging data to diagnose PD by investigating the brain regions, known to be affected at the early stages. We propose a joint kernel-based feature selection and classification framework. Unlike conventional feature selection techniques that select features based on their performance in the original input feature space, we select features that best benefit the classification scheme in the kernel space. We further propose kernel functions, specifically designed for our non-negative feature types. We use MRI and SPECT data of 538 subjects from the PPMI database, and obtain a diagnosis accuracy of 97.5%, which outperforms all baseline and state-of-the-art methods.


PD Staging, Data Acquisition and Preprocessing
The data used in this paper are acquired from the PPMI database 9 , which is the first substantial study for identifying the PD progression biomarkers. In this research, we use the subjects with both MRI and SPECT modalities available in the database, resulting in 369 PD and 169 Normal Control (NC) subjects. PD subjects in the PPMI However, if a non-linear classification is intended, it is better to select the features that can best classify the data in the kernel space (Bottom). Here, φ denotes a non-linear mapping function. study are de novo PD patients, newly diagnosed and unmedicated. The healthy/normal control subjects are both age-and gender-matched with the PD patients. The demographic information of the subjects is illustrated in Table 1.
PD subjects are often evaluated using the widely used Hoehn and Yahr (H&Y) scale 27 to understand their stagings. H&Y scale defines board categories, which rate the motor function for PD patients. H&Y stages correlate with several factors including motor decline, neuroimaging studies of dopaminergic loss and deterioration in quality of life 28 . It has a 5-point scale (Stages 1-5) measurement. Most of the studies in PD evaluated disease progression through analyzing patients and the time taken for them to reach one of the H&Y stages. The subjects in the first stage have unilateral involvement only, often with the least or no functional impairment. They have mild symptoms, which are inconvenient but not disabling. The second stage has bilateral or midline involvements, but still with no impairment of balance. For these subjects, the posture and gait are usually affected. Stage three shows the first signs of impaired reflexes. The patient will show significant slowing of the body movements and moderately severe dysfunction. In the fourth stage, the disease is fully developed and is severely disabling; the patient can still walk but to a limited extent, and might not be able to live alone any longer. In the fifth (final) stage, the patient will have a confinement to bed or will be bound to a wheelchair. The PD subjects in this study are mostly in the first two H&Y stages. As reported by the studies (http://www. ppmi-info.org/wp-content/uploads/2013/09/PPMI-WW-ADNI.pdf) in PPMI 9 , among the PD patients at the time of their baseline image acquisition, 43% of the subjects were in stage 1, 56% in stage 2 and the rest in stages 3 to 5.
Previous clinical studies (e.g., the Braak staging scheme 29,30 ) show that the disease is initiated in brainstem, mid-brain and subcortical regions; and, over time, it propagates to many other brain regions. Thus, in order to diagnose PD and identify its imaging biomarkers in its early stage, we focus on the structures in the brainstem and basal ganglia areas, along with the regions in the subcortical regions. Based on 29 , we employ the 36 ROIs depicted in Fig. 2a, which correspond to brainstem (4 ROIs), substantia nigra and red nucleus (4 ROIs), limbic lobe (16 ROIs), insula (2 ROIs) and subcortical regions (10 ROIs). We extract features specifically from these ROIs, since we stress that we would like to investigate diagnosis of de novo PD subjects. Including all ROIs in the brain would jeopardize the motivation of early diagnosis, since other ROIs are previously investigated and are found not clinically relevant to PD or get affected by PD at very late stages [29][30][31] .
For each subject, a 3D MPRAGE sequence was performed, using 3T SIEMENS MAGNETOM TrioTim syngo scanners. T1-weighted images were acquired for 176 sagittal slices with the following parameters: repetition time = 2300 ms, echo time = 2.98 ms, flip angle = 9°, and voxel size = 1 × 1 × 1 mm 3 . The MR images are preprocessed by skull stripping, and then segmented into White Matter (WM), Gray Matter (GM), and CerebroSpinal Fluid (CSF) tissues, using technique in ref. 32. Then, the Anatomical Automatic Labeling (AAL) atlas with the above 36 labeled ROIs is registered to each subject's native space. Specifically, we used FLIRT from FSL package 33 for affine alignment, followed by HAMMER 34 for nonlinear registration. The AAL atlas is warped according to the estimated deformation fields to the subject and then the ROI labels are propagated onto the subject's native space. Finally, WM and GM tissue volumes in each ROI are calculated as MRI features.
On the other hand, SPECT images are primarily used for PD diagnosis and analysis. These images are used to register the dopamine transporter levels in the striatum, commonly known as DaTSCAN. To acquire this image, the 123 I-ioflupane neuroimaging radiopharmaceutical biomarker is injected, which binds to the dopamine transporters in the striatum. Brain images are then acquired. To process these images, the PPMI study (DatScan SPECT Image Processing Methods for Calculation of Striatal Binding Ratio, August 2013) has performed attenuation correction on the SPECT images, along with a standard 3D 6.0 mm Gaussian filter. Then, the images were normalized to standard Montreal Neurological Institute (MNI) space. Next, the transaxial slice with the highest striatal uptake was identified and the 8 hottest striatal slices around it were averaged, to generate a single slice image. On this average slice, the four caudate and putamen (left and right) ROIs, which are in the striatum brain region, are labeled and considered as target ROIs. The occipital cortex region is also segmented and used as a reference ROI. Count densities for the regions were used to calculate the striatal binding ratios (SBRs), as As can be seen in Fig. 2c, the caudate and putamen intensity values differ significantly between the PD and NC subjects. Notations Throughout the paper, bold capital letters denote matrices (e.g., A). Small bold letters are vectors (e.g., a). All non-bold letters denote scalars or functions. a j i is the scalar in the row i and column j of A, while a i and a j denote the i th row and the j th column of A, respectively. = ∑ a a i i 2 2 2 and = ∑ a a i i 1 represent  2 and  1 norms of a, respectively.

Kernel-based Feature Selection and Max-Margin Classification
Feature or variable selection is defined as the process of picking a subset of discriminative features to best construct a model, namely for classification or regression. Feature selection approaches can be mainly categorized into two types: unsupervised and supervised. The former selects features without considering the class labels, while the latter aims to select features with the maximum relevance to the class labels and the least amount of redundancy in the selected features. Among the supervised feature selection methods, sparse feature selection 6 has gained much attention in recent years, due to its simplicity and superior performance.
Suppose we have N training samples, with d features for each. We can arrange the feature vectors in a matrix  ∈ × X d N , and their respective labels in  ∈ × y N 1 . Sparse feature selection aims at minimizing the objective: to best get a representation of the samples using the weight vector  ∈ w d . This weight vector is constrained with an  1 norm to get a compact set of discriminative features. Obviously, the set of selected features under this setting would be appropriate if we are planning to build a linear classification model (e.g., linear SVM). This is because these features are selected to minimize redundancy and maximize relevance to the class labels in the original feature space. However, for a non-linear classification task, we should select the features that replicate the same minimum-redundancy and maximum-relevance in the kernel space.
Non-linear classification is usually achieved by first projecting the original feature vectors to a high-dimensional space using a non-linear mapping function (referred to as φ ⋅ ( )), and then performing classification in the kernel space. However, instead of explicitly defining a mapping function, often a kernel function is used, which defines the similarity between the samples (or the feature vectors) 8 . This is simply because many machine learning algorithms (e.g., SVMs) can be entirely expressed in terms of dot products 8 , and, under some conditions, kernel functions can also be expressed as dot products in a (possibly infinite dimensional) feature space.
Here, in order to both select features and classify the data, we adopt a formulation similar to kernel-based SVM, with a customized kernel to account for feature selection. We propose to apply the kernel function on each single feature and define the aggregate-kernel through a simple weighted sum of all these kernels. In this way, we can select the features that contribute most to constructing a better classification model through  1 regularization on the kernels' weight vector.
Formulation. The original kernel-based SVM formulation finds a max-margin decision boundary in the kernel space,, using a loss function, L, with a trade-off hyperparameter C, as: where f is associated with a Reproducing Kernel Hilbert Space , and x j and y j are the j th sample and its corresponding label, respectively. In terms of a non-linear mapping function, f is defined as As discussed earlier, we do not directly use the mapping function, φ ⋅ ( ). To this end, the representer's theorem 35 is incorporated, which states that the solution for w can always be represented as a linear combination of the training data. Therefore: Hence, the optimization problem for the SVM would reduce to solving for β j , ∀ j.
Here, we would further like to select the features that can best construct the non-linear SVM model. Accordingly, we construct a single kernel for each feature, and then aggregate all these kernels through a weighted average. Therefore, the kernel between two samples x j and x would be calculated as: This is similar to multiple kernel learning frameworks 36,37 , while building each kernel only on a single feature. Applying the newly-introduced kernel k on all training samples, we would have K = k(X, X). Figure 3 shows the proposed process of building the kernel matrix over all samples (3b), in comparisons with the conventional kernel building (3a). Now, by regularizing the weight vector α, we can enforce the selection of most discriminative features projected in the kernel space. Therefore, the objective function would be formed as: However, we know that usually different features and kernels characterize different aspects of the data. Therefore, we propose to use different kernel types on all features, and let our method choose which kernel type and which feature would best build the classification model. Accordingly, if we define κ different types of kernels, we define a combination of κ ∈ … m {1, , } different kernels for the d features, as: Since, there is a  1 regularization on this vector, we can include different kernel types and determine which kernel contributes more in classifying the data, based on the selected feature(s).

Optimization.
To solve the optimization problem in (6), we only need to make sure the loss function is differentiable. We can, then, easily decompose the problem into two convex subproblems, and solve using a gradient decent method. The first subproblem would consist of solving for f, in which we assume α is fixed. This reduces to a simple SVM optimization with the kernel characterized by the fixed α: 1: Input: Training features X, and labels y.

7:
Then with f fixed, we can determine the best values for α through solving the second subproblem: The only non-smooth term is the last one. However, since we need to enforce the constraint α ≥ 0, we use a projection operator  α α = ( ) max (0, ) at each iteration. This constraint ensures that the last term is also differentiable with respect to α. Therefore, this objective can be simply optimized using projected gradient decent 38 .
To guarantee the convergence of the algorithm, based on 36,39 , we choose the gradient step size according to the Armijo rule 40 . As a result, the optimization process would alternate between the above two subproblems until convergence. The second subproblem learns the kernel, embedded in f(x j ), through optimizing the weights α, while the first one uses any SVM solver of choice to learn the SVM function.
Algorithm 1 summarizes the whole process and includes the stopping criteria. It is important to note that the main objective function (as in (6)) is a convex function, and therefore has a global optimum. The above two steps through the projected gradient decent will be repeated until convergence to the global optimum. It is shown that, under such settings with a convex objective function, the projected gradient decent converges to the global optimum 38 .
Note that our proposed framework for joint feature selection and classification integrates the  1 regularization on the max-margin classifier (i.e., SVM), as in (6). Since the optimization process, as described above, consists of two phases of optimization, the same idea can be applied to other kernel methods, by using our customized kernel and incorporating the  1 regularization to their formulations.
Kernel Function. The choice of kernel functions is very important, dependent on the application and the data types associated with the specific application of interest. Kernels are basically a measure of similarity between the samples. One of the most popular kernels is the Radial Basis Function (RBF) or the Gaussian kernel, formulated as: where D(.,.) is a distance measure between two samples and σ is the kernel hyperparameter. Usually a squared Euclidean norm is used as the distance measure between two samples: However, this distance metric should be adopted based on the characteristics of the data. As described earlier, in this work, we use the volume of ROIs from MRI and SPECT images and therefore our features are all non-negative. Hence, our feature vectors are similar to histogram features or probability distribution functions (PDF), if normalized. Many previous works usually normalize the data using z-scores, which converts the feature values to a common scale with a mean of 0 and standard deviation of 1. However, this basically changes both the characteristics and the physical meanings of the features. But here, we normalize the feature vectors to the range [0, 1], making them similar to PDFs or the histogram-like features, used widely in computer vision applications. There are a number of other distance metrics used for non-negative or histogram-like features 41 . One of such distances is the χ 2 distance: Another popular distance used for such purposes in the computer vision area is the Earth Mover's Distance (EMD) 42 , which evaluates the dissimilarity between two multi-dimensional distributions in the feature space. Intuitively, given two distributions, we can consider one as a mass of earth spread in space, and the other as a collection of holes in that same space. EMD would measure the least amount of movements required to fill the holes with earth. For 1-dimensional case, such as the case for our application, EMD has a simple closed-form solution: where cdf(⋅) is the cumulative distribution function and |·| is the absolute value function.
One of the popular kernels used for histogram-like features is the intersection kernel (histogram intersection kernel or HIK), which captures the similarities of the histogram-like features with the following definition: Since, dependent on the data distribution, some datasets might perform better using linear kernels (i.e., in the native feature space), we also define the linear kernel function as: j j LIN To find the best kernel, which can be different for each single dataset, we use all these kernel definitions and define our kernel formulation through Equation (7), by summing over all kernel types (∀ m, The set of all kernels could therefore be defined as the Linear kernel, the RBF kernel with different distance metrics, and the Intersection kernel as with κ = 5 different kernel types. Our optimization framework, through optimizing for α, will select which feature(s) and kernel type(s) are best for the problem with the available dataset. We will examine this proposed kernel and compare the results with different settings of the kernels and hyperparameters.

Experimental Results
In this section, we conduct several experiments on synthetic and real PD diagnosis datasets. For evaluations, we compare our results with several baseline methods. Baseline classifiers under comparison include support vector machine (SVM), SparseSVM 43 , sparse feature selection followed by a SVM (SFS + SVM) and a multiple kernel learning (MKL) framework for SVM 37 , in which one kernel is learned for each modality and feature type (GM, WM and SBR, as described in Section) and classification is learned over these kernels. To further evaluate the effect of the feature selection ( 1 regularization), we run the same objective in (6) with conventional  2 regularization on the vector α, denoted as 'Proposed  2 -reg' . In addition, we also employ the following widely used linear and non-linear feature selection strategies followed by a SVM, and report the result: t-test 44 , elastic-net 45 , AutoEncoder-Restricted Boltzmann Machine (AE-RBM) 46 , and the mutual information based feature selector minimum-redundancy maximum-relevancy (mRMR) 7 . For all the experiments the hyperparameters are fixed, reported in the following. This avoids the excessive hyperparameter overturning procedures, and makes the results easily reproducible. All results are generated through 10-fold cross-validation, and we found that fixing λ = 1 generates the best results, while the best values for other hyperparameters are σ = 0.5 and C = 10.

Synthetic Data.
To verify the proposed method and analyze its performance, we generate two sets of synthetic data, one linearly separable and the other one not separable linearly (hence we believe it would be best classified nonlinearly). For both cases, the data comprise two classes of 100 samples each, and feature vectors of the dimensionality 50. The linearly separable data are sampled from two separate randomly generated subspaces. For the linearly separable experiment, we first construct two independent subspaces with the dimensionality of 50, similar to refs 2,47. These two subspaces are constructed with bases  ∈ × U 1 50 50 (a random orthogonal matrix), and = U TU 2 1 , where T is a random rotation matrix. We then sample 50 vectors from each subspace through = X UQ i i i , = i {1, 2}, with Q i , a 50 × 100 matrix, being independent and identically distributed (i.i.d.) from (0, 1)  . This leads to a binary classification problem. The two classes would correspond to the above two subspaces. For the nonlinearly separable data, we sample data for the two classes from two spheres in the 50-dimensional space, but with different radiuses. To this end, the first class data are sampled i.i.d from a sphere with the radius of 0.5. For the second class, we construct another sphere with a radius of 1.0 with the same center as the first sphere. Then, the data points of the second class are sampled i.i.d from the space comprised by the difference of the two spheres.
To sketch the data we generated for this experiments and to be able to visualize them, we employ a dimensionality reduction technique to facilitate the visualization of the data points. We project the samples from the two datasets into the 2D space using t-SNE 48 . The t-SNE projection technique visualizes high-dimensional data by giving each sample a location in a two dimensional map. The map created by the t-SNE reveals the neighborhood structure of the sample manifold at many different scales 48 . Figure 4 shows the 2D t-SNE projection of the generated data. As can be seen, the first set of data (4(a)) can be likely separated linearly, while the second one (4(b)) is not separable linearly, and we would expect to see better results using the non-linear kernels. Each of the features for the two linear and nonlinear experiments are normalized to the range [0, 1], similar to the data we have, since our features from the neuroimaging data are volumetric and non-negative. The proposed method is run on these two sets of data, separately. The weight vector α, which is defined in (7) and optimized in (6), will contain the weight for each feature (∀ ∈ … i d {1, 2, , }) for all different κ kernel types ( κ ∀ ∈ … m {1, 2, , }). The mean weight for the features, across each kernel, would be a good indication of how the useful kernels for the task of classification were selected. These mean values are included in Table 2. Since we have imposed a  1 regularization on the weights vector, it will sparsely select a compact set of features and kernels. Therefore, features with the less-useful or redundant kernels are often all given zero weights. As can be seen, for    Table 4. Diagnosis accuracy of the proposed and the baseline methods, with different kernels on MRI + SPECT data.
the linearly separable data (Fig. 4(a)), the linear and histogram intersection kernels are selected, while for the non-linear data, the RBF kernel with different distance metrics are selected along with the HIK. One of the main conclusions here could be that HIK is almost equally very important for both linear and non-linear volumetric data. We will verify this on the real neuroimaging dataset, in the following subsections. HIK kernel has also been found to very useful in the computer vision area, in which the features are non-negative and histogram-like. This type of kernel can model non-linear relations, while being also able to model piece-wise linear problems 49 . This is the reason it is found very useful for both linear and non-linear problems.
It is noteworthy to mention that our method can work equally the same for any type linearly or non-linearly separable data. In this case, the method for both experiments, generates a 100% accuracy, while based on the characteristics of the data, different kernel types are selected. PD Diagnosis. As described before, we use both MRI and SPECT modalities of total 538 subjects (169 NC, 369 PD) from the PPMI database to evaluate the proposed method. Table 3 compares the diagnosis accuracies of the proposed and baseline methods, analyzing the effects of different modalities. As it is obvious, the proposed method generates the best results when using both MRI and SPECT modalities. Additionally, Table 4 shows the results of the proposed and the baseline kernel methods with different kernel settings using the MRI + SPECT modalities. The proposed kernel, which combines all kernel types provides the best results compared to the other five kernels. Our method (Proposed  1 -reg and  2 -reg) is the only method that can interweave all kernels and concurrently select the best kernel and feature combination. Other kernel methods (MKL, SFS + SVM and SVM, as listed in the Table) can only operate on a single kernel and hence would only be able to interpret the data from a single aspect (e.g., linearity or non-linearity).
In addition, Fig. 5 compares the sensitivity, specificity and the area under ROC curve for the methods. As can be seen, the proposed framework using the proposed interweaved kernel on the combination of SPECT and MRI data achieves the best results with an accuracy of 97.5%. This suggests that the best features are selected in the non-linear kernel space, compared to the results of other methods, specifically SFS + SVM with an accuracy of 90.8%, in which a sparse feature selection is conducted in the original feature space followed by a non-linear classifier. This supports our assumption that, if a non-linear classification is intended, we should select those features that form the best classifier in the kernel space, not in the original feature space. Furthermore, our proposed kernel achieves an improved accuracy of over 1.4% than the case of squared Euclidean distance, which is widely used in many previous studies. On the other hand, comparison of the results from  1 and  2 regularization of the kernel weights vector indicates that the proposed method ( 1 regularized) achieves an accuracy (and AUC) superior to the same formulation but with  2 regularization. This is because  1 regularization is suitable for recovering sparse signals, and can certainly reduce overfitting and increase generalization capability, thus achieving better performance on the unseen testing data.  2 is preferable for data that is not sparse, while our features are extracted from brain ROIs.  1 enforces selecting most discriminant ROIs. To further evaluate the feature selection performance, we conducted experiments with several state-of-the-art linear and non-linear feature selection or reduction and classification techniques used for such neurodegenerative diagnosis applications, which are already included in Table 3.  To analyze the effect of the proposed kernel and the contribution of each single kernel in building the final max-margin classification model, we calculate the mean weight (in vector α) of the features for each single kernel. Therefore, for the m th kernel this mean weight would be calculated as This measure can simply show how often features calculated using that kernel are selected and therefore would indicate the importance and significance of that kernel type. Figure 6 shows these mean weight values. As can be seen, again the intersection kernel (HIK) has the largest weight while the other two variations of RBF kernel (i.e., with EMD and Euclidian distance) are getting less weights, but still important. The other two kernels, linear and RBF with the χ 2 distance, however, are not showing any useful for the task at hand. This is aligned with the non-linear synthetic experiment and shows that if the features selected according to the classifier the performance could be boosted. To analyze the hyperparameter, λ in the objective function (6), we plot the accuracy and the area under the ROC curve as a function of λ in Fig. 7. This hyperparameter is a trade-off to control how compact we want the selected feature set to be. As it is clear in this figure, the whole process is not hugely dependent to the setting of this hyperparameter and performs similarly well in a wide range of its settings.
One the major advantages of the proposed technique is that we can analyze the weight vector α and see features corresponding to which ROI are given non-zero mean weights over the 10 folds of the cross-validation. In this way, we can draw a conclusion on the importance of any single ROI for the specific disease. The selected ROIs by the proposed method are visualized in Fig. 8, and Table 5 depicts the specific ROIs and tissue types or modalities that were selected. The last row of this table shows the mean values for the coefficient vector α for that specific  modality (feature type), which indicates the level of significance and contribution of specific feature types in building the kernel-based max-margin classifier. The SPECT features are the most useful features as they are quite discriminative for the task of PD diagnosis. This is also easily inferred by comparing the two subject samples in Fig. 2c. Among the MRI features, WM volumes are more compelling, since the deep brain regions contain more WM volumes and fibers. The selected ROIs are also aligned with the previous studies 2, 20,29,30,50 . Specifically, the Putamen and Caudate regions, along with the brainstem and many subcortical regions have always been studied and found important to assess PD development measures.
To analyze the convergence behavior of the proposed method, we plot the magnitude of the objective function in (6) for each of the 10 folds in the experiment on PPMI using both MRI and SPECT modalities, as a function of the number of iterations required for solving the alternating optimization problem. The plots are illustrated in Fig. 9. As can be seen, the algorithm converges in a limited number of iterations for all data splits.
To analyze the significance of the obtained results, we conduct a permutation test 51 , which does not assume any data distribution and is also non-parametric. Permutation test investigates whether the classifier has found any class structure, or the observed accuracy was obtained by coincidence. To this end, the classification scheme is repeated by randomly permuting the class labels for π different times (π = 100, in our experiments). Then, a p-value can be calculated, which would show the portion of the runs for which the misclassification rate is better than the original classification error. For our case, we obtained a p-value less than 0.01, which signifies that the classification error on the original data is indeed substantially small and the classifier is not generating those results by chance 51 . Furthermore, we use a bootstrapping procedure to statistically validate the performance of the proposed method. In this way, we can alternatively show that the obtained results are not due to any over-fitting.  Table 5. ROI names and their modalities, selected as the most important ROIs by our algorithm.
But we only have one dataset. As a result, when we compute a statistic on the data (i.e., classification accuracy for our case), we cannot see how variable that statistic is. Accordingly, we create a large number of datasets, through bootstrapping, by subsampling from the original dataset. This leads to a distribution for the accuracy, so we can compare the methods statistically. To this end, we resample  subjects with replacement from the dataset. Then, we use 90% of these subjects to train the classifier, which is then used to classify the remaining 10% of the subjects. This procedure is repeated  different times, leading to  different accuracy results. For our case, we set  = 400 and  = 500. Since the subjects are sampled with replacement, the trials are independent and identically distributed (i.i.d.). Therefore, we can compare the methods using their  trial bootstrap performances. Figure 10 shows the histograms of the  trials for our proposed method (with  1 -regularization) and the baseline MKL method. With all the trials, our proposed method and the baseline MKL method achieved the mean and standard    Table 6. Comparisons of the proposed method with state-of-the-art methods for PD diagnosis. The table includes number of subjects and the methodologies used by different methods. Also, the neuroimaging modalities used by each study are provided.
deviation of 95.5% ± 1.1 and 93.0 ± 2.2, respectively. As can be seen, our proposed method has a better mean accuracy with a tighter bound for the standard deviation, compared to the baseline MKL method. In addition, we compare our proposed method against several previously published state-of-the-art methods for the same purpose (PD diagnosis). The comparisons are provided in Table 6. The table includes all the information about the dataset and the methods they used for obtaining those results. As can be seen, our results are superior to all previous works, while we evaluated our method on a large dataset. One important note here could be that MRI data are much less dependable for the diagnosis task. This can be why when a small dataset is utilized, the solutions are quite prone to overfitting and relatively reasonable results could be achieved. On the other hand, when using large sets of data, the results for the diagnosis using only MRI degrades. Furthermore, the comparisons with the work of Prashanth et al. 18 shows that the combination of MRI and SPECT data boosts the performance of diagnosis. It is important to note that the experiments in ref. 18 are conducted on the same PPMI dataset, but the authors used subjects from multiple time points, putting them together and running their method. This could be prone to several issues: (1) the subjects in the future time points, which were diagnosed as PD in the previous time points, are now in the late stages of PD, and (2) different time points of the same subject can be split in the training and testing sets, which helps unfairly increase the performance. Although the comparison is not fair, still we outperform their results by combining multiple modalities and proposing a more robust method.

Conclusion
In this paper, we introduced a kernel-based feature selection scheme, in which we select features and kernels that induce the best classification performance in the kernel space. Furthermore, we presented a kernel function by interweaving different kernel types, which is in accordance with the feature types, while being able to tolerate both linear and non-linear properties of the data. The results indicate that the proposed framework for joint kernel-based feature selection and max-margin classification induces the best performance for diagnosis for Parkinson's disease, when using features from both SPECT and MRI modalities. We tested the proposed method on a large set of data and obtained competitive and superior results for the diagnosis of PD among all previous studies for PD.