Introduction

Diagnosis of neurodegenerative diseases based on neuroimaging data has remarkably improved patient treatment strategies. However, early diagnosis is a very challenging task, due to various sources of inter- and intra-subject variabilities, within and between different groups, as well as the induced deviations due to image acquisition and preprocessing procedures. In addition, the need for the identification of biomarkers for the disease further intensifies these challenges. Imaging biomarkers are essentially a set of image-based features, and therefore, feature selection techniques are heavily researched (e.g., refs 1 and 2) and have yielded eminent achievements.

Many previous works in the literature have investigated both linear and non-linear classification models for the task of disease diagnosis using neuroimaging data1,3,4. One of the closures in almost all these works is that linear and non-linear models both obtain similar performances for disease diagnosis based on neuroimaging data. However, a very crucial step in these works is the feature selection procedure, which is often overlooked. Current methods use either unsupervised (e.g., Fisher score5) or supervised feature selection (e.g., sparse feature selection6, or minimum-redundancy maximum-relevancy7) techniques. When a linear feature selection technique is used, the selected features would be most correlated linearly with the class labels (maximum relevance), while being least correlated with each other (minimum redundancy), in the input feature space. Nonetheless, this linearly selected feature set would not be the best for a non-linear classifier, which would map the input feature space to a higher dimensional space (denoted as kernel space). Accordingly, for the case of non-linear classification, it would be better to select the features that result in the best performance in the non-linear kernel space, rather than the original input feature space, illustrated in Fig. 1.

Figure 1: Conventional methods (e.g., refs 3 and 7) usually select features in the original input feature space and then learn a classifier based on the selected features (Top).
figure 1

However, if a non-linear classification is intended, it is better to select the features that can best classify the data in the kernel space (Bottom). Here, denotes a non-linear mapping function.

On the other hand, the choice of the mapping function to project the input feature space to a higher dimensional space is itself challenging, and dependent on the application and its associated data types. These non-linear projections are often modeled using kernel functions8. In this paper, we propose a new strategy to select the features that can best construct a classifier in the kernel space, as opposed to the conventional methods that select features only in the original input feature space. In this way, if we use a non-linear kernel, the selected features work best in the non-linear kernel space, or if a linear kernel is used, those features will be best for a linear classifier. To take advantage of more kernel types, we define a new kernel which interweaves different linear and non-linear kernels, so that takes advantage of both types of models. We apply our method to the problem of PD diagnosis using the Parkinson’s Progression Markers Initiative (PPMI) database9, available at http://www.ppmi-info.org/data. We use densities in Regions-of-Interest (ROIs) as features from Magnetic Resonance Imaging (MRI) and Single-Photon Emission Computed Tomography (SPECT) modalities. These features are volumetric non-negative values, and therefore we incorporate such information in our proposed kernel function.

It is important to note that, based on various studies in the literature10,11, usually non-linear classifiers can be better fit, if the data spans in a lower-dimensional space. Feature selection methods are often used in many neuroimaging applications12,13,14, leading to a limited number of selected features, and therefore we are dealing with a low-dimensional space for the classification task. This observation motivates us to specifically investigate non-linear models for classifying these data. But as mentioned above, in this paper we argue that if one chooses a non-linear classifier, the feature selector should be designed accordingly, and hence, we develop a joint model. However, it is not appropriate to assume that a non-linear model is always better than a linear model (or the native feature space). As a result, we formulate the framework such that both linear and non-linear kernels can be utilized and the model selects which operates best on the specific dataset.

Note that current practical diagnosis of PD mainly depends on the clinical symptoms. Clinically, PD is characterized by tremor at rest, rigidity, akinesia (or bradykinesia), postural instability, flexed posture and freezing (motor blocks) including also non-motor symptoms such as cognitive and psychiatric impairment15. PD symptoms start to appear with the loss of neurotransmitters in the brain, particularly dopamine. As discussed in the literature16,17,18,19,20, neuroimaging techniques can be successfully utilized for early diagnosis of PD. For instance, SPECT imaging is usually considered for the diagnosis of PD18,19, and MRI is often employed for the differential diagnosis of PD syndromes19,21,22, as well as the analysis of the structural changes in PD patients16,23,24. Recently, many researchers started using machine learning and data-driven analysis methodologies for disease diagnosis purposes, and significant amount of research efforts have been dedicated to diagnosis and progression prediction of neurodegenerative diseases using different brain imaging modalities3,18,19,22,25. Automatic PD diagnosis and progression prediction could help physicians and patients avoid unnecessary medical examinations or therapies, as well as potential side effects and safety risks26. Machine learning and pattern recognition methods could simplify the development of these automatic PD diagnosis approaches. For instance, Prashanth et al.18 use intensity features extracted from SPECT images along with a SVM classifier, while Focke et al.23 use the voxel-based morphometry (VBM) on T1-weighted MRI with a SVM classifier to identify idiopathic Parkinson syndrome patients. In another work, Salvatore et al.24 propose a method based on principal component analysis (PCA) on morphological T1-weighted MRI, in combination with SVMs for diagnosis of PD and its syndromes.

Accordingly, the contributions of this paper are multi-fold: (1) Our proposed method selects the features that best classify the data in the kernel space, through learning multiple kernels; (2) We design kernel functions specific to our feature types (volumetric non-negative features); (3) We use multi-modal MRI and SPECT data and propose a non-linear feature fusion strategy to diagnose PD.

PD Staging, Data Acquisition and Preprocessing

The data used in this paper are acquired from the PPMI database9, which is the first substantial study for identifying the PD progression biomarkers. In this research, we use the subjects with both MRI and SPECT modalities available in the database, resulting in 369 PD and 169 Normal Control (NC) subjects. PD subjects in the PPMI study are de novo PD patients, newly diagnosed and unmedicated. The healthy/normal control subjects are both age- and gender-matched with the PD patients. The demographic information of the subjects is illustrated in Table 1.

Table 1 Details of the subjects from the PPMI dataset used in our study.

PD subjects are often evaluated using the widely used Hoehn and Yahr (H&Y) scale27 to understand their stagings. H&Y scale defines board categories, which rate the motor function for PD patients. H&Y stages correlate with several factors including motor decline, neuroimaging studies of dopaminergic loss and deterioration in quality of life28. It has a 5-point scale (Stages 1–5) measurement. Most of the studies in PD evaluated disease progression through analyzing patients and the time taken for them to reach one of the H&Y stages. The subjects in the first stage have unilateral involvement only, often with the least or no functional impairment. They have mild symptoms, which are inconvenient but not disabling. The second stage has bilateral or midline involvements, but still with no impairment of balance. For these subjects, the posture and gait are usually affected. Stage three shows the first signs of impaired reflexes. The patient will show significant slowing of the body movements and moderately severe dysfunction. In the fourth stage, the disease is fully developed and is severely disabling; the patient can still walk but to a limited extent, and might not be able to live alone any longer. In the fifth (final) stage, the patient will have a confinement to bed or will be bound to a wheelchair. The PD subjects in this study are mostly in the first two H&Y stages. As reported by the studies (http://www.ppmi-info.org/wp-content/uploads/2013/09/PPMI-WW-ADNI.pdf) in PPMI9, among the PD patients at the time of their baseline image acquisition, 43% of the subjects were in stage 1, 56% in stage 2 and the rest in stages 3 to 5.

Previous clinical studies (e.g., the Braak staging scheme29,30) show that the disease is initiated in brainstem, mid-brain and subcortical regions; and, over time, it propagates to many other brain regions. Thus, in order to diagnose PD and identify its imaging biomarkers in its early stage, we focus on the structures in the brainstem and basal ganglia areas, along with the regions in the subcortical regions. Based on29, we employ the 36 ROIs depicted in Fig. 2a, which correspond to brainstem (4 ROIs), substantia nigra and red nucleus (4 ROIs), limbic lobe (16 ROIs), insula (2 ROIs) and subcortical regions (10 ROIs). We extract features specifically from these ROIs, since we stress that we would like to investigate diagnosis of de novo PD subjects. Including all ROIs in the brain would jeopardize the motivation of early diagnosis, since other ROIs are previously investigated and are found not clinically relevant to PD or get affected by PD at very late stages29,30,31.

Figure 2
figure 2

ROIs from MRI and SPECT used in this study.

For each subject, a 3D MPRAGE sequence was performed, using 3T SIEMENS MAGNETOM TrioTim syngo scanners. T1-weighted images were acquired for 176 sagittal slices with the following parameters: repetition time = 2300 ms, echo time = 2.98 ms, flip angle = 9°, and voxel size = 1 × 1 × 1 mm3. The MR images are preprocessed by skull stripping, and then segmented into White Matter (WM), Gray Matter (GM), and CerebroSpinal Fluid (CSF) tissues, using technique in ref. 32. Then, the Anatomical Automatic Labeling (AAL) atlas with the above 36 labeled ROIs is registered to each subject’s native space. Specifically, we used FLIRT from FSL package33 for affine alignment, followed by HAMMER34 for nonlinear registration. The AAL atlas is warped according to the estimated deformation fields to the subject and then the ROI labels are propagated onto the subject’s native space. Finally, WM and GM tissue volumes in each ROI are calculated as MRI features.

On the other hand, SPECT images are primarily used for PD diagnosis and analysis. These images are used to register the dopamine transporter levels in the striatum, commonly known as DaTSCAN. To acquire this image, the 123I-ioflupane neuroimaging radiopharmaceutical biomarker is injected, which binds to the dopamine transporters in the striatum. Brain images are then acquired. To process these images, the PPMI study (DatScan SPECT Image Processing Methods for Calculation of Striatal Binding Ratio, August 2013) has performed attenuation correction on the SPECT images, along with a standard 3D 6.0 mm Gaussian filter. Then, the images were normalized to standard Montreal Neurological Institute (MNI) space. Next, the transaxial slice with the highest striatal uptake was identified and the 8 hottest striatal slices around it were averaged, to generate a single slice image. On this average slice, the four caudate and putamen (left and right) ROIs, which are in the striatum brain region, are labeled and considered as target ROIs. The occipital cortex region is also segmented and used as a reference ROI. Count densities for the regions were used to calculate the striatal binding ratios (SBRs), as

As can be seen in Fig. 2c, the caudate and putamen intensity values differ significantly between the PD and NC subjects.

Notations

Throughout the paper, bold capital letters denote matrices (e.g., A). Small bold letters are vectors (e.g., a). All non-bold letters denote scalars or functions. is the scalar in the row i and column j of A, while ai and aj denote the ith row and the jth column of A, respectively. and represent and norms of a, respectively.

Kernel-based Feature Selection and Max-Margin Classification

Feature or variable selection is defined as the process of picking a subset of discriminative features to best construct a model, namely for classification or regression. Feature selection approaches can be mainly categorized into two types: unsupervised and supervised. The former selects features without considering the class labels, while the latter aims to select features with the maximum relevance to the class labels and the least amount of redundancy in the selected features. Among the supervised feature selection methods, sparse feature selection6 has gained much attention in recent years, due to its simplicity and superior performance.

Suppose we have N training samples, with d features for each. We can arrange the feature vectors in a matrix , and their respective labels in . Sparse feature selection aims at minimizing the objective:

to best get a representation of the samples using the weight vector . This weight vector is constrained with an norm to get a compact set of discriminative features. Obviously, the set of selected features under this setting would be appropriate if we are planning to build a linear classification model (e.g., linear SVM). This is because these features are selected to minimize redundancy and maximize relevance to the class labels in the original feature space. However, for a non-linear classification task, we should select the features that replicate the same minimum-redundancy and maximum-relevance in the kernel space.

Non-linear classification is usually achieved by first projecting the original feature vectors to a high-dimensional space using a non-linear mapping function (referred to as ), and then performing classification in the kernel space. However, instead of explicitly defining a mapping function, often a kernel function is used, which defines the similarity between the samples (or the feature vectors)8. This is simply because many machine learning algorithms (e.g., SVMs) can be entirely expressed in terms of dot products8, and, under some conditions, kernel functions can also be expressed as dot products in a (possibly infinite dimensional) feature space.

Here, in order to both select features and classify the data, we adopt a formulation similar to kernel-based SVM, with a customized kernel to account for feature selection. We propose to apply the kernel function on each single feature and define the aggregate-kernel through a simple weighted sum of all these kernels. In this way, we can select the features that contribute most to constructing a better classification model through regularization on the kernels’ weight vector.

Formulation

The original kernel-based SVM formulation finds a max-margin decision boundary in the kernel space,, using a loss function, L, with a trade-off hyperparameter C, as:

where f is associated with a Reproducing Kernel Hilbert Space , and xj and yj are the jth sample and its corresponding label, respectively. In terms of a non-linear mapping function, f is defined as . As discussed earlier, we do not directly use the mapping function, . To this end, the representer’s theorem35 is incorporated, which states that the solution for w can always be represented as a linear combination of the training data. Therefore:

Hence, the optimization problem for the SVM would reduce to solving for βj, .

Here, we would further like to select the features that can best construct the non-linear SVM model. Accordingly, we construct a single kernel for each feature, and then aggregate all these kernels through a weighted average. Therefore, the kernel between two samples xj and x would be calculated as:

This is similar to multiple kernel learning frameworks36,37, while building each kernel only on a single feature. Applying the newly-introduced kernel k on all training samples, we would have K = k(X, X). Figure 3 shows the proposed process of building the kernel matrix over all samples (3b), in comparisons with the conventional kernel building (3a). Now, by regularizing the weight vector α , we can enforce the selection of most discriminative features projected in the kernel space. Therefore, the objective function would be formed as:

Figure 3
figure 3

(a) Conventional construction of kernel functions. (b) Kernels learned for each single feature, aggregated through the weights corresponding to each feature. km(.,.) denotes the kernel function applied on the features.

However, we know that usually different features and kernels characterize different aspects of the data. Therefore, we propose to use different kernel types on all features, and let our method choose which kernel type and which feature would best build the classification model. Accordingly, if we define κ different types of kernels, we define a combination of different kernels for the d features, as:

In this way, would be defined as the concatenation of all . Note that b = d × κ. Since, there is a regularization on this vector, we can include different kernel types and determine which kernel contributes more in classifying the data, based on the selected feature(s).

Optimization

To solve the optimization problem in (6), we only need to make sure the loss function is differentiable. We can, then, easily decompose the problem into two convex subproblems, and solve using a gradient decent method. The first subproblem would consist of solving for f, in which we assume α is fixed. This reduces to a simple SVM optimization with the kernel characterized by the fixed α :

Then with f fixed, we can determine the best values for α through solving the second subproblem:

The only non-smooth term is the last one. However, since we need to enforce the constraint α ≥ 0, we use a projection operator at each iteration. This constraint ensures that the last term is also differentiable with respect to α . Therefore, this objective can be simply optimized using projected gradient decent38. To guarantee the convergence of the algorithm, based on36,39, we choose the gradient step size according to the Armijo rule40. As a result, the optimization process would alternate between the above two subproblems until convergence. The second subproblem learns the kernel, embedded in f(xj), through optimizing the weights α , while the first one uses any SVM solver of choice to learn the SVM function.

Algorithm 1 summarizes the whole process and includes the stopping criteria. It is important to note that the main objective function (as in (6)) is a convex function, and therefore has a global optimum. The above two steps through the projected gradient decent will be repeated until convergence to the global optimum. It is shown that, under such settings with a convex objective function, the projected gradient decent converges to the global optimum38.

Note that our proposed framework for joint feature selection and classification integrates the regularization on the max-margin classifier (i.e., SVM), as in (6). Since the optimization process, as described above, consists of two phases of optimization, the same idea can be applied to other kernel methods, by using our customized kernel and incorporating the regularization to their formulations.

Kernel Function

The choice of kernel functions is very important, dependent on the application and the data types associated with the specific application of interest. Kernels are basically a measure of similarity between the samples. One of the most popular kernels is the Radial Basis Function (RBF) or the Gaussian kernel, formulated as:

where D(.,.) is a distance measure between two samples and σ is the kernel hyperparameter. Usually a squared Euclidean norm is used as the distance measure between two samples:

However, this distance metric should be adopted based on the characteristics of the data.

As described earlier, in this work, we use the volume of ROIs from MRI and SPECT images and therefore our features are all non-negative. Hence, our feature vectors are similar to histogram features or probability distribution functions (PDF), if normalized. Many previous works usually normalize the data using z-scores, which converts the feature values to a common scale with a mean of 0 and standard deviation of 1. However, this basically changes both the characteristics and the physical meanings of the features. But here, we normalize the feature vectors to the range [0, 1], making them similar to PDFs or the histogram-like features, used widely in computer vision applications. There are a number of other distance metrics used for non-negative or histogram-like features41. One of such distances is the χ2 distance:

Another popular distance used for such purposes in the computer vision area is the Earth Mover’s Distance (EMD)42, which evaluates the dissimilarity between two multi-dimensional distributions in the feature space. Intuitively, given two distributions, we can consider one as a mass of earth spread in space, and the other as a collection of holes in that same space. EMD would measure the least amount of movements required to fill the holes with earth. For 1-dimensional case, such as the case for our application, EMD has a simple closed-form solution:

where cdf() is the cumulative distribution function and |·| is the absolute value function.

One of the popular kernels used for histogram-like features is the intersection kernel (histogram intersection kernel or HIK), which captures the similarities of the histogram-like features with the following definition:

Since, dependent on the data distribution, some datasets might perform better using linear kernels (i.e., in the native feature space), we also define the linear kernel function as:

To find the best kernel, which can be different for each single dataset, we use all these kernel definitions and define our kernel formulation through Equation (7), by summing over all kernel types (, ), for each single feature (, ). The set of all kernels could therefore be defined as the Linear kernel, the RBF kernel with different distance metrics, and the Intersection kernel as

with κ = 5 different kernel types. Our optimization framework, through optimizing for α, will select which feature(s) and kernel type(s) are best for the problem with the available dataset. We will examine this proposed kernel and compare the results with different settings of the kernels and hyperparameters.

Experimental Results

In this section, we conduct several experiments on synthetic and real PD diagnosis datasets. For evaluations, we compare our results with several baseline methods. Baseline classifiers under comparison include support vector machine (SVM), SparseSVM43, sparse feature selection followed by a SVM (SFS + SVM) and a multiple kernel learning (MKL) framework for SVM37, in which one kernel is learned for each modality and feature type (GM, WM and SBR, as described in Section) and classification is learned over these kernels. To further evaluate the effect of the feature selection ( regularization), we run the same objective in (6) with conventional regularization on the vector α, denoted as ‘Proposed -reg’. In addition, we also employ the following widely used linear and non-linear feature selection strategies followed by a SVM, and report the result: t-test44, elastic-net45, AutoEncoder-Restricted Boltzmann Machine (AE-RBM)46, and the mutual information based feature selector minimum-redundancy maximum-relevancy (mRMR)7. For all the experiments the hyperparameters are fixed, reported in the following. This avoids the excessive hyperparameter overturning procedures, and makes the results easily reproducible. All results are generated through 10-fold cross-validation, and we found that fixing λ = 1 generates the best results, while the best values for other hyperparameters are σ = 0.5 and C = 10.

Synthetic Data

To verify the proposed method and analyze its performance, we generate two sets of synthetic data, one linearly separable and the other one not separable linearly (hence we believe it would be best classified nonlinearly). For both cases, the data comprise two classes of 100 samples each, and feature vectors of the dimensionality 50. The linearly separable data are sampled from two separate randomly generated subspaces. For the linearly separable experiment, we first construct two independent subspaces with the dimensionality of 50, similar to refs 2,47. These two subspaces are constructed with bases (a random orthogonal matrix), and , where T is a random rotation matrix. We then sample 50 vectors from each subspace through , , with Qi, a 50 × 100 matrix, being independent and identically distributed (i.i.d.) from . This leads to a binary classification problem. The two classes would correspond to the above two subspaces. For the nonlinearly separable data, we sample data for the two classes from two spheres in the 50-dimensional space, but with different radiuses. To this end, the first class data are sampled i.i.d from a sphere with the radius of 0.5. For the second class, we construct another sphere with a radius of 1.0 with the same center as the first sphere. Then, the data points of the second class are sampled i.i.d from the space comprised by the difference of the two spheres.

To sketch the data we generated for this experiments and to be able to visualize them, we employ a dimensionality reduction technique to facilitate the visualization of the data points. We project the samples from the two datasets into the 2D space using t-SNE48. The t-SNE projection technique visualizes high-dimensional data by giving each sample a location in a two dimensional map. The map created by the t-SNE reveals the neighborhood structure of the sample manifold at many different scales48. Figure 4 shows the 2D t-SNE projection of the generated data. As can be seen, the first set of data (4(a)) can be likely separated linearly, while the second one (4(b)) is not separable linearly, and we would expect to see better results using the non-linear kernels.

Figure 4: 2D t-SNE projection of the synthetic data used for evaluation of the method.
figure 4

Red and Blue dots represent samples from two different classes. (a) Linearly separable data, (b) Nonlinearly separable data.

Each of the features for the two linear and nonlinear experiments are normalized to the range [0, 1], similar to the data we have, since our features from the neuroimaging data are volumetric and non-negative. The proposed method is run on these two sets of data, separately. The weight vector α , which is defined in (7) and optimized in (6), will contain the weight for each feature () for all different κ kernel types (). The mean weight for the features, across each kernel, would be a good indication of how the useful kernels for the task of classification were selected. These mean values are included in Table 2. Since we have imposed a regularization on the weights vector, it will sparsely select a compact set of features and kernels. Therefore, features with the less-useful or redundant kernels are often all given zero weights. As can be seen, for the linearly separable data (Fig. 4(a)), the linear and histogram intersection kernels are selected, while for the non-linear data, the RBF kernel with different distance metrics are selected along with the HIK. One of the main conclusions here could be that HIK is almost equally very important for both linear and non-linear volumetric data. We will verify this on the real neuroimaging dataset, in the following subsections. HIK kernel has also been found to very useful in the computer vision area, in which the features are non-negative and histogram-like. This type of kernel can model non-linear relations, while being also able to model piece-wise linear problems49. This is the reason it is found very useful for both linear and non-linear problems.

Table 2 Results on the synthetic data: mean weight of the features for each of the κ = 5 kernels.

It is noteworthy to mention that our method can work equally the same for any type linearly or non-linearly separable data. In this case, the method for both experiments, generates a 100% accuracy, while based on the characteristics of the data, different kernel types are selected.

PD Diagnosis

As described before, we use both MRI and SPECT modalities of total 538 subjects (169 NC, 369 PD) from the PPMI database to evaluate the proposed method. Table 3 compares the diagnosis accuracies of the proposed and baseline methods, analyzing the effects of different modalities. As it is obvious, the proposed method generates the best results when using both MRI and SPECT modalities. Additionally, Table 4 shows the results of the proposed and the baseline kernel methods with different kernel settings using the MRI + SPECT modalities. The proposed kernel, which combines all kernel types provides the best results compared to the other five kernels. Our method (Proposed -reg and -reg) is the only method that can interweave all kernels and concurrently select the best kernel and feature combination. Other kernel methods (MKL, SFS + SVM and SVM, as listed in the Table) can only operate on a single kernel and hence would only be able to interpret the data from a single aspect (e.g., linearity or non-linearity).

Table 3 Diagnosis accuracy of the proposed and the baseline methods, with different modalities.
Table 4 Diagnosis accuracy of the proposed and the baseline methods, with different kernels on MRI + SPECT data.

In addition, Fig. 5 compares the sensitivity, specificity and the area under ROC curve for the methods. As can be seen, the proposed framework using the proposed interweaved kernel on the combination of SPECT and MRI data achieves the best results with an accuracy of 97.5%. This suggests that the best features are selected in the non-linear kernel space, compared to the results of other methods, specifically SFS + SVM with an accuracy of 90.8%, in which a sparse feature selection is conducted in the original feature space followed by a non-linear classifier. This supports our assumption that, if a non-linear classification is intended, we should select those features that form the best classifier in the kernel space, not in the original feature space. Furthermore, our proposed kernel achieves an improved accuracy of over 1.4% than the case of squared Euclidean distance, which is widely used in many previous studies. On the other hand, comparison of the results from and regularization of the kernel weights vector indicates that the proposed method ( regularized) achieves an accuracy (and AUC) superior to the same formulation but with regularization. This is because regularization is suitable for recovering sparse signals, and can certainly reduce overfitting and increase generalization capability, thus achieving better performance on the unseen testing data. is preferable for data that is not sparse, while our features are extracted from brain ROIs. enforces selecting most discriminant ROIs. To further evaluate the feature selection performance, we conducted experiments with several state-of-the-art linear and non-linear feature selection or reduction and classification techniques used for such neurodegenerative diagnosis applications, which are already included in Table 3.

Figure 5
figure 5

Comparisons of the sensitivity (SEN), specificity (SPE) and area under the ROC curve (AUC).

To analyze the effect of the proposed kernel and the contribution of each single kernel in building the final max-margin classification model, we calculate the mean weight (in vector α ) of the features for each single kernel. Therefore, for the mth kernel this mean weight would be calculated as

This measure can simply show how often features calculated using that kernel are selected and therefore would indicate the importance and significance of that kernel type. Figure 6 shows these mean weight values. As can be seen, again the intersection kernel (HIK) has the largest weight while the other two variations of RBF kernel (i.e., with EMD and Euclidian distance) are getting less weights, but still important. The other two kernels, linear and RBF with the χ2 distance, however, are not showing any useful for the task at hand. This is aligned with the non-linear synthetic experiment and shows that if the features selected according to the classifier the performance could be boosted. To analyze the hyperparameter, λ in the objective function (6), we plot the accuracy and the area under the ROC curve as a function of λ in Fig. 7. This hyperparameter is a trade-off to control how compact we want the selected feature set to be. As it is clear in this figure, the whole process is not hugely dependent to the setting of this hyperparameter and performs similarly well in a wide range of its settings.

Figure 6: Mean weight of the features for each of the kernels, selected by our method.
figure 6

The larger the mean weight, the more frequently that kernel is selected with larger weights, and thus more useful it can be.

Figure 7
figure 7

Accuracy (ACC) and area under the ROC curve (AUC) as a function of the hyperparameter λ, as in (6).

One the major advantages of the proposed technique is that we can analyze the weight vector α and see features corresponding to which ROI are given non-zero mean weights over the 10 folds of the cross-validation. In this way, we can draw a conclusion on the importance of any single ROI for the specific disease. The selected ROIs by the proposed method are visualized in Fig. 8, and Table 5 depicts the specific ROIs and tissue types or modalities that were selected. The last row of this table shows the mean values for the coefficient vector α for that specific modality (feature type), which indicates the level of significance and contribution of specific feature types in building the kernel-based max-margin classifier. The SPECT features are the most useful features as they are quite discriminative for the task of PD diagnosis. This is also easily inferred by comparing the two subject samples in Fig. 2c. Among the MRI features, WM volumes are more compelling, since the deep brain regions contain more WM volumes and fibers. The selected ROIs are also aligned with the previous studies2,20,29,30,50. Specifically, the Putamen and Caudate regions, along with the brainstem and many subcortical regions have always been studied and found important to assess PD development measures.

Figure 8
figure 8

The most frequently selected ROIs.

Table 5 ROI names and their modalities, selected as the most important ROIs by our algorithm.

To analyze the convergence behavior of the proposed method, we plot the magnitude of the objective function in (6) for each of the 10 folds in the experiment on PPMI using both MRI and SPECT modalities, as a function of the number of iterations required for solving the alternating optimization problem. The plots are illustrated in Fig. 9. As can be seen, the algorithm converges in a limited number of iterations for all data splits.

Figure 9
figure 9

The magnitude of objective function in (6) (denoted as in Algorithm 1), as a function of the number of iterations required for solving the alternating optimization problem, over all 10 folds of the cross-validation experiment.

To analyze the significance of the obtained results, we conduct a permutation test51, which does not assume any data distribution and is also non-parametric. Permutation test investigates whether the classifier has found any class structure, or the observed accuracy was obtained by coincidence. To this end, the classification scheme is repeated by randomly permuting the class labels for π different times (π = 100, in our experiments). Then, a p-value can be calculated, which would show the portion of the runs for which the misclassification rate is better than the original classification error. For our case, we obtained a p-value less than 0.01, which signifies that the classification error on the original data is indeed substantially small and the classifier is not generating those results by chance51. Furthermore, we use a bootstrapping procedure to statistically validate the performance of the proposed method. In this way, we can alternatively show that the obtained results are not due to any over-fitting. But we only have one dataset. As a result, when we compute a statistic on the data (i.e., classification accuracy for our case), we cannot see how variable that statistic is. Accordingly, we create a large number of datasets, through bootstrapping, by subsampling from the original dataset. This leads to a distribution for the accuracy, so we can compare the methods statistically. To this end, we resample subjects with replacement from the dataset. Then, we use 90% of these subjects to train the classifier, which is then used to classify the remaining 10% of the subjects. This procedure is repeated different times, leading to different accuracy results. For our case, we set and . Since the subjects are sampled with replacement, the trials are independent and identically distributed (i.i.d.). Therefore, we can compare the methods using their trial bootstrap performances. Figure 10 shows the histograms of the trials for our proposed method (with -regularization) and the baseline MKL method. With all the trials, our proposed method and the baseline MKL method achieved the mean and standard deviation of 95.5% ± 1.1 and 93.0 ± 2.2, respectively. As can be seen, our proposed method has a better mean accuracy with a tighter bound for the standard deviation, compared to the baseline MKL method.

Figure 10: Histograms of bootstrapping for different trials, for the proposed (with -regularization) and MKL methods.
figure 10

Red dashed line indicates the mean accuracy for all the trials. Our proposed method has a better mean accuracy with a tighter bound for the standard deviation.

In addition, we compare our proposed method against several previously published state-of-the-art methods for the same purpose (PD diagnosis). The comparisons are provided in Table 6. The table includes all the information about the dataset and the methods they used for obtaining those results. As can be seen, our results are superior to all previous works, while we evaluated our method on a large dataset. One important note here could be that MRI data are much less dependable for the diagnosis task. This can be why when a small dataset is utilized, the solutions are quite prone to overfitting and relatively reasonable results could be achieved. On the other hand, when using large sets of data, the results for the diagnosis using only MRI degrades. Furthermore, the comparisons with the work of Prashanth et al.18 shows that the combination of MRI and SPECT data boosts the performance of diagnosis. It is important to note that the experiments in ref. 18 are conducted on the same PPMI dataset, but the authors used subjects from multiple time points, putting them together and running their method. This could be prone to several issues: (1) the subjects in the future time points, which were diagnosed as PD in the previous time points, are now in the late stages of PD, and (2) different time points of the same subject can be split in the training and testing sets, which helps unfairly increase the performance. Although the comparison is not fair, still we outperform their results by combining multiple modalities and proposing a more robust method.

Table 6 Comparisons of the proposed method with state-of-the-art methods for PD diagnosis.

Conclusion

In this paper, we introduced a kernel-based feature selection scheme, in which we select features and kernels that induce the best classification performance in the kernel space. Furthermore, we presented a kernel function by interweaving different kernel types, which is in accordance with the feature types, while being able to tolerate both linear and non-linear properties of the data. The results indicate that the proposed framework for joint kernel-based feature selection and max-margin classification induces the best performance for diagnosis for Parkinson’s disease, when using features from both SPECT and MRI modalities. We tested the proposed method on a large set of data and obtained competitive and superior results for the diagnosis of PD among all previous studies for PD.

Additional Information

How to cite this article: Adeli, E. et al. Kernel-based Joint Feature Selection and Max-Margin Classification for Early Diagnosis of Parkinson’s Disease. Sci. Rep. 7, 41069; doi: 10.1038/srep41069 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.