Parkinson’s disease detection based on features refinement through L1 regularized SVM and deep neural network

In previous studies, replicated and multiple types of speech data have been used for Parkinson’s disease (PD) detection. However, two main problems in these studies are lower PD detection accuracy and inappropriate validation methodologies leading to unreliable results. This study discusses the effects of inappropriate validation methodologies used in previous studies and highlights the use of appropriate alternative validation methods that would ensure generalization. To enhance PD detection accuracy, we propose a two-stage diagnostic system that refines the extracted set of features through \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_{1}$$\end{document}L1 regularized linear support vector machine and classifies the refined subset of features through a deep neural network. To rigorously evaluate the effectiveness of the proposed diagnostic system, experiments are performed on two different voice recording-based benchmark datasets. For both datasets, the proposed diagnostic system achieves 100% accuracy under leave-one-subject-out (LOSO) cross-validation (CV) and 97.5% accuracy under k-fold CV. The results show that the proposed system outperforms the existing methods regarding PD detection accuracy. The results suggest that the proposed diagnostic system is essential to improving non-invasive diagnostic decision support in PD.

(1) This paper addresses the issue of inappropriate validation methods employed in prior studies and advocates for the adoption of alternative validation approaches.Furthermore, it demonstrates that consolidating multiple samples per subject data into a single sample per subject data set effectively mitigates the issue of overlap.(2) We enhance the set of extracted features through the utilization of an L 1 -regularized SVM.This process effectively eliminates redundant and irrelevant features, yielding a higher-quality feature set for classification.(3) To the best of our knowledge, the proposed cascaded diagnostic system, referred to as L 1 SVM-DNN, rep- resents a pioneering technique for the detection of Parkinson's disease (PD) using voice and speech data.(4) Only a limited number of studies have explored the evaluation of feature selection at the input level of Deep Neural Networks (DNN) 32 .Notably, Taherkhani et al. 32 recently discovered that deep learning models exhibit improved performance when the feature selection and feature extraction capabilities of a DNN are integrated.In this paper, we reinforce this finding by incorporating feature selection at the input level of the DNN.(5) The proposed cascaded diagnostic system surpasses the performance of state-of-the-art methods as reported in the two benchmark voice recording datasets.
The remainder of the paper is structured as follows: In Section "Materials and methods", we provide a detailed explanation of the datasets and delve into the discussion of a deep learning-based predictive classification model.In Section "Results and discussion", we present experimental results and engage in a discussion of these findings.Section "Comparative study" is dedicated to a comparative study.Section "Limitations of the study" briefly discuss some limitation of the study.Lastly, Section "Conclusion" encapsulates the conclusion of this study.
Table 1.Description of the datasets.suggested that to test the effectiveness of a newly developed machine learning method, it is a good approach to choose dataset(s) that have been extensively tested.Thus, our choice of datasets in this paper was based on the facts discussed in 37 .

The proposed cascaded system based on L 1 SVM and DNN
We propose a two-stage feature selection and classification method to detect PD using replicated voice data and various voice records.With the proposed two-stage approach, the time complexity of the predictive model can be reduced.The accuracy can also be improved by eliminating irrelevant features from the feature space.The model that we used for feature refinement is the L 1 -regularized linear SVM, while for classification DNN with optimized hyper-parameters has been used.The models' formulations, potentially associated problems, and proposed solutions are stated as follows.
For a given dataset D with q instances: D where x i is i-th instance and each instance has p dimensions or features.And y i denotes class label which may be −1 or 1 for binary classification.For the classification problem, SVM learns the hyper-plane given by wx = b , where b is the bias and w is the weight vector.The hyper-plane maximizes the margin distance 2/ w 2 2 .The primal form of the SVM can be formulated as follows: In 1995, Cortes and Vapnik proposed a modified version of SVM called Soft Margin SVM, which allows for mislabeled instances 38 , and it has the following form: where the regularizer or penalty function is L 2 -norm, C > 0 is the error penalty parameter and ξ is slack variable used for misclassification measurement.
In 1998, Bradley and Mangasarian proposed to use L 1 -norm as the regularizer 39 , and the feature selection can be made using L 1 -norm SVM due to its sparse solutions.It is formulated as: where the regularizer or penalty function is L 1 -norm, C > 0 is the error penalty parameter and ξ is slack variable used for misclassification measurement.As discussed above, in (3), w is the weight vector.changing values of hyper-parameter C, different coefficients of w shrink towards zero.In fact, with sufficiently small C, several fitted coefficients would be exactly zero, i.e., sparse solution.Therefore, L 1 -norm regularization has an inherent feature selection property, i.e., those features whose corresponding coefficients are fitted to zero can be eliminated.Furthermore, as C changes, several fitted coefficients will become zero, which will result in different feature subsets 40 .Thus, the optimal subset of features can be obtained by tuning the hyper-parameter C. For this purpose, we use HGSA in this paper which will automatically tune the C hyper-parameter of the linear SVM model and search the optimal subset of features.
(1) min  32 .We consider only the most important features in feature selection by eliminating the irrelevant features from the feature space.While in feature extraction, all the features are considered, and new ones are extracted.DNNs use a large number of non-linear elements, i.e., neurons, to learn relationships or functions of high complexity.More likely, irrelevant features present in the feature space are also modeled accordingly.Noise is the result of Modeling irrelevant features 32 .Thus, learning the noise from these irrelevant features negatively affects the acquired knowledge of data about the overall distribution of the data 32 .If feature space contains irrelevant features, overfitting the network to the training data is another problem 32,41 .That is when the network learns irrelevant details from the training data.It shows good performance on the training data as it becomes more biased to the previously seen data 42 .But, it fails to generalize to the unseen validation or testing data.
To solve these problems posed by irrelevant features in the feature space, we use L 1 regularized SVM to make the feature space free from irrelevant features before applying the feature vector to DNN.The SVM model eliminates irrelevant features.To validate the fact that feature selection coupled with the feature extraction capability of DNN improves the performance of DNN, in Section "Comparative study", we performed experiments by applying all the features to DNN, i.e., removing the feature selection SVM model and then compared it with the proposed L 1 SVM-DNN.The accuracy of 96.87 and 62.5% is obtained for datasets 1 and 2, respectively, when all features were applied to DNN.While accuracies of 100% and 97.5% are obtained for datasets 1 and 2, respectively, using the L 1 SVM-DNN model.Hence, simulation results show that the feature selection capability of the SVM model, when combined with the feature extraction capability of the DNN model, improves the performance of DNN for PD detection problems.HGSA is used to search for advanced or optimal features and is given to a DNN model for classification.
For the given m training samples, a DNN models a hypothesis function h θ (x) parameterized by DNN param- eters θ ∈ R d where d denotes the dimension of θ and the input feature vector is represented by x .The h θ (x) tries to anticipate label ŷ for input feature vector x .The aim is to locate those optimum values of θ for which objective function is minimized as: We used the ADAM learning algorithm to minimize (4).In this paper, we used default values for hyper-parameters of the ADAM algorithm, i.e., the value of 0.9 for β 1 , 0.999 for β 2 and 10 −8 for ε .After optimizing the (4) Input: { m 0 : number of points in the subspace of hyper-parameters of SVM model and k 0 : number of points in the subspace of hyper-parameters of DNN model} Output: {Optimized values of C, L, N h and dropout hyper-parameters of the two models i.e., SVM and DNN} 1. Merging and Initialization.
Merge the two subspaces of hyper-parameters and initialize the hybrid hyper-parameters space 2. Initialize Highest Accuracy = 0 3. for j = 1 : m o 4.
for k = 1 : Evaluate Accuracy for each point in the hybrid grid or hybrid search space.Algorithm 1. Hyper-parameters optimization of the proposed cascaded system using Hybrid Grid Search Algorithm (HGSA).
parameters or weights of the DNN model by ADAM for training data samples, the model performance is evaluated by applying testing data samples.The generalization performance (in terms of % of falsely predicted testing samples), represented by generalization error η or validation loss L(A , D train , D valid ) .In the expression, A denotes the model, D valid denotes data on which the loss is evaluated, and D train denotes the data on which the model is trained.Our objective is to find A that minimizes the validation loss.The hyper-parameter optimiza- tion problem under k-fold CV is then to minimize the black box function given as follows: where denotes the hyper-parameters of DNN and A represents DNN configuration under hyper-parameters choice or setting.In order to obtain good performance, optimal hyper-parameters of DNN need to be searched that can lessen the validation loss.Hence, two optimization problems are dealt with here, i.e., searching the optimal value of the hyper-parameter of the SVM model that will yield the optimal subset of features and searching optimal hyper-parameters of the DNN model.In this paper, two optimization problems are merged into one by merging the hyperparameters of the two models.Thus, after merging the two optimization problems into one, (5) can be formulated as: The minimization of (6) will result in us optimized forms of two models.The merging of hyper-parameters of the two models yields a hybrid grid.Each point on the grid has several coordinates.The first coordinate of each point on the hybrid grid is C, i.e., the SVM model's hyperparameters, while other coordinates are the hyperparameters of the DNN model.The hyper-parameters of the second model contain the number of layers of DNN denoted by L, the number of neurons in each hidden layer characterized by N h , where h indicates the hidden layer number and dropout regularization.Dropout regularization is considered only in those cases when the model is overfitting.To solve the minimization of ( 6), we use HGSA.Algorithm 1 gives the detailed procedure of the HGSA algorithm.

Ethical approval
This article does not contain any studies with human participants or animals performed by any authors.

Informed consent
Informed consent is not applicable.The study used two publically available datasets 33,34 .

Results and discussion
For evaluation purposes, both types of cross-validation schemes are utilized, i.e., LOSO CV and k-fold CV with data translation.LOSO CV and k-fold are two widely adopted validation approaches in data analysis.In LOSO CV, the dataset is initially partitioned into S n parts, where S n represents the total number of subjects or individuals..In each iteration of LOSO CV, the data corresponding to one subject, starting with S 1 , is reserved for testing, while the data from the remaining subjects are utilized for training the model.Similarly, in k-fold CV, the dataset is divided into k subsets or folds.During the first iteration of k-fold CV, the data in the first fold k = 1 is set aside for testing, while the data from the other folds are employed for model training.In subsequent iterations, the testing fold shifts to the next one k = 2 , and the remaining data continue to serve as the training set.This cycle repeats until all the folds have been used for testing.
For more practical validation, we carried out model development in phase 1 and model testing in phase 2 as can be seen in Fig. 1.The software package used for these experiments was Python.In all the experiments, N1 and N 2 represent the number of neurons in hidden layer 1 and hidden layer 2 of the network, respectively.While L denotes the total number of layers in the neural network and N h represents the number of neurons in each hidden layer when we are using the equal number of neurons in all hidden layers.The learning algorithm used is ADAM.Furthermore, C represents the hyper-parameter of the linear SVM model and n denotes the number of features produced by the SVM model.The initial range for the hyperparameters N1, N 2 , N h is set between 5 and 100.Likewise, the initial range of the hyperparameters L is established between 4 and 10, while the hyperparameter C takes an initial range spanning from 0.00001 to 1000.

Simulation results of dataset 1 LOSO cross-validation
In this experiment, LOSO CV is performed on the first dataset.Despite the fact that LOSO CV is the most practical validation scheme for replicated voice data and multiple types of voice data, LOSO CV was ignored in previous studies except 43 for this dataset.The best results of 100% were obtained for C = 0.5, resulting in a subset of features having only eight features.Moreover, the best result was obtained for optimally configured DNN with five layers i.e.L = 5 , and 20 and 30 neurons in each hidden layer.The same results are also obtained for L = 4 and N h = 30 .That is, the proposed approach can classify subjects as PD and healthy with an accuracy of 100%.The results of the experiment are reported in Table 3.In the table, the optimal subset of features for n = 8 contains F 1 , F 2 , F 3 , F 10 , F 16 , F 18 , F 19 and F 21 .It is evident from the table that if optimal hyper-parameters of the DNN model are not utilized, we may obtain poor performance with an optimal subset of features.Thus, better performance can be achieved if extracted features are refined and optimally configured DNN is utilized.
As discussed earlier, the first dataset has the problem of imbalanced classes.The problem of imbalanced classes in data affects the performance of predictive models because the predictive models trained on imbalanced data are more sensitive to detecting the majority class and less sensitive to the minority class 44 .Thus, there is a need to balance the training process of the predictive model.There are two ways-Under-sampling the majority class and over-sampling the minority class.Over-sampling is very easy for image datasets because, with simple operations like rotations and translation, we can easily over-sample the minority class.For voice data, we have used the under-sampling method.However, in literature, more advanced techniques used for under-sampling did not significantly improve simply selecting random samples.Hence, in this paper, we performed random under-sampling during the training process.
The practical demonstration of the problems posed by imbalance classes is given in Table 3.The last three rows of the table, separated by a horizontal line, are the results obtained when no measure is taken to balance the training process.The simulation results show that the model fails to perform better even with optimally configured DNN and the optimal subset of features.The reason is that machine learning models are sensitive to

k-fold cross-validation with k=10
The second experiment that is performed on the first dataset is a k-fold CV.The value of k is chosen here to be 10.The results for different hyper-parameter configurations are given in Table 4. HGSA searches for the best accuracy of 100% for a 10-fold CV.The achieved accuracy via 10-fold CV is the same as the accuracy achieved in 45 .In 45 , the 10-fold experiment was also conducted on the second dataset and achieved 90% accuracy.Our proposed model achieved 97.5% for 10-fold CV on the second dataset, which proves the effectiveness of the proposed diagnostic system.The optimal subset of features with n = 1 contains F 2 and with n = 7 contains F 1 , F 2 , F 3 , F 10 , F 16 , F 19 and F 21 .

Simulation results of dataset 2 LOSO cross validation on training database
In this experiment, LOSO CV is performed on the training database of the second dataset.We achieved state-ofthe-art results with an accuracy of 100%, which is the highest classification accuracy reported so far for LOSO CV on the training database.The results of the experiment are given in Table 5.The proposed approach has the capability to classify subjects as PD and healthy with an accuracy of 100%.The best results are obtained for C hyper-parameter equal to 0.0015 for this dataset, resulting in a feature subset consisting of only seven features.It is important to note that 100% result for LOSO CV does not mean that the proposed system can correctly classify all samples of the dataset.Because a subject is classified as PD if more than half of its samples are predicted as 1, otherwise the subject is classified as healthy.Thus, it is expected that for any disease having more than one sample per patient, the proposed system could be an ideal candidate for diagnosis.Moreover, optimal subset of features for C = 0.0015 and with n = 7 contains F 5 , F 10 , F 15 , F 19 , F 21 , F 24 and F 26 .Additionally, the best result of 100 % was obtained for optimally configured DNN with five layers i.e.L = 5 and 30 neurons in each hidden layer.
It is evident from Table 5 that if optimal hyperparameters of the DNN model are not utilized, we may obtain poor performance with an optimal subset of features.Thus, better performance can be achieved if extracted features are refined and optimally configured DNN is utilized.

LOSO cross-validation on testing database
In this experiment, LOSO CV is performed on the testing database of the second dataset.This dataset is an independent dataset collected from new 28 patients under the same conditions in which the training dataset was collected.This dataset aims to validate the performance of the proposed system achieved on the training dataset.Since this data only contain patient subjects and no healthy subject, thus its specificity cannot be reported.The DNN model is trained on a train data file, but it is transformed into a new dataset by extracting only those concerned with vowel phonations.The main reason for creating modified train data is that the test data, in this case, contains only vowel phonations.The simulation results for this experiment are given in Table 6.From the results, it is clear that maximum accuracy of 78.57% is obtained.It is due to the overfitting of the model to the training data.Thus to avoid the model from overfitting, we bring into account dropout regularization.With 0.3 dropouts, the proposed method achieved an accuracy of 100%.The dropout regularization is applied to hidden layers of the DNN model.Dropout is a hyperparameter that is used when the DNN is facing the problem of overfitting.It is important to note that according to the proper unbiased validation approach depicted in Fig. 1, the accuracy on the testing dataset should be reported 96.42% not 100% because during the model development phase (results given in Table 5), the optimal model is produced under hyperparameters configuration of n = 7 , L = 5 and N h = 30.

k-fold cross validation with k = 10 on training data of dataset 2
The results of the 10-fold CV experiment for dataset 2 are given in Table 7.It is important to note that so far the highest accuracy achieved for 10-fold CV is 90% (see Table 11).The proposed diagnostic system achieved the best PD detection accuracy of 97.5 %.The obtained accuracy is the highest accuracy for k-fold cross-validation for this dataset.Moreover, the optimal subset of features obtained at C = 0.001 and with n = 1 contains F 19 while the optimal subset of features with at C = 0.01 and with n = 4 contains F 10 , F 18 , F 19 and F 21 .

Comparative study
In this section, the performance of the proposed method is compared with other well-known machine learning models and with previously published work that used the two benchmark voice datasets.

Comparison of the proposed method with other models for dataset 1
For validation purposes, we also carried out experiments by cascading the features refinement model i.e.L 1 SVM with other renowned classifiers namely SVM and artificial neural network (ANN) owing to their remarkable performance on many other biomedical problems.Furthermore, we also checked the performance of the In addition, g denotes the gamma hyperparameter of the SVM predictive model when it uses the RBF kernel.All these experiments were performed using a 10-fold CV.The goal is to evaluate the feature refinement capabilities of the L 1 SVM when it is cascaded state-of-the-art classifiers.Furthermore, all the cascaded models were optimized by using the HGSA approach.The results are tabulated in Table 8.

Comparison of the proposed method with other models for dataset 2
The same types of cascaded models were also developed for the second dataset.The results are reported in Table 9.From Tables 8 and 9, it is clear that the proposed method shows better performance.Additionally, in each case, the L 1 SVM produces features of better quality, and hence performance of the predictive model is improved whether it is SVM, ANN, or DNN.Thus, these results validate the feature refinement capabilities of the developed cascaded systems.

Comparison with previously reported methods
For comparison purposes, Tables 10 and 11 list accuracies obtained in previous studies by different methods applied to the two voice recording-based PD datasets.As shown in these tables, our developed model yield better classification accuracy than previously proposed methods in the literature.
Based on data in Tables 10 and 11, we are in a position to conclude that our developed diagnostic system gives state-of-the-art performance in terms of PD detection accuracy.

Limitations of the study
Although this study showed good performance in terms of differentiating PD patients from healthy subjects, there are some limitations.One limitation pertains to the data used in the study.Information such as the severity of the disease in PD patients from the testing dataset of the second dataset and whether the data collection Table 8. Results of other models on dataset 1. C 2 /N 1 : Hyper-parameter of SVM predictive model or width of first hidden layer in case of ANN or DNN predictive model.g/N 2 : g hyper-parameter of SVM predictive model or width of second hidden layer for DNN predictive model.C 1 : Hyper-parameter of the L 1 regularized SVM.n: the size of the optimal subset of features.Significant values are in bold.

Conclusion
This paper has addressed two primary issues concerning the automated detection of PD.Firstly, it has highlighted the inadequacies of validation methodologies employed in previous studies, which led to the creation of biased predictive models.Secondly, it has recognized the persistent challenge of achieving high PD detection rates  www.nature.com/scientificreports/when unbiased models are employed.To mitigate bias, this study has adopted appropriate validation approaches.
In addition, to enhance the accuracy of PD detection, a two-stage diagnostic system, referred to as L 1 SVM- DNN, has been proposed.Notably, unlike previous methods, this research has emphasized the independence of model development and testing phases.Two benchmark datasets were employed for validation purposes.The experimental results have demonstrated that the proposed method attains a classification accuracy of 97.5% with 10-fold CV and an impressive 100% accuracy with LOSO CV.For generalization purposes, we also evaluated the optimally developed model on testing dataset and obtained 96.42% accuracy.Based on these outcomes, it can be confidently asserted that the developed cascaded system holds significant promise in automated differentiation of PD patients from healthy subjects.
Although the L 1 SVM-DNN approach showed outstanding performance in terms of differentiating PD patients from healthy subjects, from a clinical diagnostic perspective, this kind of automated differentiation has limited significance.This is because, in real-time applications, differentiating between idiopathic PD and atypical PD (e.g., PSP, MSA, CBS, DLB), where vocal dysfunction is also manifested, is a more challenging task.Therefore, future efforts should focus on the collection of a multi-class dataset, including data from healthy subjects, idiopathic PD, and atypical PD and its subtypes.Unbiased machine learning models, like L 1 SVM-DNN, should be trained and tested on such multi-class problems.These models would have more significance and could be deployed in hospitals and clinics for real-time diagnostic applications.
2. Details of H &Y scores for PD patients in the first dataset and UPDRS III scores for PD patients in the second dataset.

Table 3 .
Resultsdetecting the majority class and less susceptible to detecting the minority when imbalanced classes are used to train the model.That is why in the last three rows, the model results in poor specificity.Thus, it is of paramount importance to balance classes during the training process.

Table 4 .
Results of 10-fold CV for dataset 1. C: Hyper-parameter of the SVM model.n: number of selected features.N 1 : Width of first hidden layer.N 1 : Width of second hidden layer.ACC[URACY]: Percentage of accuracy obtained for 10fold CV, Sen[sitivity], Spec[ificity].Significant values are in bold.

Table 5 .
Results of LOSO on train database of dataset 2.

Table 6 .
LOSO CV on a test database of dataset no 2. C: Hyper-parameter of the SVM model.n: number of selected features.L: layers in DNN.N h : Width of each hidden layer.Dropout: A hyper-parameter utilized when the network is over-fitting.ACC[URACY], Sen[sitivity].Significant values are in bold.

Table 7 .
Results of 10-fold on train data file of dataset 2. conventional DNN model without any feature refinement module.Next, we developed three similar hybrid systems i.e., SVM-SVM(Lin) and SVM-SVM(RBF), and SVM-ANN, where the first SVM model is L 1 regularized linear SVM model that is used for features refinement while the second model is used as a predictive model.In the case of the SVM-SVM hybrid model, we denote the hyper-parameter of the feature selection model by C 1 while the hyper-parameter of the predictive SVM model by C 2 .
C: Hyper-parameter of the SVM model.n: number of selected features.N 1 : Width of first hidden layer.N 1 : Width of second hidden layer.ACC[URACY], Sen[sitivity], Spec[ificity].Significant values are in bold.

Table 9 .
Results of other models on dataset 2. C 2 /N was carried out in the ON or OFF state of the disease is missing.The study did not investigate whether accuracy varies depending on disease duration and severity.Another diagnostic challenge in Parkinsonism is differentiat- 67: Hyper-parameter of SVM predictive model or width of first hidden layer in case of ANN or DNN predictive model.g/N 2 : g hyper-parameter of SVM predictive model or width of second hidden layer for DNN predictive model.C 1 : Hyper-parameter of the L 1 regularized SVM.n: the size of an optimal subset of features.Significant values are in bold.Vol.:(0123456789)ScientificReports| (2024) 14:1333 | https://doi.org/10.1038/s41598-024-51600-ywww.nature.com/scientificreports/ing between idiopathic PD and atypical PD (e.g., progressive supranuclear palsy (PSP), multiple system atrophy (MSA), corticobasal syndrome (CBS), Dementia with Lewy Bodies (DLB)), where vocal dysfunction is also manifested67.The study did not investigate this kind of differential diagnosis.

Table 10 .
Performance of different methods recently published for dataset 1.

Table 11 .
Performance of different methods recently published for dataset 2.