Using blood data for the differential diagnosis and prognosis of motor neuron diseases: a new dataset for machine learning applications

Early differential diagnosis of several motor neuron diseases (MNDs) is extremely challenging due to the high number of overlapped symptoms. The routine clinical practice is based on clinical history and examination, usually accompanied by electrophysiological tests. However, although previous studies have demonstrated the involvement of altered metabolic pathways, biomarker-based monitoring tools are still far from being applied. In this study, we aim at characterizing and discriminating patients with involvement of both upper and lower motor neurons (i.e., amyotrophic lateral sclerosis (ALS) patients) from those with selective involvement of the lower motor neuron (LMND), by using blood data exclusively. To this end, in the last ten years, we built a database including 692 blood data and related clinical observations from 55 ALS and LMND patients. Each blood sample was described by 108 analytes. Starting from this outstanding number of features, we performed a characterization of the two groups of patients through statistical and classification analyses of blood data. Specifically, we implemented a support vector machine with recursive feature elimination (SVM-RFE) to automatically diagnose each patient into the ALS or LMND groups and to recognize whether they had a fast or slow disease progression. The classification strategy through the RFE algorithm also allowed us to reveal the most informative subset of blood analytes including novel potential biomarkers of MNDs. Our results show that we successfully devised subject-independent classifiers for the differential diagnosis and prognosis of ALS and LMND with remarkable average accuracy (up to 94%), using blood data exclusively.

www.nature.com/scientificreports/ different MNDs, making the differential diagnosis hard to apply. Particularly, neurologists often fail to make a diagnosis of ALS as compared to other MNDs within the first year of illness 11 . The long latency in the differential diagnosis of most ALS/MND cases limits the possibility of a proper therapeutic approach 12,13 . In contrast, an earlier diagnosis reduces the period of uncertainty for the patient allowing them to plan the future care and the essential support, which may have an impact on the progression of the disease 3 . The current formal diagnosis of ALS is clinical and based on the revised El Escorial criteria 4,[14][15][16] . Other tools such as neuroimaging, electrophysiology, and cerebrospinal fluid (CSF) have the limited role of excluding the possibility of alternative neurological conditions with similar symptoms 17 . Previous studies on blood data have already proven the involvement of altered metabolic pathways in MNDs, despite limiting their investigation on ALS patients, i.e., the most common form of MND [18][19][20][21][22] . However, biomarker-based monitoring tools are still far from being applied in the clinical practice 23 . More specifically, Lu et al. 23 have already evaluated the combined blood expression of neuromuscular and inflammatory biomarkers as predictors of disease progression and prognosis in ALS. Furthermore, ALS-specific systemic inflammatory signals have also been reported, including a reduced frequency of regulatory T cells in the blood in individuals with a faster disease progression [18][19][20][21][22] . A common limitation of studies investigating MND-related blood analytes is due to the small number of analytes that are usually arbitrarily and heuristically chosen. A more innovative way to proceed might start from a bigger number of blood parameters later selected according to a data-driven strategy.
In this work, we introduce a new dataset containing diachronic clinical and biochemical data acquired over the last 10 years from both ALS and LMND patients. Each patient has been clinically followed up by the same experienced neurologist through periodical medical examinations and blood analysis until either today or his/ her death. Our dataset is unique in the scientific literature as every single record combines clinical outcomes with a remarkable collection of 108 common and rare blood analytes, including haemochrome indexes, haemostasis and metabolism parameters, routine functional profiles of the main organs, and inflammatory/immunological and oxidative markers.
Through the application of robust and well-validated statistical and machine-learning (ML) methods to this new dataset, we aim to detect specific patterns of blood analytes capable of automatically discriminating ALS from LMND patients helping out in the prognosis.
ML techniques have already been successfully applied to ALS data sets and some promising diagnosis models have been proposed 17 . Prognostic models have been tested using clinical, biological, and neuroimaging data 17 . However, to the best of our knowledge, there are no studies that have applied ML techniques to support a differential diagnosis of different MNDs. The main limitation of classification performance is due to the small number of training samples compared to the large number of features. In our study, we have addressed the issue of the poor sample-to-feature ratio by successfully applying a feature selection algorithm that uses a backward elimination procedure 24,25 . Thanks to this method, we identified the smallest but at the same time most informative subset of blood analytes with the aim of reducing the necessary number of blood analyses and, consequently, increasing the cost-effectiveness.

Methods
Standard protocol approvals, registrations, and patient consents. The ethical approval was obtained from the Tuscany Ethics Committee N • 14568. All participants signed an informed consent or, if this was not possible, gave their verbal permission for a carer to sign on their behalf. Moreover, all methods were carried out following relevant guidelines and regulations. Patient recruitment criteria. Inclusion criteria. Our study included 726 blood samples acquired from 41 ALS and 25 LMND patients, diachronically withdrawn during the last 10 years. Among these, we considered 692 blood samples acquired from 35 ALS and 20 LMND patients to build the dataset and the classifiers presented in this work. The remaining 34 blood samples acquired from 6 ALS and 5 LMND additional patients were included in the study at a later stage only for performance evaluation of the classification analyses (see "Generalization performance evaluation" section).
All ALS and LMND patients underwent periodical electrophysiological examinations including electromyography, electroneurography, and motor/magnetic evoked potentials. The patients were diagnosed and included in the study according to the El Escorial revised criteria 4,15 . According to these criteria, the ALS patients showed the simultaneous presence of upper (cortical) and lower (brainstem or spinal) motor neuron signs such as spastic tone, hyperreflexia, clonus, pathologic reflexes; the indisputable progression of the disease; the absence of an alternative reasonable explanation for symptoms and signs. On the other hand, LMND criteria considered only patients with the exclusive presence of lower motor neuron signs combined with weakness, muscle atrophy, fasciculations, and the indisputable progression of the disease. Of note, we enrolled only "clinically definite" patients, namely those with clinical signs of the involvement of both the upper and lower motor neurons for the ALS group or of the exclusive involvement of the lower motor neuron for the LMND group, in three out of four body regions: bulbar, cervical, thoracic, and lumbosacral. More in detail, we enrolled in the study 14 ALS and 6 LMND patients showing clinical signs in the bulbar, cervical, and lumbosacral regions; 7 ALS and 3 LMND patients in the bulbar, cervical, and thoracic regions; 5 ALS and 5 LMND patients in the bulbar, thoracic and lumbosacral regions; and 16 ALS and 11 LMND patients in the cervical, thoracic and lumbosacral regions.
It is important to note that LMND patients in the course of their disease could exhibit symptoms and signs related to the involvement of the upper motor neuron, and consequently fall within the diagnosis of ALS. Accordingly, we included in our study only LMND patients who kept exclusive involvement of the lower motor neuron over time: i.e., patients who were diagnosed with LMND one year or longer prior the study and did not Data collection. Over the last 10 years, we have collected 692 clinical and blood data from 35 ALS and 20 LMND patients approximately every 3 months. The data have been used to developing an on-going database including symptom onset (defined as the first patient-reported body weakness complaint 23 , PR, and other clinical data, together with 108 blood analytes (Table 1).
Clinical data. Clinical data include demographics, medical history, treatment information, and disease severity index. This latter was scored according to the revised form of the ALS Functional Rating Scale, ALSFRSR 26 . In addition, we calculated the disease PR by subtracting the ALSFRSR score from 48 (i.e. the maximal ALSFRSR score) and dividing by the disease duration (from the symptom onset) expressed in months 23 . Within both our ALS and LMND groups of patients, we considered two sub-groups according to their PR. Specifically, we defined a relatively slower progressing sub-group and a relatively faster-progressing sub-group using a cut-off of 0.5 as in 23 . Accordingly, within the LMND dataset, 185 blood data were labeled as "low PR" and 99 as "high PR". Instead, concerning the ALS group, blood samples were divided into two groups of 259 and 149 data with low land high PR, respectively.
Lab data. Blood analytes (n=108) included haemochrome and routine profiles for kidney, liver, pancreas, and heart functions, together with haemostasis and metabolism parameters, inflammatory and immunological markers (lymphocyte subsets, immunoglobulins, cytokines and growth factors), and oxidative markers, which are thoroughly reported in Table 1.
Database description. We considered three different datasets: one including all the 692 clinical and blood data from both patient groups (all-patients), and two sub-datasets selecting only patients at their early disease stages, namely those with high scores ( ≥ 35/48 ) of ALSFRSR (hSc) and those within their first year from the symptom onset (1-y). More in detail, the hSc dataset represents a group of data taken from patients both with benign prognosis (from the clinical outcome) and at the beginning of their disease course. This included 44 patients (30 ALS and 14 LMND) for a total of 143 blood samples. The 1-y dataset included 31 patients (20 ALS and 11 LMND) for a total of 70 blood samples acquired during the first year of the course of the disease, without considering the prognosis. Comparison between hSc /1-y ALS and hSc /1-y LMND might help us to get information for an early differential diagnosis.
Data are available upon reasonable request and verification of all ethical aspects, at p.bongioanni@ao-pisa. toscana.it.
Statistical and classification analysis. The dataset comprising of 692 observations and 108 features, and its subsets described in "Data collection" section, were used to perform exploratory statistical analysis and to build five different pattern recognition systems.
Descriptive statistics. An exploratory group-wise statistical comparison between ALS and LMND patients was performed for each blood analyte. We used a non-parametric Mann-Whitney U test with a Holm-Bonferroni adjustment for multiple testing. The same non-parametric statistical analysis was used also to analyze possible statistical differences between both ALS and LMND patients with low (< 0.5) and high PR ( ≥ 0.5).
Classification analysis. For each of the three datasets described in "Database description" (i.e., all-patients, hSc, and 1-y), we performed a classification analysis aiming at distinguishing between the ALS and LMND groups using only blood data information. Moreover, a further classification analysis was performed on the complete dataset only to distinguish, within each of the two groups, between patients with high PR and low PR.
Our learning algorithm is based on a support vector machine (SVM) model. The SVM finds the decision boundary that maximizes the margin separating the two classes of training data points. However, due to the characteristics of our dataset, two main issues needed to be addressed: first, our data were not linearly separable, i.e., the boundary between the two classes could not be linear as in standard SVM; secondly, the very high number of features (i.e., analytes) compared to the number of data points led to a high overfitting risk, as well as less interpretable results. To solve the first issue, we adopted an RBF kernel that mapped the original input dataset into a new space where our data became linearly separable (using the "kernel trick") (see Fig. 1). Alternatively, we can say that the RBF kernel made our decision boundary nonlinear. To address the second point, we employed a feature selection (FS) strategy. Particularly, we implemented a recently developed recursive-feature-elimination (RFE) algorithm embedded in the SVM model, including also a correlation bias reduction strategy 27 . Embedded FS ranked the features based on their importance in separating the two classes through a specific classifier, i.e., the SVM. Once we ordered the features, we iteratively removed the last ranked since it has the least effect www.nature.com/scientificreports/ on classification. At each iteration step, we estimated the classification performance (i.e., accuracy) until all the features have been removed (Fig. 1). The later a feature was removed, the more important it was. The classifier model was fit and evaluated through a leave-one-subject-out procedure (LOSO) which is a nearly unbiased estimator of the out-of-sample error [28][29][30] . More in detail, within the LOSO scheme, considering N subjects, iteratively we split the feature-set into a training set, comprising of n observations from ( N − 1 ) patients, and into a test set comprising of the m observations from the remaining patient. This approach is indeed a highly reliable procedure, especially in the case of multiple correlated observations from the same source 31 .
To solve the SVM optimization problem, we used the default hyper-parameters and solver suggested by LIB-SVM library 32 . Indeed, when FS algorithms are adopted, they already lead to a deep exploration of the hypothesis space. Therefore, a parameter tuning might often lead to an over-searching condition with consequent overoptimistic accuracy estimation, as well as a high computational cost. www.nature.com/scientificreports/ In summary, the employed method combined both the possibility of a nonlinear model and an FS strategy that also mitigates the bias due to correlated features 27 . Particularly, FS had a crucial role not only to maximize the classification accuracy and reduce the overfitting risk, but also to allow us to remove the irrelevant, noisy, and redundant analytes highlighting the most informative subset 9,33 . Previous studies have proved that embedded FS, i.e., scoring features based on the output of a predictive model, commonly outperform the other FS strategies such as Filter and Wrapper approaches 9,33 . Of note, further embedded approaches for reducing the dimension of the feature space were tested, e.g. LASSO-based models such as L1-SVM and LASSO binomial generalized linear model. However, very poor results were achieved, probably because L1-regularization does not enable employing the RBF kernel, which has proved to play a crucial role in the good classification of our datasets.
Generalization performance evaluation. As mentioned in "Lab data", to measure the classifier generalization performance, we recruited 11 additional patients (6 ALS and 5 LMND) to build an independent test set comprised of 34 new blood samples. These patients were included in the study only at the end of the model identification analyses to estimate the generalization error in an unbiased way. Since this test set did not include patients at the early stage of the disease, it was used to test the generalization performance of only three classifiers: (i) ALS versus LMND considering the all-patients dataset, (ii) High versus Low PR considering the ALS dataset, and (iii) High versus Low PR considering the LMND dataset. It is worthwhile noting that, unlike the training and validation sets, such a test set included only the reduced subset of analytes previously selected through the LOSO validation procedure.

Results
In this section, we present the results obtained from both statistical analysis and classification. The section is organized into sub-sections according to the kind of comparison (ALS versus LMND or high PR versus Low PR), the dataset considered, and the kind of analysis (statistics and classification).

ALS versus LMND statistical comparison. All-patients dataset.
Concerning the dataset that includes the whole group of patients, we observed that most of the analytes' average values were within the normal healthy ranges for both ALS and LMND (except for IGF1, MMP9, ICAM1, VCAM1, and IgE). Instead, the results of the statistical comparison revealed several analytes that significantly differed between ALS and LMND.
The list and relative descriptive statistics of the significant analytes (corrected p value < 0.05 ) are shown in Table 2. The most relevant differences are described hereafter. Specifically, ALS had a significantly inferior quantity of RBC, but in a larger size (MCV) and containing more Hb (MCH, MCHC) than LMND. Chol and Trig blood content was also higher in the LMND group as well as the number of growth factors (FGF and IGF1) and cell-adhesion molecules (ICAM1 and VCAM1). Instead, ALS patients showed a higher level of Fe and Fer  www.nature.com/scientificreports/ (associated with a reduced amount of Tran), and higher values of CK (associated with lower values of LA) than LMND. From the inflammatory-immunological analyte group, ALS patients had a lower amount of alpha-and beta-globulin, but a higher number of gamma-globulins as well as IgG, IgA, IgM, and IgE content. Other relevant immunological biomarkers were found significantly different between the two patients' groups: ALS showed higher amounts of CD3 and CD4, IL3, IL8, and lower levels of CD19 lymphocytes, soluble IL2R, soluble IL6R, TNF, TNFRs, and IL12 (Table 2).
hSc and 1-y datasets. When we consider only the subset of patients with high ALSFRSR (i.e., hSc group), we note that ALS patients had average lower values of Tran and higher levels of iron than LMND ones. Moreover, also in this case, several immunological biomarkers were found significantly different between the two patients' groups (see Table 2): results revealed a higher percentage of CD4 cells and IL3 as well as a significantly lower percentage of CD8 and CD25 cells in the hSc-ALS group. Focusing on the 1-y dataset, 1-y-LMND patients had a significantly higher level of soluble IL2R than 1-y-ALS patients.

ALS versus LMND classification results. The results of the SVM-RFE automatic classification between
ALS and LMND patients using the different datasets described in "Data collection" section are shown in Fig. 2.
Considering the whole group of patients (all-patients dataset), we achieved maximum recognition accuracy of 72.53 %. This accuracy was obtained by selecting only the first 6 most informative analytes according to the RFE criterion ( Fig. 2A). Taking into account only those patients at an early stage of the disease (i.e., hSc and 1-y datasets), the maximum accuracy increases to 81.25 % for the hSc dataset using 11 analytes (Fig. 2B), and to 93.94% for the 1-y dataset combining the first 10 ranked features (Fig. 2C).
Most informative selected analytes Exploring the analytes selected by the RFE algorithm (see Table 2), it is worthwhile noting that Cre, Tran, P, Ca are in the first positions among the selected analytes in the three classifications. However, the most informative ranked analytes are the immunological ones: 3 out of 6 considering the all-patients dataset, i.e., monocytes%, IgM, and CD3 lymphocyte counts; 5 out of 11 considering the hSc group, i.e. IgE, IgM, absolute leucocyte counts, CD4 and CD8 lymphocyte; 5 out of 10 considering the 1-y group, i.e., IgE, IgG, γGl, CD4 and CD8 lymphocyte counts.
Test set evaluation The generalization performance of the classifier fitted on the all-patients dataset was assessed also on the test set of 11 patients described in "Generalization performance evaluation". The result revealed an accuracy of 70.59%, i.e., consistent with the performance estimated by the LOSO procedure. Table 3 shows the results of the statistical comparison between patients with high and low PR, for the ALS and LMND datasets, respectively. ALS dataset. In the fast progressive ALS group, we found higher levels of Ca, K, Mg, P, Vit B12, Fol, Chol HDL, ESR, and Amy than in the LMND group. On the other hand, in slowly progressing ALS, we found more basophils (BAS), a higher percentage of CD3 and CD8 cells, and higher levels of LDH, albumin, IgG, IL4, IL10, IL2R, IL6R, EPO, ICAM1, ERS, associated with lower percentages of CD16+56 and CD45 cells.

Low versus high progression rate statistical comparison.
LMND dataset. Considering the LMND group, we observed higher levels of VitB12, Fer (associated with lower values of Tran), ALT, GGT, and LDH in the fast progressive LMND group than in the slow progressive one. Moreover, the fast progressive LMND showed also a reduced quantity of MON, EOS, TPAO, ESR, K, and Na; whereas the soluble IL2R and IGF1 resulted higher than in the slow progressive one.

Low versus high progression rate classification results.
Concerning the automatic recognition of fast and slow progressive ALS and LMND patients, high accuracy was achieved for each of the two groups. Particularly, considering the ALS patients, we obtained 87.25% of accuracy by using the first 16 ranked analytes (Fig. 3A). Likewise, considering the LMND patients, we achieved an accuracy of nearly 93 % by using the first 12 ranked analytes (Fig. 3B).   Table 2. In both groups, the highest recognition accuracy was achieved by a combination of analytes of different origins. Interestingly, Chol, HDL, Fer, VitB12, CD16+56, and MCP1 are shared between the ALS and LMND datasets.

Analyte Median ± MAD (ALS) Median ± MAD (LMND) p values
Test set evaluation The results of the generalization performance assessment on the test set showed an accuracy of 81.25% for the ALS group and 90.91% for the LMND one, confirming the very good performance estimated by the LOSO procedure.

Discussion
In our study, we introduce a novel dataset of blood data and present an ML approach aiming at supporting clinicians in making a differential diagnosis of MNDs. Specifically, the applied learning algorithm is able to discriminate ALS from LMND patients, using blood data information exclusively. Moreover, our approach is able to predict the prognosis of MND patients with remarkable accuracy, recognizing whether the patients have high or low disease progression. Our results are obtained performing an automatic selection of the best combination of blood analytes ensuring the maximum classification accuracy.
Over the last 10 years, we have enrolled 55 ALS and LMND patients and collected 692 blood samples from which 108 blood parameters have been extracted. This outstanding collection of blood analytes together with a large number of blood samples is unique in the scientific literature and grants an important value to our results. Moreover, most of the studies focus on more invasive and expensive methods, such as CSF analysis or neuroimaging, not suitable for repeated sampling over time, rather than routine investigations. Indeed, plasma, easily available, represents an attractive biological fluid for the detection of biomarkers, and extensible CSF-based biomarkers 34 . ML models and large datasets offer unprecedented opportunities to appraise candidate diagnostic, monitoring, and prognostic biomarkers 17 . Our database has been used as input of an SVM-RFE algorithm. This method together with the LOSO cross-validation strategy allows mitigating the risk of confounding classification results (i.e., overfitting), which cannot be underestimated with such a number of features (i.e., 108). Indeed, on the one hand, the LOSO strategy reduces the risk of a biased optimistic estimation of the classifier accuracy avoiding the presence of observations of the same subject in both the training-and test-set. On the other hand, the RFE algorithm reduces the dimension of the dataset and, at the same time, selects the combination of analytes that maximizes the accuracy using the SVM classifier. To our knowledge, this is the first study, which investigates and develops characterization and classification of different MNDs (ALS and LMND) at single-subject level, based on blood data alone. In addition, a better understanding as well as an early recognition and prognosis of ALS and LMND may have a significant impact on research activities concerning not only the differential diagnosis but also the development of specific differentiated treatment of ALS and other MNDs. To this end, given the little progress that has been made in these last years, a novel system able to support the clinical practice is highly desirable. Our results showed a good prediction accuracy (72.53%) in recognizing the disease form of the patient under examination (ALS vs. LMND) that even strongly increased when the early stage of the disease was considered (81.25% based on the ALSFRSR, and 93.94% considering the first year of the disease). The three patients' subgroups are associated with different combinations of blood parameters (Table 2), which allow discriminating between ALS and LMND with the highest accuracy. On the one hand, our selected analytes confirmed the importance of blood immunological properties in the discrimination of MNDs, as already reported by Lu et al 23 .
In fact, we found that, among the selected analytes, most were inflammatory and immunological. Accordingly, some of the most relevant information was found in the leukocytes and their related analytes (i.e., lymphocyte subsets and immunoglobulins). Some of these analytes were highlighted also in the univariate statistical analysis. Particularly, ALS as compared with LMND patients were characterized by increased percentages of CD3, CD4, and CD8, as already observed in ALS patients compared to healthy controls 35 , as well as higher levels of IgM. On the other hand, our classifier revealed the importance of some non-standard predictive features such as P, Cre 36 , and Tran 37 , which have been only recently indicated as factors related to the disease and potential markers and are still under study. Such data could highlight important new mechanisms related to the disease. Moreover, our statistical results have indicated significant differences in other recently proposed non-standard potential markers Table 3. List of the most informative features selected by the SVM-RFE for each ALS/LMND classification.    Concerning the disease progression, by means of an ML approach, we succeeded in classifying slowly versus fast progressive ALS and LMND patients with very good prediction accuracy (87.25% and 92.8%, respectively), indicating the potential of blood analyte measurements for prognostic purposes. Exploring the selected blood analytes for the evaluation of prognosis, we found that the RFE algorithms were able to select a common group of markers for both diseases: VitB12, CD16+56, Chol, HDL, and Fer. As far as VitB12 is concerned, no correlation data of its endogenous levels to disease severity have been reported. CD16+56 has been found higher in ALS patients compared to healthy controls 35 . Contrasting results are still reported for Fer 42 or Chol and HDL 43 as biomarkers, nevertheless, some studies suggest that hyperlipidemia is a protective factor in ALS 13 . This could suggest that the aforementioned analytes play a crucial role in differentiating the disease progression regardless of the type of MNDs. More in detail, from the statistical comparison we observed that VitB12 was significantly higher in fast versus slow progressive patients for both ALS and LMND groups. Whereas, on the one hand, significantly higher amounts of Chol and HDL characterized fast PR in ALS patients exclusively, on the other hand, higher Fer levels were related to fast PR in the LMND group only. Of note, since the resulted optimal learning model only requires the acquisition of few blood analytes, some of them typical of routine clinical analysis, not only the risk of overfitting is strongly mitigated, but this leads to a diagnosis and prognosis support tool with reasonably low costs. Due to the difficult and the slow process of recruiting such kinds of patients, the high economic cost for the biochemical analyses, and the strict inclusion criteria, the patient sample size is limited, even if the number of blood samples is large. Moreover, when the hSc and 1-y, as well as the PR classification problems are considered, the datasets are subjected to a decrease in the number of observations. This might induce a higher risk of overfitting. For this reason, the applied methodological strategies were specifically conceived to mitigate the risks due to a non-large number of recruited patients and to make our 692 observations enough to achieve positive, robust, and replicable results. It is worthwhile noting that even considering the prediction accuracy achieved after selecting only the first five most informative features, and consequently reducing the complexity of the model and the overfitting risk, recognition accuracy of over 75% was always reached in all classification tasks, except for the hSc problem where 6 features were necessary. Moreover, to test the generalization performance of the proposed recognition systems, and, therefore, the possibility to export our results in a real clinical scenario, we tested the fitted model on a test set including 11 new patients. The results confirm even in this case very high accuracy consistent with that estimated during the LOSO procedure. This is a further confirmation of the robustness of our recognition system suggesting good replicability of our results, and the fact that the relatively low amount of data did not strongly affect the reliability of the results.
In conclusion, this study, besides strengthening the importance of the immunological components in the MNDs diseases, raises many questions about those analytes (widely used but trivial) that have shown to be important in the discrimination of ALS and LMND but not yet specifically related to the different types of MNDs. On the other hand, the immunological information is not sufficient if it is not supported by other blood analytes that so far have been considered non-standard markers for neurodegenerative diseases. Moreover, our data and results strongly support the hypothesis that ALS and LMND represent two different diseases, whereas in many cases they are considered and treated as a single one. www.nature.com/scientificreports/ Although significant p-values were reported for several analytes, the confidence intervals (Median ± MAD) should not be translated into a list of cut-offs levels to be used in the clinical practice. Indeed, despite the statistical significance, such intervals are often strongly overlapped between the two groups under comparison as well as fall within the ranges of healthy controls. On the other hand, our classification system might provide the clinicians with an automatic tool that can easily support the differential diagnosis of the LMND and ALS patients, showing the resulted class with the related accuracy level, in an easier and more interpretable way compared to the statistical cut-offs. It is also surprising to note that the accuracy increased when data related to the first year from the onset of the symptoms are considered. Consequently, our results can support the clinician in differentiating between the two diseases at the very early stage of the disease, whereas, with the normal clinical practice, it is often difficult to understand the actual involvement of the upper motor neuron.
From the methodological point of view, this study does not add a significant innovation in the machine learning field, although the selected method perfectly fits the aims of our study and the specifications of our type of data. However, this study can be considered as an onset for future innovative methodological applications. Indeed, data collection will go on to increase the number of patients and blood samples. This will give the possibility to apply deep learning-based classification methods, which might lead to further improvement of the classification performance.
To sum up, this work introduces a new tool to apply automatic techniques for the diagnosis and prognosis of different MNDs and paves the way for future research in which clinicians and scientists will search for an effective treatment for MNDs following a differential and selective approach. Our next study will deeply investigate these analytes that have been automatically selected using a data-driven approach and will compare these results with those achieved including some a priori clinical knowledge in the learning models. Moreover, hierarchical regression models will be employed to predict the disease progression at a single-subject level.