Multivariate classification techniques and mass spectrometry as a tool in the screening of patients with fibromyalgia

Fibromyalgia is a rheumatological disorder that causes chronic pain and other symptomatic conditions such as depression and anxiety. Despite its relevance, the disease still presents a complex diagnosis where the doctor needs to have a correct clinical interpretation of the symptoms. In this context, it is valid to study tools that assist in the screening of this disease, using chemical work techniques such as mass spectroscopy. In this study, an analytical method is proposed to detect individuals with fibromyalgia (n = 20, 10 control samples and 10 samples with fibromyalgia) from blood plasma samples analyzed by mass spectrometry with paper spray ionization and subsequent multivariate classification of the spectral data (unsupervised and supervised), in addition to the treatment of selected variables with possible associations with metabolomics. Exploratory analysis with principal component analysis (PCA) and supervised analysis with successive projections algorithm with linear discriminant analysis (SPA-LDA) showed satisfactory results with 100% accuracy for sample prediction in both groups. This demonstrates that this combination of techniques can be used as a simple, reliable and fast tool in the development of clinical diagnosis of Fibromyalgia.

www.nature.com/scientificreports/ studies predicting some types of cancer 18,19 . Considering that there is no standard instrumental technique for fibromyalgia, the diagnosis is clinical by elimination. The PSI-MS technique was chosen because it is innovative, does not require sample preparation and has a very relevant analytical power, lowering analysis costs when compared to ESI-MS, for example, and can be implemented in hospital routines.
In view of the versatility of the PSI method and in order to better understand the resulting results, it is worth involving an analysis of the metabolites so that it is possible to identify and quantify the main metabolites of the biological system 20 . Metabolomic analysis is typically characterized in two approaches 21 : targeted approach on selected metabolite or metabolite class with known chemical properties; and untargeted approach, measuring all possible metabolites providing hypotheses for further testing. The study of metabolomics contributes to a better understanding of the biological characteristics of different types of cancer 22 , as well as in understanding the association with indicator biomolecules for the diagnosis of fibromyalgia 23 . To make this possible, multivariate statistical analysis is commonly used 24 , where the combinatorial effect of multiple variables is considered and can be characterized in unsupervised techniques, which refer to methods that identify hidden structures in the data without knowing the class labels, and supervised techniques that use the information from the class label to build the classification model 21 .
Several types of chemometric algorithms are reported for pattern recognition and classification of MS data, in particular to discriminate between healthy and cancer samples 24 . Some reports on the investigation of biological and metabolomic markers in biofluids from patients with FM are also located in the literature 23,25,26 , but without the occurrence of application of chemometric tools for group classification. In a recent study 2 the blood plasma of control and fibromyalgia patients is used, in which it was possible to reach good results of discrimination of these groups from infrared (IR) spectroscopic analysis coupled with chemometric techniques. In this case, despite the good results in IR, the spectra obtained are difficult to interpret and have low resolution for elucidating molecules of clinical interest, and consequent detection of FM in blood smear. However, working with IR provides a good suggestion for using classification tools, seeking to optimize the best performance results.
In this study, nine algorithm combinations were used (PCA-LDA, PCA-QDA, PCA-SVM, SPA-LDA, SPA-QDA, SPA-SVM, GA-LDA, GA-QDA and GA-SVM) that are executed and compared to discriminate between blood plasma samples from control and fibromyalgia patients. These algorithms used the spectra obtained through the PSI-MS technique, providing a practical and reliable method without complex sample preparation. In addition, associations of selected variables with already cataloged metabolites are also explored, indicating the possible presence of biological fingerprints in the study groups.

Results
It is possible to visualize in Fig. 1 the overlap along the MS signal that does not allows visual differentiation between the control group (CG) and the fibromyalgia group (FG). The raw spectra are pre-processed with weighted automatic least squares baseline correction and vector normalization 27 . Spectral samples were divided into training (70%) and test (30%) groups using the Kennard-Stone sample selection algorithm. In this division, the set of training samples is used in the construction of the model, while the set of test samples are used in the prediction of model performance. The spectra resulting from the pre-processing are submitted to classification of FG and CG groups. To start distinguishing between classes, pattern recognition algorithms were used. At first, the PCA is used as an exploratory analysis for the dataset, where the score plots with confidence ellipse are shown in Fig. 2. The www.nature.com/scientificreports/ plot of scores of the first two principal components (PC1 and PC2) demonstrate a good separation of the two classes, with the samples from the control group clustered in a region different to the samples from the case group.
The presence of a sample from the control group displaced to the right of the others is also visible, which may indicate an outlier and the presence of a patient with FM in the control group. For this initial analysis, sensitivity and specificity calculations were made with 100% accuracy in discriminating groups. These results suggest the PCA as a good tool for classification, demonstrated in the pattern of discrimination observed in the score plots.
Despite the good performance of the exploratory analysis, it is possible to observe that the confidence ellipses demonstrate a large region of intersection between groups, which proposes a more elaborate exploratory method. We opted for subsequent supervised analyzes, also considering the limited number of samples and confirmation of the results. In the further investigation of the data, supervised discriminant analyzes (LDA and QDA) associated with PCA data reduction were used. In this case, the first 5 principal components (PC) were applied with an accumulated explained variance of 76.33%. For these models, the results are more expressive for linear classification techniques, with 100% sensitivity and specificity for the control group, while for the group of people with FM the values decrease to 33.3% (Table 1). By associating variable selection techniques with discriminant analyzes in SPA-LDA, SPA-QDA, GA-LDA and GA-QDA, the best results are observed for the supervised methods obtained in this study, in which the SPA-LDA model presented 100% accuracy in both classes and the GA-LDA model with 95.8% accuracy for the CG and 73.6% for the FG. The low results in the other models, especially in the QDA combinations, may imply a bad adjustment due to the absence of large differences in the variance structures between the classes 28 . For the mentioned LDA models, as well as the good classification results, the possibility of exploring the most significant variables in the distinction of the GC and GF groups can also be demonstrated, which comprise the investigation of possible relationships with the metabolomics present in the biological samples of individuals participants from each class ( Table 2).  www.nature.com/scientificreports/ Another supervised technique of nonlinear classification, in addition to QDA, used in this paper in search of best results, was the SVM using the Polynomial Base Function (PBF) kernel for the spectral dataset, as indicated in a study with blood samples 29 . For the PCA-SVM, good performance was presented for the CG group with 100% accuracy (Table 1). With the SPA-SVM and GA-SVM variable selection algorithms, the results also indicate good distinctions between CG and FG, with emphasis on the classification of the group with FM in the SPA-SVM model with 100% accuracy in this group, and the GA-SVM where 100% sensitivity and 91.66% specificity were achieved for individuals with the disease. Associations with metabolomics were also raised for the small set of selected variables ( Table 2).
In the supervised models with variable selection, for a universe with more than 900 variables present in the MS dataset, a total of sixty-six values were collected due to the need for three executions of the GA models, because it is a non-deterministic algorithm. Thus, among the variables represented in more than one model there are: 65, 77, 118, 173, 360, 390 and 475 m/z. Among these m/z values, the 118 m/z is present in three different SPA models. These selected variables and for those indicated in the differences between the class means and chemical profile of the samples in the PCA (Fig. 3) forward to the study of metabolites.
Platforms with information in the human metabolomics database such as HMDB 30 , LIPID MAPS 31 and PubChem 32 were used. Among the compounds identified from the m/z values selected in all worked models, the following classes of compounds are observed: organics acids and derivatives, inorganic compound, nucleotides, benzenoids, organohalogen and organosulfur. For the value 118 m/z the database directed to compounds organoheterocyclics.
The observations were carried out on the HMDB and LIPID MAPS platforms following the positive mode ion settings in M + H and molecular mass tolerance in ± 0.01 m/z. Among the identified metabolites, those with the highest value attributed are Lysophosphatidylcholine (LysoPC) found in the HMDB library as LysoPC (16:1/0:0) Table 2. Selected variables (m/z) in PCA and supervised models.  www.nature.com/scientificreports/ and LysoPC (18:2/0:0) compounds. In the LIPID MAPS platform there was no association with any information from the database about these compounds. The relationship with the compound LysoPC (16:4) was indentified on the PubChem platform and the other metabolites of the HMDB library were confirmed.

Discussion
Studies carried out by the American College of Rheumatology offer good results for differentiating between groups of patients with and without fibromyalgia, taking into account the occurrence of 11 of the 18 sensitive pain points for FM symptoms 33 , the accuracy obtained in this classification was 84.9% with a sensitivity of 88.4%. In a more recent study using blood plasma samples from groups with and without FM and readings in mid-infrared spectroscopy coupled with multivariate classification 2 , the accuracy results reached values of 84.2% and sensitivity of 89.5% using the GA-LDA model. For this paper, using paper spray ionization mass spectrometry, the best results provided 100% accuracy for SPA-LDA, in addition to classification with PCA that also recognized all case group (FG) samples as patients with fibromyalgia. When comparing these results it is possible to observe good values for distinction between the classes, but the emphasis is on the SPA-LDA model through the PSI-MS analysis technique, which indicates the good potential of this methodology as a support in the screening for the disease. Thus, the use of different classification methods aims to expand the possibilities of satisfactory results. In the case of unsupervised analysis with PCA, in addition to demonstrating good ability to separate the groups, the use also aimed to reduce multivariate data in order to match the complexity of the subsequent supervised classifiers, with quantities of data available to avoid overfitting or under-training 34 . The good results with PCA can be related to the reduced number of samples, which could also to indicate the impossibility of using supervised techniques. However, this study explores the potential of MS for the smallest number of samples, considering that MS has high analytical sensitivity for small amounts of available biological fluids, demonstrating high resolution with accurate mass data for structural elucidation 35 . The SPA and GA algorithms, through the calculations performed to classify the groups, also indicate selected variables that may be associated with the presence of FM in the case group. The LDA supervised models are recurrent in several clinical studies of this nature, such as in cancer 36,37 and viral diseases [38][39][40] , because they work with processes with less complex characteristics. It is also common in the literature to find similar studies that expand the classifications to more robust models such as QDA 41,42 and SVM 43 , other researches seek to further expand the treatment of spectral samples, submitting them to these three types of techniques (LDA, QDA and SVM) and comparing the better accuracy results given the complexities of multivariate data 2,24 . This last approach is worked on and observed in this article.
This work also explored the possibility of associating variables selected by the models used with metabolomics cataloged in the literature in order to indicate possible biochemical markers in the identification of patients with FM. Among sixty-six selected variables, from a universe with more than 900 variables representing the range of mass/charge ratios analyzed, eight m/z ratio values are recurrent in more than one model while the 118 m/z variable is suggested by all SPA models. Despite not identifying new metabolites present in the database, the observation of greatest interest is to confirm the presence of Lysophosphatidylcholine (LysoPC) compounds 30 suggested by a study of patients with FM indicating a possible role of LysoPC as a biomarker, with results indicating an increase in this metabolite in patients with Fibromyalgia 44 .
The exploratory analysis with PCA and the supervised analysis with SPA-LDA demonstrated satisfactory results with 100% accuracy for predicting samples in both groups (CG and FG) using the PSI-MS technique. Other models had more reasonable values such as GA-LDA, PCA-SVM and SPA-SVM, which may indicate overtraining for the SPA-LDA model, but all classification techniques were submitted to confounding tests that validate the effectiveness of the tool, which it consists of new processing with the mutual exchange between the CG and FG groups to try to deceive the mathematical model. The representation of variables selected by the models also indicated possible lines of investigation for the understanding of biological fingerprints in the identification of patients with FM, despite the absence of new metabolites associated with these variables in the referenced database, based on blood samples, it was possible to confirm the presence of Lysophosphatidylcholine that represent compounds associated with the disease, according to the literature. Regarding the results found in the classification analyses, the study proposes the best results for two simple and easy to perform algorithms, when compared to the others. The results of this study also surpass the reference methods already used, demonstrating another method with good efficacy in FM screening. Despite the optimistic results, it is worth remembering that the sample universe was reduced when compared to the reference methods. Another important consideration is the size of the blood sample collected, as only plasma is used in the instrumental analysis, keeping the main biochemical information, which may indicate that statistics are kept for larger samples.
Thus, considering that the experimental technique with PSI-MS and spectral analysis by using PCA as an unsupervised technique and SPA-LDA as a supervised technique reach 100% sensitivity and specificity for both classes of individuals with FM and without the disease, this has promising potential as an analytical approach with good clinical outcomes. This combination of techniques can be used as a simple, reliable and fast tool in making clinical diagnosis of Fibromyalgia. In addition to these considerations, the detection of metabolomics is also a positive factor in the study, proposing the association of this information with possible biological markers for the disease.

Methods
Sample. This  Conventional PSI-MS. For each selected plasma sample, a small 0.05 mL aliquot was removed and applied to triangular paper (Whatman grade 1, GE healthcare, USA, with 1.5 cm sides) left at room temperature (25 °C) until drying. Triangular papers containing small aliquots of blood were positioned in front of the mass spectrometer (at a distance of 4 mm between the tip of the paper and the entrance of the mass spectrometer). The dry paper was held by a metal clip connected to the voltage source of the mass spectrometer, with the tip of the paper at a distance of approximately 5 mm. 10 µL of methanol (0.1% formic acid v/v) was applied to the paper to form the electrospray for MS analysis. Analyzes were performed in triplicate measures.
Instrumental parameters. Mass spectra were obtained using a Thermo Scientific LTQ-XL Linear Ion Trap spectrometer. The optimized parameters were as follows: negative ionization mode; capillary temperature 275 °C; 15 V capillary voltage; 4 kV spray voltage; tube lens 50 V. Mass spectra were acquired using Thermo Tune plus software and processed for chemometric analyzes using Xcalibur Analysis package software (version 2.0, Service release 2, Thermo Electron Corporation).
Computational analysis. Spectral data were processed using MATLAB R2014b software (MathWorks Inc., Natick, USA) with PLS Toolbox version 7.8 (Eigenvector Research Inc., Wenatchee, USA). The set of spectral samples was subjected to pre-processing with weighted automatic least squares baseline correction and vector normalization. Spectral samples were divided into training (70%) and test (30%) groups using the Kennard-Stone sample selection algorithm. In this division, the set of training samples is used in building the model, while the set of testing samples are used in predicting the model's performance. The description of the classification methods used in this study can be found in the Supplementary Information (S2).