Machine learning applied to near-infrared spectra for clinical pleural effusion classification

Lung cancer patients with malignant pleural effusions (MPE) have a particular poor prognosis. It is crucial to distinguish MPE from benign pleural effusion (BPE). The present study aims to develop a rapid, convenient and economical diagnostic method based on FTIR near-infrared spectroscopy (NIRS) combined with machine learning strategy for clinical pleural effusion classification. NIRS spectra were recorded for 47 MPE samples and 35 BPE samples. The sample data were randomly divided into train set (n = 62) and test set (n = 20). Partial least squares, random forest, support vector machine (SVM), and gradient boosting machine models were trained, and subsequent predictive performance were predicted on the test set. Besides the whole spectra used in modeling, selected features using SVM recursive feature elimination algorithm were also investigated in modeling. Among those models, NIRS combined with SVM showed the best predictive performance (accuracy: 1.0, kappa: 1.0, and AUCROC: 1.0). SVM with the top 50 feature wavenumbers also displayed a high predictive performance (accuracy: 0.95, kappa: 0.89, AUCROC: 0.99). Our study revealed that the combination of NIRS and machine learning is an innovative, rapid, and convenient method for clinical pleural effusion classification, and worth further evaluation.

www.nature.com/scientificreports/ tool for disease diagnosis, including cancer diagnosis, due to its ability of reflecting changes in molecular compositions by identifying different bonds vibrations in functional groups [17][18][19][20] . The variations in metabolites between MPE and BPE have been revealed by past metabonomic studies, indicated an increased amount of valine, lactate, alanine, lipids, and free fatty acids (FFAs) (16:0, 18:0, and 18:1) along with a decreased amount of acetoacetate, creatinine, β-glucose, and α-glucose in MPE 1,8,21 . In addition, our previous metabolomics results revealed that the metabolites composition, such as lipids and oxidized polyunsaturated fatty acids, varies between MPE and BPE. Therefore, NIRS might be able to distinguish the differences between the chemical compositions of MPE and BPE, and contribute to a novel diagnosis method with a higher sensitivity.
In the present study, a total of 82 pleural effusion samples were analyzed, including 47 MPE samples from diagnosed lung adenocarcinoma patients and 35 BPE samples from patients with diagnosed tuberculosis or tuberculous pleurisy. NIRS technology combined with machine learning approaches, including partial least squares (PLS), random forest (RF), support vector machine (SVM), and gradient boosting machine (GBM) models, were used to screen for the characteristics in near-infrared spectra between MPE and BPE samples.

Materials and methods
Pleural effusion samples. A total of 82 pleural effusion samples were obtained from the biobank of Zhejiang Cancer Hospital in Hangzhou, China. MPE samples were collected from 47 patients diagnosed with lung adenocarcinoma, complicated with pleural metastases. BPE samples were collected from 35 patients diagnosed with pulmonary tuberculosis and/or tuberculous pleurisy. Informed consent was obtained from all individual participants included in the study, and our study was approved by the Ethics Committee of Zhejiang Cancer Hospital. All methods were performed in accordance with the relevant guidelines and regulations. The diagnoses were based on cytological or histological examinations for MPE and bacterial culture which were performed in the cases of tuberculosis. All the pleural effusion samples were spun at 1600 g for 10 min at 4 °C, and the aliquot of supernatant was stored at -80 °C until analysis. Basic information of patients, including gender, age, and pathological information, were collected (Table 1). NIRS analysis and spectra collection. The frozen samples were thawed to room temperature before analysis. The NIR spectra of the pleural effusions were collected using an Antaris™ II FT-NIR analyzer (Thermo Nicolet, USA) with air as a reference. Aquartz colorimetric tube with an optical path of 2 mm was used as the sample cup. Each spectrum was obtained from 32 successive scans from 4000 to 10,000 cm −1 with a spectral resolution of 4 cm −1 . The spectrum was recorded by absorbance. Each sample was analyzed in triplicate and the average spectrum was calculated by TQ Analyst 8.0 data processing software.

Data analysis.
Randomly slicing the data into train set and test set and preprocess. This cohort was randomly split into a train set of 62 cases (36 MPE and 26 BPE) and test set of 20 cases (11 MPE and 9 BPE) using "sample" function in R.
Preprocess. Spectra data in train set were mean-centered, scaled to unit variance, smoothed using a Savitzky-Golay filter, and dimensionally reduced by PCA analysis via generation of audit data summarizing discrete variables. Spectra in test set were preprocessed with the same method and the same values of parameters in train set.
Model training and testing. For PLS, RF, and GBM, model training and parameter tuning were conducted with caret R package, in which 10 repeated, fivefold cross validation was used. For SVM, model training was performed using e1071 R package with fivefold cross validation. Accuracy was used to select the optimal model by the largest value. The running time for each model was measured in the following CPU condition: Intel(R) Core (TM) i5-8250U CPU@ 1.60 GHz.
Feature wavelength selection with SVM-RFE algorithm. SVM-RFE algorithm was used to rank the wavenumbers in train set. The algorithm processes were briefly described as follows: (1) train the SVM model; (2) compute the weight vector; (3) rank the variables from the minimum to the maximum by square weights; (4) update the feature ranking list; (5) eliminate the feature with the smallest square weight, and repeat from Step 1 until all the www.nature.com/scientificreports/ features were ranked. In order to optimize the subset size of the features, a series of subsets with different sizes of wavenumber (from top 1 to the total number) were evaluated for their predictive performances.

Results
NIR spectral analysis. Plots of the raw NIR spectra of the 82 pleural effusions, groups of MPE and BPE samples, and their average spectra were illustrated in Fig. 1. Evidently, due to the broad and overlapping spectra peak, there was no significant difference between MPE and BPE samples in raw spectra, and the direct interpretation is nearly impossible. However, though there were no feature peaks, the NIR spectra still contain a lot of information in terms of the chemical composition of pleural effusion. There are four regions referring to different chemical substructures: wavenumbers between 4200 and 5500 cm −1 indicate the CH, OH and NH stretch/ CH deformations in the phenyl; between 5400 and 6100 cm −1 refers to the first overtone of CH; wavenumbers of 6200 to 7600 cm −1 indicate the first overtone of OH, NH, and CH; and wavelengths of 7900 to 9000 cm −1 indicate the second overtone of CH. NH, and CH combinations were denoted by wavenumbers of 6200 to 7600 cm −1 ; and the second overtone of CH was denoted by wavelengths of 7900 to 9000 cm −117,22 .

Principal component analysis. As an unsupervised model, a principal component analysis (PCA) was
performed to check the extent of clustering of the samples and to investigate the potential NIR features for differentiating between MPE and BPE classes. Figure 2 shows a scatter plot of the first two principal components (PCs), accounting for about 94.7% of the total variation. However, there was no clear separation between MPE and BPE samples, which indicated that the structure or the relationship of the data might be complicated, nonlinear, and therefore unfit for an unsupervised model.  Table 2).
For GBM model, the final optimized model was with the following parameters: ntrees value of 50, interaction depth value of 1, shrinkage value of 0.1 and n minobsinnode value of 10. The running time was 5.57 s. The predictive accuracy, kappa, and AUC ROC were 0.95, 0.9, and 0.99, respectively (Figs. 3G,H, 4D; Table 2).
Among the four models, the performance of PLS was unsatisfactory. RF and GBM have exhibited relatively high accuracy and kappa values in both train and test sets, but with a relatively longer computational time. In contrast, SVM was the fastest model in computation and has displayed the best predictive performance in the test group. Therefore, SVM was considered as the best model for pleural effusion classification in this study. More detailed model performance parameters were illustrated in Table 2.

Wavenumber selection. After ranking wavenumbers by SVM-RFE algorithm, SVM model with different
sizes of featured wavenumbers was tested (from the top 1 to all the features). The results showed that the predic-    www.nature.com/scientificreports/

Discussion and conclusion
Our study applied several machining learning approaches in NIRS analysis to classify malignant and benign pleural effusion samples, through which a rapid, convenient and accurate diagnostic method was successfully developed.
The diagnostic performance of NIRS has been investigated in the past studies. For example, Chen et al. established a NIRS based method to distinguish between normal and malignant colorectal tissues 17 . However, to the best of our knowledge, our study is the first one that applyed NIRS to the classification of pleural effusion. MPE usually indicates advanced development in cancer, which contributes to a unique cancerous microenvironment that is significantly different from the surrounding healthy tissues, featured with variations in metabolites including proteins and lipids 1,8,10,21 . Therefore, NIRS can be used to distinguish the variation of chemicals in samples [17][18][19][20] .  www.nature.com/scientificreports/ Although the NIRS of malignant and benign samples overlapped to a great extent, additional application of machine learning aided in the separation of malignant and benign samples and some spectral regions that are of high diagnostic values were detected. According to our previous metabolomics results using the same samples, malignant pleural effusion differs from benign samples in metabolites like acylcarnitines, oxidized polyunsaturated fatty acids (PUFAs), and ether lipids 23 . In line, the top 50 diagnostic wavenumbers detected by SVM-RFE denoted functional groups including CH, CH2, and CH3, NH, free and bound OH. The spectral intervals of CH2 and CH3 arisen from stretching vibrations at 5577 to 5889 cm −1 of the first overtone, and that of OH of stretching vibrations at 7077 to 7093 cm −1 and at 9977 to 9981 cm −1 denoted the change in ether lipids. In addition, CH group of combined vibrations of second overtone was detected at 7227 to 7247 cm −1 together with the aforementioned OH groups explained the existence of oxidized PUFAs. Acylcarnitines can also be annotated in terms of the NH (at 6306 to 6618 cm −1 ), CH, and OH detected.
Compared to the traditional diagnostic methods, such as cytological or histological examinations, our method is simpler and more convenient since the supernatant of the pleural effusion sample is the only need. Hence our method could be a supplementary tool when there are difficulties in collecting malignant cells or tissues. In addition, compared to the high throughput method, such as metabolomics, our NIRS method is more economical, less time-and labor-consuming, and needs no additional sample preparation. Altogether, our NIRS method is worthy to be further developed for clinical application.
At present, NIRS-SVM has been considered as the best model for pleural effusion classification. SVM algorithm is fast and has a high predictive performance for pleural effusion classification. Compared to models using the whole spectra, SVM with the top 50 features is less complex and more stable in application, and worthy to be further investigated.
The major limitation in our study is that our cohort size was relatively small. In addition, the types of pleural effusions were too limited. A larger cohort with more types of pleural effusions could be studied in the future. In conclusion, our study provided an idea that NIRS could be a helpful tool in the classification of pleural effusion, with advantages of high speed and accuracy, which might improve the current clinic diagnostic methods for MPE.