Application of variable selection in the origin discrimination of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. based on near infrared spectroscopy

Dried sclerotium of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. is a traditional Chinese medicine. Its chemical components showed difference among geographical origins, which made it difficult to keep therapeutic potency consistent. The identification of the geographical origin of W. cocos is the fundamental prerequisite for its worldwide recognition and acceptance. Four variable selection methods were employed for near infrared spectroscopy (NIR) variable selection and the characteristic variables were screened for the establishment of Fisher function models in further identification of the origin of W. cocos from Yunnan, China. For the obvious differences between poriae cutis (fu-ling-pi in Chinese, or FLP) and the inner part (bai-fu-ling in Chinese, or BFL) of the sclerotia of W. cocos in the pattern space of principal component analysis (PCA), we established discriminant models for FLP and BFL separately. Through variable selection, the models were significant improved and also the models were simplified by using only a small part of the variables. The characteristic variables were screened (13 for BFL and 10 for FLP) to build Fisher discriminant function models and the validation results showed the models were reliable and effective. Additionally, the characteristic variables were interpreted.

Dried sclerotia of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. is a well-known traditional Chinese medicine, which is a fungal species parasitizing the roots of pine trees 1 . Traditionally, it is used in many prescriptions for inducing diuresis, invigorating the spleen, excreting dampness and tranquilizing the mind. However, poriae cutis (fu-ling-pi in Chinese, or FLP) and the inner part (bai-fu-ling in Chinese, or BFL) of the sclerotia of W. cocos have different therapeutic efficacy. FLP is reported to have only diuretic activity, while BFL has an invigorating activity in addition to diuretic and sedative effects 2 . Modern phytochemical and pharmacological investigations have shown that triterpenes and polysaccharides are the two main kinds of secondary metabolites found in W. cocos, which are responsible for its functions of anti-tumor, anti-oxidant, anti-rejection, antibacterial, anti-inflammatory, anti-hyperglycemic, nematicidal, etc 3 . The previous studies found that the contents of triterpenoid and polysaccharide in W. cocos from different origins were different 4,5 . The difference in chemical components of W. cocos in different geographical origins makes it difficult to keep therapeutic potency consistent. The identification of the geographical origin of W. cocos is the fundamental prerequisite for its worldwide recognition and acceptance.
In China, the poria produced in Yunnan is reputable as Yunnan poria (Yun-ling in Chinese) for its geoherbalism. Yunnan locates in southwest China and is influenced by a low latitude plateau, mountainous country monsoon climate 6 . There are seven climatic zones in Yunnan from the north temperate zone to north tropic zone, and climatic zones distribute according to the elevation 7 . The complex climate condition influences the quality of W. cocos. It was reported that the infrared spectra of W. cocos peels from different producing areas (Hubei, Anhui and Yunnan provinces) revealed obvious regional differences, and for the large geographical span, the component contents in samples from Yunnan were different at a certain extent 8 . Based on ultra performance liquid Principal Component Analysis. In order to remove the redundant information produced by hi gh-frequency line noise and retain the useful information in the low-frequency region, we applied the spectrum standard deviation (SDD) method to filter the original spectra by TQ 9.2 39 . The wave band 7501.74 cm −1 -4088.35 cm −1 (886 wavelength points) was preliminary selected (as shown in Fig. 1). Then we analyzed W. cocos by principal component analysis (PCA). In Fig. 2, we could find that in the pattern space of PCA, BFL and FLP were completely separated. The result indicated the inner chemical compositions of the two parts were different. In view of this, we established the discriminant models of BFL and FLP separately.
We analyzed BFL and FLP by PCA, respectively. The results were shown in Supplementary Table S1. According to Kaiser Criterion, only factors with eigenvalues greater than or equal to one will be accepted as possible sources of variance in the data 40 . The first five factors that accounted for spectrum cumulative 97.858% of BFL and 97.203% of FLP were selected for the next analysis.

Abnormal Samples Diagnosis.
In the course of spectrum information (X) collection and index (Y) measurement, the data (X or Y) might deviate along with the abnormal fluctuation of instrument. The outlier samples could interfere with the discrimination model seriously. Through modular group iterative singular samples diagnosis method, the BFL and FLP were analyzed by Matlab R2010a analysis software. In order to establish steady discriminant model, the exceptional spectra including the number of samples 43 of BFL, 3, 33 and 35 of FLP were removed (see Supplementary Fig. S1). Classification of Training Set and Validation Set. According to K-S method 41,42 , the samples were divided into the training and validation sets of BFL and FLP by the proportion of 2:1, respectively. The training and validation sets of BFL contained 40 and 19 samples, and those of FLP had 39 and 18 samples, respectively. Each set included the samples of all the five regions. The training set was used for variable selection and modeling, and the independent validation set was used for validation of the model.

Variable Selection based on CARS.
The preliminary selected dataset 7501.74 cm −1 -4088.35 cm −1 (886 wavelength points) was intended for investigating the ability for CARS to select key variables by eliminating the redundant information. One hundred replicate running of CARS was executed and the root mean square error of cross validation (RMSECV) values were recorded.
By 10-flod cross validation, the optimal number of PCA was five. The statistics of frequency of each selected wave number of spectrum was implemented. The number of Monte Carlo iterations was set to 50. In each iteration, 80% samples from the training sets were randomly chosen to build a PLS-DA model. The optimized number of variables was confirmed with the lowest RMSECV value. Only a small part of the wavelengths could be selected by CARS. According to the lowest RMSECV values, twenty key variables of FLP (RMSECV = 1.6202) were screened, and forty significant variables of BFL (RMSECV = 1.6767) were selected. Compared with preliminary selected variables (886 wavelength points), the optimized number of variables by CARS was reduced significantly (see Supplementary Fig. S2).
Variable Selection based on MC-UVE. Five hundred replicate running of MC-UVE was executed and the RMSECV values were recorded. Ten-fold cross validation and five principal factors of PLS-DA model were used in this study to explore its prediction performance. Reliability index (RI), defined as the ratio of the mean to the standard deviation of this distribution, was used to assess the reliability of each variable. Based on this reliability, all variables were ranked. Then, these variables were sequentially added to build a PLS-DA model whose performance was assessed by cross validation. The RI corresponding to the variable whose addition results in the minimum RMSECV value was chosen as the threshold. The variables that were related with a RI lower than the threshold value could be removed 35 .
The analysis result showed the variables with the RI values greater than 2.5107 were selected using ten-fold cross validation for BFL, and 95 variables were selected when the minimum ten-fold RMSECV was 1.5601. For FLP, 35 variables with the RI values greater than 2.1589 were selected using ten-fold cross validation as the minimum ten-fold RMSECV was 1.5852 (see Supplementary Fig. S3).
Variable Selection based on SPA. The three parameters of SPA were set to N = 1000 (N, the number of Monte Carlo Simulation), R = 0.8 (R, the ratio of samples to be selected in each Monte Carlo sampling), Q = 10 (Q, the number of variables to be sampled in each Monte Carlo Simulation). 10-flod cross validation and five number of PCA were used in this study to explore its prediction performance. The variable importance assessed by conditional synergetic score (COSS) value was calculated (COSS = − log 10 (P)). RMSECV values were recorded, and the corresponding minimum RMSECV value was chosen as the optimized number of variables. The more significant a variable was, the higher the score it got. Particularly, the variables with COSS values greater than 2 were selected. As the minimum RMSECV value was 1.6235, 90 informative variables of BFL were selected for further analysis. For FLP, as the minimum RMSECV value was 1.6428, 30 informative variables were selected (see Supplementary Fig. S4).
Variable Selection based on LPG. LPG 36 was adopted in wavelength selection for NIR spectral analysis.
The method calculated an LPG (score plot) by performing PCA on the NIR spectral data matrix (7501.74 cm −1 -4088.35 cm −1 ), and then detected the non-collinear variables from the LPG. According to the results of PCA in Supplementary Table S1, the first two principal components were used for LPG. In the end, both BFL and FLP, 129 variables were selected by LPG (see Supplementary Figs S5 and S6).

Evaluation of the Selected Variables.
For further analysis the reliability of CARS, MU-UVE, SPA and LPG methods, PLS-DA models of BFL and FLP were established by SIMCA-P 11.0 software. The performance of models was assessed by determination coefficient (R 2 ), RMSECV and root mean square error of prediction (RMSEP). Generally, a good model should have high value of R 2 and low value of RMSECV 43 . According to Galtier discriminant criterion, the ability of classification was assessed by prediction sets, and values of prediction and deviation (Y pre and Y dev ) were examined. When Y pre > 0.5 and Y dev < 0.5, the prediction samples belonged to a certain kind of training set; Y pre < 0.5 and Y dev < 0.5, the prediction samples did not belong to a certain kind of training set; Y dev > 0.5 and 0.45 < Y dev < 0.5, the prediction samples were suspicious, because they were very close to the threshold 0.5. The 0.45 and 0.55 limits have been chosen because they express 10% of error in the results 44,45 .
Tables 1 and 2 summarized the prediction results of the PLS-DA models performed on the extraction of NIR spectra by the different variables selection methods. Compared with the preliminary variables (7501.74 cm −1 -4088.35 cm −1 , 886 variables), through different variable selection methods (CARS, MC-UVE, SPA and LPG), the number of the selected variables were decreased. Simultaneously, the parameters for assessing the PLS-DA models were improved. The values of accuracy and R 2 increased, RMSECV and RMSEP reduced.
For BFL, the prediction accuracy values of the PLS-DA models performed on the extraction of NIR spectra by the four methods all reached 100%. The sequence of R 2 was CARS > SPA > LPG > MC-UVE, while they were in the exact opposite sequences for RMSECV and RMSEP as CARS < SPA < LPG < MC-UVE. All the four methods showed satisfactory prediction performance for BFL.
For FLP, the highest prediction accuracy values reached 100% in the PLS-DA models performed on the extraction of NIR spectra by MC-UVE and LPG methods, while 94.44% for CARS and SPA methods. The sequence of R 2 was LPG > MC-UVE > CARS > SPA. The values of RMSECV and RMSEP were in the opposite sequence LPG < MC-UVE < CARS < SPA. The results of MC-UVE and LPG were better than CARS and SPA for BFL.
The prediction results of the models were significant improved when conducting variable selection, and also the models were simplified by using only a small part of the variables. The results experimentally proved the necessity to perform variable selection before building a calibration model. PLS-DA was performed based on the results of PCA of 56 common variables of BFL. From Fig. 3a, we found that the first two principal components cumulatively accounted for 64.9% of the variation. It was visible that BFL were separated into five groups. The loading scatter plot (Fig. 3b)    Simultaneously, PLS-DA was conducted for 21 common variables of FLP. In Fig. 4a, the first two principal components cumulatively accounted for 68.0% of the variation. The first principal component explained 38.8% of the total variance and the second principal component explained 29.2% of that. FLP samples were distinctly separated into five groups. Visually analyzed the loading scatter plot (Fig. 4b), we found the variables such as 4508.75, 4952.30, 5230.00, 5233.86, 5303.28, 5634.98, 5685.12, 5874.11 and 5928.11 cm −1 made a significant contribution to the discrimination. The biplot (Fig. 4c)

Establish of Discriminant Analysis Function.
To identify and analyze the unknown samples, the Fisher discriminant function model was established. Through stepwise regression method, the common variables which made a greater contribution to classification were further screened. As a result, thirteen variables including 4092. 21 were selected for FLP. Nine of them were recognized in the discussion of PLS-DA. The results of stepwise regression were in accordance with PLS-DA, which proved that those variables could be seen as the characteristic identification marks of W. cocos.
In the process of Fisher discriminant analysis, the thirteen variables of BFL and ten variables of FLP were used as discriminant variables respectively, and the different BFL and FLP samples were performed as the subjects of the study to establish Fisher discriminant functions. The function of BFL was shown as follow and the coefficients were in Table 3: where X i was the corresponding variables, Y i was the corresponding class. The function of FLP was shown as follow and the coefficients were in Table 4: T  B T  B T  B T  B T  B T  B T  B T  B T  B T   0  1 1  2 2  3 3  4 4  5 5  6 6  7 7  8 8  9 9  1 0 10 where T i was the corresponding variables, Y i was the corresponding class. The Fisher discriminant analysis results were shown in Figs 3d and 4d. The effect of discrimination model was evaluated by cross validation. As seen in the two figures, the ungrouped prediction samples located in different classes. The class of the ungrouped samples could be identified according to the distance from each sample to the centroids of all classes. The validation results were shown in Tables 5 and 6. The original grouped samples 97.50% for BFL and 97.43% for FLP were correctly classified. In the cross validation, the accuracy rates were 94.74% for BFL and 94.44% for FLP. In our previous study, the Fisher discriminant analysis functions built based on the wavelength selected only by the MC-UVE method, the original grouped samples 92.50% for BFL and 92.86% for FLP were correctly classified, and the accuracy rates were 80.95% for BFL and 83.33% for FLP in the cross validation 38 . The correct classification rates were significantly improved both in the original grouped samples and in the cross validation sets in this study. The validation results indicated that the Fisher discriminant function model established based on the characteristic variables selected simultaneously by the four methods CARS, MC-UVE, SPA and LPG could be seen as a reliable and effective method to discriminate BFL and FLP. Absorbance peaks at 4501.04 and 4508.75 cm −1 are the combination of asymmetric stretch of NH and NH 2 rocking in urea (NH 2 -C=O-NH 2 ). Absorbance peak at 4597.46 cm −1 is due to CONH 2 as combination of amide B and amide II modes. The wavelength at 4612.89 cm −1 is assigned to CONH 2 specifically due to the α-helix peptide structure. The absorption band at 5079.58 cm −1 is the combination of N-H stretching vibration and N-H bending in aromatic amine. Absorbance peak at 5866.40 cm −1 corresponds to C-H first overtone stretch vibration mode

Conclusions
In this work, we first systematically collected the near-infrared spectrum of cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos from different regions in Yunnan, China. Interestingly, we found that there were obvious differences between FLP and BFL in the pattern space of PCA. Based on this, we established discriminant models for FLP and BFL separately. Through four variable selection methods CARS, MC-UVE, SPA and LPG, the common variables were selected. Furthermore, the characteristic variables were screened to build Fisher discriminant function models, and the validation results showed the models were reliable and effective. The variable selection method used in NIR spectrum provided a new thought for the origin identification of traditional Chinese medicines. The spectrum difference between the cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos provided theoretical basis in the spectrum level for the traditional usage of FLP and BFL separately.     Yunnan (10) and southeastern Yunnan (14). They were identified and authenticated by Professor H. Jin, Yunnan Academy of Agricultural Sciences. The specimens were preserved in the Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences. The samples were separated into FLP and BFL. After drying at room temperature, samples were ground to fine powder and stored in the zip lock bags for further analysis. The detailed sample information is listed in Supplementary Table S2. (Thermo Fisher Scientific INC., USA) was attached with diffuse reflection module. The spectrum collecting software Result TM 2.1 and the analysis software TQ 9.2 included in the instrument were employed. Traditional Chinese medicine grinder DFT-100 (Zhejiang wenling Linda machinery co., LTD) was applied. Stainless steel sieve tray 80 mesh (Tai'an of Chinese and western, Beijing) was used. The multivariate data analysis softwares were SIMCA-P 11.0 (Umetrics, Umea, Sweden), SPSS 19.0 (SPSS Inc., Chicago, USA) and MATLAB R2010a, and the code was derived from http://www. mathworks.cn/.

Instruments. Antaris II Fourier Transform Near Infrared Spectroscopy
Spectra Collection. The powder (20.0 g) was weighed before it was sufficiently mixed, then transferred to the sample cup of NIR and compressed. The parameters of collection were scanning (64 times), resolution (4 cm −1 ), scanning range (10000 cm −1 -4000 cm −1 ) and parallel collection (3 times). The NIR spectra of W. cocos were preprocessed with Norris, mean centering, standardization, and second derivative successively by software TQ 9.2. Through optimizing, the range 7501.74-4088.35 cm −1 was selected according to the spectrum standard deviation. The higher the spectra standard deviation was, the greater a contribution made to classification.