Introduction

Dried sclerotia of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. is a well-known traditional Chinese medicine, which is a fungal species parasitizing the roots of pine trees1. Traditionally, it is used in many prescriptions for inducing diuresis, invigorating the spleen, excreting dampness and tranquilizing the mind. However, poriae cutis (fu-ling-pi in Chinese, or FLP) and the inner part (bai-fu-ling in Chinese, or BFL) of the sclerotia of W. cocos have different therapeutic efficacy. FLP is reported to have only diuretic activity, while BFL has an invigorating activity in addition to diuretic and sedative effects2. Modern phytochemical and pharmacological investigations have shown that triterpenes and polysaccharides are the two main kinds of secondary metabolites found in W. cocos, which are responsible for its functions of anti-tumor, anti-oxidant, anti-rejection, antibacterial, anti-inflammatory, anti-hyperglycemic, nematicidal, etc3. The previous studies found that the contents of triterpenoid and polysaccharide in W. cocos from different origins were different4,5. The difference in chemical components of W. cocos in different geographical origins makes it difficult to keep therapeutic potency consistent. The identification of the geographical origin of W. cocos is the fundamental prerequisite for its worldwide recognition and acceptance.

In China, the poria produced in Yunnan is reputable as Yunnan poria (Yun-ling in Chinese) for its geoherbalism. Yunnan locates in southwest China and is influenced by a low latitude plateau, mountainous country monsoon climate6. There are seven climatic zones in Yunnan from the north temperate zone to north tropic zone, and climatic zones distribute according to the elevation7. The complex climate condition influences the quality of W. cocos. It was reported that the infrared spectra of W. cocos peels from different producing areas (Hubei, Anhui and Yunnan provinces) revealed obvious regional differences, and for the large geographical span, the component contents in samples from Yunnan were different at a certain extent8. Based on ultra performance liquid chromatography-ultraviolet-mass spectrometry (UPLC-UV-MS) fingerprints, the effect of habitat on the quality of peeled and sliced poria was obvious9.

Near-infrared spectroscopy (NIR), as a fast and non-destructive technology, has been widely used to identify traditional Chinese medicinal materials10,11,12,13,14. The NIR spectrum reflects the absorption of overtones and combinations of the fundamental mid-IR bands like C-H, O-H, and N-H functional groups. The bandwidth of NIR region (between 780 and 2500 nm (12000 to 4000 cm−1)) is wide and absorption bands overlap heavily, which make the analysis of NIR spectra extremely difficult with conventional methods15,16. The variable selection is a critical step in the analysis of the datasets with thousands of variables in NIR spectroscopy17. In recent years, several variable selection methods of NIR have been developed, such as interval partial least-squares (iPLS)18,19, backward interval partial least-squares (biPLS)20, moving window partial least-squares regression (MWPLSR)21, genetic algorithm (GA)22,23,24, simulated annealing algorithm (SAA)25, competitive adaptive reweighted sampling (CARS)26,27,28, Monte Carlo uninformative variable elimination (MC-UVE)29,30,31,32,33, subwindow permutation analysis (SPA)34,35 and latent projective graph (LPG)36,37.

Previously, we used MC-UVE method to screen the NIR spectrum information of W. cocos 38. On this basis, in this study, four variable selection methods including CARS, MC-UVE, SPA and LPG were employed and compared for NIR variables selection. The common variables were selected from the variable selection results of the four methods. Then, the characteristic variables were screened based on the common variables for the establishment of Fisher function models in further identification of the origin of W. cocos from Yunnan, China. Additionally, the characteristic variables were also interpreted.

Results and Discussion

Stability of NIR

The NIR resulting.spc files were converted to.csv data files by the multivariate statistical analysis of SIMCA-P 11.0. The stability of 25 times parallel collections of a sample was considered by Hotelling T2. The results showed that the parallel spectrum acquisitions possessed satisfactory stability with coefficient 4.26 and 7.82 in the 95% and 99% levels in W. cocos, respectively. The results indicated that NIR was a reliable method for discriminant analysis.

Principal Component Analysis

In order to remove the redundant information produced by hi gh-frequency line noise and retain the useful information in the low-frequency region, we applied the spectrum standard deviation (SDD) method to filter the original spectra by TQ 9.239. The wave band 7501.74 cm−1 – 4088.35 cm−1 (886 wavelength points) was preliminary selected (as shown in Fig. 1). Then we analyzed W. cocos by principal component analysis (PCA). In Fig. 2, we could find that in the pattern space of PCA, BFL and FLP were completely separated. The result indicated the inner chemical compositions of the two parts were different. In view of this, we established the discriminant models of BFL and FLP separately.

Figure 1
figure 1

The original spectra of BFL and FLP. The red lines represent BFL samples, while the other colorized lines stand for FLP samples.

Figure 2
figure 2

Principal component scores of BFL and FLP. The black triangles represent BFL samples, while the red squares correspond to FLP samples.

We analyzed BFL and FLP by PCA, respectively. The results were shown in Supplementary Table S1. According to Kaiser Criterion, only factors with eigenvalues greater than or equal to one will be accepted as possible sources of variance in the data40. The first five factors that accounted for spectrum cumulative 97.858% of BFL and 97.203% of FLP were selected for the next analysis.

Abnormal Samples Diagnosis

In the course of spectrum information (X) collection and index (Y) measurement, the data (X or Y) might deviate along with the abnormal fluctuation of instrument. The outlier samples could interfere with the discrimination model seriously. Through modular group iterative singular samples diagnosis method, the BFL and FLP were analyzed by Matlab R2010a analysis software. In order to establish steady discriminant model, the exceptional spectra including the number of samples 43 of BFL, 3, 33 and 35 of FLP were removed (see Supplementary Fig. S1).

Classification of Training Set and Validation Set

According to K-S method41,42, the samples were divided into the training and validation sets of BFL and FLP by the proportion of 2:1, respectively. The training and validation sets of BFL contained 40 and 19 samples, and those of FLP had 39 and 18 samples, respectively. Each set included the samples of all the five regions. The training set was used for variable selection and modeling, and the independent validation set was used for validation of the model.

Variable Selection based on CARS

The preliminary selected dataset 7501.74 cm−1–4088.35 cm−1 (886 wavelength points) was intended for investigating the ability for CARS to select key variables by eliminating the redundant information. One hundred replicate running of CARS was executed and the root mean square error of cross validation (RMSECV) values were recorded.

By 10-flod cross validation, the optimal number of PCA was five. The statistics of frequency of each selected wave number of spectrum was implemented. The number of Monte Carlo iterations was set to 50. In each iteration, 80% samples from the training sets were randomly chosen to build a PLS-DA model. The optimized number of variables was confirmed with the lowest RMSECV value. Only a small part of the wavelengths could be selected by CARS. According to the lowest RMSECV values, twenty key variables of FLP (RMSECV = 1.6202) were screened, and forty significant variables of BFL (RMSECV = 1.6767) were selected. Compared with preliminary selected variables (886 wavelength points), the optimized number of variables by CARS was reduced significantly (see Supplementary Fig. S2).

Variable Selection based on MC-UVE

Five hundred replicate running of MC-UVE was executed and the RMSECV values were recorded. Ten-fold cross validation and five principal factors of PLS-DA model were used in this study to explore its prediction performance. Reliability index (RI), defined as the ratio of the mean to the standard deviation of this distribution, was used to assess the reliability of each variable. Based on this reliability, all variables were ranked. Then, these variables were sequentially added to build a PLS-DA model whose performance was assessed by cross validation. The RI corresponding to the variable whose addition results in the minimum RMSECV value was chosen as the threshold. The variables that were related with a RI lower than the threshold value could be removed35.

The analysis result showed the variables with the RI values greater than 2.5107 were selected using ten-fold cross validation for BFL, and 95 variables were selected when the minimum ten-fold RMSECV was 1.5601. For FLP, 35 variables with the RI values greater than 2.1589 were selected using ten-fold cross validation as the minimum ten-fold RMSECV was 1.5852 (see Supplementary Fig. S3).

Variable Selection based on SPA

The three parameters of SPA were set to N = 1000 (N, the number of Monte Carlo Simulation), R = 0.8 (R, the ratio of samples to be selected in each Monte Carlo sampling), Q = 10 (Q, the number of variables to be sampled in each Monte Carlo Simulation). 10-flod cross validation and five number of PCA were used in this study to explore its prediction performance. The variable importance assessed by conditional synergetic score (COSS) value was calculated (COSS = − log10 (P)). RMSECV values were recorded, and the corresponding minimum RMSECV value was chosen as the optimized number of variables. The more significant a variable was, the higher the score it got. Particularly, the variables with COSS values greater than 2 were selected. As the minimum RMSECV value was 1.6235, 90 informative variables of BFL were selected for further analysis. For FLP, as the minimum RMSECV value was 1.6428, 30 informative variables were selected (see Supplementary Fig. S4).

Variable Selection based on LPG

LPG36 was adopted in wavelength selection for NIR spectral analysis. The method calculated an LPG (score plot) by performing PCA on the NIR spectral data matrix (7501.74 cm−1–4088.35 cm−1), and then detected the non-collinear variables from the LPG. According to the results of PCA in Supplementary Table S1, the first two principal components were used for LPG. In the end, both BFL and FLP, 129 variables were selected by LPG (see Supplementary Figs S5 and S6).

Evaluation of the Selected Variables

For further analysis the reliability of CARS, MU-UVE, SPA and LPG methods, PLS-DA models of BFL and FLP were established by SIMCA-P 11.0 software. The performance of models was assessed by determination coefficient (R 2), RMSECV and root mean square error of prediction (RMSEP). Generally, a good model should have high value of R2 and low value of RMSECV43. According to Galtier discriminant criterion, the ability of classification was assessed by prediction sets, and values of prediction and deviation (Ypre and Ydev) were examined. When Ypre > 0.5 and Ydev < 0.5, the prediction samples belonged to a certain kind of training set; Ypre < 0.5 and Ydev < 0.5, the prediction samples did not belong to a certain kind of training set; Ydev > 0.5 and 0.45 < Ydev < 0.5, the prediction samples were suspicious, because they were very close to the threshold 0.5. The 0.45 and 0.55 limits have been chosen because they express 10% of error in the results44,45.

Tables 1 and 2 summarized the prediction results of the PLS-DA models performed on the extraction of NIR spectra by the different variables selection methods. Compared with the preliminary variables (7501.74 cm−1–4088.35 cm−1, 886 variables), through different variable selection methods (CARS, MC-UVE, SPA and LPG), the number of the selected variables were decreased. Simultaneously, the parameters for assessing the PLS-DA models were improved. The values of accuracy and R 2 increased, RMSECV and RMSEP reduced.

Table 1 Prediction results of PLS-DA models of BFL built by different variable selection methods.
Table 2 Prediction results of PLS-DA models of FLP built by different variable selection methods.

For BFL, the prediction accuracy values of the PLS-DA models performed on the extraction of NIR spectra by the four methods all reached 100%. The sequence of R 2 was CARS > SPA > LPG > MC-UVE, while they were in the exact opposite sequences for RMSECV and RMSEP as CARS < SPA < LPG < MC-UVE. All the four methods showed satisfactory prediction performance for BFL.

For FLP, the highest prediction accuracy values reached 100% in the PLS-DA models performed on the extraction of NIR spectra by MC-UVE and LPG methods, while 94.44% for CARS and SPA methods. The sequence of R 2 was LPG > MC-UVE > CARS > SPA. The values of RMSECV and RMSEP were in the opposite sequence LPG < MC-UVE < CARS < SPA. The results of MC-UVE and LPG were better than CARS and SPA for BFL.

The prediction results of the models were significant improved when conducting variable selection, and also the models were simplified by using only a small part of the variables. The results experimentally proved the necessity to perform variable selection before building a calibration model.

Common Variables Analysis

Based on the variable selection results of the four methods, the variables which were selected more than twice were chosen as the common variables for the further analysis. Totally, there were 56 common variables of BFL and 21 common variables of FLP were chosen.

PLS-DA was performed based on the results of PCA of 56 common variables of BFL. From Fig. 3a, we found that the first two principal components cumulatively accounted for 64.9% of the variation. It was visible that BFL were separated into five groups. The loading scatter plot (Fig. 3b) displayed the contribution of each variable to the discrimination. The further the variable distance from the zero of the X-axis and the Y-axis, the more the variable contributes to the classification46. Through a visual analysis, the variables such as 4092.21, 4096.06, 4308.19 4439.33, 4597.46, 5079.58 and 5866.40 cm−1 were identified preliminarily. The biplot provided a better understanding about the relationships between samples and variables in one plot (Fig. 3c). The biplot displayed that the variables 5866.40 cm−1 was positively correlated with the samples in class 1 in the (+, −) quadrant. The variable 4597.46 cm−1 was positively correlated with the samples in class 2, 3 and 4 in the (−, +) quadrant, and negatively correlated with those in class 1 in the (+, −) quadrant. The variables 4092.21, 4096.06, 4439.33 and 5079.58 cm−1 were positively correlated with the samples in class 1, 2 and 3 in the (−, −) quadrant, and negatively correlated with those in class 5 in the (+, +) quadrant. The variable 4308.19 cm−1 was positively correlated with the samples in class 5 in the (+, +) quadrant. Those variables were the most important markers to separate BFL samples into the five classes.

Figure 3
figure 3

Chemometric analysis of common variables of BFL. (a) PLS-DA scores scatter plot. (b) PLS-DA loading scatter plot. (c) PLS-DA loadings biplot. (d) Fisher discriminant analysis scatter plot.

Simultaneously, PLS-DA was conducted for 21 common variables of FLP. In Fig. 4a, the first two principal components cumulatively accounted for 68.0% of the variation. The first principal component explained 38.8% of the total variance and the second principal component explained 29.2% of that. FLP samples were distinctly separated into five groups. Visually analyzed the loading scatter plot (Fig. 4b), we found the variables such as 4508.75, 4952.30, 5230.00, 5233.86, 5303.28, 5634.98, 5685.12, 5874.11 and 5928.11 cm−1 made a significant contribution to the discrimination. The biplot (Fig. 4c) showed that the variables 5230.00 and 5233.86 cm−1 were positively correlated with the samples in class 1 in the (+, −) quadrant. The variables 4508.75 and 5303.28 cm−1 were positively correlated with the samples in class 3 and 4 in the (−, +) quadrant, and negatively correlated with those in class 1 in the (+, −) quadrant. The variable 5634.98 and 5685.12 cm−1 were positively correlated with the samples in class 2 in the (−, −) quadrant, and negatively correlated with those in class 1 and 5 in the (+, +) quadrant. The variables 4952.30, 5874.11 and 5928.11 cm−1 were positively correlated with the samples in class 1 and 5 in the (+, +) quadrant. Those variables were the most important markers to separate FLP samples into the five classes.

Figure 4
figure 4

Chemometric analysis of common variables of FLP. (a) PLS-DA scores scatter plot. (b) PLS-DA loading scatter plot. (c) PLS-DA loadings biplot. (d) Fisher discriminant analysis scatter plot.

Establish of Discriminant Analysis Function

To identify and analyze the unknown samples, the Fisher discriminant function model was established. Through stepwise regression method, the common variables which made a greater contribution to classification were further screened. As a result, thirteen variables including 4092.21, 4096.06, 4165.49, 4308.19, 4439.33, 4485.61, 4501.04, 4566.61, 4570.47, 4597.46, 4612.89, 5079.58 and 5866.40 cm−1 were selected for BFL. Seven of them were identified in the above discussion of PLS-DA. Ten variables including 4123.06, 4508.75, 4952.30, 5230.00, 5233.86, 5303.28, 5634.98, 5685.12, 5874.11 and 5928.11 cm−1 were selected for FLP. Nine of them were recognized in the discussion of PLS-DA. The results of stepwise regression were in accordance with PLS-DA, which proved that those variables could be seen as the characteristic identification marks of W. cocos.

In the process of Fisher discriminant analysis, the thirteen variables of BFL and ten variables of FLP were used as discriminant variables respectively, and the different BFL and FLP samples were performed as the subjects of the study to establish Fisher discriminant functions. The function of BFL was shown as follow and the coefficients were in Table 3:

$$\begin{array}{c}\begin{array}{rcl}{\rm{Y}} & = & -{{\rm{A}}}_{0}+{{\rm{A}}}_{1}{{\rm{X}}}_{1}-{{\rm{A}}}_{2}{{\rm{X}}}_{2}-{{\rm{A}}}_{3}{{\rm{X}}}_{3}+{{\rm{A}}}_{4}{{\rm{X}}}_{4}+{{\rm{A}}}_{5}{{\rm{X}}}_{5}+{{\rm{A}}}_{6}{{\rm{X}}}_{6}-{{\rm{A}}}_{7}{{\rm{X}}}_{7}\\ & & +{{\rm{A}}}_{8}{{\rm{X}}}_{8}-{{\rm{A}}}_{9}{{\rm{X}}}_{9}-{{\rm{A}}}_{10}{{\rm{X}}}_{10}+{{\rm{A}}}_{11}{{\rm{X}}}_{11}+{{\rm{A}}}_{12}{{\rm{X}}}_{12}+{{\rm{A}}}_{13}{{\rm{X}}}_{13}\end{array}\end{array}$$

where Xi was the corresponding variables, Yi was the corresponding class.

Table 3 The coefficients of Fisher functions of BFL.

The function of FLP was shown as follow and the coefficients were in Table 4:

$${\rm{Y}}=-{{\rm{B}}}_{0}-{{\rm{B}}}_{1}{{\rm{T}}}_{1}+{{\rm{B}}}_{2}{{\rm{T}}}_{2}+{{\rm{B}}}_{3}{{\rm{T}}}_{3}+{{\rm{B}}}_{4}{{\rm{T}}}_{4}-{{\rm{B}}}_{5}{{\rm{T}}}_{5}+{{\rm{B}}}_{6}{{\rm{T}}}_{6}-{{\rm{B}}}_{7}{{\rm{T}}}_{7}+{{\rm{B}}}_{8}{{\rm{T}}}_{8}+{{\rm{B}}}_{9}{{\rm{T}}}_{9}-{{\rm{B}}}_{10}{{\rm{T}}}_{10}$$

where Ti was the corresponding variables, Yi was the corresponding class.

Table 4 The coefficients of Fisher functions of FLP.

The Fisher discriminant analysis results were shown in Figs 3d and 4d. The effect of discrimination model was evaluated by cross validation. As seen in the two figures, the ungrouped prediction samples located in different classes. The class of the ungrouped samples could be identified according to the distance from each sample to the centroids of all classes. The validation results were shown in Tables 5 and 6. The original grouped samples 97.50% for BFL and 97.43% for FLP were correctly classified. In the cross validation, the accuracy rates were 94.74% for BFL and 94.44% for FLP. In our previous study, the Fisher discriminant analysis functions built based on the wavelength selected only by the MC-UVE method, the original grouped samples 92.50% for BFL and 92.86% for FLP were correctly classified, and the accuracy rates were 80.95% for BFL and 83.33% for FLP in the cross validation38. The correct classification rates were significantly improved both in the original grouped samples and in the cross validation sets in this study. The validation results indicated that the Fisher discriminant function model established based on the characteristic variables selected simultaneously by the four methods CARS, MC-UVE, SPA and LPG could be seen as a reliable and effective method to discriminate BFL and FLP.

Table 5 The validation results of the Fisher discriminant analysis of BFL.
Table 6 The validation results of the Fisher discriminant analysis of FLP.

Interpretation of the Characteristic Variables

In order to further understand the significance of these characteristic variables, we interpreted the spectra-structure of them. The wavelengths at 4092.21, 4,096.06, 4123.06, 4165.49, 4566.61 and 4570.47 cm−1 are related to the vibration of C-H aryl in benzene band. The absorption band at 4308.19 cm−1 is the combination of C-H stretch and C-H2 deformation in polysaccharides. The wavelength at 4439.33 cm−1 is the combination of O-H and C-O stretch in glucose. Band at 4485.61 is assigned as second overtones of the symmetric and asymmetric bending vibrations of the CH2 of the uncoupled vinyl group. Absorbance peaks at 4501.04 and 4508.75 cm−1 are the combination of asymmetric stretch of NH and NH2 rocking in urea (NH2-C=O-NH2). Absorbance peak at 4597.46 cm−1 is due to CONH2 as combination of amide B and amide II modes. The wavelength at 4612.89 cm−1 is assigned to CONH2 specifically due to the α-helix peptide structure. The absorption band at 5079.58 cm−1 is the combination of N-H stretching vibration and N-H bending in aromatic amine. Absorbance peak at 5866.40 cm−1 corresponds to C-H first overtone stretch vibration mode in CH3. The absorption band at 4952.30 cm−1 is due to a combination of the OH stretch and CH bending. The wavelengths at 5230.00, 5233.86 and 5303.28 cm−1 are the hydroxyl bands. The peaks at 5634.98 and 5685.12 cm−1 are related to C-H in methylene. The band at 5874.11 cm−1 is assigned to C-H in methyl, while at 5928.11 cm−1 is C-H in methyl with OH associated47. According to the absorption peaks, we could speculate that the chemical compositions of BFL and FLP were different, which provided theoretical basis in the spectrum level for the traditional usage of cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos separately.

Conclusions

In this work, we first systematically collected the near-infrared spectrum of cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos from different regions in Yunnan, China. Interestingly, we found that there were obvious differences between FLP and BFL in the pattern space of PCA. Based on this, we established discriminant models for FLP and BFL separately. Through four variable selection methods CARS, MC-UVE, SPA and LPG, the common variables were selected. Furthermore, the characteristic variables were screened to build Fisher discriminant function models, and the validation results showed the models were reliable and effective. The variable selection method used in NIR spectrum provided a new thought for the origin identification of traditional Chinese medicines. The spectrum difference between the cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos provided theoretical basis in the spectrum level for the traditional usage of FLP and BFL separately.

Methods

Materials

Sixty W. cocos samples from five different areas of Yunnan Province in China were collected during July to August in 2015: the central Yunnan (19), western Yunnan (12), northwestern Yunnan (5), southwestern Yunnan (10) and southeastern Yunnan (14). They were identified and authenticated by Professor H. Jin, Yunnan Academy of Agricultural Sciences. The specimens were preserved in the Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences. The samples were separated into FLP and BFL. After drying at room temperature, samples were ground to fine powder and stored in the zip lock bags for further analysis. The detailed sample information is listed in Supplementary Table S2.

Instruments

Antaris II Fourier Transform Near Infrared Spectroscopy (Thermo Fisher Scientific INC., USA) was attached with diffuse reflection module. The spectrum collecting software ResultTM 2.1 and the analysis software TQ 9.2 included in the instrument were employed. Traditional Chinese medicine grinder DFT-100 (Zhejiang wenling Linda machinery co., LTD) was applied. Stainless steel sieve tray 80 mesh (Tai’an of Chinese and western, Beijing) was used. The multivariate data analysis softwares were SIMCA-P 11.0 (Umetrics, Umea, Sweden), SPSS 19.0 (SPSS Inc., Chicago, USA) and MATLAB R2010a, and the code was derived from http://www.mathworks.cn/.

Spectra Collection

The powder (20.0 g) was weighed before it was sufficiently mixed, then transferred to the sample cup of NIR and compressed. The parameters of collection were scanning (64 times), resolution (4 cm−1), scanning range (10000 cm−1–4000 cm−1) and parallel collection (3 times). The NIR spectra of W. cocos were preprocessed with Norris, mean centering, standardization, and second derivative successively by software TQ 9.2. Through optimizing, the range 7501.74–4088.35 cm−1 was selected according to the spectrum standard deviation. The higher the spectra standard deviation was, the greater a contribution made to classification.