Laser-Induced Breakdown Spectroscopy Coupled with Multivariate Chemometrics for Variety Discrimination of Soil

The aim of this work was to analyze the variety of soil by laser-induced breakdown spectroscopy (LIBS) coupled with chemometrics methods. 6 certified reference materials (CRMs) of soil samples were selected and their LIBS spectra were captured. Characteristic emission lines of main elements were identified based on the LIBS curves and corresponding contents. From the identified emission lines, LIBS spectra in 7 lines with high signal-to-noise ratio (SNR) were chosen for further analysis. Principal component analysis (PCA) was carried out using the LIBS spectra at 7 selected lines and an obvious cluster of 6 soils was observed. Soft independent modeling of class analogy (SIMCA) and least-squares support vector machine (LS-SVM) were introduced to establish discriminant models for classifying the 6 types of soils, and they offered the correct discrimination rates of 90% and 100%, respectively. Receiver operating characteristic (ROC) curve was used to evaluate the performance of models and the results demonstrated that the LS-SVM model was promising. Lastly, 8 types of soils from different places were gathered to conduct the same experiments for verifying the selected 7 emission lines and LS-SVM model. The research revealed that LIBS technology coupled with chemometrics could conduct the variety discrimination of soil.

At present, the calibration curve method with single band data mining and calibration free method were widely used for concentration measurement in LIBS analysis 29 . Recently, multivariate chemometrics methods combined with LIBS technology showed a sharp growth in the field of soil analysis. Examples of soil analyzing based on multivariate chemometrics mainly included predication of element concentrations in soil and classification of soil variety. For determining the elemental content, multivariate analysis methods, such as partial least-squares regression (PLSR), artificial neural network (ANN), support vector machine (SVM), random forest regression (RFR), multi-linear regression (MLR), principal component regression (PCR), standard addition method (SAM), were widely applied to LIBS data [30][31][32][33][34][35][36] . On the other hand, researchers developed a variety of ways to deal with the variety discrimination of material. In general, identification of emission lines of a significant element or use of multiple LIBS emission lines (calculating intensity ratios of the selected lines) were the two methods for classifying the soil samples 29 . Moreover, multivariate analyses have been already widely applied to LIBS data for classification purposes 37,38 , such as classification of slag sample using partial least squares discriminant analysis (PLS-DA) 33 , discrimination of sedimentary rocks based SVM 39 , principal component analysis (PCA) and PLS-DA for distinguishing characteristics of the geological samples 40 , PCA on data of polluted soils using a mobile LIBS system 41 , rock identification using three methods of PCA, PLS-DA and soft independent modeling of class analogy (SIMCA) 42 , PCA for plastic classification 43 , adjusting spectral weightings (ASW) for polymer identification 44 , rock classification by remote LIBS using independent component analysis (ICA) 45 , hierarchical cluster analysis (HCA) for classifying the chicken tissue (brain, lung, spleen, liver, kidney and skeletal muscle) 46 , classification of pharmaceutical tablets based on SIMCA 47 , classification of toys relying on toxic elements using the k-nearest neighbors (KNN) 48 . However, few studies on variety classification of soil using LIBS coupled with multivariate analyses have been reported.
In the current study, LIBS technique was employed to carry out the variety discrimination of soil. Multivariate chemometrics method of PCA, SIMCA and LS-SVM were introduced to conduct the characteristic and discrimination analysis, and then the selected characteristic spectral lines and optimal discriminant model were verified.

Results and Discussion
Overview of soil LIBS spectra. The obtained LIBS spectra curves contained more than 17,000 wavelength channels in a wavelength range starting at 300 nm in the ultraviolet (UV) and extending into the near-infrared (NIR) to 850 nm. Figure 1 shows the original LIBS spectra of the 6 types of soil samples (GBW07410, GBW0746, GBW07447, GBW07454, GBW07455, and GBW07456) in 300-850 nm. It could be observed from Fig. 1 that there were similar profiles of curves of 6 soils. However, the discrepancy of 6 groups only appeared on the different LIBS spectra intensity. In detail, most of the valuable and high-intensity spectral lines were around the region of 300-450 nm. Concurrently, LIBS spectra in 450-850 nm exhibited a relatively low-intensity and stable tendency, except several obvious peaks around the wavelengths at 590 nm, 655 nm, 770 nm, and 820 nm.
From Fig. 1, 6 types of soil samples with similar LIBS spectra curves were attributed to their homologous matrix of chemical and elemental composition 11,49 . In order to analyze information of the LIBS spectral lines, the soil sample numbered GBW07410 was taken as an example to identify those spectral lines according to NIST Atomic spectra database and Kurucz database 50 . Figure 2 illustrated a typical spectrum of soil samples numbered GBW07410. Most of emission lines standing for different elements were accurately identified and labeled in corresponding positions.
In Fig. 2, a number of emission lines (atomic and ionic spectral lines) with different LIBS intensity were observed to contain the information of Al, Ca, Si, Fe, Mg, Na, Mn, Li, Ti, N, K, Ba, H, and O element. The spectral lines of O and N might be caused by O 2 and N 2 in the air and the by chemical substances in soil sample.
The identified emission lines of elements Al, Fe, Mg, Ca, Na, K, and Si in the oxide components present in the soil sample are listed in Table 1. These emission lines had minimal interference from other emission lines and provided enough LIBS intensity.
Principal components analysis (PCA) on LIBS data. Because the soil samples had different matrix and experiment conditions (especially the Echelle spectrometer was sensitive to temperature) changed, which might bring the discrepancy of LIBS data. Hence, it was necessary to conduct the LIBS data preprocessing.  Normalization had a widely application in preprocessing of the LIBS data. Area normalization is a kind of normalization; its purpose is to "scale" samples in order to get all data on approximately the same scale. So, in order to compensate for spectral changes caused by matrix effects and variation in experimental conditions, all spectra were normalized by using area normalization method, accomplishing an equal area under the curve for each spectrum 51,52 .
In this study, the acquired LIBS spectra included 17,173 wavelength channels in 300-850 nm. Classification using all the spectra could enhance the calculation time and increase the requirements of equipment performance for LIBS measurements. Hence, removal of highly superfluous variables and selection of few crucial emission lines of elements were of significance for multivariate analysis.
First, PCA was employed to transform the full spectra into several principal components (PCs) and the loading plot was executed to select the important emission lines. The first seven PCs explained 96.19% of the variations of original all spectral information and their loading plot are shown in Fig. 3. It could be observed that most emission lines of main elements (Al, Fe, Mg, Ca, Na, K, and Si) offered the relatively large loading coefficients, which were in alignment with the labeled lines in Fig. 3.
Finally, the emission lines with high signal-to-noise ratio (SNR) would be selected to conduct the further analysis. Based on above, a total of 7 characteristic lines (marked in Fig. 3 49 nm, which contained LIBS spectral peaks of Si, Al, Fe, Mg, Na, Ca, and K of each soil class, were chosen from the identified emission lines. Then, a matrix with 180 × 7 (LIBS spectra × lines) was obtained to implement the next analysis.
Next, another PCA was carried out using the LIBS spectra at the selected characteristic lines to display any variation among the 6 types of soil sample. The first 2 PCs explained 94.49% (PC-1: 65.69% and PC-2: 28.80%) of the variations among total spectral information, and their score and loading plots are shown in Fig. 4. Each point in the scatter plot (score plot) represented one spectrum. Fig. 4(a) showed that an apparent clustering could produce with PC-1 and PC-2. The LIBS spectra of soil were distinguished in the side of PC-1, while some spectra tended to be on the positive side of PC-2. Meanwhile, there was a slight cross between the classes of GBW07446, GBW07447, GBW07454, and GBW07455. It was worth noticing that there was an obvious difference between 6 groups of soil samples. Figure 4(b) showed the loading plot of the PCA, which also revealed the importance of the analyzed variables. It could be concluded that Ca and Na elements gave expression to dominating contribution on PC-1 and PC-2, respectively. For fully explaining the scatter of score plot, loading plot of Fig. 4(b) and Table 4 were combined to analyze the scatter distribution of 6 types of soils. The GBW07454 and GBW07447 classes with relatively high concentration of Ca element were located in the positive side of PC-1; and the GBW07410 class with relatively low content of Ca was situated in the negative side of PC-1. In addition, the classes of GBW07446 and GBW07447 containing relatively high concentration of Na element distributed in the positive side of PC-2; and classes (GBW07410, GBW07456, GBW07455, and GBW07454) with similar content of Na were scattered in negative side of PC-2. The classes of GBW07446, GBW07447, and GBW07455 exhibited some slightly intersections, which was attributed to their approximate content of Al and K.
Although some differences could be observed in Figs 1 and 4, chemometrics methods were employed to extract and concentrate the connotative information for further discriminating soils 53 . Soft independent modeling of class analogy (SIMCA). For SIMCA analysis, SPXY (sample set partitioning based on joint x-y distances) method proposed by Galvao et al. 54 , was first implemented to divide the data matrix (180 × 7) of LIBS spectra and corresponding labeled classes of each spectrum into a calibration set with 120 LIBS spectra and a predication set with 60 LIBS spectra. SIMCA was applied to calibration set of the LIBS spectra. As mentioned principle of SIMCA, a separate PCA was performed for every class of soil, resulting in 6 individual PCAs. For the predication of the "unknown" LIBS spectra, these PCA models were applied with 4 PCs each, except for the model of class GBW07454, where 3 PCs were used. Then, a SIMCA model was established using the LIBS spectra of the calibration set. Putting the data of predication set into the SIMCA model could compute the forecast results.
The result of SIMCA classification and predication of unknown class samples is shown in Fig. 5. Nearly all the LIBS spectra of soil in the predication set were correctly classified, except 6 ones were misclassified. 3 LIBS spectra of GBW07446 and GBW07456 were identified as the class of GBW07447, respectively. This resulted in Least-squares support vector machine (LS-SVM). Next, LS-SVM methodology was employed to establish models based on the calibration set (same to SIMCA model) for discriminating the 6 classes of soils. To obtain an excellent classification performance, two factors of regularization parameter γ and the RBF kernel function parameter σ 2 in LS-SVM classifier have to be optimized. The parameter γ could determine the tradeoff between maximizing the model performance and minimizing model complexity, and the σ 2 was the bandwidth and implicitly defined the nonlinear mapping from input space to some high-dimensional feature space [55][56][57] . Based on LS-SVM model, its classified results of predication set (60 spectra) are summarized in a confusion matrix presented in Table 2. The numbers of correctly classified samples were listed on the diagonal, and the off-diagonal was the misclassification. Meanwhile, the sum of the numbers in each column was the number of  samples examined for each type. From Table 2, all the analytical spectra of soil samples in 6 classes were correctly discriminated to their own groups, resulting in a correct classification rate of 100%. The results suggested that LIBS technology had a potential to discriminate different types of soil combined with proper chemometrics method, demonstrating the capability of chemometrics-LIBS in field of spectroanalysis.
Evaluation of discrimination models. The results of 2 models could be concluded that LS-SVM method provided more accurate discrimination results compared with the SIMCA model. Then, receiver operating characteristics (ROC) curve was employed to evaluate the performance of 2 models. From the ROC curves of SIMCA and LS-SVM discriminant models, the parameters of "area" and "std" in SIMCA ROC curve were 0.93695 and 0.008533, while those factors were 1 and 0 in LS-SVM ROC curve. The results indicated that the discrimination capability of LS-SVM was superior to SIMCA model. This might arise from the fact that LS-SVM as a nonlinear method has an ability to overcome the variability in LIBS measurements and showed better performance in handling high-dimensional LIBS data sets compared to conventional linear method 38 , like SIMCA in this study. To explore the discrepancy of 8 soils, PCA was adopted on the obtained matrix of LIBS spectra. PCA had compressed most variance of spectra into the first 3 PCs, which explained 99.10% variance of original data. Figure 6 shows the score plot of the first 3 PCs. It could be seen that most points of LIBS spectra of each type were clustered together and the boundaries of different types were relatively clear. So, it could be concluded that there was an obvious differentiation in 8 types of soil and the selected 7 characteristic lines were valid to distinguish the different soils. LS-SVM model was also developed for discriminating the 8 types of soils. The acquired spectral matrix was split into a calibration set (290 spectra × 7 emission lines) and a predication set (150 spectra × 7 emission lines) by the SPXY method with the ratio of 2:1. The sample labels in the database, which were integer varying from 1 to 8, were considered as the class labels used for producing models. Using the spectra in the calibration set, the classification performances approached 100% based on the LS-SVM. To assess the performance of this model, ROC curve of the LS-SVM classifiers displayed the parameters "area" and "std" of 1 and 0, respectively. It could be seen that the LS-SVM model obtained excellent discriminant performances. The above reliable discrimination results indicated that it was feasible to discriminate the LIBS spectra of different types of soils by means of LS-SVM methodology. with high SNR were selected to conduct the further next analysis. PCA was carried out on the LIBS spectra at the 7 selected emission lines. An obvious cluster was observed and analyzed. Then, SIMCA and LS-SVM discrimination models were established, and their performances were evaluated by ROC curve. Results demonstrated that the LS-SVM model was the optimal model for discriminating the different types of soils. Moreover, the 7 selected emission lines and the LS-SVM model were applied to the other 8 types of soil samples, which also achieved outstanding discrimination results. To improve the extendibility of our application, more samples and a diversified analytical data set should be taken into account to obtain enough spectrum data in further investigations. It could provide a theoretical guidance for establishing agrotype system and farmland management.

Materials and Methods
Experimental device and LIBS data acquisition. A typical LIBS system illustrated in Fig. 7 was assembled in our lab using the following main components: a Q-switched Nd: YAG laser (Vilte-200, Beamtech Optronics Co. Ltd., Beijing), a high resolution Echelle spectrometer (Mechelle 5000, Andor Technology) coupled to an intensifier charge coupled device (ICCD) camera (iStar DH340T-18F-03, Andor Technology), a delay generator (DG645, Stanford Research Systems, USA), an X-Y-Z moving stage (Zolix Instruments Co. Ltd., Beijing), and a personal computer (PC) with the Andor SOLIS software (Version 4.25, Andor Technology).
The experiments were finished in air. Before collecting LIBS data, the system was warmed up for 0.5 h to ensure the thermal stability of the instruments. Then, the setup was corrected by a Hg: Ar lamp (Ocean Optical, HG-1, Hg-Ar lines 253-922 nm) for wavelength calibration and a Deuterium-Halogen light source (DH-2000-BAL, Germany) for intensity calibration.
After that, the Q-switched Nd:YAG laser operating with 1 Hz repetition frequency and 7 ns pulse duration emitted a laser pulse with energy of 80 mJ at wavelength of 532 nm. Through the reflection of a mirror, the laser beam was focused vertically onto the soil sample surface with a 100 mm focal distance lens. Then, LIBS emission was collected by the light collector and delivered by optical fiber to the Echelle spectrometer (200-975 nm, 195 mm focal lengths, F/7, resolution of λ /∆ λ 6000) equipped with a time-gated ICCD camera (1024 × 1024 pixels, 13.6 × 13.6 μ m 2 /pixel). The delay generator provided a proper delay time to eliminate the initial continuum emission. In order to obtain fresh locations to be ablated, the soil pellets were placed on the sample holder which could be moved automatically in X, Y and Z directions by stepper motors. For all measurements, the gate width and exposure time of the ICCD were set to 2 μ s and 0.01 s, respectively. A gate delay of 2.5 μ s, which was the gap between the laser pulse and the start of gating time, was chosen. In addition, microchannel plate (MCP) gain was fixed at 500.
For each spectrum, an accumulation of 20 laser pulses per site was collected to increase the signal-to-noise ratio (SNR). To minimize the influence of sample heterogeneity and laser energy fluctuations 52 , 180 shots were performed from 9 sites (9 spectra were obtained) for each sample. Then, the spectra of each 3 sites were averaged into an analytical spectrum, and 3 representative LIBS spectra could be acquired from one soil sample. Meanwhile, 30 LIBS spectra were recorded from each class (10 samples per class). According to the preceding process, a total of 180 LIBS spectra were gathered from 6 classes of soil samples.

Soil samples.
In this research, the certified reference material (CRM) of soil powder sample (grain size is less than 0.075 mm) was provided by National Institute of Metrology, P. R. China. The selected soil samples included 6 classes: black soil (GBW07410) from Heilongjiang province, sandy soil (GBW07446) and saline-alkali soil (GBW0747) from Inner Mongolia Autonomous Region, loess (GBW07454) from Shaanxi province, sediment (GBW07455) from the Huaihe River in Anhui province, and the other sediment (GBW07456) from the  Table 3 (other elemental compositions are not listed). The certified elemental compositions were measured by the Institute of Geophysical and Geochemical Exploration of the Chinese Academy of Geological Science using X-ray fluorescence (XRF), inductively coupled plasma-mass spectrometry (ICP-MS), atomic absorption spectroscopy (AAS), etc.
Meanwhile, the concentrations of main elements (Si, Al, Fe, Mg, Ca, Na, K) in oxide were calculated and listed in Table 4.
To obtain the homogeneous surface of the soil for laser ablating, the soil powder was made into cylindrical pellets using a manual pellet presser (FY-24, SCJS Co., LTD, Tianjin, China). In detail, the applied load of the presser was 15 Mpa lasting for 4 minutes. Each soil pellet had a weight of 3 g, a diameter of 25 mm, and a thickness of 3 mm.

Multivariate chemometrics methods.
PCA is an unsupervised technique (classes or composition of the samples in the data matrix is not involved) and has a wide application in reducing the dimension of multivariate data sets 39,58 . The principle and application of PCA could be found in the paper reported by the literature of 1,59,60 . Meanwhile, the score plot of PCs is used to reveal the features of variable distribution 61 , and the loading plot of PCs can exhibit the importance of different variables.
Least-squares support vector machine (LS-SVM), an optimized version of the standard SVM, is a powerful methodology in pattern recognition and function estimation 55,56,[62][63][64] . In this research, radial basis function (RBF) kernel function was adopted to establish the LS-SVM model, it was convenient to detect the effect of independent variable (X) on dependent variable (Y) based on analysis of linear regression coefficient. The details of LS-SVM could be found in the literatures of 56,62,63 .
Soft independent modeling of class analogy (SIMCA), an application of PCA, is a supervised technique for classification 30,48,59,65 . In this model, the acquired data set is subdivided corresponding to class affiliation, and then a separate PCA model is performed independently for each of those classes. Meanwhile, numbers of PCs are selected individually for the corresponding classes. Samples with unknown class membership are then assigned to a class by projecting them into each subspace. Then, the residual variances of unknown samples are compared to judge which categories the sample belongs to 66,67 .
Lastly, confusion matrix with an advantage to obtain true values and false values from the result 68 could be used to list the discrimination results. Meanwhile, receiver operating characteristics (ROC) curve, a useful tool for organizing classifiers and visualizing their performance [69][70][71] , is employed to assess the performance of discrimination models.
Software tools. The processes of statistical calculations and data analyses were carried out by "The Unscrambler X10.1" (CAMO PROCESS AS, Oslo, Norway) and MATLAB 7.8 (R2009a) software (The MathWorks, Inc., Natick, MA, USA). In addition, Origin Pro 8.0 SR0 (Origin Lab Corporation, Northampton, MA, USA) software was used to design graphs. All processes were run on a PC (CPU: Intel Core i3-3220 @3.30GHz, RAM: 4.00GB) under Windows 7.   Table 4. The concentrations (mg·g −1 ) of main elements in oxide of six soils.