Prediction of soil salinity with soil-reflected spectra: A comparison of two regression methods

To achieve the best high spectral quantitative inversion of salt-affected soils, typical saline-sodic soil was selected from northeast China, and the soil spectra were measured; then, partial least-squares regression (PLSR) models and principle component regression(PCR) models were established for soil spectral reflectance and soil salinity, respectively. Modelling accuracies were compared between two models and conducted with different spectrum processing methods and different sampling intervals. Models based on all of the original spectral bands showed that the PLSR was superior to the PCR; however, after smoothing the spectra data, the PLSR did not continue outperforming the PCR. Models established by various transformed spectra after smoothing did not continue showing superiority of the PCR over the PLSR; therefore, we can conclude that the prediction accuracies of the models were not only determined by the smoothing methods, but also by spectral mathematical transformations. The best model was the PCR based on the median filtering data smoothing technique (MF) + log (1/X) + baseline correction transformation (R2 = 0.7206 and RMSE = 0.3929). To keep the information loss becoming too large, this suggested that an 8 nm sampling interval was the best when using soil spectra to predict soil salinity for both the PLSR and PCR models.

Soil salinization is one of the most important obstacle factors that has caused adverse effects on soil production, such as a decrease in cultivated soil fertility and crop failures, which restrict the global development of agriculture [1][2][3][4][5][6] . At the same time, soil salinization greatly influences the ecological environment, which is closely related to human lives and seriously influences the development of the social economy [7][8][9][10][11] .
Traditional field sampling analysis technology is time-consuming and laborious and different sampling methods have a large number of uncertainties and errors when expressing the soil salinization level in a study area [12][13][14] . Technology regarding hyperspectral analysis is time-saving, can perform rapid analysis, saves energy, has a low cost, is not destructive, and can simultaneously estimate the multiple components in soil given new technology and methods for soil information research 12,[14][15][16][17] . The soil spectrum is a comprehensive reflection of various soil physical and chemical properties.
Regarding the use of a high-spectral quantitative model to predict soil properties, due to multiple spectral variables, the correlation between variables needs to be eliminated when building models. Most authors have established partial least-squares regression (PLSR) models 31 and obtained good precision 28,29,[32][33][34] . Several authors have obtained better principle component analysis (PCR) models 35,36 , which perform better than the PLSR. Some authors asserted that the PLSR method performs better than the PCR method. Others asserted the opposite opinion. Sometimes these models are not directed at the same property (e.g., soil salinity). For salinity, it is hypothesized that the precision of a model is associated with the processing methods, such as smoothing and various mathematical transformations. Therefore, it is not rigorous and arbitrary to determine which modelling method is better to predict soil salinity under non-unified modelling conditions. In addition, we often develop transformations to soil spectral data when building models, such as smoothing, multiplicative scatter correction (MSC), and vector normalization (SNC). A number of different spectral transformations have been carried out when predicting soil organic matter, total nitrogen and soil heavy metals with high spectra 31 . However, the chemical properties of soil decide whether the selected data transformations are different. For soil salinity research, due to geographical differences among regions, the same data processing methods have different model precisions as those for different soil salt components 37,38 . Our early studies have indicated that when the PLSR model is used to predict soil salinization, the best data transformation was smoothing + MSC 37 . However, it is still necessary to prove whether the PCR model has the same regulations when using the same transformations. The same type of saline soil should be selected to ensure uniform condition of modelling, so that the models could be compared.
China has vast areas of saline-sodic soil distributed mainly in the arid and semi-arid areas of northern China 39 . It not only restricts the regional development of agriculture and the economy, but it also has adverse effects on regional food and ecological security. Therefore, monitoring soil salinization is a very important task 40 . The soil type in the study area is classified as Aquic Alkalic Halosol based on Chinese Soil Taxonomy 41 . The above studies of soil salinity inversion models were less focused on saline-sodic soil 37 . Because soil compositions are very complicated, the inversion methods established from the areas with different soil salinization types had certain limitations. Even if an adequate and complex model has already been established in the same type of soil regions 42 , it cannot guarantee the applicability in a wider region. Establishing the best quantitative model in this region has important significance. Therefore, the saline-sodic soil was selected as the study subject.
Based on the above literatures and analysis, we find that the main existed problems were: (1) there was no definite conclusion on which model is more suitable for predicting soil salinity in soil with same salt components, and lacking of systematic analysis because of the different conditions of modelling in previous studies. (2) It is need to verify whether different spectral processing methods affect the accuracy of the two models under the uniform external conditions. Thus, this study aimed to (1) build PCR and PLSR models between the soil reflection spectrum and the soil salinity content and compare the pros and cons of the two methods when predicting soil salinity in saline-sodic soil and (2) analyse the influence of different spectral transformation methods on the accuracy of the two models and determine the best spectral transformation methods. This conclusion can be used as a reference for the establishment of the spectral model and the selection of the spectral transformation method in investigation of soil salinity, and the best model can also be used in the prediction of soil salinity in saline-sodic soil.

Results
The accuracy of the soil electrical conductivity prediction models based on the original spectra. We established the PLSR and PCR models based on the original spectral bands and soil electrical conductivity (EC) values; the prediction accuracy of the established models can be seen in Table 1. The calibration accuracies of the PLSR model and the PCR model were R 2 = 0.8623 and R 2 = 0.5373, respectively. The calibration accuracy of the PLSR method was significantly higher than that of the PCR method, and the independent prediction accuracy of the PLSR method (R 2 = 0.5346 and RMSE = 0.5071) was superior to that of the PCR method (R 2 = 0.4534 and RMSE = 0.5496). However, it was too soon that conclude that the prediction of EC with the PLSR method was significantly higher than that with the PCR method. The soil spectra data required further processing and mathematical transformation; models established based on the processed spectra data may strengthen the results. Therefore, we were able to perform spectral transformations when building models to verify which model was superior.
The accuracy of models based on different spectral smoothing methods. When establishing a soil property inversion model, one of the most commonly used methods for hyperspectral data processing is spectrum smoothing. There are four main methods for spectral smoothing: Moving-Average data smoothing technique (MA), Savitzky-Golay data smoothing technique (SG), Median filtering data smoothing technique (MF), and Gaussian filtering data smoothing technique (GF) 43 . This paper chose four smoothing methods to smooth soil spectra and aimed to determine which smoothing method was better, as well as verify whether the PLSR model continued to outperform the PCR model based on the smoothed spectra. Based on the smoothed spectra data, the PLSR and PCR models for soil EC were established, and the model accuracies are shown in Table 2. Table 2 shows that smoothing improved the accuracy of the models after implementing different spectral smoothing methods. Although the calibration of the models had a good prediction (R 2 > 0.7600), the independent prediction of models showed different results. In addition to the PLSR model, which was established based on spectra that were smoothed with the MA method, the other PLSR models that were established based on the remaining spectral smoothing methods all achieved good results; of these results, the PLSR model based on the median filter smoothing was the best (R 2 = 0.6414 and RMSE = 0.4452).

Calibration
Cross-validation www.nature.com/scientificreports www.nature.com/scientificreports/ As for the PCR models, four smoothing methods significantly improved the precision of the prediction. Among them, the best smoothing method was the median filtering method (R 2 = 0.6799 and RMSE = 0.4206).
However, compared with the models based on the original spectra, the PLSR models did not continue outperforming the PCR models. This indicated that the accuracy of prediction model regarding soil electrical conductivity was also affected by some factors besides the model itself. Looking at the model accuracy based on four types of spectral smoothing, the prediction accuracy of the PLSR with the second smoothing method approached the prediction accuracy of the PCR. For the MA, MF and GF smoothing methods, the prediction accuracies of the PCR were obviously better than those of the PLSR model. The changes of prediction accuracies between PLSR and PCR models mainly occurred after the smoothing. Soil spectra were only processed by the smoothing method. Therefore, from the above results, we concluded that the smoothing method affected the predictive precision of the PLSR and PCR models.
Model accuracies based on different spectral mathematical transformations. From the four types of smoothing methods mentioned above, both the MA and SG smoothing methods represent a linear smoothing spectrum method, while both the MF and GF methods represent a nonlinear smoothing spectrum method. To verify the above deduction, we chose the MA and MF smoothed spectrum methods (both have prediction accuracies of PCR > PLSR) and performed various mathematical transformations. If the deduction was correct, the prediction accuracy of the models, which were established on various transformation spectrums after smoothing, continued showing the accuracies better for the PCR than for the PLSR.
Regarding the MA smoothing method, the prediction accuracy of models, which were established for various transformation spectra after MA smoothing, did not continue to show PCR superiority over the PLSR (Table 3).
As for the MF smoothing method, the prediction accuracy of models, which were established for various transformation spectra after MA smoothing, also did not show PCR superiority to the PLSR (Table 4). Therefore, we can conclude that the prediction accuracy of the models was not only determined by the smoothing methods, but also by the spectra mathematical transformations.
According to the results from the spectral mathematical transformations, three types of methods, including the MF + log(1/X) transformation, the MF + log(1/X) + baseline correction transformation, and the MF + area normalization transformation, had adequate prediction accuracies for the PCR and PLSR models, where the PCR model based on the MF + log(1/X) + baseline correction transformation had the highest prediction accuracy (R 2 = 0.7206 and RMSE = 0.3929).
The accuracy of the PCR models based on different resampled hyperspectral data. The resampling of hyperspectral soil was conducted at intervals of 2, 4, 8, 10, 16, 32, and 64 nm based on the smoothing + log (1/X) processing method in order to find the optimal sampling interval for modelling the prediction of soil salt. The relevant content regarding the effects of different resampling intervals in the PLSR model (based on MF smoothing) has been discussed in our previous studies 37 ; here, this paper mainly studies the effect of different resampling intervals on the PCR (based on MF smoothing).
As Table 5 shows, all the prediction accuracies of the PCR calibration models were high, with R 2 ranging from 0.75 to 0.83; these values were higher than those of the corresponding validation models and prediction models, and the RMSEs for all the calibration models were lower than those of the corresponding validation models and prediction models. With an increasing in sampling interval, the precision of the calibration model also gradually increased, with an R 2 ranging from 0.7518 to 0.8298, and the RMSE decreased to 0.3938 from 0.4602. With an increase in sampling interval, the precision of the cross-validation set slowly increased. There was a significant turning point at the 32 nm interval. When comparing the calibrated PCR models and the cross-validation PCR models, the precision of the independent validation set showed different changes. With an increase in sampling  Table 2. Prediction accuracies of the PLSR and PCR models for EC based on different spectral smoothing methods. "Independent Prediction" stands for the accuracy of the models by independent validation set (36 selected samples). "Number of predictors or factors" denotes the number of spectral principal components extracted. The numbers 1, 2, 3, and 4 represent the moving-average data smoothing technique (MA), the Savitzky-Golay data smoothing technique (SG), the median filtering data smoothing technique (MF), and the Gaussian filtering data smoothing technique (GF) methods, respectively. The data in rows marked with the letter "a" are referenced from the literature 37 .
www.nature.com/scientificreports www.nature.com/scientificreports/ interval from 2 nm to 8 nm, the change in R 2 was minimal; as the sampling interval gradually increased, the RMSE gradually declined.

Discussion
From the view of the established models based on optimal smoothing (MF), most PCR models were superior to the PLSR models (Table 4). Different smoothing methods had different principles, which affected the extraction of the principal components. The MF smoothing method used a filtering principle, which obviously improved the accuracy of the PCR model. The MA smoothing method used a linear principle, which did not obviously improve the accuracy of the PCR model. Different mathematical transformation methods based on different smoothing methods had different prediction accuracies, which indicated that mathematical transformations had different effects than those from smoothing methods 36,43 .
It can be seen from the above results that the treatment of spectral data was necessary and significantly improved the precision of the model 28 Table 4. Prediction accuracy of PLSR and PCR modes for EC based on MF spectral smoothing. "Independent Prediction"stands for the accuracy of models by independent validation set (36 selected samples). "Number of predictors or factors" denotes the number of spectral principal components extracted. The number 2 represents the median filtering data smoothing technique (MF) methods. 2 + A represents MF + log(1/X); 2 + A + B represents MF + log(1/X) + baseline correction; 2 + C represents MF + first derivative; 2 + D represents MF + area normalization; 2 + E represents MF + SNV; and 2 + F represents MF + MSC. The data in rows marked with the letter "a" are referenced from the literature 37 37 . Therefore, we should choose the appropriate mathematical method when modelling. From the perspective of mathematical processing convenience and practicability of the model, we recommend the MF + log(1/X) and MF + area normalization transformations to soil spectra when building PCR and PLSR models.
Because the hyperspectral data interval was small, the hyperspectral data provided rich information regarding redundancy and noise. In addition, the small data interval also caused inconvenience in the calculation due to the huge amounts of hyperspectral data in other spectral applications. Resampling of the spectrum can reduce noise and the number of independent variables, which can improve the efficiency of modelling and prediction accuracy 23 .
Kemper and Sommer 44 thought that a large sampling interval (20 nm or 10 nm) reduced the influence of noise and produced adequate prediction results; their study was similar to our study. However, when the sampling interval was 32 nm, the prediction accuracy of the PCR obviously changed; the PCR models were barely able to predict soil salt. Therefore, if the focus is on reducing noise, a spectrum interval that is too large will cause a loss in spectrum information and a decline in prediction accuracy when using the spectrum.
As for the PLSR models analysed in this study 37 , the performance was similar to that of the PCR models. However, if the prediction accuracy of the PLSR reached a certain level (i.e., R 2 exceeded 0.6), then the sampling interval could not exceed 8 nm. Otherwise, the prediction accuracy would decrease. To avoid the large loss of information, it was suggested that a sampling interval of 8 nm was the best when using soil spectra to predict soil salinity both with PLSR and PCR models.

Conclusions
In this paper, we established PLSR and PCR models based on original spectral bands and soil conductivity; it is feasible to predict salinity in saline-sodic soil using soil spectra. Smoothing improved the accuracy of the models, and the best smoothing method was median filtering for both the PLSR and PCR. According to the results of the spectral mathematical transformations, the best model was the PCR model based on the MF + log(1/X) + baseline correction transformation, which had the highest prediction accuracy. The prediction accuracies of the models were not just determined by smoothing methods, but also by spectral mathematical transformations. To avoid a large loss of information, it was suggested that a sampling interval of 8 nm was best when using soil spectra to predict soil salinity with both PLSR and PCR models. This paper built adequate prediction models and determined the effect of spectral transformations on models; however, it should be noted that the best model we established was suitable for saline-sodic soil. This model whether or not is suitable for other types of soil salt needs to be verified.

Methods
Study area. Our study area is located west of the Jilin Province in northeast China (44°13′57″-46°18′N,121°38′-124°22′50″E). The area has a very typical and large area of saline-sodic soil. The soil type in the study area is classified as Aquic Alkalic Halosol based on Chinese Soil Taxonomy 41 . The study area belongs to the temperate continental monsoon climate, and the average annual precipitation is only 400-450 mm, while annual evaporation reaches 1200 mm. The small amount of precipitation and large amount of evaporation leads to climate droughts. In addition to the special climate, hydrogeological conditions and human activities have contributed to soil salinization in the area 39 .
Field Sampling and Laboratory Measurements. Soil samples were collected from the typical saline-sodic soil of the Songnen Plain, which consists of 6 counties that encompass 29302 km 2 in northeast China (Fig. 1). A soil sample experienced 5-7 subsamples at each sampling point, then the soils were mixed and transfered (1-2 kg) into plastic bags, labelled, then taken back to the lab for analysis. A total of 126 soil samples were collected from the surface to a depth of 20 cm and sieved through a 2-mm mesh, and a 0.147 mm mesh. The soil samples equally represent all soil types and land uses. Soil sieved through a 2-mm mesh was used to measure the  Table 5. Results of calibration, validation and prediction with different resampling intervals by the PCR analysis. "Independent Prediction"stands for the accuracy of models by independent validation set (36 selected samples). "Number of predictors or factors" denotes the number of spectral principal components extracted.
www.nature.com/scientificreports www.nature.com/scientificreports/ soil electrical conductivity, and soil sieved through a 0.147-mm mesh was used to measure the soil spectrum for excluding the effect of particle size on soil spectra 45,46 . The soil electrical conductivity was measured at 1:5 of soil: water using conductometry 47 . Soil spectral measurement and processing. Soil samples sieved through a 0.147-mm mesh were ground until fine particles(<0.038 mm) were obtained and then were tabulated. After the tabulated soil dried at a low temperature, the soil spectrum was measured using the Lamdar900 spectrum test 45,46 . A total of 1051 bands were measured, with wavelengths ranging from 400 to 2500 nm and a spectrum sampling interval of 2 nm (see Supplementary spreadsheet S1).
There were four main methods of spectral smoothing: MA, SG, MF, GF 43 . To study the effects of spectral smoothing on the accuracy of a hyperspectral soil model, our original spectral data were smoothed by the four types of smoothing methods, subsequently.
To study the influence of other mathematical transformations on the precision of the hyperspectral soil model, a variety of mathematical transformations, including the SNV, MSC, baseline correction, area normalization, the maximum normalization (MAX), range normalization, first derivative (FD), and logarithm transformation, were conducted based on the smoothed spectrum 43,48 . Finally, the soil spectra were resampled to compare the influences of different sampling intervals on the prediction accuracy of the spectral model.

Modelling and verification methods.
Because the number of hyperspectral variables was greater than the number of soil samples, the ordinary least-squares model cannot be used. For this situation, the commonly used methods for modelling soil hyperspectral data are the PCR and PLSR methods. These two types of modelling methods extract the principal components from spectral variables and exhibit adequate spectral prediction [49][50][51] . Both the PLSR and the PCR methods extract the maximum information reflecting the variation of the data. The principal components extracted by PCR were orthogonal, while the principal components extracted by PLSR were based on three analytical methods: principal component analysis, canonical correlation analysis and multivariate linear regression analysis. This paper used these two types of common methods for modelling [49][50][51] .
The calibration and validation set would have a significant impact on the results. In this study, the distribution of soil salt content of soil samples collected is widespread and data of soil salt content distributed in each grade of soil salinization (0.02-30 g/kg), therefore, the Rank method 52 was chosen in this paper. The procedure was: all samples were sorted according to the electrical conductivity (EC) content, and then two neighbouring soil samples were selected as calibration sets for each soil sample to avoid this effect. The remaining soil samples were selected as validation sets. The cross-set was the same as the calibration set. In all, ninety samples were selected as a calibration set, and 36 samples were selected as an independent validation set for prediction. The whole established models were tested by the independent validation set. The evaluation of the model precision mainly adopted determination coefficient R 2 and the root mean square error (RMSE) to forecast and measure values, respectively 50 . The larger the value of R 2 , the better the precision of the model. In addition, the smaller the RMSE, the better the precision of the model. The root mean square error algorithm is as follows 2 where X represents the real value, Y represents the predictive value, and N represents the sample number. We had discussed the accuracy of PLSR models under different spectral transformation methods 37 . In this manuscript, we will establish the PCR models under the unified condition on the basis of the previous study, and www.nature.com/scientificreports www.nature.com/scientificreports/ analyze the accuracies of the two types of model (PLSR and PCR). The purpose of this paper is to compare the modelling accuracies between the two models (PLSR and PCR), specifically by analyzing the influence of different spectral transformation methods on the accuracy of the two models and considering the consistency of the models in experiments with four different spectral smoothing methods and a variety of spectral mathematical transformations. For convenience a small portion of the data were quoted from the reference 37 for reuse and analysis. To avoid misunderstanding and ensure the seriousness and rigour of the article, we added annotations to the involved data (Tables 2 and 4).