Introduction

Zinc (Zn) in soil is one of the essential trace elements of plants1,2,3. When a plant is short of Zn, the growth in the stem and bud is reduced, and normal growth will be significantly affected. Similarly, Zn is required in the photosynthesis cycle4. Nevertheless, various influencing factors can affect the accumulation of Zn. For example, geographic location (e.g., latitude & longitude) has influenced the distribution and content of vegetation, soil nutrients and heavy metals5,6,7,8,9,10,11. Moreover, a considerable number of interactions are taking place in the soil between physical and chemical properties, such as, organic matter (OM), soil reaction (pH), calcium carbonate (CaCO3) and essential macro and micronutrients (P, K, Ca, Mg, Mn, Fe, Zn, and Cu)12,13.

Previous studies investigating Zn in soils have mainly focused on the influence of Zn on plants14, Zn content prediction15, Zn pollution characteristics16, source analysis17 and potential ecological risk assessment18. Furthermore, spatial analysis and statistical methods, such as multivariate analysis19,20,21 have been used to analyze the relationships of nutrients and heavy metals on plant and soils. But the nonlinear relationship between soil nutrients and heavy metal elements in soils has not been identified. Moreover, the effect of interactions of influencing factors on variation of soil Zn content should be also figured out. The interactions have an important impact on the material circulation in the soil circle, and it is also important for maintaining the ecological balance of environmental materials and eliminating pollutants to soil, plants and even humans22. This aims to identify the interactions of geographical and physical factors that could affect Zn content in soil using a non-parametric model, such as generalized additive model (GAM).

GAM has been widely used in medical application23,24,25, financial research26, fishery survey27 and environmental and climate studies28,29,30, due to specific advantages31,32. For example, it can directly deal with the nonlinear relationship between response variables and multiple explanatory variables33, especially for analysis of large data sets34. Furthermore, GAM can be used to analyse interactions between influencing factors on the response variable35,36,37. Conversely, traditional statistical methods cannot perform well in addressing the complex nonlinear relationship38,39. In our study, soil Zn content (here after referred to as Zn) is taken as an example to study the relationship between soil heavy metals, nutrients and geographic location (latitude & longitude). In the GAM, Zn is used as the response variable, geographic location and five types of soil nutrients are used as the explanatory variables.

Results

Pre-analysis of selected variables

The normal distribution is rejected at the significant level of 5% depending on the Shapiro-Wilk test. It does not meet the data requirement for binomial distribution. Consequently, the log function is selected as the link function33,37. It is found that most explanatory variables could pass the significance test at the P-value < 0.01 level, but most of the Pearson correlation coefficients (Rs) are less than 0.3 (Table 1). The correlations are low among the variables. The largest R is between SAK and AK with the value of 0.443, which shows that the correlation is high in the large number of samples.

Table 1 Comparison of the R among the explanatory variables.

Univariate analysis of influences on Zn

The regression model with cubic splines is used to analyze the influences of each individual explanatory variable on Zn and corresponding fitting degree of the model (Table 2). The results show that all the seven explanatory variables have passed the significance test at the P-value < 0.01 level, suggesting that each individual variable is statistically significant for the influence of Zn, with a low deviance explained. The deviance explained of AP and longitude are higher with the values of 20.2% and 16.5%, respectively. The corresponding adjustment coefficient of determination (Adj.R2) which increases with the increase in the number of independent variables are 0.16 and 0.2 for AP and longitude, respectively. The precision of model derived from each individual explanatory variable is low. Consequently, multiple variable interactions are considered for investigating their influences on Zn.

Table 2 Test of the GAM using univariate analysis.

Multivariate analysis of influences on Zn

The variables are gradually added to the GAM, and the tests are carried out using the Akaike information criterion (AIC) score (Table 3). It can be found that the AIC scores are generally reduced with the gradual increase of variables. Conversely, when SAK is added, the score increases by about 0.6, and the P-value is 0.031. SAK does not pass the significance test at the P-value < 0.01 level, which indicates that SAK has little effect on Zn. The other variables of latitude, longitude, OM, AHN, AP and AK significantly affected the changes of Zn at the P-value < 0.01 level. When all factors are added, the Adj.R2 = 0.4

Table 3 Test of the GAM using multivariate analysis.

Model fitting after concurvity diagnosis analysis

Three index values of S(SAK) and S(AK) are all close to or greater than 0.5 (Table 4), suggesting that they have a concurvity (a correlation between S(SAK) and S(AK)). Combining the results of concurvity of SAK and the multivariate analysis, SAK was removed from the model. After refitting the model in the absence of SAK, it identified that AHN, AK, AP, longitude, latitude significantly influenced Zn (Table 5). The refitted GAM with the deviance explained of 70.4% and Adj.R2 of 0.6 is an improvement on the model which does not have concurvity. The refitted GAM identifies the effect of the influencing factors on the changes in Zn content (Fig. 1) and the resulting nonlinear relationships (EDF ≠ 1). The model predicts that Zn content increases with the rise in latitude, peaking at 39.7°N. Zn reaches the maximum at longitude 115.9°E and 39.8°N, and it has little change with the change of OM content.

Table 4 Test of the concurvity of the smooth function.
Table 5 Hypothesis test of the refitted GAM.
Figure 1
figure 1

Estimated smoothness of six variables on Zn; y-axis is the partial effect of the variable and shadow section is the standard-error confidence intervals.

Cross-validation of the refitted multivariate GAM

To avoid overfitting, cross-validation was used to test the refitted multivariate GAM. The difference between the predicted value and the measured value was small, and the six variables passed the significance test at the P-value < 0.01 level (Table 6). The optimal model can reasonably reflect the influencing factors on Zn.

Table 6 Cross-validation of GAM based Zn variation.

Interactions of multivariate factors on Zn

The model deviance explained derived from GAM was 72.1%, with the Adj.R2 of 0.63. The estimated degree of freedom of the longitude-HN interaction was 1 (Table 7). The F-value for the longitude-latitude interaction, the HN-AP interaction and the OM-AK interaction are 19.857, 4.678 and 4.433, respectively. These interactions passed the significance test at the P-value < 0.01 level. Similarly, the latitude-AP interaction, the latitude-OM interaction and the latitude-AK interaction passed the significance test at the P-value < 0.05 level.

The interactions that passed the significance test (P-value < 0.01) demonstrates the impact of interactions on Zn (Table 7 and Fig. 2). Figure 2(a) shows the influence of interaction between latitude and longitude on Zn. When latitude is less than 39.6°N, Zn decreases rapidly with the increase of longitude until it reaches at about 115.8°E. Zn reaches its local maximum at 115.8°E, 39.7°N, and then there is little increase with the increase of latitude and longitude. The influence of interaction between AHN and AP on Zn can be observed in Fig. 2(b). When AP content is less than 50 mg/kg, Zn varies little with the increase of AHN content. Above 50 mg/kg, AHN increases until AP reaches approximately 200 mg/kg until AP content reaches about 200 mg/kg. When AHN content is less than 50 mg/kg, Zn decreases with the increase of AP content. The influence of the interaction between OM and AK on Zn can be observed in Fig. 2(c). When OM content is less than 100 g/kg, Zn increase with a rise in AK. When both OM content and AK content increase, Zn increases. Figure 2(d) reveals the influence of the interaction between latitude and AP on Zn. Zn does not change significantly with the increase of AP content when the latitude is greater than 39.7°N. When the latitude is less than 39.6°N, Zn increases rapidly with the increase of AP content. Figure 2(e) shows the influence of the interaction between latitude and OM on Zn. When OM content approaches 400 g/kg, Zn decreases rapidly as latitude increases. Figure 2(f) shows the influence of the interaction between latitude and AK on Zn. The Zn content decrease with an increase in latitude when AK remains unchanged. Zinc reaches a minimum at latitude 39.8°N when AK is less than 200 mg/kg.

Table 7 Hypothesis test of the interaction GAM model.
Figure 2
figure 2

Three-dimensional effect graph of interacting influencing factors on the variation of Zn content.

Discussion

Influencing factors on Zn

Latitude and longitude have significant influence on soil Zn content (P-value < 0.01). Conversely, Richardson et al.40 have shown that there is no correlation between site location and Zn content. The difference in results from ref.40 may be due to the specific geographic location and land use change in the study area. The Fangshan district is a mountainous region with manufacturing and agriculture as the prime land uses. Zinc content in soil in urban and industrial areas may be an order of magnitude greater than that in rural areas41. For example, Zn reaches its maximum at 115.8°E, 39.7°N (Fig. 2).

The increase of Zn due to an increase in OM and content is consistent with OM and heavy metals coexisitng in soil sediment, with OM been found to have important implications on heavy metal speciation, transport and bioavailability42,43. In addition, Zn content is also affected by other nutrient elements. For example, increase in flax yields in response to Zn application are most likely to occur where P fertilizer is broadcast at relatively high levels or on soils with a history of heavy P application44. Similarly, Zn increased as AP content increased in this study (Fig. 1).

Modeling the Zn content

To explore the variation of Zn content in soil, the linearity of the influencing factors on Zinc were examined. On analysis of the EDFs of the smoothing functions from the univariate? GAM, it was identified that Zn content is affected by complex nonlinear influences. The univariate GAMs of Zn content in soil are able to estimate values for the significant influencing factors of latitude, longitude, OM, AHN, SAK, AP, AK. These factors are considered additive and hence a multivariate GAM was fitted, improving the goodness of fit over the univariate model. Nevertheless, SAK does not pass the significant test (P-value > 0.01) for the multivariate GAM but it does pass for the univariate GAM. This suggests there is a concave relationship between S(SAK) and S(AK).

Moreover, there is spatial correlation between AK and SAK in the study area. SAK refers to the potassium that exists between layers of layered silicate minerals and grain edges and cannot be reached by neutral salts in a short time. Conversely, AK can be quickly absorbed and utilized by plants. Zhang et al.45 have revealed that AK is affected more than other potassium forms and can be more sensitive in directly reflecting the productivity than SAK. On removal of SAK the goodness of fit of the multivariate GAM improved and identified that latitude, longitude, OM, AHN, AP and AK have significant influences on the Zn content in soil. Zinc content in soils is primarily affected by the interactions between latitude and OM, AP, AK (Fig. 2). The modelling suggests Zn content in soil is affected more so by the vertical direction (latitude) than the horizontal direction (longitude) in the study region. This could be due to location of manufacturing industries or natural landforms and soil types. In our study, the GAM derived from the pairwise interaction with the influencing factors can be used to analyze the influence characteristics of Zn content. Zn content is affected by multiple factors, and the interactive GAM can be constructed using three or more of these factors to analyse influences on Zn content in soil.

Materials and Methods

Description of the study area

Fangshan District is located between longitudes 115.4°–116.3°E and latitudes 39.5°–39.9°N in Beijing, China. It is situated to the east of the Taihang Mountains. The south-eastern region of the district is on a plain, with hill country intersecting the district from the northeast. It is in a warm temperate semi-humid monsoonal climatic zone.

Collection of soil samples

The soil samples were primarily collected in five typical agricultural croplands including vegetable land, irrigated land, irrigated paddy field, dry land and orchards. A total of 1,497 soil samples is collected in the study area (Fig. 3). Representative soils samples were collected from random points in the croplands to a depth of 20 cm. The hybrid samples were acquired by five points and then the samples were crushed and fully mixed. Two diagonal lines were used to divide the samples into four parts. Any two parts of the diagonal angles were reserved as the final samples. A portable sub-meter GPS receiver was used to accurately acquire latitude and longitude of the sample points. Atomic Absorption Spectrometry (TAS-990, Xian Yima Optolec Co Ltd) was used to analyze the soil samples for nutrients and heavy metals. Specifically, samples were analyzed for organic matter (OM) (g/kg), alkali-hydrolyzed nitrogen (AHN) (mg/kg), available phosphorus (AP) (mg/kg), slowly available potassium (SAK) (mg/kg) and available potassium (AK) (mg/kg). Heavy metals analyzed were Zn, Fe, Cu, Mn, B and S.

Figure 3
figure 3

Spatial distribution of the collected soil samples in the study area.

Generalized additive model

It is a regression model that can define the relationships between the response variable and each explanatory variable through smooth functions18,31. GAM, using an identity link function with Gaussian error distribution, is used to determine the effects of various factors on soil Zn. The generalized additive model considering interactions of two factors can be given in a general form:

$$g(\mu )=\sum \,{f}_{i}({X}_{i})+\sum \,{f}_{j,k}({X}_{j},{X}_{k})+\varepsilon $$
(1)

where \(\mu =E(Y/{X}_{1},{X}_{2},\cdots ,{X}_{p})\); g(μ) is a link function, in this study, the log() is used as a link function; \({f}_{i}\) (i = 1, 2, …, 7) are the smooth functions of Xi, Xi (i = 1, 2, …, 7) are the explanatory variables, and they are latitude, longitude, OM, AHN, AP, SAK, AK, respectively. \({f}_{j,k}()\) are the smooth functions for the interaction between these explanatory variables \(({X}_{j},{X}_{k})\), \(({X}_{j},{X}_{k})\) are (latitude, longitude), (AHN, AP), (OM, AK), (latitude, AP), (latitude, OM), (latitude, AK) respectively. \(\varepsilon \) is the residuals and (\(E(\varepsilon )=0,Var(\varepsilon )={\sigma }^{2}\)).

The smooth functions with cubic regression splines were used in our work. Cubic regression splines were constructed with piecewise cubic polynomials joined together at points called knots. The definition of cubic smoothing spline basis arises from the solution of the following optimization problem. Among all the functions f, with two continuous derivatives, find one that minimize the penalized residual sum of squares.

$$\sum _{i=1}^{n}\,{\{{y}_{i}-f({x}_{i})\}}^{2}+\lambda {\int }_{a}^{b}f^{\prime\prime} {(x)}^{2}dx$$
(2)

where \({y}_{i}(i=1,2,\cdots ,n)\) is a set of observed values of the response variable and \({x}_{i}(i=1,2,\cdots ,n)\) is a set of observed values of the explanatory variable. λ is the smoothing parameter. \(\sum _{i=1}^{n}\,{\{{y}_{i}-f({x}_{i})\}}^{2}\) measures the degree of fit of the function to the data, while \(\lambda \,{\int }_{a}^{b}\,f^{\prime\prime} {(x)}^{2}dx\) adds a penalty for the curvature of the function, and the smoothing parameter controls the degree of penalty given for the curvature in the function. In our study, the position of the knots will be evenly spaced along the dimension of each explanatory variable.

Statistical analysis

All statistical analysis in this study was undertaken in a free software environment for statistical computing and graphics (R version 3.1.2)46. A Shapiro-Wilk test was employed to check the normality of Zn. Correlation coefficient (R) was used to check the correlation between variables. In general, when there is a definite collinearity relationship between the influencing factors in the model, the concurvity relationship must exist between these factors. The existence of concurvity in GAM would not only increase the variance of coefficients but also enlarge the standard deviation of coefficients. It can cause the narrowing of confidence interval. Hence, it is necessary to test whether model has concurvity. The concurvity test has three indicators: worst, observed and estimate (Table 4). Generally, the three indicators ranging from 0 to 1 can be used to judge whether there is a concurvity. A value of 0 means no concurvity. As the test value approaches 1, the more obvious concurvity is.

Validation of the model

A forward stepwise procedure was used to choose the most appropriate model removing each explanatory variable from the model, and then evaluating the AIC score. The smaller the score, the better the model fits. The AIC score is calculated as follows:

$$AIC=(2k-2L)/n$$
(3)

where k is the number of parameters in the model; L is the log likelihood; and n is the number of observations.

The 95% confidence interval of the fitted values for Zn was obtained from bootstrapping. Additionally, the estimated degree of freedom (EDF) was used to determine whether the selected factors were nonlinearly associated with the response variables. In order to get a reliable and stable model, a cross-validation method was used to verify the model. We randomly selected 70% of the sampling data for modeling, and the remaining 30% was used as the test set.

Conclusions

Using the GAM, we analyzed the relationship of Zn content between latitude, longitude, OM, AHN, AK, AP and SAK in Fangshan District, Beijing. Based on our analysis, we find that Zn content in soil is significantly affected by latitude, longitude, OM, AHN, AK, AP and interactions of OM, AP, longitude, AK with latitude. Thus, by fitting a GAM, the influence of interactions between factors affecting Zn content in soil can be quantitatively predicted and analyzed. In addition, to gain a greater understanding on influencing factors on Zn content in soil, other influencing factors (e.g. pH) need to be included in the GAM.