Analysis of influencing factors on soil Zn content using generalized additive model

Soil zinc (Zn) plays a crucial role in plant growth, but excessive accumulation in the environment may lead to air, water and soil pollution. It is affected by various chemical, environmental and spatial factors. Therefore, it is important to identify the factors influencing Zn content in the landscape. The main motivation for this study is to determine the suitability of a generalized additive model (GAM) to describe change in soil Zn content due to influencing factors. A total of 1497 soil nutrient samples were collected in Fangshan District, Beijing, China. Organic matter (OM), available phosphorus (AP), available potassium (AK), alkali-hydrolyzed nitrogen (AHN) and slowly available potassium (SAK) are considered. The relationship between Zn, nutrients and geographic location (latitude & longitude) is investigated using the GAM. More precisely, the Akaike information criterion (AIC) is used to select influencing factors on Zn content and cross-validated to avoid overfitting of the multivariate model. The results show that Zn content reaches its maximum at latitude 39.8°N and longitude 115.9°E. Zinc content increases as AP increases to 150 mg/kg. When OM content is greater than 90 g/kg, Zinc content decreases with an increase in OM content. Factors that affected Zn content, in descending order of significance derived from deviance explained and adjustment coefficient of determination (Adj.R2) were AP, latitude, AHN, AK and OM. Moreover, the interactions between latitude and longitude, AHN and AP, OM and AK have significant impact on Zn.

Univariate analysis of influences on Zn. The regression model with cubic splines is used to analyze the influences of each individual explanatory variable on Zn and corresponding fitting degree of the model ( Table 2). The results show that all the seven explanatory variables have passed the significance test at the P-value < 0.01 level, suggesting that each individual variable is statistically significant for the influence of Zn, with a low deviance explained. The deviance explained of AP and longitude are higher with the values of 20.2% and 16.5%, respectively. The corresponding adjustment coefficient of determination (Adj.R 2 ) which increases with the increase in the number of independent variables are 0.16 and 0.2 for AP and longitude, respectively. The precision of model derived from each individual explanatory variable is low. Consequently, multiple variable interactions are considered for investigating their influences on Zn.
Multivariate analysis of influences on Zn. The variables are gradually added to the GAM, and the tests are carried out using the Akaike information criterion (AIC) score (Table 3). It can be found that the AIC scores are generally reduced with the gradual increase of variables. Conversely, when SAK is added, the score increases by about 0.6, and the P-value is 0.031. SAK does not pass the significance test at the P-value < 0.01 level, which  Cross-validation of the refitted multivariate GAM. To avoid overfitting, cross-validation was used to test the refitted multivariate GAM. The difference between the predicted value and the measured value was small, and the six variables passed the significance test at the P-value < 0.01 level ( Table 6). The optimal model can reasonably reflect the influencing factors on Zn.
Interactions of multivariate factors on Zn. The model deviance explained derived from GAM was 72.1%, with the Adj.R 2 of 0.63. The estimated degree of freedom of the longitude-HN interaction was 1 ( Table 7). The F-value for the longitude-latitude interaction, the HN-AP interaction and the OM-AK interaction are 19.857, 4.678 and 4.433, respectively. These interactions passed the significance test at the P-value < 0.01 level. Similarly, the latitude-AP interaction, the latitude-OM interaction and the latitude-AK interaction passed the significance test at the P-value < 0.05 level. The interactions that passed the significance test (P-value < 0.01) demonstrates the impact of interactions on Zn (Table 7 and Fig. 2). Figure 2(a) shows the influence of interaction between latitude and longitude on Zn. When latitude is less than 39.6°N, Zn decreases rapidly with the increase of longitude until it reaches at about 115.8°E. Zn reaches its local maximum at 115.8°E, 39.7°N, and then there is little increase with the increase of latitude and longitude. The influence of interaction between AHN and AP on Zn can be observed in Fig. 2 When AP content is less than 50 mg/kg, Zn varies little with the increase of AHN content. Above 50 mg/kg, AHN increases until AP reaches approximately 200 mg/kg until AP content reaches about 200 mg/kg. When AHN content is less than 50 mg/kg, Zn decreases with the increase of AP content. The influence of the interaction between OM and AK on Zn can be observed in Fig. 2(c). When OM content is less than 100 g/kg, Zn increase with a rise in AK. When both OM content and AK content increase, Zn increases. Figure 2(d) reveals the influence of the interaction between latitude and AP on Zn. Zn does not change significantly with the increase of AP content when the latitude is greater than 39.7°N. When the latitude is less than 39.6°N, Zn increases rapidly with the increase of AP content. Figure 2(e) shows the influence of the interaction between latitude and OM on Zn. When OM content approaches 400 g/kg, Zn decreases rapidly as latitude increases. Figure 2(f) shows the influence of the interaction between latitude and AK on Zn. The Zn content decrease with an increase in latitude when AK remains unchanged. Zinc reaches a minimum at latitude 39.8°N when AK is less than 200 mg/kg.

Discussion
Influencing factors on Zn. Latitude and longitude have significant influence on soil Zn content (P-value < 0.01). Conversely, Richardson et al. 40 have shown that there is no correlation between site location and Zn content. The difference in results from ref. 40 may be due to the specific geographic location and land use change in the study area. The Fangshan district is a mountainous region with manufacturing and agriculture as the prime land uses. Zinc content in soil in urban and industrial areas may be an order of magnitude greater than that in rural areas 41 . For example, Zn reaches its maximum at 115.8°E, 39.7°N (Fig. 2).    Table 5. Hypothesis test of the refitted GAM. ***Indicates P-value < 0.01 level.    The increase of Zn due to an increase in OM and content is consistent with OM and heavy metals coexisitng in soil sediment, with OM been found to have important implications on heavy metal speciation, transport and bioavailability 42,43 . In addition, Zn content is also affected by other nutrient elements. For example, increase in flax yields in response to Zn application are most likely to occur where P fertilizer is broadcast at relatively high levels or on soils with a history of heavy P application 44 . Similarly, Zn increased as AP content increased in this study (Fig. 1).

Modeling the Zn content.
To explore the variation of Zn content in soil, the linearity of the influencing factors on Zinc were examined. On analysis of the EDFs of the smoothing functions from the univariate? GAM, it was identified that Zn content is affected by complex nonlinear influences. The univariate GAMs of Zn content in soil are able to estimate values for the significant influencing factors of latitude, longitude, OM, AHN, SAK, AP, AK. These factors are considered additive and hence a multivariate GAM was fitted, improving the goodness of fit over the univariate model. Nevertheless, SAK does not pass the significant test (P-value > 0.01) for the multivariate GAM but it does pass for the univariate GAM. This suggests there is a concave relationship between S(SAK) and S(AK).
Moreover, there is spatial correlation between AK and SAK in the study area. SAK refers to the potassium that exists between layers of layered silicate minerals and grain edges and cannot be reached by neutral salts in a short time. Conversely, AK can be quickly absorbed and utilized by plants. Zhang et al. 45 have revealed that AK is affected more than other potassium forms and can be more sensitive in directly reflecting the productivity than SAK. On removal of SAK the goodness of fit of the multivariate GAM improved and identified that latitude, longitude, OM, AHN, AP and AK have significant influences on the Zn content in soil. Zinc content in soils is primarily affected by the interactions between latitude and OM, AP, AK (Fig. 2). The modelling suggests Zn content in soil is affected more so by the vertical direction (latitude) than the horizontal direction (longitude) in the study region. This could be due to location of manufacturing industries or natural landforms and soil types. In our study, the GAM derived from the pairwise interaction with the influencing factors can be used to analyze the influence characteristics of Zn content. Zn content is affected by multiple factors, and the interactive GAM can be constructed using three or more of these factors to analyse influences on Zn content in soil.

Materials and Methods
Description of the study area. Fangshan District is located between longitudes 115.4°-116.3°E and latitudes 39.5°-39.9°N in Beijing, China. It is situated to the east of the Taihang Mountains. The south-eastern region of the district is on a plain, with hill country intersecting the district from the northeast. It is in a warm temperate semi-humid monsoonal climatic zone.
Collection of soil samples. The soil samples were primarily collected in five typical agricultural croplands including vegetable land, irrigated land, irrigated paddy field, dry land and orchards. A total of 1,497 soil samples is collected in the study area (Fig. 3). Representative soils samples were collected from random points in the croplands to a depth of 20 cm. The hybrid samples were acquired by five points and then the samples were crushed and fully mixed. Two diagonal lines were used to divide the samples into four parts. Any two parts of the diagonal angles were reserved as the final samples. A portable sub-meter GPS receiver was used to accurately acquire latitude and longitude of the sample points. Atomic Absorption Spectrometry (TAS-990, Xian Yima Optolec Co Ltd) was used to analyze the soil samples for nutrients and heavy metals. Specifically, samples were analyzed for organic matter (OM) (g/kg), alkali-hydrolyzed nitrogen (AHN) (mg/kg), available phosphorus (AP) (mg/kg), slowly available potassium (SAK) (mg/kg) and available potassium (AK) (mg/kg). Heavy metals analyzed were Zn, Fe, Cu, Mn, B and S.

Generalized additive model.
It is a regression model that can define the relationships between the response variable and each explanatory variable through smooth functions 18,31 . GAM, using an identity link function with Gaussian error distribution, is used to determine the effects of various factors on soil Zn. The generalized additive model considering interactions of two factors can be given in a general form: . The smooth functions with cubic regression splines were used in our work. Cubic regression splines were constructed with piecewise cubic polynomials joined together at points called knots. The definition of cubic smoothing spline basis arises from the solution of the following optimization problem. Among all the functions f, with two continuous derivatives, find one that minimize the penalized residual sum of squares.
adds a penalty for the curvature of the function, and the smoothing parameter controls the degree of penalty given for the curvature in the function. In our study, the position of the knots will be evenly spaced along the dimension of each explanatory variable.

Statistical analysis.
All statistical analysis in this study was undertaken in a free software environment for statistical computing and graphics (R version 3.1.2) 46 . A Shapiro-Wilk test was employed to check the normality of Zn. Correlation coefficient (R) was used to check the correlation between variables. In general, when there is a definite collinearity relationship between the influencing factors in the model, the concurvity relationship must exist between these factors. The existence of concurvity in GAM would not only increase the variance of coefficients but also enlarge the standard deviation of coefficients. It can cause the narrowing of confidence interval. Hence, it is necessary to test whether model has concurvity. The concurvity test has three indicators: worst, observed and estimate (Table 4). Generally, the three indicators ranging from 0 to 1 can be used to judge whether there is a concurvity. A value of 0 means no concurvity. As the test value approaches 1, the more obvious concurvity is. Validation of the model. A forward stepwise procedure was used to choose the most appropriate model removing each explanatory variable from the model, and then evaluating the AIC score. The smaller the score, the better the model fits. The AIC score is calculated as follows: where k is the number of parameters in the model; L is the log likelihood; and n is the number of observations. The 95% confidence interval of the fitted values for Zn was obtained from bootstrapping. Additionally, the estimated degree of freedom (EDF) was used to determine whether the selected factors were nonlinearly associated with the response variables. In order to get a reliable and stable model, a cross-validation method was used to verify the model. We randomly selected 70% of the sampling data for modeling, and the remaining 30% was used as the test set.

Conclusions
Using the GAM, we analyzed the relationship of Zn content between latitude, longitude, OM, AHN, AK, AP and SAK in Fangshan District, Beijing. Based on our analysis, we find that Zn content in soil is significantly affected by latitude, longitude, OM, AHN, AK, AP and interactions of OM, AP, longitude, AK with latitude. Thus, by fitting a GAM, the influence of interactions between factors affecting Zn content in soil can be quantitatively predicted and analyzed. In addition, to gain a greater understanding on influencing factors on Zn content in soil, other influencing factors (e.g. pH) need to be included in the GAM.