Spatial regression and geostatistics discourse with empirical application to precipitation data in Nigeria

In this study, we propose a robust approach to handling geo-referenced data and discuss its statistical analysis. The linear regression model has been found inappropriate in this type of study. This motivates us to redefine its error structure to incorporate the spatial components inherent in the data into the model. Therefore, four spatial models emanated from the re-definition of the error structure. We fitted the spatial and the non-spatial linear model to the precipitation data and compared their results. All the spatial models outperformed the non-spatial model. The Spatial Autoregressive with additional autoregressive error structure (SARAR) model is the most adequate among the spatial models. Furthermore, we identified the hot and cold spot locations of precipitation and their spatial distribution in the study area.


Spatial data concept and model formulation
The conventional non-spatial sample of n independent observations y i , i = 1, . . . , n that is linearly related to matrix X is known to have a data generating process (DGP) of the form: This specification indicates that each observation has an underlying mean of X i β and a random component µ i . From the classical point of view, for a situation where i represent regions or points in space the observed values at one location (or region) are independent. Alternatively, statistically, independent observations imply that E u i u j = E(u i )E u j = 0 . The assumption of independence greatly simplifies models but in spatial contexts, this simplification seems unattainable.
(1) www.nature.com/scientificreports/ Conversely, "spatial dependence reflects a situation where values observed at one location or region depend on the values of neighbouring observations at nearby locations". In this case, if observations i = 1 and j = 2 represent neighbours (perhaps regions with borders that touch), then there will be a situation which suggests a simultaneous data generating process of the form: This assertion emanates from the fact that the value assumed by y i depends on that of y j and vice versa. The very notion of spatial dependence indicates the need to ascertain which other units in the spatial system have an impact on the particular unit under concern. Properly, this is conveyed in the topological notions of a neighbourhood. This quantification of the locational aspect of our sample data can be done in several ways. The contiguity based neighbourhood such as rook (common side), bishop (common vertex) or queen (common side or vertex) is a common form of representation. Figure 1 illustrates the definition of the various contiguity-based neighbourhood between sites s i ands j while Table 1 gives a queen contiguity relation among the five regions.
From Table 1, region 2 is a neighbour to region 1 and by the symmetric property, region 1 must be a neighbour to region 2. Similarly, regions: 1 and 3; 2, 4 and 5; 3 and 5; and 3 and 4 are neighbours to region 2, 3, 4 and 5, respectively. These give rise to the spatial weight matrix, W which reflects the first-order contiguity relation among the five regions. The W is expressed as: The W is symmetric, and it has zeros on the main diagonal. This is done to prevent a unit from being a neighbour to itself. The spatial weights matrix is row-standardized to have row-sums of unity and produce a spatially weighted average term Wy of the dependent variable in the spatial lag model. Consequently, the spatial parameter associated with Wy has an instinctive interpretation of spatial autocorrelation coefficient; and also accelerates the maximum likelihood (ML) estimation of spatial models. Consequently, row-standardization has become a meeting in practice without further investigation. However, it may not be appropriate in some situations. Hence, the standardized W is given as:

Model formulation
If the expression in Eq. (1) is restated in a matrix form and the error structure takes the form u = ρWy + ε or u = Wu + ε , the two resulting models are called spatial lag and spatial error models. Mathematically, they are given as: From (4) the implied DGP is given as: The model statement in (6) can be "interpreted as indicating that the expected value of each observation y i will depend on the mean value Xβ plus a linear combination of values taken by neighbouring observations scaled by the dependence parameter,ρ 28 . The infinite series expansion of (I n − ρW) −1 is given in (7) according to 28,29 .
Hence, the re-expression of SAR DGP for vector y shown in Eq. (6) follows thus: The ideal expressed in (8) is that rows of the weight matrix W are constructed to signify first-order contiguous neighbours. Equally, matrix W 2 reflects second-order contiguous neighbours, that is, those that are neighbours to the first-order neighbours. This connotes neighbour of the neighbour to an observation i includes observation itself. Hence, W 2 has positive elements on the diagonal. "The implication of this is that higher-order spatial lags can lead to a connectivity relation for an observation i such that W 2 ε will extract observations from the vector ε that point back to itself ". This is in stark contrast with the conventional independence relation in ordinary least-squares regression where the Gauss-Markov assumption rules out dependence observation of ε i on other observations j , by assuming zero covariances between i and j in the data generating process 30 .
The DGP for spatial error model shown in (5) where the disturbances exhibit spatial dependence is given as From the foregoing, it is clear that in the spatial lag model, the spatially lagged dependent variable captures the spatial dependence between the cross-sectional units whereas in the spatial error model, the spatial autocorrelation term captures the spatial dependence 31 . Posited two economic arguments in support of SEM over SAR and Spatial Durbin models. Firstly, they argued that the SEM model constitutes a fuller representation of the spatial dependence than SAR and spatial Durbin model (an extension of the SAR model in which the lag effect of the dependent and independents variables are included in the model specification). This is because with the SEM model the spatial dependence can be influenced by other considerations in addition to shocks to the spatially lagged dependent variable. Secondly, they considered a situation where the total demand is disaggregated into two categories 1 and 2, a Wald test of the whole set of coefficients from the model for category 1 against 2, the set of coefficients from the model for category 2-which is necessary to establish if there is more to be learnt from disaggregating the data can be performed with easy on spatial error model. This is because the set of explanatory variables will be the same for a pair of SEM models. However, such a test cannot be performed on a pair of SAR models or a pair of spatial Durbin model because the spatially lagged dependent variables will differ in the two models. Though an exhaustive discussion of the spatial Durbin model will not be considered in this study, yet it must be remarked that this kind of model was developed with motivation to account for spatial dependence in the independent variable. This rationale stems from the idea that dependence in spatial relationships does not only occur in the dependent variable but also in the explanatory variables. www.nature.com/scientificreports/ Another representative of the family of spatial regression models that is of interest in this study is the one that includes both endogenous interaction impacts and interaction effects among the error terms. Based on 31 and related works, this model type was advocated for in the World Conference of the Spatial Econometrics Association held in 2017 32 . Labelled this model the Kelejian-Prucha model after their article in 1998 since they were the first to set out an estimation method for this model, also when the spatial weights matrix used to specify the spatial lag and the spatial error structure is the same. Whereas it was named Spatial Autoregressive with additional Autoregressive error structure (SARAR) or Cliff-Ord type spatial model by 31 themselves 33 . Termed the model "Spatial Autoregressive Confused" (SAC) 17 . The specification takes the form: The DGP of the model is of the form: At first glance, the specification appears to represent a mixture of both spatial dependences in the dependent variable and the disturbances represented by W 1 y and W 2 u, respectively. A more formal examination of specification produce from a mixture of spatial dependence in the dependent variable and the disturbances is provided by 34 .

Result and discussion
Variables' description and data screening for spatial autocorrelation. Most statistical procedures and inferences usually work well on the assumption of normality of the data. The data used for the study were explored for this essential criterion. The statistical properties of the variables are presented in Tables 2 and it showed that the spread of the variables from their central level is substantial and hence the high level of coefficient of variation. However, northing is less dispersed (CV = 32%) when compared with precipitation, easting and elevation. There is a moderate level of skewness in the variable except for elevation. Due to the instability and skewness tendencies, all the variables were log-transformed, and this enhanced their statistical properties. For instance, the skewness decreased for all the variables while the leptokurtic and platykurtic nature of the variables became smoothed to realize approximately normally distributed variables. The low-level interrelationship among the independent variables depicted by the correlation matrix is a signal that the selected variables passed the Gauss Markov assumption of absence of multicollinearity. Also worthy of note is the relationship of the selected independent variables with precipitation. Negative correlation was established between precipation and northing which indicated that precipitation decreased from south towards the north. However, positive correlation existed between easting and northing and this indicated that precipitation increase from west towards the east 35 . www.nature.com/scientificreports/ The dependent variable was first diagnosed for spatial autocorrelation using Moran's I scatter plot (Fig. 2) 36 . The data in the plot are standardized so that units on the graph are conveyed in standard deviations from the mean 37 . The horizontal axis demonstrates the standardized value of precipitation for a county, the vertical axis shows the standardized value of the average precipitation (WAn P re), for that county's neighbours as defined by the order one queen weights matrix. The slope of the regression line through these points expresses the global Moran's I and this is estimated to be 0.8288 with a p value of 0.0010 in this study 37,38 .
Similarly, a global measure of spatial autocorrelation was computed for the variables using both Moran's and Greary's C. As shown in Table 3, all the variables have positive and significant spatial autocorrelation, and this implies that similar values of each variable occur near their contiguous locations. The upper right quadrant of the Moran's I scatter plot showed those counties with above-average precipitation and share above-average precipitation with neighbouring counties (high-high) 39,40 . These are regarded as the hot spot locations while the lower left quadrant which shows counties with below-average precipitation values and neighbours also with below-average values (low-low) is the cold spot locations 39,40 . The lower right quadrant displays counties with above-average precipitation surrounded by counties with below-average values (high-low), and the upper left quadrant contains the reverse (low-high) 41,42 . They are called spatial outliers. Figure 3 (top, bottom) is a Local Indicator of Spatial Autocorrelation (LISA) and significant maps, respectively. These maps shed light on the clustering suspected in the Moran's I scatter plot. The red colour in this figure (top) depicted the hot spot location of precipitation and there 82 of such locations in our data predominantly in the Southern regions of the country. In the same vein, the blue colour represented the cold spot i.e., clusters with low levels of precipitation and there are 89 points with such attribute in the dataset. Additional information provided by the significant map (Fig. 3, bottom) indicated that 63 (19.3%), 51 (15.6%) and 59 (18.10%) of the sample observation showed statistically significant local spatial autocorrelation at 5%, 1% and 0.1% level of significance, respectively. Spatial variability and continuity. The knowledge of the spatial clustering in the data from the previous subsection necessitated further exploration with geostatistical tools. The 3-D surface map presented in Fig. 4 unambiguously described the interrelationship of precipitation in the study area with other geographic variables. it was noted from the map that precipitation values increase with decreases in the height above the sea level (elevation) and latitudinal values whereas longitudinal values have an irregular pattern as one move from west to east. This result implies that locations with high latitudes tend to experience low precipitation while those in the low latitudes have high precipitation. Equally, the high altitudes locations have high precipitation as against those in the low altitudes. From geography perspectives, Latitudes is simply a measure of how far one is from the equator while elevation measure how high one is above the sea level, so the locations that are far from the equator  www.nature.com/scientificreports/ or have high elevation are prone to cold weather or climate compare with those in the low latitudes. This explains the uneven distribution of precipitation in the northern and southern part of Nigeria. The locations in the core north of the country are at high latitudes and altitudes and by implication far from the equator. This result is of great relevance in agriculture and water resource management. Further Gaussian model experimental variogram was fitted to the precipitation data and the basic spatial parameters were calculated using R statistical software (Fig. 5). The calculated parameters were a nugget, range and sill. The nugget is the value at which the model intercept y-axis and it can be interpreted as the variance at zero distance between a unit and its neighbour, the range is the distance where the model first flattens, and this can be interpreted as the distance where the value of one variable becomes spatially independent 43 while the value at which the model attains the range is called the sill and it is interpreted at the lag distance between the measurements at which one value for a variable does not influence neighbouring values (discontinuity). The nugget, range, and the sill for the variable under investigation were found to be 50, 0.3 and 1200, respectively. The ratio of the www.nature.com/scientificreports/ nugget variance to the sill is the spatial coefficient parameter 43,44 and for this study, it was estimated to be 4.2%. Following the classification of 10,45 , this value indicated strong spatial autocorrelation in the precipitation series. A diagnostic check was conducted for spatial dependence in OLS regression after it has been confirmed that spatial clustering is present in the dataset. The aim here is to unearth the type of spatial effect present in the dataset and model as appropriate. To achieve this, the Lagrange Multiplier specification test for spatial lag and error (LM LAG and LM ERROR ) was conducted on the residuals extracted from the fitted OLS regression 42,46 . If neither the LM LAG nor LM ERROR statistics rejects the null hypothesis then OLS is appropriate. If one of the LM statistics rejects the null hypothesis, but the other does not then the decision is straightforward. The alternative spatial regression model that matches the test statistic that rejects the null hypothesis 47,48 . When there is conflict, that is when both LM LAG and LM ERROR statistics rejects the null hypothesis, to select an adequate model, focus shift to    [40][41][42]42,48 . Typically, only one of them will be significant (for example LM ERROR as in Table 4), or one will be more significant than the other. It is important to note that when both robust forms are significant, a model matching the (most) significant statistics is estimated. When both are highly significant, the model with the larger value of test statistics is considered appropriate however there may be other causes of misspecification 40 . The LM-SARMA will tend to be significant when neither of them is appropriate 23 . From the foregoing illustration, the SEM model was appropriate if a choice was to be made between it and the SAR counterpart. Observe that both the LM LAG and LM ERROR were significant but the robust form of the spatial lag model was insignificant 40 .
To enhance further comparison, five models were estimated of which the first is non-spatial while the rest four take spatial specification form. The models included traditional linear regression, spatial lag (SAR), spatial error (SEM), spatial Durbin and SARAR. The spatial models were estimated by maximizing the corresponding likelihood while the non-spatial model was estimated by the OLS method. Table 5 reported the summary of the result from the five models. Columns 1, 2, 3, 4 and 5 indicated the estimates obtained from OLS, SAR, SEM, Durbin and SARAR models, respectively. A close look at the result showed that the OLS, SAR and SEM models estimates were alike in term of the sign and significance but differs in term of the sign when compared with Durbin and SARAR estimates. However, the OLS was characterized with either under or overestimation of the coefficient. For instance, OLS over-estimated the coefficient for easting by 42.9% compared with the SAR model while it is over-estimating this coefficient by 8.6%, 13.6% and 9.5%, respectively when compared with "SEM", DURBIN and "SARAR" models. This inflation or deflation of the coefficient was not surprising because of the significance of the spatial dependence in the dataset. This result also buttresses the position of 49-51 that nonspatial OLS is devastating and to be avoided unless interdependence is known to be very weak or nonexistent.
The spatial effect parameters ( and ρ ) in the spatial models are found to be highly significant. The SAR, SEM, and DURBIN have the spatial effect of ρ = 0:726, p < 0.01, λ = 0:779, p < 0.01 and ρ = 0:777, p < 0.01, respectively. The SAR coefficient indicates that the association between the dependent variable and its contiguous counties, the SEM coefficient, show the association of the error term with the neighbouring observation while the DURBIN coefficient gives the idea on the level of dependence of the spatial lag of the independent variables. In the case of SARAR, ρ is negative and significant while λ is positive and significant ( ρ = 0:496, p < 0.01;λ = 0:919, p < 0.01).  Table 5. Predictors of annual precipitation and selection criteria estimates. * p < 0.10, ** p < 0.05, *** p < 0.01. www.nature.com/scientificreports/ The spatial effect coefficients of the SARAR model gives a level of spatial association in the dependent variable and its neighbours as well as the error and its connected regions. The model selection criteria statistics are presented in the last panel of Table 5. Using the two criteria, the OLS value is highest among other models. This indicates that OLS performs poorly in the presence of spatial clustering and that the spatial model will produce a robust estimate of the parameter. This finding is in agreement with the earlier report by 3 that spatial models are superior to nonspatial OLS when a spatial effect is detected in the model.

Independent variables
Overall, the SARAR model produced a better fit for the regression relation because its model selection criteria values were smallest compared with other spatial models. Based on the selected model, only the easting significantly explained precipitation. However, it was noted that easting and elevation exerted a positive impact on precipitation while northing impact was negative. It implies that precipitation increase with a corresponding increase in easting and elevation whereas it depreciates as northing appreciate. The positive effect of easting on precipitation indicated there would an increase in precipitation value for any unit movement from west towards the east, while the negative effect of northing (though not significant) depicted a decrease in precipitation for any unit movement from south towards the north.

Conclusion
This study discussed the rationale for an alternate technique to the conventional regression proposition of independence of observation and applied the approach to building a regression relationship between three predictors (Easting, Northing and Elevation) and precipitation. Exploratory data analysis tools were used to detect spatial autocorrelation, hot spot and a cold spot of precipitation in the study area. The results agreed with previous studies on the superiority of spatial models over OLS. On the premise that spatial models achieved significant improvement over their traditional counterpart, it indicated that spatial models were not only the correct specification but also a more effective approach. However, this study added that a spatial model that simultaneously accounted for spatial effect in the dependent variable and error term provides a better fit compared with the SAR and SEM used in the earlier studies of 4,52 for precipitation modelling.
The spatial modelling approach discussed is quite rich and provided the basis for choosing a particular regression specification, unlike the orthodox framework where the model is imposed on the data without investigating what the data reveal about itself and how it should be modelled. Data exploration is very important, and it is one of the key ways of avoiding misspecification and misleading result.

Materials and methods
Study area. The study was conducted in Nigeria, a country in the Sub-Saharan region. The country is situated in West Africa and bordered in the North and Northeast by the Niger Republic and the Republic of Chad, respectively. Also, it shared a boundary with the Republic of Cameroon and the Republic of Benin in the East and West, respectively. To the South, Nigeria is bordered by approximately 850 kms of the Atlantic Ocean, stretching from Badagry in the West to Rio del Rey in the East. It lies within latitudes of (4 14 N) and longitudes of (3 13 E) with a total land area of 923,768 square kilometres (Fig. 6, top) 53 . Nigeria has two distinct seasons: dry and wet. These seasons are based on the proximity of each region or location in the nation to the Intertropical Convergence Zone (ITCZ). The dry season is between October to March while the wet season is between April to September annually with June and July often the wettest (Fig. 6, bottom) (online resources: https:// www. brita nnica. com/ place/ Niger ia/ Clima te).
Data and methods. The data was sourced from the Nigeria Malaria Indicator Survey (MIS) of 2015. The suitability of MIS for the study was based on its national representativeness and provision of geo-referenced information required for spatial modelling. The geographical covariates were provided in a shapefile format and consist of climatic variables for 329 clusters. The precipitation data for the year 2015 for all the 329 clusters and their respective coordinates were extracted but only 326 observations were suitable for analysis after removing inconsistent cases 3,4,35,52 . Modelled precipitation as a function of easting, northing and elevation and this specification is adopted in this study due to limited data. The easting and northing variables for each cluster were obtained by transforming the latitudes and longitudes of these 326 locations to Standard Universal Transverse Mercator (UTM) using "PAleontological STatistics" (PAST) software. By definition, the northing value is the distance of the position from the equator in meters while the easting value is the distance from the central meridian (longitude) of the used UTM zone (the study area has three UTM zones, namely, 31, 32 and 33). Before model estimation, the variables were diagnosed for spatial variability and clustering. Various exploratory tools were used to describe and visualize spatial distributions; identify uneven locations or spatial outliers; discover pattern of association, cold or hot spots 39,[41][42][43]54,55 . Firstly, a 3-D surface contour map was used to examine the spatial variability of precipitation along the lines of longitude and latitude as well as its behaviour relative to a height above sea level. Secondly, the Variogram plot was used to study the precipitation data for a possible tendency of spatial dependence and discontinuity.
The spatial weighting matrix was created by employing GeoDa software using the queen definition of neighbour discussed in Section "Spatial data concept and model formulation" and formatted as "spmat" object and imported to STATA software for further exploration of the data. Basic information about the spatial weighting matrix is presented in Table 6. The number of neighbours among the clusters range between 2 and 12 links with each county having 6 neighbours on average and a total of 1920. The 3-D surface contour map, Variogram plot, weighting matrix creation, and regression modelling were carried out using Surfer, R, GeoDA and STATA statistical software packages, respectively. Each software was chosen based on the ease of undertaken assigned task and the time of execution.   www.nature.com/scientificreports/ The weighting matrix created by "spmat" command was used to produce a cluster map for precipitation and thereafter was converted to a text file by using the "export" command in STATA. The resulting text file was saved as "dta" file and imported as a row standardized weighting matrix of "spatwmat" object. This "spatwmat" weighting matrix format was used for the global and local indicator of spatial autocorrelation as well as to produce Moran's I scatter plot. (12) y = pWy + Xβ + WXθ + u u = Wu + ε | | < 1, |ρ| < 1, |θ| < 1