An enhanced dual IDW method for high-quality geospatial interpolation

Many geoscience problems involve predicting attributes of interest at un-sampled locations. Inverse distance weighting (IDW) is a standard solution to such problems. However, IDW is generally not able to produce favorable results in the presence of clustered data, which is commonly used in the geospatial data process. To address this concern, this paper presents a novel interpolation approach (DIDW) that integrates data-to-data correlation with the conventional IDW and reformulates it within the geostatistical framework considering locally varying exponents. Traditional IDW, DIDW, and ordinary kriging are employed to evaluate the interpolation performance of the proposed method. This evaluation is based on a case study using the public Walker Lake dataset, and the associated interpolations are performed in various contexts, such as different sample data sizes and variogram parameters. The results demonstrate that DIDW with locally varying exponents stably produces more accurate and reliable estimates than the conventional IDW and DIDW. Besides, it yields more robust estimates than ordinary kriging in the face of varying variogram parameters. Thus, the proposed method can be applied as a preferred spatial interpolation method for most applications regarding its stability and accuracy.


Methods
In this section, traditional DIDW-GG is first introduced. Its improved versions, DIDW with two locally varying exponents (DIDW-LL) and the simplified DIDW-LL (SDIDW-LL), are proposed and elaborated in detail. A brief introduction to OK is illustrated in Supplementary Method online.
DIDW-GG. Let x i (i = 1, 2, . . . , N) be a coordinate point in q q ≥ 1 dimensional space and z(x i ) be the sampled (observed) value of a variable z at this location. For an unsampled point x 0 to be estimated, the widely used linear regression estimator ẑ(x 0 ) is defined as 1,30 : with where i (x 0 ) is the estimation weight assigned to the i-th measured value z(x i ) , and n(x 0 ) represents the number of data closest to the estimated location x 0 . For DIDW-GG, its estimation weight is calculated by 29 : where d 0i is the D-U distance from the i-th data to the estimated location x 0 ; d ij represents the D-D distance between the i-th and j-th sample locations; and p 1 ( p 1 ≥ 0 ) and p 2 ( p 2 ≥ 0 ) are the corresponding D-U and D-D exponents to adjust the contributions of d 0i and d ij to the estimation, respectively. Note that in the case of p 2 = 0 , DIDW-GG degrades into the traditional IDW-G, of which the estimation weight is: It is also notable that both D-U and D-D exponents in Eq. (3) are global constants across the study region. This feature may limit DIDW-GG to produce high-quality estimates, especially when the spatial phenomenon under study is involved and the sampling data is irregularly distributed.

DIDW-LL.
Aiming to integrate locally varying exponents in the estimation, each DIDW-GG exponent in Eq. (3) is interpreted as a function of the location being estimated. As a result of this interpretation, the DIDW-LL weight is calculated as follows: where p 1 (x 0 ) and p 2 (x 0 ) are the local exponents that can be applied to adjust the contributions of d 0i and d ij , respectively.
To a large extent, the two locally varying exponents in Eq. (5) entail the flexibility and suitability of the improved DIDW. For an estimated point surrounded by a set of highly clustered local samples, a large D-D exponent (i.e., p 2 (x 0 ) ) should be adopted to produce significant declustering weights. Conversely, if this point is close to a group of regularly distributed samples, a relatively small D-D exponent is preferred to avoid such a strong declustering effect.
Similarly, in the case of p 2 (x 0 ) = 0 , DIDW-LL in Eq. (5) degrades into the traditional IDW-L 23 , of which the estimation weight can be expressed as: Besides, if p 1 (x 0 ) and p 2 (x 0 ) were constant for every estimated location, Eqs. (5) and (3) would be equal; in other words, DIDW-LL degrades into DIDW-GG in this situation. www.nature.com/scientificreports/ SDIDW-LL. As compared with IDW-L, the flexibility of DIDW-LL is at the cost of complexity. Thus, the estimation weights in Eq. (5) are simplified by assuming that p 1 (x 0 ) equals p 2 (x 0 ) , resulting in the SDIDW-LL estimation weights: where p 1 (x 0 ) is the local exponent to simultaneously adjust the influences of d 0i and d ij to the estimation.
Determination of locally varying exponents. Suppose p is a vector consisting of DIDW-LL exponents to be optimized (e.g., p = p 1 (x 0 ), p 2 (x 0 ) T ), and O L p is the objective function to evaluate the suitability of these parameters. Then, the corresponding optimization of the local exponents is: where D is the definition domain of the vector p , and D ⊂ R 2 .
The objective function could be implemented in terms of different assessment criteria, such as the typical error measurements (i.e., true error, absolute error, and so on), interpolation selection index 31 , estimation error variance 1,30,32 , and the intensity of neighboring data 28 . Among these measurements, the error variance is frequently employed in geostatistical methods 23,33 and considered in this research.
According to the statistical theory on random function model 1 , all of the data z(x i ) could be interpreted as a realization of the random variable (RV) Z(x i ) . Likewise, this interpretation of the unknown value z(x 0 ) and measured value z(x i ) as realizations of the RVs Z(x 0 ) and Z(x i ) allows one to define the estimation error as an RV, Ẑ (x 0 ) − Z(x 0 ) . Under the stationarity assumption, the estimation error variance can be calculated by 23,30 : where C(·) stands for the covariance function model used for the study area.
Note that i (x 0 ) and Ẑ (x 0 ) are expressed as i (x 0 ; p) and Ẑ (x 0 ; p) in Eq. (9), respectively. This expression is to explicitly indicate that the DIDW-LL estimate and weight are related to the parameter vector p . Based on Eqs. (8) and (9), the optimized exponents can be rewritten as: The parameter vector p in this optimization process is flexible to be specified. For example, it can contain only the D-D or D-U exponent, or both. In this research, three typical application scenarios are chosen as follows: 1) DIDW with locally varying D-U and D-D exponents (i. e., DIDW-LL). In this way, both D-D and D-U exponents are locally optimized in Eq. (10); 2) SDIDW with locally varying D-U and D-D exponents (i. e., SDIDW-LL). The two exponents are equal for SDIDW-LL, and thus only one element needs to be placed in the vector being optimized; 3) DIDW with a local D-U exponent and a global D-D exponent (i. e., DIDW-LG). In this situation, the local D-U exponent is optimized in Eq. (10), while the global D-D exponent can be determined by minimizing cross-validated estimation error.
Algorithm implementations. The pseudocodes of DIDW-LL and DIDW-LG are described in Algorithm 1 and 2, respectively. It is worth noting that it is necessary to search for an appropriate global D-D exponent based on cross-validation before DIDW-LG is performed.

Results
Experiment design. For the sake of consistency and comparability between this research and our previous work on DIDW-GG 29 , similar experiment data and calculation parameters to that work are adopted in this study.
Experiment data. The standard Walker Lake dataset 1,29 is employed in this research, which is derived from a digital elevation model (DEM) from the western United States, the Walker Lake area in Nevada. Following the interpolation applications in 1 , 470 irregularly spaced samples and 780 regularly distributed locations from this dataset are used as sampled and estimated data, respectively. The origin of the 780 regular points is 5E, 5 N (i.e., X = 5 m, Y = 5 m), and the spacing between points is 10 m in both the north-south and the east-west directions. The locations and the associated attribute values are shown in Fig. 2, along with the complete data in Supplementary Data online. An extensive description of the dataset can be found by 1 .

Experiment methods.
The conventional IDW-L and DIDW-GG are used as benchmarks to assess the interpolation performance of the proposed method. Also, since OK possesses the same optimization objective as DIDW-LL and IDW-L, it is applied as a reference to accomplish the performance evaluation.
Accordingly, there are six methods to be evaluated: DIDW-LL, SDIDW-LL, DIDW-LG, DIDW-GG, IDW-L, and OK. These methods are applied to estimate the 780 grid nodes using the 470 irregular sample points (Fig. 2); their estimates are then compared with the actual values to generate reliable estimation errors. To distinguish it from cross-validated interpolation, this process of interpolating the 780 grid nodes is referred to as "actual interpolation" in the following test.   LG, and IDW-L search for appropriate ones using Eq. (10); DIDW-GG finds its suitable exponents by a cross-validation-based optimization 29 . All local samples within 25 m are chosen to participate in the estimations. Besides, to observe the clustering feature of neighborhood samples, the available data are divided into quadrants, and the variance of the number of samples in the four quadrants could be used as an index of clustering 1 . Note that the reliability of these indices depends on the total number of conditioning data within each neighborhood (in Fig. 3a); an index resulting from a large number of local samples is more reliable than that with a small sample size. Therefore, the sub-region highlighted by the red ellipse in Fig. 3b is of higher reliability than other locations under study.

Experimental parameters.
To obtain the covariance coefficients in Eq. (10), N14°W is chosen as the direction of maximum continuity, and its variogram adopted is 1 : In the direction of minimum continuity (N76°E), the model is: The accompanying experimental and theoretical variograms in these two directions are shown in Fig. 4. First, IDW-L yields unreasonable sample weights with respect to data redundancy. For example, this approach does not recognize the relative importance of the samples indicated by the pentagons in Fig. 5. In contrast, DIDW-LL, DIDW-GG, and OK reasonably account for the underlying data redundancy in this sample configuration.

An illustration of DIDW-LL weights.
Besides, the resulting weights from DIDW-LL and OK are quite similar due to the same estimation objective, implying that DIDW-LL would approximate OK in terms of estimates and the associated error variances. This phenomenon for DIDW-LL is reasonable and expectable since kriging's underlying declustering mechanism is widely accepted 1,34 . On the other hand, DIDW-GG does not bear such a significant resemblance to OK, especially for the first data point (i.e., the sample with an ID of "1") in Fig. 6. It should be pointed out that, by tuning its D-D and D-U exponents, DIDW-GG could account for a specific data configuration satisfactorily. However, it may be difficult for DIDW-GG to search for very suitable D-D and D-U exponents simultaneously for multiple estimated points because its exponents are constant across the study area. Further analyses on the correlation between OK and DIDW-LL, DIDW-LG, IDW-L are illustrated in the following sections.
Moreover, note that the negative OK weights can be observed. Although these weights are valid and acceptable in theory, they would also lead to unrealistic estimates in some practical applications 35 . Noticeably, this issue will not arise in the developed methods as the basic idea of weight assignment of IDW is inherited by DIDW.
Consequently, DIDW-LL has favorable characteristics in the following aspects: (1) compared with IDW-L, it can recognize the clustered sample data more accurately; (2) relative to OK, it entails non-negative estimation weights; and (3) as compared with DIDW-GG, it has more opportunities to appropriately account for the sample configuration regarding every single estimated point.

DIDW-LL and SDIDW-LL estimations.
As stated above, all of the test estimators are applied to interpolate the 780 grid nodes (in Fig. 2). Figure 7a exhibits the D-D exponents resulting from the DIDW-LL estimation. As expected, they are overall in line with the clustering degree of local data represented in Fig. 3b, especially for the highlighted elliptical sub-area. Generally, the more strong clustering is observed, the larger D-D exponents will be.  Fig. 3a. The overall feature is that the estimated locations with a large number of conditioning data tend to be attached with a high D-U exponent; conversely, a relatively low D-U exponent is applied when the number of local samples is small. Figure 8 depicts the comparisons of the actual values and estimates from DIDW-LL, SDIDW-LL, and the reference estimators (IDW-L, DIDW-GG, and OK). DIDW-LL, SDIDW-LL, and OK possess very similar interpolation accuracy, superior to either IDW-L or DIDW-GG. The scatterplots represented are similar to each other, especially for the variogram-based estimators (i.e., DIDW-LL, SDIDW-LL, IDW-L, and OK). This feature is further exhibited in Fig. 9, which indicates that the estimates and the associated error variances from DIDW-LL and SDIDW-LL bear a more significant correlation to the OK results than those from IDW-L and DIDW-GG. This phenomenon is expectable because IDW-L ignores the D-D correlation, and DIDW-GG does not aim to minimize the estimation error variance.

DIDW-LG estimation.
To evaluate the interpolation performance of DIDW-LG, cross-validation is first applied to determine an appropriate global D-D exponent, which is then employed to accomplish the interpolation for the 780 estimated locations.

Cross-validations.
In the process of cross-validation using DIDW-LG, four classical error measurements, including mean true error (MTE), mean absolute error (MAE), root mean square error (RMSE), and the correlation coefficient between actual and estimated values, are used to explore the interpolation accuracy as well as to determine an appropriate global D-D exponent. The corresponding results are shown in Fig. 10, and some observations can be made as follows.
First, in Fig. 10a, as the D-D exponent increases, the MTE presents a monotonic decreasing tendency, indicating a continuous decrease of the associated estimates in total. This decline of the estimates, resulting from the declustering, is in line with the sampling strategy (the samples are preferentially collected in the high-value areas as shown in Fig. 2a) and thus demonstrates the validity of DIDW-LG.
Additionally, it is also notable that the origin of each subplot in Fig. 10 corresponds to the case when IDW-L is used. Obviously, there are numerous D-D exponents, which would entail that DIDW-LG is more accurate than IDW-L.
Moreover, both MAE and RMSE indicate that a D-D exponent of 4.0 is appropriate, thus employed in the actual interpolation below.
Actual interpolations. Based on the optimal D-D exponent stated above, the actual interpolation using DIDW-LG is conducted, and the corresponding results are depicted in Fig. 11. Overall, the essential characteristics of DIDW-LG results, including the D-U exponents, interpolation accuracy, and the similarity compared with OK, are consistent with DIDW-LL and SDIDW-LL (shown in Fig. 7). This consistency demonstrates that DIDW-LG also produces more favorable estimates than IDW-L and DIDW-GG.
Moreover, it is still worth providing qualitative insights into the actual interpolation performance of DIDW-LG with different D-D exponents. In Fig. 12, it can be observed that the behavior of MTE from DIDW-LG is normal as expected, which is rather similar to what is revealed in Fig. 10a. Likewise, as exhibited by RMSE or MAE, there are numerous D-D exponents that would yield more accurate DIDW-LG estimates than the conventional IDW-L. Test with different datasets. Ten sample sub-datasets, drawn as 10%, 20%, …, 100% of the data from the 470 sample points and orderly named as S10, S20, …, S100, are applied to estimate the 780 grid nodes by the tested estimators. The detailed sample locations of these datasets can be found as Supplementary Fig. S1 online. As exhibited in Fig. 13 and its accompanying result in Table 1, in general, IDW-L produces the most inaccurate results among the test methods. The main reason should be that IDW-L completely ignores the correlation among sample data. On the contrary, OK yields the most accurate estimates. Following OK, DIDW-LL and DIDW-LG yield very similar estimation results, which are slightly more accurate than SDIDW-LL. Despite this, SDIDW-LL is still superior to either IDW-GG or IDW-L with respect to interpolation accuracy.
These characteristics are generally consistent with those illustrated in the above tests (as shown in Sect. 4.3 and 4.4), implying the stability of the developed methods in the context of various sample datasets.
Test with different variogram parameters. It is widely accepted that the practical success of kriging estimators heavily depends on the suitability of the chosen variogram 36 . Likewise, due to the introduction of the error variance in Eq. (10), either DIDW-LL or DIDW-LG is unavoidably dependent on the reliability of the spatial structure. Nevertheless, the degree of this dependence is not very clear, which deserves to be elaborated.
To achieve this elaboration, the reference variogram model in Eq. (11) is perturbed to generate a set of spatial structures in the following two aspects: (1) ten main anisotropy angles, evenly dividing the search space, are designed based on the main anisotropic direction (340°) of the reference variogram model; (2) likewise, the first range, 30 m, along the direction of maximum continuity in Eq. (11) is applied to create ten new variogram models through equally increasing its value by 0 m, 10 m, 20 m, …, 90 m. Figure 14 exhibits the resulting interpolation accuracies of the five variogram-based methods with various anisotropy angles. Judging from the bend degree of the RMSE or correlation coefficient curves, the most sensitive method to the main anisotropy angle is OK, followed by IDW-L, DIDW-LL, and SDIDW-LL, which bear similar sensitivities; DIDW-LG presents significant stability under the condition of various directions of maximum continuity. The tested methods sorted by the overall interpolation accuracy from best to worst are OK, DIDW-LG, DIDW-LL, SDIDW-LL, and IDW-L, respectively. Nevertheless, it is noticeable that the DIDW-LG with several main anisotropy angles, such as 40° and 58°, also generates more accurate estimates than OK. Figure 15 reveals the corresponding estimates in the case of varying variogram ranges. Most methods represent favorable stability except OK, which tends to yield less accurate estimates than IDW-L in terms of the RMSE or correlation coefficient. www.nature.com/scientificreports/ Consequently, all three implementations of the proposed DIDW with LVEs (i.e., DIDW-LL, SDIDW-LL, and DIDW-LG) are significantly superior to the traditional IDW-L and DIDW-GG. When the spatial correlation is accurately captured, their results could bear significant similarity to OK outcomes; otherwise, they may outperform OK, especially for DIDW-LG.

Discussion
To some extent, it is rational to consider that DIDW with LVEs approximates OK since they share the same optimization goal, minimizing estimation error variance. This approximation would be enhanced by using variogram distance instead of the Euclidean metric employed in this study, probably improving the estimation accuracy when spatial anisotropy in the study region is significant. However, this replacement should be cautiously applied since it may increase the dependency of the proposed method on the spatial structure.
Moreover, the designed objective function could be implemented more flexibly. For instance, other estimation parameters in the proposed method, such as the type of search model and search radius, can also be added into the vector p in Eq. (10), and optimized together with the local exponents to further improve the interpolation accuracy. For the sake of practicability, more advanced optimization technologies in machine learning methods, such as the genetic algorithm 37,38 and simulation annealing 39 , would be helpful to achieve this goal.
Finally, the main characteristics of OK and DIDW with LVEs is summarized in Table 2. In addition to the two methods, the radial basis function interpolation (RBFI) 40,41 is described in this table, because it is also a frequently used SI method that accounts for the effect of clustering. It is notable that, unlike RBFI and OK, the proposed method does not need to solve a system of equations. This feature would be attractive in a big data or high-dimensional context, where numerical instability of the solution to the system exists.

Conclusions
In this paper, a new dual IDW framework (DIDW with LVEs) that can account for the D-D and D-U correlations flexibly is proposed. It involves two key points: (1) the original DIDW formalism is modified to incorporate the LVEs; (2) a generalized objective function aiming to minimize the estimation error variance is developed to determine appropriate LVEs. Within this framework, DIDW can self-adaptively choose suitable exponents according to local data configuration and correlation. This feature entails that DIDW can capture locally changed physical features, thereby increasing the accuracy and reliability of its estimates.
The real-world application shows that DIDW with LVEs is more flexible and robust than the traditional IDW-L and DIDW-GG. Besides, it is superior to OK in many aspects; for instance, it is immune to negative estimation weights, applicable for high-dimensional SI issues, and less sensitive to variogram parameters.