A new mixture copula model for spatially correlated multiple variables with an environmental application

In environmental monitoring, multiple spatial variables are often sampled at a geographical location that can depend on each other in complex ways, such as non-linear and non-Gaussian spatial dependence. We propose a new mixture copula model that can capture those complex relationships of spatially correlated multiple variables and predict univariate variables while considering the multivariate spatial relationship. The proposed method is demonstrated using an environmental application and compared with three existing methods. Firstly, improvement in the prediction of individual variables by utilising multivariate spatial copula compares to the existing univariate pair copula method. Secondly, performance in prediction by utilising mixture copula in the multivariate spatial copula framework compares with an existing multivariate spatial copula model that uses a non-linear principal component analysis. Lastly, improvement in the prediction of individual variables by utilising the non-linear non-Gaussian multivariate spatial copula model compares to the linear Gaussian multivariate cokriging model. The results show that the proposed spatial mixture copula model outperforms the existing methods in the cross-validation of actual and predicted values at the sampled locations.

Many environmental sampling is often observed multiple spatially correlated variables at a given geographical location. For instance, multiple topsoil heavy metal concentrations, such as cadmium, zinc, and copper, are sampled from the soil sample at a field location. In forestry, multiple biomass variables, such as bole, foliage, stump, branch, and root biomass, are sampled in a tree. Also, the spatial distribution of forest biomass variables may use to understand wildfire behaviour. These variables can depend on each other in complex ways, such as non-linear and non-Gaussian spatial dependence. The spatial modelling by considering these complex multivariate spatial dependence may increase the prediction accuracy of individual variables, which may help forest managers to minimise risk and save lives. This article focusses on the copula-based spatial modelling of spatially correlated multiple variables and predicts the individual variables while utilising multivariate spatial dependence of spatially correlated variables.
Gaussian-based linear kriging method is widely used to model spatial variables and provides a weighted average measure of linear spatial dependence. The kriging weights do not depend on the different values of samples and also assume linear Gaussian spatial dependence over the spatial domain [1][2][3][4][5][6] . However, spatial interpolation (prediction or simulation) based on a spatial model expects to behave differently for different values of samples. That is, spatial correlation between samples varies for the different quantiles of samples. Thus, Bárdossy 7 introduced spatial copula method that can capture the spatial dependence of a spatial variable by considering the different values of samples. In the non-spatial setting, copula method is used to model the dependence between two or more non-spatial variables, which has widely applied in many fields, such as environmental science, finance, economics, medicine and engineering [8][9][10][11][12][13][14] . Bárdossy's 7 spatial copula method divides the distance over which spatial dependence exists into equally spaced intervals, also referred to as distance classes or spatial bins, and requires the same family of copulas to be fitted across all of the spatial bins. Gräler and Pebesma 15 proposed a more flexible spatial copula model that permits copulas from different families to be fitted across the distance classes. The added flexibility of Gräler and Pebesma's model, over Bárdossy's model, permits increased accuracy in modelling and prediction. The spatial copula concept proposed by Bárdossy, Gräler and Pebesma has used in mining, forestry, soil sampling, hydrology, and other environmental applications 8,[16][17][18][19][20][21][22][23][24] . However, these spatial copula methods enable modelling and predicting a univariate spatial variable without considering the multivariate dependence of spatially correlated multiple variables. Recently, Gnann et al. 25 improved Bàrdossy's 7 method to interpolate a primary spatial variable while considering a secondary correlated spatial variable. However, Gnann et al. 25 assumed that the joint distribution of primary and secondary variables follows Gaussian copula.
As a solution to model non-Gaussian multivariate spatial dependence in spatial copula framework, Musafer et al. 26 proposed a multivariate spatial copula model, whereby the correlated spatial variables were transformed into spatially uncorrelated factors using non-linear principal component analysis (NLPCA). Then, Gräler and Pebesma's univariate spatial copula model was used to model and predict spatially uncorrelated factors. Subsequent back transformation is required to transform predicted values to the scale of the original variables and to re-inject correlation. However, Musafer et al. 's 26 method indirectly models the joint dependence between spatial variables through a black-box transformation. We directly extend Gräler and Pebesma's univariate spatial copula to multivariate setting that jointly models spatially correlated multiple variables via a white-box mixture copula 24 . The mixture copula is a joint distribution function of multiple copulas that offers a more flexible framework for parametric statistical modelling and analysis. Also, a single copula family may not be able to capture tail dependencies but the mixture copula capture the tail dependencies as well 24 . The mixture copula has used in the non-spatial setting for modelling multivariate genomic data 27 , and modelling wave height and period 28,29 . We adapt the mixture copula in the spatial setting that offers a more flexible multivariate spatial copula framework for spatially correlated multiple variables.

Methods
The methodology for modelling spatially correlated multiple variables consists of two essential components: modelling each spatial variable separately using Gräler and Pebesma's 15 univariate spatial copula; then joining the univariate spatial copulas using the idea of mixture copula 24 . We also use the proposed spatial mixture model to predict individual variables using inverse conditional approach in a bivariate context 30 . However, the method can be used to predict more than two variables with a trivial generalisation of the bivariate setting.
. . , Z m ] be the second-order stationary multivariate spatial random field Z with m spatial variables that are sampled at the same two-dimensional location x ∈ X , and let X = (x 1 , x 2 , . . . , x n ) be the set of existing locations in the given spatial domain X.
A spatial copula 15 describes the joint spatial dependence of a univariate spatial variable at any two spatial locations x and x + h , where h is the separation distance between two locations. Hence, spatial copulas model dependence of one spatial location relative to another spatial location, rather than modelling dependence using absolute locations.
The methodology for modelling spatially correlated multiple variables is simply shown in Fig. 1, and a detail procedure for the model development is provided in steps 1-4.
Step 1: For each spatial variable Z l , l = 1, 2, . . . , m , models the marginal cumulative distribution functions (CDFs), such as Gamma, Weibull, Normal, Log-normal, and obtain the best fitted CDFs. Let F l denote the best fit CDF of Z l , which is assumed to be same at each location x, i.e., F l (Z l (x)) = F l (Z l (x + h)).
The proposed method is based on the concept of distance dependent spatial copula 15,31 . Hence, the distances between every pair of locations are calculated. Suppose, x 1 , x 2 , x 3 and x 4 are four sampled locations of Z.
As given in Fig. 2, the distances are calculated for each location pair and (a j , b j ) are the coordinates of x i and x j , for i = j , ∀i, j = 1, 2, . . . , n , respectively. Also, n(n + 1)/2 The spatial dependence of copula-based spatial models depends on the distance between locations. As the distance between two points increases, spatial dependence between points decreases until it is independent or negligible enough to be considered independent. The distance at which independence occurs is referred to as the cut-off distance and is determined empirically using a correlogram. The correlogram plots the Kendall's tau τ correlation coefficient for each spatial bin, and a curve is fitted through the plotted points. Similar to a variogram in Kriging, the cut-off distance is visually determined as the distance at which the curve plateaus.
Step 2: Based on the distance between pairs, place each sample pair {F l (Z l (x), F l (Z l (x + h))} into K equally spaced spatial bins as follows: A correlogram is used to determine the cut-off distance as a plot of τ against the mean distance of each bin, which is calculated using the pairs belonging to relevant spatial bin. Figure 3 depicts an example correlogram.
Given the pairs of points for each spatial bin, spatial copula that describes the dependence of spatial variable Z l at any two locations can be calculated as, where k = 1, 2, . . . , K is the index of the spatial bin, u and v are any selected quantiles of the corresponding univariate CDF of Z l at locations x and x + h.
The copulas for each bin are selected using maximum log-likelihood values of competing copulas, such as Gaussian, Student's t, Clayton, Frank, Gumbel, and Joe 30 , which represent variety of dependence structures. Then, a mixture copula is used to determine the multivariate spatial dependence across bins as a weighted linear combination of copula.
Step 3: For each spatial bin k, use the spatial copulas in Eq. (1) to construct the mixture copula C m k,h as,  www.nature.com/scientificreports/ where, w l is the mixture weight, m l=1 w l = 1 , and 0 < w l < 1. An equal weight can be used in Eq. (2) if the correlogram of each variable is not significantly different. Otherwise, compare different weight combinations across bins and obtain the optimal weight combination. Moreover, for small distances, pairs of points will become extremely strong dependent and modelled using a comonotonic copula M (u, v). For large distances, pairs will become independent and modelled using a product copula �(u, v) 32 , as follows The mixture copula in Eq. (2) only describes the multivariate spatial dependence across individual bins.
However, a spatial model should be able to capture spatial autocorrelation between bins 15 . For instance, points near the upper bound of the first bin and the lower bound of the second bin may have similar features; points near the upper bound of the second bin and lower bound of the third bin; and so on. Thus, the spatial dependence is incorporated using the distance dependent parameter k that determines spatial dependence while incorporating spatial autocorrelation.
In practical situations, the first bin is modelled using the best fit copula for that bin, and subsequent bins are modelled using the convex linear combination of copulas with parameter k 15 . Further, pairs that fall above the cut-off distance are often omitted and not incorporated into the convex combination, assumed as an independent copula.
Step 4: Use Eq. (2), construct the distance dependent spatial mixture copula of Z as the convex linear combination of mixture copulas of each spatial bin as follows, Prediction. Prediction of individuals spatial variables at sampled locations based on the spatial mixture copula is described in a bivariate context. That is m = 2 , then C m h is the spatial mixture copula of Z 1 and Z 2 . The prediction method demonstrates the advantage of using a secondary correlated spatial variable in the prediction of a primary spatial variable 25 . Thus, an inverse conditional prediction approach is proposed 30 ,[pp. 40-42] where a secondary correlated spatial variable is known when predicting the primary spatial variable .
Suppose Z 1 is the primary variable of interest, then Z 2 is correlated secondary variable. The prediction of Z 1 at location x conditional on the known given value of Z 2 can be generated at the same location x, using the copula CDF of C m h . The procedure of the inverse conditional approach is given in steps 5-8, Step 5: Obtain the joint CDF values of Z 1 and Z 2 using C m h , and let T be the vector with joint CDF values.
Step 6: Obtain the marginal CDF values of Z 2 using F 2 , and let R be the vector with marginal CDF values.
Step 7: Derive the conditional distribution of T, given R = r , using the partial derivative of C m h as follows, let s = (C m h,r ) −1 (T|R = r) be the conditional predicted value of Z 1 at location x.
Step 8: Take, Z 1 = F −1 1 (s). The prediction of Z 2 , given Z 1 , can be described by simply switching the subscripts 1 and 2 in the steps 5-8. The proposed method can be validated against actual values at sampled locations by cross-validation, and three scenarios are considered with the existing methods.
• Can any improvement in the prediction of individual variables be gained by utilising the multivariate spatial dependence using mixture copula over the univariate pair copula 15 ? • Can any improvement in the prediction of individual variables be gained by utilising the mixture copula over the NLPCA transformation based spatial copula 26 ? • Can any improvement in the prediction of individual variables be gained by utilising the non-linear non-Gaussian multivariate spatial dependence (spatial mixture copula) over the linear Gaussian multivariate spatial dependence (cokriging) 33 ?
The cross-validation study is illustrated using mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE). The MAE, RMSE and MAPE can be calculated using the actual and (  26 . Also, accuracy in the reproduction of the bivariate relationship of Z 1 and Z 2 is evaluated based on the mean square error from the kernel density estimation (KDE MSE). The KDE MSE can be calculated by taking the mean of the squared differences between the bivariate KDEs of the actual and predicted data 26 .

Application
The proposed method was applied to model real forest data that was taken from georeferenced forest inventory plots in the US Department of Agriculture Forest Service Bartlett Experimental Forest (BEF) in Bartlett, New Hampshire 34 . The variables of interest were forest-wide biomass estimations within the area of 1053 hectares (measured in mg/ha). In this study, only foliage biomass ( Z 1 ) and bole biomass ( Z 2 ) were used that sampled at 335 two-dimensional locations. The prediction of bole biomass can be used for carbon accounting purposes, and the prediction of foliage biomass can be used to identify regions with high values of foliage biomass. Also, the behaviour of wildfires depends on pools of biomass variables 26,35 . Table 1 gives the summary statistics of the data. Figure 4a,b show the spatial distributions of Z 1 and Z 2 at observed locations. Figure 4c shows a strong bivariate non-linear relationship between Z 1 and Z 2 . The best marginal distributions were selected based on the maximum log-likelihood (ML) values. The Weibull distribution was achieved as the best distribution for Z 1 based on the ML values, 65.03, 69.43, 33.26, 44.03, and the Gamma distribution was achieved as the best distribution for Z 2 based on the ML values, 378.86, 378.28, 250.20, 377.51, of Gamma, Weibull, Normal, Log-normal distributions respectively. Then, the CDF values of Z 1 and Z 2 were calculated using the corresponding CDFs. The following steps for the modelling is only incorporated the CDF values of Z 1 and Z 2 (Step 1).
The cut-off distance was selected as 800 m using the correlograms of variables, and ten equally spaced (80 m) spatial bins were created (see Table 2). Table 3 shows the best fit copulas and the estimated copula parameters, where C 1,k,h and C 2,k,h are the fitted univariate spatial copulas of Z 1 and Z 2 respectively (Step 2). The correlation across bins almost similar for each variable (see Table 2), and then equal weights were used. Table 4 shows the mixture copulas of each bin (Step 3).
The mixture copulas in Table 4 were used to develop the distance dependent convex combination of mixture copulas as given in the Eq. (3), which is the proposed spatial mixture copula of the spatially correlated Z 1 and Z 2 (Step 4), is given by where 2 = 105−80 160−80 = 0.31, 3 = 0.61, . . . , 10 = 0.48. The proposed spatial mixture copula method was used to predict Z 1 and Z 2 using the inverse conditional approach as described in the steps 5-8. Figure 5 shows the bivariate relationship of Z 1 and Z 2 . Table 5 shows the model validation results with the existing methods.
According to Table 5 almost all the RMSE, MAE, and MAPE values are the lowest for the Z 1 and Z 2 predictions based on the spatial mixture copula. The MAPE value of cokriging method is the smallest for the Z 1 prediction that is very close to the spatial mixture copula. Thus, it can be seen that the proposed method outperformed in the prediction of Z 1 and Z 2 across the observed locations. Also, the proposed method accurately reproduces the bivariate relationship in terms of the minimum value of KDE MSE.  Fig. 5, the univariate pair copula method does not reproduce the tail values of both variables. Cokriging is unable to predict tail values and follows a strictly linear relationship. Although the NLPCA spatial copula method reproduces the non-linear relationship, it cannot reproduce upper tails, specifically for Z 2 . The prediction of individual variables using the novel spatial mixture copula method accurately predicts both upper and lower tail values, and conditional values of the variables reproduce the non-linear relationships between them. Thus, using mixture copula in the multivariate spatial copula framework is improved the accuracy in the univariate prediction.

Conclusions
This article proposed a new mixture copula method for modelling spatially correlated multiple variables. The proposed method models multiple spatial variables without any normalisation of the original variables, such as NLPCA transformation. The method was applied to model bivariate non-linear spatial variables Z 1 (foliage biomass) and Z 2 (bole biomass). The model performance was assessed in the cross-validation of actual versus predicted values at sampled locations. The use of multivariate spatial dependence in the univariate prediction, the strength of the mixture copula in the univariate prediction, and utilising non-linear non-Gaussian multivariate  The method also applied to non-linear simulated bivariate correlated variables (see Supplementary online), where the spatial mixture copula outperformed the existing methods, in terms of predicting individual simulated variables and their bivariate relationship. The proposed method used equal weights for each variable for both BEF application and simulation study. However, further improvement to the spatial mixture model is the optimal weights selection of each variable in the mixture copula modelling. For instance, one spatial variable may have a strong spatial dependence across locations than the other variable, and the optimal weights selection may increase the prediction accuracy of each variable across locations. The prediction method is explained for the bivariate case (m = 2), however, it can be extended to multivariate setting. Also, the proposed spatial mixture copula can be extended to multivariate spatial sampling design methodology for optimally selecting additional sampling to reduce prediction uncertainty by leveraging spatially correlated multiple variables. Moreover, the proposed method assumes isotropic spatial dependence of spatial variables but can be extended to model spatially correlated anisotropic variables, which can be present in mining 36 and soil variables 37 , for example.
The proposed method assumes that the spatial random field is stationary. However, the method can be extended to non-stationary spatial processes. For example, a non-stationary spatial process can be divided into several locally stationary processes. A univariate spatial copula can be modelled to each stationary process, and then a global non-stationary spatial copula can be constructed as a mixture of locally stationary spatial copulas. Furthermore, the proposed method assumes that all data points are known and collected at the same set of locations. However, measurements could be unavailable or difficult to sample at some locations, which is quite www.nature.com/scientificreports/ common in functional spatial data analysis. Thus, the proposed method can be extended for modelling and predicting complex functional spatial data.

Data availability
In implementing the proposed model in this paper, R software version 3.6.3 (R Development Core Team 2020) was used. Specifically, R package "spcopula" version 0.2-4 of Gräler was entirely used in this study (see http://rforge.r-proje ct. org/ proje cts/ spcop ula/), including key dependent packages, such as "copula", "VineCopula" version 2.1.8, "sp", "spBayes" "MASS","fitdistrplus". The data is available in "spBayes" package, where only non-zero values of biomass were considered in this study. For simulation study, "gstat" package was mainly used that facilitated the unconditional prediction of Gaussian random fields. Moreover, in comparing the proposed model with the most relevant NLPCA spatial copula model, MATLAB software was used, specifically "Nonlinear PCA Figure 5. Reproduction of bivariate relationship using various methods. Actual (red), predicted (black), Z 1 given Z 2 (green), and Z 2 given Z 1 (blue).