Introduction

Official socioeconomic statistics data are fundamental to policy-making and multi-disciplinary research1. In China, some official socioeconomic databases are openly available from different sources, such as the China Data Center (http://chinadatacenter.org/default.aspx), the Thematic Database for the Human-Earth System (http://www.data.ac.cn/index.asp) and special research programs2. However, a common problem with these databases is that they are all at the coarse provincial level. The county is the basic data collection unit for the official statistics in China, while the published county-level socioeconomic yearbooks have a considerable amount of missing data. Thus far, the country has not published any socioeconomic statistics databases that completely cover the entire country with data at the county level for an extended period3,4, due to the insufficient survey and data collection infrastructure, especially in some remote areas. This county-level missing data problem has seriously limited the wide use of China’s socioeconomic data.

Various methods for estimating missing values have been applied to official statistics data5,6. The traditional design-based methods, such as the James-Stein estimator7 and the generalized regression estimator8, are commonly used, but they require sufficient samples, which are hard to acquire in many regions. To avoid the challenge of collecting samples, the auxiliary information from other related variables6 has been used to impute the missing data for the official statistics in different countries9,10,11. For instance, a demographic study with census data in the US utilized the covariate information in each area for estimating missing values11, and a cultural participation study in Australia estimated missing data with an additional synthetic database9. However, useful auxiliary information is not always available in practice, similar to samples12. Particularly in China, neither samples nor auxiliary data are available for the socioeconomic dataset on the spatiotemporal scales.

When neither samples nor auxiliary data are available, model-based imputation methods have been proposed, such as k-nearest neighbors (kNN)13, expectation maximum (EM)14, singular value decomposition (SVD)15, and random forest (RF)16. For each record whose value is missing, kNN finds its k nearest neighbors whose values are available using the Euclidean metric and imputes the missing value by averaging the values of the neighbors13,17. EM assumes a distribution for the partially missing data and bases inferences on the likelihood of that distribution14. SVD initializes all missing records with zeros and then estimate them as a linear combination of the k most significant eigen-variables until it reaches a certain convergence threshold13,15. RF imputes data by regressing each variable in turn against all other variables and then predicting missing data for the dependent variable using the fitted forest16. However, none of these methods have taken into account information about spatial and temporal structures in the estimation of missing values.

Spatial agglomeration is a common socioeconomic phenomenon18. Thus large-scale official statistics data usually have spatial structures, particularly spatial autocorrelation19. Moreover, temporal autocorrelation, in which observations that are temporally close to each other tend to be similar, is also likely to be inherent in official statistics data20,21. On the one hand, information about spatial and/or temporal structures can be utilized for estimating missing values, especially when other information, such as that from samples and auxiliary data, is unavailable. In addition, for a spatiotemporal dataset, an imputation model that cannot fully capture the spatial and/or temporal structures in the data may introduce bias into the results, thus leading to low accuracy and high uncertainty22. Unfortunately, the four widely used model-based imputation methods described above are not designed to incorporate either spatial or temporal autocorrelation effects.

Under this situation, spatial statistical models can be applied to estimate missing values by accounting for spatially correlated information as spatial components in the model23. For example, Bihrmann, K. et al. implemented a logistic regression model with a spatially structured random component to impute missing data on Salmonella Dublin in Danish cattle herds24. Baker, J. et al. used a Bayesian model with the spatial intrinsic conditional autoregressive prior to impute missing data in health studies25. Staubach, C et al. used a beta-binomial model incorporating spatially structured and unstructured random effects to complete disease prevalence data26. However, the spatial information has not been widely taken into account in estimating missing values for official statistics data.

In addition, most spatial models for estimating missing data focus on a snapshot situation and neglect the temporal autocorrelations in the data24,25,26. Furthermore, the spatial and temporal structures can have interactivity, which may not be fully captured if the model only considers the two effects separately27. To address these problems, spatiotemporal models are usually formulated within the Bayesian hierarchical modeling (BHM) framework to account for spatial structures, temporal structures and their respective space-time interactions28. BHM is a powerful analytical technique for building spatiotemporal statistical models29, as the information provided by neighboring regions and time trends can be naturally represented as priors and it gives robust posterior estimates30. BHM-based spatiotemporal models have been found in many applications31,32,33, but have not been to missing data estimations of official statistics.

To estimate missing values for China’s socioeconomic official statistics data, for which the problem that neither samples nor auxiliary information are available is frequently encountered, we developed a spatiotemporal modeling procedure under the BHM framework that incorporates spatial autocorrelation, temporal correlation, and space-time interactions as the primary information sources. In addition, this modeling procedure is progressive since it contains two steps. It first imputes those easier-to-impute variables that have only small percentages of missing values, for which models considering only spatial and temporal information can achieve a decent estimation quality. It then uses the estimation results of these easier-to-impute variables as covariate information, along with the spatiotemporal multivariate regression model, to impute those more-difficult-to-impute variables that have large percentages of missing values.

We applied this progressive spatiotemporal (PST) procedure to the estimation of county-level missing data in China’s official socioeconomic statistics from 2002 to 2011. We evaluated different types of spatiotemporal models in order to select an optimal implementation for the PST method. We also evaluated PST’s sensitivity to a change of the missing data percentages and created spatial uncertainty maps. As a comparison, we also applied four other imputation methods, including kNN, EM, SVD and RF to the Chinese dataset.

To our best knowledge, no previous works on missing data estimations of official statistics have constructed BHM-based spatiotemporal models that comprehensively incorporate spatial, temporal, and covariate effects and perform modeling in a progressive way. In the area of official statistics, studies on such a great scale and such a large dataset that cover all of China are rare.

Methods

Study area and data

The socioeconomic data that we used in this study are from three series of official statistics yearbooks published by the National Bureau of Statistics of China (http://www.stats.gov.cn/english/), including the China County Statistical Yearbook, the China Statistical Yearbook for Regional Economy, and the China City Statistical Yearbook34. Data of counties (suburban/rural areas) in China are from the former two yearbooks, in which the statistical variables complement each other. In China’s administrative division system, a city can contain a number of county-level units called municipal districts. The data of municipal districts of cities are from the latter yearbook, which contains the complete set of statistical variables. We conducted logarithmic transformation of each socioeconomic variable to approximate a normal distribution1,35,36 in order to mitigate the impact of extreme values, and to make the effective relationships non-linear while still preserving the linearity of the model37.

The situations of missing data are different across statistical variables and yearbooks, and are generally more serious in the China County Statistical Yearbook and the China Statistical Yearbook for Regional Economy. If we use the county-year as a unit, in these two series of yearbooks, across different variables the minimum missing percentage is 28.69% during the 10-year period of 2002 to 2011, the maximum is 36.08%, and the mean is 30.98%. On the other hand, the China City Statistical Yearbook is almost complete, but it only covers urban areas. We combine the data from these three series into an integrated dataset. The dataset covers a total of 20 socioeconomic variables for 2,310 county-level areal units in China spanning a 10-year period from 2002 to 2011.

For each of the 20 socioeconomic variables in our integrated dataset, we calculated its overall percentage of county-years with missing data during the 10-year period. As shown in Table 1, the overall missing-data percentages of the last six variables (X15 to X20) are much larger than those of the first 14 variables (X1 to X14). We also calculated for each variable the maximum yearly percentage of counties with missing data during the 10 years, based on which we defined that if the percentage is <15%, then the quality of the data for that variable in that year is acceptable, e.g., Fig. 1(a). If it is >85%, the quality is unacceptable, and we named this year a big year of data missing for that variable, e.g., Fig. 1(b). It turned out that a big year of data missing only appears with variables X15-X20. Figure S1 in the additional document provides the detailed information about missing data for all counties in an example year. Based on whether a variable has at least one big year of data missing, we divided the dataset into two parts the first 14 variables (X1 to X14), which have no big year of data missing, and the last six variables (X15 to X20), which have big years of data missing. These two parts were separately used in the two steps of modeling.

Table 1 Missing data situations of 20 socioeconomic variables.
Figure 1
figure 1

Study area and the missing data maps of GDP (variable X15) in the years 2002 (a) and 2011 (b).

Experimental design

The overall design of analysis is illustrated in Fig. 2. Before modeling, we started with a Moran’s I test on the spatial autocorrelation of each target socioeconomic variable in each year (supplementary file S3). Because the counties of China vary greatly in size, where some are very large and some are very small, we chose to use contiguity rather than distance to represent the spatial relationship in measuring the spatial structure. We found that for all socioeconomic variables in each year, the Z-score was positive and significant (>2.58), which indicates that all the variables have significant spatial autocorrelations and that it is reasonable to utilize the spatial autocorrelation information for imputing missing data.

Figure 2
figure 2

Experiment’s overall design flow chart.

The PST modeling process contains two general steps. In the first step, we derived information from the spatial and temporal structures in the existing data. We used the spatiotemporal models that take into account of the random effects of spatial, temporal, and their interactive imformation to estimate the missing values for variables X1 to X14, whose missing percentages are small, with no auxiliary covariates or samples involved (as they were not available). The second step worked for variables X15 to X20, whose missing percentages are large. The second step used multivariable regression modeling because we had the covariates from the first step as the independent variables.

In each of the two steps, we built two alternative statistical models. In step 1, we built two spatiotemporal models, one parametric38 and one nonparametric27 (herein referred to as Model 1 and Model 2, respectively). They have the common components of spatial effects, but Model 1 uses the linear time prior, whereas Model 2 uses the nonlinear time prior for the temporal components and the space-time interaction components. With Models 1 and 2, we intended to discover which type of spatiotemporal model is more suitable for our data, and we chose the optimal one between the two for the next step. In step 2, we constructed spatiotemporal multivariable regression models (herein referred to as Model 3). Compared with Model 2, Model 3 incorporates additional covariate information (the 14 variables imputed in step 1). Model 3 will demonstrate the usefulness of the new imputed covariates in estimating other variables.

After building the models, we used a variety of methods for evaluation and validation. First, we evaluated the two pairs of alternative spatiotemporal models (Models 1 vs. 2 and Models 2 vs. 3) regarding the Bayesian model fitness using the deviance information criterion (DIC) and the predictive quality using the conditional predictive ordinate (CPO). This first step of the evaluation was based on the entire dataset and selected an optimal spatiotemporal model for the PST imputation. Second, we ran a cross-validation to evaluate the predictive performance of PST and the model’s sensitivity to change of a missing data percentage. Specifically, we randomly sampled 10%, 20%, and 30% from the existing data to create three test sets, and used the rest of the data as the training sets.We further obtained the spatial uncertainty maps to evaluate the local prediction errors of the spatiotemporal models applied in the PST method. Third, we compared the proposed PST method with four other widely used imputation methods, including kNN, SVD, EM, and RF. We applied cross-validation (10% random samples) to test the actual accuracy of these imputation methods, and implemented the four methods using R.

Statistical methods

Progressive spatiotemporal (PST) modeling

Spatially, we denote the 2,310 county-level areal units as i = 1, …, I (I = 2310). Temporally, we denote the 10 years as t = 1, …, T (T = 10). Let yit denote the values of a socioeconomic variable in area i and year t. All of our models assume a log-normal likelihood prior distribution. The structured additive linear predictor \({\eta }_{it}=\,\mathrm{log}({y}_{it})\) will be decomposed additively into components of space, time, or both. As aforementioned in the Experimental Design section, we constructed three different models. The details are described in this section.

Parametric spatiotemporal model (Model 1)38:

$${\eta }_{it}=\alpha +{\mu }_{i}+{\nu }_{i}+(\beta +{\delta }_{i})\times t$$
(1)

In the linear predictor \({\eta }_{it}\), α quantifies the intercept fixed effect, and μi and vi are the spatial components that represent two random effects. The term vi assumes a Gaussian exchangeable prior to the model unstructured heterogeneity, which is formalized as \({\nu }_{i} \sim N(0,{\delta }_{\nu }^{2})\), and μi assumes an intrinsic conditional autoregressive (CAR) prior for the spatially structured variability.

The spatial components include two effects: one assuming a Gaussian exchangeable prior to model the unstructured heterogeneity, which is \({\nu }_{i} \sim N(0,{\delta }_{\nu }^{2})\), and the other assuming an intrinsic conditional autoregressive (CAR) prior for the spatially structured variability39, which is:

$${\mu }_{i}|{\mu }_{j\ne i} \sim N(\frac{1}{{m}_{i}}{\sum }_{i \sim j}{\mu }_{i},\frac{{\sigma }^{2}}{{m}_{i}})$$
(2)

where i ~j indicates that areas i and j are neighbors, mi is the number of areas that share boundaries with the i-th area and \({\sigma }^{2}\) is the variance component. The spatial dependence in \({\mu }_{i}\) assumes the CAR prior that extends the well-known Besag model39, with a Gaussian distribution, which implies that each \({\mu }_{i}\) is conditional on the neighbor \({\mu }_{j}\) with the variance dependent on the number of neighboring counties mi of county i. The structured spatial effect is considered as the spatial autocorrelation information that is borrowed from nearby neighbors, and the unstructured spatial effects are seen as the spatial heterogeneity characteristics in a specific area. Model 1 also includes the linear effect β, which represents the main temporal trend, and a differential temporal trend 𝛿i, which represents the area-specific time variation (the differential time trend for each region).

Nonparametric spatiotemporal model (Model 2): As an alternative to the assumption of a linear time trend in Model 1, Model 2 implements a general dynamic nonparametric time trend, which is considered more realistic. It adopts a random walk model for the main temporal trend and the corresponding spatiotemporal interaction term. The linear predictor of a nonparametric spatiotemporal model can be written as27

$${\eta }_{it}=\alpha +{\mu }_{i}+{\nu }_{i}+{\gamma }_{t}+{\varphi }_{t}+{\delta }_{it}$$
(3)

where \({\mu }_{i}\) and \({\nu }_{i}\) represent the spatial main effects, which are the same as in Model 1; \({\gamma }_{t}\) and \({\varphi }_{t}\) represent the temporal main effects; and \({\delta }_{it}\) represents the space-time interactions. The term \({\varphi }_{t}\) represents the unstructured time effect and is specified by using an independent mean-zero normal prior with unknown variance \({\sigma }_{\varphi }^{2}\). The term \({\gamma }_{t}\) represents the structured time effect and is modeled dynamically through a neighboring structure. We used the random walk (RW) dynamic model as a prior for the structured time effect, with its prior density π as follows28:

$$\pi ({\gamma }_{t}|{\sigma }_{\gamma }^{2})\propto \exp (-\frac{1}{2{\sigma }_{\gamma }^{2}}\sum _{t=2}^{T}{({\gamma }_{t}-{\gamma }_{t-1})}^{2})$$
(4)

In the time-space interaction term \({\delta }_{it}\), i = 1,…,I is the space index and t = 1,…,T is the time index. The specification of the prior on \({\delta }_{it}\) depends on the spatial and temporal main effects, which are assumed to interact. Assuming that the spatial main effect \({\nu }_{i}\) and the temporal main effect \({\gamma }_{t}\) interact with each other, each spatial unit \({\delta }_{i}=({\delta }_{i1},{\delta }_{i2},\,\mathrm{..}.,\,{\delta }_{iT})^{\prime} ,i=1,\,\mathrm{..}.\,I\) follows a random walk, and the prior on \({\delta }_{it}\) is thereby written as follows27:

$$p(\delta |{\kappa }_{\delta })\propto \exp \{-\frac{{\kappa }_{\delta }}{2}\sum _{i=1}^{m}{\sum _{t=2}^{T}({\delta }_{it}-{\delta }_{i,t-1})}^{2}\}$$
(5)

where \({\kappa }_{\delta }\) is the precision factor, which is the reciprocal of variance \({\sigma }_{\delta }^{2}\). The space-time interactions \({\delta }_{it}\) are considered as unobserved covariates for each unit (i,t) that have structures in time and space. Such a specfication is suitable when temporal trends are different among counties but the spatial trends are stable. With \({\delta }_{it}\), Model 2 can take into account not only of the spatial heterogeneity of each county but also the temporal variation of each county across ten years for the missing data imputation.

Spatiotemporal multivariable regression model (Model 3): When covariate information (observed and related variables) is available for imputing missing values, a traditional multivariable regression model can be easily specified as \({\eta }_{it}=\alpha +\sum _{k}{\beta }_{k}{X}_{itk}\), where α quantifies the intercept, Xk is the k-th covariate, and \({\beta }_{k}\) are the coefficients23. Combining it with Model 2, we build Model 3 as follows:

$${\eta }_{it}=\alpha +\sum _{k}{\beta }_{k}{X}_{itk}+{\mu }_{i}+{\nu }_{i}+{\gamma }_{t}+{\varphi }_{t}+{\delta }_{it}$$
(6)

where the specifications of these spatial and temporal random effects are the same as in Model 2. With this model, the imputation can comprehensively incorporate the related covariates, spatial effects, temporal effects, and space-time interactions.

Bayesian hierarchical modeling framework

Bayesian hierarchical modeling (BHM) is a statistical process that works on multiple levels to estimate the parameters of posterior distributions using the Bayesian method40. It has demonstrated the advantage of being able to impute missing data in a relatively straightforward way28. By applying BHM to spatiotemporal modeling, we implemented the prediction models in this study with three levels, namely, the data distribution, the spatiotemporal process, and the parameter, where each level can also contain a number of sub-levels. We employed the log-normal likelihood model for the data distribution and combined different sub-models (the CAR and RW models) to form a hierarchical model for the spatiotemporal process to incorporate the random effects of the spatial structures, temporal structures, and space-time interaction. For the parameter level, we used the inverse gamma distribution as the priors for all unknown variance parameters. This non-informative prior specification for the parameters and their variance components allows the observed data to have the greatest influence on posterior distributions without being greatly influenced by the choice of the prior41. The BHM-based PST models presented in this study were solved using the Integrated Nested Laplace Approximation (INLA) approach in the R software42. A major advantage of INLA is that it returns accurate parameter estimates in a relatively short computational time30. The R-INLA package can be directly downloaded from http://www.r-inla.org/. The core codes for fitting spatiotemporal models have been openly published in a few studies28,30,41.

Model evaluation and validation

  1. (1)

    Bayesian model fitness

    The deviance information criterion (DIC) is a well-known Bayesian model comparison criterion, which is defined as43

    $${\rm{DIC}}=\overline{D}+{p}_{D}$$
    (7)

    where \(\bar{D}\) is the mean of the model posterior deviance and \({p}_{D}\) is the effective number of parameters. The greater the value of pD is, the higher the complexity of the model. The greater the mean deviance values are, the greater the error of the representative model. Models with smaller DICs are better supported by the data.

  2. (2)

    Bayesian model predictive quality

    The conditional predictive ordinate (CPO) is defined as a cross-validated predictive density at a given observation and can be used to compute predictive measures44. For continuous distributions, it is defined as

    $$CP{O}_{i}=p({y}_{i}^{\ast }|{y}_{f})$$
    (8)

    where \({y}_{i}^{\ast }\) is the predicted value and \({y}_{f}\) is a sample of observations y, which is used to fit the model and to estimate the posterior distribution of the parameters. In practice, the cross-validated logarithmic score (LS) computed from the CPO is used to evaluate the model’s predictive quality. A smaller LS indicates a better prediction of the model. The LS is calculated as

    $$LS=-\frac{1}{n}\sum _{i=1}^{n}\mathrm{log}(CP{O}_{i})$$
    (9)
  3. (3)

    Actual prediction accuracy

To compare different imputation models with cross-validation, we used three indices to measure the actual prediction accuracy, namely, the standardized allocation error (SAE)11, the mean square error (MSE) and the coefficient of determination (R2)21,35. All these indices compare the model-predicted values with observed values. The SAE, MSE, and R2 are calculated as follows:

$$SAE=\frac{\sum _{i=1}^{n}|{y}_{i}-{y}_{i}^{\ast }|}{\sum _{i=1}^{n}{y}_{i}},\,MSE=\frac{1}{n}\sum _{i=1}^{n}{({y}_{i}-{y}_{i}^{\ast })}^{2},\,{\rm{and}}\,{R}^{2}=1-\frac{\sum _{i=1}^{n}{({y}_{i}-{y}_{i}^{\ast })}^{2}}{\sum _{i=1}^{n}{({y}_{i}-\bar{y})}^{2}},$$
(10)

where \({y}_{i}\) is the observed value, \({y}_{i}^{\ast }\) is the predicted value, \(\bar{y}\) is the mean of the observed values and n is the number of validation samples.

The SAE is a relative error index that is convenient for comparison between alternative models and has been well adopted in the official statistics field to compare various estimation methods11. An SAE value close to 0 indicates a good fit between the actual and estimated distributions. In addition, we could calculate the localized SAE for each spatiotemporal unit with \(SA{E}_{ij}=\frac{|{y}_{ij}-{y}_{ij}^{\ast }|}{{y}_{ij}}\).

The MSE is an absolute error index. A smaller MSE indicates a better prediction of a model. The R2 is an index for assessing the agreement between observed and estimated values, with the value ranging from 0 for complete disagreement to 1 for perfect agreement. Scatterplots were created to compare the observed values and estimated values in the cross-validation1,21.

Data availability

The three governmental yearbook series, which provide the original data for this study, are available from the National Bureau of Statistics of China (http://www.stats.gov.cn/english/). The new datasets generated during the current study are not publicly available due to the limitation of the copyright of the governmental data source but are available from the corresponding authors upon a reasonable request with reference to this paper and a signed confidentiality agreement.

Results

Optimal spatiotemporal models

Table 2 lists the evaluation results for the two pairs of alternative spatiotemporal models. For the first 14 variables, between Model 1 and Model 2, the latter has larger pD values, which indicate that Model 2 is more complex, apparently because it incorporates a spatiotemporal interaction term that is not a part of Model 1. This higher complexity was beneficial, as it led to lower DIC values, thus indicating a better fit to the data. The higher quality of Model 2 is further confirmed by the lower LS values that represent a better predictive ability.

Table 2 Bayesian models’ evaluated results of 20 variables with the alternative spatiotemporal models (M1: parametric spatiotemporal model; M2: nonparametric spatiotemporal model; M3: spatiotemporal multivariable regression model).

To select the covariates from the first 14 variables to assist with the prediction of the last six variables, we first assessed the multicollinearity to select the variables whose variance inflation factor (VIF) <5. We then used the forward stepwise regression method to further select those variables that have a significant association (sig <0.05) with the target variable for modeling. The variables selected in this way were considered to have spatial and temporal structures similar to those of the last six variables (see supplementary file section S2 for details) and could be used in Model 3.

Because Model 3 included additional covariate terms, it has a higher complexity than Model 2, as indicated by the PD values in Table 2. This higher complexity brought better model fitness (lower DIC) and predictive ability (lower LS) to Model 3.

The comparison between Models 1 and 2 indicates the usefulness and necessity of including the main time trend and the space-time interaction. The comparison between Models 2 and 3 demonstrates the effectiveness of the proposed progressive modeling process. That is, easier-to-impute variables (variables with small percentages of missing values) can be helpful in the imputation of those more-difficult-to-impute variables (variables with large percentages of missing values). Through the two step experiment, we selected the nonparametric spatiotemporal model (Model 2) and the spatiotemporal multivariable regression model (Model 3) as the final models for the PST method to perform the imputation for our integrated dataset of China’s official statistics.

Cross-validation and sensitivity to missing value percentage

Figures 3 and 4 give the results of the cross-validation experiments with spatiotemporal Models 2 and 3. The scatter plots in Fig. 3 shows that under the 10% test set setting, the predicted values match the observed values for most variables well.

Figure 3
figure 3

Prediction scatter diagrams of 20 variables in the 10% simulation experiment.

Figure 4
figure 4

Evaluation of 20 socioeconomic variables in the 10%, 20% and 30% cross-validation simulation experiments with the PST method.

The MSE, SAE, and R2 consistently show that (Fig. 4) when the test set contains 10%, 20%, and 30% of all existing data in the dataset, the amount of available data for modeling in the training set decreases and the prediction error increases, but not dramatically. Since the percentage of missing data for a variable rarely goes up to 30% in our database, our models should be able to maintain an acceptable performance when applied to the database.

Furthermore, the mean SAE values of all 20 variables are less than 0.05 for all three test sets, thus indicating that the overall prediction error accuracy under these settings is less than 5%. It is noteworthy that the six variables handled by the second-step modeling, which have larger percentages of missing data, do not have considerably larger prediction errors than the 14 variables in the first-step modeling. Among all 20 variables, a total of 14 (not necessarily the first 14) have an SAE <5%. Among these 14 variables, X1, X2, and X19 are the best-estimated variables, with SAEs of approximately 1%. There are six other variables whose SAEs are between 5% and 10%, which is still acceptable.

Spatial SAE maps

We also calcluated the localized SAE for each county to reveal the spatial variation of the uncertainty (prediction error) in the results generated by PST. As an example, Fig. 5 is a map of the local SAE for the number of hospital beds (X14) in four years. Variable X14 has the highest SAE value among the 20 variables. The map shows that most counties (blue) have a prediction error <0.1 in all four years. The regions with high-quality predictions are stable during 2002–2011, whereas the regions with relatively low- quality of predictions (red) are few and scattered. The SAE maps further illustrate the effectiveness of the applied spatiotemporal model.

Figure 5
figure 5

Spatial SAE maps of variable X14 in the years (a) 2002, (b) 2005, (c) 2008 and (d) 2011.

Comparison of different imputation methods

Finally, using the 10% test set, we ran cross-validation to compare the proposed PST method with four other imputation methods, including kNN, EM, SVD and RF. The comparison evaluation is still based on the SAE, MSE and R2. The results are shown in Fig. 6. On all three indicators, PST outperforms all other methods for all variables. For instance, the mean prediction error of PST is less than 5%, whereas that of RF is between 5% and 10%, and those of the other three methods are all greater than 10% (the top panel of Fig. 3). For the four other methods we compared, the rank from best to worst is RF, SVD, kNN and EM, and kNN and EM are almost the same. For the large-scale spatiotemporal dataset, it is useful to consider the spatial and temporal random effects as the additional information for the missing data imputation.

Figure 6
figure 6

Evaluation of different imputation methods (EM, SVD, kNN, RF, and PST) for the 10% simulation dataset.

Discussion

In this study, we developed a sophisticated progressive spatiotemporal (PST) method and used it to estimate the missing values in China’s county-level official socioeconomic statistics. Our estimation covers the entire country for a 10-year period and includes 20 socioeconomic variables. We developed this procedure for estimating missing values in the official statistics dataset when auxiliary samples and covariate information are not available, which is a situation that prevails in China’s socioeconomic statistics (and is also likely in other countries’ similar datasets) but would not be well addressed by previous model-based methods5,6,7,8. We conducted a variety of evaluations, and they consistently prove the efficacy of the proposed PST method.

PST imputes missing values using a two-step progressive modeling strategy. First, based on the understanding that socioeconomic phenomena tend to agglomerate in space and time (e.g., well-developed cities tend to promote development of nearby towns, and a county’s development tends to maintain a smooth trend during a period)18, we tried to derive information from those county-years that do have data by borrowing information from the spatial and temporal structures in the data and their interactivity. This first step was implemented by constructing spatiotemporal models that incorporate items of spatial autocorrelation, temporal autocorrelation, and space-time interactions, under a Bayesian hierarchical modeling framework. The BHM method is effective in taking into account non-linear spatiotemporal associations as prior information. We found that for a large country such as China, when a variable’s percentage of missing data is <15% in each year, it is possible to achieve high-quality estimation based only on the information derived from the spatial and temporal structures in the existing data. This study is a pilot study on applying this framework to the estimation of missing data in large spatiotemporal databases.

Second, when a variable has a large percentage of missing data, e.g., >85%, taking into account the imputation results of those easier-to-impute variables (variables with small percentages of missing values) can be helpful. For this purpose, we adopted a progressive strategy and implemented a two-step modeling process. That is, if some variables have been well estimated in the first step, they can be further used as covariates in the estimation for those more- difficult-to-impute variables and combined with nearby spatial and temporal information by constructing spatiotemporal multivariable regression models. This second step turned out to be effective in the estimation of the six more- difficult-to-impute variables in our study.

By comparing PST with four widely used imputation methods, including kNN, SVD, EM, and RF, we confirm that PST had a better prediction accuracy and reduced residuals compared to the other methods. The good performance of PST is greatly due to its capability to incorporate spatial and temporal autocorrelation effects, which the other four methods lack but is important for a large-scale spatiotemporal dataset. Among the other four methods, the RF method performed the best compared to the kNN, SVD, and EM imputation methods, and this result is consistent with other studies17,45. Especially, when a county has missing values for all variables46, which means that no covariates exist to estimate the target variable (covariates are fundemental to RF), PST is able to first impute those easy-to-impute variables based solely on the spatial and temporal structure information and then uses the imputation results of the easy-to-impute variables to impute those more-difficult-to-impute variables. The PST method is especially useful for the case without any additional information to use for imputation. The cross-validations also demonstrate that the performance of PST remained acceptable when the percentage of missing values went up to 30%.

The two-step PST method is not limited to the specific socioeconomic statistics variable that we have been working on, and its usefulness can be generalized. The entire procedure can be adapted and applied to the estimation of missing data for other large-scale spatiotemporal datasets. The immediate outcome of this study is a complete county-level socioeconomic dataset of China with 20 variables over a 10-year period, which should be the first of its kind. This new dataset should be of great value to multi-disciplinary research and policy-making practices.

There are some limitations to this study. This imputation method did not consider that some counties that failed to provide the required official statistics data in all ten years are also counties that are far less developed than their neighbors. Thus, assuming a smooth spatial structure when imputing missing data for these counties may result in an over-estimation. A possible solution may be to obtain more local data (unit-level) in these counties from other private sources and apply multilevel mixed models combined with the spatiotemporal models in future research. In addition, since the China National Bureau of Statistics has never publicized the standards it uses (e.g., the sampling range or the sampling method), data inconsistency has been a big concern. At this time, no other openly published county-level socioeconomic dataset is available for us to verify the data that we used in this study. Encouragingly, the results of the cross-validations indicate that even with the existence of data inconsistency, our model can still achieve a good performance and is thus valuable in imputing missing data for the official statistics. Nevertheless, data standardization is an important issue to be considered in future studies.