Background & Summary

To accelerate the transition to a carbon-neutral world powered by emerging energy technologies by 2050, the world is required to achieve net-zero emissions1. The electricity sector, attributing 40% of worldwide carbon dioxide (CO2) emissions, emerges as a critical focal point for decarbonization2. Consequently, nations have implemented policies to transform electricity systems and reduce conventional energy use, thus promoting sustainable development3,4,5. China is the world’s largest consumer of electricity, accounting for 31% of electricity consumption in 2022, with operational inefficiencies and structural irrationalities2. In this context, high spatio-temporal resolution data can help to reveal such problems to formulate regional energy transition policies6. Official statistic data are only available at the county level and above, which makes it challenging to estimate consumption distribution and analyze spatio-temporal dynamics at fine scales7. Therefore, it is critical to use advanced methods to accurately estimate high spatio-temporal resolution electricity data.

High spatio-temporal electricity consumption estimation includes “bottom-up” and “top-down” approaches8,9,10. The “bottom-up” method accumulates data from individual units to higher aggregates, ensuring high accuracy at granular levels11. For example, total regional demand is calculated by aggregating electricity usage data from buildings12. However, the “bottom-up” approach is labor-intensive and time-consuming, which is not practical for processing massive long-term datasets. This becomes particularly challenging when swift data generation is essential for decision-making or emergency responses13.

The “top-down” approach offers an effective alternative14,15. This approach utilizes open-source big data, such as satellite imagery and socio-economic data (e.g., gross domestic product (GDP), population density), serving as proxies for data downscaling16,17,18,19,20. For instance, Zhou et al.21 developed a spatial disaggregation index using multi-source variables to estimate high-resolution building energy consumption. Furthermore, machine learning techniques effectively extract non-linear relationships between variables, enabling the “top-down” method to estimate electricity consumption accurately with minimal loss, thus improving the model’s precision. Chen et al.22 used the particle swarm optimization-back propagation (PSO-BP) algorithm to estimate electricity consumption at the 1 km × 1 km grid using nighttime lights as a proxy. This study demonstrated the synergy between big data and machine learning in downscaling electricity consumption. However, most existing datasets provide only annual data, limiting applications that require monthly data (e.g., energy demand forecasting23 or short-term effect analysis24). According to our survey, monthly electricity consumption data with high spatial resolution are not currently available in China.

Machine learning can effectively extract non-linear variable relationships25,26 but struggles to capture spatial correlations27, which are vital for accurately rendering detailed geographic information in high-resolution analyses28. Spatial correlation implies that geographically proximate objects are more likely to share similar attributes, and overlooking this factor would lead to increased predictive errors29. Kriging interpolation, an advanced technique grounded in spatial statistics, effectively identifies spatial correlations by analyzing distances and relationships among objects30,31. This method has proven invaluable in various downscaling applications, including estimations of population32,33,emission34, precipitation35 and temperature36,37. Therefore, it is crucial to further integrate kriging interpolation to address the shortcomings of machine learning and improve prediction accuracy.

To address the aforementioned shortcomings, this study introduces a hybrid downscaling model that integrates machine learning with kriging interpolation38. This model estimates the electricity consumption of 280 major cities from April 2012 to December 2019 at 1 km × 1 km spatial resolution using multi-source data. This research highlights three contributions: (1) we created the first fine-grained electricity consumption dataset with monthly 1 km × 1 km in China; (2) the proposed method can extract complex variable correlations, which improves the estimation accuracy at different spatio-temporal scales; (3) kriging interpolation can characterize the spatial correlations, and also address the challenge of mismatch between predicted values and statistical data effectively.

Methods

The workflow of high spatio-temporal resolution of electricity data estimation in China is shown in Fig. 1. Firstly, we obtained electricity statistics and high spatio-temporal multi-source data as variables from the open platform. Secondly, by integrating machine learning with kriging interpolation techniques, we developed a “top-down” hybrid model for generating the final results. Finally, we verify the effectiveness of the data through spatio-temporal analysis.

Fig. 1
figure 1

The framework of the study.

Data collection and processing

This section summarizes the data products and corresponding preprocessing used to estimate high spatio-temporal electricity consumption data. Table 1 provides information on the resolution, source, and role of all datasets.

Table 1 Information on datasets for high spatio-temporal electricity estimation.

Statistical electricity data

Total electricity consumption data comes from Statistical Bureaus across China, covering 280 major prefecture-level cities (including urban and rural areas). It comprises three types of spatio-temporal resolutions. Firstly, annual electricity data at the prefecture-level city, achieving a coverage rate of nearly 95%, with missing values filled in by linear interpolation. The second and third datasets are monthly electricity consumption for prefecture-level cities and annual data for counties, respectively, providing more detailed spatial or temporal resolution. Only approximately 1,500 and 2,000 records were obtained due to data availability limitations.

Multi-source high spatio-temporal data

Previous studies usually focus on the effect of a single data (e.g., nighttime lights19) on electricity consumption, ignoring the integration of various factors. To accurately estimate electricity consumption on a fine scale, we have incorporated seven high spatio-temporal resolution variables: nighttime lights, average temperature, CO2, population distribution density, GDP, building height, and building surface. These variables characterize the urban economic development, material stock and technological progress in multiple dimensions. They have demonstrated strong correlations with electricity data and are drivers for accurately capturing electricity consumption patterns39,40,41,42. In particular, the collected building height and surface data span over five years. To address temporal inconsistencies, we use these data to represent two adjacent years. For example, 2020 building datasets were used for both 2018 and 2019. This approach accounts for building change cycles, minimizing temporal variation impacts and ensuring accurate feature representation43. Due to the inconsistencies in coordinate systems and resolutions among different data sources, we converted all the data to the Albers equal-area coordinate system and resampled them to 1 km resolution to facilitate attribute extraction and model training.

Property calculations

Considering that electricity consumption is generated almost from built-up land, this study filters out built-up areas based on land use data44 to reduce the estimation error. Additionally, given the significant quantitative differences in electricity consumption at different spatio-temporal scales, this study takes Cheng et al.38 to use electricity intensity (monthly 1 km electricity consumption) as the dependent variable. For example, in the annual data electricity dataset, the electricity consumption is divided by the built-up area and twelve months. The same approach is applied to the dependent variable. In this way, all variables are in a uniform feature space, ensuring that model training is efficient and robust.

Constructing fine spatial and temporal scale methods for electricity estimation

This section introduces a hybrid downscaling model for electricity estimation, combining an improved XGBoost (eXtreme Gradient Boosting) algorithm with kriging interpolation (see Supplementary Figure S1 for details). We utilize processed electricity data as the dependent variable and multi-source spatio-temporal data as independent variables. The XGBoost model, coupled with incremental learning, is employed to extract features across various spatio-temporal scales for training. Subsequently, the output grid results are refined using kriging interpolation to capture geographic autocorrelation features and perform corrections, resulting in high-resolution electricity data products.

Estimation model of electricity consumption

The low resolution of annual electricity data may lead to substantial estimation errors when relying solely on these data for model training. Therefore, we employ finer-grained electricity data combined with incremental learning approach to progressively process the data streams in order to merge, refine and enhance the accumulated information45. This approach enables the model to comprehensively reveal electricity consumption patterns at different spatio-temporal scales.

Specifically, the first step is to build a base model using XGBoost and train it using annual city electricity data. The XGBoost is a gradient boosting tree model that integrates multiple weak classifiers, and improves model generalization performance by preventing overfitting through regularization. The second step is spatial feature incremental learning (XGBoost-SIL). We maintain the trained model’s tree node weights and add several trees to fine-tune the model by incorporating annual county data. The same approach was applied in the third step and monthly city data is further added for temporal incremental learning (XGBoost-STIL). To analyze the effectiveness of spatio-temporal incremental learning of the method, 20% of the annual county data (spatial-test data) and monthly city data (temporal-test data) are selected as test datasets for the three models. The study uses parametric grid search and five-fold cross-validation to improve the model generalization performance and stability in the training process.

As the kriging interpolation method is only capable of spatial interpolation, is not able to disaggregate the annual data into monthly data. Therefore, this study used the above model to generate monthly county electricity density data as an intermediate product. This dataset is further adjusted by dasymetric mapping46 to be used as baseline dataset for subsequent area-to-point kriging interpolation.

Electricity correction and mapping

Machine learning methods excel at capturing nonlinear relationships between variables but fall short in capturing the geographical spatial correlations. We used kriging interpolation to fill this gap. Specifically, area-to-point kriging is a modification of ordinary kriging that allows for low to high resolution redistribution47. This ensures the consistency of residuals across all grids within the same area. The residual for each grid is determined through a weighted linear combination of nearby coarse areas, following the unbiased weighting constraint.

$$\left\{\begin{array}{r}\widehat{{e}_{p}}={\sum }_{k=1}^{K}\lambda ({v}_{k})e({v}_{k})\\ {\sum }_{k=1}^{K}\lambda \left({v}_{k}\right)=1\end{array}\right.$$
(1)

where \(\widehat{{e}_{p}}\) signifies the estimated residual for the grid, K denotes the number of considered counties, \(e\left({v}_{k}\right)\) is the known area residual for county vk, and \(\lambda \left({v}_{k}\right)\) represents the weight allocated to each county, calculated as follows:

$$\left\{\begin{array}{c}{\sum }_{l=1}^{K}\lambda ({v}_{l})\overline{C}({v}_{k},{v}_{l})-{X}_{p}=\overline{C}(p,{v}_{k}),\quad k=1,\ldots ,K\\ {\sum }_{l=1}^{K}\lambda ({v}_{l})=1\end{array}\right.$$
(2)

where \(\overline{C}({v}_{k},{v}_{l})\) indicates the covariance between areas vk and vI, while \(\overline{C}(p,{v}_{k})\) describes the covariance between the target grid and the county. The covariance is derived from coarse to fine resolution by deconvolution. Deconvolution is generating variograms from discrete points of input area data by minimizing the variogram difference between the theoretical regularization and the input area data. Post-correction with kriging interpolation reduces errors by ensuring that aggregate of estimated results for grids within county matches the total electricity consumption.

Accuracy assessment

The coefficient of determination (R2), the root mean square error (RMSE) and the spatio-temporal coefficient (FI) to evaluate the model performance by calculating the error between the predicted values (\({\widehat{y}}_{i}\)) and true values (yi). The coefficient (R) is applied to data correlation analysis. FI drawing inspiration from the F1-score, make R2 and RMSE comprehensive to assess the model’s validity across temporal-test data (TD) and spatial-test data (SD):

$${F}_{I}=2\times \frac{{I}_{SD}\times {I}_{TD}}{{I}_{SD}+{I}_{TD}}$$
(3)

where I is the R2 or RMSE value. The final electricity consumption data are compared with related datasets to validate accuracy. Given the unavailability of data on the same scale, the study use national monthly data on total, residential, and industrial electricity consumption from statistical yearbooks for quantitative verification. Additionally, representative cities from diverse geographic locations–Beijing (North), Shanghai (East), Shenzhen (South), Chengdu (West), and Wuhan (Center)–are chosen for spatial comparison analysis with annual grid electricity data (AGED) created by Chen et al.22.

Data Records

The study estimated high-resolution total electricity consumption data for 280 major Chinese cities based on multi-source data availability, which account for 90.6% of China’s electricity consumption (https://www.stats.gov.cn/). The dataset is stored in Geotiff (.tif) format in the folder “China_1km_Ele_201204_201912.zip” and spatially projected using the Albers equal area method. The folder contains 93 .tif files, each labeled with the year and month, describing the monthly electricity consumption in China. Cities details are also provided in the folder in .csv format. The dataset48 is publicly available for free on Figshare (https://doi.org/10.6084/m9.figshare.25398559.v1).

Technical Validation

The technical validation of this study encompasses three main parts: (1) analysis of the correlation between independent and dependent variables ; (2) assessment of the model’s performance; and (3) comparative analysis between our dataset and existing related datasets; (4) analysis of the spatio-temporal patterns of high-resolution electricity consumption in China.

Variable correlation analysis

Firstly, the correlation between the independent variables and electricity consumption is analyzed to provide a solid foundation for accurate estimation of electricity consumption. As shown in Fig. 2, the results indicated that all independent variables have statistically significant correlations with the dependent variable, with p-values less than 0.001 and an average correlation coefficient of 0.52. Building height (0.65) and nighttime lights (0.64) demonstrated the strongest correlations with electricity consumption, underscoring the critical role of urbanization and economic activities in electricity demand. The correlation coefficients for GDP, POP, building surface, and CO2 fall within the range of 0.45 to 0.6, signifying the considerable influence of economic development, demographics, urban configuration, and environmental factors on the patterns of electricity consumption. Although the correlation between temperature and electricity consumption was lower (0.26) than others, the control variable experiments (Supplementary Table S1) have verified that temperature can further improve accuracy, which may be attributed to the effect of temperature in specific events such as summer cooling. Furthermore, controlled variable experiments were conducted to verify the validity of each variable, as detailed in the Supplementary Information.

Fig. 2
figure 2

Correlation analysis of electricity consumption with (a) Building height, (b) Building surface, (c) GDP, (d) POP, (e) Temperature, (f) Nighttime lights, and (g) CO2.

Model performance analysis

Table 2 shows the performance of the models based on machine learning and incremental learning in this study. The baseline model XGBoost achieved R2 of 0.678 and RMSE of 239.561 on the spatial dataset, and R2 of 0.706 and RMSE of 137.072 on the temporal dataset. Additionally, the \({F}_{{R}^{2}}\) and FRMSE were 0.690 and 174.371, respectively. After integrating spatial incremental learning (XGBoost-SIL), the performance of the spatial dataset is significantly improved with R2 and \({F}_{{R}^{2}}\) increasing to 0.895 and 0.763, while FRMSE decreases to 77.909. Based on this, the XGBoost-STIL model performance is optimized by further integrating temporal incremental learning, both datasets improved the R2 to above 0.9, while the RMSE was reduced to around 60. The comprehensive enhancement is further demonstrated by \({F}_{{R}^{2}}\) of 0.911 and FRMSE of 60.084, highlighting the model’s improved ability to accurately capture complex electricity consumption patterns across diverse datasets. These improvements underscore the significant impact of integrating spatial and temporal incremental learning, offering a robust framework that outperforms traditional methodologies.

Table 2 Performance of temporal-test dataset (TD) and spatial-test dataset (SD) in different methods (average results of five-fold cross-validation).

Dataset validation

Further validation and comparisons were conducted using official statistics and existing datasets from quantitative and qualitative perspectives, respectively. In the absence of electricity data at same resolutions, we used the total, residential and industrial electricity consumption of the country at monthly periods for a quantitative correlation analysis. Subsequently, we conducted a comparative validation with the AGED to evaluate our dataset’s reliability. Fig. 3 shows the correlation of our results with official statistics for validation. The correlation is 0.89 for total electricity consumption, 0.82 for industrial electricity consumption, and 0.93 for residential electricity consumption, with all p-values less than 0.001. These results confirm the model’s effectiveness in accurately reflecting actual electricity consumption patterns across different sectors. Such statistically significant correlations affirm the robustness of our dataset when compared with established benchmarks, providing a solid foundation for its application in energy research and policy development.

Fig. 3
figure 3

Correlation analysis of downscaling results with official statistics on (a) total electricity consumption, (b) industrial electricity consumption, and (c) residential electricity consumption.

We further compared with the AGED in five large cities in different regions by incorporating land use data49. As shown in Fig. 4, we observed that AGED displayed a flat distribution, failing to distinguish adequately between consumption patterns across different regions. A critical shortfall of their dataset is the inability to differentiate between built-up and non-built-up areas, mistakenly attributing electricity demand to non-built-up areas like vegetation and water bodies. In addition, their methodology lacks correction in conjunction with actual official statistics, which would lead to errors. In contrast, our data can effectively capture the electricity consumption patterns of different land use types, and avoid incorrectly estimating electricity use on non-built-up zones. By incorporating kriging interpolation, our method corrects estimations, and capture spatial heterogeneity across high-resolution grids and ensuring our electricity results are precise.

Fig. 4
figure 4

Distribution patterns of electricity consumption from this study and AGED in different urban land use: Beijing (China’s Capital, Northern), Shanghai (International economic center, Eastern), Shenzhen (China’s first special economic zone, Southern), Wuhan (Central Transportation Hub) and Chengdu (Western China’s leading city).

This study also reveals the diversity of electricity consumption patterns in various functional zones (e.g., residential, industrial, and commercial zones) within the city. Take shanghai as an example, which has the highest China’s GDP in 2019. The high electricity demand areas are mainly located in downtown Shanghai, which includes the city’s central business district (CBD) and various commercial centers. The prosperity of these areas directly influences their substantial electricity demands. Similarly, Shenzhen, known for its high-tech industries, experiences uniformly high levels of electricity consumption across the city. This is particularly pronounced in industrial zones and coastal logistics hubs, reflecting the city’s vibrant industrial production and international trade activities. The areas in Wuhan with high electricity demand are mainly found along the Yangtze River, which is the central hub of the city with clusters of Grade A office buildings. There is also high electricity consumption in the northwest, primarily driven by the airport and industrial areas. These high-resolution analyses of electricity consumption patterns provide an insight into urban energy consumption disparities, which can help optimize the allocation of energy resources.

Spatio-temporal patterns of electricity consumption in China

In this study, we estimated the electricity consumption of 1 km × 1 km grid from April 2012 to December 2019. December 2019 was chosen to visualize high-resolution electricity distribution patterns in China (Fig. 5). The highly concentrated pattern of electricity consumption in the North China Plain reflects a dense population with a thriving service and manufacturing industries. Northeastern China, despite economic restructuring, shows a medium density of hotspots as a traditional industrial area. The central and southern regions have a dispersed pattern of electricity consumption due to terrain.

Fig. 5
figure 5

December 2019 distribution patterns of electricity consumption in China’s 1 km (a) and typical urban agglomerations: (b) Beijing-Tianjin-Hebei, (c) Pearl River Delta, and (d) Yangtze River Delta.

Additionally, the results also show that high electricity consumption patterns are concentrated in the Beijing-Tianjin-Hebei (BTH), Yangtze River Delta (YRD) and Pearl River Delta (PRD) urban agglomerations. The electricity demand in these areas not only reflects their advanced levels of industrialization and urbanization but also their pivotal role in the national economy. The BTH as a hub of political and cultural significance in China, with key industries such as government services, finance, and information technology creating high-energy consumption patterns. The YRD and the PRD, as the centers of China’s manufacturing and export sectors, have high electricity consumption pattern, highlighting the concentration of industrial activity and substantial energy needs.

In terms of temporal dynamics, our meticulous monthly data analysis has captured the seasonal fluctuations and trend variations in electricity consumption across the three urban agglomerations, as shown in Fig. 6. BTH and YRD, with an increase from May to October and a subsequent decrease, may reflect the impact of climatic variations on electricity demand. In contrast, the PRD demonstrates a stable monthly electricity consumption trend, a discrepancy that may be attributed to the distinct industrial structures of each region. Temporal patterns of electricity consumption were further analyzed with land use data. Industrial areas recorded the highest proportion of electricity consumption, accounting for 43.2%, indicating that the industrial production has a high demand for electricity throughout the year. In particular, residential electricity consumption shows seasonal variations, especially during the summer peak season. Commercial and transport facility areas have relatively low electricity throughout the year and have no significant seasonal fluctuations.

Fig. 6
figure 6

Monthly electricity consumption in the three urban agglomerations in 2019, as well as details in various land use types. (a) represents the monthly electricity consumption of the three urban agglomerations. The YRD’s electricity consumption outstrips that of the PRD and BTH, a trend driven by its superior GDP of 29.03 trillion yuan, driven by Shanghai and three prosperous provinces, compared to the PRD’s 7.8 trillion yuan and BTH’s 6.9 trillion yuan. (b,d) Represent the monthly electricity consumption of the three urban agglomerations under different land use types in streamgraph.

This study creates a high spatio-temporal resolution electricity data for China, effectively filling an important data gap. The dataset reveals the intricate dynamics of electricity consumption, providing a reliable data support for sustainable development research. Future studies can use this data to explore diverse energy scenarios, optimize prediction models, and formulate strategies to shift the world toward a more sustainable and efficient energy future.

Uncertainties and limitations

There are several aspects of uncertainties in this study. Firstly, we mainly use socio-economic and environmental variables to estimate electricity consumption, without fully considering geographic factors. This may limit the model’s ability to comprehensively capture the electricity consumption patterns in diverse regions, such as the differences of electricity consumption in southern and northern China due to heating and cooling demands. Northern cities have higher heating demands in winter, while southern cities have higher cooling demands in summer50. Although our model considers temperature data, it cannot directly reflect these seasonal differences. Future study should consider more geography-related variables, such as Heating Degree Days (HDD) and Cooling Degree Days (CDD)51. In addition, regional modeling can be performed based on climate zones to reduce geographic uncertainty and improve model accuracy.

The input dataset uncertainty also challenged this study. Although we considered variables with full-coverage and availability as much as possible, there are also some relevant data not included. For example, we combined land use data in analysis but without integrating it into the downscaling model, which could improve the results52. Energy prices and types should also be considered. Moreover, the spatio-temporal differences in the original variables (e.g., the GHSL data spans 5 years) may affect the results. However, the unavailability of spatio-temporal datasets limits the integration of these data in this study. Currently, our dataset covers 2012 to 2019 at the 1 km × 1 km scale. In the future, we will continue to focus on the availability of relevant data, optimize our approach by incorporating more valuable data and dynamically update the spatio-temporal scales of the dataset.