City- and county-level spatio-temporal energy consumption and efficiency datasets for China from 1997 to 2017

Understanding the evolution of energy consumption and efficiency in China would contribute to assessing the effectiveness of the government’s energy policies and the feasibility of meeting its international commitments. However, sub-national energy consumption and efficiency data have not been published for China, hindering the identification of drivers of differences in energy consumption and efficiency, and implementation of differentiated energy policies between cities and counties. This study estimated the energy consumption of 336 cities and 2,735 counties in China by combining Defense Meteorological Satellite Program/Operational Line-scan System (DMSP/OLS) and Suomi National Polar-Orbiting Partnership/Visible Infrared Imaging Radiometer Suite (NPP/VIIRS) satellite nighttime light data using particle swarm optimization-back propagation (PSO-BP). The energy efficiency of these cities and counties was measured using energy consumption per unit GDP and data envelopment analysis (DEA). These data can facilitate further research on energy consumption and efficiency issues at the city and county levels in China. The developed estimation methods can also be used in other developing countries and regions where official energy statistics are limited. Measurement(s) fossil fuel Technology Type(s) machine learning Measurement(s) fossil fuel Technology Type(s) machine learning

www.nature.com/scientificdata www.nature.com/scientificdata/ have set reducing energy consumption per unit GDP as a government policy target to tackle climate change 17 . However, it cannot reveal the improvement path of energy efficiency.
Second, the non-parametric method, in which energy, capital and labor are regarded as inputs, GDP and pollutants are regarded as expected output and undesirable output respectively, is applied to obtain energy efficiency. Energy efficiency means that reducing energy consumption as much as possible in exchange for increased economic output and fewer pollutants 24 . The closer the actual output increase and pollutant reduction in exchange for the reduction of energy consumption are to the theoretical potential output increase and pollutant reduction, that is, the closer the actual energy performance is to the potential energy performance, the higher the energy efficiency will be. The data envelopment analysis (DEA) is the most widely used non-parametric method 24 . The advantages of this measurement are mainly reflected in the convenience of locking the drivers of energy efficiency improvement, the ease of comparison of the relative energy efficiency between regions, and the need for no prior setting of the estimation function. However, it not only requires more socioeconomic data except energy consumption, but also is relatively complicated to calculate.
Third, the parameter method, which takes energy, capital and labor as independent variables and GDP as dependent variable, is chosen to estimate energy efficiency 24 . The stochastic frontier analysis (SFA) is the most widely used parameter method 27 . However, because of relying on the prior setting of the regression function and requiring more socioeconomic data except energy consumption, this measurement is relatively few adopted.
Although research on energy efficiency has practical values, unlike energy consumption data, neither Chinese officials nor other international organizations have published energy efficiency data. This is mainly because energy efficiency data is the use of energy consumption data. If Chinese officials and other international organizations publish energy consumption data, scholars and institutions can use the above methods to obtain energy efficiency data. However, the widely accepted energy efficiency data only focus on the national www.nature.com/scientificdata www.nature.com/scientificdata/ and provincial levels, because the energy consumption data published by the CNBS and other international organizations is at national and provincial levels 24,27 .
Socioeconomic development and energy consumption vary considerably between cities and counties in sub-provincial China [27][28][29] . For example, in 2010, energy consumption was approximately 30 times higher in Tangshan than in Suqian 30 . Meanwhile, even cities with similar scales of energy consumption have different levels of economic development. For example, in 2010, the energy consumption of Wenzhou is similar to that of Jiayuguan, the GDP of the former is nearly 15 times that of the latter 30 . However, official data are unavailable on city-and-county-level spatio-temporal energy consumption and efficiency in China. This hinders microscopic-level research on the drivers of energy consumption and assessment of energy efficiency in China. Moreover, this limits central and provincial governments in setting differentiated energy efficiency targets and city and county governments in adopting targeted energy efficiency improvement initiatives. Therefore, numerous studies have recently attempted to overcome these research constraints. These studies can be broadly classified into the following three categories.
First, city-level energy consumption data are collated from provincial and city statistical yearbooks [31][32][33] . Although the data from these studies are highly reliable as they are published by official agencies, they lack annual energy consumption data for all cities, and the energy types counted by some city statistics departments vary. This is mainly because the lower the government level, the less efficient their data disclosure 34 and the weaker the accounting capacity of energy statistics departments, particularly in developing cities.
Second, city-and-county-level energy consumption surveys are conducted to obtain energy consumption data at these levels, and at the enterprise and household levels [35][36][37][38] . The data obtained from these surveys are more microscopic, accurate, and complete; however, annual energy consumption data cannot be obtained for all cities and counties, and the energy consumption status and change characteristics for both cities and counties cannot be directly generalized. This is primarily due to the high labor and financial resources required to www.nature.com/scientificdata www.nature.com/scientificdata/ conduct these surveys 35 , the difficulty of including all cities and counties, and the difficulty of conducting annual surveys.
Third, satellite data and machine learning methods are used to determine city-level CO 2 emissions related to energy consumption. For example, Chen et al. 39,40 employed particle swarm optimization-back propagation (PSO-BP) to construct quantitative relationships between provincial nighttime light data and statistical provincial CO 2 emission data in China, and applied top-down data inversion to derive CO 2 emission data for all cities and counties in China, by using the total lighting brightness in these cities and counties as weights. Yue et al. 41 used econometric regression and two nighttime light data to obtain a 1-km resolution energy consumption data on the regional scale for China. Although these studies do not provide energy consumption data at the city and county levels, they provide a useful way to obtain annual energy consumption data for cities and counties in China.
Our study filled this gap through inversion of satellite data and machine learning methods to obtain spatio-temporal energy consumption data for 336 cities and 2,735 counties in China from 1997 to 2017 by combining Defense Meteorological Satellite Program/Operational Line-scan System (DMSP/OLS) and Suomi National Polar-Orbiting Partnership/Visible Infrared Imaging Radiometer Suite (NPP/VIIRS) satellite nighttime light datasets using PSO-BP. Based on these energy consumption and related socioeconomic data, the study also provides spatio-temporal energy efficiency data at the city and county levels by using the energy consumption per unit GDP and the ratio of actual to potential energy performance in the non-radial directional distance function (NDDF) derived from DEA. The proposed methodology and datasets can be widely used in energy consumption and efficiency studies at the city and county levels in China, and can provide a reference www.nature.com/scientificdata www.nature.com/scientificdata/ for other developing countries and regions with limited energy statistics to analyze their sub-national energy consumption and efficiency.

Methods
Data scopes. In this study, the measured energy consumption and efficiency datasets were based on city and county levels in China. Cities and counties are the grassroot administrative regions in China and the basic units in the implementation of energy policies by the Chinese government 23,42,43 . Energy consumption at cityand-county-level excludes the production side of the jurisdiction and the energy consumption transferred across administrative regions.
Our study provides data on energy consumption for 336 cities (including 332 prefecture-level cities and 4 municipalities) and 2,735 counties in mainland China from 1997 to 2017 (excluding Hong Kong, Macau, Taiwan, and Tibet) based on the accessibility of the identifiable satellite nighttime light data with a certain resolution, which covers approximately 87% of the land area, over 90% of the population, and 90% of the GDP. The definitions of cities and counties are published in 2010 by the Ministry of Civil Affairs of China. Considering the names of some cities and counties have changed after 2010, our study used their names of 2010 to keep consistency. More detailed explanation was shown in the data files 44 .
Our study also provides two energy efficiency datasets: First, energy efficiency data was calculated by energy consumption per unit GDP, which covered 336 cities and 2,489 counties (1997-2017). It includes fewer cities and counties in some years, due to the lack of GDP data. Second, energy efficiency data was calculated by the ratio of actual to potential energy performance (also known as energy efficiency performance index, EEPI) using www.nature.com/scientificdata www.nature.com/scientificdata/ the NDDF of DEA, which covered 189 cities (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016). It includes fewer years and cities, and none counties, mainly due to the lack of basic socioeconomic data considered as input variables of the NDDF, for example, fixed asset investment and fixed asset investment price index.
Matching nighttime light data from two types of satellites. The city-and-county-level energy consumption data in this dataset were inferred from two satellite nighttime light datasets: the DMSP/OLS and the NPP/VIIRS 45 . The DMSP/OLS data are derived from the OLS scanners from 1992 to 2013 45,46 , and NPP/VIIRS data from the VIIRS scanners from 2012 47 . Although the above two accessible nighttime light data have the advantages of long time span and wide space coverage, and are widely accepted and applied, they cannot be directly mixed together. It is mainly because the two nighttime light images belong to different sensors, different time and space of nighttime light image collection and differences in pixel levels. For example, the inconsistencies in the acquisition time and cloud cover of nighttime light images in 2013 resulted in a large gap between DMSP/ OLS and NPP/VIIRS images at the pixel level. Moreover, the two nighttime light data also have the different discontinuities and saturation levels, and the differences in spillover and white noise. Thus, to obtain a long and continuous nighttime light data (including those before and after 2013), we spliced the two satellite nighttime light datasets following the procedures adopted by Liu et al. 48 , Lv et al. 49 , and Chen et al. 39 .
First, inter-calibration 50 , radiometric calibration 51 , intra-annual composition 52,53 and inter-annual series correction methods 53 were adopted for correcting DMSP/OLS data to eliminate pixel under-saturation and spillover as well as discontinuities, incomparability, and instability. Inter-calibration method is an on-board calibration and additional processes on the annual composites of nighttime light data. Considering both satellite sensors www.nature.com/scientificdata www.nature.com/scientificdata/ transferring nighttime light data in the same year, such as F14 and F15 in 2001, we applied the intra-annual composition and to improve the stability of lit pixels, as follows: where DN n i ( , ) represents the digital number (DN) values of lit pixel i from two types of satellite sensors in year n. Radiometric calibration method is used for eliminating the differences between nighttime light data in the same year from two satellites when conducting comparative time series analysis. We applied the invariant region method to inter-calibrate DMSP/OLS images, which is consistent with Wu et al. 51 . It has better inter-calibration accuracy. Power function form was chosen and the global radiance calibrated nighttime light (RCNTL) in 2006 was treated as a reference image. The estimated parameters are shown in Table 1.
Inter-annual series correction method is used for stabilizing the inter-annual variability of the same satellite. According to Hu & Huang 53 , it was performed by assuming that the DN value of each lit pixel would not decrease over time with the economic development, as follows: www.nature.com/scientificdata www.nature.com/scientificdata/ Moreover, as the higher intensity summer light in high latitudes can cause increased interference with the accuracy of the NPP/VIIRS data 49 , the monthly NPP/VIIRS data for June, July, and August were excluded and the annual data was synthesized as an arithmetic mean based on the nighttime light data for the remaining nine months. To avoid the influence of deleting abnormal nighttime light data of the above three months on the accuracy of matching, noise pollution was removed in the process of converting NPP/VIIRS data into NPP/VIIRS data, which is consistent with Lv et al. 49 . Accordingly, we used a Gaussian low-pass filter with a 5 × 5 pixel window to smooth the NPP/VIIRS annual data and reduce spatial variability 54 for better matching with the DMSP/ OLS annual data. The Gaussian low-pass filter δ was set to 1.75 54,55 and the negative value was replaced with zero. To further reduce the effect of white noise, a threshold . − − nWcm sr 0 3 2 1 was set in the annual image, and data smaller than this threshold were excluded 52 . Fig. 1 reported the differences by comparing the DMSP/OLS images in 2013 and the NPP/VIIRS images in 2017 before and after pre-processing, respectively. By comparing Fig. 1a and 1b, the processed images have solved the saturation issue of the original images that all DN values of these saturated pixels were 63. Meanwhile, compared to Fig. 1c, the processed images in Fig. 1d have eliminated negative values and maintained better matching with the DMSP/OLS images in terms of the distribution of nighttime light values. www.nature.com/scientificdata www.nature.com/scientificdata/ Second, PSO-BP was used to match the DMSP/OLS, which has differences in sensors and resolution, with the NPP/VIIRS satellite nighttime lighting dataset. The PSO, proposed by Kennedy & Eberhart 56 , is an intelligent global optimization algorithm based on the laws of collaborative mechanisms in bird populations. Each particle in the swarm represents a possible solution to the problem, and each particle corresponds to a fitness value whose position and velocity will be determined by the optimal position 57 . After each position is updated, its fitness value is recalculated, and its position and velocity are updated again, so that all particles reach the optimal solution after repeated iterations 57 . Rumelhart et al. 58 developed a multilayer feedforward neural network algorithm with a three-layer network topology, also known as BP network algorithm, which included an input layer, an output layer, and an implicit layer. The number of neurons in the implicit layer and the transfer function between the nodes are related to the specific object under study, and there is no unified principle for selecting them 59,60 ; however, they can generate local data at local extremes and cause training failures 61 . The BP algorithm is a local algorithm with good performance in constructing regressions and obtaining local optimum optimistic results, however, data at local extremes and training failures may occur. The PSO algorithm is a global algorithm which has the potential to explore the global optimistic results 62 . The combination of these two algorithms helps to improve the computational effectiveness 63 . Ismail et al. 64 , Mohamad et al. 63 , Lee & Cheng 57 , and Wang et al. 28 have successively combined these two algorithms and proposed the PSO-BP neural network algorithm. This improved algorithm can overcome the shortcomings of the traditional BP neural network, such as poor learning stability, low reliability, and easily attained local minima. The specific steps are as follows 63,65 : To build the neural network model, we set the county-level NPP/VIIRS pixel averages for 2013 as input parameters, and derived the central geographical coordinates of each county in China using the minimum boundary method, following Chen et al. 40 . By applying the Lambert projection and the ArcGIS zone statistics, www.nature.com/scientificdata www.nature.com/scientificdata/ we obtained the area of each county using Arcmap 10.5, and used the abovementioned coordinate data as well as the area data of each county as supplementary data parameters. This approach reduces matching errors and improves the integration accuracy of the two sets of satellite nighttime lighting data. We logarithmized the input parameters according to Li et al. 54 . When setting the output parameters, the 2013 county DMSP/OLS pixel averages are used as the output parameters.
The computational program developed by Ismail et al. 64 and Mohamad et al. 63 was used as a reference for the inter-calibration of the night-light data from DMSP/OLS and NPP/VIIRS. The number of layers of BP neural network is set to three, in which the hidden layer is set to five nodes, the maximum number of iterations is set to 50, and the population size is 20 54 . There are 2,735 counties in the Chinese mainland; therefore, we randomly selected 2,000 counties as our training set and 735 counties as our test set. However, even though the algorithm results obtained a high coefficient of determination, a large amount of data might also cause problems such as anomalous value, fluctuations, and jumps and faults in the conversion process. Thus, continuous satellite nighttime light data might not be obtained, leading to large errors in the estimated energy consumption data. To reduce this error, we further converted NPP/VIIRS data from 2013 to DMSP/OLS data, and calculated the annual growth of NPP/VIIRS data from 2013 to 2017 after conversion. For example, the annual growth of NPP/ VIIRS data for 2014 and the existing DMSP/OLS data for 2013 provided the simulated 2014 DMSP/OLS data. Then, we can calculate the smooth DMSP/OLS data obtained from the 2014-2017 simulation 40 , as shown in the following equations.

Estimation of satellite-based energy consumption at the city and county levels.
For the inversion of county-level energy consumption data, the provincial energy balance sheet data published in the China Energy Statistics Yearbook and the previously mentioned PSO-BP neural network algorithm were used to establish the quantitative relationship between provincial energy consumption data and provincial nighttime light data. The values of sum of the DN (SDN) value, and dummy variables of identity and year were selected as input parameters, and the provincial energy consumption were selected as output parameters. The reason why dummy variables of identity and year were selected in addition to the SDN values is to control the influence of province differences that do not change with time and time differences that do not change with time on provincial energy consumption. For example, although the DN values of two provinces are the same in a certain year, the different levels of energy consumption may still occur because of other parameters that have not been observed and vary with provinces. Similarly, although a province has the same DN value in two years, it may still have different energy consumption due to other parameters that have not been observed over time. The total energy consumption sample was 630 (21 years × 30 provinces), of which 400 samples were randomly selected as the training set, and the remaining 230 samples were selected as the test set. County energy consumption data [66][67][68] were obtained using the top-down method and a DN value-based weighted-average strategy. Finally, 2,735 county-level energy consumption data were aggregated to obtain 336 city-level energy consumption data. The equation is as follows:  represents province lighting brightness, PE i t , represents province energy consumption, i represents county ( = … i 1, 2, 3, , 2735), j represents province ( = … j 1, 2, 3, , 30), and t represents year (t 1997, 1998, 1999, , 2017 = … ).
Calculation of energy efficiency based on two methods. This dataset uses the DEA and the IMA to measure the energy efficiency data at the city and county levels in China. DEA is a non-parametric method widely used to measure efficiency [69][70][71] . Researchers worldwide have adopted and constantly improved this approach in their studies on energy efficiency [72][73][74][75] . The energy efficiency can be calculated using this method without setting the production function between input and output, which could be regarded as its advantage. According to the studies by Chen et al. 76 , Jebali et al. 77 , Wang et al. 78 , and Wu et al. 79 , we selected labor, capital, and energy as input variables; GDP as expected output variable; and particulate matter (PM 2.5 ) as undesirable output variable. We chose the NDDF with an undesired output as the non-parametric functions. This is because, unlike the traditional Sheppard distance function, the NDDF with undesirable output considers both the expected output and the undesirable output (usually the pollutant), and it does not require price-specific data. Moreover, compared with the traditional directional distance function, NDDF with undesired output can avoid the problem of overestimating efficiency due to the existence of slacks. Furthermore, NDDF can calculate the inefficiency value of each input and output factor [80][81][82] . Therefore, the use of NDDF is widely accepted in measuring energy efficiency [83][84][85] . By referring to the studies by Zhou et al. 86 , Zhang et al. 87 , and Lin & Chen 88 , we set the collection of weights for energy efficiency EEPI as (0, 0, 0, 1/3, 1/3, 1/3) and the set of directions as (0, 0, 0, 0, 0, 0) , details as follows: → ⋅ represents the directional distance function; and K represents the capital stock, which is calculated by adopting perpetual inventory method based on the fixed asset investment data in the China City Statistical Yearbook: where I t is the current fixed asset investment, and t α is the depreciation rate of fixed assets. Referring to the set of Meng et al. 89 , t α is assumed to be 9.6%, and K t 1 − is the capital stock of the previous period. After using the fixed asset investment price index to deal with the fixed asset investment, the capital stock at the beginning of the period is set as www.nature.com/scientificdata www.nature.com/scientificdata/ get the annual concentration of PM 2.5 . g is direction vector; w E , w Y , and w B represent the city's energy consumption, GDP, and the standardized weight vector for PM 2.5 , respectively. β E , β Y , and β B represent energy consumption, GDP, and level factor vector of PM 2.5 in cities, respectively. In the constraint conditions represented by s t . ., Z n is the weight coefficient; n is the city; and β * E , β * Y , and * B β are the energy consumption, GDP, and the inefficiency value of PM 2.5 , respectively.
The energy efficiency derived from the DEA method is generally considered as more scientific and reliable; however, its calculations must be supported by other socioeconomic data, such as the capital stock and labor. These data are not published by the statistical agencies of some cities and counties. Therefore, we could only provide energy efficiency data for 189 cities using the DEA.
Energy consumption per unit GDP has also been used to approximately represent energy efficiency by some studies, such as Bor 91 , Duro & Padilla 92 , Cheng et al. 93 , and Cheng et al. 94 . Lower energy consumption, which drives economic growth, is often considered to be more energy efficient. Accordingly, we used the energy consumption per unit GDP to estimate the energy efficiency of 336 cities and 2,489 counties in China.

Data Records
This dataset provides a total of 6,085 data records (energy consumption and energy efficiency) 44

Technical Validation
Validity testing for energy consumption based on nighttime light data from satellites. The fitting results of PSO-BP neural network algorithm are shown in Fig. 7. The coefficient of determination of the training set was 0.999, while that of the test and verification sets were 0.989 and 0.991, respectively. This shows that the model has good fitting effect and validity in the training, testing, and verification stages, and the overall coefficient of determination of the model was high at 0.9956, which is higher than that of Yue et al. 41 . In their work, the coefficient of determination using the econometrics estimation method was 0.734. Furthermore, we fit the county-level energy consumption data according to the province-level division by using the sum of the provincial simulated energy consumption, and the results were compared with official statistical provincial energy consumption data in the China Energy Statistics Yearbook from 1997 to 2017. The latter was published by the Chinese official agency, CNBS, and was used as the validation data for comparisons, as shown in Fig. 8. The provincial energy consumption data fitted in this study was highly correlated with the official statistical provincial energy consumption data, and the decisive coefficient was 0.9851. This shows that the energy consumption data obtained from fitting of satellite nighttime light data is reliable.
Limitations and future work. There are three deficiencies in our datasets, which will provide future avenues for more accurate estimates of energy consumption and efficiency at the city and county levels in China. First, the city-and-county-level energy consumption data retrieved from satellite nighttime light data only comprises the total energy consumption data, and cannot identify the change of energy mix. Second, the provincial energy consumption data spliced using satellite nighttime light data do not include energy consumption data from land-use change and use of non-commercial biomass fuels, which results in underestimation of the energy consumption data of cities and counties. Third, considering the unavailability of socioeconomic data for some cities and all counties, such as labor and capital stock, even though data on energy consumption was provided for 336 cities and 2,735 counties from 1997 to 2017, the energy efficiency data based on the DEA included only 189 cities between 2003 and 2016.
Therefore, our future studies will focus on three aspects: First, in the retrieval of city-and-county-level energy consumption data, the disaggregated energy consumption data and other socioeconomic variables should be explored and introduced. Second, a provincial total energy consumption data covering more energy types should be obtained before comprehensively inverting the city and county energy consumption data through investigation. Third, the socioeconomic data of cities and counties should be obtained to support the use of DEA in evaluating the energy efficiency of all cities and counties in China.

Usage Notes
The data of city-and-county-level energy consumption and efficiency provided in this study are valuable with practical applications in the field of energy economics, management, and policy, including the following: First, the provided data have the characteristics of wide spatial coverage and long time span. This unique panel-structured dataset can be used to observe the trajectories and spatial differences of energy consumption and efficiency on a micro-level than at national and provincial levels. Therefore, it can also be used to analyze the factors driving the changes and spatial differences of energy consumption and efficiency in cities and counties. Second, the panel-structured dataset of energy consumption and efficiency can be used to match other socioeconomic data at the city and county levels, and studies such as the economic effects of energy consumption, the environmental and social effects of energy consumption, and the coupling relationship between energy efficiency and economic development may be conducted. Third, the development of energy consumption and efficiency data at the city and county levels can not only contribute to the energy management at China's www.nature.com/scientificdata www.nature.com/scientificdata/ grassroot-level governments, such as in the formulation and implementation of road maps for energy transition and energy efficiency improvement, but also provide a basis for the central and provincial governments to allocate the energy rights of cities and counties under the constraints of "carbon peak" and "carbon-neutral" targets. Fourth, the development of energy consumption and efficiency data at city and county levels could provide a more accurate assessment of the impact of the central government's energy saving, emissions reduction, and low-carbon green policies, as well as other socioeconomic policies, for example, assessment of the impact of the "central heating, " "coal to electricity, " low-carbon pilot city, and carbon emissions trading right pilot policies on energy consumption and energy efficiency. Fifth, the method of retrieving micro-level energy consumption data by using satellite nighttime light data can also provide a reference for other developing countries and regions with limited energy statistics to evaluate their energy consumption and efficiency at the sub-national level.

Code availability
MatLab (R2017b), STATA (16), and ArcGIS (10.5) are the major applications used to obtain the energy consumption and efficiency data. The code for PSO-BP matching the relationship between the two sets of satellite nighttime light data and the inversion of city and county energy consumption is provided in the Appendix. Codes and datasets for the DEA method, which measures energy efficiency in 189 cities, are also provided in the Appendix.