County-level CO2 emissions and sequestration in China during 1997–2017

With the implementation of China’s top-down CO2 emissions reduction strategy, the regional differences should be considered. As the most basic governmental unit in China, counties could better capture the regional heterogeneity than provinces and prefecture-level city, and county-level CO2 emissions could be used for the development of strategic policies tailored to local conditions. However, most of the previous accounts of CO2 emissions in China have only focused on the national, provincial, or city levels, owing to limited methods and smaller-scale data. In this study, a particle swarm optimization-back propagation (PSO-BP) algorithm was employed to unify the scale of DMSP/OLS and NPP/VIIRS satellite imagery and estimate the CO2 emissions in 2,735 Chinese counties during 1997–2017. Moreover, as vegetation has a significant ability to sequester and reduce CO2 emissions, we calculated the county-level carbon sequestration value of terrestrial vegetation. The results presented here can contribute to existing data gaps and enable the development of strategies to reduce CO2 emissions in China. Measurement(s) carbon dioxide emission • carbon dioxide sequestration Technology Type(s) machine learning Factor Type(s) temporal interval • geographic location Sample Characteristic - Environment carbon dioxide Sample Characteristic - Location China Measurement(s) carbon dioxide emission • carbon dioxide sequestration Technology Type(s) machine learning Factor Type(s) temporal interval • geographic location Sample Characteristic - Environment carbon dioxide Sample Characteristic - Location China Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13090370


Background & Summary
As one of the largest carbon emitters globally, China has pledged to reach the peak of its carbon emissions by 2030 [1][2][3] , and significant effort has been put into developing a sustainable economy [4][5][6] . An increasing number of studies have focused on topics such as CO 2 emissions accounts [7][8][9] , driving forces of CO 2 emissions 10,11 , forecasting future emissions, and more 12,13 . However, most of this research has been conducted at the national, provincial [14][15][16] , or city level 17,18 . Actually, even within the same province or the same prefecture-level city, there can be obvious differences in CO 2 emissions among counties. Research at the county-level is important for capturing regional heterogeneity and developing policies that can effectively lead to reductions in CO 2 emissions.
Therefore, records of county-level CO 2 emissions are required, which could help to fill the gaps in China's CO 2 emission data and could be used for the development of strategic policies that propose county specific emission reduction actions. However, very few studies have investigated the county-level emissions in China owing to low availability of data sources, and the corresponding studies have limitations in methodology, time span, and geographical coverage. These studies can be classified into two main categories: (1) Previous studies calculated the county-level CO 2 emissions on the basis of the published energy use data 19,20 . For example, Cai et al. 19 estimated CO 2 emissions of 16 counties in Tianjin, China in 2007 based on the construction of a CO 2 emissions grid, which was derived from the spatial distributions of energy use from the industrial sector, agricultural sector, and residential sector. Additionally, Guan et al. 20 adopted the CO 2 emission coefficients provided by the IPCC and 11 types of energy use, such as coal, coke, coal gas, Thus, based on the available nighttime light data provided by DMSP/OLS images, China's energy-related CO 2 emissions can be calculated at micro-level administration. However, because the DMSP/OLS images are only available up to 2013, the research period was limited. Additionally, even though Suomi National Polar-Orbiting Partnership/Visible Infrared Imaging Radiometer Suite (NPP/VIIRS) images provide another source of nighttime light brightness data after 2012, the evident gaps between the two sets of satellites' data have hindered the construction of long-term nighttime light data sets and calculations of CO 2 emissions. Thus, several studies have made attempts to unify the two sets of satellite data [27][28][29] . But matching the results proved difficult, and other problems involving discontinuity and saturation were encountered. Hence, there is room for further improvements.
In addition, existing literature on CO 2 emissions reduction has only focused on the energy-related carbon emissions, and the influence of carbon sequestration of vegetation have always been ignored. With regard to the concept of plant carbon sequestration capacity, it is a natural carbon sequestration process, which directly counteracts the processes of emitting CO 2 into the atmosphere. In addition, the natural processes mainly originate from vegetation net primary productivity (NPP) or net ecosystem productivity-vegetation in the ecosystem absorb CO 2 from the air, produce carbohydrates such as glucose through photosynthesis, and release oxygen. Actually, vegetation has a significant effect on CO 2 sequestration, and can account for a major part of the CO 2 emissions associated with energy use. Among them, terrestrial vegetation plays a significant role in CO 2 sequestration, and methods of estimation of its sequestration capacity have advanced and have been widely adopted 29 . Therefore, we also estimate the county-level carbon sequestration values of terrestrial vegetation, which facilitates more comprehensive research on reducing CO 2 emissions in China and evaluating sustainable develoment 30 .
Thus, our present study makes the following marginal contributions to this field of research: (1) we developed a new model and employed a particle swarm optimization-back propagation (PSO-BP) algorithm to unify the scale of DMSP/OLS and NPP/VIIRS images during 1997-2017, which obtain superior fitting effects than those of previous studies based on original models and normal econometrics; (2) we adopted the PSO-BP algorithm to downscale the provincial energy-carbon emissions based on the nighttime light data, and calculated 2,735 county-level energy-related carbon emissions during 1997-2017; and (3) we estimated the corresponding county-level carbon sequestration values of terrestrial vegetation, which is an issue that has rarely been considered in previous studies and plays a significant role in CO 2 emissions mitigation.

Methods
Study areas and data preprocessing. Since China is one of the largest CO 2 emitter globally, our aim was to facilitate the determination of carbon reduction status in China and to address current data gaps in China. In addition, our research results could facilitate energy saving and emissions reduction activities and efforts in other countries, especially in other developing countries.
As the most basic governmental unit in China, counties should play important roles in the implementation of emission reduction policies from the central, provincial, and municipal governments. Therefore, we selected county-level CO 2 emissions as a research focus. Our study areas cover 2,735 counties of 30 provinces in China mainland (excluding Tibet, Hong Kong, Macau, and Taiwan) based on the accessibility of data sets, which cover approximately 87% of China's land area, over 90% of the population, and 90% of the GDP.
Three satellite data sets were used in our study: two types of nighttime light data (provided by DMSP/OLS 31,32 and NPP/VIIRS 33 images) and net primary productivity data of terrestrial vegetation (which were provided by the MODIS NPP products 34 ). Considering that there are several problems in the satellite images, such as discontinuities, the white noise, and fill values, these data sets need to be pre-processed before they can be further used.
The DMSP/OLS images were from the period of 1992-2013, and these data had problems stemming from discontinuities, saturation, and incomparability. Hence, we adopted several methods including inter-calibration, radiometric calibration, intra-annual composition, and inter-annual series correction methods proposed by previous studies to obtain continuous and stable DMSP/OLS images 27,28,35 . www.nature.com/scientificdata www.nature.com/scientificdata/ We used the monthly NPP/VIIRS images, an approach consistent with previous studies 26,27 . Additionally, because of the influence of stray light pollution, lighting data in the mid-high latitudes of China in summer showed large errors; thus, we removed the images from June to August, and we used the remaining monthly data to synthesize the annual data. Then, we applied a Gaussian low-pass filter with a window size of 5 × 5 to mitigate the NPP/VIIRS images' spatial variability and smooth the data to better match the DMSP/OLS images 36 . The σ was set as 1.75 in accordance with studies by Li et al. 37 and Zheng et al. 38 Moreover, to further reduce the white noise of NPP/VIIRS images, we replaced the negative values with zero and set a threshold of 0.3 nW·m −2 ·sr −1 in the annual images, which is consistent with earlier work 36 .
The MOD17A3 products provided by the National Aeronautics and Space Administration (NASA) have fill values and need to be multiplied by a 0.0001 conversion factor. Next, following the user guides 39,40 , we obtained the net primary productivity data. Finally, based on the conversion coefficient (i.e., 1.62/0.45) used by Chen et al. 30

Inter-calibration between DMSP/LOS and NPP/VIIRS Based on PSO-BP. Because the DMSP/OLS
and NPP/VIIRS images were derived from different types of satellites, there are evident gaps in the two sets of data. Specifically, there are discrepancies caused by various factors such as the use of different sensors, different spatial resolutions, different spread functions, and so on 36 . However, the mechanisms for explaining the differences remain like a "Black Box, " and any fixed functional form for the inter-calibration between DMSP/OLS and NPP/VIIRS data may fail to produce a good match between the two sets of data and lead to large errors. Therefore, in the present study, we used an artificial neural network (ANN) to explore the relationship between the DMSP/ OLS and NPP/VIIRS data rather than conventional econometric methods, because the conventional methods often fail to model the non-linear relationships 41 . www.nature.com/scientificdata www.nature.com/scientificdata/ Additionally, because the back-propagation (BP) algorithm has performed well in previous studies for constructing regressions and obtaining local optimistic results 42,43 , the BP algorithm was adopted in this research. However, given that the BP algorithm can lead to data at local extremes and training failures, we also combined the BP algorithm with particle swarm optimization (PSO)-PSO has shown great potential in exploring the global optimistic results 44,45 .
As for the input parameters, we followed the approach of Zhao et al. 28 and selected the county-level mean pixel values of NPP/VIIRS in 2013 (V ) as the input. Given the geographical heterogeneity of individual data in mainland China, we made use of the minimum boundary method to obtain each county's central geographic coordinates and used Arcmap 10.5 to obtain the area of each county. Then, we selected the central geographic coordinates (X and Y ) and the area of each county (A) as the supplementary input parameters, which greatly improved the matching accuracy of the two sets of data and reduced errors. To enhance the accuracy of  www.nature.com/scientificdata www.nature.com/scientificdata/ modelling, we use the logarithmic form of the input parameters according to the suggestion of Li et al. 37 In addition, with regard to the output parameter, we select the county-level mean pixel values of DMSP/OLS in 2013 (D).
Based on initializing the ANN weights with the PSO technique, we set values of C 1 and C 2 both as 2.0, the maximum iteration number as 50, and the population size as 20 45,46 . Additionally, the structure of the model was set as one hidden layer with five nodes in the hidden layer, which is consistent with the work of Mohamad et al. 45 The total number of samples was 2,826. Among these, 2,000 samples were randomly selected as training samples, whereas the other 826 samples were used as testing samples. The calculation procedure for the PSO-BP model was consistent with the work of Mohamad et al. 45 and Yin et al. 47 .
To test the validity of the PSO-BP algorithm and the proposed supplementary input parameters, we also trained the BP algorithm and corresponding algorithm without the supplementary input parameters as control groups. The best training results are presented in Fig. 1. As shown in Fig. 1, all of the correlation coefficient values for the training results of mean pixel values in 2013 were more than 0.9, which indicated that the ANN was advantageous for identifying the potential relationship between DMSP/OLS and NPP/VIIRS data. Among the results, it was evident that the models that considered geographic coordinates and area (i.e., panels a and c) showed comparably better fitting effects than models that only used the county-level mean pixel values of NPP/VIIRS data as input parameters (i.e., panels b and d). Additionally, the correlation coefficient values for the PSO-BP algorithm were higher than those for the BP algorithm, thus indicating that the PSO-BP algorithm was better for determining the potential matching relationship between DMSP/OLS and NPP/VIIRS images. Figure 2 shows the test performances of the four models, and these results can be used to identify the fitting effects of each model. The highest correlation coefficient value of 0.96361 in model a for the testing dataset indicated that the proposed PSO-BP algorithm reliably matched the DMSP/OLS images with NPP/VIIRS images in later years (e.g., 2014 and 2015). The correlation coefficient, R 2 , based on our method, was 0.955, which was significantly higher than values obtained previously, including the 0.8354 of Lv et al. 27 , 0.9154 of Zhao et al. 28 , and 0.91 of Li et al. 36 .
Furthermore, the matching work has not yet been completed. Although the correlation coefficient is close to 1, there were evident and unavoidable faults in some counties in 2013 (i.e., DMSP/OLS data in 2013) and 2014 (i.e., converted NPP/VIIRS data in 2014), which also exist in previous studies. Therefore, we make use of the annual increase amounts of converted NPP/VIIRS data during 2013-2017 to obtain the final simulated DMSP/OLS data during 2014-2017, avoiding the shortcoming of faults and discontinuities in some regions during 2013-2014. Calculation of CO 2 emissions based on satellite data. Because provincial energy balance tables were available and there was a lack of energy use data for various counties, we first established the relationship between provincial CO 2 emissions and nighttime light data (i.e., the sum of the DN values) in this study; then, the sum of the DN values was used as a proxy to estimate the county-level carbon emissions.
First, the estimations of provincial CO 2 emissions were carried out based on the following method provided by the Intergovernmental Panel on Climate Change (IPCC), which has been widely adopted 4,48,49 : where C E i t , represents the provincial CO 2 emissions from energy use (unit: million tons); E ij t represents the j th type of energy use in province i; LCV ij t is the low calorific value of the j th energy consumption; CC ij t is the carbon content of the j th energy source; and COF ij t is the carbon oxidation factor of the j th energy source. In addition, 17 types of fossil fuel used are considered, including raw coal, cleaned coal, other washed coal, briquettes, gangue, coke, coke oven gas, blast furnace gas, converter gas, other gases, other coking products, crude oil, gasoline, kerosene, diesel oil, fuel oil, naphtha, lubricants, paraffin, white spirit, bitumen asphalt, petroleum coke, other petroleum products, liquefied petroleum gas (LPG), refinery gas, and natural gas. www.nature.com/scientificdata www.nature.com/scientificdata/ Subsequently, to avoid spurious regression problems, we adopted the unit root test to verify the relationship between provincial CO 2 emissions c, and sum of DN values sdn. The results are presented in Table 1. It was evident that the sum of DN values and carbon emissions had to be processed at the same time with Eq. (1). Then, the co-integration Pedroni test was adopted 50 , which has been widely accepted in the field of econometrics [51][52][53] . The majority of tests led to the rejection of the null hypothesis of no co-integration, thus suggesting that there was significant co-integration among the provincial carbon emissions and sum of DN values.
Furthermore, considering that the relationship between provincial CO 2 emissions and nighttime light data is non-linear, the normal econometric methods may lead to relatively high errors 27 ; here, we employed the PSO-BP   www.nature.com/scientificdata www.nature.com/scientificdata/ algorithm to fit and train the relationship. We selected the sum of DN values, dummy variables of identity, and year as the input parameters, and the provincial CO 2 emissions were the output parameter. In addition, the other initialized parameters were consistent with those discussed in the earlier section on the inter-calibration. The results are presented in Fig. 3.
The test and training results showed great fitting effects, which were indicative of the high effectiveness of the algorithm. Notably, the coefficient of determination R 2 of 0.9895 was higher than the 0.94 of Meng et al. 21

Data Records
A total of 5470 data records (county-level CO 2 emissions caused by energy use and carbon sequestration values of terrestrial vegetation). Among them, there were 2,735 CO 2 emission county data records associated with energy use (

Technical Validation
Validity testing for temporal and spatial nighttime light data changes. On the basis of the inter-calibration method described earlier, we were able to obtain continuous and stable county-level nighttime light DN values during 1997 to 2017, and the sum of DN values is presented in Fig. 6. The red line represents the changes in the sum of DN values before the inter-calibration between DMSP/OLS and NPP/VIIRS images, and the green line represents the changes in the sum of DN values (sdn) after the inter-calibration based on the PSO-BP algorithm. Evidently, there was a gap between the scale of DMSP/OLS and NPP/VIIRS images before the matching. Additionally, the trend of our inter-calibrated results continuously increased, which is consistent with previous studies 27,28 .
Subsequently, because nighttime light data tend to be highly consistent with economic output 55,56 and power consumption data [57][58][59] , we used the provincial cross-sectional gross domestic product (GDP) and power consumption to individually perform linear regressions with the sum of DN values (sdn) during 1997-2017. These results are presented in Table 2.
With regard to the Model (1) results shown in Table 2, it was evident that there was a significant positive relationship between the provincial cross-sectional GDP and sdn during 1997-2017. All of the R 2 values were www.nature.com/scientificdata www.nature.com/scientificdata/ over 0.87, and the AIC values were small, thus implying that inter-calibrated nighttime light data characterized the economic output well. Simultaneously, in regard to the Model (2) results shown in Table 2, the sum of the DN values were evidently consistent with the power consumption data because the slopes were significantly positive, the R 2 values were high, and the AIC values were small. Validity testing for CO 2 emissions based on satellite data. The method of calculation of carbon sequestration value of terrestrial vegetation is consistent with that in Chen et al. 30 and, therefore, was deemed a reliable measure. With regard to the validity of energy-related carbon emissions based on the nighttime light data, we made use of the national and provincial energy-related CO 2 emissions provided by existing studies 17,60,61 to conduct a comparison with the summary of our simulated energy-related carbon emissions. These results are presented in Fig. 7. Panels (a) and (b) in Fig. 7 individually show the scatter plots of our simulated national and provincial energy-related CO 2 emissions with the CO 2 emissions based on existing literature during 1997-2017. The results in each graph were highly consistent, thus indicating that the simulated CO 2 emissions based on the nightlight data are reliable.
Limitations and future work. Our datasets have several limitations, which we will address in the future to improve the accuracy of China's county-level emission accounts. First, our estimated county-level CO 2 emissions are based only on the nighttime light data, overlooking other factors such as urbanization rate, population, and earth surface temperature. Secondly, our estimated carbon sequestration values only include carbon sequestration capacity of terrestrial vegetation, without taking into account ocean carbon sequestration capacity 62 .
Therefore, our future work will include two aspects: first, we will combine night light data with other satellite data such as earth surface temperature provided by MOD11A2, and impervious surface data provided by Gong et al. 63 , to improve the accuracy of the calculated county-level carbon emissions. We will further analyze the carbon sequestration capacity of mangrove vegetation to enrich the estimated county-level carbon sequestration data.   www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
First, the China mainland's 2,735 county-level energy-related carbon emissions data is provided and has the advantages of wide coverage and long time-span. The data set can help fill the existing data gaps and be further used in future research. For example, scholars can use the data to further analyze the driving forces of the CO 2 emissions at county-level rather than the nation 3 , province 4,5 or prefecture-level 21,22,27 . Additionally, it can be used to further evaluate the emission reductions 19,20 or construct budget allocation of CO 2 emission rights in the county.
Second, considering that vegetation plays a significant role in sequestrating and reducing CO 2 emissions, the dataset of the 2,735 county-level carbon sequestration values of vegetation can be combined with our provided county-level energy-related CO 2 emissions. Evidently, the dataset could facilitate further comprehensive analyses and research on China's emissions mitigation 30,64,65 , the gap between CO 2 emissions and carbon sequestration, and comprehensive evaluation of sustainable development 65 .
Additionally, the present study was limited by differences in time spans of the energy-related CO 2 emissions and carbon sequestration values for vegetation. Because of the availability of the original data, the energy-related CO 2 emissions data span 1997-2017, while the carbon sequestration values of vegetation data span 2000-2017. Similarly, we have to point out that our study areas only include China mainland. The units of county-level CO 2 emissions and carbon sequestration values provided are million tons.

Code availability
The programs used to generate all the results were MATLAB (R2017b) and ArcGIS (10.5). The PSO-BP codes for matching the scales of the nighttime light data and modelling the relationships among the provincial energyrelated CO 2 emissions are presented in Suppl. File 1.