China industrial environmental database 1998–2015

There has been a rapid-growing trend in studying China’s environmental problems in the past decade. However, the existing environmental statistics data are far from meeting researchers’ requirements. The biggest problem is that the official environmental statistics data are only provided at either regional level or sectoral level. Considering the huge heterogeneities in different regions and sectors, researchers are unable to conduct comprehensive policy evaluations. In this study, we constructed the time-series industrial environmental database for China (CIED) at both regional and sectoral level. The database includes totally 31 regions and four types of pollutants: chemical oxygen demand (COD), sulphur dioxide (SO2), ammonia-nitrogen (NH3-N), and nitrogen oxide (NOX). This study also clarifies several important concepts for researchers to better understand China’s official environmental statistics data. Measurement(s) environmental pollution Technology Type(s) computational modeling technique Sample Characteristic - Location China Measurement(s) environmental pollution Technology Type(s) computational modeling technique Sample Characteristic - Location China


Background & Summary
China's environmental problem is one of the most important issues accompanying with its rapid industrialization. Although the situation has been greatly improved in the past decade, the whole nation, especially in certain regions, is still facing severe environmental problems such as air pollution issue, water pollution and so on. Since 2000s, there is a growing number of evidence-based literature focusing on China's environmental problems. Topics of these literature include discussions of China's environmental Kuznets curve [1][2][3][4][5] , analyses of environmental protection performance [6][7][8][9] and evaluations of environmental policies 10,11 . Most of these studies use either regional-level or sector-level environmental data that are retrieved from official statistics to conduct analyses, which fail to take both regional and sectoral heterogeneities into consideration simultaneously. The main reason is that the official statistics such as China Statistical Yearbook on Environment only report provincial aggregated pollution data and sectoral aggregated pollution data. Attempts to partly overcome this problem are to use micro-level data such as firm-level or plant-level data, and cases can be seen in several recent studies [12][13][14][15] . These studies provide insights into firms' polluting behaviours and also do provide new evidence in understanding regional and sectoral variations in China.
There are two main drawbacks in using micro-level data, especially firm-level data, to study the environmental issues. Firstly, a certain number of observations will be dropped when constructing the balanced panel dataset which plays an important role in most empirical studies. Sample representativeness will be further weakened if researchers match different firm-level datasets in order to obtain more variables. As a result, other important topics such as structure effect, entry effect and exit effect can not be investigated at the macro-level. Secondly, many important influence factors are only available and applicable in regional level or sectoral level, and micro-level data only contains individual firm specific characteristics. But most of current research interests and policy needs are at the macro-level. In this case, firm-level analysis may unable to deliver desired results and policy implications at the macro-level. Due to the above two reasons, estimating environmental data at both regional level and sectoral level will undoubtedly contribute to the existing empirical works. For example, the two-dimensional data have great potentials to boost environmental policy evaluations by using treatment effect analyses such as difference-in-difference (DID) model [16][17][18] and synthetic control method (SCM) 4,19,20 , which have gained much popularity in this field in the past decade. The second direction of the application is conducting environmental efficiency analyses such as data envelope analysis (DEA) and stochastic frontier analysis (SFA), by viewing pollutions as undesirable products [21][22][23][24] . Furthermore, two-dimensional environmental data can also be used to construct environmentally extended multi-regional input-output (EE-MRIO) tables [25][26][27] , as well as provide more accurate estimates for parameters 28 , which are extremely useful in conducting complex economic system analyses such as computable general equilibrium (CGE) modelling works.
In the China Industrial Environmental Database (CIED), there are 31 regions (excluding Hongkong, Macao, and Taiwan) and 39 industrial sectors. For each sector in each region, we provide environmental data of four types of pollutants: chemical oxygen demand (COD), sulfur dioxide (SO 2 ), ammonia-nitrogen (NH 3 -N), and nitrogen oxide (NO X ). The database contains discharge and removal data for all four pollutants. Due to the data availability, data for COD and SO 2 are available from 1998 to 2015, data for NH 3 -N is available from 2001 to 2015 and data for NO X is available from 2006 to 2015. The time period covers last three years of the ninth Five-Year Plan and the successive three complete Five-Year Plan periods. Since Five-Year Plans play great roles in the policy-making procedure, the database can be used to analyze a wide range of institutional and policy adjustment issues during this period.

Methods
In this study, we combine the bottom-up method and top-down method to construct the two-dimensional environmental database. In the first step, the bottom-up method uses firm-level environmental data calculate pollutants' discharge and removal amount by region and by sector. In the second step, the top-down method collects total amount of pollutants discharged and removed for each region or for each sector. In the final step, the cross-entropy method is used to balance the two-dimensional matrix in each year. aggregation of micro-level environmental data. Micro-level environmental data are retrieved from China's Environmental Statistics Database (CESD). The CESD is a micro-level database compiled by China's Ministry of Environmental Protection (MEP) which is used as the data basis for publications of China's official environmental statistics such as China Statistical Yearbook on Environment, China Environment Yearbook and Annual Statistic Report on Environment in China. This study uses the industrial firm-level data from the CESD and we call it China's Industrial Environmental Statistics Database (CIESD). The CIESD is a newly released database available to researchers which covers all major industrial emission sources in China. This database has already been used and explained in details in several recent studies 14,15,29,30 . Table 1 reports the number of firms covered in our database by region and year. We should pay some special attentions when using the database. Totally 20 cells in Table 1 are zero or close to zero for all regions, which means these observations are completely or almost missing. The overall data coverage rates in 1998 and 1999 are 75.38% and 91.97% respectively, which are not so perfect but acceptable for this study. In 2006, there are totally 1784 firms which cannot be categorized to any region by using 6-digit administrative division code. After carefully checking these observations manually, we find they are all duplicated observations for Anhui province. After comparing data of these duplicated observations, we just keep one unique observation with useful information. For the period during 2006 and 2010, environmental statistical data for thermal power plants are collected as a separate database, which is not included in our database. Total number of thermal power plants are also reported in Table 1. Therefore, if we subtract numbers of thermal power plants in the period during 2006 and 2010, the overall data coverage rates from 2000 to 2014 are quite close to 100% which means a quite good sample representativeness. Before aggregating the micro-level environmental data, deeper investigations of the whole dataset have been conducted in order to revise the abnormal data records and improve the data quality. There are typically two types of errors exist for the abnormal data record. The first type is missing data error and the other one is measurement error. The missing data error will lead to underestimation of pollutants' values and the measurement error will lead to both underestimation and overestimation of pollutants' values. We manually checked each firm's time series data to revise these two issues. For the missing data, if a water-polluting firm has both COD and NH 3 -N discharge data in all year but lacks NH 3 -N discharge data in one specific year, then the missing NH 3 -N discharge data is identified and will be filled with the interpolation value. For the measurement error, it is likely that firms may report incorrect data by using different units. For example, if one firm reports the data in unit of gram instead of kilogram which is required by the regulations, then data value recorded in the database will be 1,000 times higher than the true value. This data will be identified as abnormal data and be revised to its reasonable level.
The raw two-dimensional environmental data can be calculated based on revised CIESD as follows: Collection of macro-level environmental data. In this study, we collect macro-level environmental data from China Statistical Yearbook on Environment for year from 2001 to 2015 and China Environment Yearbook for year from 1998-2000. Environmental data in the yearbook are compiled from the CESD, which is collected and processed according to the environmental statistical system (ESS). The ESS was adjusted at the beginning of each Five-Year Plan period to improve the support the environmental statistical work and to improve the quality of environmental statistical data. Within the time period of our database, the ESS has been adjusted four times. These adjustments raise three biggest challenges for this study to construct the CIESD.
The first challenge is that not all pollutants are included in the CESD from 1998 to 2015. For example, NH 3 -N was reported since the Tenth Five-Year Plan (started from 2001) and NO X was reported since the Eleventh Five-Year Plan (started from 2006). Due to the data availability, environmental data of these two pollutants are  www.nature.com/scientificdata www.nature.com/scientificdata/ covered since the forementioned period. Table 2 reports whether each pollutant is reported and is listed as the major pollutant in four Five-Year Plan periods.
The second challenge is that indicators of pollutants are not consistent within the whole data period. According to the ESS, industrial emission sources are divided into major sources and non-major sources. Only major sources are covered by the CESD, and they are expected to account for at least 85% of total annual discharges of major pollutants. When compiling the environmental statistical data from 1998 to 2010, regional removal data, sectoral discharge data and sectoral removal data are just the summation value of major sources. However, regional discharge data are adjusted to include non-major sources' discharge data which are estimated by using certain estimation methods. As a result, the national data aggregated from regional data are typically larger than that aggregated from sectoral data. Besides, removal data are no longer reported according to the new ESS since the beginning of Twelfth Five-Year Plan. Instead, number of pollutants produced are reported in the CESD as well as in the yearbooks. Both the new production data and the discharge data are adjusted to include non-major sources. Table 3 reports the comparison results of regional aggregation to sectoral aggregation. For the period from 2011 to 2015, we calculate the pollutants' removal data by subtracting discharge data from production data. Inconsistent treatments of non-major sources in regional and sectoral data will be balanced using cross-entropy method in this study.
The third challenge is that the classifications of sectors reported in official statistics have been changed three times within the data period. Classifications of sectors are based on Classification Standards of National Economic Industries (CSNEI) which includes 2-digit, 3-digit and 4-digit codes for all sectors and sub-sectors.  Table 4 reports the concordance of industrial sectors and the first column is the coordinated sector number. We also find that for some firms, they have different 2-digit sector codes recorded in CIESD and Annual Survey of Industrial Enterprises Database (ASIED). Since 2-digit sector codes in ASIED are more consistent to the sector classifications of official statistics released by the National Bureau of Statistics of China. We match the firms in CIESD and ASIED, and revise firms' 2-digit sector codes according to the ASIED before the concordance of sectors. Besides, firms classified as "Others" in CIESD are all checked and revised manually. The chord diagram in Fig. 1(a) illustrates    Continued www.nature.com/scientificdata www.nature.com/scientificdata/ the changes of all firms' sector codes in CIESD. We can see that most of firms' sector codes in the CIESD are not revised or remain the same 2-digit sector codes after revision. If we take a close look at firms whose 2-digit sector codes are revised, interchanges occur in most sectors except for the "Others" sector whose 2-digit sector code is 39. Since all firms originally classified in the "Others" sector have been manually checked and revised, they are now assigned the correct 2-digit sector codes.   www.nature.com/scientificdata www.nature.com/scientificdata/ Moreover, since China has conducted the first National Census of Pollution Sources (NCPS) in 2007, the ESS has significantly changed according to the census results. As a result, there are two special concerns we must pay attention to when using the environmental data. Firstly, the ESS relied heavily on firms' self-reported data to determine the major emission sources before 2011. After adopting the complete emission source list provided by the NCPS, total number of firms included in the CESD increased significantly in 2011 and keeps stable thereafter. As a result, it is not surprising that total discharge data jump at 2011 as well. Secondly, a large number of discharge coefficients have been updated by the NCPS and they have been adopted in the ESS adjusted in 2011. Consequently, some sectors pollution levels also changed greatly. If researchers conduct empirical studies by using panel models, these systematic changes can be captured by the sector and time fixed effects. While if researchers use methods such as index decomposition, data envelope analysis and so on, the forementioned concerns should be treated carefully.
Balancing the environmental data. We use a two-step adjustment procedure to adjust the raw two-dimensional environmental data to construct the final CIESD. In the first step, the raw value is adjusted by using the following formula: where ∼ X r j c , represents the adjusted raw value and Y r c represents aggregated macro-level environmental data for region r. The second term in the right-hand side of formula (3) is the region-specific average scale ratio to add non-major emission sources. Sectors in the same region are assumed to be scaled up proportionally. This step adjusts the aggregated regional raw data and make them consistent to the macro-level environmental data.
In the second step, we use the cross-entropy method to balance the two-dimensional data matrix 31 . In this paper, we assume the regional distributions of the environmental data within each sector are more reliable. Thus, we define the prior coefficients a r j c , and the new coefficients to be estimated a r j c , as follows:  Table 5. The structure of the environmental data by region and by sector. The names of region 1 to 31, sector 1 to 39 can be found in Tables 1 and 4, respectively.    www.nature.com/scientificdata www.nature.com/scientificdata/ where β c is the slope coefficient and r j c , ε is the noise term which follows normal distribution, ε σ N (0, ) r j c c , 2 . Table 6 presents estimated slope coefficients and standard deviations for different pollutants. All coefficients are statistically significant at 1% level. Table 7 presents adjusted R 2 for different pollutants and most of them are greater than 0.99. Figure 2 provides visual illustrations of the comparisons of unbalanced value and balanced value for four types of pollutants.
Comparison with provincial official statistics. Among all provinces in China, only Zhejiang province reports sectoral level environmental data. We collect sectoral level data for four types of pollutants from Zhejiang Statistical Yearbook on Environment and Zhejiang Natural Resources and Statistical Yearbook on Environment and coordinate all industrial sectors according to Table 4. Figure 3 provides visual illustrations of the comparisons of official value and our value for four types of pollutants. We can see that observations for all pollutants are well fitted and OLS results in Fig. 3 are highly consistent to those provided in Fig. 2.
Comparison with national aggregation data. The national aggregation data of four types of pollutants have been revised due to the correction of micro-level firm data. Figure 4 presents the comparisons of original and revised national aggregation data. Results show that the revised national aggregation data are smoother than the original data over the time. Abnormal data such as national COD removal data in 2002 and 2012 have been corrected. Almost all discharge and removal data of four types of pollutants have been adjusted downward for all years. Removal data have higher reduction rates than discharge data.