Air pollution emissions from Chinese power plants based on the continuous emission monitoring systems network

To meet the growing electricity demand, China’s power generation sector has become an increasingly large source of air pollutants. Specific control policymaking needs an inventory reflecting the overall, heterogeneous, time-varying features of power plant emissions. Due to the lack of comprehensive real measurements, existing inventories rely on average emission factors that suffer from many assumptions and high uncertainty. This study is the first to develop an inventory of particulate matter (PM), SO2 and NOX emissions from power plants using systematic actual measurements monitored by China’s continuous emission monitoring systems (CEMS) network over 96–98% of the total thermal power capacity. With nationwide, source-level, real-time CEMS-monitored data, this study directly estimates emission factors and absolute emissions, avoiding the use of indirect average emission factors, thereby reducing the level of uncertainty. This dataset provides plant-level information on absolute emissions, fuel uses, generating capacities, geographic locations, etc. The dataset facilitates power emission characterization and clean air policy-making, and the CEMS-based estimation method can be employed by other countries seeking to regulate their power emissions.


Background & Summary
China has become the top power producer globally, and it had the largest share (19.5-26.7%) of global power generation from 2010 to 2018 1 . The majority (70.4-82.5% during 2010-2018) of China's power generation came from thermal power plants that combusted coal, oil plus natural gas, biomass or other fossil energy (accounting for 60.2-73.4% of the total capacity) 2 . Accompanying with large amounts of fossil energy combustion, China's thermal power plants have become major sources of air pollutants, emitting 5.0-23.5%, 15.7-38.7% and 19.1-51.5% of China's anthropogenic particulate matter (PM, defined as microscopic solid or liquid matter suspended in the atmosphere) 3-6 , SO 2 4-10 and NO X 4-11 , respectively, from 2010 to 2017. These air pollutants (representing 5.9%, 23.1% and 21.5% of China's anthropogenic PM, SO 2 and NO X emissions, respectively, in 2015 5 ), through a series of physical processes and chemical reactions in the atmosphere 12 , contributed to 7.6% of China's population-weighted PM 2.5 (PM with an aerodynamic diameter of or below 2.5 μm) concentration as of 2015 13,14 , leading to severe haze events and human health damage nationwide.
To control power emissions, an emission inventory at high spatiotemporal resolutions is needed as the foundation for an analysis of power emission characteristics and specific policy designs 15 Table 1). The thermal power plants produce electricity by combusting a variety of fossil energies, which fall into 4 categories: coal, gas plus oil, biomass and others (detailed in Table 2).
The CEAP dataset integrates two databases, i.e., the CEMS data and unit-specific information. The CEMS data-the direct, real-time measurements of stack gas concentrations of PM, SO 2 and NO X from China's power plant stacks-are monitored by China's CEMS network and reported to the China Ministry of Ecology and Environment (MEE; http://www.envsc.cn/). The CEMS data are recorded on a source and hourly basis. In total, the CEMS dataset covers 4,622 emission sources (i.e., power plant stacks) associated with 5,606 units (accounting for 98% of China's thermal power capacity), 35,064 hours from 2014 to 2017, and 3 air pollutants (i.e., PM, SO 2 and NO X ) for each source-hour sample (Table 3). The MEE has also provided stack-specific information (regarding latitude and longitude, heights, temperature, diameter, etc.; http://permit.mee.gov.cn/).
Unit-specific information is also derived from the MEE, involving activity levels (energy consumption and power generation), operating capacities, geographic allocations and pollution control equipment (particularly the types and removal efficiencies) at a yearly frequency. Due to data availability, the unit information is available only until 2016, and the activity levels for 2017 are projected following the overall trends in provincial thermal power generation between 2016  www.nature.com/scientificdata www.nature.com/scientificdata/ CEAP dataset is innovative in that it incorporates comprehensive real CEMS-measured emission data, avoiding the use of average emission factors and the associated operational assumptions and uncertain parameters.
Pre-processing of CEMS data. We have been exclusively granted access to the data from China's CEMS network. Generally, the CEMS consists of a sampling system (for filtering and sampling flue gas), an online analytical component (for monitoring flue gas parameters, particularly emission concentrations) and a data processing system (for collecting, processing and reporting monitoring data) 27,28 . According to the GB13223-2003 regulation 29 , the CEMS network should cover all power plant furnaces that burn coal (except stoker and spreader stoker) and oil and generate >65 tons of steam each hour, as well as those that burn pulverized coal and gas. Thus, some power plants have not yet been incorporated into the CEMS network (accounting for 3-4% of the total thermal power capacity from 2014 to 2017) because their furnaces did not meet the requirements necessary to install a CEMS. For the power plants outside the CEMS network, we assume their stack concentrations are similar to the averages of the units with similar fuel types and similar regions within the CEMS network.
To guarantee the reliability of CEMS data, China's government has made great efforts in developing specific regulations and technical guidelines for power plants and local entities to follow and supervise,   www.nature.com/scientificdata www.nature.com/scientificdata/ respectively 24,28,[30][31][32] . These official documents elaborate on all the processes required to regulate the CEMS network, including not only CEMS installation, operation, inspection, maintenance and repair but also CEMS data collection, processing, reporting, analysis and storage 28,32,33 . Since 2014, all state-monitored companies have been mandated to report their CEMS data to the local governments through a series of online platforms for different provinces (listed in Supplementary Table 1). Local entities have random onsite inspections to check the truthfulness of the reported results on at least a quarterly basis 23,24,28,32,34 ; this system enables a comparison of CEMS data across different firms to explore potential outliers and abnormalities and prevent data manipulation 28,35 . Then, the governments release the inspection results to the public through the same online platforms (listed in Supplementary Table 1) 24,36,37 . Severe financial penalties and criminal punishments can be imposed on firms that adopt data manipulation (in terms of deleting, distorting and forging CEMS data, for example) 38,39 .
The malfunction of CEMS monitors may also introduce large uncertainty to CEMS data during the processes of operation (indication errors, span drift, zero drift, etc.), maintenance (particularly the failure to perform calibration and reference tests) and data reporting (invalid data communication, data missing, etc.) 24,28 . Accordingly, each power plant is required to make at least one A-, B-and C-grade overhaul for 32-80, 14-50 and 9-30 days per 4-6, 2-3 and 1 year(s), respectively, as well as one D-grade overhaul (if needed) for 5-15 days per year, to check, maintain and upgrade its technologies, thereby reducing measurement uncertainty 40 . During these overhauls, CEMS operators conduct CEMS calibration (i.e., zero and span calibration), maintenance procedures (e.g., examining and cleaning major CEMS components and replacing or upgrading parts, if necessary, such as optical lens, filter and sampling meter) and a reference test (i.e., relative accuracy test audit). Furthermore, third-party operators examine CEMS operation and maintenance routines, to guarantee standardized CEMS operation and facilitate improvement in CEMS data accuracy 27,28,31 . All the related activities should be documented according to standardized requirement contents 27,28 . Even with the aforementioned efforts, there is still a small proportion of nulls and outliers in the CEMS database, which represent 1% and 0.1% of the total operating hours, respectively, from 2014 to 2017. We treat these samples seriously by following the relevant official documents, which have been released by China's government. Table 4 provides the treatment methods for nulls or zeros, which can be divided into 3 types based on duration. On the one hand, we consider nulls and/or zeros that span at least 5 successive days as a downtime or overhaul and omit them in the estimation, according to the regulation 27 . On the other hand, missing data lasting < 5 day(s) are treated as outliers (i.e., impossible values in operation) and processed in two different ways: the nulls and/or zeros successive for > 24 hours are assumed around the valid values near the time and set to the monthly averages 27 : 1 are all missing values. Furthermore, we treat the measurements that are out of the measurement ranges of CEMS instruments (outside of which the data are unreliable 30,44 ; detailed in Supplementary Table 2) as abnormal data and process them in a similar way to nulls according to the official regulation 27 .

CEMS-based estimation of emission factors and absolute emissions. The introduction of real
CEMS-monitored measurements provides a direct estimation for emission factors on a source and hourly basis, avoiding the use of average emission factors with many assumptions and uncertain parameters 17,42,44 .
(3), EF s i y m h , , , , indicates the emission factor, defined as the amount of emissions per unit of fuel use (in g kg −1 for solid or liquid fuel and in g m −3 for gas fuel), and V i y , is the theoretical flue gas rate, defined as the expected volume of flue gas per unit of fuel use under standard production conditions (m 3 kg −1 for solid or liquid fuel and m 3 m −3 for gas fuel) 42 , which was estimated by the China Pollution Source Census (2011) 45 based on sufficient field measurements (detailed in Table 5). Based on Eq. (3), abated emission factors can be directly obtained even without the use of removal efficiencies and the relevant parameters, because CEMS monitors the gas concentrations at stacks after the effect of control equipment (if any).
Notably, recent clean air policies (particularly different emissions standards) target stack concentrations, such that a large proportion of missing data exist regarding other measurements (particularly flue gas rates, with missing data accounting for 34.62%, 31.91%, 29.97% and 42.96% of the total samples in 2014, 2015, 2016 and 2017, respectively). Accordingly, we introduce theoretical flue gas rates into the estimation to avoid significant underestimation of the actual volume when there are too many missing data values 46 . In addition, the adoption of theoretical flue gas rates can address flue gas leakage, a common problem in power plants that greatly distorts www.nature.com/scientificdata www.nature.com/scientificdata/ the real flue gas volume 46 . The theoretical flue gas rates are derived from the China Pollution Source Census, with values varying across operating capacities, fuel types and boiler types 42,45 . Thus, the actual volume of flue gas is computed in terms of the theoretical flue gas rate times actual fuel consumption.
The absolute emissions of PM, SO 2 and NO X from individual power plants can be estimated in terms of the emission factors times the activity levels 21 : where E s i y m , , , represents the air pollution emissions (g); and A i y m , , is the activity data, i.e., the amount of fuel use (kg for solid or liquid fuel and m 3 for gas fuel). In the CEAP dataset, power plant emissions are estimated on a monthly basis (the smallest scale for activity data), in which the yearly unit-level activity data are allocated at the monthly scale using the monthly province-level thermal power generation as weights 16 :

Data records
A total of 12 data records (emissions and plant/unit information inventories) are contained in the CEAP dataset, which have been uploaded to public repository figshare 47 . Of these The CEAP dataset introduces systematic real measurements by China's CEMS network to directly estimate the PM, SO 2 and NO X emissions from China's power plants during 2014-2017 (Fig. 1). In particular, the dataset provides plant-level information about absolute emissions, fuel uses, generating capacities and geographic allocations for 2,583, 2,714, 2,596 and 2,596 power plants from 2014 to 2017, respectively. In addition, the CEAP dataset presents dynamic stack concentrations by region and fuel type and describes the overall structures of operating units, capacities, ages, emission factors, emissions and CEMS coverage for China's thermal power plants.

technical Validation
Uncertainties. The CEMS-based estimates are subject to uncertainties arising from volatilities in the CEMS data, the introduction of theoretical flue gas rates and the projection of activity data. Thus, uncertainty analyses are performed to verify the robustness of our estimates. Generally, the uncertainty analysis on each examined model variable or parameter (emission concentrations, theoretical flue gas rates or activity data) includes five main steps: (a) estimate the probability distributions by fitting data with an given distribution as the input of the Monte Carlo approach; (b) generate random values based on the probability distributions via Monte Carlo simulation; (c) put the random values into Eqs. (3-5) to replace the original values and obtain a new set of estimates for emission factors and total emissions; (d) repeat steps (b) and (c) 10,000 times and obtain 10,000 sets of results 16,17,48,49 ; and (e) yield the uncertainty ranges of our estimates in terms of 2 standard deviations of the new 10,000 set of results 21 . Table 6 reports the related results and reveals that the uncertainties can be controlled within a small range (i.e., ±9.03% and ±2.47% for emission factors and absolute emissions, respectively). Set them to the arithmetic mean of the two nearest valid points before and after them.
The guideline (HJ/T 75-2007) 27 suggests setting missing data lasting for 1-24 hour(s) to the arithmetic mean of the two nearest valid points before and after them. www.nature.com/scientificdata www.nature.com/scientificdata/ Uncertainties in CEMS data. The volatility in stack gas concentrations (the key model inputs in our estimation) should be considered in the uncertainty analysis 42 . As the hourly CEMS measurements are recorded as an average over an hour time period, the associated volatility well reflects real variability in the emissions (as power demand rises and falls throughout the day, for example) 32 . We assume normal distributions for stack concentrations for each unit on a monthly basis and then draw the related parameters of distributions (e.g., the mean and the standard deviation) through data fitting based on the associated daily averages of the CEMS measurements 50,51 . For a unit without CEMS, the bootstrap method is used to select samples from the units of the same fuel type and the same region in the CEMS network at an equal probability. Then, the Monte Carlo simulation is performed to generate random stack concentrations based on the associated distributions 17,42 . With 10,000 simulations, the uncertainty ranges of the estimates are assessed to be small, i.e., ±8.65% and ±1.09% for the emission factors and absolute emissions, respectively.
Measurement uncertainties lead to a certain level of CEMS-monitored stack concentration deviations 28 . According to the official regulation 27 , a qualified CEMS instrument should control the error tolerance within ±15%, ±5% and ±5% for PM, SO 2 and NO X concentrations, respectively. Accordingly, we assume uniform distributions within the allowed tolerance ranges for all stack concentrations on the hourly, unit and pollutant basis. Then, random stack concentrations are generated using the Monte Carlo technique and put into Eq. (3) replacing the associated original values. A total of 10,000 simulations are run to estimate the uncertainty ranges of our estimates (in terms of 2 standard deviations). The results show that the final uncertainties fall within ±10.38% for emission factors and ±0.59% for total emissions. Uncertainties in theoretical flue gas rates. Given that a large proportion of measurements of actual flue gas rates are missing in CEMS data (29.97-42.96% from 2014 to 2017), we introduce theoretical flue gas rates (fourth column of Table 5) in the estimation. Even though this method can prevent significant underestimations and flue gas leakage, uncertainties might arise due to the heterogeneity across units in factors such as technologies, operational situations and feedstocks. We assess the uncertainty ranges of flue gas rates (defined as the lower and upper bounds of a 95% confidence interval around the central estimates 16,48 ; six column) using the real samples in the CEMS database for 1,373 units that have different unit capacities, fuel types and boiler types and are   www.nature.com/scientificdata www.nature.com/scientificdata/ located throughout mainland China (fifth column). A single-sample two-tailed t-test is conducted, and the results (last column) indicate that the mean CEMS-monitored flue gas rates (fifth column) are at similar levels to the theoretical values that we used (fourth column). In the uncertainty analysis, Monte Carlo simulation is conducted to produce random flue gas rates following a uniform distribution on the associated uncertainty ranges 48,52 . For the unit types without uncertainty ranges (e.g., those burning solid waste, oil and petroleum coke), the largest range (i.e., ±10.07%) is employed. Relying on 10,000 simulations, the results show that uncertainty ranges can be well controlled within ±6.90% and ±0.23% for the emission factors and absolute emissions, respectively.
Uncertainties in activity data. The unit-specific activity data are available only up to 2016, and the 2017 values are projected using the monthly provincial data for 2017. This approach assumes that the growth rates in the activity levels of different units in a province are uniform from 2016 to 2017, which somewhat contradicts reality and brings about uncertainties. To assess such uncertainties, a bootstrap method is used to generate 10,000 samples of the growth rates from the previous values from 2014 to 2016, and statistical analysis is employed to fit these samples in a normal distribution. The Monte Carlo simulation is performed to generate random growth rates and thence the growth of activity levels from 2016 to 2017 for individual units, and the total provincial growth is allocated into each unit using the random growth as weights. With 10,000 simulations, the uncertainty range of total emissions is estimated to be quite small (within ±0.03%).
Comparison with existing databases. We compare our estimates with existing databases, finding that our estimates of Chinese power emissions (using the real CEMS measurements for 2014-2017; purple bars in Fig. 2) are 18.62-91.86%, 54.98-69.77% and 17.55-67.76% below previous estimates (based on average emission factors that were evaluated up to 2012 without considering the recent mitigation effect particularly attributable to the ULE standards policy promulgated in 2014) for PM, SO 2 and NO X , respectively. Furthermore, using the detailed measurements on the source and hour basis, the uncertainty of our estimates can be controlled at a relatively low level (error bars).   Table 6. Uncertainty ranges of the estimated emission factors and absolute emissions.
www.nature.com/scientificdata www.nature.com/scientificdata/ Limitations and future work. The CEAP dataset can be improved and extended from the following perspectives. First, some power plants have not yet been incorporated into the CEMS network, which account for an average of 3.8% of the total thermal capacity for 2014-2017. Therefore, collecting and incorporating these samples is needed to extend the CEAP dataset. Second, apart from air pollutants from power plants, the CEMS network monitors both air and water pollutants from various industries, totalling over 30,000 emission sources. Based on these data, the CEAP database can be extended into multisector datasets for both air and water pollutants in the future. Third, due to the data availability, the estimation does not use high-frequency information about activity data, such that CEMS data majorly drive the power emissions on a monthly scale. Future research involves incorporating hourly operational data (especially fuel consumption and flue gas rates) for each unit to improve the reliability of emissions estimates. Fourth, although great efforts have been made to guarantee the reliability of CEMS data, serious verification works (such as aerial concentration measurements) are still needed to check the data quality of the CEMS system 41 .

Code availability
There is no custom code in the generation of the CEAP dataset. In this study, Microsoft Excel is employed to process all the data and Origin is used to draw the figures. Three model inputs have been used in the construction of this dataset, i.e., the measurements from China's CEMS network, theoretical flue gas rates and activity data. First, the CEMS-monitored data are released by the Ministry of Ecology and Environment of China through online platforms for different provinces, and we have documented all the links to these platforms in Supplementary Table 1. Second, theoretical flue gas rates are available in Table 5. Third, activity data are exclusively offered by the Ministry of Ecology and Environment of China.