Quantitative association analysis between PM2.5 concentration and factors on industry, energy, agriculture, and transportation

Rapid urbanization is causing serious PM2.5 (particulate matter ≤2.5 μm) pollution in China. However, the impacts of human activities (including industrial production, energy production, agriculture, and transportation) on PM2.5 concentrations have not been thoroughly studied. In this study, we obtained a regression formula for PM2.5 concentration based on more than 1 million PM2.5 recorded values and data from meteorology, industrial production, energy production, agriculture, and transportation for 31 provinces of mainland China between January 2013 and May 2017. We used stepwise regression to process 49 factors that influence PM2.5 concentration, and obtained the 10 primary influencing factors. Data of PM2.5 concentration and 10 factors from June to December, 2017 was used to verify the robustness of the model. Excluding meteorological factors, production of natural gas, industrial boilers, and ore production have the highest association with PM2.5 concentration, while nuclear power generation is the most positive factor in decreasing PM2.5 concentration. Tianjin, Beijing, and Hebei provinces are the most vulnerable to high PM2.5 concentrations caused by industrial production, energy production, agriculture, and transportation (IEAT).

concentrations. Many studies have been devoted to finding the influencing factors for PM 2.5 concentration. Meteorological factors such as temperature 21 , wind speed 22 , and rainfall 23 , and human factors such as industrial processes 24 , energy production and consumption 25 , and transportation 26 are the most important. Previous research has shown that many human activities severely impact air pollution such as crop straw burning 27 , coal burning 28 , and vehicle exhaust emissions 29 . Many research has been carried out on PM 2.5 concentration using Chemistry Transport Models (CTMs). CTMs can simulate the distribution of PM 2.5 concentration in certain regions. Currently, however, this method still has some limitations. First, more work should be done on the mechanism of PM 2.5 formation and development 30 . Second, data on pollution sources in a whole nation are often not accurate enough and will lead to significant simulation error 31 .
In this paper, PM 2.5 concentration is studied through a different approach. We tried to figure out the most important influencing factors of industrial production, energy production, agriculture, and transportation (IEAT) on PM 2.5 concentration based on available statistics data. Based on millions of collected PM 2.5 data, we determined the spatial and temporal characteristics of PM 2.5 distribution in mainland China, and analyzed trends in PM 2.5 for different provinces. We also collected multisource statistical data that included meteorological and IEAT-factors for each month from January 2013 to May 2017. We found associations between PM 2.5 and IEAT-factors, and developed a regression formula PM 2.5 concentration based on 10 primary factors. In addition, we calculated the meteorological and IEAT contributions to PM 2.5 levels in the different provinces. The results are helpful for governments providing macroscopic mitigation plans for controlling PM 2.5 in the future.

Data and Methods
Data source. We summarized the collected data into 49 influencing factors, as listed in Table S1. These factors can be grouped into 5 categories: meteorology, industrial production, energy production, agriculture, and transportation. Since 2013, PM 2.5 data were collected hourly by 391 monitoring stations managed by the Ministry of Environment Protection, China (MEP); PM 2.5 data from January 2014 through May 2017 were collected daily by 190 monitoring cities from Air Quality Inspection Platform of China 32 ; meteorological data between 2013 and 2016 were collected by 195 weather stations of the China Meteorological Administration (CMA) 33 ; meteorological data between January and May 2017 are from the National Oceanic and Atmospheric Administration (NOAA) 34 ; all data for industrial and energy production are from the National Bureau of Statistics of China (NBSC) 35 ; all transportation data are from the Ministry of Transport, China (MOT) 36 ; data for straw burning are from the Ministry of Environmental Protection, China (MEP) 37 ; data for geographic division of mainland China are from the Ministry of Civil Affairs, China (MCA) 38 .
Data processing. All data have different temporal collection frequency. PM 2.5 and meteorological data were collected hourly and daily, while other data for human factors were collected monthly. To unify these data sets, the monthly value for each factor is used in this study, therefore a monthly averaged value is calculated for PM 2.5 and meteorological data. PM 2.5 and meteorological data, which were obtained by stations or cities, are averaged by province to obtain a provincial value.
To obtain the regression formula, all data are from the 53 months between January 2013 and May 2017. There are 34 provinces in China. Since there is no available data for Hong Kong, Macau, and Taiwan, we only considered the 31 provinces of mainland China. Therefore, the total data volume for each index should be 1643 (31 × 53) if no data was missing. In reality, there are data missing as shown in Table S1. We used linear interpolation to obtain this missing data. Most of the meteorological factors, and some of the human factors vary periodically by year. We therefore give priority to linear interpolation by year rather than by month. For example, if the data (X, M, Y) for province X in month M, Year Y is missing, we determine the data by linear interpolation based on data (X, M, Y − 1) and data (X, M, Y + 1). If data (X, M, Y − 1) or data (X, M, Y + 1) is missing, we use data (X, M − 1, Y) and data (X, M + 1, Y) to interpolate the missing data. We do not fill data outside the periods for which they were measured, for example, if data was only recorded from May 2015 to May 2017, then data before May 2015 cannot be determined via interpolation. Finally, 3.1% of the data was determined using linear interpolation.
To remove the differences due to geometrical area variation among the 31 provinces and the number of days in a month, all accumulative data (industrial production, energy production, agriculture, and transportation) were divided by province area and number of days in the month.
In this study, we used stepwise regression 39,40 to process data to obtain the primary influencing factors that contribute the most to PM 2.5 concentration. In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is made by an automatic procedure 41 . The variables ending up in the final equation signify the best combination of independent variables for predicting the dependent variable 42 . Stepwise regression is frequently used in the statistical analysis of air pollution and has the advantage of being able to avoid collinearity 43 . SPSS software (IBM SPSS, version 20) was used for the statistical analysis 44 .
All the 49 influencing factors (as listed in Table S1) are used as input variables. Using stepwise regression, 10 factors having the highest impact on PM 2.5 are found. In addition, the regression formula for PM 2.5 concentration is obtained.

Results
General analysis for PM 2.5 in China. Because of periodicity of meteorological impacts and human behavior, PM 2.5 variation is also periodic. Figure 1 shows the pollution level and mean concentration of PM 2.5 in China for 12 months based on data from the past four years. The month with the best air quality is August, with a mean concentration of PM 2.5 of 36.0 μg/m 3 . 81.1% of the days had good air quality, and 17.5% of the days had moderate air quality. January has the worst air quality with on average 94.3 μg/m 3 of PM 2.5 . In January, only 27.6% of the days have good air quality, and 37.1% days have light, moderate, heavy, or severe pollution. Generally speaking, concentrations of PM 2.5 are much higher in winter than in summer. Therefore, better control of PM 2.5 in winter is critical. Figure 1 shows the temporal distribution of PM 2.5 , while Fig. 2 shows its spatial distribution. The number in    Table 1. Meteorological factors including temperature, air pressure, and wind speed are the most strongly associated with PM 2.5 concentration. The top six IEAT factors having a strong association with PM 2.5 concentration include the production of: natural gas; industrial boilers; ore; tractors; nuclear power, and locomotives.
Shaanxi, Tianjin, and Beijing have the highest production of natural gas per unit area. Henan, Shanghai, and Jiangsu are the three biggest industrial boiler producers per unit area. Hebei, Beijing, and Liaoning have the highest production of ore per unit area. Shandong, Henan, and Chongqing are the top three provinces in tractor production per unit area. Zhejiang, Tianjin, and Beijing rank highest in the production of locomotives per unit area. Nuclear power generation is the only positive human factor among the top ten factors. Using clean energy can efficiently reduce PM 2.5 air pollution. Just 7 out of 31 provinces have nuclear power generation. The average PM 2.5 concentrations per year in Zhejiang, Guangdong, and Fujian provinces (which have high nuclear power generation) are 49.5, 37.2, and 30.4 μg/m 3 , respectively. Constant in this formula is 13.7 which means the background value of PM 2.5 concentration is 13.7 μg/m 3 in China. In addition, R 2 of the linear fitting by these 10 factors is 0.721.
Meteorological and IEAT contributions to PM 2.5 concentration. In this study, we divided the influencing factors into meteorological and IEAT elements. Meteorological and IEAT contributions to PM 2.5 concentration are calculated using Eq. 1 with meteorological and human factors, respectively. Figure 4(a,b) show the meteorological and IEAT contributions to PM 2.5 . In Fig. 4(a) we see that the three northeastern provinces of China and North China (except Inner Mongolia and Tianjin) have a high meteorological contribution to PM 2.5 , while the south of China has a lower meteorological contribution to PM 2.5 .     Table S2.
Verification. Another group of data from June 2017 to December 2017 was used to verify the robustness of regression formula obtained above (Eq. 1). In Fig. 5, a total of 217 points show the monthly averaged PM 2.5 data for 7 months of 31 provinces. From comparison between the measured data and calculated values by regression formula based on ten influencing factors from June 2017 to December 2017 (Fig. 5), R 2 is 0.62, which is slightly lower than regression formula's accumulative R 2 (0.72). The average errors between the measured data and the calculated value is 11.1 μg/m 3 .

Discussion
In this study we investigated 49 influencing factors including meteorology, industrial production, energy production, agriculture, and transportation on PM 2.5 concentrations. We determined a quantitative association between PM 2.5 concentration and IEAT factors, and obtained a formula for PM 2.5 concentration considering 10 primary factors based on stepwise regression. Stepwise regression is used because it is suitable for processing collinear data. We have tried to use common linear regression to process data, and found that the results are not as good as results obtained by stepwise regression (see Supplementary Information for details). In this study, we analyzed PM 2.5 concentrations in China from January 2013 to May 2017. The average PM 2.5 concentration over 12 months shows an upward parabola. Severe pollution (>300 μg/m 3 ) appeared in winter in many areas of China. Using clean energy such as nuclear power to replace coal burning power plants is a very efficient way to reduce the number of severely polluted days in China. North and Central China have serious PM 2.5 pollution problems. Research has shown that a 10 μg/m 3 increase over a previous day's PM2.5 level results in a 1.78% increase in respiratory related mortality and a 1.03% increase in stroke related mortality 45 . In China, residents in high-PM 2.5 concentration areas look forward to gale force winds to reduce pollution. However, reducing PM 2.5 generated by human activity is the key solution. One approach would be a reasonable distribution of PM 2.5 sources to help balance PM 2.5 concentration between highly-populated areas and rural areas. In addition, efforts should be made to reduce the number of severely polluted days rather than just reducing the average PM 2.5 concentration.
Meteorological contributions to PM 2.5 are high in the three northeastern provinces of China and North China (excluding Inner Mongolia and Tianjin). These areas are cold in winter, resulting in more coal consumption. Less rainfall in these inland areas is another meteorological reason for high PM 2.5 levels. North China (excluding Inner Mongolia), Central China, and some provinces of East China have high IEAT contributions to PM 2.5 . Beijing is a typical polluted city which had an average PM 2.5 concentration of 76.7 μg/m 3 from April 2016 to March 2017. In the 13th Five-Year Plan, The Ministry of Environment Protection, China (MEP) set a target for the concentration of PM 2.5 in Beijing to be reduced to 56 μg/m 3 by 2020. This is a considerable challenge. Shanghai provides a good example where in the past four years, the PM 2.5 concentration has gradually decreased even though urban construction continues and the economy keeps improving.
According to stepwise regression, we found that of the 49 influencing factors (meteorology, industrial production, energy production, agriculture, and transportation), the production of natural gas, industrial boilers, ore, tractors, nuclear power, and locomotives are the top six IEAT factors contributing to PM 2.5 concentrations. Since the production of natural gas, energy, ore, locomotives and tractors do not need to be concentrated in one area, future planning should include spreading these industries over wider areas to avoid creating areas with high population densities and heavy pollution. Finally, the regression formula was verified by the data from another seven months from June 2017 to December 2017.
Some researchers are focusing on the relationship between human factors and PM 2.5 pollution. Previous research has suggested that clean fuels such as natural gas should replace the coal used for small domestic boilers to reduce air pollution 46 because combustion for natural gas is cleaner. However, some recent studies noticed that several air pollutants, including VOCs, NO x , PM 2.5 and SO 2 , will be emitted during production stage of the natural gas [47][48][49] . Our results also show the production of natural gas is strongly related to PM 2.5 concentrations. Using clean energy (such as solar power, wind power, and nuclear power) to replace fossil energy, rather than using natural gas to replace coal, would be a better solution for PM 2.5 control. Industrial boilers, which are usually used for burning coal 50 , are bad for air quality 51 . We also found that provinces with high production of industrial boilers, usually have high PM 2.5 concentrations. Industrial boilers are usually heavy and not convenient for long-range transportation, so most of the produced boilers are likely to be installed and used locally, leading to this phenomenon. Through analysis of collected 24-h PM 2.5 samples in Brazil, researchers found that environmental contamination is led by ore mining and related activities such as the transport of products to and from the mines 52 . In our study, we found that production of ironstone and phosphate ore were the third biggest generators of PM 2.5 (P < 0.001), followed by the production of natural gas and industrial boilers. Turkish researchers found that the PM 2.5 concentration around tractors can reach thousands μg/m 3 53 . We found that the production of tractors is also strongly related to PM 2.5 concentration (P < 0.001). New tractors are usually transported to nearby regions, and tractor operation will also generate PM 2.5 . In New York, the levels of PM 2.5 rapidly reached a peak when a diesel-powered locomotive passed 54 . In China, locomotives are widely used. We found that producing locomotives will increase PM 2.5 concentration. High production of locomotives generally reflects a high level of heavy industry, which lead to serious air pollution. As for meteorological factors, many studies have shown that the PM 2.5 concentration is negatively correlated with wind speed and rainfall 55-58 which agrees with our findings. Because many human factors such as coal burning are strongly related to meteorological factors, the impacts of meteorological factors on PM 2.5 concentration are difficult to quantitatively measure.
There are some limitations to this study. Because of data limitation, we considered only some influencing factors in meteorology, industrial production, energy production, agriculture, and transportation. Other meteorological and IEAT factors may also influence PM 2.5 levels. For example, energy consumption is very important on emissions and air pollution. However, accurate monthly fuel consumption data at provincial level is unavailable, mainly because there are numerous distributed consumers. In the future, we should take these factors into consideration, if advanced statistical method is developed and accurate energy consumption data are available. In addition, some unconsidered human factors are strongly related to meteorological factors. For example, the consumption of coal and fireworks is high when the temperature is low (Spring Festival which consumes many fireworks is around January and February) 59 . Therefore, the meteorological contribution in this study is mainly comprised of both unconsidered human and meteorological factors. In addition, all data is from January 2013 to May 2017. The precision could be improved if we could collect more data over longer periods. We used the province as the spatial unit in this study. City-scale data should be analyzed in the future once more precise data is obtained. In our data processing, some data was obtained using yearly and monthly linear interpolation, and this may slightly influence the precision of the regression formula for PM 2.5 concentration. Moreover, since all factors and results are related to industrial technologies, personal living habits, and a few other characteristics, the association should be adjusted when being used for analysis of PM 2.5 concentrations in other countries. PM 2.5 in China in spatial and temporal dimensions was analyzed for data from January 2013 to May 2017. We quantitatively obtained the impacts of meteorology, industrial production, energy production, agriculture, and transportation on PM 2.5 concentration over an extended period. We found that production of natural gas, industrial boilers, ore production, tractors, and locomotives were the five human factors with the strongest association with PM 2.5 concentration. The model and the results provide efficient references for governments to make better plans on controlling PM 2.5 concentrations.

Conclusions
In the future, more types of data, longer time periods, and more detailed regionalization should be considered to improve the precision of the association analysis for PM 2.5 concentration.