Exploring the spatial heterogeneity and temporal homogeneity of ambient PM10 in nine core cities of China

We focus on the causes of fluctuations in wintertime PM10 in nine regional core cities of China using two machine learning models, Random Forest (RF) and Recurrent Neural Network (RNN). RF and RNN both show high performance in predicting hourly PM10 using only gaseous air pollutants (SO2, NO2 and CO) as inputs, showing the predominance of the secondary inorganic aerosol and implying the existence of thermodynamic equilibrium between gaseous air pollutants and PM10. Also, we find the following results. The correlation of gaseous air pollutants and PM10 were more relevant than that of meteorological conditions and PM10. CO was the predominant factor for PM10 in the Beijing-Tianjin-Hebei Plain and the Yangtze River Delta while SO2 and NO2 were also important features for PM10 in the Pearl River Delta and Sichuan Basin. The spatial heterogeneity and temporal homogeneity of PM10 in China are revealed. The long-range transported PM10 was substantiated to be insignificant, except in the sandstorms. The severity of PM10 was attributable to the lopsided shift of thermodynamic equilibrium and the phenology of indigenous flora.

www.nature.com/scientificreports/ of PM 10 is much higher in winter than in the other seasons, so we focus on the wintertime (December, January and February) PM 10 in the past more than five years (December 2014 to February 2019). The scopes of this work are as follows: (1) finding the different regional PM 10 patterns and its determinants; (2) exploring the contributors of severe wintertime haze in a novel perspective and demonstrating of the insignificance of long-range transport. In Section two, we introduce the study areas, the sources of data, and parameters of two machine learning models. In Section three, we illustrate the causes for severity of haze in wintertime and show the reason why the long-range transported PM 10  However, all of these regions have suffered from severe PM 10 for decades due to the rapid industrialization. In order to develop better control measures, the question emerges as whether the regional patterns of PM 10 are the same. Because of the regional heterogeneity of natural and anthropogenic sources of PM 10 , a reasonable assumption is the determinants of PM 10 varies among regions but remains consistent in the same region. Nine regionally representative core cities, which are Beijing and Tianjin in BTH, Shanghai, Nanjing and Hangzhou in YRD, Guangzhou and Shenzhen in PRD, and Chengdu and Chongqing in SCB, are picked to investigate the regional PM 10 patterns in wintertime. These nine cities, each of which has more than nine million citizens, are the most flourishing areas of China with their ever-growing urbanization. According to census, the permanent residents living in these nine cities were 154 million in 2017.
Data of wintertime air pollutants and meteorology. All the data used in this work are publicly accessible online. The time period studied was sifted to be wintertime (December, January and February) from 1 December 2014 to 28 February 2019. Hourly air pollutants, including sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), tropospheric ozone in the surface air (O 3 ), carbon monoxide (CO), PM 2.5 and PM 10 were extracted from official website of China National Environmental Monitoring Centre (http:// beiji ngair. sinaa pp. com/), where the air pollution data from 1563 environmental monitoring sites across China were recorded and documented. We chose the environmental monitoring sites in the nine investigated cities for training and testing. We use the data from all of the environmental monitoring sites in a city to calculate Feature Importance. Then we take the average of them to predict hourly PM 10 in Scenario one and two. The meteorological data were from the NASA Global Modeling and Assimilation Office (https:// gmao. gsfc. nasa. gov/ reana lysis/ MERRA-2) and the University of Wyoming (http:// www. weath er. uwyo. edu/ surfa ce/ meteo rogram/ seasia. shtml), including hourly temperature, relative humidity, atmospheric pressure, wind speed and wind direction.
Parameters of random forest and RNN. Recurrent Neural Network (RNN) is capable of capturing temporal contextual information, suitable for simulating the accumulation and deposition of air pollutants. RNN can transfer information from one step to the following step. Random Forest (RF), a tree structuring model, is able to quantitatively rate the significance of each input in shaping the output via calculating the Feature Importance (FI). There are two types of Feature Importance, which are Variable Importance and Gini Importance. In this case, we chose Gini Importance.
Several setups of RF and RNN were tested and fine-tuned before we selected the best settings of parameters. As for RF, n-estimator is the number of built trees. A higher n-estimator ensures the predictions to be stronger and more stable, but also makes the operator code slower. Increasing max-features generally improves the performance of Random Forest, but decreases the diversity of individual tree and slows down the running speed. To strike the right balance, assigning maximum features to be auto to take all features into consideration and put no restriction on the individual tree. Max depth being none means the node extends until all leaves are pure or all leaf nodes contain fewer samples than min samples split, which is set as two in this work. Min sample leaf is the minimum sample number on leaf nodes. Max leaf nodes are the optimal nodes defined by a relative reduction in purity in the best-first fashion. Max leaf nodes being none means there is no restriction on the number of leaf nodes. As for RNN, the activation function chosen was the most popular non-linear function rectified linear unit (ReLU), expressed as f (x) = max (z, 0) . As the number of the hidden units becomes larger, the prediction accuracy of RNN slightly increases but the running speed is slowed down. In this case, we choose the number of the hidden units to be 300. Learning rate is typically log-spaced and change of it commonly does not make significant improvement. We choose learning rate to be 10 -3 . Lay number is set to be 2, because two-layer enables RNN more accurate than single-layer in predicting PM 10 , as we've tested.

Results and discussion
Feature importance of PM 10 . Feature Importance (FI), calculated by Random Forest, is able to quantify the significance of each input to impact the output. The higher the score that an input gets, the more significant that input is to the output. The hourly meteorological conditions and air pollutants in the wintertime of past www.nature.com/scientificreports/ more than five years (December 2014 to February 2019) were input to calculate the long-term FI of PM 10 , shown in Fig. 1. First and foremost, Fig. 1 quantitatively demonstrates that gaseous air pollutants (SO 2 , NO 2 , O 3 and CO) were more significant than the meteorological conditions in shaping PM 10 , as the FI of gaseous air pollutants outscored that of meteorological conditions combined. SO 2 and NO 2 were positively correlated with PM 10 , because they were the precursors of sulfate and nitrate, the main components of PM 10 27 . Tropospheric O 3 in the surface air and PM 10 were negatively associated, because PM 10 is a promoter that speeds up the aerosol sink of hydroperoxy radicals 28 . The strongly positive association between CO and PM 10 was because they were emitted from same sources, such as coal-base domestic heating and traffic. The possible chemical bonds between CO and PM 10 need further investigation. As for Beijing and Tianjin of BTH, the influence of CO on PM 10 was far greater than that of other gaseous air pollutants and NO 2 contributed more pivotally than SO 2 for PM 10 . As for Shanghai, Nanjing and Hangzhou of YRD, SO 2 played a more crucial role than NO 2 in reproducing PM 10 . The influence of CO on PM 10 was also predominant in YRD but less critical than that in BTH. As for Guangzhou and Shenzhen of PRD, NO 2 and SO 2 had higher FI than CO, revealing a different pattern of PM 10 in stark comparison with BTH and YRD. As for PM 10 in SCB, CO and NO 2 were the primary FI in Chengdu and Chongqing, respectively. Therefore, the spatial heterogeneity of regional PM 10 in China is corroborated. We then calculate the annual FI for PM 10 from the aforementioned nine cites, shown in Table 1. Despite of the ebb and flow of FI in some year, the results are consistent for wintertime PM 10 in a city. CO is associated with the insufficient combustion in the coal-based house heating while NO 2 is mainly emitted by automotive vehicles, curbing coal-based house heating in BTH/YRD and controlling vehicles in PRD and SCB are the best ways to lower PM 10 .
Prediction of PM 10 using SO 2 , NO 2 and CO as inputs. Due to the leading roles that gaseous air pollutants (SO 2 , NO 2 and CO) play in shaping PM 10 , they are used to predict hourly PM 10 without meteorological circumstances. Training period is set to be December and February while testing period is January (Scenario one). Training and testing data are from the same city. Pearson correlation coefficient (R) and Root Mean Square Error (RMSE) are used as two statistic indicators to evaluate the performance of RF and RNN, and the results are shown in Table 2 and Fig. 2. As Table 2 indicates, both RF and RNN show good accuracy in simulating hourly PM 10 with only three gaseous air pollutants as inputs. In most cases, the Pearson correlation coefficient (R) between hourly observed and RF/RNN-simulated data is larger than 0.8. RNN is related with time series, as it recursively associates the dataset in the direction of sequence evolution. However, in this case, RNN's not outperforming Random Forest in all nine cities signals that PM 10 was not strongly linked to the time series with one hour interval. This finding reveals that, compared with the impact of gaseous pollutants, the concentration of PM 10 at a given time-point is more relevant to the gaseous air pollutants at the same time than to their previous levels one hour prior. Also, when using the gaseous air pollutants in timestamp (T-1) as inputs, the performances of RF and RNN are slightly worse for predicting PM 10 in timestamp T, compared with that using the gaseous air pollutants in timestamp (T) as inputs. Moreover, the Pearson correlation coefficient of PM 10 in timestamp T and concomitant gaseous pollutants in timestamp T is greater than that of PM 10 in timestamp T and gaseous pollutants one hour prior in timestamp (T-1). This finding not only unravels that PM 10 and gaseous air pollutants were in thermodynamic dynamic equilibrium, but also implies the formation and deposition of PM 10 tended to occur in less than one hour. Furthermore, when training data and testing data are extracted from different cities, the prediction accuracy is reduced, implying every city had its own unique pattern of PM 10 . Thermodynamic equilibrium between gaseous air pollutants and PM 10 . As Fig. 2 and Table 3 show, both RF and RNN ubiquitously underestimate PM 10 in all nine cities in Scenario one. In contrast with Scenario one, Scenario two is set as the testing period is hourly PM 10 in one day in January 2019 and the training period is hourly PM 10 in the remaining days in January 2019. Training and testing data are from the same city. Inputs include SO 2 , NO 2 and CO as well. The results are given in Fig. 3. As Fig. 3 shows, the underestimations do not take place in Scenario two. In addition, we use the gaseous pollutants in January 2018 and December 2017/ February 2018 as inputs to train RF and RNN, respectively. The results are similar: the prediction results of PM 10 in January 2019 using the data in January 2018 for training are greater than that using the data in February 2018 for training. The simulations of RF and RNN both underestimate the PM 10 level in all nine cities when using the data in December 2017 and February 2018 for training, similar to Scenario one, indicating this is a ubiquitous phenomenon.
Two insidious causes account for this. The major reason is the chemical processes of sulfur dioxide forming sulfate and nitrogen dioxide forming nitrate are exothermic. Since the temperature in January is lower than that in December and February, the thermodynamic equilibrium shifts lopsidedly in favor of augmenting PM 10 in January. Moreover, indigenous flora plays an important role for the removal of PM 10 29-32 . As the leaf area index dwindles and the metabolism of trees slows down with the decrease of temperature, the change of phenology of indigenous plants is the minor reason for severity of PM 10 in wintertime.
Insignificance of long-range transport. The motivation of this work is partially stimulated by the sizzling debates in several previous studies [33][34][35][36][37] . Guo et al. 33 inferred that primary emissions and regional transport of PM in Beijing were insignificant in spawning haze. Li et al. 34 demurred to Guo et al. 33 , insisting that long-range transport was the major cause of severe haze in Beijing. Zhang et al. 35 contended that the back trajectory analysis by Li et al. 34 was unsuitable for urban-scale investigations and polluted periods in Beijing were typically linked to stagnant conditions with weak and variable winds. Cao and Zhang 36 criticized Guo et al. 33 for ignorance of nonfossil emission sources, such as biomass burning, cooking, and biogenic emissions. Zhang et al. 37 38 , when the horizontal transportation of air pollutants exceeds 300 km, it is considered as long-distance transport. Machine learning can give an assessment to this argument. The gestations of the haze can be ascribed to crescendo of gaseous precursors, increase of primary emission, or long-range transport. The lifespans of SO 2 and NOx are short 33 . The gaseous air pollutants and solid PM 10 have different physical characteristics, making them unlikely to transport together for a long distance. Hence, our theory to judge the causes of the ups and downs of PM 10 level is: when using gaseous air pollutants (SO 2 , NO 2 and CO) as inputs, if RF and RNN catch the maximum, the high episodes were induced by the increase of secondary inorganic aerosols or change of primary sources; otherwise, it's elicited by long-range transport.
The average of the monthly average discrepancy between simulation and observation in Scenario two is less than 15% of the observation. Hence, RF and RNN catch the undulations of PM 10 using only gaseous air pollutants as inputs, indicating the insignificance of long-range transport. In urban areas of China, fugitive dust from roads, construction sites, and unpaved soil sources normally account for 30%-50% of PM 10 , which is referred as primary PM 10 39 . CO is a presumable indicator for primary PM 10 . The sporadic sandstorm may induce the long-range transport of PM 10 from the far-flung deserts in the northwestern China 40 . RF and RNN catch all the fluctuations of PM 10 using gaseous air pollutants as inputs, indicating the long-range transport induced by spasmodic sandstorm did not occur in January 2019. Thus, we second and shore up the viewpoints of Guo et al. 33 .

Conclusion
Air pollution has become a hot button in China in recent years. In this work, we take a deeper insight into PM 10 . To wrap up, we deduce the following conclusions. We find that PM 10 was more statistically correlated to the gaseous air pollutants (SO 2 , NO 2 and CO) than meteorological conditions. The spatial heterogeneity and temporal homogeneity of PM 10 in China are quantitatively chronicled, signifying each city had its own unique www.nature.com/scientificreports/ Table 1. FI of wintertime PM 10 in nine regional core cities in Scenario one.