Introduction

Air pollution is a major environmental issue faced by heavily polluted regions around the world, including Central and South-Eastern Asia1,2,3. Reducing the number of premature deaths caused by air pollution has been identified as one of the United Nations’ Sustainable Development Goals4. The new air quality guideline set by the World Health Organization has revised the annual concentration of fine particulate matter (PM2.5) from 10 µg m−3 to 5 µg m−3 (ref.5), which requires further tightening of the measures for air pollution prevention and control6. Long-term observations of air pollutants capturing changes in air pollution can be used to evaluate the effectiveness of air quality policies7,8,9,10. The changes in air pollutant concentrations, however, are impacted both by emissions11,12,13 and meteorological conditions13,14,15,16. Using the observed air pollutant concentrations without consideration of meteorological impacts to directly evaluate the effectiveness of measures has been questioned17,18. Therefore, assessing the effectiveness of air quality policies needs to decouple the impacts of emissions and meteorology on air pollutant variations.

Generally, there are three methods to estimate meteorology-normalized air pollutant variations (Supplementary Table 1). One is to use the chemistry transport models (CTMs) such as the Weather Research and Forecasting Model-Community Multiscale Air Quality Model (WRF-CMAQ) and GEOS-Chem (GC). Zhang et al.11 reported that the decrease in PM2.5 in China was predominantly attributed to anthropogenic emissions abatement during 2013–2017 using WRF-CMAQ. With GEOS-Chem, Qiu et al.19 quantified the emission-driven trends of PM2.5 and found a substantial reduction of PM2.5 concentration in eastern and central China from 2013 to 2017. Due to the inherent assumptions, parameterizations, and simplifications of processes in CTMs20,21, and large uncertainties in emission inventories22,23, CTM outputs are subject to large uncertainty24. One alternative method is the traditional statistical method (TSM), such as multiple linear regression (MLR) and Kolmogorov–Zurbenko (KZ) filter. The MLR is widely used to separate the contributions of emissions and meteorology to variations of PM2.525,26,27 and ozone (O3)28,29,30. The KZ filter developed by Rao and Zurbenko (1994) was first used to detect and track changes in O3 in the US31. Since then, the KZ filter has been used in determining long-term trends of other air pollutants32,33,34. The other method is machine-learning (ML), a branch of statistical methods35. For instance, Grange et al.36 developed a method for weather normalization of inhalable particulate matter (PM10) by random forest (RF). Since then, the ML methods have been widely used especially during COVID-19 lockdowns37,38,39. Zheng et al.37 found substantial reductions in air pollutant concentrations due to emission reductions during the lockdown period in Wuhan by the RF model. By the same method, Shi et al.38 found abrupt but smaller-than-expected changes in surface air pollutant concentrations during COVID-19 in 11 cities globally. The ML methods are also used to assess the impacts of clean air actions on air pollutants. For instance, Vu et al.40 used the RF model to assess the impacts of clean air action on air pollutant trends in Beijing between 2013 and 2017. Similarly, Dai et al.41 answered the question of whether the Three-Year Action Plan improved the air quality in the Fenwei Plain of China by the RF model. Despite the wide adoption of traditional statistical and ML methods, the results from these two methods are always suspect due to their shortcomings in not considering the physical and chemical processes of air pollutants during their atmospheric lifetime.

Due to the high demands in running CTMs (e.g., air pollutant emission inventory, computer resources, and professional researchers), the application of CMTs is limited. As an alternative, TSM and ML have been widely used to normalize the weather on air pollutants. It should be noted that none of the existing methods is perfect in decoupling the impacts of emissions and meteorology on air pollutant variations42. The performance and comparability of different methods should therefore be assessed. Intra-comparisons between TSM41,42,43,44 or intercomparisons between TSM, ML, and CTMs40,45,46,47, however, are less reported39,40. One of the biggest challenges is the lack of simultaneous CTM results as a reference. The CTM simulations always focus on the study period’s beginning and end year or specific months within each year, while the TSM and ML make use of the entire study period. Such differences in the study period would introduce bias in intercomparison. If the performances and bias of different methods in decoupling the impacts of emissions and meteorology on air pollutant observations have been investigated, it will enhance our confidence to use these methods.

The notable air quality improvement in China from 2013 to 2017 has been acknowledged11, which provides an opportunity to assess the performances of different methods in separating PM2.5 variation drivers. The aims of this study are (1) assessing the differences in model performance of TSM, ML, and CTM methods in decoupling the impacts of meteorology and emissions on PM2.5 and (2) comparing the trends (including emission-related and meteorology-related) of PM2.5 and the bias of trends from statistical methods with the CTM result as a reference. The resources needed in different methods and three key factors that have impacts on weather normalization using the ML are discussed finally. This study would be beneficial to select a suitable method for investigating the long-term variations of aerosol compositions.

Results

Performances of different models to reproduce PM2.5 observation

Figure 1 shows the average values of statistical metrics between the observed and predicted PM2.5 concentrations from different methods (the method-specific statistical metrics are shown in Supplementary Figs. 1–3). Overall, the metrics derived from the six methods were averaged (mean value ± standard deviation and hereafter) in the range of 0.55 ± 0.41 to 0.94 ± 0.04 for r, 16.2 ± 28.0% to 28.7 ± 44.6% for NMB, and 0.31 ± 0.60 to 0.83 ± 0.07 for index of agreement (IOA), respectively. It should be noted that the temporal resolution of data to calculate the statistics in Fig. 1 was monthly for CMAQ and GC, daily for MLR and KZ, and hourly for RF and extreme gradient boosting (XGB). If the temporal resolution of the data to calculate the statistical metrics for KZ, MLR, RF, and XGB was scaled to monthly, the TSM and ML showed even better performance to reproduce the observations (Supplementary Fig. 4). For instance, r values produced by MLR and RF significantly increased from 0.79 ± 0.04 to 0.85 ± 0.04 and 0.94 ± 0.02 to 1.0 ± 0.01, respectively, at the 0.001 level.

Fig. 1: Spatial distributions and boxplots of statistical metrics for model evaluation.
figure 1

ac Spatial distributions in the average values of Pearson correlation coefficient (r), normalized mean bias (NMB), and index of agreement (IOA) derived from six methods. df Boxplots of r, NMB, and IOA for each method. The color and size of dots in the top panels are mapped to the mean values and standard deviations of statistics calculated from the six methods. The gray (“criteria”) and black (“goal”) horizontal dashed lines in the bottom panels represent the recommended benchmarks for model performance evaluation suggested by Emery et al.48. It should be noted that the hyperparameters for RF and XGB are tuned here for model evaluation.

According to the “criteria” value of r greater than 0.4 and the “goal” value of NMB within ±30% for 24-h averaged PM2.5 evaluation48, the MLR and KZ methods achieved acceptable performance in all cities. Most cities (71 of 74 sites for MLR) were close to the “goal” of evaluation with r greater than 0.7 and NMB within ±10% for statistical models. Similarly, the level of accuracy for RF and XGB models was considered to be close to the best a model can be expected to achieve. For that the r and NMB values calculated from hourly resolution data even fulfill the threshold of “goal” (Fig. 1d, e), not to mention the values calculated from daily data. r and NMB values for CMAQ and GC models calculated from monthly data meet the “criteria” of model evaluation for 47 and 63 cities, respectively. If the temporal resolution of data for CTM evaluation was changed to daily, the performances of CMAQ and GC would decline.

In terms of different methods to reproduce PM2.5 variations, the ML methods showed higher r and IOA values, with lower RMSE values. The CTM, however, showed lower r and IOA values and higher RMSE compared to TSM and ML (Fig. 1d, f). For instance, the IOA values from different methods ranked as XGB (0.97 ± 0.01) > RF (0.96 ± 0.01) > KZ (0.76 ± 0.04) > MLR (0.74 ± 0.04) > GC (0.62 ± 0.19) > CMAQ (0.54 ± 0.26). A literature review in Supplementary Table 1 also showed a better performance of TSM and ML than CTM in reproducing the air pollutant concentrations. For instance, the correlation coefficient of the linear regression between the monthly observations and simulations of PM2.5 showed a higher value for the RF (r2 = 0.99) model than CMAQ (r2 = 0 .82) in Beijing40. These statistical metrics for MLR, KZ, RF, and XGB models indicated that TSM and ML can capture the spatial-temporal variations of PM2.5 in this study.

Comparison in trends of PM2.5 from different methods

Supplementary Fig. 5 shows the time series of scaled PM2.5 concentrations derived from the six methods. Generally, all of the 74 cities showed a decreasing trend of PM2.5 with contributions from both emission-related and meteorology-related trends (Fig. 2). The trends of 74 cities were averaged as −11.8 ± 2.69 µg m−3 yr−1 ~ −0.37 ± 0.36 µg m−3 yr−1 for \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\), −10.3 ± 2.66 µg m−3 yr−1 ~ −0.27 ± 0.93 µg m−3 yr−1 for \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\), and −2.03 ± 0.80 µg m−3 yr−1 ~ 0.33 ± 1.20 µg m−3 yr−1 for \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\), respectively, from six methods. The high standard deviation of trends suggested the spatial heterogeneity in PM2.5 reduction during 2013–2017 in China (Supplementary Table 2). The high standard deviation of mean trends for \({{\rm{PM}}}_{2.5}\) in Fig. 2a–c was also related to the model differences (Supplementary Fig. 6). \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) calculated from CTMs had an insignificant (p > 0.05) difference between CMAQ and GC. Similarly, \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) calculated from TSM (e.g., KZ and MLR) showed no statistical difference, and the same result for the ML (RF vs. XGB) (Supplementary Fig. 7). The trends of \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) from CTM, TSM, and ML, however, showed significant differences. The trend of \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) derived from CMAQ (−4.09 ± 2.44 µg m−3 yr−1) was significantly higher (less negative) than MLR (−4.97 ± 2.87 µg m−3 yr−1), RF (−5.23 ± 2.96 µg m−3 yr−1), and XGB (−5.23 ± 2.96 µg m−3 yr−1) at the 0.05 level (Fig. 2d). For \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) (Fig. 2e), the intra-comparison of trends within CTM, TSM, and ML showed no differences (p > 0.05). Intercomparison of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) also showed insignificant (p > 0.05) differences between CMAQ ( − 3.98 ± 2.19 µg m−3 yr−1), KZ (−3.29 ± 2.30 µg m−3 yr−1), MLR (−3.84 ± 2.54 µg m−3 yr−1), RF (−4.84 ± 2.79 µg m−3 yr−1), and XGB (−4.80 ± 2.78 µg m−3 yr−1). For \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) (Fig. 2f), trends from CTM and ML showed insignificant differences while the trends from TSM were significantly lower than the other methods. No significant differences between the trends in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from TSM, ML, and CMAQ models suggesting the lack of physical-chemical mechanisms was not important at least in revealing the emission-related trends of PM2.5 on the national scale by the TSM and ML.

Fig. 2: Spatial distributions and boxplots of PM2.5 trends from different methods.
figure 2

ac Spatial distributions in the average values of trends for PM2.5 observation (\({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\)), emission-related PM2.5 (\({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\)), and meteorology-related PM2.5 (\({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\)) derived from six methods. df Boxplots of \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\), \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\), and \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) trends for each method. The meteorological conditions resampling strategy for the RF and XGB was from Grange et al.36. The color and size of dots in the top panels are mapped to the mean values and standard deviations of trends calculated from six methods. The marks in the bottom panels represent the differences between the two paired methods and the NS., and * mean the differences are not significant and significant at 0.05 levels.

Contributions of emission and meteorology to PM2.5 trend by different methods

Using the scatterplot between \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\), \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\), and \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\), the relative contributions of emissions and meteorology to the variations of PM2.5 were quantified (Supplementary Fig. 8). A contribution of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) to \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) less than 100% indicates that the inter-annual variations of meteorology contribute to the reduction of PM2.5. On the contrary, a percentage of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) greater than 100% suggests the inter-annual variations of meteorology offset the reduction of PM2.5 from emission variations. On the national scale, the decrease in PM2.5 from 2013 to 2017 in China was dominated by emission reductions with contributions of 78.9% (KZ) ~90.5% (RF) according to the six modeling results (Supplementary Table 3). The comparable results between TSM, ML, and CTM suggested their ability to determine the dominant factor to variations of PM2.5 at a large spatial scale.

The estimated contributions of emissions and meteorology to variations in PM2.5 by different methods, however, showed a regional difference (Supplementary Table 3). For instance, the relative contributions of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) to \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) calculated from CTM and ML were higher than 100% in YRD, suggesting the negative role of meteorology on PM2.5 reduction from 2013 to 2017 (Supplementary Fig. 6). The percentages of \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) to \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) from TSM, however, suggested the meteorology variations contributed to the reduction of the observed PM2.5 in the YRD region, similar with previous studies with CTM adopted in YRD49,50. The opposite role of meteorology to the variation of PM2.5 derived from different methods here and previous studies suggested that none of the existing methods can perfectly decouple the effects of emissions and meteorology on the trends of air pollutant concentrations. The different methods demonstrated comparable results in quantifying the influence of meteorological factors on PM2.5 variations at the national scale, whereas differences were observed at a regional scale. Therefore, results from multiple methods (linear/non-linear) should be cross-checked to carefully evaluate the impacts of policies or interventions on regional air pollutant concentrations.

Bias in trends of PM2.5 from different methods compared to CMAQ

With an assumption that the emission constant sensitivity simulation of a CTM (e.g., CMAQ in this study) produced a conceptual minimum of estimation error19, the biases in trends (defined as 100% × (1 – the slope of a linear regression between CMAQ and other methods)) from the other five methods relative to CMAQ were calculated (Supplementary Fig. 9). \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) trends calculated from KZ and MLR were underestimated by 7% and 3%, while the trends from the other three methods were unbiased. Compared to \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from CMAQ, trends from the other five methods showed underestimation with KZ underestimated most by 23%, followed by MLR (13%), XGB (3%), RF (2.8%), and GC (2.4%). Trends of meteorology-related PM2.5 calculated from statistical and machine learning methods were highly biased with underestimation of 79%, 66%, 30%, and 28% for KZ, MLR, RF, and XGB, respectively. The bias of \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) trend from GC, however, was overestimated by 6%.

The higher biases in trends of PM2.5 from TSM were related to the model performance to reproduce the relationship between PM2.5 and meteorological variables. Specifically, \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) was calculated from the residuals of linear fitting models for KZ and MLR methods (see Supplementary Methods for details), the higher residual or lower slope of the fitting in KZ and MLR methods was, the lower bias in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) yielded (Supplementary Fig. 10a, b). The higher model performance of the KZ filter compared to MLR (e.g., r values of 0.85 ± 0.05 for the KZ filter vs. 0.79 ± 0.04 for MLR) in reproducing the relationship between meteorological variables and PM2.5 can explain the larger bias in emission-related trend of PM2.5 from KZ filter. A sensitivity study using the RF instead of the MLR model to build the relationship between the baseline component of PM2.5 and meteorological factors in the KZ method further indicated a higher bias from fitting by the RF model, which showed higher slope and low residuals (Supplementary Fig. 10c). The lower biases in PM2.5 trends from ML methods were possibly related to the inclusion of temporal variables (proxies of emission) in model training. The sensitivity analysis of bias in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from the RF model showed a larger bias from the model without the temporal variables included in model training (Supplementary Fig. 10d).

The biases of different methods were also related to the inherent uncertainties of CMAQ, which originated from uncertainties in air pollutant emission inventories and incomplete physical-chemical mechanisms. For instance, using the results from TSM and ML methods as the references, the higher biases of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) and \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) were produced for CTM (Supplementary Fig. 11). Additionally, these statistical methods assume that the variation of air pollutant is a linear sum of meteorological and emission changes18,42. Therefore, the influence of meteorological and emission changes on air pollutants can be cleanly separated from each other19. The impacts of meteorological variation may not be distinguishable from air pollutant trends driven by emission changes, due to their interactions19. Nevertheless, the ML methods perform robustly in getting the weather-normalized trends of PM2.5 compared to TSM. Similar findings were also reported from previous studies, e.g., the widely used MLR did not perform well in correcting for emission-related and meteorology-related trends of air pollutants19.

Discussion

The required input datasets, advantages, disadvantages, biases in trends, and scopes of applications for different methods are summarized in Table 1. Compared to CTMs, the superiority of TSM and ML in weather normalization of air pollutants is less in required input datasets and their higher running speed with fewer computational resources (see Supplementary Note 1 and Supplementary Fig. 12 for details). The fewer required sources for running TSM and ML means that these methods can be run on personal computers, indicating their wide potential applications. Although TSM and ML have disadvantages in considering physical-chemical processes in their applications, these limitations are not significant in capturing the trend of PM2.5 as shown above. Among the TSM and ML, the TSM has to address assumptions such as sample normality, homoscedasticity, independence, strict adherence to parametric requirements, and interaction effects among variables51. The ML is non-parametric and has the critical advantage of not needing to address many of the assumptions required for statistical methods36. Considering the application conditions, the balance between model performance and required resources, and their biases in normalizing the impacts of weather on PM2.5, machine learning methods are recommended.

Table 1 Summary of different models in weather normalization of PM2.5.

To better apply ML methods in the weather normalization of air pollutants, three influencing factors have been discussed in this study. Parameter setting is crucial in the ML methods to achieve optimal learning capacity during the training process and to achieve the best prediction performance during the testing stage39,52. The reported papers usually adopt the fixed parameters for the RF model training10,36,37,38,40,53. As shown in Fig. 3a, r and IOA calculated from the RF model with parameters tuned significantly increased compared to the RF model without parameters tuned. Trends of \({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\) and \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from the tuned and untuned RF models showed insignificant differences, while the trends of \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) showed a significant difference with a higher reduction rate from the tuned RF model (Fig. 3b). Compared to the results from CMAQ, the tuned RF model can reduce the bias of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trend by 9% from 12% to 3% (Fig. 3c). The bias in meteorology-related PM2.5 trend for the tuned RF model also reduced by 12% from 41% to 29% compared to the untuned RF model (Supplementary Fig. 13). The bias in RF model with GC as a reference also verified the improvement of weather normalization of PM2.5 by the tuned RF model (Supplementary Fig. 13). Therefore, the parameters for the ML methods are recommended to be optimized before application.

Fig. 3: Comparison in results derived from the random forest model with hyperparameters tuned and untuned.
figure 3

a Boxplots of statistics derived from the tuned RF model (tu) and untuned RF model (ut). b Boxplots of the trends for PM2.5 observation (\({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\)), emission-related PM2.5 (\({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\)), and meteorology-related PM2.5 (\({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\)); c Scatterplot between emission-related trends of PM2.5 derived from RF and CMAQ models. NS. and *** represent the differences between the two groups that are not significant (p > 0.05) and are significant at 0.001 levels.

The meteorology resampling strategies adopted by the ML also influence the weather normalization result54. The widely reported meteorology resampling strategies include the method developed by Grange et al.36 and Vu et al.40 (denoted as “G” and “V”, respectively, and hereafter). These two strategies have a shortage of comparing the trends from the CTMs. The meteorological factors in a CTM sensitivity simulation are fixed at a specific year while the meteorological variables in methods G and V are randomly sampled from the entire study period. We developed a resampling strategy (denoted as the “M” method hereafter, see Supplementary Note 2 for details), which resampled the meteorological variables from the year that was used for the sensitive simulation for CTM, e.g., 2017 in this study. As shown in Fig. 4a, changes in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends (calculated as \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends from different resampling strategies − \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trend from CMAQ) showed insignificant differences among different strategies. The biases in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends with different strategies were underestimated by 2.22% (V30) ~ 18.5% (M) compared to the CMAQ reference result. The insignificant differences and low biases of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends from different resampling strategies compared to CMAQ indicated these strategies can both produce reasonable emission-related trends of PM2.5. Unlike the insensitivity of emission-related trends to different resampling strategies, the meteorology-related PM2.5 trends were more sensitive to resampling strategies (Fig. 4b). For instance, trends of \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) with G, V5, and V30 strategies were lower than those from CMAQ with a bias of 27.7%, 40.2%, and 45.8% respectively. Given the fact that the insignificant differences, low bias of trends, easy and fast calculation properties of the meteorology resampling strategy developed by Grange et al.36, this strategy is recommended for weather normalization of air pollutants.

Fig. 4: Comparisons in changes of PM2.5 trends from different meteorology resampling strategies with CMAQ as a reference.
figure 4

a Emission-related trend of PM2.5. b Meteorology-related trend of PM2.5. The filled colors in boxes represent the bias (%) calculated as 100% × (1 – the slope of a linear regression between the CMAQ and XGB model with different resampling strategies). NS. and * represent the differences between the two groups that are not significant (p > 0.05) and significant at 0.05 levels. The details about the resampling strategy can be found in Supplementary Note 2.

The inclusion or exclusion of the temporal variables (e.g., Julian day and the day of the week in this study) in the prediction process should also be emphasized. Previous studies disagree with each other in the inclusion of temporal variables. Grange et al.36 recommended the inclusion of temporal variables in weather normalization, and similar research was reported elsewhere41,55,56. In this strategy, meteorological variables and temporal variables were randomly sampled and they were used to predict the PM2.5 concentration. On the contrary, the exclusion of temporal variables (randomly sampled meteorological variables and fixed temporal variables) in weather normalization40,54 was also adopted. Using the sensitive experiment with temporal variables included (RF_wt) and not included (RF_nt), the impact of this factor was discussed here. As shown in Fig. 5a, the variation of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from RF_wt was continuously decreased, while the time series of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from RF_nt showed periodic decreases from 2013 to 2017. The linear regression between \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends from RF_wt and CMAQ showed a slope approaching 1, which was higher than the slope for fitting between RF_nt and CMAQ (Fig. 5b). This was due to the time series of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from RF_wt was more coincided with CMAQ compared to that from RF_nt. The temporal variables can be used as proxies for cyclical emission patterns8 and if the temporal variables were randomly sampled in prediction, the signal of emission variations was erased from the normalized time series. As a result, the time series of RF_wt well revealed the long-term emissions54 while the time series of RF_nt was able to characterize the seasonal and long-term emission trends40,54. In reality, air pollutant emissions have seasonal variations that arise from energy consumption patterns (e.g., heating during the cold season)33,57. Therefore, the results from the resampling strategy with temporal variables excluded were more reasonable despite it having a higher bias in trend of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) with CMAQ as a reference.

Fig. 5: Comparisons in results of random forest model with temporal variables included and excluded in weather normalization.
figure 5

a Time series of scaled monthly values in \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) from different methods. b Scatterplot between \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) trends from CMAQ and RF with temporal variables included (RF_wt) and excluded (RF_nt).

Traditional statistical methods and machine learning have been widely used for weather normalization of primary and secondary air pollutants (Supplementary Table 1). These methods may result in different outcomes for the impacts of meteorology on air pollutants vary with the properties of air pollutants19. Particularly, separating and quantifying the effects of meteorology on O3 is more challenging due to the complex interaction between meteorology, emissions, and chemical formation58,59. With CTM as a reference, the performance of TSM and ML in weather normalization of other air pollutants should be investigated before their application. With the successful reconstruction of air pollutant datasets derived from satellites60,61,62, CTM63,64, and ground observations63,64,65, the long-term and full-coverage datasets are developed. Coupled with these open-accessed datasets (e.g., the China High Air Pollutants (CHAP), Track Air Pollution in China (TAP), and MERRA-2), ML method (e.g., RF and XGB), and recommendations in this study (e.g., hyperparameters tuning, meteorological condition resampling strategy, and exclusion of temporal variables in resampling), evaluating the effectiveness of air pollution prevention and climate change response policies can be conducted at regional and national scales. These potential evaluations contribute to solving air pollution and fulfilling the United Nations’ Sustainable Development Goals.

Methods

Data sources and preprocessing

Hourly ground observations of PM2.5 during 2013–2017 were from the national air quality monitoring network established and operated by China National Environmental Monitoring Center. 74 key cities (Supplementary Fig. 14) were selected in this study due to their data availability from 2013 to 2017. Data quality control was conducted according to previous studies34,66,67 (see Supplementary Methods for details). Hourly values of meteorological variables including temperature at 2 m (T2M), dewpoint at 2 m (D2M), mean sea-level pressure (MSL), eastward and northward wind components of wind at 10 m (U10, V10), total precipitation (TP), boundary layer height (BLH), total cloud cover (TCC), and surface downward solar radiation (SSR) were obtained from the ERA-5 single-level pressure reanalysis datasets68. The relative humidity (RH) was calculated with T2M and D2M69. The monthly mean concentrations of PM2.5 from 2013 to 2017 by WRF-CMAQ and GEOS-Chem were from Zhang et al.11 and Zhai et al.70, respectively. More details about the input meteorology, emission inventory, and simulation settings (base + sensitivity) by CTM can be found in the references above. The methods for calculating the emission-related and meteorology-related PM2.5 concentrations from CTMs can be found in the Supplementary Methods.

Weather normalization of PM2.5 by TSM and ML

To decouple the effects of meteorology on PM2.5 variations, two traditional statistical methods (MLR, KZ) and machine learning methods (RF and XGB) were adopted using the meteorological variables mentioned above in each city. Specifically, the T2M, MSL, U10, V10, RH, TP, BLH, TCC, and SSR were used to build the MLR and KZ filter models. In addition to these meteorological variables, time variables (Unix time: number of seconds since 1970-1-1, Julian day: day of the year, day of the week) acted as emission proxies8, and clusters of backward trajectories reaching each city acted as transport indicator36,38 were also used in RF and XGB models. For MLR and KZ model building, the daily averages of air pollutants and meteorological variables were used, while the hourly observations were used in RF and XGB models. A flow chart to show the weather normalization of PM2.5 using different methods is shown in Supplementary Fig. 15. The data process, model building, and weather normalization are described below in detail.

MLR

Following previous studies26,28,30 but with a little modification, nine meteorological variables mentioned above were used to establish the relationship between meteorological factors and PM2.5, instead of employing a stepwise MLR to exclude less important variables. The anomalies of meteorological conditions and PM2.5 were obtained by moving the 5-year mean values of 50-d moving averages from the 10-d mean time series and the anomalies calculated by this method were deseasonalized but not detrended. According to a previous study26, the 50-d moving window was chosen here because the anomalies of PM2.5 and other meteorological variables calculated in this manner were not sensitive to the moving window (Supplementary Fig. 16). The anomalies of PM2.5 and meteorological variables were finally used to build the MLR model. The prediction of MLR was considered as the meteorology-driven PM2.5 concentration and the residuals of fitting were considered as the PM2.5 concentration attributed to emission changes26,30. More details about MLR to separate the meteorology and emission-related PM2.5 concentrations can be found in Supplementary Methods.

KZ

The KZ filter (KZ(m, p)) uses different iteration times (p) and moving averages of time width (m) to separate the time series of air pollutant into different components, e.g., KZ(365, 3) to filter out long-term component31,71,72, and KZ(15, 5) to get the baseline component (seasonal + long-term components)33. To get the long-term component of PM2.5 and its two subcomponents including emission-related and meteorology-related, the baseline components of PM2.5 and meteorological factors were used to build the MLR model with PM2.5 as the dependent variable. The emission-related concentration was obtained by KZ(365, 3) to the residuals of MLR above. The meteorology-related concentration was calculated as the difference between the long-term concentration of PM2.5 and the emission-related concentration. More details about KZ-MLR can be found in Supplementary Methods.

RF

The meteorological variables, time variables, and cluster of trajectories mentioned above were used to build the RF model. Before model training, the dataset was randomly divided into two sub-datasets with a ratio of 7:3. 70% of the datasets were used to build the model and the remaining 30% of datasets were used to test the model. In line with previous studies36,38,53,73,74, the settings below were used to train the RF model: the number of the tree (ntree) = 300; the number of variables that may split at each node (mtry) = 3; the minimum size of terminal nodes (min.node.size) = 5. In addition to the default settings for the RF model, these parameters were also tuned by random search with 5-fold cross-validation after 100 times evaluation. The 5-fold cross-validation was used here to determine the optimal hyperparameter combinations75,76. The search space consisted of ntree, mtry, and min.node.size with their ranges of 10 ~ 1000, 1–13, and 1–13, respectively. The results of tuned hyperparameters for the RF model are provided in Supplementary Table 4. After the model training, the weather normalization in each observation was conducted by randomly sampling the meteorological variables from the meteorological data pool without replacement to predict the concentration by 500 times (sensitive analysis of resampling times on result is provided in Supplementary Methods and Supplementary Fig. 17). The weather-normalized concentration (emission-related concentration) for each observation was finally calculated as the arithmetic mean of 500 predictions. The meteorology-related PM2.5 concentration was then calculated as the difference between observed air pollutant concentration and emission-related concentration13,37,53,73.

XGB

Similar to weather normalization using RF, the XGB tree model was used to build the relation between hourly air pollutant concentrations and meteorological variables and other predictor variables55. Three key parameters including the number of gradient-boosted trees (nrounds), the maximum tree depth for base learners (max_depth), and boosting learning rate (eta)77 were optimized by random search with 5-fold cross-validation55,77. The search space consisted of nrounds, max_depth, and eta with their ranges of 10 to 1000, 1 to 13, and 0 to 1, respectively. The terminator of the random search was chosen as 100 times evaluation. Based on the performance of the 5-fold cross-validation, the optimal hyperparameters were obtained in each city (Supplementary Table 4). After tuning, these parameters were used to train the XGB model. The trained XGB model was further used for weather normalization with the same process as described above in the RF model.

Experiment design

A series of calculations using different methods mentioned above were conducted to compare the performance of different methods in weather normalization of PM2.5 (1# experiment in Table 2). To exclude the effects of the data split process on machine learning, the same data was used in model training and testing for RF and XGB models, e.g., the effects of hyperparameter tuning on the RF model (2# experiment). Additionally, the same trained model (e.g., XGB) was used in discussing the meteorology resampling strategy on weather normalization results (3# experiment). Finally, the 4# experiment was designed to discuss the inclusion and exclusion of temporal variables on weather normalization results by the RF model.

Table 2 Experiment design and its purpose in this study.

Trend calculation and statistical parameters

Using the methods mentioned above, PM2.5 observation (\({{\rm{PM}}}_{2.5}^{{\rm{OBS}}}\)) was decoupled into emission-related (\({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\)) and meteorology-related (\({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\)) concentrations. To make sure that the trends of PM2.5 observation equaled the trends of \({{\rm{PM}}}_{2.5}^{{\rm{EMI}}}\) and \({{\rm{PM}}}_{2.5}^{{\rm{MET}}}\) from 2013 to 2017, the trends were calculated by linear regression between annual values in PM2.5 concentrations and years33,34,78. The slope of the linear regression equation was regarded as the trend. To evaluate the model performances of different models to reproduce the observations of PM2.5, several statistical parameters including the Pearson correlation coefficient (r), normalized mean bias (NMB), and IOA were used (see Supplementary Table 5 for more details).