A synchronized estimation of hourly surface concentrations of six criteria air pollutants with GEMS data

Yang, Qianqian; Kim, Jhoon; Cho, Yeseul; Lee, Won-Jin; Lee, Dong-Won; Yuan, Qiangqiang; Wang, Fan; Zhou, Chenhong; Zhang, Xiaorui; Xiao, Xiang; Guo, Meiyu; Guo, Yike; Carmichael, Gregory R.; Gao, Meng

doi:10.1038/s41612-023-00407-1

Download PDF

Article
Open access
Published: 18 July 2023

A synchronized estimation of hourly surface concentrations of six criteria air pollutants with GEMS data

npj Climate and Atmospheric Science volume 6, Article number: 94 (2023) Cite this article

2080 Accesses
3 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Machine learning is widely used to infer ground-level concentrations of air pollutants from satellite observations. However, a single pollutant is commonly targeted in previous explorations, which would lead to duplication of efforts and ignoration of interactions considering the interactive nature of air pollutants and their common influencing factors. We aim to build a unified model to offer a synchronized estimation of ground-level air pollution levels. We constructed a multi-output random forest (MORF) model and achieved simultaneous estimation of hourly concentrations of PM_2.5, PM₁₀, O₃, NO₂, CO, and SO₂ in China, benefiting from the world’s first geostationary air-quality monitoring instrument Geostationary Environment Monitoring Spectrometer. MORF yielded a high accuracy with cross-validated R² reaching 0.94. Meanwhile, model efficiency was significantly improved compared to single-output models. Based on retrieved results, the spatial distributions, seasonality, and diurnal variations of six air pollutants were analyzed and two typical pollution events were tracked.

An intercomparison of weather normalization of PM2.5 concentration using traditional statistical methods, machine learning, and chemistry transport models

Article Open access 20 December 2023

Unmasking the sky: high-resolution PM2.5 prediction in Texas using machine learning techniques

Article 01 April 2024

Exploring the spatial heterogeneity and temporal homogeneity of ambient PM10 in nine core cities of China

Article Open access 26 April 2021

Introduction

With rapid and energy-intensive economic development, China has witnessed serious air pollution in the past several decades¹. To assist air quality management, the China National Environmental Monitoring Center (CNEMC) started in 2013 to operate a network that measures six criteria air pollutants, namely PM_2.5, PM₁₀, sulfur dioxide (SO₂), nitrogen dioxide (NO₂), carbon monoxide (CO), and ozone (O₃). However, these sites are predominantly concentrated in urban or suburban regions, and considerable areas of China go still unmonitored^2,3. Satellite-based observations, especially geostationary satellites, are skilled in offering horizontal distribution of atmospheric composition and are thus widely used to supplement ground-based observations⁴.

Many algorithms have been developed or adopted to retrieve ground-level abundance of air pollutants from satellite images, which can be roughly divided into physics-based and statistics-based methods. Physics-based approach converts column density measured from satellites to ground-level concentrations using their physical connections. For instance, ref. ⁵ used SO₂ profiles from an air quality model and tropospheric column SO₂ from the Ozone Monitoring Instrument (OMI) to estimate ground-level SO₂ concentrations. Similarly, ground-level NO₂ concentrations were inferred from NO₂ vertical column abundances from the TROPOspheric Monitoring Instrument (TROPOMI) using a surface-to-column conversion factor from a chemical transport model⁶. Additionally, a semi-empirical physical approach was developed to obtain ground-level PM_2.5 and PM₁₀ from satellite aerosol optical depth (AOD) through vertical correction, humidity correction, fine mode conversion, and volume correction⁷.

The statistics-based approach aims at learning relationships between satellite retrievals of aerosols and gases and collocated ground-level concentration with statistical models. Due to skills in capturing nonlinear relationships, machine learning-based retrieval has received considerable attention in recent years^{8,9,10,11,12,13}. Wang and Christopher¹⁴ found a linkage between AOD and ground-level PM_2.5 mass and estimated air quality categories from AOD using a linear regression model. Later, more influencing factors such as meteorological and topographical variables were considered and more advanced and sophisticated machine learning models, such as land-use regression model¹⁵, space-time regression model¹⁶, geo-intelligent deep neural networks^17,18, and ensemble-learning-based models^19,20 were developed. For gaseous pollutants, satellite retrievals of column density were usually used as main predictors^21,22,23.

Recently, retrieval models have also been built using satellite radiance or reflectance data. Shen et al.²⁴ proposed that we could replace AOD with top-of-atmosphere reflectance (TOAR) data for PM_2.5 estimation, and successfully retrieved ground-level PM_2.5 concentration with TOAR data from three MODIS bands (red, blue, and a short-wave infrared band). After that, studies based on Himawari-8, Fengyun-4, and Landsat-8 data²⁵ were conducted for ground-level PM_2.5 and PM₁₀ concentrations estimation. In addition, reflectance/radiance-based retrieval of ground-level O₃ concentration was also achieved. Luo et al.²⁶ utilized 32 MODIS wavebands (all MODIS bands excluded bands 13–16, from visible to thermal infrared bands) and other auxiliary variables to estimate ground-level O₃ concentration with a deep-learning technique. Similarly, ref. ²⁷ leveraged 7 Himawari-8 channels (one mid-wave infrared and six thermal infrared channels) for O₃ estimation and also achieved good results. It was demonstrated that radiance/reflectance-based models achieved similar accuracy to column-product-based models but improved resolution and spatial coverage^24,28,29.

A single specific pollutant was targeted in above mentioned models (referred to as single-output models hereafter). Single-output models are skilled in exploring the characteristics of a single pollutant fully, and high model accuracy was usually yielded. However, major air pollutants share some common sources, evolve under the same meteorological conditions, and are connected chemically or physically³⁰. Therefore, similar predictors and model structures were used in models built for different pollutants. Building multiple single-output models for different pollutants leads to duplication of efforts. A model that estimates concentrations of these pollutants simultaneously can make better use of their correlations and improve efficiency. Multi-output regression methods³¹ that consider both underlying relationships between features and corresponding target variables and relationships between targets have been applied in ecological modeling^32,33, chemometrics³⁴, signal and image processing^35,36, etc., and demonstrated a strong ability on simultaneous prediction and joint estimation of multiple variables.

Another limitation of previous studies is embedded in the usage of low-Earth orbiting satellite that provides one to two observations for concerned areas and misses the dynamic evolution of pollutants during a day^37,38. Geostationary observations of AOD were made available since the launch of a geostationary meteorological satellite, yet those of trace gases have been limited as satellite monitoring of trace gases relies largely on spectral information at ultraviolet (UV) and visible bands. In February 2020, South Korea launched the Geostationary Environment Monitoring Spectrometer (GEMS) on board the Geostationary Korea Multi-Purpose Satellite 2 (GEO-KOMPSAT-2) satellite series. GEMS is the first ultraviolet-visible instrument onboard a geostationary earth orbit platform³⁸, which enables hourly monitoring of trace gases for almost 20 countries in Asia. With the high spatial and temporal resolution, diurnal variations of multiple atmospheric components are observed, which provides a great opportunity to research hourly estimations of ground-level air pollution.

This study aims to achieve a simultaneous estimation of hourly ground-level concentrations of six criteria air pollutants in China using the multi-output random forest model (MORF) and the latest GEMS data. The joint inversion of multiple pollutants proposed here is expected to simplify the process of retrieving surface concentrations of six criteria air pollutants, and largely improve modeling efficiency compared to traditional models. The proposed method can infer hourly variations of air pollutants with high accuracy and high efficiency, and assist in monitoring the evolution of pollution episodes.

Results and discussion

Statistics of model performance

Figure 1 displays the results of sample-based CV for hourly retrievals. MORF achieves a general CV R² of 0.95 and RPE of 20.13% for six air pollutants, yet the performances vary with pollutant types. R² values range from 0.79 to 0.94 and RPE values range from 14.83 to 25.18%. The best performance is yielded for the estimation of O₃ concentrations, with CV R², RMSE, MAE, and RPE of 0.94, 11.19 $\mu g{m}^{-3}$, 7.48 $\mu g{m}^{-3}$, and 14.83%, respectively. Low bias is also indicated with the slope of the fitting line of 0.93. The model performance for particulates (PM_2.5 and PM₁₀) also shows high accuracy, which is comparable to that of state-of-the-art single-output models^18,39. CV R² for PM_2.5 and PM₁₀ reach 0.92 and 0.94, and the RMSE values are 9.94 $\mu g{m}^{-3}$ and 24.77 $\mu g{m}^{-3}$, respectively. The retrieval accuracy for NO₂ and CO are relatively lower, with R² of 0.87 and 0.80, and RPE of 22.33 and 17.92%, respectively. MORF model yields the worst performance for SO₂ estimation, with a CV R² of 0.79 and RPE of 25.18%. UV-based satellite retrieval of SO₂ has been reported to be subject to large uncertainties due to the presence of O₃ absorption and strong molecular Rayleigh scattering⁴⁰. This might also explain the relatively poorer performance of SO₂ from our approach. The performance of sample-based CV is relatively stable across different hours, months, and stations (Supplementary Note 1). Generally, model performance is relatively better in the warm season for O₃ estimation and in the cold season for other pollutants. The model yields a higher accuracy at noon than in the morning and afternoon. Besides, model performance in regions with limited sites is poorer than that in regions with a large number of ground stations, which are consistent with previous studies¹⁶.

**Fig. 1: Scatter plots of the sample-based tenfold cross-validation results of MORF.**

Site-based CV results are slightly worse than those of sample-based CV (Supplementary Note 2). R² range from 0.56 to 0.91 for different kinds of pollutants. O₃ estimation yields the best accuracy with R² of 0.91, RMSE of 13.91 $\mu g{m}^{-3}$, MAE of 9.67 $\mu g{m}^{-3}$, and RPE of 19.20%. For SO₂ and CO, R² decreases by ~0.23 and RPE increases by ~8% compared to sample-based CV. Site-based CV R² for other pollutants range from 0.75 to 0.84, and RPE from 24.61 to 29.24%.

In addition to CV, we also conducted an independent validation (IV). The results are provided in Supplementary Note 3. The results of IV are similar to that of CV, proving that the proposed model is stable and generalized.

The model performance of MORF was compared with that of SORF in terms of accuracy and efficiency. We trained six separate SORF models, each using one of the six air pollutants as output. The model parameters were the same as the MORF model. The comparison results are listed in Table 1. The retrieval accuracy of MORF and SORF are very close, but MORF outperforms SORF in terms of efficiency. The training of MORF (time for fitting MORF model with all samples) took only 10 min while training six SORF models cost nearly 50 minutes. In addition, MORF took 4.52 s for retrieving one resampled GEMS image, but SORF models needed 6.84 s to complete the estimation of six air pollutants. The model size of SORF was also much larger than that of MORF. Considering that building six SORF models also means more efforts on data preparation, data preprocessing, parameter tuning, etc., MORF is much more efficient than SORF.

Table 1 Comparisons between the model performance of MORF and SORF.

Full size table

Spatiotemporal variations of six criteria air pollutants

Considering the uneven distribution of GEMS data in different months and hours (Supplementary Note 4), we calculated the monthly mean first and then used the monthly mean values to calculate the annual mean to reduce the bias caused by uneven sample distribution. Besides, we divided data into two parts when analyzing diurnal variation. For the warm season, data for all the hours were considered, while only data from 00:45 UTC to 06:45 UTC were analyzed for cold season.

Spatial distributions of air pollutants in 2021 are displayed in Fig. 2. In terms of spatial variation, PM_2.5 hotspots are located in the junction of Henan, Hebei, and Shandong provinces, and the west of Xinjiang (locations of these provinces can be found in Supplementary Note 5). Areas with high PM₁₀ concentration are mainly located in northwestern China, where dust storm happens frequently⁴¹. O₃ pollution is most serious in Shandong province and surrounding regions and some coastal cities in southern China. The distribution of areas with high NO₂ concentrations is highly consistent with locations of a metropolis, such as the Beijing-Tianjin-Hebei (BTH) region, Yangtze River Delta (YRD), Guangzhou, Wuhan, Chengdu, Chongqing, Lanzhou, and Xian. This is related to its dominant source of transportation⁴². Unlike particulates and O₃ pollution, CO and SO₂ are more associated with point sources⁴³, as indicated by CO hotspots in Shenyang (Liaoning province), Jincheng (Shanxi), Tangshan (Hebei), Wuhan (Hubei), Lanzhou (Gansu), Xining (Qinghai), the border of Chuxiong (Yunnan) and Panzhihua (Sichuan), and Xinjiang. The distribution of SO₂ hotspots is similar to that of CO. Highest SO₂ concentrations are detected in Lanzhou (Gansu), Xining (Qinghai), and some cities in Inner Mongolia, consistent with ground-level observations (Supplementary Note 6). Under national regulations of SO₂ emissions in eastern and southern China, SO₂ concentrations in the YRD, BTH, and PRD have decreased remarkably over recent years. However, in northwestern China, SO₂ concentration keeps growing due to the expansion and relocation of the energy industry⁴⁴. The seasonal variations are consistent with previous studies (details are provided in Supplementary Note 7).

**Fig. 2: Spatial distributions of annual mean concentrations for six air pollutants.**

Figure 3 displays the diurnal variations in the warm season. PM_2.5 and PM₁₀ concentrations decrease with time in most regions, which is associated with the development of BLH⁴⁵ and the high emissions during morning rush hours. In contrast, in northwestern China, particulate concentrations increase first from 08:00 BJT to 12:00 BJT, and then decrease from 12:00 BJT to 16:00 BJT⁴⁶. Different diurnal variation patterns in northwestern China and other regions can be attributed to the difference in pollution sources. O₃ concentrations increase from 09:00 BJT to 15:00 BJT, due to the enhanced solar radiation and photochemical reaction activity during daytime²². Similar to the diurnal pattern of PM_2.5, NO₂ and CO concentrations decrease from 08:00 BJT to 16:00 BJT gradually under the influence of boundary layer mixing. SO₂ concentrations in northeast China present a decreasing trend during the daytime. However, in Inner Mongolia and northwestern China, SO₂ increases from 8:00 BJT to 11:00 BJT and then decreases.

**Fig. 3: Diurnal variations of air pollution in the warm season.**

The diurnal variations in the cold season are basically consistent with those in the warm season, with several small differences (Supplementary Note 8). Particulate concentrations in northwestern China peak in the later noon (14:00 BJ time) rather than at noon (12:00 BJ time) in the cold season. In the warm season, we find the most distinct increase of O₃ concentration happens in the BTH region, which is different from that in the cold season that occurs in southern China. This is related to the different seasonality features of O₃ across China⁴⁷.

Application in monitoring pollution episodes

We selected two pollution cases to show some examples of how our results can help with monitoring dynamic evolution. As shown in Fig. 4, we use hourly estimations to monitor the dynamic evolution of a serious O₃ pollution event in Guangdong province on April 30, 2021, and a dust storm event in northern China on March 15, 2021. Comparisons with ground-level observations suggest that our retrieved maps accurately capture the changes in O₃ concentrations during this pollution episode. O₃ concentrations increase rapidly from 20 $\mu g{m}^{-3}$ at 9:00 BJT to >250 $\mu g{m}^{-3}$ at 16:00 BJT in Guangzhou and surrounding cities. Another small hotspot located in the southeastern corner of Guangdong province is also detected, where O₃ concentration reaches 200 $\mu g{m}^{-3}$ at 16:00 BJT. For the dust storm event, ground-level observations indicate an extremely high PM₁₀ concentration (>3500 $\mu g{m}^{-3}$) in Beijing which is also well reflected in the retrieved maps. Besides, both station observations and our retrievals show that PM₁₀ concentrations in Beijing decrease from >3500 $\mu g{m}^{-3}$ at 10:00 BJT to ~2500 $\mu g{m}^{-3}$ at 15:00 BJT. These two cases demonstrate that retrieval results from our proposed algorithm can well capture changes in pollutant concentrations during pollution events.

**Fig. 4: Variations of pollutant concentrations during two pollution events.**

Discussions

Geostationary satellites offer great potential to monitor air pollution due to their advantage in spatial and temporal coverage. Previously, a number of machine learning models were built to infer ground-level concentrations of air pollutants from satellite images. High estimation accuracy was achieved in these models, yet a joint inversion model that improves modeling efficiency and reduces modeling complexity is still lacking. In this study, we approximated it to a multi-output problem and proposed a unified retrieval model based on MORF that achieved simultaneous estimation of hourly concentrations of six criteria air pollutants in China, benefiting from the world’s first geostationary air pollution monitoring spectrometer GEMS. CV results for all samples, different months, hours, and stations demonstrated the accuracy and stability of our MORF model. Comparisons with SORF proved that MORF was much more efficient than a current single-output model. Based on our retrieval results, the spatial, seasonal, and diurnal variations of the six pollutants were analyzed in detail. In general, the maximum values of daytime PM_2.5, NO₂, and CO appear in the morning. PM₁₀ concentrations peak at noon in the warm season and in the afternoon in the cold season. O₃ concentrations increase from morning to afternoon, associated with photochemistry intensity. We also used retrieved maps to monitor the dynamic evolution of pollutants during two pollution events, an O₃ pollution event in Guangdong province and a dust storm event in northern China. Our retrieval results captured the same variations of pollution as ground stations, but showed better spatial coverage.

Even so, limitations still exist. For instance, the model accuracy of the estimation of SO₂ and CO can be further improved. On the one hand, the absorption features of SO₂ and CO, namely ultraviolet-B and infrared bands, are outside the wavelength range used in this study. Collecting data with a wider range of spectral coverage may help with the improvement of model performance. On the other hand, the information satellite can provide about ground-level air pollution can be limited and difficult to extract, other multi-source data such as emissions and point-of-interest information may also benefit the improvement of estimation accuracy. We also noticed that model performance decreased in regions with limited stations. For example, the estimation accuracy of PM_2.5, PM₁₀, and SO₂ was lower in Tibet. This fact should be considered when the retrieval results are used. In the future, data from more stations in these regions can be used when available to reduce uncertainties⁴⁸. Besides, estimating multiple variables using one model can bring useful extra information, but may also bring mutual interference, especially when uncorrelated tasks are introduced. Therefore, a model that can judge the correlation between multiple tasks may achieve better performance. Some deep-learning-based multi-task models which can evaluate the correlation between different regression tasks and determine the sharing degree according to correlations is worthy of attention. Finally, the physical relationships between ground-level air pollution and satellite radiance data are not fully explained and explored in this study. Interpretable machine learning models can be used in our future work to offer a deeper understanding.

Method

Data collection

The study area extended from 15°N to 45° N, 73°E to 135°E (Supplementary Material Supplementary Note 5). Ground-level concentrations of the six criteria air pollutants, namely PM_2.5, PM₁₀, SO₂, NO₂, CO, and O₃ were obtained from the China National Environmental Monitoring Center website (http://www.cnemc.cn/en/). There were more than 1600 stations in 2021. These stations covered all provinces in mainland China and provided pollutant concentrations data with low uncertainty⁴⁹. Hourly data in 2021 were used in this study and negative values were removed as outliers⁵⁰.

Hourly normalized radiance data at six wavelengths (354, 388, 412, 443, 477, and 490 nm), ranging from UV to visible bands, in 2021 were used, which were taken from the GEMS Level 2 (L2) aerosol product³⁸. Considering that different air pollutants have different spectral absorption intensities at different wavelengths^51,52,53, radiance data at different wavelengths are likely to provide useful information for estimating concentrations of air pollutants. The nominal spatial resolution of the GEMS aerosol product is 3.5 km × 8 km over Seoul, South Korea, and we used hourly data in this study.

The information that satellites images can provide are limited, especially for ground-level trace gases like SO₂ and CO. Therefore, meteorological and spatiotemporal information were also considered in our model. Four meteorological variables, including hourly boundary layer height (BLH), 2 m temperature (T), 2 m dew point temperature (DT), and surface solar radiation downwards (SR), were taken from the ECMWF (European Center for Medium-Range Weather Forecast) Reanalysis v5 (ERA5) dataset⁵⁴. The spatial resolution of BLH was 0.25°×0.25°, while that of the other three variables from ERA5-land dataset⁵⁵ was 0.1° × 0.1°.

Data integration

We resampled all the variables to the defined grids of 0.1° × 0.1° using bilinear interpolation^50,56, and then ground measurements and raster data were collocated according to time and location (longitude and latitude). Hourly GEMS L2 aerosol products were provided at starting time of observation from 22:45 UTC (Universal Coordinated Time) to 7:45 UTC. Considering that GEMS scanned east-west coverage over ~30 min, air pollution, and meteorological data at the hour closest to the starting time were matched with GEMS data. For example, meteorological and air pollution data at 01:00 UTC were matched with GEMS observations that started at 00:45 UTC.

Previous studies indicated that oversampling technique could improve the quality of training samples and promote the model to better learn the relationship between predictors and target variables^20,57,58. Random oversampling technique⁵⁹ was adopted in this study to facilitate better learning. Details about the oversampling strategy are provided in Supplementary Note 9.

Model development

Spatiotemporal information, satellite observations, and meteorological variables were used to estimate ground-level concentrations of air pollutants, and the model can be expressed as:

$$(P{M}_{2.5},P{M}_{10},{O}_{3},N{O}_{2},CO,S{O}_{2})=f(month,doy,hour,RAA,{R}_{1},{R}_{2},{R}_{3},{R}_{4},{R}_{5},{R}_{6},BLH,SR,T,DT)$$

(1)

in which, month, day (day of the year), and hour are the temporal information, and RAA stands for relative azimuth angle. R₁–R₆ represent normalized radiance at 354, 388, 412, 443, 477, and 490 nm, while BLH, SR, T, and DT are the four considered meteorological variables. f() represents the proposed MORF model.

MORF model was developed from the random forest (RF) model⁶⁰. RF model is a widely used decision-tree-based ensemble-learning model. To overcome overfitting, decision trees in RF were trained using only a random subset of training samples with a random subspace of the input features. Individual trees were then formed using a greedy algorithm that involved, at each split node, the generation of several binary split candidates⁶¹. We used ${Q}_{m}$ and ${n}_{m}$ to represent the data and number of samples at each tree node m. For each candidate node split $\theta =(j,{t}_{m})$ that consisted of a feature $j$ and a threshold ${t}_{m}$, data were partitioned into two subsets: ${Q}_{m}^{{\mathrm{left}}}(\theta )$ with ${n}_{m}^{{\mathrm{left}}}$ samples and ${Q}_{m}^{{\mathrm{right}}}(\theta )$ with ${n}_{m}^{{\mathrm{right}}}$ samples. The quality of a candidate split of node m was then computed using an impurity function $H()$:

$$G({Q}_{m},\theta )=\frac{{n}_{m}^{{\mathrm{left}}}}{{n}_{m}}H({Q}_{m}^{{\mathrm{left}}}(\theta ))+\frac{{n}_{m}^{{\mathrm{right}}}}{{n}_{m}}H({Q}_{m}^{{\mathrm{right}}}(\theta ))$$

(2)

For a single-output regression task (single-output RF, SORF), the impurity function $H()$ with an L2 error (mean squared error) can be written as:

$$H({Q}_{m})=\frac{1}{{n}_{m}}\sum _{y\in {Q}_{m}}(y-{\bar{y}}_{m}{)}^{2},{\rm{where}}\,{\bar{y}}_{m}=\frac{1}{{n}_{m}}\sum _{y\in {Q}_{m}}y$$

(3)

For a multi-output regression problem (multi-output RF, MORF), the splitting criteria were modified to compute the average loss across all ${n}_{t}$ outputs³⁵. The impurity function was thus changed to:

$$H^{\prime} ({Q}_{m})=\frac{1}{{n}_{t}}\mathop{\sum }\limits_{i=1}^{{n}_{t}}{H}_{i}({Q}_{m})$$

(4)

where ${n}_{t}$ is the number of outputs (6 in this study), and $H^{\prime} ({Q}_{m})$ is the new impurity function.

Three parameters were tuned in our experiments, i.e., the number of trees (n_estimators), the minimum number of samples required for internal node split (min_samples), and the number of features to make the split decision (max_features). After a parameter sensitivity test (Supplementary Note 10), n_estimators, min_samples, and max_features were set as 30, 3, and 3 for a balance of model accuracy and efficiency.

Variable importance in the MORF model was evaluated with permutation importance^19,62, which was defined to be the decrease in a model score when a single feature value is randomly shuffled. In general, meteorological variables are the most important, followed by radiance data. But for different air pollutants, the variable importance ranking results are different. The detailed results are provided in Supplementary Note 11.

Model performance was evaluated using tenfold cross-validation (CV)⁶³ and independent validation (IV). Sample-based CV for all samples, different months, hours, and stations were conducted. In addition, a site-based CV was also conducted. For each round of CV, 10% of stations were selected for testing and the rest for training. For IV, we divided the data into two parts. 70% of the data were used for model fitting, CV, and parameter tuning, and then the fitted model was validated on the remaining 30% of the data⁶⁴. Quantitative metrics, including coefficient of determination (R²), root mean squared error (RMSE), mean absolute error (MAE), and relative predictive error (RPE), were calculated for each air pollutant⁵⁶:

$${R}^{2}={\left(\frac{{\sum }_{i=1}^{n}\le({\mathrm{ob}}{{\mathrm{s}}}_{i}-\overline{{\mathrm{obs}}})({\mathrm{es}}{{\mathrm{t}}}_{i}-\overline{{\mathrm{est}}})}{\sqrt{{\sum }_{i=1}^{n}{({\mathrm{ob}}{{\mathrm{s}}}_{i}-\overline{{\mathrm{obs}}})}^{2}{\sum }_{i=1}^{n}{({\mathrm{es}}{{\mathrm{t}}}_{i}-\overline{{\mathrm{est}}})}^{2}}}\right)}^{2}({\mathrm{unitless}})$$

(5)

$${\mathrm{RMSE}}=\sqrt{\frac{{\sum }_{i=1}^{n}{({\mathrm{es}}{{\mathrm{t}}}_{i}-{\mathrm{ob}}{{\mathrm{s}}}_{i})}^{2}}{n}}({\mathrm{unit}}:\mu g/{m}^{3})$$

(6)

$${\mathrm{MAE}}=\frac{{\sum }_{i=1}^{n}|{\mathrm{es}}{{\mathrm{t}}}_{i}-{\mathrm{ob}}{{\mathrm{s}}}_{i}|}{n}({\mathrm{unit}}:\mu g/{m}^{3})$$

(7)

$${\mathrm{RPE}}=\sqrt{\frac{{\sum }_{i=1}^{n}{({\mathrm{es}}{{\mathrm{t}}}_{i}-{\mathrm{ob}}{{\mathrm{s}}}_{i})}^{2}}{n}}/\overline{{\mathrm{obs}}}({\mathrm{unit}}:\mu g/{m}^{3})$$

(8)

where n is the total number of ground sites and i represents the ith sites. obs_i and est_i represent the observed value and the estimated value at the ith site, respectively. $\overline{{\mathrm{obs}}}$ and $\overline{{\mathrm{est}}}$ are the mean values for observed and estimated values at all ground sites. A summary of the flowchart of this study is shown in Fig. 5.

**Fig. 5: The flowchart of this study.**

Data availability

ERA5 reanalysis dataset is freely available from the Copernicus Climate Change Service (C3S) Climate Data Store. ERA5 hourly data on single levels from 1959 to the present and ERA5-Land hourly data from 1950 to the present are used in this study and are accessible at https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis era5 single levels?tab=overview and https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview, respectively. The China National Environmental Monitoring Center data is available at https://quotsoft.net/air/. GEMS data used in this study were provided by Yonsei University.

Code availability

All additional codes needed to perform the analyses are available upon reasonable request from the corresponding author (mmgao2@hkbu.edu.hk).

References

Geng, G. et al. Drivers of PM2.5 air pollution deaths in China 2002–2017. Nat. Geosci. 14, 645–650 (2021).
Article Google Scholar
Liu, C., Gao, M., Hu, Q., Brasseur, G. P. & Carmichael, G. R. Stereoscopic monitoring: a promising strategy to advance diagnostic and prediction of air pollution. Bull. Am. Meteorol. Soc. 102, E730–E737 (2021).
Article Google Scholar
Liu, C. et al. Stereoscopic hyperspectral remote sensing of the atmospheric environment: Innovation and prospects. Earth Sci. Rev. 226, 103958 (2022).
Article Google Scholar
Yang, Q. et al. Mapping PM_2.5 concentration at a sub-km level resolution: a dual-scale retrieval approach. ISPRS J. Photogramm. Remote Sens. 165, 140–151 (2020).
Article Google Scholar
Kharol, S. K. et al. OMI satellite observations of decadal changes in ground-level sulfur dioxide over North America. Atmos. Chem. Phys. 17, 5921–5929 (2017).
Article Google Scholar
Cooper, M. J., Martin, R. V., McLinden, C. A. & Brook, J. R. Inferring ground-level nitrogen dioxide concentrations at fine spatial resolution applied to the TROPOMI satellite instrument. Environ. Res. Lett. 15, 104013 (2020).
Article Google Scholar
Zhang, Y. & Li, Z. Remote sensing of atmospheric fine particulate matter (PM_2.5) mass concentration near the ground from satellite observation. Remote Sens. Environ. 160, 252–262 (2015).
Article Google Scholar
Yuan, Q. et al. Deep learning in environmental remote sensing: achievements and challenges. Remote Sens. Environ. 241, 111716 (2020).
Article Google Scholar
Ma, Z. et al. A review of statistical methods used for developing large-scale and long-term PM_2.5 models from satellite data. Remote Sens. Environ. 269, 112827 (2022).
Zhang, Y. et al. Satellite remote sensing of atmospheric particulate matter mass concentration: advances, challenges, and perspectives. Fundam. Res. 1, 240–258 (2021).
Article Google Scholar
Gao, M. et al. Seasonal prediction of Indian wintertime aerosol pollution using the ocean memory effect. Sci. Adv. 5, eaav4157 (2019).
Article Google Scholar
Liang, F. et al. Evaluation of a data fusion approach to estimate daily PM2.5 levels in North China. Environ. Res 158, 54–60 (2017).
Article Google Scholar
He, Q. & Huang, B. Satellite-based high-resolution PM_2.5 estimation over the Beijing-Tianjin-Hebei region of China using an improved geographically and temporally weighted regression model. Environ. Pollut. 236, 1027–1037 (2018).
Article Google Scholar
Wang, J. & Christopher, S. A. Intercomparison between satellite-derived aerosol optical thickness and PM_2.5 mass: Implications for air quality studies. Geophys. Res. Lett. 30, 2095 (2003).
Article Google Scholar
Liu, Y., Paciorek Christopher, J. & Koutrakis, P. Estimating regional spatial and temporal variability of PM2.5 concentrations using satellite data, meteorology, and land use information. Environ. Health Perspect. 117, 886–892 (2009).
Article Google Scholar
He, Q. & Huang, B. Satellite-based mapping of daily high-resolution ground PM_2.5 in China via space-time regression modeling. Remote Sens. Environ. 206, 72–83 (2018).
Article Google Scholar
Li, T., Shen, H., Yuan, Q., Zhang, X. & Zhang, L. Estimating ground-level PM_2.5 by fusing satellite and station observations: a geo-intelligent deep learning approach. Geophys. Res. Lett. 44, 985–911,993 (2017). 11.
Article Google Scholar
Wang, B. et al. Estimate hourly PM_2.5 concentrations from Himawari-8 TOA reflectance directly using geo-intelligent long short-term memory network. Environ. Pollut. 271, 116327 (2021).
Article Google Scholar
Yang, N., Shi, H., Tang, H. & Yang, X. Geographical and temporal encoding for improving the estimation of PM_2.5 concentrations in China using end-to-end gradient boosting. Remote Sens. Environ. 269, 112828 (2022).
Article Google Scholar
Geng, G. et al. Tracking air pollution in China: near real-time PM_2.5 retrievals from multisource data fusion. Environ. Sci. Technol. 55, 12106–12115 (2021).
Article Google Scholar
Wei, J. et al. Full-coverage mapping and spatiotemporal variations of ground-level ozone (O₃) pollution from 2013 to 2020 across China. Remote Sens. Environ. 270, 112775 (2022).
Article Google Scholar
Wang, Y., Yuan, Q., Li, T., Zhu, L. & Zhang, L. Estimating daily full-coverage near surface O₃, CO, and NO₂ concentrations at a high spatial resolution over China based on S5P-TROPOMI and GEOS-FP. ISPRS J. Photogramm. Remote Sens. 175, 311–325 (2021).
Article Google Scholar
Wang, Y., Yuan, Q., Li, T. & Zhu, L. Global spatiotemporal estimation of daily high-resolution surface carbon monoxide concentrations using Deep Forest. J. Clean. Prod. 350, 131500 (2022).
Article Google Scholar
Shen, H., Li, T., Yuan, Q. & Zhang, L. Estimating regional ground‐level PM_2.5 directly from satellite top‐of‐atmosphere reflectance using deep belief networks. J. Geophys. Res. Atmos. 123, 13875–13886 (2018).
Article Google Scholar
Chen, B. et al. Estimation of atmospheric PM₁₀ concentration in China using an interpretable deep learning model and top‐of‐the‐atmosphere reflectance data from China’s new generation geostationary meteorological satellite, FY‐4A. J. Geophys. Res. Atmos. 127, e2021JD036393 (2022).
Google Scholar
Luo, N. et al. Explainable and spatial dependence deep learning model for satellite-based O₃ monitoring in China. Atmos. Environ. 290, 119370 (2022).
Article Google Scholar
Chen, B. et al. Estimation of near-surface ozone concentration and analysis of main weather situation in China based on machine learning model and Himawari-8 TOAR data. Sci. Total Environ. 864, 160928 (2023).
Article Google Scholar
Li, M., Yang, Q., Yuan, Q. & Zhu, L. Estimation of high spatial resolution ground-level ozone concentrations based on Landsat 8 TIR bands with deep forest model. Chemosphere 301, 134817 (2022).
Article Google Scholar
Yang, Q., Yuan, Q. & Li, T. Ultrahigh-resolution PM_2.5 estimation from top-of-atmosphere reflectance with machine learning: Theories, methods, and applications. Environ. Pollut. 306, 119347 (2022).
Article Google Scholar
Gao, M., Ji, D., Liang, F. & Liu, Y. Attribution of aerosol direct radiative forcing in China and India to emitting sectors. Atmos. Environ. 190, 35–42 (2018).
Article Google Scholar
Borchani, H., Varando, G., Bielza, C. & Larrañaga, P. A survey on multi-output regression. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 5, 216–233 (2015).
Article Google Scholar
Mandal, D. et al. Crop biophysical parameter retrieval from Sentinel-1 SAR data with a multi-target inversion of Water Cloud Model. Int. J. Remote Sens. 41, 5503–5524 (2020).
Article Google Scholar
Tuia, D., Verrelst, J., Alonso, L., Perez-Cruz, F. & Camps-Valls, G. Multioutput support vector regression for remote sensing biophysical parameter estimation. IEEE Geosci. Remote Sens. Lett. 8, 804–808 (2011).
Article Google Scholar
Bediaga, H. et al. Multi-output chemometrics model for gasoline compounding. Fuel 310, 122274 (2022).
Article Google Scholar
Dapogny, A., Bailly, K. & Dubuisson, S. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) 135–140 (2017).
Talavera-Llames, R., Pérez-Chacón, R., Troncoso, A. & Martínez-Álvarez, F. MV-kWNN: A novel multivariate and multi-output weighted nearest neighbours algorithm for big data time series forecasting. Neurocomputing 353, 56–73 (2019).
Article Google Scholar
Saide, P. E. et al. Assimilation of next generation geostationary aerosol optical depth retrievals to improve air quality simulations. Geophys. Res. Lett. 41, 9188–9196 (2014).
Article Google Scholar
Kim, J. et al. New era of air quality monitoring from space: geostationary environment monitoring spectrometer (GEMS). Bull. Am. Meteorol. Soc. 101, E1–E22 (2020).
Article Google Scholar
Mao, F. et al. Estimating hourly full-coverage PM_2.5 over China based on TOA reflectance data from the Fengyun-4A satellite. Environ. Pollut. 270, 116119 (2020).
Article Google Scholar
Gonzalez Abad, G. et al. Five decades observing Earth’s atmospheric trace gases using ultraviolet and visible backscatter solar radiation from space. J. Quant. Spectrosc. Radiat. Transf. 238, 106478 (2019).
Article Google Scholar
Li, J. et al. Mixing of Asian mineral dust with anthropogenic pollutants over East Asia: a model case study of a super-duststorm in March 2010. Atmos. Chem. Phys. 12, 7591–7607 (2012).
Article Google Scholar
Liu, F. et al. Recent reduction in NOx emissions over China: synthesis of satellite observations and emission inventories. Environ. Res. Lett. 11, 114002 (2016).
Article Google Scholar
Li, S. & Xie, S. Spatial distribution and source analysis of SO2 concentration in Urumqi. Int. J. Hydrog. Energy 41, 15899–15908 (2016).
Article Google Scholar
Ling, Z. et al. OMI-measured increasing SO2 emissions due to energy industry expansion and relocation in northwestern China. Atmos. Chem. Phys. 17, 9115–9131 (2017).
Article Google Scholar
Gao, M. et al. Reduced light absorption of black carbon (BC) and its influence on BC-boundary-layer interactions during “APEC Blue”. Atmos. Chem. Phys. 21, 11405–11421 (2021).
Article Google Scholar
Liu, Z. et al. Seasonal and diurnal variation in particulate matter (PM10 and PM2.5) at an urban site of Beijing: analyses from a 9-year study. Environ. Sci. Pollut. Res. Int. 22, 627–642 (2015).
Article Google Scholar
Gao, M. et al. Ozone pollution over China and India: seasonality and sources. Atmos. Chem. Phys. 20, 4399–4414 (2020).
Article Google Scholar
Zeng, Z. et al. Estimating hourly surface PM2.5 concentrations across China from high-density meteorological observations by machine learning. Atmos. Res. 254, 105516 (2021).
Article Google Scholar
Li, T., Shen, H., Zeng, C., Yuan, Q. & Zhang, L. Point-surface fusion of station measurements and satellite observations for mapping PM_2.5 distribution in China: methods and assessment. Atmos. Environ. 152, 477–489 (2017).
Article Google Scholar
Zhou, C. et al. Optimal planning of air quality-monitoring sites for better depiction of PM2.5 pollution across China. ACS Environ. Au. 2, 314–323 (2022).
Article Google Scholar
Krotkov, N. A., Carn, S. A., Krueger, A. J., Bhartia, P. K. & Kai, Y. Band residual difference algorithm for retrieval of SO₂ from the aura ozone monitoring instrument (OMI). IEEE Trans. Geosci. Remote Sens. 44, 1259–1266 (2006).
Article Google Scholar
Veefkind, J. P., Haan, J. F. D., Brinksma, E. J., Kroon, M. & Levelt, P. F. Total ozone from the ozone monitoring instrument (OMI) using the DOAS technique. IEEE Trans. Geosci. Remote Sens. 44, 1239–1244 (2006).
Article Google Scholar
van Geffen, J. H. G. M. et al. Improved spectral fitting of nitrogen dioxide from OMI in the 405–465 nm window. Atmos. Meas. Tech. 8, 1685–1699 (2015).
Article Google Scholar
Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146, 1999–2049 (2020).
Article Google Scholar
Muñoz-Sabater, J. et al. ERA5-Land: a state-of-the-art global reanalysis dataset for land applications. Earth Syst. Sci. Data 13, 4349–4383 (2021).
Article Google Scholar
Yang, Q., Yuan, Q., Li, T. & Yue, L. Mapping PM2.5 concentration at high resolution using a cascade random forest based downscaling model: evaluation and application. J. Clean. Prod. 277, 123887 (2020).
Article Google Scholar
Vu, B. N. et al. Application of geostationary satellite and high-resolution meteorology data in estimating hourly PM2.5 levels during the Camp Fire episode in California. Remote Sens. Environ. 271, 112890 (2022).
Article Google Scholar
Xiao, Q. et al. Separating emission and meteorological contributions to long-term PM2.5 trends over eastern China during 2000–2018. Atmos. Chem. Phys. 21, 9475–9496 (2021).
Article Google Scholar
Xiao, F. Inference-based naïve bayes: turning naïve bayes cost-sensitive. IEEE Trans. Knowl. Data Eng. 25, 2302–2313 (2013).
Article Google Scholar
Zeng, Z. et al. Daily global solar radiation in China estimated from high‐density meteorological observations: a random forest model framework. Earth Space Sci. 7, e2019EA001058 (2020).
Article Google Scholar
Strobl, C., Malley, J. & Tutz, G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14, 323–348 (2009).
Article Google Scholar
Chen, Y.-W., Medya, S. & Chen, Y.-C. Investigating variable importance in ground-level ozone formation with supervised learning. Atmos. Environ. 282, 119148 (2022).
Article Google Scholar
Li, T., Shen, H., Zeng, C. & Yuan, Q. A validation approach considering the uneven distribution of ground stations for satellite-based PM_2.5 estimation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 13, 1312–1321 (2020).
Article Google Scholar
Xiao, Y., Wang, Y., Yuan, Q., He, J. & Zhang, L. Generating a long-term (2003− 2020) hourly 0.25° global PM_2.5 dataset via spatiotemporal downscaling of CAMS with deep learning (DeepCAMS). Sci. Total Environ. 848, 157747 (2022).
Article Google Scholar

Download references

Acknowledgements

This work is supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (project no. HKBU12202021 and HKBU22201820) and the National Natural Science Foundation of China (No. 42005084). The authors are grateful to the GEMS science team for providing GEMS aerosol products and to China National Environmental Monitoring Center for providing ground-level air pollution data.

Author information

Authors and Affiliations

Department of Geography, Faculty of Social Sciences, Hong Kong Baptist University, Hong Kong, SAR, 999077, China
Qianqian Yang, Fan Wang, Xiaorui Zhang, Xiang Xiao, Meiyu Guo & Meng Gao
School of Geodesy and Geomatics, Wuhan University, Wuhan, Hubei, 430079, China
Qianqian Yang & Qiangqiang Yuan
Department of Atmospheric Sciences, Yonsei University, Seoul, 03722, Korea
Jhoon Kim & Yeseul Cho
Environmental Satellite Center, National Institute of Environmental Research, Incheon, 22689, Korea
Won-Jin Lee & Dong-Won Lee
Department of Computer Science, Faculty of Science, Hong Kong Baptist University, Hong Kong, SAR, 999077, China
Chenhong Zhou & Yike Guo
Department of Chemical and Biochemical Engineering, The University of Iowa, Iowa City, IA, 52242, USA
Gregory R. Carmichael

Authors

Qianqian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jhoon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yeseul Cho
View author publications
You can also search for this author in PubMed Google Scholar
Won-Jin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Won Lee
View author publications
You can also search for this author in PubMed Google Scholar
Qiangqiang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenhong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaorui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Meiyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yike Guo
View author publications
You can also search for this author in PubMed Google Scholar
Gregory R. Carmichael
View author publications
You can also search for this author in PubMed Google Scholar
Meng Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.G. designed the study, and Q.Y. conducted data analysis with help from J.K., Y.C., W.-J.L., D.-W.L., Q.Y., F.W., C.Z., X.Z., X.X., M.G., Y.G., and G.R.C. Q.Y. and M.G. wrote the paper with inputs from all other authors.

Corresponding author

Correspondence to Meng Gao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

SUPPLEMENTAL MATERIAL

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, Q., Kim, J., Cho, Y. et al. A synchronized estimation of hourly surface concentrations of six criteria air pollutants with GEMS data. npj Clim Atmos Sci 6, 94 (2023). https://doi.org/10.1038/s41612-023-00407-1

Download citation

Received: 23 February 2023
Accepted: 25 June 2023
Published: 18 July 2023
DOI: https://doi.org/10.1038/s41612-023-00407-1