Importance of ozone precursors information in modelling urban surface ozone variability using machine learning algorithm

Surface ozone (O\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_3$$\end{document}3) is primarily formed through complex photo-chemical reactions in the atmosphere, which are non-linearly dependent on precursors. Even though, there have been many recent studies exploring the potential of machine learning (ML) in modeling surface ozone, the inclusion of limited available ozone precursors information has received little attention. The ML algorithm with in-situ NO information and meteorology explains 87% (R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document}2 = 0.87) of the ozone variability over Munich, a German metropolitan area, which is 15% higher than a ML algorithm that considers only meteorology. The ML algorithm trained for the urban measurement station in Munich can also explain the ozone variability of the other three stations in the same city, with R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document}2 = 0.88, 0.91, 0.63. While the same model robustly explains the ozone variability of two other German cities’ (Berlin and Hamburg) measurement stations, with R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document}2 ranges from 0.72 to 0.84, giving confidence to use the ML algorithm trained for one location to other locations with sparse ozone measurements. The inclusion of satellite O\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_3$$\end{document}3 precursors information has little effect on the ML model’s performance.

www.nature.com/scientificreports/ are trained using a wide range of meteorological variables, many of which drive photo-chemical processes [30][31][32][33][34][35][36] . The variability of surface ozone is well explained by the ML algorithm with meteorological information alone [37][38][39] . Temperature is identified as a key factor in explaining ozone variability in the ML model 40 . Temperature is also a driver of biogenic VOC emissions (a precursor to O 3 ) in addition to being a driver of photo-chemical processes 7,8 . In the NO X saturated regime, ozone production is directly proportional to VOC emission (and thus to temperature), but in the NO X limited regime, ozone dependency on VOC shifts to NO X 41 . Given that many urban areas are currently in a NO X saturated regime, it is reasonable to expect that ML algorithm trained solely on meteorology will be able to explain ozone variability. After transitioning to a NO X limited regime, the ML algorithm trained solely on meteorology may fail to reproduce the surface ozone variability. Previous studies have also shown that the ozone response to temperature has been decreasing in recent years, as urban regions are transitioning to NO X limited regime 42,43 . However, only a few studies have focused on the inclusion of precursor information into the ML model 33,34,36 .
In-situ VOC and O 3 measurements are too scarce when compared to NO X measurements, and all are even scarcer in rural areas. Satellite data are becoming an indispensable tool for analyzing urban and rural air quality due to their increasing spatial resolution and spatial coverage, but they are column retrievals. Since stratospheric ozone is highly variable, total column ozone retrievals from satellites are unsuitable for studying surface ozone. Satellites, on the other hand, retrieve the ozone precursors (NO 2 and HCHO (formaldehyde)), which can be used to study the surface ozone chemistry [44][45][46] . Because HCHO is an intermediate gas-product of VOC oxidation, it can be used as a proxy for VOC emissions. As CTMs resolve the physical-chemical processes, whereas ML algorithms do not, a hybrid modelling approach that incorporates the CTM prediction as a predictor variable into the ML model may improve the performance 47 . To this end, the objectives of this study are formulated as follows: 1) investigate the importance of limited available (in-situ and satellite) ozone precursor information and coarse CTM ozone simulations in modeling urban surface ozone variability using ML algorithm; and 2) investigate the potential of ML model's transfer-ability; how well the ML algorithm trained for one location explains ozone variability in other locations. The ultimate goal of these two objectives is to provide us confidence in modeling the surface ozone variability of locations with sparse or no ozone measurements and filling the data gap.

Study region, datasets and model
This study focuses on Munich, a southern German metropolitan area where air pollutants are currently measured at five different locations. Given the long-term availability of all pollutants data, we chose an urban measurement station (Lothstrasse) to train and test the ML model, which continuously measured O 3 , NO 2 , NO, and CO from 2001 to 2017. In our study, we also used data (2003 to 2017) from other three stations in Munich (Johanneskirchen-suburban, Allach-suburban, and Stachus-urban) to assess the transfer-ability of the ML model. We also tested the ML model's transfer-ability using data (2015 to 2019) from measurement stations in other German cities, including Berlin (Neukollen-urban, Wedding-urban, and Buch-suburban) and Hamburg (Bramfeld-suburban, Neugraben-suburban, and Sternschanze-urban). The geographical locations of three German metropolitan areas (Munich, Berlin and Hamburg) and its monitoring stations considered in this study are shown in Fig. S1.
Meteorological variables (temperature, boundary layer height, relative humidity, wind speed and wind direction) are obtained from the ERA 5 reanalysis dataset, with spatial and temporal resolutions of 0.25° and 1 h, respectively 48 . Surface ozone simulations of CAMS (Copernicus Atmosphere Monitoring Service) global reanalysis dataset (EAC4) are also obtained from CAMS data store, which has a spatial resolution of 0.75° and a temporal resolution of 3 h.
The tropospheric column NO 2 and HCHO data from the NASA Aura satellite's OMI (ozone monitoring instrument) are also used 49 . OMI data has a spatial resolution of 13 * 24 km and a daily temporal resolution. The OMI local overpass occurs between 1 p.m. and 2 p.m. OMI data are available beginning in October of 2004. We filtered the OMI data before using it to include only data with no processing errors, less than 10% snow or ice cover, a solar zenith angle of less than 80° for NO 2 (70° for HCHO), and a cloud radiance fraction of less than 0.5. At the end, we only had 689 days of OMI data out of 4809 days (October, 2004 to December, 2017) for "Lothstrasse" station.
The Extreme Gradient Boosting (XGBoost) algorithm, a supervised learning-gradient boosting tree-based ML algorithm 50 , is used in this study to model surface ozone concentrations. Since our objective is to investigate the importance of precursor information in surface ozone modeling using ML, the ML algorithm we choose should be more interpretable. A tree-based ML algorithm, such as XGBoost, is more interpretable than neural networks, which are typically black box systems, and also achieves higher interpretability than simple linear regression algorithms (high-bias algorithm) 51 . We train the XGBoost ML algorithm with different predictor categories or combinations of predictor categories (Table 1), and then compare its performance in terms of correlation (R 2 ) and root mean square error (RMSE). The predictor categories are broadly classified into meteorology (temperature, relative humidity, boundary layer height, wind speed and wind direction), in-situ ozone precursors (NO, NO 2 and CO), satellite ozone precursors (column NO 2 and HCHO) and CTM simulations (CAMS model surface O 3 ). Additionally, we consider two more predictors (day of the week and season), which we include in the meteorology category. The hyper-parameters of the XGBoost algorithm (such as the number of gradient boosted trees, learning rate, and maximum depth of a tree, etc.) are tested using grid search function (https:// scikit-learn. org/ stable/ modul es/ gener ated/ sklea rn. model_ selec tion. GridS earch CV. html) and, we find that XGBoost algorithm is not sensitive to hyper-parameters in this study. Therefore, the hyper-parameters were set to their default values (https:// xgboo st. readt hedocs. io/ en/ latest/ param eter. html). We also discuss the predictor variable (feature) importance in the ML model using the results derived from sklearn python library's "feature_importance" function, which calculates feature importance by taking the average gain across all splits (https:// scikit-learn. org/ stable/ auto_ examp les/ ensem ble/ plot_ gradi ent_ boost ing_ regre ssion. html). For this study, we focus on the afternoon

Performance of ML model in predicting the urban surface ozone. For the "Lothstrasse" station in
Munich, all in-situ measurements, meteorological variables and CAMS data are available for 5375 days from 2001 to 2017. We divided the 5375 days of measurements into two parts: first 3800 days (70%) for training, and remaining 1575 days (30%) for testing the ML predictions. The k-fold cross validation (CV) is used to evaluate the performance of the ML model for different dataset combinations for training and testing. Here we choose k as 10, i.e., 5375 days of data split into 10 parts. To avoid spurious correlation between training and test datasets, we adopted a block sampling approach 52 . The first nine parts are used to train the ML algorithm, and the final one is used to test the ML model; this process is repeated ten times for the remaining combinations. The mean of R 2 derived from the k(10)-fold cross validation is then computed. The ML algorithm that was trained solely on meteorology ("ML_met") explains 77 percent of the variance (R 2 = 0.77) in measured O 3 , with RMSE of 16 µ g m −3 (Fig. 1a). The mean R 2 of k(10)-fold CV is 0.77. Wind speed and wind direction have a low importance in the fitted model when compared to other meteorological variables (relative humidity, boundary layer height, and temperature) (Fig. S2). In addition, including the day of the week and season in the training dataset ("ML_ met_ds") improves the ML model's performance (R 2 = 0.81, RMSE = 14.6 µ g m −3 and mean R 2 of k(10)-fold CV = 0.80) (Fig. 1b). This performance improvement could be attributed to the pronounced seasonal cycle of ozone and weekday-weekend differences. The ozone reaches its maximum in summer and minimum in winter, and due to being in a NO X saturated regime, weekend ozone levels are higher than the weekdays 13 . The ML algorithm trained solely with CAMS ("ML_cams") or in-situ precursors ("ML_insitu") show poor performance in all terms when compared to ML algorithm trained with the meteorology category alone ("ML_met_ds") ( Fig. 1c,d).
The ML algorithm trained with meteorology and in-situ precursors category ("ML_met_ds_insitu") performs better than "ML_met_ds", with R 2 and RMSE are about 0.87 and 12 µ g m −3 , respectively (Fig. 1e). The scatter of predicted O 3 by "ML_met_ds" is largely reduced in "ML_met_ds_insitu", resulting in a lower RMSE. The mean R 2 of k(10)-fold CV is 0.88, which is a 15% increase over "ML_met". The important feature in "ML_met_ds_insitu" is derived to be in-situ NO measurements, followed by boundary layer height, temperature, and relative humidity. The improvement in performance from "ML_met_ds_insitu" is thus due to the inclusion of NO measurements in the model. The addition of CAMS O 3 simulations with meteorology and in-situ precursors ("ML_met_ds_insitu_ cams") further improves the model performance (R 2 = 0.89, RMSE = 10.9 µ g m −3 and mean R 2 of k(10)-fold CV = 0.9), which is slightly higher than that of "ML_met_ds_insitu" (Fig. 1f), with CAMS O 3 simulations being the most important feature (Fig. S2). The feature importance calculated using the permutation approach (https:// chris tophm. github. io/ inter preta ble-ml-book/ featu re-impor tance. html) and SHAP values (https:// chris tophm. github. io/ inter preta ble-ml-book/ shap. html) agree with the feature importance calculated using each feature's gain. For example, Fig. S3 shows the feature importance calculated based on permutation and SHAP values for the case "ML_met_ds_insitu". We also performed a similar analysis using Random Forest ML algorithm with a split of 5375 dataset into 70%/30% (training/testing) (Table S1). When compared to "ML_met_ds" in RF model simulations, the performance of "ML_met_ds_insitu" is improved (in all terms). This supports our earlier findings that including in-situ precursor information is not redundant when modeling surface ozone with ML model.  Fig. 2. The performance of each ML simulation with fewer days case (689 days) at lothstrasee station is shown in the last three columns. www.nature.com/scientificreports/ For 689 days between 2001 and 2017, all in-situ and satellite ozone precursors information, meteorological variables and CAMS data are available. Similarly, we use the first 70% of data (480 days) for training and remaining 30% (209 days) for testing the model. Also, we performed the k(10)-fold CV for 689 days of dataset. The performance of the ML algorithm trained with meteorology and satellite precursors ("ML_met_ds_satellite") is, however, equal to the performance of the ML algorithm trained with meteorology alone (Fig. 2a-c). This implies that including satellite ozone precursor data had less effect on model performance. In terms of mean R 2 of k(10)fold CV, the ML algorithm with meteorology, satellite precursors, and the CAMS category provides slightly better results. However, it is poor than that of the ML algorithm trained with meteorology, in-situ precursors, and the CAMS category. The performance difference between ML model with a high (5375) and low (698) number of days is marginal. In all cases, the performance of the ML model with fewer days (698 days) is slightly worse than the performance of the ML model with 5375 days for training and testing (Fig. 2a-c). To see how the availability of training dataset affects performance, we train and test the "ML_met_ds_insitu" for varying percentages of data for the 5375 days case (Fig. S4). The difference between different dataset combinations for training and testing is also marginal; the 80%/20% (training/testing) dataset performs slightly better than the 20%/80% dataset (lower RMSE by 1.5 µ g m −3 and higher R 2 by 0.03). However, in this case, 20% of data equates to nearly three years of data, which may be sufficient to capture all ozone variability by ML model.   www.nature.com/scientificreports/ We investigated the sensitivity of each predictor variable in the ML model. This is done by excluding the particular predictor variable from the "ML_met_ds_insitu" (Table S2). Temperature is the important feature fitted in model. When temperature is excluded from "ML_met_ds_insitu", the RMSE increases by 1.9 µ g m −3 and the R 2 decreases by 0.04 compared to all variables included in "ML_met_ds_insitu". Furthermore, at each case, when variable such as season, relative humidity, wind direction, boundary layer height, and in-situ NO is excluded, RMSE increases and R 2 decreases. There are no changes in RMSE and R 2 when the day of the week or wind speed is removed. When in-situ NO 2 or CO is removed, the RMSE decreases in comparison to "ML_met_ds_insitu", indicating that the model is over-fitted when these variables are included. Therefore, we train the ML algorithm only with season, relative humidity, temperature, wind direction, boundary layer height and in-situ NO variables ("ML_s_rh_t_wd_blh_no"), which show slightly better performance in-terms of RMSE decreases by 0.4 µ g m −3 compared to "ML_met_ds_insitu". Figure S5 depicts a time series plot of ground-truth vs modeled surface ozone concentrations, demonstrating the ML model's superior performance in modeling complex ozone variability ranging from daily to seasonal variation. ML model's transfer-ability. First, we use the "ML_met_ds" trained for "Lothstrasse" station (5375 days) to predict the ozone concentrations of other three stations in Munich, two (Johanneskirchen, Allach) of which are sub-urban and remaining one (Stachus) is urban station. When compared to ground-truth, the performance of "ML_met_ds" for two sub-urban station is better (R 2 = 0.86, 0.81 and RMSE = 12.6, 15.1 µ g m −3 ) than for the urban station (R 2 = 0.5 and RMSE = 20.3 µ g m −3 ) (Fig. S6). The predictions are better in all terms when we use "ML_s_rh_t_wd_blh_no", compared to "ML_met_ds", indicating that including precursor information plays an important role in explaining ozone variability of other locations (Fig. 3). These findings also imply that ML algorithm trained on long-term data for urban stations are transferable not only to other urban stations, but also to sub-urban stations, which have different emission scenarios, such as low NO X . It could be because a machine learning algorithm trained on long-term data from urban stations can learn ozone variability for various emission scenarios (e.g., low emission activities such as public holidays, weekend, etc.). When including CAMS with "ML_s_rh_t_wd_blh_no" ("ML_s_rh_t_wd_blh_no_cams"), ML model show slightly better performance (Fig. S7).

ML simulation name
Similarly, we use the "ML_met_ds", "ML_s_rh_t_wd_blh_no" and "ML_s_rh_t_wd_blh_no_cams" trained for "Lothstrasse" station to predict the ozone concentration of two major German cities (3 stations for each city) (Figs. 3, S6, S7). Here, as well, the performance of "ML_s_rh_t_wd_blh_no" is better than "ML_met_ds" in all terms, with R 2 ranges from 0.72 to 0.84 and RMSE ranges from 13.1 to 17.2 µ g m −3 . When using "ML_s_rh_t_ wd_blh_no_cams", the performance is slightly better than "ML_s_rh_t_wd_blh_no" in terms of R 2 and RMSE. We also performed a ML simulation for the days that have OMI data for all nine stations in Munich, Berlin and Hamburg (Tables S3-S5). In all cases, "ML_met_ds_satellite" trained for "Lothstrasse" station performs slightly better than "ML_met_ds" in predicting the ozone concentrations of other locations.

Discussion
In this study, the potential of a machine learning algorithm in simulating urban surface ozone has been demonstrated. As ozone is primarily produced by complex photo-chemical reactions in the atmosphere, the performance of the ML algorithm with meteorology information alone is promising; however, including the precursor emission (NO X ), particularly NO concentration, information further enhance the ML model's performance in predicting the surface ozone. It could be because NO is an important scavenger of O 3 in the urban environment. Due to the scarcity of measurements, we did not use another important insitu ozone precursor (VOC) information in this study, but instead used satellite column HCHO information in the ML model. The addition of a satellite ozone precursor (column NO 2 , HCHO) information as a new feature has little effect on the ML model performance. This could be because satellite column NO 2 and HCHO retrievals are less sensitive to surface emissions. Furthermore, the coarser resolution of satellite retrievals might limit its applicability. This study also reveals that ML algorithm, with O 3 , meteorology and precursor information (NO), trained for one location can be used to suitably model the surface ozone concentrations of different locations with sparse ozone measurements. However, the performance of ML model vary by location because other factors also influence ozone production. Therefore, we advocate for additional research that focuses on specific campaigns that measure all other factors (such as VOC emissions and aerosol load) influencing ozone formation and use an ML model to simulate the ozone variability of other locations.