Introduction

The chemical processes in the urban atmospheres of Himalayan foothills have strong potential to impact the regional air quality, agriculture, and therefore the economy1,2,3. In addition, the build-up of climate-forcing pollution in the Himalayan region can have irreversible effects on the hydrological cycle and global climate1,2,4,5,6,7. The atmospheric dynamics above the Himalaya also form the crossroad of so called “Atmospheric Brown Clouds” to the Tibetan Plateau8. Recent increase in extreme weather events triggering the calamities also indicate an intensifying interplay between the increasing pollution and meteorology over fragile ecosystem of the Himalaya9,10,11,12.

The enhanced concentrations of ozone (O3) and other climate-forcing pollutants in the Himalayan foothills are attributed to unprecedented growth in population and urbanization13,14,15,16. The intense forest-fires, diverse natural factors, and the topography also play vital roles in the build-up of trace gases and aerosols here5,11,15,17,18,19. The Himalayan atmosphere is particularly influenced by the most densely populated region of the world—the Indo-Gangetic Plain (IGP)20,21. The IGP is a global hotspot of elevated O3 and aerosol loading due to strong anthropogenic emissions and intense crop-residue burning in prevalence of favorable meteorological conditions22,23,24,25,26,27. The emissions and photochemistry in the IGP affect the Himalayan atmosphere in particular through the mountain meteorology and boundary layer dynamics20,21,28. A potential climate warming combined with future increase in the emissions can further intensify the atmospheric chemistry over this part of the world29,30,31.

Considering the discussed scenario, measurements and modeling studies have been conducted to assess the effects of diverse emissions, photochemistry, and dynamics on atmospheric composition over the Himalayan region8,15,17,32,33. The concentrations of O3 and precursors were found to be enhanced during pre-monsoon (spring) and post-monsoon (autumn) seasons due to regional pollution supplemented with biomass-burning, intense solar radiation, and less precipitation15,20,21,34,35. The long-term measurements of atmospheric composition and meteorological parameters however remain lacking over the Himalayan foothills in India, which are experiencing severe air quality and extreme weather events. Studies to fill this gap are of paramount significance since the chemistry-climate models also have greater biases in reproducing already sparse measurements over the Himalayan region20,34,36,37. The stronger biases are suggested to be mainly due to the limitation of models in resolving the highly complex topography of Himalaya and foothills5,19,20,37,38. The uncertainties in the emission inventories and parameterizations of physical and chemical processes also increase the biases in the models19,37,39,40,41. Besides higher biases, the conventional models also need intensive computing resources which poses further limitation in conducting high-resolution simulation.

In the current era, the artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternative tools for modeling in various fields including the Earth system science42,43,44,45. Recent studies utilized AI/ML modeling in the analyses of extreme whether events and prediction of oceanic phenomenon as well as atmospheric composition46,47,48. These studies have shown that ML models trained with data from observations or physical models can produce reliable simulations without intensive high-end computing. Nevertheless, the applications of AI/ML to simulate complex atmospheric chemistry remain still limited. Considering the scientific and societal implications, lack of measurements, and limitations of conventional models over Himalayan region, the objectives of this study are as follows:

  1. (1)

    To explore the potential of ML modeling for simulating urban O3 variability.

  2. (2)

    To study the effects of meteorological and chemical variables on model performance.

  3. (3)

    To assess the effect of the data fraction used in the training on model performance.

The study region, datasets, and modeling are described in the “Methodology” section. Model simulations and results are presented in the “Model simulations and results” section, followed by “Discussion” section.

Methodology

Study region and datasets

The study is focussed on the urban O3 chemistry over the Doon valley of the Himalaya. We initiated in situ O3 measurements using an online O3 analyser manufactured by Environnement S.A, France (model O3 42) at a representative station—Graphic Era deemed to be University campus (77.99° E, 30.27° N, 600 m above mean sea level). The observations are based on the UV light absorption by O3 and instrumental uncertainty is about 5%49. The continuous measurements are being conducted since April 2018 except during February–August 2020 when the laboratory was not open due to severe impact of the COVID-19 pandemic. Further details of these O3 measurements are presented in the earlier study15.

Auxiliary datasets used in training the ML model include the meteorological and chemical reanalysis from the ECMWF (European Center for Medium range Weather Forecasting). The meteorological parameters: temperature, humidity, horizontal winds, and boundary layer height (BLH) are included from the ERA-Interim50. Whereas, the chemical species: O3 and key precursors (CO, NO, NO2) have been included from the CAMS (Copernicus Atmosphere Monitoring Service) reanalysis51. ERA-Interim and CAMS products have been analyzed for diverse studies including over the Indian region15,52,53,54. The CAMS data has been shown to reproduce the day-to-day variability in the noontime O3 over the study region15. Here, we have utilized the reanalysis data of 2003–2019 period focusing on the noontime variations (6 GMT; 11:30 local time), when the urban O3 photochemistry is most intense.

Machine learning model

This study utilizes the XGBoost (Extreme Gradient Boosting) algorithm of the ML modeling55 to simulate the O3 variations. Considering the dependence of O3 on meteorological parameters and precursor gases, this modeling is under the supervised learning of AI. In the gradient boosting algorithm, a prediction model is developed in form of an ensemble of weak prediction systems i.e., decision trees. The model is built in a stage-wise manner and generalizations are made by allowing optimization of an arbitrary differentiable loss function (e.g., squared error). Further details of the XGBoost can be found elsewhere55. The method adopted to build and evaluate the model is shown as a flow chart in the Fig. 1.

Figure 1
figure 1

Flow chart of the steps in building the ML model for simulation of urban O3 variations and evaluation.

Hyper parameters have been varied iteratively following the trial and error method to achieve better prediction. The parameters were fine-tuned using the grid search function (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). The values of hyper parameters set in the model are given in the supplementary material—Table S1. Other hyper parameters were kept to their default values (https://xgboost.readthedocs.io/en/latest/parameter.html). To avoid overfitting, the iterations are aborted once they cease to improve the fit parameters further, i.e., no reduction in RMSE (root mean square error) over 100 iterations. The model performance in simulating O3 variations has been evaluated by estimating correlation (r2), slope of linear fit, and RMSE (root mean square error).

Model simulations and results

A series of simulations have been performed under this study, as summarized in the Table 1. These simulations and the evaluation of model performance are discussed in the following subsections.

Table 1 Different ML model simulations performed in the study.

Simulation utilizing in-situ O3 measurements

In the first simulation ML_obs_O3_met_prec, the ML model has been trained using the observational data of O3 and reanalysis data of meteorological parameters (met) and precursors (prec). Analysis is focussed on the variations in noontime (11:30 h local time) O3. The data of April 2018 to April 2019 (number of days N = 222) has been used for training the ML model, which is 50% of total available data. Model simulation is evaluated against remaining independent observations for April–December 2019 period (N = 223 days). Figure 2 shows the correlation between the ML model simulation and in-situ measurements of noontime O3 over Doon valley for April–December 2019 period. ML model is found to successfully reproduce the temporal variability in the noontime O3 with r2 value of 0.75 (p < 0.01) and RMSE value of 10 ppbv. The estimated bias in ML model result is seen to be significantly lower as compared to the bias values reported in global and regional atmospheric models over this region15,34,37. The result suggests that in absence of high-resolution measurements, the ML modeling can be combined with reanalysis and limited in-situ data over the Himalayan region. The strong correlation between the ML model and in-situ measurements further opens up the possibility to utilize such simulations for assessing the impacts of O3 on agriculture and health in this region.

Figure 2
figure 2

Correlation between measurements and model (ML_obs_O3_met_prec) simulated variations in noontime O3 over Doon valley during April–December 2019. Solid blue line shows the linear regression fit and dashed lines show the 99% confidence intervals.

Simulations utilizing long-term CAMS O3

The in-situ measurements are largely unavailable in the Indian Himalayan region and the temporal coverage is also very limited. In view of this, we include the long-term CAMS data to assess the potential and performance of ML modeling more deeply. With availability of long-term data, here, we train the ML Model with noontime (11:30 local time) CAMS O3 and reanalysis meteorology for 2003–2015 (70% of total data). This makes a significant fraction (30%) of total data during 2015–2019 period available for the evaluation. The simplest simulation is ML_cams_O3 in which model is trained only with the O3 time series without including any additional parameter. This simulation is found to predict the independent O3 variations with r2 value of 0.47 and RMSE of 11.6 ppbv (Fig. 3). This result is a manifestation of a periodicity in O3 data embedded by the seasonal cycle in India.

Figure 3
figure 3

Scatter plot and Taylor′s diagram evaluating the ML model simulations of noontime O3 variations as compared to the CAMS reanalysis.

The relative effect of including variations in the meteorological parameters versus major precursors (CO, NO, NO2) has been evaluated by performing additional simulations (Table 1, Fig. 3). Model trained with O3 and meteorology (ML_cams_O3_met) reproduces independent O3 variations with r2 value of 0.71 and slope value of 0.65. Another simulation in which the ML model is trained with O3 and precursors but not with the meteorology shows similar or slightly improved performance (r2 = 0.74, slope = 0.79, p < 0.01). The inter-comparison of these two simulations suggests that reasonable predictions of urban O3 variability can be made with ML models trained with either of the meteorological or precursor dataset. This is important as this region lacks comprehensive datasets especially of the precursors, and in such cases the meteorological datasets can be used to predict O3. Further, to explore the potential of ML approach, we performed another simulation ML_cams_O3_met_prec in which both meteorology as well as precursors have been included in the model. This led to significant improvement in the model performance with r2 value as high as 0.86 and slope value of 0.91. For this simulation, the RMSE value also drops drastically to 6 ppbv and the mean bias is also smaller (~ 3 ppbv). An important finding is that when the potentials of both meteorological as well as chemical datasets are combined, the model’s ability to predict outliers improves drastically, which is of major significance in air quality assessments.

A comparison of r2 values among all these simulations (numbered 2–5 in the Table 1) suggests that ~ 47% of O3 variations can be explained (r2 = 0.47 in ML_cams_O3) by the periodicities embedded in the data originated from the seasonal cycle. As precursors and meteorology act in tandem, higher r2 values (~ 0.7) in simulations trained with either meteorology or precursors suggest that this additional ~ 25% of O3 variability can be attributed to the changes in meteorology or precursor levels. Meteorology plus major precursors could explain ~ 86% (r2 = 0.86 in ML_cams_O3_met_prec) of the variations in the urban O3. The remaining variability could be due to diverse unaccounted factors such as deposition, vertical transport, and volatile organic compounds, etc. The analysis suggests that ML simulations can provide deep insights into the relative importance of the physical and chemical processes affecting the air quality.

The performance of different simulations has been compiled in form of a Taylor’s diagram (Fig. 3). The figure includes statistics like r, normalized RMSE, and normalized standard deviation (SD) where normalization is done with respect to the SD in the reference (CAMS). The relative performance of different simulations is assessed by comparing how close a simulation is to the reference point (CAMS). For an ideal agreement, ML simulation should coincide with the reference point (r = 1, normalized SD = 1, and normalized RMSE = 0). It is evident that the ML simulation exploiting the potentials of both meteorology and precursors (ML_cams_O3_met_prec) performed the best. Besides stronger r value, a normalized SD value close to 1 suggests that the simulation produces similar extent of the variability as in the CAMS. On the other hand, ML simulations using either meteorology or precursors had similar performance. Also, ML_cams_O3_prec produced more variability likely due to non-linearities in chemistry as compared with the simulation using meteorological variations (ML_cams_O3_met).

Effect of training data length

We further investigate the sensitivity of model performance to the fraction of available data being used for the training. In this regard, a series of simulations have been performed using the best performing model set up (ML_cams_O3_met_prec) by using 20–95% data for model training. Figure 4 shows the variations in r2 and RMSE values due to variation in the training data fraction. The analysis shows that the model performance is highly sensitive to the length of total data being used in its training. The r2 value is found to increase significantly from about 0.6–0.87 and RMSE shows reduction from ~ 11 to 6 ppbv with increase in the training data fraction. The analysis suggests that longer time-dependent datasets are highly desirable for optimizing performance of ML models in predicting air quality variation. This underlines that long-term in situ measurements and validated chemistry-climate simulations can help in further exploiting the potential offered by the ML approach.

Figure 4
figure 4

Variation in r2 and RMSE with change in the percentage data used in training of ML model.

Discussion

Our study unravels the strong potential of ML modeling for computationally inexpensive simulations of urban O3 variability in the Himalayan foothills region. The periodicity in O3 and meteorological parameters due to systematic seasonal cycle of India tends to allow ML model to reproduce data fairly well. In lack of high-resolution measurements, ML simulations can be used to assess the impacts of O3 on health and agriculture in this region. Additionally, the series of simulations conducted here would serve as a reference for further applications of AI/ML based modeling to complement conventional Earth system models. It is however pointed out that here the environment is urban and the O3 variations are greatly governed by the regional photochemistry. The scenario could be very different for cleaner remote regions where O3 variability is dominated by transport from upwind polluted regions or from the higher altitudes. In this regard, we recommend establishing baseline stations to continuously monitor the atmospheric composition as well as the meteorology to exploit the full potential of ML modeling. Model performance is already promising with inclusion of only meteorology, nevertheless, the inclusion of precursors enhances the model’s ability to capture outliers, which are critical in air quality assessments. Future studies may extend the scope to additional climate-forcing pollutants and to unravel feedback between pollution and meteorology causing calamities in the fragile ecosystem of the Himalaya experiencing strong anthropogenic pressure.