An integrated analysis of air pollution and meteorological conditions in Jakarta

Air pollution and climate change are general problems for society. This paper proposes an integrated analysis of the Air Quality Index (AQI) and meteorological conditions in Jakarta. The column-based data integration model is applied to create integrated data of the Air Quality Index and meteorological conditions. The integrated data is then used to generate a causal graph using the PC algorithm. The causal graph reveals that there exist causal relationships between pollutants and meteorological conditions, e.g, humidity, rainfall, wind speed, and duration of sunshine affect particulate matter 10 (PM\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{10}$$\end{document}10); wind speed affects sulfur dioxide (SO\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_2$$\end{document}2); temperature affects ozone (O\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_3$$\end{document}3). The historical data records that the average wind speed is decreased and the number of unhealthy days has risen. Ozone and particulate matter are two pollutants that mainly influence poor air quality in Jakarta. The integrated data is also used to train Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) for forecasting. Experimental results show that LSTM using integrated data produces smaller errors for forecasting AQI and meteorological conditions.

www.nature.com/scientificreports/ meteorological conditions using a causal learning approach. It implements the PC algorithm to generate a causal graph from a dataset. The causal graph is then used to analyze the cause and effect relationships among variables. The proposed method is useful to analyze the linkage of air pollution and meteorological conditions in Jakarta.
The integrated data is also applied to train models for forecasting. This paper implements Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) to forecast AQI and meteorological conditions. The research contribution is an integration model to analyze the dependency relationships among variables and prediction of the future values of AQI and meteorological conditions, the case study in Jakarta.

Methods
Air Quality Index (AQI). The Ministry of Environment and Forestry in the Republic of Indonesia measures the Air Quality Index (AQI) using equation (1), where I, I a , I b , L a , L b , and L x represent AQI score, upper limit AQI, lower limit AQI, upper limit ambient concentration, lower limit ambient concentration, and measurement results of real ambient concentration, respectively 24 . Air Quality Index (AQI) standard values are categorized as good , moderate (51-100), unhealthy (101-200), very unhealthy (201-300), and hazardous ( ≥301).

PC algorithm.
A causal graph is a graphical model that represents cause and effect relationships among variables. Assume that causal information between variables can be represented by a directed acyclic graph (DAG) where the nodes represent random variables and the edges represent direct causal effects [25][26][27][28] . Each causal DAG implies a set of conditional independence relationships 25 . A simple graph A → B (i.e., A is a parent of B) represents that A is a direct cause of B. A is a (possibly indirect) cause of B only if there is a directed path from A to B (A is an ancestor of B). One of the algorithms for learning a causal graph from a dataset is the PC algorithm 26,28,29 .
The PC algorithm applies conditional independence tests to generate a causal graph from a dataset 26 . Suppose E, ρ , α , n, and �(.) denotes the separation set, the partial correlation, the significance level, the number of samples, and the cumulative distribution function (cdf) of N (0, 1) , respectively. An equation (2) can be used to compute a conditional independence test for Gaussian data 30,31 . It tests a question 'is a variable D u conditionally independent D v of given D E ?' The correlation coefficient of two random variables X and Y is ρ XY = σ XY σ X σ Y , where σ is standard deviation 32 . The partial correlation can be computed from correlation matrix using Eq. (3), where A, B, and C are random variables 33 .
In general, the PC algorithm has two main steps: generating graph skeleton and orienting the edges 34 . Suppose a dataset consists of v variables. The first step is generating a complete undirected network consisting of v vertices. The conditional independence tests are run for every triplet vertices. The output of the first step is a skeleton. The information of the conditional independence test in the first step is used to orient the edges. The output of the PC algorithm is a graph represented by a Completed Partially Directed Acyclic Graph (CPDAG) 30 . The PC algorithm can be used to learn causal graphs by assuming there are no latent variables in the dataset.
Long short-term memory (LSTM). Long short-term memory (LSTM) is an efficient gradient-based method 35,36 . LSTM refers to a standard recurrent neural network (RNN) that has long-term memory and shortterm memory. Suppose ζ , X t , S t , S t−1 , S t , • , O t denote the sigmoid function, the preprocessed data, the new state of memory cell, the previous state of the memory cell, the final state of memory cell, Hadamard product, and the final output of the memory unit, respectively. Let i t , f t , o t be the output of different gates and , U (c) be coefficient matrices. The mathematical models related to the LSTM memory unit are defined by Eqs. (4-9) 37 . LSTM networks work well for making predictions based on time series data [38][39][40][41][42] .
The proposed method. This paper proposes an integrated analysis of air pollution data and meteorological condition to analyze the air quality in Jakarta. The proposed method is illustrated in Fig. 1. The stages of the proposed method are data integration, causal graph generation, and forecasting. The integration process of meteorological data and AQI data use column-based integration. The datasets are time series data with numerical values. The idea of data integration has been used to learn simultaneously from multiple data sources 50,51 . The integration data requires not only the same date for each sample but also the same number of samples from all resources. In this paper, the integrated data is a single table containing variables from meteorological data and AQI data. This data is then used as input for generating a causal graph and forecasting. A causal graph is generated using the PC algorithm. LSTM and GRU are implemented for forecasting. This paper uses the PC algorithm from bnlearn in the R package 52,53 . A causal graph is generated in R Studio. It also implements LSTM and GRU from TensorFlow Keras. The forecasting is run in Jupyter Notebook for Python.
LSTM and GRU are implemented to forecast the prediction of AQI and meteorological conditions. The LSTM and GRU models consist of stacked layers with 128 and 64 units, dropout layer and dense layer. LSTM and GRU are run for 50 epochs and they implement the Softmax activation function. This paper uses multivariate forecasting. The experiments use integrated and not integrated data. The letter i and p indicate that the algorithm is implemented for forecasting using integrated data and not integrated data, respectively. A not-integrated data refers to AQI data or a meteorological conditions dataset. An integrated dataset is a dataset containing AQI and meteorological conditions obtained from data integration process. This paper runs multivariate forecasting in 3 different scenarios: www.nature.com/scientificreports/ • Experiment 1 using training set from 2010 to 2018 and testing set from 2019.
• Experiment 2 using training set from 2010 to 2019 and testing set from 2020.
• Experiment 3 using training set from 2010 to 2020 and testing set from 2021.

Results and discussion
The datasets are containing less than 5% missing values. The missing values are filled up using an average value of the observed variables from 7 days before the observed date. After preprocessing phase, it implements columnbased integration to create a single formed data from AQI and meteorological condition datasets. The integrated data is used to generate a causal graph and to train models for forecasting.
Causality analysis. This paper examines the dependence relationships between air pollutants represented by AQI and meteorological conditions. A graph is generated from an integrated data of AQI and meteorological conditions from 2010 to 2021. The dataset consists of 4383 samples and 10 variables (temperature, humidity, rainfall, sunshine, wind speed, PM 10 , SO 2 , CO, O 3 , and NO 2 ). PM 2.5 is not included to the experiments due to the samples are only available from 2021. Figure 2 shows a causal graph generated using the PC algorithm at significance level of α = 0.05 . The graph finds some information that will be explained as follows.
• Humidity, rainfall, and duration of sunshine are causal parameters for PM 10 . Those findings are corresponding to some previous studies. Humidity influences PM's natural deposition process; moisture particles adhere to PM and accumulate atmospheric PM concentration 9 . The increasing humidity reduces PM 10 concentrations in the atmosphere because moisture particles grow in size to a point where 'dry deposition' happens. PM 10 continually reduced with humidity rising 10 . The precipitation has a certain wet scavenging effect on PM 2.5 and PM 10 11 . Precipitation scavenging refers to the cleaning of gases and particles by cloud and precipitation elements. A study of ambient air quality in Jakarta found that the concentration of suspended particulate matter is decreased in the wet season (October-March) and increased in the dry season (April-September) because rainfall removes the pollutant in the atmosphere 54 . • CO has a dependent relationship to humidity. The previous study shows that higher humidity has a negative effect on the adsorption of carbon monoxide 55    www.nature.com/scientificreports/ coefficient involving PM 2.5 is obtained from the dataset of 2021. Table 1 shows correlation coefficient between two variables computed using Pearson correlation. The longer the sunshine duration makes the higher temperature, lower humidity, and lower rainfall. The temperature and duration of sunshine have a positive correlation to PM 10 and PM 2.5 . The more concentration of PM the higher temperature will be. Humidity has a negative correlation to PM 10 and PM 2.5 . This is one of the possible ways to decrease PM concentration by increasing humidity. Higher rainfall increases humidity. Weather modification to create artificial rain is useful to decrease PM concentration. Humidity and CO have a positive correlation. Meanwhile, wind speed and SO 2 have a negative correlation. The annual average AQI from 2010 to 2021 is illustrated in Fig. 3A. The highest exposure to O 3 happened in 2012. Figure 3B shows the monthly average AQI in Jakarta from 2010 -2021. The top 3 air pollutant are O 3 , PM 2.5 and PM 10 . AQI score of O 3 is always higher than 50 and it reaches over 100 in October to November which is categorized as an unhealthy condition.
Sunrise and sunset in Jakarta are not significantly different every day throughout the year because it lies on a latitude of − 6 • 12 ′ 52.63 ′ ′ S and a longitude of 106 • 50 ′ 42.47 ′ ′ E. The length of daylight remains the same every day, so the duration of sunshine is mostly affected by clouds. In the last 10 years, the low average rainfall happens from May to September, and the lowest is around 1.8 mm in August. Meanwhile, the longest average sunshine duration occurs in August, September, and October at 5.7, 6.4, and 5.3 hours, respectively. The month of June to October has a high average level of PM 10 over 73 and the highest is 76.79 in August. The lowest average of PM 10 is 50.24 in January. This finding is closed to the previous study 54 which is states that the highest concentration of PM 10 occurs in September 2015 and the lowest one is in February 2017. In 2021, the two highest average AQI for PM 2.5 are 80.56 in June and 86.32 in July. In May and October, the average temperature is around 29.1 • C Figure 2. A causal graph is generated from AQI and meteorological conditions. www.nature.com/scientificreports/ which is higher than the overall average temperature of 28.49 • C. The average humidity during July-October is around 70-74%. Since 2015, the average wind speed decreases around 1 m/s than that in 2010. In 2021, the correlation between wind speed and PM 2.5 ρ(wind speed and PM 2.5 ) is − 0.32. The decrement in wind speed contributes to increasing PM 2.5 . Wind speed and SO 2 have a negative correlation, so decreasing wind speed rises SO 2 . A positive correlation is obtained between SO 2 and NO 2 as 0.6, indicating that the concentration of those pollutants rises together. O 3 has a positive correlation to PM 10 and PM 2.5 .
The historical data and forecasting models. A record of the number of unhealthy (U) and very unhealthy days (VU) in the year 2010-2021 is presented in Table 2. The historical data shows that O 3 is the pollutant that mostly causes unhealthy and very unhealthy days. There are 108 days where on the same day two pollutants have AQI scores over 100 but only 22 days were labeled as very unhealthy because they only pay attention to a pollutant that has the highest AQI scores on that days. In 2020, on three consecutive days, the three pollutants together (SO 2 , O 3 , and NO 2 ) have AQI scores of more than 100 and those are categorized as unhealthy. It needs further study for a case when more than two pollutants have AQI scores over 100 in a day. It is possible to be more hazardous when the concentration of multiple pollutants reaches the unhealthy limit at the same time, so the categories of air pollution levels need to be evaluated. The previous studies reveal various effects of the pollutants. The ambient temperature increased acute cardiovascular-respiratory mortality effects of PM 2.5 60 . Exposure to PM 10 , NO 2 , and O 3 generates a relative risk to human health 61 . The effect of humans inhaling O 3 possibly leads to acute lung function changes and inflammation 62 . PM 2.5 may contribute to the development of diabetes mellitus, increase cardiopulmonary morbidity and mortality, and cause adverse birth outcomes 63 . Epidemiological evidence shows that PM 2.5 damage the human respiratory system 64 . The accumulating of exposure to low concentrations of carbon monoxide can affect a number of organ systems 65 .
The actual data and forecasting of AQI from 2010 to 2021 are described in Fig. 4 A and B, respectively. The performance of LSTM and GRU are evaluated using MAE and RMSE. According to the experimental results, LSTM using integrated data produces the smaller error. In general, LSTM and GRU show a good performance in forecasting PM 10 , CO, and O 3 .   www.nature.com/scientificreports/ The actual data and forecasting meteorological conditions are described in Fig. 5A and B, respectively. LSTM and GRU work well to forecast temperature, humidity, sunshine duration and wind speed. However, they are less accurate to predict rainfall.
The two highest AQI of PM 10 were in 2011 and 2013 when the averages were 76.59 and 78.21. The AQI of SO 2 was consistently rising around 3 times higher than in 2010. The AQI of CO increased from 2010 to 2017, but it decreased from 2020 to 2021. The AQI of O 3 was also rising and the highest was in 2012-2013. Figure 6 shows the values of MAE and RMSE for forecasting results. LSTM using integrated data produces smaller errors. www.nature.com/scientificreports/ In general, the forecasting results of AQI data from 2020 to 2021 have higher errors than that from 2019. It is suspected that major restrictions in some activities during the Covid-19 outbreak influence that condition, for instance, the national or local lockdown reduces the use of motor vehicles which decreases the CO level. There was a huge increase in SO 2 and NO 2 from September 2020-January 2021 but the reason is unknown. It needs further study for investigation. Comparing to the other study which is forecasting the observed variables using not integration data 66 , the forecasting using integration data produces slightly lower MAE and RMSE. www.nature.com/scientificreports/ The findings in this paper are expected to enrich the knowledge of the linkage between air pollution and climate change. This contribution is beneficial to determining the proper handling of air pollution and climate change problems.

Conclusion
In conclusion, the integration analysis successfully discovers the linkage between air pollution and meteorological conditions in Jakarta. The integrated data is used to generate a causal graph and to train models for forecasting. A causal graph shows that there exist dependence relationships between AQI and meteorological conditions. This information is beneficial for handling air pollution and climate change. LSTM and GRU work well as models for forecasting PM 10 , CO, O 3 , temperature, humidity, sunshine duration, and wind speed. However, those models show less accurate to predict SO 2 , NO 2 , and rainfall. LSTM using integrated data produces a smaller error. The forecasting results of air pollution before the Covid-19 outbreak are more accurate. The Covid-19 outbreak influences human activities that probably affect air quality, e.g, decreasing CO, and increasing NO 2 and SO 2 . The future work is implementing machine learning approach for an integrated analysis to find the connection of population growth, industries, human activities and air pollution to the climate change in Indonesia.

Data availibility
The datasets are available from the corresponding author by request for strong reasons.  www.nature.com/scientificreports/