Introduction

Studies have attributed the increase in air pollution to rapid urban development and modernization1. Over the years, much emphasis has been placed on the analysis of PM2.5 (particulate matter with a diameter of 2.5 μm or less) and NO2 (Nitrogen Oxide) due to the adverse effects on public health2,3,4, global climate5,6 and long-range transport, particularly for PM2.57. As a result of public health and environmental implications, countries and international organizations have engaged in regulating and monitoring PM2.5 and NO2 concentrations. For instance, the World Health Organization (WHO) air quality guidelines established in 2005 were revised and released on September 22, 2021. These new guidelines come from decades of research showing that air pollution's health effects result from high exposure and very low concentrations8. Therefore, the guidelines of 2005 recommended that the annual average of PM2.5 and NO2 concentrations should not exceed 10 and 40 μg/m3 (21 ppb), respectively. The 2021 guidelines reduce these recommendations to 5 and 10 μg/m3 (5 ppb) for PM2.5 and NO2, respectively.

The analysis and monitoring of PM2.5 and NO2 are essential to assess the effectiveness of mitigation strategies and compliance with standards. Currently, monitoring stations reliably and accurately measure PM2.5 and NO2 concentrations. However, monitoring is often difficult as PM2.5 and NO2 measurements are only done at some locations due to the high costs of installation, maintenance, and management of monitoring stations. As a result, detailed information about the spatiotemporal distribution, trend, and climatic and temporal effect of PM2.5 and NO2 is often lacking in locations with few monitoring stations. Therefore, the need exists, in these locations, to implement a multi-method approach that could predict PM2.5 and NO2 where it is not measured and generate fine-grain spatial distribution data, examining the changes of PM2.5 and NO2 concentration over time and understand the climatic and temporal effects on PM2.5 and NO2 concentrations. This information is critical for identifying areas (hotspots) that do not comply with international pollution concertation standards, quantitatively evaluating the air quality policy, and assessing the risks to human health9. Furthermore, empirical models are required to describe the general features of the spatial patterns of PM2.5 and NO2, trends, and influencing factors10,11.

In Japan and many regions of the world, studies have only focused on the spatiotemporal distribution, trend, forecast, or influencing factors of PM2.5 and NO212,13. However, rarely studies have combined these to generate a more holistic understanding that can help identify health risk areas, influencing factors, pollution trends, and the efficacy (or lack thereof) of policy interventions11. For instance, in Nagasaki Prefecture, most of the studies have only focused on either the health impact caused by air pollutants14,15,16, the long-range transport of air pollutants from the Asian continent17,18,19, or the effects of climatic variables and the spatial and temporal distribution of PM2.520. Although these studies provide essential information, a multi-method approach is necessary to understand better the distribution, factors, and current and future trends of PM2.5 and NO2 pollution in Nagasaki Prefecture.

This study used a multi-method approach to analyze PM2.5 and NO2 data in Nagasaki Prefecture from 2013 to 2021. This study aims (1) to estimate PM2.5 and NO2 pollution variability in unmeasured areas using ordinary kriging, (2) to identify and analyze the correlation of the major climatic and temporal factors that influence PM2.5 and NO2 pollution in Nagasaki Prefecture via Pearson's correlation and random forest feature selection and (3) to conduct a trend and forecast analysis of PM2.5 and NO2 based on fitting loess, automated auto-regressive integrated moving average (ARIMA) and error trend and seasonal models (ETS). Using a multi-method approach, we provide a broader analysis and understanding of the spatiotemporal distribution, forecast, trend, and influencing factors of pollutants which is crucial for the improvement, development, and assessment of mitigation strategies and for identifying health risk areas. Furthermore, this proposed multi-method approach can be used in other regions where studies have only focused on one of the aspects of air pollution and where a holistic understanding of air pollution is lacking.

Policy background

As a result of the rapid economic development of the Northeast Asian sub-region and the resulting environmental problems, China, Japan and Korea have held a Tripartite Environment Ministers Meeting (TEMM) annually since 199921. This meeting aims to strengthen environmental cooperation among these countries and address environmental problems at the domestic, regional, and global levels. At the 15th Tripartite Environment Ministers Meeting in 2013, the Tripartite Policy Dialogue on Air Pollution (TPDAP) was established and started in 201421. The objective of the establishment of the TPDAP was to coordinate efforts among the three countries to address the air pollution problem by developing cooperation initiatives and sharing information about air pollution policy implementation and impacts. The 3rd TPDAP in 2016 established two working groups to share air pollution information (Fig. 1).

Figure 1
figure 1

Tripartite Policy Dialogue on Air Pollution working groups one and two.

The establishment of the Tripartite Policy Dialogue on Air Pollution (TPDAP) has resulted in reducing air pollution in the three countries (Table 1).

Table 1 Significant policies implemented by China, Japan, and Korea to improve air quality.

Results

PM2.5 and NO2 spatiotemporal distribution

The years that had the maximum mean average level of PM2.5 concentration were 2014 and 2016, with 16.2 and 14 μg/m3, respectively. The minimum mean average concentrations in 2020 and 2021 were 10.4 and 9.7 μg/m3, respectively (Table 2). There have been dynamic temporal variations of PM2.5 concentration with respect to a yearly minimum, maximum, and mean during the study period. These variations are also expressed in the year's seasons, with Winter and Spring being the seasons with the highest PM2.5 concentrations. For each year of the study period, 2013–2021, Spring had the highest concentration of PM2.5 (Table S2). Amidst these variations, there is an indication of a declining trend of PM2.5 concentration from 2013 to 2021.

Table 2 Annual descriptive statistics for PM2.5 and NO2.

Regarding the spatial distribution of PM2.5 concentration, there is also yearly variation as to the hotspot of PM2.5 (Fig. 2). However, to a great degree, the most affected area is the westernmost part of Nagasaki Prefecture, as illustrated by the PM2.5 spatial distribution maps of 2015, 2016, 2017, and 2018.

Figure 2
figure 2

Spatiotemporal distribution of PM2.5 from 2013 to 2021 in Nagasaki Prefecture, Japan. Created with ArcMap 10.7 (https://www.arcgis.com/index.html).

Concerning NO2, the maximum mean average level of NO2 concentrations was in 2013 and 2014 at 6.3 ppb. The minimum mean average concentrations were in 2020 and 2021, with 4.7 and 4.4 ppb, respectively (Table 2). The box plots indicated minimal temporal variations of NO2 concentration during the study period (Fig. 3). Winter and Spring are the seasons with the highest NO2 concentration, with Winter being the season with the highest NO2 concentrations for each year, except for 2013 when Spring had the highest concentration (Table S2). The spatial distribution of NO2 indicated that the hotspots have remained in the same location over the years (Fig. 3). The high concentration of NO2 is located in Sasebo and Nagasaki, the two largest cities of Nagasaki Prefecture.

Figure 3
figure 3

Spatiotemporal distribution of NO2 from 2013 to 2021 in Nagasaki Prefecture, Japan. Created with ArcMap 10.7 (https://www.arcgis.com/index.html).

Pearson's correlation and Random forest

Pearson's correlation results indicated that meteorological factors influence PM2.5 and NO2 concentrations in Nagasaki Prefecture (Fig. 4). In the case of PM2.5, the factors had weak positive, negative, and mixed correlation results. The factors negatively correlated with PM2.5 were the southern oscillation index, average temperature, maximum temperature, minimum temperature, average humidity, and minimum humidity, with the southern oscillation index having the most substantial influence among these factors. On the other hand, minimum wind speed and sunlight time had a positive correlation. The other factors had mixed results in some stations having positive, negative, and no correlation. For NO2, some factors had a strong positive and negative correlation, with others having a weak positive and negative correlation. Average temperature, maximum temperature, minimum temperature, average humidity, and minimum humidity had a strong negative correlation, with average local pressure and average sea level pressure having a strong correlation. On the other hand, the southern oscillation index, rain maximum 10 min, average wind speed, and sunlight time had a weak negative correlation, with maximum wind speed and maximum instantaneous wind speed having a weak positive correlation. The other factors had mixed results, with some stations having positive, negative, or no correlation.

Figure 4
figure 4

Heatmap represents the correlation between climatic factors and PM2.5 and NO2 air pollution data for 18 monitoring stations in Nagasaki Prefecture, Japan. * SOI = Souther Oscillation Index, LP = Average local pressure, SP = Average sea level pressure, Rain = Daily Precipitation, Rain1h = Maximum 1-h precipitation, Rain10m = Maximum 10 min precipitation, Temp = Average Temperature, Max.T = Maximum Temperature, Min.T = Minimum Temperature, Humid = Average humidity, Min.H = Minimum humidity, WindS = Average wind Speed, MWS = Minimum wind speed, MIWS = Maximum instantaneous wind speed, SUN = Sunlight time.

The RF model feature selection results at Tsushima, Goto, Daitou, Inasa, Obama, and Yoshii for PM2.5 and NO2 are shown in (Fig. 5). These stations were selected based on their location, which are representative of the study area. Among the monitoring stations, the most important predicting factors for PM2.5 were Spring, maximin instantaneous wind speed one day, maximum instantaneous wind speed, humidity, sunlight time one day, and southern oscillation index. At the observation tower of Tsushima, Goto, Inasa, and Yoshii, Spring is among the three major predicting factors of PM2.5, with Spring being the primary predictor at Goto. Maximum instantaneous wind speed was among the three most important predictors for Goto, Daitou, Insas, and Yoshii and was the primary predictor at Tsushima. Maximum instantaneous wind speed was the main predictor for Daitou, Inasa, and Yoshii and the second most important predictor for Obama. Humidity was the main predictor of Obama and the third predictor at Goto. Southern oscillation index was only a significant predictor at Tsushima.

Figure 5
figure 5

RF feature selection for PM2.5 and NO2 of different monitoring stations in Nagasaki Prefecture, Japan.

On the other hand, for NO2, the three most important predictors that were among the stations were average wind speed, minimum temperature one day, maximum instantaneous wind speed, average sea level pressure one day, average temperature one day, average sea level pressure three days, southern oscillation index seven days, average temperature, maximum temperature one day and average local pressure one day. Average wind speed was among the three main predictors in Tsushima, Goto, Daitou, Inasa, and Yoshii, and the main predictor in Goto. Minimum temperature was the major predictor in Tsushima and Daitou. In Goto and Daitou, maximum instantaneous wind speed was the second and third major predictor, respectively. Sea level pressure one day was the main predictor in Inasa and Yoshii. Average temperature one day, average sea level pressure three days, southern oscillation index seven days, average temperature, maximum temperature one day, and average local pressure one day were among the three most important factors in only one of the stations. The feature selection result from RF indicated that the factors influencing PM2.5 and NO2 concentrations in Nagasaki Prefecture vary depending on the location of the monitoring stations.

Tables 3 and 4 show the results of the random forest models for each of the 18 stations for PM2.5 and NO2, respectively. Model accuracy was evaluated using R2 and MSE. In the case of PM2.5, the result indicated that the accuracy estimates for the 18 stations are varied with R2 values in the range of 0.41–0.53 and MSE of 22.7–37.6 for the training dataset and R2 values ranging from 0.16 to 0.33 and MSE 32.7–51.3 for the test dataset. The low values of R2, high values for MSE, and the high difference of R2 between the trained model and the test model indicated that the RF models constructed with these factors could not be used to predict PM2.5 concentrations.

Table 3 Result of the random forest model for PM2.5 using hyperparameters of the optimum value.
Table 4 Result of the random forest model for NO2 using hyperparameters of the optimum value.

Whereas, in the case of NO2, R2 was higher (Test: 0.354–0.735) than the R2 values of PM2.5; thus, the factors used in this study are a better predictor of NO2 concentration than PM2.5. However, the R2 values for most of the NO2 test models are still low to be used to predict NO2 concentrations, except for the results of Isahaya, which can be considered acceptable.

Trend and Forecast analysis

At a 0.05 significance level, the Mann–Kendall test determined that PM2.5 and NO2 in most of the monitoring stations had a monotonic trend and a negative slope (Table 5). The stations with no monotonic trend for PM2.5 were Shimabara, Oomura, Kawadana, Togitsue, MatsuuraShimachi, and Tsushima, and for NO2 were Yukiura, Tsushima, Iki, Obama, and Muramatsu. The stations that had the most significant magnitude of reduction for PM2.5 were Daitou (− 1.278 μg/m3), Fukuishi Jihai (− 1.178 μg/m3), and Kogakura (− 1.01 μg/m3), while Goto (− 0.43 μg/m3) had the lowest. For NO2, the most significant magnitude of reduction was observed in Fukuishi Jihai (− 0.78 ppb), Higashi Nagasaki (− 0.50 ppb), and Kogakura (− 0.49 ppb), and the lowest in Matsuura Shisamachi (− 0.12 ppb). Figures S1 and S2 represent the data decomposition and tend for six monitoring stations.

Table 5 PM2.5 and NO2 Sen slope, Mann Kendal and Forecast of the monitoring stations in Nagasaki Prefecture, Japan.

The results of Holt-Winters and ARIMA forecast analysis are presented in Table 5, which indicated, based on the MAPE and RMSE, that model suitability to forecast PM2.5 and NO2 varies depending on the location of the monitoring station. For PM2.5, Isahaya, Kawadana, Matsuura Shisamachi, Tsushima, Iki, Obama, and Inasa ARIMA gave better results; ETS and Holt-Winters gave better results in the other monitoring stations. In the case of NO2, ETS and Holt-Winters gave better results in Daitou, Isahaya, Shimabara, Oomura, Matsuura Shisamachi, Tsushima, Iki, Goto, Inasa, Fukuishi Jihai, and Yoshii with ARIMA providing better results in the other stations. The highest MAPE for Holt-Winters for PM2.5 was in Shimabara (23.566), and the lowest was in Higashi Nagasaki (11.98). For ETS, the highest MAPE was in Yoshii (26.59), the lowest was in Muramatsu (14.25), and for ARIMA, the highest MAPE was in Kawadana (23.99), and the lowest was in Tsushima (12.63). For NO2, the highest RMSE for Holt-Winters was in Iki (43.77), the lowest was in Tsushima (10.18), and for ETS, the highest was in Oomura (15.84), and the lowest was in Fukuishi Jihai (7.83), and for ARIMA the highest was in Obama (32.75), and the lowest was in Kawadana (8.61).

Figure 6 shows the forecast results for PM2.5 and NO2 for six monitoring stations using the best model, ETS, Holt-Winters, or ARIMA (Table 5). PM2.5 for Tsushima, Inasa, Yoshii and Obama was forecasted with ARIMA, while Goto and Daitou were forecasted with Holt winters. The forecast of PM2.5 produced by ARIMA in Inasa and Obama tends to converge to the mean. In Tsushima, the ARIMA model; in Goto, the Holt-Winters model; and in Daitou and Yoshii, the ETS model was able to replicate the trend and the seasonal components of the data for PM2.5. For NO2, Tsushima and Goto were forecasted with Holt-Winters, Obama was forecasted with ARIMA, and the other stations were forecasted with ETS. For NO2 ETS, Holt–Winters and ARIMA were able to replicate the data's trend and seasonal components. However, the tendency of the data to converge towards the mean was not observed in the case of NO2.

Figure 6
figure 6

Models and Forecast of PM2.5 and NO2 for different monitoring stations in Nagasaki Prefecture, Japan. Blue and red lines represent the WHO PM2.5 and NO2 recommendations for 2005 and 2021, respectively.

For PM2.5, in general, the future forecast indicates a negative trend. However, the future concentration of PM2.5 will remain above the 2021 WHO recommendations (5 μg/m3). Also, in most stations, the future concentrations will stay above or below the 2005 WHO recommendations (10 μg/m3), depending on the season, except for Daitou, which shows that the future concentrations will decline below the 2005 recommendations. For NO2, the forecast shows a very slight declining trend for the majority of the stations. Compared to the other stations in Yoshii, the decline is more consistent. For NO2, all the future forecasts are below the 2005 WHO recommendations (21 ppb). Also, for most stations, the NO2 concentration will be below or above the 2021 WHO recommendations (5 ppb), depending on the season. Yoshii and Obama are exceptions, as NO2 concentration levels are below the 2021 recommendations.

Discussion

Acknowledging that air pollution has adverse health effects, even at the lowest observed levels, is crucial for reconsidering current legislation and regulation. Thus, reducing the health impacts caused by the average annual exposure to NO2 and PM2.5 needs to be prioritized to address known inequities owing to economic activities, socioeconomic conditions, and increased vulnerability of the residential population8. Although regulation and legislation in Japan have effectively reduced PM2.5 and NO2 concentrations over the years, and it is considered one of the industrialized countries with low levels of these pollutants, the results indicate that the average annual concentrations of PM2.5 and NO2 exceeded the 2021 pollution concertation guidelines of WHO. In particular, PM2.5 annual average concentration exceeded the 2005 and 2021 pollution guidelines. The difficulty of regulating and reducing PM2.5 concentration in Nagasaki Prefecture is due to the long-range transport of PM2.5 from East Asia and Eurasia28.

As a result of the long-range transport characteristic of PM2.5, its spatial distribution and concentration vary throughout the study period as it is affected by climatic and temporal factors. For instance, Pearson's correlation and the random forest feature selection indicated that the most important factors influencing PM2.5 were Spring, maximum instantaneous wind speed, humidity, sunlight time one day, and southern oscillation index. In Spring, PM2.5 concentrations are higher than in other seasons (Table S2). This is due to the changes in meteorological conditions, especially wind direction, which affects the long-range transport of PM2.5 from East Asia20. Maximum instantaneous wind speed was negatively correlated in some stations, showing that horizontal dispersion plays a role in Nagasaki Prefecture. However, maximum wind speed was positively correlated, indicating that PM2.5 pollutants are being transported from other areas29. This result is further reinforced by Fig. 2, which suggests that from 2014 through 2021, the highest concentration of PM2.5 are located in the westernmost part of Nagasaki Prefecture. Several studies have indicated that PM2.5 is transported to Nagasaki Prefecture from East Asia; thus, the proximity of the westernmost part of Nagasaki's Prefecture to East Asia, its downwind location, and the change of wind direction in Spring are the main reasons for high PM2.5 concentrations detected during the study period. The less affected areas are those located in the easternmost part, which is further away from East Asia. The wide distribution of PM2.5 and its spatial variability thought the study period makes it difficult to regulate and identify specific hotspots. Its wide distribution is also a cause for concern as it has health implications for many of the resident population in Nagasaki Prefecture. However, during the study period, as indicated by Table 2, Figs. 2, 3, and 5, there has been a decline in PM2.5 concentrations. This decline in PM2.5 in Nagasaki Prefecture is related to the decrease in PM2.5 concentrations in China and Korea. This reduction can be attributed to the changes in policy, technology, social, environmental, and economic factors in Japan, Korea, and China. For instance, the changes in environmental policies and the tri-national cooperation between these countries have generated positive results in reducing PM2.5 (see Section “Policy background” Table 1). Also, the restrictions on social and economic activities imposed due to the COVID-19 pandemic resulted in a notable reduction of PM2.5 in 2020 and 2021. Although PM2.5 shows a declining trend, better local and regional strategies are needed to reduce PM2.5 further as the pollution levels are above the WHO guidelines amidst local and tri-national efforts.

As for NO2, the average annual concentration is below the 2005 pollution guidelines. However, the results indicated that the hotspots identified are above the WHO 2021 pollution concentration guidelines. NO2 pollution concentrations are also influenced by climatic and temporal factors, as indicated by Pearson's correlation and random forest feature selection analysis. For NO2, average wind speed was negatively correlated due to the dilution and dispersion of pollutants. However, maximum wind speed and maximum instantaneous windspeed were positively correlated, which can be attributed to the notion that the NO2 plum is buoyant, but at higher wind speeds, the plum is brought down to ground level30. Temperature was negatively correlated with NO2; temperature is known to promote air convection, leading to pollution dispersion and dilution31. Average local pressure and average sea level pressure were positively correlated due to the low atmospheric boundary layer, which accompanies high pressure and prevents air pollutants' vertical dispersion29. Sunlight time was negatively correlated to NO2; this can be attributed to the photochemical reactions of solar radiation, which reduced NO2 concentration. Amidst the influence of climatic factors on NO2, its spatial distribution remained constant throughout the study period with consistent hotspot areas, except for 2020, where the pollution concertation was the lowest and more dispersed with no visible hotspot. From 2013 to 2019 and 2021, hotspots were located in Nagasaki's Prefecture major cities, Nagasaki, Sasebo, Isahaya, and Oomura. Nagasaki and Sasebo are the two largest cities in Nagasaki Prefecture with the highest concentrations of NO2 throughout the study period; this is because of the economic activities in the area associated with shipbuilding, power plants, machinery, and heavy industries and also the burning of fossil fuels, especially from the transport sector. The lowest concentrations of NO2 were in 2020–2021, as indicated by Table 2 and Fig. 3; this remarkable reduction of NO2 can be attributed to the restrictions imposed by the Japanese government on social and economic activities due to the COVID-19 pandemic. The reduction of NO2 in 2018–2019 could be due to the decommissioning of the Ainoura Power Station, a crude oil-fired power plant. The more gradual decrease of NO2 during the study period, as indicated by the trend and forecast analysis, can be attributed, among other factors, to the stricter vehicle emission regulations implemented22 and also the regulation of emissions from stationary sources such as fossil fuel powerplants, electric and industrial boilers.

Pearson's correlation and the random forest feature selection identified major factors influencing PM2.5 and NO2 and provided a good indication of the complex relationship between the significant climatic and temporal factors and PM2.5 and NO2 pollutants in Nagasaki Prefecture. However, the results indicate that the correlation and factors of importance that influence PM2.5 and NO2 vary depending on the monitoring station. These differences observed in terms of correlation, factors of importance, tend, and model performance among the 18 stations can be attributed to the varying unique characteristics of climatic, environmental, social, and economic factors in each location, which affect PM2.5 and NO2 concentrations. For instance, in the case of Goto, the major predictor of PM2.5 is Spring and humidity in Obama. This difference can be attributed to the location of these monitoring stations. Goto is located in the westernmost part of Nagasaki Prefecture, which is the area most affected by the long-range transport of PM2.5 from East Asia in Spring, as opposed to Obama, which is located in the easternmost part of Nagasaki Prefecture, which is the least affected by the seasonal changes. Although RF was able to identify the major factors influencing PM2.5 and NO2, the model's prediction of PM2.5 and NO2 can be further improved by including not only climatic and temporal factors but emission sources and factors related to human activities such as economic development, transportation, and energy utilization32. And in the case of PM2.5, including emission sources and human activity factors from China and Korea can improve the model's predictive capabilities. Therefore, even though this study has generated valuable information on the spatiotemporal distribution, tend, influencing factors, and forecast of PM2.5 and NO2 in Nagasaki Prefecture, additional studies are needed to evaluate further the influence of social, environmental, economic, and technological factors affecting the spatiotemporal distribution and trend of PM2.5 and NO2 in Nagasaki Prefecture. And also to assess the differences that exist (e.g., trend, influencing factors, etc.) among the monitoring stations.

Materials and methods

Study site

Nagasaki Prefecture is located on the island of Kyushu (Fig. 7). The prefecture has an area of approximately 4,105 km2 with a population of 1,377,187. Nagasaki borders Saga Prefecture on the east and is surrounded by the Tsushima Straits, the Ariake Bay, and the East China Sea. Nagasaki air pollution is relatively low but is influenced by transboundary air pollution from Asia and Eurasia28,33,34. Studies conducted in Nagasaki have demonstrated that air pollution has adverse health effects, especially in children14,16. Moreover, 8.3 and 29.6% of the population in Nagasaki Prefecture are less than or equal to 10 and more than or equal to 65 years of age, respectively. Therefore, they are considered vulnerable to air pollutants35. Although up until March 2012, Nagasaki Prefecture had no PM2.5 monitoring stations, the first two stations to record PM2.5 concentration were installed in Isahaya and Iki.

Figure 7
figure 7

Study site and air pollution monitoring stations in Nagasaki Prefecture, Japan. Created with ArcMap 10.7 (https://www.arcgis.com/index.html).

Air pollution datasets

The monitoring station network in Nagasaki Prefecture has increased from four monitoring stations in March 2012 to 18 monitoring stations. The recorded data of these eighteen stations are available from the Nagasaki Prefecture Atmospheric Environment Information (http://www.pref.nagasaki.jp/). PM2.5 and NO2 data is collected daily at one-hour intervals, from which monthly and annual averages are calculated. Monitoring stations are located at municipal offices, elementary schools, and towers. We selected the 2013–2021 PM2.5 and NO2 datasets because measurements of these pollutants were collected at each of the 18 monitoring stations for each year of the study period. For this study, we calculated the monthly mean concentration of PM2.5 and NO2 for each of the 18 monitoring stations. In Japan, NO2 concentrations are given in parts per billion (ppb) as opposed to PM2.5 measurements, which are given in micrograms per cubic meter (μg/m3). Climatic data of Nagasaki from 2013 – 2021 were collected from the Japan Meteorological Agency website (Table 6).

Table 6 Parameters for the random forest model.

Data processing and ordinary kriging

For the 18 stations, we calculated the annual average for PM2.5 and NO2 from the daily data collected in each monitoring station from 2013 to 2021. This resulted in nine datasets for PM2.5 and nine for NO2, which were used to implement ordinary kriging to predict the spatiotemporal distribution. The dataset for each year was divided into four seasons, Spring: March to May, Summer: June to August, Autumn: September to November, and Winter: December to February, and summary statistics were calculated (Table S2).

Ordinary kriging

Ordinary kriging (OK) interpolation is suitable for PM2.5 and NO2 concentration mapping as it is a commonly used geostatistical estimator in air pollution interpolation and is often referred to as the unbiased estimator36. Ordinary kriging models the unsampled value z*(\({x}_{0}\)) as a combination of neighboring observations n, Eq. (1),37:

$$z^{*} \left( {x_{0} } \right) = \mathop \sum \limits_{i = 1}^{n} {\uplambda }_{i} Z(x_{i} )$$
(1)

where z*(x0) estimate value at x0, Z(xi) measure value at xi and λi weight is assigned for the residual of Z(xi).

Semivariogram

We derived the experimental semi-variogram for the 18 datasets to determine the spatial autocorrelation and the spatial structure of data points. The semi-variograms are expressed as a function of the distance between data points and explain the measured points' spatial relationship, Eq. (2),38.

$${\upgamma }\left( {\text{h}} \right) = \frac{1}{2N\left( h \right)} \mathop \sum \limits_{i = 1}^{N} \left[ {Z\left( {x_{i} + h} \right) - Z\left( {x_{i} } \right)} \right]$$
(2)

where \(\upgamma\)(h) quantity function of increment h, N(h) numbers of pairs separated by the vector h, Z(xi) is the sampled values at location xi and Z(xi + h) sampled measurements at location Xi + h.

In this study, we fitted the experimental semi-variogram to two theoretical semi-variogram models: exponential and spherical, two of the most commonly used models39. The parameters determined were: range (a) the distance up until which the regionalized variable is auto-correlated, partial sill (c) which is the spatially structured part of the residuals, and the nugget (c0) the non-spatial variability40. The spherical and exponential models are defined by Eqs. (3) and (4) 41.

$$\begin{gathered} Exponential\;Model \hfill \\ \gamma \left( h \right) = c_{0} + c\left[ {1 - \exp \left( { - \frac{3h}{a}} \right)} \right] \hfill \\ \end{gathered}$$
(3)
$$\begin{gathered} Spherical\;Model{ } \hfill \\ \left\{ \begin{gathered} \gamma \left( h \right) = c_{0} + c\left[ {1.5\left( \frac{h}{a} \right) - 0.5\left( \frac{h}{a} \right)^{3} } \right],\quad \;h \le a \hfill \\ \gamma \left( h \right) = c_{0} + c,\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad h > a \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(4)

Cross validation

The model's prediction ability of the unsampled PM2.5 and NO2 locations was conducted using cross-validations to calculate the mean error (ME), standard mean error (SME), root mean square error (RMSE), root mean square standard error (RMSSE) and average standard error (ASE). We analyzed the cross-validation results for both spherical and exponential models; the model with better results was selected for interpolating PM2.5 and NO2 (Table S1). The RMSE and RMSSE are defined by Eqs. (5) and (6), respectively42.

$${\text{RMSE}} = { }\sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {Z\left( {x_{i} } \right) - Z^{*} \left( {x_{i} } \right)} \right]^{2} }$$
(5)
$${\text{RMSSE}} = { }\sqrt {\frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \left[ {Z_{1} \left( {x_{i} } \right) - Z_{2} \left( {x_{i} } \right)} \right]^{2} }$$
(6)

where N number of validation points, Z(xi) measured value and Z*(xi) standard values being Z1(xi) and Z2(xi).

A RMSE closer to 0 and a RMSSE closer to 1 depict that the parameters and fitting model are excellent and the kriging estimators are robust.

Pearson’s correlation and random forest

Pearson's correlation analysis was conducted for each of the 18 monitoring stations to produce a heatmap depicting the correlation between major climatic and temporal factors and PM2.5 and NO2 pollutants (Table 6). We then used random forest (RF) to identify the most important climatic and temporal factors influencing PM2.5 and NO2 in each of the 18 stations43. The factors identified were then used to construct the RF models for each of the 18 stations to make PM2.5 and NO2 predictions. For each of the 18 stations, the random forest models were trained with 80% and validated with 20% of the respective monitoring station data. The RF model was then evaluated using the root mean square error (R2) and the mean, standard error (MSE). Random forest modeling is a type of ensemble learning method used for classification and regression analysis. It is well known to have advantages in terms of accuracy, robustness, and computational efficiency compared to other models44. The RF model was constructed using the open-source machine learning library scikit-learn*2 on Python. Next, the categorical factors were converted into dummy or indicator factors using the Python Pandas method (get dummies) (Tables 6 and 7). Furthermore, 7-day time lag data was added to the climatic factors to confirm the influence of past dependent factors. Finally, hyperparameters were determined with the ranges and steps indicated in (Table 8) using the grid-search technique for optimal model construction.

Table 7 Categorical rank from the continuous data of AOI.
Table 8 Hyperparameters of the random forest grid search.

Trend and forecast analysis

R statistical software was used to conduct the trend and forecast analysis (R Core Team, 2022). For the trend analysis, monthly mean concentrations for PM2.5 and NO2 were utilized. The csmk.test function was used to conduct the Mann–Kendall test for trend detection. Equation (7) gives Mann–Kendall Statistics S, Variance V(S), and standardized test statistics Z45,46.

$$S = \mathop \sum \limits_{i = 1}^{n - 1} \mathop \sum \limits_{j = i + 1}^{n} sgn\left( {x_{j } - x_{i} } \right),$$
(7)
$$sgn\left( {x_{j } - x_{i} } \right) = \left\{ {\begin{array}{*{20}c} { + 1, > \left( {x_{j } - x_{i} } \right)} \\ {0, = \left( {x_{j } - x_{i} } \right)} \\ { - 1, < \left( {x_{j } - x_{i} } \right)} \\ \end{array} } \right.$$
$$V\left( S \right) = \frac{1}{18}\left[ {n\left( {n - 1} \right)\left( {2n + 5} \right) - \mathop \sum \limits_{p = 1}^{q} t_{p} (t_{p} - 1)\left( {2t_{p} + 5} \right)} \right],$$
$$Z = \left\{ \begin{gathered} \frac{S - 1}{{\surd var\left( s \right)}}\quad if\;S > 0 \hfill \\ 0\quad \quad \quad \quad \;if\;S = 0 \hfill \\ \frac{S + 1}{{\surd var\left( S \right)}}\;\;if\;S < 0 \hfill \\ \end{gathered} \right.$$

where xj and xi time series and n number of data points in the time series. Where tp number of ties up to sample p. A positive Z value signifies a rising trend, a negative Z signifies a descending trend for the data period.

The sens.slope function was used to calculate the Sen's slope which indicated the magnitude of the trend. Equation (8) gives the slope for all data pairs and Eq. (9) the median of the n values of Ti, Sen's slope estimator (Qi)47.

$$T_{i} = \frac{{x_{j} - x_{k} }}{j - k}$$
(8)

where Ti slope and xj and xk data values at time j and k.

$$Q_{i} = \left\{ \begin{gathered} T_{{\frac{n + 1}{2}}} ,\quad \quad \quad \quad \quad \;n\;is\;odd \hfill \\ \frac{1}{2}\left( {T_{\frac{n}{2}} + T_{{\frac{n + 2}{2}}} } \right),\;n\;is\;even \hfill \\ \end{gathered} \right.$$
(9)

A positive Qi signifies a rising trend; a negative Qi signifies a declining trend over time.

Both Mann–Kendall and Sen’s slope consider the seasonality of the data. The trend package in R was used to do the correlated seasonal Man-Kendall test and the seasonal Sen's slope tests48. Both functions do not operate on missing data; therefore, the tsclean function in the forecast package was used49,50. To obtain the trend of PM2.5 and NO2, we decompose the time series data into a trend, seasonal and irregular components by using the stl (seasonal decomposition of time series by LOESS) function developed by William Cleveland51,52. The stl function from the stats package was used to fit the loess to the data and the tsclean function in the forecast package was used to identify and replace outliers and missing values before applying the stl function. Then the stl function from the stats package was used to fit the loess to the data. The mean absolute percentage error (MAPE) and root mean square error was computed to determine if the component after the LOESS decomposition had satisfactorily captured the PM2.5 and NO2 data information. The goodness of fit of the trend line was determined by checking the residuals; this was done by using the checkresiduals function from the forecast package.

Both exponential smoothing and ARIMA models were evaluated for the forecast analysis. These methods have been used to perform air pollution forecast analysis and, in some cases, have performed better than deep learning methods. First, the Augmented Dickey-Fuller test (ADF Test) was performed to ensure the stationarity of the time-series data53. Once stationarity was confirmed, the two models were trained and tested with the 2013–2019 and the 2020–2021 pollutants datasets, respectively. Next, validation was performed using the test set whereby the mean absolute percentage error (MAPE) and root mean square error were computed to determine if the EST and ARIMA had satisfactorily captured the information of the PM2.5 and NO2 data. The models with the lowest AIC were then used to do the forecasting of both PM2.5 and NO2.

Exponential smoothing (ES) forecasting methods and models

Brown, Winter and Holt introduced the exponential smoothing54. Gardner55 extensively reviews the various ES methods. The exponential smoothing forecasting formulation consists of the forecast method and the statistical model. The forecast method uses an algorithm to produce a point forecast which is a prediction of a single value whereas the forecast statistical model is a process which generated an entire probability distribution with several values which when averaged generates a point forecast and provides prediction intervals with a level of confidence54.

The exponential smoothing forecasting method is based on the idea that the forecast produced are weighted averages of past observations, with the weight associated to each observation exponentially decreasing as the observation gets older54,55. Model formulations are of component (recursive) form and error correction form54. The error correction form is derived from the rearrangement of the equation in the component form. This error correction form uses the state space approach to exponential smoothing method since it consists of a measurement (observed) equation and a state (transition) equation. These two equations with their error distribution constitute a specified statistical model know as state space model. Since all observations and state variables uses the same error process it is called "single source of error" (SSOE) or "innovation" and more specifically known as "innovation state space model. The single source of error (SSOE) was formulated by Snyder56.

Pegels provided classification of the trend and the seasonal patterns depending on whether they are additive (linear) or multiplicative (nonlinear)56. The family of exponential smoothing forecasting methods can be systematically described as a combination of level, trend, and seasonality54,58,59. Each one can be of either an additive character or multiplicative character. The trend component can be classified as having no trend, additive trend, additive damped trend, multiplicative trend, and multiplicative damped trend55,57,58,59. The simplest classification is the single exponential smoothing (SES) method, which considers only the constant level model and uses data with no trend or seasonality. This method consists of a forecast and smoothing equations for the level. The Holt linear trend method, also known as Double Exponential Smoothing (DES), consist of a forecast equation, and two smoothing equations: a level equation and a trend equation.

The Holt-Winters seasonal method, also known as the triple exponential smoothing (TES), consist of a forecast equation and three smoothing equations: a level equation, trend equation, and seasonality equation. The family classification of exponential smoothing generates a combination of 15 exponential smoothing methods with different components58,59. Rearranging the terms in the different components for each of the 15 exponential smoothing methods (i.e., level component, trend component, and seasonal component), generate an error correction form model for each of the 15 methods with each having an additive or a multiplicative error model thus producing a total of 30 error models. These error-correction form models, also known as "innovative" state space models, are labeled as ETS ( ; ; ), representing Error, Trend, and Seasonal. The forecast equation is the measurement equation, and the smoothing equations becomes the state equation, with both having the same source of error54.

Of the 15 exponential smoothing methods, six were considered here. These are the Holts linear trend method, Holt linear damped trend, Holt-Winters additive, Holt-Winters additive damped, Holt-Winters multiplicative, and Holt-Winters multiplicative damped component. These six methods are converted to their error correction components form with their respective additive and multiplicative error correction model, yielding 12 error correction models. These 12 error correction models were used for model selection (Table S3).

The innovative state space model forecasting for each univariate time series was generated using the ets() function in the forecast package in R49,58. Two procedures using ets() function for model selection were used: the automatic selection and the manual selection of a model. The automatic selection of the ETS models provides options for which models to be evaluated and selects the most appropriate model given the data. The model option used was model = "ZAZ" where the first Z represents either additive or multiplicative error, the second Z represents automatic selection in which the choices are no seasonality, additive seasonality, or multiplicative seasonality, and the A represents an additive trend. Based on this "ZAZ" option, 12 models were evaluated from the 30 error models available. Selection of the best-fitted model from the 12 models was based on the minimization of the corrected Akaike Information Criterion (AIC), which avoids over-fitting by considering both goodnesses of fit and model complexity.

The manual model selection procedure used was selecting the hw() function from the forecast package in R. This function selects the Holt-Winters additive model, which corresponds to the ETS(A;A;A) in the ets () function which stands for additive error, additive trend, and additive seasonality55. The Ljung–Box Q test was used for residual diagnostics to determine whether the residuals were white-noise sequences. The Box.test was used from the stata package in R.

The ARIMA models

Slutsky, Walker, Yaglom, and Yule first articulated autoregressive (AR) and moving average (MA) models. Box & Jenkins integrated the existing knowledge formulating ARIMA, known as the Box-Jenkins approach60. An autoregressive model (AR) assumes the forecasted value is a linear combination of the past values of the variable, and moving average models (MA) assumes a linear combination of past forecasting errors. Combining these two models, AR and MA, produces an ARMA model. If the time series is non-stationary then the series are differenced to create stationarity before modeling, then the I is introduced in the ARMA. The I in an ARIMA model represents the integration parameter produced by differencing. The non-seasonal ARIMA models have parameters p, d, and q. The p represents the lag order of the autoregression, the d is the order of the differencing, and the q is the order of the moving average for the non-seasonal part. The seasonal ARIMA, also known as SARIMA, incorporates an additional set of terms, like the ARIMA models, that considers the seasonal effects. The seasonal parameters incorporated are P, D, Q, and m. The P, D, and Q represent the lag order of the autoregression, the order of the differencing, and the order of the moving average for the seasonal part, and the m represents the number of periods in each season. Box et al. and Chatfield61,62 expressed the AR(p), MA(p), and ARMA (p, q), mixed seasonal ARMA(p,q)(P,Q)m, ARIMA(p,d,q), and mixed seasonal ARIMA -SARIMA(p,d,q)(P,D,Q)m models.

ARIMA forecast was done using the automatic ARIMA algorithm for model selection using the auto.arima() function from the forecast package in the R program49,50. Two procedures were used in the automatic selection. The first method used the default settings (restricted models), and the second was full model selection. Sometimes running the full model selection will produce a different optimal model54.