Forecasting Air Quality in Taiwan by Using Machine Learning

This study proposes a gradient-boosting-based machine learning approach for predicting the PM2.5 concentration in Taiwan. The proposed mechanism is evaluated on a large-scale database built by the Environmental Protection Administration, and Central Weather Bureau, Taiwan, which includes data from 77 air monitoring stations and 580 weather stations performing hourly measurements over 1 year. By learning from past records of PM2.5 and neighboring weather stations’ climatic information, the forecasting model works well for 24-h prediction at most air stations. This study also investigates the geographical and meteorological divergence for the forecasting results of seven regional monitoring areas. We also compare the prediction performance between Taiwan, Taipei, and London; analyze the impact of industrial pollution; and propose an enhanced version of the prediction model to improve the prediction accuracy. The results indicate that Taipei and London have similar prediction results because these two cities have similar topography (basin) and are financial centers without domestic pollution sources. The results also suggest that after considering industrial impacts by incorporating additional features from the Taichung and Thong-Siau power plants, the proposed method achieves significant improvement in the coefficient of determination (R2) from 0.58 to 0.71. Moreover, for Taichung City the root-mean-square error decreases from 8.56 for the conventional approach to 7.06 for the proposed method.


Geographical and Meteorological Divergence of PM 2.5 Concentration Between Regional Monitoring Areas
In Taiwan, the PM 2.5 concentration has been a severe environmental problem and has attracted considerable attention from researchers and the general public. Previous studies have focused on the source contributions of air pollution. Air pollution can be categorized as locally produced(mainly from motor vehicles and power plants) or long-range transported (LRT) pollution (mainly from China) 27,28 . Some studies have focused on the influence of meteorological conditions on the PM 2.5 concentration and have indicated that the PM 2.5 concentration is strongly influenced by the southwesterly monsoonal flow (SWM) and northeasterly monsoonal flow (NEM). Geographical divergence is an important factor when discussing the effects of meteorological conditions and synoptic weather on the air quality in Taiwan 29 . Thus, the following section discusses the geographical divergence and temporal variations of the measured PM 2.5 concentration in Taiwan.
The large-scale database used in this study was built by the official EPA and Central Weather Bureau (CWB). The database contains more than 260,000 samples taken in 2017 from 77 air monitoring stations (from the EPA) and 580 weather stations (from the CWB) in Taiwan. Fig. 1(a) displays the geographical positions of the 77 air stations. The map indicates that Taiwan island is surrounded by the ocean and has the Central Mountain Range (CMR) running from the north to the south. The complex terrain and different meteorological conditions create spatial and seasonal variations in the PM 2.5 concentration in Taiwan. Considering the different meteorological and geographical conditions, Taiwan has been divided into seven regions by the EAP for regional air quality monitoring. As illustrated in Fig. 1(a), the seven regions are northern Taiwan (NT), Chu-Miao (CM) area, central Taiwan (CT), Yun-Chia-Nan (YCN) area, Kao-Ping (KP) area, Hua-Dong (HD) area, and Yilan (YI) 30 . In addition to the geography, PM 2.5 is also heavily influenced by the weather conditions. To offer accurate weather conditions and predictions, the CWB has set up 580 weather stations in Taiwan, as depicted in Fig. 1(b). Figure 2 illustrates the monthly temporal variation of the mean PM 2.5 concentration in the seven regions, where YI has missing data except for December. This figure indicates that the PM 2.5 concentration reaches a peak in most regions during spring (February, March, and April), drops dramatically during summer (May, June, and July), and starts to go upwards during autumn (September, October, and November) and winter (November, December, and January).
Starting from autumn (September, October, and November), the prevailing wind (NEM) affects the northern part of the island, thereby easily dispersing and diluting air pollutants. The NEM becomes strong during www.nature.com/scientificreports www.nature.com/scientificreports/ the winter; however, it also brings LRT pollutants from China, which affect the air quality. Thus, air pollution is severe because the NEM is blocked by CMR; therefore, the southern part of Taiwan becomes a lee side region and forms stagnant wind conditions that increase the accumulation of air pollutants 30,31 . By contrast, during spring a stable weather system without strong winds is observed, especially in the northern part of the island 29 . Besides, the boundary layer height is also one of the factors for decreasing PM 2.5 during summer. On the contrary, during autumn and winter, temperature inversion limits the boundary layer height and results in increasing PM 2.5 concentration. Figure 2 also indicates that Taiwan has superior air quality during the summer because the prevailing wind (SWM) easily disperses and dilutes air pollutants. Moreover, the subtropical high pressure over the Pacific Ocean in the summer causes high convection of air mass, which leads to the vertical dispersion of pollutants.  After examining the temporal variations, we investigate the geographical divergence. Fig. 2 indicates that YCN and KP have the worst air quality. According to the EPA standards, when the PM 2.5 concentration is larger than 35.4 μg/m 3 , the air quality could cause health problems to citizens. From Fig. 2, the YCN and KP regions exceed the threshold values for 2 and 5 months, respectively. By contrast, NT and HD have the best air quality because NT is a financial and business center and HD is an undeveloped green space. As expected, these two regions have only minor industrial pollution. In Taiwan, HD is the most livable place with good air quality because the mean PM 2.5 concentration is below 15.4 μg/m 3 for several months.
An interesting region is CT, where the mean PM 2.5 concentration is lower than that in only KP and YCN. The central part of Taiwan is subject to both meteorological conditions and topographic effects because most residents live in the basin area surrounded by hills and mountains. Furthermore, the second-largest coal-fired power station in the world, that is, the Taichung Power Plant, is located in this region. The power plant accounts for 15% of Taichung City's PM 2.5 concentration 32 . Figure 3 shows the estimate of the PM 2.5 concentration of Taiwan in December using IDW (inverse distance weighted) technique, which is a interpolation method via spatial average for unseen locations 33,34 . This figure shows that NEM is strong in December, resulting in better air quality in the northern Taiwan. However, the NEM is blocked by CMR. This results in weak wind in the southern Taiwan and bad atmospheric diffusion condition.

Results
The data provided by the EPA and CWB for 2017 were collected and processed. These data contained more than 260,000 samples covering 264,799 h and 36 attributes. In this study, 81 independent parameters were selected from each air station and its four nearest weather stations as features, and gradient boosting with regression was used as the learning model to forecast the PM 2.5 concentration for the next 24 h. The feature extraction process and learning algorithm are described in detail in the following sections. This study used five-fold cross validation to justify the prediction results. The entire data were randomly split into five folds, of which four were used for training and one was used for testing.

pM 2.5 prediction Results
The RMSE, normalized RMSE (NRMSE), and R 2 were used as prediction performance metrics in the experiments. The RMSE indicates the mean fluctuation between the observed and predicted values. A smaller RMSE and NRMSE indicate a superior forecasting performance. R 2 measures the proportion of the variance of observed values that is predictable in the case of multiple regression. When R 2 = 1, the observed values are perfectly predicted.  Fig. 4(a). In addition, www.nature.com/scientificreports www.nature.com/scientificreports/ some trend is consistent in Fig. 4(a,b) such as the distribution of HD regions (the lower right), CT regions (the upper left), and NT regions (middle). Next, the separation between regions obviously reduced due to the normalization. For example, YCN and NT are clearly separated in Fig. 4(a) whereas they are mixed in Fig. 4(b). Using NRMSE, it is also difficult to distinguish the KP regions among the middle ranges. Thus, each performance metric has its own advantage and dis-advantages. NRMSE provides fair prediction performance comparison among each monitoring stations while RMSE makes the comparison of air quality regions easily due to higher separation ability. In addition to the large RMSE and NRMSE, the small R 2 values imply that these regions may contain an increased number of hidden factors that may influence the prediction model. Thus, an interesting observation is that although YCN and KP have worse air quality than CT, the R 2 value of CT is lower than that of YCN and KP. The R 2 value for station 34 (Fengyuan station in CT) is only 0.41, and is the worst among all stations. This issue is discussed in detail in the following paragraph. By contrast, most stations in the HD, YI, and NT regions are distributed in the lower right. For stations 20, 21, 76, and 77, R 2 is approximately equal to 0.9, which indicates that the forecasting model works nearly perfectly for these stations. At some stations, the RMSE values are smaller than 4 and the NRMSE values are smaller than 0.2, which indicates that the prediction value approaches the ground truth. Among these stations, we observe that although station 68 (Hengchuen) belongs to the KP region, its performance is completely different from that of other stations in this region. The reason for this observation may be that station 68 is located in a national park in south Taiwan, where few anthropic activities occur, which results in a low RMSE and NRMSE. Figure 5 shows the standard deviations (STD) of RMSE and R 2 across 5-folds validations. Besides, we can observe that the maximum STD of RMSE is less than 1 and that of R 2 is less than 0.1. In addition, this figure also indicates that almost all monitoring stations show the similar and stable STD, while there are only minor stations show relatively higher STD in the NT regions. Thus, the forecast models are not overfitting for the air monitoring stations in Taiwan during the experiments. Figure 6 displays calendar heat maps of predicted, observed, and residual values at station 68 (Hengchuen) and 69 (Chaozhou). In Fig. 6(a), Hengchuen exhibits small residuals in the PM 2.5 concentration over the year, a low PM 2.5 concentration (maximum is only approximately 30 μg∕m 3 ), and minor differences between the observed and predicted PM 2.5 values. By contrast, as displayed in Fig. 6(b), Chaozhou exhibits large residuals in the PM 2.5 concentration over the year, a high PM 2.5 concentration (maximum is approximately 100 μg∕m 3 ), and significant differences between the observed and predicted PM 2.5 values. Besides, Fig. 6(b) illustrates the higher observed values in winter (November, December, and January) and spring (February, March, and April) at Chaozhou. The results indicate that although both Hengchuen and Chaozhou are in the KP region, their forecasting performance is significantly different. Hengchuen is located in a national park, where Chaozhou is very close to the Linyuan industrial area. Therefore, in addition to the between-region divergence, the in-region difference is also an important issue when considering the prediction performance of individual air monitoring stations.

Comparison Between Taiwan and London
We applied another public air quality database to validate the aforementioned study results. Recently, the air quality data for London and southeast England was made available for independent scientific measurements and assessment 35,36 . Therefore, we selected London as a benchmark region for comparison. By using the same analysis procedure as that for Taiwan, experiments were conducted to collect hourly data from air and weather stations for 6 years (2012-2018) in London to provide a fair comparison of air quality prediction between different countries 37 . Figure 7 illustrates the comparison results, where the empty blue circles and solid gray circles represent air stations in Taiwan and London, respectively. The subfigure depicts the index abbreviations of air stations in  Table 1. whereas that in Taiwan is 0.4-0.9. The RMSE results exhibit the same tendency. The RMSE in London is 5-7, whereas that in Taiwan is 3-13. These results indicate that the air quality forecasting performance in London is significantly superior to that in Taiwan even when using the same algorithm. This may be attributed to three main factors. First, from the perspective of geography, London is located in a basin, whereas Taiwan has considerably complicated terrain. Second, London is a financial center without domestic pollution sources. By contrast, most pollution in Taiwan originates from domestic sources, such as in KP and YCN. Third, Taiwan has transboundary issues because air pollution may be transported from China. These factors explain why London significantly outperforms Taiwan and also indicate that dedicated air quality forecasting in Taiwan is challenging.
Next, we selected stations in the Taipei region in Taiwan (solid blue circles in Fig. 7) for comparing with stations in London for two reasons. First, most northern stations are located in Taipei, which lies in a basin. In addition, Taipei is a financial center without domestic pollution sources. Thus, Taipei is similar to London in terms of the topography, pollution sources, and number of air stations. The comparison in Fig. 7 indicates that Taipei has similar prediction accuracy compared with that of London from the viewpoint of both the RMSE and R 2 values. Fig. 7 indicates that the RMSE in Taiwan is 3-13 and that in Taipei is 5-8. In other words, PM 2.5 prediction in Taipei is more accurate than that at other stations in Taiwan. The results obtained from an alternative database for London verified the advantages of the proposed prediction mechanism. If a city has a pure topography (basin) and zero domestic pollution sources (e.g., Taipei and London), the PM 2.5 prediction performance is expected to be stationary and reliable with a RMSE of approximately 6 and an R 2 value of 0.75.

Effect of Adding Industry-Related CEMS Features
In recent years, air pollution has worsened in CT, especially in Taichung City, for reasons that remain unclear. Domestic pollution could be one of the reasons because the coal-fired power station in Taichung increased power generation in 2017. The power generation was increased at the directive of the Taiwan government to satisfy future energy requirement and to promote the nuclear-free homeland policy. The results indicate that the forecasting performance is poor in CT, as displayed in Fig. 4. In particular, station 34 has the lowest R 2 value. This result is consistent with the coal-fired power station's policy for increasing power generation. We explored the possibility of using additional industry-related features to achieve an enhanced prediction performance. The effect of adding industry-related features may also provide some scientific evidence for the distribution of pollution sources.
The question is whether improving the prediction accuracy is possible when considering industry-related information. To answer this question, we used the alternative CEMS database to investigate the effect of adding www.nature.com/scientificreports www.nature.com/scientificreports/ industry-related features on the PM 2.5 prediction. The emission of factory gas through chimneys has been managed and monitored by the CEMS of the EPA since 1993. CEMS detects several parameters, including the CO, SO 2 , O 2 , nitrogen oxides, and hydrogen chloride concentrations; temperature; opacity, and emission flow rate 38 . Besides, wind speed and direction are applied for considering diffusion of the detected items. CEMS has collaborated with the EPA since 2003, and the data has recently been made publicly available for academic research. Figure 8 displays the geographical positions of Taichung's coal-fired power plants and air stations 30-34 in Taichung City, which are indicated by the yellow area in the map. In addition to the large coal-fired power station, Fig. 8 also shows the geographical position of the Thong-Siau gas-fired power plants, which is the second-largest station in CT and is close to the air stations in Taichung City. Therefore, considering these two thermal power plants when making PM 2.5 predictions in Taichung is reasonable. Figure 9 illustrates the effect of adding industry-related features in Taichung, in which all the CEMS data were included as features in the training model. The figure indicates that after considering the CEMS, the forecast performance significantly improves in terms of the RMSE and R 2 values. In general, the highest improvement is observed when adding Taichung Power Plant's attributes. Upon also adding Thong-Siau power plant's features, stations 34 and 32 exhibit a similar performance, stations 31 and 33 exhibit marginal improvement in the performance, and station 30 exhibit a marginal decrease in the performance.
An interesting observation is that the improvement in the forecast performance is highly related to the distance between the power plants and the air stations. For example, stations 30-33 are closer to Taichung Power Plant than station 34, as shown in Fig 8. This may explain why the improvement upon adding Taichung Power www.nature.com/scientificreports www.nature.com/scientificreports/ Plant is the most insignificant at station 34 compared with the other air stations. Besides, station 34 is located near CMR and has higher altitude than nearby stations. It has been influenced by both the Taichung and Thong-Siau power plants and suffers from high PM 2.5 value due to bad atmospheric diffusion conditions. Furthermore, station 34 is close to Thong-Siau power station plant, and the improvement upon adding Thong-Siau power plant's features is the largest among all the air stations. Similarly, station 30 is the farthest from the Thong-Siau power plant. This may explain why the forecast performance decreases marginally after considering the features of Thong-Siau power plant. In summary, the R 2 values increase by 0.13, whereas the RMSE decreases by 1.50 on average (Fig. 9). These results indicate that the PM 2.5 prediction performance in Taichung improves significantly on considering the neighboring coal-fired power plants.
Next, we follow the same procedure to determine whether the same trend can be obtained in Taipei City. Although no industrial power plants exist in Taipei, the CEMS data, including data from garbage incinerators at Mucha, Neihu, and Beitou, were applied as alternative industry-related features to determine the performance improvement for air stations in Taipei. Figure 10 depicts the comparison between Taichung and Taipei. This figure indicates that although adding CEMS features still provides a marginal performance improvement in Taipei, the degree of improvement is not comparable to that in Taichung. Industrial pollution appears to have a stronger impact in Taichung than in Taipei, especially in terms of the PM 2.5 prediction. This observation fits recent developments of Taipei and Taichung. Taipei City is a financial and business center in Taiwan with minor industrial impact, whereas Taichung City experiences increased air pollution due to coal-fired power plants.

Discussion
The air quality forecast has aroused attention from governments and scientists for improving environmental quality of citizens and upgrading environmental sensing studies. Because numerous scientific studies have linked PM 2.5 (particulate matter) to health problems, this study proposes a machine learning approach based on GBM to predict the PM 2.5 concentration in Taiwan.

Index SiteCode SiteName
SiteType DataOwner  www.nature.com/scientificreports www.nature.com/scientificreports/ By extracting temporal sequences from air monitoring and weather stations into features, the proposed mechanism efficiently predicts the air quality in terms of the PM 2.5 concentration for the next 24 h at individual air monitoring stations. By performing experiments over 1 year, this study investigated the geographical and meteorological divergence of forecasting results and measurements at seven regional monitoring areas in Taiwan. The results from the alternative London database verified the proposed prediction mechanism. We observe that Taiwan has a more complex environment than London due to domestic pollution sources, overseas pollution  Figure 10. Comparison of the CEMS impact between Taichung and Taipei. The prediction results of Taipei with and without CEMS are represented by blue solid circles and diamonds, respectively. The prediction results of Taichung (shown in Fig. 9) with and without CEMS are represented by green solid circles and squares, respectively, for a comparison.
sources (mainly from China), and geographic topography; however, Taipei exhibits a similar prediction performance to London. This may be because these two cities have similar topography (basin) and are financial centers without domestic pollution sources. Finally, because domestic industrial pollution makes prediction difficult, this study proposes an enhanced fusion version of the prediction model by incorporating additional features from the CEMS to improve the prediction accuracy. After considering the industrial impact, the proposed method achieves significant improvement in the R 2 values (0.58-0.71 on average) and considerably decreases the RMSE (8.56-7.06 on average)compared with the conventional approach for Taichung City. In the future, we will try to use recent advanced dynamic neural networks to improve the performance. We will also investigate deeper and more complex model structure when more training data become available for air quality prediction.

Methods
Data collection. The EPA, CWB, and CEMS databases are the main data sources used to forecast air quality. EPA uses an automatic continuous monitoring system to immediately identify and respond to uncommom emerging signals. This system collects air quality data every hour 39,40 . EPA provided more than 260,000 samples from 77 air stations in 2017. The measured attributes include the index, city, county, station name, date, detected items, and time in hours. The detected items include PM 2.5 , NO 2 , PM 10 , NO, NO X , SO 2 , CO, O 3 , THC, NMHC, and CH 4 . Specifically, PM 2.5 represents particulate matter with a diameter less than 2.5 μm, and NO 2 is one of a group of highly reactive gases. These substances are primarily released into the air from the burning of fuel in cars, trucks buses, and power plants.
Taiwan's EPA has strict regulations and guidelines in regards to quality assurance operations for air quality monitoring to achieve the EPA's data quality objectives (DQO). Every year the EPA has to perform quality assurance operations for the air quality monitoring system in order to evaluate the accuracy of air quality measurements. Those data were recorded in EPA's quality assurance (QA) operations annual report of air quality monitoring system. For our PM25 data (2017), the accuracy was 96.3%. The annual report from 2001 to 2017 can be downloaded at the EPA website 41 . In Taiwan, the stations are categorized by the EPA as 6 types, which may give a general idea about the location of the stations: • 60 general stations. Most of the stations are located in urban and suburban area. CWB has set up 580 weather stations to monitor and report weather conditions in Taiwan 42 . The attributes of weather information include longitude, latitude, station name, city, county, wind speed, wind direction, temperature, and pressure. The number and date of datasets from the CWB and EPA must be synchronized for model training. The CEMS was established by the EPA to monitor the gases emitted from chimneys of industries since 1993. The detected items include the CO, SO 2 , O 2 , nitrogen oxides, and hydrogen chloride concentrations; temperature; opacity; and total emission flow rate. Specifically, emission flow rate indicates total emission materials from a chimney and its unit is cubic meter per hour 38 . The data has been recently made publicly available for academic research, and we used this data for training to study the influence of the CEMS on Taichung and Taipei.
Feature extraction. We derived and extracted features from the EPA and CWB hourly data for model training. There exist 81 features from each air station and its four nearest weather stations. Specifically, 21 features were identified from individual air stations and 15 were identified from a weather station. Tables 2 and 3 list the details of the 81 features for generating the GBM-based PM 2.5 prediction model.
When we forecast air pollution on a given day, the previous two days are also considered due to the memory effect. Furthermore, the difference between these two days' PM 2.5 concentration is defined as "concentration difference" In addition, because traffic flow strongly influences the air quality, the features consider whether a given day is regular day, weekend, or holiday. An air station's weather conditions are represented by the mean pressure and mean temperature of nearby weather stations. Features in the training model also include the hour of day, day of week, and year to learn the trend and period of the temporal index. In summary, 21 features are extracted from an air station, as presented in Table 2.
We derived weather stations' pressure, temperature, and wind speed from the features to represent the weather conditions at nearby air stations. Because wind can blow from any direction, the wind was split into the northern and eastern directions for simplicity. In summary, 15 features were inferred from each weather station, as presented in Table 3. Because the 580 weather stations are uniform distributed in Taiwan, neighboring weather stations should be enough for representing weather condition of each air quality station. However, number of weather stations may be varying according to the environments and validation tests. To provide a fair comparison among stations, we fixed the number as 4 in the experimental setup. Therefore, we used one air station and four neighboring weather stations to generate 81-dimensional feature vectors for model learning and PM 2.5 forecasting.
GBM. GBM combines fitting functions, loss functions, a decision tree, and gradient descent analysis 24 . The decision tree produces initial values for the fitting function with multiple regression, which deals with the many input variables considered in this study. Then, errors between the observed data sets and output values are www.nature.com/scientificreports www.nature.com/scientificreports/ calculated using a loss function. The frequently used loss functions include square-error, absolute-error, and negative binomial log-likelihood functions 43,44 . Thereafter, gradient descent analysis is applied to find the fitting function whose expected value of loss function is minimized. The aforementioned procedure is repeated to acquire the optimized fitting function.
The one-year data were randomly split into five folds, of which four were used for training models and one was used for testing models. Besides, the period of training data is chosen to cover a whole year to reduce seasonal changes when training prediction models. For training models, 31 features of the predicting day, 25 features of one day before predicting day, and 25 features of two days before predicting day comprise 81-dimensional input vector x t , as shown in Tables 2 and 3, where t is the time index in hour. Specifically, the predicting day's meteorological features are collected form weather forecast of the CWB. Because the target is to predict the next-24 hours PM 2.5 , the output is one-dimensional variable denoted as y t+24 .
After N pairs of the input vector x t and the output variable y t+24 are given, a fitting function F(x t ) is selected from unknown functions x F( , ) t β′ produced by the decision tree. In addition, β′ is a gradient decent step size and ( ) the target function F(x t ) is chosen to be F(x t , β). Note that in this study, N is 211840 since we collected one-year data and 80% of them were selected for training model. Besides, the gradient descent analysis is applied for optimized fitting function F(x t ). The detailed process is described as follows. At the first step, initial guess function x F ( , ) t 0 β′ is generated and initial gradient descent step size β 0 is ( ) t t 0 0 0 Then, we take gradient of loss function as a first-step base learner function f 1 (x t ) as After M iterations, the target function F(x t ) is expressed as

Dimensions Features
Predicting day Holiday or not, weekend or not, Saturday or not, Sunday or not, concentration difference, extrapolation of concentration, average pressure, average temperature, hour of the day, day of the week, and which year 11 Day before the predicting day Days before the predicting day Holiday or not, weekend or not, Saturday or not, and Sunday or not, Concentration of air pollutant 5 × 2 Table 2. Input features obtained from an air monitoring station.

Dimensions Features
Predicting day Day before the predicting day Two days before the predicting day pressure, temperature, wind speed, eastern wind speed, northern wind speed 5 × 3 Finally, F(x t ) is the GB-based prediction model. Then, during the online stage, the testing samples are put into the model to calculate the forecasting results. The entire GBM procedure is described in Algorithm 1.