Combination of data-driven models and interpolation technique to develop of PM10 map for Hanoi, Vietnam

The degradation of air quality is the most concerned issue of our society due to its harmful impacts on human health, especially in cities with rapid urbanization and population growth like Hanoi, the capital of Vietnam. This study aims at developing a new approach that combines data-driven models and interpolation technique to develop the PM10 concentration maps from meteorological factors for the central area of Hanoi. Data-driven models that relate the PM10 concentration with the meteorological factors at the air quality monitoring stations in the study area were developed using the Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) algorithms. Models’ performance comparison showed that ANN models yielded better goodness-of-fit indices than MLR models at all stations in the study area with average coefficient of correlation (r) and Nash–Sutcliffe Efficiency Index (NSE) of 0.51 and 0.34 for the former, and 0.7 and 0.49 for the latter. These indices indicates that the ANN-based data-driven models outperformed the MLR-based models. Thus, the ANN-based models and the Inverse Distance Weighting (IDW) interpolation technique were then combined for mapping the monthly PM10 concentration with a spatial resolution of 1 km from global meteorological data. With this combination, the PM10 concentration maps account for both local PM10 concentration and impacts of spatio-temporal variations of meteorological factors on the PM10 concentration. This study provides a promising method to predict the PM concentration with a high spatio-temporal resolution from meteorological data.


Scientific Reports
| (2020) 10:19268 | https://doi.org/10.1038/s41598-020-75547-y www.nature.com/scientificreports/ conditions and PM concentration have a close relationship. Zhao et al. 12 , in a study on PM 2.5 pollution from 2005 to 2007, highlighted a clear seasonal variation in the concentration of PM 10-2.5 in which the PM concentration at the rural area was at the minimum level in winter and maximum level in spring and summer, while the urban region experienced the opposite trend. This study also pointed out that precipitation had an important contribution to the seasonal pattern of PM 2.5 in the urban area, while monsoon was the main factor for that in the countryside. In a similar study, Duo et al. 13 concluded that temperature in Lhasa, Tibet was the dominant factor governing all air pollutants including PM 10 and PM 2.5 in spring, while relative humidity and atmospheric pressure were the major meteorological drivers during summertime. Spatial distribution of fine to coarse PM showed an inverse relationship with wind speed 14,15 . Srimuruganandam and Nagendra 16 revealed that low wind speed was highly correlated with PM concentration. It was however reported by Giri et al. 17 that PM 10 concentration in Kathmandu, Nepal increased with wind speed and atmospheric pressure. Wang and Ogawa 18 also reported that PM 2.5 was positively proportional to wind speed higher than 3 m/s, and negatively proportional to the wind speed lower than that level. Based on the relationship between the PM 10 concentration and meteorological factors, efforts have been made to construct data-driven models for predicting the PM 10 concentration from meteorological data using different statistical methods [19][20][21] . Among these methods, the Multiple Linear Regression (MLR)-based data-driven models are the most popular and have been widely used. The MLR algorithm is usually used to formulate the linear relationship between meteorological factors (including temperature, relative humidity, wind speed, and wind direction) with the PM 10 or PM 2.5 concentration. Although there are some studies that reported good prediction results 22 , it is generally seen that the MLR-based models are yet to present consistently satisfactory results due to the linearizing of the non-linear system as reviewed by Shahraiyni and Sodoudi 23 . With the ability to represent complex non-linear problems, the ANN-based data-driven model with different architectures has been extensively used to estimate the PM concentration. Several studies reported that the Artificial Neural Network (ANN) models produced satisfactory prediction results 24 . Although some studies showed that the ANN models performed better than the MLR models 24 , the ANN models have a more complicated structure and still present some limitations in terms of handling high dimensional input variables, local minima or interpretability (the black-box model problem). As a result, for each case study, it is necessary to compare these two models to select the more suitable model. PM 10 concentration recorded in air quality monitoring statios only cover an area surrounding those stations, thus it is necessary to use interpolation techniques for mapping the PM 10 concentration. There are multiple interpolation techniques for this purpose. Wong et al. 25 provided an excellent review on these techniques and divided them into four groups, namely spatial averaging, nearest neighbor, inverse distance weighting and kriging. These interpolation techniques have been successfully employed to construct PM 10 maps in many studies. For example, Perez 26 29 used spatiotemporal kriging with the external drift to explore spatio-temporal variations of PM 10 concentrations in Ankara, Turkey. The main drawback of these studies was that they only used the PM 10 concentration measured at the air quality stations for interpolation without considering the impacts of the spatio-temporal variations of the meteorological factors on PM 10 variation. As a result, it is crucial to develop an interpolation technique that can account for information from both air quality stations and meteorological data.
With the increasing population and rapid urbanization, air quality in Vietnam, especially in large cities like Hanoi has been significantly degraded. Hopke et al. 30 indicated that Hanoi was one of the cities which had the worst air quality in Asia. Saksena et al. 31 showed that the average value of PM 10 concentrations in the streets in Hanoi could reach up to 455 μg/m 3 , which is much higher than the Vietnamese daily standard for the PM 10 concentration (150 μg/m 3 ). This has posed a negative effect on the city's public health. It was reported that ambient and in-house air pollution was becoming the major reason for deaths related to the environment in Vietnam, just second to smoking 32 . As a result, there has been increasing attention and demand from both the local community and the government of Hanoi for a study on air quality and its controlling factors with PM 10-2.5 concentration prediction being the top priority.
There are two objectives to this study. The first objective is to develop a hybrid mapping approach that combines the data-driven model and IDW interpolation to produce the PM 10 concentration maps from global meteorological data. The second objective is to employ this approach to construct the monthly PM 10 maps for the central districts of Hanoi and analyze its spatio-temporal variations.

Methodology and material
In this study, we developed a hybrid approach that combines data-driven models and interpolation techniques to construct the PM 10 concentration maps from global meteorological data. As shown in Fig. 1, this approach consists of two main steps, namely, (1) development of data-driven models at each air quality monitoring station and (2) construction of PM 10 maps from global meteorological data using the IDW interpolation technique. Details of these two steps are presented below.
Development of data-driven models. Before developing data-driven models, a set of input features derived from meteorological factors were constructed. After that, the data-driven models linking these selected input features and the PM 10 concentration using two machine-learning algorithms, namely, MLR and ANN were developed. Finally, based on their performance, the more accurate models were selected for mapping the PM 10 concentration.

Scientific Reports
| (2020) 10:19268 | https://doi.org/10.1038/s41598-020-75547-y www.nature.com/scientificreports/ Construction of input features. Multiple meteorological factors can be considered in data-driven models to predict the PM 10 concentration. However, based on the availability of meteorological data at the air quality monitoring stations and their correlation with the PM 10 concentration, the following variables were taken into account: mean daily temperature, maximum daily temperature, minimum daily temperature, mean daily humidity, mean daily wind speed and mean daily atmospheric pressure. Next, we assumed that the PM 10 concentration was linked to meteorological factors by a quadratic function as follows: in which X i (i = 1, 2, …, 6) is the mean daily pressure (X 1 ), mean daily temperature (X 2 ), mean daily humidity (X 3 ), mean daily wind speed (X 4 ), maximum daily temperature (X 5 ), minimum daily temperature (X 6 ). Totally, there are 27 features considered in Eq. (1). Since the size of features is considerably large, features with low correlation coefficients with the PM 10 concentration or close correlation with other previously-selected features were removed from the equation. Next, the selected input features and PM 10 concentration were standardized to avoid the effects of differences in scale of features that could significantly influence the performance of regression models. After standardization, both input features and PM 10 concentration were unitless values with a mean of 0 and a standard deviation of 1. The standardized features and PM 10 concentration were used as inputs and outputs for the data-driven models.
Development of data-driven models. Multiple linear regression model. MLR is a statistical technique used to find a linear relationship between a response variable (dependent variable) and explanatory variables (independent variables). It is one of the most common methods to generalize the relationship of PM concentration with its determinants 23 . Generally, the MLR model is defined by the following equation: in which, y i is dependent variable/response variable; x i = independent/explanatory variables; c 0 is the intercept; c n is the slope coefficient for each independent variables; ǫ is the error term. The y i , in this study, is the standardized PM 10 concentration, and x i is the standardized features constructed in the previous step. At each air quality monitoring station, a data-driven model based on the MLR algorithm was developed from measurement data of PM 10 concentration and meteorological-derived input features using the method of least squares. Artificial neural network model. In this study, a feed forward neural network (FFNN) model, a common architecture of ANN model, was employed to build data-driven models for each air quality monitoring station using the same input and output data as in the MLR models for comparison. The structure of a FFNN model consists www.nature.com/scientificreports/ of three layers (input layer, hidden layer and output layer). We refered to Sanger 33 for more detailed information about the FFNN algorithm. In order to develop the FFNN data-driven models, the input and output data were randomly sampled into 3 sub-sets with 70% of data for training, 15% for validation and 15% for testing. Since the number of nodes in the input and output layers was determined, the determination of the FFNN structure focused on determining the number of hidden nodes. In this study, the trial-and-error method was used to find the number of hidden nodes for each air quality monitoring station. 10 concentration mapping. Based on the data-driven models developed in the previous section, this study constructed the monthly PM 10 concentration maps from meteorological data using a new approach based on the IDW interpolation method. In order to predict the PM concentration at a given location (interpolated location) from surrounding air quality monitoring stations, the IDW method determines the weighting factors of each station as below:

Development of a hybrid interpolation approach for PM
in which w ik is the weighting factor of station ith at interpolated location kth. d 2 ik is the distance from station ith to interpolated location kth. N is the total number of air quality monitoring stations used for interpolation. Using the weighting factors, the PM 10 concentration was estimated as below: in which f PM 10 i is the data-driven model developed for the station ith; X k is the input feature vector which was derived from meteorological data at interpolated location kth. The novel of this approach is that instead of using the PM 10 concentration values at the air quality monitoring stations like in traditional approaches, it used meteorological data at the interpolated location to feed the data-driven models. This hybrid approach allows us to consider the impacts of both local conditions (via the data-driven model developed for each station) and spatio-temporal variations of meteorological factors.
All the above steps including feature construction, development of MLR and ANN-based models and mapping the PM 10 concentration were programmed on Matlab (coding of this program is provided in the Supplementary document). This programming language allows for quick implementation of the algorithms and easy visualization of the results without using any other additional software.
The method developed in this study can be well applied for other cities where meteorological and PM 10 concetration observations are avaiable. However, the selection of meteorological factors and development of data-driven models in this study were purely relied on data from air quality monitoring stations in the city of Hanoi. As a result, the data-driven models may not be applicable to other cities. The data-driven models should be developed for each city based on the availability of observation data in that city.

Study area.
Hanoi is the capital city of Vietnam with an area of 3358 km 2 (following the administrative expansion in 2008) and more than 7.4 million people in 2017 34 . The study area consists of eleven central districts of Hanoi (Fig. 2). This is the most crowded area of Hanoi where 41% population of the city resides in 7.7% of the total city area. With a high population density and a large number of vehicles and construction activities, this area has been severely suffered from air pollution. As for the weather conditions, Hanoi has four distinct seasons including spring (March-May), summer (June-August), autumn (September-November) and winter (December-February). In winter, the weather is cold and dry, while summer has high humidity and rainfall 35 . With a lower temperature and low humidity, the PM 10 concentration in winter is much higher than the other seasons (Fig. 3). While the 24 h PM 10 concentration is mostly below the National Technical Regulation on Ambient Air Quality (QCVN 05:2013/BTNMT, PM 10 = 150 μg/m 3 ), there are many days in which PM 10 concentration is above this standard in winter. Due to the harmful impacts of high PM 10 concentration on human health and the importance of the study area, it is necessary to construct high resolution PM 10 concentration maps for this area in order to provide more detail air quality information for local residents who are most likely to be affected. Data availability. The data used in this study was collected from two sources corresponding to two objectives. For the development of the data-driven models, input data was collected from 11 air quality monitoring stations located across the study area (Fig. 2). These stations include three fixed stations (Minh Khai, Trung Yen 3 and Nguyen Van Cu) and eight sensor stations (Table 1). Nguyen Van Cu station is under the management of the Vietnam Environment Administration. The remaining stations are under the management of the Hanoi Department of Environmental Protection. Hourly PM 10 concentration and meteorological data (atmospheric pressure, temperature, humidity, wind speed) at all stations from 01/06/2017 to 31/12/2018 were collected. This dataset covers one and a haft year, and therefore, can represent the temporal variations of the PM 10 concentration and meteorological factors over four seasons of the year. In addition to PM 10 , other air pollutants were also collected, although they were not considered in the scope of this study. The hourly PM 10 concentration and meteorological data were averaged to generate a daily dataset to reduce the measurement errors and remove their diurnal variation.
Since the data collected from 11 meteorological stations were limited, the PM 10 concentration calculated from these data was not representative for its spatial variation in the study area. Therefore, high spatial resolution maps of the PM 10 concentration were needed. For mapping the monthly PM 10 concentration, we used the Scientific Reports | (2020) 10:19268 | https://doi.org/10.1038/s41598-020-75547-y www.nature.com/scientificreports/ global meteorological data from the WorldClim 2.0 database (https ://www.world clim.org/), which contains temperature (mean, maximum, minimum), precipitation, solar radiation, vapor pressure, and wind speed data with a spatial resolution of 1 km 2 . This is a reliable data source that was validated with gauged data (correlation coefficient with gauged data r ≥ 0.99 for temperature and vapor pressure, r ≥ 0.86 for precipitation and r ≥ 0.76 for wind speed). After the global meteorological data was downloaded, they were extracted for the region of the study area. Because the relative humidity was not available, it was calculated from actual and saturated vapor pressure. The atmospheric pressure was calculated from the location altitude and air temperature. Figure 4 shows the temperature, wind speed, relative humidity and air pressure in February obtained from the WorldClim database as an example. As shown in the figure, the data extracted from the WorldClim can well reflect the spatial variation of meteorological factors.

Results
Construction of input features for data-driven models. The meteorological data collected from 01/06/2017 to 31/12/2018 were used to construct the input features for the data-driven models. The total number of features considered in this study was 27. In order to reduce this number of features, the correlation coefficients between each feature with the PM 10 concentration and between features were estimated. Figure 5 presents the correlation matrix which indicates the correlation coefficients of input features with each other and with the PM 10 concentration. The figure shows that the correlation coefficients between the input features and the PM 10 concentration range from − 0.46 to 0.46. All six meteorological factors (mean daily atmospheric pressure, mean daily temperature; mean daily humidity, mean daily wind speed, maximum daily temperature and minimum daily temperature) have a relatively high correlation with the PM 10 concentration with absolute correlation coefficients greater than 0.24. It is interesting that of these meteorological factors, only the mean daily pressure is positively correlated with the PM 10 concentration, while the other factors have negative correlation coefficients. This implies that the PM 10 concentration increases with increasing mean daily pressure and decreasing other factors. The features with the highest correlation with the PM 10 concentration are the mean daily pressure (X 1 ) and its quadratic term ( X 2 1 ) (correlation coefficients = 0.46), while the product of the mean daily pressure and the mean daily humidity (X 1 X 3 ) has the lowest correlation (correlation coefficient = − 0.22).  www.nature.com/scientificreports/ In order to select input features for the data-driven models, we evaluated the correlation of each feature with the PM 10 concentration and with the other features. The mean daily pressure (X 1 ), mean daily temperature (X 2 ), mean daily humidity (X 3 ), mean daily wind speed (X 4 ) are well correlated with the PM 10 concentration and are independent from each other. As a result, they were added to the input features. The maximum (X 5 ) and minimum daily temperature (X 6 ) are well-correlated to the mean daily temperature with correlation coefficients of 0.98. Hence, these two features and their associated features (X 1 X 5 , X 2 X 5 , X 3 X 5 , X 4 X 5 , X 5 X 5 , X 5 X 6 ; X 1 X 6 , X 2 X 6 , X 3 X 6 , X 4 X 6 , X 5 X 6 , X 6 X 6 ) were not included in the input features. Features X 1 X 1 , X 1 X 2 , X 1 X 3 , X 1 X 4 , X 1 X 5 and X 1 X 6 that are functionally correlated with X 1 were not considered either. Of the features associated with the mean daily temperature (X 2 ) and mean daily humidity (X 3 ), only features X 2 X 3 , X 2 X 4 , and X 3 X 4 are relatively independent on the others factors and have a high correlation with the PM 10 concentration. Therefore, these features were selected as inputs for the data-driven models. In total, the input features consist of X 1 , X 2 , X 3 , X 4 , X 2 X 4 , and X 3 X 4 .
Development of data-driven models. Multiple linear regression model. Using the input features selected in the previous section, the MLR-based model (Eq. 2) was written as below: in which the coefficient c i (i = 1…8) was determined from measurement data using the method of least squares. Comparing with previous studies that usually used the MLR algorithm to construct the relationship between the PM 10 concentration and meteorological factors, this study considered both the meteorological factors (X 1 , X 2 , X 3 , X 4 ) and their combinative quadratic terms (X 2 X 3 , X 2 X 4 , X 3 X 4 ), and therefore, could account for the nonlinear quadratic form of this relationship. In addition, to account for the seasonal dependence of the PM 10 concentration on meteorological factors, this study built two MLR models corresponding to the winter and spring period and the summer and autumn periods. Artificial neutron network model. Although the MLR-based data-driven models developed in this study accounted for the nonlinear and seasonal relationship between the PM 10 concentration and meteorological factors, they only considered the quadratic nonlinear relationship. For comparison, we employed the ANN algorithm, which could formulate this relationship in a more complicated manner. The ANN model included three layers (input, hidden and output) in which the number of nodes in the input layer and output layer was equal to 7 (number of input features) and 1 (PM 10 ), respectively. After using the trial-and-error method, we found that the number of the hidden layers with 12 nodes was the most suitable for all air quality monitoring stations. Figure 6 illustrates the architecture of the ANN model used in this study.
In order to build the ANN model for each station, the measurement data were divided into three sub-datasets (70% for training, 15% for validation and 15% for testing) to avoid overfitting. Figure 7 compares the PM 10 (5) Figure 5. Coerrelation matrix of features and between features with the PM10 concentration. X 1 is the mean daily pressure; X 2 is the mean daily temperature; X 3 is the mean daily humidity; X 4 is the mean daily wind speed; X 5 is the maximum daily temperature; X 6 is the minimum daily temperature.

Scientific Reports
| (2020) 10:19268 | https://doi.org/10.1038/s41598-020-75547-y www.nature.com/scientificreports/ concentration between the ANN modeling and measurement in each sub-dataset and the whole dataset at the Trung Yen 3 station as an example. The figure shows that the ANN model could well simulate the PM 10 concentration at all datasets, and therefore, could be reliably used for predicting PM 10 concentration.
Comparison of model performance. To evaluate the performance of the MLR and ANN models, we used three statistical indices, namely, Root Mean Squared Error (RMSE), correlation coefficient (r) and Nash-Sutcliffe Efficiency (NSE), which are formulated as below: where N is total number of data points; Y m is the modeled data, Y o is the observed data. The correlation coefficient ranges from − 1 to 1 in which the higher value corresponds to the closer positive relationship between the modeling and measurement. RMSE measures the differences between modeling and measurement. The lower RMSE indicates a better agreement between the modeling and measurement. NSE varies in the range from −∞ to 1 in which NSE = 1 indicates a perfect match between the modeling and measurement, NSE ≤ 0 implies that the model predictions have the same or lower accuracy than the mean of measurements. Table 2 compares the indices for both data-driven models. The table indicates that the ANN models performed much better than the MLR models in all eleven stations. The ANN outputs are well correlated with the measurement data with an average correlation coefficient of 0.7. No station has a correlation coefficient below 0.65. Meanwhile, the average correlation coefficient of the MLR outputs with measurements is 0.58 in which the Hoan Kiem station has the lowest coefficient (r = 0.51). As for the modeling errors, the average RMSE of the ANN and MLR models are 14.1 and 15.6, respectively. The NSE criterion ranges from 0.41 to 0.57 for the ANN models and from 0.26 to 0.53 for the MLR models. It is clear that the differences between modeling and measurement of the ANN models are lower than those of the MLR models. The reason for this fact is that the ANN algorithm Figure 6. Schematization of the ANN model structure. The model includes 3 layers: input (7 nodes), hidden (12 nodes) and output (1 node).  www.nature.com/scientificreports/ accounts for more complicated interactions between the input and output than the MLR model. For their better performance, the ANN models were employed for mapping the PM 10 concentration.

Scientific Reports
Of all air quality monitoring stations, the performance of both MLR and ANN at the Trung Yen 3 3, Nguyen Van Cu and Minh Khai stations are much better than the others. Indeed, the average correlation coefficient and NSE corresponding to the ANN models for the three stations are respectively equal to 0.74 and 0.55, while these indices for the other stations are much lower (0.68 for the correlation coefficient and 0.46 for the NSE). This could be explained by the fact that Trung Yen 3, Nguyen Van Cu, Minh Khai are the main stations of Hanoi, which have been frequently checked and performed quality control. As a result, the quality of measurement data at these stations is better than the others.
Monthly PM 10 concentration mapping. Mapping the monthly PM 10 concentration for the study area of eleven central districts in Hanoi from the WorldClim global meteorological data was performed using the ANN models developed in the previous section and the hybrid interpolation approach (Eq. 4). Figures 8 and 9 below presents the monthly and seasonal maps of the PM 10 concentration. The spatial resolution of these maps is equal to that of meteorological data (1 km 2 ). The seasonal PM 10 concentration maps were generated by assembling monthly maps.
As for the temporal variation of the PM 10 concentration, it can be seen that the PM 10 concentration reaches its peaks in the winter season (January, February, November and December). For example, the mean PM 10 concentration in November is up to 71 μg/m 3 . The low temperature (ranging from 16 to 21 °C in these months) and the temperature inversion phenomenon in winter are likely the causes of the high PM 10 concentration in these months. By contrast, because the air temperature in June-August reaches its highest level (~ 29 °C), the concentration of PM 10 hits a trough in this period with the PM 10 concentration ranging from 46 to 48 μg/m 3 As regards to the seasonal variation, points out that the PM 10 concentration is highest in winter and lowest in summer. In addition to the air temperature, the humidity and wind speed, which are lowest in winter, are also the reason for the higher PM 10 concentration in winter than in the other seasons.
As for the spatial variation, the concentration of PM 10 in Long Bien district which is situated in the northeast of the study area is much lower than the other districts. The main reason is that compared to the other districts in the study area, the density of population in this district is lowest in the study area (around 4.5 thousand people/ km 2 , versus 11.6 thousand people/km 2 in other districts). This lower population density leads to less intensive traffic. Besides, due to the sparse air quality monitoring network, the PM 10 concentration in Long Bien district strongly depends on the PM 10 concentration at the Nguyen Van Cu station, which situates relatively far away from transportation routes. On the other hand, as pointed out by Nghiem et al. 36 , since the inauguration of Vinh Tuy Bridge in 2010 and Nhat Tan Bridge in 2015, the flow of traffic vehicles through Nguyen Van Cu road was decreased. As a result, the annual average of PM 10 concentration at Nguyen Van Cu station from 2010 to 2018 slightly declined which makes the PM 10 concentration lower. On the contrary, the highest concentration is found at the Pham Van Dong station located in the southwest of the study area. This station is placed on the Pham Van Dong Street, one of the main route to access Hanoi from Noi Bai International Airport, thus the traffic in this street is normally quite intensive. Besides, a high number of active construction works in this area might be an important factor for the increased level of PM 10 . Meteorological conditions also influence the spatial distribution of the PM 10 concentration. The highest PM 10 concentration in the northwest region is partly caused by the low air temperature in this region (see Fig. 4). However, Fig. 8 shows that the impact of local factors (e.g., street, population, transportation intensity) on the spatial variation of the PM 10 concentration is larger than that of meteorological factors.

Conclusion
In this study, a combinative approach of data-driven models and IDW interpolation technique was developed to construct the PM 10 concentration maps for the central area of Hanoi. The construction of data-driven models consisted of two steps, feature construction and model development. The feature construction is responsible for constructing optimal features from meteorological factors. By evaluating the correlation between the PM 10 concentration with each feature and correlation between features, a set of features was selected as the input for the data-driven models. The model development step built the data-driven models that link the PM 10 concentration with the input features using the MLR and ANN algorithms for each air quality monitoring station. The obtained results indicate that the ANN-based data-driven models provided much better results than the MRLbased models. In order to construct the PM 10 concentration maps, the IDW interpolation technique was used to calculate the weighting factors for each air quality monitoring stations. While many other studies obtained the unknown PM 10 concentration by interpolating the PM 10 concentration at the air quality monitoring stations without considering meteorological factors, this study accounted for the meteorological factors in the datadriven models. Using this approach, both the local PM 10 surrounding monitoring stations and the dependence of PM 10 on meteorological factors were taken into account hence provided a better representation of the current situation in the study area.
Due to a lack of high spatial resolution of meteorological data, this study used the 1 km 2 resolution monthly WorldClim data as the input to predict monthly PM 10 concentration via combination of the established datadriven models and interpolation method. The monthly PM 10 maps were then aggregated to construct seasonal maps. The temporal analysis revealed that the PM 10 concentration was highest in the winter months and lowest in the summer months, which was mainly caused by the negative dependence of the PM 10 concentration on air temperature and low humidity. The spatial analysis indicated that the northeast region was the region with the lowest PM 10 concentration because the urbanization in this region was less developed than the others. The northwest region had the highest PM 10  www.nature.com/scientificreports/ of new buildings and roads, which together elevated the PM 10 concentration. The meteorological factors also influenced the spatial variation of the PM 10 concentration but with a lower impact level compared to the local sources of PM 10 generation surrounding the monitoring stations. This study also pointed out that although the spatial variation of meteorological factors was taken into account, the low density of air quality monitoring stations might reduce the accuracy of PM 10 concentration maps. Hence, it is necessary to establish a denser air quality monitoring stations network to better cover the spatial variation of the PM 10 concentration. The approach developed in this study can be applied to provide the forecasting PM 10 concentration maps based on predicting meteorological information. These results could also provide a very meaningful foundation for the local authority in deriving and implementing city air quality management activities and urban planning in Hanoi.
Received: 11 June 2020; Accepted: 28 September 2020 License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.