Artificial intelligence accuracy assessment in NO2 concentration forecasting of metropolises air

Air quality has been the main concern worldwide and Nitrous oxide (NO2) is one of the pollutants that have a significant effect on human health and environment. This study was conducted to compare the regression analysis and neural network model for predicting NO2 pollutants in the air of Tehran metropolis. Data has been collected during a year in the urban area of Tehran and was analyzed using multi-linear regression (MLR) and multilayer perceptron (MLP) neural networks. Meteorological parameters, urban traffic data, urban green space information, and time parameters are applied as input to forecast the daily concentration of NO2 in the air. The results demonstrate that artificial neural network modeling (R2 = 0.89, RMSE = 0.32) results in more accurate predictions than MLR analysis (R2 = 0.81, RMSE = 13.151). According to the result of sensitivity analysis of the model, the value of park area, the average of green space area and one-day time delay are the crucial parameters influencing NO2 concentration of air. Artificial neural network models could be a powerful, effective and suitable tool for analysis and modeling complex and non-linear relation of environmental variables such as ability in forecasting air pollution. Green spaces establishment has a significant role in NO2 reduction even more than traffic volume.

www.nature.com/scientificreports/ successfully used for predicting and modeling the concentration of ambient air pollution. Furthermore, ANNs are widely used in short-and long-term applications for forecasting pollutants 20 . Various studies applied the ANNs for pollution prediction without considering a variety of environmental factors such as green space and parks, which is recognized as a shortcoming 4,[21][22][23] . Regression analysis is a traditional statistical technique for model generation. Multi-linear regression (MLR) analysis is an approach to evaluate the relationship between independent and dependent factors. The goal of this investigation is the prediction of the concentration levels of NO 2 as a factor determining the atmospheric pollution in Tehran city. For this purpose prediction accuracy, MLR and ANN models are selected to be compared. Many environmental variables were used to support model outputs, and finally, we identified the most accurate model for prediction of the concentration levels of NO 2 in Tehran. The result of model sensitivity analysis was illustrated to prioritize model variables.

Material and methods
Data collection. To perform this research, data from the Tehran Air Quality Control Company, Tehran Meteorological Organization, Tehran Transportation and Traffic Organization and urban green space information were collected. The parameters measured in these organizations were included five groups: the concentration of NO 2 , urban traffic parameters (using Tehran traffic cameras data), urban green space information, meteorological data, and time parameters. The urban traffic parameters were including the length of the north-south (y) and east-west (x) cross streets (km), the average number of vehicles at the street intersections, and total number of vehicles at the intersections. The urban green space information were the number of parks, the total area of parks in each district, an average of the distance between each park and NO 2 monitoring station and park index adjacent to the NO 2 monitoring station (i.e. the area of the nearest park to the NO 2 monitoring station divided by the distance of the nearest park form NO 2 monitoring station) and the area of green spaces in district. Furthermore, meteorological data such as air temperature, rainfall, wind speed and direction, humidity, air pressure, and the length of sunshine per day, as well as time parameters such as 1-day time delay (NO 2 concentration in the last day), 2-day delay time (NO 2 concentration in the past two days), the day of the year(1-365), the desired month , the season (1: Spring-2: Summer-3: Autumn-4: Winter) and the warm and cold seasons (1: hot-2: cold) were considered as main variables affecting the air pollution of the Tehran. Data were collected for 1 year (2015). The stations of NO 2 monitoring and meteorological stations were located close to each other and in one area, as well as the traffic information was gathered on the streets near the stations.

Multiple linear regression model (MLR).
Multi-linear regression analysis (MLR) was implemented in the form of stepwise. This model was used to examine the relationship between the daily concentration of NO 2 of air as dependent data and influential variables as independent data. The most accurate MLR equation was obtained based upon statistical parameters including the highest correlation coefficient, the lowest mean square error root (RMSE) and the number of descriptors in the model (n), and the greater value of the F statistic 2,24,25 . The R statistic parameter shows the accuracy of the regression line and the greater value of R can better fit between observed and predicted values 26,27 . Neural network model. In this study, the daily concentration of NO 2 of air was predicted using an artificial neural network model. The multilayer perceptron (MLP) architecture has been used successfully to model some difficult and diverse problems and nonlinear functions such as air quality forecasting. It is composed of a system of layered and interconnected neurons or nodes, namely, an input, one or more hidden layers, and output layers 15,28,29 . In the current study, logarithmic sigmoid and linear activation functions were examined to optimize the network. The back-propagation (BP) training algorithm is found to be the most common and powerful nonlinear statistical technique in MLP networks 29,30 . All computations were developed in MATLAB R2016b software. In learning process to detection the performance of designed the neural network model, the following statistical indicators such as correlation coefficient (R 2 ) (Eq. 1), mean absolute error (MAE) (Eq. 2), mean square error (MSE) (Eq. 3), and root means square error (RMSE) (Eq. 4) were calculated 31,32 : where y i and y i are the targets and network outputs, − y i is the mean of target values, and n is the number of samples, respectively.
(1) www.nature.com/scientificreports/ To determine the most affecting factors on the model output, a sensitivity analysis was performed. In this method, each factor was changed in the range of the standard deviation, while other factors were equal to their value of the average. The standard deviation outputs of the model for each factor change were calculated as the sensitivity of the model finding for that factor.

Results
The study area. Tehran metropolis, the capital of Iran, with a population of over 8,000,000 individuals, is located in the south of the Alborz mountains and the northern margin of Iran's central desert. Its geographical longitude is 51° 2′ E to 51° 36′ E with an approximate length of 50 km. Its geographical latitude is from 35° 35′ N to 35° 50′ N with an approximate width of 30 km. The altitude of this city at the northern point is about 1800 m and at the southernmost point is about 950 m above sea level. This city has been extended to 730 km 24,33 . MLR model. According to the result of the stepwise-multi linear regression analysis, the desired month is from the desired year, length of the north-south (y) and east-west (x) street, wind speed and direction, rainfall, the air temperature, the humidity, the length of sunshine per day, the one-day time delay and the two-day time delay have a significant impact on the predicted daily NO 2 concentrations in Tehran.
There is a relation between the rate of change of affecting factors and the air NO 2 concentration in Tehran. The best result for performance stepwise-multi linear regression in predicting the NO 2 concentration of air set out in Table 1 and Eq. (5).
where X 1 is the desired month is from the desired year, X 2 is the length of the east-west (x) street, X 3 is the length of the north-south (y) street, X 4 is wind direction, X 5 is the wind speed, X 6 is the rainfall, X 7 is the air temperature, X 8 is the humidity, X 9 is the length of sunshine per day, X 10 is the 1-day time delay, and X 11 is the 2-day time delay, respectively.
The findings obtained from stepwise-multi linear regression in forecasting the NO 2 concentration of air are shown in Fig. 1. ANN modeling. Before neural network training, the factors influencing air pollution, as input data need to be normalized so that the data were converted to numbers between 1 and − 1. Data were randomly divided into three subsets: 20% samples selected for validation, 20% as test sets and 60% as training tests and then input data were weighed in the first layer and moved to the middle layer. Next, the outputs were weighted through connections between the middle layer and the output layer. Finally, the findings were presented in the output layer. The most accurate structure of neural networks optimized using 27 neurons and one hidden layer. The optimal structural characteristics of the neural network are presented in Table 2.
The maximum value of R 2 as well as the minimum amount of MSE in the test set and train set considered (Table 3). Indeed it shows a very high level of neural network accuracy in forecasting the NO 2 concentration of air in the city of Tehran.
The scatter plot represents a correlation between data that show the accuracy of a neural network model for forecasting the concentration of NO 2 in Tehran 34   www.nature.com/scientificreports/ in Fig. 2. As can be seen from Fig. 2, the value of the coefficient (R 2 ) proves the relatively high correlation between output and target values.
Sensitivity analysis of NO 2 concentration. Sensitivity analysis aims to evaluate the most important input parameters affecting on the model output. As can from Fig. 3, the value of the park area, an average of green space and one-day time delay are detected as the most influential inputs that influence the NO 2 concentration in the air. Figure 4 shows the effect of varying most important parameters including park area, area of green space, and one-day time delay on the level of NO 2 concentration.

Discussion
The objective of this study was to forecast the air NO 2 concentration in Tehran metropolis by applying MLR and MLP models. The comparison between MLR and artificial neural network modeling demonstrates that the neural network model (R 2 = 0.89, RMSE = 0.32) performs more accurately than multiple regression analysis (R 2 = 0.81, RMSE = 13.151). The MLR analysis had a lower R 2 value than the MLP model. There is no doubt that air pollution is influenced by several factors including meteorological parameters, pollutant sources, green spaces quality, and quantity, etc. which require a non-linear computing tool as alternatives to the traditional approach. Artificial Neural Network (ANN) models could be used as interpolation methods for complex and non-linear problems such as predicting and modeling air pollution 17,35,36 . Several research exhibits the performance of ANNs and traditional regression models to predict air pollutant concentrations 19,37-39 . Dragomir et al. compared the multiple linear regression and multilayer perceptron for the forecast of the NO 2 concentration in Romania. This study focuses on the dependence between meteorological factors including temperature, pressure, wind speed, wind direction, solar radiation, rainfall, and relative humidity and their influence on measured NO 2 concentration. The results indicated that MLP has a higher correlation coefficient than MLR in the forecasting of air quality and meteorological factors have an impact upon the NO 2 concentration 38 . Rahimi used MLP and MLR for prediction of the NO 2 and NO x concentrations according to meteorological variables. The best structure for the MLR model RMSE and R 2 were 3.6 and 0.42 respectively in comparison to the MLP model with RMSE = 0.0046 and R 2 = 0.82. The findings in this study shows that performance of MLP is superior in comparison to MLR for prediction NO2 concentrations in urban environments 39 . Cabaneros et al. illustrated that MLP models can accurately forecast NO 2 concentrations (R 2 = 0.9, RMSE = 23.45) based on air pollution and meteorological data 19 . Cakir and Sita (2020) developed a non-linear method (MLP) and MLR model utilizing air pollution and meteorological data to predict concentration values of NO 2 . The results demonstrate that there are no significant differences between two methods. They recommended that more studies be done in this area 40 . Although there are several studies in this area, however, they did not formulate comprehensive influential variables with green space and traffic data. To determine the daily averaged concentration of PM10, and PM2.5 in the Adriatic coast of Italy, Biancofiore et al. compared ANN and MLR models. This research was used, as input, daily values of air temperature, humidity,  www.nature.com/scientificreports/  The findings for each five air pollutants highlighted that neural network models to be significantly superior compared to MLR owing to their ability to effectively predict the complex and nonlinear issues such as air pollution 42 . Although implementation of regression methods is simple, however, the results of various studies also exhibit that regression methods may not offer precise predictions in some areas such as air pollution in comparison to ANNs models; and ANNs have generally superior performance 39,43,44 .The comprehensive data collection including green space and traffic data is the most advantages of our study. As a result, our ANN model is more accurate than other developed models in research. It is concluded that green space and traffic variables, especially in metropolitans, should be considered to achieve more accurate models. As can from Fig. 3, green spaces and parks have a key role in the prediction of NO 2 air. Figure 4A presents that there is a positive correlation between the concentration of NO 2 and the park ' s area. A reason for this result could be the fact that the vegetation cover of parks especially in metropolises are so limited to grasses and bushes and lack of trees is one of the disadvantages. Hence, more human activities around urban parks cause more air pollution. On the other hand, urban green spaces (vegetation in parks, squares, street tree lines, etc.) play a significant role in NO 2 reduction in the air. This observation is in agreement with those of the previous studies [45][46][47][48][49][50] . Slemi et al. provided that tree cover and the level of air pollutant concentrations were two key parameters in removing pollutants 47 . Janhall showed that differently designed vegetation and vegetation cover dispersion impact on air quality. Moreover, trees and vegetation should be high and porous enough to allow air to pass through them because the air that passes not through vegetation is not filtered; and vegetation should be as close to sources of contaminants as possible 45 . Vos et al. indicated that trees could result in a higher concentration of NO 2 . This result may be explained by the fact that vegetation cover may lead to obstructing the wind flow. This means that the aerodynamic effect decreases ventilation and filtering capacity of vegetation 51 . As one can see Fig. 4B, there is a negative correlation between the concentration of NO 2 and area green space, so the concentration of NO 2   2 48 . In the United States, urban vegetation is estimated to occupy 3.5% of the land, which could lead to the absorption of 0.711 ton of pollutant. This amount is equivalent to 3.8 billion US dollars. Urban green spaces are often regarded as a substantial air purification service 52 . Although, the tree cover and the level of air pollutant concentrations were two key parameters in removing pollutants 47 . Moreover, we concluded that green spaces has a significant role in NO 2 reduction even more that traffic variables.

Conclusion
Atmospheric pollution is now recognized as a permanent concern and a major environmental issue at the global level with adverse health worldwide. Because air pollution can effect on human health, quality of life and environment, the predicting of air pollution values has received much attention number of researchers due to closely relate it health of people and environment. The results showed an accurate prediction approach from the field of data processing techniques and artificial neural network model could be a powerful, effective and suitable tool for analysis and modeling complex and non-linear relation of environmental variables such as ability in forecasting air pollution. Green spaces establishment has a significant role in NO 2 reduction even more than traffic volume. It can thus be suggested developing different ANNs techniques survey for ambient air pollutants prediction and forecasting. It's worthwhile in the aspect of environmental, health, economic aims and useful for governmental authorities and decision-makers.

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.