Very few land use regression (LUR) models have been developed for megacities in low- and middle-income countries, but such models are needed to facilitate epidemiologic research on air pollution. We developed annual and seasonal LUR models for ambient oxides of nitrogen (NO, NO2, and NOX) in the Middle Eastern city of Tehran, Iran, using 2010 data from 23 fixed monitoring stations. A novel systematic algorithm was developed for spatial modeling. The R2 values for the LUR models ranged from 0.69 to 0.78 for NO, 0.64 to 0.75 for NO2, and 0.61 to 0.79 for NOx. The most predictive variables were: distance to the traffic access control zone; distance to primary schools; green space; official areas; bridges; and slope. The annual average concentrations of all pollutants were high, approaching those reported for megacities in Asia. At 1000 randomly-selected locations the correlations between cooler and warmer season estimates were 0.64 for NO, 0.58 for NOX, and 0.30 for NO2. Seasonal differences in spatial patterns of pollution are likely driven by differences in source contributions and meteorology. These models provide a basis for understanding long-term exposures and chronic health effects of air pollution in Tehran, where such research has been limited.
Air pollution is a complex mixture of gases and particles, and it has been associated with a wide range of health outcomes1,2. The latest estimates from the Global Burden of Disease (GBD) Study indicated that approximately 87% of the global population is exposed to ambient concentrations of fine particulate matter (PM2.5) that do not meet the guideline values set by the World Health Organization (WHO)3,4. This estimate is even higher when restricted to the populations of low- and middle-income countries (LMICs). In addition, air pollution was one of six modifiable risk factors associated with more than 5% of the GBD, as measured by disability-adjusted life years lost (DALYs)5. This burden is also reflected in Iran6,7, where the latest estimates suggest that approximately 7% of total DALYs are attributable to air pollution, which is ten times greater than the DALYs attributable to HIV/AIDS and tuberculosis combined8. Even so, the burden of air pollution might be substantially underestimated because (1) most of the exposure-response estimates are from high-income countries, and (2) the burden might not be fully captured by PM2.5 and ozone, which were the only indicators used in the GBD analyses. Furthermore, emerging evidence suggests that air pollution is associated with many chronic diseases not yet included in the GBD assessment, such as acceleration of atherosclerosis2, high blood pressure9,10, diabetes11,12, metabolic syndrome13, and possibly with neurodegenerative diseases such as multiple sclerosis14, vascular dementia and Alzheimer’s disease15,16.
The scientific community has consistently stated that lack of epidemiologic evidence from LMICs limits the generalizability of current air pollution findings4. One pillar of air pollution epidemiology is high quality exposure estimates17, but quantification of exposures at the individual level has been especially challenging in LMICs18,19,20. In light of the long-term health effects associated with reduced air quality, methods that estimate the spatial distribution of air pollutants are particularly useful. Land use regression (LUR) is a widely applied, state-of-the-science method used to map spatial variability in ambient air pollutants. Generally speaking, LUR uses local land use information characterized in geographic information systems (GIS) to estimate concentrations of air pollutants at any location within a city21. The land use variables can represent a broad range of characteristics in the area surrounding the locations, such as the type of land use, elevation, population density, point sources, and vehicle traffic22. Valid LUR models offer the opportunity to estimate air pollution concentrations at locations where no measurement data are available.
Two important considerations in LUR modeling are (1) the number of monitoring sites and (2) the locations of those sites within the study area23. Some LUR models are based on data from a small number sites (for example, 17 sites in Houston metropolitan area)24, whereas others use large numbers of local, national, or multi-national measurement locations (for example, 562 sites across 12 Spanish cities in Girona Province25 or 2400 sites across Europe)26. Previous work suggests that LUR models should be constructed using measurements from at least 80 locations27 identified by some algorithm to optimize their spatial variability28. However, several studies have used secondary data from existing regulatory monitoring networks, which typically have fewer locations24,29,30,31,32 with less spatial variability. In addition to these considerations, Basagaña et al. (2012) suggested that LUR analyses should use a restricted set of predictor variables, especially when the number of monitoring sites is small27. However, no LUR study to date has introduced a systematic approach to restricting the variable set.
Nitrogen oxides (NOx) are a group of highly reactive gasses that contain different numbers of nitrogen and oxygen atoms, including nitrogen oxide (NO), nitrogen dioxide (NO2), and nitrous oxide (N2O). However, NOx is frequently considered to be the sum of NO and NO2 in atmospheric sciences33. Fossil fuel combustion produces NO34 as a primary pollutant. This free radical rapidly oxidizes in the atmosphere, scavenges tropospheric ozone, and converts to secondary NO235. Both mobile and point sources contribute to NO in Tehran. Iran benefits from large natural gas reserves36 that are used for most commercial processes and residential heating.
To date, LUR has been applied to model NO221,22,30,37,38,39 in many high-income countries and in some LMICs40,41. However, LUR models for NO and NOx are rare, especially in LMICs42. We previously reported LUR models for particulate matter (PM10) and sulfur dioxide (SO2) in Tehran, where the entire population was located in areas exceeding the WHO guidelines for both pollutants40. The results also suggested the potential for seasonal differences in the spatial patterns of more primary pollutants. Here we develop annual and seasonal models for NO, NO2 and NOx using data from the regulatory monitoring network.
Air quality data
None of the pollutants were normally distributed (p < 0.001). The annual median concentrations (interquartile range, or IQR) were 71.7 (59.3) ppb for NO, 50.9 (11.1) ppb for NO2, and 122.3 (55.1) ppb for NOX across the 23 monitoring stations. The cooler season medians (IQR) were 100.9 (87), 58.1 (19.7), and 155.9 (81.6) ppb, respectively, and the warmer season values were 43.4 (37.7), 41.7 (7.9), and 87.0 (38.0) ppb, respectively (Fig. 1). The correlation between the annual, cooler season, and warmer season concentrations ranged from 0.94 to 0.99 for NO, from 0.61 to 0.92 for NO2, and from 0.90 to 0.99 for NOX across the 23 monitoring stations. The between-pollutant correlations ranged from 0.25 to 0.46 for NO and NO2, from 0.85 to 0.95 for NO and NOX, and from 0.38 to 0.72 for NO2 and NOX (Table S2, supplemental information).
Final LUR models
Of the 210 potentially predictive variables (PPVs) we generated, 21 (10%) were significantly predictive in one or more of the LUR models. The R2 values for the final annual mean models were 0.78, 0.69, and 0.71 for NO, NO2 and NOX, respectively. They ranged from 0.69 to 0.79 for the cooler season models and 0.61 to 0.72 for the warmer season models (Table 1). Some of the variables appeared in multiple models. These included: (1) distance to the traffic access control zone; (2) distance to sensitive land use areas; (3) the natural logarithm of distance to the nearest primary school; (4) the natural logarithm of distance to the nearest hazardous facility; (5) slope; (6) the presence of bridges, and (7) areas of green, official/commercial, and other land uses (Table 1, and Tables S3–S11, supplemental information). The Moran’s I results were −0.07, −0.12, and −0.04 for residuals of the annual, cooler season, and warmer season NO models. All p-values were greater than 0.23. The values were similar for the NO2 and NOx models (not shown), with a minimum p-value of 0.06.
The R2 values for the leave-one-out cross validations ranged from 0.53 to 0.66 for the NO models, 0.51 to 0.58 for NO2 models, and from 0.42 to 0.63 for the NOX models (Table 1, and Tables S3–S11, Supplemental information). A final leave-one-out cross-validation (LOOCV) check was done for the coefficient of each predictive variable in the final regression models. The minimum and maximum of the LOOCV coefficients had the same direction of effect for all variables in all models. All the coefficients of variation ranged from 7% to 11%.
The limits of prediction for the annual, cooler season, and warmer season NO models were 16.4, 21.0, and 12.0 ppb, respectively. For the NO2 models they were 15.6, 14.9, and 16.3 ppb, respectively, and for the NOx models they were 46.6, 54.0, and 34.6 ppb, respectively. Overall, out of 24,505,474 grid cells in the modeling domain, a range of 0.2% to 16.0% of cells were increased to the limit of prediction and 0.0% to 5.3% of cells were truncated to 120% of the maximum observed concentrations (Table 2).
Agreement between the measured and predicted pollutant concentrations was relatively good (Fig. 2). The maps showed clear hotspots for the NO concentrations across the city. These were well-characterized by distance to the traffic access control zone, the natural logarithm of distance to the nearest primary school, surrounding areas of green land use, and slope. The NO2 concentrations were more dispersed and homogeneous throughout the city. The NOX maps were similar to the NO maps, and the hotspots were driven by similar variables, though they also reflected distance to the nearest hazardous facility and the presence of bridges (Fig. 3).
The correlations between the predicted annual, cooler season, and warmer season concentrations at 1000 randomly-selected sites were weak to moderate. Values for the warmer and cooler season estimates were 0.64 for NO, 0.58 for NOx, and 0.30 for NO2 (Fig. 4).
Materials and Methods
The megacity of Tehran is the capital of Iran. It covers an area of 613 km2, with the Alborz Mountains in north and desert in south. The populated areas within the city range from 1,000 to 1,800 meters above sea level (Fig. 5). The annual mean daily temperature is 18.5 °C, with highs of 43 °C in July and lows of −15 °C in January. The average annual precipitation is 220 millimeters (mm), with the maximum in March (39 mm) and the minimum in September (1 mm). The weather is typically sunny, with an annual average of 2800 h of bright sunshine and a mean cloud cover of 30%. The prevailing winds blow from west and north (Figure S1, Supplemental information). Tehran is the most populous city in Iran, and the third largest city in the Middle East. There are approximately 9 million urban residents, with a daytime population of more than 10 million due to diurnal migration from the surrounding areas40,43.
Air quality data
Hourly NO, NO2 and NOx concentrations for the 2010 calendar year were obtained from 23 air quality monitoring stations administered by two government agencies (Fig. 5). Of the stations, 16 belonged to the Air Quality Control Company (AQCC), and 7 to the Department of Environment (DOE). Both the AQCC and DOE monitoring stations used chemiluminescence analyzers (Model AC 32 M of Environment SA, France; APNA-370 of Horiba, Japan; and EC 9841 of Ecotech, Australia) to measure nitrogen oxides. They follow quality assurance/quality control (QA/QC) procedures that, under ideal circumstances, ensure the instruments are checked and calibrated every two weeks. However, calibration gases can be challenging to obtain in Tehran.
A complete annual dataset would contain 8760 measurements (24 hours/day × 365 days in 2010) for each pollutant at each monitoring site. However, 28.1%, 27.7%, and 27.6% of the NO, NO2 and NOX values were missing, respectively (Figure S2, supplemental information). As in our previous work40, the Amelia program was used for imputation of the missing data (Page S5, supplemental information)44. The program uses a new expectation-maximization algorithm with bootstrapping to impute missing values and return a complete dataset. We provided the program with all available hourly concentrations from the different stations, along with the month, day, and hours of measurement. In order to evaluate the consistency and reliability of the missing data estimates we ran the Amelia program 10 times for each pollutant to impute hourly missing values, and calculated the resulting 10 annual averages for each monitoring station. The mean of the 10 imputation-filled datasets was calculated for NO, NO2 and NOX from January 1st, 2010 through January 1st, 2011 for all monitors, and these values were used as the LUR response variables.
We also divided the year into warmer and cooler seasons based on our previous work40 and because Chen et al. (2010) reported different LUR predictor variables and spatial patterns in Tianjin, China during the heating and non-heating seasons. The same study also found that the predictive variables and the R2 values for the LUR models differed by season41. The warmer and cooler seasons were defined as April through September and October through March, respectively. These months were selected based on WHO guidelines for countries in the Northern hemisphere, and on the highest and lowest mean daily temperatures at Mehrabad International Airport in Tehran40.
We generated 210 PPVs in six classes and 73 sub-classes (Table 3). The six classes were Traffic Surrogates, Land Use, Distance Variables, Population Density, Product Variables, and Geographic Location. The Traffic Surrogates class described the vehicular network in buffers around the pollution monitoring stations. The Land Use class described ten land use types within buffers around the stations. The Distance Variables class measured the Euclidian distance (and natural logarithm of the distance) from each station to all of the Traffic Surrogate and Land Use types, and to other features. The natural logarithms of the distances were used based on studies that have reported exponential decay in air pollutant concentrations with increasing distance from pollution sources45,46,47,48. The Population Density was calculated for the total population and for the population excluding unemployed people and children less than five years of age. The Product Variables class included the ratio of variables in the Traffic Surrogates class to the variables in the Distance Variables class. Finally, the Geographic Location class included the elevation of each monitoring site, obtained from a digital elevation model (DEM) of Tehran in meters above sea level, and a slope (gradient) variable that was created in GIS based on the DEM. The potential geospatial variables were selected based on previous studies and available information in Tehran. The raw GIS inputs were all in vector format, originating from the Japan International Cooperation Agency (JICA) and the Centre for Earthquake and Environmental Studies of Tehran (CEST)49. The final PPVs were all in raster format with a resolution of 5 × 5 meters, and their values in the grid cells underlying the monitoring stations were used for the regression analyses. All spatial analyses and figures were generated using ESRI’s ArcGIS 10.2.1 for Desktop (http://www.esri.com/).
Model development and diagnostics
The model building algorithm was based on one we developed for a previous study40. However, we further refined the algorithm to account for non-normality of the response variable, which can violate the assumptions of linear regression modeling. We also used transformation to normalize the relationships between the response variables and the PPVs, and we restricted the number of variables in the final model to the root of the number of observations. The key steps of the updated stepwise algorithm are:
Take the log transformation of the response variable.
Check for normality using the Shapiro-Wilk test50.
Apply a power transformation if not normally distributed.
Linearize the relationships between the transformed variables and the PPVs using log and power transformations on the PPVs, and then proceed with the original algorithm40 such that steps (5) through (8) are done for every iteration (i.e. the addition of each new PPV to the model):
Check the direction of the effect of each PPV in the model for consistency with a priori assumptions (Table 3) to ensure that final models did not contradict knowledge about pollution emissions and dispersion.
Ensure a p-value of < 0.1 for each PPV.
Ensure that each new PPV increases the coefficient of determination (R2) for a LOOCV51.
Calculate a multicollinearity index called the variance inflation factor (VIF)52.
Finally, restrict the number of predictor variables in LUR model to , where N denotes the number of monitoring stations.
Check the normality of residuals using the Shapiro-Wilk test50.
The algorithm was programmed as a function in the R statistical package. Its details are explained in pages S7–S11, the supplemental information, and in the original paper by Amini et al.40. Models were constructed for average annual, cooler season, and warmer season concentrations of NO, NO2, and NOx.
To check the stability of the final LUR models, the regression coefficients for the LOOCV models were retained for all predictor variables in the final NO, NO2, and NOx models. The minimum, maximum, and coefficient of variation were calculated for the set of LOOCV coefficients, and models with lower variability were considered to be more stable. The spatial autocorrelations for all annual and seasonal NO, NO2, and NOx residuals were evaluated by calculating the global Moran’s I statistic. Values of Moran’s I range from −1.0 to 1.0, with −1.0 meaning perfect negative autocorrelation, 1.0 meaning perfect positive autocorrelation, and 0 meaning a random spatial pattern53.
When generating raster variables from vector data, raster cells outside of the buffer zones are returned as null (or “NoData” in ArcGIS). All null values for the Traffic Surrogates, Land Use, Distance Variables, Population Density, Product Variables, and Geographic Location variables were set to zero. The Raster Calculator in the ArcGIS Spatial Analyst Tools was used to render our final nine regression equations into maps that estimated annual and seasonal concentrations of NO, NO2, and NOX across the study area. We established a limit of prediction for low values, defined as the minimum observed concentration divided by the square root of two. All grid cells with estimates below this limit were set to this limit. Grid cells with very high estimates were set to 120% of the maximum observed concentrations, as per Henderson et al.22 and Amini et al.40.
Seasonality of the spatial variability
In order to evaluate the effect of season on the spatial variability in NO, NO2, and NOx concentrations, we assessed the correlations between annual, cooler season, and warmer estimates at 1000 locations within the study area. These were randomly selected using the Feature Class Data Management Tools in ESRI ArcMap 10.2.1 GIS (ESRI, Redlands, CA). We checked the normality of the estimate distributions with a Shapiro-Wilk test, and we calculated the Pearson or Spearman correlation depending on the results.
This study developed annual and seasonal LUR models for NO, NO2 and NOx for the Middle Eastern megacity of Tehran, Iran, using data from 23 sites in the air quality monitoring network. The models performed reasonably well for all pollutants and time periods. Because there are few comparable studies published for LMICs, the discussion will focus on the observed patterns in concentrations, and the strengths and limitations of the models.
We found that the 2010 annual NO, NO2, and NOX concentrations were relatively high in Tehran. The mean NO concentrations (88 ppb) were more than five times higher than those reported for other large cities, such as New York (16 ppb)54, and the mean NO2 concentrations (53 ppb) were almost 2.5 times higher than the recommended WHO guideline value of 21 ppb55. They were also considerably higher than the 2008 concentrations reported for many comparable megacities, such as Delhi (18.8 ppb), São Paulo (24.6 ppb), Tokyo (28.7 ppb), Mexico City (29.3 ppb), Los Angeles (34.5 ppb), and Dhaka (43.3 ppb), and approaching the values in Beijing (63.8 ppb)56.
Overall, the concentrations of nitrogen oxides were higher in the cooler season than in the warmer season (Fig. 1). This is consistent with the findings of Matte et al. (2013), where NO and NO2 concentrations in New York were higher in winter than summer54, and findings of Dons et al. (2014) in Antwerp (Belgium)57. The higher concentrations during the cooler season in Tehran could be due to residential heating, which is done primarily by natural gas36. There are also seasonal differences in meteorological factors given the specific topographical situation of the city, including inversions and low mixing heights. This may lead to more complex spatial variability in pollutants and different exposure patterns.
When considering the R2, adjusted R2, and LOOCV R2 values, model performance was better in the cooler season than in the warmer season for NO2 and NOX, but the opposite was true for NO (Table 1). Regardless, several of the cooler and warmer season models shared the same predictor variables. The most predictive variables for all pollutants were surrogates of traffic impact, including distance to the traffic access control zone (DIST to TACZ) in the NO and NOx models (Table 1, and Figure S3, supplemental information). This is a high traffic zone in the middle of Tehran, with access restricted to authorized vehicles on working days. It supports the hypothesis that the major source of NO and NOx in Tehran is vehicles and traffic. The natural logarithm of distance to nearest primary school (LNDIST to PRSC) appeared in eight out of nine models. All models indicate that the primary schools tended to be located in less polluted areas (Figure S4, supplemental information).
Another important predictor was green space within buffers up to 500 meters. The negative coefficients suggest that concentrations of nitrogen oxides decreased as the green space increased, which supports the call for urban greening to improve air quality and overall health58. We also observed increasing nitrogen oxides with increasing elevation, but decreasing concentrations with increasing slopes. This may reflect different traffic flows through the city, where the northern and southern outskirts differ in elevation by almost 800 meters. In the cooler season, the NO concentrations were also increased in areas with higher total population density, which is consistent with the hypothesis that seasonal differences were driven by residential heating. Both the annual and warmer season mean NOx concentrations increased with higher bridge density. These are predominantly land bridges that allow one roadway to pass over another roadway, replacing roundabouts and traffic lights to control traffic flow. They are found at most intersections of major roads in Tehran, and also at many smaller intersections.
The correlations of cooler and warmer season measured concentrations across the 23 fixed sites were very high for NO and NOx, but they were reduced to 0.64 and 0.58, respectively, across the 1000 randomly-selected locations. The correlation for NO2 was 0.61 across the fixed sites and 0.30 across the 1000 locations. Visual inspection of the pollution maps showed some interesting seasonal differences in the spatial distributions of NO2. One region in the northern part of the city appeared highly polluted in the cooler season, but not in the warmer season. This region had some gaps in the monitoring data, so the validity of the model may have been compromised despite our use of Amelia (see S4 to S6, supplementary information). Overall, however, our findings suggest that epidemiologic studies based on long-term exposures should account for seasonal patterns in the spatial data.
To date, many LUR models have been developed for NO2 in high-income countries, mainly because NO2 is quite easy to measure with passive samplers21. However, some studies have also modeled NO and NOx22,37,38,59. In all of these studies, direct or surrogate measures of traffic have been the most predictive variables. For example, in Oslo (Norway) all oxides of nitrogen were modelled using elevation, length of large roads in a 100 m buffer, length of medium roads in a 250 m buffer, and length of small roads in a 1000 m buffer based on 80 measurement locations37. In Tehran, the models were mostly driven by distance to the traffic access control zone and the presence of bridges in a 400 m buffer. Su et al. (2009) conducted a study to estimate NO, NO2, and NOx using 201 locations in Los Angeles (California) for two seasons. They found that traffic volume, truck routes, road networks, land use, greenness, and slope gradients were the most predictive variables38. We found similar explanatory variables in Tehran using data from 23 regulatory monitoring locations. However, the magnitude and ranking of the R2 values in Los Angeles were more similar to those in Oslo, with 81% for NO, 85% for NOx, and 86% for NO238. Results from Vancouver (Canada) are also consistent with our findings in Tehran, with traffic variables, elevation, geographic coordinates, and total population within 2500 m buffer radius driving the models22. In Montreal (Canada) Gilbert et al. (2005) found that distance to highways, lengths of roads within buffers of 100–500 m, open space, and population density within a radius of 2000 m were the most predictive variables for NO2, and the best-fitting model had an R2 of 0.5460.
The use of fixed site monitor locations to develop the LUR models can be both a strength and a limitation. Readily-available data from validated instruments allows academics and government agencies to regularly model the spatial variability in air pollutants with minimal additional costs. However, the locations for fixed monitoring networks are generally chosen by criteria that may not optimize their ability to capture the variability necessary for spatial modelling28. Although we did not evaluate whether 23 measurement sites are sufficient to reliably model spatial variability in a megacity such as Tehran27, our future work will examine this question in more detail.
Another limitation is that some predictor variables could not be assigned a direction of effect a priori due to lack of previous knowledge or other studies. This, in turn, might have caused inconsistent effects of variables in the regression models. These variables include urban facilities, sensitive areas, such as military and protected government areas, other land use variables, distance to hazardous facilities areas, distance to food shops, distance to airports, distance to health and ambulance services, elevation, and slope gradients. Therefore, we suggest conducting further studies in Tehran to better specify the impact of these areas on air pollution concentrations.
We found significant seasonal differences in the spatial variation of nitrogen oxides in Tehran, especially NO2. However, the small number of measurement sites in our study might affect these findings. Examples of LUR models are rare in LMICs, and these results are relevant for the next generation of exposure assessment, population-based health research, and policy-making in such contexts. In addition, this work establishes a benchmark for future air pollution modeling in Tehran. Overall, our models performed relatively well. Our next step is to evaluate whether a larger number of monitoring sites selected with a strict algorithm produces different results and/or different conclusions about the spatial patterns reported here.
How to cite this article: Amini, H. et al. Annual and seasonal spatial models for nitrogen oxides in Tehran, Iran. Sci. Rep. 6, 32970; doi: 10.1038/srep32970 (2016).
The authors acknowledge Tehran Air Quality Control Company and Department of Environment for providing the necessary data. Heresh Amini holds a Swiss Government Excellence Scholarship (ESKAS) for PhD and a project stipend of the PhD Educational Platform in Health Sciences (PPHS) from the Faculty of Medicine, the University of Basel.
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/