Annual and seasonal spatial models for nitrogen oxides in Tehran, Iran

Very few land use regression (LUR) models have been developed for megacities in low- and middle-income countries, but such models are needed to facilitate epidemiologic research on air pollution. We developed annual and seasonal LUR models for ambient oxides of nitrogen (NO, NO2, and NOX) in the Middle Eastern city of Tehran, Iran, using 2010 data from 23 fixed monitoring stations. A novel systematic algorithm was developed for spatial modeling. The R2 values for the LUR models ranged from 0.69 to 0.78 for NO, 0.64 to 0.75 for NO2, and 0.61 to 0.79 for NOx. The most predictive variables were: distance to the traffic access control zone; distance to primary schools; green space; official areas; bridges; and slope. The annual average concentrations of all pollutants were high, approaching those reported for megacities in Asia. At 1000 randomly-selected locations the correlations between cooler and warmer season estimates were 0.64 for NO, 0.58 for NOX, and 0.30 for NO2. Seasonal differences in spatial patterns of pollution are likely driven by differences in source contributions and meteorology. These models provide a basis for understanding long-term exposures and chronic health effects of air pollution in Tehran, where such research has been limited.

. Wind-rose diagram for the Mehrabad International Airport meteorology station, indicating the wind is predominantly from the west and north. Figure S2. Details of the data available for each air quality monitoring station. The x-axis shows time in months and y-axis shows available data for each of stations out of 23. The green color for each station shows that the data were available and the white areas are missing data.

Imputation of missing data
The Amelia program was used for imputation of the missing data ( Figure S2 shows missing NO, NO 2 and NO X measurements) 1 . The program uses a new expectation-maximization algorithm with bootstrapping to impute missing values and return a complete dataset. We provided the program with all available hourly concentrations from the different stations, along with the month, day, and hours of measurement. In order to evaluate the precision of the missing data estimates we ran the Amelia program 10 times for each pollutant, and calculated the resulting 10 annual and seasonal averages for each monitoring station. Next, we calculated the coefficient of variation (CV) between the 10 annual and seasonal means. If the CV was small (less than about 5%), the estimates were considered acceptable. If not, the station was removed from further analyses because of the low precision of the annual and/or seasonal estimates.
The maximum CVs for the 23 stations used to model NO ranged from 3.9% for the annual to 4.5% for the cooler season. The maximum CVs for the 23 stations used to model NO 2 ranged from 3.8% for the warmer season to 5.5% for the cooler season. For 23 NO X stations, the maximum CVs were 2.9% for the warmer season to 3.2% for the cooler season. The CV averages for NO stations ranged from 0.8% for the annual to 1.1% for the warmer season. The CV averages for NO 2 stations ranged from 0.7% for the warmer season to 1.0% for the cooler season. The CV averages for NO X stations ranged from 0.7% for the annual to 0.8% for the cooler and warmer seasons (Table S1).

Model development and diagnostics
We developed a systematic algorithm that considered 10 key pieces of information: (1) Take the log transformation of the response variable

The rationale for step 1 to 3
We decided to first log-transform the response variable data in step 1 with the benefit that the predictions get non-negative values-though it could also somewhat resolve the challenge of having a response variable that is not normally distributed (step 2). Finally, we tried to use power transformations to have a normally distributed response variable (step 3). Though in general it is required that the residuals or errors should be normally distributed in a linear regression analysis (see step 10), once the response variable data are not normally distributed, most likely the residuals might also not have a normal distribution 6 . We checked the normality assumption by Shapiro-Wilk test 2 .

The rationale for step 4
In order to linearize relationship of each explanatory variable with transformed response variable of stage 3, we tried various powers for each predictor variable to find ones that produce most

The rationale for step 5
There are two approaches to LUR model building. One approach favors the most predictive model and the other approach favors the most easily interpreted model. The two are not, necessarily, mutually exclusive. In the first approach, the most predictive variables are kept regardless of their sign. We did not feel that this was the right approach for Tehran, because we wanted to be able to readily explain the modeling results to the epidemiologists who will be  14 . We made this choice based on our situation in Tehran, to ensure that any associated epidemiologic results to be easily interpreted by policy makers.

The rationale for step 6
Some LUR studies have observed that insignificant predictors (p-values > 0.1) can increase the total R 2 of the regression equation 9,10,15,16 . Thus, to prevent the inclusion of such variables in our models, we set another criterion in the sixth piece of the algorithm to include only those variables with significant p-values (p < 0.1). The p < 0.1 was selected because it is widely applied in the LUR community 7,9 .

The rationale for step 7
A recent analysis from Girona (Spain) demonstrated that LUR models developed from fewer sites had higher model R 2 values, lower LOOCV R 2 values, and different predictive variables than models developed from more sites 17 . To account for this, we designed a model-building algorithm that selected variables based on the improvements to the LOOCV R 2 value instead of the model R 2 or adjusted R 2 . Model R 2 is a measure of internal validity of the model while LOOCV R 2 is a measure of external validity, thus, a more appropriate measure for model selection 18 . We believe this method, especially for study areas with small number of sites, leads to the generation of models with high R 2 and LOOCV R 2 values, as well as generation of temporal models with an internally consistent set of potentially predictive variables.

S10
The rationale for step 8 The VIF is the reciprocal of Tolerance, and both are multicollinearity indices. The VIF is calculated as the following equation: Where 1 − 2 is the tolerance and 2 denotes the proportion of variance in the ith predictor, which is correlated with the other predictors in the regression equation 5 .
There is no consensus in the LUR community about the VIF cutoff that should be used for LUR  18 . Although we chose this value for our analyses, the VIF of predictor variables was less than 4 in all cases.

The rationale for step 9
A recent analysis from Girona (Spain) demonstrated that LUR models should be restricted to a set of potential predictor variables 20 . We therefore decided to restrict the number of predictor variables in a LUR model to square root of number of measurement sites. We believe this restriction could provide more realistic R 2 values. S11

The rationale for step 10
This step is one of the assumptions of any regression analysis. In fact, the residuals of final regression model should be normally distributed 21 in order to have valid p-values for predictors' coefficients in regression model. For variables of the form XXX.YYY the XXX indicates the variable type, and the YYY indicates the buffer size, in meters. Response of the model is (Ln (warmer season NO X )) -4 ; hence, predicted pollutant is Exp (model -0.25 ) Radius variable types included in the models were: OFIC = official/commercial land use area The linear distance variable included in the model was: DIST to TACZ = distance to traffic access control zone The log-linear distance variable included in the models was: LNDIST to PRSC = log distance to the nearest primary school Other variable included in the model was: BGD = product of bridge length in a buffer radii divide to distance to the bridges SLP = slope Abbreviations: LOOCV, leave-one-out cross-validation; RMSE, root mean square error; SE, standard error; VIF, variance inflation factor