## Background & Summary

Fine particulate matter (aerodynamic diameter <2.5 μm; PM2.5) is widely known to have significant adverse effects on human health1,2,3. While recent studies have shown air quality improving for the contiguous United States from the reduction of industrial and vehicular emissions4,5, air pollution in wildfire-prone areas, particularly in the mountain west region of the United States, has increased and is projected to further worsen due to climate-mediated increases in wildfire activity6,7,8. These communities impacted by wildfire smoke from nearby and distant wildfires experience high episodic exposures to PM2.5 with concentrations often exceeding 24-hour ambient air quality standards for extended periods9,10.

Of particular concern in western rural states is the scarcity of air quality monitoring stations, which provide the data needed to deliver accurate health warnings and predictions to the public. Air quality monitoring stations are often located to represent worst-case exposures for the largest concentration of people or sited to capture background exposure. For example, in 2021 there were 22 sites in the Montana network that monitored PM2.5 (16 = Population Exposures, 5 = Background Exposure, and 1 = Source Impact; https://www3.epa.gov/ttnamti1/files/ambient/pm25/qa/vol2sec06.pdf).

In the intermountain west, the sparsity of air quality monitoring stations is further complicated by the region’s complex terrain which likely contributes to significant heterogeneity in air pollution levels across communities11. Even in populous communities with fixed monitoring sites, the spatial variation can be high during the wildfire season due to inversion and drainage flow features common in the intermountain west. Many areas in the intermountain west also experience increased wintertime risk of poor air quality due to cold-air inversion, trapping air pollutants in mountain valleys where most towns and residents are located12. Regardless, it is unlikely that the single air monitor sites in many intermountain west communities provide an accurate representation of pollution exposure, as suggested by other urban area focused studies13. Thus, improved spatially resolved maps of surface PM2.5 accounting for variability in terrain are needed to enhance understanding of particulate matter impacts on public health during both the winter and wildfire season. Such maps would provide the spatial and topographically resolved data needed to identify fine scale PM2.5 effects on specific diseases, such as respiratory disease (e.g., influenza and wildfire season PM2.514).

Previous studies have demonstrated that by combining ground observations with satellite and weather data, surface PM2.5 can be mapped with reasonably high accuracy. A wide range of modelling approaches have been evaluated, including machine learning algorithms15,16, spatio-temporal modelling and interpolation17, ensemble approaches18, linear regression19,20, and geographically weighted regression (GWR)21. Many of these models rely on satellite-based measurements of aerosol optical depth (AOD) data in combination with ground-based air quality observations. AOD measures the optical extinction in the air column. The spatial coverage of AOD data allows researchers the ability to fill in where EPA’s PM2.5 monitoring sites are lacking. However, AOD data have their limitations. AOD attempts to measure the mass of aerosols in the entire atmospheric column, which may not correlate with measurements at the surface. In many areas of the intermountain west snow and/or cloud cover will produce missing data, with some locations having no winter observations22. Likewise, the semi-arid western US MODIS AOD data are also unreliable due to high heterogeneous surface albedo23. Additional infilling techniques are needed when working across these areas and with MODIS data.

Despite extensive efforts to model surface PM2.5, few datasets are publicly available for use in air quality and public health studies24. Here, we describe the development of daily PM2.5 grids for the western United States. We use air quality observations from the EPA combined with weather data from global reanalysis, high resolution MODIS AOD data20 and static spatial covariates to produce daily grids at approximately 1-kilometer resolution from 2003–2020. The dataset we describe here improves upon previous daily PM2.5 models by providing one of the longest time series of daily surface PM2.5 and through the use of a terrain component designed specifically to capture the potential for increased surface PM2.5 in valleys under stable atmospheric conditions. This term provides a significant improvement in predictive accuracy year-round and particularly in winter when MODIS data are frequently unavailable.

## Methods

An overview of the data and methods is shown in Fig. 1.

### Study area

The study area consisted of the contiguous United States west of −102 degrees longitude, or the 11 western-most states. The study period began 2003-01-01 and ended 2020-12-14.

### Surface fine particulate matter measurements

We obtained surface PM2.5 measurements from the United States Environmental Protection Agency (EPA) Air quality monitoring system (AQS)25. This included 1,935,006 observations from 823 unique locations (Fig. 2). Hourly measurements were averaged over each day. The annual average daily count of observations ranged from 151 to 404, increasing each year, but with many stations recording every 3 days the count varied greatly from day to day. This variability was more pronounced at the beginning of the study.

### Predictor variables

#### Satellite aerosol optical depth

Satellite Aerosol Optical Depth (AOD) measures particulate matter in the atmosphere that is a column integrated optical measure of PM (specifically, total suspended particles) extinction in the atmosphere. We used daily MODIS MCD19A2 AOD at 1-km resolution as a predictor following many other studies15,16,17,18,19,20,21. AOD is noted to correlate well with wildfire smoke26, but is affected by high surface albedo in semi-arid regions of the US23, clouds and snow cover. In areas with persistent snow cover a pixel might be continuously infilled for up to 5 months, and on the date with the greatest cloud cover only 4% of the total pixels were available. To avoid cloud, snow and albedo artifacts we used a double masking approach described below.

#### Smog potential layer

We developed a single static 1-km resolution smog potential layer by combining a number of spatial predictor layers in a GBM modelling framework. The full list of predictors is given in Supplementary Table 1.

A suite of topographic indices were derived from the 30-m resolution National Elevation Dataset27 (NED) Digital Elevation Model (DEM) resampled to 1-km resolution. The primary terrain covariates we considered were local minima functions that measure the log of the vertical distance between a given point and the lowest elevation within a set radius. The radii considered ranged from 700-m to 50-km. We also considered the morphometric protection index28, a measure of terrain openness within a 2 km radius. As additional descriptors of terrain we included a suite of MODIS minimum and maximum land surface temperature (LST) monthly 2003–2012 normals (https://modis.gsfc.nasa.gov/data/dataprod/mod11.php).

To represent pollution sources, we retrieved a suite of 114 gridded PM2.5 pollution emission surrogates from the United States EPA SMOKE program29 at 4-km resolution (https://www.cmascenter.org/download/data.cfm) and resampled to the 1-km grid. In addition, we obtained gridded 60-m resolution population density data from the United States Census Bureau for 2010. The population density data were resampled to 1-km resolution then kernel smoothed using a 2.5-km radius gaussian kernel.

#### Daily smog intensity

For prediction of daily smog intensity, we considered gridded 0.5-degree resolution meteorological data from the Climate Forecast System Reanalysis (CFSR)30; including upward longwave surface radiation, 700-mb geopotential height, wind speed at 10-m above ground, mean relative humidity at 2-m above ground, maximum temperature at 2-m above ground, and boundary layer cloud cover. The geopotential height was standardized daily on a per-pixel basis by subtracting the mean for that day of the year (1979–2014) and dividing by the standard deviation.

#### Fire perimeters

We considered the area of actively burning wildfire in a post-hoc analysis of model performance. To estimate fire activity, we downloaded all available fire perimeter shapefiles from geomac31 and converted these to raster grids matching our 1-km grid. For each date, we summed the number of pixels within the available fire perimeters.

### Modelling

#### Daily surface PM2.5 prediction creation

Our general approach was to use daily Geographically Weighted Regression (GWR) to make gridded predictions of surface PM2.5 using 3 covariates: (1) daily AOD, (2) static smog potential, and (3) daily smog intensity. We also considered the interaction between smog potential and smog intensity. Our approach follows methods described by Holden et al.32 used for modelling spatiotemporal dynamics of nocturnal cold air drainage. All models were fitted in R version 4.1.333, using the gbm34, dismo35, spgwr36, geoR37, gstat38 and raster39 packages plus custom code written by A.S. Our code is available with a worked example in Supplementary Material.

Daily average PM2.5 data were downloaded from the EPA AQS Transfer Network25. A log transformation was applied prior to modelling to correct for positive skewness of the PM2.5 observations: most observations are near zero (median = 5.8 ug/m3) but there were numerous extreme values as large as 818 ug/m3. Since our data contained zeros, 1 was added before applying the log transformation. Predictions in the log scale were back-transformed using the exponential function and subtracting 1. Log bias is mathematically expected in back-transformed log predictions, so we corrected for this using the Duan method40 on a daily basis, in which a linear model (LM) for observed surface PM2.5 as a function of predicted surface PM2.5 is fit with no intercept. The slope term of this LM is then used to multiply the back-transformed predictions. Since the back-transformation can yield negative values, we changed all negative values to zero. Similarly, we truncated all values greater than 1000 since the most extreme station observation was 818.

We retrieved 1-km resolution MODIS AOD data from the MODIS collection 6 MAIAC algorithm22. MODIS data have considerable contamination from snow and clouds, particularly in mountainous areas of the western United States, as well as high surface albedo in semi-arid regions of the US23. In addition, there are known issues with smoke being masked as cloud, which we were unable to address since it would require changes to the AOD retrieval algorithm41. We applied 3 methods for removing pixels which likely contained errors. First, we screened the data to include only high-quality pixels based on the quality assurance layer provided with the MODIS product. We then used data from the 250-m resolution 8-day MODIS Normalized Difference Vegetation Index (NDVI) provided by the Global Agricultural Monitoring System (GLAM)42. Previous work with these data14 showed that missing values from this product accurately indicated snow covered days. The snow mask derived from this source was buffered by 3 pixels to more conservatively eliminate erroneous values due to snow. Lastly, we developed a static mask for consistently bright areas due to highly reflective ground surface, such as may occur in salt pans and dry lake beds (https://lpdaac.usgs.gov/documents/110/MCD19_User_Guide_V6.pdf). This was based on a threshold of 100 applied to the 5th percentile of fall (September through November) 2000–2020 values for each pixel. Following removal of missing or contaminated pixels, we used a simple spatial infilling approach to fill in missing pixels based on the nearest non-missing data values. Although crude, this was deemed adequate as we found our GWR models naturally down-weighted the infilled areas. Dates that were unavailable (n = 50) were temporally interpolated from the nearest previous and subsequent dates on a per-pixel basis. Proportion of non-missing data averaged by month is shown in Fig. 3.

We used an iterative approach to develop a static spatial model of smog potential and a dynamic model of daily smog intensity. Under stable atmospheric conditions, which occur most frequently during the winter months (December - February) in western North America, pollution can accumulate near its source, often settling into valleys and leaving upslope areas with relatively cleaner air43. Topography can greatly influence the degree of accumulation, with more enclosed valleys trapping more pollution. Our smog potential layer integrates pollution sources and topography to predict where surface PM2.5 accumulates under stable atmospheric conditions, while the smog intensity models when and where stable conditions occur. The following steps were used to create these layers:

1. 1.

We calculated mean winter PM2.5 at each station and modelled this as a function of local emissions and topography. Winter observations (Dec. 1 – Feb. 28) were used because the smog effect is strongest in winter and smoke-driven AOD effects are minimized. Locations with less than 100 observations were excluded. We used gradient boosting machines (GBM; also known as generalized boosted regression models or boosted regression trees)44 to fit this model, with spatially independent cross-validation (CV) to control for over-fitting. GBM is a machine learning method that fits a series of decision trees sequentially. The first tree is fit to the response variable and each subsequent tree is fit to the errors of the ensemble of trees leading up to that stage, with weighting used to emphasize difficult to fit observations. At each stage a new tree is chosen to minimize a loss function, which in our case is the sum of squared errors from a 10-fold cross validation. The CV groups were defined by dividing the study area based on k-means clustering of geographic coordinates. GBM models require 3 main parameters to be specified. The learning rate (also known as the shrinkage parameter) controls the relative contribution of each new tree. A larger learning rate causes the loss function to decrease more rapidly as trees are added. The tree complexity controls the number of branches of each tree, which is analogous to the order of interactions allowed. The third parameter is the number of trees. In our case we set tree complexity to 3 and chose a learning rate such that the loss function was minimized after 1000 to 2000 trees were added. All topographic and pollution source predictors described above and listed in Table S1 were put through an initial screening using Pearson’s correlation test, the remaining covariates were used in an initial GBM model. Model selection was accomplished by iteratively removing the predictor with the smallest contribution and choosing the smallest model to minimize CV log-likelihood, which is theoretically similar to AIC45, such that more complex models did not result in a substantial improvement. The final model was fit using all data and predictions were made at all sample locations

2. 2.

After the initial smog potential model was fit, we used a linear regression model to predict daily station PM2.5 using a suite of coarse-resolution (0.5-degree) weather covariates and their interactions with estimated smog potential. From this model we extracted the multiplier for smog potential by grouping the smog terms and considering only those regression coefficients including smog potential and its interactions. Equations 13 show how this was derived.

$$\begin{array}{lll}{y}_{i} & = & {b}_{0}+{b}_{1}\ast ao{d}_{i}+{b}_{2}\ast smo{g}_{i}+{b}_{3}\ast ge{o}_{i}+{b}_{4}\ast longwav{e}_{i}+{b}_{5}\ast win{d}_{i}\\ & & +{b}_{6}\ast smo{g}_{i}\ast ge{o}_{i}+{b}_{7}\ast smo{g}_{i}\ast longwav{e}_{i}+{b}_{8}\ast smo{g}_{i}\ast win{d}_{i}+\in \end{array}$$
(1)
$$\begin{array}{ccc}{y}_{i} & = & {b}_{0}+{b}_{1}\ast ao{d}_{i}+{b}_{3}\ast ge{o}_{i}+{b}_{4}\ast longwav{e}_{i}+{b}_{5}\ast win{d}_{i}\\ & & +({b}_{2}+{b}_{6}\ast ge{o}_{i}+{b}_{7}\ast longwav{e}_{i}+{b}_{8}\ast win{d}_{i})\ast smo{g}_{i}+\in \end{array}$$
(2)
$$smog\_intensit{y}_{i}={b}_{2}+{b}_{6}\ast ge{o}_{i}+{b}_{7}\ast longwav{e}_{i}+{b}_{8}\ast win{d}_{i}$$
(3)
3. 3.

A preliminary smog intensity was calculated by predicting Eq. (3) to all times and locations.

4. 4.

We developed a refined estimate of station smog potential by averaging surface PM2.5 under a set of winter days (Dec. 1 – Feb. 28) within a constrained range of preliminary smog intensity values. Stations with fewer than 100 observations meeting this criteria were removed. Mean surface PM2.5 under these standardized conditions was again modelled using GBM as in step 1, giving a final model for smog potential. Predictions from this model were made over the 1-km grid.

5. 5.

A final linear model was fit using the refined smog potential model and its interactions with weather covariates. The smog intensity derived from this linear model was used as a predictor for daily surface PM2.5 as in step 2. Leave one out cross-validation (LOOCV) yielded an R2 of 0.73.

6. 6.

The errors in the smog potential layer were examined for spatial dependency. Variogram modelling using an exponential function estimated autocorrelation out to an effective range of 156-km. These errors could be due to atmospheric effects or pollution sources not accounted for in our model of smog potential. Addition of the kriged error term to the GBM fits increased LOOCV R2 at sample locations to 0.76. The GBM errors were kriged over the 1-km grid and added to the GBM fit to produce a final model of static smog potential. The initial smog potential fit, kriged error surface and final model are shown in Fig. 4.

Geographically weighted regression (GWR)46 was used to interpolate daily station PM2.5 using the smog potential, smog intensity, interaction between smog potential and smog intensity and AOD as predictors. This technique fits a unique weighted linear regression model at a set of user defined points, with weights based on the geographic distance from each point to the location of PM2.5 observations. GWR is an ideal technique to use when the relationships between the response and predictors are non-stationary in space, such as other studies have noted in the relationship between AOD and PM2.5. Since the AOD and PM2.5 relationship is known to vary seasonally14, we fit a unique GWR model daily. To define weights, we chose a gaussian kernel with an adaptive bandwidth selected to include a fixed proportion of data points. We found that the optimal proportion varied with the number of daily observations, so we used a value of 10% for days with n > 200 observations, 20% for 100 < n < 200, and 50% for n < 100.

The daily GWR fits gave an additional advantage in dealing with the large swaths of missing values in satellite AOD. The large infilled gaps in the AOD covariate cause its value as a predictor to vary in space and time. Daily GWRs allow this predictor to be down-weighted when and where its information content is low.

To evaluate the predictive performance of our model we performed a leave one out cross-validation (LOOCV)47. In this scheme, for each day, a series of GWR models are fit with a single observation withheld. We also refit out static smog model with each point withheld. Predictions to the withheld locations are retained for model validation. Thus, a separate GWR was fit to each and every PM2.5 observation. LOOCV provides a measure of predictive accuracy analogous to considering an entirely new location.

In addition to fitting the full model, which includes both AOD and smog terms, we fit an intercept-only null model, an AOD-only model, and a smog-only model. This allowed for a comparison of the relative strength of the predictors.

We made our GWR predictions to a coarser 10-km grid to ease the computation burden. The set of regression coefficients estimated at each point of the 10 km grid were resampled to the 1-km grid using bilinear interpolation, then multiplied by the 1-km covariates to give 1-km resolution predictions of PM2.5. We found this gave a negligible loss in predictive accuracy because the GWR coefficients varied spatially on a scale much larger than 10-km.

## Data Records

Table 1 lists the names and descriptions of the datafiles that are available on figshare48. We provide annual files in NetCDF format containing daily 1-km resolution surface PM2.5 estimates. We also provide the static 1-km resolution smog potential layer. Additional input data sources are publicly available and accessible online. R code used in our analysis is also provided.

## Technical Validation

### Performance of static smog prediction model

Table 2 shows correlation coefficients between the coarse resolution CFSR meteorological variables and daily winter PM2.5 at all locations, as in step 1 of the methodology for developing the smog layers. These show that mean surface PM2.5 is higher under stable conditions defined by low wind, high pressure, low humidity, strong upward longwave radiation flux, and low cloud cover. MODIS AOD and surface Tmax are relatively poor predictors of winter PM2.5. The top 5 predictors were used in a linear model to predict daily smog intensity.

Table 3 shows the correlation between the top 11 spatial predictors and mean surface PM2.5 of stations under stable conditions as defined by the daily smog intensity model. Population density is the strongest predictor but the metrics of slope position are also strong as indicated by the high correlation coefficients for the local minima functions, which measure the vertical distance between a given location and the lowest point within a defined radius. These predictors were used in a GBM to produce gridded estimates of smog potential. The raster map was able to explain 78% of the variation in estimated smog potential. As a final step we kriged the prediction errors over our 1-km grid using an exponential variogram function with an effective range of 156-km. This gave a final map that explained 89% of the variation in estimated smog potential. We also performed a LOOCV on the GBM model and kriging step. For each iteration we withheld a single point to fit the GBM and perform the error kriging. This gave R2 values of 0.755 and 0.719 for the predictions with and without the kriged error term, respectively.

### Performance metrics on testing data

Table 4 shows the performance metrics (RMSE, MAE and R2) resulting from the LOOCV validation of our full model and the 3 reduced models. The results from our full model are comparable to similar studies also using stringent validation on spatially independent data (Reid et al. 2021). Figure 5 shows a scatterplot of observed versus LOOCV predicted values. This shows that our model tends to underpredict the largest extreme values. A regression of predicted on observed values and forced through the origin yielded a slope of 1.00 (SE = 0.0004). Figure 6 shows the spatial representation of the RMSE with a scatterplot against the distance between each stations, showing larger distances with smaller error structure.

Note that RMSE is highly correlated with average PM2.5 (r = 0.57), since low PM2.5 values limit the magnitude of errors.

Comparison of the full and reduced models yielded results that support the hypothesis that wildfire smoke is the dominant source of PM2.5 in the summer and early fall, while smog is dominant in winter. Figure 7 shows the difference in R2 by month between full, AOD-only and smog-only models versus the null model, which was an intercept-only GWR. The smog sub-model contributes strongly in the winter and less so in the summer and early fall. This is not surprising since the stable conditions which allow smog to accumulate occur predominantly in the winter months over western North America. In contrast, the AOD-only model outperforms the smog-only model in July, August and September. This is also not surprising as these are the months considered fire season in western North America. Figure 8 shows raster odell for January and August, calculated by averaging all raster maps for those months. It shows how our model predicts population density and topography to dominate in winter months while during the summer PM2.5 pollution is more widely distributed, presumably due to the dispersed effect of wildfire smoke.

Since AOD is hampered by missing values during the winter months, we also calculated R2 for the AOD-only model for only those observations for which non-infilled AOD was available. This increased the apparent performance of the AOD-only model during the winter months, but a true comparison is difficult since the subset of valid AOD winter observations generally come from the southwest portion of our study area where snow and cloudy conditions are less predominant. Figure 9 gives an example for the Salt Lake City, Utah, USA, area for the observation station values correlated to the modelled raster mean PM2.5 (μg/m3; R2 = 0.554).

The relationship between AOD predictive performance and wildfire smoke is further borne out by comparing AOD model performance versus wildfire activity. Figure 10 shows the relative difference in monthly model performance between the AOD-only and null models with respect to the area of actively burning wildfire for those same months. We see that the AOD covariate becomes increasingly important when area burning exceeds 100 km2. This implies that AOD is able to effectively delineate the spatial pattern of wildfire smoke.

In summary, we feel our model provides for realistic predictions of PM2.5 pollution across areas of the western United States where measurements are sparse. We show that topography and pollution sources can be used to predict the distribution of PM2.5 under stable meteorological conditions, which complements the well-documented ability of satellite-based AOD to predict PM2.5 driven by wildfire. We hope the public availability of these data will prove useful to researchers studying the health effects of PM2.5 pollution.