Daily PM2.5 concentration estimates by county, ZIP code, and census tract in 11 western states 2008–2018

We created daily concentration estimates for fine particulate matter (PM2.5) at the centroids of each county, ZIP code, and census tract across the western US, from 2008–2018. These estimates are predictions from ensemble machine learning models trained on 24-hour PM2.5 measurements from monitoring station data across 11 states in the western US. Predictor variables were derived from satellite, land cover, chemical transport model (just for the 2008–2016 model), and meteorological data. Ten-fold spatial and random CV R2 were 0.66 and 0.73, respectively, for the 2008–2016 model and 0.58 and 0.72, respectively for the 2008–2018 model. Comparing areal predictions to nearby monitored observations demonstrated overall R2 of 0.70 for the 2008–2016 model and 0.58 for the 2008–2018 model, but we observed higher R2 (>0.80) in many urban areas. These data can be used to understand spatiotemporal patterns of, exposures to, and health impacts of PM2.5 in the western US, where PM2.5 levels have been heavily impacted by wildfire smoke over this time period.


Background & Summary
Fine particulate matter (often referred to as PM 2.5 , meaning particulate matter (PM) which is 2.5 microns in aerodynamic diameter or smaller) air pollution is increasingly associated with numerous adverse health outcomes including, but not limited to, mortality 1 , respiratory and cardiovascular morbidity 2,3 , negative birth outcomes 4 , and lung cancer 5 . Although PM 2.5 concentrations have been declining in many parts of the United States due to policies to limit emissions of air pollutants 6 , PM 2.5 levels have been increasing in parts of the western US 7 . This increase has been shown to be associated with wildfire smoke 7,8 , which can cause PM 2.5 concentrations that are several times higher than the Environmental Protection Agency's (EPA's) daily PM 2.5 National Ambient Air Quality Standard (NAAQS) in areas downwind of the wildfires for several days at a time 9 .
Estimates of PM 2.5 concentrations for health studies have traditionally been derived from data from stationary air quality monitors placed in and around populated areas for regulatory purposes. In the US, the EPA's Federal Reference Method (FRM) PM 2.5 monitors often only measure every third or sixth day and most counties do not contain a regulatory air pollution monitor 10 . There is therefore not enough temporal and spatial coverage from FRM monitors to obtain a good estimate of the air pollution exposures where every person lives.
To improve population exposure assessment of PM 2.5 , researchers have increasingly been using methods to estimate PM 2.5 exposures in the temporal and spatial gaps between regulatory monitoring data using data from satellites (such as aerosol optical depth (AOD) or polygons of smoke plumes or air pollution models 11,12 ) over the past two decades. Each of these data sources has its own benefits and limitations, and researchers are increasingly statistically "blending" information from a combination of data sources to better estimate PM 2.5 in space and time. Various methods of blending have been used including spatiotemporal regression kriging 13 , geographically-weighted regression 14 , and machine learning methods [15][16][17][18][19] .
Machine learning methods for estimating PM 2.5 concentrations often train large auxiliary datasets, often including satellite AOD, meteorological data, chemical transport model output, and land cover and land use data to provide optimal estimates of PM 2.5 where people breathe. These models have been implemented in various locations around the world at city, regional, and national scales 20 . Some epidemiological questions can only Table 1 and described in more detail below.
Satellite AOD is a measure of particle loading in the atmosphere from the ground to the satellite. We obtained daily estimates of AOD from the MODIS Terra and Aqua combined Multi-angle Implementation of Atmospheric Correction (MAIAC) dataset 29 . This is the finest resolution (1 km) AOD dataset currently available and was available for our whole time period and spatial domain. After downloading each Hierarchical Data Format (HDF) file from the online repository, we calculated the average daily AOD values at each location and took the value from the nearest neighbour at each PM 2.5 monitoring location. MAIAC AOD has been shown to better predict PM 2.5 than coarser resolution AOD 30 and has been used in many studies in various geographic regions in blended models to predict daily PM 2.5 [31][32][33] . We obtained meteorological data from the North American Mesoscale (NAM) Analysis meteorological model 34 because it includes all of the standard meteorological variables which play a role in PM 2.5 levels, including planetary boundary layer height which can be important to help scale AOD values to ground-level estimates of PM 2.5 35 . We calculated 24-hour averages from 6-hour average data for planetary boundary layer height, temperature at 2-meter elevation, relative humidity at 2 meters, dew point temperature at 2 meters, U-(east-west) and V-(north-south) components of wind speed at 10 meters, surface pressure, pressure reduced to mean sea level, vertical wind velocity at an altitude of 850 mb, and vertical wind velocity at an altitude of 700 mb. NAM has 12 km spatial resolution. The NAM data was retrieved using the rNOMADS package 36 . PM 2.5 concentration estimates from chemical transport models have been shown to be an important input to machine learning models for PM 2.5 15,17 . We obtained daily estimates of PM 2.5 at 12 km spatial resolution from runs of the CMAQ (Community Multi-scale Air Quality) model from the U.S. EPA for the years 2008-2016, the years for which the CMAQ estimates are available 37 .
The main reason that PM 2.5 concentrations have been increasing in the western US, while they have been decreasing in other regions, is the increasing number and magnitude of wildfires in that region 7,8 . To include information on proximity to active fires in our machine learning models, we collected daily data about fire detection locations and size from the MODIS Thermal Anomalies/Fire Daily L3 Global 1 km product (MOD14A1 and MYD14A1) 38 . As fires in closer proximity are likely to influence PM 2.5 more than fires farther away, we calculated the number of active fires in radial buffers of 25, 50, 100, and 500 km radii around each monitoring location, on the current day as well as on each of the previous seven days. We then calculated an inverse-distance-weighted average for each temporal lag. Finally, we created an indicator variable for whether there were one or more fires within 500 km of a monitor in the last week.
Elevation can influence PM 2.5 concentrations. For example, PM 2.5 can accumulate in mountain valleys during persistent cold air pools (commonly referred to as inversions) during winter 39 . We obtained elevation data from the 3D Elevation Program of the U.S. Geological Society, with has a resolution of 1 arc-second, which is approximately 30 m north/south and varies east/west with latitude 40 .
Surrounding land cover can be a proxy for air pollution emissions not from wildfires, such as emissions from traffic and industry in urban areas, and lack of particulate air pollution emissions in more vegetated areas. We used the land cover class information from the Landsat-derived National Land Cover Dataset (NLCD) 2011 41 , which has a spatial resolution of 30 m and uses circa 2011 Landsat satellite data to classify each pixel into one of a variety of land cover codes. We used the NLCD 2011 to calculate the percentage of urban development (codes 22, 23, and 24) within buffer radii of 1 km, 5 km, and 10 km around each monitor. We also included population density as areas with higher population have more sources of air pollution emissions. Population density was obtained from the 2010 U.S. Census 42 . For a time-varying measure of vegetation abundance, we used monthly measures of the Normalized Difference Vegetation Index (NDVI) from the MODIS satellite product MOD13A3 43 which has a 1 km spatial resolution.
As another, potentially more specific proxy indicator of emissions from vehicles, we calculated the sum of all road lengths of type "Arterial" and "Collector" within 100, 250, 500, and 1000 m buffers of each monitoring location. Arterial roads are high-capacity urban roads, including highways. Collector roads are low-to-moderate capacity roads. The road data came from the National Highways Planning Network 44 , which contains spatial information on over 450,000 miles of highways in the United States.
To account for seasonality in PM 2.5 data, we created the following predictor variables: cosine of day-of-week, cosine of day-of-year, and cosine of month. The use of cosine of these terms ensures that day/month values at the end and beginning of the week and year align.
We also created indicator variables to represent spatial and temporal variation in the data that could not be explained by any of the other spatial, temporal, or spatiotemporal variables. Use of nested levels of spatiotemporal variables can help to capture nonlinear spatiotemporal effects. Temporal variable nesting consisted of variables to indicate the periods 2008-2012, 2013-2016, and 2017-2018; year; season; cosine of month; and cosine of day of year. Spatial variable nesting consisted of dummy variables for region (within the 11 western states: northwest (i.e., WA, OR), southwest (i.e., CA, NV), four corners (i.e., AZ, CO, NM, UT), and northern mountain states (i.e., WY, MT, ID)); state; latitude; and longitude. We also included interaction terms for time period (grouping of years) and region. This type of nesting has been referred to as a "multiresolution basis" 45 .
Data merging. We created seven datasets that all required merging the above datasets together: one dataset to train the model (with the 2008-2016 dataset able to be subset from the full 2008-2018 dataset) and six prediction datasets, with three spatial levels of prediction (county, ZIP code, and census tract) for each of the two models (2008-2016 which included CMAQ as a predictor variable, and 2008-2018 which did not include CMAQ as a predictor variable). The training dataset merged all predictor variables to each 24-hour average PM 2.5 monitoring observation by linking the data temporally (using date) and spatially (by selecting the nearest spatial observation for each predictor variable using latitude and longitude). Similarly, the prediction datasets were created by spatially (by nearest latitude and longitude) and temporally (by date) linking all predictor variables to the centroid of each county, ZIP code, and census tract for each day in the study domain.
Machine learning modelling. We employed ensemble machine learning to model PM 2.5 exposures across the western US. Specifically, we used a generalized linear model (GLM) to combine the results from two machine learning algorithms: a random forest model and a gradient boosting model. These models performed best on preliminary analyses of random subsets of our dataset, which aligns with a previous study that found that tree-based models (using random forest, gradient boosting, and cubist algorithms) performed best in air pollution modelling 46 . Then, we used the same random subsets of the data to tune hyperparameters for each algorithm via a grid-search ( Supplementary Information 1).
For the machine learning modelling, we performed both random and spatial 10-fold cross-validation so that our models would not overfit the training data 47 . We performed random 10-fold cross-validation, in which a random 10% of all observations are in each fold and the data is trained on 9 folds and tested on the left-out fold and then repeated 10 times such that every fold is a left-out fold once, for comparison to other studies. We performed spatial 10-fold cross-validation, whereby all observations from a given monitoring site are within the same fold, as a more appropriate tool for evaluating the accuracy of a model when predicting PM 2.5 at new (left-out) locations, such as those in our prediction set 48 . Before either kind of 10-fold cross-validation, we also removed 10% of the monitoring locations for a testing data set, which was not used in model development. The spatial-folds analysis used a spatial held-out 10% (random 10% of the locations), while the random-folds analysis used a random held-out 10% (10% of the observations regardless of location). Most previous studies to create daily PM 2.5 estimates using machine learning [15][16][17][18] present results from only random 10-fold cross-validation, which violates the assumption of independence between folds because of repeated observations (on different days) from the same locations (PM 2.5 monitor locations). Additionally, they do not provide performance metrics for predictions on a completely left-out test set. Their performance metrics are therefore overly optimistic for how their models will perform in their output data.
We used the metrics root-mean-squared error (RMSE) and R 2 to report accuracy, for both the 10-fold cross-validation and for the left-out testing data set, for both spatial folds and random folds. Note that our spatial-folds metrics are different than spatial R 2 that has been reported in other papers such as Di et al., (2019) 17 .
All analyses were run using R 49 , and all machine learning models utilized the R packages caret 50 and caret ensemble 51 . Variable importance was calculated using the "permutation" importance algorithm in the caret package.
Our main model and prediction data set provide daily PM   Table 2. Performance metrics (RMSE and R 2 ) of our models using various subsets of data and types of CV folds. *This test was performed to show the impact of including CMAQ in the model.
Overall, our models including CMAQ PM 2.5 (2008-2016 models) perform better (have lower RMSE and higher R 2 values) than models without CMAQ PM 2.5 (2008-2018 models). This may be due to the additional information provided by the CMAQ output or could be because the models without CMAQ PM 2.5 include two additional years of data that have more days with high PM 2.5 , which are much harder to predict accurately than lower values. For comparison, results for a random-folds model without CMAQ PM 2.5 on the years 2008-2016 performed slightly worse than the model for years 2008-2016 with CMAQ PM 2.5 included as a predictor variable.
The spatial-folds performance metrics are worse than the random-folds performance metrics. This is not surprising because the spatial folds do not allow for observations from the same location to be in multiple folds,  Fig. 3 Predicted-versus-observed plots of daily PM 2.5 values from our ensemble machine learning models using spatial 10-fold cross validation. All lines in Fig therefore the models are predicting at locations that they did not train on, whereas random folds have likely trained on observations at all locations, thus are more likely to predict values better for those locations. Using solely random folds can therefore be misleading as to the performance of the models given that these kinds of models predict at locations without monitoring data and thus not in the training data 48 . Thus, we posit that most of the models presented in the literature previously that use random folds CV are reporting R 2 values that are likely higher than their predictive performance at non-sampled locations. Henceforth all results in this section refer to the spatial-folds analysis. Performance of our models on our completely left-out testing data set provide worse metrics than their training (10-fold CV) counterparts. Some of the discrepancy between training and testing set results is because the testing data set was not used to inform the development of the model; some of the discrepancy is because of random chance of a given monitoring site being in the testing data set. Given that most previous machine learning air pollution modelling studies do not report metrics for completely left-out testing data [15][16][17][18][19] , their results on the accuracy of their models' predictions may be misleading. For comparison, the performance metrics for our full models (without any cross-validation folds) on the 2008-2016 and the 2008-2018 datasets are, respectively, RMSE = 1.726 µg/m 3 and R 2 = 0.960; RMSE = 2.027 µg/m 3 and R 2 = 0.961. These are much better performance metrics than any of those in Table 2 because all models overfit the data they are trained on. Full models train on all observations with none left out and therefore are not realistic representations of how accurately the model will predict at locations outside of the training set.   G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G  G G G G G G G G G G  G G G G G G G G G G G Boise, Idaho RMSE = 4.7 µg/m 3 , R 2 = 0.711 www.nature.com/scientificdata www.nature.com/scientificdata/ from the USFS and other sources that we included so that our models would have those higher values on which to train. Although we do show underprediction at these higher values, training our model on a larger range of input values than previous studies allows us to predict values that are likely closer to the "true" concentration to which populations were exposed. It should be noted the two Di et al. papers 16,17 Table 3 shows the RMSE (and R 2 ) values of our models on different levels of PM 2.5 , years, states, and seasons using spatial folds. RMSE (and R 2 ) values from the random-folds analysis of the same models are in Supplementary  Table 2.
The models with CMAQ PM 2.5 (2008-2016 models) always perform better than the models without CMAQ PM 2.5 (2008-2018 models). This is likely because of the importance of PM 2.5 concentration data from a chemical transport model in the machine learning models (Supplementary Table 5). Although chemical transport models provide spatiotemporal estimates of PM 2.5 , they do not always predict accurately and many have shown that empirical models like the ones presented here can improve predictive performance of chemical transport models. However, the algorithms we used perform well even when one important predictor is missing. When CMAQ PM 2.5 is not included, MAIAC AOD rises in variable importance (Supplementary Table 5   www.nature.com/scientificdata www.nature.com/scientificdata/ between variables does not matter for prediction with random forest, it most likely reduces the variable importance calculations via permutation 54 . We have better predictive performance at lower levels of PM 2.5 . This is likely partially because a much larger number of observations at lower values allowed the model to be better trained at those values; it is well-known that models tend to have greater trouble predicting extremes. In the spatiotemporal subsets, we observed higher RMSE for the years 2012 and 2015-2018, which have some of the highest PM 2.5 values due to high numbers of wildfires in our study domain during those years. The patterning of results by state is less clear, although it is notable that the RMSE values for California are lower than might be expected given the state's higher-than-average PM 2.5 levels. This is likely because over 40% of our monitoring locations and thus observations are from California. Finally, the RMSE values for Spring are lower than those from the other seasons, which is likely due to the predominance of lower PM 2.5 values in the spring (Fig. 4). One thing to note related to the performance metrics in Table 3 is that sometimes the R 2 values are quite low for the testing data for a given year or state because there were few observations within that state or year within the pre-selected 10% testing hold-out done across all of the data before the modelling; this is not a 10% of observations from each state or year.
We also provide RMSE values normalized by the mean of each corresponding dataset in Supplementary Tables 3 and 4 (corresponding to the spatial and random folds results respectively). These tables provide a sense of the predictive performance relative to the observed values. We observe that the model has larger error relative to the mean of the observed data for years and seasons with more wildfires. This is unsurprising given that www.nature.com/scientificdata www.nature.com/scientificdata/ squaring the errors in the RMSE calculation tends to weight large differences more heavily than the simple sum in the mean calculation. There is less of a clear pattern for the normalized RMSE by state, though the fact that our model does relatively well in California is likely due to having more air quality monitors and thus training data there. The spatial folds and random folds results follow the same overall pattern, though the random folds results have smaller errors compared to their mean values, likely because there is training data from all locations in all folds yet use of these metrics would lead to overconfident predictions in locations without monitoring data.
www.nature.com/scientificdata www.nature.com/scientificdata/ to observations from the closest monitors (Figs. 5-8). The monitoring stations were selected for having the most continuous monitoring observations in a large city within each state, but we included two large cities for California because of its large population and geographic variability with regards to air pollution. From these plots, you can see that our model predictions depict the temporal variability in PM 2.5 concentrations at monitoring stations nicely including the high values when they occur (most often during wildfire seasons) even for cities in which the air pollution during the most recent wildfire seasons have dwarfed the PM 2.5 levels in previous wildfire seasons (e.g., San Francisco, Seattle). There is quite good agreement between the predictions (which are from the nearest prediction location (centroid of census tract, ZIP code, or county) to each selected monitoring location (R 2 values range from 0.693 in Cheyenne, WY to 0.923 in Salt Lake City, UT, with all but three above 0.80). The RMSE values also vary, but these are more impacted by the range of PM 2.5 values observed at a given location. The models predict worse at higher values, thus locations with more PM 2.5 variability (mostly driven by high PM 2.5 values during wildfire seasons) tend to have higher RMSE values (e.g., Missoula, Montana). It is important to note that these are predictions in our prediction data set (either census tract, ZIP code, or county centroid) that are nearest to the monitor and not at that exact location. In particular, the distances from the nearest prediction to the monitor in Santa Fe, New Mexico and Boise, Idaho were nearly an order of magnitude larger than the distances for each of the other cities, which contributed to their lower R 2 and higher RMSE values.
We also mapped seasonal averages of our daily PM 2.5 predictions by county to assess the spatial patterning in the predictions and whether they aligned with the spatial pattern in the monitored data (Fig. 4). The year picked for each season was to demonstrate our model's ability to predict the spatial pattern of PM 2.5 during specific times when PM 2.5 was particularly high in certain regions such as the large wildfires in Arizona in the spring of 2011, in the North in summer 2015, in the Pacific Northwest and California in the Fall of 2017, and to show the PM 2.5 from inversions in Utah and California in the winter of 2013. Not all counties had monitoring data, so it was not possible to plot the monitoring data in the same way. Overall, these maps show lower PM 2.5 levels in the spring; Spring 2011 is just one example of this. In the winter, in the western US counties, many areas experience quite low levels of PM 2.5 , but there are some regions that have higher levels due to wintertime inversions (e.g., Salt Lake City, California's Central Valley). We also chose to highlight two seasons with high wildfire activity: the summer of 2015 when there were many wildfires burning in the Pacific northwest and the fall of 2017 when California had a large number of wildfires. We note that, as these are seasonal averages, the PM 2.5 levels are not as high as the maximum daily predicted levels for each county.

Usage Notes
All of our data files are available in the RData format. An RData file can be opened in R using the "load" command. Note that the state data frames range in size from about 2 million rows to about 34 million rows. A computer with lots of memory and/or cloud computing may be necessary to analyse these data.
Because the predictive model is continuous and not confined to positive numbers, low predictions (close to 0) will sometimes go negative. Because this is physically unrealistic, we recommend replacing the few negative PM 2.5 predictions with 0, as we have done to generate the plots and performance metrics shown in the Technical Validation section.
Then, our PM 2.5 estimates may be merged with health data or other data at the county, ZIP code, and census tract levels based on FIPS codes (for county and census tract) and ZCTA5 codes (for ZIP code). If a research team is concerned about our imputation technique, then they may wish to exclude any rows that have one or more missing variables.

Code availability
All code used for downloading and processing the data used in this project, including the machine learning and technical validation code, may be accessed at https://doi.org/10.5281/zenodo.4499264 55 .
To ensure that our work is reproducible, all code is written in open-source languages. Some scripts are in R and others are in Python. Python scripts need Python 3; R versions beyond 3.5.1 should suffice.
The scripts on Zenodo are a copy of our GitHub repository of scripts and contains the following files and directories: • General_Project_Functions: Scripts to obtain the prediction set locations and tools that are generally useful during the data processing, such as making buffers around points and reprojecting point coordinates.
• Get_PM25_Observations: Scripts to process PM 2.5 observations from across the western U.S. These observations are used to train our machine learning models.
• Get_Earth_Observations: Scripts to download and process observations from data sets that are used both as inputs for our machine learning models during training and as inputs for our models in the prediction stage. The file Overall_steps provides all necessary directions. Individual README files (in each folder) provide more details if there are any.
• Merge_Data: Scripts to merge all the data together and derive some spatio-temporal variables.
• Machine_Learning: Scripts to run and evaluate our machine learning models. The folder Final_scripts contains all code used for our final analysis. The code in the Exploring_models folder was all preliminary testing.
• Estimate_PM25: Scripts to use our machine learning models to make final predictions and to explore the prediction data sets over time and space.