High resolution annual average air pollution concentration maps for the Netherlands

Long-term exposure to air pollution is considered a major public health concern and has been related to overall mortality and various diseases such as respiratory and cardiovascular disease. Due to the spatial variability of air pollution concentrations, assessment of individual exposure to air pollution requires spatial datasets at high resolution. Combining detailed air pollution maps with personal mobility and activity patterns allows for an improved exposure assessment. We present high-resolution datasets for the Netherlands providing average ambient air pollution concentration values for the year 2009 for NO2, NOx, PM2.5, PM2.5absorbance and PM10. The raster datasets on 5×5 m grid cover the entire Netherlands and were calculated using the land use regression models originating from the European Study of Cohorts for Air Pollution Effects (ESCAPE) project. Additional datasets with nationwide and regional measurements were used to evaluate the generated concentration maps. The presented datasets allow for spatial aggregations on different scales, nationwide individual exposure assessment, and the integration of activity patterns in the exposure estimation of individuals.


Background & Summary
Air pollution negatively affects health. Epidemiological studies have shown that long-term exposure to gaseous (e.g. NO 2 , NO x ) or particle pollutants (e.g. PM 2.5 , PM 10 ) may contribute to the development of asthma, lung cancer, diabetes, and cardiovascular diseases [1][2][3][4][5][6] . For an improved assessment of exposure to air pollution, capturing the spatial variation of street level air pollution concentrations is relevant. Land Use Regression (LUR) models based on measurements from monitoring stations and predictor variables such as traffic or land use have been shown to be suitable to explain the spatial variation in air pollution concentrations [7][8][9][10] . LUR models estimating individual exposure to air pollution have for instance been developed for the United States 11 , China 12 , metropolitan areas in Australia 13,14 and within the European Study of Cohorts for Air Pollution Effects (ESCAPE) 10,15 .
Previous studies of health effects of long-term exposure used LUR models to calculate air pollution at selected locations, mostly front door locations of home addresses. This approach neglects human activity patterns which may be a fundamental omission as important air pollutants strongly vary in space and time. In case information is available for multiple locations such as home and work locations with associated residence times, or on activity patterns like commuting, estimating individual exposures to air pollution can be improved [16][17][18] . To enable this, concentration values at virtually any location and consequently air pollution data at a high spatial resolution and national coverage are required to calculate exposure for each individual within a cohort. In addition, such map data would enable calculating exposure for the entire population in a country.
We present a new consistent data set consisting of 5 metre resolution air pollution concentration maps covering the entire land surface of the Netherlands, an area of about 33 680 km 2 . We created these maps using the LUR models developed within the ESCAPE project 10,15 for the Netherlands and Belgium, here only used for the Netherlands. The maps provide annual average concentration values for nitrogen (di) oxides (NO 2 ,NO 2background and NO x , respectively), particulate matter with less than 10 μm or 2.5 μm (PM 10 and PM 2.5 , respectively), and a proxy for elemental carbon (PM 2.5absorbance ). The concentration values calculated by the ESCAPE LUR models can be considered as long term average values as the models were identified using mean air pollution in 2009 and spatial contrasts are stable for multiple years. The LUR models were based on land use, traffic infrastructure, traffic intensity and population density. Predictor variables included aggregated values of these attributes calculated within circular buffers with radii between 25 m and 5 km.
To verify our mapping approach we compared the values of predictor variables calculated using our nationwide mapping techniques with those used to identify the ESCAPE LUR models 10,15 . To validate the maps, we used independent measurements of air pollution to assess the new concentration maps. A comparison with measurements obtained in the RUPIOH study 19 for the municipality of Amsterdam resulted in an r 2 of 0.22 for PM 2.5 , 0.38 for PM 2.5absorbance and 0.2 for PM 10 . A nationwide measurement dataset for NO 2 from the TRACHEA study 20 resulted in an r 2 of 0.75. We also validated against air quality measurement data from the National Institute for Public Health and the Environment (https:// www.rivm.nl/en/) obtained from stations across the Netherlands. The resulting r 2 for NO 2 is 0.85 and for PM 10 is 0.57.
The spatially detailed air pollution concentration datasets enable health researchers to improve the assessment of the effects of spatial variability on human exposure and health. The data can be used to enrich cohorts with air pollution data either to assess the relation between air pollution and health or to take air pollution into account as a confounding factor. Feasibility of this method has been shown in studies investigating the relation between air pollution exposure and diabetes prevalence 6 and lifestyle 21 . Concentration values can be obtained from the maps for cohorts of any size. Spatial aggregations over tracks followed by individuals, their close surrounding or over administrative areas can be straightforwardly computed with standard Geographic Information System (GIS) software.

Methods
To generate high-resolution air pollution concentration maps we built upon the LUR models and datasets initially developed in the ESCAPE project 10,15 . First, we briefly describe the ESCAPE project and the therein developed LUR models covering the Netherlands. Then, the data preparation in the spatiotemporal modelling environment PCRaster 22 and the application of the six air pollution concentration models over the Netherlands are described. The required software to calculate predictor variables and air pollution concentration maps is described in the 'Code Availability' section.

The ESCAPE Land Use Regression models
After epidemiological studies established exposure response relationships in North America 23 and European exposure estimates at that time were based on these results, the European Study of Cohorts for Air Pollution Effects (http://escapeproject.eu/) was initiated to investigate the contribution of long-term traffic-related air pollution to the state of health. The ESCAPE study covered 36 study areas in 15 European countries and was performed between 2008 and 2012. Standard operating procedures for measurements (http://www.escapeproject.eu/manuals/) and a standardised methodology for the assessment of long-term population exposure to air pollution were developed to investigate exposureresponse relationships for e.g. respiratory and cardiovascular diseases. For all European study areas, LUR models were developed based upon measured annual average concentrations. In the Netherlands, simultaneous measurements took place at 80 monitoring sites for NO 2 and NO x , and 40 sites for PM. Regional background, urban background and traffic sites were selected. The Netherlands are located in the temperate climate zone of Western Europe (Cfb according to the Köppen-Geiger classification 24,25 ) with a mean temperature of 10.1°C and a mean amount of 851 mm precipitation. Three two-week measurement campaigns were performed in different seasons over the year 2009 (cold, warm and intermediate season) to capture seasonality. In addition, an ESCAPE background reference monitoring site was measuring pollutant concentrations the entire year. The data obtained from the three measurement campaigns were then averaged, adjusting for temporal trends using the continuous data from the reference monitoring site 15,26 . PM 10-2.5 was not measured but calculated as difference between PM 10 and PM 2.5 .
Geographic datasets of land use, traffic infrastructure and population density were available for all study sites and used to derive the predictor variables. Datasets presumably improving LUR models, such as street configuration or traffic speed 27 , were not available at European scale and therefore not considered. In the Netherlands, however, information on light-duty and heavy-duty traffic intensities were available and used in the model development.
For each monitoring site and geographic dataset, potential predictor variables were then computed using circular zones and various buffer sizes. Afterwards, for each pollutant the predictor variables explaining best the spatial variation in measured annual average air pollution concentrations were identified. The LUR models were then used to assess air pollution exposure for individual cohort participants, by calculating predictor variables and evaluating the LUR models at the front-door home address locations.
In the Netherlands, the ESCAPE LUR models explained 68% variability in the annual PM 10 concentrations, 67% in the PM 2.5 concentrations, 51% in the PM 10-2.5 concentrations, 92% in the absorbance concentrations, and 86% in the NO 2 concentrations. The lower model r 2 for PM 2.5 and PM 10 compared to NO 2 and PM 2.5absorbance is likely due to the smaller impact of local (traffic) sources on these pollutants. The variation related to large source areas and the transformation process in the atmosphere are more difficult to characterize with the empirical modelling approach. Certain sources such as agriculture or shipping were not evaluated in detail. The lower r 2 for coarse PM is due to missing sources and lower precision of the measurements: coarse was calculated as the difference between PM 10 and PM 2.5 .
The development of the LUR models and the evaluation of the model performance are explained in more detail for nitrogen (di)oxides 10 and particular matter 15 . Table 1 shows the six LUR models for the Netherlands and Belgium study area that were also used to calculate the datasets presented here.

Data sources for the ESCAPE project
The ESCAPE project used several geographic data sources to derive the predictor variables for the LUR models. Datasets in the project were available for all study areas in Europe, and supplemented with national data.
Traffic related predictor variables were calculated using the digital road network based on Eurostreets version 3.1, derived from the TeleAtlas MultiNet data set for the year 2008 15 . This dataset was used for the predictors holding the length (in m) of all roads and major roads in a buffer (RL and MRL, respectively), and the inverse distances (m −1 ) to the nearest road (IDC) or the nearest major road (IDM).   The widths of roads were not explicitly given in the datasets but each centre of a road lane was specified as separate line segment. For the Netherlands, local traffic network and traffic intensity data from the Netherlands Environmental Assessment Agency (http://www.pbl.nl/en/) were additionally used for predictor variables with traffic load (vehicles·day −1 ·m) on all roads (TL) and major roads (MTL), and heavy traffic load on all roads (HTL).
Land use information was derived from the CORINE land cover 2000 data set (https://land. copernicus.eu/pan-european/corine-land-cover). The CORINE land use categories were reclassified and used to create predictor variables holding areas (m 2 ) for harbour (HAR), industry (IND) and areas with high and low density residential land (RES). Population density data (POP) was obtained from the INTARESE project dataset (http://www.integrated-assessment.eu/eu/).
Regional background estimates (BEO, BEX, BEP, BEA) were included for the study area as not all large-scale spatial trends in the air pollution concentrations could be explained by the potential predictor variables 10 . Concentration data obtained from 20 stations for nitrogen oxides and 10 stations for the other pollutants, respectively, were used to estimate four background concentrations by inverse distance weighted interpolation. A regional background estimate did not improve the PM 10 model and was therefore not included.

Spatially distributed modelling
The original ESCAPE project used vector data sources. The predictor variables and models were then calculated using vector-based Geographic Information System functions on a limited set of observations and home address locations. With this approach it was not feasible to calculate air pollution concentrations at all Dutch home address locations due to the extensive runtime of the vector-based operations. We therefore applied a raster-based modelling approach where first the vector input data sources were converted to 5 m resolution raster data, and predictor variables and LUR models were calculated from these data in a raster environment as well. This approach results in minor differences between the original variables calculated by a vector-based and variables calculated using the raster-based approach. The differences between both approaches are negligible as will be shown in the 'Technical Validation' section.
To calculate raster-based predictor variables and air pollution concentrations we used PCRaster 22 , an open-source environmental modelling platform providing a wide range of operations suitable to express spatio-temporal processes via the Python programming language (http://www.python.org/). To be usable in PCRaster, the geographic source datasets were translated using GDAL/OGR (http://www.gdal.org/). The scripts implementing the conversion steps from vector to raster data are included (Data Citation 1).
The datasets were created as follows: The population density dataset was rasterised with gdal_rasterize. For land use, first the CORINE dataset was rasterized with gdal_rasterize. Then, the CORINE land use classes were reclassified and recoded to individual raster maps holding industrial, harbour and residential areas. The total length of roads in each raster cell was calculated by intersecting a 5 m 2 resolution fishnet grid with the road network to obtain all individual road segments per raster cell. Then, the lengths of each road segment in a raster cell were calculated in QGIS (http://qgis.osgeo.org). Finally, the length values were summed up per cell and the total road lengths were assigned as raster cell value.

Calculation of the predictor variables
The rasterised data sources were used to calculate the predictor variables that are used in the LUR models. Different circular buffer sizes for land use and population density (with 1000 and 5000 metres) and for road lengths and traffic loads (with 25, 50, 500 and 1000 metres) were required to calculate the six LUR models. In total, 16 predictor variables with buffers were calculated. The summation of cell values covered by circular buffers of radii between 25 and 5000 metres was area-wide calculated in Python using the multiprocessing, PCRaster 22 and NumPy 28 modules. Depending on the size of the buffers the predictor maps were either calculated on a standard Linux workstation (40 cores Intel Xeon E5-2650, 128 Gb memory; buffers smaller than 1000 metres) or on the Dutch national supercomputer 'Cartesius' (buffers with 1000 and 5000 m radii).
For the inverse distances to road networks we calculated for each cell the distance of the cell centre coordinate to the nearest road using the Distance function from the GDAL/OGR module, and assigned the inverse distance as raster cell value. The four regional background estimators were calculated by inverse distance weighting interpolation using air pollution measurements from 20 stations for nitrogen oxides and 10 stations for particulate matter. We first created raster maps with values at station locations and used these as arguments for the inversedistance function from PCRaster, with a radius of 100 km.

Implementation and calculation of the LUR models
With each of the predictor variables available as individual raster file the LUR models can be implemented to compute the air pollution concentration maps. A PCRaster Python script illustrating the calculation of a LUR model is shown in Box 1. PCRaster provides a broad set of operations based on the map algebra and cartographic modelling concept 29 ; these geospatial operations are available after importing the module of the same name. The required predictor variables are read from disk using the corresponding filenames by readmap operations (lines 4-6) and each assigned to new variables. These variables become thereby spatial data types that essentially hold information on geographic extent and discretisation, and two-dimensional arrays for the raster cell values. The LUR model for PM 10 (see Table 1) itself is implemented in line 9. The arithmetic operators of the equation are executed for each raster cell, an approach comparable to other array programming languages. The resulting concentration values are then stored to a new geospatial dataset on disk (line 12). The five other LUR models are calculated in the same way.
Examples for the resulting maps are shown in Fig. 1, illustrating the spatial pattern of air pollution concentration on national and municipality scale. Figure 2 shows the concentration values for a North-South transect through the municipality of Utrecht. The models include several small scale traffic predictors including inverse distance to roads that explain the gradient near roads. The increased concentration near intersections is due to higher 50 meter traffic buffers.

Code availability
The PCRaster software package used to calculate the ESCAPE Land Use Regression models is opensource and can be executed on Linux, Windows and macOS. The PCRaster source code is available on GitHub (https://github.com/pcraster/pcraster/). Links to release packages for Windows, build instructions for Unices, reference documentation, online courses and information on research projects can be found at the project website (http://www.pcraster.eu). PCRaster version 4.2 is required to execute the models due to several recent code improvements for handling large datasets.
Additionally, Python (http://www.python.org/) version 2.7 (or 3.6) with the NumPy module 28 version 1.7 (or higher) are required to calculate the land use regression models. The Geospatial Data Abstraction Library (GDAL, http://www.gdal.org/) version 2.2.4 (or higher) is required to execute the scripts that rasterise vector datasets, and to execute the scripts performing the distance to road calculations.

Data Records
We provide the air pollution concentration maps resulting from the six LUR models, the Python scripts for data preprocessing, and the LUR model calculation. All content is included in a compressed file nl_apc.7z available through the public Zenodo repository (Data Citation 1).
Each concentration map is provided as individual file in the PCRaster binary file format. The PCRaster binary file format can be processed and visualised with the Aguila software 30 , which is included in the PCRaster package, common GIS applications such as ArcMap (http://www.esri.com/) or QGIS (http://qgis.osgeo.org), or converted to other raster formats or resampled to other grid cell sizes using the GDAL tools (http://www.gdal.org/) for further processing.
The maps cover the entire Netherlands with a 5 × 5 m grid cell size (63500 rows, 54800 columns). The ESCAPE source datasets and the resulting raster datasets use the 'Amersfoort/RD New' (EPSG:28992) coordinate reference system. Concentration values for NO 2 , NO 2background , NO x , PM 2.5 and PM 10  In addition to the spatial datasets we provide the Python scripts for the LUR model calculation and preprocessing of the predictor variables. The Python script lur_models.py performs the calculation of the land use regression models. The script calculate_buffer.py was used to aggregate cell values using circular buffers of various radii. The calculate_distance.py script performed the distance calculations to road networks. The generate_landuse.py script was used to generate predictor variables from the CORINE dataset. The regional background estimates were calculated with the generate_regest.py script, population Box 1 | Python script showing the fundamental steps to calculate the LUR models, here exemplary for PM 10 .
The preprocessed predictor variables are read from disk, the LUR model is calculated and the resulting air pollution concentration map is stored. Operations specific to PCRaster are marked in blue.

Validation of the raster-based approach
We first calculated the predictor variables and LUR models on a small set of locations to evaluate whether the resulting concentration values obtained in a raster-based modelling approach are in agreement with the results obtained from the LUR models calculated using a vector-based approach. We used 8000 front door locations in the province of Utrecht, an area of about 66 × 59 km 2 . The locations were obtained from a cohort dataset previously used in the ESCAPE project. Table 2 shows comparative statistics between the vector-based and raster-based approaches. Model results and measurements are compared in terms of the coefficient of determination (r 2 ), the root mean square error and the bias with o i the observed and and m i the modelled value for location i, and N the number of locations (N = 8000).
Overall, there is a high agreement between values obtained from the newly calculated raster maps and the previously calculated concentrations, with most of the predictor variables resulting in r 2 values above 0.9, and the air pollution concentration maps with r 2 values above 0.98. Small deviations were expected due to the grid representation of the spatial domain, as each possible location within a raster cell obtains the value of the 5 m 2 area.
The results of the comparison show that in general the chosen 5 m 2 grid is appropriate to represent distance related attributes for the ESCAPE LUR models. In a following step, the raster-based approach was therefore applied to calculate concentration maps covering the entire Netherlands.

Independent validation of the datasets
We used additional datasets with measured concentrations to assess the nation-wide datasets. The corresponding scatterplots are shown in Fig. 3 and the statistics are shown in Table 3.
The first dataset we used included monitoring data from the RUPIOH study 19 (Relationship between ultrafine and fine particulate matter in indoor and outdoor air and respiratory health). Measurements of PM 2.5 , PM 10 and PM 2.5absorbance were made at 48 locations spread over the city of Amsterdam between October 2002 and March 2004. Measurements were made directly outside the home, e. g. at balconies or in gardens, where feasible. The number of homes located at major roads and background sites was approximately equal. Annual average concentrations per site were calculated using data from a central reference site. The same measurement methods were applied as in the ESCAPE study used to develop the LUR models. We used this dataset because of the detailed spatial coverage of one city. NO 2 was not measured in this study.
The LUR model predictions of the nation-wide dataset were moderately correlated with measured values and significantly underpredicted the measurements. Correlation was better for the component  with more fine scale spatial variability (PM 2.5absorbance ). The low validation r 2 for PM 2.5 and PM 10 in the Amsterdam validation dataset are likely due to the application of a national LUR model to a single city and to the modest spatial variation of PM within a city. Consistently, the validation r 2 for PM 10 was much higher (0.57 vs 0.22) in the national validation dataset (LML). The difference in validation raises concern with application of the model in a single city for PM 10 and PM 2.5 . We further note that the validation dataset refers to 2002-2004 and the model was developed primarily based on measurements in 2009. There is also a downward air pollution trend in the Netherlands 31 , contributing to the lower estimates of the model compared to the measurements. The second dataset involved measurements of NO 2 made in the framework of the TRACHEA study (Traffic Related Air pollution and Children's respiratory HEalth and Allergies 20 ). Measurements were made simultaneously at 144 sites spread over the Netherlands using the same passive samplers as used in the ESCAPE study. Measurements were made during four 1-week periods spread over the seasons in 2007. Annual average concentration for 2007 was calculated.
Our NO 2 LUR model predictions correlated well with the TRACHEA measurements. The large variability in concentrations makes it easier to compare model predictions with measured values. The agreement is affected by both small scale variation (traffic versus background) and regional variation.
We also used the nation-wide dataset provided from the Dutch Air Quality Monitoring Network (https://www.lml.rivm.nl/). Hourly measurements of NO 2 and PM 10 concentrations are recorded by the LML, which is maintained by the National Institute for Public Health and the Environment (https://www. rivm.nl/en/). There are four location types: urban, rural, traffic and industry. During the ESCAPE period from October, 1 st 2008 to April, 1 st 2011 15 , 49 stations provide measurement data. We excluded stations with more than 20% missing values in the data, resulting in 45 stations for NO 2 and 38 stations for PM 10 . The NO 2 and PM 10 concentration are then averaged per measurement station over the ESCAPE period.
For the LML dataset we obtain an r 2 of 0.85 for NO 2 , and an r 2 of 0.57 for PM 10 . The lower correlation with measurements for PM 10 compared to NO 2 is due to the absence of a regional component in the ESCAPE model for PM 10 (in contrast to PM 2.5 , NO 2 , PM 2.5absorbance ). In the international ESCAPE study, it was first attempted to explain measured variability by including small and large scale GIS variables and only then added regional variables to explain remaining variability. In the case of PM 10 , large scale  population density entered the model, after which the regional background was no longer significant. For rural populations, the consequence may be that exposure contrast across the Netherlands is underestimated. Furthermore, the ESCAPE project did not include measurements specifically in intensive livestock rich areas and did not include GIS data representing farming emissions.

Usage Notes
The six raster datasets presented can be opened, processed and visualised with the open-source PCRaster package. Any application making use of recent versions of the GDAL libraries can visualise or process the datasets for further statistical analysis, for example GIS applications like ArcMap (http://www.esri.com/) or QGIS (http://qgis.osgeo.org), or statistical software such as R (http://www.R-project.org). To reduce the data volume and allow for easier data handling, e.g. in local or regional studies, the datasets can be cropped or resampled by PCRaster, or converted to other raster formats by the GDAL tools. The large file sizes of our datasets impose hardware requirements that may exceed present-day standard desktop computers. To allow for enriching cohort data with a subset of the air pollution maps, we also developed a web service facilitating access to various spatially aggregated derivatives of our raster maps. This functionality is demonstrated by the GGHDC exposure web portal (https://gghdc.geo.uu.nl/).
With our air pollution concentration map, several health research applications are feasible to investigate exposure-response relationships. The datasets can be used to extract concentration estimates for any coordinate in the Netherlands and arbitrary cohort sizes. In recent national health studies, for instance, associations between air pollution concentrations and smoking habits, alcohol consumption, physical activity and body mass index were investigated for 387 195 adults 21 or to diabetes prevalence of 289 703 adults 6 .
Spatially fully distributed datasets also allow for a better representation and consequently incorporation of individual mobility patterns. By integrating road networks or GPS tracks and air pollution concentration maps, exposure along travel routes can be estimated. Alternatively, routes with minimal exposure can be calculated to suggest the healthiest routes for travel. Spatial aggregations using buffer sizes or administrative areas can be calculated to estimate exposure for persons with approximately known residence locations or movement patterns. As an additional dataset, the PM coarse concentration can be calculated as difference between the PM 10 and PM 2.5 maps.
We are not able to provide the source datasets used to generate the predictor variables. However, the presented land use regression modelling approach is generic (e.g. 7,11,13,14 ) and can be applied using other  Our model included predictors starting from 25 m buffer sizes. We did not have very local predictor variables with the exception of inverse distance to roads. Our 5 × 5 m estimates will therefore only gradually change with distance to traffic sources and should not be used to reflect very fine scale differences. The data can therefore not be used to interpret the absolute value of the concentration at a specific location, e.g. a specific home or school address. The models have been developed to characterise exposure contrast to be applied in research. The models include the main factors that explain spatial variation of air pollution, but individual locations may have deviating concentrations because of specific local factors not included in the model such as being in a narrow street canyon, stop-and-go traffic with higher emissions related to proximity to a traffic light or an exceptionally high fraction of old diesel cars. The concentrations should also not be compared strictly with air quality limit values. The monitoring instruments used to develop the ESCAPE LUR models were not formal reference instruments, though the difference with references instruments is limited. The models include main sources of air pollution with typically generic predictor variables. The models have not been developed to incorporate specific (industrial) point sources in a specific location. Monitoring in ESCAPE focused on residential addresses and as a consequence the models are less reliable in predicting on-road concentrations.
The reader should also notice that the values represent average values over the year 2009. The data set can be used for epidemiological studies that require estimates of personal exposures for other years or aggregated over multiple years (preferably in the range +/− 10 years relative to 2009) as yearly mean values do not change considerably, but studies that use our dataset should take into account that there has been and continues to be a downward trend in concentrations while patterns in air pollution may change as a result of road construction or changes in traffic density on roads. The LUR models were developed with the intention to estimate long term exposure, the maps are therefore less suitable to estimate short term variation of air pollution concentrations. The ESCAPE project did not include measurements specifically in intensive livestock rich areas and did not include GIS data representing farming emissions. The model should therefore not be used to represent spatial variation related to farming emissions within local study areas. Concentration values in industrial areas might be underestimated as the Corine dataset does not distinguish between the types of industry. In addition, predictor variables with large buffer sizes are underestimated at the border zones as not all data in the neighbouring countries were available.