High-resolution gridded population datasets for Latin America and the Caribbean in 2010, 2015, and 2020

The Latin America and the Caribbean region is one of the most urbanized regions in the world, with a total population of around 630 million that is expected to increase by 25% by 2050. In this context, detailed and contemporary datasets accurately describing the distribution of residential population in the region are required for measuring the impacts of population growth, monitoring changes, supporting environmental and health applications, and planning interventions. To support these needs, an open access archive of high-resolution gridded population datasets was created through disaggregation of the most recent official population count data available for 28 countries located in the region. These datasets are described here along with the approach and methods used to create and validate them. For each country, population distribution datasets, having a resolution of 3 arc seconds (approximately 100 m at the equator), were produced for the population count year, as well as for 2010, 2015, and 2020. All these products are available both through the WorldPop Project website and the WorldPop Dataverse Repository.


Random forest-based dasymetric population mapping approach
The dasymetric disaggregation of population counts from administrative units into grid cells was undertaken using a population density weighting layer generated by a RF algorithm. RF is a non-linear and non-parametric ensemble learning method that generates a large collection of unpruned decision tree models and aggregates their predictions. Each tree is independently generated by bagging (i.e., by bootstrapping with replacement) 26 , and each node of each tree is split using the optimal split among a randomly selected subset of covariates 27 . Outputs of all tree models are then aggregated by calculating either their mode or average, depending on whether the decision trees are used for classification or regression.
The RF method is robust to overfitting 27 and not very sensitive, in terms of affecting prediction accuracy, to the three parameters required to be set for fitting the model 28 , namely (i) the number of covariates to be randomly selected at each node, (ii) the number of observations in the terminal nodes of each trees, and (iii) the number of trees in the forest. Furthermore, it is possible to accurately estimate the prediction error of the RF model. This can be done by averaging all mean squared errors calculated using the 'out-of-bag' (OOB) data that represent one third of the observations withheld from the bagging iteration process for each tree in the forest 27 . The OOB error can be also used to evaluate the importance of each covariate by considering how much the OOB error increases when only the OOB data for that given covariate are permuted 28,29 .
www.nature.com/sdata/ SCIENTIFIC DATA | 2:150045 | DOI: 10.1038/sdata.2015. 45 In the RF-based dasymetric population mapping approach developed by Stevens et al. 24 , a RF algorithm is used to generate gridded population density estimates that are subsequently used to dasymetrically disaggregate population counts from administrative units into grid cells. The same approach was used to produce the WorldPop Americas datasets described in this article. Initially, a population density response variable and a suite of covariates were calculated at the administrative unit level, and then used to fit a RF model for predicting population density at the grid cell level (i.e., to generate the dasymetric weighting layer) with those raster-based covariates having a spatial resolution of 100 m (Fig. 2).
To reduce processing time during the prediction phase, the multi-stage RF estimation technique developed by Stevens et al. 24 was used. This technique first fits a model using all available covariates and the (log) population density of each administrative unit as the response. Then, a very conservative covariate selection process is performed to reduce the number of covariates that will be used for both the RF model fitting and prediction. To do this the 'variable' importance of each covariate 27 is extracted and each covariate that has a score equal to zero is removed before re-fitting the RF model. This process is then iterated until only covariates with positive scores remain and thus results in the elimination of both redundant covariates and covariates that could negatively impact the prediction.
As in Stevens et al. 24 , the RF model fitting was performed by generating 500 trees in the forest and setting the number of observations in the terminal nodes equal to one. The fitted RF model was then used to predict population density using only the same reduced set of covariates. For each grid cell, each regression tree in the forest was used to predict a population density value and the average of all predictions was assigned to it as its estimated population density value. If there were not enough observations (i.e., not enough administrative unit population counts) to fit a RF model for a given country, another country located in the same ecozone 30 was identified and used to fit an appropriate RF model for predicting population density at the grid cell level 31 .
Subsequently, in both cases, the population density weighting layer was used to dasymetrically disaggregate the administrative unit-based population counts 32 and produce two gridded population datasets depicting the estimated number of people per grid square and per hectare for the population count year. These datasets were then projected to 2010 (Fig. 3), 2015, and 2020 using UNPD rural and urban growth rates 25 and also adjusted to match the most recent UNPD estimates at the time of writing 25 .
All tasks described above were entirely performed using the WordPop-RF code (Data Citation 1) described in the Code availability section below and publicly available through the figshare repository. In particular, the code relies on the R statistical environment (version 2.15) and the randomForest package (version 4.6-7) for fitting the RF model at the administrative unit level and predict at the grid cell level, and on the Python programming language (version 2.6; www.python.org) and ArcGIS 10.1 arcpy package for performing the Geographic Information System (GIS)-specific spatial operations required for   24 ). The preparation of the response variable and covariates is described in the yellow and orange panels, respectively, the RF modelling steps are outlined in in the green panels, and the disaggregation of the input population counts from administrative units into grid cells is described in the blue panel. dasymetrically disaggregating the population data, projecting them to 2010, 2015 and 2020, and adjusting them to match UNPD estimates (refer to the Supplementary File 1 for a technical description of how the GIS-specific spatial operations are implemented).

Data collection
For each country listed in Table 1, population counts were extracted from the most detailed and recent official population count data and matched to their corresponding administrative units in a GIS environment. Both population counts and the corresponding administrative units were either publicly available (e.g., from GeoHive 33 and GADM 34 , respectively) or contributed by National Statistical Offices such as the Instituto Brasileiro de Geografia e Estatística (IBGE). Table 1 also provides summary information about the input population count data and administrative unit datasets used to produce the WorldPop Americas datasets. It is well known that human population density is highly correlated with environmental and physical factors 35 that can plausibly impact the spatial distribution of population and/or be related to it. These may include continuous variables such as intensity of night-time lights 36 , energy productivity of plants 37 , topographic elevation and slope 38,39 , and climatic factors 40 , as well as categorical variables such as landcover type 41 and presence/absence of roads 42 , waterways and waterbodies 43 Table 1. Summary information about population count data and administrative unit datasets used to produce the WorldPop Americas datasets. For each country (identified by its ISO country code in the 1st column), the Average Spatial Resolution was calculated as the square root of its surface area divided by the number of administrative units and represents the effective resolution of the latter (i.e., the cell size of administrative units if all units were square of equal size) 14 . Superscripts 'C', 'E', and 'P', in the 2nd column, indicate whether the population counts were obtained from either official census, estimates, or projections, respectively. Superscript 'G', in the 8th column, indicates that the population counts were downloaded from GeoHive 33 . urban areas 44 , and protected areas 45 . Thus, twelve global raster and vector datasets (described below) were identified, collected, assembled, and processed into a set of default covariates ( Table 2) used for model fitting and prediction. The spatial variation of factors related to population distribution, such as night-light intensity and plant energy productivity, was measured using the NOAA Suomi National Polar-orbiting Partnership Visible Infrared Imaging Radiometer Suite (VIIRS) 46,47 and the NASA TERRA/Moderate Resolution Imaging Spectroradiometer (MODIS) Net Primary Productivity (NPP) 48,49 raster dataset, respectively. The spatial variation of climatic factors affecting population distribution was considered by including the WorldClim Annual Mean Temperature (BIO 1 ) and Annual Precipitation (BIO 12 ) raster datasets 50,51 . The World Wildlife Fund (WWF) HydroSheds raster dataset 52,53 , based on the NASA's Shuttle Radar Topography Mission (SRTM) Digital Elevation Model 54 , was used to represent the spatial variation of elevation and slope. The European Space Agency (ESA) ENVISAT/MERIS-based GlobCover raster dataset 55,56 , and the MODIS 500-m map of global urban extent 57,58 were used to identify different landcover types and distinguish between urban and rural areas. Finally, the World Database on Protected Areas (WDPA) 59 was used to obtain vector polygons representing protected areas, while the National  Geospatial-Intelligence Agency (NGA) Vector Map Level 0 (VMAP0) dataset 60 was used to obtain features representing populated places, roads, rivers, and waterbodies. Where available, additional country-specific datasets were used to integrate and/or replace the default datasets outlined above and the corresponding default covariates in the analysis. For example, for most of the countries, the Landsat TM-based EarthSat GeoCover-LC raster dataset 61,62 was combined with the GlobCover raster dataset to refine the extent of urban areas and identify rural settlements. Similarly, OpenStreetMap (OSM) vector datasets 63 were regularly used to integrate the VMAP0 settlement dataset and account for land-use types, building sites, and locations of points of interest that may be strongly correlated with population presence (e.g., health clinics, schools, and police stations). Furthermore, OSM road and river data were often deemed to be more complete than the corresponding VMAP0 data and, thus, were used to increase the precision and accuracy of the gridded population outputs 64 .
For each country, all assembled vector and raster datasets, including the country specific ones, are described in the metadata file accompanying the corresponding gridded population datasets and viewable in any web-browser (refer to the Data Record section below for a more detailed description of the metadata file content).

Data preparation
For each country, the vector dataset representing its administrative units, used to match to population counts, was projected using the most appropriate country-specific projected coordinate system that minimized linear and areal distortion. It was then buffered by 10 km, and rasterized at a spatial resolution of 100 m. This was done in order to (i) generate a dataset representing the population density response variable, (ii) obtain a raster dataset, representing the study area, for co-registering all raster covariates, and (iii) produce a number of raster 'distance to' covariates that were unaffected by edge effects due to the fact that the study area is artificially bounded while spatial processes are not 65 . The population density response variable was obtained through dividing population counts by the area of the corresponding administrative units, and log-transforming the results to normalize the response variable distribution.
Covariates for input to the RF method were derived as follows. First, a continuous raster dataset representing the spatial variation of topographic slope was derived from the USGS HydroSheds dataset (Table 2). Then, all raster datasets representing continuous variables, including the latter, were projected, resampled to 100 m resolution, co-registered and matched to the rasterized buffered study area. For all covariates, 'NoData' grid cells overlapping the rasterized buffered study area were filled with the values of the nearest neighbours (using the Nibble tool available in ArcGIS 10.1). All vector and raster datasets representing categorical variables were projected, rasterized to or resampled to 100 m resolution, co-registered, matched to the rasterized buffered study area, and converted into a number of binary raster covariates, representing presence/absence of a given feature, that were subsequently used to produce continuous 'distance to' and 'proportion of' raster covariates (Table 2); with the latter representing, within a 500 m buffer from each grid cell, the proportion of grid cells where the given feature is present.
A special case of a categorical raster dataset is the land-cover data. Indeed, in this case land-cover classes must be aggregated (if needed) and recoded to match the ten WorlPop Americas classes derived   Table 1). By default, the recoded GlobCover dataset was 'Nibbled', to fill in any missing grid cell, and then mosaicked with the MODIS 500 m Global Urban Extent dataset to delineate the extent of urban and non-urban built-up areas (i.e., class 190 and 240 respectively in Supplementary Table 1). When using the GeoCover-LC dataset, it was first recoded and mosaicked with the GlobCover dataset, to fill missing grid cell, and with the MODIS 500 m Global Urban Extent dataset. Similarly to the other raster datasets representing categorical variables, the processed land-cover raster dataset obtained as described above was projected, resampled to 100 m resolution, co-registered, matched to the rasterized buffered study area, and converted into twelve binary raster covariates including the combined built-up areas class (BLT) obtained by combining classes 190 and 240. Binary raster covariates were subsequently used to produce continuous 'distance to' and 'proportion of' raster covariates (Table 2). Finally, average and modal values for continuous and binary covariates, respectively, were calculated for each administrative unit and used for fitting the RF model.  The preparation of the population density response variable and raster covariates was entirely performed using the WordPop-RF code (Data Citation 1) described in the Code availability section below and publicly available through the figshare repository. In particular, the code relies on the Python programming language (version 2.6; www.python.org) and ArcGIS 10.1 arcpy package for performing the GIS-specific spatial operations required for preparing both the response variable and raster covariates (refer to the Supplementary File 1 for a technical description of how the GIS-specific spatial operations are implemented). For each country, all derived covariates are listed in the metadata file accompanying the corresponding gridded population datasets (refer to the Data Record section below for a more detailed description of the metadata file contents).

Code availability
The WordPop-RF code (Data Citation 1), used to produce the WorldPop Americas datasets, as well as the metadata and the KML files associated with them (refer to the Data Records section for a description of the latter), is publicly available through the figshare repository. The code consists of two Python (version 2.6; www.python.org) and four R (version 2.15.3) programming language scripts that must be run sequentially in the following order: 1) 01.0-Configuration.py.R; 2) Metadata.R; 3) 01.1-Data Preparation, R.r; 4) 01.2-Data Preparation, Python.py; 5) 01.3-More Complex Random Forest Regression, Full Covariate Set and Data Preparation.r; 6) 01.4-Process Density Weights to Population Maps.py; 7) 01.5-Generate KML.r; 8) 01.6-Generate Metadata Report.r. Each script is also internally documented in order to both explaining its purpose (including a detailed description of the GIS-specific spatial operations that it performs) and, when required, guiding the user through its customization.

Data Records
The high-resolution WorldPop Americas datasets described in this article referring to the 28 countries listed in Table 1, are publicly and freely available both through the WorldPop Dataverse Repository (Data Citation 2) and the WorldPop project website (http://www.worldpop.org.uk/data/). However, while the WorldPop Americas datasets stored in the Dataverse Repository represent a static version of the datasets produced at the time of writing and will be preserved stably in their published form, the datasets stored in the project website (Supplementary Table 2) will be expanded by including additional countries located in the region and updated as better and more recent official population count data and covariates become available.
Both through the Dataverse Repository and the project website, the WorldPop Americas can be download as 7-Zip archives (7-Zip.org) containing the population distribution datasets of the country it is associated with for the population count year, as well as for 2010, 2015, and 2020, and a RF model metadata report (Table 3).
Additionally, from the Data Availability page available on the WorldPop project website (http://www. worldpop.org.uk/data/data_sources/) it is also possible to browse the 7-Zip archives described above, download individual GeoTIFF datasets from them, and view the HTML files containing the RF model   based, (ii) its prediction error, (iii) the relative importance of each covariate, (iv) the prediction intervals using the OOB data (refer to the Methods section for additional information about the latter features).

Technical Validation
Root mean square error (RMSE) and mean absolute error (MAE) Six countries, located in different parts of the Latin American and the Caribbean region were selected to assess the increased accuracy of the RF-based dasymetric mapping approach with respect to a simple areal-weighting (SAW) approach 66 (Table 4). For each selected country, population counts were aggregated within the next coarser administrative level boundary than the finest for which they were available (e.g., if admin level 4 population count data were available, these were aggregated to admin level 3). The coarser, aggregated population counts were then used to produce gridded population count datasets, with a resolution of 100 m, using both the SAW and the RF approach outlined here. Finally, the two different population estimates produced using these approaches within each of the finest administrative unit were calculated, and compared with observed population figure referring to the same higher resolution unit. Results, summarized in Table 4, show how both the RMSE, the %RMSE (RMSE expressed as a percentage of the average population of the finest administrative unit level), and the MAE values (5th, 6th, and 7th column of Table 4, respectively) calculated using the RF-based outputs are consistently lower than the corresponding values calculated for the SAW outputs. These statistics can be used to compare the accuracy of the two approaches when downscaling the estimates.

Out-of-bag (OOB) error estimation
The OOB error estimate (3rd column of Table 4), as already briefly described in the Methods section, is internally calculated during the RF model fitting and can be considered a robust and unbiased measurement of the prediction accuracy of the model itself 27 .
Nevertheless, it is important to note that since the RF model is fitted at the administrative unit level and then is used to predict at the grid cell level, the OOB error estimate should not be interpreted as the prediction error at the grid cell level. Similarly, it does not represent the prediction error that could be expected to be observed at the administrative unit level by summing all final grid cell values within each administrative unit and comparing it to the observed population count referring to the same administrative unit. However, referring to the six countries mentioned in the previous section, by comparing the OOB error estimates calculated at the aggregate lower administrative unit level than the highest available (3rd column of Table 4) with the corresponding RMSE and MAE values (5th and 7th column of Table 3, respectively), it is reasonable to expect that higher accuracy of predicted values at the administrative unit level results in a higher accuracy of the final gridded population distribution datasets 24 .

Usage Notes
The WorldPop Americas datasets can be used both to support applications for planning interventions, measuring progress, and to predict response variables intrinsically dependent on the population distribution. However, considering that they represent modelling outputs generated using ancillary covariate datasets in the disaggregation process, to avoid circularity, they should not be used to make predictions or explore relationships about any of these ancillary datasets 14 . Thus, before using WorldPop Americas datasets in correlation analyses against factors which are included in the process of their construction (e.g., correlating population distribution with land-cover), ideally the population modelling process should be re-run using the WordPop-RF code (Data Citation 1) with the covariate of interest being removed to avoid issues relating to endogeneity.