## Background & Summary

Scientific investigations related to the human presence on Earth strongly rely on the availability of accurate and reliable information on the extent and location of settlements. In this framework, since early 1980s satellite imagery has been used as primary source to outline settlements at global scale1,2 and - along with technical, methodological and computational advances - their detail evolved from low resolution (1 km - 500 m) to medium resolution (100 m) and, since the last few years, to high resolution (30–10 m).

Nowadays, satellite-based settlement extent maps are widely used for many scientific purposes. For instance, in global urbanization analyses3 they are used to define urban areas, as well as characterize their morphology and assess the correlation with socio-economic variables. In spatial demography4, the location of settlements represents a fundamental spatial covariate to model the displacement of people, whereas in land-use science5,6 settlement extent is exploited as key input for calibrating land-use change models.

In the last few years, different high resolution layers outlining the global settlement extent have been presented in the literature7,8,9,10,11,12,13. Among these, the three most largely employed include:

• the Global Urban Footprint – GUF7 (available at 12 m resolution and referring to the year 2012) generated by the German Aerospace Center (DLR) from 3 m resolution TerraSAR-X/TanDEM-X radar imagery;

• the 2014 instance of the Global Human Settlement Layer – GHSL8, generated at 30 m resolution by the Joint Research Center (JRC) of the European Commission from Landsat-8 optical imagery;

• the artificial surfaces mask of the GLOBELAND30 – GLC309 (available at 30 m resolution and referring to the year 2010), generated by the National Geomatics Center of China from Landsat-5/7 optical imagery.

Among these, the GUF outperforms the other two layers13, which show severe under- and over-estimation in large parts of the world. Nevertheless, the GUF itself still exhibits two major drawbacks. On the one hand, it has been generated (like the GHSL and the GLC30) from single-date scenes, which are sometimes strongly affected by the specific acquisition conditions, hence resulting in misclassification errors. On the other hand, commercial imagery has been employed, which prevents a systematic update due to its high costs. Moreover, the exclusive use of optical or radar imagery alone represents an additional limitation since these two types of data are sensitive to different structures on the ground (i.e., artificial surfaces and built-up areas, respectively). For instance, bare soil and sand tend to be misclassified as settlements using optical imagery, while this does generally not occur with radar data; on the contrary, complex topography areas or forested regions can be wrongly categorized as settlements with radar imagery, whereas normally this does not occur using optical data.

To overcome these issues, we have developed a novel and robust methodology to reliably outline settlements which jointly exploits, for the first time, open-and-free multitemporal optical and radar data. In particular, the rationale is that the temporal dynamics of human settlements over time are different than those of all other non-settlement classes. First, we gather all the images acquired over a region of interest within a target period during which we do not expect considerable changes (e.g., one year). Next, we extract key temporal statistics (i.e., temporal mean, minimum, maximum, etc.) of: (i) the original backscattering value in the case of radar data; and (ii) different spectral indices (e.g., vegetation index, built-up index, etc.) derived after performing cloud/cloud-shadow masking in the case of optical imagery. After automatically extracting candidate training samples for the settlement and non-settlement class, binary classification based on advanced machine learning is separately applied to the optical- and radar-based temporal features. Finally, the two outputs are properly combined together.

Once tested its high robustness on a variety of study sites, the method has been employed to generate the World Settlement Footprint (WSF) 2015, a 10 m resolution (0.32 arc sec) binary mask outlining the extent of human settlements globally derived by means of 2014–2015 multitemporal Sentinel-1 (S1) radar and Landsat-8 optical imagery (of which ~107,000 and ~217,000 scenes have been processed, respectively). The WSF2015 is extremely accurate and reliable and outclasses all other mostly employed similar datasets. This has been quantitatively assessed through an unprecedented validation exercise based on 900,000 ground-truth samples collected by crowdsourcing photointerpretation and carried out in collaboration with Google. To this purpose a statistically robust and transparent protocol has been defined following recommended state-of-the-art practices.

## Methods

In this Section, we describe the novel methodology developed for outlining human settlement extent based on the joint use of multitemporal radar and optical imagery. The corresponding block scheme is reported in Fig. 1. First, both S1 and Landsat-8 data are pre-processed and suitable temporal statistics and texture features are computed. Then, training points for the settlement and non-settlement classes are derived by jointly exploiting both radar- and optical-based temporal statistics (along with additional ancillary information). Classification is performed separately for the two types of data by means of an ensemble of Support Vector Machines (SVM) classifiers. A final post-classification phase is dedicated to properly combine the Landsat- and S1-based classification maps and automatically identifying and deleting potential false alarms.

Each of the abovementioned steps is described into detail in the following. Next, the WSF2015 layer is presented along with all relevant details concerning its implementation.

### Preprocessing and feature extraction

As concerns S1 data, we take into account imagery acquired in Interferometric Wide swath (IW) mode (i.e., S1 main mode over land with 250 km swath). In particular, we consider High-Resolution Level-1 Ground Range Detected (GRD) products available at 10 m resolution.

All scenes acquired over the given study area in the target timeframe are first gathered and then pre-processed by means of the S1 Toolbox14. Specifically, this task includes:

• orbit correction (for improving the geocoding);

• thermal noise removal (for removing dark strips near scene edges with invalid data);

• radiometric calibration (for computing backscattering intensity using sensor calibration parameters in the GRD metadata);

• Range-Doppler terrain correction (for removing the brightness and geometric distortions occurring in correspondence of elevated and sloping terrain);

• conversion to decibel (dB) values (for reducing the very high dynamic range of data).

Scenes acquired in ascending and descending pass are treated separately due to the strong influence of the viewing angle in the backscattering of built-up areas. Furthermore, experimental analyses assessed that the joint employment of VV/VH imagery does not provide any considerable improvement with respect to the solely use of VV data; accordingly, VH data are disregarded.

As pointed out above, the rationale of the proposed approach is that given a series of multi-temporal images for a study area, the corresponding temporal dynamics of human settlements are sensibly different than those of all other non-settlement classes. For instance, in the case of radar data the backscattering temporal mean of built-up areas (due to double bounce reflection) is higher than that of forest areas (which might result in high backscattering in one/few acquisitions due to specific conditions, but in general exhibit lower values). To properly characterize this behavior, for each pixel we compute 5 key temporal statistics, namely the backscattering temporal maximum, minimum, mean, standard deviation, and mean slope (i.e., defined as the average absolute difference between consecutive items of the temporal series).

Texture information is also extracted to ease the identification of lower-density residential areas mostly characterized by single houses surrounded by vegetation (which are generally challenging to detect due to their lower backscattering values with respect to that of denser urban areas). To this purpose, we compute the coefficient of variation (COV) of the temporal mean backscattering, which is defined for each pixel as the ratio between the local standard deviation and the local mean calculated over a NxN pixel spatial neighborhood. In particular, the COV represents an estimate of the local image heterogeneity. Here, in the light of the 10 m spatial resolution of the considered S1 data, a neighborhood of 5 × 5 pixels proved to be an effective choice.

Overall, both for VV ascending/descending passes, the final S1 feature stack includes 7 features, namely: the 5 abovementioned temporal statistics and the COV derived from the backscattering temporal mean, plus the number of available scenes per pixel.

In the case of Landsat-8, imagery taken at 30 m resolution by the Operational Land Imager (OLI) sensor is used. In particular, we only consider scenes acquired in the target period over the study area with cloud cover lower than 60% (as reported in the corresponding metadata). Indeed, we experienced that further raising this threshold often results in accounting for images with non-negligible misregistration error. Data are then calibrated and Top-Of-Atmosphere (TOA) radiance is extracted.

A mask is then generated for each image to exclude pixels affected by cloud and cloud shadows from the analysis. To this purpose the Function of mask (FMask) algorithm is applied given its assessed effectiveness in the scientific community15. Besides pixels covered by clouds and cloud shadows, the algorithm also identifies snow, clear land and clear water pixels; in particular, this is done by jointly analyzing the Normalized Difference Vegetation Index (NDVI), the Normalized Difference Snow Index (NDSI), and the Brightness Temperature for the given scene.

A thorough experimental analysis has been carried out to identify a set of spectral indices highly suitable for an effective delineation of human settlements; in particular, the final list and corresponding formulas are reported in Table 1. The Normalized Difference Built-Up Index (NDBI)16 has been applied to extract built-up areas in many studies17,18; nevertheless, due to the use of the first short-wave infrared (SWIR) band (i.e., OLI band 6) this index is also sensitive to vegetation with low water content19, which exhibits values comparable to those of settlement areas. Accordingly, the Normalized Difference Middle Infrared index (NDMIR) and the NDVI are applied to overcome this issue. On the one hand, the NDMIR is computed using both SWIR bands (i.e., OLI bands 6 and 7), thus being sensitive to vegetation moisture20. On the other hand, the NDVI21 has been widely employed in a variety of land cover applications as well as in the context of settlement extent classification22,23. Moreover, the Modified Normalized Difference Water Index (MNDWI)24 is also employed to discriminate water from settlement areas. Such index enhances the performance of the NDWI25 by replacing the MIR with the NIR band (i.e., OLI band 5), which leads to a reduction of noise from built-up areas. In addition to the previous, two other spectral indices have been included for improving the discrimination between settlement areas and bare soil/bare rocks; specifically, these are the Normalized Difference Red Blue (NDRB) and Normalized Difference Green Blue (NDGB) indices26.

To characterize the generally stable temporal dynamics of the settlement class with respect to the other non-settlement classes, the same set of 5 key temporal statistics used in the case of S1 data are extracted for each of the 6 Landsat-8 spectral indices presented above. Moreover, to improve the detection of rural and suburban areas (mostly characterized by a low share of built-up areas and a high share of vegetation, thus resulting in a heterogeneous environment compared to denser built-up areas), also here additional texture features are extracted. In particular, for each of the derived 6 temporal mean indices, we computed the corresponding COV in a neighborhood of 3 × 3 pixels, which empirically proved the most effective choice in the light of the 30 m spatial resolution of Landsat data.

The final Landsat-8 feature stack includes 37 bands, namely: temporal maximum, minimum, mean (plus the corresponding 3 × 3 COV), standard deviation and mean slope for NDBI, NDVI, MNDWI, NDMRI, NDRG and NDGB, along with the number of available cloud/cloud-shadow-free acquisitions per pixel.

In Fig. 2, examples are given for Ho Chi Minh (Vietnam), Istanbul (Turkey), Johannesburg-Pretoria (South Africa), Karachi (Pakistan), Lagos (Nigeria), and Moscow (Russia). Specifically, in addition to reference Google Earth imagery we display for each city: (i) a RGB color composition obtained combining the Landsat-8 temporal mean NDBI (red), NDVI (green) and MNDWI (blue); (ii) the S1 VV backscattering temporal mean. The same visualization parameters have been consistently applied to all 6 sites.

Yet by simple visual inspection, it is possible to appreciate the advantage of jointly employing radar- and optical-based temporal statistics. In particular, even if it is not feasible to properly delineate settlements by solely using radar data, it is clear how optical imagery helps overcoming this issue and vice-versa. For instance, in the case of Lagos, the backscattering is counterintuitively low in several highly urbanized areas; nevertheless, this occurs due to the extremely high building density (mostly informal housing) which prevents the typical radar double bounce reflection. Instead, when employing Landsat imagery the settlement outline is clearly distinguishable. On the contrary, optical-based temporal features are often not effective alone in arid regions - as in Karachi - where bare areas tend to be misclassified as settlements. Nevertheless, these can be effectively outlined by means of S1 temporal statistics.

### Training points selection

Reliably identifying training points for the settlement and non-settlement class proved being the most critical task of the whole classification system; indeed, a training set including a consistent number of mislabeled samples would most likely result in poor performances. To this purpose, we designed a strategy which jointly exploits the temporal statistics computed for both S1 and Landsat data, along with additional ancillary information. In particular, any given sample x in the study area is labelled as potentially settlement or non-settlement if it satisfies all the corresponding conditions listed in Table 2 (where different thresholds have been determined based on extensive empirical analysis against Google Earth VHR imagery carried out over 450 test sites of 1 × 1 degree size distributed all over the world).

Concerning optical data, we generally observed that most of the pixels can be effectively outlined as settlement/non-settlement by jointly thresholding the corresponding NDBI, NDVI, and MNDWI temporal mean features. Nevertheless, being all 3 spectral indices correlated to the presence of vegetation, absolute threshold values are not globally effective as vegetation strongly varies depending on climate. To overcome this drawback, we took into account the well-established Köppen Geiger scheme27 and for each climate type we determined specific thresholds for outlining both candidate settlement and non-settlement training samples. Referring to Table 2, $$KG\left(x\right)$$ denotes the Köppen-Geiger classification for the given pixel x. $${A}_{Smin}\left(KG\left(x\right)\right)$$ and $${A}_{Smax}\left(KG\left(x\right)\right)$$ denote minimum and maximum thresholds, respectively, for the temporal mean $$\bar{\rm{A}}\left(x\right)$$ of the spectral index  AA{NDBI,NDVI,MNDWI} defined to determine whether x is a candidate settlement training sample. Similarly, $${A}_{NSmin}\left(KG\left(x\right)\right)$$ and $${A}_{NSmax}\left(KG\left(x\right)\right)$$ denote minimum and maximum thresholds defined to determine whether $$x$$ is a candidate non-settlement training sample. Furthermore – in the reasonable hypothesis that the higher is the number of cloud/cloud-shadow free acquisitions, the more robust are the corresponding temporal statistics – we exclude all pixels whose number of Landsat-8 clear observations (i.e., NLC8 (x)) is lower than 5. Since the Köppen Geiger classification includes 30 different climate types, overall we determined 360 thresholds on the three indices.

Regarding radar data, we expect the temporal mean backscattering of most settlement samples to be sensibly higher than that of all other land-cover classes. Accordingly, samples whose temporal mean backscattering (either in the case of data acquired in ascending $${\bar{\sigma }}_{A}^{0}\left(x\right)$$ and descending $${\bar{\sigma }}_{D}^{0}\left(x\right)$$ pass) is:

• lower than -7 dB are not eligible to be labelled as settlement training samples (if the number of ascending/descending scenes used for computing the temporal statistics $${N}_{S1A}\left(x\right)$$/$${N}_{S1D}\left(x\right)$$ is higher than or equal to 5);

• greater than -11 dB are not eligible to be labelled as non-settlement training samples (if the number of ascending/descending scenes used for computing the temporal statistics $${N}_{S1A}\left(x\right)$$/$${N}_{S1D}\left(x\right)$$ is higher than or equal to 5).

It is worth noting that in complex topography regions: (i) radar data often show high backscattering comparable to that of settlements; and (ii) bare rocks are present, which often exhibit a behavior similar to that of built-up areas in the Landsat-based temporal statistics. Accordingly, to exclude these from the analysis, we mask all pixels whose slope28 (i.e., the angle corresponding to the maximum elevation difference between the given pixel and its 8 neighbors) is higher than 10 degrees. To this purpose, we employed the Shuttle Radar Topography Mission (SRTM)29 Digital Elevation Model (DEM) for latitudes between −60° and +60° and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER)30 DEM elsewhere.

### Classification

In the light of their proven effectiveness and high generalization capabilities, Support Vector Machines (SVM)31,32 with Radial Basis Function (RBF) Gaussian Kernel have been chosen for the classification task.

In general, the criteria defined in the previous section result in a high number of candidate training points; thus, a subset should be sampled to keep the computational burden under control. For instance, a reasonable choice when investigating large regions is to subdivide the study area in working units of 1 × 1 degree size; in this case, an effective strategy proved extracting 500 samples for the settlement and 500 for the non-settlement class.

The stacks of Landsat- and S1-based temporal features are classified separately since this proved more effective than performing a single classification on the merger of the two stacks. In both cases, a grid search with a 5-fold cross validation33 approach is employed to identify the optimal values for the learning parameters (i.e., the ones expected to provide the best possible discrimination between the settlement and non-settlement classes). These include $$\gamma$$ and $$C$$ which tune the SVM kernel spread and error penalization, respectively32. In our analyses, we test all combinations with $$C={2}^{i},i\in {{\mathbb{Z}}}^{\ge 0},i\le 13$$ and $$\gamma =0.1\cdot j,j\in {{\mathbb{Z}}}^{+},j\le 20$$.

Since results might vary depending on the specific subset of selected training points, as a means to further improve the final performances and obtain more robust classification maps, we randomly subset 20 different training sets and feed an ensemble of as many SVM classifiers. Then, we apply a majority voting approach34,35 to handle the resulting maps and each pixel is finally associated with the settlement class only if it is labeled as settlement at least 11 over 20 times.

### Post-Classification

A final post-classification phase is dedicated to properly combine the Landsat- and S1-based classification maps and automatically identifying and deleting false alarms. To this purpose, an updated version of the post-editing object-based approach adopted in the production of the GUF layer has been used7, which exploits the 9 reference binary datasets (7 global and 2 continental) described in Table 3.

A settlement agreement mask is first generated from the combination of 6 reference layers (i.e., DLR-RC, CIL, OSM-S, OSM-R, GL30-S, and NLCD), which is labeled as positive only where two or more of these are positive. Likewise, a settlement exclusion mask is obtained by combining 3 reference layers (i.e., DLR-RM, GLC30-W, GLC30-WL), which is labelled as positive where at least one of these is positive.

Next, segmentation is applied to both Landsat- and S1-based classification maps for categorizing each cluster of connected pixels as individual objects; in particular, this is carried out by exploiting contour tracing to iterate over an image only once36.

Objects are then removed if:

• their extent overlaps for less than 30% the settlement agreement mask and, concurrently, it overlaps for more than 30% the settlement exclusion mask (this helps excluding objects wrongly covering complex topography regions, water or wetlands);

• the zonal mean of the Landsat-based temporal mean NDVI is higher than 0.6 (this is mostly the case of false detections in the S1-based classification occurring in correspondence of specific types of dense forests);

• the zonal mean of the S1 temporal mean backscattering (either computed for scenes acquired with ascending or descending pass) is lower than -11 dB (this is mostly the case of false detections in the Landsat-based classification occurring in correspondence of bare soil and sand).

The final classification map is given by the merger of the objects preserved in the Landsat- and S1-based classification maps.

### The WSF2015

The methodology presented above has been applied globally to generate the WSF2015 layer. Concerning radar data, pre-processing and feature extraction have been performed for ~107,000 S1 scenes (i.e., ~51,000 collected with ascending pass and ~56,000 with descending pass) acquired in 2014–2015. In particular, this task has been directly supported by Google through its Earth Engine cloud computing platform37.

As regards optical imagery, pre-processing and feature extraction have been performed for ~217,000 Landsat-8 scenes acquired in 2014–2015 with less than 60% cloud cover and downloaded from US Geological Survey (USGS), European Space Agency (ESA) and the Google Cloud Storage. All cloud/cloud-shadow masks have been obtained from USGS via the ESPA (Earth Resources Observation and Science (EROS) Center Science Processing Architecture) on demand interface which employs a C version of the FMask algorithm. The resulting dataset, for which more than 1.5PB of intermediate products were generated, is referred to as Landsat TimeScan 201538. Specifically, the whole processing has been carried out at the IT4Innovations Czech supercomputing center (Ostrava) in the framework of ESA’s Urban Thematic Exploitation Platform (U-TEP)39 project. Classification has been also carried out in the same infrastructure, whereas post-classification activities have been performed in the Calvalus system40 available at DLR’s Earth Observation Center.

To effectively handle the huge amount of data to process, working units of 1 × 1 degree size have been defined and the final WSF2015 is obtained as a mosaic of ~14 K tiles (where at least a single settlement has been detected).

In Fig. 3, an overview of the WSF2015 is given for the entire World, along with 4 different zooms referring to: (a) Eastern China and Korea; (b) Western Europe; (c) Mid-Atlantic USA; and (d) the Nairobi region in Kenya. Zoom a refers to one of the most populated regions of the world, including - among others - the Bohai Economic Rim (at the top), as well as the Yangtze River Delta and Pearl River Delta megalopolis (at the center and bottom, respectively). However, yet at this scale it is immediately evident how, besides the major cities, the WSF2015 also outlines the thousands of medium and small-size settlements scattered throughout the whole region, especially in the North China Plain which exhibits an extremely flat topography. This is also evident in Zoom b, where the myriad of towns and (especially) small villages characterizing the Western European landscape are properly mapped. Moreover, one can also start noticing the high detail in the delineation of the bigger cities (e.g., London at the left, Paris at the bottom left, Berlin at the top right), which can be further appreciated in Zoom c. Here, a portion of the US Northeast megalopolis is shown stretching from Washington-Baltimore (bottom left corner), to Philadelphia (center) and to New York (top right corner). The WSF2015 reliably outlines all major centers, as well as their fragmented metropolitan and suburban areas; concurrently, it also detects the small rural villages located in the nooks of the Appalachian Mountains (top left corner). Finally, in Zoom d the layer proves capable of capturing the very complex settlement pattern north of Nairobi (located at the bottom center) which includes the counties of Muranga and Kiambo. Specifically, these are mostly characterized by a rugged landscape interspersed with several hillocks where residents intensively settled along the many valleys in the region (thus resulting in the striped linear pattern that might be falsely interpreted as misclassification at first sight).

Overall, the WSF2015 estimates a global settlement surface of ~1.28 MKm², which corresponds to ~0.95% of the emerged surfaces (i.e., ~134.77Mkm² excluding Antarctica). The dataset has been recently used by the Authors to perform a thorough analysis of the worldwide settlement spatial variability and structure through advanced scaling analysis41. Settlement density proved not suitable for explaining alone the high variability of existing patterns, hence a novel global categorization is proposed.

## Data Records

The WSF2015 layer described in this article is publicly and freely available through figshare42. The dataset is organized for download in 306 GeoTIFF files (EPSG4326 projection, deflate compression) each one referring to a portion of 10 × 10 degree size (~1110 × 1110 km) whose upper-left and lower-right corner coordinates are specified in the file name [e.g., the tile WSF2015_v1_EPSG4326_e010_n60_e020_n50.tif covers the area between (10E;60 N) and (20E;50 N)]. A virtual mosaic file (i.e., WSF2015_v1_EPSG4326.vrt) is also provided which allows visualizing the global product at once on most diffused GIS platforms (e.g., ArcGIS, QGIS, MapInfo). Settlements are associated with value 255; all other pixels are associated with value 0.

Additionally, 5 resampled versions are also provided at 100 m, 250 m, 500 m, 1 km and 10 km, respectively, reporting for each pixel the corresponding ground percent surface covered by settlements. These can be efficiently used, for instance, as input to regional, continental, or global models and are distributed as individual GeoTIFF files embedding overviews for the levels 2, 4, 8, 16, 32, 64, 128 and 256.

## Technical Validation

In the framework of remote sensing, accuracy assessment is generally separated into three major components43, namely:

• response design, which defines the protocol for determining whether the map and reference classifications are in agreement; sampling design, which defines the protocol for identifying a representative subset of the region under analysis (given the impossibility of applying the response design to the entire classification map);

• analysis, which defines how to quantify accuracy.

In the following, the strategy designed for validating the WSF2015 is presented; in particular, specific details are given about the protocols adopted for each of the abovementioned components, while final results are discussed afterwards.

### Response design

The four major features of the response design include the source of information from which reference data are taken, the spatial unit, the labeling protocol for the reference classification, and a definition of agreement:

• Source of Reference Data: Google Earth satellite/aerial VHR imagery available for the period 2014–2015 has been used. The spatial resolution varies depending on the specific data source; in the case of SPOT imagery it is ~1.5 m, for Digital Globe’s WorldView-1/2 series, GeoEye-1, and Airbus’ Pleiades it is in the order of ~0.5 m resolution, whereas for airborne data (mostly available for North America, Europe and Japan) it is about 0.15 m.

• Spatial Assessment Unit: since input data with different spatial resolutions have been employed to generate the WSF2015 (i.e., 30 m Landsat-8 and 10 m S1), a 3 × 3 block spatial assessment unit composed of 9 cells of 10 × 10 m has been chosen.

• Reference Labeling Protocol: in our study we define:

• building as any structure having a roof supported by columns or walls and intended for the shelter, housing, or enclosure of any individual, animal, process, equipment, goods, or materials of any kind;

• building lot as the area contained within an enclosure (e.g., wall, fence, hedge) surrounding a building or a group of buildings;

• road as any long, narrow stretch with a smoothed or paved surface, made for traveling by motor vehicle, carriage, etc., between two or more points;

• paved surface as any level horizontal surface covered with paving material.

Based on this taxonomy, 4 possible labels have been defined, namely:

• Buildings: if the given cell intersects any building;

• Building Lots: if the given cell intersects any building lot and no buildings;

• Roads/Paved-Surfaces: if the given cell intersects any road/paved surface and no buildings or building lots;

• None of the previous.

The labelling task has been performed by crowdsourcing internally at Google. Specifically, by means of an ad-hoc tool, operators have been iteratively prompted a 3 × 3 assessment unit on top of the available Google Earth reference VHR scene closest in time to the year 2015 and given the possibility of assigning any of the 4 labels defined above to each cell. For training the operators, a representative set of 100 reference 3 × 3 units was prepared in collaboration between Google and DLR.

• Defining Agreement: to cope with the different existing definitions of settlement, we computed the assessment figures by separately considering as settlement all areas covered by: (i) buildings; (ii) buildings or building lots; and iii) buildings, building lots or roads/paved-surfaces. Furthermore, 4 different agreement criteria have been defined, specifically:

1. (1)

for each cell, positive agreement occurs only for matching labels between the classification and the reference;

2. (2)

for each block, a majority rule is applied over the entire 3 × 3 block of both the classification and the reference; if the final labels match, then the agreement is positive;

3. (3)

for the classification, a majority rule is applied over the entire 3 × 3 block; for the reference, each block is labelled as settlement only if it contains at least one cell marked as settlement; if the final labels match, then the agreement is positive;

4. (4)

for both the classification and the reference, each block is labelled as settlement only if it contains at least one cell marked as settlement; if the final labels match, then the agreement is positive.

### Sampling design

As recommended in the state-of-the-art good practices for assessing land-cover map accuracy44,45, stratified random sampling design has been chosen. In particular, it is a probability sampling design and one of the easiest to implement; indeed, it involves first the division of the population (i.e., the collection of all pixels contained in the map) into mutually exclusive subsets (i.e., strata) within which random sampling is performed afterwards.

To include a representative set of settlement patterns, 50 tiles of 1 × 1 degree size (out of the ~14.000 composing the WSF2015) have been selected based on the ratio between the number of settlements (i.e., disjoint clusters of pixels categorized as settlement in the WSF2015) and their overall area. In particular, the i-th selected tile has been chosen randomly among those whose ratio belongs to the interval $$[{P}_{2\left(i-1\right)};{P}_{2i}],i\in \left[1;50\right]\subset {\mathbb{N}}$$ (where Px denotes the x-th percentile of the ratio). The final selected tiles are shown in red in Fig. 3.

As the settlement class covers a sensibly smaller area compared to the merger of all other non-settlement classes, an equal allocation reduces the standard error of its class-specific accuracy. Moreover, such an approach allows to best address user’s accuracy estimation, which corresponds to the map “reliability” and is indicative of the probability that a pixel classified on the map actually represents the corresponding category on the ground46,47. Accordingly, for each of the 50 selected tiles we randomly extracted 1,000 settlement and 1,000 non-settlement samples from the WSF2015 and used these as center cells of the 3 × 3 block assessment units to be labelled by photointerpretation. Such a strategy resulted in an overall amount of (1,000 + 1,000) × 9 × 50 = 900,000 cells labelled by the crowd. To our knowledge, this outnumbers any other similar exercise presented so far in the literature.

### Analysis

To finally assess the accuracy of the WSF2015, we considered a series of measures commonly employed in the remote sensing community44, namely:

• the Kappa coefficient48,49, which jointly takes into account omission (i.e., underestimation) and commission (i.e., overestimation) errors, as well as the possibility of chance agreement between classification and reference maps. Kappa assumes values between −1 and 1 and a common rule-of-thumb for its interpretation is the following50: < 0 no agreement; 0–0.20 slight; 0.21–0.40 fair; 0.41–0.60 moderate; 0.61–0.80 substantial; 0.81–1.0 perfect;

• the percent producer’s accuracies PAS% and PANS% of the settlement and non-settlement class, respectively. Specifically, they denote the portion of assessment units (i.e., cells or blocks) categorized as settlement/non-settlement according to the collected reference information which are correctly categorized as settlement/non-settlement in the classification map. Its complementary measure (100 – PA%) corresponds to the percent omission error;

• the percent user’s accuracies UAS% and UANS% of the settlement and non-settlement class, respectively. Specifically, they denote the proportion of all assessment units (i.e., cells or blocks) categorized as settlement/non-settlement in the classification map which are categorized as settlement/non-settlement also according to the collected reference information. Its complementary measure (100 – UA%) corresponds to the percent commission error;

• the percent average accuracy AA%, which is obtained as the mean between PAS% and PANS% and represents a balanced measure of correct settlement and non-settlement detection.

### Quality assessment

Figure 4 reports the accuracies over the 900,000 collected reference samples computed for the WSF2015 and, concurrently, the GUF, GHSL and GLC30 layers for comparison. In particular, results are given for all combinations (overall 12) of three considered settlement definitions and four assessment criteria. Due to the different spatial resolution of the GUF (12 m) and both the GHSL and GLC30 (30 m), while assessing their quality, each 10 × 10 m cell of the considered block spatial assessment unit is tagged as settlement only if the intersection with the specific layer is positive.

Noticeably, in all experiments the WSF2015 exhibited the best AA%, with a remarkable average of 86.37 and a mean increase with respect to GUF, GHSL and GLC30 of +6.24, +15.28 and +18.58, respectively. Alongside, it resulted in an average Kappa of 0.6885 with a mean increase of +0.0754 with respect to the GUF and, especially, +0.2338 and +0.2975 with respect to GHSL and GLC30, respectively.

By analyzing the numbers into detail, one can notice a noteworthy increase of the WSF2015 Kappa coefficient for assessment criteria 3 and 4 (0.7646 on average) with respect to criteria 1 and 2 (0.6123 on average). This is due to the fact that 30 m resolution Landsat imagery has been employed to generate the product. Hence, even if just a portion of the Landsat pixel on the ground intersects any building, building lot or paved surface, this mostly has a considerable effect in the corresponding spectral signature and the pixel tends to be finally categorized as settlement. This is taken into account by assessment criteria 3 and 4, since the entire 30 × 30 m reference block spatial assessment unit is labelled as settlement even if it contains just one cell marked as settlement.

Assessment criteria 1 and 2 should be then considered more suitable for a fair comparison against the GUF given its 12 m spatial resolution. In this case, one can appreciate how the AA% and Kappa reported for the WSF2015 are in line with those exhibited by the GUF, which has been generated from highly expensive 3 m resolution commercial TerraSAR-X/TanDEM-X imagery. Instead, assessment criteria 3 and 4 allow a fair comparison against GHSL and GLC30 as they are both derived from Landsat data. Here, the WSF2015 exhibits notable AA% and Kappa up to 89.33 and 0.7822, respectively, outperforming both GHSL and GLC30 (with an increase always higher than 17 and 0.32, respectively).

From Fig. 4, one can also notice that on average results do not significantly vary across the three considered definitions of settlement; however, a proper analysis allows to better understand which one fits best with the different layers. Concerning the WSF2015, the highest accuracies mostly occur when considering as settlement the combination of buildings and building lots. Only for assessment criteria 1 and 2 Kappa is higher when also roads/paved surfaces are included. Indeed, despite generally associated with very low S1 backscattering values, most of these are not masked out given their fine scale within urban areas. As regards the GUF, highest AA% and Kappa occur partly when only buildings and partly when buildings and building lots are considered as settlement. This is in line with the theory, since the layer has been generated from radar imagery which is sensitive to vertical structures (these comprise both buildings, as well as main elements delimiting building lots like walls, fences, hedges, etc.). In the case of GHSL and GLC30, the two layers show a similar behavior and provide on average a slightly higher Kappa when settlements are defined as combination of buildings and building lots.

Giving a closer look to producer’s and user’s accuracies it is possible to better understand the nature of the different performances. All GUF, GHSL and GLC30 generally show very high PANS% (i.e., >85), but mostly exhibit consistently lower PAS%, with values never greater than 75.80, 52.39 and 44.26, respectively. On the contrary, the WSF2015 scores overall remarkably high PAS% and PANS% (on average 88.71 and 84.04, respectively) and, concurrently, it always shows the best UANS% in front of a UANS% only marginally lower than that of the other layers (on average 92.15 and 75.95, respectively). This quantitatively assesses the capability of the WSF2015 to effectively detect the presence of a considerable number of settlements actually unseen in the other global products. This occurs at the price of a minor settlement extent overestimation mostly due to the employment of the COV texture features; specifically, these allow a more accurate detection in rural and suburban areas, but sometimes result in an overestimation of 1–2 pixels around the actual settlement.

The improved detection performances of the WSF2015 can be qualitatively appreciated in Fig. 5, where a cross-comparison against GUF, GHSL and GLC30 is reported for three representative regions including the Igboland (i.e., a cultural and common linguistic region located in south-eastern Nigeria), Kampala (i.e., the capital and largest city of Uganda) and Bangalore (i.e., the capital of the Indian state of Karnataka). Despite the rather different settlement patterns, all three sites are characterized by the presence of medium and large size cities surrounded by a number of very small settlements. As one can notice, the WSF2015 proves extremely effective in all three cases, outperforming all other layers; specifically, it is capable of detecting a higher amount of small villages and better outlining the fringes of major urban areas. The GUF performs equally good only in the Igboland region, but detects considerably less settlements in the Bangalore and, especially, the Kampala case studies. Both GHSL and GL30 exhibit severe underestimation in all three test sites.

## Usage Notes

The WSF2015 will be a valuable product in support to all applications requiring detailed and accurate information on human presence. In particular, combined either with other EO or non-EO-based datasets (e.g., related to climate, health, economy, demography, etc.), it will enable deriving indices and metrics of help not only for scientific research but even decision making.

As assessed by the extensive validation exercise, the WSF2015 proved to be the currently most accurate and reliable product of its kind and will hence allow to improve any type of analysis carried out so far with other existing similar layers. Nevertheless, it is worth pointing out that - due to limitations specific of the data used - it was not feasible to consistently detect very small structures (e.g., huts, shacks, tents) because of their reduced scale, the specific building material employed (e.g., cob, mudbricks, sod, straw, fabric), their temporal nature (e.g., nomad or refugee camps), or the presence of dense vegetation preventing their identification.