Global high-resolution growth projections dataset for rooftop area consistent with the shared socioeconomic pathways, 2020–2050

Joshi, Siddharth; Zakeri, Behnam; Mittal, Shivika; Mastrucci, Alessio; Holloway, Paul; Krey, Volker; Shukla, Priyadarshi Ramprasad; O’Gallachoir, Brian; Glynn, James

doi:10.1038/s41597-024-03378-x

Download PDF

Data Descriptor
Open access
Published: 30 May 2024

Global high-resolution growth projections dataset for rooftop area consistent with the shared socioeconomic pathways, 2020–2050

Scientific Data volume 11, Article number: 563 (2024) Cite this article

2120 Accesses
110 Altmetric
Metrics details

Subjects

Abstract

Assessment of current and future growth in the global rooftop area is important for understanding and planning for a robust and sustainable decentralised energy system. These estimates are also important for urban planning studies and designing sustainable cities thereby forwarding the ethos of the Sustainable Development Goals 7 (clean energy), 11 (sustainable cities), 13 (climate action) and 15 (life on land). Here, we develop a machine learning framework that trains on big data containing ~700 million open-source building footprints, global land cover, road, and population datasets to generate globally harmonised estimates of growth in rooftop area for five different future growth narratives covered by Shared Socioeconomic Pathways. The dataset provides estimates for ~3.5 million fishnet tiles of 1/8 degree spatial resolution with data on gross rooftop area for five growth narratives covering years 2020–2050 in decadal time steps. This single harmonised global dataset can be used for climate change, energy transition, biodiversity, urban planning, and disaster risk management studies covering continental to conurbation geospatial levels.

Mapping global urban land for the 21st century with data-driven simulations and Shared Socioeconomic Pathways

Article Open access 08 May 2020

Downscaling SSP-consistent global spatial urban land projections from 1/8-degree to 1-km resolution 2000–2100

Article Open access 28 October 2021

Global 30 meters spatiotemporal 3D urban expansion dataset from 1990 to 2010

Article Open access 26 May 2023

Background & Summary

Global building stock consumed circa 18% of the global electricity demand and contributed to 21% of the global GHG emissions in the year 2019¹. United Nations² projects that the global population will grow from 8 billion in 2022 to 9.7 billion by 2050. The increase in population will require an increase in global building stocks and will have increasing downstream effects on material demands³. In contemporary literature, rooftop areas or in general vector building footprints with additional enrichment of building types, floor area per capita, construction year etc. are often used as a reliable proxy for generalising global building stock⁴.

Hence, a harmonised global geospatial assessment of global rooftop area assessment is essential for various research domains, including urban planning and architecture⁵, renewable energy⁶, and sustainable development⁷ as it provides crucial data for optimising space usage, designing sustainable buildings, fostering renewable energy adoption, and improving the overall environmental performance of urban areas. The availability of a harmonised dataset that documents the global rooftop area is of importance to not only energy system modellers but also to national and international research institutions as this spatially explicit dataset can aid in energy planning, access to energy, analysing impacts of extreme natural events⁸ and conflicts⁹. Of more importance is that a first order harmonised spatially explicit dataset be generated that documents the future spatial growth in the rooftop area to aid in cross-domain scenario analysis and policy formulation by incorporating different socioeconomic growth dynamics to fulfil the complementary needs of Sustainable Development Goals and mitigation of climate change.

Global assessment of gross rooftop area is a complex task as the smallest unit of assessment is a rooftop. This complexity is compounded by the fact that building stock archetypes change between geographies and are dependent on the socio-economic and cultural factors prevalent in the region of interest (ROI). In the past, bottom-up modelling approaches^{10,11,12,13,14} were used to assess the rooftop area at sub-national and national scales. Here, the studies focussed on the extrapolation of relationships between socioeconomic drivers and rooftop areas from a small sample region to a larger ROI. Although these methods are useful for rapid estimation of rooftop areas, they often report lower accuracies than the highly spatially resolved methods that utilise large-scale surveying of building stocks¹⁵.

On the other hand, highly spatially resolved top-down^{16,17,18,19,20} techniques like Light Detection and Ranging (LiDAR) based rooftop mapping which use a drone-mounted laser to map the landscape in 3D,and Machine Learning (ML) based object detection have shown promising results for ROIs covering continental scales. The LiDAR-based rooftop mapping is currently the most accurate method of determining the rooftop area along with capturing the rooftop attributes at scale. But these methods require significant investment in aerial imaging and computational costs because of which the most common implementation of LiDAR-based rooftop mapping is limited to a city scale analysis. ML-based models form the next class of methods that can aid in the detection of building rooftops at scale. However, these methods have shown limited suitability for a global scale study as the training of ML models requires heavy investment in training data that should have enough diversity to cover a global ROI²¹. Additionally, a server-scale computational environment is required to train and generate inferences from these trained ML models which requires significant cost and time investment. As a result of this, the largest ROI tackled by an ML-based approach covers the continent of Africa²⁰. However, extending this to a global implementation is yet to be achieved due to complexities around capturing accurate geographically diverse samples to train the ML models and the prohibitive cost of mapping the globe using LiDAR. Moreover, the application of the top-down method has been restricted to a single-year estimation of rooftop area and only limited studies have researched into advancing the bottom-up methods to future high-resolution estimation of growth in global rooftop area²².

A third stream of methods that can aid in the rapid assessment of rooftop areas at ROIs spanning continental scales is to use a hybrid approach. This approach utilises the spatial relationship among samples covering landcover mapping (derived from remotely sensed imagery), socioeconomic metrics and actual on-ground building stock attributes to infer rooftop areas for out-of-sample regions. Studies that have demonstrated this hybrid approach^18,23 utilise statistical inferencing to generate these relationships for Continental and country-level ROI.

For this study, we combined the bottom-up and top-down approaches to develop a hybrid ML-based framework built on our previous learnings from a single-year global estimation of rooftop solar PV⁶. The hybrid ML framework learns from the spatial relationship between downscaled Gross Domestic Product (GDP)²⁴, Population density^25,26, built-up area extent²⁷, and sample building footprints to estimate rooftop area in out-of-sample regions. The Shared Socioeconomic Pathways (SSP) narratives²⁸ which are extensively used in climate change research, examine how global society, demographics and economics might change over the next century by quantifying the narratives into numerical metrics that can be interpreted by mathematical models. The framework for SSPs starts with a narrative defining five different worlds based on challenges to adaptation and mitigation. SSP1 is the sustainable world, SSP3 is the world under regional rivalry having the highest challenges to mitigation and adaptation, SSP4 is the world of inequality with the highest challenge to adaptation, SSP5 is the fossil-fuelled world with the highest challenge to mitigation and, SSP2 is the middle of the road pathway. By using SSP-specific spatially explicit growth in GDP²⁴, population density²⁹, and build-up area³⁰ as drivers to the trained ML framework, we estimated the growth in the global building footprint area which we one-to-one map as gross rooftop area under each of these development pathways, Fig. 1. This way we combine the spatial attributes (built-up area) of top-down modelling with statistical modelling (socioeconomic parameters) of bottom-up methods. The hybrid ML framework allows for estimating the global gross rooftop area by leveraging the global statistical relationship between sample building footprint, built-up area on-ground, population and GDP which mitigates the need for an extensive ML-based building polygon extraction from remotely sensed images while providing accuracies in the range of ±0.1 km² in predicted rooftop area per 1/8-degree fishnet grid tile. Another advantage of the hybrid ML framework over top-down ML-based approaches is the low computational footprint of the framework which precludes the use of image processing and hence reduces the barrier to access for open-source big data like building footprints, global road datasets etc.

Methods

Data collection

We started the task of data collection by defining a global fishnet (FN) grid at a spatial resolution of 1/8 degree. The FN grid cell has an approximate spatial resolution of 14 km² at the equator and the size of the grid cell is dynamic based on the latitude it lies in but maintains the same 1/8-degree length and height. This spatial resolution of the grid was chosen to match the spatial resolution of the SSP-derived population and built-up extent gridded datasets. A 14 km² FN grid resolution provides us with a large enough extent to capture city limits at scale and a small enough extent to not cover the entire conurbations within itself.

Next, we chose 2020 as our base year with 2030, 2040, and 2050 as our medium-term time horizon projection years. Primary datasets collected during this study can be categorised into either a vector dataset - big data derived base year building footprint polygons (BF20), Open Street Maps (OSM)³¹ derived base year building footprint (BF20_OSM) and global geo-mapped base year roads (RL20) or raster datasets - base year global population count (PPLN20), base year global built-up extent (BU20), future SSP derived griddled population (PPLN_X,Y), future SSP derived griddled built-up extent (BU_X,Y), and future country wise SSP derived GDP (GDP_X,Y), where X is the SSP narrative and Y is the year. The attributes of the different base year and SSP-derived datasets are documented in Table 1 with a visual depiction in Fig. 2.

Table 1 Base year layers used in this study along with their attributes.

Full size table

The building footprint data collected from the big data sources (BF20), had full country coverage for base year building polygon data in the USA, UK, Australia, and Canada. Full continental coverage was available for Africa except for the North African region including countries above the Sahara Desert. For the rest of the world, building polygon data was derived from Open Street Maps, but the spatial coverage was sporadic with good spatial coverage only available for the European continent. This mismatch between the completeness of OSM-derived building footprints (BF20_OSM) encouraged us to create our own OSM Gap Detection application to capture selected data that has full completeness based on our FN grid (Usage Notes). The base year population count data (PPLN20) covers the entire global landmass hence no further filtering or sampling of the dataset was required.

The base year global built-up extent dataset (BU20) had global coverage for the year 2019. The built-up layer captures the extent of human-made modifications on the earth. Using a suite of remote sensing techniques, these structures can be isolated from the natural landscape and the area occupied by these structures can be converted into a raster grid where each grid cell can represent either the built-up area contained within it or the percentage of area that is built-up. Naturally, built-up extent will capture roads, carparks, industrial sites, airport runways etc. that do not form part of the building footprint and can sometimes cover 2–3 times more area than a building footprint in a built-up raster cell²³. To account for this, we created an ML model to downscale the built-up extent to the estimated rooftop area which we will discuss in the Machine Learning model section.

The next step in our study after collection of base datasets for the year 2020 was to collect SSP-derived datasets for the years 2020, 2030, 2040 and 2050. In total, we collected SSP-derived data for gridded population, built-up extent, and GDP per country data for the years 2020–2050 (Fig. 3). The gridded population count dataset and built-up extent dataset were available as raster datasets at 1/8-degree spatial resolution, with the GDP per country dataset being mapped to respective country boundaries using an administrative boundary dataset from GADM project V3.6 (https://gadm.org/data.html).

Base year calibration and spatial harmonisation

After the collection and verification of base year datasets and SSP-derived datasets, we conducted a harmonisation of the base year across the datasets. This base year harmonisation was conducted for BU20 and BF20 layers. We assumed that the 2019 built-up extent of our BU20 layer represented the 2020 data points. Similarly, the BF20 layer polygon which contains building footprint information from multiple years across different datasets was assumed to represent building footprints for the year 2020. These assumptions add a component of uncertainty in the harmonisation as some buildings constructed during the year 2020 are not part of the training dataset, but at a global scale, these assumptions will have minimal effect on the final output of the study due to the design of our ML framework.

Base year data aggregation

After temporally harmonising the datasets to a common base year, we aligned the datasets on a common spatial resolution and projected coordinate system. For this, we mapped the base year datasets to the FN grid. We overlayed the FN grid on top of the BF20, PPLN20, BU20 and RL20 datasets and used a cookie-cutter approach to cut and aggregate the datasets within each unique FN grid cell. Next, the BU20 layer boundary inside each FN was chosen as the region of interest and any data point outside this BU20 boundary but inside the FN boundary was not considered. This provided us with the first stage of spatial harmonisation where only datapoints inside the BU20 layer extents were considered. To achieve this, we used the area outside the BU20 layer as a masking layer to select data points that are not masked.

The base year vector datasets representing non-masked BF20 and RL20 datasets were processed on the ArcGIS PRO V2.8 platform, where we used the inbuilt multicore processing enhancements to process the cutting and aggregation of vector datasets at scale. After the cutting step, each building polygon and road polyline feature inside each unique FN grid cell was aggregated to represent a single value per FN grid cell. It should be noted that a polygon falling on the FN grid cell boundary was intersected at the boundary and only the area of the polygon inside of the respective FN was attributed to that FN, Fig. 4.

The base year raster datasets representing non-masked PPLN20 and BU20 datasets were processed on the Google Earth Engine platform³². Both the datasets were clipped at the boundary of the overlapping FN and the pixels completely inside the FN were aggregated as is, with pixels falling on the boundary being aggregated using weighted summation. Here, the value attribution of the pixel in consideration was calculated based on the area of the pixel inside the FN. It should be noted that while the PPLN20 dataset represents a simple population count at 100 m resolution, the BU20 layer pixel represents the percentage of built-up area inside each 100 m pixel. Hence, the aggregation of BU20 pixel was undertaken by multiplying the pixel area by pixel value to represent the true built-up area represented by each 100 m resolution pixel.

SSP-derived data aggregation

The SSP-derived population PPLN_X,Y and BU_X,Y for Y equal to 2020 were spatially harmonised to the FN grid by mapping the values from spatially harmonised PPLN20 and BU20 datasets derived in the previous steps. This aids in first providing a common base year value for estimation of future aggregated rooftop areas per FN grid cell and second removes any mismatch of data points and data values between the base datasets and SSP-derived datasets. The mismatch between the data points occurred due to PPLN_X,2020 and BU_X,2020 using exogenous methodologies and frameworks to estimate the values in their respective datasets. As an example, the BU_X,2020 dataset points depicting the presence of built-up area was derived from a model that uses the GHSL³³ layer from JRC for the year 2015 thereby not incorporating some newly developed areas in east China (Fig. 5). Additionally, the mismatch between data values can occur when for an FN grid cell BU_X,2020 layer either under or over-represents the value depicted by the BU20 dataset. As a result of these mismatches, for a BU20 layer’s global aggregated built-up area of 1.46 million km², the BU_X,2020 layer only represents 0.98 million km² of global aggregated built-up area. This highlights the importance of harmonising the datasets both at a common temporal and spatial scale.

After harmonising the PPLN_X,2020 and BU_X,2020 datasets for each of the SSP scenarios, the future datapoint and data values per FN grid cell of the respective datasets were recalculated using the following:

$$PPL{N}_{X,Y}=\left(PPL{N}_{X,Y}^{* }-PPL{N}_{X,2020}\right)+PPLN20$$

(3.1)

$$B{U}_{X,Y}=\left(B{U}_{X,Y}^{* }-B{U}_{X,2020}\right)+BU20$$

(3.2)

where, for each unique FN grid cell, X is the SSP scenario, Y is the year for which datapoint and value are calculated, PPLN20 is the base year population count and BU20 is the base year built-up area. The (*) nomenclature depicts future metrics before recalculation. This effectively captures the absolute growth in the metrics per FN grid cell over the harmonised base datasets. For GDP value per FN grid cell, we devised population-weighted down mapping of country-level GDP value using the following:

$$GD{P}_{X,Y}=\frac{GD{P}_{C,X,Y}}{PPL{N}_{C,X,Y}}* PPL{N}_{X,Y}$$

(3.3)

where, for each unique FN grid cell, X is the SSP scenario, Y is the year for which datapoint and value are calculated, and C is the country for which aggregated metrics are calculated at the country level. This GDP downscaling methodology creates a new feature layer representing GDP-weighted population count per FN grid cell for training our ML model discussed in the next section. Finally, we create the population density layers for both base year datasets and SSP-derived datasets using the following.

$$PD20=\frac{PPLN20}{F{N}_{Area}}$$

(3.4)

$$PPLN{D}_{X,Y}=\frac{PPL{N}_{X,Y}}{F{N}_{Area}}$$

(3.5)

where, for each unique FN grid cell, X is the SSP scenario, Y is the year for which the datapoint and data value are calculated and FN_Area is the geodesic area occupied by the FN grid cell.

Machine learning model

We designed a ML-based framework based on XGBoost ML model³⁴ to estimate aggregated rooftop area per FN grid cell. The ML framework accomplishes the task of first extracting the FN grid cell from the BF20_OSM layer derived from the OSM global building footprint dataset that has complete building footprint polygon mapping and second estimating the aggregated rooftop area per sample FN grid cells. The flow of data and steps involved in the development of the ML framework are shown in Fig. 6.

Training M1 model

We start the development of the ML framework by extracting sample FN grid cells from the base year datasets. The FN grid cells that have complete coverage for PD20, BU20, RL20 and BF20 datasets are selected as sample FN grid cells and the extracted sample layers are named here as PD_S20, BU_S20, RL_S20 and BF_S20 respectively. The PD_S20, BU_S20, and RL_S20 sample FN grid cells then act as independent variables with BF_S20 acting as the dependent variable for the M1 model. The M1 model is then trained by using a 10-fold cross-validation strategy and 1000 hyper-tuning iterations. The 10-fold cross-validation strategy enables the use of a complete input dataset for training purposes and aids in reducing the problem of overfitting in conjunction with 1000 rounds of hyper-tuning iterations. The trained M1 model then accepts PD20, BU20, and RL20 layers as drivers to estimate the aggregated gross rooftop area for all the global FN grid cells, BF_FN20 layer.

Extraction of OSM samples

At this stage, we have a global estimate of rooftop area for the year 2020 which we then use to extract samples from the BF20_OSM layer. For this, we compare at the FN level the values of BF_FN20 and BF20_OSM layer. For the FN grid cells where the ratio between BF20_OSM and BF_FN20 is between 1.1 and 0.9 i.e., where BF20_OSM values show 90–110% of BF_FN20 values, those FN grid cells are selected for their completeness of building footprint mapping and extracted as BF_OSM20 sample layer. This comparison between M1 model predicted values and OSM-derived values also lends itself to the development of an OSM Gap detection tool which we discuss further in Usage Notes.

Training M2 model

After tuning, training, and inferencing of BF_OSM20 layer from the M1 model, we shift our focus to the M2 Model which will enable the estimation of global gross aggregated rooftop area per FN grid cell for SSP narratives. For this, we combine the BF_S20 samples from the base year dataset with BF_OSM20 samples. We also resample PD20, BU20 and GDP_X,Y layers to collect samples based on FN grid cells covering our combined building footprint samples to generate PD_S,OSM20, BU _S,OSM20 and GDP_S,OSM,2,2020 layers. The GDP_S,OSM,2,2020 layer here represents population-based downscaled GDP per sample FN grid cell for samples covering base year and OSM-derived Building footprint FN grid cells for SSP2 narrative and 2020 year. The PD_S,OSM20, BU_S,OSM20, GDP_S,OSM,2,2020 sample FN grid cells then act as independent variables with BF_S20 and BF_OSM20 acting as dependent variables for the M2 model. The final sample FN grid cells used in our study are shown in Fig. 7 with building footprint attributes recorded in Table 2.

Table 2 Attribute of building footprint samples used for model training.

Full size table

The M2 model is trained by using a 10-fold cross-validation strategy and 1000 hyper-tuning iterations. At the conclusion of this step, we have our final M2 model which then accepts PPLND_X,Y, BU_X,Y and GDP_X,Y layers as drivers to estimate a global BF_X,Y layer for five SSP narratives and years ranging from 2020–2050. The final BF_X,Y layer is stored as GeoPackage files having 1/8 degree FN grid cell resolution with a value representing the aggregated gross rooftop area inside the FN grid cell for further analysis, Fig. 8.

Although the trained M1 model in conjunction with SSP-derived drivers can aid in the generation of the final BF_X,Y layer, we could not implement this as RL20 layer data is only available for the base year of 2020 and multivariate regression would be required to estimate its value beyond 2020 which would add an extra layer of uncertainty in our results. Additionally, the selection of BU_S,OSM20 and the merger of this layer with BF20 layer provided us with additional global data points to retrain a new model M2 which would be more compliant with global trends rather than just the countries/regions covered by BF20 dataset.

Data Records

The high-resolution datasets generated in this study contains 3,216,960 individual Fishnet tiles with 1/8 degree spatial resolution, spanning the entire globe. The main datasets along with additional files are hosted and referenced on Zenodo³⁵ (https://doi.org/10.5281/zenodo.11085013). The dataset covers all countries except Antarctica. Selected regional outputs of the study are shown in Fig. 9. To enable easy integration in the workflows, we have provided the main datasets in the following formats:

1)
Vector dataset: The global gross estimated rooftop area per FN grid cell for each SSP narrative is provided as a Geopackage (.gpkg) file (Results_Vis.gpkg) with polygon geometries at 1/8-degree spatial resolution in an EPSG:4326 coordinate system. The attribute table of this file contains FN_ID column representing the FN grid cell ID, and other columns representing the FN_ID specific assessed rooftop area. The assessed gross rooftop area columns are sequenced as BF_X_Y with X having values as 1, 2, 3, 4, and 5 for SSP1, SSP2, SSP3, SSP4, SSP5 narratives with Y representing the assessment year having values as 20, 30, 40, and 50 for years 2020, 2030, 2040, and 2050 and with km² units. In addition, a CF column is added for each FN_ID entry that documents the Capacity Factor for rooftop solar PV based on the World Bank solar atlas³⁶.
2)
Raster datasets: The global gross estimated rooftop area per FN grid cell for each SSP narrative is provided as a geotiff (.tif) files with LZW compression in an EPSG:4326 coordinate system. The assessed gross rooftop area datasets are sequenced as BF_X_Y with X having values as 1, 2, 3, 4, and 5 for SSP1, SSP2, SSP3, SSP4, SSP5 narratives with Y representing the assessment year having values as 20, 30, 40, and 50 for years 2020, 2030, 2040, and 2050 and with km² units.
3)
Numerical dataset: The global gross estimated rooftop area per FN grid cell for each SSP narrative is provided as a parquet (.parquet) file (Results.parquet). This file contains FN_ID column representing the FN grid cell ID, and other columns representing the FN_ID specific assessed rooftop area. The assessed gross rooftop area columns are sequenced as BF_X_Y with X having values as 1, 2, 3, 4, and 5 for SSP1, SSP2, SSP3, SSP4, SSP5 narratives with Y representing the assessment year having values as 20, 30, 40, and 50 for years 2020, 2030, 2040, and 2050 and with km² units. In addition, a CF column is added for each FN_ID entry that documents the Capacity Factor for rooftop solar PV based on the World Bank solar atlas.

In addition to the main datasets, we have provided additional files to enable generating the vector and numerical datasets from this study:

1)
M2_Model.json: This file contains the frozen parameters of the M2 model in.json format generated from XGBoost version 2.0.3
2)
SSP_drivers.parquet: This file contains the driver data used for generating the main dataset in our study
3)
FN_MAP.parquet: This file contains the boundary information for each fishnet grid tile in a Well Known Text (WKT) format.
4)
Prediction.ipynb: This file provides a python notebook interface to generate inferencing from M2_Model.json using SSP_drivers.parquet file. In addition, this file also generates the numerical dataset and converts it into vector dataset using FN_MAP.parquet file.
5)
environment.yaml: This file contains the frozen configuration of python virtual environment used to generate the results presented in this study.

Technical Validation

Input validation

The datasets presented in this study have undergone end-to-end technical validation for the base year of 2020. The validation is performed for M1 and M2 model inputs, the performance of M1 and M2 models, the validity of outputs of M1 and M2 models and finally verification of estimations generated by the M2 model. For datasets covering the years 2030–2050, we could not provide a true verification of data validity as they represent the future, but the high accuracy of 2020 data suggests strong model veracity which provides high confidence in these outputs. The input validation of the base year datasets and SSP-derived drivers are presented in Table 3 as a link to the validation reports generated by either the data providers or the peer-reviewed publication which form the basis of the data. Due to the scale of the dataset, assumptions and the limitation of methods used, the big datasets used in this study are expected to have errors at a higher resolution when verifying at a per building level, but at an aggregated country/ regional spatial resolution these datasets have shown acceptable performance.

Table 3 Input data validation.

Full size table

Model validation on sample FN tiles

The learning accuracy of the M1 and M2 models is determined by the significance of the correlation between the dependent and independent variables used to train the model. Further, a 10-fold cross-validation strategy to expose the models to various combinations of input data to reduce model overfitting was used. Additionally, the distribution of model output with respect to the dependent variables and the spread of the errors were evaluated to choose the best model. It was observed that the M2 model has a slight tendency to underestimate ground truth.

The final output of the M2 model (BF_X,Y) was further evaluated for discrepancies between aggregated country-wise input base year big data derived BF20 values and aggregated country-wise M2 models estimated outputs for SSP2 narrative in the year 2020 (BF_2,2020). These evaluations were conducted by aggregating the FN grid cell values for those FN grid cells that fall within the geographic boundaries of the country being evaluated. Overall, we observed high fidelity between the ground truth and estimated values at a country level. On a higher spatial resolution, we also compared the sub-national level estimations for the USA based on ASHRAE USA Climatic regions. Here also high fidelity was observed between ground truth and predicted values. Figures 10, 11 and Table 4 document the results of these checks.

Table 4 Result comparison of M2 Model’s output on seen training data.

Full size table

Result validation on unseen datasets

After verifying the M2 model’s output (BF_X,Y) on seen/training data, further validations were performed on the unseen datasets. Here we compared our results (BF_2,2020) i.e. M2 model’s output for SSP2 and year 2020 with EUBUCCO v0.1^4,37 dataset for selected countries that had full data availability in EUBUCCO v0.1 dataset. The countries are Spain, France, Netherlands, Denmark, Finland, Estonia, Lithuania, Slovakia, Slovenia, Switzerland, Germany, and Luxembourg. For this, we first masked the EUBUCCO v0.1 dataset with the built-up layer in 2020 (BU20) and then mapped the resulting building footprints onto the FN grid flooded by aggregation of building footprint geometry within each FN grid tile. The second set of validation at the sub-national level was performed for the cities of Kansas, Singapore, and Sydney. Overall, we found that the results of the M2 model are within expected error ranges when compared with unseen data that is not exposed to the M2 model during training. This way, we could validate our results to a high degree of certainty by comparing results at sub-national and national spatial levels. Table 5 along with Fig. 12 documents the finding of the validations performed on unseen datasets.

Table 5 Result comparison of M2 Model’s output on seen and unseen data.

Full size table

Usage Notes

Limitations

The aggregated rooftop area dataset was generated with an assumption of one-to-one mapping between the building footprint and the rooftop area. Although some building archetypes can have a larger rooftop area than building footprint due to the presence of rooftop superstructures¹⁴, we have not considered this due to the scale of the analysis which looks at global region of interest rather than per building. Similarly in higher latitudes due to the slope of the rooftops, the total building rooftop area can be higher than the building footprint area. Hence, it is advised to use region-specific rooftop attribute values when using these datasets for city-level analysis. Additionally, due to the nature of the ML model used for the estimation of rooftop area, we recommend an error margin of ± 0.1 km² per FN grid cell. Considering the global scope of this study, we assume medium term (2020–2050) stationarity of spatiotemporal patterns learned by M2 model which limits the future projection of gross rooftop area. To mitigate the assumption of spatiotemporal stationarity, we have incorporated five different growth pathways in the form of SSPs that act as a proxy of different urban planning paradigms, thus allowing for an integrated assessment with various other factors e.g. climate change, energy systems etc. Finally, the training data to drive M2 Model is partially biased towards developed nations with only African countries and some samples from Open Street Maps providing training data for emerging economies. This imbalance in training data has manifested itself as the slight tendency for underestimation of gross rooftop area for high-density cities and conurbations.

Application to energy system/integrated assessment modelling

We foresee that the datasets generated in this study will be of urgent use to the energy system/Integrated assessment modelling community for assessment of rooftop Solar PV/Solar thermal technical potential^6,38,39 applications and for building side energy systems modelling^40,41,42 purposes. For energy justice⁴³ and energy accessibility studies⁴⁴, the datasets can provide invaluable information in the form of urban growth dynamics and for calibration of the building stock models. For example, in the technical potential assessment studies⁴⁵, users can assume that rooftops are flat with solar panels being placed at the latitude-specific optimal angle. Users can also assume that the entire estimated rooftop area will be fully covered by solar panels and the panels will be devoid of shadows. This assumption culminates as our dataset representing the best-case scenario for a technical potential generation. In wider literature, a rooftop availability factor of 0.3 is used to convert gross rooftop area to net rooftop area to account for unsuitable rooftops due to orientation and slope attributes of building stocks. For the users of this dataset, we recommend using region-specific rooftop availability factors if known, else 0.3 can be used as the factor for more practical results. The net rooftop area can then directly be converted into monthly technical potentials using high-resolution solar irradiate datasets e.g. NASA MERRA 2⁴⁶, Fig. 13.

Application to analyse OSM spatial data completeness

Open Street Map-derived data is being used in many studies as a source of ground truth mapping and for the calibration of big data models. Additionally, raw OSM data in the form of building polygons, and road mapping is being used extensively in resource accessibility studies and vulnerability mapping⁴⁷. A primary reason for the uptake of OSM data can be attributed to its free accessibility and the presence of more than a million active users who are updating the digital planet files on an hourly basis. Although the quantity of data that is present inside the OSM database is vast, studies using them often must do significant pre-processing to extract data that is suitable for their use case. Additionally, users of the OSM dataset struggle with the lack of validation studies done on OSM datasets.

For data attributes dealing with global roads, one study⁴⁸ highlights that the OSM global road dataset is 80% complete. Similar studies for global building footprint datasets are currently limited to either country-level studies (https://github.com/thinkingmachines/osm-completeness) or regional studies (https://github.com/hotosm/osm-analytics). As an application of the output of our M1 model, we overlayed our predicted gross rooftop area mapped to the FN grid for the year 2020 and for the SSP2 growth narrative on top of the building footprint polygon planet dataset from OSM to estimate the completeness of the OSM dataset. To quantify the completeness, we calculated the percentage difference in the assessed gross rooftop area from our study and the calculated gross building footprint area mapped to the FN grid from OSM. The base dataset for OSM comparison was procured in August 2021.

In the final output of this analysis, a value of 0 represented that either OSM data is missing, or data cannot exist at that FN grid cell. A value of 1 represented that OSM dataset coverage is 100% in that FN grid cell. Any value between 0.9–1.5 was considered as representing 100% completeness of the OSM dataset as our M1 model does have under or over-prediction characteristics in some regions based on driver metrics. A value greater than 1.5 was representative of regions in OSM that may not have population presence but have OSM building polygon tags e.g., greenhouses, industrial complexes around major shipping ports etc. Since our M1 model relies on the population as an important driver, in FN grid cells having a completeness value greater than 1.5, our model gives a lower value than the OSM dataset value. Another reason for this can be attributed to the wrong tag being assigned to building polygons or the misclassification of non-building built-up structures as building polygons inside the OSM dataset. An example of a completeness value dataset is shown for Europe in Fig. 14a, with example cases of completeness value greater than 1 shown in Fig. 14b–d. A similar automated analysis can be conducted for a global dataset to quantify the completeness of the OSM dataset and direct the crowdsourced mapping of buildings to areas that are under mapped.

Pseudocodes

Algorithm 1

Data Collection and pre-processing.

Algorithm 2

XGBoost Model Training and Estimation (Model M1).

Algorithm 3

Preparing training Data for Model M2.

Algorithm 4

XGBoost Model Training and Estimation (Model M2).

Code availability

We have documented within the Data Descriptor the Pseudocodes that support the methodology of this study. Codes used for inferencing results along with XGBoost model generated in this study are hosted at Zenodo (https://doi.org/10.5281/zenodo.11085013).

References

Cabeza, L. F. et al. 2022: Buildings. In IPCC, 2022: Climate Change 2022: Mitigation of Climate Change. Contribution of Working Group III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Shukla, P. R. et al. (eds.)]. Cambridge University Press, Cambridge, UK and New York, NY, USA., https://doi.org/10.1017/9781009157926.011 (2022).
World Population Prospects 2022: Ten Key Messages., (United Nations, Department of Economic and Social Affairs, Population Division., 2022).
Mohammadiziazi, R. & Bilec, M. M. Building material stock analysis is critical for effective circular economy strategies: a comprehensive review. Environmental Research: Infrastructure and Sustainability 2, 032001, https://doi.org/10.1088/2634-4505/ac6d08 (2022).
Article ADS Google Scholar
Milojevic-Dupont, N. et al. EUBUCCO v0.1: European building stock characteristics in a common and open database for 200+ million individual buildings. Scientific Data 10, 147, https://doi.org/10.1038/s41597-023-02040-2 (2023).
Article PubMed PubMed Central Google Scholar
Hamaina, R., Leduc, T. & Moreau, G. in Bridging the Geographic Information Sciences: International AGILE’2012 Conference, Avignon (France), April, 24–27, 2012 (eds Jérôme Gensel, Didier Josselin, & Danny Vandenbroucke) 327–346 (Springer Berlin Heidelberg, 2012).
Joshi, S. et al. High resolution global spatiotemporal assessment of rooftop solar photovoltaics potential for renewable electricity generation. Nat Commun 12, 5738, https://doi.org/10.1038/s41467-021-25720-2 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Jing, R. et al. Unlock the hidden potential of urban rooftop agrivoltaics energy-food-nexus. Energy 256, 124626, https://doi.org/10.1016/j.energy.2022.124626 (2022).
Article CAS Google Scholar
Giardina, G. et al. Combining remote sensing techniques and field surveys for post-earthquake reconnaissance missions. Bulletin of Earthquake Engineering https://doi.org/10.1007/s10518-023-01716-9 (2023).
Article Google Scholar
Aimaiti, Y., Sanon, C., Koch, M., Baise, L. G. & Moaveni, B. War Related Building Damage Assessment in Kyiv, Ukraine, Using Sentinel-1 Radar and Sentinel-2 Optical Images. Remote Sensing 14 (2022).
Hoogwijk, M. M. On the global and regional potential of renewable energy sources. (2004).
Izquierdo, S., Rodrigues, M. & Fueyo, N. A method for estimating the geographical distribution of the available roof surface area for large-scale photovoltaic energy-potential evaluations. Solar Energy 82, 929–939, https://doi.org/10.1016/j.solener.2008.03.007 (2008).
Article ADS Google Scholar
IEA. Energy Technology Perspectives 2016: Towards Sustainable Urban Energy systems. Report No. 9789264252332, (2016).
Korfiati, A. et al. Estimation of the global solar energy potential and photovoltaic cost with the use of open data. International Journal of Sustainable Energy Planning and Management 9, 17–29, https://doi.org/10.5278/ijsepm.2016.9.3 (2016).
Article Google Scholar
Jacobson, M. Z. et al. 100% Clean and Renewable Wind, Water, and Sunlight All-Sector Energy Roadmaps for 139 Countries of the World. Joule https://doi.org/10.1016/j.joule.2017.07.005 (2017).
Article Google Scholar
Castellanos, S., Sunter, D. A. & Kammen, D. M. Rooftop solar photovoltaic potential in cities: How scalable are assessment approaches? Environmental Research Letters https://doi.org/10.1088/1748-9326/aa7857 (2017).
Article Google Scholar
Rottensteiner, F. & Briese, C. A new method for building extraction in urban areas from high-resolution LIDAR data. International Archives of Photogrammetry and Remote Sensing, (2002).
Maloof, M. A., Langley, P., Binford, T. O., Nevatia, R. & Sage, S. Improved Rooftop Detection in Aerial Images with Machine Learning. Machine Learning 53, 157–191, https://doi.org/10.1023/A:1025623527461 (2003).
Article Google Scholar
Gagnon, P., Margolis, R., Melius, J., Philips, C. & Elmore, R. Rooftop Solar Photovoltaic Technical Potential in the United States: A Detailed Assessment. (2016).
Assouline, D., Mohajeri, N. & Scartezzini, J. L. Quantifying rooftop photovoltaic solar energy potential: A machine learning approach. Solar Energy 141, 278–296, https://doi.org/10.1016/j.solener.2016.11.045 (2017).
Article ADS Google Scholar
Sirko, W. et al. Continental-Scale Building Detection from High Resolution Satellite Imagery. 1–15 (2021).
Yang, H. L., Lunga, D. & Yuan, J. in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). 870–873.
Gernaat, D. E. H. J., de Boer, H. S., Dammeier, L. C. & van Vuuren, D. P. The role of residential rooftop photovoltaic in long-term energy and climate scenarios. Applied Energy https://doi.org/10.1016/j.apenergy.2020.115705 (2020).
Article Google Scholar
Bódis, K., Kougias, I., Jäger-Waldau, A., Taylor, N. & Szabó, S. A high-resolution geospatial assessment of the rooftop solar photovoltaic potential in the European Union. Renewable and Sustainable Energy Reviews https://doi.org/10.1016/j.rser.2019.109309 (2019).
Article Google Scholar
Dellink, R., Chateau, J., Lanzi, E. & Magné, B. Long-term economic growth projections in the Shared Socioeconomic Pathways. Global Environmental Change 42, 200–214, https://doi.org/10.1016/j.gloenvcha.2015.06.004 (2017).
Article Google Scholar
Leasure, D. D. C. B. M. T. A. & WorldPop. peanutButter: An R package to produce rapid-response gridded population estimates from building footprints, version 0.2.1. https://doi.org/10.5258/SOTON/WP00678 (2020).
Lloyd, C. T., Sorichetta, A. & Tatem, A. J. Data Descriptor: High resolution global gridded data for use in population studies. Scientific Data https://doi.org/10.1038/sdata.2017.1 (2017).
Buchhorn, M. et al. Copernicus Global Land Service: Land Cover 100 m: collection 3: epoch 2019: Globe. Zenodo https://doi.org/10.5281/zenodo.3939050 (2020).
Riahi, K. et al. The Shared Socioeconomic Pathways and their energy, land use, and greenhouse gas emissions implications: An overview. Global Environmental Change 42, 153–168, https://doi.org/10.1016/j.gloenvcha.2016.05.009 (2017).
Article Google Scholar
Kc, S. & Lutz, W. The human core of the shared socioeconomic pathways: Population scenarios by age, sex and level of education for all countries to 2100. Global Environmental Change 42, 181–192, https://doi.org/10.1016/j.gloenvcha.2014.06.004 (2017).
Article PubMed PubMed Central Google Scholar
Gao, J. & O’Neill, B. C. Mapping global urban land for the 21st century with data-driven simulations and Shared Socioeconomic Pathways. Nature Communications 11, 1–12, https://doi.org/10.1038/s41467-020-15788-7 (2020).
Article CAS Google Scholar
OpenStreetMapcontributors. Planet dump retrieved from https://planet.osm.org (2021).
Gorelick, N. et al. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment https://doi.org/10.1016/j.rse.2017.06.031 (2017).
Corbane, C. et al. Automated global delineation of human settlements from 40 years of Landsat satellite data archives. Big Earth Data https://doi.org/10.1080/20964471.2019.1625528 (2019).
Article Google Scholar
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. https://doi.org/10.1145/2939672.2939785 (2016).
Article Google Scholar
Joshi, S. et al. Global high-resolution growth projections dataset for rooftop area consistent with the shared socioeconomic pathways, 2020-2050. Zenodo https://doi.org/10.5281/zenodo.11085013 (2024).
[Data/information/map] obtained from the “Global Solar Atlas 2.0, a free, web-based application is developed and operated by the company Solargis s.r.o. on behalf of the World Bank Group, utilizing Solargis data, with funding provided by the Energy Sector.
Milojevic-Dupont, N. et al. EUBUCCO (v0.1) [Data set]. Zenodo https://doi.org/10.5281/ZENODO.7225259 (2022).
Creutzig, F. et al. The underestimated potential of solar energy to mitigate climate change. Nature Energy https://doi.org/10.1038/nenergy.2017.140 (2017).
Victoria, M. et al. Solar photovoltaics is ready to power a sustainable future. Joule 5, https://doi.org/10.1016/j.joule.2021.03.005 (2021).
Mastrucci, A., Marvuglia, A., Benetto, E. & Leopold, U. A spatio-temporal life cycle assessment framework for building renovation scenarios at the urban scale. Renewable and Sustainable Energy Reviews 126, 109834, https://doi.org/10.1016/j.rser.2020.109834 (2020).
Article Google Scholar
Nutkiewicz, A., Mastrucci, A., Rao, N. D. & Jain, R. K. Cool roofs can mitigate cooling energy demand for informal settlement dwellers. Renewable and Sustainable Energy Reviews 159, 112183, https://doi.org/10.1016/j.rser.2022.112183 (2022).
Article Google Scholar
Eker, S., Mastrucci, A., Pachauri, S. & van Ruijven, B. Social media data shed light on air-conditioning interest of heat-vulnerable regions and sociodemographic groups. One Earth 6, 428–440, https://doi.org/10.1016/j.oneear.2023.03.011 (2023).
Article ADS PubMed PubMed Central Google Scholar
McCallum, I. et al. Estimating global economic well-being with unlit settlements. Nature Communications 13, 2459, https://doi.org/10.1038/s41467-022-30099-9 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Moner-Girona, M., Kakoulaki, G., Falchetta, G., Weiss, D. J. & Taylor, N. Achieving universal electrification of rural healthcare facilities in sub-Saharan Africa with decentralized renewable energy technologies. Joule 5, https://doi.org/10.1016/j.joule.2021.09.010 (2021).
Wang, Z., Arlt, M.-L., Zanocco, C., Majumdar, A. & Rajagopal, R. DeepSolar++: Understanding residential solar adoption trajectories with computer vision and technology diffusion models. Joule https://doi.org/10.1016/j.joule.2022.09.011 (2022).
Article Google Scholar
Gelaro, R. et al. The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Journal of Climate 30, 5419–5454, https://doi.org/10.1175/JCLI-D-16-0758.1 (2017).
Article ADS Google Scholar
Herfort, B., Lautenbach, S., Porto de Albuquerque, J., Anderson, J. & Zipf, A. The evolution of humanitarian mapping within the OpenStreetMap community. Scientific Reports 11, 1–15, https://doi.org/10.1038/s41598-021-82404-z (2021).
Article CAS Google Scholar
Barrington-Leigh, C. & Millard-Ball, A. The world’s user-generated road map is more than 80% complete. PLoS ONE https://doi.org/10.1371/journal.pone.0180698 (2017).
DaskdevelopmentTeam. (2016).

Download references

Acknowledgements

S.J. acknowledges that part of the research was developed in the Young Scientists Summer Program at the International Institute for Applied Systems Analysis, Laxenburg (Austria). S.J., B.O.G. and J.G. are supported by a research grant from Science Foundation Ireland (SFI) and the National Natural Science Foundation of China (NSFC) under the SFI-NSFC Partnership Programme Grant Number 17/NSFC/5181. S.M. acknowledges support from the Horizon Europe R&I programme project DIAMOND (grant no. 101081179). B.Z., A.M. and V.K. acknowledge funding from the Horizon Europe Research and Innovative Action Programme under Grant Agreement No. 101056810 (CircEUlar).

Author information

Authors and Affiliations

SFI MaREI Centre for Energy Climate and Marine, Cork, Ireland
Siddharth Joshi, Brian O’Gallachoir & James Glynn
Environmental Research Institute, University College Cork, Cork, Ireland
Siddharth Joshi, Paul Holloway & Brian O’Gallachoir
School of Engineering, University College Cork, Cork, Ireland
Siddharth Joshi & Brian O’Gallachoir
Energy, Climate, and Environment Program, International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria
Siddharth Joshi, Behnam Zakeri, Alessio Mastrucci & Volker Krey
Institute for Data, Energy, and Sustainability (IDEaS), Department of Information Systems and Operations Management, Vienna University of Economics and Business (WU), Vienna, Austria
Behnam Zakeri
Grantham Institute – Climate Change and the Environment, Imperial College London, London, UK
Shivika Mittal
CICERO Center for International Climate Research, Oslo, Norway
Shivika Mittal
Department of Geography, University College Cork, Cork, Ireland
Paul Holloway
Industrial Ecology Programme and Energy Transitions Initiative, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
Volker Krey
Global Centre for Environment and Energy, Ahmedabad University, Ahmedabad, India
Priyadarshi Ramprasad Shukla
Center on Global Energy Policy, Columbia University, New York, USA
James Glynn
Energy Systems Modelling Analytics, Galway, Ireland
James Glynn

Authors

Siddharth Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Behnam Zakeri
View author publications
You can also search for this author in PubMed Google Scholar
Shivika Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Mastrucci
View author publications
You can also search for this author in PubMed Google Scholar
Paul Holloway
View author publications
You can also search for this author in PubMed Google Scholar
Volker Krey
View author publications
You can also search for this author in PubMed Google Scholar
Priyadarshi Ramprasad Shukla
View author publications
You can also search for this author in PubMed Google Scholar
Brian O’Gallachoir
View author publications
You can also search for this author in PubMed Google Scholar
James Glynn
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.J. and B.Z. conceived the research idea. S.J. designed and developed the machine learning framework, model, and codes. S.J., S.M., P.H., A.M. and B.Z. designed the GIS and data analysis frameworks. S.M. and B.Z. supported the model analysis. S.J. created the figures and drafted the manuscript. P.R.S., V.K., J.G. and B.O.G. provided valuable insights on the results. All authors discussed the results and contributed to the manuscript.

Corresponding author

Correspondence to Siddharth Joshi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Joshi, S., Zakeri, B., Mittal, S. et al. Global high-resolution growth projections dataset for rooftop area consistent with the shared socioeconomic pathways, 2020–2050. Sci Data 11, 563 (2024). https://doi.org/10.1038/s41597-024-03378-x

Download citation

Received: 19 September 2023
Accepted: 15 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1038/s41597-024-03378-x

Subjects

Abstract

Similar content being viewed by others

Mapping global urban land for the 21st century with data-driven simulations and Shared Socioeconomic Pathways

Downscaling SSP-consistent global spatial urban land projections from 1/8-degree to 1-km resolution 2000–2100

Global 30 meters spatiotemporal 3D urban expansion dataset from 1990 to 2010

Background & Summary

Methods

Data collection

Base year calibration and spatial harmonisation

Base year data aggregation

SSP-derived data aggregation

Machine learning model

Training M1 model

Extraction of OSM samples

Training M2 model

Data Records

Technical Validation

Input validation

Model validation on sample FN tiles

Result validation on unseen datasets

Usage Notes

Limitations

Application to energy system/integrated assessment modelling

Application to analyse OSM spatial data completeness

Pseudocodes

Algorithm 1

Algorithm 2

Algorithm 3

Algorithm 4

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links