Global land use for 2015–2100 at 0.05° resolution under diverse socioeconomic and climate scenarios

Global future land use (LU) is an important input for Earth system models for projecting Earth system dynamics and is critical for many modeling studies on future global change. Here we generated a new global gridded LU dataset using the Global Change Analysis Model (GCAM) and a land use spatial downscaling model, named Demeter, under the five Shared Socioeconomic Pathways (SSPs) and four Representative Concentration Pathways (RCPs) scenarios. Compared to existing similar datasets, the presented dataset has a higher spatial resolution (0.05° × 0.05°) and spreads under a more comprehensive set of SSP-RCP scenarios (in total 15 scenarios), and considers uncertainties from the forcing climates. We compared our dataset with the Land Use Harmonization version 2 (LUH2) dataset and found our results are in general spatially consistent with LUH2. The presented dataset will be useful for global Earth system modeling studies, especially for the analysis of the impacts of land use and land cover change and socioeconomics, as well as the characterizing the uncertainties associated with these impacts.


Background & Summary
Land use (LU) change represents one of the most important human effects on the Earth system 1,2 , with profound physical and biogeochemical impacts at both regional and global scales. Therefore, existing Earth system models (ESMs) widely take land use records as a key input for more realistically simulating Earth system dynamics and analyzing the effects of LU on climate and biogeochemical cycling. For example, the Land Use Model Intercomparison Project (LUMIP) has been launched to advance understanding of the impacts of LU on climate 3 in the Climate Model Intercomparison Project Phase 6 (CMIP6) 4 .
ESMs have been found to be sensitive to LU changes [5][6][7] . Therefore, uncertainties in LU datasets can lead to a propagation of uncertainty in ESM projections that have to be carefully considered and evaluated. This could be relatively more straightforward for historical land use and land cover change, which can be benchmarked by observational datasets such as those based on by ground investigation or satellite remote sensing 8,9 . In contrast, it is challenging to evaluate the uncertainties of future LU projections, since there is no benchmarking reference. One possible approach for this purpose is to compare multiple LU projections from different models, following the similar philosophy of the model intercomparison projects 3,6 .
To date, there are still limited global gridded LU datasets that are publicly available for ESM simulations. One representative example is the Land Use Harmonization dataset version 2 (LUH2) 10 , which was produced to provide gridded LU data annually at 0.25° × 0.25° resolution for the historical reconstructions and under future scenarios (850-2300). Essentially, LUH2 harmonizes and at times downscales LU projections from the Global Change Assessment Model (GCAM) or similar models, such as IMAGE 11 , REMIND-MAGPIE 12 , MESSAGE-GLOBIOM 13 , and AIM 14 . LUH2 data has been produced for a small set of scenarios, e.g., a subset of greenhouse gas emission pathways in which end-of-century radiative forcing approaches four-levels (2.6, 4.5, 6.0, and 8.5 W m 2 ) by altering future greenhouse gas emissions and by changing underlying socioeconomic projections 16 . The SSPs are a set of five future scenarios (SSP1 to SSP5) which were designed to span a range of future socioeconomic conditions, and can be used in combination with the RCPs. SSPs look at different ways in which the world might evolve in the future. Specifically, SSP1 represents a world focusing on sustainable growth and equality with low challenges to mitigation and adaptation ("taking the green road"); SSP2 represents a world where trends broadly follow their historical patterns with medium challenges to mitigation and adaptation ("a middle of the road"); SSP3 is a fragmented world of resurgent nationalism with high challenges to mitigation and adaptation ("regional rivalry -a rocky road"); SSP4 represents a world of increasing inequality with low challenges to mitigation but high challenges to adaptation ("inequality -a road divided"); and SSP5 represents a world of rapid and unconstrained growth in economics and energy use with high challenges to mitigation and low challenges to adaptation ("fossil-fueled development"). Broadly speaking, land use is strongly regulated under SSP1, mediumly regulated under SSP2, limitedly regulated under SSP3, strongly regulated in middle/high income countries but limitedly regulated in low income countries under SSP4, and medium regulated under SSP5 35 .
The SSPs and RCPs combinations form a set of future global change scenarios which provide the basis for the next round of CMIP6 and future Intergovernmental Panel on Climate Change (IPCC) assessments 4,17 . In total, these combinations create a set of fifteen potential future scenarios 31 which are varied five times, for using bias-corrected precipitation and temperature data from five global climate models (GCMs, including GFDL-ESM2M, HadGEM2-ES, IPSL-CM5A-LR, MIROC5 and NorESM-M) from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) 36 to drive the estimation of water supply, hydropower capacity changes, and building energy demands in GCAM under each RCP scenario. Details are provided in Graham et al. 31 . Uncertainties due to the use of the five GCMs are provided with the dataset. Therefore, these GCAM runs have resulted in 75 projections of global (excluding the Antarctic) LU at region-basin level for every five years over 2015 to 2100 (Fig. 1).
Downscale GCAM-projected land use with demeter. We used Demeter, a spatial disaggregation model, to convert LU projections by GCAM from regional scale to grid cells. Detailed algorithms and optimization procedures are documented in several publications [32][33][34]37 . In brief, Demeter uses a base land cover map at a target resolution as the reference, and strategically allocates the projected area changes for each land type by models (e.g., GCAM) to the target grid cells, following a series of user-defined rules and spatial constraints.
Demeter requires converting the land cover classes defined in the parent model (i.e., GLTs) and the base map (classified in base land types (BLTs)) to user-defined final land types (FLTs). Downscaled land cover types are thus presented in FLTs. For the purpose of supporting Earth system modeling, we defined 32-type FLTs ( Table 2) which were designed to balance and keep the most detailed land cover classification between the 39 GLTs (from GCAM) and 79 BLTs (from the Community Land Model version 5 (CLM5) ( Table 3), the land component of the Community Earth System Model version 2 as one of the most widely-used ESMs), thus with sufficient plant functional characterization for use by ESMs. In particular, we adopted the base map from the 0.05° × 0.05° 79-BLT land cover data that has been used in CLM5 and represents the land cover in the early 21st century (e.g., 2005). The rules for converting GLTs and BLTs to FLTs are developed with the definitions of GLTs, BLTs and FLTs. and are provided in the input datasets at 28,29,32,35 .  www.nature.com/scientificdata www.nature.com/scientificdata/ In addition, Demeter also requires a set of user-defined rules and spatial constraints for the downscaling processes, including (1) treatment order, which decides the order of allocating changes among land cover types; (2) transition priority, which defines the order of transitions among land cover types; and (3) spatial constraints, such as the spatial distributions of soil workability and nutrient availability. Here we followed the treatment order and transition priority for broader land cover types (e.g., forest, shrub, grass, crop, urban. snow and sparse) as suggested in Chen et al. 32 , but with additional considerations on the variation inside the broad land cover types for the transition priorities. Transition priority is determined by geographical co-occurrence across FLTs. The co-occurrence is defined in four dimensions (from the highest to lowest importance): climate zones (temperate, boreal and tropical), leaf shapes (needleleaf and broadleaf), phenology types (evergreen and deciduous), and vegetation types (trees, grassland, shrubland, cropland, etc.). For example, a 'temperate needleleaf evergreen forest' is more likely transitioned to a 'temperate broadleaf evergreen shrub' than a 'boreal needleleaf evergreen forest' , and or a 'temperate broadleaf deciduous forest' . We used the same spatial constraints as published in Le Page et al. 34 , and Chen et al. 32 , but have linearly interpolated the spatial weighting for soil workability and nutrient availability data into 0.05° to be consistent with our target downscaling resolution. In addition, a couple of key parameters in Demeter have been identified with the most importance for governing the downscaling processes 18 , including: (1) r, the ratio of allocating land cover change as intensification (i.e., increasing a particular land cover in a grid cell in which it already exists); and (2) τ, a threshold percentage of suitable grid cells (determined by spatial constraints) to accept extensified land cover change allocation (i.e., creating new land cover in grid cells in which it does not yet exist). Within this study, we adopted the global optimal value of r = 1 and τ = 0.6 based on our earlier calibration 18 . Detailed rules and constraints are included along with the other input data, such as GCAM projections under the fifteen SSP-RCP scenarios driven by the five GCMs and the base map for downscaling, are archived at as a publicly available dataset 38 (Table 4).
However, it is possible that the definitions of a land type may be different in GCAM and the base map, even though their names are similar 39,40 . For example, grassland and shrubland of GCAM are more broadly defined than in the base map, thus some barren area (as defined in BLTs) are classified as grassland/shrubland in GCAM. Such difference may result in inconsistency between the downscaled results and the base map. Although the existing conversion rules may work within each broad land type (i.e., forests, grassland and shrubland, cropland, bioenergy, urban, and barren), there is no reasonable way to decide the conversion between two distinct broad land types (e.g., grassland vs. barren) according to the existing GLT-to-FLT conversion rules in Demeter. For example, it is reasonable to convert a GLT of 'Grassland' to a FLT of 'C3 grass' , but not reasonable to convert 'Grassland' to 'Barren' in Demeter's conversion rule. To address these potential inconsistencies, we developed a preprocessing procedure to harmonize GCAM projections to be more consistent with the base map in two steps. Such a procedure can be used to match any base map, ensuring consistency between historic and future land use for any user of the product (e.g., an ESM). The essential idea of the procedure is to adjust the area of each GLT in GCAM projections at the starting point (governed by the CLM5 base map, i.e., in 2005) to match the base map (Step 1) but keep the fractional change of each GLT as projected by GCAM in the following time steps (Step 2).
Step 1. Harmonize GCAM projection in year 2005. We used Eq. (1) to adjust the GCAM-projected area for each GLT in each GCAM's spatial unit (region-basin) to match that according to the base map: where A GLT,u,H is the harmonized (H) GLT area in region-basin u. A u,G is the area of region-basin u in the original GCAM projection, A BT,u,B is the area of a broad land type BT (i.e., forests, grassland and shrubland, cropland,  www.nature.com/scientificdata www.nature.com/scientificdata/ bioenergy, urban, and barren) in u according to the base map (B), and A u,B is the area of region-basin u according to the base map. Thus, A BT,u,B /A u,B gives the areal fraction of broad land type BT in region-basin u according to the base map. A GLT,u,G is the GCAM-projected area of a GLT in region-basin u, and A BT,u,G is the GCAM-projected area of a broad land type BT in region-basin u. Thus A GLT,u,G /A BT,u,G gives the areal fraction of a GLT to its associated broad land type BT in region-basin u according to GCAM projection. If A BT,u,G = 0, i.e., a BT does not exist in u according to GCAM projection (e.g., bioenergy plants), A GLT,u,H is set to be 0.
Step The preprocessing procedure produces a harmonized GCAM projections to be used in Demeter. In this data descriptor, we primarily present the downscaled LU datasets from the harmonized GCAM projections after applying the preprocessing procedure. But we provide both the downscaled results from the harmonized and original GCAM projections in our dataset. The former one better matches the base map with the PFT-oriented land type definition; and the latter is more consistent with the original GCAM projections. If there is no special declaration, 'our dataset' is referred to the results from the harmonized GCAM projection in this paper.

Data Records
The dataset includes the projected global gridded land use (excluding the Antarctic) for the period of 2015-2100 at 0.05-degree resolution and 5-year time step under the fifteen SSP-RCP scenarios driven by five GCMs (Fig. 1). Note that we provide two versions of the dataset: one was developed from the harmonized GCAM projections, and the other one was developed from the original GCAM projections (see descriptions above). In each version of the   www.nature.com/scientificdata www.nature.com/scientificdata/ dataset, the spatially explicit results driven by the five GCMs as well as their mean and standard deviation for each of the fifteen SSP-RCP scenarios are provided. All the data has been stored in self-describing NetCDF format which is commonly used by the ESM community. More specifically, the data in each year includes grid-explicit fraction (in percent) of each of the 32 final land types (   www.nature.com/scientificdata www.nature.com/scientificdata/ "GCAM_Demeter_LU_V_SSP_RCP_Model_Year.nc", where "V" could be "H" or "O", denoting the two versions (using harmonized or original GCAM projections) of the dataset; "SSP" and "RCP" denote the SSP and RCP scenarios, "Year" denotes the year of the land use data, and "GCM" denotes the source driving forcing data from five GCMs, or the mean ("modelmean") and the standard deviation ("modelstd") of the results from the five GCMs.

Technical Validation
GCAM projected land use change. We examined the harmonized GCAM land use projections for producing this dataset (Fig. 2). Specifically, bioenergy land area will increase under all SSPs; crop area will decrease under SSP1, SSP4, and SSP5, but will increase under SSP2 and SSP3; forest area will increase under SSP1 and SSP5, but will decrease under SSP2 and SSP3; grassland and shrubland will generally decrease under all SSPs. The dynamic of urban land and barren (rock, ice, desert) are not projected by GCAM, thus they will remain unchanged over time. The projected land use areas have large variations across RCPs under each of the SSPs. For example, under all SSPs, the bioenergy plant area will increase (and accordingly the forest and grassland will decrease) much faster under RCP 2.6 than RCP 4.5 and 6.0, for the higher demand of clean energy to meet more strong limitation of greenhouse gas emission under RCP 2.6. These temporal trends of the GCAM projections generally match the qualitative descriptions of land use futures under SSPs in Popp et al. 35 .
Spatial consistency with LUH2. We compared the land use patterns between the base map (representing the 'truth' in the early 21 st century), and our dataset and LUH2 data in 2015. Because the land cover classification is different between the our dataset and LUH2, we grouped them into three broad land cover types ('Crop' , 'Forest' , and 'Non-forest') for comparison, as shown in Fig. 3. Here we used the data from LUH2 and our dataset under the scenario SSP4 RCP6.0, because the source of land use information under SSP4-RCP6.0 in LUH2 only is from GCAM. In general, the two datasets show similar spatial variations of the three broad land cover types with minor inconsistencies within some regions. Our GCAM-Demeter-based dataset better matches the spatial patterns in the base map than the LUH2 data. The Pearson's correlation coefficients between the base map and our dataset are 0.99 for all three broad types, while they are 0.70, 0.59, and 0.85 between the base map and LUH2 for 'Forest' , 'Non-forest' and 'Crop' , respectively. This is not surprising because the base map was used as the reference for downscaling, but it suggests that our downscaling approach is successful in producing accurate historical land cover as suggested by the base map, which serves a good starting point for generating gridded land use data in the future years. It should be noted that the comparison does not mean LUH2 is incorrect, because of LUH2 uses a different input data and strategy for spatial downscaling, and use different land cover definitions. Primary differences between our dataset and LUH2 are in the northern high latitudes. For example, while LUH2 does not classify Greenland as any of the three broad land cover types, our dataset shows Greenland as 'Non-forest' , because GCAM classifies 'Rock, Ice, Desert' into a single type (Table 1). Both the base map and our dataset show some forest in Russia's Kamchatka Peninsula, while LUH2 suggests dense forest coverage in that region, probably due to the definition of forest in LUH2, which is based on carbon density other than a specific PFT. Note that the www.nature.com/scientificdata www.nature.com/scientificdata/ spatial resolution is 0.25° and 0.05° for the LUH2 data and our dataset, respectively, thus our data shows more spatial variation than the LUH2 data.
We further compared the two datasets under their overlapped SSP-RCP scenarios (SSP1 RCP2.6, SSP2 RCP4.5, SSP4 RCP6.0 and SSP5 RCP8.5). Taking the data in 2100 under SSP4 RCP6.0 as an example (Fig. 4), we find visually similar difference between the two datasets as that in 2015 (Fig. 3). More quantitative comparisons are shown in Fig. 5. The Pearson's correlation coefficients between our dataset and LUH2 are the highest for 'Crop' , then for 'Forest' , and the lowest for 'Non-forest' . The relatively weaker correlations for 'Non-forest' are due to the fact that our dataset does not distinguish snow ice. In general, the correlation coefficients show decreasing trend under all scenarios, particularly under SSP1 RCP2.6, indicating the discrepancy between the two datasets are getting larger over time. In addition, the global areas of the three broad land cover types from our dataset and LUH2 are at the similar level, but they are not fully consistent and the inconsistency changes over time. Such discrepancies are most likely due to that LUH2 use projections of a different model for each of the scenarios, e.g., IMAGE 11 for SSP1 RCP2.6, MESSAGE-GLOBIOM 13 for SSP2 RCP4.5, SSP4 RCP6.0 for GCAM, and REMIND-MAGPIE 12 for SSP5 RCP8.5, while our dataset is solely based on GCAM. However, LUH2 used land cover projections from an older version of GCAM 42 , and we have adjusted the original GCAM projections to match the base map in our approach (see methods). Both factors may contribute to the discrepancy under SSP4 RCP6.0, although it is relatively smaller than the discrepancy under the other scenarios.  www.nature.com/scientificdata www.nature.com/scientificdata/ Uncertainties due to the driving GCMs. The climate forcing provided by the five GCMs drives a few key components in GCAM, such as water supply, agricultural productivity, hydropower capacity changes, and building energy demands 31 . Thus, variations of climate forcing can introduce uncertainty in GCAM-projected (and thus the downscaled) land use under each of the SSP-RCP scenarios. Below we demonstrated the areal uncertainties in 2100, for each of the GCAM regions and broad land type groups due to the GCM forcing using the SSP4 RCP6.0 scenario as an example (Fig. 6). We found that the use of different GCMs can bring some uncertainties to the resulting area sharing of different land cover types in each of the regions (box plots show distributions of percentage deviations from the mean projected area forced by the five GCMs). 'RockIceDesert' and 'Urbanland' has no variation because no changes are projected by GCAM for these two land types. The uncertainty ranges differ with regions. For example, the 25 th and 75 th percentile of the areal variations of 'Forest' land types in ' Africa_ Eastern' are −0.9% (0.9% less than the mean) and 1.8% (1.8% more than the mean), while they are −3.4% and 5.5% respectively in the region 'Europe_Eastern' . Overall the uncertainty ranges (i.e., the maximum minus the minimum) are small, but for some certain land types and regions, the uncertainty ranges can be as large as 10%, particularly for 'Crops' , such as those in 'China' , 'Europe_Eastern' , 'Japan' , 'Pakistan' , 'Russia' , 'South Africa' and 'South Korea' . Thus, while we think the mean land use driven by the five GCMs are sufficiently useful in most cases, we include both the spatially explicit mean and the standard deviation land use records at each time step in our dataset to represent the uncertainties due to GCM drivers.

Usage Notes
The presented dataset includes global gridded land use and land cover change projections in the 21 st century under fifteen diverse SSP-RCP scenarios. The land cover types (in FLTs) were grouped following the classification system of a representative ESM, thus can be conveniently applied in simulations of a wide range of existing ESMs. Since the dataset is provided at a relatively high spatial resolution (0.05 degree), it offers good flexibility of ESM simulations at various spatial resolutions, which are typically at coarser resolutions. However, several important limitations have to be considered for correct use of this dataset.
First, since urban land is static in the current version of GCAM, the urban area in the downscaled products does not change over time. Additional consideration of urban dynamics will need to be taken into account for certain ESM simulations. Second, the land use and land cover change provided in this dataset is directly driven by human impacts, and there was no consideration of the biophysical response of vegetation (i.e., the dynamic vegetation) to climate change in the development of this product. Including the biophysical dynamic vegetation requires further development of GCAM in the future. Third, since the land use projections are generated by a single model, we cannot assess the uncertainty associated with the model structure. Future research should consider applying Demeter in conjunction with models other than GCAM.