Background & Summary

Cropland covers about 10% of the Earth’s land surface, making it an essential component of all land surface and Earth system models. Croplands also produce feed, food, fiber, and fuel, and are the dominant consumer of freshwater globally, making them an important component of global hydrologic, economic, and multi-sector dynamics models. While there are currently available datasets that provide annual cropland extent (e.g., the History Database of the Global Environment (HYDE 3.2)1, or the SEDAC Croplands v12,3), there is a need for information beyond the scope of those datasets. First, cropland extent alone cannot provide information on the production (kg) or yield (production per area) of a crop, which is important for understanding biogeochemical cycling as well as food production. Further, to fully evaluate the impact of crops on water use, evaporative fluxes, and energy fluxes, there is need for a dataset that provides the monthly temporal resolution of cropping activity, along with disaggregation by crop type and irrigated versus rainfed management. The MIRCA20004 dataset provides these monthly and crop-wise resolution data, but only circa year 2000. Globally, cropland area increased by approximately 5% from 2000 to 2015 (~775,000 km2)5, though with great spatial variation even in the direction of this change (positive vs negative) at the country, watershed, and grid cell level1,5.

Here, we address these cropland data needs by producing a year 2015 gridded annual harvested area, production, and yield dataset for 26 crops and crop categories (e.g., wheat, vegetables), disaggregated by irrigated and rainfed management. We then generate a monthly gridded cropland dataset for each of those crops, indicating which months the crop is actually grown in the field, such that the monthly area is consistent with the annual harvested area for each crop. The foundation for this year 2015 cropland data is a gridded cropland dataset circa year 2010 generated for use by the Global Agro-Ecological Zones model v4 (GAEZ 2010)6,7. The GAEZ v4 methodology6 couples national aggregate statistical data on crops and irrigation from FAO, with detailed spatial data, taken from a variety of published sources, on cropland distribution, irrigation distribution, soil properties, climate, human population (to allocate land area to built infrastructure), livestock population (to allocate land area to livestock), protected lands, lakes and wetlands, observed crop phenology and crop calendars, as well as available sub-national crop statistics. The ancillary spatial data are used to generate a series of gridded intermediate products, e.g., average attainable irrigated and rainfed yields, irrigated and rainfed multi-cropping intensity class. These intermediate spatial products are used to downscale the national aggregate statistical crop data from FAOSTAT5 (average of period 2009–2011) to 5-minute gridded products of irrigated and rainfed harvest area, production, and yield, as well as value of crop production. The downscaling process is a sequential, iterative rebalancing procedure that relies on optimization principles8 suitable for situations when the aggregate observed information is available (here FAO national crop statistics), constrained by the priors (i.e., the intermediate spatial maps), other available statistical data, and expert opinion6. Here we present our methodology for updating the GAEZ 2010 products to 2015, providing more recent data on production, yield, and harvested areas; these data are available to the wider global modeling community. The new dataset is referred to here as GAEZ+ 2015.

Updating to 2015 was done using FAOSTAT5 country-level information on crop harvest area, crop production, and livestock herd size changes from 2010 to 2015, along with the FAO GeoNetwork’s Global Administrative Unit Layer (GAUL)7 to map 5 minute grid cells to countries. We used the MIRCA20004 information on multi-cropping patterns and monthly crop calendars to convert the updated GAEZ+ 2015 crop harvest areas into monthly cropland physical areas by crop and production system. This method uses national-level data to update a gridded product; as such, it is important to note the limits of this data set. GAEZ+ 2015 is not based on remote sensing observations of cropland presence or absence in a given grid cell, and therefore represents aggregate changes in cropland extent but not high-resolution changes. Further, information on crop calendars and rotations is taken from the MIRCA20004 dataset, so GAEZ+ 2015 does not include information on changes to crop calendars or rotations in the past 15 years. This data product is meant to be an intermediate data set for use by those who need crop layers consistent with country level FAO-reported 2015 statistics while we await the release of other products such as the crop- and production-specific datasets that are promised from the Global Food Security-Support Analysis Data (GFSAD)9 and MapSPAM10, both of which use remote sensing to better identify changes in crop locations and calendars.

The data provided here follow commonly used file formats (geotiff and netCDF) and metadata standards familiar to global gridded modeling communities. The re-use value is high, as there are a number of Global Hydrologic Models and Land Surface Models that use this type of data. Additionally, interdisciplinary research teams are increasingly using global gridded crop data, e.g., in global economic models such as SIMPLE-G11 and integrated assessment models such as the Global Change Analysis Model GCAM12. Further, this dataset uses the GAEZ crop list and ensures consistency with FAO administrative level agricultural data, thereby connecting this dataset to already widely used crop categories and more aggregate publicly available data that is already in use. FAO is currently using the GAEZ+ 2015 data in an ongoing study.

Methods

Here we describe methods for the GAEZ+ 2015 Annual Crop Data, and the GAEZ+ 2015 Monthly Cropland Data. The Annual Crop Data was generated first, then the Monthly Cropland Data was calculated based on the Harvest Area results of the Annual Data (Fig. 1).

Fig. 1
figure 1

Schematic overview of annual and monthly data production methods. The GAEZ+ 2015 products described in this paper are in dark blue boxes; publicly available data used are in light blue. Dark blue arrows indicate which data are used in each processing step, and grey arrows from steps to data show which steps result in final GAEZ+ 2015 data products. The processing steps listed here are referred to in the Methods section text.

GAEZ+ 2015 Annual Crop Data Methods

The GEAZ+ 2015 Annual Crop Data updates the 2010 GAEZ v4 crop harvest area, yield, and production maps6,7 (identified as Theme 5 in ref. 7) using national-scale data on the change in crop harvested area and livestock numbers from 2010 to 2015, based on statistics for 160 crop groups, and cattle and buffalo, from FAOSTAT5.

Three datasets were used to produce GAEZ+ 2015 Annual Crop Data:

  1. 1.

    FAOSTAT crop production domain: annual, country-level data on crop harvested area (H) and crop production (P) for each crop from the FAOSTAT database (Table 1)

    Table 1 GAEZ and FAOSTAT crop harmonization.
  2. 2.

    GAEZ v46,7 gridded global annual harvested area, yield, and production by crop for the 26 FAOSTAT crops and crop categories at 5-minute resolution

  3. 3.

    Global Administrative Unit Layer (GAUL 2012)13 data. GAUL 2012 reports the fraction of each global 5-minute grid cell that falls within a given country or disputed territory. There are 275 unique global administrative units.

Step 1. Calculate crop changes from 2010 to 2015 by country:

For each country, we extracted the harvested area (H) and crop production (P) for each of the 160 FAOSTAT crop categories, c, from the FAOSTAT database. We averaged three years (2009–2011) of annual national crop harvested area data to represent 2010 national crop harvest area, H2010, and three years (2014–2016) of annual crop harvested area data to represent 2015 national crop harvest area, H2015, then calculated a ratio, rHc, of 2015 to 2010 harvested areas for each crop c in each country, and equivalently, for crop production:

$$r{H}_{c}={H}_{2015}/{H}_{2010}$$
(1)
$$r{P}_{c}={P}_{2015}/{P}_{2010}$$
(2)

This results in 160 rH and rP values per country. If harvest area and production values for a particular crop are zero or unreported in the FAOSTAT data, then rHc and rPc are both set to 1.0 (i.e., no change from 2010 to 2015). Three years of data are averaged (2009 – 2011 and 2014 – 2016) to account for missing data for some country/year combinations and to avoid emphasizing reported outliers.

Step 2. Aggregate FAOSTAT-based ratios to the GAEZ crop categories:

We followed the crop aggregation methods of the GAEZ model to aggregate the FAOSTAT crop list (160 unique crops as of 2019) to 26 crops (see Table 1). For each of the 26 GAEZ crop categories, if there is more than one matching FAOSTAT crop (see Table 1) then we applied an area-weighted average (based on FAOSTAT year 2015 harvested area) of the FAOSTAT crops within each country to the rH and rP values for that crop and country. This results in 26 rH and rP values per country. There was one exception to this: the GAEZ_2010 crop category ‘fodder crops’ was an aggregate of 17 FAOSTAT crops (see Table 1) for which harvest area data are no longer reported on FAOSTAT; i.e., GAEZ_2010 had obtained FAOSTAT data on fodder crops circa 2010, but FAOSTAT no longer provides any data on fodder crops for any year. We assumed that the 2010 to 2015 fractional change in fodder crop harvest area in each country was proportional to the change in the FAOSTAT reported national herd sizes for cattle and buffalo livestock data5 for that country, following the same methodology as for crop harvested area change (see Step 2 below). This method assumes a negligible international trade of fodder crops as indicated by bilateral trade matrices available from FAOSTAT.

Step 3. Apply country-level ratios to grid cells:

Calculated country-level ratios were then applied to each grid cell k, using the GAUL_201213 definitions for which grid cells fall within which countries. Some grid cells are split between two or more countries. In this case, all model output variables for the grid cell are divided between the countries based on the fraction of grid cell area falling within the country i:

$${H}_{c,2015}^{k}={H}_{c,2010}^{k}{\sum }_{i}\,{f}_{i}^{k}r{H}_{c,i}$$
(3)
$${P}_{c,2015}^{k}={P}_{c,2010}^{k}{\sum }_{i}\,{f}_{i}^{k}r{P}_{c,i}$$
(4)

where \({H}_{c,2015}^{k}\) is the year 2015 harvested area (or production) for crop c in grid cell k; \({f}_{i}^{k}\) is the fraction of country i in grid cell k, and rHc,i and rPc,i are the ratios for crop c in country i as calculated in Eqs. 1 and 2. This results in 26 H and P values per grid cell. If the sum of all crop harvest areas exceeds 99% of the grid cell area, all crop harvest areas are reduced equally to fit within 99% of the area.

Special Case: Sudan

FAOSTAT data for years before 2011 report data for Sudan, and for South Sudan and Sudan after 2011. To compute the ratios for these grid cells, we split the 2010 data for Sudan into a virtual ‘North’ Sudan and ‘South_Sudan’, using the data for the year 2012, which was reported for both countries. We then used these generated 2010 data and applied the same methodology as described above to calculate changes in harvested areas and production in all grid cells in both countries.

Special Case: Small regions and islands

Forty-nine countries - generally small regions or islands - had no data reported for crop harvested area by FAOSTAT. We assumed that there was no change in crop harvested area for the grid cells in these countries. Note that many may have had zero ha as previously-reported crop area in GAEZ v4. These countries are (the number following each region is the region’s number in ADM0_CODE in the GAUL_2012 data13):

Anguilla (9), Aruba (14), Ashmore_and_Cartier_Islands (16), Azores_Islands (74578), Baker_Island (22), Bassas_da_India (25), Bird_Island (32), Bouvet_Island (36), British_Indian_Ocean_Territory (38), Christmas_Island (54), Clipperton_Island (55), Cocos (Keeling)_Islands (56), Europa_Island (80), French_Southern_and_Antarctic_Territories (88), Glorioso_Island (96), Greenland (98), Guernsey (104), Heard_Island_and_McDonald_Islands (109), Howland_Island (112), Isle_of_Man (120), Jarvis_Island (127), Jersey (128), Johnston_Atoll (129), Juan_de_Nova_Island (131), Kingman_Reef (134), Kuril_islands (136), Madeira_Islands (151), Mayotte (161), Midway_Island (164), Navassa_Island (174), Netherlands_Antilles (176), Norfolk_Island (184), Northern_Mariana_Islands (185), Palmyra_Atoll (190), Paracel_Islands (193), Pitcairn (197), Saint_Helena (207), Scarborough_Reef (216), Senkaku_Islands (218), South_Georgia_and_the_South_Sandwich_Islands (228), Spratly_Islands (230), Svalbard_and_Jan_Mayen_Islands (234), Tromelin_Island (247), Turks_and_Caicos_Islands (251), United_States_Virgin_Islands (258), Wake_Island (265), Gibraltar (95), Holy_See (110), Liechtenstein (146).

Special Case: Disputed Areas

Some grid cells in the GAUL_201213 cell-table database are assigned to nine disputed areas, rather than to specific countries. We assumed that there was no change in crop harvested area or production from 2010 to 2015 for grid cells these disputed areas. These areas are (the number following each region is the region’s number of the ADM0_CODE in the GAUL_201213 data):

Abyei (102), Aksai_Chin (2), Arunachal_Pradesh (15), China/India (52), Hala’ib_Triangle (40760), Ilemi_Triangle (61013), Jammu_and_Kashmir (40781), Ma’tan_al-Sarra (40762), Falkland_Islands_(Malvinas) (81).

Step 4. Compute 2015 crop yields:

Crop yields were computed for each crop, c, and grid cell, k, as the ratio of crop production to crop harvest area (if harvest area, Hc,k,2015, is zero, then yield, Yc,k,2015, is set to zero):

$${Y}_{c,k,2015}={P}_{c,k,2015}/{H}_{c,k,2015}$$
(5)

The resulting gridded global data are:

  1. A.

    GAEZ+ 2015 Crop Harvest Area14

  2. B.

    GAEZ+ 2015 Crop Yield15

  3. C.

    GAEZ+ 2015 Crop Production16

This new data product consists of 156 data files in geotiff format, one rainfed harvested area file and one irrigated harvested area file for each crop harvest area (1000 ha (107 m2) per 5-minute grid cell), crop production (1000 tonnes (106 kg) per 5-minute grid cell), and crop yield (tonnes per ha (10−1 kg m−2) per 5-minute grid cell), for each of the 26 GAEZ crops or crop categories in Table 1.

GAEZ+ 2015 monthly cropland area methods

Two datasets were used to produce monthly cropland area by crop and by irrigated vs rainfed management. These are:

  1. 1.

    GAEZ+ 2015 Annual Harvested Area14 (as developed above)

  2. 2.

    MIRCA2000 cropland area4

Step 5. Harmonize the GAEZ+ 2015 and MIRCA2000 crop lists

The MIRCA20004 cropland product provides monthly growing area grids (gridded physical cropland area) for 26 irrigated and rainfed crops and crop categories, as well as cropping calendars that identify the planting month and harvesting month for each crop (via ‘subcrops’ – see below). However, the MIRCA2000 crop list is not the same as the GAEZ+ 2015 crop list; we matched each crop type in the GAEZ+ 2015 crop list to a crop type in the MIRCA2000 crop list to enable the application of MIRCA2000 crop calendars to GAEZ+ 2015 crops (Table 2). Out of the 26 GAEZ+ 2015 crops, 18 had clear 1:1 matching crop categories within MIRCA2000. The remaining 8 crops were matched based on general crop characteristics, i.e., annual vs. perennial, or to unmatched MIRCA2000 cereals.

Table 2 List of GAEZ crop categories used in all GAEZ+ 2015 products, as well as the matching between GAEZ+ 2015 crops and MIRCA20004 crop categories for the purposes of producing GAEZ+ 2015 monthly cropland data.

An essential component of the MIRCA2000 cropland dataset is the identification of subcrop categories within each crop category to split crops into areas grown in different seasons, or crops with different planting and harvesting dates within the same season. Up to 5 subcrops can be defined to represent such multi-cropping practices. Below, we use the following notation:

HG = annual harvested area from the GAEZ+ 2015 product for a given crop

HM = annual harvested area calculated from the MIRCA2000 data for a given crop

AM,n = cropland area of MIRCA2000 crop, subcrop n, by month

AG,n = cropland area of GAEZ+ 2015 crop, subcrop n, by month

AG = cropland area of GAEZ+ 2015 crop, by month

Step 6. Apply MIRCA2000 monthly crop calendars to GAEZ+ 2015 annual data

To generate the monthly cropland physical area of GAEZ+ 2015 crops, we followed these steps for each GAEZ crop in each grid cell:

  1. 1.

    For a given GAEZ crop in a given grid cell, is the area reported >0 for the matching MIRCA2000 crop?

    1. a.

      If YES, then use the MIRCA2000 data for the grid cell and crop considered.

    2. b.

      If NO, then find the closest grid cell with the matching MIRCA2000 crop category, and apply the MIRCA2000 crop rotation from that grid cell to the given crop/grid cell combination for the following steps.

  2. 2.

    Does the matching MIRCA2000 crop category (Table 1) have more than 1 subcrop?

    1. a.

      If NO, then AG = HG for all months of the cropping season, as defined by the MIRCA2000 crop calendar.

    2. b.

      If YES, then for each subcrop category n, apply the ratio of AM,n/HM to HG, then sum the subcrop areas within each month such that:

    $${A}_{G}=\sum _{n}\frac{{A}_{M,n}}{{H}_{M}}{H}_{G}$$
  3. 3.

    For each month and each grid cell, check if the sum of all crops (irrigated and rainfed) is greater than the 99% of area of the grid cell. We assume that at least 1% of land must be retained as non-cropland for agricultural infrastructure such as roads, buildings, irrigation infrastructure, and other landcovers (e.g. rivers, wetlands).

    1. a.

      If NO, then no further processing is done.

    2. b.

      If YES, then reduce crop area by the excess value based on a removal order (Table 2). Rainfed crops have higher removal order numbers for the excess truncation (starting with 1) before removing irrigated crops, until the cell area is not exceeded. A large removal number (e.g., 20) indicates that the crop’s land is unlikely to be removed. Large priority numbers are given to the staple crops to ensure these important food producing lands are consistent with FAOSTAT country data.

The maximum monthly amount of physical cropland that was removed by step 3 is 711,543 ha, which is 0.05% of total global cropland physical area.

The resulting global gridded data from Step 6 are monthly time series of cropland physical area by crop, subcrop, and production system, called GAEZ+_2015 Monthly Cropland Data17. Combining the MIRCA2000 crop calendar and subcrop rotation information with the GAEZ+ 2015 annual data allows for the representation of crop seasonality; e.g., Fig. 2 shows the aggregate monthly cropland physical area for Rice 1 and Rice 2 (two sub-crops of rice) over the northern hemisphere, clearly illustrating the two main rice-growing seasons.

Fig. 2
figure 2

Aggregate monthly cropland physical area for Rice 1 and Rice 2 subcrops from monthly GAEZ+ 2015 over the northern hemisphere shows the two main rice-growing seasons. This seasonality is the result of combining GAEZ+ 2015 annual data with the MIRCA20004 crop calendars and subcrop divisions.

Data Records

GAEZ+ 2015 Annual Crop Data14,15,16:

File format: geotiff

File naming convention: GAEZAct2015_{Variable}_{Cropname}_{Management}.tif

Where {Variable} is one of: HarvArea, Production or Yield.

{Cropname} is one of the names from column 1 in Table 1.

{Management} is one of: Irrigated, Rainfed, Total or Mean.

Date Produced: May 2020.

Spatial Metadata:

Extent: X: −180 to +180

Extent Y: −90 to +90 Resolution: 0.083333 decimal degrees (5 arcminutes)

Coordinate reference system: longitude/latitude (WGS84 datum)

Projection in PROJ.4 notation: “+proj = longlat + datum = WGS84”

Units:

Crop Harvest Area: 1000 ha (107 m2) per 5-arcminute grid cell

Crop Production: 1000 tonnes (106 kg) per 5-arcminute grid cell

Crop Yield: tonnes per ha (10−1 kg m−2) per 5-arcminute grid cell

No data value: Oceans, open-water, and Antarctica in the geotiff files have the no-data value of −3.39999999999999996e + 38.

Repository: Harvard dataverse

GAEZ+ 2015 Monthly Cropland Data17

File format: netCDF v.4 with default internal compression (level 7)

File naming convention: GAEZ_CropArea_{Cropname}_{subcrop}_{Management}.nc

Where: {Cropname} is one of the names from column 1 in Table 1.

{subcrop} is a number (1 through 5) indicating crop rotations.

{Management} is Irrigated or Rainfed.

Date Produced: January 2021.

Spatial Metadata:

Extent: X: −180 to +180

Extent Y: −90 to +90

Extent Z: 12, one layer per month Resolution: 0.083333 decimal degrees (5 arcminutes)

Coordinate reference system: longitude/latitude (WGS84 datum)

Projection in PROJ.4 notation: “+proj = longlat + datum = WGS84”

Units: Cropland Area: 1000 ha (107 m2) per 5-arcminute grid cell

No data value: Oceans, open-water, and Antarctica in the netCDF files have the no-data value of −3.4e + 38.

Repository: GeoHub (https://mygeohub.org/)

Note that the annual dataset and the monthly data set are stored in two different repositories. The annual dataset is on Harvard Dataverse and the monthly dataset is on GeoHub. The use of different repositories is based on funding and contract obligations.

Technical Validation

We compare GAEZ+ 2015 harvested area to FAOSTAT5 reported harvested area by crop (Table 3) and by country (Online-only Table 1). It would be useful to also compare harvested area to other global gridded datasets, but at the time of this analysis, there is no other global gridded crop harvested area product for the year 2015. GAEZ+ 2015 crop yields and cropland physical area are compared to other publicly available global gridded datasets for the year 2015, as well as to one dataset for cropland presence/absence at the grid cell level for the year 2010. Validation of any global gridded crop dataset is challenged by universal uncertainties in underlying reported crop statistics and in remote sensing methods. Most, if not all, global gridded crop datasets make use of FAO-reported country level statistics, leading to consistency but not validation when datasets are compared. Here, we report comparisons to other datasets so users familiar with those data are aware of similarities and differences, and we compare spatial aggregates not used in the development of the data, e.g., sub-national boundaries and watershed boundaries, as well as grid cell values where available to illustrate the spatial scale at which these data are consistent or inconsistent.

Table 3 Comparison between GAEZ+ 2015 and FAOSTAT global harvested area by crop category.

Harvested area comparison

While the goal of this paper is to present year 2015 agricultural data, we first must note any biases apparent in the underlying year 2010 data upon which our product is based, as any bias in the 2010 data will necessarily be carried into the 2015 products. Globally, the sum of all GAEZ v4 2010 harvested areas is 4.2% lower than the FAOSTAT c.2010 total harvested area for all crops, based on all available matching country and crop combinations. The majority of this difference in harvested area is due to the Fruits_&_Nuts category, which is reported in FAOSTAT but not by GAEZ v4 2010. There are also large differences in the Crops_NES and Yams_and_other_roots categories, but in the opposite direction, partially cancelling out the Fruits_&_Nuts discrepancy. The four crops that account for the majority of the world’s production – wheat, maize, rice, and soybeans – all match well, with differences between GAEZ v4 2010 and FAOSTAT c.2010 of <1%. Similarly, the top crop producing countries in the world match well between the two datasets, though notably both Indonesia and Thailand have ~10% less total harvested area in GAEZ v4 2010 than in FAOSTAT c. 2010.

Since no other global gridded harvested area dataset exists at this time, we have only the FAOSTAT country-level and crop total data for 2015 as a basis for comparison. We do not expect the country or global crop aggregates to match exactly because the FAOSTAT data used to generate GAEZ+ 2015 Harvested Area provided only a change in crop harvested area by country; we did not target or calibrate to the FAOSTAT 2015 reported values, and so comparing these two datasets provides some evaluation of the combination of underlying GAEZ 2010 data with the FAOSTAT-based change ratio. Global harvested area by crop from GAEZ+ 2015 and FAOSTAT is shown in Table 3, and total crop harvested area by country from GAEZ+ 2015 and FAOSTAT is shown in Online-only Table 1.

Table 3 presents crops in order of the largest to the smallest global harvested area, according to GAEZ+ 2015. The world’s four staple crops – wheat, maize, rice, and soybeans – all have <1% difference in crop harvested area between FAOSTAT and GAEZ+ 2015. As can be seen in Table 1, the FAOSTAT “CropsNES” category is an aggregate of many crops, which individually have small global harvested areas, but collectively are the third largest harvested area category in the world. Despite challenges in reporting of small crop harvested areas, there is a 0.1% difference in global CropsNES harvested area between FAOSTAT 2015 and GAEZ+ 2015. Notably, GAEZ+ 2015 reports 7% (~5 Mha) more Pulses harvested area than FAOSTAT, a difference that is directly inherited from the 7% difference between GAEZ v4 2010 and FAOSTAT 2010 Pulses harvested area difference. Pulses are the 5th largest harvested area in the world, and an important food crop, particularly in developing countries; this crop warrants more attention from global crop researchers in light of this harvested area difference. Other crop categories with large (>10%) differences are Other cereals, Yams & other roots, and Banana; these crops account for a small proportion of global crop harvested area, but should be evaluated more carefully for local studies of regions where these crops are nutritionally and/or economically important. The GAEZ+ 2015 data set includes 169,182 thousand hectares of fodder crops; this value is not included in Table 3 because the FAO does not report fodder crop area for 2015.

Online-only Table 1 presents harvested area summed for all crops by country, ordered from most to least GAEZ+ 2015 harvested area. The five countries with the most harvested area (India, China, United States of America, Russian Federation, and Brazil) all have less than or equal to 1% difference between GAEZ+ 2015 and FAOSTAT 2015. There are a few notable large countries for which there are larger differences between the datasets; Australia, Germany and France all of more than a 10% difference. There are 31 countries with >0 harvested area reported in FAOSTAT 2015, yet 0 harvested area in GAEZ+ 2015; all these countries have <100,000 ha of FAOSTAT-reported 2015 harvested area, indicating that GAEZ+ 2015 is missing information on many of the smallest agricultural production countries.

Crop yield comparison

The Global Dataset of Historical Yields (GDHY)18 provides a dataset of year 2015 global gridded crop yields for maize, rice, soybean, and wheat. To our knowledge, there are no global gridded datasets of year 2015 yields for the other GAEZ+ 2015 crop categories. Since the GDHY dataset is gridded, it can be used to compare not only yield values (t ha−1) but also the spatial distribution of crop yield at the grid cell level.

GAEZ+ 2015 grid cell crop yields are consistently lower than GDHY grid cell crop yields for all four crops (Fig. 3), with grid cell yield linear regression slope values of 0.5, 0.3, 0.9, and 0.4 for maize, rice, soybean, and wheat, respectively. While individual grid cell yield values are consistently lower in GAEZ+ 2015 than GDHY, the former reports a greater spatial extent, with more non-zero grid cell values; this can be seen in the large span of non-zero values along the y-axis of Fig. 3, as well as by comparing maps of the two products (Fig. 4). It is expected that the spatial extent of these two products is different, as the GDHY product uses crop harvested area data from year 200019 as a basis for crop distribution; this difference in distribution, especially the lower extent in GDHY compared to GAEZ+ 2015, explains the difference in grid cell level yields. Yield is a weight per area, so a smaller total area in GDHY necessitates a higher yield per area in order to achieve agreement with the FAO country level yield and production statistics also used by GDHY.

Fig. 3
figure 3

GAEZ+ 2015 grid cell crop yields are consistently lower than Global Dataset of Historical Yields (GDHY)18 grid cell crop yields for all four crops, and with scatter. Hexbin plots show the log of the number of scatter plot points that fall within each hexagon unit on the plot. The linear regression line is shown in black, with 10.1038/s41597-021-01115-2 grey shading around the line showing the 95% confidence interval.

Fig. 4
figure 4

Maps of the Global Dataset of Historical Yields (GDHY)18 and GAEZ+ 2015 grid cell yields (t ha-1) for maize, rice, soybean, and wheat, show agreement on general spatial patterns. Maize, soybean, and wheat show differences in the western parts of Russia and in eastern Europe, with GAEZ+ 2015 reporting a larger spatial extent of these crops.

Cropland physical area comparison

Global total

Annual cropland physical area extent (shown in Fig. 5) can be calculated from the GAEZ+ 2015 monthly cropland physical area data. Here we compare GAEZ+ 2015 cropland extent to FAOSTAT reported cropland area extent5. GAEZ+ 2015 minimum cropland extent is calculated by assuming the maximum possible re-use of cropland in multi-cropping systems, effectively using the maximum monthly growing area as the cropland extent. GAEZ+ 2015 maximum cropland extent is calculated by assuming the minimum possible re-use of cropland, effectively taking the minimum of the annual harvested area and the grid cell area as the cropland extent. Globally, we find GAEZ+ 2015 cropland extent is 4–8% lower than FAOSTAT cropland extent (Table 4). This result is consistent with our estimate that GAEZ v4 2010 total harvest area is 4.2% lower than FAOSTAT reported harvested area, and that GAEZ v4 2010 total minimum cropland extent (calculated in the same way as the GAEZ+ 2015 minimum cropland extent) is 10% lower than HYDE 3.21 year 2010 reported global cropland extent.

Fig. 5
figure 5

(a) GAEZ+ 2015 irrigated cropland, shown as a fraction of each 5-minute grid cell, and (b) GAEZ+ 2015 rainfed cropland, shown as a fraction of each 5-minute grid cell.

Table 4 Comparison of GAEZ+ 2015 global cropland extent to FAOSTAT year 2015 global cropland extent.

Spatial distribution of cropland physical area

To evaluate the accuracy of the GAEZ+ 2015 spatial patterns of cropland physical area, we compare total cropland physical area, irrigated cropland physical area, and rainfed cropland physical area (Fig. 6), as well as irrigated rice physical area and rainfed rice physical area (Fig. 7) to the HYDE 3.21 product. While HYDE 3.21 utilizes country-level information from FAOSTAT within its data generation algorithm, it bases the spatial distribution of cropland on remote sensing data, which provides an independent source of sub-national crop data. Validation was done at two spatial aggregates: (1) hydrologic basins defined at 20,000–200,000 km2 area units (1,815 unique units), derived from the Hydrosheds global 5-arcminute simulated river network20 and (2) administrative boundaries at the sub-national level (administrative level 1 from the FAO GeoNetwork Global Administrative Units Layer (GAUL)13 provided for download by18 (See Fig. 8 for maps of the spatial units). This provides validation metrics at geophysical and socio-political scales relevant to the modelling communities that would use this dataset. We also compare grid cell values of physical areas for all categories available from HYDE (Fig. 9, Table 5).

Fig. 6
figure 6

Linear regression of GAEZ+ 2015 (y axis) cropland physical area, irrigated physical area, and rainfed physical area against HYDE 2.31 year 2015 data, aggregated by administrative units and hydrologic basins. Values are in millions of hectares (Mha). The grey line is a 1:1 line, and the blue line is the linear regression model.

Fig. 7
figure 7

Linear regression of GAEZ+ 2015 (y axis) irrigated rice and rainfed rice physical area against HYDE 2.31 year 2015 data, aggregated by administrative units and hydrologic basins. Values are in millions of hectares (Mha). The grey line is a 1:1 line, and the blue line is the linear regression model.

Fig. 8
figure 8

Hydrologic basins (a) and administrative units (b) used for spatial aggregation. Hydrologic basins defined at 20–200 km2 area units (1815 units), derived from the Hydrosheds global 5-arcminute simulated river network20, and the administrative units are administrative level 1 data from the FAO GeoNetwork Global Administrative Units Layer (GAUL)13 provided for download by23.

Fig. 9
figure 9

Linear regression of GAEZ+ 2015 (y axis) grid cell cropland physical area, irrigated physical area, rainfed physical area, irrigated rice and rainfed rice physical area against HYDE 2.31 year 2015 grid cell values. Values are in thousands of hectares. The grey line is a 1:1 line, and the blue line is the linear regression model.

Table 5 Linear regression results comparing the spatial distribution of GAEZ+ 2015 cropland physical area to HYDE 2.31 cropland by administrative unit, hydrologic basin aggregation, and individual grid cells.

Linear regression results (Table 5 and Figs. 6, 7, and 9) show that when aggregated spatially, GAEZ+ 2015 minimum cropland, irrigated, and rainfed physical area match well with HYDE 3.2 spatial distributions (r2 ≥ 0.9 for all three), with results significant at the p < 0.001 level. With the exception of irrigated land, the slope of the regression of GAEZ+ 2015 as a function of HYDE 2.3 is slightly less than 1.0, indicating a consistent small underestimation of rainfed land, and small overestimation of irrigated land compared to HYDE 3.2. However, the strength of the linear regression shows that the spatial distributions are similar. Rice irrigated and rainfed physical areas match less well, with r2 values of only 0.77 and 0.48 for irrigated and rainfed, respectively, at the administrative unit level (values are higher for the basin aggregations). Differences in rice physical area are not surprising, especially for rainfed rice, given the known challenges of mapping widely distributed and diverse rice cropping systems, e.g.21,22.

While the linear regression shows agreement in aggregate, the grid cell comparison (Fig. 9) illustrates large scatter, especially in cropland physical area. This large scatter is consistent with the differences in underlying methods used to identify crop area presence/absence in the two datasets. We expect GAEZ+ 2015 to have 0 or lower values where HYDE reports non-zero values due to the use of the GAEZ v4 2010 crop map as a basis for the GAEZ+ 2015 data product.

Grid cell crop presence/absence comparison for year 2010

Lastly, we compare the GAEZ v4 data that underlies GAEZ+ 2015 to the MapSPAM10 crop data product. While this paper presents the new GAEZ+ 2015 dataset, note that the methods used to produce this gridded data restrict a given crop’s presence (harvested area, yield, and physical area) to the grid cells in which that crop was present in the GAEZ v4 data product. Therefore, we find it informative to compare the GAEZ v4 crop-specific presence/absence to another data set that provides such information for the same year (2010); this will provide a view of how the underlying crop map data compares to other published data, and display the constraints placed on the GAEZ+ 2015 crop presence/absence in grid cells.

With a few notable exceptions, grid cell presence/absence of the world’s four main crops – maize, rice, soybean, and wheat – mostly agree between GAEZ v4 and MapSPAM (Fig. 10). For all four crops, MapSPAM has presence in more grid cells than GAEZ v4 across Sub-Saharan Africa. There are also crop-specific inconsistencies across western Canada (Maize), Europe (Rice), Australia and Russia (Soybean), and India (Wheat).

Fig. 10
figure 10

Agreement (grey), and disagreement (blue and red) between GAEZ v4 and MapSPAM10 year 2010 presence of maize, rice, soybean, and wheat. Grey indicates the crop is present in both datasets; blue shows the crop is present in GAEZ v4 but absent in MapSPAM, and red shows the crop is present in MapSPAM but absent in GAEZ v4.

Usage Notes

There are several known issues and limitations that users should be aware of. These are described in the following paragraphs.

Annual harvested area versus monthly cropland area discrepancies

As noted in the Methods section above, there is a small disparity between the annual harvested area and the monthly cropland area for some crops in the GAEZ+ 2015 product; the maximum difference in any month is 0.05% of global cropland.

Irrigated cassava files

The annual product includes a file for irrigated cassava harvested area. All grid cell areas for this crop are 0; this follows from zero irrigated cassava area in 2010 in GAEZ v4 data; therefore we did not include irrigated cassava in the monthly product.

Limitation of the country-level statistics approach

This product will become obsolete when a product becomes available that updates global data using the GAEZ or a similar methodology to account for sub-national shifts in the spatial distribution of cropland due to a range of a potential factors, such as land protection, land degradation, urban expansion, infrastructure development such as roads or reservoirs and canals, or climate change. The use of country-level statics in the methods presented here cannot capture these changes in spatial distribution; therefore, this product should be seen as a temporary tool to be used while researchers are waiting for updated products that can capture those changes.

Lack of new information on irrigation

There was insufficient post-2010 irrigation data available from FAO to update the distribution of irrigation activity among crop types and land areas, so the product uses for 2015 the same gridded rainfed-to-irrigated crop harvest area and crop production ratios as reported in GAEZ v4 for 2010.

Lack of fallow land information

Fallow land is not included as a category in this dataset. This is partly due to a compromise made in the development of the dataset: because the grid cells with presence of a given crop are restricted to the same grid cells from GAEZ v4, we allowed the total cropland extent within a grid cell to become large in order to best match national level statistics. This accommodates cases where the crop expanded to other grid cells, yet leaves little room for fallow land.

Lagging data usage

It is a challenge to keep global cropland products current – the global agricultural system is constantly in flux, yet global data products take time to develop, so their input data is necessarily from earlier years. Some challenging features to quantify (e.g., crop calendars, irrigation) are only re-constructed at the global scale very infrequently. For example, GAEZ+ 2015 monthly crop data relies on the MIRCA2000 data4 to characterize sub-annual cropping activity; MIRCA2000 was published a decade ago and represents the situation c.2000, a decade and a half prior to the 2015 target year of this new data set.

Other uncertainties

Any global agricultural data set will contain errors. These could result from imperfect statistical reporting of underlying data, simplifications needed to successfully harmonize disparate data sets (e.g., cropland data and irrigation data), and inherent uncertainties in remote sensing characterization of global-scale land use.

This product necessarily carries forward any errors in the 2010 base product of GAEZ v4.

Finally, we note that due to the limitations – especially the country-level statistics approach – model results generated using this dataset should be interpreted with care at the level of grid cells are small aggregates. While gridded input data is required for many global models, we recommend interpreting results at aggregate spatial levels such as the administrative units or watersheds presented in Fig. 8. Grid cell level data such as MIRCA2000, HYDE 2.3, and MapSPAM are routinely used in global hydrologic models attempting to simulate years not represented by those datasets; here, the GAEZ+ 2015 update to the GAEZ v4 product aims to incorporate updated crop data into an existing gridded product to reduce the bias resulting from using outdated data. This method generates its own, unique set of biases due to the limitations described above.