Global soil, landuse, evapotranspiration, historical and future weather databases for SWAT Applications

Large-scale distributed watershed models are data-intensive, and preparing them consumes most of the research resources. We prepared high-resolution global databases of soil, landuse, actual evapotranspiration (AET), and historical and future weather databases that could serve as standard inputs in Soil and Water Assessment Tool (SWAT) models. The data include two global soil maps and their associated databases calculated with a large number of pedotransfer functions, two landuse maps and their correspondence with SWAT’s database, historical and future daily temperature and precipitation data from five IPCC models with four scenarios; and finally, global monthly AET data. Weather data are 0.5° global grids text-formatted for direct use in SWAT models. The AET data is formatted for use in SWAT-CUP (SWAT Calibration Uncertainty Procedures) for calibration of SWAT models. The use of these global databases for SWAT models can speed up the model building by 75–80% and are extremely valuable in areas with limited or no physical data. Furthermore, they can facilitate the comparison of model results in different parts of the world.

function definition. A calibration program, SWAT-CUP (SWAT Calibration and Uncertainty Procedures) 25,35 , was developed for the calibration of SWAT models. SWAT-CUP provides five different calibration routines and the option of choosing between 11 different objective functions. We have previously shown that the choice of different routines and objective functions lead to different parameters while producing equally acceptable calibration results 36,37 . It would be desirable to always obtain unconditional model parameters independent of calibration procedures and objective functions. For this reason, in the new version of the program, we have provided an option for multi-objective calibration, which provides an option of choosing any combination of the objective functions.
Furthermore, data processing and formatting of data for different applications are highly time-consuming and prone to errors, resulting in much of the research time to be spent on data preparation instead of modeling application and analyses. For this reason, we have put together global soil, landuse, and historical and future weather databases for use in SWAT and other similar watershed models (Table 1) as described in the next section. The collection of these data provides a valuable resource for modeling, especially in regions of data scarcity.

Methods
Soil maps of the world. FAO/UNESCO soil map of the world. There is a general lack of reliable soil information for many parts of the world, which has significantly disadvantaged evaluation of soil erosion, land degradation, environmental impact studies, and sustainable land management programs. Two highly-used global soil maps are the FAO/UNESCO Soil Map of the World and Harmonized World Soil Database (HWSD_v121). Both maps provide a limited description of parameters, which are not directly useful for hydrologic models. We have, therefore, used pedotransfer functions developed from soils around the world to create the needed parameters such as hydraulic conductivity, available water capacity, and bulk density. Pedotransfer functions "translate data we have into data we need" 38 . These functions estimate parameters that are difficult to measure using easily measured soil properties such as texture, color, and structure, that are routinely recorded by soil surveyors 39 .
The FAO/UNESCO soil map of the world was prepared using the topographic map series of the American Geographical Society of New York at a nominal scale of 1:5,000,000 consisting of a 30 cm topsoil layer, and a 70 cm subsoil layer (Fig. 1). Associated files, which we produced, include "Lookup_Soil_FAO-UNESCO.txt," which contains the correspondence between soil map and soil database, and the SWAT's usersoil table in the main SWAT database "SWAT2012.mdb".
Initially, in 2004, the first author created the soil database for the FAO/UNESCO 1995 soil map for quantification of water availability and quality in Africa 9,10 . The soil names were created as a concatenation of the FAO mapping unit (e.g., Af14-3C) and FAO Soil-ID (e.g., 1) to give Af14-3C-1. Soil hydrologic groups were determined according to SWAT Manual 40 based on the criteria in Supplementary Table S1. The fraction of anions exclusion (ANION_EXCL) was set to 0.5 according to the SWAT Manual 40 . The potential or maximum crack volume of the soil profile (SOL_CRK) expressed as a fraction of the total soil volume was set to zero as there was no information www.nature.com/scientificdata www.nature.com/scientificdata/ available to evaluate this parameter. Other soil properties have initially been calculated 9,10 using the program ROSETTA 41 . In the current study, we have updated this database using a large number of pedotransfer functions, as described below.
Harmonized world soil database (HWSD). The Food and Agriculture Organization of the United Nations (FAO) and the International Institute for Applied Systems Analysis (IIASA) combined the available regional and national soil information with the data already contained within the 1:5,000,000 scale FAO-UNESCO map, into a new comprehensive Harmonized World Soil Database (HWSD_v121). This map has a resolution of about 1 km (30 arc seconds) and consists of a 30-cm topsoil layer, and a 70-cm subsoil layer ( Supplementary Fig. S1).
The soil variables provided in the Harmonized World Soil Database 42 and FAO/UNESCO Soil Map of the World included soil texture (%sand, %silt, %clay), organic carbon, pH, and electrical conductivity (EC). However, from a hydrological point of view, we require parameters such as bulk density, water storage capacity, and hydraulic conductivity for different soil layers, which we used pedotransfer functions to estimate. We estimated soil bulk density (Table 2), soil available water capacity (Table 3), soil hydraulic conductivity (Table 4), soil erodibility factor for universal soil loss equation (USLE) ( Table 5), and moist soil albedo ( Table 6). The used pedotransfer functions are based on the soils from around the world; hence, providing parameters that are more universally applicable. The above variables were calculated for all soil records in the two soil maps.
Furthermore, to account for parameter uncertainty, the soils were sorted by their textural classes based on USDA classification 42 that included Clay, Clay-loam, Heavy-clay, Loam, Loamy-sand, Sand, Sandy-clay, Sandy-clay-loam, Sandy-loam, Slit-loam, Silty-clay, and Silty-clay-loam. For each textural class, we pooled the estimates of various pedotransfer functions from both FAO_UNESCO and HWSD databases and calculated their cumulative probability distributions from which we obtained parameter values at the 5%, 50%, and 95% probability levels. Values for bulk density are shown in Table 7 as an example, while other parameters are given in Supplementary Tables S2-S6. An example calculation of the 95 percent prediction uncertainty (95PPU) is shown in Supplementary Fig. S2 for the hydraulic conductivity of topsoil sandy loam. The 95PPU parameter range sets a physically meaningful limit on the parameters for different soil textural classes and is instrumental in constraining the respective parameters in model calibration. These ranges can, of course, be modified by the user as needed.
In the pre-processing of HWSD database, similar to FAO/UNESCO, we modified the data where necessary by replacing zero values of %sand, %silt, and %clay by 1, and making sure that their summation equals 100%. Also, after applying various pedotransfer functions, we replaced the negative or unreasonable values with the overall averages to avoid model-generated errors. Finally, we should point out that the soil parameters in both databases must still be calibrated for a specific location.
Landcover maps of the world. Global land cover characterization (GLCC). The GLCC from USGS is a landuse and land cover classification dataset based primarily on the unsupervised classification of the 1-km AVHRR (Advanced Very High-Resolution Radiometer) 10-day NDVI (Normalized Difference Vegetation Index) composites ( Supplementary Fig. S3). The AVHRR source imagery dates from April 1992 through March 1993. The GLCC map contains 24 land cover types. We made the correspondence between the GLCC map units and Global landuse GlobCover. The GlobCover is a European Space Agency initiative to develop global composites and land cover maps using observations from the 300-m MERIS sensor onboard the ENVISAT satellite mission (Soolementary Fig. S4). The GlobCover map covers the period of December 2004 to June 2006 and is derived by automatic and regionally-tuned classification of a MERIS full resolution surface reflectance time series. The GlobCover map contains 23 land cover types. We made correspondence between the GlobCover units and SWAT's (crop) database in Supplementary Table S8 based on the description of the land covers provided by the maps and the SWAT landuse definitions.
The databases for the above two global landuse maps are supported by the table (crop) in the SWAT2012.mdb database and the lookup tables "Lookup_Landuse_GlobCover.txt" and Lookup_Landuse_USGS.txt. However, similar to the soil parameters, landuse parameters must be calibrated for a given location.
Historical weather data. The historical (1970The historical ( -2005 reanalysis temperature and precipitation data from the Research Unit East Anglia (CRU TS 3.1) 43 were reformatted from NetCDF into SWAT-readable text files. The database is daily and has a resolution of 0.5° and covers the entire globe in 67,420 files.
Future weather data. We provide five global climate models (GCM), each with four carbon evolution scenarios supported by ISI-MIP5 (Inter-Sectoral Impact Model Intercomparison Project) 44 . These daily data cover the period of 1950-2099 and have a resolution of 0.5°. Similar to CRU, they have been reformatted from NetCDF into SWAT-formatted text files.
The five GCM models include HadGEM2-ES, IPSL-CM5A-LR, MIROC-ESM-CHEM, GFDL-ESM2M, and NorESM1-M (Table 1) with Representative Concentration Pathway (RCP) scenarios (RCP2.6, RCP4.5, RCP6.0, and RCP8.5) 45 . The 0.5° grid WATCH Forcing Data 46 for the period of January 1, 1960, to December 31, 1999 (the reference period) was used as observation data to downscale the five GCMs 44 . WATCH is a combination of the ERA-40 daily data, the 40-year reanalysis of the European Centre for Medium-Range Weather Forecasts, and the Climate Research Unit TS2.1 dataset (CRU) 43 . The WATCH Forcing Data data combines the daily statistics of ERA-40 with the monthly mean characteristics of CRU and Global Precipitation Climatology Centre (GPCC) datasets and represents a complete gridded observational dataset for bias correction of global climate data 44 .
The historical and future data can be downloaded for any given geographic location from www.2w2e.com using the template illustrated in Supplementary Fig. S5. The Climate Change Toolkit (CCT) program 47 is linked to the above databases and can be used for bias correction if local data is available. CCT uses additive correction www.nature.com/scientificdata www.nature.com/scientificdata/ for temperature and a multiplicative correction factor for precipitation. The program can also be used for extreme climate analysis 48 . Global actual evapotranspiration data. Actual evapotranspiration (AET) from the earth's land surface is collected by NASA using satellite data from 1982 to 2003 49,50 (Supplementary Fig. S6). The algorithm calculates canopy transpiration and soil evaporation using a modified Penman-Monteith approach with biome-specific canopy conductance determined from the normalized difference vegetation index (NDVI). Priestley-Taylor approach was used to quantify open water evaporation. The observations from 34 flux network (FEUXNET) tower sites 51 were used to parameterize an NDVI-based canopy conductance model to validate the global ET al.gorithm using measurements from 48 additional, independent flux towers 49,50 .
AET has been used before to calibrate SWAT when other observed data is not available 52 . It is crucial to have a measure of AET when calibrating a SWAT model with river discharge data. Using river discharge alone, we can confidently estimate runoff and infiltration. However, components of the infiltrated water cannot be estimated with any degree of confidence. These components include soil moisture (S), aquifer recharge (AR), and actual evapotranspiration (AET) (Fig. 2). Using an estimate of AET in calibration can significantly increase our confidence in the other components of infiltrating water.
To use the provided MODIS-NASA data for calibration in SWAT-CUP, users, should overlay the MODIS-AET grids with the subbasin map of their ArcSWAT/QSWAT project and average the AET grid points inside each subbasin to one single value to represent the subbasin's AET. Table 3. Available Water Capacity, AWC( = θ 33 -θ 1500 ) (cm cm −1 ) pedotransfer functions. θ 33 = soil water content at field capacity, θ 1500 = soil water content at wilting point, C = %clay, ρ b = bulk density (g cm −3 ), T = %silt, OC = %organic carbon, S = %sand.

Williams 30
Where:  www.nature.com/scientificdata www.nature.com/scientificdata/ The GlobCover from the European Space Agency and associated data files (Lookup Table and SWAT2012. mdb) 56 are deposited at Pangaea and www.2w2e.com sites. There are 23 landcover types in this database.
The historical CRU and future GCM weather data 57 are deposited at Pangaea and www.2w2e.com. Finally, the Global Actual Evapotranspiration Data 58 in text format is deposited at Pangaea and www.2w2e.com.

technical Validation
The global soil and landuse databases have been successfully used in many SWAT applications around the world 4,6,9,10,16,[59][60][61][62] . Validation of these maps, which are based on satellite observations, are offered by ground-truth observations conducted by the map developers and also in various literature 63,64 .
There is a significant variation in the reported values of soil parameters in the literature and by various agencies. In this research, we used a large number of pedotransfer functions and soil samples from around the world to estimate the textural-based soil parameters. In Table 8 we compared our estimated values of bulk density and hydraulic conductivity with values reported by the U.S. Department of Agriculture (USDA), STRUCTx (STRUCTURAL ENGINEERING RESOURCES website, see Table 8), and other reported values. The rest of the parameters could not be found based on textural classes. As evident, there are significant variations in all estimates, especially for saturated hydraulic conductivities. For this reason, it is essential to have a range of estimates, so one can limit the values to a likely range during model calibration.

Usage Notes
There are 4,931 soil records in the FAO/UNESCO database, and 16,328 records in the HWSD soil map. Both (usersoil) tables are in the SWAT2012.mdb database. The field (Name) is concatenated by using the fields SU-SYM74, SU-SYM90, MU_GLOBAL, and ISSOL as given in the original HWSD database. SU-SYM74 is the soil unit symbol according to the FAO-74 soil classification, SU-SYM90 is the soil unit symbol according to the FAO-90 soil classification, MU_GLOBAL is the Global Mapping Unit identifier, which provides the link between the GIS soil units and the attribute database, and ISSOL is a field indicating if the soil mapping unit is a soil (1) or a non-soil (0). All maps provided have the World WGS-84 Spatial Reference without any projection. The users will have to project these maps as needed before using it in the ArcSWAT or QSWAT models.
Different soil and landuse maps are provided to emphasize the fact that often more than one database is available for building and calibrating a model, and also to encourage the users to use different databases to realize the conditionality of their calibrated models. Calibrated model parameters are always conditioned on the input data, meaning one could obtain a different set of parameters if one had used a different set of available data. This is probably the most disappointing aspect of calibration.   Fig. 2 Schematic illustration of the conceptual water balance model in SWAT.