A comprehensive data set of physical and human-dimensional attributes for China’s lake basins

Lakes provide water-related ecosystem services that support human life and production. Nevertheless, climate changes and anthropogenic interventions remarkably altered lake and basin hydrology in recent decades, which pose a significant threat to lacustrine ecosystems. Therefore, assessments of lacustrine ecosystems require the spatial and temporal characteristics of key physical and human-dimensional attributes for lakes and lake basins. To facilitate stakeholders obtaining comprehensive data of lake basins in China, we compiled the comprehensive data set for China’s lake basins (CODCLAB) mostly from publicly available data sources based on spatial analysis and mathematical statistics methods in this study. The CODCLAB is available in three data formats, including raster layers (Level 1) in “tiff” format, vector shapefiles (Level 2), and attribute tables (Level 3). It covers 767 lakes (>10 km2) in China and their basin extent associating with 34 variables organized into five categories: Hydrology, Topography, Climate, Anthropogenic, and Soils. This unique database will provide basic data for research on the physical processes and socioeconomic activities related to these lakes and their basins in China and expect to feed a broad user community for their application in different areas.


Background & Summary
Lakes are increasingly influenced by anthropogenic pressures and environmental changes (e.g., changing climate) that can modify their hydrology and ecological functions 1,2 . A growing body of literature has evidenced that it is essential to know how lakes respond to natural and anthropogenic factors [3][4][5][6] . These evidence consistently indicates that intensified driving forces have been weakening the environmental, economic, and public health benefits provided by lakes 7 . For instance, land use changes (e.g., reclamation projects, irrigated agriculture) in the lake basin can modify lake hydrologic regimes beyond natural ranges. While environmental changes (e.g., changing climate or soil geology) may accelerate human pressure on lake hydrology 8,9 . Yet, the interaction between lakes and the environment is very complex. Concurrently, the lake dynamics can indicate the course of their basin changes, and the basin changes can affect the properties of lakes in reverse 10 . Researchers and policymakers are trying to apply effective solutions to alleviate climate variability and human footprints on lakes 11,12 , which necessitates large amounts of data related to these physical and anthropogenic processes herein 1,13 . Therefore, for a comprehensive knowledge about the changes occurred in lakes or lacustrine ecosystems often necessitate more background information on the spatial-temporal characteristics of key attributes at the basin scale that users are interested in, such as topography, climate, anthropogenic, etc.
Hydrological data of lakes in the regional or global scale are increasingly generated and applied in recent years, such as lake area, level, and volume data from the ground-and satellite-based observations 6,14 .
HydroLAKES was arguably one of the most prominent choices and was widely applied in limnologic and hydrologic studies. The HydroLAKES database distinguished 1.42 million lakes with an area above 0.1 km 2 and provided their vector boundaries associated with basic attributes 15 . However, researchers rarely paid attention to comprehensive hydrological, physical, and cultural characteristics at the basin scale of lakes. As a pioneer in comprehensive basin-scale data sets, the HydroATLAS database offered hydro-environmental sub-basin and river characteristics globally, accompanied by 56 variables in six categories 16 . Although the HydroATLAS database is valuable for basin-scale studies with fully global data references, the comprehensive attributes provided by HydroATLAS are not well applicable to China's lake basins due to the lack of enough local validations. For the lake basins in China, there is no HydroATLAS-like comprehensive watershed data set well constrained by local data quality control. Instead, Chinese scholars pay more attention to the dynamics of lakes and basins in key areas (e.g., Tibetan Plateau, and Yangtze River basin) [17][18][19][20][21] , as well as the characteristics of various attributes based on sample points at the national scale 22,23 . Despite these advancements, users are more willing to select the data from a set of basin-scale characteristic data sets consistently.
To facilitate stakeholders obtaining comprehensive data of lake basins in China, we introduce the comprehensive dataset for China's lake basins (CODCLAB). We provided 767 Chinese lakes (≥10 km 2 ) and their basin boundaries with geographic reference in the CODCLAB dataset, in which the study lakes and their basins represent nearly 93% of the total lake area and 36% of the land area in China, respectively (Fig. 1). In addition, CODCLAB also provided extensive variables at basin scale that are organized into five categories (Hydrology, Topography, Climate, Anthropogenic, and Soils) based on publicly available data sources (Table 1).
Our compiled CODCLAB dataset is expected to facilitate more users to access the spatial-temporal characteristics of key attributes for the lake basins of China and be applied in different areas. Further, CODCLAB can provide data reference for comprehensive evaluation of lake basins, mixing natural and human sciences. For example, the anthropogenic dataset of CODCLAB could be used to advance studies of anthropogenic effects on the lake environment. Moreover, the CODCLAB can also directly support the response of lake hydrology to climate change and various natural factors.

Methods
Data compilation. We applied spatial analysis and mathematical statistics methods to compile the CODCLAB dataset (Fig. 2). The CODCLAB dataset is organized into five categories (Hydrology, Topography, Climate, Anthropogenic, and Soils) and contains 749 extended attributes (Table 2). First, the extended attributes within vector and raster files were correspondingly assigned the lake basins based on spatial join and zonal statics methods by Geographic Information System (GIS) tools, respectively. Then, the lake basin scaled static and time series data were processed to generate a final dataset including tables, shapefiles, and raster files.  Table S1) include Yunnan-Guizhou Plateau (YGP), Tibetan Plateau (TP), Uygur Autonomous Region (UAR), Inner Mongolia Plateau (IMP), Northeast Plains and Mountains (NEPM), and Eastern Plains (EP). Five large lakes with the sub-basins in CODCLAB include 1 Bosten Lake, 2 Chaohu Lake, 3 Poyang Lake, 4 Doting Lake, and 5 Hulun Lake.
(1) Lake water extent delineation In this study, we detected the maximum water area of lakes (>10 km 2 ) in China from 1984 to 2020 based on the Global Surface Water (GSW) datasets of the Joint Research Centre (JRC) (https://global-surface-water.appspot.com/). The JRC GSW dataset is a global waterbody data set with high temporal and spatial resolution and a long time sequence that was produced by an expert system of combining evidentiary reasoning and visual interpretation 24 . With high accuracy, the JRC GSW dataset has been widely used as a key hydro-science data source [25][26][27] .   www.nature.com/scientificdata www.nature.com/scientificdata/ We used the Max Water Extent (MWE) data layer of the JRC GSW dataset in a version of 1.3 as the pending lake boundaries, reflecting the maximum inundation extent of global surface water from 1984 to 2020. Further, we removed the objects corresponding to other water bodies of non-natural lakes based on artificial interpretation methods one by one, such as rivers, artificial lakes (reservoirs), paddy fields and wetlands, etc. When removing the non-natural lakes, we referred to the google earth historical images, and basic geographic data, including the national basic geographic database of lake point data from the second National Lake Survey and other relevant literature 28,29 . Finally, the maximum water extent of 767 lakes in China from 1984 to 2020 was obtained. The study lakes ( Fig. 1) include 298 freshwater lakes (39%) and 469 saline lakes (61%) 28,30,31 .
(2) Lake-basin delineation Based on HydroBASINS, HydroRIVERS, and Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) datasets 32-34 , we delineated the basin boundary data for a total of 767 lakes (MWE > 10 km 2 ) in China (Fig. 1). Figure 3 shows the lake basin delimitation process. Firstly, we computed the flow directions based on SRTM DEM according to the D8 algorithm 35 (Fig. 3(a)). Then, we determinated the inlets, outlets, and sources of rivers of all lakes by overlaying the lake water extent with SRTM DEM and river works derived from HydroRIVERS ( Fig. 3(a)). Secondly, we merged or edited the finer-level geometry of HydroBASINS, which contained all the rivers that flow through the lake ( Fig. 3(b)). For five large lakes with broad watershed extents, we further delineated their secondary sub-basins with reference studies or maps. The five large lakes included Bosten Lake, Chaohu Lake, Poyang Lake, Dongting Lake, and Hulun Lake (Fig. 1). Thus, 767 lake basins and 805 sub-basins were delineated eventually.
processing of key attributes data by lake basin.
(1) Lake-basin attributes assignment This study assigned the CODCLAB attributes in both the vector and raster files one-to-one to the lake basins based on the spatial join and zonal statics methods from GIS tools, respectively (Fig. 2). The spatial join tool can join attributes from one feature to another based on the spatial relationship. The target features and the joined characteristics from the join features are written to the output feature class. Therefore, spatial join is suitable for lake-basin assignments like vector hydrologic attributes of CODCLAB. Further, www.nature.com/scientificdata www.nature.com/scientificdata/ the zonal statistics GIS tool can calculate statistics on values of a raster within the zones of another dataset. Therefore, according to the CODCLAB attributes of the raster data format, we used lake-basin boundaries to do zonal statistics for these attributes and realized the CODCLAB attributes assignment of lake basins based on raster files.
(2) Attributes processing Lake area extraction. JRC GSW water dynamic maps were used in the study to extract the lake area from 1984 to 2020. The GSW water dynamic maps (1984-2020) were created through automated process mining of the archive of the Landsat 7 ETM + and Landsat 8 OLI missions with a spatial resolution of 30 m 24 . First, we employed GSW multiyear surface water occurrence dataset with a pixel value above the 25% (represents seasonal water) and 75% (represents permanent water) threshold for selecting water observations. Then, we clipped the GSW water surface dataset by lake MWE masks in this study to achieve the permanent area (minimum) and seasonal area (maximum) of study lakes from 1984 to 2020.
Supply coefficient of lakes. The supply coefficient (sc) of a lake is the ratio of lake basin area to lake area (Eq. 1). The greater the supply coefficient of the lake is, the more the lake is affected by the river water regime in the recharge area and the greater change in lake water level and size. www.nature.com/scientificdata www.nature.com/scientificdata/ Population trend analysis. Further, we analyzed the population trend using the linear regression method 36 . We assume that the population of the Chinese lake basin varies linearly 37 . So, we used a linear slope to represent the population trend by the following equation. Drought index. The standardized precipitation evapotranspiration index (SPEI) based on precipitation and temperature data was used to extend the drought attribute of climate dataset in CODCLAB. SPEI can indicate the drought trend and has been widely used in the drought assessment and water resource management fields 38 . The applicability of SPEI to indicate drought monitoring has been proved in China 39 . In this study, a 3-month scale (equal to the time span of one season) of SPEI in the last 40 years (1980-2019) was computed to represent the seasonal drought severity of lake basins in CODCLAB.

Data Records
The CODCLAB dataset is a reprocessing data set from publicly available data sources based on spatial analysis and mathematical statistics methods. All the publicly available data sources with physical and human-dimensional attributes are filtered through quality control. The principle of public data screening mainly considers data sets with ground validation and has close attention to natural sciences and humanities research. The CODCLAB dataset 40 is available in three data formats, including tiff raster layers (Level 1), shapefiles (Level 2), and attribute tables (Level 3). The Level 1 data in tiff format stores the original static or time series rater dataset of CODCLAB, e.g., topography, climate, anthropogenic, and soils data set. Lake-basin scale characteristics assigned to the basins are stored in shapefiles associated with lake-basin polygons, such as supply coefficient of lakes, etc. Table 2 describes the naming rules for variables and units of the attribute value in separate shapefiles. All lake-basin attributes are provided in Level 3 tables associated with the lake ID, i.e., ' Anth_CODCLAB.xlsx' file, which stores anthropogenic information including lake ID, population density, GDP, etc. In addition to the above-mentioned CODCLAB_Level 1, Level 2, and Level 3, we also provide the CODCLAB of sub-basins for five large lakes and basic geographic information data in vector format, which are named CODCLAB_ sub-basins 41 and CODCLAB_Level 0 41 , respectively. The detailed data description of CODCLAB for different levels is shown in Table S2.
Hydrology dataset. The hydrology dataset of CODCLAB is the static vector data that reflects characteristics of lake basins at the stationary time scale, i.e., lake area, lake volume, residence time, etc. Usually, lake ID corresponds to the static variable in a one-to-one way, so we store this type of data in vector shapefiles combined with lake-basin polygons in the study. The supply coefficient of lakes obtained through calculation is shown as www.nature.com/scientificdata www.nature.com/scientificdata/ sample data records (Fig. 4). The supply coefficient of lakes showed significant spatial heterogeneity. Located in arid northwest China, the supply coefficient of lakes in the UAR zone was relatively high. However, the lakes in the humid areas of southwest and southeast China had a lower supply coefficient, i.e., the lakes in the YGP and EP lake zones (details in Table S1 and Fig. S1). The higher ratio of lake basin area to lake area (supply coefficient) in arid regions means that lakes in that region need more flowing water to recharge and sustain the lake water balances. In contrast, lakes in humid areas need fewer supplements. In addition, the range value of supply coefficient of lakes was calculated based on the permanent and seasonal lake area derived from water occurrence layer of GWS dataset (Figs. S2-S3). topography dataset. Topography information of Chinese lake basins comprising elevation, slope, and relief amplitude is extremely useful for the hydrologic study of lakes or lake basins. In the CODCLAB dataset, all topography datasets are available in a three-level data organization with separate files (tiff raster, shapefile, and table format). For example, 'Elevation_IDxx.tif ' file represents the Level 1 raster format dataset of elevation for the lake basin with IDxx. 'Topo_CODCLAB.shp' and 'Topo_CODCLAB.xlsx' store all the topography attributes of study lake basins in Level 2 and Level 3 data format, respectively.
Climate dataset. The climate characteristics of CODCLAB show obvious spatial heterogeneity (Fig. 5). The mean annual temperature for China's lake basins ranged from −21.51 to 26.43 °C, with an average of 7.51 °C. The lowest value corresponds to the location of lake basins in the TP zone, and the highest value was observed at a location of lake basins in the UAR zone (Fig. 5a). The mean annual total precipitation ranged from 19.22 to 2303.75 mm, with an average value of 679.01 mm, and the minimum and maximum values corresponded to locations in the lake-basins in TP and southeast part of the lake basins in EP (Poyang Lake basin and Dongting Lake basin), respectively (Fig. 5b). The mean annual actual evapotranspiration (AEVAP) ranged from 1.8 to 1507.2 mm, with an average of 427.59 mm (Fig. 5c), and the distribution of AEVAP of CODCLAB is positively www.nature.com/scientificdata www.nature.com/scientificdata/ correlated with precipitation and temperature (Fig. 5). The drought trend of China lake-basins on seasonal scales is illustrated in Fig. 5d. It reflects temporal and spatial characteristics of seasonal drought on a time scale of 3 months. As a result, the lake basins tend to get drier in the northwestern part of TP and the central and western part of IMP during spring, autumn, and winter. The lake basins in EP also show a significant drying trend in the spring and fall. In contrast, the lake basins of western TP, northern UAR, and western NEMP became significantly wet. Interestingly, lake basins with a perennially dry tendency tend to have lower average temperatures and less precipitation and evaporation (e.g., Western IMP, Southwest UAR, and Northwest TP). anthropogenic dataset. Human activity can substantially alter anthropogenic pressures on lake hydrology and eco-environment. We take land use/cover and population density as examples to state the time series anthropogenic data records of CODCLAB stored in the format of a tiff raster (Figs. 6-7). Land use/cover change (LUCC) of lake basins gives the watershed perspective to understand the impacts of anthropogenic pressures on lake hydrology. Green land, such as forests and grasslands, accounts for half of China's natural lacustrine basins (Fig. 6f,g). On the other hand, urban impervious surface and cropland dominated by human activities account for 23% of China's lacustrine basins (Fig. 6f,g). In the past 35 years, forest, water bodies, and urban land use/cover have increased continuously, while the other six land types have fluctuated and declined (Fig. 6a). The intensity of human activities also shows obvious spatial heterogeneity in different lake zones (Fig. 6). Urban impervious surface and cropland dominate the lake basins in the eastern plain of China (Fig. 6d,e). While water and grassland almost occupy the whole composition of the lake basin area in the Tibetan Plateau (Fig. 6b,c).
The spatial distribution of population density between eastern and western lake basins is highly consistent with the land use/cover difference (Fig. 7). The high population density distribution in the EP lake zone resulted in strong human intervention (i.e., urban land and cropland change) in the lake basins. Further, the lake basins with the fastest population growth are the Taihu and Dianchi lake-basin with over 10000 count/km 2 /5 yrs (Fig. 7). In addition, some low population density basin areas in the six national lake zones are losing population. In summary, the population change rate in the lake basins of China is proportional to the population density. Soils dataset. Soils dataset of CODCLAB includes three-dimensional soil texture information and soil moisture. The soil dataset can be applied in many research fields, including agriculture, hydrology, climate, ecology, and environment. CODCLAB offers sand, silt, clay contents, etc., in each lake basin and at multiple depths of 0-5, 5-15, 15-30, 30-60, 60-100, and 100-200 cm. All soil data sets are available in a three-level data organization with separate files (tiff raster, shapefile, and table format). In addition, CODCLAB applies 'attributes + depth' to assign soil information to each lake basin.

technical Validation
Major CODCLAB variables reformat existing source data into the geospatial frameworks of the lake basin of China apart from a few reanalysis data. The quality of original datasets (known as source data) is already validated by other independent studies as follows table (Table 3). Furthermore, we still present the following local validation of global dataset and cross validation of localized dataset in China to illustrate the accuracy of CODCLAB. Li, et al. 52 Dong, et al. 53 Han, et al. 54 CMFD* Climate -CMFD has close-to-zero mean bias error (MBE), lower root mean square error (RMSE), and higher R 2 than GLDAS for almost all variables He, et al. 43 WorldPop  www.nature.com/scientificdata www.nature.com/scientificdata/ Local validation. Most of the source data of CODCLAB are localized in China. A small amount of global data used by CODCLAB has been widely applied in China, and some local validation accuracy has been found to support the CODCLAB (e.g., GSW, SRTM, and NTL shown in Table 3). www.nature.com/scientificdata www.nature.com/scientificdata/ (1) Validation of lake extent derived from the GSW dataset We randomly selected six lakes from different national lake zones as validation examples (Fig. 8). We validated their lake area extraction results by comparing GSW retrieve results and manual digitizing results through high-resolution remote sensing images of Sentinel-2 satellite with different periods. The validation result was shown in Fig. 8 combined with total R-squared (R 2 ) and mean absolute percent error (MAPE; Eq. (3)) of 0.99 and 2.56%, respectively.

MAPE n
where S1 is the lake area obtained digitally from Sentinel-2 images and S2 is the lake area derived from GSW retrievals. And i is the selected date of validation, and n denotes the number of selected dates for the one lake to validate. (2) Validation of elevation derived from the SRTM1 DEM dataset Previous studies have validated the accuracy of SRTM at regional scales in China (Table 3). Further, we utilized the Ice, Cloud, and land Elevation Satellite (ICESat) footprints to validate the SRTM data in our CODCLAB dataset at the lake basin scale. The spatial distribution of the ICESat footprints shows that the validate points can cover all lake zones and almost all lake basins (Fig. 9a). The scatter plot of verification points compares the consistent distribution of the SRTM1 DEM data and ICESat elevation data (Fig. 9b).
The results show that the elevation of CODCLAB derived from SRTM1 DEM dataset has a better performance with an R 2 of 0.99 and an RMSE of 8.07 m. In addition, the SRTM1 DEM data have a positive 1:1 relationship with the ICESat elevation data according to most verification points around the non-bias (1:1) line (Fig. 9) Fig. 10, the NTL of CODCLAB and the NTL derived from Luojia 1-01 have a consistent spatial pattern at both national and regional scales. Among the national validation points within six lake zones (Fig. 10c-h), we find that the accuracy of NTL of CODCLAB in these lake zones is acceptable, and no significant variation. YGP has the highest accuracy with an R 2 of 0.97, followed by NEMP and EP (R 2 = 0.96). The rest of the lake zones all have an accuracy higher than 0.93, which means the NTL intensity of CODCLAB is similar to the Luojia 1-01 at the pixel level.
Cross validation. We selected three groups of variables with multiple data sources for cross validation of CODCLAB (Fig. 11). The R 2 values of the three groups of variables are all greater than 0.8, which means that each group of variables has a strong correlation. The temperature of all study lake basins derived from the RESDC and CMFD has the highest relevancy (R 2 = 0.98). For precipitation, there is no same variable from multiple sources, yet the precipitation of REDSC still has a strong correlation with the precipitation rate of CMFD (R 2 = 0.91). Similarly, population density and population count per square kilometer of different data sources also have a www.nature.com/scientificdata www.nature.com/scientificdata/ strong correlation (R 2 = 0.83). Therefore, the original validated datasets in independent research can be conducted to manifest the consistency and reliability of CODCLAB due to the cross validation.

Usage Notes
The CODCLAB can be used in a suite of research areas relating to hydro-environmental studies at the lake basin scale of China. For example, the climate parameters provided by the CODCLAB can be used to analyze the effects of basin-scale climate change on the hydrological dynamics of lakes. Second, the anthropogenic attributes of CODCLAB can be applied to understand the impact of human activities on lake basins. In addition to employing the variables of different types individually, CODCLAB can also be applied by combining multiple variables in comprehensive studies. For instance, we need to invoke both anthropogenic and hydrological variables of CODCLAB to understand the impact of population change on lake dynamics.
The data files are formatted as tiff raster layers (CODCLAB_Level 1), shapefiles (CODCLAB_ Level 2), and attribute tables (CODCLAB_Level 3) based on the three-level organization. It still requires users to decide which level of data and which type of variables to employ. In addition to the uniform resolution dataset (CODCLAB_1km) 41 , users also need to consider the differences in temporal and spatial resolution between different CODCLAB variables. www.nature.com/scientificdata www.nature.com/scientificdata/ As the potential for future application, the CODCLAB can be used to increase research efficiency by allowing users to quickly achieve multi-source data with the common georeference for location-specific studies. Suppose that future data users can describe lake or basin changes with co-located hydrometeorological and anthropogenic data based on one-stop resources served by CODCLAB.

Code availability
Two core tools applied in the study were 'Spatial join' and 'Zonal Statistics' provided by ESRI's. ArcGIS 10.7 software package. In addition, the customized batch steps of reprocessing data, including lake area extraction and raster attribute extraction, were programmed using Python 2.7 scripts which were provided in our data set named 'Lake_area_extraction.py' and 'Raster_attribute_ extraction.py' , respectively 41 .