## Background & Summary

Lakes are increasingly influenced by anthropogenic pressures and environmental changes (e.g., changing climate) that can modify their hydrology and ecological functions1,2. A growing body of literature has evidenced that it is essential to know how lakes respond to natural and anthropogenic factors3,4,5,6. These evidence consistently indicates that intensified driving forces have been weakening the environmental, economic, and public health benefits provided by lakes7. For instance, land use changes (e.g., reclamation projects, irrigated agriculture) in the lake basin can modify lake hydrologic regimes beyond natural ranges. While environmental changes (e.g., changing climate or soil geology) may accelerate human pressure on lake hydrology8,9. Yet, the interaction between lakes and the environment is very complex. Concurrently, the lake dynamics can indicate the course of their basin changes, and the basin changes can affect the properties of lakes in reverse10. Researchers and policymakers are trying to apply effective solutions to alleviate climate variability and human footprints on lakes11,12, which necessitates large amounts of data related to these physical and anthropogenic processes herein1,13. Therefore, for a comprehensive knowledge about the changes occurred in lakes or lacustrine ecosystems often necessitate more background information on the spatial-temporal characteristics of key attributes at the basin scale that users are interested in, such as topography, climate, anthropogenic, etc.

Hydrological data of lakes in the regional or global scale are increasingly generated and applied in recent years, such as lake area, level, and volume data from the ground- and satellite-based observations6,14. HydroLAKES was arguably one of the most prominent choices and was widely applied in limnologic and hydrologic studies. The HydroLAKES database distinguished 1.42 million lakes with an area above 0.1 km2 and provided their vector boundaries associated with basic attributes15. However, researchers rarely paid attention to comprehensive hydrological, physical, and cultural characteristics at the basin scale of lakes. As a pioneer in comprehensive basin-scale data sets, the HydroATLAS database offered hydro-environmental sub-basin and river characteristics globally, accompanied by 56 variables in six categories16. Although the HydroATLAS database is valuable for basin-scale studies with fully global data references, the comprehensive attributes provided by HydroATLAS are not well applicable to China’s lake basins due to the lack of enough local validations. For the lake basins in China, there is no HydroATLAS-like comprehensive watershed data set well constrained by local data quality control. Instead, Chinese scholars pay more attention to the dynamics of lakes and basins in key areas (e.g., Tibetan Plateau, and Yangtze River basin)17,18,19,20,21, as well as the characteristics of various attributes based on sample points at the national scale22,23. Despite these advancements, users are more willing to select the data from a set of basin-scale characteristic data sets consistently.

To facilitate stakeholders obtaining comprehensive data of lake basins in China, we introduce the comprehensive dataset for China’s lake basins (CODCLAB). We provided 767 Chinese lakes (≥10 km2) and their basin boundaries with geographic reference in the CODCLAB dataset, in which the study lakes and their basins represent nearly 93% of the total lake area and 36% of the land area in China, respectively (Fig. 1). In addition, CODCLAB also provided extensive variables at basin scale that are organized into five categories (Hydrology, Topography, Climate, Anthropogenic, and Soils) based on publicly available data sources (Table 1).

Our compiled CODCLAB dataset is expected to facilitate more users to access the spatial-temporal characteristics of key attributes for the lake basins of China and be applied in different areas. Further, CODCLAB can provide data reference for comprehensive evaluation of lake basins, mixing natural and human sciences. For example, the anthropogenic dataset of CODCLAB could be used to advance studies of anthropogenic effects on the lake environment. Moreover, the CODCLAB can also directly support the response of lake hydrology to climate change and various natural factors.

## Methods

### Data compilation

We applied spatial analysis and mathematical statistics methods to compile the CODCLAB dataset (Fig. 2). The CODCLAB dataset is organized into five categories (Hydrology, Topography, Climate, Anthropogenic, and Soils) and contains 749 extended attributes (Table 2). First, the extended attributes within vector and raster files were correspondingly assigned the lake basins based on spatial join and zonal statics methods by Geographic Information System (GIS) tools, respectively. Then, the lake basin scaled static and time series data were processed to generate a final dataset including tables, shapefiles, and raster files.

### Lake and lake-basin delineation

1. (1)

Lake water extent delineation

In this study, we detected the maximum water area of lakes (>10 km2) in China from 1984 to 2020 based on the Global Surface Water (GSW) datasets of the Joint Research Centre (JRC) (https://global-surface-water.appspot.com/). The JRC GSW dataset is a global waterbody data set with high temporal and spatial resolution and a long time sequence that was produced by an expert system of combining evidentiary reasoning and visual interpretation24. With high accuracy, the JRC GSW dataset has been widely used as a key hydro-science data source25,26,27.

We used the Max Water Extent (MWE) data layer of the JRC GSW dataset in a version of 1.3 as the pending lake boundaries, reflecting the maximum inundation extent of global surface water from 1984 to 2020. Further, we removed the objects corresponding to other water bodies of non-natural lakes based on artificial interpretation methods one by one, such as rivers, artificial lakes (reservoirs), paddy fields and wetlands, etc. When removing the non-natural lakes, we referred to the google earth historical images, and basic geographic data, including the national basic geographic database of lake point data from the second National Lake Survey and other relevant literature28,29. Finally, the maximum water extent of 767 lakes in China from 1984 to 2020 was obtained. The study lakes (Fig. 1) include 298 freshwater lakes (39%) and 469 saline lakes (61%)28,30,31.

2. (2)

Lake-basin delineation

Based on HydroBASINS, HydroRIVERS, and Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) datasets32,33,34, we delineated the basin boundary data for a total of 767 lakes (MWE > 10 km2) in China (Fig. 1). Figure 3 shows the lake basin delimitation process. Firstly, we computed the flow directions based on SRTM DEM according to the D8 algorithm35 (Fig. 3(a)). Then, we determinated the inlets, outlets, and sources of rivers of all lakes by overlaying the lake water extent with SRTM DEM and river works derived from HydroRIVERS (Fig. 3(a)). Secondly, we merged or edited the finer-level geometry of HydroBASINS, which contained all the rivers that flow through the lake (Fig. 3(b)). For five large lakes with broad watershed extents, we further delineated their secondary sub-basins with reference studies or maps. The five large lakes included Bosten Lake, Chaohu Lake, Poyang Lake, Dongting Lake, and Hulun Lake (Fig. 1). Thus, 767 lake basins and 805 sub-basins were delineated eventually.

### Processing of key attributes data by lake basin

1. (1)

Lake-basin attributes assignment

This study assigned the CODCLAB attributes in both the vector and raster files one-to-one to the lake basins based on the spatial join and zonal statics methods from GIS tools, respectively (Fig. 2). The spatial join tool can join attributes from one feature to another based on the spatial relationship. The target features and the joined characteristics from the join features are written to the output feature class. Therefore, spatial join is suitable for lake-basin assignments like vector hydrologic attributes of CODCLAB. Further, the zonal statistics GIS tool can calculate statistics on values of a raster within the zones of another dataset. Therefore, according to the CODCLAB attributes of the raster data format, we used lake-basin boundaries to do zonal statistics for these attributes and realized the CODCLAB attributes assignment of lake basins based on raster files.

2. (2)

Attributes processing

#### Lake area extraction

JRC GSW water dynamic maps were used in the study to extract the lake area from 1984 to 2020. The GSW water dynamic maps (1984–2020) were created through automated process mining of the archive of the Landsat 7 ETM + and Landsat 8 OLI missions with a spatial resolution of 30 m24. First, we employed GSW multiyear surface water occurrence dataset with a pixel value above the 25% (represents seasonal water) and 75% (represents permanent water) threshold for selecting water observations. Then, we clipped the GSW water surface dataset by lake MWE masks in this study to achieve the permanent area (minimum) and seasonal area (maximum) of study lakes from 1984 to 2020.

#### Supply coefficient of lakes

The supply coefficient (sc) of a lake is the ratio of lake basin area to lake area (Eq. 1). The greater the supply coefficient of the lake is, the more the lake is affected by the river water regime in the recharge area and the greater change in lake water level and size.

$$sc=\frac{Are{a}_{basin}}{Are{a}_{lake}}$$
(1)

#### Population trend analysis

Further, we analyzed the population trend using the linear regression method36. We assume that the population of the Chinese lake basin varies linearly37. So, we used a linear slope to represent the population trend by the following equation.

$$k=\frac{{\sum }_{i=1}^{n}\left({t}_{i}-\bar{t}\right)\left({y}_{i}-\bar{y}\right)}{{\sum }_{i=1}^{n}{\left({t}_{i}-\bar{t}\right)}^{2}}$$
(2)

where k is the linear slope of the population trend of Chinese lake basins. When k>0, it indicates that the population is increasing, and vice versa. ti is the given year corresponding to the population and yi is the given population of year i. $$\bar{t}$$ and $$\bar{y}$$ represent the average value of year and population, respectively.

#### Drought index

The standardized precipitation evapotranspiration index (SPEI) based on precipitation and temperature data was used to extend the drought attribute of climate dataset in CODCLAB. SPEI can indicate the drought trend and has been widely used in the drought assessment and water resource management fields38. The applicability of SPEI to indicate drought monitoring has been proved in China39. In this study, a 3-month scale (equal to the time span of one season) of SPEI in the last 40 years (1980–2019) was computed to represent the seasonal drought severity of lake basins in CODCLAB.

## Data Records

The CODCLAB dataset is a reprocessing data set from publicly available data sources based on spatial analysis and mathematical statistics methods. All the publicly available data sources with physical and human-dimensional attributes are filtered through quality control. The principle of public data screening mainly considers data sets with ground validation and has close attention to natural sciences and humanities research. The CODCLAB dataset40 is available in three data formats, including tiff raster layers (Level 1), shapefiles (Level 2), and attribute tables (Level 3). The Level 1 data in tiff format stores the original static or time series rater dataset of CODCLAB, e.g., topography, climate, anthropogenic, and soils data set. Lake-basin scale characteristics assigned to the basins are stored in shapefiles associated with lake-basin polygons, such as supply coefficient of lakes, etc. Table 2 describes the naming rules for variables and units of the attribute value in separate shapefiles. All lake-basin attributes are provided in Level 3 tables associated with the lake ID, i.e., ‘Anth_CODCLAB.xlsx’ file, which stores anthropogenic information including lake ID, population density, GDP, etc. In addition to the above-mentioned CODCLAB_Level 1, Level 2, and Level 3, we also provide the CODCLAB of sub-basins for five large lakes and basic geographic information data in vector format, which are named CODCLAB_sub-basins41 and CODCLAB_Level 041, respectively. The detailed data description of CODCLAB for different levels is shown in Table S2.

### Hydrology dataset

The hydrology dataset of CODCLAB is the static vector data that reflects characteristics of lake basins at the stationary time scale, i.e., lake area, lake volume, residence time, etc. Usually, lake ID corresponds to the static variable in a one-to-one way, so we store this type of data in vector shapefiles combined with lake-basin polygons in the study. The supply coefficient of lakes obtained through calculation is shown as sample data records (Fig. 4). The supply coefficient of lakes showed significant spatial heterogeneity. Located in arid northwest China, the supply coefficient of lakes in the UAR zone was relatively high. However, the lakes in the humid areas of southwest and southeast China had a lower supply coefficient, i.e., the lakes in the YGP and EP lake zones (details in Table S1 and Fig. S1). The higher ratio of lake basin area to lake area (supply coefficient) in arid regions means that lakes in that region need more flowing water to recharge and sustain the lake water balances. In contrast, lakes in humid areas need fewer supplements. In addition, the range value of supply coefficient of lakes was calculated based on the permanent and seasonal lake area derived from water occurrence layer of GWS dataset (Figs. S2S3).

### Topography dataset

Topography information of Chinese lake basins comprising elevation, slope, and relief amplitude is extremely useful for the hydrologic study of lakes or lake basins. In the CODCLAB dataset, all topography datasets are available in a three-level data organization with separate files (tiff raster, shapefile, and table format). For example, ‘Elevation_IDxx.tif’ file represents the Level 1 raster format dataset of elevation for the lake basin with IDxx. ‘Topo_CODCLAB.shp’ and ‘Topo_CODCLAB.xlsx’ store all the topography attributes of study lake basins in Level 2 and Level 3 data format, respectively.

### Climate dataset

The climate characteristics of CODCLAB show obvious spatial heterogeneity (Fig. 5). The mean annual temperature for China’s lake basins ranged from −21.51 to 26.43 °C, with an average of 7.51 °C. The lowest value corresponds to the location of lake basins in the TP zone, and the highest value was observed at a location of lake basins in the UAR zone (Fig. 5a). The mean annual total precipitation ranged from 19.22 to 2303.75 mm, with an average value of 679.01 mm, and the minimum and maximum values corresponded to locations in the lake-basins in TP and southeast part of the lake basins in EP (Poyang Lake basin and Dongting Lake basin), respectively (Fig. 5b). The mean annual actual evapotranspiration (AEVAP) ranged from 1.8 to 1507.2 mm, with an average of 427.59 mm (Fig. 5c), and the distribution of AEVAP of CODCLAB is positively correlated with precipitation and temperature (Fig. 5). The drought trend of China lake-basins on seasonal scales is illustrated in Fig. 5d. It reflects temporal and spatial characteristics of seasonal drought on a time scale of 3 months. As a result, the lake basins tend to get drier in the northwestern part of TP and the central and western part of IMP during spring, autumn, and winter. The lake basins in EP also show a significant drying trend in the spring and fall. In contrast, the lake basins of western TP, northern UAR, and western NEMP became significantly wet. Interestingly, lake basins with a perennially dry tendency tend to have lower average temperatures and less precipitation and evaporation (e.g., Western IMP, Southwest UAR, and Northwest TP).

### Anthropogenic dataset

Human activity can substantially alter anthropogenic pressures on lake hydrology and eco-environment. We take land use/cover and population density as examples to state the time series anthropogenic data records of CODCLAB stored in the format of a tiff raster (Figs. 67). Land use/cover change (LUCC) of lake basins gives the watershed perspective to understand the impacts of anthropogenic pressures on lake hydrology. Green land, such as forests and grasslands, accounts for half of China’s natural lacustrine basins (Fig. 6f,g). On the other hand, urban impervious surface and cropland dominated by human activities account for 23% of China’s lacustrine basins (Fig. 6f,g). In the past 35 years, forest, water bodies, and urban land use/cover have increased continuously, while the other six land types have fluctuated and declined (Fig. 6a). The intensity of human activities also shows obvious spatial heterogeneity in different lake zones (Fig. 6). Urban impervious surface and cropland dominate the lake basins in the eastern plain of China (Fig. 6d,e). While water and grassland almost occupy the whole composition of the lake basin area in the Tibetan Plateau (Fig. 6b,c).

The spatial distribution of population density between eastern and western lake basins is highly consistent with the land use/cover difference (Fig. 7). The high population density distribution in the EP lake zone resulted in strong human intervention (i.e., urban land and cropland change) in the lake basins. Further, the lake basins with the fastest population growth are the Taihu and Dianchi lake-basin with over 10000 count/km2/5 yrs (Fig. 7). In addition, some low population density basin areas in the six national lake zones are losing population. In summary, the population change rate in the lake basins of China is proportional to the population density.

### Soils dataset

Soils dataset of CODCLAB includes three-dimensional soil texture information and soil moisture. The soil dataset can be applied in many research fields, including agriculture, hydrology, climate, ecology, and environment. CODCLAB offers sand, silt, clay contents, etc., in each lake basin and at multiple depths of 0–5, 5–15, 15–30, 30–60, 60–100, and 100–200 cm. All soil data sets are available in a three-level data organization with separate files (tiff raster, shapefile, and table format). In addition, CODCLAB applies ‘attributes + depth’ to assign soil information to each lake basin.

## Technical Validation

Major CODCLAB variables reformat existing source data into the geospatial frameworks of the lake basin of China apart from a few reanalysis data. The quality of original datasets (known as source data) is already validated by other independent studies as follows table (Table 3). Furthermore, we still present the following local validation of global dataset and cross validation of localized dataset in China to illustrate the accuracy of CODCLAB.

### Local validation

Most of the source data of CODCLAB are localized in China. A small amount of global data used by CODCLAB has been widely applied in China, and some local validation accuracy has been found to support the CODCLAB (e.g., GSW, SRTM, and NTL shown in Table 3).

1. (1)

Validation of lake extent derived from the GSW dataset

We randomly selected six lakes from different national lake zones as validation examples (Fig. 8). We validated their lake area extraction results by comparing GSW retrieve results and manual digitizing results through high-resolution remote sensing images of Sentinel-2 satellite with different periods. The validation result was shown in Fig. 8 combined with total R-squared (R2) and mean absolute percent error (MAPE; Eq. (3)) of 0.99 and 2.56%, respectively.

$$MAPE=\frac{1}{n}{\sum }_{i=1}^{n}\left|\frac{S{1}_{i}-S{2}_{i}}{S{2}_{i}}\right|\times 100 \%$$
(3)

where S1 is the lake area obtained digitally from Sentinel-2 images and S2 is the lake area derived from GSW retrievals. And i is the selected date of validation, and n denotes the number of selected dates for the one lake to validate.

2. (2)

Validation of elevation derived from the SRTM1 DEM dataset

Previous studies have validated the accuracy of SRTM at regional scales in China (Table 3). Further, we utilized the Ice, Cloud, and land Elevation Satellite (ICESat) footprints to validate the SRTM data in our CODCLAB dataset at the lake basin scale. The spatial distribution of the ICESat footprints shows that the validate points can cover all lake zones and almost all lake basins (Fig. 9a). The scatter plot of verification points compares the consistent distribution of the SRTM1 DEM data and ICESat elevation data (Fig. 9b). The results show that the elevation of CODCLAB derived from SRTM1 DEM dataset has a better performance with an R2 of 0.99 and an RMSE of 8.07 m. In addition, the SRTM1 DEM data have a positive 1:1 relationship with the ICESat elevation data according to most verification points around the non-bias (1:1) line (Fig. 9).

3. (3)

Validation of nighttime lights derived from the global NTL dataset

In this study, the Luojia 1-01 nighttime light imagery developed by Wuhan University (http://59.175.109.173:8888/) was employed to verify the accuracy of the global NTL dataset in China. The Luojia 1-01 has a fine spatial resolution compared to the NTL dataset of CODCLAB composited by DMSP-OLS and NPP-VIIRS data. The Luojia 1-01 launched in 2018 also localized in China, and it is well suited for validating global NTL data. As shown in Fig. 10, the NTL of CODCLAB and the NTL derived from Luojia 1-01 have a consistent spatial pattern at both national and regional scales. Among the national validation points within six lake zones (Fig. 10c–h), we find that the accuracy of NTL of CODCLAB in these lake zones is acceptable, and no significant variation. YGP has the highest accuracy with an R2 of 0.97, followed by NEMP and EP (R2 = 0.96). The rest of the lake zones all have an accuracy higher than 0.93, which means the NTL intensity of CODCLAB is similar to the Luojia 1-01 at the pixel level.

### Cross validation

We selected three groups of variables with multiple data sources for cross validation of CODCLAB (Fig. 11). The R2 values of the three groups of variables are all greater than 0.8, which means that each group of variables has a strong correlation. The temperature of all study lake basins derived from the RESDC and CMFD has the highest relevancy (R2 = 0.98). For precipitation, there is no same variable from multiple sources, yet the precipitation of REDSC still has a strong correlation with the precipitation rate of CMFD (R2 = 0.91). Similarly, population density and population count per square kilometer of different data sources also have a strong correlation (R2 = 0.83). Therefore, the original validated datasets in independent research can be conducted to manifest the consistency and reliability of CODCLAB due to the cross validation.

## Usage Notes

The CODCLAB can be used in a suite of research areas relating to hydro-environmental studies at the lake basin scale of China. For example, the climate parameters provided by the CODCLAB can be used to analyze the effects of basin-scale climate change on the hydrological dynamics of lakes. Second, the anthropogenic attributes of CODCLAB can be applied to understand the impact of human activities on lake basins. In addition to employing the variables of different types individually, CODCLAB can also be applied by combining multiple variables in comprehensive studies. For instance, we need to invoke both anthropogenic and hydrological variables of CODCLAB to understand the impact of population change on lake dynamics.

The data files are formatted as tiff raster layers (CODCLAB_Level 1), shapefiles (CODCLAB_ Level 2), and attribute tables (CODCLAB_Level 3) based on the three-level organization. It still requires users to decide which level of data and which type of variables to employ. In addition to the uniform resolution dataset (CODCLAB_1km)41, users also need to consider the differences in temporal and spatial resolution between different CODCLAB variables.

As the potential for future application, the CODCLAB can be used to increase research efficiency by allowing users to quickly achieve multi-source data with the common georeference for location-specific studies. Suppose that future data users can describe lake or basin changes with co-located hydrometeorological and anthropogenic data based on one-stop resources served by CODCLAB.