A lake data set for the Tibetan Plateau from the 1960s, 2005, and 2014

Long-term datasets of number and size of lakes over the Tibetan Plateau (TP) are among the most critical components for better understanding the interactions among the cryosphere, hydrosphere, and atmosphere at regional and global scales. Due to the harsh environment and the scarcity of data over the TP, data accumulation and sharing become more valuable for scientists worldwide to make new discoveries in this region. This paper, for the first time, presents a comprehensive and freely available data set of lakes’ status (name, location, shape, area, perimeter, etc.) over the TP region dating back to the 1960s, including three time series, i.e., the 1960s, 2005, and 2014, derived from ground survey (the 1960s) or high-spatial-resolution satellite images from the China-Brazil Earth Resources Satellite (CBERS) (2005) and China’s newly launched GaoFen-1 (GF-1, which means high-resolution images in Chinese) satellite (2014). The data set could provide scientists with useful information for revealing environmental changes and mechanisms over the TP region.

Long-term datasets of number and size of lakes over the Tibetan Plateau (TP) are among the most critical components for better understanding the interactions among the cryosphere, hydrosphere, and atmosphere at regional and global scales. Due to the harsh environment and the scarcity of data over the TP, data accumulation and sharing become more valuable for scientists worldwide to make new discoveries in this region. This paper, for the first time, presents a comprehensive and freely available data set of lakes' status (name, location, shape, area, perimeter, etc.) over the TP region dating back to the 1960s, including three time series, i.e., the 1960s, 2005, and 2014, derived from ground survey (the 1960s) or high-spatialresolution satellite images from the China-Brazil Earth Resources Satellite (CBERS) (2005) and China's newly launched GaoFen-1 (GF-1, which means high-resolution images in Chinese) satellite (2014). The data set could provide scientists with useful information for revealing environmental changes and mechanisms over the TP region.

Background & Summary
The Tibetan Plateau (TP), known as the core region of the Earth's third pole 1,2 , has attracted great attention from the hydrology, weather, and climate communities. The state of environmental elements of the TP region, such as glacier 3 , permafrost 4 , snow 5 , river 6 , wetland 7 , and lake 8 , is critical for developing a better understanding of the interactions among the cryosphere, hydrosphere, and atmosphere. Lakes, as essential components of the hydrosphere over the TP, play an important role in regional and global biogeochemical processes 9 . Over the last half-century, great efforts have been made to develop a comprehensive understanding of the status and changes of lakes across the TP in previous studies [10][11][12][13][14][15][16][17][18] . The state of lakes in early years was mainly recorded by surveying and mapping or field investigations. Afterwards, remote sensing became a powerful tool for lake monitoring since the 1980s. Algorithms for automatic extraction and mapping of lake water bodies from medium and high resolution satellite images have been widely applied 19,20 . However, the accuracy of water boundary extraction using such automatic algorithms still needs further improvement 21 . For studies that require better accuracy or for nationalscale survey, semi-automatic extraction or even manual interpretation and digitization seems more feasible for building databases with strict quality control 22 .
There are several global-scale data sets about lakes' properties, e.g., the Global Lakes and Wetlands Database (GLWD, 1:1 to 1:3 million resolution, freely available at http://www.wwfus.org/science/data. cfm) created using data from many organizations and individuals 23 , the GLObal WAter BOdies database (GLOWABO) produced using GeoCover data set circa 2000 (ref. 24), and a database of summer lake surface temperatures for 291 lakes globally collected in situ and/or by satellites for the period 1985-2009 (freely available at http://portal.lternet.edu/) 25 . However, for regional-scale studies, data sets with higher resolution and longer time series are required. In particular, since the natural environment over the TP region is relatively harsh, data accumulation and sharing can facilitate scientists worldwide in making new discoveries in this region. Although numerous studies on the state of the TP lakes 13,26 have been performed, there has not been public data set associated with the status of lakes (name, location, shape, area, perimeter, etc.) across the TP region from the past to the present, especially no open-access data set derived from high-spatial-resolution satellite images.
The objective of this study was therefore to produce and share a data set about the state of lakes (area ≥1 km 2 ) over the TP. The data set includes three sub-datasets, i.e., the 1960s, 2005, and 2014 time series. The 1960s sub-dataset was produced from a valuable historical record through surveying and mapping, while the 2005 and 2014 sub-datasets were produced mainly using satellite images from the China-Brazil Earth Resources Satellite (CBERS) and China's newly launched GaoFen-1 (GF-1) satellite, respectively. The 1960s and 2005 sub-datasets originated from the results of the first and second nationwide lake investigations 27,28 , respectively. The 2014 sub-dataset was the first comprehensive evaluation of the GF-1 data for monitoring of TP lakes. Manual interpretation and digitization approaches were applied to ensure the accuracy of the data set. An overview of the production and validation of the data set is shown in Figure 1, and detailed information on methods will be described in the next section. The data set will provide scientists with a useful data source for revealing environmental changes and mechanisms over the TP region. Moreover, the data set could be used to validate automated mapping procedures (e.g., literature 20 and 29 ), to test theoretical hypotheses about lake distributions (e.g., literature 30 ), or to contribute to meteorological applications (e.g., literature 31 ). For research related to ecology,  biogeochemistry, and geomorphology, the value of those even smaller lakes (i.e., area o1 km 2 ) is tremendous 32 . Since the data set in this study was produced by manual extraction, this type of lakes was not included. To fill the time gap in the developed data set in this study as well as to extend it in the future, scientists are welcome and appreciated to add smaller lakes, new time series, and new attributes (e.g., water level) into this data set.

Methods
The boundary of the TP in this study is defined as above the elevation of 2,500 m 13 using the NASA Shuttle Radar Topographic Mission (SRTM) 90 m Digital Elevation Models (DEM) Database v4.1 (Fig. 2). Two Provinces of China, i.e., Tibet and Qinghai, contribute to the major area of the TP (Fig. 2). To make the comparison and analysis convenient, the TP is further divided into 12 basins, including 9 exorheic drainage basins (i.e., AmuDarya, Brahmaputra, Ganges, Hexi, Indus, Mekong, Salween, Yangtze, and Yellow) and 3 endorheic drainage basins (i.e., Inner TP, Tarim, and Qaidam). The Inner TP is subdivided into 6 small basins ranging from Inner A to Inner F. The whole data set includes three sub-datasets: 1960s, 2005, and 2014, and focuses on all the lakes with areas greater than 1 km 2 .
The 1960s sub-dataset After the establishment of the People's Republic of China in 1949, development and utilization of lake resources started to be back on track. New institutions for lake research were established by scientists from governments, universities, and research institudes 27 . In the 1960s, scientists did field surveying and mapping for all the lakes (area more than 10 km 2 ) across China, which was part of China's first nationwide lake investigation and could be traced from literature 27 . All the lakes were coded and published as an industry criterion of China called Code for China Lake Name 33 . A vector database (1:250,000) including the attributes (i.e. location, shape, and area) of the lakes was built. The original version of the 1960s sub-dataset in this paper is a data set extracted from the nationwide 1:250,000 ESRI shapefile format using the TP boundary. Some lakes in the data set have been edited according to the 2005 sub-dataset to be mentioned in the following section, e.g., in the raw 1960 attribute table, one lake may have two or more records directing to the separated parts of this lake, and these parts were merged in this sub-dataset to ensure the uniqueness of the lake attribute. Since the lake surveying was conducted within China, lakes outside the borderline, which were included in the following 2005 and 2014 satellitebased sub-datasets, were not included in the 1960s sub-dataset.   were used as the main data source for the investigation. The CBERS is an international technological cooperation program between China and Brazil which developed and operated Earth observation satellites. CBERS-1 was launched in October, 1999, with the CCD camera as its main payload. To obtain intact lake data, images from the Landsat Enhanced Thematic Mapper Plus (ETM+) were used as a supplementary data source during cloudy days for CBERS-1 images. To comprehensively evaluate the state of lakes across the TP, we ensure extraction of information for each lake using two types of images: one was selected in the wet season (i.e., August-October) and the other in the dry season (i.e., April or May). All the CBERS CCD and Landsat ETM+ images were geometrically corrected and geo-rectified to an Albers Equivalent Conical Projection with a Root Mean Square (RMS) uncertainty lower than 30 m. For Qinghai and Tibet Provinces (Fig. 2), images totalling 457 including 408 CBERS CCD images and 49 Landsat ETM+ images were jointly used to extract lake water bodies. The 2005 sub-dataset in this paper consists of two parts: one part is the wet season results of the Qinghai and Tibetan region during the second lake investigation; for lakes outside Qinghai and Tibetan Provinces but inside the TP boundary, we downloaded Landsat ETM+ images (wet season) as supplements to extract the lake boundaries. The two parts were then merged to form the 2005 sub-dataset.
The 2014 sub-dataset Images acquired in year 2014 from China's newly launched GF-1 WFV (Wide Field of View Cameras) sensor were used as the main data source for lake water body extraction. China officially started development of the China High-Resolution Earth Observation System (CHEOS) in May 2010, which was established as one of the major national science and technology projects. The Earth Observation System and Data Center of the China National Space Administration (EOSDC-CNSA) is responsible for organizing the construction of the CHEOS. The space-based CHEOS system was designed to launch 7 satellite series in sequence. GF-1, launched in April 2013, is the first satellite configured with one 2 m panchromatic/8 m multi-spectral (PMS) camera and four 16 m multi-spectral WFV cameras. An 800 km swath-width image can be acquired using the four synchronized-working WFV cameras, which greatly improved the revisit time to 4 days 34 . To match the timing and spatial resolution of the 2005 sub-dataset, the 16 m WFV images during the wet season were used in this study. There are 136 GF-1 images and 11 Landsat8 OLI images used to extract the water bodies. All the GF-1 images were ortho-rectificated before water body extraction. Note that to deal with the problem of missing pixels for Landsat ETM+ SLC-off imagery since 2003 (ref. 35), for both of the 2005 and 2014 sub-datasets, we used multi-temporal images to ensure the accuracy of extraction. Water bodies were firstly extracted in each basin and then merged together to form a whole data set for the TP.

Water boundary extraction from satellite images
In order to strictly control the precision of water boundary extraction from satellite images and to provide users with a comprehensive and reliable data set, we chose to manually interpret and extract the water boundaries of the lakes, given possible uncertainties in automatic extracting methods. Note that in this paper, islands inside the lake boundary were not counted to the total area of the lake water surface. Rules for determining water surface boundaries in the TP region are shown in Fig. 3. Green, yellow, and red lines represent the sketched water boundaries of the 1960s, 2005, and 2014, respectively. The three panels (a1-a3, b1-b3, and c1-c3) represent rules for different situations. Details for the rules are explained as follows: 1) Water body extraction for lakes with different water chemical properties: Lakes in the TP can be divided into three categories according to their water chemical properties, i.e., freshwater lake, semi-saline lake, and saline lake 27 . Figure 3 a1, a2 and a3 show examples of the appearances for the three categories in GF-1 pseudo-color composite images (near-infrared/red/green) individually. Mapam Yumco (salinity 0.1-0.4 g/l 27 ), a freshwater lake in the Indus basin of the southwest TP, shows ultra dark blue (see A1) in the GF-1 image. Zige Tangco, a semi-saline lake (average salinity of 40.7 g/l from field measurements in August, 2010) in the Inner TP basin of the Central Tibet, shows dark blue. The waterlines of both the freshwater lake and semi-saline lake are generally clear in the satellite images. We tracked and drew the waterlines of these lakes while zooming the images into a fixed scale 1:25,000. However, saline lake, like Chabyer Co (salinity 393.5-439.8 g/l 27 ) in Figure 3 a3, sometimes has a layer of salt on top of the water surface, which makes it difficult to determine the waterline. For such cases, to ensure the reliability of the results, we checked multiple images in different seasons, and referred to field investigations recorded in the 1960s (ref. 27).
2) Water body extraction for lakes with different formation mechanisms: Natural lakes can be formed by various processes. For lakes in the TP, tectonic movement, river erosion, glacial activity, and landslide are the primary drivers for the formation of lakes 27 . Most glacial lakes in the TP are small with areas less than 1 km 2,36 . Therefore, we only focus on describing tectonic lakes, barrier lakes, and fluvial lakes. Figure 3 b1, b2, and b3 show examples of the appearances for the three categories in GF-1 pseudo-color images individually. Selin Co (2300.49 km 2 in 2014) in B1, a classic tectonic lake lies in Central Tibet, is now larger than Nam Co (2028.50 km 2 in 2014) and becomes the largest lake in Tibet. The waterline of Selin Co is relatively clear and easy to draw. Ranwu Lake in B2 is a barrier lake formed by the landslideinduced debris flows blocking the river. The water level for barrier lakes is not very high, which makes it light blue shown in the image. In general, the waterline for barrier lakes is pretty clear but it varies with time on occasion. A priori knowledge is important to identify this type of lake. Fluvial lakes are often long and narrow. The lake in Fig. 3 b2 is a new-born lake in 2005 in the source area of the Yellow River. The waterline for this type of lake is highly dentate.
3) Dealing with specific issues: Since this data set only includes water bodies of lakes, islands within a lake were removed from the waterline polygon (Fig. 3 c1). Small water bodies in the bottomland of a lake are not included into the total water surface (e.g., the black circles in Fig. 3 c2), except that the water bodies in the bottomland are large enough and connected to the main water body (e.g., the white rectangle in Fig. 3 c2). Figure 3 c3 shows a lake in year 2014 that was merged from two separate lakes in year 2005 or the 1960s. These cases are normal in the northeast of Central Tibet, since most of the lakes have been expending over the past 50 years. In the data set, if a new merged lake was formed from two or more separate lakes, the new lake as a whole was renamed after the larger/largest one of the two/ more lakes.
We also determined two specific types of lakes: new-born lakes and dead lakes. For example, if a lake existed in a certain location in the 2005 image while in the 1960s the same location was identified as land or non-lake water body, this lake was defined as a new-born lake. Similarly, if a lake was found in a certain location in the 1960s while in the 2005 image there was no lake in the same location, this lake was defined as a dead lake. Following the above definitions, all the images for years 2005 and 2014 were examined one by one to determine the two types of lakes.
For questionable lakes like ephemeral lakes or salt lakes with salt crusts, we checked and compared their state on both wet-season and dry-season images. If the bottomlands with seasonal-covered water or the salt crusts were located outside the water surface boundaries on both wet/dry-season images, we would not consider them as components of the lake. Otherwise, if they were located inside the water surface boundaries on wet-season images but outside for the dry-season ones, we took the median lines of the water surface boundaries on wet/dry-season images as the lake water boundaries.

Data Records
The data set is available in three folders. The first folder, 'Data_Information_File', contains detailed information on lakes in each sub-dataset, the data collector, image/images used for water body extraction, citations, etc. The second folder, 'Data_Value_File', contains two subfolders: 'shp_1-10' and 'shp_10-', which store the shapefiles of the 1-10 km 2 and ≥10 km 2 lakes, respectively. The third folder, 'Supplement', contains the boundary files and validation sampling shapefiles. An overall statistical table for numbers and areas in this data set is also included in the 'Supplement' folder. The shapefiles of the three sub-datasets can be linked to the lake information via the ID/NAME_CH/NAME_EN columns. The 1960s and 2005 data used in this study have been published in the literature 22,28,37 . The 2014 data have not been published. The data set can be accessed at http://dx.doi.org/10.6084/m9.figshare.3145369 (Data Citation 1). Table 1 shows data labels and descriptions for the shapefiles in detail. Some lakes may have two or more Chinese or English names. The alias of these lakes are individually recorded after their common names using brackets, e.g., Yazi Lake (Woniu Lake). Lakes that did not have names are recorded as 'Noname'. The 'Noname' lakes are generally small and exist in the 'shp_1-10' file folder. An attribute column called 'IS_NEWBORN' is used to record whether a lake is a new-born lake or not. Number 0 indicates that the lake is not a new-born lake. NOTES_CH Statements for specific cases in Chinese, e.g. two lakes merged to one single lake due to drastic expansions NOTES_EN Statements for specific cases in English, e.g. two lakes merged to one single due to drastic expansions particularly in the Inner TP basin. More sub-datasets are required to study the detailed information on the change characteristics from 1960s-2005.

Technical Validation Quality control and validation of the dataset
For the 2005 and 2014 sub-datasets, after completing the first round of extracting water body boundaries of all lakes, we had three of the authors of this paper (Wei Wan, Zhongying Han, and Yuan Yuan) crosscheck the initial results basin by basin to ensure that there were no missing or erroneous lake. We organized four graduated students to examine the attribute tables for the data set to ensure the validity and integrity. We paid much attention to determining those new-born or dead lakes. Also, we examined ≥10 km 2 and 1-10 km 2 shapefiles for the same year together to avoid record repetitions. For the 1960s sub-dataset, we could not examine the fundamental data since the historical surveying and mapping work was unrepeatable. Instead, we examined the 1960s sub-dataset by comparison with the two remote sensing-derived sub-datasets to ensure the consistency of the attributes of lakes, e.g. the ID, name, and located basin.
To achieve robust and quantitative validation of area and perimeter estimates from GF-1 WFV images, we did a two-step comparison. First, we compared the resulting GF-1 WFV-derived values (WFV for short) to the results derived from the Landsat 8 Operational Land Imager (OLI) images (OLI for short), since WFV and OLI have the same level of resolution, i.e. 16 and 30 m. Second, we compared both of the WFV and OLI results with the results derived from the GF-1 PMS images (PMS for short). Here the PMS results were treated as reference data. To achieve a rational sampling number as well as considering the workload, we divided the total number of lakes by area into 6 categories: ≥ 1000 km 2 , 500-1000 km 2 , 100-500 km 2 , 50-100 km 2 , 10-50 km 2 , and 1-10 km 2 . Approx. 5% of the number of each category was selected as samples, i.e. 1, 1, 3, 3, 13, and 33. Names and attributes of the sampled lakes can be found in the 'Supplement' folder of the data set. For validation, we collected thirteen Landsat 8 OLI images and fifty-nine GF-1 2 m/8 m PMS images during the wet season in 2014-2015. The raw panchromatic/multispectral images from PMS were firstly ortho-rectificated individually and then processed to create the final 2 m pan-sharpened reference images. The sampled lakes were digitalized out of the OLI and PMS under 1:25000 and 1:2500 scales, respectively. We use two morphometric indices mentioned in Liturature 20, the Shoreline Development Index (SDI) and the thickness index Miller (MI), to describe the morphometry of lakes. Table 3 shows statistics of measured parameters (area, perimeter) and calculated parameters (SDI and MI) for WFV, OLI, and PMS, respectively. In general, the mean, minimum, maximum, and standard deviation for the three data sets are at the same level. The PMSderived perimeters always showed relatively higher values than the other two, since higher-resolution images could contain more details of lake boundaries.
To evaluate the matching of lake boundaries, we calculated relative deviations (RD) in area and perimeter between the WFV, OLI and the respective PMS data sets. For all the sampled lakes, deviation in estimated area for WFV was RD = 0.012 (median = 0; StDev = 0.044), and for OLI was RD = − 0.014 (median = − 0.012; StDev = 0.052). Similarly, deviation in estimated perimeter for WFV was RD = − 0.065 (median = − 0.074; StDev = 0.069), and for OLI was RD = − 0.082 (median = − 0.07; StDev = 0.06). Figure 4 shows histograms and Gaussian fits of the RDs of area and perimeter for the sampled WFV and OLI results. Note that RDs for both area and perimeter distributions reached good R-square values. Based on the error analysis here, the data source and method to create this data set appears to be reliable and robust. It is worth mentioning that since automatic methods are more efficient than manual interpretation, it will be nice to compare these two methods in the future work.

Comparison with other data set
After validating the extraction accuracy of the data set developed in this study, we further compared our data set to another two publicly released data sets, i.e. the global-scale data set GLWD 23 and a regionalscale data set created by Yao et al. 38 . The GLWD was produced using multi-datasources gathered from the 1990s. The level 1 and level 2 of the GLWD data were used for comparison. All the GLWD lakes were firstly extracted using the TP boundary and then regrouped into two categories: ≥ 10 km 2 and 1-10 km 2 .
There are totalling 1131 lakes with an aggregated area of 38,153 km 2 for the GLWD. This is of the same order of magnitude as that for the data sets in this study. Figure 5 provides an overview of latitudinal, longitudinal and basin-range lake distributions according to different data sets. Number and area values were aggregated at steps of 1°and 2°for the latitudinal and longitudinal distributions, respectively. In general, the data sets in this study showed consistent results as compared to the GLWD. It is reasonable for the inconformity between the two data sets, since they reflect numbers and areas at different time periods. The most striking result of comparing the two data sets, however, is the basic difference in their geolocation for 1-10 km 2 lakes, e.g., lakes distributed in between 80°-90°E and around 35°N (i.e. northwest of the inner TP basin). For this region, We overlaid the shapefiles of the GLWD and the data sets developed in this study with the 1990s (Landsat 4-5 TM) and the 2014 (GF-1) remote sensing images, and found that some of the GLWD lakes were not shown on the 1990s images, and some small lakes were missing. We checked the lakes in our data sets one-by-one with reference to the GLWD data to ensure that there were no missing lakes. Despite of all the small issues discussed above, we believe that both the developed data sets in this study and the GLWD show good quality. The GLWD data could, to some extent, be a good addition to fill the time gaps in the developed data set in this study. The Yao's data set was created over the Hoh Xil region using Landsat TM/ETM+ images acquired in 2000. It is noted that the extracting methods and rules are consistent between Yao's and our data set. There are 44 lakes included in both of the data sets which were selected and used for comparison. Figure 6 shows the area of the target lakes in 2000 (Yao, in black), 2005 (this study, in blue) and 2014 (this study, in red), respectively. Note that for the area of each lake, the 2000 (Yao) data and the 2005 data are basically consistent (R 2 = 0.99). This is reasonable because the changes in lakes should not be prominent in a 3-5 year period. Some images used for Yao's results were from years 2001, 2002, and even 2003, making the comparison more convincing. The validation and comparison once again imply that, for lake monitoring, images from various satellite sensors, i.e., CBERS-1 CCD, GF-1 WFV, and Landsat TM/ETM+, can generate consistent and comparable results.
Assessment of trends in lake changes over the last decades Figure 7 shows changing rates of lakes in the TP over the last decade (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014). Blue and red solid circles represent increasing and decreasing rates for individual lakes, respectively. Previous studies have revealed that lakes in some TP regions showed consistently expanding or shrinking trends during certain periods. For example, literature 13,39 , using Landsat TM/ETM+ data, suggest that the area of lakes in the inner plateau expanded at a rapid growth rate between the 1990s and 2009/2010 (~27%). Literature 10,18,40 , using ICESat/GLAS altimetry data, suggest that the water level of lakes in the inner plateau showed a significantly increasing trend between 2003 and 2009. Literature 10,13 reveal that lakes in the Brahmaputra basin showed a decreasing trend in both area and water level. Literature 41 reveals that   and northeast of the basin show a more rapid growth rate (rate averages for Inner C, D, E, and F are 15.63, 15.13, 12.58, and 12.38%, respectively), while the south-eastern part shows a relatively slow growth rate (average rates for Inner A and B are 3.19 and 4.79%, respectively). This demonstrates a consistent and continued trend relative to the findings from the published studies. For changing rates of lakes in the Brahmaputra basin, it is clear that lakes in this basin show a decreasing rate (−2.53%) in recent years. This is highly consistent with the above-mentioned published studies. It is worth mentioning that analyzing the rates of lake change using two time intervals in this study could only obtain a general conclusion. To investigate a particular lake or basin-scale water balance, data acquired at more time intervals and effective automatic methods are required.