A dataset of remote-sensed Forel-Ule Index for global inland waters during 2000–2018

Water colour is the result of its constituents and their interactions with solar irradiance; this forms the basis for water quality monitoring using optical remote sensing data. The Forel-Ule Index (FUI) is a useful comprehensive indicator to show the water colour variability and water quality change in both inland waters and oceans. In recent decades, lakes around the world have experienced dramatic changes in water quality under pressure from both climate change and anthropogenic activities. However, acquiring consistent water colour products for global lakes has been a challenge. In this paper we present the first time series FUI dataset for large global lakes from 2000–2018 based on MODIS observations. This dataset provides significant information on spatial and temporal changes of water colour for global large lakes during the past 19 years. It will be valuable to studies in search of the drivers of global and regional lake colour change, and the interaction mechanisms between water colour, hydrological factors, climate change, and anthropogenic activities.


Background & Summary
Lakes are widely recognised as sentinels of environmental change, representing ecosystems that are particularly vulnerable to anthropogenic disturbance and climatic variability 1 . They play a crucial role in the global hydrological cycle, supporting extensive services such as water supply, hydropower generation, flood mitigation, fisheries, and biodiversity [2][3][4] . Deriving water quality information for lakes over large areas and long-time scales is of considerable value in exploring how lakes change and respond to environmental changes. Satellite remote sensing can potentially provide objective, broad scope, high frequency, and continuous measurements of inland water quality by capturing water colour information 5 . However, challenges brought about by the optical complexity of inland waters and overlying atmosphere, and interference due to adjacency effects have hindered the development of valid Earth observation (EO) approaches for water quality monitoring in inland waters compared with the ocean applications. As a result, few water quality EO products are available for inland waters at global and regional scales.
Water colour itself is recognised by the Global Climate Observing System as a key essential climate variable for lakes as it is directly related to variations in water constituents. Water colour is one of the oldest water observation data with records for global water bodies stretching back over a century. Water colour observations are based on the fact that clear water appears blue while turbid water turns green and/or yellow with increased levels of suspended sediment, phytoplankton, and coloured dissolved organic matter. It is traditionally measured using the Forel-Ule water colour scale, which divides water into 21 colour classes from dark blue to yellowish-brown. Recently remote sensing data have been applied to derive the Forel-Ule Index (FUI) of water using remote sensing reflectance (R rs ) in the visible domain [6][7][8] . Studies have shown that FUI derived from R rs has relatively low uncertainty due to its tolerance of aerosol perturbations, variable observational conditions, and good transferability across different sensors 7,9,10 . As water colour is the outcome of interactions between sunlight and the absorption and scattering of water constituents, changes in water optically active constituents can be described by variations in FUI 6,11 . The relationships between FUI and water quality parameters (e.g. chlorophyll-a (Chl-a) and total suspended matter (TSM), coloured dissolved organic matter (CDOM), turbidity, and water clarity) have been www.nature.com/scientificdata www.nature.com/scientificdata/ previously explored and documented 6,7,12,13 . While FUI yields more information on Chl-a for open oceans 11,12 , it is well-correlated with water clarity for coastal and inland waters according to the recent studies 12,14,15 . Given its low uncertainties, feasible transferability, and intrinsic relationship with water quality, FUI was recently promoted as a comprehensive water quality index for marine and inland waters, especially in large regions and over long time spans 7,12,14 .
In the past century, land-use changes, increasing urbanisation and industrialisation, population growth, together with apparent climate change have inevitably brought about changes in aquatic systems worldwide 16 . However, there is a lack of systematic water quality products or datasets available for global inland waters. Here, we present a time series dataset of FUI for large global lakes (including lakes and reservoirs, termed as 'lake' or 'lakes' for briefness hereafter) from 2000-2018 based on Moderate Resolution Imaging Spectroradiometer (MODIS) data. This dataset has a high value for providing unique information for the spatial patterns and long-term change trends of water colour worldwide over the past 19 years. These data could also be used in analyses in addressing scientific issues such as how water colour associated with hydrological parameters, climate change, and local anthropogenic activities at global and regional scales.

Methods
Water-leaving reflectance correction. The MODIS surface reflectance level-3 product (MOD09A1) was acquired from the Goddard Space Flight Center (GSFC) of the National Aeronautics and Space Administration (NASA) (http://ladsweb.nascom.nasa.gov/index.html). This global coverage product is 8-day composited data with 500 m spatial resolution that have been previously applied to inland water quality monitoring 7,17 . MOD09 has already been corrected for aerosol effect, Rayleigh scattering, and cirrus clouds and provides an estimation of surface reflectance. We performed a further water-leaving reflectance correction based on the minimum band value in the near infrared (NIR) to short wave infrared (SWIR) bands to remove the skylight reflection, residual aerosol effect, and sun glint for improved estimation of water-leaving reflectance (R rs ) 18 . This correction method can be operationally applied to various types of inland waters over large areas with relatively stable and satisfactory performance 7,18 . Lake water body extraction and identification. We used a modified histogram bimodal method to extract large inland water areas (>25 km 2 ) automatically based on the reflectance of the 1640 nm band 7,19 . This band was selected because of the obvious reflectance difference between water and other land covers in the SWIR band. First, an initial rough water area was obtained based on the MOD09A1 Quality Assurance (QA) dataset where inland water pixels were marked. Then, during automatic selection of the threshold value (Fig. 1), a buffer zone was created around each connected water area with an area 1.5 times the initial water area. Based upon the expanded area including the initial water area and the buffer zone, a histogram of the 1640 nm reflectance was produced for the whole expanded area where water and other land-cover types would be distributed separately in the histogram within the two modes ( Fig. 1(b)). Finally, the threshold value for this water was recognised as the valley value within a specific threshold range in the histogram (denoted as the range between T 0 and T 1 in Fig. 1). In this way, every water body found in the imagery could be identified separately with a threshold adapted to the water reflectance and its surrounding land-cover features, which can avoid misidentifications caused by one harmonised threshold value for all waters. In addition, clouds, cloud shadows, snow/ice, mixed pixels, and other noise pixels were identified using the MOD09A1 QA dataset, then removed before further analysis.
Following water body extraction from MODIS imagery, a normal water mask for lakes would be obtained where the occurrence frequency of the water pixels exceeds 30% during 2000-2018. The normal water mask represented the lake's normal water area acquired by MODIS during 2000-2018 and served as the boundary in the www.nature.com/scientificdata www.nature.com/scientificdata/ following FUI statistical calculations. This removed the ephemeral water areas or low coverage water bodies from our dataset. Hence, the surface areas provided in this dataset are based on MODIS observations which may be a little different with the surface areas in other databases but indicated the valid water area calculated in this dataset. Besides, each lake's centroid point was identified using the normal mask, and its latitude and longitude were then extracted. The geographical coordinates were used to identify a specified water body. FUI retrieval. We used a FUI retrieval algorithm for visible MODIS bands defined in previous research 7,14 , as summarized below: (1) CIE tristimulus X, Y, Z were calculated from the R, G, B bands of MOD09A1 after water-leaving correction using an RGB conversion method 6,20 : = . + . + . = .
+ . + . (2) The chromaticity coordinates x, y were calculated by normalising X, Y, Z between 0 and 1 20 : (3) Hue angle α can be derived with x, y 9,10 : here, the hue angle α is in degrees and changes from 0° to 360° anti-clockwise starting from the positive x-axis at y -1/3 = 0 in the CIE chromaticity diagram. We note that our previous publications 7,12,14 have used a different definition for hue angle α (termed as α' hereafter) where it increases in a clockwise direction starting from the negative axis at x -1/3 = 0. This calculates the value of α' using the formula arctan2(x -1/3, y -1/3) + π and α' increases with FUI. However, this would not affect the FUI result because the same chromaticity coordinates of the 21 FUI colours 21 were used to generate the FUI look-up table (Table 1). (4) To eliminate the colour difference caused by the MODIS visible band setting, we conducted a deviation delta (Δ) correction by modelling the α differences between human-eye-sensed true colour and MODIS-derived colour, following the idea proposed in previous research 9 . However, because we use different method to derive CIE tristimulus X, Y, Z (as shown in Eq. (1)), our correction equation is different with that in [9]:  www.nature.com/scientificdata www.nature.com/scientificdata/ Monthly and yearly FUI calculation. All FUI images were produced using the 8-day composited MOD09A1 data from February 2000 to December 2018, and monthly FUI images were produced by removing outlier data in the time domain and averaging the remaining values in the same pixel location for one month. The time domain outlier data were recognised when outside the 'μ ± 3σ' window (μ denotes the average value and σ denotes the standard deviation). The monthly average FUI values were then calculated for each water body when the detected water pixels for one water body were >30% of those in the normal water mask to ensure the representativeness of the calculation. In the monthly average FUI calculation, outlier data in the spatial domain were removed and the remaining pixel values within the extent of the water body (identified by the normal water mask) were averaged. The spatial domain outlier data were recognised when outside the 'μ ± 1.5σ' window in order to avoid uncertainties caused by thin clouds, aerosol perturbations, or other noise, thereby ensuring more accurate monthly averaged FUI values for water bodies. Because lakes may be covered by clouds or other noise at times, there would be missing data for some lakes. To ensure the reliability of this time series dataset, lakes with less than six valid monthly data in one year from 2000-2018 were not included in this dataset. Because some Northern Hemisphere lakes may be covered or partly covered by ice during winter, their monthly FUI data were only calculated from boreal May to October each year. These frozen lakes were identified based on monthly climatological lake surface water temperature data provided by the ARC-Lake v3.0 dataset (http://www.laketemp. net/home_ARCLake/data_access.php) 22 . These lakes were only included if they had at least three valid monthly data points in each year. Finally, the missing data for lakes were filled through linear interpolation 23,24 . The yearly average FUI values were calculated for each water body by averaging the corresponding monthly average FUI.

Data records
The long-term FUI time series data are available via Figshare 25 . General information for the 1049 investigated large lakes (>25 km 2 ) around the world is compiled in 'lake_ info.csv' , where each row represents one lake and the columns are as follows: (1) Lake_id: Identifies each lake with MODIS tile and location number.
(2) Lake_name: Lake name acquired from Google Earth and some lake database. A small part of lakes have blank names since we cannot find their names. (7) Country/Region: Country or region in which the lake is located; international lakes are assigned to the country or region containing the centroid point and may be arbitrary for centroid points falling on the boundaries. (8) Continent: Continent in which the lake is located; international lakes may be arbitrarily assigned to one continent.
The long-term monthly FUI data from February 2000 to December 2018 for lakes are compiled in the 'monthly_FUI' folder, in which the raw monthly FUI and filled monthly FUI data are provided in the 'raw_ monthly_FUI' and 'filled_monthly_FUI' files, respectively. Monthly FUI data for freezing lakes are only provided from May to October for every year because ice cover changes the observed colour. Long-term yearly FUI data from 2000-2018 are compiled in the 'yearly_FUI' folder, in which the yearly mean FUI is provided in the 'yearly_ FUI.csv' file. Average FUI of lakes from 2000-2018 is mapped in Fig. 2, and annual change rates are graphed in Fig. 3. Lakes have significant positive or negative yearly change trend (p < 0.01) in the nineteen years are also marked in Fig. 3.

technical Validation
Quality control and assurance of the dataset. Quality control methods were embedded and executed during the processes of water-leaving reflectance correction, water body extraction, FUI image retrieval, and monthly-and yearly-average FUI calculation. After water-leaving reflectance correction and water body extraction were performed, clouds, cloud shadow, snow/ice, and other noise over the water area were identified using the QA flags attached to the MOD09A1 data and removed for further processing. The land adjacency effect was avoided by eroding the water areas with a 500 m distance 26 . To avoid data contamination by water bottom appearance, optically shallow water was excluded using a blue-band thresholding method assisted by visual interpretation using Google Earth images 7 . During monthly FUI image calculation, outliers at the pixel level were checked www.nature.com/scientificdata www.nature.com/scientificdata/ using the 'μ ± 3σ' criterion. During the summer-average FUI value calculation for each lake, water areas <30% of the normal surface area were removed to avoid average value biases caused by spatial variability in lakes. To avoid artificial errors, a set of scripts in the IDL programming language were composed for water extraction, FUI retrieval, and summer-average FUI calculation.
To assemble the summer-average FUI data for each lake, the lake ID was attached to each lake according to its centroid location, then the assembled dataset was cross-checked using a series of graphs and maps, allowing the identification of outliers and abnormal trends. We also compared the FUI change rates with other related water quality studies to confirm the results 27,28 . Validation with in situ data. Previous studies have shown that FUI can be derived from multispectral satellite data with high accuracy given its tolerance of aerosol perturbation and unfavourable viewing conditions, and the uncertainties in satellite water-leaving reflectance can be reduced during conversion to FUI 7,10 . We evaluated the FUI derived from MOD09 by comparison with in situ spectral data measured in our previous study 7 , which showed the uncertainties contained in the MODIS FUI data were <10%.
We further validated the MODIS FUI results using concurrent in situ R rs (λ) data, mainly from Chinese lakes. That is to say, the MODIS FUI is validated to the color of water itself without having any Secchi disk submerged, while there would be a Secchi disk put in the water during the traditionally FUI measurement using the handheld  www.nature.com/scientificdata www.nature.com/scientificdata/ Forel-Ule water colour scale 29 . We note that there could be systematic biases between the satellite FUI data and the in situ FUI obtained using the handheld water colour scale assisted by a Secchi disk, but this is not the case in this study 30,31 . Here, the mean relative difference (MRD) and root mean square error (RMSE) were used to depict the uncertainties: i n est,i mea,i 1 2 where x est denotes the estimated value, x mea denotes the measured value, and n is the number of measurements.
The in situ R rs (λ) measurements were carried out in six large Chinese lakes with diverse water types ranging from clear and oligotrophic to turbid and eutrophic. In addition, in situ R rs (λ) data collected in Lake Erie (North America) were acquired from the SeaWiFS Bio-optical Archive and Storage System (SeaBASS) database and used to fill a gap in our data for moderately clear water (FUI ranging from 7-10). In the in situ measurements, above-water radiance measurements were conducted to derive water-leaving reflectance spectra for the sampling sites, then the water-leaving reflectance spectra were resampled to the MODIS bands and the FUI values were calculated. In the built of match-ups, the nearest pixel to the sampling location was selected in the MOD09 daily data (MOD09GA) and the time window was within 1 day. Finally, there are a total of 151 concurrent matchups in the seven lakes ( Table 2). As shown in Fig. 4, the MRD between the MODIS FUI and in situ derived FUI was 6.5%, and the RMSE between them was 1.09. Given that the acceptable error level in satellite water colour products is ~30% 32,33 , our error rate of <10% demonstrates the validity of the MODIS FUI results. Moreover, the MODIS FUI was derived with a consistent methodology and dataset, further guaranteeing its performance for water colour change detection.
Cross validation with diversity II data. Water quality parameters (TSM and turbidity) provided by Diversity II dataset were used to cross validate the MODIS FUI dataset presented here in several large lakes  Table 2. Lake name, location, sampling date, and the number of match-ups (N) in the in situ dataset. www.nature.com/scientificdata www.nature.com/scientificdata/ around the world. The Diversity II datasets were produced from Medium Resolution Imaging Spectrometer (MERIS) data using optimised water quality retrieval algorithms for inland waters 34 . This dataset provides water quality data (e.g. Chl-a, TSM, and turbidity) for ~300 large lakes around the world from 2002-2012. As studies have shown 6,12,14,15 , FUI of water is well-correlated with Secchi disk depth and turbidity and can also indicate TSM in turbid waters with high suspended sediment. Therefore, our long-term monthly FUI data were compared with Turbidity and TSM monthly data from Diversity II dataset in several lakes including Lake Namco, Lake Silingco, Lake Ontario, Lake Ladoga and Lake Taihu (Fig. 5). This comparison showed that similar temporal patterns and trends in MODIS FUI and MERIS turbidity generally occurred in Lakes Namco, Silingco, and Ontario, which are relatively clear waters located in the Qinghai-Tibet Plateau and North America, respectively. In these lakes, MERIS TSM data basically had similar long-term trends with MERIS turbidity and MODIS FUI, but with some details that may differ. That is probably because water constituents other than TSM (such as CDOM) may also affect water colour and FUI. However, in Lake Taihu in eastern China, which is very turbid and dominated by TSM 35 , FUI and TSM basically had a better correlation than FUI and turbidity. In Lake Ladoga in northwestern Russia, the correlation between FUI and TSM is slightly higher than the correlation between FUI and turbidity, which suggest TSM have a little larger effect on FUI in this lake. Figure 6 shows correlations between the monthly FUI and turbidity in Lake Namco, Lake Silingco, and Lake Ontario, and the correlations between monthly FUI and TSM in Lake Ladoga and Lake Taihu. Our FUI data and Diversity II Turbidity or TSM data generally showed good agreement, with correlation coefficients (R) ranging from 0.47-0.73, consistent with previous research showing that FUI can be used as an indicator of water www.nature.com/scientificdata www.nature.com/scientificdata/ clarity 12,14,15 . The correlation coefficients in Lakes Namco and Silingco were higher than those in the other three lakes, which are reasonable because water constituents in the latter three are generally more complicated 28,[35][36][37][38] . In addition to the suspended solids quantified by TSM and turbidity, CDOM may also play an important role in driving water colour changes in some saline lakes and lakes surrounded by forests or agriculture farmland 39 , so in these cases the correlation between FUI and turbidity and TSM might be weak. As the Diversity II data and our FUI data were produced using different satellite data (MERIS and MODIS, respectively), the good agreements between the two datasets shown here also demonstrate the reliability of both the two satellite data for use in studying long-term water colour and water quality parameters for inland waters when assisted by proper atmospheric corrections.

Code availability
The IDL code named MODIS_FUI.pro for calculating FUI from MOD09A1 data is also available via Figshare 25 . We note that the code contains a few steps that need ENVI software, so that it needs to be run under the ENVI + IDL environment. The ENVI version 5.3 and the IDL version 8.5 were used in the code development.