Caravan - A global community dataset for large-sample hydrology

Kratzert, Frederik; Nearing, Grey; Addor, Nans; Erickson, Tyler; Gauch, Martin; Gilon, Oren; Gudmundsson, Lukas; Hassidim, Avinatan; Klotz, Daniel; Nevo, Sella; Shalev, Guy; Matias, Yossi

doi:10.1038/s41597-023-01975-w

Download PDF

Data Descriptor
Open access
Published: 31 January 2023

Caravan - A global community dataset for large-sample hydrology

Scientific Data volume 10, Article number: 61 (2023) Cite this article

12k Accesses
29 Citations
40 Altmetric
Metrics details

Subjects

Abstract

High-quality datasets are essential to support hydrological science and modeling. Several CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) datasets exist for specific countries or regions, however these datasets lack standardization, which makes global studies difficult. This paper introduces a dataset called Caravan (a series of CAMELS) that standardizes and aggregates seven existing large-sample hydrology datasets. Caravan includes meteorological forcing data, streamflow data, and static catchment attributes (e.g., geophysical, sociological, climatological) for 6830 catchments. Most importantly, Caravan is both a dataset and open-source software that allows members of the hydrology community to extend the dataset to new locations by extracting forcing data and catchment attributes in the cloud. Our vision is for Caravan to democratize the creation and use of globally-standardized large-sample hydrology datasets. Caravan is a truly global open-source community resource.

PatagoniaMet: A multi-source hydrometeorological dataset for Western Patagonia

Article Open access 02 January 2024

The global lake area, climate, and population dataset

Article Open access 11 June 2020

A comprehensive, multisource database for hydrometeorological modeling of 14,425 North American watersheds

Article Open access 20 July 2020

Background & Summary

Data underpin our understanding of the storage and transport of water at the Earth’s surface. Hydrological processes (e.g., streamflow generation) are governed by hydroclimatic variables (e.g., rainfall, temperature, humidity) and landscape characteristics (e.g., soils, landcover, human intervention). These interactions govern the availability of water resources and the occurrence of extreme events like floods and droughts.

Detailed datasets combining hydroclimatic time series, landscape attributes, and/or hydrological response variables like streamflow exist for many experimental catchments, in many cases spanning decades^1,2,3. However, it is not possible to capture the diversity of hydrological behavior from any individual watershed. In parallel, there also exist tens of thousands of gauges monitoring rivers across the world. Although data available from these gauges are limited in that they do not describe all of the hydrological processes in a given watershed, the large number of gauges means that they cover a wide of range of hydrological regimes and extreme events^4,5,6,7. Gupta et al.⁸ argued that large sample sizes allow for assessment of the generality of hydrological models and research findings. Large sample sizes also allow for large-scale research like detecting and attributing systematic shifts in terrestrial water availability at regional^9,10 to global scales^11,12. Moreover, large sample datasets are necessary for developing generalizable data-driven models^13,14,15,16.

Recognizing this has led to the development of a sub-discipline in the hydrological sciences called large-sample hydrology (LSH), which relies on data from hundreds to thousands of catchments¹⁷. There are an increasing number of publicly available LSH datasets. Arguably, the first open LSH dataset was from the Model Parameter Estimation Experiment (MOPEX)¹⁸, which contains data from 431 basins within the United States through 2003. Later datasets were developed for specific countries or regions, including Australia¹⁹, Austria²⁰, Brazil²¹, North-America²², China²³, Chile²⁴, Europe²⁵, Great Britain²⁶, Thailand http://hydro.iis.u-tokyo.ac.jp/GAME-T/GAIN-T/routine/rid-river/index.html, the United States^27,28, and the Arctic https://www.r-arcticnet.sr.unh.edu/v4.0/index.html. Many of these are referred to as Catchment Attributes and MEteorology for Large-sample Studies (CAMELS) datasets^{19,21,24,26,28}.

Although none of the existing CAMELS datasets are global, there are global collections of streamflow data like the Global Streamflow Indices and Metadata Archive (GSIM)^29,30, which provides monthly and seasonal streamflow indices for 35,000+ locations, and the Global Runoff Data Base https://www.bafg.de/GRDC, which provides river discharge estimates at 10,000+ locations. Both of these collections, however, are not coupled with catchment attributes or meteorological forcing data. Critically, GSIM does not provide daily streamflow data (only indices), and GRDC does not allow for redistribution of raw data, which makes it difficult to standardize with other datasets. Furthermore, although data from 10,000+ stations are available through GRDC, both the quality of the available records and the period of record for individual basins varies significantly³⁰. On the other hand, HydroATLAS³¹ provides global catchment attributes, but does not include meteorological or streamflow data. There are also proprietary or non-public hydrological datasets that have been used for hydrological research–for example, datasets used by Beck et al.^32,33, for global model calibration or by Blöschl et al.³⁴ for extrapolating climate change impacts on flooding (less than a third of one percent of the daily time series used in the latter study are publicly available, last access 20th March 2022). There are many reasons why proprietary datasets exist in today’s research landscape. These often encompass causes that lie outside the domain of influences of individual research groups. However, from a scientific perspective, proprietary datasets are a roadblock to open, collaborative, reproducible, and extensible research.

Aside from the fact that no comprehensive, global LSH dataset exists, Addor et al.¹⁷ identified four major limitations of many of the existing region-specific datasets: (i) lack of common standards to allow for intercomparison, (ii) lack of metadata and uncertainty estimates to assess data reliability, (iii) lack of information about human interventions, and (iv) limited accessibility. Addor et al.¹⁷ also outlined desiderata for standardizing and automating the development of LSH datasets, including (i) basic data requirements, (ii) naming conventions for hydrologically-relevant variables, (iii) publicly available data processing code, (iv) uncertainty estimates, (v) anthropogenic descriptors, and (vi) adhering to FAIR data standards³⁵. They propose that community, cloud-based infrastructure could help overcome these limitations, by allowing for the use and development of standardized practices and codebases.

The Caravan dataset presented here is a step toward realizing this vision. The basis for Caravan is a collection of region-specific datasets, which are merged and standardized in a way that is designed with the following characteristics:

1.
Standardized: Data are standardized globally meaning that the same meteorological and landscape variables exist for all catchments, and are derived using the same procedures from the same source datasets.
2.
Open: All data are publicly available with an open license.
3.
Extensible: All software tools and source datasets used to produce Caravan are open and accessible through a cloud platform (Google Earth Engine) to enable others to extend (i.e., add catchments to) the dataset.

The third point is especially important. Most streamflow gauges are maintained by local or national organizations, and the data from these gauges are rarely FAIR (Findable, Accessible, Interoperable and Re-usable). Caravan is designed to be extensible, so that anyone can easily derive meteorological forcings and landscape attributes for additional catchments using a standardized procedure. This allows new catchments to be used in the context of this larger dataset (e.g., for training models, assessing relative climate impacts, etc.), and it allows organizations with streamflow data from any number of catchments (from one to thousands) to quickly and easily add their data to the larger public Caravan dataset in a way that is standardized with all other catchment data. Our vision is for Caravan to be the platform for a larger community data resource–we see this as perhaps the most direct path to developing a truly open global hydrological dataset. The current Caravan dataset that we introduce here includes streamflow observations from 6830 basins, spanning most Global Environmental Stratification (GEnS) climate zones³⁶, with the exception of arctic, extreme cold, and arid zones (Fig. 1). Caravan includes daily data from almost four decades (1981–2020), including catchments that experienced significant climate trends (Fig. 2).

Methods

Basin selection & streamflow data

Daily streamflow observations for the 6830 basins currently in Caravan were aggregated from several existing open datasets:

482 basins from CAMELS (US)²⁷
150 basins from CAMELS-AUS¹⁹
376 basins from CAMELS-BR²¹
314 basins from CAMELS-CL (using an updated Version from January 2022)²⁴
408 basins from CAMELS-GB²⁶
4621 basins from HYSETS²²
479 basins from LamaH-CE²⁰

These datasets were selected because (i) they include catchment boundaries for each streamflow gauge, and (ii) because their licenses allow redistribution. Furthermore, we currently only include basins equal or larger than 100 km² and smaller than 2000 km². Streamflow data is normalized by catchment area to units of mm/day. All data are reported in the local time zone (non-daylight saving time for the entire year) of the gauge station, which is included in metadata.

Time periods of available streamflow observations varies between basins, however we did not include any streamflow data prior to 1981 because this is the beginning of the ERA5-Land reanalysis, which was used to derive meteorological forcing data. Figure 3 shows density of streamflow records through time (left) and the distribution of lengths of daily streamflow records (right), emphasizing that comparatively long flow time series are available for the Caravan catchments (the median length is 31 years).

Meteorological forcing data

Caravan includes meteorological forcing data from ERA5-Land³⁷. This choice was made for the following reasons:

Global coverage and spatial consistency: Although ERA5-Land data products are often lower-accuracy (i.e., more uncertain) than local, high-resolution meteorological data sets, only globally available data sets allow for comparative studies at a global scale.
Sub-daily (e.g., hourly) resolution: All daily average streamflow observations in the source datasets are reported in the corresponding local time of the gauge station. In contrast, global meteorological data products are usually provided in GMT + 0. To be able to calculate the matching daily average meteorological forcing data for the daily averaged streamflow observation, it is therefore necessary to have sub-daily meteorological data, so that we can shift the meteorological data according to the local time zone of the gauge station, before computing daily aggregates.
Availability in the cloud: one of our goals was to do all heavy computing tasks in the cloud (here: Google Earth Engine). ERA5-Land provides hourly data on Google Earth Engine.
Permissive license: A core principle of Caravan is to democratize LSH datasets and dataset development. ERA5-Land has a permissive license that allows free distribution.

ERA5-Land meteorological variables used in Caravan are listed in Table 1–these are typical variables used as forcing data (or boundary conditions) for hydrology and land surface models. We first computed the area-weighted spatial average for each variable in each catchment area from hourly spatial data (~9km spatial resolution) and shifted the hourly time series (natively at GMT + 0) to the local time of each gauge. We then computed different daily statistics for each variable according to the Aggregation column in Table 1.

Table 1 ERA5-Land meteorological variables. Daily aggregates are computed in local time of each basin.

Full size table

Reference model states

In addition to meteorological forcing data, Caravan includes time series of modeled soil moisture and snow states from ERA5-Land (Table 2). These time series are included to provide reference values or benchmark values for studies that analyze or model hydrological states. These time series data were processed in the same way as meteorological forcing data.

Table 2 ERA5-Land model state variables. Daily aggregates are computed in local time of each basin.

Full size table

Catchment attributes

Caravan includes two sets of catchment attributes: (i) attributes derived from HydroATLAS^31,38 and (ii) climate attributes derived from the daily ERA5-Land time series included in Caravan. The latter are similar to the climate attributes provided in CAMELS-US²⁸. The reasons for choosing HydroATLAS as the source for the former are similar to the reasons for choosing ERA5-Land for time series data: HydroATLAS has global coverage with a license that allows for redistribution.

The catchment attributes derived from HydroATLAS use the highest resolution shape file available in that dataset (level 12). The level 12 HydroATLAS polygons are, for the vast majority of basins, smaller than the catchment boundaries for each gauge station provided by the respective CAMELS datasets–i.e., a single polygon representing the drainage area for a specific gauge include multiple HydroATLAS polygons. Therefore, we first computed the spatial join of the HydroATLAS polygons and the catchment boundaries and then derived the catchment attributes as an area-weighted aggregate (see the Aggregation column in Tables 3, 4). Catchment attributes included in Caravan can be loosely grouped into the following categories: hydrology, physiography, climatology, soils & geology, land cover characteristics, and anthropogenic influences. A full list of all catchment attributes derived from HydroATLAS is given in Tables 3–5 contains a list of attributes that were derived from ERA5-Land time series. Lastly, Table 6 lists additional attributes that are also included in Caravan, such as the latitude and longitude coordinates of each gauge station, the station name, the country of the gauge station location and the catchment area.

Table 3 First set of catchment attributes derived from HydroATLAS that are included in Caravan from the groups Hydrology, Physiography, and Climate.

Full size table

Table 4 Second set of catchment attributes derived from HydroATLAS that are included in Caravan from the groups Land Cover, Soils & Geology, and Anthropogenic.

Full size table

Table 5 Climate attributes derived from ERA5-Land time series.

Full size table

Table 6 Metadata and other attributes.

Full size table

Data processing in the cloud

The major computational challenge for developing LSH datasets is processing gridded meteorological and attributes data. To make the development and augmentation of Caravan as democratic as possible (i.e., to make it as easy as possible for anyone to add new watersheds or new data layers to the dataset), all of our data processing scripts use Google Earth Engine via Python APIs. Google Earth Engine³⁹ is a free-to-use cloud service with a large catalogue of geospatial data, including all of the datasets described above. The Caravan data processing scripts interact with Earth Engine directly through APIs, so that there is no need for individuals to download data from Earth Engine outside of these scripts. This has two benefits: it is not necessary for users to download and store large amounts of gridded meteorological data, and does not require any specific hardware. Any individual hydrologist, modeler, researcher, or student should be able to process even large numbers of new watersheds with minimal effort or expense. All that is necessary to add a new gauge to the Caravan dataset is a shapefile representing the drainage area of the catchment, plus a timeseries of daily or subdaily streamflow (discharge) values from that gauge in local time. Instructions about how to add new catchments to Caravan are provided in a Readme file in the dataset repository.

Data Records

The current version of the Caravan dataset (6830 watersheds)⁴⁰ is available at https://doi.org/10.5281/zenodo.7540792. A project homepage is available at https://github.com/kratzert/Caravan/, including all code and where news and updates are announced.

The dataset is organized into the following subfolders:

The attributes folder contains one subfolder per source dataset, which each contain two csv (comma separated values) files. One file (‘attributes_hydroatlas_{source}.csv’) contains attributes derived from HydroATLAS and the other file (‘attributes_caravan_{source}.csv’) contains limate indices derived from ERA5-Land, where {source} indicates the corresponding source data set (e.g. camelsgb for CAMELS-GB, camelscl for CAMELS-CL, and so on). The first column in all attributes file is called ‘gauge_id’ and contains a unique basin identifier of the form ‘{source}_{id}’, where {source} again is the abbreviation of the corresponding source dataset, and {id} is the basin id as defined in the original source dataset.
The shapefiles folder contains one subfolder per source dataset. Each of these subfolders contains a shapefile with the catchment boundaries of each basin within that dataset. These are the shapefiles that were used to derive the catchment attributes and ERA5-Land time series data. Each polygon in a given shapefile has a field ‘gauge_id’ that contains the unique basin identifier.
The timeseries folder contains two subfolders, csv and netcdf, that both share the same structure and contain the same data, once as csv-files and once as netCDF files. Each of these two subfolders contains one subfolder per source dataset. Within these source dataset specific subdirectories, there is one file (either csv or netCDF) per basin, containing all time series data (meteorological forcings, state variables, and streamflow). The netCDF files also contain metadata information, including physical units, timezones, and information on the data sources.
The code folder contains all scripts and Jupyter notebooks that were used to derive the data set. These scripts can be used to extend the data set to any new basin in the world. Instructions are included in the README.md file contained in this folder.
The licenses folder contains license information of all data included in Caravan and for Caravan itself. General license information are listed in the README.md file in this directory, source dataset specific information are listed in the files located in the source dataset specific subdirectories.
The README.md file in the main directory includes a description of the dataset structure, information on the units of time series data, and time zones.

All time series data except streamflow are aggregated (daily and spatially over basins) from ERA5-Land. ERA5-Land is available directly from⁴¹, however we used the Google Earth Engine repository. HydroATLAS attributes were derived from the HydroATLAS dataset⁴². Streamflow time series are collected from the respective region-specific repositories: Australia⁴³, Brazil⁴⁴, Canada²², Chile⁴⁵, Great Britain⁴⁶, LamaH-CE (Austrian territory and Danube catchment up to Bratislava)⁴⁷, and the United States:⁴⁸.

Technical Validation

Aggregating HydroATLAS attributes

The majority of catchment attributes are derived from HydroATLAS. The key challenge in extracting data from HydroAtlas is to define which HydroATLAS polygons are within a given gauge’s drainage area. The primary complication is that all datasets–i.e., the various CAMELS datasets and HydroATLAS use shapefiles derived from different digital elevation maps (DEM) at different spatial resolution. This means that catchment boundaries from the source datasets do not perfectly align with the polygons in HydroATLAS. An example of this is shown in Fig. 4. This figure shows the drainage area for a particular gauge, as specified by the shapefile in the CAMELS dataset (first subpanel), the collocated HydroATLAS subbasin polygons (second panel), and the mismatch between the two due to different datasets deriving catchment boundaries from different DEMs (third panel).

Because of this mismatch along catchment boundaries between different watershed delineations in different datasets, we chose to only include gauges with total drainage areas of at least 100 km². In smaller catchments, this boundary effect can represent a significant fraction of the total area of the catchment–an example of this is illustrated in Fig. 5. To quantify this area mismatch, we included a static feature called area_fraction_used_for_aggregation, which is the fraction of the area used for the aggregation and the total catchment area. In Fig. 4c, this would be the fraction of the green area by the sum of the green and orange areas. The distribution of these values across all basins is shown in Fig. 6.

Validating meteorological time series

Like most data about the natural environment, hydrological data is typically associated with significant uncertainty. Quantifying uncertainty is a central part of hydrological research^49,50, and usually involves intensive field campaigns^51,52, statistical comparison between several data products^53,54,55, or modeling studies^56,57–all of which are outside the scope of the current project. We can, however, statistically verify the processing tools that were used to develop the Caravan data from existing datasets. We did this verification by comparing Caravan-derived meteorological forcings (from ERA5–Land) with forcings from CAMELS-US. CAMELS-US was chosen because it includes three independent meteorological data sources (NLDAS, Maurer, DayMet), which allows us to contextualize the variability between CAMELS-US forcings and Caravan forcings. There will always be some amount of variability between any two meteorological datasets, and having three meteorological data products allows us to contextualize any variability between Caravan features and CAMELS-US features.

We calculated the correlation (Pearson r) between each pair of forcing data products (NLDAS, Maurer, DayMet, ERA5-Land) separately in each basin (n = 482) for three meteorological variables: total daily precipitation and daily maximum and minimum temperatures. We then used a set of one-tailed, paired t-tests to test hypotheses that for each of the three meteorological variables, correlations between Caravan and any individual CAMELS-US data product were significantly (α = 0.90) lower than correlations between each pair of CAMELS-US forcing products. Figure 7 shows the results of these tests. Although certain forcings are more highly correlated than others (e.g., DayMet and Maurer are more highly correlated than DayMet and NLDAS), correlations between Caravan and CAMELS-US data products were not consistently lower than correlations between different CAMELS-US data products.

Usage Notes

Our vision for Caravan is as the foundation of a dynamically growing community LSH dataset that anyone in the hydrology community can access and augment. Currently, the spatial distribution of basins included in Caravan is limited to a few regions in the world, see Fig. 1. We hope that some users will be willing (and allowed) to share their data, so that Caravan, over time, will contain discharge data from most parts of the world. In fact, while this manuscript was in review, a community extension was provided, adding 308 basins from Denmark⁵⁸. Detailed instructions for adding new catchments to Caravan are provided in the dataset repository, as well as in the code repository. This includes all code necessary to derive meteorological and attributes data on Google Earth Engine for any new basin globally. All computation can be done for free using Google Earth Engine.

In the introduction, we noted that Addor et al.¹⁷ listed six desiderata for LHS datasets. Caravan meets five of those six criteria–the missing desideratum is to have uncertainty estimates on all data components. Assessing uncertainty in hydrological data is difficult without relying on strong assumptions (often, some type of hydrological model), and we expect that future work will apply various methods for quantifying the uncertainty in global rainfall-runoff datasets. Perhaps that a comparison of the attributes and timeseries provided in Carvan, and those from the LSH original datasets, could provide new insights into their uncertainty, and inform the selection of datasets for hydrology.

Code availability

The code that was used to produce the Caravan dataset is available at https://github.com/kratzert/Caravan/.

References

Goodrich, D. et al. The usda-ars experimental watershed network: Evolution, lessons learned, societal benefits, and moving forward. Water Resources Research 57, e2019WR026473 (2021).
Article ADS Google Scholar
Likens, G. E. The watershed-ecosystem approach. Hydrological Processes 35, e13977, https://doi.org/10.1002/hyp.13977 (2021).
Article Google Scholar
Goodman, K. J., Parker, S. M., Edmonds, J. W. & Zeglin, L. H. Expanding the scale of aquatic sciences: the role of the national ecological observatory network (neon). Freshwater Science 34, 377–385 (2015).
Article Google Scholar
Kovács, G. Proposal to construct a coordinating matrix for comparative hydrology. Hydrological Sciences Journal 29, 435–443 (1984).
Article Google Scholar
Falkenmark, M. & Chapman, T. Comparative hydrology: An ecological approach to land and water resources (Unesco, 1989).
Andreassian, V., Hall, A., Chahinian, N. & Schaake, J. Introduction and synthesis: Why should hydrologists work on a large number of basin data sets? In Andreassian, V., Hall, A., Chahinian, N. & Schaake, J. (eds.) Large sample basin experiments for hydrological model parameterization: results of the model parameter experiment–MOPEX, vol. IAHS Publ. 307, 1–5 (Wallingford: IAHS Press, 2006).
Blöschl, G. et al. Twenty-three unsolved problems in hydrology (uph)–a community perspective. Hydrological sciences journal 64, 1141–1158 (2019).
Article Google Scholar
Gupta, H. V. et al. Large-sample hydrology: a need to balance depth with breadth. Hydrology and Earth System Sciences 18, 463–477 (2014).
Article ADS Google Scholar
Stahl, K. et al. Streamflow trends in europe: evidence from a dataset of near-natural catchments. Hydrology and Earth System Sciences 14, 2367–2382, https://doi.org/10.5194/hess-14-2367-2010 (2010).
Article ADS Google Scholar
Gudmundsson, L., Seneviratne, S. I. & Zhang, X. Anthropogenic climate change detected in european renewable freshwater resources. Nature Climate Change 7, 813–816 (2017).
Article ADS Google Scholar
Gudmundsson, L., Leonard, M., Do, H. X., Westra, S. & Seneviratne, S. I. Observed trends in global indicators of mean and extreme streamflow. Geophysical Research Letters 46, 756–766, https://doi.org/10.1029/2018GL079725 (2019).
Article ADS Google Scholar
Gudmundsson, L. et al. Globally observed trends in mean and extreme river flow attributed to climate change. Science 371, 1159–1162, https://doi.org/10.1126/science.aba3996 (2021).
Article ADS CAS Google Scholar
Kratzert, F. et al. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019 (2019).
Article ADS Google Scholar
Kratzert, F. et al. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research 55, 11344–11354, https://doi.org/10.1029/2019WR026065 (2019).
Article ADS Google Scholar
Ghiggi, G., Humphrey, V., Seneviratne, S. I. & Gudmundsson, L. Grun: an observation-based global gridded runoff dataset from 1902 to 2014. Earth System Science Data 11, 1655–1674, https://doi.org/10.5194/essd-11-1655-2019 (2019).
Article ADS Google Scholar
Ghiggi, G., Humphrey, V., Seneviratne, S. I. & Gudmundsson, L. G-run ensemble: A multi-forcing observation-based global runoff reanalysis. Water Resources Research 57, e2020WR028787, https://doi.org/10.1029/2020WR028787 (2021).
Article ADS Google Scholar
Addor, N. et al. Large-sample hydrology: recent progress, guidelines for new datasets and grand challenges. Hydrological Sciences Journal 65, 712–725 (2020).
Article CAS Google Scholar
Schaake, J., Cong, S. & Duan, Q. The US MOPEX data set. In Andreassian, V., Hall, A., Chahinian, N. & Schaake, J. (eds.) Large sample basin experiments for hydrological model parameterization: results of the model parameter experiment–MOPEX, vol. IAHS Publ. 307, 9–28 (Wallingford: IAHS Press, 2006).
Fowler, K. J., Acharya, S. C., Addor, N., Chou, C. & Peel, M. C. CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in australia. Earth System Science Data 13, 3847–3867 (2021).
Article ADS Google Scholar
Klingler, C., Schulz, K. & Herrnegger, M. LamaH-CE: Large-sample data for hydrology and environmental sciences for central europe. Earth System Science Data 13, 4529–4565 (2021).
Article ADS Google Scholar
Chagas, V. B. et al. CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in brazil. Earth System Science Data 12, 2075–2096 (2020).
Article ADS Google Scholar
Arsenault, R. et al. A comprehensive, multisource database for hydrometeorological modeling of 14,425 north american watersheds. Scientific Data 7, 1–12 (2020).
Article Google Scholar
Hao, Z. et al. CCAM: China catchment attributes and meteorology dataset. Earth System Science Data 13, 5591–5616 (2021).
Article ADS Google Scholar
Alvarez-Garreton, C. et al. The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies–chile dataset. Hydrology and Earth System Sciences 22, 5817–5846 (2018).
Article ADS Google Scholar
Kuentz, A., Arheimer, B., Hundecha, Y. & Wagener, T. Understanding hydrologic variability across europe through catchment classification. Hydrology and Earth System Sciences 21, 2863–2879 (2017).
Article ADS Google Scholar
Coxon, G. et al. CAMELS-GB: Hydrometeorological time series and landscape attributes for 671 catchments in great britain. Earth System Science Data 12, 2459–2483 (2020).
Article ADS Google Scholar
Newman, A. et al. Development of a large-sample watershed-scale hydrometeorological data set for the contiguous usa: data set characteristics and assessment of regional variability in hydrologic model performance. Hydrology and Earth System Sciences 19, 209–223 (2015).
Article ADS Google Scholar
Addor, N., Newman, A. J., Mizukami, N. & Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences 21, 5293–5313 (2017).
Article ADS Google Scholar
Do, H. X., Gudmundsson, L., Leonard, M. & Westra, S. The global streamflow indices and metadata archive (gsim)–part 1: The production of a daily streamflow archive and metadata. Earth System Science Data 10, 765–785 (2018).
Article ADS Google Scholar
Gudmundsson, L., Do, H. X., Leonard, M. & Westra, S. The global streamflow indices and metadata archive (GSIM)–part 2: Quality control, time-series indices and homogeneity assessment. Earth System Science Data 10, 787–804 (2018).
Article ADS Google Scholar
Linke, S. et al. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution. Scientific data 6, 1–15, https://doi.org/10.1038/s41597-019-0300-6 (2019).
Article ADS Google Scholar
Beck, H. E. et al. Global-scale regionalization of hydrologic model parameters. Water Resources Research 52, 3599–3622 (2016).
Article ADS Google Scholar
Beck, H. E. et al. Global fully distributed parameter regionalization based on observed streamflow from 4,229 headwater catchments. Journal of Geophysical Research: Atmospheres 125, e2019JD031485 (2020).
ADS Google Scholar
Blöschl, G. et al. Changing climate both increases and decreases european river floods. Nature 573, 108–111 (2019).
Article ADS Google Scholar
Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific data 3, 1–9 (2016).
Article Google Scholar
Metzger, M. J. et al. A high-resolution bioclimate map of the world: a unifying framework for global biodiversity research and monitoring. Global Ecology and Biogeography 22, 630–638 (2013).
Article Google Scholar
Muñoz-Sabater, J. et al. Era5-land: A state-of-the-art global reanalysis dataset for land applications. Earth System Science Data 13, 4349–4383 (2021).
Article ADS Google Scholar
Lehner, B. Hydroatlas version 1.0 data download. Figshare https://doi.org/10.6084/m9.figshare.9890531.v1 (2022).
Gorelick, N. et al. Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment https://doi.org/10.1016/j.rse.2017.06.031 (2017).
Article Google Scholar
Kratzert, F. et al. Caravan - A global community dataset for large-sample hydrology (Version 1.0), Zenodo, https://doi.org/10.5281/ZENODO.7540792 (2022).
Muñoz Sabater, J. et al. Era5-land hourly data from 1981 to present. ECMWF https://doi.org/10.24381/cds.e2161bac (2021).
Lehner, B., Linke, S. & Thieme, M. Hydroatlas version 1.0. Figshare https://doi.org/10.6084/m9.figshare.9890531.v1 (2019).
Fowler, K., Acharya, S. C., Addor, N., Chou, C. & Peel, M. CAMELS-AUS v1: Hydrometeorological time series and landscape attributes for 222 catchments in australia. PANGAEA https://doi.org/10.1594/PANGAEA.921850 (2020).
Chagas, V. B. P. et al. CAMELS-BR: Hydrometeorological time series and landscape attributes for 897 catchments in brazil. Zenodo https://doi.org/10.5281/zenodo.3964745 (2020).
Alvarez-Garreton, C. et al. The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies - chile dataset. PANGAEA https://doi.org/10.1594/PANGAEA.894885 (2018).
Coxon, G. et al. Catchment attributes and hydro-meteorological timeseries for 671 catchments across great britain (CAMELS-GB). NERC Environmental Information Data Centre https://doi.org/10.5285/8344e4f3-d2ea-44f5-8afa-86d2987543a9 (2020).
Klingler, C., Kratzert, F., Schulz, K. & Herrnegger, M. LamaH-CE: Large-sample data for hydrology and environmental sciences for central europe. Zenodo https://doi.org/10.5281/zenodo.5153305 (2021).
Newman, A. et al. A large-sample watershed-scale hydrometeorological dataset for the contiguous usa. UCAR/NCAR - GDEX https://doi.org/10.5065/D6MW2F4D (2014).
McMillan, H. K., Westerberg, I. K. & Krueger, T. Hydrological data uncertainty and its implications. Wiley Interdisciplinary Reviews: Water 5, e1319 (2018).
Google Scholar
Beven, K. Facets of uncertainty: epistemic uncertainty, non-stationarity, likelihood, hypothesis testing, and communication. Hydrological Sciences Journal 61, 1652–1665 (2016).
Article Google Scholar
Colliander, A. et al. Validation of smap surface soil moisture products with core validation sites. Remote Sensing of Environment 191, 215–231 (2017).
Article ADS Google Scholar
Habib, E. & Krajewski, W. F. Uncertainty analysis of the trmm ground-validation radar-rainfall products: Application to the teflun-b field campaign. Journal of applied meteorology 41, 558–572 (2002).
Article ADS Google Scholar
Kumar, S. V., Dirmeyer, P. A., Peters-Lidard, C. D., Bindlish, R. & Bolten, J. Information theoretic evaluation of satellite soil moisture retrievals. Remote Sensing of Environment 204, 392–400 (2018).
Article ADS Google Scholar
Nearing, G. S. et al. Nonparametric triple collocation. Water Resources Research 53, 5516–5530 (2017).
Article ADS Google Scholar
Alemohammad, S. H., McColl, K. A., Konings, A. G., Entekhabi, D. & Stoffelen, A. Characterization of precipitation product errors across the united states using multiplicative triple collocation. Hydrology and Earth System Sciences 19, 3489–3503 (2015).
Article ADS Google Scholar
McMillan, H., Jackson, B., Clark, M., Kavetski, D. & Woods, R. Rainfall uncertainty in hydrological modelling: An evaluation of multiplicative error models. Journal of Hydrology 400, 83–94 (2011).
Article ADS Google Scholar
Domeneghetti, A., Castellarin, A. & Brath, A. Assessing rating-curve uncertainty and its effects on hydraulic model calibration. Hydrology and Earth System Sciences 16, 1191–1202 (2012).
Article ADS Google Scholar
Koch, J. Caravan extension Denmark - Danish dataset for large-sample hydrology. Zenodo https://doi.org/10.5281/zenodo.6762361 (2022).
Knoben, W. J. M., Woods, R. A. & Freer, J. E. A quantitative hydrological climate classification evaluated with independent streamflow data. Water Resources Research 54, 5088–5109, https://doi.org/10.1029/2018WR022913 (2018).
Article ADS Google Scholar

Download references

Acknowledgements

Frederik Kratzert was partially supported by a Google Faculty Research Award (PI: Sepp Hochreiter, JKU Linz). Daniel Klotz was partially supported by Verbund AG. Martin Gauch was supported by the Linz Institute of Technology DeepFlood project. We would like to thank Shaun Harrigan and Ervin Zsoter at ECMWF for help the ERA5-Land data product. We would also like to thank Kurt Schwehr with the Google Earth Engine team for helping facilitate public access to the HydroATLAS dataset. Additionally, we would like to thank Jon Schwenk, who reported a problem with how we derived some of the attributes and helped finding a solution. This work is a contribution to the large-sample hydrology working group of the Panta Rhei research initiative of the International Association of Hydrological Sciences (IAHS).

Author information

Authors and Affiliations

Google Research, Vienna, Austria
Frederik Kratzert
Google Research, Mountain View, CA, USA
Grey Nearing
Fathom, Square Works, Bristol, UK
Nans Addor
Geography, University of Exeter, Exeter, UK
Nans Addor
Google, Mountain View, CA, USA
Tyler Erickson
Institute for Machine Learning, Johannes Kepler University, Linz, Austria
Martin Gauch & Daniel Klotz
Google Research, Tel Aviv, Israel
Oren Gilon, Avinatan Hassidim, Sella Nevo, Guy Shalev & Yossi Matias
Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
Lukas Gudmundsson

Authors

Frederik Kratzert
View author publications
You can also search for this author in PubMed Google Scholar
Grey Nearing
View author publications
You can also search for this author in PubMed Google Scholar
Nans Addor
View author publications
You can also search for this author in PubMed Google Scholar
Tyler Erickson
View author publications
You can also search for this author in PubMed Google Scholar
Martin Gauch
View author publications
You can also search for this author in PubMed Google Scholar
Oren Gilon
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Gudmundsson
View author publications
You can also search for this author in PubMed Google Scholar
Avinatan Hassidim
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Klotz
View author publications
You can also search for this author in PubMed Google Scholar
Sella Nevo
View author publications
You can also search for this author in PubMed Google Scholar
Guy Shalev
View author publications
You can also search for this author in PubMed Google Scholar
Yossi Matias
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All co-authors (F.K., G.N., N.A., T.E., M.G., O.G., L.G., A.H., D.K., S.N., G.S., Y.M.) were involved in developing the concept for this dataset through extensive discussions about requirements, scope, and current data availability. F.K. wrote most of the data processing code, T.E. wrote parts code for processing data on Earth Engine. G.N. did the trend analysis and comparison between ERA5-Land and CAMELS-US. F.K. created all figures. All co-authors participated in writing the manuscript.

Corresponding author

Correspondence to Frederik Kratzert.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kratzert, F., Nearing, G., Addor, N. et al. Caravan - A global community dataset for large-sample hydrology. Sci Data 10, 61 (2023). https://doi.org/10.1038/s41597-023-01975-w

Download citation

Received: 30 May 2022
Accepted: 18 January 2023
Published: 31 January 2023
DOI: https://doi.org/10.1038/s41597-023-01975-w

This article is cited by

Global prediction of extreme floods in ungauged watersheds
- Grey Nearing
- Deborah Cohen
- Yossi Matias
Nature (2024)
Artificial intelligence can provide accurate forecasts of extreme floods at global scale

Nature (2024)
Deep learning for water quality
- Wei Zhi
- Alison P. Appling
- Li Li
Nature Water (2024)
A Machine Learning-Based Probabilistic Approach for Irrigation Scheduling
- Shivendra Srivastava
- Nishant Kumar
- Tirthankar Roy
Water Resources Management (2024)
A benchmark dataset for machine learning in ecotoxicology
- Christoph Schür
- Lilian Gasser
- Marco Baity-Jesi
Scientific Data (2023)