Main

Global water models—including hydrological, land surface and dynamic vegetation models1—have become increasingly relevant for policymaking and in scientific studies. The Sixth Assessment Report2 of the Intergovernmental Panel on Climate Change draws heavily on results from global water models, which provide information about climate change impacts on hydrological variables including soil moisture3, streamflow4, terrestrial water storage5 and groundwater recharge6. Some of these models are already embedded in global water information services to provide information to a wide array of stakeholders, such as the Global Groundwater Information System7 or the African Flood and Drought Monitor8. Because measurements of many hydrological variables are very sparse and insufficient for large-scale analyses, global water models are regularly used in scientific studies to provide globally coherent estimates of variables such as groundwater recharge and groundwater storage change9,10. Global water models are also an integral part of Earth system models, and a realistic representation of the water cycle is essential for simulating the role of water within and across the different components of the Earth system11.

The Intergovernmental Panel on Climate Change’s Sixth Assessment Report2 concludes from an analysis of currently available global water model projections that ‘uncertainty in future water availability contributes to the policy challenges for adaptation, for example, for managing risks of water scarcity’. Whereas some of this uncertainty stems from projected and observed climatic forcing, considerable uncertainty stems from global water models themselves4,6,12,13,14. For instance, Beck et al.13 found distinct inter-model performance differences when comparing simulated and observed streamflow for ten global water models driven by the same forcing. To illustrate this uncertainty, we show how 30-year (climatological) averages of actual evapotranspiration, groundwater recharge and total runoff vary globally on the basis of outputs from eight models driven by the same forcing (Fig. 1a–c; Methods). We find substantial disagreement among models, as indicated by high coefficients of variation, particularly for groundwater recharge and total runoff. We further show which model deviates most from the ensemble mean and find that there is not one model that consistently deviates the most (Fig. 1d–f). Whereas this analysis cannot tell us which models perform better or worse, it suggests that it is not straightforward to single out a model for a certain flux or a certain region, which warrants a more in-depth evaluation.

Fig. 1: Disagreement between global water models for three key water fluxes.
figure 1

ac, Left: maps showing the coefficient of variation, calculated per grid cell as the ensemble standard deviation divided by the ensemble mean of eight global water models for different water fluxes: actual evapotranspiration (a), groundwater recharge (b) and total runoff (c). Lighter areas (‘blank spaces’) indicate high coefficients of variation (CoV) values and thus show where models disagree most. df, Right: maps showing which model deviates most from the ensemble mean for each grid cell for different water fluxes: actual evapotranspiration (d), groundwater recharge (e) and total runoff (f). Dark grey areas in df indicate that multiple models deviate similarly strongly from the ensemble mean. Empty, blank areas in df indicate that no model deviates strongly from the ensemble mean. The percentages shown in df refer to the fraction of grid cells (not land area) covered by each model. Greenland is masked out for the analysis.

Most evaluation strategies compare model outputs to historical observations over the area for which the observation is representative. This can be at the plot (for example, flux towers), catchment (for example, gauging stations) or grid cell (for example, gridded remote sensing products) scale. Such approaches are necessary but not sufficient to robustly evaluate global models15. First, these approaches compare simulated and observed values location by location and are therefore limited to potentially improving a model for that location; however, given that large fractions of the global land area are ungauged, we require methods that can extract and transfer information from gauged to ungauged locations16. Second, relevant information for model evaluation might not just lie in comparing the magnitudes of simulated and observed values in a single location but rather in how a variable varies along a spatial gradient17. And third, comparison with historical observations does not guarantee that a model reliably predicts system behaviour under changing conditions18. Rather than evaluating global models in essentially the same way as catchment-scale models, evidence of different large-scale hydrological relationships presents us with an opportunity for a different evaluation strategy that is inherently large-scale but so far rarely exploited.

Towards evaluation using functional relationships

Reviewing the hydrological literature reveals a range of relationships19 that, if they appear in empirical data, should also appear in models (and vice versa). Such relationships often capture behaviour that is not prescribed by small-scale processes but rather emerges through the interaction of these processes (or model components) at large scales. The perhaps most prominent example is the Budyko framework20, which describes the long-term partitioning of precipitation into evapotranspiration and streamflow solely as a function of the aridity index. Another example are so-called elasticities of streamflow to changing climatic drivers (for example, precipitation or temperature), which provide an observation-based constraint on climate change effects on streamflow21,22. A third example are empirical relationships between annual rainfall and runoff, which can be affected differently by prolonged drought; in Australia, some catchments have shown similar rainfall–runoff relationships before and after the Millennium Drought, while other catchments have transitioned to a new stable state23. The search for robust relationships that characterize the functioning of hydrological systems is in itself a great scientific challenge19, but such functional relationships also provide an excellent yet poorly explored opportunity for the evaluation of global water models.

We define the term function as the actions of (hydrological) systems on the inputs that enter them, such as partition, storage and release of water and energy24,25. Accordingly, we define functional relationships as relationships between two or more variables that characterize these functions. Such relationships often focus on forcing, state and response variables that are expected to be causally related (for example, precipitation and runoff), and they can focus on both temporal variability at a single location and (as used here) spatial variability across multiple locations. Functional relationships need not be uniquely defined and are typically characterized by substantial scatter due to other (secondary) controlling variables, local variability or uncertainty.

Whereas functional relationships have been used before to evaluate land surface, forest and Earth system models—for example, by analysing relationships between soil moisture and evaporation and runoff26,27,28,29 or between precipitation and other atmospheric drivers and vegetation productivity30,31,32—their potential for evaluating global water models has not yet been sufficiently explored. The use of functional relationships is currently scattered among the hydrological literature (for example, refs. 33,34,35) and has not been formalized into an evaluation framework. There is a pressing need to develop a ‘theory of evaluation’36 that does justice to the nature of global models, the purposes for which they are used and their growing relevance for society37. Functional relationships have the potential to be a central building block of such a theory of evaluation, and below we show how they can help shed new light on model behaviour.

Here we focus on functional relationships that capture the spatial co-variability of forcing and response variables. Rather than focusing on a process-by-process comparison that can quickly become unmanageable28, functional relationships can capture emergent patterns and shift the focus to identifying the dominant controls on the variables of interest. Especially the relationships between water and energy availability and the major water fluxes leaving the land surface—evaporation and runoff—have been frequently studied20,38, providing an excellent starting point for model evaluation. In addition, functional relationships that focus on spatial patterns offer several advantages. First, such relationships are well suited for the analysis of global models due to their spatially distributed nature, which means that these relationships can be readily obtained from comparing values from multiple grid cells. Second, spatial relationships can be calculated based on long-term averages, which for some variables are often the only observations available (for example, for groundwater recharge39,40). And third, such relationships can capture how hydrological variables co-vary across large scales and thus offer the potential for model improvement over large areas, including locations that lack observations.

In this analysis, we investigate how long-term averages of two forcing and three response variables co-vary spatially, leading to six variable pairs overall. The forcing variables are precipitation P and net radiation N (the available water and energy, respectively), and the response variables are actual evapotranspiration Ea, groundwater recharge R and total runoff Q (three key water fluxes). We analyse forcing–response relationships based on 30-year (climatological) averages (1975–2004; all in mm per year) from eight global water models (CLM4.5, CWatM, H08, JULES-W1, LPJmL, MATSIRO, PCR-GLOBWB and WaterGAP2) from phase 2b of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP 2b41). In addition, we use observational datasets, observation-driven machine learning products and the semi-empirical equation introduced by Budyko20 to calculate functional relationships between the same variables as for the models as benchmarks (Table 1). To explore regional variability in functional relationships38, we divide the world into four climatic regions: wet–warm (18% of modelled area), wet–cold (15%), dry–cold (24%) and dry–warm (43%), shown in Fig. 2d. Details can be found in the Methods section.

Table 1 Spearman rank correlations among forcing variables and water fluxes and number of observations based on different observational or observation-driven datasets and the Budyko equation
Fig. 2: Examples of functional relationships.
figure 2

a, Scatter plots between precipitation and groundwater recharge for PCR-GLOBWB and WaterGAP2. Owing to space constraints, we focus on a few examples with differing relationships. Scatter plots for all variable pairs are shown in Supplementary Figs. 1520. Each dot represents one grid cell and is based on the 30-year average of each flux. Spearman rank correlations ρs measure the strength of the relationship between forcing and response variables and are calculated for all grid cells within a climate region. The lines connect binned medians (ten bins along the x axis with equal amount of points per bin) for each region. b, The climate regions are shown. The grey dashed line shows the 1:1 line, indicating the water limit assuming all water is supplied by precipitation.

Disagreement in functional relationships between models

We can visually assess relationships between forcing (P, N) and response variables (Ea, R, Q) by inspecting scatter plots where each point represents one grid cell (or observation); this is shown for precipitation and groundwater recharge in Fig. 2a. We first take a closer look at the shapes of the functional relationships, indicated by the coloured lines in Fig. 2a. Later we will also quantify the strength of the relationships using Spearman rank correlations ρs. We limit ourselves to a qualitative discussion, given that fitting an equation would mean that we would have to assume a functional form. We report mean values and slopes (obtained via linear regression) for each region in Supplementary Tables 47, which quantitatively support our visual assessment. Figure 3 shows connected binned median values for precipitation and the three water fluxes for all models and observational datasets (Table 1), separated by climate region. A similar plot for net radiation and the three water fluxes is shown in Extended Data Fig. 1.

Fig. 3: Average functional relationships among precipitation and three key water fluxes.
figure 3

Average functional relationships based on models and benchmark datasets among precipitation P and actual evapotranspiration Ea, groundwater recharge R and total runoff Q, respectively. The coloured lines represent one model each, the grey-black lines represent different observational datasets, labelled on the outer-right panels. The MacDonald groundwater recharge dataset contains only enough data values for the dry–warm region and is thus only shown there. The lines connect binned medians (ten bins along the x axis with equal amount of points per bin) for each climate region. The grey dashed line shows the 1:1 line, indicating the water limit assuming all water is supplied by precipitation. Note that the graphs do not show the full range for some curves to better illustrate the model differences.

While the PEa relationships look similar in shape, they can differ greatly in magnitude (Fig. 3). They increase rather linearly in dry (water-limited) regions and increase initially in wet (energy-limited) regions and then level off as they reach an energy limit that bounds actual evapotranspiration. The limit differs greatly between models, varying up to about 400 mm per year in wet–warm regions. Because all models are forced with the same total radiation, this difference is related to the way the models translate total radiation into net radiation and how they then use net radiation to calculate actual evapotranspiration. There is no obvious connection between this difference and the different potential evapotranspiration schemes used42, potentially because the models, while forced with the same climate inputs, differ in the way they parameterize the land surface (for example, land use, soils). In dry regions, actual evapotranspiration is mostly limited by precipitation, a forcing dataset that is the same for all models, resulting in less variability. The Budyko equation and the FLUXCOM43 dataset suggest, in line with literature estimates44, that most models underestimate actual evapotranspiration, often greatly so (Supplementary Tables 4 and 5). However, we note that FLUXCOM probably overestimates actual evapotranspiration, especially in dry–warm regions, because it considers only vegetated areas43. Overall, the disagreement in modelled actual evapotranspiration, particularly visible in energy-limited regions, suggests substantial differences in the way models estimate the energy available for evapotranspiration.

Most PR relationships increase monotonically, but the shape, the slope and the threshold at which some models start to produce groundwater recharge are very different (Fig. 3). For instance, in dry–warm regions, some models produce essentially no groundwater recharge even if precipitation is above 1,000 mm per year, while others produce over 200 mm per year. In dry–warm regions, we have by far the most extensive database on groundwater recharge39,40, and the observations fall (apart from those at very high precipitation values) within the range of the models. In wet–warm regions, we find the largest disagreement between models and observations, which suggest lower (higher) groundwater recharge rates for higher (lower) precipitation. Whereas this shows the benefit of using an ensemble rather than a single model, even a large ensemble spread does not always capture the observed relationships. The large spread further suggests that many models greatly over- or underestimate groundwater recharge rates and consequently greatly over- or underestimate how much groundwater contributes to evapotranspiration and streamflow45. These differences in slope, visible for all climate regions, reflect very different spatial sensitivities to changes in precipitation. Whether temporal sensitivities are similar can only be hypothesized given that no global observational dataset with groundwater recharge time series is available but would imply very different responses to projected changes in precipitation.

The PQ relationships look similar in shape and mostly increase monotonically, especially for wet regions (Fig. 3). The relative differences are larger for dry places, commonly perceived as regions where runoff is more difficult to model46. The model and benchmark relationships disagree particularly strongly in dry–cold regions. There, the GSIM47,48 dataset shows a variable relationship between total runoff and precipitation, whereas the GRUN49 dataset shows almost no increase with increasing precipitation. Overall, GSIM, GRUN and the Budyko equation indicate, in line with an earlier evaluation50, that most models produce too much total runoff. This parallels recent findings that Earth system models predict higher runoff increases due to climate change than observations suggest22. The overestimation in total runoff is complementary to the underestimation of actual evapotranspiration and shows that most models partition too much precipitation into runoff rather than evapotranspiration.

Diverging dominance of forcing on response variables

To quantitatively compare the strength of the forcing–response relationships, we use Spearman rank correlations ρs. A rank correlation close to 1 (or −1) indicates that the spatial variability in the forcing variable almost completely explains the spatial variability in the response variable, as can be seen in Fig. 2a for WaterGAP2. A rank correlation closer to 0 indicates that other factors control the response (for example, other input or model parameters describing the land surface), as can be seen in Fig. 2a for PCR-GLOBWB. We stress that a high correlation is not a measure of goodness of fit. Considerable scatter and correspondingly low correlations might indeed be characteristic for many relationships, and if models overestimate how strongly a forcing variable controls a model output, this also indicates unrealistic behaviour. Calculating rank correlations for all variable pairs, we find that the models differ substantially among each other and in comparison to observations (Fig. 4; rank correlations for all benchmark datasets and models are listed in Table 1 and Supplementary Table 3, respectively).

Fig. 4: Strength of functional relationships for models and benchmark data.
figure 4

af, Spearman rank correlations ρs between forcing variables (precipitation (a,c,e), net radiation (b,d,f)) and water fluxes (actual evapotranspiration (a,b), groundwater recharge (c,d) and total runoff (e,f)), divided into different climate regions. Net radiation for LPJmL and PCR-GLOBWB is not available and is estimated as the median of the other models (per grid cell). The lines connecting the dots are only there as a visual aid. The numbered triangles show rank correlations based on benchmark datasets (grey background) and the Budyko equation, with numbers indicating the corresponding data source (Table 1). Observation-based rank correlations are shown only if they are based on more than 50 data points.

For precipitation and actual evapotranspiration (Fig. 4a), the models show the same ranking between climate regions and rather small differences in magnitude, indicating that actual evapotranspiration is strongly constrained by the available water in all models. The model-based correlations are higher in dry regions (ρs = 0.74–0.98) than in wet regions (0.57–0.83), reflecting water and energy limitations. The Budyko equation assumes complete dependence on aridity (here defined as N/P). It thus predicts higher correlations overall and mainly distinguishes between wet (0.83–0.84) and dry (0.98–1.00) regions but, unlike models and FLUXCOM, not between cold and warm regions. The Budyko equation should thus be seen as a useful comparison but not as the ‘correct’ model, given that different studies have shown that snow51, climate seasonality52, vegetation type53, inter-catchment groundwater flow54 and human impacts55 can affect the long-term water balance beyond aridity.

We find much variability for net radiation and actual evapotranspiration (Fig. 4b). There is no obvious correspondence between the potential evapotranspiration schemes used42 (for example, Priestley–Taylor for LPJmL and WaterGAP2 or Penman–Monteith for JULES-W1 and CWatM) and the rank correlations, implying that other factors play a more important role (also, refs. 14,56). Both the Budyko equation and FLUXCOM show very high correlations for all wet places (0.93–0.99), indicating a strong energy limitation57, underestimated by many models (especially CWatM and MATSIRO). FLUXCOM shows a stronger NEa relationship (Fig. 4b) in dry–cold places than all models and the Budyko equation, while it shows a weaker PEa relationship (Fig. 4a) there. This could be due to an uncertain representation of energy balance processes in cold regions, possibly related to interactions between snow-affected albedo and evapotranspiration58,59, sublimation60 or the aerodynamic component of potential evapotranspiration61.

For precipitation and groundwater recharge (Fig. 4c), some models (CLM4.5, MATSIRO, WaterGAP2 and H08) show high to very high correlations (0.71–0.95) for all climate regions, suggesting that precipitation is the dominant control on groundwater recharge across all climate regions in these models. Other models (CWatM, JULES-W1, LPJmL, PCR-GLOBWB) show much lower and more variable correlations (0.35–0.85), suggesting different controls on groundwater recharge (for example, model structural decisions and parameterizations). H08 and WaterGAP2 use the same approach to calculate groundwater recharge42 and they show almost identical rank correlations, indicating that the functional relationships might be relatable to the model structure in this case. Recent studies have shown a strong influence of precipitation and aridity on groundwater recharge39,40,45, and using the same datasets, we also find high to very high correlations in dry–warm regions (0.74–0.84). In these often highly water-limited regions, precipitation appears to be the dominant control on groundwater recharge. Besides climate, perceptual models of groundwater recharge generation usually include soil characteristics, topography, land use and geology62,63. This might explain why observations show a more scattered PR relationship, particularly in wet–warm regions (−0.06).

For precipitation and total runoff (Fig. 4e), WaterGAP2 and PCR-GLOBWB both show lower correlations (0.52–0.75) than the other models (0.58–0.95). WaterGAP2 is the only model here that is calibrated against streamflow observations42, which might explain why it shows the lowest rank correlations for total runoff. The Budyko framework assumes that long-term runoff only depends on aridity and thus shows higher correlations (0.87–0.99) than the benchmark datasets (0.27–0.94) and most models (0.52–0.95). Because factors other than aridity can influence total runoff51,52,53,54 and given that GSIM tends to show lower correlations overall (0.32–0.80), models that show correlations as high as the Budyko equation probably overestimate how strongly precipitation controls total runoff. Similar to the shapes of the functional relationships (Fig. 3),we generally find the largest differences in both models and datasets in dry–cold regions, where GRUN and GSIM show particularly low correlations (0.27 and 0.32).

For net radiation and both groundwater recharge and total runoff (Fig. 4d,f), we find high variability and mostly positive correlations. The models probably produce more groundwater recharge and total runoff in regions with higher net radiation because precipitation is also higher in these regions (Supplementary Fig. 1). Whereas it is difficult to interpret these correlations, the large variability still suggests considerable differences between models.

Discussion

Focus areas for model improvement

Our analysis has revealed substantial disagreement between models and between models and observations, questioning the robustness of model-based studies and impact assessments, especially if only a single model is used. The energy balance, from total radiation to actual evapotranspiration, appears to be poorly represented, indicated by a different energy limit (Fig. 3), a general underestimation of actual evapotranspiration and widely varying NEa relationships (Fig. 4). This warrants a closer look in future studies, as a realistic depiction of energy balance and evaporation processes is critical for climate change studies57,58. We find the largest disagreement for groundwater recharge, which is arguably the least understood process and poorly constrained by sparse observations39,40. The inter-model differences in groundwater recharge can be much larger than the differences in actual evapotranspiration and must therefore have other reasons. To better constrain the large variability between models, we need to improve our understanding of the dominant controls on groundwater recharge at large scales64. This knowledge is important for assessments of sustainable use of groundwater resources9,10, for groundwater modelling studies that use groundwater recharge from global water models as input65 and for understanding the sensitivity of groundwater recharge to changing climatic drivers6. Most models overestimate total runoff and we find the largest disagreement for total runoff in dry–cold regions. This echoes existing literature1,12,22,50 and highlights the need for model refinement in dry and/or cold regions, which are under-researched and strongly affected by climate change46. To explore more in-depth how snow processes affect the water balance, future studies could focus on functional relationships in snow-dominated regions by specifically delineating these regions using the fraction of precipitation falling as snow or snow cover extents.

Towards an inventory of robust functional relationships

We have used different observational datasets, observation-driven machine learning products and the Budyko equation20 to derive empirical and theory-based functional relationships, but challenges remain. Observation-driven machine learning products43,49 are not raw observations and may reflect their upscaling methods rather than the underlying natural distribution but serve as useful benchmarks in the absence of direct observations (for example, because of limited numbers of FLUXNET sites66). The Budyko equation20 is a climate-only model and thus provides a useful benchmark but neglects other influences on the long-term water balance. The observations themselves and the forcing data paired with them are also associated with uncertainty, even though most of the relationships used here appear to be relatively robust (Methods includes an extended discussion). Yet especially for variables with small numbers of observations, it is challenging to provide robust observation-based constraints for certain regions (Table 1). For example, groundwater recharge measurements have almost entirely been made in dry–warm regions (97% of MacDonald data40 and 92% of Moeck data39), leaving groundwater recharge in other regions poorly constrained. On the other hand, most streamflow measurements have been taken in wet regions (60% of GSIM data used here), and globally there is a placement bias of stream gauges towards wet regions67, even though—according to our classification—short of two-thirds of the global land area are defined as dry. Instead of taking new measurements to understand a specific place, new measurements would have much more leverage if they would help us to also understand other places, for example, by filling an observational gap along a climatic gradient (that is, in functional space). In addition, more quality-controlled datasets with uncertainty estimates40 are critical to obtain realistic uncertainty estimates for functional relationships. This would ultimately allow us to obtain robust ranges of functional behaviour that we can benchmark our models against.

The functional relationships studied here appear to be robust with respect to modelled human impacts, probably because we investigated long-term averages over large regions where climatic controls on the selected hydrological variables dominate (Supplementary Figs. 2630). Yet for different variables, especially when studied at shorter temporal and smaller spatial scales, human impacts might have a considerable effect on functional relationships. The effects of human impacts might be investigated by studying strongly managed and near-natural regions separately68. Indeed, comparing functional relationships between human impacted and natural regions would be an excellent strategy to assess the degree of human alteration of the natural water cycle. In addition, relationships that specifically focus on human impacts, such as relationships between irrigated areas and irrigation water withdrawals69, might be used to better understand the representation of human impacts in models.

Whereas visual comparison (focusing on the shape of the relationships) and rank correlations (focusing on the strength of the relationships) have exposed clear differences between models and observations, our approach here should be seen as a first step. There are other ways to describe the relationships analysed here, for example, by characterizing thresholds or nonlinearities (visible in Fig. 3). Metrics such as rank correlations also require careful interpretation. For example, positive correlations between net radiation and groundwater recharge probably arise because precipitation and net radiation are positively correlated and thus do not imply a causal relationship. The interpretation of empirical relationships should therefore be backed up by process knowledge or extended by methods that allow for discovery of causal relationships70. Physics-aware machine learning might be powerful in that respect, as it combines domain knowledge with versatile pattern recognition71. Beyond the relationships investigated here, we anticipate that exploring temporal relationships (for example, using elasticities21,22 or shifts in PQ relationships23), dividing the landscape into additional categories (for example, hydrobelts72) and including other variables, such as state variables or stores (for example, soil moisture, terrestrial water storage), will provide additional insights.

Conclusions

As our models grow in complexity, encompassing more processes and covering larger spatial and temporal scales, we need a concurrent development of model evaluation strategies: an evaluation framework for large-scale models. Central to such an evaluation framework should be functional relationships, which shift the focus away from matching historical records in specific locations to a more diagnostic and process-oriented evaluation of model behaviour36. Functional relationships allow us to focus on larger-scale assessments, to relate places to each other and to explore if dominant controls in models are consistent with observations, theory and expectations (that is, our perceptual model73). This understanding is critical for ensuring that models faithfully represent real-world systems, ultimately leading to more credible projections of environmental change impacts. Eventually, expanding our range of functional relationships in hydrology, constrained by various observational datasets and expert knowledge, would give us a knowledge base of realistic system behaviour that could be used to evaluate models, diagnose model deficiencies and weight model ensembles, comparable to the use of emergent constraints in climate modelling37.

Both our approach and our findings have implications beyond hydrology. First, the terrestrial water cycle plays a central role in the Earth system and is often strongly coupled to other components, such as the biosphere, lithosphere and atmosphere and human activities (for example, refs. 74,75,76). More realistic simulations of the global water cycle therefore also enable us to better clarify how it influences and is influenced by other Earth system components. Methodologically, functional relationships are not limited to applications in hydrology. In fact, land surface, forest and Earth system models26,27,28,29,30,31,32 have already been studied in similar fashions, though a broader application of this approach has so far been missing. As indicated by recent studies76,77, functional relationships provide an excellent opportunity to study the interactions between hydrology and, for example, terrestrial ecosystems, and thus represent a tool that can be used across disciplines.

Beyond model evaluation, functional relationships invite us to think about how the global water cycle functions, what we know, what we do not know and what that means for a future under climate change73. Our results suggest that improved process understanding will be particularly important for energy balance processes, groundwater recharge processes and generally in dry and/or cold regions. So how can we improve our process understanding? In 1986, Eagleson78 stated that ‘science advances on two legs, analysis and experimentation, and at any moment one is ahead of the other. At the present time advances in hydrology appear to be data limited’. For some processes, this still seems to be the case. But clearly, we have a wealth of data available and might ask ourselves: are we extracting all of the information from the observations we have? On the basis of the data we have, what and where should we measure next? And are there functional relationships in hydrology yet to be found19? Even if the search for such relationships is challenging, it will be a fruitful and exciting endeavour for global hydrology.

Methods

Model data retrieval and processing

We analysed 30-year (climatological) averages (1975–2004) from eight global water models41: CLM4.579, CWatM80, H0881, JULES-W182, LPJmL83, MATSIRO84, PCR-GLOBWB85 and WaterGAP286. The model simulations were carried out following the ISIMIP 2b protocol and here we used model outputs forced with the Earth system model HadGEM2-ES under historical conditions (historical climate and CO2 concentrations). We note that the specific forcing chosen does not appear to influence model-based functional relationships (see below). We used precipitation P (ISIMIP variable name pr), net radiation N (not an official ISIMIP output), actual evapotranspiration Ea (ISIMIP variable name evap), groundwater recharge R (ISIMIP variable name qr) and total runoff Q (ISIMIP variable name qtot). Note that Q here refers to runoff generated on the land fractions (and not surface water bodies) of each grid cell and does not include upstream inflows, which allows for comparison to grid cell P. P, Ea, R, and Q were downloaded from https://data.isimip.org/. Net radiation N is not an official ISIMIP output and was provided by the individual modelling groups. It is not available for all models, so we used the ensemble median per grid cell for models without N data. We converted all fluxes to mm per year and removed Ea values larger than 10,000 mm per year and set R values smaller than 0 to 0. Note that our analysis excludes Greenland and Antarctica. A more detailed description is given in the Supplementary Information.

CoV and most deviating model maps

For each grid cell, we used the 30-year averages of the eight models (that is, the model ensemble) and calculated the ensemble standard deviation divided by the ensemble mean. Maps of the standard deviation are shown in the Supplementary Information (Supplementary Figs. 810). To see which model dominates the ensemble spread, we checked for each grid cell which model shows the largest absolute difference (denoted by d1) from the ensemble mean (denoted by μ). To see if multiple models dominate the ensemble spread, we also checked for each grid cell which model shows the second-largest absolute difference (denoted by d2) from the ensemble mean. If the relative difference between the largest and the second-largest difference is less than 20%, that is (d1 − d2)/d1 < 0.2, the grid cell falls into the category ‘multiple’. If the relative difference between the most deviating model and the ensemble mean is less than 20%, that is d1/μ < 0.2, the grid cell is counted as having no most deviating model (empty areas on Fig. 1d–f).

Functional relationships

To visualize the shape of the functional relationships, we binned the data in each climate region into ten bins (along the x axis) with an equal amount of points, calculated the median per bin and connected the obtained median value. For groundwater recharge, we used only five bins because there are so few values. Note that the non-gridded observational datasets do not have the same spatial distribution as the gridded datasets and the models and thus do not have the same distribution of forcing variables. Their bins can therefore span different ranges of the forcing variables. As a metric for the strength of the functional relationships, we calculate Spearman rank correlations ρs between model inputs and outputs per climate region, a measure of the monotonicity between two variables that is robust to outliers. We use the following categories for correlations: negative correlation (<0), no to low correlation (0 to 0.25), medium correlation (0.25–0.5), high correlation (0.5–0.75), very high correlation (0.75–1.0). We also show mean fluxes and slopes obtained through linear regression in Supplementary Tables 47.

Climate regions

On the basis of the aridity index (here defined as N/P; where N is model ensemble median), a place is categorized as either wet (N/P < 1) or dry (N/P > 1). On the basis of how many days per year fall below a 1 °C temperature threshold, a place is categorized as either cold (more than one month below 1 °C) or warm (less than one month below 1 °C). This results in four categories: wet–warm (15% of model grid cells/18% of modelled area), wet–cold (23%/15%), dry–cold (28%/24%) and dry–warm (34%/43%). To test how different decisions affect our climate region classification, we also used the ensemble median of potential evapotranspiration Ep (partially downloaded, partially provided by the modelling groups) to calculate the aridity index (Ep/P), and we used a different threshold for our warm/cold distinction. This resulted in little differences overall, as can be seen in the Supplementary Information (Supplementary Fig. 14).

Benchmark datasets and theory

To benchmark model performance, we used different observational datasets, observation-driven machine learning products and the Budyko equation20. If the datasets provide their own forcing data, we used these data. If not, we paired them with GSWP3 P data87 to have one consistent forcing product. For Ea, we used FLUXCOM data43 (RS monthly 0.5° from 2001–2015) paired with GSWP3 P data87 (downloaded from https://data.isimip.org/). For R, we used data from MacDonald et al.40, which include matching P data, and data from Moeck et al.39 paired with GSWP3 P data87. For Q, we used GRUN data49 from 1985–2004 paired with GSWP3 P data87 (the dataset used in the creation of GRUN) and GSIM data47,48 paired with GSWP3 P data87. For GSIM, we only used catchments with areas ranging from 250 to 25,000 km2 with a minimum of ten years of data between 1985 and 2004 to ensure a sufficient number of catchments that do not differ too much in size from the model grid cells. To obtain theory-based estimates for Ea and Q, we forced the Budyko20 equation (equation (1)) with HadGEM2-ES P (the same forcing as used for the models) and ensemble median N from the ISIMIP 2b models analysed here.

$$\frac{{E}_\mathrm{a}}{P}=\sqrt{\frac{N}{P}\tanh \left(\frac{P}{N}\right)\left(1-\exp \left(-\frac{N}{P}\right)\right)}$$
(1)

More details on data processing and quality checks can be found in the Supplementary Information.

Extended discussion on model forcing and scenario uncertainty

The choice of forcing product and differences in the treatment of human influences (for example, water use and dams) might affect the functional relationships exhibited by the models. To get an idea how much uncertainty this introduces, we compared our results to model runs using WATCH-WFDEI forcing with either variable historical conditions (varsoc) or no human influences (nosoc) for WaterGAP2 and PCR-GLOBWB, carried out following the ISIMIP 2a protocol. The results, shown in the Supplementary Information (Supplementary Figs. 2630), stay essentially the same, showing that the model-based correlations are robust signatures of model behaviour.

Extended discussion on benchmark dataset uncertainty

Because not all datasets come with matching P data, we sometimes paired the observations with GSWP3 reanalysis data87. To get an idea how much uncertainty this introduces, we investigated how different P data sources affect the functional relationships. Correlations calculated using the MacDonald et al.40R data with either GSWP3 P data or the accompanying P data are very similar for dry–warm places (0.83 and 0.84; Supplementary Information). Using HadGEM2-ES P (the model forcing) data instead of GSWP3 P data to calculate correlations with FLUXCOM Ea43, Moeck R39, GRUN Q49 and GSIM47,48, respectively, results in no notable differences. Because most datasets only contain a limited number of years of data, sometimes only one average value39,40, we used all available years in our analysis. The only observation-driven dataset that contains a long enough time series to analyse functional relationships for two independent 30-year periods is GRUN49. Using GRUN data from 1945–1974 instead of 1975–2004 results in virtually no differences. While we cannot rule out that other datasets would lead to different relationships, this analysis indicates that the functional relationships and the rank correlations are relatively robust (Supplementary Figs. 3142).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.