Functional relationships reveal differences in the water cycle representation of global water models

Global water models are increasingly used to understand past, present and future water cycles, but disagreements between simulated variables make model-based inferences uncertain. Although there is empirical evidence of different large-scale relationships in hydrology, these relationships are rarely considered in model evaluation. Here we evaluate global water models using functional relationships that capture the spatial co-variability of forcing variables (precipitation, net radiation) and key response variables ( ac tu al e va potranspiration, groundwater recharge, total runoff). Results show strong disagreement in both shape and strength of model-based functional relationships, especially for groundwater recharge. Empirical and theory-derived functional relationships show varying agreements with models, indicating that our process understanding is particularly uncertain for energy balance processes, groundwater recharge processes and in dry and/or cold regions. Functional relationships offer great potential for model evaluation and an opportunity for fundamental advances in global hydrology and Earth system research in general. Global water models—including hydrological, land surface and dynamic vegetation models 1 —have become increasingly relevant for policy-making and in scientific studies. The Sixth Assessment Report 2 of the Intergovernmental Panel on Climate Change draws heavily on results from global water models, which provide information about climate change impacts on hydrological variables including soil moisture 3 , streamflow 4 , terrestrial water storage 5 and groundwater recharge 6 . Some of these models are already embedded in global water information services to provide information to a wide array of stakeholders, such as the Global Groundwater Information


Analysis
https://doi.org/10.1038/s44221-023-00160-ystreamflow 21,22 .A third example are empirical relationships between annual rainfall and runoff, which can be affected differently by prolonged drought; in Australia, some catchments have shown similar rainfall-runoff relationships before and after the Millennium Drought, while other catchments have transitioned to a new stable state 23 .The search for robust relationships that characterize the functioning of hydrological systems is in itself a great scientific challenge 19 , but such functional relationships also provide an excellent yet poorly explored opportunity for the evaluation of global water models.
We define the term function as the actions of (hydrological) systems on the inputs that enter them, such as partition, storage and release of water and energy 24,25 .Accordingly, we define functional relationships as relationships between two or more variables that characterize these functions.Such relationships often focus on forcing, state and response variables that are expected to be causally related (for example, precipitation and runoff), and they can focus on both temporal variability at a single location and (as used here) spatial variability across multiple locations.Functional relationships need not be uniquely defined and are typically characterized by substantial scatter due to other (secondary) controlling variables, local variability or uncertainty.
Whereas functional relationships have been used before to evaluate land surface, forest and Earth system models-for example, by analysing relationships between soil moisture and evaporation and runoff [26][27][28][29] or between precipitation and other atmospheric drivers and vegetation productivity [30][31][32] -their potential for evaluating global water models has not yet been sufficiently explored.The use of functional relationships is currently scattered among the hydrological literature (for example, refs.33-35) and has not been formalized into an evaluation framework.There is a pressing need to develop a 'theory of evaluation' 36 that does justice to the nature of global models, the purposes for which they are used and their growing relevance for society 37 .Functional relationships have the potential to be a central building block of such a theory of evaluation, and below we show how they can help shed new light on model behaviour.
Here we focus on functional relationships that capture the spatial co-variability of forcing and response variables.Rather than focusing on a process-by-process comparison that can quickly become unmanageable 28 , functional relationships can capture emergent patterns and shift the focus to identifying the dominant controls on the variables of interest.Especially the relationships between water and energy availability and the major water fluxes leaving the land surface-evaporation and runoff-have been frequently studied 20,38 , providing an excellent starting point for model evaluation.In addition, functional relationships that focus on spatial patterns offer several advantages.First, such relationships are well suited for the analysis of global models due to their spatially distributed nature, which means that these relationships can be readily obtained from comparing values from multiple grid cells.Second, spatial relationships can be calculated based on long-term averages, which for some variables are often the only observations available (for example, for groundwater recharge 39,40 ).And third, such relationships can capture how hydrological variables co-vary across large scales and thus offer the potential for model improvement over large areas, including locations that lack observations.
In this analysis, we investigate how long-term averages of two forcing and three response variables co-vary spatially, leading to six variable pairs overall.The forcing variables are precipitation P and net radiation N (the available water and energy, respectively), and the response variables are actual evapotranspiration E a , groundwater recharge R and total runoff Q (three key water fluxes).We analyse forcing-response relationships based on 30-year (climatological) averages (1975-2004;  all in mm per year) from eight global water models (CLM4.5,CWatM, H08, JULES-W1, LPJmL, MATSIRO, PCR-GLOBWB and WaterGAP2) from phase 2b of the Inter-Sectoral Impact Model Intercomparison Project Flood and Drought Monitor 8 .Because measurements of many hydrological variables are very sparse and insufficient for large-scale analyses, global water models are regularly used in scientific studies to provide globally coherent estimates of variables such as groundwater recharge and groundwater storage change 9,10 .Global water models are also an integral part of Earth system models, and a realistic representation of the water cycle is essential for simulating the role of water within and across the different components of the Earth system 11 .
The Intergovernmental Panel on Climate Change's Sixth Assessment Report 2 concludes from an analysis of currently available global water model projections that 'uncertainty in future water availability contributes to the policy challenges for adaptation, for example, for managing risks of water scarcity'.Whereas some of this uncertainty stems from projected and observed climatic forcing, considerable uncertainty stems from global water models themselves 4,6,[12][13][14] .For instance, Beck et al. 13 found distinct inter-model performance differences when comparing simulated and observed streamflow for ten global water models driven by the same forcing.To illustrate this uncertainty, we show how 30-year (climatological) averages of actual evapotranspiration, groundwater recharge and total runoff vary globally on the basis of outputs from eight models driven by the same forcing (Fig. 1a-c; Methods).We find substantial disagreement among models, as indicated by high coefficients of variation, particularly for groundwater recharge and total runoff.We further show which model deviates most from the ensemble mean and find that there is not one model that consistently deviates the most (Fig. 1d-f).Whereas this analysis cannot tell us which models perform better or worse, it suggests that it is not straightforward to single out a model for a certain flux or a certain region, which warrants a more in-depth evaluation.
Most evaluation strategies compare model outputs to historical observations over the area for which the observation is representative.This can be at the plot (for example, flux towers), catchment (for example, gauging stations) or grid cell (for example, gridded remote sensing products) scale.Such approaches are necessary but not sufficient to robustly evaluate global models 15 .First, these approaches compare simulated and observed values location by location and are therefore limited to potentially improving a model for that location; however, given that large fractions of the global land area are ungauged, we require methods that can extract and transfer information from gauged to ungauged locations 16 .Second, relevant information for model evaluation might not just lie in comparing the magnitudes of simulated and observed values in a single location but rather in how a variable varies along a spatial gradient 17 .And third, comparison with historical observations does not guarantee that a model reliably predicts system behaviour under changing conditions 18 .Rather than evaluating global models in essentially the same way as catchment-scale models, evidence of different large-scale hydrological relationships presents us with an opportunity for a different evaluation strategy that is inherently large-scale but so far rarely exploited.

Towards evaluation using functional relationships
Reviewing the hydrological literature reveals a range of relationships 19 that, if they appear in empirical data, should also appear in models (and vice versa).Such relationships often capture behaviour that is not prescribed by small-scale processes but rather emerges through the interaction of these processes (or model components) at large scales.The perhaps most prominent example is the Budyko framework 20 , which describes the long-term partitioning of precipitation into evapotranspiration and streamflow solely as a function of the aridity index.Another example are so-called elasticities of streamflow to changing climatic drivers (for example, precipitation or temperature), which provide an observation-based constraint on climate change effects on Analysis https://doi.org/10.1038/s44221-023-00160-y(ISIMIP 2b 41 ).In addition, we use observational datasets, observationdriven machine learning products and the semi-empirical equation introduced by Budyko 20 to calculate functional relationships between the same variables as for the models as benchmarks (Table 1).To explore regional variability in functional relationships 38 , we divide the world into four climatic regions: wet-warm (18% of modelled area), wet-cold (15%), dry-cold (24%) and dry-warm (43%), shown in Fig. 2d.Details can be found in the Methods section.

Disagreement in functional relationships between models
We can visually assess relationships between forcing (P, N) and response variables (E a , R, Q) by inspecting scatter plots where each point represents one grid cell (or observation); this is shown for precipitation and groundwater recharge in Fig. 2a.We first take a closer look at the shapes of the functional relationships, indicated by the coloured lines in Fig. 2a.Later we will also quantify the strength of the relationships

Analysis
https://doi.org/10.1038/s44221-023-00160-yusing Spearman rank correlations ρ s .We limit ourselves to a qualitative discussion, given that fitting an equation would mean that we would have to assume a functional form.We report mean values and slopes (obtained via linear regression) for each region in Supplementary Tables 4-7, which quantitatively support our visual assessment.Figure 3 shows connected binned median values for precipitation and the three water fluxes for all models and observational datasets (Table 1), separated by climate region.A similar plot for net radiation and the three water fluxes is shown in Extended Data Fig. 1.
While the P-E a relationships look similar in shape, they can differ greatly in magnitude (Fig. 3).They increase rather linearly in dry (waterlimited) regions and increase initially in wet (energy-limited) regions and then level off as they reach an energy limit that bounds actual evapotranspiration.The limit differs greatly between models, varying up to about 400 mm per year in wet-warm regions.Because all models are forced with the same total radiation, this difference is related to the way the models translate total radiation into net radiation and how they then use net radiation to calculate actual evapotranspiration.There is no obvious connection between this difference and the different potential evapotranspiration schemes used 42 , potentially because the models, while forced with the same climate inputs, differ in the way they parameterize the land surface (for example, land use, soils).In dry regions, actual evapotranspiration is mostly limited by precipitation, a forcing dataset that is the same for all models, resulting in less variability.The Budyko equation and the FLUXCOM 43 dataset suggest, in line with literature estimates 44 , that most models underestimate actual evapotranspiration, often greatly so (Supplementary Tables 4  and 5).However, we note that FLUXCOM probably overestimates actual evapotranspiration, especially in dry-warm regions, because it considers only vegetated areas 43 .Overall, the disagreement in modelled actual evapotranspiration, particularly visible in energy-limited regions, suggests substantial differences in the way models estimate the energy available for evapotranspiration.
Most P-R relationships increase monotonically, but the shape, the slope and the threshold at which some models start to produce groundwater recharge are very different (Fig. 3).For instance, in dry-warm regions, some models produce essentially no groundwater recharge even if precipitation is above 1,000 mm per year, while others produce over 200 mm per year.In dry-warm regions, we have by far the most extensive database on groundwater recharge 39,40 , and the observations fall (apart from those at very high precipitation values) within the range of the models.In wet-warm regions, we find the largest disagreement between models and observations, which suggest lower (higher) groundwater recharge rates for higher (lower) precipitation.Whereas this shows the benefit of using an ensemble rather than a single model, even a large ensemble spread does not always capture the observed relationships.The large spread further suggests that many models greatly over-or underestimate groundwater recharge rates and consequently greatly over-or underestimate how much groundwater contributes to evapotranspiration and streamflow 45 .These differences in slope, visible for all climate regions, reflect very different spatial sensitivities to changes in precipitation.Whether temporal sensitivities are similar can only be hypothesized given that no global observational dataset with groundwater recharge time series is available but would imply very different responses to projected changes in precipitation.
The P-Q relationships look similar in shape and mostly increase monotonically, especially for wet regions (Fig. 3).The relative differences are larger for dry places, commonly perceived as regions where runoff is more difficult to model 46 .The model and benchmark relationships disagree particularly strongly in dry-cold regions.There, the GSIM 47,48 dataset shows a variable relationship between total runoff and precipitation, whereas the GRUN 49 dataset shows almost no increase with increasing precipitation.Overall, GSIM, GRUN and the Budyko equation indicate, in line with an earlier evaluation 50 , that most models produce too much total runoff.This parallels recent findings that Earth system models predict higher runoff increases due to climate change than observations suggest 22 .The overestimation in total runoff is complementary to the underestimation of actual evapotranspiration and shows that most models partition too much precipitation into runoff rather than evapotranspiration.

Diverging dominance of forcing on response variables
To quantitatively compare the strength of the forcing-response relationships, we use Spearman rank correlations ρ s .A rank correlation close to 1 (or −1) indicates that the spatial variability in the forcing variable almost completely explains the spatial variability in the response variable, as can be seen in Fig. 2a for WaterGAP2.A rank correlation closer to 0 indicates that other factors control the response (for example, other input or model parameters describing the land surface), as can be seen in Fig. 2a for PCR-GLOBWB.We stress that a high correlation is not a measure of goodness of fit.Considerable scatter and correspondingly low correlations might indeed be characteristic for many relationships, and if models overestimate how strongly a forcing The percentage of grid cells per climate region is given in brackets.The Budyko equation was forced per grid cell with the same forcing as the models (indicated by *) and thus covers approximately the same extent (except for cells with negative net radiation).The gridded datasets (FLUXCOM, GRUN) are available at the same resolution as the models and thus also cover approximately the same extent (except for non-vegetated areas in the case of FLUXCOM).This is indicated by m.e. for model extent.For datasets without matching precipitation data, we used GSWP3 reanalysis data.Nr corresponds to the numbers used in Fig. 4. The MacDonald rank correlation for the wet-warm region is shown in brackets because of the very small sample size; it is not shown in Fig. 4. Dashes (-) indicate that correlations could not be calculated because no observations were available.ρ s denotes Spearman rank correlations.

Analysis
https://doi.org/10.1038/s44221-023-00160-yvariable controls a model output, this also indicates unrealistic behaviour.Calculating rank correlations for all variable pairs, we find that the models differ substantially among each other and in comparison to observations (Fig. 4; rank correlations for all benchmark datasets and models are listed in Table 1 and Supplementary Table 3, respectively).For precipitation and actual evapotranspiration (Fig. 4a), the models show the same ranking between climate regions and rather small differences in magnitude, indicating that actual evapotranspiration is strongly constrained by the available water in all models.The modelbased correlations are higher in dry regions (ρ s = 0.74-0.98)than in wet regions (0.57-0.83), reflecting water and energy limitations.The Budyko equation assumes complete dependence on aridity (here defined as N/P).It thus predicts higher correlations overall and mainly distinguishes between wet (0.83-0.84) and dry (0.98-1.00) regions but, unlike models and FLUXCOM, not between cold and warm regions.The Budyko equation should thus be seen as a useful comparison but not as the 'correct' model, given that different studies have shown that snow 51 , climate seasonality 52 , vegetation type 53 , inter-catchment groundwater flow 54 and human impacts 55 can affect the long-term water balance beyond aridity.
We find much variability for net radiation and actual evapotranspiration (Fig. 4b).There is no obvious correspondence between the potential evapotranspiration schemes used 42 (for example, Priestley-Taylor for LPJmL and WaterGAP2 or Penman-Monteith for JULES-W1 and CWatM) and the rank correlations, implying that other factors play a more important role (also, refs.14,56).Both the Budyko equation and FLUXCOM show very high correlations for all wet places (0.93-0.99), indicating a strong energy limitation 57 , underestimated by many models (especially CWatM and MATSIRO).FLUXCOM shows a stronger N-E a relationship (Fig. 4b) in dry-cold places than all models and the Budyko equation, while it shows a weaker P-E a relationship (Fig. 4a) there.This could be due to an uncertain representation of energy balance processes in cold regions, possibly related to interactions between snow-affected albedo and evapotranspiration 58,59 , sublimation 60 or the aerodynamic component of potential evapotranspiration 61 .
For precipitation and groundwater recharge (Fig. 4c), some models (CLM4.5,MATSIRO, WaterGAP2 and H08) show high to very high correlations (0.71-0.95) for all climate regions, suggesting that precipitation is the dominant control on groundwater recharge across H08 and WaterGAP2 use the same approach to calculate groundwater recharge 42 and they show almost identical rank correlations, indicating that the functional relationships might be relatable to the model structure in this case.Recent studies have shown a strong influence of precipitation and aridity on groundwater recharge 39,40,45 , and using the same datasets, we also find high to very high correlations in drywarm regions (0.74-0.84).In these often highly water-limited regions, precipitation appears to be the dominant control on groundwater recharge.Besides climate, perceptual models of groundwater recharge generation usually include soil characteristics, topography, land use and geology 62,63 .This might explain why observations show a more scattered P-R relationship, particularly in wet-warm regions (−0.06).
For precipitation and total runoff (Fig. 4e), WaterGAP2 and PCR-GLOBWB both show lower correlations (0.52-0.75) than the other models (0.58-0.95).WaterGAP2 is the only model here that is calibrated against streamflow observations 42 , which might explain why it shows the lowest rank correlations for total runoff.The Budyko framework assumes that long-term runoff only depends on aridity and thus shows higher correlations (0.87-0.99) than the benchmark datasets (0.27-0.94) and most models (0.52-0.95).Because factors other than aridity can influence total runoff [51][52][53][54] and given that GSIM tends to show lower correlations overall (0.32-0.80), models that show correlations as high as the Budyko equation probably overestimate how strongly precipitation controls total runoff.Similar to the shapes of the functional relationships (Fig. 3),we generally find the largest differences in both models and datasets in dry-cold regions, where GRUN and GSIM show particularly low correlations (0.27 and 0.32).
For net radiation and both groundwater recharge and total runoff (Fig. 4d,f), we find high variability and mostly positive correlations.The models probably produce more groundwater recharge and total runoff in regions with higher net radiation because precipitation is also higher in these regions (Supplementary Fig. 1).Whereas it is difficult to interpret these correlations, the large variability still suggests considerable differences between models.

Focus areas for model improvement
Our analysis has revealed substantial disagreement between models and between models and observations, questioning the robustness of model-based studies and impact assessments, especially if only a single model is used.The energy balance, from total radiation to actual evapotranspiration, appears to be poorly represented, indicated by a different energy limit (Fig. 3), a general underestimation of actual evapotranspiration and widely varying N-E a relationships (Fig. 4).This warrants a closer look in future studies, as a realistic depiction of energy balance and evaporation processes is critical for climate change studies 57,58 .We find the largest disagreement for groundwater recharge, which is arguably the least understood process and poorly constrained by sparse observations 39,40 .The inter-model differences in groundwater recharge can be much larger than the differences in actual evapotranspiration and must therefore have other reasons.To better constrain the large variability between models, we need to improve our understanding of the dominant controls on groundwater recharge at large scales 64 .This knowledge is important for assessments of sustainable use of groundwater resources 9,10 , for groundwater modelling studies that use groundwater recharge from global water models as input 65 and for understanding the sensitivity of groundwater recharge to changing climatic drivers 6 .Most models overestimate total runoff and we find the largest disagreement for total runoff in dry-cold regions.This echoes existing literature 1,12,22,50 and highlights the need for model refinement in dry and/or cold regions, which are under-researched and strongly affected by climate change 46 .To explore more in-depth how snow processes affect the water balance, future studies could focus on functional relationships in snow-dominated regions by specifically delineating these regions using the fraction of precipitation falling as snow or snow cover extents.Note that the graphs do not show the full range for some curves to better illustrate the model differences.

Towards an inventory of robust functional relationships
We have used different observational datasets, observation-driven machine learning products and the Budyko equation 20 to derive empirical and theory-based functional relationships, but challenges remain.Observation-driven machine learning products 43,49 are not raw observations and may reflect their upscaling methods rather than the underlying natural distribution but serve as useful benchmarks in the absence of direct observations (for example, because of limited numbers of FLUXNET sites 66 ).The Budyko equation 20 is a climate-only model and thus provides a useful benchmark but neglects other influences on the long-term water balance.The observations themselves and the forcing data paired with them are also associated with uncertainty, even though most of the relationships used here appear to be relatively robust (Methods includes an extended discussion).Yet especially for variables with small numbers of observations, it is challenging to provide robust observation-based constraints for certain regions (Table 1).For example, groundwater recharge measurements have almost entirely been made in dry-warm regions (97% of MacDonald data 40 and 92% of Moeck data 39 ), leaving groundwater recharge in other regions poorly constrained.On the other hand, most streamflow measurements have been taken in wet regions (60% of GSIM data used here), and globally there is a placement bias of stream gauges towards wet regions 67 , even though-according to our classification-short of two-thirds of the global land area are defined as dry.Instead of taking new measurements to understand a specific place, new measurements would have much more leverage if they would help us to also understand other places, for example, by filling an observational gap along a climatic gradient (that is, in functional space).In addition, more quality-controlled datasets with uncertainty estimates 40 are critical to obtain realistic uncertainty estimates for functional relationships.This would ultimately allow us to obtain robust ranges of functional behaviour that we can benchmark our models against.The functional relationships studied here appear to be robust with respect to modelled human impacts, probably because we investigated long-term averages over large regions where climatic controls on the selected hydrological variables dominate (Supplementary Figs.26-30).Yet for different variables, especially when studied at shorter temporal and smaller spatial scales, human impacts might have a considerable effect on functional relationships.The effects of human impacts might be investigated by studying strongly managed and near-natural regions separately 68 .Indeed, comparing functional relationships between human impacted and natural regions would be an excellent strategy to assess the degree of human alteration of the natural water cycle.In addition, relationships that specifically focus on human impacts, such as relationships between irrigated areas and irrigation water withdrawals 69 , might be used to better understand the representation of human impacts in models.
Whereas visual comparison (focusing on the shape of the relationships) and rank correlations (focusing on the strength of the relationships) have exposed clear differences between models and observations, our approach here should be seen as a first step.There are other ways to describe the relationships analysed here, for example, by characterizing thresholds or nonlinearities (visible in Fig. 3).Metrics such as rank correlations also require careful interpretation.For example, positive correlations between net radiation and groundwater recharge probably arise because precipitation and net radiation are positively correlated and thus do not imply a causal relationship.The interpretation of empirical relationships should therefore be backed up by process knowledge or extended by methods that allow for discovery of causal relationships 70 .Physics-aware machine learning might be powerful in that respect, as it combines domain knowledge with versatile pattern recognition 71 .Beyond the relationships investigated here, we anticipate that exploring temporal relationships (for example, using elasticities 21,22 or shifts in P-Q relationships 23 ), dividing the landscape into additional categories (for example, hydrobelts 72 ) and including other variables, such as state variables or stores (for example, soil moisture, terrestrial water storage), will provide additional insights.

Conclusions
As our models grow in complexity, encompassing more processes and covering larger spatial and temporal scales, we need a concurrent development of model evaluation strategies: an evaluation framework for large-scale models.Central to such an evaluation framework should be functional relationships, which shift the focus away from matching historical records in specific locations to a more diagnostic and process-oriented evaluation of model behaviour 36 .Functional relationships allow us to focus on larger-scale assessments, to relate places to each other and to explore if dominant controls in models are consistent with observations, theory and expectations (that is, our perceptual model 73 ).This understanding is critical for ensuring that models faithfully represent real-world systems, ultimately leading to more credible projections of environmental change impacts.Eventually, expanding our range of functional relationships in hydrology, constrained by various observational datasets and expert knowledge, would give us a knowledge base of realistic system behaviour that could be used to evaluate models, diagnose model deficiencies and weight model ensembles, comparable to the use of emergent constraints in climate modelling 37 .
Both our approach and our findings have implications beyond hydrology.First, the terrestrial water cycle plays a central role in the Earth system and is often strongly coupled to other components, such as the biosphere, lithosphere and atmosphere and human activities (for example, refs.74-76).More realistic simulations of the global water cycle therefore also enable us to better clarify how it influences and is influenced by other Earth system components.Methodologically, functional relationships are not limited to applications in hydrology.In fact, land surface, forest and Earth system models [26][27][28][29][30][31][32] have already been studied in similar fashions, though a broader application of this approach has so far been missing.As indicated by recent studies 76,77 , functional relationships provide an excellent opportunity to study the interactions between hydrology and, for example, terrestrial ecosystems, and thus represent a tool that can be used across disciplines.
Beyond model evaluation, functional relationships invite us to think about how the global water cycle functions, what we know, what we do not know and what that means for a future under climate change 73 .Our results suggest that improved process understanding will be particularly important for energy balance processes, groundwater recharge processes and generally in dry and/or cold regions.So how can we improve our process understanding?In 1986, Eagleson 78 stated that 'science advances on two legs, analysis and experimentation, and at any moment one is ahead of the other.At the present time advances in hydrology appear to be data limited'.For some processes, this still seems to be the case.But clearly, we have a wealth of data available and might ask ourselves: are we extracting all of the information from the observations we have?On the basis of the data we have, what and where should we measure next?And are there functional relationships in hydrology yet to be found 19 ?Even if the search for such relationships is challenging, it will be a fruitful and exciting endeavour for global hydrology.

Model data retrieval and processing
We analysed 30-year (climatological) averages (1975-2004) from eight global water models 41 : CLM4.5 79 , CWatM 80 , H08 81 , JULES-W1 82 , LPJmL 83 , MATSIRO 84 , PCR-GLOBWB 85 and WaterGAP2 86 .The model simulations were carried out following the ISIMIP 2b protocol and here we used model outputs forced with the Earth system model HadGEM2-ES under historical conditions (historical climate and CO 2 concentrations).We note that the specific forcing chosen does not appear to influence model-based functional relationships (see below).We used precipitation P (ISIMIP variable name pr), net radiation N (not an official ISIMIP output), actual evapotranspiration E a (ISIMIP variable name evap), groundwater recharge R (ISIMIP variable name qr) and total runoff Q (ISIMIP variable name qtot).Note that Q here refers to runoff generated on the land fractions (and not surface water bodies) of each grid cell and does not include upstream inflows, which allows for comparison to grid cell P. P, E a , R, and Q were downloaded from https://data.isimip.org/.Net radiation N is not an official ISIMIP output and was provided by the individual modelling groups.It is not available for all models, so we used the ensemble median per grid cell for models without N data.We converted all fluxes to mm per year and removed E a values larger than 10,000 mm per year and set R values smaller than 0 to 0. Note that our analysis excludes Greenland and Antarctica.A more detailed description is given in the Supplementary Information.

CoV and most deviating model maps
For each grid cell, we used the 30-year averages of the eight models (that is, the model ensemble) and calculated the ensemble standard deviation divided by the ensemble mean.Maps of the standard deviation are shown in the Supplementary Information (Supplementary Figs.8-10).To see which model dominates the ensemble spread, we checked for each grid cell which model shows the largest absolute difference (denoted by d 1 ) from the ensemble mean (denoted by μ).To see if multiple models dominate the ensemble spread, we also checked for each grid cell which model shows the second-largest absolute difference (denoted by d 2 ) from the ensemble mean.If the relative difference between the largest and the second-largest difference is less than 20%, that is (d 1 − d 2 )/d 1 < 0.2, the grid cell falls into the category 'multiple'.If the relative difference between the most deviating model and the ensemble mean is less than 20%, that is d 1 /μ < 0.2, the grid cell is counted as having no most deviating model (empty areas on Fig. 1d-f).

Functional relationships
To visualize the shape of the functional relationships, we binned the data in each climate region into ten bins (along the x axis) with an The grey dashed line shows the 1:1 line, indicating the water limit assuming all water is supplied by precipitation.Note that the graphs do not show the full range for some curves to better illustrate the model differences.

Fig. 1 |
Fig. 1 | Disagreement between global water models for three key water fluxes.a-c, Left: maps showing the coefficient of variation, calculated per grid cell as the ensemble standard deviation divided by the ensemble mean of eight global water models for different water fluxes: actual evapotranspiration (a), groundwater recharge (b) and total runoff (c).Lighter areas ('blank spaces') indicate high coefficients of variation (CoV) values and thus show where models disagree most.d-f, Right: maps showing which model deviates most from the ensemble mean for each grid cell for different water fluxes: actual evapotranspiration (d), groundwater recharge (e) and total runoff (f).Dark grey areas in d-f indicate that multiple models deviate similarly strongly from the ensemble mean.Empty, blank areas in d-f indicate that no model deviates strongly from the ensemble mean.The percentages shown in d-f refer to the fraction of grid cells (not land area) covered by each model.Greenland is masked out for the analysis.

Fig. 2 |
Fig. 2 | Examples of functional relationships.a, Scatter plots between precipitation and groundwater recharge for PCR-GLOBWB and WaterGAP2.Owing to space constraints, we focus on a few examples with differing relationships.Scatter plots for all variable pairs are shown in Supplementary Figs.15-20.Each dot represents one grid cell and is based on the 30-year average of each flux.Spearman rank correlations ρ s measure the strength of the

Fig. 3 |
Fig. 3 | Average functional relationships among precipitation and three key water fluxes.Average functional relationships based on models and benchmark datasets among precipitation P and actual evapotranspiration E a , groundwater recharge R and total runoff Q, respectively.The coloured lines represent one model each, the grey-black lines represent different observational datasets, labelled on the outer-right panels.The MacDonald groundwater recharge dataset contains only enough data values for the dry-warm region and is thus only shown there.The lines connect binned medians (ten bins along the x axis with equal amount of points per bin) for each climate region.The grey dashed line shows the 1:1 line, indicating the water limit assuming all water is supplied by precipitation.Note that the graphs do not show the full range for some curves to better illustrate the model differences.

Fig. 4 |
Fig. 4 | Strength of functional relationships for models and benchmark data.a-f, Spearman rank correlations ρ s between forcing variables (precipitation (a,c,e), net radiation (b,d,f)) and water fluxes (actual evapotranspiration (a,b), groundwater recharge (c,d) and total runoff (e,f)), divided into different climate regions.Net radiation for LPJmL and PCR-GLOBWB is not available and is estimated as the median of the other models (per grid cell).The lines connecting Analysis https://doi.org/10.1038/s44221-023-00160-yExtended Data Fig. 1 | Average functional relationships between net radiation and three key water fluxes.Average functional relationships based on models and benchmark datasets between net radiation N and actual evapotranspiration E a , groundwater recharge R and total runoff Q, respectively.The colored lines represent one model each, the grey-black lines represent different observational datasets, labeled on the outer-right panels.The lines connect binned medians (10 bins along the x-axis with equal amount of points per bin) for each climate region.

Table
). Observation-based rank correlations are shown only if they are based on more than 50 data points.