## Introduction

E is a key element of the global water cycle: approximately two-thirds of the precipitation over land is evaporated back into the atmosphere1. Due to its influence on water vapor and cloud feedbacks, E plays a crucial role in global warming, and its projected increase is expected to intensify the global hydrological cycle2. Changes in E will not only have far-reaching consequences on water availability and climate3,4, but can also severely affect the occurrence of hydroclimatic extremes5 and the ability of ecosystems and river basins to recover from them6,7,8. Moreover, E is an important indicator of vegetation stress, thus it is widely used for estimating drought conditions9 and their implications for water management, ecosystem health, and agricultural production10. Its reliable representation in hydrological and climate models is therefore crucial, and so is its accurate global monitoring from space. However, E cannot be derived directly from satellite-based measurements, which is why even retrieval algorithms tend to rely on process-based formulations11.

Several approaches exist to estimate E at large scales using process-based models. Some of them simulate E as a residual of the energy balance, or derive it empirically using vegetation, temperature, and radiation data. These approaches are primarily employed in high-resolution remote sensing, especially in agricultural areas, owing to its minimal input data requirements12,13. Other models employ a flux-based approach to derive E using more physically-founded methods, such as the Monin-Obukhov similarity theory, to calculate the gradients of specific humidity between the atmosphere and land surface (vegetated or non-vegetated), and explicitly model the surface resistance to the diffusion of water vapor. This approach is prevalent in climate models14. Finally, the third approach, commonly used in hydrological models15 and satellite retrieval algorithms16,17,18, involves the prior calculation of potential evaporation (Ep), a theoretical maximum for the given land cover and meteorological conditions. Subsequently, actual E is estimated by reducing Ep by a certain factor that accounts for the sub-optimal conditions of evaporation due to (e.g.,) water scarcity; this is referred to here as ‘evaporative stress’ (S), and more precisely ‘transpiration stress’ (St) when applied to plant transpiration. Independent of the approach, significant uncertainties remain in the current global estimates of E, and this applies to both climate models19 and satellite-based algorithms20.

In this study, we focus on stress-based models of E, the most common approach to derive global E from satellite data21. In such models, uncertainty arises from the formulations of Ep and S (and particularly St). While several process-based formulations of Ep exist22,23, they differ in their estimates substantially, and even the mere definition of Ep as a concept remains elusive24. Nevertheless, the chosen Ep function forms the most process-based part of the stress-based E models, and while parameters within Ep formulations can be better constrained with more data25, the opportunities to improve stress-based models via modifications to Ep remain limited26. Therefore, we focus on the main source of uncertainty: the S formulation. This uncertainty arises from the lack of understanding of the response of plant transpiration (the major source of E in vegetated ecosystems) to environmental stressors, particularly at the spatial resolution at which global models operate. Here, we note that the focus of this study is the stress which limits vegetation transpiration below the atmospheric potential, and therefore can be triggered even under conditions in which plants do not experience stress from a physiological standpoint. The transpiration stress (i.e., St) should encapsulate multiple interacting hydroclimatic variables that affect different aspects of plant physiology and structure which affect transpiration in a highly non-linear manner at multiple time scales27. However, St formulations used in existing global models are simple, not capturing all the influences and interactions among the stressors. This occurs because they are based on a limited number of experimental studies whose extrapolation to global scale is hindered by their local nature28,29,30. The complexity of the interactions among these stressors, and the fact that they involve physiological processes that are unobserved, calls for machine learning techniques as a suitable solution to this long-standing challenge.

Machine learning methods have become popular in Earth sciences in recent years, enabling the discrete classification of important geo-spatial variables which are hard to map, such as clouds31, soils32, and forest cover33; but also estimation of dynamic variables, such as carbon fluxes34, precipitation35, or river discharge36. In fact, machine learning models trained on in situ measurements of E and other hydro-meteorological covariates, have already been used to estimate global E34. However, pure machine learning–based approaches have several disadvantages in realistically modeling Earth system processes. Machine learning models do not obey the physical limits which constrain different scales, such as the closure of water and energy balances. Further, the black-box nature of machine learning hinders the interpretability, an important requirement if the influence of individual covariates needs to be realistically represented to improve process understanding. More importantly, the use of pure machine learning methods for specifically estimating Et at global scales is hindered by the fact that in situ observations of Et have a small footprint which is not representative of Et at the coarser grid scales at which global models operate.

An emerging research direction, and the approach adopted in this study, is to combine process-based and machine learning models in a symbiotic manner. ‘Hybrid’ models retain the advantages of process-based models, i.e., physical consistency and interpretability, and those of machine learning models, i.e., more realistic data-driven formulations of processes that are insufficiently understood37. Several proof-of-concept implementations have demonstrated the advantages of hybrid modeling in climate sciences with machine learning sub-models employed for representing different processes38,39 or for improved model parameter discovery40. For modeling E in particular, attempts have been made to physically constrain pure machine learning models to improve the accuracy of E estimates25,41. However, an important research question is whether hybrid models are capable of operating at a global scale with machine learning used to replace specific process formulations.

Here machine learning, specifically deep learning, is used to learn the functional relationship between covariates (St drivers) and target process (St). We exploit recent progress in satellite-based remote sensing and an unprecedented number of in situ observations spread across the globe to develop a novel formulation of St from the ground-up without any prior assumptions. Therefore, the objective of using deep learning is fundamentally different (development of an improved formulation of the transpiration stress response of vegetation) from that of purely machine learning-based models which are designed to predict Et directly and suffer from scaling issues described above. Further, we implement the new formulation of St, and execute it online, in a process-based model of global evaporation which provides physical constraints to the deep learning-based St formulation. In doing so, we develop a hybrid model in which the new deep learning-based formulations of St are tightly coupled to the process-based, aiming to simulate daily E at the global scale. A comprehensive evaluation of the model is carried out using in situ observations and gridded datasets, including comparisons to pure machine learning and process-based approaches.

## Results

### Hybrid model architecture

A hybrid model at the highest level of abstraction is made up of two components: a process-based host model and machine learning-based formulations embedded in the host for representing certain processes37. For the process-based model, we choose the Global Land Evaporation Amsterdam model (GLEAM)16,42. GLEAM simulates E as a summation of its constituents: Et, bare-soil evaporation (Eb), open water evaporation (Ew), snow sublimation (Es), and interception loss (Ei). Et and Eb are estimated for every grid cell of the global model using a Priestley Taylor-based formulation for Ep and their respective evaporative stress factors (St and Sb), weighted by the fractional coverage of short vegetation, tall vegetation, bare-soil, and open water (Fig. 1). Interception is based on the Gash analytical model43. GLEAM contains a multi-layer soil water balance model in which satellite-based surface soil moisture data are assimilated. Sb is a function of soil moisture content (see Methods), while St accounts for the transpiration stress caused by shortage of plant available water (PAW) and sub-optimal phenological state (represented by vegetation optical depth, VOD). In nature, however, several additional stressors are responsible for limiting Et below its potential, which are not considered within the St formulation of the process-based GLEAM. The exact responses of Et to these stressors are ecosystem-dependent and difficult to encapsulate into a single stress factor (St).

Here, using deep learning and reliable field observations, we aim to recover an St that correctly encodes the functional relationships among the multiple stressors existing in nature. Deep learning models are developed at daily time scales using observational data from a large network of eddy covariance or flux towers and sap flow measurements. The models are developed separately for short (231 flux towers and 173,000 data points) and tall vegetation (137 flux towers and 90 sap flow measurement sites, 125,000 data points) (see Methods for the details of the target variable and covariates used in the deep learning models). We consider four other transpiration stressors, in addition to PAW and VOD, that are known to regulate stomatal conductance and hence influence St: (a) vapor pressure deficit (VPD), as an indicator of atmospheric dryness44, (b) air temperature (Ta), to include the effects of sub-optimal temperature and heat stress45, (c) incoming shortwave radiation (SWi), to incorporate the influence of light limitation46, and (d) atmospheric carbon dioxide (CO2) concentration, which exhibits a first order control on stomatal opening47. We note that the slowly evolving effects on transpiration of long-term ecological or plant trait adaptation in response to rising CO2 (as reflected on water use efficiency trends) may not be adequately captured by training the machine learning algorithms on the limited record length of available flux tower and sap flow measurements48. The potential effect of phosphorous and nitrogen limitations on St49 is not considered in this study due to the lack of dynamic global data. In addition, the influence of plant traits such as root depth, isohydricity, and other anatomical and morphological traits, and their fine-scale or inter-species variations is not explicitly considered, since reliable data for upscaling such traits so that they can be implemented within a global model is not available.

Finally, the hybrid model of global E is created by coupling the deep learning-based model of St to the GLEAM process-based model. At every (daily) time step, and at every (0.25 degree) grid cell of the global model, the soil water balance module of GLEAM uses precipitation (P) to compute PAW. Then, PAW, VOD, Ta, VPD, SWi, CO2 are transferred to the (offline-trained) deep learning model (see Methods). The deep learning model is run in predictive mode to generate St. St is then used to constrain Ep and thus compute E by the process-based host model. Finally, E is used to update the soil moisture (and PAW) before the process is repeated for next time step (Fig. 1).

### Validation with in situ measurements

St and E estimates from the hybrid model are validated at 458 in situ monitoring stations (see Figs. 12 and 13 in Supplementary Information) sourced from several flux tower and sap flow databases (refer to the Methods section for the calculation of St from flux tower and sap flow data). The hybrid model performance is compared to that of the fully process-based model. Violin plots and spatial maps illustrate the Kling-Gupta Efficiency (KGE), a metric which combines correlation, variability bias, and mean bias (see Methods). KGE values theoretically range from − to 1.0, with values greater -0.41 implying that the model is a better predictor than the mean seasonal cycle50.

The violin plots (Fig. 2a) show the distribution of KGE values calculated for the 231 stations located in short vegetation ecosystems, and the 227 stations in located in tall vegetation ecosystems (137 flux tower sites and 90 sap flow measurement stations). We see that both the process-based model and the hybrid model accurately estimate St in short vegetation ecosystems (including Croplands, Shrub and Grasslands, and Wetlands) and tall vegetation ecosystems (consisting of Broadleaf, Needleleaf, and Mixed forests)—see Table 3 in Supplementary Information for station-wise land cover classification. For most stations (>75%), KGE values from the process-based model are higher than −0.41. However, the deep learning model of St improves these results, particularly over tall vegetation—see Fig. 2a. The higher KGE is attributable to improvements in the bias and variability components of the KGE rather than the correlation component—refer to Figure 1 in Supplementary Information for violin plots of correlation and root mean square error (RMSE). While the average correlations of the process-based model estimates of St are similar to those by the hybrid model, the RMSE of the hybrid model tends to be substantially lower, particularly for tall vegetation ecosystems.

Next, we check whether the improvement in the estimation of St in the hybrid model is propagated to the simulation of E. From Fig. 2b, it is evident that the improvements in St are not linearly translated to E. This can be attributed to the fact that the vast majority of the flux towers and sap flow sites are in energy-limited regions, where E dynamics are influenced more by Ep than by St. Overall, both models exhibit high, and similar, KGE values (median value of approximately 0.5) for short vegetation. For tall vegetation, the hybrid model outperforms the process-based model in terms of KGE values. In terms of correlation and RMSE, both models perform similarly (see Fig. 1 in Supplementary Information): the process-based model exhibits marginally higher correlations, while the RMSE of the hybrid model is lower for both vegetation classes. We also compare the estimates of E from the hybrid model with that of a purely machine learning-based dataset, FLUXCOM (Fig. 2 in Supplementary Information). We see that while the overall performance of both approaches is similar, the hybrid model tends to outperform FLUXCOM in forest (tall vegetation) ecosystems.

To understand the difference between the hybrid and process-based models better, we compare the spatial distribution of differences in KGE values for St and E estimates from the two models for different geographical zones (Fig. 3, also see Figs. 3 and 4 in Supplementary Information for absolute values of KGE for St and E). In North America (NA), which has the largest number of flux towers and sap flow sites, the hybrid model outperforms the process-based model in estimating St and E, especially in the humid eastern and north-eastern areas. In comparison, both models tend to inaccurately simulate St in the arid south-west region. In Europe (EU), the hybrid model performs better than the process-based model in estimating St across the majority of the flux tower stations, including stations which are located in the relatively arid south. However, in Asia (AS) and rest of the world (RW), the performance of the hybrid model is very similar to the process-based model. One reason could be that the AS and RW regions have a very sparse distribution, and thus flux towers and sap flow sites in those ecosystems may have distinct biophysical characteristics from the majority of sites in the training database. Further, we compare the spatial maps of correlation and RMSE (see Figs. 58 in Supplementary Information) to understand the source of the disparity in KGE values. In terms of correlation, the two models perform very similarly to each other across the different regions. Therefore, the major source of improvement in the hybrid model can be traced to the better estimation of the variability seen in the observation, a fact supported by the violin plots (Figure 1 in Supplementary Information). Further, we notice that the discrepancy in the St estimates between the two models, does not translate to an improved E estimation, particularly in energy-limited regions (Fig. 3), which are poorly represented in the training data. Finally, we also compare the performance of the hybrid model against FLUXCOM at individual flux tower and sap flow measurement sites (Fig. 9 in Supplementary Information). Similar to the comparison with the process-based model, we see that the hybrid model underperforms in the relatively arid western parts of the US and the Iberian Peninsula.

### Comparison with global datasets

The goal of the hybrid model is to generate spatially and temporally continuous estimates of St and E over the entire continental surface. Therefore, it is important to also validate it against independent global estimates of both St and E. Therefore, St and Et seasonal aggregates are compared with other global datasets in Fig. 4 and Fig. 5, respectively. To further investigate the realism of these global patterns, the temporal dynamics are investigated in Fig. 6 by displaying correlation maps based on monthly time series.

Due to the absence of observations of St at those scales, we choose a satellite-retrieved proxy that has been shown to represent the transpiration stress experienced by vegetation reasonably well: the ratio of solar-induced chlorophyll fluorescence to photosynthetically-active radiation (SIF/PAR)51 (see Methods). We note here that the units and range of SIF/PAR values are different from those of St, but that the spatial gradients and temporal dynamics are expected to be comparable. We also caution that the comparison may not be appropriate under extreme conditions and higher CO2, where carbon and water cycles may decouple52. In June-July-August (JJA), summer season in the Northern Hemisphere, we see that the spatial patterns of St in the hybrid model are similar to those in the process-based model (Fig. 4a, c). However, the hybrid model captures better the higher transpiration stress that is suggested by the low values of SIF/PAR in the higher latitudes (Fig. 4e). For December-January-February (DJF), the picture is similar; St in the higher latitudes is accurately captured by the hybrid model (Fig. 4b,d,f). Similarly, we see that the hybrid model represents the stresses in the Congo, Amazonian and Eastern Asian rainforests accurately, both in JJA (Fig. 4a) as well as DJF (Fig. 4a). Figure 6a, c shows the temporal correspondence between St and SIF/PAR for the hybrid and process-based models, respectively, while Fig. 6e shows the difference between the two previous maps. We see that the hybrid model exhibits a positive correlation with SIF/PAR over a majority of the continental surface with parts of Amazonia, Congo, and South East Asia (Fig. 6a) being an exception. The hybrid model St shows a better correlation with SIF/PAR in eastern China and in northern latitudes—compare Fig. 6a, c. In contrast, the process-based model shows a higher correlation in large parts of western North America, Europe, and Australia. In addition, the hybrid model shows a marked improvement in the spatial correlation with SIF/PAR, both in the JJA season (0.66 vs. 0.59) and in the DJF season (0.42 vs. 0.34).

We also compare the E estimates from the hybrid and process-based models with a purely machine learning-based E dataset (FLUXCOM) which is trained on a subset of the global flux towers used in this study34. We see that in both seasons, JJA and DJF, the spatial patterns of E from the hybrid and process-based models are similar to those from FLUXCOM (Fig. 5). Regions of divergence are seen in the northeastern parts of South America, and southern and eastern Africa, where the FLUXCOM E estimates are higher than those of the hybrid and process-based models, especially during JJA. The correlation maps (Fig. 6b, d) show a high correspondence between the hybrid model estimates of E and FLUXCOM. A major region of divergence that stands out in both the hybrid and process-based models is Amazonia. This may relate to the fact that very few stations are available in tropical forests for model training, and therefore both the estimates of FLUXCOM and the hybrid model tend to be more uncertain there, and it may also reflect the lack of explicit consideration of interception loss as a component of E in FLUXCOM. Meanwhile, the difference between the correlations of the hybrid and process-based model with FLUXCOM is nominal (Fig. 6f). The hybrid model also shows mild improvements in the spatial correlation to FLUXCOM, both during JJA (0.84 vs. 0.81) and DJF (0.95 vs. 0.94).

## Discussion

The growing complexity of large-scale Earth system and climate models requires increasingly high computational resources. More importantly, processes are frequently represented based on limited experimental understanding and are thus uncertain in their application at larger scales. Hybrid modeling approaches have the potential to reduce the ill-effects of over-parameterization, reduce computation times, and even improve accuracy in process representation53. Here, we focus on one of the main unknowns in the global water cycle and a key variable in climate models: terrestrial evaporation (E). We developed and applied a global-scale hybrid model of E, in which a deep learning-based formulation of transpiration stress was embedded within a process-based model at daily timescales. We showed that the deep learning model, designed without a priori assumptions, and based on expert knowledge, is overall more accurate than the traditional process-based counterpart at capturing the non-linearly interacting processes that yield transpiration stress. The biggest improvement is seen in forested (tall vegetation) regions, especially in northern latitudes. This has important implications for constraining transpiration estimates in tropical, temperate, and boreal forests which contribute a major part of the global transpiration54. The study also highlights a limitation of any deep learning model, in which sufficient availability of training data is crucial: the majority of the flux towers and sap flow measurement sites used for training are located in North America and Europe. This is especially relevant for modeling Earth system processes that exhibit large regional (and local) variability, and thus for which the ability of any data-driven formulation to generalize over the entire globe will by default be imperfect. From a computational perspective, the model was developed in TensorFlow, a popular Python library for deep learning, which scales across a wide range of hardware, operating systems, and programming languages. Therefore, the transpiration stress model is agnostic of the host model, and hence can be embedded in different global scale Earth system models.

## Methods

### Evaporative stress formulation in the process-based model

In the conventional, process-based GLEAM, the total evaporative stress (S) is composed of St and Sb. St is defined as

$${S}_{t}=\sqrt{\frac{VOD}{VO{D}_{max}}}\left(1-{\left(\frac{{w}_{c}-{w}_{w}}{{w}_{c}-{w}_{wp}}\right)}^{2}\right)$$
(1)

where VODmax is the maximum (99th percentile) VOD, wc is critical soil moisture, ww is the soil moisture content of the wettest soil layer, wwp is wilting point. St is calculated separately for tall and short vegetation.

Sb is defined as

$${S}_{b}=1-\left(\frac{{w}_{c}-{w}_{1}}{{w}_{c}-{w}_{r}}\right)$$
(2)

where w1 is the surface soil moisture (first layer in the soil water balance module of GLEAM) and wr is the residual soil moisture content. The values of wwp, wc, and wr are taken from version 3 of GLEAM42.

### Development of the deep learning-based transpiration stress formulation

The first step consists of defining the target variable, and the appropriate predictors or covariates. Here, the target variable is the tower-scale St, calculated as

$${S}_{t}=\frac{{E}_{t}}{{E}_{pt}}$$
(3)

where, Et is actual transpiration and Ept is potential transpiration.

To estimate Et in Equation (3), we use daily in situ measurements of E, assembled from a total of 557 flux towers. These towers were compiled from FLUXNET55 (https://fluxnet.org/data/fluxnet2015-dataset/), FLUXNET-CH4 (https://fluxnet.org/data/fluxnet-ch4-community-product/), AmeriFlux (https://ameriflux.lbl.gov/), European Eddy Fluxes Database Cluster (http://www.europe-fluxdata.eu/), and the Integrated Carbon Observation System (ICOS) (https://www.icos-cp.eu/). After the removal of inconsistent values, we end up with 368 stations, out of which 231 stations (approximately 173,000 data points) are classified as having dominantly short vegetation and 137 stations (approximately 103,000 data points) are classified as tall vegetation (refer to Fig. 12 and Table 3 in Supplementary Information for site-specific details and for the mapping of flux tower land cover class to tall and short vegetation). To separate Et from E at the flux stations, we use empirical functions relating the ratio of Et to E to the leaf area index (LAI) for different vegetation classes56 (see Section 2 in Supplementary Information). We remove rainy days from the flux tower datasets to minimize the impact of interception loss on the measurements of E and sensor errors during rain. The LAI-based Et partitioning model is used here to ensure that the deep learning model of St is completely independent from the E partitioning model used to estimate Et at the eddy covariance sites. Other commonly used partitioning models apply water use efficiency and surface conductance as the main predictors in their empirical approaches57, which are in turn dependent on vapor pressure deficit (VPD), an important covariate used in the deep learning model developed in this study (see below). Therefore, to prevent such confounding dependencies between VPD and St, we use an LAI-based empirical model56. We note here that none of the existing Et partitioning models, simple or complex, are perfect. The LAI-based method used here has been validated over different ecosystems24.

To mitigate the effects of the uncertainty in Et estimates arising from the choice of the partitioning model used in this study, we supplement the estimates of tall vegetation Et partitioned from E at the flux towers with a more direct estimate of Et from sap flow measurements. These in situ measurements are sourced from SAPFLUXNET, a global database of tree-level sap flow measurements58. It contains sub-daily time series of sap flow accompanied by in situ-measured hydrometeorological variables and ancillary site, stand and plant metadata. Here, tree-level sap flow data (cm3/h) from SAPFLUXNET version 0.1.5 was expressed per unit projected crown area (Ac), estimated as a function of tree trunk basal area at breast height (Ab), site mean annual temperature (MAT) and precipitation (MAP). This model ($$\log {A}_{c}=-2.53+6.02E-01* \log {A}_{b}+9.60E-02* MAT-5.48E-05* MAP$$, N = 1055, R2 = 0.89) was fitted using data from the Biomass And Allometry Database59 (BAAD). Tree-level averages of sap flow per unit crown area were then averaged per measured species, weighed by the basal area composition of the stand, and aggregated into daily values. A total of 90 experimental sites are used in the study (Fig. 13 and Table 4 in Supplementary Information). With the addition of the sap flow measurement sites to the 137 flux towers, the total number of data points available for training and validating the deep learning model of Et stress for tall vegetation is approximately 125,000 (20% corresponding to sap flow sites).

Next, we obtain from daily values of Ept for each station from GLEAM. GLEAM uses a Priestley-Taylor formulation to calculate Ept which has been shown to be generally accurate at ecosystem scales24. To account for the scale mismatch between grid-scale estimates of GLEAM and point-scale measurements at the flux tower sites, we scale the Ept values with Et values using days following rain days as:

$${E}_{pt}^{scaled}=\left(\frac{{E}_{pt}^{raw}-{E}_{pt,mean}^{raw}}{{E}_{pt,sd}^{raw}}\right)* {E}_{t,sd}^{flux}+{E}_{t,mean}^{flux}$$
(4)

where $${E}_{pt}^{raw}$$ is the raw GLEAM Ept for the specific flux tower site, $${E}_{pt,mean}^{raw}$$ is the mean of the raw GLEAM Ept estimates for the specific flux tower site, $${E}_{pt,sd}^{raw}$$ is the standard deviation of the raw GLEAM Ept for the specific flux tower site, $${E}_{t,mean}^{flux}$$ is the mean of the observed Et at the flux tower, and $${E}_{t,mean}^{flux}$$ is the standard deviation of the observed Et at the flux tower. Inherent in this bias-correction approach is the assumption that ecosystems transpire at their potential on days after rainfall.

The covariates used for modeling St are the absolute values and seasonal anomalies of the following variables: (a) PAW, (b) VPD, (c) Ta, (d) SWi (e) VOD, (f) CO2. PAW is commonly defined60 as

$$PAW=\frac{{w}_{w}-{w}_{wp}}{{w}_{c}-{w}_{wp}}$$
(5)

The absolute values and anomalies of PAW for the flux tower sites are derived from GLEAM42(see section 3 in Supplementary Information for input data used in GLEAM). VPD is derived from relative humidity and Ta data sourced from Atmospheric Infrared Sounder (AIRS) aboard the Aqua satellite mission61. SWi is derived from the Clouds and the Earth’s Radiant Energy System (CERES) satellite mission62. VOD is derived from the Vegetation Optical Depth Climate Archive (VODCA) dataset63. The SWi and VOD from the same data sources are used as forcing to the GLEAM model to generate PAW to ensure consistency. CO2 data is sourced from the Copernicus Atmopsheric Monitoring Service Global Inversion of Greenhouse Gas Fluxes and Concentrations project (https://ads.atmosphere.copernicus.eu). Finally, within the GLEAM soil water balance model, Equation (5) is solved for short and tall vegetation separately and aggregated based on the fraction of tall and short vegetation in every grid cell. For tall (or short) vegetation flux tower sites, PAW weighted by the corresponding tall (or short) vegetation fraction is extracted. In GLEAM, for tall vegetation, ww is calculated based on three soil layers, and for short vegetation ww is based on two soil layers. Here, we note that the choice of estimating the covariates from global gridded datasets rather than in situ measurements at the flux towers and sap flow sites is deliberate. This is done to maintain consistency between the datasets which are used for training (at the point scale) and prediction within the hybrid model (at a coarser scale of 0.25° × 0.25°). In doing so, we aim to minimize the uncertainties that would arise from training and predicting with different datasets. This experiment design choice trades potentially higher local scale prediction and interpretability for more consistent and reliable prediction at the global scale.

### Deep learning model architecture and training

Designing an optimal deep learning model involves optimizing a number of model-related variables (hyper-parameters) such as the number of layers, number of neurons in each layer, the activation functions in each layer, the rate of dropout to prevent over-fitting, the optimal learning rate, and a loss or objective function along with an appropriate validation metric for evaluating the progress of model training. Here, we design the model architecture, optimize the hyper-parameters, and train the deep learning model using TensorFlow version 2.464. To optimize the hyper-parameters, we employ an automated optimization library available in TensorFlow; specifically, a Bayesian optimization procedure with maximization of the Kling Gupta Efficiency (KGE)65 as both the training objective and validation metric. In training the objective function is implemented as minimization of 1 − KGE. KGE is selected as it combines correlation, variability bias, and mean bias into a single metric. KGE is defined as

$$KGE=1-\sqrt{{(r-1)}^{2}+{\left(\frac{{\sigma }_{sim}}{{\sigma }_{obs}}-1\right)}^{2}+{\left(\frac{{\mu }_{sim}}{{\mu }_{obs}}-1\right)}^{2}}$$
(6)

where r is linear correlation between simulated and observed values, σsim and σobs are standard deviation of simulations and observations, and μsim and μobs are mean values of simulations and observations.

First, the Bayesian hyper-parameter optimization was carried out for short vegetation data (231 sites). The most optimal deep learning architecture was found after approximately 1000 iterations of the Bayesian optimization procedure. The resulting deep learning architecture was manually tuned. The final model was then trained for short vegetation St with a training:validation data split of 85:15, a batch size of 100, a learning rate of 0.000142, and a maximum epoch size of 1000. The training process does not make any distinction between the different sites—all the 173,000 data points from the 231 sites are treated equally. The training was automatically stopped when the validation objective function started degrading (while the training objective function keeps improving), a sign that the model is overfitting (Fig. 11 in Supplementary Information shows the evolution of the objective during the training process). The same model architecture and training setup was used for training the model for tall vegetation St (227 sites). As the model performed satisfactorily with some minor changes, the time consuming hyper-parameter optimization procedure was not performed separately for the tall vegetation dataset (see Fig. 8 in Supplementary Information for the final deep learning models).

### Calculation of SIF/PAR

SIF data is sourced from the contiguous Orbiting Carbon Observatory-2 (OCO-2) dataset, which is available at 0.05° resolution and 16-day time step66. This dataset uses machine learning to gap-fill SIF data to produce a spatially continuous dataset from the OCO-2 satellite, which has a smaller footprint and infrequent overpass times. The data was spatially aggregated to 0.25° and temporally aggregated to monthly timescales for calculating the correlation maps (Fig. 6) and to seasonal time scales Fig. 4. PAR data is from the CERES mission62. PAR data is available at 1.0° resolution at hourly to monthly resolution. Here, the monthly PAR data was used to normalize SIF data.