Main

Accurate evapotranspiration (ET) data are essential for assessing the surface energy and water balance, the carbon cycle and the management of water resources1. ET is the sum of the flux of water vapour from soil (evaporation) and through vegetation (transpiration) to the atmosphere. ET constitutes the second largest component of the terrestrial water balance, after precipitation. The usefulness of spatially contiguous mapping of ET, particularly over irrigated agricultural lands, has been amplified by drought, climate change, and high rates of human water withdrawal and agricultural consumption, leaving many aquifers and water reservoirs in the western United States at all-time-low levels2,3,4. Satellite-based remote sensing of ET (RSET) offers a powerful approach for mapping ET over large geographic regions at semi-continuous timescales1,5,6. Until recently, the availability of RSET data at spatial scales relevant for water resources management has been limited by cost and computational requirements.

OpenET5 employs six state-of-the-art satellite based RSET models, that is, ALEXI/DisALEXI7, eeMETRIC8, geeSEBAL9, PT-JPL10, SIMS11,12 and SSEBop13, that have been widely applied and evaluated in the United States for a range of water management and agricultural applications. The models are applied on the Google Earth Engine cloud-based platform14 to provide historical and near real-time ET data at subfield scales (30-m pixels) over the western United States5. Five of the RSET models constrain components of the surface energy balance (SEB) using land surface temperature (LST) primarily derived from Landsat Collection 2, along with gridded weather data, and land cover datasets. The sixth model, SIMS, assumes well-watered conditions and computes crop coefficients based on vegetation density, derived from satellite surface reflectance values, along with a gridded soil water balance model. The models composing OpenET have been used by water managers, farmers and governmental organizations for irrigation scheduling, water accounting and allocation, and water rights administration15,16,17. The OpenET platform provides an unprecedented level of accessibility to RSET data through its public online data explorer interface—including querying satellite ET within individually vectorized field boundaries. All six RSET models in OpenET operate automatically, including any required calibrations, which permits rapid calculations for the more than 100,000 Landsat images processed so far across the 23 western-most states in the contiguous United States. As the number of applications of RSET data for sustainable land and water resources management grow, it is important for practitioners to have information on the accuracy of RSET data across land cover types, climatic zones and agricultural production practices18.

In this Analysis, we present a large-scale benchmark assessment of the accuracy of OpenET data using a well-curated publicly archived dataset of in situ ET measurements from 152 stations (141 eddy covariance (EC) systems, 7 Bowen ratio systems and 4 lysimeters), over a variety of regions, climates and land cover types19,20, collectively comprising ~45 years of paired model–measurement ET data (Fig. 1). The EC technique is generally viewed as the best available method for continuous measurement of in situ energy and heat flux at spatial scales that approach satellite-based retrievals21,22, although we acknowledge the associated data uncertainties and made efforts to reduce them19. In addition to evaluation of individual model accuracies, we evaluated the OpenET ensemble ET value, computed as the mean of all models after flagging and removal of up to two outliers using the median absolute deviation (MAD) approach23,24. The generation of an ensemble value is a widely used technique to combine outputs from diverse models, each having their own behaviour5 and random error25,26,27. It also facilitates applications such as irrigation scheduling and water rights administration, where practitioners require a single value for use in management of water resources5. The publicly archived in situ flux dataset allows for reproducibility and benchmarking of future OpenET model versions or other RSET data.

Fig. 1: Map of in situ ET measurement sites.
figure 1

Map of the locations of in situ ET stations used to evaluate OpenET, including their general land cover type and Köppen–Geiger (KG) climate zones34. White areas represent climate zones that did not contain any cropland sites and were excluded from the analysis. Climate zone abbreviations are defined as follows: cold and hot semi-arid steppe (Bsk + Bsh); hot and cold desert (Bwh + Bwk); humid subtropical (Cfa); hot- and warm-summer Mediterranean (Csa + Csb); and hot- and warm-summer humid continental (Dfa + Dfb).

ET data computed from micrometeorological measurements at EC sites were obtained from a variety of sources, primarily AmeriFlux28. Supplementary Table 1 provides a full list of stations used in the study including land cover type, site principal investigators, Digital Object Identifiers (DOIs) and other metadata. Flux data were carefully post-processed, including gap-filling, screening for energy balance closure error and data completeness, and visual data quality assessments. Flux data that passed quality control and showed limited energy balance closure error were included in the study and underwent closure correction following the FLUXNET2015/ONEFlux approach for daily averaged fluxes19,29. We refer to EC data as ‘ECET’ throughout the article. Closed ECET data were considered to be most representative of actual ET30. To sample RSET pixels for comparison with ECET, flux footprints were developed for each station. Flux footprints are two-dimensional mappings of the areal extent of a station’s source area, that is, the area on the ground that contributes to fluxes measured by the tower instrumentation. Refer to Methods and Volk et al.19,20 for details on flux data processing and footprint mapping methods used. Additional discussion of uncertainty in EC data and steps taken to limit that uncertainty are provided in Supplementary Discussion 1. An overview of the satellite-driven ET models in the OpenET ensemble is provided in Methods.

The discussion of statistical results that follows focuses on comparisons between monthly aggregated ECET and RSET. Although accuracy assessments were conducted using daily (date of overpass) data and monthly total ET aggregated to growing season and annual periods, our discussion focuses on monthly results for several reasons: monthly ET has utility for longer-term water accounting and planning; uncertainties in EC data due to closure and other factors are reduced at the monthly (compared with daily) timescale, and OpenET directly provides daily and monthly ET, along with data services that allow users to compute ET at other aggregation periods. Accuracy results are provided for daily, monthly, seasonal and annual timesteps in Supplementary Tables 26, and accuracy metrics for daily timesteps should be consulted for applications of ET data at timesteps of 1–15 days. Five well-known statistical metrics were used to evaluate OpenET accuracy (for equations, see Methods): the linear regression slope forced through the origin which measures bias (Slope), mean bias error (MBE), mean absolute error (MAE), root-mean-square error (RMSE) and the coefficient of determination (r2). Regression results with a non-zero intercept for monthly data are provided in Supplementary Table 7.

Performance over all agricultural flux sites

Of all the general land cover types sampled, OpenET models showed the strongest agreement with ECET collected in agricultural settings. For 44 agricultural sites combined, eeMETRIC, SIMS and PT-JPL showed the least bias in terms of MBE, all less than −4.5 mm per month or 5% of the mean ECET (Table 1). The ensemble value had a slightly higher magnitude bias of −5.3 mm per month, or 5.8% of the mean ECET. The ensemble value outperformed each individual model in terms of MAE 15.9 mm per month (17.3% of the mean ECET), RMSE 20.4 mm per month (22.4%) and r2 (0.90). In comparison, MAE from individual models ranged from 17.9 to 22.7 mm, RMSE from 23.1 to 29.1 mm per month and r2 from 0.83 to 0.87 with smallest errors from PT-JPL, SIMS and DisALEXI.

Table 1 Smmary statistics between modelled and observed monthly ET for cropland sites

ET data from the individual RSET models were generally linearly related to ECET, with PT-JPL and SIMS exhibiting some curvature due to seasonally varying biases (Fig. 2). Many of the models underestimated ET during the cold season relative to the ECET, leading to the slightly low bias in the ensemble ET value (Table 2). To investigate seasonal variability in model accuracy, we pooled all monthly paired (model–measured) ET to generate monthly climatologies for major land cover classifications (Fig. 3 and Extended Data Figs. 15). The range between unclosed and closed ECET provides one measure of the uncertainty in the in situ data31.

Fig. 2: Modelled versus observed monthly ET at cropland sites.
figure 2

Monthly comparison of all paired OpenET5 ensemble members ET versus closed flux tower ET19,20 from all cropland stations for all months of record. Included for each model is the result of the least square linear regression model forced through the origin and r2.

Table 2 Summary statistics between modelled and observed monthly ET for cropland sites grouped by climate zone
Fig. 3: Monthly climatology of paired modelled and observed ET for cropland sites.
figure 3

a, Monthly climatology of paired OpenET5 and flux tower ET19,20 from cropland sites. b, The residual of monthly mean ET (model minus mean closed flux ET). Unclosed and closed labels refer to flux tower ET before and after energy balance closure correction. Dashed lines represent the closed flux ET mean plus two standard errors of the mean and unclosed flux ET mean minus two standard errors of the mean.

For most months, the multi-model ensemble ET value was well bounded between the closed and unclosed mean ECET for cropland sites, while individual ensemble members showed more seasonal bias. In spring, SSEBop and eeMETRIC underestimated unclosed ET, whereas SIMS overestimated closed ET, probably due to the assumption of well-watered conditions. In peak summer months, most models were in good agreement with closed ECET, with geeSEBAL and PT-JPL biased low. In September and October, when actual ET rates decline quickly, several models were biased high, except DisALEXI and geeSEBAL, which tracked closer to the unclosed values. The higher agreement of RSET with ECET during the peak summer period is encouraging, as this is the period of intensive irrigation and consumptive use of water through ET. A post hoc test showed that DisALEXI, geeSEBAL and SSEBop had mean monthly ET values that were statistically different (as underestimation) from the mean closed ECET. The mean aggregated growing season ET for all models were no different from the mean closed ECET (Supplementary Tables 8 and 9).

The monthly climatologies derived at flux sites were upscaled using data from all cropland pixels over the full OpenET domain (Extended Data Fig. 6). We found similar seasonal patterns and relative model biases to those identified at the flux sites—giving confidence in the representativeness of the ECET comparisons.

Impact of sampling interval on model performance

Model accuracy often improves with temporal aggregation interval due to cancellation of errors8. In croplands, the accuracy metrics for the OpenET ensemble improved as the aggregation period increased from daily (overpass dates) to monthly to growing season to annual periods (Supplementary Tables 26). Daily ensemble results for the combined cropland sites showed a MAE of 23.6%, and RMSE of 31.1% of the mean ECET. At this timescale there is increased uncertainty both in the ECET data due to variability in micrometeorological conditions and energy balance closure, and remotely sensed ET due to potential cloud contamination and errors in footprint representation. These ensemble uncertainties are reduced when integrating to monthly (MAE of 17.3% and RMSE of 22.4% of ECET), growing season (MAE of 12.9% and RMSE of 15.5% of ECET) and water year (MAE of 11.3% and RMSE of 12.3% of ECET) timescales. Fortunately, during growing season periods we found lower energy balance closure error in EC data19 and there is less cloud cover in satellite data in the western United States as compared with the non-growing period. During the summer, the daily ensemble normalized MAE (NMAE) on overpass dates was typically between 5% and 25% (Supplementary Fig. 1), and monthly 7% and 20% (Fig. 4). We expect custom aggregation periods between 2 and 15 days to have similar or slightly improved accuracy to daily results that vary seasonally; subweekly to bi-weekly RSET may be of greatest use for irrigation scheduling32.

Fig. 4: Monthly MAE of the model ensemble for different crop types.
figure 4

a, Monthly mean flux tower ET19,20. b, OpenET5 ensemble MAE and MAE normalized by the mean flux tower ET19,20 (NMAE) using all paired model–measured data for cropland stations grouped by crop types. Annual crops that had a mixed history of rotation between C3 and C4 crop types, for example, corn–soy rotations, were not included in C3 or C4 results but were included in the combined grouping.

Performance among annual and perennial crops

Annual crops, including wheat, corn, soy, rice and others, make up the majority (80%) of cropland sites in the OpenET ECET dataset (Supplementary Table 1). Compared with perennial crops, annual crops tend to have shorter canopies and more homogeneous cover at peak growth stage. The annual crop sites in the OpenET flux dataset are predominantly irrigated, and are distributed across a range of climatic zones, with higher density in regions such as Mediterranean and semi-arid Central Valley, California, and humid continental regions in the High Plains and the Mississippi Alluvial Plain (Fig. 1).

For annual crops, each of the RSET models in the OpenET ensemble exhibited small bias and high levels of accuracy and precision (Table 1). Similar to all crop types combined, the ensemble value for annual crops outperformed individual models in terms of MAE (15.3 mm per month or 17.9% of mean ECET), RMSE (19.7 mm per month or 23.2% of mean ECET) and r2 (0.9). Of the RSET models, eeMETRIC and PT-JPL exhibited the lowest magnitude of MBE, with PT-JPL and SIMS yielding the highest accuracy in terms of MAE and RMSE.

Dividing annual crops into C3 and C4 subclasses, we find the seasonal patterns and magnitudes of ensemble MAE are similar throughout the year (Fig. 4). NMAE in general reflects the inverse of the characteristic water use curve for each class, with C3 crops exhibiting a broader seasonal curve than C4 and therefore lower NMAE early and late in the season. While the higher NMAE values observed outside the growing season for all crop types (Fig. 4) are more indicative of low ET rates than of meaningful modelling error characteristics, cool-season errors may be generally inflated by higher cloud cover, increasing the time interval between cloud-free satellite retrievals. Improving satellite imaging frequency, as well as ET time integration and gap-filling techniques, should help to increase OpenET accuracy during the non-growing season (Discussion).

Another class of interest is woody perennials, which are high-value crops and pose distinct modelling challenges. High-quality eddy flux ET data were available for three vineyards, three nut tree orchards and one fruit orchard, all located in California19,33. Vineyards and orchards have taller and more highly structured canopies, often with inter-row cover crops, and vineyards are often deficit irrigated. These qualities lead to shadowing and mixed pixel effects in remote sensing at the 30-m level, and the need for sensitivity to small changes in vine stress to inform deficit irrigation applications is a unique modelling requirement.

RSET model performance in the vineyard sites sampled was strong and consistent across models. The ensemble accuracy exceeded that for annual crops (Table 1 and Fig. 4), with lower bias (slope of 1.02 and MBE of 5.3 mm per month) and lower MAE and RMSE (13.7 and 16.2 mm per month, respectively, or 12.2% and 14.5% of the mean monthly ECET) and r2 of 0.90. DisALEXI performed similarly or better than the ensemble at the vineyard flux sites, perhaps due to its two-source approach towards partitioning temperature fluxes between the substrate (inter-row) and canopy.

Performance was more varied across ensemble members for the orchards than for other broad crop types, and biases were more negative. This could be related to shadowing effects in the taller and more strongly clumped canopies, particularly for models that are strongly dependent on LST inputs. The ensemble value had a negative bias with mean slope of 0.87, MBE −11.9 mm per month, MAE 21.2 mm per month (16.8% of ECET) and RMSE 27.9 mm per month (22.1% of ECET), and an r2 of 0.91. SSEBop and SIMS had the least bias in terms of slope and MBE, and SSEBop and DisALEXI had the lowest error in terms of MAE and RMSE (Table 1). While MAE in orchards is high mid-season, the normalized values are similar to those of annual crops (Fig. 4).

Variation of model performance across climate regions

To investigate variations in OpenET performance over different climates, cropland accuracy metrics were grouped by the Köppen–Geiger climate zones of the flux sites34 (Fig. 1). Zones with fewer than five flux stations were omitted as a conservative measure, and some zones were lumped on the basis of secondary climate classifications (for example, hot- and warm-summer Mediterranean zones). Each resulting group had 7–13 flux stations used for calculation of accuracy statistics.

Overall, the OpenET ensemble had better agreement with ECET at crop sites in water-scarce, semi-arid to arid regions (Mediterranean and desert zones in the Southwest) as compared with humid zones (Table 2 and Supplementary Fig. 2). Irrigation is more prevalent in semi-arid to arid regions, and crop ET tends to be closer to potential ET rates and is more accurately modelled in some RSET modelling frameworks. High accuracy of models in semi-arid and arid regions is advantageous, given the high priority of water resource sustainability and management challenges in these regions.

Among the zones considered, the OpenET ensemble value was most accurate for crop sites in Mediterranean zones, with MAE of 13.3 and RMSE of 16.5 mm per month (14.2% and 17.6% of the mean ECET), with the ensemble outperforming individual members. Of the individual models, SIMS showed the best agreement with ECET in these regions, suggesting well-watered conditions for most sites or possible influence of adjacent non-irrigated areas on SEB models. Similarly, in arid sites (hot and cold desert), SIMS had the lowest MAE and RMSE (Table 2). During the growing season periods when the majority of irrigation is applied, the ensemble’s monthly NMAE was consistently below 10% for cropland sites in Mediterranean climates (Supplementary Fig. 2).

Model performance in the subhumid and humid continental regions of the Midwest and Central Plains was similar to that in the Mediterranean climate zone, again with the ensemble outperforming individual models in terms of collective statistics (Table 2 and Supplementary Fig. 2). Errors were higher at the humid subtropical sites, with SIMS tending to overestimate ET with a slope of 1.15 and normalized MBE of 19.9%, indicating ET is less well correlated with vegetation density in this region, and that irrigation practices may result in intermittent vegetation water stress. Hypotheses for increased RSET error in humid regions and paths for improvement are proposed in Discussion.

Performance in natural ecosystems

Most of the flux stations (61%) used in the intercomparison were in non-agricultural sites, including shrublands, grasslands, mixed forests, conifer forests, and wetlands or riparian areas (Fig. 1)19. The SIMS model is currently not designed for and implemented in non-agricultural land-cover types; for these pixels, the ensemble consists of five models with the possibility of removing a single outlier (Methods). Systematic model error and variability for non-agricultural sites was higher than cropland sites (Fig. 5).

Fig. 5: Monthly modelled ensemble versus observed ET for sites grouped by land cover type.
figure 5

Monthly comparison of the paired monthly OpenET5 ensemble ET versus closed flux tower ET19,20 for each general land cover group. Included for each group is the result of the least square linear regression model and r2.

Most models exhibited a high bias in wetland/riparian sites, dominated by overprediction of ET during the spring (Extended Data Fig. 5). SSEBop had higher accuracy in these sites than other models and the ensemble value (Supplementary Tables 24). For models that estimate all components of the SEB (DisALEXI, eeMETRIC and geeSEBAL), this bias could result from an underestimation of the substrate (water) heat storage term in the spring before the vegetation canopy develops7. These errors can potentially be mitigated in the future through accurate classification of inundated land areas.

Natural ecosystems under high water stress, such as shrublands and grasslands in desert and semi-arid steppe climates in the western United States, showed the highest variability and error with respect to ECET (Fig. 5 and Supplementary Tables 24). In these systems, ET can be a small fraction of available energy, and difficult to both measure on the ground and model using RSET approaches. Shrublands also tend to be more heterogeneous than cropland sites, and this can introduce additional uncertainty into model–measurement comparisons5. Nevertheless, it is important to provide an evaluation of accuracy, both to benefit ET monitoring and land health assessments within shrub and grassland ecosystems, and to identify key areas for future research in RSET to reduce model error.

The Landsat-scale ET from OpenET also has applications in forested landscapes, as a predictor of forest health and mortality35 and as a metric of water yield response to forest management36. In forested locations, most OpenET models overestimated ET, particularly at the evergreen flux sites sampled, yielding a slope for the ensemble value of 1.24 and MBE of 16.8 mm per month (27.3%). At these sites, eeMETRIC showed the least bias with a slope of 1.17 and an MBE of 10.8 mm per month (17.5%), while for MAE and RMSE, the ensemble value outperformed each individual model. At mixed forest sites, however, eeMETRIC and DisALEXI were in better agreement with ECET than was the ensemble.

Ensemble outlier removal and spatial inter-model variability

See Supplementary Discussion 2 for analysis and discussion of the MAD outlier removal approach that is used for computing the ensemble value, including spatial analysis of the occurrence of outliers and the long-term differences between each model’s seasonal ET and the ensemble value (Extended Data Figs. 7 and 8, Supplementary Figs. 39 and Supplementary Tables 9 and 10). Evidence suggests that the MAD approach showed accuracy metrics similar to other simple methods. Over 2016–2022, typically no model was identified as an outlier in cropland pixels; however, SIMS was about 10% more likely to be identified as an ensemble outlier, and it often gave the highest ET value, particularly in the Central Plains.

Discussion

ET is a critical driver and metric of ecosystem function, weather and climate, agricultural practices and water resource management. However, field-scale ET has previously been difficult to estimate at scale; therefore, ready access to high-resolution (spatially and temporally) ET data offers societal benefits to a variety of stakeholders1,5. Using monthly ET data, water managers can develop more accurate water budgets in support of incentive-driven conservation programmes and innovative management and trading strategies. For policymakers, such data can improve water supply tracking, simplify regulatory compliance and promote the co-development of solutions with local communities. Crop producers may be able to improve the efficiency of irrigation practices in some instances, resulting in enhanced sustainability and reduced costs for water, fertilizer and energy. Supplementary Discussion 3 continues the conversation on incentives towards improving irrigation efficiency and how OpenET data can provide value in an RSET-based irrigation scheduling framework.

In addition to informing water management, OpenET has multiple research and modelling applications. Carbon and climate modelling can benefit from 30-m RSET data as a diagnostic indicator of ecosystem health and function response under a changing climate1. RSET is being used to reduce summertime warm-dry bias in weather forecasting and climate models by improving the representation of ET from irrigated land37, ET–soil moisture coupling38 and transpiration–evaporation partitioning39. Hydrologic and land surface models at multiple scales can also benefit from high-resolution ET data, for example, as validation or forcing data in basins where streamflow measurements are not available to constrain the water budget13,40,41.

Realizing the full potential benefits of RSET data for water resource and land management applications requires rigorous and reproducible accuracy assessment to inform practitioners on best use practices18. The accuracy results we present here provide valuable constraints on model uncertainty based on broad crop type, climate region and timescale.

Average error in the OpenET ensemble value with respect to mean ECET in cropland sites for monthly, growing season and annual aggregated ET, ranged from 10% to 17% for MAE and 11% to 22% for RMSE. These errors are within accuracy levels of 10–20% reported for supervised remote sensing techniques42. They are also consistent with accuracy targets set by the OpenET user groups: 10–20% at a monthly timestep, and 15–25% for daily ET data5. These errors include uncertainties in ECET data, which are estimated to range from 10% to 30% depending on site characteristics and instrumentation design and maintenance42.

These accuracy results may support advancements in water management applications that incorporate OpenET data. For croplands, all models except for SIMS had negative bias errors at the monthly timestep (−2.7% to −13.3%), with an MBE of −5.8% for the ensemble ET value (SIMS MBE is +4.7%). Awareness of these bias errors when using these data for irrigation management applications may prevent unintentional deficit irrigation that can suppress crop yields and farm revenue43. Cross-comparisons between the primarily reflectance-based SIMS and PT-JPL models and the LST-driven models may be useful for identifying periods of intentional or unintentional crop water stress and deficit irrigation. Reducing errors in the OpenET daily data is a high priority for advancing their utility for on-farm water management.

At local to regional scales, the reported uncertainties at monthly to annual timesteps should inform applications related to water balance, water accounting and water rights administration. Comparison of OpenET data aggregated at the scale of irrigation districts or watersheds against carefully constrained water balances offers one path to assessment of biases at larger scales. Particularly in administration of water rights, the current uncertainty in the OpenET data (for example, growing season ensemble NMAE of 12.9% for croplands) must be recognized in evaluating consumptive water use, and OpenET data should only be used for this purpose in combination with other sources of information.

This study provides insights into potential pathways towards improving the accuracy of the individual models within the OpenET ensemble. Across both agricultural and some natural landscapes, most models underestimated cropland ET during the winter and spring, particularly the models that rely upon TIR measurements to compute ET. This underestimation may be related to loss of thermal contrast over an image, where differences between the hottest and coolest pixels are reduced relative to midsummer values, adding uncertainty to within-scene scaling approaches. It may also be related to misrepresentation of soil evaporation during extended wet periods, extended periods of cloudiness, and error in shared model inputs. In addition, treatment of effects of senesced standing vegetation and crop residue on SEB can impact model performance outside of the growing season. In terms of observational errors, the energy balance closure error and uncertainty in EC data are also amplified during periods outside of the growing season19.

We found increased model error in croplands in humid climates as compared with drier regions. Again, lower temperature contrasts across humid landscapes may contribute to errors in TIR-based within-scene scaling models. A primary driver, however, is probably the relative paucity of clear-sky satellite retrievals and potential for error in LST due to undetected clouds. Improving temporal sampling of RSET model inputs will be a major focus of on-going development in OpenET, through future use of imagery from additional Landsat-like optical (Sentinel-2) and thermal (ECOSTRESS, VIIRS) sensors44, and integration of future TIR observations from satellite missions currently in development by NASA, USGS and the European Space Agency. Methods for computing ET values between cloud-free satellite observations, currently based on linear interpolation of the ratio of ET to a reference flux, can also be improved. Approaches used in mapping and predicting vegetation phenology45 and dynamic time warping46 algorithms developed for signal processing applications offer promise for reducing large errors during periods of rapid vegetation change or extended cloud cover, which would contribute to reduced RMSE values across the model ensemble.

Examining results for specific crop classes, we found strong results for DisALEXI and SIMS over vineyards, and DisALEXI, SIMS and SSEBop over fruit and nut orchard sites—key targets for irrigation management in the Central Valley. Increasing the number of validation sites in orchards would help to address remaining modelling issues associated with this challenging canopy architecture. The USDA ARS-led Tree-crop Remote sensing of Evapotranspiration eXperiment (T-REX) is aimed at addressing this observational gap47.

All models, to varying degrees, have room for notable improvement in computation of ET in natural ecosystems. For example, most models systematically underestimate ET in drier ecosystems such as grasslands and shrublands and overestimate ET in evergreen forests. Incorporation of high-frequency and high-resolution visible and near-infrared data into the remote sensing models may improve their ability to capture phenological shifts particularly in arid/semi-arid regions, and agricultural systems in general48,49. Improvement of gridded meteorological model inputs50,51, land cover classification data and soils data52 may also lead to improved model performance in both natural ecosystems and in croplands. In particular, datasets compiled from agricultural weather stations and used to compute bias correction surfaces for reference ET could be re-evaluated to ensure reference surface compliance with the assumptions of the American Society of Civil Engineers Penman–Monteith equation53.

Future OpenET accuracy evaluations will target primary causes of error in ground ET measurements and RSET methods. Specific factors to consider include local advective impacts on modelled and measured ET, EC energy budget closure, local thermal contrast, ET reduction in deficit irrigated or rainfed systems, potential biases in gridded meteorological inputs to RSET models, and accurate capture of ET over sparsely cultivated landscapes. Comparisons with other well-established spatially mapped ET products such as MOD16 or FLUXCOM54 may provide further insights for operational global ET mapping at field scales (30–100 m). Comparisons against ET data computed from long-term water balance studies13,55 would help fill in gaps of spatial coverage in measured in situ ET across the western United States in hydrologically important but sparsely cultivated regions such as the Upper Colorado River Basin.

Conclusions

The OpenET platform provides spatially continuous ET data at 30-m resolution throughout the western United States. An intercomparison and accuracy assessment involved six satellite-based RSET models composing the current OpenET version, ensemble ET computed from the six models, and a well-documented benchmark eddy flux dataset from 152 stations located in the contiguous United States. Based on results from 59 cropland ET stations located in a variety of climatic regions, little systematic model bias was observed in croplands, and error metrics were within or near the targets set forth by OpenET partners including farmers, irrigation managers and water management agencies. The best accuracy metrics were associated with seasonal and annual timescales, and for crops in arid/semi-arid regions. The OpenET ensemble mean, with outlier removal, typically outperformed any individual model in terms of error statistics. Generally, no more than one model was identified as an outlier during growing season months over most agricultural regions in the western United States, and frequently no models were excluded. This finding highlights the substantial progress achieved so far in developing fully automated RSET modelling approaches that can be employed to map ET over large areas at field-scale resolution. The study identified paths for future targeted research and model improvement, and is intended to support the RSET research community in the development of increasingly robust and accurate RSET techniques. We are also hopeful that this assessment will provide added confidence to water resource managers, farmers, ranchers, scientists and other potential users of OpenET due to the high rigour and transparency of methods that were employed.

Methods

Flux data processing and footprint sampling

We used a curated benchmark eddy flux-based ET dataset19,20 and tools56 for use in this and subsequent evaluations of OpenET RSET models5. The rationale and decision-making steps for the collection and post-processing of flux data, as well as analyses of footprint sampling techniques and energy balance closure error within the dataset, are described in Volk et al.19,20. Data processing techniques for gap-filling and correction for energy balance closure error were conducted using open-source Python tools56 that enhance data provenance and reproducibility. Data were also subject to qualitative, visual-based data screening and filtering19,20. The final post-processed dataset consists of 161 stations, is public and includes daily and monthly ET and meteorological data, interactive graphics of such data for each station, and site information such as land use and Principal Investigator acknowledgements20. We note that nine stations in the dataset were not included in the statistical results presented here because they had data coverage that did not overlap with the data that could be developed for all six OpenET models. For example, not all models could be implemented from satellite imagery recorded before 2001 (ref. 5). Figure 1 shows a map of the 152 stations used in this accuracy assessment as well as their land cover types and Köppen–Geiger climate zones, and Supplementary Table 1 provides additional metadata for each station.

Data for the majority (106) of the flux stations in this study were downloaded from the AmeriFlux website, last accessed on 27 October 2020, and the remaining stations were retrieved from a variety of sources and Principal Investigators from university partners, the US Geological Survey, the US Department of Agriculture and others19. In addition to EC systems, four precision weighing lysimeters measuring cropland ET in Texas57 and seven high-quality Bowen Ratio instrumented sites, which measure ET in predominantly phreatophyte shrublands in Nevada20, were included in the dataset. Gap-filling of initial half-hourly fluxes of the four main energy balance components—latent, sensible and soil heat flux, and net radiation—was conducted using linear interpolation where gaps up to 2 h during the daytime or 4 h during nighttime were interpolated. If a given 24-h period still contained gaps then the daily average was not calculated and the daily flux value was left as a gap. After this initial gap-filling, fluxes were averaged to daily periods and energy balance closure correction was applied following the daily energy balance ratio approach defined by FLUXNET2015/ONEFlux19,29. The corrected daily latent heat flux, which is the energy consumed through ET, was used to calculate ET with an adjustment to the latent heat of vapourization for air temperature20. This closure-adjusted value is referred to as closed flux ET or measured ET in the main text and all statistical measures reported for OpenET models were against the energy balance corrected ET data. Daily ET gaps were subsequently filled using gridMET fraction of reference ET and gridMET grass reference ET19,20,58. To exclude flux stations with higher data uncertainty, only stations with mean daily energy balance closure of 0.75 or higher during the growing season and 0.6 or higher during the non-growing season were chosen for this intercomparison. Here, growing season periods were spatially mapped on the basis of a cumulative growing-degree-day and killing frost approach derived from long-term gridded climate data and are specific to each flux site19,58. The final dataset is similar to the recent FLUXNET2015 (ref. 29) release consisting of high-quality eddy flux station data that were subject to similar processing and correction techniques. The largest difference between the two datasets, in terms of daily latent heat flux estimates, results from different gap-filling procedures, where our approach is considered to be simpler and more conservative19,20,29.

Two approaches were used to estimate flux tower footprints or source area for tower pixel sampling of RSET imagery: (1) simple square ‘static’ pixel (Landsat 30 m) grids of 3 × 3, 5 × 5 and 7 × 7 drawn around station locations, and (2) two-dimensional, physically based flux source area estimations modelled using hourly meteorological data using the Kljun et al.59 approach, with hourly footprints converted to daily/monthly average footprint rasters weighted by reference ET19. The placement of the static grids was informed by high-resolution imagery to avoid inclusion of pixels of non-representative land cover (structures, roads and canals), and shifted slightly into the predominant wind direction as determined by long-term mean daytime windroses (built from data between 6:00 and 20:00 local time). Although the physically based and temporally dynamic footprints were preferred over the static footprints, only about half of the stations in the dataset had sufficient data for their production. Commonly, one or more input parameters to the Kljun et al.59 model, such as the standard deviation of the crosswind component of wind due to turbulence or friction velocity, was not available. A detailed description of parameter estimation, processing steps and the method used for creating weighted mean footprint images (using reference ET from NLDAS2 gridded weather data60) can be found in Volk et al.19. We also conducted a rigorous comparison of the intersection between source areas from the static grids of different sizes and the temporally dynamic footprints. The major finding was that the larger 7 × 7 grids tended to include substantially more of the dynamically defined footprint area than did the smaller grid sizes on average; however, the smaller 3 × 3 grids tended to overlap with pixels that were deemed part of the dynamic footprint on a more consistent basis. Therefore, we decided to use the 7 × 7 grids for pixel sampling at most flux sites where a dynamic footprint could not be generated, with exceptions for sites with heterogeneous surroundings or with non-representative land cover nearby the station. For these sites, we used 5 × 5 or 3 × 3 grids to avoid giving equal weight to pixels of potentially different land cover that lie near the perimeter of the typical actual footprint area19.

Model data

The majority of the models that make up the OpenET ensemble are based on full or simplified implementations of the SEB approach. The SEB approach accounts for the energy used to transform liquid water in plants and soil into vapour that is released to the atmosphere. The SEB approach relies on satellite measurements of surface temperature and surface reflectance combined with other key land surface and weather variables to calculate components of the energy balance—net radiation, sensible heat flux, ground heat flux and latent heat flux. eeMETRIC8, geeSEBAL9 and DisALEXI7 compute each component of the energy balance using optical (that is, short-wave) and thermal (that is, long-wave) data, whereas SSEBop13 and PT-JPL10 are simplified approaches in which certain components of the energy balance are not calculated, or are calculated using a set of simplifying assumptions. SIMS11,12 relies on surface reflectance data, crop type information and a gridded soil water balance model to compute ET as a function of canopy density using a crop coefficient approach for agricultural lands.

The Google Earth Engine14 Python application programming interface was used to develop a workflow for sampling OpenET RSET model data at ET flux sites. Sampling of the daily and monthly RSET model data was performed at each site using a set of static (3 × 3, 5 × 5 and/or 7 × 7) and/or dynamic flux source-area footprints. Conditions for each of the extraction methods using static footprints were as follows: (1) daily ET from eeMETRIC, SIMS and SSEBop for sites outside of California was calculated as the product of the mean daily fraction of grass reference ET (EToF) produced by the models and the mean daily bias-corrected gridMET grass reference ET (ETo) (repeated for sites within California using daily CIMIS ETo, where CIMIS is more commonly used and depended upon in California); (2) daily ET from PT-JPL, geeSEBAL, and ALEXI/DisALEXI for all sites was computed as the spatial average of daily ET pixels produced by the models; (3) monthly ET from all RSET models for sites outside of California were calculated as the product of the mean monthly EToF and the mean monthly gridMET ETo (repeated for sites within California using the monthly CIMIS ETo). The process of extrapolating instantaneous data (time of overpass) to daily ET is an internal model calculation and differs for each model, and we refer readers to the individual model documentations for details as well as Melton et al.5. Daily Landsat image pixels with cloud contamination are flagged on the basis of the CFMask derived indicators61 in the pixel quality assurance band (QA_PIXEL) and those pixels are not considered. When computing monthly ET, all missing or masked daily ET pixels are computed by linearly interpolating between the nearest unmasked (cloud free) pixels in time within ±32 days.

Conditions for each of the extraction methods using dynamic footprints were as follows:

  1. (1)

    daily ET from eeMETRIC, SIMS and SSEBop for sites outside of California was calculated by first multiplying the sampled daily EToF pixels produced by the models in the footprint by each daily flux footprint weight to obtain daily weighted EToF pixels, and summing all daily weighted EToF pixels to obtain mean daily weighted EToF, normalizing the mean daily weighted EToF by the sum of weights to account for times when the sum of weights did not equal 1 (for example, caused by cloud masking of pixels), and then multiplying the mean daily weighted EToF by the mean daily bias corrected gridMET ETo (replaced for sites within California using the daily CIMIS ETo);

  2. (2)

    daily ET from PT-JPL, geeSEBAL and ALEXI/DisALEXI for all sites was calculated by multiplying the daily ET pixels by the daily flux footprint weights to obtain daily weighted ET pixels, summing all daily weighted ET pixels to obtain mean daily weighted ET, and then normalizing the mean daily weighted ET by the sum of weights, and

  3. (3)

    monthly ET from all RSET models for sites outside of California was calculated by first multiplying the monthly EToF pixels by the monthly flux footprint weights to obtain monthly weighted EToF pixels, summing all monthly weighted EToF pixels to obtain mean monthly weighted EToF, normalizing the mean monthly weighted EToF by the sum of weights, and then multiplying the mean monthly weighted EToF by the mean monthly bias-corrected gridMET ETo (replaced for sites within California using the monthly CIMIS ETo).

Additional processing was required after extracting the daily ET when duplicate days of data were extracted at select sites due to overlapping Landsat paths. Occasionally a site would lie within the footprints of two overlapping Landsat scenes, resulting in more than one ET value on a given overpass date. To obtain single daily ET values for the site, the daily weighted mean ET for each day was computed using the pixel count (that is, number of pixels used when deriving the respective spatial mean ET value) as the weight. ET pixel counts were occasionally less than the grid/footprint total because of the removal of poor-quality pixels (for example, cloud masking).

Ensemble computation

The ensemble mean of the six OpenET models was computed after removing up to two outlier models based on the MAD23,24, a robust measure of spread that is suitable for small samples. The outlier removal occurs at the pixel level for each ET image generated. To identify outliers for a single scene, first the median value and the MAD from the median is computed as

$${{\mathrm{MAD}}}=b\times {{\mathrm{median}}}\left(\left|{X}_{i}-{{\mathrm{median}}}\left(X\,\right)\right|\right),$$

where Xi is the ET value for model i and X is the full set of all six model’s ET estimates. Here, b is a scalar set to 1.483, and it was derived on the basis of the assumption of normality of the sample population62. This approach is sometimes referred to as the MADe rule, where e = 1.483. The MAD value is typically scaled by 2, 2.5 or 3 on the basis of a subjective assessment of the data, which is then used to create a band around the median:

$${{\mathrm{median}}}\left(X\,\right)\pm 2{{\mathrm{MAD}}}.$$

Model estimates that fall outside the band are deemed as outliers, and up to two outliers (those furthest from the median) are removed from the set of model estimates before taking the ensemble mean.

Due to the tendency for some OpenET models to predict zero ET or even negative ET rates in some arid regions during dry periods we modified the above approach for these scenarios. Specifically, when the ensemble median estimate is zero but at least one model predicts a positive ET rate, the ensemble mean is taken to include that value without any prior outlier removal. In these cases, the outlier removal would result in removing the model estimates that are positive and although actual ET may be quite negligible, a zero estimate is not considered to be physically realistic. However, in these scenarios, because the majority of models may predict zero, the ensemble mean will also be highly skewed towards zero making this a conservative measure to prevent zero ensemble estimates.

Statistical analyses

Key summary statistics including the least squares linear regression slope forced through the origin (slope) as well as linear regression with an intercept (Supplementary Table 7), MBE, MAE, RMSE and the coefficient of determination (r2) were computed using paired observations between OpenET model ET estimates and post-processed and corrected flux ET estimates19. Daily accuracy statistics were not compared against any gap-filled station ET data, and monthly statistics only used station ET with 5 or fewer gap-filled days per month. Growing season and annual evaluations used paired monthly data and did not include any periods with monthly gaps. Also, the number of paired observations was always the same among models for all statistical analyses.

All statistics were calculated on a site-by-site basis using paired model–measured ET using the Python Numpy package version 1.17.2 (ref. 63). For linear regression, the Numpy linalg.lstsq algorithm was used, and it applies the least squares approach. We used the modelled ET as the dependent variable and the measured ET as the independent variable.

The MBE was calculated as

$${{\mathrm{MBE}}}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left({{P}}_{{i}}-{{O}}_{{i}}\right),$$

where Oi is the observed ET, Pi is the model predicted ET and n is the total number of paired model–measured ET data points.

The MAE was calculated as

$${{\mathrm{MAE}}}=\frac{1}{n}\sum_{i=1}^{n}{\rm{|}}{{P}}_{{i}}-{{O}}_{{i}}{\rm{|}},$$

and the RMSE was calculated as

$${{\mathrm{RMSE}}}=\sqrt{\mathop{\sum }\limits_{i=1}^{n}\frac{{({{P}}_{{i}}-{{O}}_{{i}})}^{2}}{{n}}}.$$

Here, r2 values were calculated as the square of the Pearson correlation coefficient, which was calculated from paired model–measurement ET data using the Python statsmodels package, version 0.12.1 (ref. 64).

For grouping statistics by land cover or climate zone we used two methods: (1) for the computation of linear regression and r2 all data from each ground observation in a group (for example, monthly paired model–station ET estimates for annual crop stations) were pooled together before computing a single statistic per model; and (2) MBE, MAE and RMSE were computed separately for each ground station, and then a weighted mean was taken. Grouped statistics were weighted by the square root of the number of paired observations per station (n); the rationale is to avoid giving too much weight to stations with excessively long data records while also not giving equal weight to stations with short data records65. We also imposed data length requirements for in situ ET stations: to be included in daily grouped mean statistics we required stations to have a minimum of six paired station–model data points, and a minimum of three paired observations for inclusion in monthly grouped mean statistics. We note that Melton et al.5 presented similar statistical metrics from a subset of cropland sites used in this study, and in that study, the linear regression slope and r2 metrics did incorporate weighting, which we deemed inappropriate or unnecessary in this study. For congruency, the statistics computed in the same manner as in Melton et al.5 are provided in Supplementary Table 12.

A post hoc Tukey test, also known as the honestly significant difference test, was used to compare multiple mean ET estimates from each model, the ensemble mean, and from the mean of the unclosed and closed flux ET data. The test was applied using all paired data from cropland stations, including for crop subgroups: annual crops, orchards and vineyards, at daily, monthly, growing season and annual timescales. The family-wise error rate was set to 0.05 and the test was performed using the Python statsmodels package, version 0.12.1 (ref. 64).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.