Main

Monitoring variations in the global water cycle is crucial for understanding the Earth’s climate system1. The long-term trends in ice-sheet melting2 and freshwater availability3 can be investigated by analysing the water cycle over decades, while the short-term variations of the water cycle contain fruitful information for monitoring natural hazards like flood4 and drought5 events. To quantify variations in the global water cycle, total water storage (TWS), defined as the storage in all forms of water, has been implemented as an essential climate variable6. For decades, TWS has mainly been modelled by simulations from global hydrological models, including global hydrology and water resource models and land surface models7. The hydrological models can provide spatial variance and short-term temporal variations but suffer from providing reliable long-term trends, which indicate the climate and human-induced changes in water storage8. Since 2002, the Gravity Recovery and Climate Experiment (GRACE) and its follow-on (GRACE-FO) missions (hereafter GRACE) have provided us with a unique opportunity to monitor the changes in global TWS anomalies (TWSAs) by measuring gravity field variations9,10,11. The satellite-measured TWSAs have unprecedented accuracy with global coverage due to the physical measurement principle and provide valuable information about the Earth’s climate system from a macro perspective1,12,13,14.

Although GRACE products have been widely used, their coarse spatial resolution of about 3° is among the key factors limiting the applications in related fields, such as understanding water storage changes in small catchments14. The problem of low spatial resolution mainly comes from two origins as follows. First, the design of the orbit and the accuracy of the instruments inherently limit the possible spatial resolution15,16. Second, postprocessing approaches are needed for obtaining meaningful signals17,18 but they also attenuate the actual geophysical signals, especially the high-frequency signals19,20,21. As a result, reconstructing these high-frequency geophysical signals is vital. To improve the resolution, incorporating additional information with higher spatial resolution is necessary. In the specific task of downscaling GRACE TWSAs, the most promising high-resolution information is contained in hydrological models and measurements. The hydrological models can directly simulate TWSAs, whereas other hydrological parameters such as precipitation provide valuable auxiliary information by considering the water balance14,22. Many studies have proven the feasibility of assimilating data to downscale GRACE products over specific regions23,24,25,26 but few of them applied the methods successfully on a global scale. The downscaling algorithms with global generalizability so far can provide GRACE TWSAs of 0.5° using partial least squares regression27 or an ensemble Kalman filter assimilation pipeline28 but still have some deficiencies, such as insufficient intrabasin variability preservation or interbasin mass conservation.

In recent years, deep learning has progressed rapidly and shown remarkable potential in modelling the Earth system29,30,31. Many studies investigated the potential of applying deep learning or classical machine learning approaches to downscale GRACE measurements in a supervised learning context32,33. The main challenge is the need for high-resolution ground truth of TWSAs, which are inaccessible. Therefore, the studies usually generate the training pairs by downsampling the high-resolution hydrological simulations into the same resolution as GRACE. Under the assumption that the relationship between predictors and targets holds in different resolution domains, they apply the trained model on the original high-resolution hydrological simulations to obtain the downscaled GRACE predictions with the same resolution34,35,36. Another way to deal with the lack of ground truth is generating GRACE-like TWSAs by applying a Gaussian smoother on the hydrological simulations and training the model based on synthetic data pairs37. In this context, the necessary assumption is that the captured relationship also holds for the real GRACE measurements. However, all the aforementioned studies only applied their methods locally to continentally, indicating the difficulty of applying proposed deep learning methods globally due to their inherent challenges about generalization38. This study contributes to the downscaling problem and solves the two mentioned inadequacies of the existing deep learning algorithms, namely relieving assumptions between different domains and providing global generalizability. First, we developed a loss function based on the average deviations between the outputs and the GRACE TWSAs over a certain area and the similarity between the outputs and WaterGAP Hydrology Model (WGHM)39 simulations (Fig. 1; Methods). Since the GRACE and WGHM TWSAs are part of the inputs, we do not need extra labels or certain assumptions for generating synthetic training pairs. As a result, the network parameters can be optimized in a self-supervised manner without explicit assumptions bridging the high–low spatial resolution domains or simulation–measurement domains. Second, our downscaled TWSAs inherit the global generalizability from both GRACE and WGHM TWSAs and cover all the land areas, including coastal areas and small islands, except for Greenland and Antarctica due to the deficiency of hydrological models39,40. Our model is based on the principle of convolutional neural networks41, allowing us to consider spatial correlations between individual cells. Our analysis shows the impressive performance of the proposed algorithm in providing high-resolution TWSAs, which represent the high-resolution structures while keeping accurate mass conservation on the basin scale. Therefore, the water balance equation can be better closed in the basins smaller than the GRACE-effective resolution. Ultimately, we also discuss the potential usages of the obtained high-resolution TWSA product for analysing the impacts of climate change and anthropogenic activities on a local scale, as well as for natural hazard monitoring.

Fig. 1: The structure of the designed deep learning model in this study with an enlarged structure of the used residual blocks.
figure 1

The 2D convolutional layers and upsampling layers with bilinear interpolation are denoted by Conv2D and Upsampling2D. The kernel size of the 2D convolutional layers is denoted by k, whereas the stride is denoted by s. The input features go through the encoder–decoder structure to generate the predictions with the same size, which will be compared to the GRACE and WGHM TWSAs to compute the loss function. Therefore, the optimizing process is self-supervised.

Global downscaled TWSA product with uncertainties

We determined a global high-resolution TWSA product from April 2002 to December 2019, covering all global land areas except for Greenland and Antarctica. To provide uncertainty information, we combined probabilistic deep learning principles and Monte Carlo simulations to estimate the uncertainties using deep ensembles42. An example of the highly resolved TWSA product is shown in Fig. 2 with six major river basins enlarged. On the global scale, the seasonal changes in water storage are consistent with the ones observed by the GRACE measurements (Supplementary Videos 1 and 2). The regional maps demonstrate the visibility of the main river systems with refined details, which are inherited from the WGHM simulations. Moreover, the results are visually smoother than the WGHM TWSAs, indicating effective noise reduction. The reduction of the outliers is not only beneficial to reducing the amount of abnormal pixels but also helpful for gaining accurate values of the other neighbouring pixels. Once the outliers are reduced, the magnitudes of the neighbouring pixels with actual signals are calibrated by considering the constraint of agreements with GRACE TWSAs over an area larger than the GRACE-effective resolution16. Overall, the downscaled TWSAs have a global median uncertainty of 7.3 mm. The regions where water storage changes rapidly are usually accompanied by relatively large uncertainties due to their higher TWSA values, such as the mainstreams shown in Fig. 2.

Fig. 2: The downscaled TWSAs (top) and their uncertainties (bottom) in August 2008.
figure 2

The data are provided in the format of equivalent water height (EWH) with a spatial resolution of 0.5°. af, Six major river basins are shown with enlarged details: Amazon (a), Mississippi (b), Congo (c), Lena (d), Yangtze (e) and Nile (f). The regions without valid information are shaded. Note the different spatial scales for the enlarged images for better visualization.

Source data

High-resolution details with large-scale mass conservation

Since accurate high-resolution TWSA measurements on a global scale are inaccessible, we cannot directly evaluate the quality of the downscaled product. Therefore, we rely on the GRACE measurements and WGHM simulations in different contexts. First, we compared the downscaled TWSAs with the WGHM simulations to study whether the high-resolution structures are sufficiently reconstructed. To achieve this goal, we considered each pixel over the whole time span as a time series and computed the pixel-wise Pearson correlations between downscaled and WGHM TWSAs. The impact of the inaccurate magnitudes in the WGHM TWSAs is reduced since the Pearson correlation is invariant under changes in scale. As shown in Fig. 3a, the overall correlation is high with a median value of 0.80, meaning an improvement of 51% compared to GRACE TWSAs (0.53). The relatively low correlations are mainly found in arid regions such as the north of Africa, the Middle East and the middle of Asia. These low correlations are understandable due to weak hydrological signals in these arid regions since GRACE and WGHM are not sufficiently sensitive to accurately measure or simulate them. Furthermore, comparisons with independent satellite altimetry measurements show that the high-resolution information of the downscaled product is beneficial (Supplementary Table 1 and Supplementary Fig. 5).

Fig. 3: Evaluation of the downscaled TWSAs from 2002 to the end of 2019.
figure 3

a, The global pixel-wise Pearson correlation with WGHM simulations. b, The basin-wise RMSE compared to GRACE measurements. The regions without valid information are shaded.

Source data

To evaluate the basin-wise quality of the downscaled TWSAs, we rely on the GRACE measurements since the GRACE TWSAs are considered to be accurate over the their effective resolution16,43. We first averaged GRACE and downscaled TWSAs of each individual basin to generate basin-wise time series and compute the root mean square errors (RMSEs) between these two types of time series. The results are shown in Fig. 3b for 288 basins globally. The RMSEs are lower than 30 mm in most of the land areas, resulting in a global average RMSE of 21.9 mm weighted by basin areas. This value demonstrates the quality of the downscaled TWSAs since the typical GRACE uncertainties are 20–30 mm (ref. 15). Compared to WGHM simulations (weighted RMSE of 49.2 mm), our method provides an improvement of around 56%. Further analysis reveals that certain basins exhibit relatively high RMSEs, such as the basins in the glaciated areas of Alaska. Insufficiently modelling of glaciers and ice sheets is a known issue of the hydrological models8,39. Since we trained one neural network for the whole globe, the network cannot handle the substantial differences between the hydrological simulations and GRACE measurements in these specific regions because the issue is inessential in other regions, constituting the major part of the samples. Therefore, we consider this issue as a trade-off between generalizability and performance in specific areas.

Long-term trends and seasonal variations

To understand the performance of the proposed deep learning model on the basin scale in more detail, we plot the time series of the average TWSAs over the six selected major basins (Fig. 4). In the Amazon basin, dominant seasonal signals are relatively homogeneous. Both downscaled and WGHM TWSAs agree well with GRACE measurements, although the WGHM simulations tend to underestimate the amplitudes. For the basins where average TWSAs are less stationary and with more inhomogeneous variations (Mississippi, Congo, Lena and Nile), the downscaled TWSAs agree better with the GRACE observations than the WGHM simulations. These phenomena may be related to the known limitation of hydrological simulations, as they cannot fully capture the trends in water changes8. For example, the downscaled TWSAs successfully model the steady decrease in the Lena basin after 2007 and the increase in the Nile basin since 2010, whereas the WGHM time series do not represent these trends properly. However, both downscaled and WGHM TWSAs underestimate the increasing trend in the Yangtze River basin after 2010. A potential reason is the active anthropogenic impacts in this basin, such as construction of dams44. Although the WGHM model considers human intervention, it may perform unsatisfactorily in specific regions and result in a relatively big disparity from GRACE measurements39. As a result, the typical relationship captured by our model based on global data may not be the best solution in these regions, which is again a trade-off between generalizability and performance in specific areas. Conversely, the good agreements between the downscaled TWSAs and the GRACE measurements in the Mississippi and Congo basins demonstrate the performance of our model.

Fig. 4: The basin-wise average TWSA time series (mm) of six selected major river basins for the whole studied time span from 2002 to the end of 2019.
figure 4

af, Amazon (a), Mississippi (b), Congo (c), Lena (d), Yangtze (e) and Nile (f). Our downscaled product is abbreviated as DS. Note the different y axis ranges for different basins.

Source data

To quantify the performance of our method in retaining long-term trend, annual and semi-annual signals, we estimated these signals from the three TWSA types over 160 basins larger than 200,000 km2 and show the results in Fig. 5a. Here, we set the spatial threshold of 200,000 km2 to obtain more reliable GRACE estimations as reference16,39,43. The correlation between WGHM- and GRACE-derived trends is only 0.47, showing the major limitation of the hydrological models in capturing long-term trends. Our method substantially improves this situation and reaches a correlation of 0.94. This improvement reveals the effectiveness of the proposed algorithm for data assimilation. The network has learned to rely on the GRACE measurements to calibrate the average magnitudes over an area larger than the GRACE-effective resolution. Therefore, the trends contained in the GRACE measurements have been successfully inherited to the downscaled TWSAs. The largest trend differences are found in the glaciated basins of Alaska and their neighbouring basins (Fig. 5b), which related to the problem of insufficient modelling of glaciers and the resulting leakage errors again. The performance of WGHM on annual and semi-annual signal estimations is clearly better than estimating trends with correlations of 0.83 for both. Nevertheless, our method still outperforms it and reaches correlations of 0.97 and 0.95, respectively. The results prove the realistic temporal changes in our downscaled TWSAs on the basin scale, even though we treated every month separately and did not explicitly feed any temporal information to the model. Moreover, the phases of the annual signals of our downscaled product agree well with those measured by GRACE (Fig. 5c), with the only major differences occurring in the north of Africa where the hydrological signals are weak and the phases are therefore ambiguously defined. It is a remarkable improvement compared to WGHM simulations, which suffer from phase shifts of 2 or 3 months for many of the basins globally (Supplementary Figs. 3 and 4).

Fig. 5: Temporal decomposition of the basin-wise signals for the 160 basins larger than 200,000 km2.
figure 5

a, The long-term trend, annual and semi-annual amplitudes estimated from the high-resolution products (y axis) versus GRACE measurements (x axis). b, Map of basin-wise trend differences (downscaled GRACE). c, Map of basin-wise phase shifts (downscaled GRACE). The regions without valid information are shaded.

Source data

Closing water balance equation beyond the GRACE resolution

The downscaled TWSAs are beneficial for better closing the water balance equation in regions smaller than the GRACE-effective resolution. Figure 6 depicts the agreements between water changes inferred from the downscaled TWSAs and those computed from ERA5-Land water budget components45 in level-4 basins46. The downscaled TWSAs show a reasonable ability to close the water balance equation globally with positive Nash–Sutcliffe efficiency (NSE)47 of 83% studied area, whereas GRACE and WGHM have positive NSE in 77% and 75%. Most of the negative values come from the arid regions where the hydrological signals are weak. The improvements of downscaled TWSAs compared to the original GRACE product strongly correlate with the basin sizes (Fig. 6b). The average improvement in NSE given by the downscaled TWSAs is 0.13 for basins larger than GRACE-effective resolution (200,000 km2)39, 0.21 for basins between the effective resolution and limiting resolution (63,000 km2)16 and 1.21 for basins smaller than 63,000 km2. Conversely, the benefits compared to WGHM TWSAs do not strongly correlate with basin sizes but probably come from the more accurate values obtained from data assimilation.

Fig. 6: Agreements between total water storage changes inferred from downscaled TWSAs and those computed from water budget components.
figure 6

a, Basin-wise NSE of downscaled TWSAs in level-4 basins. The regions without valid information are shaded. b, Differences in NSE (left axis) between downscaled and GRACE TWSAs sorted by basin area (right axis). Positive values indicate a better performance of the downscaled TWSAs. c, The same as b but compared with WGHM TWSAs.

Source data

Discussion

The downscaled TWSAs provide special insight for studying the climate and anthropogenic impacts locally, which enables the development of targeted strategies for sustainable management of water resources. Figure 7a shows the comparison of the derived trends from the three sources of TWSAs. The high-resolution signals in the downscaled trends are inherited from WGHM simulations and their values are calibrated by considering the agreement with GRACE measurements on the effective resolution. For example, the downscaled trends clearly show the three hot spots of groundwater depletion in the United States (the High Plains aquifer, the Mississippi embayment and the Central Valley of California)48,49. Among them, the notable negative trends in the High Plains aquifer are not observable in GRACE TWSAs since the positive trends caused by progress from dry to wet periods3 on the neighbouring pixels average them out. GRACE TWSAs indicate negligible to positive trends mainly caused by increasing precipitation in central and southern India3,50, which do not fully represent the remarkable water storage declines in the regions with high population density and groundwater irrigation51. Caveat: we note that the downscaled product generally depends on the quality of used hydrological simulations. Prominent trends in the simulations will impact downscaled trends to a certain extent. We should interpret these signals with care since relatively larger uncertainties are expected as a result of imperfections in the hydrological simulations.

Fig. 7: Comparison of three high-level products derived from three different sources of TWSAs.
figure 7

The columns from left to right show the indices derived from downscaled, GRACE and WGHM TWSAs. The top row shows the pixel-wise TWSA trends from 2002 to 2019 in the form of EWH (mm yr1) after removing the annual and seasonal variations. The middle row shows the pixel-wise maximum FPI of 2008. The numbers indicate the risk of potential flooding events, with a large number indicating relatively high risk and vice versa. The bottom row shows the monthly DSI for August 2008. The 11 categories from red to blue indicate exceptional drought, extreme drought, severe drought, moderate drought, abnormally dry, near normal, slightly wet, moderately wet, very wet, extremely wet and exceptionally wet.

Source data

The downscaled TWSAs enable multiple downstream applications, including flood and drought monitoring on a local scale. To study the potential benefits of our downscaled TWSAs, we computed two well-known TWSA-based indices: the flooding potential index (FPI)4 and the drought severity index (DSI)5. We report the maximum FPI of the year 2008 in Fig. 7b to show the most notable signals in a single plot, whereas Fig. 7c depicts the monthly DSI of August 2008. FPI is sensitive to the value range of TWSAs since it relies on the relative relationship between the storage deficit and accumulated precipitation. The outliers present in WGHM simulations can ruin the relationship and cause unrealistic flood potential, resulting in an FPI map with far more high-risk regions than GRACE-derived FPI. Our proposed method clearly ameliorates this issue by suppressing the outliers, which allows us to obtain realistic high-resolution FPI with a reasonable visual agreement with GRACE-derived FPI, such as in the Congo basin and along the eastern coastline of North America. Similarly, the abnormal values in the WGHM simulations may cause opposite categories in DSI, resulting in noisy patterns (north of Africa) or abnormally underestimated severity (Australia). Again, the DSI derived from the downscaled TWSAs agree better with the GRACE-derived ones and open the window to monitor extreme environmental events with higher spatial resolution. However, the environmental monitoring indices derived from downscaled TWSAs inherited the same limitations that GRACE TWSAs have52, since the downscaled TWSAs only provide higher spatial resolution but do not provide longer observations.

Our current approach still has some limitations, which reveal the potential for further improving high-resolution TWSAs. First, more effort should be put into modelling the glaciers by considering additional measurements or specific models. Second, human intervention modelling is sophisticated and may need specific modifications. Including population, farming area and water usage statistics in the deep learning model may provide better results. Last, deep learning models have the potential to consider constraints based on the interactions between different forms of water, such as the interaction between terrestrial water and ocean or free water and glaciers. Nevertheless, with the preliminary study on the potential use of the downscaled TWSAs for monitoring water change and natural hazards, we demonstrate the significance of the proposed method. In practice, timeliness is a key factor. The training process of the proposed algorithm can be finished in around 3 days for the global model using consumer-level platforms (NVIDIA RTX3080TI), which is efficient considering the typical delay of GRACE monthly products. Therefore, the major limitations for rapidly delivering high-resolution TWSAs are the processing time demands of GRACE measurements and hydrological simulations. For applications that need higher temporal frequencies, such as daily to weekly solutions, we can benefit from the principle of online machine learning and the proposed model can be updated within 1 hour. The operational delivery of the downscaled product should be beneficial for the geoscience community and society, especially in the fields of hydrology, climate science, sustainable water management and hazard prediction.

Methods

GRACE mascon solution

The analysis centres of GRACE provide a variety of products regularly. One of the most user-friendly products is the mass concentration (mascon) solution, where the mass variations are directly estimated by explicitly relating the intersatellite range–rate measurements to the mascon formulation53,54,55. Compared to the spherical harmonic solutions, the mascon solutions suffer less from leakage errors and can better separate the land and ocean signals56. Therefore, the mascon solutions typically have a finer resolution for small regions57. In this study, we used the mascon solutions provided by NASA Jet Propulsion Laboratory (JPL)58. JPL has applied many data processing steps, including the replacement of C2,0 coefficients with the solutions from satellite laser ranging59, applying a Glacier Isostatic Adjustment (GIA) model60, removal of the impacts of ocean, atmosphere and land ice masses. In the end, the remaining monthly gravity changes can provide a precise measure of mass redistribution in the Earth’s water cycle57. Besides, the mean values from 2004.0 to 2009.999 are removed from the products to produce the TWSA in the form of EWH. The product used in this study is without the land-grid-scaling gain factors61 so that the data can provide us with information that is entirely independent of hydrological models.

Hydrological models and basin boundaries

WaterGAP is a global hydrological model that thoroughly describes water storage, usage and resources in all land areas except for Antarctica. WaterGAP v.2.2d, including WGHM, was published in 202139. WGHM comprehensively models daily water flows and water storage since it includes various forms of water, such as groundwater, rivers and snow. As one of the standard WGHM outputs, monthly TWS is provided at a high spatial resolution of 0.5° × 0.5°. This TWS product is the sum of the water storage in the canopy, snow, soil, groundwater, wetland, lake, reservoir and river storage39. Owing to the modelling approach and high spatial resolution, we can observe the principal rivers from the simulated TWS. The direct comparison between the obtained TWSAs and the GRACE TWSAs is possible62. However, we must remove the average values from 2004.0 to 2009.999 to generate the TWSAs with the same temporal baseline. Since the WGHM-modelled TWS does not include assimilation of GRACE measurements, the obtained WGHM TWSAs and GRACE TWSAs are entirely independent. Although the global structures of the WGHM TWSAs are noticeably more finely resolved than GRACE products, the WGHM TWSAs also have two non-negligible limitations. First, the WGHM simulations are much noisier due to errors in the simulation procedure. Second, the values of the WGHM TWSAs are less accurate compared to the GRACE TWSAs since they are not based on real observations. WGHM clearly underestimates the mean annual TWSA amplitudes in more than half of 147 investigated river basins by more than 10%, which may relate to the fact that WaterGAP does not simulate glaciers39.

In addition, we included hydrological information from the Global Land Data Assimilation System (GLDAS), which aims to assimilate satellite- and ground-based observational products to provide fields of land surface states and fluxes40. Within this platform, many land surface models (LSM) are integrated. In this study, we used the data from the GLDAS Noah Land Surface Model L4 monthly 0.25° × 0.25° v.2.1 (refs. 40,63). GLDAS v.2.1 is forced with a combination of model and observation data from 2000 to the present without assimilating GRACE measurements. Therefore, the products provided by GLDAS v.2.1 are additional data sources independent of the GRACE TWSAs. Considering the water balance equation, which describes the relationship between the changes of TWSA (TWSC) and precipitation (P), evapotranspiration (ET) and runoff (R):

$${{{\rm{TWSC}}}}=P-{\it{ET}}-R,$$
(1)

we exclusively focused on the three mentioned parameters. The data are downsampled into the resolution of 0.5° × 0.5° by computing the average of four neighbouring pixels to obtain the same resolution as the WGHM TWSAs.

The hydrological boundaries are obtained from HydroBasins46, which represent a series of vectorized polygon layers that depict sub-basin boundaries at a global scale. All continents, except Antarctica, are included. The HydroBASINS product follows the Pfafstetter concept64 and provides levels 1 to 12 globally. In this study, we focused on the HydroBasins level 1 (nine continents), 3 (292 sub-basins) and 4 (1,342 sub-basins) products.

Feature selection and preprocessing

One of the essential prerequisites for the success of the deep learning model is to determine a set of meaningful features that can represent the changes in TWSAs to a sufficient degree. First, the TWSAs from GRACE and WGHM are the most important features because the GRACE TWSAs have relatively accurate values over a larger region and those from WGHM provide information about high-resolution structures. Furthermore, we included precipitation, evapotranspiration and runoff as features inspired by equation (1). We should note that the GLDAS provides runoff split into three components: storm surface runoff, baseflow-groundwater runoff and snow melt. Therefore, we have five additional features in total. In the end, since multiple studies pointed out the correlation between the changes of TWS and geocoordinates1 and the positive contribution of geocoordinates to global deep learning models65, we also considered latitudes and longitudes as additional features. We normalized the features on the basis of their 0.01th percentiles and 99.99th percentiles to reduce the impacts of outliers.

The final step before we fed the data into our model is splitting the global area into small patches of the same size. We first found all the pixels on land by relying on the basin boundaries. Then, we considered them as the central pixels and generated a 16° × 16° patch around each central pixel, which means each patch has a size of 32 × 32 with a resolution of 0.5°. The patch size is a multiple of the coarse resolution of GRACE TWSAs (3° × 3°). A 16° × 16° patch contains more than 25 effective GRACE pixels, making the patch-wise average of GRACE TWSAs meaningful. Some patches near the coastlines unavoidably contain areas over oceans where the hydrological models do not provide any values. We filled these with the average values of valid pixels in the same patch. The patches were considered as images in our deep learning model and the nine features were put into nine channels.

Self-supervised data assimilation model

The lack of high-resolution ground truth impedes the application of supervised learning approaches to provide high-resolution TWSAs. All the mentioned deep learning-based studies artificially generated input–output pairs, which capacitates a supervised regression but always under the constraints of certain assumptions. To avoid strict assumptions about the input–output pairs and exploit the information from the available data sources as much as possible, we proposed a data assimilation model with a specifically designed loss function that allows self-supervised optimization. Our model is based on the principle of the convolutional neural networks41, which is a specific type of neural networks that uses convolution in place of general matrix multiplication in some layers. The convolutional operators, usually known as kernels, can extract high-level feature maps by considering the relative positional relationship between pixels or low-level features. In our specific case, the model can extract information about hydrological phenomena, such as water storage changes in waterbodies, from the values of individual pixels. We also applied the concept of residual learning66. In this context, the network explicitly approximates the residual function \({{{\mathcal{F}}}}(x)={{{\mathcal{H}}}}(x)-x\), which is the difference between the original target function \({{{\mathcal{H}}}}(x)\) and the input x. The fitting of the residual function should not be more difficult than fitting the original target function itself due to the existence of the skip connection. As a result, a deeper model should have a training error no greater than its shallower counterpart66. Batch normalization67 is also included to reduce the sensitivity to the initialization and thereby improve the optimizing process.

Our model aims to assimilate the satellite observations and hydrological simulation by balancing the accurate values from GRACE observations over an area larger than their effective resolution and the high-resolution structures from the WGHM simulations. Therefore, the loss function is designed in the way that the outputs of our model are compared with both inputs, GRACE TWSAs and WGHM TWSAs. The first goal of our optimizing process is to let the values of the outputs be as close as the GRACE TWSAs over each patch. Since the GRACE measurements of individual 0.5° pixels are not representative, we computed the absolute error (AE) between the averaged GRACE TWSAs and the averaged predicted TWSAs over each patch:

$${{{{\rm{AE}}}}}_{{{{\rm{G}}}}}\left({{{{P}}}}_{{{{\rm{G}}}}},\hat{{{{P}}}}\right)=\left\vert \frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{p}_{{{{\rm{G}}}},n}-\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{\hat{p}}_{n}\right\vert ,$$
(2)

where \({P}_{\rm{G}}\) and \(\hat{{{{P}}}}\) denote the GRACE patches and predicted patches, respectively. Each patch includes N pixels with values denoted by pG and \(\hat{p}\) for GRACE and predicted patches. The second goal of our optimizing process is to learn the high-resolution structures from the WGHM TWSAs. For this purpose, we introduced the Pearson correlation coefficient (R) between the outputs and WGHM TWSAs to describe the similarity since it proves superior to other similarity metrics like structure similarity index (SSIM) as argued in another study68 and confirmed in our tests. We introduced a second metric to enhance the measurements of structural similarity, namely the mean absolute error (MAE) between WGHM TWSAs and predicted TWSAs. Equations (3) and (4) show the corresponding definitions with pW denoting the pixels of the WGHM patch \({P}_{\rm{W}}\) and \({\hat{p}}\) denoting the pixels of the output patch \({\hat{P}}\), respectively:

$$\begin{array}{l}{\rm{R}}({{{\it{P}}}}_{{{{\rm{W}}}}},\hat{{{\it{P}}}})=\frac{{\sum }_{m}{\sum }_{n}\left({{{{P}}}}_{{{{\rm{W}}}},mn}-{\overline{{{{P}}}}}_{{{{\rm{W}}}}}\right)\left({\hat{{{{P}}}}}_{mn}-\overline{\hat{{{{P}}}}}\,\right)}{\sqrt{\left({\sum }_{m}{\sum }_{n}{\left({{{{P}}}}_{{{{\rm{W}}}},mn}-{\overline{{{{P}}}}}_{{{{\rm{W}}}}}\right)}^{2}\right)\left({\sum }_{m}{\sum }_{n}{\left({\hat{{{{P}}}}}_{mn}-\overline{\hat{{{{P}}}}}\,\right)}^{2}\right)}}\end{array},$$
(3)
$${{{{\rm{MAE}}}}}_{{{{\rm{W}}}}}\left({{{{P}}}}_{{{{\rm{W}}}}},\hat{{{{P}}}}\right)=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\left\vert {p}_{{{{\rm{W}}}},n}-{\hat{p}}_{n}\right\vert .$$
(4)

Finally, we combined the proposed terms to achieve the two goals within the same optimizing process, leading to the following formulation of the loss function:

$${{{\mathcal{L}}}}\left({{{{P}}}}_{{{{\rm{G}}}}},{{{{P}}}}_{{{{\rm{W}}}}},\hat{{{{P}}}}\right)=\frac{1}{B}\mathop{\sum }\limits_{b = 1}^{B}\left\{{{{{\rm{AE}}}}}_{{{{\rm{G}}}}}({{{{P}}}}_{{{{\rm{G}}}}},\hat{{{{P}}}})+\left[1-{{{\rm{R}}}}\left({{{{P}}}}_{{{{\rm{W}}}}},\hat{{{{P}}}}\right)\right]\times {{{{\rm{MAE}}}}}_{{{{\rm{W}}}}}\left({{{{P}}}}_{{{{\rm{W}}}}},\hat{{{{P}}}}\right)\right\},$$
(5)

where B is the batch size. The reason for using both AE and MAE rather than the L2 metrics is that the L1 metrics are usually more robust against outliers. During our experiments, we observed that the GRACE-term and WGHM-term converge with similar magnitudes, indicating the ability of our model to balance the two terms without over-relying on any of them. Thus, there is no need to add more hyperparameters to weigh these two terms explicitly.

The detailed realization of the global model is shown in Fig. 1, including details of the applied residual blocks. The encoder contains three two-dimensional (2D) convolutional layers followed by a ReLU activation function \(\left({{{\rm{ReLU}}}}({{{x}}})=\max (0,{{{x}}})\right)\) and three residual blocks. The encoding process is realized by the fact that each convolutional layer has a stride of 2 to reduce the size of the outputs and increase the receptive field. The convolutional layers have increasing numbers of kernels (16, 32 and 64) with a size of 3 to increase the latent information, namely the feature dimension. In the decoder, the feature maps have to be upsampled first. Here, we use 2D-bilinear upsampling layers to pre-upsample the feature maps and feed them into the 2D convolutional layers, followed by ReLU and residual blocks. Unlike the encoder, the convolutional layers have stride 1 to keep the size of outputs the same. They have 64, 32 and 16 kernels to reproduce the spatial information from the latent information. At the end of the architecture, another convolutional layer with kernel size 1 without activation function is designed so that it can project the final feature maps to the actual TWSA values. The resulting output size is the same as the input size (32 × 32). Once the outputs are generated, they are compared to the original GRACE and WGHM TWSAs to compute the loss using equation (5). To understand the benefits of our proposed model structure, we can analyse the two parts of the optimizing process separately. If we only consider the parts of the loss function related to the WGHM terms (bottom part in Fig. 1), our model is similar to an autoencoder69,70, which aims to reconstruct the WGHM TWSAs while reducing noise. Therefore, the number of outliers is remarkably reduced. Then, optimizing the part of the loss function related to the GRACE terms is like a regression problem (top part in Fig. 1), aiming to calibrate the values of the reconstructed high-resolution TWSAs on the patch scale. Our proposed loss function combines these two principles so that they are optimized jointly.

To tune the model structure and hyperparameters, we started with experiments over four river basins (Amazon, Congo, Mississippi and Lena). At this stage, we randomly split the data of 158 months until the end of 2016 into training (70%, 110 months), validation (15%, 24 months) and test sets (15%, 24 months). By monitoring the training process, we observed that the models usually converged between training epochs 120 and 150. The models did not suffer from overfitting issues and had similar performance on data from 2017 to the end of 2019. Therefore, we decided to train our models using all 180 months of data to obtain the same quality for the whole time interval (April 2002 to December 2019). We used TensorFlow v.2.6.0 (ref. 71) to implement all the networks as well as the training, validation and test processes. The optimizer is Adam72 with default settings and a batch size of 512. The training process is efficient and can be finished in around 3 days for the global model using consumer-level platforms (NVIDIA RTX 3080TI).

Estimating uncertainties based on deep ensemble

To quantify the uncertainties of downscaled TWSAs, we followed the principle of deep ensemble learning42 and trained five independent deep learning models from scratch with different random initial states. As a relatively small number of independent models is sufficient for modelling the predictive uncertainty73, we computed our ensemble results (\({\mu }_{* }\)) and uncertainties (\({\sigma }_{* }\)) as:

$${\mu }_{* }=\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{\mu }_{{\theta }_{m}},$$
(6)
$${\sigma }_{* }=\sqrt{\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}\left({\mu }_{{\theta }_{m}}^{2}+{\sigma }_{{\theta }_{m}}^{2}\right)-{\mu }_{* }^{2}},$$
(7)

where \({\mu }_{{\theta }_{m}}\) and \({\sigma }_{{\theta }_{m}}\) are the predicted TWSAs and associated uncertainties of model m and M is the total number of models, namely five in this study. However, due to the specific design of the loss function, \({\sigma }_{{\theta }_{m}}\) cannot be directly estimated by deep learning models. To overcome this issue, we used Monte Carlo simulations. We sampled 20 sets of GRACE inputs randomly on the basis of their uncertainties for each model m to estimate \({\sigma }_{{\theta }_{m}}\). Ultimately, the ensemble uncertainty σ* was estimated from five independent deep learning models, with 100 Monte Carlo simulation runs in total. We note that the uncertainties of WGHM simulations and GLDAS inputs are unavailable and not considered. Therefore, the uncertainties reported in this study may be underestimated.

Closure of water balance equation

The closure of the water balance equation is realized by comparing the left side of equation (1) computed from the derivatives of TWSA products and the right side of equation (1) computed from water budget components. We chose the precipitation, evapotranspiration and runoff products from ERA5-Land45 to provide independent external evaluation since they are not considered in the deep learning model. Moreover, they are proven to agree well with GRACE measurements22. To obtain homogeneous time steps for water changes, we interpolated the GRACE and our downscaled TWSAs to the middle of each month using PCHIP74 interpolation. The GRACE and GRACE-FO eras were dealt with separately to avoid a biased interpolation caused by the gap of 1 year. The TWSCs were obtained by centred finite differences of TWSAs:

$${{{\rm{TWSC}}}}(t)=\frac{{{{\rm{TWSA}}}}({{{t}}}+1)-{{{\rm{TWSA}}}}({{{t}}}-1)}{2{{\Delta }}t},$$
(8)

where Δt indicates 1 month. The P, ET and R time series were further smoothed to reduce potential high-frequency artefacts introduced by the finite differences75:

$$\widetilde{X}(t)=\frac{1}{4} X(t-1)+\frac{1}{2}X(t)+\frac{1}{4}X(t+1),$$
(9)

where X denotes P, ET or R. To this end, we received two types of TWSCs: TWSCGRACE estimated from GRACE measurements or other TWSA product and TWSCbudget estimated from P, ET and R. The agreement of these two TWSCs was quantified by computing the NSE47:

$${{{\rm{NSE}}}}=1-\frac{\frac{1}{T}\mathop{\sum }\nolimits_{t = 1}^{T}{\left({{{{\rm{TWSC}}}}}_{{{{\rm{budget}}}}}(t)-{{{{\rm{TWSC}}}}}_{{{{\rm{GRACE}}}}}(t)\right)}^{2}}{\frac{1}{T}\mathop{\sum }\nolimits_{t = 1}^{T}{\left({{{{\rm{TWSC}}}}}_{{{{\rm{GRACE}}}}}(t)-{\overline{{{{\rm{TWSC}}}}}}_{{{{\rm{GRACE}}}}}(t)\right)}^{2}}.$$
(10)

TWSA-derived environmental monitoring indices

To demonstrate the applicability of the highly resolved TWSAs, we relied on the concepts of environmental monitoring indices introduced in previous studies, including FPI4 and DSI5. The motivation for FPI is the different capacities of storing water in each cell. We obtained the water deficit SDef(t) = SMax − S(t − 1) by computing the difference between the water storage of the previous month \(\left(S(t-1)\right)\) and historic maxima SMax for each individual cell. Under the assumption that the high flooding risks are caused by extreme precipitation, we computed the flood potential index F(t) = PMon − SDef(t), where PMon(t) = P(t) × dt denotes the monthly precipitation (ERA5L in this study). In the end, the FPI was normalized to the range 0 to 1: \({{{\rm{FPI}}}}=F(t)/\max \left[F(t)\right]\). During this approach, the gaps within the TWSA records may cause incorrect water deficit due to the integral over time for the monthly precipitation. DSI is a standardized metric and also considers the different characteristics of each cell. First, we computed the average values (\({\overline{{{{\rm{TWSA}}}}}}_{m}\)) and standard deviations (σm) of each month m over all the years with valid TWSA records. Then, we computed standardized anomalies for the month m and year y as \(({{{{\rm{TWSA}}}}}_{m,y}-{\overline{{{{\rm{TWSA}}}}}}_{m})/{\sigma }_{m}\). In the end, the standardized anomalies were place in 11 categories on the basis of the thresholds set by ranking percentiles (30%, 20%, 10%, 5% and 2%, two-sided)76.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.