Downscaling GRACE total water storage change using partial least squares regression

The Gravity Recovery And Climate Experiment (GRACE) satellite mission recorded temporal variations in the Earth’s gravity field, which are then converted to Total Water Storage Change (TWSC) fields representing an anomaly in the water mass stored in all three physical states, on and below the surface of the Earth. GRACE provided a first global observational record of water mass redistribution at spatial scales greater than 63000 km2. This limits their usability in regional hydrological applications. In this study, we implement a statistical downscaling approach that assimilates 0.5° × 0.5° water storage fields from the WaterGAP hydrology model (WGHM), precipitation fields from 3 models, evapotranspiration and runoff from 2 models, with GRACE data to obtain TWSC at a 0.5° × 0.5° grid. The downscaled product exploits dominant common statistical modes between all the hydrological datasets to improve the spatial resolution of GRACE. We also provide open access to scripts that researchers can use to produce downscaled TWSC fields with input observations and models of their own choice. Measurement(s) Gravity Technology Type(s) gravity field theory • computational modeling technique Factor Type(s) geographic location • temporal interval Sample Characteristic - Environment water body Sample Characteristic - Location global Measurement(s) Gravity Technology Type(s) gravity field theory • computational modeling technique Factor Type(s) geographic location • temporal interval Sample Characteristic - Environment water body Sample Characteristic - Location global Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13503114

www.nature.com/scientificdata www.nature.com/scientificdata/ statistical relationship between large-scale variable and small scale variables 15 . Both have their pros and cons, such as the former is physically based but computationally expensive and strongly dependent on boundary conditions, while the latter is computationally cheap and easy to implement but requires long and reliable observations and depends on choice of predictors and quality of input data.
It has been shown that the effective resolution can be improved by assimilating information at a better spatial resolution [16][17][18][19][20] . Various data assimilation techniques have been devised and successfully implemented, for example, in improving operational weather forecast, predicting ocean dynamics, and modelling soil moisture content 18,21,22 . Recently several studies have assimilated hydro-geodetic data and hydrological models to estimate, calibrate, or validate hydrological flux variables. For example: modelling river runoff with the help of hydro-geodetic approaches 23,24 , estimating catchment-scale water budget using a Kalman filter framework 25,26 , and calibration or/and validation of hydrological model outputs using GRACE 27,28 . The ensemble Kalman approach filter has been used effectively to assimilate GRACE TWSC into a Land Surface Model (LSM) to improve model performance 17,[28][29][30] . Several non-parametric methods have also been proposed to improve spatio-temporal knowledge of hydrological variables. For example 31 , predicted ground water level changes by incorporating GRACE with hydro-meteorological variables in an Artificial Neural Network (ANN) framework. 32 demonstrated the efficacy of ANN to predict TWSC from precipitation, soil moisture, and temperature, and 33,34 used ANN and Machine Learning to produce high-resolution TWS estimates. Recently 35 demonstrated that statistical downscaling can help us fill temporal gaps in GRACE data and 36 developed a statistical downscaling approach that uses evapotranspiration data to downscale GRACE TWSC.
Inspired by recent developments in assimilating models and hydro-geodetic observations, we present a statistical downscaling approach that improves the spatial resolution of GRACE from ≈3° to 0.5° grid. The method employs a multivariate regression model that integrates multiple components of water budget (WaterGAP hydrology model (WGHM) TWSC, GRACE TWSC, several estimates of precipitation, evapotranspiration and runoff). The regression is carried out at a residual signal level obtained by removing the dominant seasonal signal and linear trend. It also accounts for time lag and lead between various water budget components. The method finds common spatiotemporal modes by employing Partial Least-squares Regression (PLR), and then uses these modes to reconstruct (redistribute) GRACE observed mass change at the spatial resolution of WGHM. We demonstrate that our method is able to learn from high resolution model TWSC and resolve spatial features such as river channels, which is not possible from conventional GRACE products. Since the information on high resolution mass change is obtained from high resolution model, the downscaled product is expected to vary with the input data. In this study our aim is to provide users with a framework that they can use to obtain better resolved TWSC estimates with datasets and models they are confident of. Hence we refrain from commenting on the best dataset on the input side because robust validation of TWSC at the grid scale is not possible. We demonstrate that the method ensures conservation of GRACE-derived mass at catchment scale for 160 catchments spread over the globe.

Methods
Complex mechanisms drive spatiotemporal variability in hydrological flux variables, which are related to each other via the water budget equation. Our aim is to relate hydrological flux variables to grid scale TWSC. This can be achieved by employing a multivariate linear regression model that relates the predictand (S) (the signal to be predicted) to the predictor (L) (obtained from set of observations) as where S(n × g) is a matrix with n rows, one for each epoch, and g columns, one for each grid cell in a river catchment. The predictor matrix L(n × d) has n rows, one for each epoch, and d columns containing Precipitation P, Evapotranspiration ET, Runoff R, and catchment average of TWSC. In other words, each column vector in L is a time series with n epochs. H(d × g) is the prediction matrix. Setting up L efficiently is crucial for the success of (1). Since many data products are available for each water budget component and their performance varies with space and time, using multiple products for each variable provides the regression model with flexibility to rely relatively more on a dataset that offers stronger spatio-temporal common modes. Secondly, the water budget components are known to have temporal lead or lag with respect to each other [37][38][39] , and the dominant signal is driven by the annual water cycle. Therefore, the regression in (1) will be more efficient, if we i) expand the observation space by including k time shifted versions of each flux variable, and ii) operate at the residual signal level. We demonstrate this with an example: assume we have total m products for P, ET, R, and TWSC for a time period from January 2003 to December 2015. First, we expand P, ET, and R, to get a matrix where number of rows correspond to number of epochs and number of columns represent number of grid cells corresponding to a catchment, then we obtain k shifted versions of equal length time series, for example: for k = 12, we will obtain 12 equal length time series for each precipitation grid cell in the catchment we are interested in, where the first will start at January 2003 and end at December 2014, the second time series starting at February 2003 and ending at January 2015, and so on. Then we remove a cyclo-stationary mean from the corresponding shifted time series to obtain ΔP, ΔET, and ΔR. A cyclo-stationary mean is an annual cycle that represents the mean behaviour over the observation period. ΔTWSC is the time series from GRACE with cyclo-stationary mean signal and a linear trend removed. Hence L becomes, www.nature.com/scientificdata www.nature.com/scientificdata/ ΔP m k p g is a column vector of length equal to the number of epochs n. m p g is the product of number of precipitation products m p and the number of grid cells in the catchment g. This means the dimension of ΔP will be n × (m p × g × k). Hence the number of columns in L depends on the number of models for each variable , the number of grid cells in the region of interest, and the number of time-shifts k. Equation (1) represents the ideal case, but in reality measurements suffer from noise. Therefore, the multivariate regression model in (1) becomes Dimension reduction is crucial for multivariate regression analysis, which we achieve by using Partial Least Squares Regression (PLR), a non-parametric filtering technique developed by 40 . It decomposes the signal while minimizing the noise and preserving the mutual linear variability of measurements and unknown signals 40,41 . In other words, PLR aims to regress on those Principal Components (PCs) of measurements that highly correlate with the target signal 42,43 . Similar to Canonical Correlation Analysis (CCA), PLR obtains the PCs via Singular Value Decomposition (SVD) of the covariance matrix C LS between predictors and predictands 41,44,45 .
In the context of this study, S is obtained from a hydrology model that simulates TWSC at a higher spatial resolution compared to GRACE. For a given L, we can obtain the covariance matrix C LS (d × g), which can be decomposed using SVD: where U C (d × r) and V C (r × g) are joint normalized eigenvectors for L and S, which are also called the canonical modes, and Σ C (r × r) is a diagonal matrix containing covariance between L and S. r is the number of canonical modes from SVD, obtained as the rank of covariance matrix C LS

43
. The PCs of L, which are significantly correlated with S, can be obtained by projecting L on U C to get U L (n × r): Hence, we can write = L UU L C T , which can be substituted in (3): L C T which can also be written as L where K(r × g) is the transformed regression matrix obtained by projecting H on U c . Since we do not expect the total mass change in a catchment to change after downscaling, we can use the mass conservation as a constraint: where A w is the area vector for grid cells belonging to the catchment and ΔM GRACE is the catchment average of TWSC from GRACE. Using (7) and (5) in the constraint, we get Bringing the constraint in the observation space and solving for K using the least squares method, is the reformed regression matrix that can be combined with U C to obtain the optimal prediction matrix H, This concludes the training part, where we obtained a prediction matrix from known S and L. The prediction matrix can now be used to estimate the predictand S, that is, estimates of TWSC at a higher spatial resolution. Since we have regressed at the residual level, we can obtain the full downscaled product by restoring the part of TWSC signal that was removed earlier. To summarize, the statistical downscaling using PLR has five major steps: 1. arrange global data products to obtain time series vectors representing hydrological flux variables, 2. obtain dominant modes of variability between observations and a high resolution hydrology model, www.nature.com/scientificdata www.nature.com/scientificdata/ 3. estimating the prediction matrix from selected canonical modes, 4. transforming the prediction matrix from mode space to signal space, and 5. obtaining the downscaled GRACE product.
Step by step implementation. A conceptual flow chart of the process is provided in Fig. 1. The first step is to obtain datasets of P, ET, R, and TWSC. Usually the global products are available in grid cell format at different spatial resolutions. We choose P, ET, and R, at the same spatial resolution as WGHM. GRACE TWSC catchment averages are computed using the following relation, c c wherein f c is the catchment average of a global gridded field f(θ, λ). R(θ, λ) is the catchment mask with value 1 inside the catchment and 0 outside. Ω represents the domain of the Earth's surface, θ and λ are co-latitude and longitude, and dΩ is the infinitesimal surface element sin θdθdλ. A pre-processing step is performed to subtract the dominant signals and retain the residuals from the data. First we shift grid scale time series for P, ET, R by k = 6. We choose k = 6 because we are confident of capturing seasonal lead and lag with this value. If readers are confident of capturing time lead or lag between budget components with a different k, then they can use a different value. After obtaining time-shifted vectors, we remove a cyclo-stationary mean from the corresponding vector to obtain ΔP, ΔET, and ΔR. Hydrology model based estimates of TWSC have been shown to underestimate linear trends compared to GRACE observations 46  . Please note that the GRACE TWSC estimates are obtained at catchment scale, and therefore, the same value is subtracted from every grid cell. Hence, the observations from GRACE, P, ET, and R are regressed on the difference between WGHM and GRACE. Therefore, the dominant part of TWSC from GRACE is maintained.

P ET R M T GRACE
H is unknown while L and S are known.
In the next step, we compute C LS and decompose it by SVD to get U C (d × r) and V C (r × g). r is the number of canonical modes from SVD and it can attain a maximum value that is the rank of covariance matrix C LS . In this study we choose r = 10, because including more modes does not affect the efficacy of PLR method 20 . We can obtain U L as Model TWSC at grid scale 1. Remove linear trend . 2. Remove cyclo-staƟonary mean from GRACE.

ParƟal Least Squares Regression PredicƟon Matrix
Add back the GRACE catchment scale TWSC to Downscaled TWSC at grid scale GRACE TWSC Compute catchment averages and its cyclo-staƟonary mean.
Remove linear trend. Fig. 1 Flowchart of the downscaling approach.

PredicƟon phase
www.nature.com/scientificdata www.nature.com/scientificdata/ L C Now we have all the components to implement equations in (10) and obtain K. This leads to determination of the prediction matrix H, H can then be used to predict = S L H. The full downscaled product is obtained by adding back the linear trend and the cyclo-stationary mean signal to S.

Data Records
We use GRACE Level 3 mascon products from Jet Propulsion Laboratory (JPL) and GRACE level 2 spherical harmonic coefficients provided by the Institute of Geodesy, Graz University of Technology (IFG) 13,47,48 . We use precipitation (P) datasets from three centres (CPC, DELAWARE, and GLDAS NOAH025 M 2.1) 49-53 , two model based estimates of evapo-transpiration (ET) products (GLDAS, SeB) 53-55 , and two model based runoff (R) estimates (GLDAS and MERRA) 53,56,57 . We implement the method for 160 river catchments, where the smallest catchment is the Negro river basin in Uruguay with an area of 62518 km 2 and the largest catchment is the Amazon river basin with an area of 4672876 km 2 . The catchment boundaries have been downloaded from GRDC 58 (cf Fig. 2). The prior model information is obtained from WGHM, which is a global water resource and use model that simulates water flows among all relevant continental water storage compartments, including canopy, snow, soil, groundwater, lakes, reservoirs, rivers and wetlands. Despite the complex yet realistic model setup, the uncertainties in climate forcing limit the accuracy of WGHM in monitoring large-scale water storage variation 27,46,59,60 . We obtained model TWSC at a spatial resolution of 0.5° × 0.5° for a period from January 2003 to December 2016 [59][60][61] . Each dataset spans at least from January 2003 to December 2015. The data used in this study have been summarized in the Table 1.
Using these dataset we obtain downscaled TWS fields, which are available to users on figshare as netcdf files with four variables: Lat, Long, time and EWH_mm 62 . Lat and Long are latitude and longitude vectors of dimension 259200 × 1, representing the centre of a 0.5° × 0.5° grid cell on the surface of the Earth. EWH_mm is the TWS change in terms of mm Equivalent Water Height (EWH) for that grid cell with dimensions 259200 × 144, and time is a column vector of dimension 144 × 2 with year and month.  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Results. We demonstrate the method for two GRACE products: level 2 spherical harmonic coefficients from ITSG and the JPL mascon products. We do not endorse these products over other, we have just chosen one mascon solution and one spherical solution for demonstrative purposes. Furthermore, the difference between various GRACE product is not huge at catchment scale, therefore, choice of GRACE data is not critical. For the spherical harmonic product, coefficient C 2.0 is replaced by more accurate estimates from satellite laser ranging www.nature.com/scientificdata www.nature.com/scientificdata/ and the missing degree 1 terms have been replaced by degree 1 coefficients estimated by 63 . The Glacial Isostatic Adjustment (GIA) signal is removed using the ICE-6G forward GIA model 64 . Since the spherical harmonic coefficients are noisy, we filter them with a Gaussian filter of 400 km radius. Filtering affects the signal quality via attenuation and leakage 11 . Therefore, we use the data-driven method of deviation to repair the signal damage due to filtering 65 . The data-driven method of deviation has been shown to provide accurate mass change estimates for catchments larger than ≈65000 km 2 8 . The JPL GRACE mascon solutions do not need additional corrections and are available at sampling of 0.5° × 0.5° while their effective resolution is 3° × 3°1 3 . We use 10 PCs to reconstruct the signal for 160 catchments from January 2004 to December 2015. www.nature.com/scientificdata www.nature.com/scientificdata/ We show the coarse JPL mascon EWH maps and the downscaled maps for the month of March 2006 in Fig. 3 and for September 2006 in Fig. 4. The Year 2006 was chosen arbitrarily, months March and September are six months apart and thus would show us out of phase TWSC maps. We did not use any interpolation scheme. We can see mass change following physical water bodies 60,61 , hence, the downscaled product is able to capture spatial features better than original GRACE product. This is further demonstrated for the Amazon catchment in Fig. 5, where we plot the TWSC maps from JPL GRACE mascon, downscaled products, and the WGHM model, for four selected months. We also plot the time series of catchment average for Amazon. It is clear that the downscaled product is able to deliver mass change estimates at a higher spatial resolution.
Validation. Validating gridded downscaled TWSC for the 160 catchments is not possible as no other observational dataset is available for direct comparison. Therefore, we validate the efficacy of downscaled product by www.nature.com/scientificdata www.nature.com/scientificdata/ checking the conservation of mass at catchment scale. In Fig. 6, we show time series for 10 catchments of various shape, size, and climatic characteristics. We plot the Root Mean Square (RMS) of difference between the TWSC time series from GRACE and downscaled products for all the 160 catchments, which represents the error introduced by the downscaling process. The RMS of the processing error is almost always smaller than the GRACE error that is typically around 20 to 30 mm 4 .  Fig. 2. The plot at the bottom of the figure shows the RMS error of difference between time series from GRACE and downscaled product over 160 catchments. The RMS of difference between catchment averages of JPL GRACE product and the corresponding downscaled product is denoted by RMS J , while for ITSG GRACE product is RMS I . www.nature.com/scientificdata www.nature.com/scientificdata/ In Figs. 5 and 6, we compare catchment averages of GRACE products and their downscaled versions, along with the TWSC from WGHM model. A first observation is that the model simulations are not able to match GRACE observations, the difference is more prominent for catchments with poor data quality or availability, for example the Cauvery, the Tigris, or the Volga. Secondly, the downscaled product does not pick signal amplitude information from WGHM model. The catchment averages of downscaled TWSC match with the corresponding GRACE product. The water mass is redistributed to reflect additional information on river channels and landscape properties (cf. Figure 5). The spatial correlation between high resolution WGHM time series and the downscaled products is shown in Fig. 7, which shows that WGHM plays a significant role in redistributing the water mass change from GRACE. Together from Figs. 5-7, we can safely conclude that WGHM informed the spatial redistribution and not the signal amplitude while the principle of conservation of mass is not violated. We have provided the RMS of the process error for downscaled GRACE from JPL solution (RMS j ) and ITSG (RMS I ) solution on the time series plot for 160 river catchments. These catchments are distributed all over the globe (cf Fig. 2) and the RMS of error is small for all of them. Therefore, we can safely conclude that the efficacy of the downscaling approach is not region-dependent.
Please note that we do not claim that the downscaled product corresponds to the ground truth, as the output is only as accurate as the information from the input datasets. If users find other models or datasets more plausible, we recommend them to use the Octave/MATLAB script for generating a downscaled product on their own. The methodology uses dominant modes of variability between observations and a high resolution hydrology model to obtain downscaled product. Analysis of the prediction matrix could help us understand the relative contribution of individual input data. In Supplementary Fig. 1 we show the relative percentage contribution from dominant input datasets for one grid cell (0.5° × 0.5°) in the Amazon catchment. We do not go into detailed analysis of the prediction matrix in this study as it is an enormous task as there are 1530 grid cells in the Amazon catchment alone (the prediction matrix for the Amazon river basin alone is 64261 by 1530). Analyzing prediction matrix and understanding influence of input dataset will be a future project.

Usage Notes
The scripts and the output data are available for download. The datsets for P, ET, R, and TWSC should be prepared by the user following the instructions in the ReadMe file provided along with scripts and dataset. After you have Octave/MATLAB Data files for each variable, run the script statistical_downscaling_grids_TWS.m in command line following the instruction in ReadMe file. The command window will ask the user to select relevant files. These will then be used to obtain a downscaled product. A downscaled dataset from January 2004 to December 2015, and an example dataset to guide users is also available. Fig. 7 Maps of correlation between WGHM time series and downscaled TWSC time series from JPL GRACE product (a) and from ITSG GRACE product (b). There is a high correlation between high resolution WGHM product and the downscaled product, which demonstrates that the small scale features in downscaled product are driven by the WGHM model. www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
We have used Octave/MATLAB for processing the data. Along with the downscaled GRACE data, we also provide the script freely available for download from figshare https://doi.org/10.6084/m9.figshare.c.5054564 62 . www.nature.com/scientificdata www.nature.com/scientificdata/