Global soil moisture data derived through machine learning trained with in-situ measurements

While soil moisture information is essential for a wide range of hydrologic and climate applications, spatially-continuous soil moisture data is only available from satellite observations or model simulations. Here we present a global, long-term dataset of soil moisture derived through machine learning trained with in-situ measurements, SoMo.ml. We train a Long Short-Term Memory (LSTM) model to extrapolate daily soil moisture dynamics in space and in time, based on in-situ data collected from more than 1,000 stations across the globe. SoMo.ml provides multi-layer soil moisture data (0–10 cm, 10–30 cm, and 30–50 cm) at 0.25° spatial and daily temporal resolution over the period 2000–2019. The performance of the resulting dataset is evaluated through cross validation and inter-comparison with existing soil moisture datasets. SoMo.ml performs especially well in terms of temporal dynamics, making it particularly useful for applications requiring time-varying soil moisture, such as anomaly detection and memory analyses. SoMo.ml complements the existing suite of modelled and satellite-based datasets given its distinct derivation, to support large-scale hydrological, meteorological, and ecological analyses.

different results. Consequently, the resulting soil moisture data is independent from, and can complement existing satellite-based or model-derived datasets. Similar data-driven approaches to derive gridded datasets using ML algorithms have been successfully employed in the cases of land-atmosphere fluxes 25 and runoff 26 .
Here we present a novel global-scale gridded soil moisture dataset generated through a data-driven approach (Fig. 1). Namely, we employ a Long Short-Term Memory neural network (LSTM) 27 to build a soil moisture simulation model. Daily meteorological time series and static features obtained from both reanalysis and remote sensing datasets are used as predictor variables. As a target variable, we use adjusted in-situ soil moisture measurements from different depths obtained from the International Soil Moisture Network (ISMN) 14 and the National Center for Monitoring and Early Warning of Natural Disasters of Brazil (CEMADEN) 28 .
In-situ soil moisture measurements have widely used as target variables for ML model training, often directly at a point-scale 18,20,23 . To use in-situ data for soil moisture modeling at a grid-scale, the limited spatial representativeness of in-situ data should be carefully considered. A recent study applied the extended triple collocation technique and selected only in-situ measurements that well represent soil moisture dynamics at the spatial scale similar to satellite footprints 21 . On the other hand, in our study, the raw point-level data are scaled to match means and variabilities of the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 gridded soil moisture at the corresponding grid cells in order to allow seamless merging of measurements across different stations and time periods, and to estimate soil moisture at a target grid-scale. This allows training the ML model using in-situ data collected from a large number of stations around the globe.
Our new global soil moisture dataset, SoMo.ml, provides soil moisture at three different depths: 0-10 cm, 10-30 cm, and 30-50 cm, corresponding to Layer 1, Layer 2, and Layer 3, respectively. The data has a spatiotemporal resolution of 0.25° and daily, covering the period of 2000 to 2019. See Table 1 for more details.
Methods target soil moisture data preparation. Target soil moisture data at 0.25° and daily resolution for model training is constructed using the in-situ measurements. From the ISMN data only 'good' observations are selected, based on the quality flag 29 . The full list of ISMN networks involved in this study can be found in Table 2. CEMADEN provides only useful-quality data 30 . Both datasets provide sub-daily data and daily averages are computed for the days with at least six available sub-daily estimates. Stations or sensors with less than 2 months of data are discarded.
In-situ measurements across the different sites are collected with various sensor types, which have different calibrations. Therefore, the means and variances of the obtained time series are not necessarily comparable, which could introduce artifacts during the LSTM training. For this reason, we adjust the mean and standard deviation of the daily in-situ time series to those of the respective ERA5 grid-cell soil moisture within the overlapping period. As ERA5 soil moisture is available at 0-7 cm, 7-28 cm, and 28-100 cm depths, it is vertically interpolated into the target layer depths with a depth-weighted averaging. If more than one in-situ measurement time series is available Fig. 1 Schematic of data-driven approach to generate global-scale gridded soil moisture from in-situ measurements. The LSTM model is trained with meteorological data over days t-364 to t and static features to simulate target soil moisture at day t. As in-situ measurements are point level data, they are adjusted using longterm mean and standard deviation of ERA5 gridded soil moisture to represent soil moisture at a 0.25 degree resolution. The model maps input-output relationships at a single grid pixel, but is trained using a combination of training data from grid pixels where in-situ soil moisture measurements are available.
www.nature.com/scientificdata www.nature.com/scientificdata/ at the same depth within the same grid cell (0.25°), their average is taken (Fig. 1). As a result, the adjusted in-situ target data resembles ERA5 soil moisture in terms of mean and standard deviation, while its daily temporal variations follow the ground observations. Our approach is also based on the fact that temporal variations from point-level data have a greater areal representation compared to absolute soil moisture values 31,32 . We can therefore assume that point-level data contains sufficient information to infer soil moisture dynamics at the grid scale.
For each soil layer, we preferentially select the adjusted in-situ measurement taken at the mid-depth of the layer; i.e. 5 cm, 20 cm, and 40 cm, respectively. If no data is available at the mid-depth, the measurement taken closest to the mid-depth, and within the layer, is chosen, leading to a total of 1114, 1064, and 683 grid pixels for the three layers, respectively. The location of the grid cells with available target soil moisture is shown in Fig. 2a. Selected depths and data lengths of target soil moisture data employed for each layer are depicted in Fig. 2b. A considerable fraction of the target data is obtained from North America across diverse hydro-climatic regions (see Fig. 3). While training data from South America represents warm and semiarid regions, those from Asia mostly cover relatively cold regions.
Model training. LSTM is a special kind of recurrent neural networks that is capable of learning long-term dependencies across time steps in sequential data 27 . It has been widely used in land surface modelling such as runoff or soil moisture simulations 23,24,33,34 . An adapted version of the LSTM architecture, Entity-Aware LSTM 33 , that can ingest time-varying forcing and static inputs separately is used in this study, thereby allowing the algorithm to explicitly differentiate the two different types of information.
We model soil moisture using the Entity-Aware LSTM architecture (hereafter referred to as 'LSTM model'); the model consists of 1) 128 of hidden units, 2) one LSTM layer with one dense layer, and 3) 0.5 of dropout rate. These model hyperparameters are selected through a grid search (searching the optimal hyperparameters over the pre-defined hyperparameter space) with 5-fold cross validation. The entire dataset is split into five folds, each containing approximately 20% of the data. While the dataset is randomly split into the folds, neighbouring grid pixels are grouped into the same fold to account for spatial auto-correlation. The training of the model is performed using data from four folds, while the model validation is made with the remaining fold. This operation is repeated five times so that each fold is used once as an independent validation set, and finally the performance is averaged across the repetitions to obtain a representative estimate.
The LSTM model is trained to learn the relationship between the multiple predictor variables and the target soil moisture. The model is trained separately for each soil layer. The predictor data used for the LSTM-based soil moisture modelling is listed in Table 3. The meteorological inputs during days t-364 to t are used to simulate soil moisture at day t; i.e. the model can establish the relationship of present soil moisture with present and past meteorological forcing over a full annual cycle. All input data are normalised using their mean and standard deviation to enhance the training efficiency 35 . We use the mean squared error divided by the standard deviation of soil moisture at each individual grid cell as a loss function. This scaling ensures comparative values of the loss function across wet and dry regions with potentially different temporal variabilities 33 .
Meteorological forcing variables are prepared from new global atmospheric reanalysis ERA5 produced by ECMWF 36 . There are several reasons why ERA5 is chosen. First, ERA5 uses large amounts and diverse kinds of observations such as synoptic station data, satellite radiance, and ground-based radar precipitation information via the 4D-Var data assimilation. Its enhanced quality as meteorological forcing, compared to its predecessor ERA-Interim, has been demonstrated through an experiment with land surface models 37 . Second, ERA5 allows the generation of long-term global-scale soil moisture data. The direct use of observations such as satellite data introduces the problem of gaps in space and time, and different or limited time periods covered by the respective variables. In this sense, the current version of SoMo.ml can also serve as a baseline data to evaluate performance of updated data versions in the future, e.g., by comparing with data generated from machine learning trained with purely observational data for selected variables. Finally, ERA5 is available with only a few months latency, allowing corresponding future updates of the SoMo.ml dataset.
For the deeper layers, soil moisture simulated from the upper layer(s) is additionally used as input data. Although the model performance of different combinations of input variables could be exhaustively compared to find 'best' predictors, we select meteorological forcing variables that are commonly used in physically-based

File format NetCDF
Key strengths 1) Global scale, long-term data. 2) Distinct data derivation compared to existing gridded soil moisture products.
3) Better agreement with in-situ measurements in terms of temporal soil moisture dynamics.
Limitations 1) Performance depends on in-situ data availability, which is low in tropical regions including Africa.
2) Uncertainty and errors in measurements may affect the model performance.
3) ERA5-based scaling is necessary, making long-term means and variabilities of SoMo.ml similar to ERA5 data. www.nature.com/scientificdata www.nature.com/scientificdata/ modeling; the usefulness of such variables in land surface hydrologic modeling has been proven over many decades 38,39 . In addition, we assess the relative importance of predictors for the soil moisture simulations and find that land surface temperature has the greatest effect on the model performance for the top layer, while soil moisture in the upper layers(s) is the most important variable for the deeper layers. Further details are given in the following section.
For the static data, long-term mean precipitation and aridity over the period of 2000-2019 is computed using the ERA5 data 36 . Aridity is defined as the ratio of net radiation (converted into mm) divided by precipitation 40 . We characterise topography through mean and standard deviation of sub-grid scale elevation, as obtained from the ETOPO1 digital elevation model 41 . In addition, we use soil type and land cover information from the Global Land Data Assimilation System (GLDAS) data archive 42 . GLDAS resampled soil porosity and fractions of sand, silt, and clay from FAO datasets 43 into 0.25° spatial resolution. The land cover is based on MODIS-derived 20-category vegetation data that uses a modified International Geosphere-Biosphere Programme classification scheme 44 . We use GLDAS Dominant Vegetation Type Data Version 2 which assigned the predominant vegetation type to each 0.25° grid cell.
Importance of predictors. The relative importance of predictor variables for the soil moisture simulation is quantified using a permutation approach. The importance is defined as the decrease in model accuracy when the time series of a particular variable is randomly permuted to remove the information contained in its temporal dynamics 45,46 . In the case of the static features, we permute all variables at the same time; each variable is randomly shuffled in space. As shown in Fig. 4, for the top layer, land surface temperature is the most significant explanatory variable among the considered meteorological forcings, followed by precipitation and 2m-temperature, in terms of both normalised root-mean-square error (NRMSE) and correlation coefficient. Land surface temperature and  www.nature.com/scientificdata www.nature.com/scientificdata/ its diurnal amplitude has been recognised previously as a proxy for soil wetness [47][48][49] , confirming the LSTM results. The static data is relevant for the soil moisture performance only in terms of NRMSE. This is in line with previous findings showing that e.g. soil and vegetation types influence the spatial variability of soil moisture, but not so much the temporal dynamics 31 . While a wide range of predictor variables, including static variables, makes a significant contribution to the model performance for the first layer, (simulated) soil moisture in the upper layer(s) has the greatest effect on the model performance for the deeper layers.
Global data generation. The LSTM model is trained using the entire training dataset which consists of the available target soil moisture data and corresponding predictor data. After establishing the internal relationships ('learning'), the model is applied using the predictor data over a quasi-global area of 90° N-60° S at 0.25° spatial resolution. In order to account for the random initialisation of LSTM's trainable parameters, five simulations are performed and final soil moisture values are computed as an average of the five simulations.  Table 3. Predictor data used for the LSTM model. www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The SoMo.ml dataset can be accessed at figshare 50 . Three compressed files (.zip) contain data in NetCDF format for the three respective layers. An example file name is 'SoMo.ml_v1_<LAYER>_<YYYY>.nc' , with LAYER and YYYY standing for soil moisture layer depth and year, respectively.

technical Validation
Model validation. The validity of the LSTM model in soil moisture modeling is tested through 5-fold cross-validation. The simulated soil moisture for the validation is hereafter referred to as SoMo.ml*, as this simulation data differs somewhat from the actual SoMo.ml because it is not based on training with all available target data, but only with 80% of the data according to the 5-fold cross validation approach. Figure 5a shows that the mean of SoMo.ml* at each pixel generally agrees well with that of the target data (Pearson's r ranges between 0.92 to 0.98), indicating that the model captures spatial variations of soil moisture. The model shows somewhat better performance towards deeper layers. In Fig. 5b,frequency distributions of the entire time series of SoMo.ml* and target soil moisture are compared. Again, reasonable agreement is observed, although the simulated soil moisture exhibits smaller variability with larger minimum and smaller maximum values, as can also be seen from the slightly higher peaks of SoMo.ml*. The entire soil moisture time series are further compared for particular (sub-)continents in Fig. 5c. In terms of both distributions and medians, the model shows a satisfactory performance overall. However, relatively less agreement is observed in Africa, Australia, and South America. This is probably because the model has difficulties learning the soil moisture dynamics there as most grid cells from these regions are characterised by extreme hydro-climatic conditions (e.g. very warm or arid, see Fig. 3) for which only few in-situ observations are available. The (hydro-climatic) diversity of training data can significantly affect the performance of data-driven modelling; when given more diverse training data, models can acquire more complete knowledge of input-output relationships and therefore perform better across various regimes 34 . Overall, the LSTM model successfully learns soil moisture dynamics from the training data and can reproduce them at unseen locations. www.nature.com/scientificdata www.nature.com/scientificdata/ comparison with independent in-situ measurements. Cross-validation (5-fold) is made through a direct grid-to-point comparison between the SoMo.ml* and the in-situ measurements as done in many previous studies [51][52][53][54][55] . This validation also enables a comparative assessment of modelled soil moisture from the LSTM with that of state-of-the-art global gridded datasets such as ERA5, GLEAM 52 , and the satellite-based ESA-CCI 15 datasets. Established skill scores such as NRMSE, relative bias, and correlation coefficient are used to quantify the agreement with the ground truth data. Figure 6 shows the distribution of the NRMSE of SoMo.ml* across climate regimes (left) and a comparison of these results with the respective performances of the reference datasets (right). NRMSE is defined as the RMSE divided by the means of ground truth. Although SoMo.ml* shows slightly higher biases at some stations over warm and arid regions, there is no clear overall climate dependency of the NRMSE. In Layer 1, while the median NRMSE of SoMo.ml* is similar to that of ESA-CCI, which shows lowest NRMSE, a wider spread of errors is observed. ERA5 and GLEAM tend to overestimate in-situ measurements (see Fig. S1 in Supplementary Information for relative biases), leading to slightly higher NRMSE values. In the deeper layers, where ESA-CCI is not available, NRMSE values of SoMo.ml* are slightly lower but overall similar to those of the ERA5 and GLEAM references. As a result, this comparison highlights similar deviations of absolute soil moisture values from in-situ measurements across the considered datasets. www.nature.com/scientificdata www.nature.com/scientificdata/ Figure 7 shows results from a similar comparison, but focusing on the time-variability of the soil moisture dataset as expressed by the correlation of soil moisture anomalies with in-situ measurements. To exclude the impact of the seasonal cycle, we consider short-term anomalies 56,57 . For each soil moisture at day d, a period P is defined as P = [d-17, d + 17] (corresponding to a 5-week window). If at least 10 data are available within the period, the average soil moisture and corresponding anomaly are computed. Equations are applied to each station and a grid pixel it lies on. No pronounced climate dependency of the correlations is observed for SoMo.ml* (Fig. 7, left). Comparing with the reference datasets, SoMo.ml* outperforms them for the top layer. While overall anomaly correlations decrease in the deeper layers, also for these layers SoMo.ml* shows closer agreement with the observations than the reference datasets. The results underline the particular strength of SoMo.ml*, and likely also the actual SoMo.ml, to represent the temporal variability of soil moisture. This is somewhat expected; while this comparison is done against independent in-situ measurements, the temporal dynamics of SoMo.ml* are directly learned from (remaining) in-situ measurements. Similar results are obtained when using the correlations of long-term absolute soil moisture, and of anomalies derived by removing the mean daily averages (Figs. S2 and S3, respectively). We also compute the triple collocation error [58][59][60] , which is widely used to estimate random error variance of soil moisture data in the absence of reliable ground reference data, confirming the results from Figs. 6 and 7 and underlining the usefulness of SoMo.ml (Fig. S4).
Note that ESA-CCI has missing values in space and time and GLEAM is available only until 2018, such that partly different spatiotemporal data are used among datasets in the comparison. We repeat the analysis above using only data where all datasets are available and find very similar results (not shown). In summary, compared www.nature.com/scientificdata www.nature.com/scientificdata/ with state-of-the-art references, SoMo.ml* shows a comparable performance in terms of biases, while outperforming the other datasets in terms of temporal correlations, which highlights the benefits of using in-situ observation more directly in the derivation of soil moisture dataset.
Global-scale comparison with existing gridded datasets. Next, we examine the spatial patterns of SoMo.ml at the global scale. Figure 8a presents the median soil moisture values over the entire period. Low values in arid regions such as southwest North America, North Africa, central Asia, and Australia and high values in more humid regions such as the northern latitudes and Southeast Asia are well captured. Figure 8b compares latitudinal profile of SoMo.ml against that of the reference datasets (Fig. 8b). Overall, we find a satisfactory consistency between global patterns of SoMo.ml and the reference datasets. For instance, the highest average soil moisture occurs near the equator in the tropics, while driest soil moisture is found near 20° N. These patterns are overall well reproduced in SoMo.ml. This is expected to some extent because we rescale the target soil moisture using ERA5 means and standard deviations, such that the LSTM algorithm will pick up these ERA5 characteristics in locations and at time steps with available in-situ measurements. Nonetheless, SoMo.ml between 15° N and 25° N tends to be wetter than the reference datasets (over the eastern part of the Sahara desert), especially in the deeper layers. More generally, SoMo.ml might not properly describe soil moisture in very-arid regions, which can be related to a lack of training data from such regions (see Fig. 3). Different patterns found in ESA-CCI along the equator are mostly due to the missing data. Over very high latitudes over 60° N, we can observe relatively large differences across datasets, probably due to different freezing and thawing patterns. Meanwhile, in-situ measurements (not adjusted) do not show a meaningful pattern of latitudinal averages but large variability across stations and sensors, whereby it is not clear to which extent this is due to different sensor types and calibrations or due to actual moisture differences caused by heterogeneous land surface characteristics. Additional comparison among the global soil moisture datasets can be found from Figs. S5-S7 in Supplementary Information.

Usage Notes
We present a global, multi-layer, long-term soil moisture dataset generated through a data-driven approach, and with comprehensive ground truth data. For model training, we preprocess the in-situ measurements to obtain more spatiotemporally consistent, grid-scale target soil moisture data by adopting mean and standard deviation from ERA5 data while preserving the observed temporal variations from the in-situ measurements. Any gridded soil moisture can possibly be used as a scaling reference, but the selection of reference will not affect the main characteristic of SoMo.ml, i.e. resembling temporal patterns of the in-situ measurements. Our newly generated soil moisture data outperforms other existing gridded datasets, including ERA5, in terms of daily temporal dynamics as indicated by highest temporal (anomaly) correlation with the ground observations. Nonetheless, the www.nature.com/scientificdata www.nature.com/scientificdata/ data quality in conditions outside the spatiotemporal range sampled within the observations is potentially uncertain. LSTM performance can be significantly affected by the (lack of) hydro-climatic diversity in the training data, even more than by the quantity of data 34 . As shown in Fig. 3, while the in-situ soil moisture measurements are obtained from networks worldwide, the data does not cover all globally occurring hydro-climatic conditions. Therefore, relatively high uncertainty outside the training conditions such as at high latitudes and in arid regions is expected. However, this lack of observations in particular conditions also presents a challenge to other datasets/ models 57,61 . Therefore, for instance, using SoMo.ml within an ensemble of differently derived datasets could be a promising solution to obtain more reliable soil moisture information in these data-sparse regions 62,63 . As a result, our new soil moisture dataset is a valuable addition to the existing suite of soil moisture datasets, and can enhance future large-scale hydrologic and ecologic analyses, and also benchmark studies to evaluate land surface models and remote sensing data.

code availability
The LSTM model implemented in this study and figure scripts are available from https://github.com/osungmin/ SciData2021_SoMo_v1. Note that the LSTM model is built by adopting python modules obtained from https:// github.com/kratzert/ealstm_regional_modeling.