A long term global daily soil moisture dataset derived from AMSR-E and AMSR2 (2002–2019)

Long term surface soil moisture (SSM) data with stable and consistent quality are critical for global environment and climate change monitoring. L band radiometers onboard the recently launched Soil Moisture Active Passive (SMAP) Mission can provide the state-of-the-art accuracy SSM, while Advanced Microwave Scanning Radiometer for EOS (AMSR-E) and AMSR2 series provide long term observational records of multi-frequency radiometers (C, X, and K bands). This study transfers the merits of SMAP to AMSR-E/2, and develops a global daily SSM dataset (named as NNsm) with stable and consistent quality at a 36 km resolution (2002–2019). The NNsm can reproduce the SMAP SSM accurately, with a global Root Mean Square Error (RMSE) of 0.029 m3/m3. NNsm also compares well with in situ SSM observations, and outperforms AMSR-E/2 standard SSM products from JAXA and LPRM. This global observation-driven dataset spans nearly two decades at present, and is extendable through the ongoing AMSR2 and upcoming AMSR3 missions for long-term studies of climate extremes, trends, and decadal variability.


Methods
The satellite brightness temperature (TB) data we use are: (1) AMSR-E TB in slow rotation mode (L1S), (2) AMSR-E and AMSR2 Level 3 standard TB. The AMSR-E/2 Level 3 daily TB at 0.25 degree resolution were obtained from the online dissemination service Globe Portal System (G-Portal, https://gportal.jaxa.jp/gpr/) of JAXA. The SMAP Level 3 passive daily SSM product (SMAPL3sm) was obtained from NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC, https://nsidc.org/data/SPL3SMP/versions/6), with a 36 km resolution at cylindrical Equal Area Scalable Earth Grid, Version 2.0 (EASE-Grid 2.0), with size of 406 rows × 964 columns. Table 1 summarizes the data used, their spatial resolutions, and provenance.
Our research approach adopt the soil moisture retrieval algorithm developed by Yao et al. 21 using artificial neural networks (ANN). The approach is divided into three components: calibration, training, and simulation/ validation, as shown in the Fig. 1. An ANN is a nonlinear mathematical computing system which is capable of representing arbitrarily complex nonlinear processes that relate the inputs and outputs of any system 22 . The structure of an ANN model includes an input layer, a hidden layer, and an output layer. We use MATLAB to implement the ANN training and simulation.
AMSR-E/AMSR2 data inter-calibration. Firstly, to ensure consistency of data from two sensors onboard different satellites, the TB of AMSR-E are calibrated to the TB of AMSR2 for each grid cell.
The common approaches for inter-calibration 23 are inappropriate for AMSR-E and AMSR2 data due to a temporal gap of 9 months between them. Previous study on inter-calibration for AMSR-E and AMSR2 are based on the double difference method(DD), taking a third sensor as an intermediate reference, such as the microwave radiometer imager (MWRI) onboard the FengYun-3B (FY3B) satellite [24][25][26] . AMSR-E operations were halted due to rotational issues in 2011, and then restarted at a slow rotation rate from 2012 to 2015 (L1S TB), which is useful for users who cross-calibrate AMSR-E with other radiometers. Hu et al. 27 conducted direct global calibration for AMSR-E and AMSR2 using AMSR-E L1S TB data.
The AMSR-E L1S data has 3 years of overlapping observations with AMSR2, based on which we developed a linear regression model for each grid cell to calibrate the AMSR-E TB to AMSR2 TB, for each frequency and each polarization: amsr amsre 2 www.nature.com/scientificdata www.nature.com/scientificdata/ Pre-process of AMSR-E L1S TB. The global distribution of the slow-rotation L1S terrestrial data is shown for one day in Fig. 2(a). While the total observed area is sparse for any day, locally there are many observations with overlapping footprints. Figure 2(b) displays the centers of those overlapping footprints in one 0.25 degree grid cell at Little Washita. First, we resampled those observed points to 0.25 degree.
Inter-calibration over each grid cell. We carry out the inter-calibration over each grid cell. We take the grid cell of the Little Washita Experimental Watershed in southwestern Oklahoma, USA (LW) as an example. In Fig. 3(a), the blue circles are TB of L1S and the red dots are TB of AMSR2. Figure 3(b) shows the calibration equation for H polarization at C band for the grid cell. In Fig. 3(a), the blue circles are TB of L1S and the red dots are TB of AMSR2. We obtain the linear relationship using the matched-pair data (N = 46), with correction slope a = 0.9713, and regression R = 0.9820.
The number of matched-paird at C band for H-polarization is shown globally in Fig. 3(c).The number of matched-pairs per grid cell is greater than 40 at mid-latitudes, which ensures sufficient statistical power to determine the regression relationship. The regression R for each grid cell is shown in Fig. 3(d). Here we only display the grids where R > 0.85; 99.92% of locations show statistically significant inter-calibration relationships (P = 0.05).
General inter-calibration equations. In the equatorial zone and around the land-sea interface we find some pixels with regression R less than 0.85. We deem this calibration approach for these grid cells to be uncertain. Those areas are shown in white in Fig. 3(d). To still provide an estimated cross-sensor inter-calibration for these grid cells with fewer matched-pairs, we use a general inter-calibration relation for each frequency and each  www.nature.com/scientificdata www.nature.com/scientificdata/ polarization estimated using all global matched-pairs for the 2012-2015 period. The regression coefficients for these general relationships are shown in Table 2. Users investigating hydroclimate regime change or decadal shifts in mean state at the single-pixel level, however, should be aware of the methodological difference for these locations. A mask marking pixels using the global inter-calibration model is given with the dataset file. ANN Training. In the ANN training period from March 2015 to December 2017, the input layer is comprised of reflectivity (R) and the microwave vegetation index (MVI) 28 derived from AMSR2 TB; the output layer (training target) is the SMAPL3sm 21 . Ten neurons are used in our net and the training function is the default 'trainlm' function.
Data pre-processing. The descending SMAPL3sm data (at 06:00 local time) are selected in this study, corresponding to the AMSR-E/2 descending nighttime TB data (at 01:30 local time). The AMSR-E/2 Level 3 daily descending TB data (01:30) at 0.25 degree resolution are selected. And those AMSR2 TB are resampled to the 36 km EASE-Grid 2.0 using linear interpolation. The R is derived from estimated surface temperatures (Ts) 29 and TB, and the MVI is derived from AMSR-E/2 TB at C band and X band: s v 36 5 = . × − . .
where f is the frequency, p is the polarization, v is vertical polarization and h is horizontal polarization.   www.nature.com/scientificdata www.nature.com/scientificdata/ ANN Training for each grid cell. To determine the local relationships between AMSR2-derived R/MVI and SMAPL3sm, we trained ANNs for every EASE-Grid 2.0 grid cell. Our ANN network was designed to minimize Mean Squared Error (MSE) of SMAPL3sm, using a random internal assignment of data into training/validation/ testing categories (these assignments are distinct from our external data division and were applied only to our 2015-2017 testing data), with Levenberg-Marquardt (L-M) optimization used for back propagation.
During the training period from 2015 to 2017, the target SMAPL3sm and the input (R and MVI) were matched grid cell by grid cell. We removed those cells with less than 50 matching pairs (N) of AMSR2-SMAP observations. We set the Ts derived from the AMSR2 TB as a criterion of the freeze/thaw, and removed the data for frozen states. The global ANN were trained grid cell by grid cell with matching data.
NNsm simulation for each grid cell. In the simulation period from 2002 to 2019, the input R/MVI were derived from consistent AMSR-E/2 TB, as described in data pre-processing. Over frozen soil, soil moisture is not retrieved, thus we did not simulate SSM for grid cells when Ts was lower than 273.15 K. We do not mask values based on surface water or vegetation water content, and highlight that the target (SMAP) values are less reliable for high values of each, meaning that these results could be screened with the same methods as used for the SMAP data itself. With the pre-processed input and the global ANNs model for each grid cell trained in the previous step, we derived the daily soil moisture from 2002 to 2019 on each grid.

Data Records
The data records 30 contain global daily soil moisture data with a spatial resolution of 36 km, in m 3 /m 3 , from June 2002 to December 2019. These data are stored in NetCDF format with one file per day, defined by two dimensions (lat, lon, respectively representing latitude and longitude) and a variable soil moisture (soil_moisture). The file name is "yyyyddd.nc", where "yyyy" stands for year and "ddd" stands for Julian date.   Table 3.
Data in 2018-2019 was used to validate the performance of trained NNsm. As shown in Fig. 4(g,h). The correlation coefficients and RMSE between trained NNsm and SMAPL3sm has similar distribution pattern with that of 2015-2017, but there's a slight decrease in accuracy than that of training period, with a spatial mean of masked CC = 0.73, and a spatial mean of masked RMSE = 0.033 m 3 /m 3 .  Table 4. Data of OzNet, REMEDHUS and AMMA sites are provided by International Soil Moisture Network (ISMN) (https://ismn.geo.tuwien.ac.at/) website [31][32][33][34] .

Validation using in situ
These ground-based sites are major validation points for satellite soil moisture products, covering a wide variety of topography, land cover types and soil types around the world. These sites, which include dozens of instrumented stations each, are designed to provide reliable and representative soil moisture values for comparison against spatially-aggregated satellite footprints. Fig. 5 shows the location of these sites.
The performance of the NNsm over the ground sites are shown in time series in Fig. 6 and in a statistical matrix in Table 5. For demonstration, we show time series for one site in every continent. The magnitude and variability of NNsm (blue dots) are consistent with those of SMAPsm (red dots) and in situ SM (Obs-sm, grey www.nature.com/scientificdata www.nature.com/scientificdata/  Validation and comparison with AMSR-E/AMSR2 standard products. Moreover, to clarify the advantages of our algorithm and soil moisture products, we validated the performance of NNsm by comparing the simulated output with the satellite standard SSM products of AMSR-E/AMSR2 from JAXA and LPRM, JAXAsm and LPRMsm respectively, over the in situ sites. Results are shown in Fig. 7 and Table 6 (2002-2019). When calculating the statistical matrix, we use the intersection of the observational periods for all four datasets (NNsm, JAXAsm, LPRMsm, in situ), as shown in second column of Table 6. We also performed inter-comparisons separately over the AMSR-E period (2002-2011) and AMSR2 period (2012-2019) (see Supplementary Tables 1 and 2).
From the time series plots shown in Fig. 7 and the statistical comparison shown in Table 6, NNsm is generally consistent with in situ SSM, while NNsm may underestimate or overestimate soil moisture slightly at a few sites. The performance of NNsm is better than that of AMSR-E/AMSR2 SSM from JAXA and LPRM, with higher CC, lower RMSE and ubRMSE. In most sites, LPRM overestimated the soil moisture, while JAXA underestimated the soil moisture. LPRM in particular shows changes in bias and variability over time.
From the time series plots shown in Fig. 7 and the statistical comparison shown in Table 6, NNsm is generally consistent with in situ SSM, albeit with some scaling bias apparent in some sites (e.g., Fig. 7(e)). The JAXAsm and LPRMsm time series show biases as well, with JAXAsm typically underestimating the in situ observations and LPRM generally overestimating (Fig. 7(a-e), Table 6). The LPRMsm also displays some large changes in variability and mean state ( Fig. 7(a,c,e)) which seem to not follow either the in situ observations or the main line of TB forcing that drive the JAXAsm and NNsm time series. Across the in situ validation sites, NNsm displays broadly lower biases, lower RMSEs and unbiased RMSEs (ubRMSE), and higher correlations with the in situ data than the JAXAsm and LPRMsm AMSR-E/2 soil moisture time series. This suggests that the NNsm may be providing added value from interactions between the reflectivities and microwave vegetation indices used as inputs in Eqs. 2-4 beyond what is used in the JAXA and LPRM retrieval algorithms.
Comparison with SMOS and CCi soil moisture products. To quantify the utility of our algorithm and any benefits from creating an additional soil moisture product, we also compared NNsm with the   Fig. 6. a Koeppen-Geiger climate classification 38 . b International Geosphere-Biosphere Program.  Tables 7-8, respectively. We also show data product intercomparison scatter plots for each site in Figs. S1-2 of the Supplementary file. Results are all shown for the overlapping data period; for example, data from the Little Washita validation site in Fig. 8 and Table 7 span the overlapping data windows of the NNsm data, the CCIsm data, and the in situ data from 2007 to 2009.

to 2009: CCism, NNsm, and in situ data.
In general, as evident in Fig. 8(a,b) and Table 7 the NNsm have substantially lower bias than CCIsm relative to the in situ observations, leading to a lower RMSE with roughly equivalent correlations. The ubRMSE is lower for the CCIsm, suggesting that the CCIsm bias is a primary source of error in the data set. CCIsm tends to overestimate the in situ values, especially when soil is dry (SSM < ~0.2 m 3 /m 3 ); the average bias across sites is 0.088 m 3 /m 3 . Scatter plots of CCIsm vs in situ observations by site show some data processing artifacts as well (discretized SSM values evident in Supplementary Fig. 1.1, 1.7, and 1.12) which may introduce some errors as well. 2010 to 2019: CCism, NNsm, SMOSsm, and in situ data. In the L-bad era, as shown in Fig. 8(c,d,e) and Table 8 Secondly, the comparison is carried out in terms of the number of data. For the period from 2002 to 2009, we take 2003 as an example. As shown in Fig. 9(a), NNsm can provide global product in summer. In general, NNsm   www.nature.com/scientificdata www.nature.com/scientificdata/  www.nature.com/scientificdata www.nature.com/scientificdata/ provides more than 200 soil moisture retrievals over each grid in middle latitudes, and provides more than 100 soil moisture retrievals over each grid in high latitudes, as shown in Fig. 9(b). CCIsm doesn't have soil moisture retrievals over equatorial zone and most area of Russia (Fig. 9(c)). In the North America, Northern Europe, Southeast China, and the Tibetan Plateau, CCIsm has a few soil moisture retrievals, with a number less than 50 over each grid (Fig. 9(d)).
For the sake of application utility, we also compare data volumes for the three data sets. Example data volumes for the pre-L-band period (2002 to 2009), and L-band period (2010 to 2019) are shown in Fig. 9(a-d), Fig. 9(e-j), respectively. For the period from 2002 to 2009, we take 2003 as an example. As shown in Fig. 9, NNsm provides a global product in the boreal summer. In general, NNsm provides more than 200 soil moisture retrievals over each grid cell in the mid-lattitudes, and provides more than 100 soil moisture retrievals over each grid cell in the high lattitudes. CCIsm does not have soil moisture retrievals over much of the equatorial zone and most area of Russia. In North America, Northern Europe, southeastern China, and the Tibetan Plateau CCIsm has very few soil moisture retrievals, less than 50 per grid cell.
For period from 2010 to 2019, we take 2010 as an example. As shown in Fig. 9(e,g,i), the three products have a similar spatial pattern and dynamic range in summer. NNsm can provide considerable number of soil moisture retrievals globally except in Tibetan Plateau ( Fig. 9(f)). SMOS has 150 soil moisture retrievals on the average but less retrievals in Asia, affected by RFI seriously (Fig. 9(h)). CCIsm has evident advantages in number of retrievals after 2010, but still has no retrievals over equatorial zone, and has less retrievals in most area of Russia, in North America, and the Tibetan Plateau ( Fig. 9(j)).   www.nature.com/scientificdata www.nature.com/scientificdata/ For the 2010-2019 period, we highlight 2010 as an example. As shown in Fig. 9(e,g,i), the three products have similar spatial pattern and dynamic range in boreal summer. NNsm has an annual data volume of typically more than 200 retrievals per year outside of regions with permanent snow and ice cover. SMOS has 150 soil moisture retrievals per year on average, with fewer retrievals in Asia, significantly affected by RFI. CCIsm has particularly high data volumes in the subtropics, but fewer in cold regions (and masks retrievals for dense tropical vegetation). www.nature.com/scientificdata www.nature.com/scientificdata/ Potential benefits and usages of this dataset. Complement to SMAP product. The NNsm dataset can be seamlessly merged with SMAP SSM at daily scale, providing greater spatial coverage and higher frequency observations, as well as serving as a complementary gap-filling product for SMAP SSM (e.g., when SMAP entered safe mode temporarily from June 20 to July 22 in 2019). Figure 10 shows global maps of SMAPL3sm, NNsm, JAXAsm, LPRMsm, and a combined NNsm/SMAPL3sm. SMAP has less daily spatial coverage, and can have a global coverage with 3-day average, as shown in Fig. 10(a,b). www.nature.com/scientificdata www.nature.com/scientificdata/ NNsm derived from AMSR-E/2 has wider daily spatial coverage than SMAP (Fig. 10(c)). When combining NNsm with SMAPsm for the same day as shown in Fig. 10(d), for example, in China, Eastern and Southern Europe, the United States, South America, Africa and Australia, the combined map shows more coverage and provides almost full global coverage, with no obvious dataset discrepancies or inconsistencies. JAXAsm (Fig. 10(e)) has a drier soil moisture estimation and LPRMsm ( Fig. 10(g)) has more wet soil moisture estimation at global scale. When merging them with SMAPsm separately, the fusion maps have obvious underestimation or overestimation, with noticeable striping in South America and East Asia ( Fig. 10(f,h)).
SMAP was placed into safe mode and stopped capturing data for one month temporarily from June 20 to July 22 in 2019. During this period, SMAP provides no product, as marked in blue boxes shown in Fig. 11(a-e). NNsm is consistent and can capture the rainfall events with one-day delay, since the observation time of NNsm is 01:30 am. With a SMAP-similar accuracy, NNsm can provide complementary soil moisture for the SMAP SSM product.
Application for the study of short-term moisture dynamics. NNsm provides more frequent soil moisture observations for studying land-atmosphere interactions. SMAP has a narrow swath and can only provide roughly 10 measurements (or more dependent on latitude) within one month, while NNsm derived from AMSR-E/2 has a wider swath and provides measurements almost every day, as shown in Fig. 11(f-h). Combined with the standard SMAP product, NNsm can be used to extract dry-down curves with higher temporal resolution and process accuracy after rainfall.
Near-Real-Time product and extension to AMSR3. Having created and trained the models, it is now possible to produce near-real-time data products into the future provided that there are available brightness temperature data. Forward simulation of SM from TB data using the NNsm models is fast and efficient, and requires no ancillary datastreams. TB from future instruments can also be used, following a calibration to the existing AMSR2 data, analogous to the AMSRE-AMSR2 calibration performed in Eq. 1.
AMSR3 is scheduled to launch in 2023 as part of the GOSAT-GW mission and will provide similar C, X, and K band observations as a successor to AMSR2. Since our model only uses these observation bands as input, our method can be readily move to AMSR3 and a long term soil moisture product can continue to be generated for stable and consistent climatological studies of the terrestrial water cycle.