Fourteen years of continuous soil moisture records from plant and biocrust-dominated microsites

Drylands cover ~41% of the terrestrial surface. In these water-limited ecosystems, soil moisture contributes to multiple hydrological processes and is a crucial determinant of the activity and performance of above- and belowground organisms and of the ecosystem processes that rely on them. Thus, an accurate characterisation of the temporal dynamics of soil moisture is critical to improve our understanding of how dryland ecosystems function and are responding to ongoing climate change. Furthermore, it may help improve climatic forecasts and drought monitoring. Here we present the MOISCRUST dataset, a long-term (2006–2020) soil moisture dataset at a sub-daily resolution from five different microsites (vascular plants and biocrusts) in a Mediterranean semiarid dryland located in Central Spain. MOISCRUST is a unique dataset for improving our understanding on how both vascular plants and biocrusts determine soil water dynamics in drylands, and thus to better assess their hydrological impacts and responses to ongoing climate change. Measurement(s) soil moisture Technology Type(s) soil moisture sensors Factor Type(s) temporal interval Sample Characteristic - Environment semi-arid grassland Sample Characteristic - Location Central Spain Measurement(s) soil moisture Technology Type(s) soil moisture sensors Factor Type(s) temporal interval Sample Characteristic - Environment semi-arid grassland Sample Characteristic - Location Central Spain Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.16951723

In drylands, vegetation is typically organised in a two-phase mosaic composed by plant-covered patches interspersed in a matrix of open areas without perennial vascular plants [22][23][24] . Vegetated and open areas have contrasted water dynamics, with infiltration rates that are typically higher beneath plant patches which also have lower water losses via run-off and evaporation [25][26][27][28] . Open areas are, however, not devoid of life as they are commonly covered by biocrusts, communities dominated by mosses, lichens, fungi, and cyanobacteria living in the soil surface across drylands worldwide 29 . Both vascular plants and biocrusts are key modulators of the water cycle in drylands, as they affect processes that, such as infiltration, runoff and evapotranspiration 30 , ultimately determine soil moisture contents. Despite the hydrological importance of both vascular plants and biocrusts, no dataset characterizing long-term (>10 yr) temporal variations in soil moisture across plant-and biocrust-dominated areas (microsites) is currently available.
Here we introduce the MOISCRUST dataset, a 14-yr continuous dataset of surface soil moisture measurements from multiple microsites (vegetated and open areas with different degree of biocrust development) gathered from the Aranjuez Experimental Station, a semi-arid grassland in Central Spain where multiple studies on the ecology of biocrusts have been carried out 27,[31][32][33][34][35][36] .

Methods
Study site. The Aranjuez Experimental Station is located at the centre of the Iberian Peninsula (40° 02′ N-3° 32′ W; 590 m a.s.l., Fig. 1). The climate is Mediterranean semiarid, with average annual temperature and rainfall of 15 °C and 349 mm, respectively. Soils are classified as Gypsiric Leptosols 37 , with pH, organic carbon, and total nitrogen content values ranging between 7.2 and 7.7 mg/g, 9 and 32 mg/g, and 0.8 and 4 mg/g soil, respectively, depending on the microsite (open areas, vegetation, and biocrusts) considered 31   www.nature.com/scientificdata www.nature.com/scientificdata/ Workflow. The reproducible workflow is available in the Supplementary Material as an interactive Rstudio notebook in the file moiscrust.Rmd. It is packaged with renv to facilitate reproducibility. That means that the R package versions originally used to run the notebook are already installed in the "renv" folder of the repository. This workflow contains the following steps: (i) Data loading and preparation, (ii) imputation of missing data, (iii) incorporating weather data at daily resolution, and (iv) preparing dataset formats (see Supplementary Material for more details).

Data acquisition
Soil moisture was measured in the five most common microsites at the study site ( Fig. 2): Stipa tussocks (Stipa), Retama shrubs (Retama), and open areas devoid of perennial vegetation with very low (<5%, BSCl), medium (25%-75%, BSCm) and high (>75%, BSCh) cover of biocrust-forming lichens. Stipa microsites were placed at the north-face of Stipa tussocks, within 10 cm of their base, and are characterized by shaded conditions and a biocrust community dominated by mosses (mainly Pleurochaete squarrosa and Tortula revolvens). Retama microsites occur beneath the canopy of R. sphaerocarpa shrubs, and are characterized by moderate shade and litter accumulation. All microsites were selected in flat areas to reduce water retention from runoff, as this could be a confounding factor in soil moisture measurements, and were separated at least 2 m from one another.
We used soil moisture sensors (ECH2O EC-5, Decagon Devices Inc., Pullman, USA) to monitor soil moisture at sub-daily resolution. The sensors used provide estimates of volumetric water content (VWC) with an accuracy of ± 3%, and standard equations applied were used to sensor calibration in all microsites, as given their very similar texture values errors would be the same between microsites [38][39][40][41][42][43][44] . Such an approach has commonly been used in studies assessing soil moisture in drylands [38][39][40][41][42] , and works pretty well with the type of soils of our study site 43,44 . Three replicated sensors per microsite (total n = 15) were installed according to a stratified random design in November 2006 (Fig. 3). The sensors were introduced vertically in the soil 45 , so that the probe registered soil moisture from 0 to 5 cm depth. We did so for two main reasons: i) we were particularly interested in register the soil moisture in the topsoil (from 0 to 5 cm depth), which is the fraction of the soil profile particularly affected by plants and biocrusts (e.g 38,39,46-48 .), and ii) installing the sensors horizontally would have implied conducting substantial disturbance in a protected and very sensitive ecosystem (biocrusts are very sensitive to trampling and other disturbances [48][49][50], and this was something we wanted to avoid at all costs. Doing so would have also affected other measurements we have been conducted in this experiment, such as soil respiration 46 . The study area also had a meteorological station (Onset, Pocasset, MA, USA) that collect daily temperature, precipitation and relative air humidity (error of ± 0.2 °C; ± 0.2 mm and ± 3.5% respectively) from 30 th March 2007 to 16 th December 2020. Besides, solar radiation (W/m²) was daily collected during this period using a Silicon Pyranometer (Onset S-LIB-M003). Filling data gaps. MOISCRUST contains a total of 697,695 records over the study period, obtained from a total of 15 soil moisture sensors, of which 380,583 are either missing or negative values (54.5% of the total records). These missing values are due to diverse causes, including damaged sensors, sensors that were removed for maintenance, exhausted batteries or malfunction caused by rabbits (Oryctolagus cuniculus), which gnaw the wires of the sensors (after we discovered rabbits do this we protected wires with a plastic hose). Besides, the MOISCRUST database has several negative values (anomalous values by imbalances in the standard equation) falling within the margin of error of the sensors. These anomalous values were set to NA. In these cases, when an anomalous data was observed, we checked whether the sensor continued to measure correctly by comparison with another trustworthy sensor. Later, equal observed measurements were included in the dataset, and anomalous measurements were discarded.
To fill the gaps in the MOISCRUST dataset, we first found, for a given entry y with missing data at time t, the sensor x with data for t that is in the same type of microsite (if possible), has the longest duration in common, and shows the highest correlation with the sensor to which y belongs. Then we estimated the missing value y with a linear model y ~ x. To find the best possible candidate sensor (x) to estimate the missing data (y), we correlated all pairs of sensors and computed a selection score based on the following equation: where S x is the selection score of the candidate sensor x; y is the sensor with a missing value to be estimated; x is the sensor to be used as candidate predictor to estimate the missing value in y; %vc x,y is the percent of common valid cases of the sensors x and y; R 2 x,y is the Pearson's R² of the common valid cases of the sensors x and y; and microsite x and microsite y are the respective microsites of the sensors x and y. During data imputation, the sensor with the higher selection score was used to estimate each missing value (see Supplementary Material for a detailed description and a worked example of this procedure).
To provide an indicator of imputation quality, the algorithm generates a new column named interpolation quality, where the observed values are marked with "1", and the imputed values contain the correlation coefficient of the model used to estimate them (see Supplementary Material for details). After this process was completed, the number of missing values in the dataset was reduced to 133,881 records (19.2% of the total records). The imputation algorithm was implemented using the R software 51   www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
Raw and imputed data (in the "data" and "database" folders, respectively) are freely available from Figshare 65 . Data files come along with a metadata file with a brief description of the dataset. This dataset will be updated annually in Figshare to include data additions. In addition, the repository contains the "renv" folder to facilitate the reproducibility (see  Fig. 4), which suggests that the sensors used properly measure soil moisture contents and their temporal variation at the study area.

Usage Notes
Previous, short-term versions of the MOISCRUST dataset have been used to model annual variations in soil respiration rates across vegetation-and biocrust-dominated microsites, and to assess how vegetation, biocrusts and abiotic factors modulate wetting and drying events 7 . This dataset is particularly well suited for long-term studies focused on understanding spatio-temporal patterns of soil moisture in drylands 67 , and to analyse the effects of soil moisture-vegetation relationships (e.g. links between plant functional types and soil moisture 68 ) and feedbacks on the dynamics of dryland ecosystems 69 . It also can be used to evaluate how both vascular plants and biocrusts determine soil water dynamics in drylands, to parameterize/tune up hydrological models aiming to study the hydrological behaviour of these ecosystems and to forecast their hydrological responses to ongoing climate change. Overall, the data provided by MOISCRUST contributes to advance our understanding of hydrologic processes in drylands and as such will be of interest to both researchers and managers working in these important ecosystems.
When using data from the MOISCRUST dataset please cite this publication. Both data and code are available under a Creative Commons Attribution 4.0 International Public License, whereby anyone may freely use data and adapt our dataset, as long as the original source is credited, the original license is linked, and any changes to our data are indicated in subsequent use.

Code availability
The code used for data imputation and dataset formatting is available in Figshare 65 .