A global database of soil nematode abundance and functional group composition

As the most abundant animals on earth, nematodes are a dominant component of the soil community. They play critical roles in regulating biogeochemical cycles and vegetation dynamics within and across landscapes and are an indicator of soil biological activity. Here, we present a comprehensive global dataset of soil nematode abundance and functional group composition. This dataset includes 6,825 georeferenced soil samples from all continents and biomes. For geospatial mapping purposes these samples are aggregated into 1,933 unique 1-km pixels, each of which is linked to 73 global environmental covariate data layers. Altogether, this dataset can help to gain insight into the spatial distribution patterns of soil nematode abundance and community composition, and the environmental drivers shaping these patterns.


Background & Summary
To generate a global and quantitative understanding of the biogeography of soil organisms, critical players in global biogeochemistry, large and comprehensive datasets are needed. Due to methodological challenges and the labor-intensiveness of characterizing soil biota, many previous studies have focused on a relatively limited number of spatially distinct sampling sites. Whilst these studies are valuable to dissect local and regional scale patterns, they may not hold the depth of information that is needed to feed global-scale models 1 .
Soil nematodes are present in all trophic levels in the soil food web, play central roles in regulating carbon and nutrient dynamics, control soil microorganism populations 2-4 and, consequently, are good indicators of biological activity in soils 5 . Here, we present a dataset of 6,825 spatially distinct soil nematode samples from all terrestrial biomes and continents, an updated version of the dataset that was originally used to create a global map of soil nematode abundance and community composition 6 . The original version contained 6,759 samples; the updated version that we present here contains 66 additional samples located in Ireland. This dataset can prove useful to disentangle the effects of environmental drivers of soil nematode abundance and community composition across broad spatial scales. The original version of this dataset was used to create a high-resolution map of soil nematode abundance, which revealed that nematodes are present in higher densities in sub-Arctic regions compared to tropical and temperate regions 6 . Soil properties are the primary drivers of soil nematode abundance, whereas climatic conditions have an indirect effect by altering soil conditions 6 . The overall latitudinal gradient, with decreasing abundance towards the equator, is the inverse of patterns often observed in aboveground organisms, but is in line with what has been shown for other belowground biota 7,8 .
Besides data on the total number of nematodes per sample, the dataset contains quantification of the abundance of individuals in different functional groups of soil nematodes classified according to five feeding guilds 9 : bacterivores, fungivores, herbivores, omnivores, predators. For geospatial mapping, these sampling data were aggregated into 1,933 unique 30 Arc-seconds pixels (~1 km 2 at the equator) and combined with 73 global covariate layers including information on soil physiochemical properties, and vegetation, climate, and topographic, anthropogenic, and spectral reflectance information. We intend to continue expanding the dataset and are open to contributions of additional data.

Methods
Data collection. The methods described here are expanded versions of descriptions in our related work 6 .
The dataset encompasses georeferenced data on soil nematode abundances according to trophic groups, which were assigned according to Yeates et al. 9 . In total, the dataset contains 6,825 georeferenced samples collected in # A full list of authors and their affiliations appears at the end of the paper. DATA DeScripTor opeN the top 15 cm of soils, including 66 additional samples compared to the dataset used in our related work 6 . Across all samples, 67.2% originate from natural sites and 32.8% from agricultural or managed sites. Nematodes were extracted from soil using standard elutriation methods, including the Baermann funnel method 10 , sugar-floatation/centrifugation 11,12 , decanting and sieving 13 , Oostenbrink elutriation 14 , Whitehead tray 15 and Seinhorst elutriation 16 . These methods may include variations of the original methods. Most samples present in the dataset were obtained using the Baermann funnel method, followed by Oostenbrink elutriation and sugar-flotation (Jenkins/ Freckman) (Fig. 1). Per-sample method descriptions, sampling depth, and data provider information are available via figshare 17 . For previously published data, we provide references to the original publications of the respective samples.

Environmental metadata: soil, climate, topography, vegetation, anthropogenic characteristics.
For all sampling locations we provide paired environmental metadata, which can be used to provide insight into the environmental drivers of soil nematode abundance and community composition across spatial scales. To do so, we first prepared a covariate stack of 73 layers, for which we downloaded the covariate layers as geotiff files.  www.nature.com/scientificdata www.nature.com/scientificdata/ Next, all layers were resampled and reprojected to a unified pixel grid in EPSG:4326 (WGS84) at 30 arc-seconds resolution. Layers with a higher original pixel resolution were downsampled using a mean aggregation method; layers with a lower original resolution were resampled using simple upsampling (i.e. without interpolation) to align with the higher resolution grid. Next, all layers were converted into a multiband image, i.e. the covariate stack, that was used for pixel sampling.
To prepare the dataset for this purpose, we first need to match the resolution of the dataset to that of the global covariate layer stack that contains the environmental metadata: 30 arc-seconds, which corresponds to approximately 1-km 2 at the equator. In this step, we aggregate all data points falling within the same pixel by taking the mean value, resulting in 1,933 unique pixels. We stress that the covariate layer stack has no coverage in Antarctica and therefore the 503 samples located in this region were dropped at the pixel aggregation step. Next, pixel values across the 73 layers were retrieved and stored as a csv file. This dataset is available via figshare 17 . We stress that, as some covariate layers were reprocessed since the publication of the nematode mapping study 6 , there are some slight differences in the sampled covariate data in this updated version. The approach is visualized in Fig. 2.

Data records
All data are available via figshare 17 . Raw nematode abundance data (6,825 samples) are available as a csv file: "nematode_full_dataset_wBiome.csv". Sample IDs 20001-20066 are samples not present in our related work 6 . Abundance data aggregated into 30 Arc-seconds pixels (1,933 unique locations), combined with environmental covariate data are available as a csv file: "nematode_abundance_aggregated_wCovar.csv". Full metadata, including descriptions, units, and source information, of all global covariate layers are available as a csv file: "metadata.csv".

technical Validation
Soil nematode abundances are highly variable within and across terrestrial biomes 6 . On average, the number of nematodes per 100 g dry soil is in the few hundred -few thousand range (median = 859, mean = 2,671), although the highest recorded abundances exceed 20,000 nematodes per 100 g dry soil. Across biomes, bacterivores are the most abundant trophic group and predatory nematodes the least abundant (Table 1). Overall, the highest abundances are observed in tundra (median = 2,695 nematodes per 100 g dry soil), temperate broadleaf forest (median = 2,119) and in boreal forest (median = 2,016) soils. The lowest abundances are observed in Mediterranean forest (median = 374), flooded grasslands (median = 124), Antarctic (median = 89) and hot desert  Percentage of pixels within sampled range 75% 100% Fig. 4 Environmental representativeness of the dataset. The sampled locations represent a wide range of environmental conditions. For illustrative purposes, ten environmental variables were chosen from the full set of 73.
www.nature.com/scientificdata www.nature.com/scientificdata/ (median = 44) soils (Fig. 3, Table 2). We stress that these numbers slightly differ from the values reported in our accompanying paper 6 , where we reported the aggregated pixel median values.
As with any global ecological dataset, combining data from many researchers across the world, there is inherent variation in the data. Also, the different nematode extraction methods may vary in their efficiencies 25,26 . This underscores the need for large datasets for global scale analyses of ecological patterns. When a sufficiently large sample size allows to detect strong patterns through this statistical noise, we can be confident that a biological pattern exists 6 . As a consequence, there may be limitations to the use of the dataset at finer scales. Yet, by subsetting the dataset by extraction method or region, for example, it can serve as a starting point for local scale studies.
Environmental representativeness of the dataset. To evaluate the comprehensiveness of the dataset, we explored the environmental conditions that the sampling locations represent. Across individual environmental variables, the samples represent a wide range of environmental conditions (Fig. 4). To gain spatial insight into the environmental representativeness of the dataset, information that is important when comparing observations across spatial scales, we evaluated how the multidimensional environmental space covered by the dataset compares to the global environmental space. To do so, we used a similar approach as in our previous work 6 . First, we set out to reduce the computational load, as exploring the full stack of 73 global environmental covariate layers across ~210 million terrestrial pixels would require exorbitantly large computing power. To this end, we transformed the set of global environmental covariate layers into Principal Component (PC) space. We reduced the number of selected PCs to 17, collectively explaining more than 90% of variation. Next, we assessed the proportion of the world's terrestrial pixels falling within convex hulls of the 136 bivariate combinations of the 17 PCs. The resulting map provides a spatially-explicit depiction of the representativeness of the dataset, showing that the majority of the terrestrial pixels fall within these convex hulls, with most of the outliers existing in arid regions such as the Sahara and Arabian Deserts, and in sub-arctic regions such as the far north of Canada and Russia (Fig. 5).