Background & Summary

To generate a global and quantitative understanding of the biogeography of soil organisms, critical players in global biogeochemistry, large and comprehensive datasets are needed. Due to methodological challenges and the labor-intensiveness of characterizing soil biota, many previous studies have focused on a relatively limited number of spatially distinct sampling sites. Whilst these studies are valuable to dissect local and regional scale patterns, they may not hold the depth of information that is needed to feed global-scale models1.

Soil nematodes are present in all trophic levels in the soil food web, play central roles in regulating carbon and nutrient dynamics, control soil microorganism populations2,3,4 and, consequently, are good indicators of biological activity in soils5. Here, we present a dataset of 6,825 spatially distinct soil nematode samples from all terrestrial biomes and continents, an updated version of the dataset that was originally used to create a global map of soil nematode abundance and community composition6. The original version contained 6,759 samples; the updated version that we present here contains 66 additional samples located in Ireland. This dataset can prove useful to disentangle the effects of environmental drivers of soil nematode abundance and community composition across broad spatial scales. The original version of this dataset was used to create a high-resolution map of soil nematode abundance, which revealed that nematodes are present in higher densities in sub-Arctic regions compared to tropical and temperate regions6. Soil properties are the primary drivers of soil nematode abundance, whereas climatic conditions have an indirect effect by altering soil conditions6. The overall latitudinal gradient, with decreasing abundance towards the equator, is the inverse of patterns often observed in aboveground organisms, but is in line with what has been shown for other belowground biota7,8.

Besides data on the total number of nematodes per sample, the dataset contains quantification of the abundance of individuals in different functional groups of soil nematodes classified according to five feeding guilds9: bacterivores, fungivores, herbivores, omnivores, predators. For geospatial mapping, these sampling data were aggregated into 1,933 unique 30 Arc-seconds pixels (~1 km2 at the equator) and combined with 73 global covariate layers including information on soil physiochemical properties, and vegetation, climate, and topographic, anthropogenic, and spectral reflectance information. We intend to continue expanding the dataset and are open to contributions of additional data.

Methods

Data collection

The methods described here are expanded versions of descriptions in our related work6. The dataset encompasses georeferenced data on soil nematode abundances according to trophic groups, which were assigned according to Yeates et al.9. In total, the dataset contains 6,825 georeferenced samples collected in the top 15 cm of soils, including 66 additional samples compared to the dataset used in our related work6. Across all samples, 67.2% originate from natural sites and 32.8% from agricultural or managed sites. Nematodes were extracted from soil using standard elutriation methods, including the Baermann funnel method10, sugar-floatation/centrifugation11,12, decanting and sieving13, Oostenbrink elutriation14, Whitehead tray15 and Seinhorst elutriation16. These methods may include variations of the original methods. Most samples present in the dataset were obtained using the Baermann funnel method, followed by Oostenbrink elutriation and sugar-flotation (Jenkins/Freckman) (Fig. 1). Per-sample method descriptions, sampling depth, and data provider information are available via figshare17. For previously published data, we provide references to the original publications of the respective samples.

Fig. 1
figure 1

Nematode extraction methods used. The majority of the samples were processed using the Baermann funnel method and Oostenbrink elutriation.

Environmental metadata: soil, climate, topography, vegetation, anthropogenic characteristics

For all sampling locations we provide paired environmental metadata, which can be used to provide insight into the environmental drivers of soil nematode abundance and community composition across spatial scales. To do so, we first prepared a covariate stack of 73 layers, for which we downloaded the covariate layers as geotiff files. Next, all layers were resampled and reprojected to a unified pixel grid in EPSG:4326 (WGS84) at 30 arc-seconds resolution. Layers with a higher original pixel resolution were downsampled using a mean aggregation method; layers with a lower original resolution were resampled using simple upsampling (i.e. without interpolation) to align with the higher resolution grid. Next, all layers were converted into a multiband image, i.e. the covariate stack, that was used for pixel sampling.

To prepare the dataset for this purpose, we first need to match the resolution of the dataset to that of the global covariate layer stack that contains the environmental metadata: 30 arc-seconds, which corresponds to approximately 1-km2 at the equator. In this step, we aggregate all data points falling within the same pixel by taking the mean value, resulting in 1,933 unique pixels. We stress that the covariate layer stack has no coverage in Antarctica and therefore the 503 samples located in this region were dropped at the pixel aggregation step. Next, pixel values across the 73 layers were retrieved and stored as a csv file. This dataset is available via figshare17. We stress that, as some covariate layers were reprocessed since the publication of the nematode mapping study6, there are some slight differences in the sampled covariate data in this updated version. The approach is visualized in Fig. 2.

Fig. 2
figure 2

Data processing approach. 6,825 georeferenced samples are included in the raw dataset. These sampling locations represent 1,933 unique 30 arc-seconds pixels (~1 km at the equator), or 1,895 pixels excluding locations falling off the covariate grid. To gain mechanistic insights and discern the major environmental drivers of nematode abundance, these pixels were sampled across 73 global covariate layers.

Full metadata, including descriptions, units, and source information of all global covariate layers are available via figshare17. In short, information about soil texture and physiochemical properties was obtained from SoilGrids18, limited to the upper soil layer (top 15 cm). Climate information was obtained from WorldClim19 (version 2), which includes climate data averaged across 1970–2000 (http://www.worldclim.org/). Plant productivity data (i.e. EVI, NDVI, Gpp, Npp) and spectral reflectance data were obtained from Google Earth Engine (https://developers.google.com/earth-engine/datasets/). Aridity index and potential evapotranspiration layers were obtained from CGIAR20 (version 1) (http://www.cgiar-csi.org/data/global-aridity-and-pet-database). Anthropogenic information (i.e. human development, population density) was obtained from WCS21 (http://wcshumanfootprint.org) and from Tuanmu and Jetz22. Aboveground biomass data was obtained from CDIAC23 (https://cdiac.ess-dive.lbl.gov/epubs/ndp/global_carbon/carbon_documentation.html). Radiation data was obtained from CliMond24 (https://www.climond.org/BioclimRegistry.aspx#BioclimFAQ). WWF Ecoregion classifications were used to categorize sampling locations into biomes (https://www.worldwildlife.org/biome-categories/terrestrial-ecoregions).

Data Records

All data are available via figshare17. Raw nematode abundance data (6,825 samples) are available as a csv file: “nematode_full_dataset_wBiome.csv”. Sample IDs 20001–20066 are samples not present in our related work6. Abundance data aggregated into 30 Arc-seconds pixels (1,933 unique locations), combined with environmental covariate data are available as a csv file: “nematode_abundance_aggregated_wCovar.csv”. Full metadata, including descriptions, units, and source information, of all global covariate layers are available as a csv file: “metadata.csv”.

Technical Validation

Soil nematode abundances are highly variable within and across terrestrial biomes6. On average, the number of nematodes per 100 g dry soil is in the few hundred – few thousand range (median = 859, mean = 2,671), although the highest recorded abundances exceed 20,000 nematodes per 100 g dry soil. Across biomes, bacterivores are the most abundant trophic group and predatory nematodes the least abundant (Table 1). Overall, the highest abundances are observed in tundra (median = 2,695 nematodes per 100 g dry soil), temperate broadleaf forest (median = 2,119) and in boreal forest (median = 2,016) soils. The lowest abundances are observed in Mediterranean forest (median = 374), flooded grasslands (median = 124), Antarctic (median = 89) and hot desert (median = 44) soils (Fig. 3, Table 2). We stress that these numbers slightly differ from the values reported in our accompanying paper6, where we reported the aggregated pixel median values.

Table 1 Mean and median nematode abundances, per trophic group.
Fig. 3
figure 3

Nematode communities vary across biomes. The median and interquartile range of nematode abundances (n = 6,825) per biome from all continents.

Table 2 Mean and median nematode abundances, per biome.

As with any global ecological dataset, combining data from many researchers across the world, there is inherent variation in the data. Also, the different nematode extraction methods may vary in their efficiencies25,26. This underscores the need for large datasets for global scale analyses of ecological patterns. When a sufficiently large sample size allows to detect strong patterns through this statistical noise, we can be confident that a biological pattern exists6. As a consequence, there may be limitations to the use of the dataset at finer scales. Yet, by subsetting the dataset by extraction method or region, for example, it can serve as a starting point for local scale studies.

Environmental representativeness of the dataset

To evaluate the comprehensiveness of the dataset, we explored the environmental conditions that the sampling locations represent. Across individual environmental variables, the samples represent a wide range of environmental conditions (Fig. 4). To gain spatial insight into the environmental representativeness of the dataset, information that is important when comparing observations across spatial scales, we evaluated how the multidimensional environmental space covered by the dataset compares to the global environmental space. To do so, we used a similar approach as in our previous work6. First, we set out to reduce the computational load, as exploring the full stack of 73 global environmental covariate layers across ~210 million terrestrial pixels would require exorbitantly large computing power. To this end, we transformed the set of global environmental covariate layers into Principal Component (PC) space. We reduced the number of selected PCs to 17, collectively explaining more than 90% of variation. Next, we assessed the proportion of the world’s terrestrial pixels falling within convex hulls of the 136 bivariate combinations of the 17 PCs. The resulting map provides a spatially-explicit depiction of the representativeness of the dataset, showing that the majority of the terrestrial pixels fall within these convex hulls, with most of the outliers existing in arid regions such as the Sahara and Arabian Deserts, and in sub-arctic regions such as the far north of Canada and Russia (Fig. 5).

Fig. 4
figure 4

Environmental representativeness of the dataset. The sampled locations represent a wide range of environmental conditions. For illustrative purposes, ten environmental variables were chosen from the full set of 73.

Fig. 5
figure 5

Assessment of the representativeness of the dataset in multivariate environmental space. The map displays the percentage of pixels that fall within the convex hulls of the first 17 principal component spaces (collectively covering >90% of the sample space variation).