Background & Summary

Soil biodiversity represents a major fraction of life on Earth1,2. Despite that, globally we know little about the current status and trends of soil life, especially invertebrates. Over the last few years, our knowledge on the global distribution of earthworms3, nematodes4, springtails5, ants6 and other macrofauna7 has advanced, showing trends different from aboveground biodiversity8. This urges us to deliver open and in-depth knowledge on soil animal life for nature conservation and for understanding the functioning of terrestrial ecosystems9. To help with this task, we here present a comprehensive fine-resolution database on the global distribution of springtails (Collembola), based on a compilation of published and unpublished data of researchers worldwide.

With literally worldwide distribution, springtails account for ~32% of the global terrestrial arthropod abundance10 and have global biomass of ~27.5 Megatonn carbon5. They are especially numerous in cold regions, but are also ubiquitous in tropical soils5, and even tropical canopies11. Springtails are central components of the belowground system, affecting litter decomposition, microbial activity, abundance and dispersal, and plant growth, and serving as food for numerous invertebrate predators12. Despite a moderate total diversity (~9500 described species13), springtail communities typically host dozens of species in a few square metres5. Due to their ubiquitous presence, and high abundance and local diversity, springtails represent an ideal model taxon for macroecological studies as well as bioindicators, but so far data limitations have constrained studies to address questions solely at local to regional scales.

In this paper, we describe a novel database mainly compiled from private archives of contributing authors that served as the basis for the recently published global synthesis study on springtail abundance and diversity5. While the site-level summaries of springtail community parameters have been published together with the synthesis5, here we present much more detailed sample-level data that include taxonomic names and 16 additional datasets (1398 new samples). With this effort, we complement the previously published data papers on nematodes14 and earthworms15 in describing the global soil invertebrate diversity. We also take a step further by providing quality-controlled species-level data with standardised taxonomic names at fine-scale resolution, i.e. from individual samples, or even soil layers, within each sampling site. Our dataset allows for both analyses of global and regional patterns of diversity and community composition, species distributions, and within-site variations in abundance and diversity. Below, we first describe how the data were collected, checked, curated, structured, and standardised, then we provide an overview of the data, and finish with some notes on how the data can be used.

Methods

Data sources

The database represents a standardised compilation of available datasets. The data were primarily obtained from individual archives of the contributing authors. To ensure widespread participation, the data collection initiative was announced openly in late summer 2019 through various channels, such as the mailing list of the International Colloquium of Apterygota and social media platforms such as Twitter and ResearchGate. Additionally, colleagues who had expertise in less well investigated regions, such as Africa and South America, were contacted through personal networks established by the initial author group. All individuals who collected, provided and standardised the data were invited to become co-authors of this study, with a defined minimum role in tasks, such as data provision, data cleaning, manuscript editing and approval. Both published16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164 and unpublished data were collected for analysis. Raw data, specifically species counts in samples, were requested whenever possible. Collection methods for the published data can be found in the original publications associated with each sampling event in the database. Furthermore, existing data on springtail communities available from Edaphobase165 were also included. To address the underrepresentation of Africa, South America, Australia, and Southeast Asia in the database, a literature search was conducted in January 2020 using the Web of Science platform with keywords: ‘springtail’ or ‘Collembola’ and ‘density’ or ‘abundance’ or ‘diversity’ along with the region of interest. In 2022–2023, in addition to the data analysed in the synthesis paper5, we included 16 datasets with 1,398 samples from new contributing authors.

The newly reported unpublished data represented 10,616 samples collected from 828 sites (from one to few dozens of samples were collected per site) and years 1975–2022. Springtails from soil and litter were collected using standard soil sampling devices (soil corers, frames). Collection from canopy was done using insecticide fogging, collections from aboveground surfaces were done using pitfall traps, stem eclectors, malaise traps, swipnetting, or vacuum cleaner. Over 90% of these data used different variations of Berlese or Kempson devices for springtail extraction. All springtails were identified under microscopes using regional identification keys (mainly to species, but also high-rank taxa or morphogroups). All sampling information for the entries in the dataset are included in the spreadsheet including the exact places, times, collectors, habitat types, and the collection and identification methods.

Data collection

All data were entered into a common Microsoft Excel template (Supplementary materials Data template). The template included 30 columns describing the sampling approach and counts of springtail taxa. The following minimum set of variables was collected: collectors, collection method (including sampling area and depth), extraction method, identification precision and literature, collection date, latitude and longitude, and vegetation type (grassland, scrub, woodland, agriculture and other). Each contributed dataset was checked manually by a trained assistant for technical mistakes and completeness, and were complemented by authors if necessary. Geographical coordinates were checked using Google maps. We additionally performed descriptive statistics to check the consistency of the dataset (number of sites, samples, layers) and converted data in the template into two standard tables: events table (describing samples) and occurrence table (describing taxa counts) in R v. 4.0.2166 with RStudio interface v. 1.4.1103 (RStudio, PBC). The final events table across datasets was then checked for typos, consistency in vocabulary and outliers using OpenRefine v3.3 (https://openrefine.org; Fig. 1).

Fig. 1
figure 1

Data collection and evaluation in #GlobalCollembola. Most of the data are raw data collected from archives of the contributing authors of the paper. The data were collected using an Excel template and included in the final database after technical and expert cleaning of each dataset. No data were excluded, instead, expert evaluation is provided for each dataset. Whenever possible, we recorded species occurrences in individual samples (soil cores).

Data evaluation

Every contributed dataset underwent a manual expert evaluation. Our evaluation process involved a board of springtail specialists, each with extensive research experience in specific geographic regions (expert names are listed in the events spreadsheet of the database). The experts individually scored each dataset based on three criteria: reliability of the (1) density, (2) species richness, and (3) the accuracy of the species names provided. The density estimation quality was determined by considering the sampling and extraction method, as well as the density estimation itself for the given ecosystem type. The species richness estimation quality and species names were assessed by considering the identification key used, the experience of the scientist identifying the animals, the species list and the species richness estimation itself for the given ecosystem type. Datasets that were deemed “unreliable” during the evaluation process were still included in the database, but the evaluation results by the experts are provided alongside the data.

Taxonomic alignment

To make taxonomic lists comparable across contributed datasets, we checked all taxonomic names against the global checklist of Collembola (www.collembola.org). We did this using the ‘Species matching’ tool of the Global Biodiversity Information Facility (https://www.gbif.org/tools/species-lookup), which hosts the global checklist of Collembola from 2023. Original names were kept in the database together with the standardised names. For synonyms accepted species names were provided. For morphospecies described taxonomic names of higher ranks (usually genera) were given. Taxonomic hierarchy (genera, families, orders) and other taxonomic information was summarised in an additional spreadsheet. Unfortunately, it was not possible to fully control for factually wrong original identifications, even though the species lists were checked by experts (see above), but most of the records were judged as reliable.

Data Records

The final dataset included 380 datasets representing 2,990 sites, 44,999 samples and 249,912 occurrences (i.e. observations of taxa in samples). In total, 1,441 taxa including 1,202 species were recorded in the occurrence data. The data were provided on different scales. Most samples represented single layers (i.e. litter, topsoil, deeper soil layers) in a soil core (i.e. soil monolith) or single cores in a sampling site (Fig. 1 ‘scales’). However, some data were available only as averages across samples at the sampling site level (typically an area up to a hundred of metres in diameter). The data were organised in three spreadsheets in the csv format: (1) Events, representing a list of all samples with described methodology, locations, and sampling times; (2) Occurrences, representing a list of all observations of taxa in all samples; and (3) Taxonomy, representing list of unique taxonomic names present in the occurrence data and associated standardised taxonomic names and other taxonomic information. Furthermore, we provided an R script to link the three spreadsheets together, summarise them by soil cores and sites, and filter unreliable data and data out of the scope (Fig. 2). As an example, we also provided a csv spreadsheet with average densities and the total species richness of springtails per site, collected with area-based methods. To facilitate data re-use, we provide a separate Excel spreadsheet with detailed descriptions of all data fields (‘Data description’). All data spreadsheets, R codes and other related information are available from Fighare167.

Fig. 2
figure 2

Database structure of #GlobalCollembola. Database consists of three main spreadsheets: (1) Events, (2) Occurrences, and (3) Taxonomy. The spreadsheets can be linked, summarised, and filtered using the associated R script to produce site-level averages.

Technical Validation

Statistical soundness of the database depends on the research question addressed. Below we show representativeness of our data for main types of ecological analysis by showing its spatial and temporal scopes, as well the sampling and identification approaches.

Most macroecological studies require representation of different geographical regions, climates, and ecosystem types4. Since the database is based on an open call for collection of already produced data, there is a clustered spatial distribution of data points in well-explored regions and high variation in collection methodologies. Most collected sample-level data come from Europe (82.5% or 37,137 samples), while other continents were less represented: Asia (5.6% or 2,508 samples), North America (3.4% or 1,528 samples), South America and the Caribbean (3.2% or 1,457 samples), Africa (2.8% or 1,269 samples), Australia (2.1% or 944 samples) and Antarctica (0.3% or 156 samples; Fig. 3). Across habitat types, woodlands are the most represented (57.4% of samples), followed by grasslands (14.0%), agriculture (13.7%), scrub (9.0%) and others (5.9%; Fig. 3). Using bootstrapping of the European data, we were able to do balanced analysis of the data in our synthesis study, and cover global gradients in mean annual temperature, precipitation, aridity, soil organic carbon content, pH, soil texture, vegetation biomass (NDVI), and habitat types (including the effects of agriculture)5. However, regional-scale analyses of the data are possible mainly in Europe, while tropical and subtropical regions, especially in Africa, are represented poorly.

Fig. 3
figure 3

Global distribution of the sampling points and habitat types represented in the database. Density of samples per pixel in a global 100 × 100 coordinate grid are shown with grayscale (light – few samples, dark – many samples). Number of collected samples in each habitat type are shown with a doughnut chart; habitat classification follows the European Environmental Agency.

Analyses of temporal variation, especially long-term changes of soil biodiversity9, require time series at different temporal scales. Seasonality is particularly important to consider when addressing macroecological questions, such as latitudinal biodiversity trends and their drivers5. Our database included records from years 1948–2022, with most data collected between 1975 and 2020 (Fig. 4a). Samples were collected throughout the year, with peak data collection in July-August (i.e., assumed peak springtail activity in northern Europe; Fig. 4b). There were 310 sites which were sampled in multiple years. Most of them were sampled only twice (Fig. 4c). However, 36 sites were sampled in 4 or more years and 5 sites were sampled over the range of 10 or more years (Fig. 4c,d). Therefore, it is possible to analyse long-term changes in springtail communities with two approaches: (1) by using available long-term monitoring data from few specific sites; (2) by using regional-scale data across different sites within specific habitat types sampled over decades (representative mainly for Europe, as the most studied region). It is also possible to account for seasonality in the global models because information on the sampling month is available for 86.4% of all sampling events5. However, the sampling is typically done in the periods of high springtail activities in each climate type5 and there is a clear data gap in the global temporal variation in springtail communities which should be addressed in the future data collections.

Fig. 4
figure 4

Temporal coverage of the database. Frequency histograms show the number of samples collected in different years (a) and months (b), and the number of sites where samples were collected in multiple years (c) in a certain time range (d).

Finally, comparability of different datasets in the database depends on the collection and identification methods. Records in the database represent mainly samples collected using area-based methods such as soil cores and animals extracted with heat (i.e. various modifications of Tullgren, Berlese, Kempson or Macfadyen extractors168,169; 92.8%, Fig. 5). Pitfall traps were the second most represented method (7.2%), and we included a single dataset collected using canopy fogging11. Most of the samples represented ‘soil’ (79.9% or 35,953 samples) and ‘litter’ microhabitats (54.8% or 24,676 samples). In total, 9,058 samples represented individual layers within soil cores, while 1,316 samples represented pooled data across samples within sampling sites. Therefore, data filtering and pooling is necessary to perform quantitative analyses of community metrics. In 88.2% of samples, springtails were identified to species level, while in 2.3% to morphogroups (typically roughly reflecting species-level diversity). For 4.2% of samples, springtails were recorded without further identification (abundance data only), while in the remaining records identification to order, families, or genera are provided (Fig. 5). Since most records in the database are species-level, the database is representative to evaluate global species-richness patterns and analyse species distributions in space and time.

Fig. 5
figure 5

Collection methods and identification precision represented in the database. Number of collected samples with different methods and the number of samples where springtails were identified to a certain taxonomic resolution level are shown with doughnut charts.

Usage Notes

Our global fine-resolution data on springtail communities can be openly used to address various (macro)ecological questions in space and time. Although our database is fully open, we encourage other researchers to follow our data usage and sharing guidelines: (1) the data can be openly used if a proper attribution to the data providers is given; (2) carefully evaluate representativeness of the data for your particular question; (3) report any issues you encounter; (4) we are there to support you – get in touch with the #GlobalCollembola expert community whenever you have questions. More detailed guidelines and the issue reporting form are available from Figshare together with the full database167.

For most research questions, different spreadsheets in the database need to be combined and summarised. We suggest that you use our R code for filtering and summarising the data. Please take special care while filtering the database – we kept unreliable records, and included data collected using different methods and with different sampling efforts. For analyses using species-level data, take care for synonymy of the taxa (see ‘canonical name’ and ‘species’ columns in the Taxonomy spreadsheet). As a note of caution, some species names represent complexes with cryptic genetic diversity170,171, or ambiguous (as in most invertebrate taxonomic systems), and thus interpretations about species distributions should be done with care.

The database, as a part of the #GlobalCollembola initiative, will be curated and continue to be expanded with contributions of new data. We also will upload our data to Edaphobase165 and GBIF172 for easier findability and better interoperability. This data paper describes a static version of the database at the publication date, while new updates will be available from other open online sources. We are also working on complementary trait and literature databases on springtails for community use, which will become openly available in upcoming years. This work is currently curated by the core committee of #GlobalCollembola, constituted of 20 volunteer data providers and experts.

Our database is useful for analyses of global and regional spatial patterns in springtail abundance and diversity5. The database includes time series data across seasons and years, and data on spatial variation within sites across samples and soil layers, allowing for in-depth analyses of dynamics of springtail communities. We also believe that the database is a valuable resource for species distribution modelling of soil organisms. All records in our database are the ‘event’ type of data, representing communities where all observed species are also recorded. This allows for reconstructing true absences by comparing species lists of different sites across datasets. Overall, we believe that our data will serve to answer multiple long-standing questions in soil ecology and conservation.