Global fine-resolution data on springtail abundance and community structure

Springtails (Collembola) inhabit soils from the Arctic to the Antarctic and comprise an estimated ~32% of all terrestrial arthropods on Earth. Here, we present a global, spatially-explicit database on springtail communities that includes 249,912 occurrences from 44,999 samples and 2,990 sites. These data are mainly raw sample-level records at the species level collected predominantly from private archives of the authors that were quality-controlled and taxonomically-standardised. Despite covering all continents, most of the sample-level data come from the European continent (82.5% of all samples) and represent four habitats: woodlands (57.4%), grasslands (14.0%), agrosystems (13.7%) and scrublands (9.0%). We included sampling by soil layers, and across seasons and years, representing temporal and spatial within-site variation in springtail communities. We also provided data use and sharing guidelines and R code to facilitate the use of the database by other researchers. This data paper describes a static version of the database at the publication date, but the database will be further expanded to include underrepresented regions and linked with trait data.


Background & Summary
Soil biodiversity represents a major fraction of life on Earth 1,2 .Despite that, globally we know little about the current status and trends of soil life, especially invertebrates.Over the last few years, our knowledge on the global distribution of earthworms 3 , nematodes 4 , springtails 5 , ants 6 and other macrofauna 7 has advanced, showing trends different from aboveground biodiversity 8 .This urges us to deliver open and in-depth knowledge on soil animal life for nature conservation and for understanding the functioning of terrestrial ecosystems 9 .To help with this task, we here present a comprehensive fine-resolution database on the global distribution of springtails (Collembola), based on a compilation of published and unpublished data of researchers worldwide.
With literally worldwide distribution, springtails account for ~32% of the global terrestrial arthropod abundance 10 and have global biomass of ~27.5 Megatonn carbon 5 .They are especially numerous in cold regions, but are also ubiquitous in tropical soils 5 , and even tropical canopies 11 .Springtails are central components of the belowground system, affecting litter decomposition, microbial activity, abundance and dispersal, and plant growth, and serving as food for numerous invertebrate predators 12 .Despite a moderate total diversity (~9500 described species 13 ), springtail communities typically host dozens of species in a few square metres 5 .Due to their ubiquitous presence, and high abundance and local diversity, springtails represent an ideal model taxon for macroecological studies as well as bioindicators, but so far data limitations have constrained studies to address questions solely at local to regional scales.
In this paper, we describe a novel database mainly compiled from private archives of contributing authors that served as the basis for the recently published global synthesis study on springtail abundance and diversity 5 .While the site-level summaries of springtail community parameters have been published together with the synthesis 5 , here we present much more detailed sample-level data that include taxonomic names and 16 additional datasets (1398 new samples).With this effort, we complement the previously published data papers on nematodes 14 and earthworms 15 in describing the global soil invertebrate diversity.We also take a step further by providing quality-controlled species-level data with standardised taxonomic names at fine-scale resolution, i.e. from individual samples, or even soil layers, within each sampling site.Our dataset allows for both analyses of global and regional patterns of diversity and community composition, species distributions, and within-site Data collection.All data were entered into a common Microsoft Excel template (Supplementary materials Data template).The template included 30 columns describing the sampling approach and counts of springtail taxa.The following minimum set of variables was collected: collectors, collection method (including sampling area and depth), extraction method, identification precision and literature, collection date, latitude and longitude, and vegetation type (grassland, scrub, woodland, agriculture and other).Each contributed dataset was checked manually by a trained assistant for technical mistakes and completeness, and were complemented by authors if necessary.Geographical coordinates were checked using Google maps.We additionally performed descriptive statistics to check the consistency of the dataset (number of sites, samples, layers) and converted data in the template into two standard tables: events table (describing samples) and occurrence table (describing taxa counts) in R v. 4.0.2 166with RStudio interface v. 1.4.1103(RStudio, PBC).The final events table across datasets was then checked for typos, consistency in vocabulary and outliers using OpenRefine v3.3 (https://openrefine.org; Fig. 1).

Data evaluation.
Every contributed dataset underwent a manual expert evaluation.Our evaluation process involved a board of springtail specialists, each with extensive research experience in specific geographic regions (expert names are listed in the events spreadsheet of the database).The experts individually scored each dataset based on three criteria: reliability of the (1) density, (2) species richness, and (3) the accuracy of the species names provided.The density estimation quality was determined by considering the sampling and extraction method, as well as the density estimation itself for the given ecosystem type.The species richness estimation quality and species names were assessed by considering the identification key used, the experience of the scientist identifying the animals, the species list and the species richness estimation itself for the given ecosystem type.Datasets that were deemed "unreliable" during the evaluation process were still included in the database, but the evaluation results by the experts are provided alongside the data.
Taxonomic alignment.To make taxonomic lists comparable across contributed datasets, we checked all taxonomic names against the global checklist of Collembola (www.collembola.org).We did this using the 'Species matching' tool of the Global Biodiversity Information Facility (https://www.gbif.org/tools/species-lookup),which hosts the global checklist of Collembola from 2023.Original names were kept in the database together with the standardised names.For synonyms accepted species names were provided.For morphospecies described taxonomic names of higher ranks (usually genera) were given.Taxonomic hierarchy (genera, families, orders) and other taxonomic information was summarised in an additional spreadsheet.Unfortunately, it was not possible to fully control for factually wrong original identifications, even though the species lists were checked by experts (see above), but most of the records were judged as reliable.

Data Records
The final dataset included 380 datasets representing 2,990 sites, 44,999 samples and 249,912 occurrences (i.e.observations of taxa in samples).In total, 1,441 taxa including 1,202 species were recorded in the occurrence data.The data were provided on different scales.Most samples represented single layers (i.e.litter, topsoil, deeper soil layers) in a soil core (i.e.soil monolith) or single cores in a sampling site (Fig. 1 'scales').However, some data were available only as averages across samples at the sampling site level (typically an area up to a hundred of metres in diameter).The data were organised in three spreadsheets in the csv format: (1) Events, representing a list of all samples with described methodology, locations, and sampling times; (2) Occurrences, representing a list of all observations of taxa in all samples; and (3) Taxonomy, representing list of unique taxonomic names present in the occurrence data and associated standardised taxonomic names and other taxonomic information.Furthermore, we provided an R script to link the three spreadsheets together, summarise them by soil cores and sites, and filter unreliable data and data out of the scope (Fig. 2).As an example, we also provided a csv spreadsheet with average densities and the total species richness of springtails per site, collected with area-based methods.To facilitate data re-use, we provide a separate Excel spreadsheet with detailed descriptions of all data fields ('Data description').All data spreadsheets, R codes and other related information are available from Fighare 167 .

Technical Validation
Statistical soundness of the database depends on the research question addressed.Below we show representativeness of our data for main types of ecological analysis by showing its spatial and temporal scopes, as well the sampling and identification approaches.
Most macroecological studies require representation of different geographical regions, climates, and ecosystem types 4 .Since the database is based on an open call for collection of already produced data, there is a clustered spatial distribution of data points in well-explored regions and high variation in collection methodologies.Most collected sample-level data come from Europe (82.5% or 37,137 samples), while other continents were less represented: Asia (5.6% or 2,508 samples), North America (3.4% or 1,528 samples), South America and the Caribbean (3.2% or 1,457 samples), Africa (2.8% or 1,269 samples), Australia (2.1% or 944 samples) and Antarctica (0.3% or 156 samples; Fig. 3).Across habitat types, woodlands are the most represented (57.4% of samples), followed by grasslands (14.0%), agriculture (13.7%), scrub (9.0%) and others (5.9%; Fig. 3).Using bootstrapping of the European data, we were able to do balanced analysis of the data in our synthesis study, and cover global gradients in mean annual temperature, precipitation, aridity, soil organic carbon content, pH, soil texture, vegetation biomass (NDVI), and habitat types (including the effects of agriculture) 5 .However, regional-scale analyses of the data are possible mainly in Europe, while tropical and subtropical regions, especially in Africa, are represented poorly.
Analyses of temporal variation, especially long-term changes of soil biodiversity 9 , require time series at different temporal scales.Seasonality is particularly important to consider when addressing macroecological questions, such as latitudinal biodiversity trends and their drivers 5 .Our database included records from years 1948-2022, with most data collected between 1975 and 2020 (Fig. 4a).Samples were collected throughout the year, with peak data collection in July-August (i.e., assumed peak springtail activity in northern Europe; Fig. 4b).There were 310 sites which were sampled in multiple years.Most of them were sampled only twice (Fig. 4c).However, 36 sites were sampled in 4 or more years and 5 sites were sampled over the range of 10 or more years (Fig. 4c,d).Therefore, it is possible to analyse long-term changes in springtail communities with two approaches: (1) by using available long-term monitoring data from few specific sites; (2) by using regional-scale data across different sites within specific habitat types sampled over decades (representative mainly for Europe, as the most studied region).It is also possible to account for seasonality in the global models because information on the sampling month is available for 86.4% of all sampling events 5 .However, the sampling is typically done in the periods of high springtail activities in each climate type 5 and there is a clear data gap in the global temporal variation in springtail communities which should be addressed in the future data collections.Finally, comparability of different datasets in the database depends on the collection and identification methods.Records in the database represent mainly samples collected using area-based methods such as soil cores and animals extracted with heat (i.e.various modifications of Tullgren, Berlese, Kempson or Macfadyen extractors 168,169 ; 92.8%, Fig. 5).Pitfall traps were the second most represented method (7.2%), and we included a single dataset collected using canopy fogging 11 .Most of the samples represented 'soil' (79.9% or 35,953 samples) and 'litter' microhabitats (54.8% or 24,676 samples).In total, 9,058 samples represented individual layers within soil cores, while 1,316 samples represented pooled data across samples within sampling sites.Therefore, data filtering and pooling is necessary to perform quantitative analyses of community metrics.In 88.2% of samples, springtails were identified to species level, while in 2.3% to morphogroups (typically roughly reflecting species-level diversity).For 4.2% of samples, springtails were recorded without further identification (abundance data only), while in the remaining records identification to order, families, or genera are provided (Fig. 5).Since most records in the database are species-level, the database is representative to evaluate global species-richness patterns and analyse species distributions in space and time.

Usage Notes
Our global fine-resolution data on springtail communities can be openly used to address various (macro)ecological questions in space and time.Although our database is fully open, we encourage other researchers to follow our data usage and sharing guidelines: (1) the data can be openly used if a proper attribution to the data providers is given; (2) carefully evaluate representativeness of the data for your particular question; (3) report   any issues you encounter; (4) we are there to support you -get in touch with the #GlobalCollembola expert community whenever you have questions.More detailed guidelines and the issue reporting form are available from Figshare together with the full database 167 .
For most research questions, different spreadsheets in the database need to be combined and summarised.We suggest that you use our R code for filtering and summarising the data.Please take special care while filtering the database -we kept unreliable records, and included data collected using different methods and with different sampling efforts.For analyses using species-level data, take care for synonymy of the taxa (see 'canonical name' and 'species' columns in the Taxonomy spreadsheet).As a note of caution, some species names represent complexes with cryptic genetic diversity 170,171 , or ambiguous (as in most invertebrate taxonomic systems), and thus interpretations about species distributions should be done with care.
The database, as a part of the #GlobalCollembola initiative, will be curated and continue to be expanded with contributions of new data.We also will upload our data to Edaphobase 165 and GBIF 172 for easier findability and better interoperability.This data paper describes a static version of the database at the publication date, while new updates will be available from other open online sources.We are also working on complementary trait and literature databases on springtails for community use, which will become openly available in upcoming years.This work is currently curated by the core committee of #GlobalCollembola, constituted of 20 volunteer data providers and experts.
Our database is useful for analyses of global and regional spatial patterns in springtail abundance and diversity 5 .The database includes time series data across seasons and years, and data on spatial variation within sites across samples and soil layers, allowing for in-depth analyses of dynamics of springtail communities.We also believe that the database is a valuable resource for species distribution modelling of soil organisms.All records in our database are the 'event' type of data, representing communities where all observed species are also recorded.This allows for reconstructing true absences by comparing species lists of different sites across datasets.Overall, we believe that our data will serve to answer multiple long-standing questions in soil ecology and conservation.

Fig. 1
Fig. 1 Data collection and evaluation in #GlobalCollembola.Most of the data are raw data collected from archives of the contributing authors of the paper.The data were collected using an Excel template and included in the final database after technical and expert cleaning of each dataset.No data were excluded, instead, expert evaluation is provided for each dataset.Whenever possible, we recorded species occurrences in individual samples (soil cores).

Fig. 2
Fig. 2 Database structure of #GlobalCollembola.Database consists of three main spreadsheets: (1) Events, (2) Occurrences, and (3) Taxonomy.The spreadsheets can be linked, summarised, and filtered using the associated R script to produce site-level averages.

Fig. 3
Fig. 3 Global distribution of the sampling points and habitat types represented in the database.Density of samples per pixel in a global 100 × 100 coordinate grid are shown with grayscale (light -few samples, darkmany samples).Number of collected samples in each habitat type are shown with a doughnut chart; habitat classification follows the European Environmental Agency.

Fig. 4
Fig. 4 Temporal coverage of the database.Frequency histograms show the number of samples collected in different years (a) and months (b), and the number of sites where samples were collected in multiple years (c) in a certain time range (d).

Fig. 5
Fig. 5 Collection methods and identification precision represented in the database.Number of collected samples with different methods and the number of samples where springtails were identified to a certain taxonomic resolution level are shown with doughnut charts.