Global data on earthworm abundance, biomass, diversity and corresponding environmental properties

Earthworms are an important soil taxon as ecosystem engineers, providing a variety of crucial ecosystem functions and services. Little is known about their diversity and distribution at large spatial scales, despite the availability of considerable amounts of local-scale data. Earthworm diversity data, obtained from the primary literature or provided directly by authors, were collated with information on site locations, including coordinates, habitat cover, and soil properties. Datasets were required, at a minimum, to include abundance or biomass of earthworms at a site. Where possible, site-level species lists were included, as well as the abundance and biomass of individual species and ecological groups. This global dataset contains 10,840 sites, with 184 species, from 60 countries and all continents except Antarctica. The data were obtained from 182 published articles, published between 1973 and 2017, and 17 unpublished datasets. Amalgamating data into a single global database will assist researchers in investigating and answering a wide variety of pressing questions, for example, jointly assessing aboveground and belowground biodiversity distributions and drivers of biodiversity change.

Locations of the 276 studies included in the database. Each circle represents the centre of a study (a collection of sites where earthworms were sampled with a consistent method). The size of the circle indicates the number of sites within the study. Transparency is used only for aiding visualisation.
Where possible, available data were extracted from the suitable articles. For each suitable article, the meta-data (e.g., the article title and DOI) was compiled (Online-only Table 1). Data were extracted from the article text, tables, figures, or supplementary material (e.g., using ImageJ 28 ). Where data were not given but were required (Online-only Table 2), authors of the articles were contacted and the raw data (or missing information) were requested. If the authors did not respond, and the required information could not be obtained using an alternate method, the data were not entered into the database. All data were extracted into online data templates, with data from one article (i.e., a dataset) being entered into an individual template, referred to as a 'file' . Each file was given a unique ID, and in total 199 files were created and made open-access.
A file could contain multiple 'studies' , where each study was either a different sampling event i.e., multiple samples taken at the same site over time, and/or different sampling methodology. Each study was assigned a unique study ID. Sampled diversity of earthworms is highly dependent on the extraction method used 29 . If a dataset did not contain consistent sampling methodologies across all sites (i.e., some sites sampled with hand sorting and others hand sorting + chemical extraction), thus making it inappropriate to compare earthworm communities, the dataset was split into a separate study for each consistent methodology. If sites had been sampled multiple times, either across multiple years or within years, and the data were available for each sampling period, then only data from the first and the last sampling period were used. Each sampling period was entered as a study, which can help prevent temporal autocorrelation during analysis, e.g., when using a mixed-effects modelling approach.
A site was defined as a single location where the earthworm community was sampled using an appropriate quantitative methodology. Within each study, each site was given a unique ID (usually based on an ID given in the original source). For each site, information on the sampling methodology, soil properties, and land-use/habitat cover, along with the diversity measurements (site-level species richness, abundance and/or biomass) were entered into the data template (see Online-only Table 2 for full list of variables and the format that was required for the data template). Where possible, data were entered into the data template in the same format as given in the original source. To help enable this, columns often had separate fields to record the units. However, for some fields, values needed to be standardised prior to data entry, such as for the site coordinates and some soil properties (e.g., sand/silt/clay content).
All available and required soil properties for each site were entered into the template. Where a site had soil properties sampled at different depths (e.g., at 0-15, 15-30, and 30-40 cm), the weighted average of the values was entered into the templates. The value was then indicated as being a mean (Online-only Table 2).
The fields for habitat cover, land-use, and management system were predefined categories based on ESA CCI-LC (https://www.esa-landcover-cci.org/), the Land-use Harmonization dataset 30,31 (Fig. 2), and expert opinion (during the sWorm workshops), respectively. These classification systems were chosen based on knowledge of what external pressures might be important for explaining earthworm communities, whilst also ensuring consistency across all regions of the globe. Based on information given within the published article, or from the data providers directly, every site was classified into one of the categories for each of these fields. When information was missing, sites were classified as "unknown". Additional information on the land use and management system classification definitions shown in Tables 1 and 2, respectively.
As sampling effort also impacts diversity measurements 32 , the sampling effort at each site was recorded. Effort was recorded in two ways: 1. The area that was sampled, e.g., of a quadrat or soil block, or the area across all e.g., quadrats. This depended on how the data were presented. 2. The number of times a site was sampled, either temporally or spatially. If a site was sampled over multiple time periods, it would be the number of occasions the site was sampled. If the site had multiple samples (e.g,, multiple quadrats) and the diversity measure is an average, the sampling effort would be 1. If the diversity is a total measure (e.g., the total number of species across all quadrats) the sampling effort would be the total number of e.g., quadrats.
When datasets contained information at a higher resolution than total abundance or biomass of earthworms at a site (i.e., at ecological group, genus, or species level), this information was entered into the species occurrence table (Online-only Table 3). Each row contained a measurement of an observation (e.g. species, morphospecies, genus, life stage or ecological group) at a single site. The measurement could be the presence only, abundance, or fresh biomass of the record. Where possible, for each row we also included the life stage (adult or juvenile), whether the species was native to the location or not, and the ecological group (epigeic, endogeic, anecic, epi-endogeic). Thus, if the diversity measure was for all the juveniles at the site regardless of species, columns such as the species binomial and genus would be empty, but life stage completed. Every species binomials and ecological group assignment were checked using DriloBASE and by earthworm taxonomists (GB, MJIB, MLCB, PL), see 'Technical Validation' .
Where site-level diversity measures were given by the data provider, these were entered into the site-level sheet. Where site-level diversity measures were not given, but could be calculated from the species occurrence information, that was done in R 33 , following data entry and prior to subsequent analyses. The species present at each site, as given in the species occurrence data, were used for calculating species richness, this included species identified as sub-species. If data collectors identified a specimen as a morphospecies (i.e., a species delineation based solely on morphological characteristics, typically identified to genus level with a unique ID differentiating from other species of the same genus, as determined by the original data collector), it was included in the species richness estimate as an additional species. Unidentified species grouped as 'unknown' were excluded (Fig. 3). As www.nature.com/scientificdata www.nature.com/scientificdata/ juveniles of many earthworm species are hard to identify to species level 29,34 , juveniles were excluded from the calculation (even identified at family level). All earthworms (including juveniles) found at a site were included in the total biomass and abundance calculations.
After the ecological grouping (epigeic, endogeic, anecic, and epi-endogeic) of each species had been assigned and/or checked by the earthworm taxonomists, diversity measures within each ecological group at a site were also calculated. As with the site-level metrics, the species richness within each ecological group was calculated using only species with binomials or morphospecies. Biomass and abundance of each ecological group at a site was calculated regardless of species identity. The total number of the ecological groups at each site was calculated regardless of abundance, biomass, life stage or native status of the species included (maximum ecological group richness = 4).

Data Records
The data presented here are available in the iDiv data portal (https://doi.org/10.25829/idiv.1880-17-3189. Dataset ID: 1880) 19 in a static form. In addition, the full dataset will be hosted by Edaphobase (www.portal.edaphobase). In the future, the version in Edaphobase might change (i.e., with species names revisions, or requests from the data providers) and will hopefully be added to with additional earthworm records (or other soil taxa). www.nature.com/scientificdata www.nature.com/scientificdata/ The data is stored in three tables; meta-data (Online-only Table 1), site-level (Online-only Table 2), and species occurrence (Online-only Table 3). The file ID links the meta-data to the site-level data, and the Study ID and the Site ID, link the site-level data to the species occurrence table.

Land use category Definition Primary
Relatively undisturbed natural habitat

Secondary
Recovering, previously disturbed natural habitat

Pasture
Land used for the grazing of livestock

Production -Plantations crops
Land used for plantations crops (e.g., coffee, vineyards, oil palm)

Production -Wood plantations
Land used for timber production (e.g., teak)

Urban
Land converted to dense urban settlement

Unknown
If the land use is not given or is not clear Table 1. Definitions for the land use category. The land use classification was based on the Land-use Harmonization dataset 30,31 , to map to the original classification system, 'Production -Wood plantations' and 'Production -Plantation crops' would be 'Secondary' and 'Production -Arable' would be 'Cropland' . Table 2. A management classification system was created during the sWorm wokshops. For each managed site (i.e., not natural vegetation) the management system could also be identified (table headers), and additional management intensity variables could be also captured (table rows). However, not every management intensity variable was applicable for each management system, thus restrictions were placed. '×' indicates which management intensity variable was applicable to each management system. www.nature.com/scientificdata www.nature.com/scientificdata/ For all suitable datasets, the meta-data information was completed. The meta-data contains bibliographic information on the original paper which analysed, or published, the data, as well as contact information of the person who provided the raw data (not included in the release of the database for privacy reasons). The meta-data also included the number of sites and studies within the file, so that validation checks could be completed.

Management Intensity measure Annual crops Integrated systems Perennial crops Pastures (grazed lands) Tree plantations
Online-only Table 1 shows all fields within the meta-data, personal information of data providers has not been made available.
Information on all sampled sites within each dataset was recorded in the site-level table (Online-only Table 2). Each row represents a single site within a study, with information on the sampling methodology, soil properties, and how the land was used, managed, and covered. The site-level earthworm community metrics (species richness, abundance and biomass) are also included if available.
Site-level species lists, or abundance, and/or biomass measures for individual records are given in the species occurrence table (Online-only Table 1). Each row is a measurement of an observation at a site (22,690 non-zero observations in total). An observation could relate to a species (with a scientific binomial, e.g., the abundance of Lumbricus terrestris at a site, or a morphospecies identification), a genus, life stage, ecological group, or native/ non-native group (e.g., the abundance of all non-native species at a site). Details of native/non-native status of a species was only available when provided by the original data collector.

technical Validation
Templates used to enter the individual datasets were designed so that fields were only allowed certain values and formats where possible. This helped to reduce spelling errors, slight inconsistencies, and incorrect values being entered. Data providers were contacted if details within their raw data were unclear. As multiple people entered data into the templates, detailed documentation was created at the start of the project to ensure consistency amongst those involved. In addition, a subset of datasets was checked by several curators.
All earthworm species names were checked against DriloBASE (http://taxo.drilobase.org) to identify potential synonyms and spelling mistakes. Following that, earthworm specialists and taxonomists (GB, MJIB, MLCB and PL) checked the scientific names, removed synonyms and updated names if taxonomies had changed. Where ecological groupings were missing, the earthworm taxonomists also added them where possible, based on the available literature.

Usage Notes
Land-use fields were based on classification schemes, and may not be the most suitable for the analysis of earthworms. We included a free-text field ("Habitat as described") that could be used by future researchers to define their own classification scheme for land-use or habitat cover.
As diversity measures are highly influenced by sampling methodology, we included information on sampling methods in the database (Fig. 4). In addition, we would expect that variation in diversity would differ between the individual datasets due to, for example, inter-observer variability. We highly recommend that statistical methods used on this database take these between-dataset variations into account.
Despite our efforts to obtain a global dataset, there is a geographic bias (Fig. 1), such that sites are highly clustered in certain regions (e.g., Europe), sparse in others (e.g., South America), or lacking (e.g., southern Africa, northern Russia). To reduce such biases, we attempted to contact as many researchers as possible in such areas to acquire data. Although this helped to improve the data coverage, it did not remove the gaps. We hope to address these gaps in the future, but in the meantime, researchers should be aware of the influence these biases might have on their analyses 35,36 . acknowledgements This database and paper are a product of two sWorm workshops at sDiv, the synthesis center at iDiv. We thank M.

competing interests
The authors declare no competing interests.