Background & Summary

Soils are considered to be one of the most biodiverse terrestrial habitats1,2,3. Despite this, very little is known about the biodiversity that resides there compared to aboveground biodiversity, especially at the global scale1,4,5. This is surprising given the large number of local-scale biodiversity datasets available in the published literature. A number of studies have amalgamated local scale datasets, primarily for aboveground or marine organisms e.g.6,7, which can then be used for large-scale analyses e.g.8,9. Belowground biodiversity data are often overlooked in these large biodiversity databases4, and thus separate efforts to collate data are just now starting to emerge for certain belowground taxa, particularly microbes e.g.10,11.

Earthworms are involved in a large number of ecosystem functions and services, such as decomposition12, nutrient cycling13 and climate regulation14, amongst others13. In addition, they are often used as bioindicators of soil biodiversity and health15. Earthworms are relatively easy to sample; thus, a large amount of data are available16. Nevertheless, previous attempts to collate earthworm datasets have been geographically restricted17,18 or focused on country or regional species lists (e.g., DriloBASE; By collating site-level diversity measures, we can also collect information on factors that might determine community composition, for example, measurements of soil properties or land use and cover.

Here, we describe a global database of local earthworm diversity and associated site-level characteristics from 10,840 sites in 60 countries (Fig. 1)19. Site-level information includes at least one sampled soil property, land use, and habitat cover for just over 58% of sites. Measurements of earthworm species richness (including species lists where available), total abundance, and biomass were collected at the site-level, and for some species occurrences i.e., abundance and biomass of the species recorded at a site. In addition, using expert opinion and details given by data providers, we classified each earthworm species into ecological groups based on their feeding and burrowing behaviours (epigeics, endogeics, anecics, epi-endogeics; more details below20).

Fig. 1
figure 1

Locations of the 276 studies included in the database. Each circle represents the centre of a study (a collection of sites where earthworms were sampled with a consistent method). The size of the circle indicates the number of sites within the study. Transparency is used only for aiding visualisation.

The compilation of this dataset is timely. It can be used to answer long-standing questions in ecology in relation to this important belowground faunal group (e.g., global diversity patterns16). And in light of the IPBES Global Assessment21 and the loss of biodiversity, the dataset has the potential to be used to address the pressing issue of the consequences of environmental change on soil biodiversity. These data are suitable for linking with other soil databases, such as BETSI (, a database of soil organism traits22. Linking trait information with site-level diversity would then allow analyses of functional diversity. In addition, as nearly all sites have geographic coordinates, other environmental data layers (e.g., related to climate variables, land use or soil abiotic factors) could be linked to the site-level diversity measures (e.g.16,). Belowground diversity measures could also be linked to similar diversity measurements aboveground, thus enabling investigations across ecosystems to identify patterns of diversity and biodiversity changes23.


This work was conceptualised and discussed during two ‘sWorm’ workshops in 2016 and 2017, funded by sDiv, the synthesis centre of the German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig. More than 20 international scientists with expertise in earthworms, soil science, and/or data management met at each of the workshops.

On 18th December 2016, Web of Science was used to search the available literature for articles that had sampled the earthworm community. Keywords were used that captured measurements of diversity of all taxa within Oligochaetes: ((Earthworm* OR Oligochaeta OR Megadril* OR Haplotaxida OR Annelid* OR Lumbric* OR Clitellat* OR Acanthodrili* OR Ailoscoleci* OR Almid* OR Benhamiin* OR riodrilid* OR Diplocard* OR Enchytraeid* OR Eudrilid* OR Exxid* OR Glossoscolecid* OR Haplotaxid* OR Hormogastrid* OR Kynotid* OR Lutodrilid* OR Megascolecid* OR Microchaetid* OR Moniligastrid* OR Ocnerodrilid* OR Octochaet* OR Sparganophilid* OR Tumakid*) AND (Diversity OR “Species richness” OR “OTU” OR Abundance OR individual* OR Density OR “tax* richness” OR “Number” OR Richness OR Biomass))

This search returned 7,783 papers. All titles and abstracts of papers post-2000 were screened (6140 papers), and were excluded if they did not make reference to data suitable for the analysis. As it was most likely that raw data would need to be requested, papers in the literature search published before 2000 were not screened and excluded, as it was unlikely that available author contact details were up-to-date. After this initial screening, PDFs of all remaining papers (n = 986) were manually screened to determine whether data were suitable (see below). 477 papers made reference to data that was suitable.

In addition, to find unpublished data or to target underrepresented regions, inquiries were made to specific earthworm researchers regarding suitable datasets (e.g., by directly contacting researchers, giving presentations at the Second Global Soil Biodiversity Conference and the International Symposium of Earthworm Ecology). No date restrictions were placed on such datasets, and thus, some were published prior to 2000.

In order to be included in the database, the individual article was required to have sampled earthworm diversity using an appropriate quantitative methodology (such as hand-sorting of a soil quadrat e.g.24, or chemical expulsion e.g.25) at two or more sites that varied in their land-use/habitat cover or soil properties. At a minimum, we required data on the total abundance or fresh biomass of earthworms at each site, and if possible, the number of species (ideally with species binomials), and the abundance and biomass of each species. In addition, geographic coordinates of the sites were required, and at each site, data collectors ideally had sampled at least one of the following soil properties: soil pH (in H2O, KCl, CaCl2), soil organic carbon (%), soil organic matter (%), sand/silt/clay content (%), soil texture (USDA classification26), Cation Exchange Capacity (CEC), Base Saturation (%), Carbon:Nitrogen ratio, soil moisture (%), and soil type (WRB/FAO classification27).

Where possible, available data were extracted from the suitable articles. For each suitable article, the meta-data (e.g., the article title and DOI) was compiled (Online-only Table 1). Data were extracted from the article text, tables, figures, or supplementary material (e.g., using ImageJ28). Where data were not given but were required (Online-only Table 2), authors of the articles were contacted and the raw data (or missing information) were requested. If the authors did not respond, and the required information could not be obtained using an alternate method, the data were not entered into the database. All data were extracted into online data templates, with data from one article (i.e., a dataset) being entered into an individual template, referred to as a ‘file’. Each file was given a unique ID, and in total 199 files were created and made open-access.

A file could contain multiple ‘studies’, where each study was either a different sampling event i.e., multiple samples taken at the same site over time, and/or different sampling methodology. Each study was assigned a unique study ID. Sampled diversity of earthworms is highly dependent on the extraction method used29. If a dataset did not contain consistent sampling methodologies across all sites (i.e., some sites sampled with hand sorting and others hand sorting + chemical extraction), thus making it inappropriate to compare earthworm communities, the dataset was split into a separate study for each consistent methodology. If sites had been sampled multiple times, either across multiple years or within years, and the data were available for each sampling period, then only data from the first and the last sampling period were used. Each sampling period was entered as a study, which can help prevent temporal autocorrelation during analysis, e.g., when using a mixed-effects modelling approach.

A site was defined as a single location where the earthworm community was sampled using an appropriate quantitative methodology. Within each study, each site was given a unique ID (usually based on an ID given in the original source). For each site, information on the sampling methodology, soil properties, and land-use/habitat cover, along with the diversity measurements (site-level species richness, abundance and/or biomass) were entered into the data template (see Online-only Table 2 for full list of variables and the format that was required for the data template). Where possible, data were entered into the data template in the same format as given in the original source. To help enable this, columns often had separate fields to record the units. However, for some fields, values needed to be standardised prior to data entry, such as for the site coordinates and some soil properties (e.g., sand/silt/clay content).

All available and required soil properties for each site were entered into the template. Where a site had soil properties sampled at different depths (e.g., at 0–15, 15–30, and 30–40 cm), the weighted average of the values was entered into the templates. The value was then indicated as being a mean (Online-only Table 2).

The fields for habitat cover, land-use, and management system were predefined categories based on ESA CCI-LC (, the Land-use Harmonization dataset30,31 (Fig. 2), and expert opinion (during the sWorm workshops), respectively. These classification systems were chosen based on knowledge of what external pressures might be important for explaining earthworm communities, whilst also ensuring consistency across all regions of the globe. Based on information given within the published article, or from the data providers directly, every site was classified into one of the categories for each of these fields. When information was missing, sites were classified as “unknown”. Additional information on the land use and management system classification definitions shown in Tables 1 and 2, respectively.

Fig. 2
figure 2

The number of sites (grey bars) and the number of studies (red dots) for each category in (a) the land-use system, and (b) the habitat-cover system. Sites could only be categorised within one category, but studies do contain sites that span multiple categories.

Table 1 Definitions for the land use category.
Table 2 A management classification system was created during the sWorm wokshops.

As sampling effort also impacts diversity measurements32, the sampling effort at each site was recorded. Effort was recorded in two ways:

  1. 1.

    The area that was sampled, e.g., of a quadrat or soil block, or the area across all e.g., quadrats. This depended on how the data were presented.

  2. 2.

    The number of times a site was sampled, either temporally or spatially. If a site was sampled over multiple time periods, it would be the number of occasions the site was sampled. If the site had multiple samples (e.g,, multiple quadrats) and the diversity measure is an average, the sampling effort would be 1. If the diversity is a total measure (e.g., the total number of species across all quadrats) the sampling effort would be the total number of e.g., quadrats.

When datasets contained information at a higher resolution than total abundance or biomass of earthworms at a site (i.e., at ecological group, genus, or species level), this information was entered into the species occurrence table (Online-only Table 3). Each row contained a measurement of an observation (e.g. species, morphospecies, genus, life stage or ecological group) at a single site. The measurement could be the presence only, abundance, or fresh biomass of the record. Where possible, for each row we also included the life stage (adult or juvenile), whether the species was native to the location or not, and the ecological group (epigeic, endogeic, anecic, epi-endogeic). Thus, if the diversity measure was for all the juveniles at the site regardless of species, columns such as the species binomial and genus would be empty, but life stage completed. Every species binomials and ecological group assignment were checked using DriloBASE and by earthworm taxonomists (GB, MJIB, MLCB, PL), see ‘Technical Validation’.

Where site-level diversity measures were given by the data provider, these were entered into the site-level sheet. Where site-level diversity measures were not given, but could be calculated from the species occurrence information, that was done in R33, following data entry and prior to subsequent analyses. The species present at each site, as given in the species occurrence data, were used for calculating species richness, this included species identified as sub-species. If data collectors identified a specimen as a morphospecies (i.e., a species delineation based solely on morphological characteristics, typically identified to genus level with a unique ID differentiating from other species of the same genus, as determined by the original data collector), it was included in the species richness estimate as an additional species. Unidentified species grouped as ‘unknown’ were excluded (Fig. 3). As juveniles of many earthworm species are hard to identify to species level29,34, juveniles were excluded from the calculation (even identified at family level). All earthworms (including juveniles) found at a site were included in the total biomass and abundance calculations.

Fig. 3
figure 3

The number of (a) studies and (b) sites that measured each of the three community metrics. The points at the vertices indicate the number of studies or sites with only one community metric. The points on the edges indicates the number of studies or sites with the community metrics represented at the connecting two vertices. Finally, the point in the centre indicates the number of studies or sites with all three community metrics. For example, in (a), 145 studies measured biomass, shown in the blue polygon. 4 studies measured only biomass, 7 measured biomass and species richness, 44 measured biomass and abundance, and 90 measured all three metrics.

After the ecological grouping (epigeic, endogeic, anecic, and epi-endogeic) of each species had been assigned and/or checked by the earthworm taxonomists, diversity measures within each ecological group at a site were also calculated. As with the site-level metrics, the species richness within each ecological group was calculated using only species with binomials or morphospecies. Biomass and abundance of each ecological group at a site was calculated regardless of species identity. The total number of the ecological groups at each site was calculated regardless of abundance, biomass, life stage or native status of the species included (maximum ecological group richness = 4).

Data Records

The data presented here are available in the iDiv data portal ( Dataset ID: 1880)19 in a static form. In addition, the full dataset will be hosted by Edaphobase (www.portal.edaphobase). In the future, the version in Edaphobase might change (i.e., with species names revisions, or requests from the data providers) and will hopefully be added to with additional earthworm records (or other soil taxa).

The data is stored in three tables; meta-data (Online-only Table 1), site-level (Online-only Table 2), and species occurrence (Online-only Table 3). The file ID links the meta-data to the site-level data, and the Study ID and the Site ID, link the site-level data to the species occurrence table.

For all suitable datasets, the meta-data information was completed. The meta-data contains bibliographic information on the original paper which analysed, or published, the data, as well as contact information of the person who provided the raw data (not included in the release of the database for privacy reasons). The meta-data also included the number of sites and studies within the file, so that validation checks could be completed. Online-only Table 1 shows all fields within the meta-data, personal information of data providers has not been made available.

Information on all sampled sites within each dataset was recorded in the site-level table (Online-only Table 2). Each row represents a single site within a study, with information on the sampling methodology, soil properties, and how the land was used, managed, and covered. The site-level earthworm community metrics (species richness, abundance and biomass) are also included if available.

Site-level species lists, or abundance, and/or biomass measures for individual records are given in the species occurrence table (Online-only Table 1). Each row is a measurement of an observation at a site (22,690 non-zero observations in total). An observation could relate to a species (with a scientific binomial, e.g., the abundance of Lumbricus terrestris at a site, or a morphospecies identification), a genus, life stage, ecological group, or native/non-native group (e.g., the abundance of all non-native species at a site). Details of native/non-native status of a species was only available when provided by the original data collector.

Technical Validation

Templates used to enter the individual datasets were designed so that fields were only allowed certain values and formats where possible. This helped to reduce spelling errors, slight inconsistencies, and incorrect values being entered. Data providers were contacted if details within their raw data were unclear. As multiple people entered data into the templates, detailed documentation was created at the start of the project to ensure consistency amongst those involved. In addition, a subset of datasets was checked by several curators.

All earthworm species names were checked against DriloBASE ( to identify potential synonyms and spelling mistakes. Following that, earthworm specialists and taxonomists (GB, MJIB, MLCB and PL) checked the scientific names, removed synonyms and updated names if taxonomies had changed. Where ecological groupings were missing, the earthworm taxonomists also added them where possible, based on the available literature.

Usage Notes

Land-use fields were based on classification schemes, and may not be the most suitable for the analysis of earthworms. We included a free-text field (“Habitat as described”) that could be used by future researchers to define their own classification scheme for land-use or habitat cover.

As diversity measures are highly influenced by sampling methodology, we included information on sampling methods in the database (Fig. 4). In addition, we would expect that variation in diversity would differ between the individual datasets due to, for example, inter-observer variability. We highly recommend that statistical methods used on this database take these between-dataset variations into account.

Fig. 4
figure 4

The number of sites sampled with each sampling method across the different earthworm studies.

Despite our efforts to obtain a global dataset, there is a geographic bias (Fig. 1), such that sites are highly clustered in certain regions (e.g., Europe), sparse in others (e.g., South America), or lacking (e.g., southern Africa, northern Russia). To reduce such biases, we attempted to contact as many researchers as possible in such areas to acquire data. Although this helped to improve the data coverage, it did not remove the gaps. We hope to address these gaps in the future, but in the meantime, researchers should be aware of the influence these biases might have on their analyses35,36.