Background & Summary

Fungi play fundamental roles in the ecosystem processes across all terrestrial biomes. As plant symbionts, pathogens or major decomposers of organic matter they substantially influence plant primary production, carbon mineralization and sequestration, and act as crucial regulators of the soil carbon balance1,2. The activities of fungal communities contribute to the production of clean water, food, and air and the suppression of disease-causing soil organisms. Soil fungal biodiversity is thus increasingly recognized to provide services critical to food safety and human health3.

It is of high importance to determine how environmental factors affect the diversity and distribution of fungal communities. So far, only a few studies have focused on fungal distribution and diversity on global scale4,5,6. Importantly, these single survey studies focused either on a limited number of biomes4,5, fairly narrow groups within the fungal kingdom6, or were restricted only to fungi inhabiting soil. Although individual studies have the advantage of standardized methodology across their whole dataset, their limitation is in the limited sampling efforts in space and time that do not allow general conclusions on distribution of fungal taxa. On the other hand, since the advent of high-throughput-sequencing methods, large amounts of sequencing data on fungi from terrestrial environments accumulated along with metadata across numerous studies and allow interesting analyses when combined7. As an example of this approach, the meta-analysis of 36 papers made it possible to map global diversity of soil fungi collected in >3000 samples and indicated that climate is an important factor for the global distribution of soil fungi8. This approach clearly demonstrated the utility of a meta-study approach to address fungal biogeography, ecology and diversity. In addition, the compilation of these data demonstrated the fact that symbiotic mycorrhizal fungi that aid cultivated and wild plants to access nutrients, are more likely to be affected by rapid changes of climate than other guilds of fungi, including plant pathogens8 and helped to identify which fungi tend to follow alien plants invading new environments9.

Here, we have undertaken a comprehensive collection and validation of data published on the composition of fungal communities in terrestrial environments including soil and plant-associated habitats. This approach enabled us to construct the GlobalFungi database containing, on March 16, 2020, over 110 million unique sequence variants10 (i.e., unique nucleotide sequences) of the fungal nuclear ribosomal internal transcribed spacers (ITS) 1 and 2, covering > 17 000 samples contained in 178 original studies (Fig. 1). The ITS region has been used as molecular marker because it is a universal barcode for fungi11.The dataset of sequence variant frequencies across samples, accompanied by metadata retrieved from published papers and in global climate databases is made publicly available at https://globalfungi.com. To achieve the goal to make published data findable, accessible, interoperable and reusable, the user interface at the above address allows the users to search for individual sequences, fungal species hypotheses12, species or genera, to get a visual representation of their distribution in the environment and to access and download sequence data and metadata. In addition, the user interface also allows authors to submit data from studies not yet covered and in this way to help to build the resource for the community of researchers in systematics, biogeography, and ecology of fungi.

Fig. 1
figure 1

Map of locations of samples contained in the GlobalFungi database. Each point represents one or several samples where fungal community composition was reported using high-throughput-sequencing methods targeting the ITS1 or ITS2 marker of fungi. The background map image where the samples are represented is the intellectual property of Esri and used herein under license. Copyright © 2019 Esri and its licensors. All rights reserved.

Methods

Data selection

We explored papers fitting with a main criterion, i.e., high-throughput sequencing for the analysis of fungal communities thanks to the ITS region, and that were published up to the beginning of 2019; in total, we explored 843 papers. The following selection criteria were used for the inclusion of samples (and, consequently, studies) into the dataset: (1) samples came from terrestrial biomes of soil, dead or live plant material (e.g., soil, litter, rhizosphere soil, topsoil, lichen, deadwood, root, and shoot) and were not subject to experimental treatment that artificially modifies the fungal community composition (e.g., temperature or nitrogen increase experiment, greenhouse controlled experiment were excluded); (2) the precise geographic location of each sample was recorded and released using GPS coordinates; (3) the whole fungal community was subject to amplicon sequencing (studies using group-specific primers were excluded); (4) the internal transcribed spacer regions (ITS1, ITS2, or both) were subject to amplification; (5) sequencing data (either in fasta with phred scores reported or fastq format) were publicly available or provided by the authors of the study upon request, and the sequences were unambiguously assigned to samples; (6) the samples could be assigned to biomes according to the Environment Ontology (http://www.ontobee.org/ontology/ENVO)8. In total, 178 publications contained samples that matched our criteria.

Processing of sequencing data

For the processing of data, see Fig. 2 and Code Availability section. Raw datasets from 178 studies, covering 17 242 individual samples were quality filtered by removing all sequences with the mean quality phred scores below 20. Each sequence was labelled using the combination of a sample ID and sequence ID, and the full ITS1 or ITS2 fungal region was extracted using Perl script ITSx v1.0.1113. ITS extraction resulted in a total of 416 291 533 full ITS1 and 231 278 756 full ITS2 sequences. The extracted ITS sequences were classified according to the representative sequence of the closest UNITE species hypothesis (SH) using BLASTn14, using the SH created considering a 98.5% similarity threshold (BLASTDBv5, general release 8.1 from 2.2.201912). A sequence was classified to the best best hit SH only when the following thresholds were met: e-value < 10e−50, sequence similarity >  = 98.5%. All representative sequences annotated as nonfungal were discarded. All representative sequences classified to any fungal SH and all unclassified sequences were used to build database library of unique nucleotide sequences (sequence variants). The number of sequence variants accessible through the database is 113 423 871.

Fig. 2
figure 2

Processing of raw sequencing data for the GlobalFungi database. Workflow of processing of sequencing data included in the GlobalFungi database.

Sample metadata

Sample metadata were collected from the published papers and/or public repositories where they were submitted by the authors. In some cases, metadata were obtained from the authors of individual studies upon request. The samples were assigned to continents, countries, and specific locations when available, and all sites were categorized into biomes following the classification of Environment Ontology to a maximum achievable depth for each sample. The complete list of metadata included in the database is presented in Table 1.

Table 1 List of metadata contained in the GlobalFungi database.

In addition to the metadata provided by the authors of each study, we also extracted bioclimatic variables from the global CHELSA15 and WorldClim 216 databases for each sample based on its GPS location. Since the results based on CHELSA and WorldClim 2 were comparable, we decided to include those from CHELSA, because precipitation patterns are better captured in the CHELSA dataset, in particular for mountain sites15.

For each sequence variant that was classified to SH, fungal species name and genus name was retrieved from the UNITE database12, when available.

Data Records

The raw sequencing reads used to create the database are available at different locations (see Table 2).

Table 2 List of identifiers and source database of the raw sequencing datasets used.

The database contains two data types: sequence variants (individual nucleotide sequences) and samples. For each sequence variant, the following information is stored: sequence variant code, identification of samples where sequence variant occurs and the number of observations, the SH of best hit (when available), fungal species name (when available), fungal genus name (when available). For each sample, metadata information is stored (Table 1). Sequence data and metadata are accessible at Figshare17 (GlobalFungi_ITS_variants.zip, GlobalFungi_metadata.xlsx). All database content is accessible using a public graphical user interface at https://globalfungi.com.

Technical Validation

The technical validation included the screening of the data sources, sequencing data and data reliability. Regarding the data source screening, the data sources (published papers) were screened to fulfil the criteria outlined in the Methods section. The dataset was thoroughly checked for duplicates, and for all records that appeared in multiple publications, only the first original publication of the dataset was considered as a data source. Considering sequence quality, we have only utilized those primer pairs that are generally accepted to target general fungi (see Online-Only Table 1)7,18. Sequences were quality filtered by removing all sequences with the mean quality phred scores below 20 and sequences that did not represent complete ITS1 or ITS2 after extraction or those that were identified as chimeric by the ITS extraction software13 were removed. All representative sequences where the BLASTn search against the UNITE database12 resulted in a nonfungal organism, were discarded.

For data reliability, the geographic location represented by the GPS coordinates was validated first. For each sample set, the geographic location of the sample described in the text of the study was confronted with the location on the map. For those samples where disagreement was recorded (e.g., terrestrial samples positioned in the ocean or located in another region than described in the text), the authors of each study were asked for correction. Studies or samples that could not be reconciled in this way were excluded from the database. The quality of sample metadata was checked and if they were outside the acceptable range (such as content of elements or organic matter > 100%), these invalid metadata were removed.

Usage Notes

The user interface at https://globalfungi.com enables the users to access the database in several ways (Fig. 3). In the taxon search, it is possible to search for genera or species of fungi or for the 98.5% SH species hypotheses of UNITE, contained in the general release 8.1 from 2.2.2019. The search results open the options to download the corresponding SH or the corresponding sequence variants. It is also possible to view a breakdown of samples by type, biome, mean annual temperature, mean annual precipitation, pH, and continents. The results also contain an interactive map of the taxon distribution with relative abundances of sequences of the taxon across samples and a list of samples with metadata. Several modes of filtering of results are available as well.

Fig. 3
figure 3

User interface to access the GlobalFungi database.

In the sequence search, it is possible to search for multiple nucleotide sequences by choosing if the result will be the exact match or a BLAST result. The BLAST option gives the possibility to retrieve the sequence variant best hit in the database, or, when only one sequence is submitted, it is possible to display multiple ranked high score hits among the sequence variants.

It is also possible to open individual studies and access their content. Finally, in the Geosearch, users can select a group of samples on the map, with a range of tools, and retrieve data for these samples (such as the FASTA file with all occurring sequence variants).

Importantly, the database is intended to grow, both by the continuing activity of the authors and by using the help of the scientific community. For that, the “Submit your study” section of the web interface enabling the submission of studies not yet represented is available to users. The submission tool guides the submitting person through the steps where details about the publication, samples, sample metadata and sequences are sequentially submitted. The submitted data will be used to update the database twice a year after processing and validation by the authors. Thus, users submitting their data, besides a precious contribution to mycological progress, will benefit from making their data accessible to the international scientific community in an easily accessible form and increasing the visibility of their results. Users can also maximize their visibility by approving to add their name and affiliation to the online list of collaborators and/or to the GlobalFungi Group Author’ list that will be mentioned in future publications describing the database content, its development, or metastudies using the whole database.

Among the possible uses of the GlobalFungi Database, fungal ecologists will be able to link fungal diversity data with the panel of collected metadata, which should allow them to determine the environmental factors driving the fungal diversity. This kind of study can be done at different geographic levels, from country scale up to the entire world, and for all the fungal communities or by focusing on some ecosystem compartments. This should lead to a better understanding of the biogeography of the fungal diversity. Větrovský et al.8 brought interesting findings by doing this for soil fungal communities at the scale of the globe. The evolutionary biologists could study, for example, the effect of global change on the fungal diversity by comparing the natural versus anthropogenic biomes. In addition to focus on the fungal diversity, some studies could trigger specific fungi. Thus, mycologists could determine the biogeography of one specific fungal species. They could also determine the composition of the fungal communities associated with the focused species and detect some potential recurrent fungal associations. The GlobalFungi Database could also speed up the progress in fungal taxonomy by highlighting the existence of a high number of fungal sequences not currently assigned to species along with environmental metadata promoting thus the interest in their description.