GlobalFungi, a global database of fungal occurrences from high-throughput-sequencing metabarcoding studies

Fungi are key players in vital ecosystem services, spanning carbon cycling, decomposition, symbiotic associations with cultivated and wild plants and pathogenicity. The high importance of fungi in ecosystem processes contrasts with the incompleteness of our understanding of the patterns of fungal biogeography and the environmental factors that drive those patterns. To reduce this gap of knowledge, we collected and validated data published on the composition of soil fungal communities in terrestrial environments including soil and plant-associated habitats and made them publicly accessible through a user interface at https://globalfungi.com. The GlobalFungi database contains over 600 million observations of fungal sequences across > 17 000 samples with geographical locations and additional metadata contained in 178 original studies with millions of unique nucleotide sequences (sequence variants) of the fungal internal transcribed spacers (ITS) 1 and 2 representing fungal species and genera. The study represents the most comprehensive atlas of global fungal distribution, and it is framed in such a way that third-party data addition is possible.


Background & Summary
Fungi play fundamental roles in the ecosystem processes across all terrestrial biomes. As plant symbionts, pathogens or major decomposers of organic matter they substantially influence plant primary production, carbon mineralization and sequestration, and act as crucial regulators of the soil carbon balance 1,2 . The activities of fungal communities contribute to the production of clean water, food, and air and the suppression of disease-causing soil organisms. Soil fungal biodiversity is thus increasingly recognized to provide services critical to food safety and human health 3 .
It is of high importance to determine how environmental factors affect the diversity and distribution of fungal communities. So far, only a few studies have focused on fungal distribution and diversity on global scale [4][5][6] . Importantly, these single survey studies focused either on a limited number of biomes 4,5 , fairly narrow groups within the fungal kingdom 6 , or were restricted only to fungi inhabiting soil. Although individual studies have Processing of sequencing data. For the processing of data, see Fig. 2 and Code Availability section. Raw datasets from 178 studies, covering 17 242 individual samples were quality filtered by removing all sequences with the mean quality phred scores below 20. Each sequence was labelled using the combination of a sample ID and sequence ID, and the full ITS1 or ITS2 fungal region was extracted using Perl script ITSx v1.0.11 13 . ITS extraction resulted in a total of 416 291 533 full ITS1 and 231 278 756 full ITS2 sequences. The extracted ITS sequences were classified according to the representative sequence of the closest UNITE species hypothesis (SH) using BLASTn 14 , using the SH created considering a 98.5% similarity threshold (BLASTDBv5, general release 8.1 from 2.2.2019 12 ). A sequence was classified to the best best hit SH only when the following thresholds were met: e-value < 10e −50 , sequence similarity > = 98.5%. All representative sequences annotated as nonfungal were discarded. All representative sequences classified to any fungal SH and all unclassified sequences were used to build database library of unique nucleotide sequences (sequence variants). The number of sequence variants accessible through the database is 113 423 871.
Sample metadata. Sample metadata were collected from the published papers and/or public repositories where they were submitted by the authors. In some cases, metadata were obtained from the authors of individual studies upon request. The samples were assigned to continents, countries, and specific locations when available, and all sites were categorized into biomes following the classification of Environment Ontology to a maximum achievable depth for each sample. The complete list of metadata included in the database is presented in Table 1.
In addition to the metadata provided by the authors of each study, we also extracted bioclimatic variables from the global CHELSA 15 and WorldClim 2 16 databases for each sample based on its GPS location. Since the results based on CHELSA and WorldClim 2 were comparable, we decided to include those from CHELSA, because precipitation patterns are better captured in the CHELSA dataset, in particular for mountain sites 15 .
For each sequence variant that was classified to SH, fungal species name and genus name was retrieved from the UNITE database 12 , when available.

Data Records
The raw sequencing reads used to create the database are available at different locations (see Table 2).
The database contains two data types: sequence variants (individual nucleotide sequences) and samples. For each sequence variant, the following information is stored: sequence variant code, identification of samples where sequence variant occurs and the number of observations, the SH of best hit (when available), fungal species name (when available), fungal genus name (when available). For each sample, metadata information is stored (Table 1). Sequence data and metadata are accessible at Figshare 17 (GlobalFungi_ITS_variants.zip, GlobalFungi_metadata. xlsx). All database content is accessible using a public graphical user interface at https://globalfungi.com.

technical Validation
The technical validation included the screening of the data sources, sequencing data and data reliability. Regarding the data source screening, the data sources (published papers) were screened to fulfil the criteria outlined in the Methods section. The dataset was thoroughly checked for duplicates, and for all records that appeared in multiple publications, only the first original publication of the dataset was considered as a data source. Considering sequence quality, we have only utilized those primer pairs that are generally accepted to target general fungi (see Online-Only Table 1) 7,18 . Sequences were quality filtered by removing all sequences with the mean quality phred scores below 20 and sequences that did not represent complete ITS1 or ITS2 after extraction or those that were identified as chimeric by the ITS extraction software 13 were removed. All representative sequences where the BLASTn search against the UNITE database 12 resulted in a nonfungal organism, were discarded.
For data reliability, the geographic location represented by the GPS coordinates was validated first. For each sample set, the geographic location of the sample described in the text of the study was confronted with the location on the map. For those samples where disagreement was recorded (e.g., terrestrial samples positioned in the ocean or located in another region than described in the text), the authors of each study were asked for correction. Studies or samples that could not be reconciled in this way were excluded from the database. The quality of sample metadata was checked and if they were outside the acceptable range (such as content of elements or organic matter > 100%), these invalid metadata were removed.

Usage Notes
The user interface at https://globalfungi.com enables the users to access the database in several ways (Fig. 3). In the taxon search, it is possible to search for genera or species of fungi or for the 98 Fig. 2 Processing of raw sequencing data for the GlobalFungi database. Workflow of processing of sequencing data included in the GlobalFungi database.
corresponding SH or the corresponding sequence variants. It is also possible to view a breakdown of samples by type, biome, mean annual temperature, mean annual precipitation, pH, and continents. The results also contain an interactive map of the taxon distribution with relative abundances of sequences of the taxon across samples and a list of samples with metadata. Several modes of filtering of results are available as well.
In the sequence search, it is possible to search for multiple nucleotide sequences by choosing if the result will be the exact match or a BLAST result. The BLAST option gives the possibility to retrieve the sequence variant best hit in the database, or, when only one sequence is submitted, it is possible to display multiple ranked high score hits among the sequence variants.  188-190 , SRP097883 191,192 , SRP101553 193,194 , SRP101605 195,196 , SRP102378 197,198   www.nature.com/scientificdata www.nature.com/scientificdata/ It is also possible to open individual studies and access their content. Finally, in the Geosearch, users can select a group of samples on the map, with a range of tools, and retrieve data for these samples (such as the FASTA file with all occurring sequence variants).
Importantly, the database is intended to grow, both by the continuing activity of the authors and by using the help of the scientific community. For that, the "Submit your study" section of the web interface enabling the submission of studies not yet represented is available to users. The submission tool guides the submitting person through the steps where details about the publication, samples, sample metadata and sequences are sequentially submitted. The submitted data will be used to update the database twice a year after processing and validation by the authors. Thus, users submitting their data, besides a precious contribution to mycological progress, will benefit from making their data accessible to the international scientific community in an easily accessible form and increasing the visibility of their results. Users can also maximize their visibility by approving to add their name and affiliation to the online list of collaborators and/or to the GlobalFungi Group Author' list that will be mentioned in future publications describing the database content, its development, or metastudies using the whole database.
Among the possible uses of the GlobalFungi Database, fungal ecologists will be able to link fungal diversity data with the panel of collected metadata, which should allow them to determine the environmental factors driving the fungal diversity. This kind of study can be done at different geographic levels, from country scale up to the entire world, and for all the fungal communities or by focusing on some ecosystem compartments. This should lead to a better understanding of the biogeography of the fungal diversity. Větrovský et al. 8 brought interesting findings by doing this for soil fungal communities at the scale of the globe. The evolutionary biologists could study, for example, the effect of global change on the fungal diversity by comparing the natural versus anthropogenic biomes. In addition to focus on the fungal diversity, some studies could trigger specific fungi. Thus, mycologists could determine the biogeography of one specific fungal species. They could also determine the composition of the fungal communities associated with the focused species and detect some potential recurrent fungal associations. The GlobalFungi Database could also speed up the progress in fungal taxonomy by highlighting the existence of a high number of fungal sequences not currently assigned to species along with environmental metadata promoting thus the interest in their description.

Code availability
The workflow included several custom made python scripts (labelled by star in the