Introduction

Soil microbial communities play an essential role in maintaining important soil processes such as nutrient cycling, waste decomposition, climate regulation, and pollution degradation [1, 2]. Today, sequencing technologies are well established and broadly used [3]. As such, producing large amounts of data on the composition and diversity of bacterial and fungal communities is no longer so challenging. Moreover, the major ecological drivers of the variation in these microbial communities are becoming increasingly visible [4, 5]. The spotlight is now on the soil taxonomists. Although progress has been made in the past few years [6, 7], culturing, isolating, and classifying soil microbes are still a difficult task. For most soil bacterial and fungal species, we know very little about their identity or the tasks performed even by the most dominant microbial taxa [8]. More concerning, in some cases, we lack the most basic taxonomic information to classify these bacterial and fungal taxa as they do not match the latest data within taxonomic databases (e.g. [9] and [10]; Zomer et al. [11]) even at the highest taxonomic ranks (e.g. phyla level).

The first logical step toward the classification of these unknown microbial taxa is to identify potential locations where they could be found across the globe. This information can then be used by taxonomists and microbiologists to target these new soil taxa. Here, I used data from a global soil survey [8] across 235 locations (Fig. S1), and including amplicon sequencing information on fungal (ITS gene) and bacterial (16S rRNA gene) communities from around the world, to highlight those locations on Earth where taxa of bacteria and fungi with an unknown phyla are feasibly most prevalent. The database in Delgado-Baquerizo et al. [8] has been used previously to identify the dominant taxa of bacteria globally, and more recently, the major ecological predictors of bacterial diversity [12]. I used the bioinformatics pipeline described in Delgado-Baquerizo et al. [8], and two of the most commonly used microbial databases for taxonomic identification (Greengenes and UNITE), to estimate, at the global scale, the percentage of phylotypes of bacteria and fungi with an unknown phyla in soils across the globe. These taxa are classified as fungi or bacteria using taxonomic databases, but do not match any known phyla. As such, they are expected to be potential new phyla of fungi or bacteria.

Results and Discussion

As expected, the taxonomic information at the “species” (OTU, phylotypes) level could not be found for 99% of bacterial and 63% of fungal phylotypes (clustered at 97% similarity). Notably, up to 1.36% and 9.37% of the retrieved phylotypes classified as bacteria or fungi remained unclassified at the phyla level in soils across the globe. For these microbes, we do not know the phylum to which they belong. In other words, for some soils, almost 10% of taxa within bacteria and fungi are totally unknown to us. These taxa represent between 0.01–1.86% (average of 0.12%) of all 16S rRNA sequences, and between 0.00–22.11% (average of 3.98%) of all ITS retrieved sequences. On average, soil samples with the largest percentage of phylotypes of bacteria with an unknown phyla can be found in boreal and tropical forests (Fig. 1), while those with the largest percentage of phylotypes of fungi with an unknown phyla are found in dry forests and grasslands (Fig. 1).

Fig. 1
figure 1

Mean values (±SE) for % phylotypes of bacteria and fungi with an unknown phyla across major terrestrial biomes in 235 locations

I then generated a global atlas highlighting those global soils where bacterial and fungal phylotypes with an unknown phyla are expected to be more prevalent. Building these global maps is possible for three main reasons; firstly, the percentages of phylotypes of bacteria and fungi with an unknown phyla are highly correlated with key environmental factors at the global scale (Table 1). This result suggests that environmental data can be used to predict the distribution of phylotypes of fungi and bacteria unclassified at the phyla level. Secondly, the database used here covers a wide gradient of environmental conditions and soil properties found on Earth, being highly representative for globally distributed terrestrial ecosystems. For example, mean annual precipitation and temperature in these locations ranged from 67 to 3085 mm and −11.4 to 26.5 °C, respectively. Moreover, soil pH ranged from 4.04 to 9.21; soil C from 0.15 to 34.77%; and fine texture fraction (% clay + silt) from 1.40 to 92.00%. Finally, high resolution maps for key environmental factors predicting the percentage of unclassified taxa (Table 1) are available at the global scale. Therefore, globally available information on environmental factors can potentially be used to predict global hotspots for phylotypes of bacteria and fungi with an unknown phyla. These three important points allowed me to generate global atlases for the potential distribution of percentages of phylotypes of bacteria and fungi with an unknown phyla (Fig. 2). These global atlases were cross-validated as explained in Appendix 1 (Supplementary Materials).

Table 1 Correlation (Spearman) between the % phylotypes of bacteria and fungi with an unknown phyla (unclassified bacteria and fungi) with climate (aridity index, maximum and minimum temperature, precipitation seasonality and mean diurnal temperature range), primary productivity, dominant ecosystem type (forest and grasslands), soil properties (total organic carbon, pH and texture), and UV light in 235 locations (P < 0.05)
Fig. 2
figure 2

Global atlas including the potential distribution of % of phylotypes of bacteria and fungi with an unknown phyla (unclassified bacteria and fungi) based on their natural co-occurrence with climatic (aridity index, maximum and minimum temperature, precipitation seasonality and mean diurnal temperature range), primary productivity, dominant ecosystem type (forest and grasslands), soil properties (total organic carbon, pH and texture) and UV light in 235 locations. See Fig. S1 for the locations of the 235 in this study. See Appendix S1 for a cross-validation of these maps. A colour version of this figure is available in Fig. S2

The global maps included in this study indicate the potential distribution of unclassified taxa within bacteria and fungi. Interestingly, locations where bacteria with an unknown phyla are more prevalent are distinct from those of fungi. This global atlas suggests that soils from Brazil, Chile, Russia, Indonesia, Iceland, Northern Europe, and the coastlines of North America contain a relatively high percentage of bacteria with an unknown phyla. On the other hand, deserts from Peru, China, Australia, South Africa, the Middle East, the Saharan region, and the western coast of North America contain a relatively high percentage of unclassified taxa within fungi. Soil taxonomists and microbiologists should target soils from these environments and global locations to increase our chances of isolating and classifying these elusive yet significant soil taxa, and thus, increase our knowledge of who they are and what they are doing in our soils.

Methods

Soil sampling

Soils were collected from 235 locations across 18 countries and six continents. Soil samples (top ~7.5 cm depth) were collected under the most common vegetation across a wide range of ecosystem (forests, grasslands, and shrublands) and climatic (arid, temperate, tropical, continental, and polar ecosystems) types. The locations sampled represent wide gradients in environmental factors, which is critical for mapping predictions. Detailed information about this survey can be found in Delgado-Baquerizo et al. [8].

Molecular analyses

Soil DNA was extracted using the Powersoil® DNA Isolation Kit (MoBio Laboratories, Carlsbad, CA, USA) according to the manufacturer’s instructions. Amplicons targeting the bacterial  16S rRNA gene (341F-805R; [13]) and the fungal ITS region (FITS7-ITS4R; [14]) were sequenced at Western Sydney University’s NGS facility (Sydney, Australia) using the Illumina MiSeq platform. Bioinformatic processing was performed using a combination of QIIME [3], USEARCH [15], and UPARSE [16]. Operational taxonomic units—OTUs—(phylotypes hereafter), were identified at the ≥97% identity level. Taxonomy for bacteria and fungi was assigned using the Greengenes and UNITE databases, respectively. OTU abundance tables were constructed from these analyses. 16 s rRNA reads classified as Archaea, chloroplasts, or mitochondria were removed. The percentage of phylotypes of bacteria and fungi with an unknown phyla for each sample were calculated from these OTU tables. These phylotypes are classified as fungi or bacteria, but do not match data within taxonomic databases at the phyla level (unclassified bacteria and fungi hereafter). Given that soil and DNA samples were collected, extracted, and analysed following the same standardised protocol and within the same laboratory, any biases (e.g. sequencing error) would be consistent across analyses.

Environmental factors

For each location, information for twelve environmental factors was obtained: climate (maximum and minimum temperatures, precipitation seasonality; mean diurnal temperature range and Aridity Index); soil properties (pH, texture and total organic carbon); dominant ecosystem type (forest and grasslands); plant productivity; and UV light intensity. Information on soil pH, texture and total organic carbon (soil C) was obtained using standard laboratory methods [17, 18] in the laboratories from the Universidad Rey Juan Carlos (Spain). Climatic information (1 km resolution) for all sampling locations was obtained from the Worldclim database (www.worldclim.org; [11, 19]). The dominant ecosystem types (forest and grasslands) were determined in the field. Plant productivity (net primary productivity) data were obtained using the Normalized Difference Vegetation Index (NDVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS) aboard NASA’s Terra satellites (http://neo.sci.gsfc.nasa.gov/). The monthly average value for this variable was calculated between 2003–2015 (~10 km resolution), when all soil samplings were conducted. Information on the annual ultraviolet index (UV index) was obtained from the NASA’s Aura satellite (https://neo.sci.gsfc.nasa.gov).

Mapping the global distribution of unclassified soil taxa

The prediction-oriented regression model Cubist [20] was used to predict the percentage of phylotypes of bacteria and fungi with an unknown phyla across the globe. Mapping analyses were independently done to find the percentage of unclassified taxa within bacteria and fungi. The Cubist algorithm uses a regression tree analysis to generate a set of hierarchical rules using information on environmental covariates, based on real data (235 locations), which are later used for spatial prediction [21]. Covariates in our models include the above described 12 environmental factors as well as space (latitude and longitude). Global predictions on the distribution of the percentage of unclassified taxa within bacteria and fungi were done on a 25 km resolution grid, which resulted in a grid including 225530 locations. Environmental information for each of these locations, including soil properties, climatic information, plant production, ecosystem types, and UV light, was obtained from global databases available online. Global information on soil properties for this grid was obtained using the ISRIC (global gridded soil information) Soil Grids (https://soilgrids.org/#!/?layer=geonode:taxnwrb_250m). Global information on the major vegetation types in this study (grasslands and forests) was obtained using the Globcover2009 map from the European Space Agency (http://due.esrin.esa.int/page_globcover.php). Global information on climate, UV radiation, and net primary productivity were obtained from the WorldClim database (www.worldclim.org) and NASA satellites (https://neo.sci.gsfc.nasa.gov), as explained above. The R package Cubist was used to conduct these analyses [21].