Introduction

Small subunit (16S rRNA) gene-based surveys have clearly shown that the scope of phylogenetic diversity in soil is much broader than that implied using culture-based approaches (Ovreas and Torsvik, 1998; Dunbar et al., 1999; Smit et al., 2001; Lipson and Schmidt, 2004). Although having a remarkably stable phylum level diversity, soil is an extremely diverse ecosystem at the order, family, genus and species levels (Fulthorpe et al., 2008), with multiple yet-uncultured lineages within virtually each of the major bacterial phyla in soil (for example, Proteobacteria, Acidobacteria and Actinobacteria) (Janssen, 2006). The detailed phylogenetic analysis and taxonomic placements of 16S rRNA gene sequences has traditionally been the main focus of soil diversity studies. However, with the availability of newer sequencing technology and curated databases and the subsequent creation of large (>1000) datasets, the focus of the data analysis process has recently shifted more towards computing more accurate estimates of species richness and evenness (Schloss and Handelsman, 2006; Roesch et al., 2007; Quince et al., 2008; Youssef and Elshahed, 2008a), identification of novel bacteria phyla (Elshahed et al., 2008), accessing members of the rare soil biosphere (Elshahed et al., 2008) and computational comparisons of communities between different soils (Roesch et al., 2007; Fulthorpe et al., 2008). Detailed phylogenetic analysis of these datasets has often been overlooked, either because of the short amplicon size created, or the sheer number of clone sequences analyzed. This is unfortunate, as such datasets, especially those with near full-length 16S rRNA gene sequence, offer a unique opportunity for an in-depth evaluation of the phylogenetic diversities within each of the major bacterial phyla in soil.

In a recent study, a near full-length 16S rRNA gene clone library was constructed from Oklahoma tall-grass prairie soil and 13 001 clones were sequenced (Elshahed et al., 2008). The most abundant phylum was shown to be the Proteobacteria as is typically observed in soil libraries (for a review, see Janssen, 2006). The Proteobacteria encompass an enormous level of morphological, physiological and metabolic diversity, and are of great importance to global carbon, nitrogen and sulfur cycling (Kersters et al., 2006). Despite this phylum containing more validly described isolates than any other phylum (Kersters et al., 2006), the vast majority of soil Proteobacteria are yet to be cultivated. In this study, we describe the composition of Proteobacteria clones from Oklahoma tall-grass prairie soil, in which the majority of clones belong to family- and order-level lineages containing no characterized cultivated isolates, and compare the ecological distribution of some of the dominant uncharacterized orders whose functions in soil remain unknown.

Materials and methods

Phylogenetic analysis of Kessler Farm soil Proteobacteria 16S rRNA gene sequences

The dataset used in this study initially consisted of 13 001 16S rRNA clone sequences from soil, described in an earlier study (Elshahed et al., 2008). Briefly, a clone library (n=13 001 clones) was constructed from 16S rRNA genes (PCR-amplified using primers 27F and 1391R) from community DNA extracted from Kessler Farm Soil (KFS), which was collected from an undisturbed tall-grass prairie preserve in Central Oklahoma. Sequences were binned into operational taxonomic units (OTUs) using a 97% similarity cutoff using DOTUR (Schloss and Handelsman, 2005). Soil characteristics, and details of sampling, DNA extraction, PCR amplification, 16S rRNA clone library construction and sequencing, and initial phylogenetic classification of 16S rRNA sequences can be found in the original manuscript (Elshahed et al., 2008).

Sequences representative of each OTU identified as Proteobacteria in the original manuscript were aligned using Greengenes’ NAST alignment tool (DeSantis et al., 2006a, 2006b). Aligned KFS and closely related 16S rRNA sequences were imported into Greengenes May 2007 ARB database (DeSantis et al., 2006a) using ARB software package, available on-line at http://www.arb-home.de/ (Ludwig et al., 2004). We used the on-line program Pintail (Ashelford et al., 2005) to screen individual sequences within the Proteobacteria dataset using suspicious sequences (those identified by Bellerophon (Huber et al., 2004) or those with unclear phylogenetic affiliation or that formed unusually long branches in neighbor-joining dendrograms) as the query sequence, and the closest cultured relative or a reliable closely-related abundant KFS OTU sequence (n>50) as the reference sequence. After removal of chimera, 2675 Proteobacteria clones belonging to 479 OTUs were classified to the family taxonomic level using phylogenetic tree-building methods. Initial placement of OTUs in already-named families according to the Hugenholtz taxonomic framework (DeSantis et al., 2006a) was determined by parsimony placement of KFS clone sequences into the ARB universal dendrogram. Distance trees of each class within Proteobacteria were constructed using the neighbor-joining algorithm and Jukes–Cantor corrections using ARB software package (Ludwig et al., 2004) with filters available for each class of Proteobacteria. Branching of distance trees was also verified by constructing trees through the same methods using PAUP 4.0b10 software (Sinauer Associates, Sunderland, MA, USA) and generating bootstrap percentages based on 1000 replicates. Final classifications of KFS OTUs into families, according to the Hugenholtz taxonomic outline (DeSantis et al., 2006a), were determined by placement of each OTU into a bootstrap-supported (>50%) already-named or novel family in constructed trees. In general, novel families were defined as a bootstrap-supported group of at least two clone sequences sharing approximately >92–93% sequence similarity with each other but <92–93% sequence similarity to sequences from an already-named family. Novel orders were defined similarly, using 90% as a general cutoff, though these values varied between each class of Proteobacteria (for example, Deltaproteobacteria is more divergent than Alpha and Betaproteobacteria).

Ecological distribution of abundant KFS uncharacterized lineages

We chose the six most abundant Proteobacteria order-level lineages containing no characterized, cultivated representatives (Deltaproteobacteria-KFS-6, EB1021, Ellin314, MND1, A21b and Ellin339), and recorded the isolation source of all available environmental clone sequences belonging to each order. To determine what environmental clone sequences belonged in an order, we created distance trees in ARB using all sequences belonging to the order based on the universal parsimony tree, using the May 2007 Greengenes database. Second, we used the BLAST algorithm on the NCBI website (in November, 2008) to search for more recently deposited sequences belonging to each order, using the ‘type sequence’ (the environmental clone sequence after which the order was named, for example, MND1) as the query and 90% similarity as a general cutoff.

Results and discussion

Abundance and composition of Proteobacteria in KFS and other soils

The Proteobacteria-affiliated clones in KFS represented 25% of the total 16S rRNA clone sequences (Elshahed et al., 2008) compared with an average of 40% abundance in all published soil studies analyzing >1000 16S rRNA sequences, including eight individual soil samples in addition to a composite collection of soil libraries compiled by Janssen (2006) (Table 1). From the clone library studies, including those generating >1000 near full-length 16S rRNA genes and the Janssen compilation study (analyzing 16S rRNA gene sequences >300 bp), Proteobacteria comprised 25–40% abundance (relative to total sequences) and 42–50% abundance from shorter (100 bp) fragments generated by pyrosequencing (Table 1). Although such larger proportion of Proteobacteria in pyrosequencing-based studies might be a true reflection of the communities analyzed, it might also indicate the existence of a cloning bias or that classification based on small 16S rRNA gene fragments could lead to different taxonomic assignments than classification based on near to full-length sequences, as suggested earlier (Elshahed et al., 2008). Nevertheless, Proteobacteria remains the most abundant soil phylum, regardless of the utilized approach, which aside from PCR-based clone libraries and pyrosequencing has included metagenomics (Liles et al., 2003; Tringe et al., 2005), fluorescent in situ hybridization (Zarda et al., 1997) and microarray analysis (Yergeau et al., 2009).

Table 1 Comparison of the composition and abundance of Proteobacteria in Kessler Farm soil to other soils among published studies analyzing >1000 PCR-amplified 16S rRNA sequences

The most abundant class (39% of total Proteobacteria clones) in KFS was Alphaproteobacteria, followed by Delta- (37%), Beta- (16%) and Gammaproteobacteria (7.6%). Among all clone library datasets (>1000 sequences) of PCR-amplified 16S rRNA genes from soil (Table 1), Alphaproteobacteria is the most abundant class, relative to total sequences, comprising 35–58% of Proteobacteria clones, whereas Gammaproteobacteria is typically, though not always the least abundant (5.9–17%). Deltaproteobacteria was overrepresented in KFS compared with other large soil datasets, whereas Betaproteobacteria was underrepresented (Table 1). Epsilonproteobacteria, which has not been detected in many of the large 16S rRNA soil libraries (Table 1) was not detected in KFS, suggesting that this class is either extremely rare in soil or is not ubiquitous as are the other classes within Proteobacteria. Likewise, the recently discovered class Zetaproteobacteria, which seems to have a limited ecological distribution and metabolic abilities (Emerson et al., 2007), was undetected in KFS and other large soil clone libraries (Table 1).

Family and order-level diversities within KFS Proteobacteria

The use of classifier programs, available from Greengenes and the Ribosomal Database Project (Cole et al., 2005; DeSantis et al., 2006a), provide useful tools for initial classification of 16S rRNA gene sequences; however, inaccurate taxonomic assignments may be made without tree-building phylogenetic analyses, especially at the subphylum levels. In addition, uncertain placements of clones with low-sequence similarity to their closest relative has been observed with both classification programs, resulting in outputs with multiple placement suggestions (Greengenes), or low confidence in order and family-level affiliation outputs (Ribosomal Database Project). In addition, satisfactory identification and documentation of novel lineages requires detailed phylogenetic analysis and tree-building approaches. In this study, phylogenic associations at the class, order and family levels were initially determined using both Greengenes and Ribosomal Database Project classification programs, and were verified by parsimony analysis using the ARB software package and neighbor-joining analysis using PAUP 4.01b10. Using this combined approach, 120 family-level lineages were identified belonging to 60 orders (Table 2). Alphaproteobacteria had the highest number of families and orders, consisting of 45 families within 29 orders, and was followed by Deltaproteobacteria (33 families within 15 orders) (Table 2, Figures 1 and 2). Beta- and Gammaproteobacteria were less diverse, containing 23 and 19 families within five and 11 orders, respectively (Table 2, Figures 3 and 4). This pattern of order and family level diversity rankings between various Proteobacteria classes is in agreement with the diversity ranking estimated from the same datasets based on rarefaction curve analysis and diversity ordering approaches of KFS OTU0.03 (Youssef and Elshahed, 2008b).

Table 2 Composition and novel and uncharacterized lineages within the different classes of Proteobacteria
Figure 1
figure 1

Distance phylogram of Alphaproteobacteria KFS OTU sequences based on aligned near full-length 16S rRNA gene sequences (approximately 1350 bp) from KFS clone library as well as representative sequences from each family-level lineage downloaded from GenBank, totaling 329 sequences, with each clade shown representing a family-level lineage (unless otherwise noted), consisting of at least two sequences. The tree was rooted with the 16S rRNA gene sequence from Chloroflexus aurantiacus (GenBack accession no. AJ308501). Bootstrap values are based on 1000 replicates and are shown to the left of each branch with bootstrap support >90% (•), 70–89% **() and 50–69% (). Black clades represent families with characterized, described cultivated representatives. Gray and unfilled clades represent uncharacterized families, consisting of clone sequences and sequences from unpublished or uncharacterized isolates (gray) or only environmental clone sequences (unfilled). Numbers aside each clade denote the number of clone sequences and OTUs detected from each family in the KFS clone library. Orders, according to Hugenholtz taxonomy and the Greengenes ARB May, 2007 database, are shown to the right of families. Novel lineages are shown in bold, with novel orders labeled as Proteobacteria class-KFS-# (for example, Alphaproteobacteria-KFS-1). Novel families within novel orders are labeled according to clone names (for example, FFCH2458), and novel families within characterized orders are labeled as order name-KFS-# (for example, Sphingomonadales-KFS-1).

Figure 2
figure 2

Distance phylogram of Deltaproteobacteria KFS OTU sequences based on aligned near full-length 16S rRNA gene sequences from KFS clone library as well as representative sequences from GenBank, totaling 241 sequences. Tree construction and notations are the same as described in Figure 1.

Figure 3
figure 3

Distance phylogram of Betaproteobacteria KFS OTU sequences based on aligned near full-length 16S rRNA gene sequences from KFS clone library as well as representative sequences from GenBank, totaling 128 sequences. Tree construction and notations are the same as described in Figure 1.

Figure 4
figure 4

Distance phylogram of Gammaproteobacteria KFS OTU sequences based on aligned near full-length 16S rRNA gene sequences from KFS clone library as well as representative sequences from GenBank, totaling 183 sequences. Tree construction and notations are the same as described in Figure 1.

Prevalence of uncharacterized and novel lineages within KFS Proteobacteria

The vast majority of KFS Proteobacteria clones belonged to uncharacterized lineages (families or orders containing no validly described species); in total, 50% and 65% of KFS Proteobacteria clones belonged to uncharacterized orders and families, respectively (Table 2). It is important to note; however, that among the Alpha-, Beta- and Gammaproteobacteria, some microorganisms have been cultivated among these uncharacterized lineages, but have not been characterized nor validly described (Figures 1, 2 and 3). Indeed, within all Proteobacteria classes in KFS with the exception of Alphaproteobacteria, the most abundant orders contained no cultivated or characterized pure cultures. The most abundant order in Alphaproteobacteria was Bradyrhizobiales (Figure 1), which consisted of 463 clones (39 OTUs) and contained the most abundant OTU in the KFS dataset (n=204). The most abundant orders in Deltaproteobacteria were EB1021 (310 clones, 20 OTUs) and novel order Deltaproteobacteria-KFS-6 (210 clones, nine OTUs) (Figure 2), neither of which contain any cultivated microorganisms. The dominant orders in Beta- and Gammaproteobacteria in KFS were MND1 and Ellin339, respectively (Figures 3 and 4), which are also uncharacterized lineages. Deltaproteobacteria contained the highest number of clones belonging to undescribed lineages, with 637 clones (64%) belonging to uncharacterized orders and 848 clones (85%) belonging to uncharacterized families. These Deltaproteobacteria lineages were comprised solely of environmental clone sequences, none containing any cultivated representatives, suggesting that soil Deltaproteobacteria may be extremely difficult to cultivate in pure culture in the laboratory using standard heterotrophic growth media.

In addition, KFS contained numerous novel lineages within the Proteobacteria dataset (Table 2). In total, 15 novel orders and 48 novel families among the four classes were named in this study (Figures 1, 2, 3 and 4; for detailed descriptions of Proteobacteria KFS OTU phylogenetic affiliations, including all novel lineages, see Supplementary Table 1). The large number of novel family and orders identified from a single clone library clearly suggests that global soil Proteobacteria diversity is far broader than our current database collection suggests. Likewise, despite Proteobacteria being the most abundant soil phylum, containing more validly described species than any other phylum, the functions of the majority of Proteobacteria in soil remain to be shown.

Ecological distribution of abundant uncharacterized order-level lineages

As the majority of KFS Proteobacteria clones belong to family- and order-level lineages with no characterized representatives, the functions of these groups of microorganisms in soils is completely unknown. To gain insight into the rarity of and ecological distribution of uncharacterized lineages within Proteobacteria, we chose the six most abundant KFS uncharacterized orders, Deltaproteobacteria-KFS-6 (Deltaproteobacteria, n=210), EB1021 (Deltaproteobacteria, n=310), Ellin314 (Alphaproteobacteria, n=103), MND1 (Betaproteobacteria, n=198), A21b (Betaproteobacteria, n=99) and Ellin339 (Gammaproteobacteria, n=99) and mapped their distribution among different environmental categories using data available from 16S rRNA sequences deposited into GenBank. We found that these six lineages, collectively, have been identified in 174 different sampling sites that fall into 30 general environmental categories, the most abundant of which was soil, whereas many samples also came from aquatic and subsurface ecosystems (Table 3; for details and references for each study, see Supplementary Table 2).

Table 3 Distribution of six abundant uncharacterized order-level lineages from Kessler Farm soil among different types of ecosystems

The two Deltaproteobacteria orders were the most abundant of the uncharacterized orders; however, novel order Deltaproteobacteria-KFS-6 was detected in only four sites, all from soil. EB1021 contained the most clones out of any of the KFS uncharacterized orders, and was detected in 52 total samples from 15 different ecosystem types. This order was detected in 25 out of the 61 different soil sample sites, but was detected in 90% of the deep-sea sediment sites (Table 3 and Supplementary Table 2) and both of the marine sponge studies. Interestingly, among aquatic environments, EB1021 was detected in all sediment ecosystems (freshwater, estuarine and marine) but was not detected in any of the overlying water ecosystems, suggesting EB1021 could be preferentially distributed in anoxic ecosystems. Thus, members of EB1021 might be living in anoxic or hypoxic microenvironments within soil aggregates, and the use of anaerobic techniques could prove useful in trying to cultivate members of EB1021.

From the Alphaproteobacteria, uncharacterized order, Ellin314 was detected in more ecosystem types than any of the other KFS uncharacterized orders (Table 3, Supplementary Table 2). Most notably, members of this order have been detected in 75% of samples detected from anaerobic enrichments or consortia degrading organic pollutants. Like EB101, Ellin314 was detected in 25 of the 61 soil sites, and was more frequently detected in aquatic sediments rather than overlying water, including 60% of the deep-sea sediment sites. Unlike EB1021, however, organisms belonging to Ellin314 have been cultivated but not characterized (Joseph et al., 2003).

From the Betaproteobacteria, MND1 (the dominant order in KFS Betaproteobacteria) was detected in 84 different samples sites, more than any of the other KFS uncharacterized orders (Table 3 and Supplementary Table 2), being detected more frequently in soil, aquatic and subsurface ecosystems, which suggests that MND1 may be diverse in function and/or capable of a wide range of environmental conditions. MND1 was detected in 18 of the 25 total subsurface sites, which is triple the number of any other KFS uncharacterized order. Originally, MND1 was first detected in ferromanganous-coated sediment (Stein et al., 2001; Joseph et al., 2003), but it shows no preferential distribution towards either aerobic vs anaerobic environments. A21b (Betaproteobacteria) has a similar distribution pattern to MND1, but is detected in fewer samples, and has been rarely documented among subsurface community studies, and has not been detected in any marine environments to date (Table 3 and Supplementary Table 1). Like A21b, Ellin339 (the dominant order in KFS Gammaproteobacteria) was rare in subsurface sites and was not detected in any marine samples. However, unlike other KFS uncharacterized orders, Ellin339 was detected among more freshwater sites and was the only order detected in several acid mine drainage sites (Table 3 and Supplementary Table 1). In addition, Ellin339 was detected in an acid-impacted lake (Percent et al., 2008) and an extremely acidic river (Garcia-Moyano et al., 2007), suggesting this uncharacterized order likely contains acid-tolerant or acidophilic bacteria.

This study highlights the importance of detailed subphylum level phylogenetic analysis of large 16S rRNA datasets, a process that is increasingly overlooked in favor of automated phylum-level assignment. The discovery and documentation of 15 novel orders and 46 novel families within the Proteobacteria in a single dataset indicates that even in phyla with multiple cultured representatives, the breadth of the subphylum level diversity is not completely understood. Finally, our survey of the ecological distribution of six abundant, yet-uncultured Proteobacteria orders suggests that most of these uncharacterized lineages may be ecologically important in not only soil but many ecosystems globally, and that specific enrichment and isolation approaches that have rarely been tested (for example, acidic, hypoxic or anoxic conditions) might prove useful in obtaining these lineages in pure cultures.