Introduction

Microorganisms control many biogeochemical processes. Therefore, the pattern of microbial biodiversity and the response to environmental changes are of great interest. Indeed, a large body of literature demonstrates that microbial community composition varies over time and space and responds to environmental changes (Horner-Devine et al., 2004; Fuhrman et al., 2006; Allison and Martiny, 2008). However, the overwhelming amount of microbial diversity and its possible functional redundancy suggests that the knowledge of microbial composition offers little predictive power for microbial process rates. Thus, it remains unclear how microbial community composition is related to ecosystem functioning.

The key to addressing this issue is the relationship between phylogeny and functional traits (Cadotte et al., 2008; Gravel et al., 2011). Predicting changes in community composition and ecosystem processes based on functional traits has received considerable attention in plant ecology. Recently, studies have found that the phylogenetic diversity of a plant community is an important predictor of plant productivity (Cadotte et al., 2008; Cavender-Bares et al., 2009). These results suggest that plant functional traits—especially those related to productivity—are more commonly shared among closely related species. Supporting this idea, a microcosm study showed that more closely related protists share more traits and that relatedness affected the competition among coexisting organisms (Violle et al., 2011). Thus, species richness or diversity alone may be insufficient to understand how community composition and assembly affect the ecosystem function. Instead, we also need information about the phylogenetic relatedness of an organisms (Webb et al., 2002). But at least some studies have failed to measure a link between phylogenetic relatedness and functional traits in the plant communities (for example, Swenson et al., 2012) and this can possibly be due to high variation in the functional potential among closely related lineages. Thus, the phylogenetic distribution and clustering of functional traits may have effects on both community assembly and functioning.

In a similar manner, the phylogenetic conservatism of microbial functional traits can also be considered (Philippot et al., 2010). Continuous traits in microorganisms include salt tolerance or metabolic activity at different temperatures, whereas discrete traits include the metabolic ability to fix nitrogen or CO2. If the traits are highly phylogenetically conserved, then they will be shared among members of deep clades of microorganisms (Figure 1—trait A). Alternatively, traits that are less conserved will be found in small phylogenetic clades (Figure 1—trait B). Finally, gene loss, convergent evolution and lateral gene transfer can result in the distribution of many traits across multiple phylogenetic groups (Doolittle, 1999; Snel et al., 2002), leading to random associations between phylogenetic and functional relatedness (Boucher et al., 2003) (Figure 1—trait C).

Figure 1
figure 1

Hypothetical phylogenetic distribution of functional traits present in i clades including a trait A with a high τD, three clades with trait B with a low τD, and five randomly distributed lineages with trait C. Ri denotes the root node for each clade i with a given trait and the τD is the average 16S rRNA distance between the root node and the strains in each clade sharing a trait.

The underlying mechanisms responsible for variation in the phylogenetic dispersion of functional traits are presently not clear. However, it is known that central gene systems consisting of many interacting proteins evolve more slowly and are less likely to be transferred between organisms (Jain et al., 1999; Pal et al., 2005; Shi et al., 2005; Wellner et al., 2007). In contrast, genes associated with the peripheral systems are more likely to be gained and lost. Thus, we hypothesize that functional traits based on complex genetic systems integral to the organism evolve slowly and will be more phylogenetically conserved (Figure 1—trait A). This would result in traits shared in phylogenetically deep clusters. In contrast, simple traits consisting of few interacting proteins will be more dispersed or completely randomly distributed (Figures 1—trait B and C). Here, traits would be shared in phylogenetically shallow clusters.

Different approaches have been proposed for estimating evolutionary rate of change and phylogenetic dispersion of traits. This includes estimates of character evolution using either maximum likelihood or Bayesian approaches. Alternatively, other approaches compare the phylogenetic distribution of a trait to random or Brownian Motion models of evolution, such as Pagel’s lambda for continuous traits and Purvis and Fritz’s D for discrete traits (Pagel, 1999; Fritz and Purvis, 2010). Although these tests quantify the rate of evolution or evaluate the significance of phylogenetic clustering, they do not estimate the phylogenetic relatedness of the clusters of organisms sharing a trait—for example, in terms of 16S rRNA sequence similarity. For instance, if a given functional trait is associated with fine-scale clusters with a low trait depth (τD), the relative abundance of alpha- vs gamma-Proteobacteria may not accurately predict this function in a community. Thus, knowing the actual τD is important for linking specific changes in microbial community composition with changes in the functional potential of a community.

The aim of this study was twofold. First, we developed a new phylogenetic metric—consenTRAIT—that directly estimates the sequence similarity of clusters of organisms sharing discrete traits. Second, we used this test to examine the distribution and degree of phylogenetic clustering of a variety of microbial traits. To address the latter, we analyzed two data sets. The first data set includes all completely sequenced prokaryotic genomes. For each genome, we identified the presence of genomic subsystems underlying many ecologically relevant functions like photosynthesis, CO2 fixation, N-cycling, or growth on various organic carbon sources. The advantage of this data set is that it provides a broad overview of the metabolic potential of many strains across many phylogenetic groups. At the same time, many traits are associated with genetic changes beyond the presence or absence of a pathway, such as differences in regulation or mutations in associated genes. Therefore, we also analyzed a second data set of organic carbon usage by hundreds of bacterial strains. This data set provides phenotypic information on specific traits for many closely related bacteria, but does not cover as broad a phylogenetic range as the genomic data set.

Using our new metric and these two extensive data sets, we specifically asked three questions: Are most traits phylogenetically randomly distributed? If not, are different traits associated with different levels of phylogenetic clustering? Finally, is the degree of phylogenetic clustering of a trait correlated with its molecular complexity? If so, this result provides a biochemical basis for predicting the phylogenetic conservatism and distribution of traits in microorganisms.

Materials and methods

Data sets

To identify the phylogenetic distribution of traits in microorganisms, we analyzed two data sets. The first data set consisted of 2229 prokaryotic genomes (full Patric SEED database of 15 October 2010 (Snyder et al., 2007)). This covered 26 phyla (Supplementary Figure S1). We then identified the distribution of 19 genomic subsystems underlying ecologically relevant functions. We used the SEED annotation database (Supplementary Table S1), which provided a uniform annotation for all genomes (Overbeek et al., 1999, 2005). We used the SEED API for all analyses (Disz et al., 2010). Whenever possible, we manually checked and edited the presence of each subsystem by searching literature for assigned physiologies. We estimated the number of genes underlying a trait as the genes in a given SEED subsystem. For the second data set, we retrieved data describing utilization patterns of 70 organic carbon substrates of 738 strains from Biolog Inc. (Hayword, CA, USA) (Supplementary Table S2). This data set covered five phyla (Supplementary Figure S1). These carbon sources ranged from C1 to polymeric compounds. The molecular complexity as determined by the Bertz Complexity Index (Bertz, 1981) of each carbon substrate was recorded from http://pubchem.ncbi.nlm.nih.gov/.

Phylogenetic tree reconstruction

An aligned 16S rRNA sequence associated with each genome was retrieved from the Silva database (Pruesse et al., 2007). For the carbon substrate utilization data set, we matched the strain names to entries in the Silva database in order to retrieve 16S rRNA sequences. A phylogenetic tree for each data set was estimated using PHYLIP (DNA distance with F84 correction, neighbor-joining, and 100 bootstraps) (Felsenstein, 2006) and RAxML (350 bootstraps) at the CIPRES science gateway (Stamatakis et al., 2008). Unless otherwise noted, the results displayed were estimated using the DNA distance method.

Phylogenetic mapping of traits

For each trait from the two data sets, we determined the τD, phylogenetic dispersion and frequency of gene gain and loss. τD was estimated using a custom designed algorithm—consenTRAIT: consensus analysis of phylogentic trait distribution. The script was written in R using functions from the ‘ape’ and ‘adephylo’ packages (Paradis et al., 2004; Jombart et al., 2010). To estimate τD, we first identified the root node of clades where at least 90% of the members shared the trait. We then estimated the average consensus sequence distance (d) between the root node (Rj) of n clades (j) sharing a given trait and the m members (i) of clades (that is, the leaves, Si):

This estimate was repeated for each bootstrap tree. We attributed the presence of singleton entries (that is, no neighbor sharing the trait) to undersampling. Therefore, we scored τD for singleton entries using half the distance to the nearest internal node, which essentially assumes an equal likelihood of finding a neighbor organism with or without the trait. We included a non-parametric estimate for significant clustering by randomly assigning traits 1000 times (10 times to each bootstrap tree) to entries in the phylogenetic tree as a null distribution. We then compared the estimated τD to this null distribution.

We used the test developed by Fritz and Purvis, (2010) to estimate the phylogenetic dispersion (D) for discrete traits from the R package ‘caper’. We did 1000 permutation based on random or Brownian motion patterns of evolution and compared the observed phylogenetic pattern to these two distributions for each trait. Here, a D<0 suggests a highly clustered trait, D0 indicates a Brownian motion mode of evolution, D=1 suggests a random mode of evolution and D>1 suggests phylogenetic overdispersion (Fritz and Purvis, 2010). A trait was assigned as non-random if the probability of matching a random distribution was less than 50%. We used the Lilliefors Test to examine for normality (Lilliefors, 1967) and Pearson and Spearman correlation analysis to test for significant correlations between variables. This was done with Matlab (Mathworks, Natick, MA, USA).

Character evolution rate

An estimation of trait gain and loss events was performed based on a Wagner parsimony approach using Count (Csuros, 2010). We used a likelihood ratio of 1:4 for gene gain and loss events, respectively, but also tested different ratios. These gave comparable results (R>0.9). We also estimated the rate of trait change with a Bayesian approach using SIMMAP version 1.5 (Bollback, 2006). We estimated the rates using both the default priors as well as both overall and bias priors estimated with a MCMC analysis.

Results

To identify the phylogentic clustering of functional traits in microorganisms, we analyzed the phylogenetic distribution of 89 functional traits across the Bacteria and Archaea using both genotypic and phenotypic data. Both data set contained a high proportion of Proteobacteria, Firmicutes, Actinobacteria and Bacteriodetes but especially the genome data set included many phyla (Supplementary Figure S1). We first tested if any traits were randomly distributed—potentially due to gene loss, lateral gene transfer or convergent evolution. We found that 93% (83 out 89) of the traits were non-randomly distributed, as predicted by consenTRAIT (P<0.05) and the phylogenetic dispersion of discrete traits metric (P(D)random<0.5) (Figure 2).

Figure 2
figure 2

τD of functional traits. τD is estimated using consenTRAIT as the average 16S rRNA sequence distance between members of a clade where at least 90% of the strains carry a trait and the root node of this clade. (a) Traits identified based on genomic subsystems. (b) Phenotypic traits identified based on the ability to used specific organic carbon substrates. Black dots denote non-random phylogenetic distribution based on either consenTRAIT τD (P<0.05) or the phylogenetic dispersion test for discrete traits D, (P(D)random<0.5) (Fritz and Purvis, 2010). The box plot represents the values from 100 bootstrap trees, where the box includes values from the 25–75 percentile, the bars includes the 5–95 percentile and the line represents the median.

Given that most of the traits were non-randomly distributed, we next used consenTRAIT to identify if τD varied among the traits. We found that τD ranged from 0.0029–0.11 rRNA sequence distance (Figure 2, Supplementary Table S1 and S2). As expected, oxygenic photosynthesis was the trait with the highest τD (median 0.11, mean 0.10) and only occurred in one clade—the phylum Cyanobacteria (Figure 3a). Methanogenesis was also found in only a few, deep clades. Sulfate reduction was present in 20 clusters with a τD=0.039 16S rRNA distance.

Figure 3
figure 3

Phylogenetic distribution of functional traits (red lines) in Prokaryotes. (a) Phylogenetic distribution of the traits oxygenic photosynthesis, nitrogen fixation and utilization of melibiose based on annotated genomic subsystems in the SEED database. (b) Phylogenetic distribution of the traits raffinose, citrate, and serine utilization based on observed growth in Biolog substrate utilization plates. The phylogenetic trees are based on a 16S rRNA alignment from the Silva database and estimated in Phylip using a distance based matrix (F84 correction), neighbor-joining, and 100 bootstraps (Felsenstein, 2006).

These deeply clustered traits were the exceptions, however. The vast majority of traits examined were found in clusters of a depth less than 0.02 rRNA sequence distance. For example, nitrogen fixation was found in 54 clusters (Figure 3a) with a more intermediate level τD of0.018 16S rRNA distance. At the finest scale, both the analysis of genomic and phenotypic data sets revealed that the ability to grow on a particular carbon source was generally seen in shallow clusters (Figures 2 and 3). In the phenotypic data set, growth of each of the 70 organic carbon sources was associated with a τD<0.011 16S rRNA distance (Figure 2b and Supplementary Table S2). The genomic data largely supported this view as we here observed the potential for growth on several carbon sources in clades with τD<0.017 16S rRNA distance (Figure 2a). We also found many traits located in clusters of one, in which no neighbors shared the trait. This finding suggests that we have still significantly undersampled microbial genomics and phenotypic diversity (assuming at least some clonal replication of the genome).

Next, we tested if trait complexity was correlated with the phylogenetic dispersion and depth of microbial traits. For the genomic data set, the number of genes underlying a trait was used as an indicator of trait biochemical complexity. We observed a significant relationship between trait complexity and phylogenetic dispersion (Figure 4a). Complex traits encompassing many genes like oxygenic photosynthesis and methanogenesis were extremely clumped (D<<0) (Figure 4a and Supplementary Table S2). Other traits including nitrogen and CO2 fixation, anoxygenic photosynthesis and sulfate reduction displayed a clumped distribution consistent with a Brownian motion model of evolution (D0). Finally, traits like carbon substrate utilization were dispersed in a mode between a Brownian motion and a random model (0<1), suggesting that the ability to grow on different carbon compounds is quite dispersed (Dmean=0.48, N=71, Supplementary Table S2). Further, we observed that trait complexity was significantly correlated to τD for both the Pearson and Spearman correlation tests (Figure 4b). Thus, traits encoded by many genes were shared among deeper clades than traits encoded by fewer genes. The only clear exception to this trend was aerobic methane oxidation, which was associated with a few deep clades despite requiring only three genes.

Figure 4
figure 4

Role of complexity on phylogenetic dispersion and τD. (a) Relationship between the number of genes underlying traits and phylogenetic dispersal (Fritz and Purvis, 2010). (b) Relationship between number of genes underlying traits and τD. The correlation coefficients were based on Spearman correlation, but Pearson correlations were also significant (P<0.05).

For the phenotypic carbon substrate utilization data set, we used the Bertz molecular complexity index of the carbon substrate as a measure of trait complexity (Bertz, 1981). The Bertz complexity index takes into account not only the compound size but also the bond types and overall structure. We found that molecular complexity was significantly correlated with phylogenetic dispersion (R2=0.07, P0.03)—albeit with a low degree of variance explained. In contrast, we did not find a significant relationship between τD and complexity of substrate (Supplementary Table S2), likely due to low variability in τD (all τD<0.011 16S rRNA distance).

As one should expect, the overall variation in dispersion and τD among all traits was significantly driven by the number of trait gain and loss events as estimated using a maximum parsimony approach (P<0.0001, Pearson correlation). In contrast, when we measured the rate of trait changes (that is, the character evolutionary rate) across the tree with a Bayesian approach, we did not observe any correlation.

We also evaluated the influence of the phylogenetic technique on both the Purvis and Fritz’s D and consenTRAIT by comparing the values based on a DNA distance and a maximum likelihood method tree. The values were highly correlated for the two phylogenetic techniques (R2>0.9, P<1 × 10−20) and both supported a correlation between trait complexity and either phylogenetic dispersion or τD.

Discussion

In this study, we aim to identify the distribution and degree of phylogenetic clustering of microbial traits. To address this, we developed a new algorithm—consenTRAIT. In contrast to past metrics testing for a phylogenetic conservation of traits, the aim of consenTRAIT is to estimate the phylogenetic depth of clades where members share a trait. Furthermore, it can identify the actually sequence divergence defining ecologically coherent units (at least in respect to one trait) rather than the taxonomic level associated with a trait. The latter has been done in some recent studies of plants due to the lack of a detailed phylogenetic tree (Prinzing et al., 2001; Swenson and Enquist, 2009). For microorganisms, however, there is an abundance of sequence data and organisms are mostly classified based on phylogeny and this can be used directly for the consenTRAIT analysis. This is important as it allows for a direct comparison with phylogenetic trees and can guide the threshold for defining operational taxonomic units (OTU). Instead of requiring that all members in clade share a trait, we decided to use 90% membership threshold to account for unusual genome configurations for example, a cyanobacterium losing the ability to perform oxygenic photosynthesis (Zehr et al., 2008). It is worth noting that increasing the % threshold value for membership of a clade will commonly result in smaller τD values, whereas a decrease will have the opposite effect.

We have deliberately not discussed the average membership number in each clade sharing a trait as we expect this number to be directly linked to sampling effort. In other words, if a group of very closely related organisms has been sequenced or typed with Biolog, then we should expect to find a high number of members in a clade sharing a trait. An example of this could be E. coli. On the other hand, in phyla with few analyzed strains, we will likely find clades of one with a specific trait.

We find that more than 93% of the traits are significantly non-randomly distributed (Figure 2), suggesting that even in the presence of many well-documented examples of lateral gene transfer, vertical inheritance is generally important for the distribution of functional traits in microorganisms. The traits considered are distributed in phylogenetic clusters with a continuum of depths. Thus, diversity at all phylogenetic levels is relevant for the functioning of a microbial community. Complex functions like photosynthesis, methanogenesis, and sulfate reduction are more phylogenetically conserved and found in a few deep clades. Previous studies found that habitat association, pH and soil moisture optimum are also associated with deep clades (Kowalchuk and Stephen, 2001; Fierer et al., 2007; Philippot et al., 2010; Lennon et al., 2012; Placella et al., 2012). Thus, there is a clear evidence for specific traits associated with deeper clades.

In contrast, most traits we examine arose or disappeared more recently—either by mutation or gene gain and loss—and are shared among a few closely related organisms. This includes finding many traits in individual lineages with no sister taxa sharing the trait. This suggests that we have still undersampled microbial diversity, and that many traits might be associated with even finer-scale diversity. Thus, our estimate for τD should be considered an upper bound for many functions. In particular, traits associated with the assimilation of small organic carbon sources are highly phylogenetically dispersed. This result is consistent with observations of wide variation among the strains of E.coli, Acinetobacter, Vibrio, etc., in their ability to use particular carbon substrates (Sarma et al., 2004; Keymer et al., 2007; Vieira et al., 2011). Previous studies have identified other phylogenetically dispersed bacterial traits like nutrient acquisition (Martiny et al., 2006) and particle colonization (Hunt et al., 2008). Here, these traits are also associated clusters of a small τD.

As one would expect, τD of each trait is significant correlated to phylogenetic dispersion and the number of gene gains and losses. Thus, the consenTRAIT metric provides a similar ranking of phylogenetic conservatism as other approaches. However, consenTRAIT also identifies the phylogenetic level where one should expect to find ecologically coherent clusters. Thus, our approach and results have significant implications for how to analyze and interpret microbial diversity patterns. To understand, which biotic or abiotic factors control the distribution and functioning of microbial diversity, an OTU is usually defined arbitrarily. Defining an OTU can be done either explicitly by grouping lineages using a sequence similarity of 97% to cluster groups or implicitly by linking an OTU to a peak with techniques like ARISA (automated rRNA intergenic spacer analysis), T-RFLP (terminal-restriction fragment length polymorphisms), or phospholipid fatty acid profiles. Thus, most techniques rely on a fixed definition of an OTU but the advantage of consenTRAIT is that this algorithm identifies the average clade depth where organisms share a trait. Thus, it enables you to evaluate if a microbial taxon is ‘defined’ at a sequence similarity level that is meaningful in comparison with the phylogenetic dispersion of relevant traits. This may not always be at the phylogenetic level whereby standard techniques define taxa. It is, however, important to recognize that the consenTRAIT metric estimates the phylogenetic distance between the tips and the last common ancestor of a clade with a given trait, whereas clustering algorithms like MOTHUR is based on the inter-lineage distance (Schloss et al., 2009). For example, if a clade only contains two members with 3% sequence dissimilarity, the average distance between each tip and the clade root node would be 0.015. Thus, a trait with τD=0.015 will be expected to show most variation among OTUs defined by 97% sequence similarity. In our study, we identify many traits associated with τD<0.015 including the utilization of many small carbon molecules, whereas functions related to photosynthesis or central functions in the nitrogen or sulfur cycle are found in clusters above this threshold. Thus, OTUs will contain a mixture of lineages with and without most small carbon molecule usage traits if defined using a 97% 16S rRNA sequence similarity cutoff. A result of this is that studies that aim to investigate how microbial diversity varies among gradients of these kinds of carbon substrates should only expect to observe patterns using a much finer genetic resolution. Similarly, traits associated with broad clades may not come out significant if your taxa are defined at a much finer level. Our study suggests that diversity at a range of phylogenetic levels can be relevant for the functioning of the community, and different functions are related to different phylogenetic cluster sizes. Thus, broad changes in the taxonomic composition of a community might influence the functioning of traits associated with deep clusters, but unlikely to have a big effect on traits associated with shallow microdiverse clusters and vice versa. To account for this, we need a flexible classification of microbial taxa to understand the distribution and functioning of the full spectrum of phylogenetic diversity.

We hypothesized that genetic complexity of a trait could possibly influence the trait dispersion and depth. We find support for this relationship in the genomic data set examined even though many other factors beyond complexity-likely influence the phylogenetic distribution of traits (for example, selective pressures and interactions between traits). The association of specific functional traits with different phylogenetic levels has previously been described in microorganisms. For example, the marine cyanobacterium Prochlorococcus can broadly be divided into two groups based on their ability to grow under high and low light levels (Moore et al., 1998). Light adaptation is likely a complex trait as it involves many interaction proteins. The high-light adapted clade can be further divided into a low and high iron group (Rusch et al., 2010), and the high iron group can be subdivided into high and low temperature adapted cells (Johnson et al., 2006). Finally, nitrate and phosphate assimilation genes are associated with shallow clades (Martiny et al., 2009a, 2009b). Nutrient uptake capabilities are possibly more simple traits as they involve clusters of a few genes. Thus for Prochlorococcus, specific traits are clearly dispersed differently phylogenetically, and this variation might be linked to difference in the complexity of the trait. The variation is not exclusive to microorganisms. Prinzing and co-workers find that among plants in Europe, tolerance of extremes in soil moisture was associated with higher taxonomic levels than were those for temperature or light (Prinzing et al., 2001). Similarly, tree height and seed mass are linked to broader phylogenetic groups in comparison with other functional traits in trees (Swenson and Enquist, 2009).

The linkage between traits and complexity offers a framework for predicting the phylogenetic distribution of specific traits as well as the resilience and robustness of specific microbe-mediated ecosystem processes to environmental changes. If the traits are distributed among few phylogenetic clades, shifts in community composition may strongly alter the associated ecosystem process (Schimel et al., 2005). In contrast, communities may be resilient to changes if a trait is distributed among many groups. However, there are potentially millions of microbial taxa (Torsvik et al., 1990) and a multitude of important traits, so it is likely impossible to map all relevant traits to all phylogenetic lineages of microorganisms. Our findings offers a way to begin to estimate how broad or narrow phylogenetically distributed a trait is based on the genetic and biochemical complexity of the function. This can help to predict how changes in microbial biodiversity may affect ecosystem functioning and how resilient a community is to a particular perturbation. Considering the important role microorganisms has in many ecosystem processes, this understanding is critical for predicting future changes in global biogeochemical cycles.