Introduction

The introduction of methods using next-generation sequencing to microbial ecology has enabled high-throughput assessment of complex microbial communities. This is achieved by sequencing either PCR-amplified marker genes (amplicon sequencing) (Huse et al., 2008) or genomic DNA from environmental samples (metagenomics) (Tyson et al., 2004; Venter et al., 2004). These approaches have changed how the microbial biosphere is viewed and enabled novel insights to be gained into the composition and function of diverse microbial assemblages in habitats ranging from the deep sea to the human gut (Eckburg et al., 2005; Sogin et al., 2006). However, a limitation to effectively utilizing these vast data sets is their distribution among disparate sequence repositories, including GenBank/EMBL/DDBJ, IMG/m, CAMERA and VAMPS, (Wheeler et al., 2008; Sun et al., 2011; Markowitz et al., 2012)(http://vamps.mbl.edu/). This lack of consolidation hampers exploration of the total available sequence information.

Chlamydiae are an assemblage of bacteria that depend on eukaryotic host cells for their reproduction. Evidence to date indicates the phylum is represented by members that are all obligate intracellular bacteria with a unique developmental life cycle. Their best known representatives are the human pathogens Chlamydia trachomatis and Chlamydia pneumonia, which cause trachoma and sexually transmitted diseases, and pneumonia, respectively (Bebear and de Barbeyrac, 2009; Burillo and Bouza, 2010; Hu et al., 2010). Although these medically important chlamydiae were described in 1907 (Halberstädter and Prowazek, 1907), the phylum was only represented by the single genus Chlamydia until 1995. The limited perception of chlamydial diversity gradually changed with the identification of environmental chlamydiae including Simkania negevensis (Kahane et al., 1995), Waddlia chondrophila (Rurangirwa et al., 1999) and amoeba-associated chlamydiae like Parachlamydia acanthamoebae, Protochlamydia amoebophila (Fritsche et al., 1993; Amann et al., 1997; Collingro et al., 2005b) and Criblamydia sequanensis (Thomas et al., 2006).

Analysis of these environmental chlamydiae helped to better understand the evolution of Chlamydiae as a whole (Horn, 2008). It was learned that the intracellular lifestyle of chlamydiae dates back to an ancient association with early unicellular eukaryotes in the Precambrian, hundreds of millions of years ago (Greub and Raoult, 2004; Horn, 2008; Kamneva et al., 2012). This ancient intracellular lifestyle specialization might have contributed to the evolution of plants by facilitating the establishment of primary plastids (Brinkman et al., 2002; Huang and Gogarten, 2007). In addition, several mechanisms for host interaction developed in these early associations are still used by extant chlamydial pathogens and symbionts (Hueck, 1998). Protists have thus been suggested to have provided ‘evolutionary training ground’ for contemporary intracellular bacteria (Molmeret et al., 2005). There is evidence that some environmental chlamydiae are associated with disease in humans and animals, and their impact on public health is a source of discussion (Corsaro and Greub, 2006; Lamoth et al., 2011).

The inability to cultivate chlamydiae outside eukaryotic host cells has hampered the characterization of novel chlamydiae. Co-cultivation with amoebae has been somewhat successfully used to facilitate retrieval of chlamydiae directly from environmental samples, but differences in host specificity limit the applicability of this approach (Collingro et al., 2005a; Corsaro and Venditti, 2009; Corsaro et al., 2010). Chlamydiae have also been largely missed in traditional 16S rRNA gene-based diversity surveys based on clone libraries, mainly because of their low abundance compared with free-living bacteria, but also because many general bacterial primers used in these studies have mismatches to known chlamydial 16S rRNA genes. Thus, only the application of primer sets specifically targeting members of the Chlamydiae enabled the identification of additional lineages within this phylum (Horn and Wagner, 2001). Such studies showed that chlamydiae are not only more diverse than originally thought, but are present in a variety of environments (Horn, 2008; Corsaro and Venditti, 2009; Corsaro et al., 2010). To date, the phylum Chlamydiae has nine described families that range in size (Kuo and Stephens, 2008) from the well represented Chlamydiaceae and Parachlamydiaceae to the less represented Rhabdochlamydiaceae (Corsaro and Venditti, 2009), Criblamydiaceae (Corsaro et al., 2009), Simkaniaceae (Everett et al., 1999) and Waddliaceae (Rurangirwa et al., 1999). The families with the least number of representatives (a single species) are Clavochlamydiaceae (Karlsen et al., 2008), Piscichlamydiaceae (Draghi et al., 2004) and the recently discovered Parilichlamydiaceae (Stride et al., 2013).

In this study, we introduce an approach to combine all existing metagenomic and amplicon sequence data to assess the microbial diversity and ecology of the Chlamydiae. To achieve this, we collected all chlamydia-like protein and 16S rRNA gene sequences from publically available sequence databases by using similarity-based searches, filtering steps and large-scale phylogenetic analyses. Our study revealed the existence of an enormous, hidden, family-level diversity of Chlamydiae, particularly in marine habitats, and provided insights into the genomic diversity of the different families. Our approach is applicable to other microbial taxa; it demonstrates a useful computational strategy to explore taxonomic and genomic diversity and ecology of microbes that exist in available metagenomic sequence space.

Materials and methods

Identification and analysis of putative chlamydial proteins in metagenomic data

The database SIMAP (Rattei et al., 2010) integrates data from multiple major public repositories of metagenomic sequences, such as IMG/M (Markowitz et al., 2012), CAMERA (Sun et al., 2011) and the whole-genome shotgun section of NCBI GenBank (Wheeler et al., 2008). SIMAP consistently annotates all potential protein-coding sequences of these metagenomes and currently contains about 45 million non-redundant metagenomic proteins. Metagenomic proteins in SIMAP with significant similarity to known chlamydial proteins (E-value <10−20, alignment coverage >50% for both query and subject) were extracted, and phylogenetic trees were calculated with their closest homologs using PhyloGenie (Frickey and Lupas, 2004) and a maximum likelihood method (RAxML; (Stamatakis, 2006)). Phylogenetic trees were then filtered with the PhyloGenie tool PHAT for well-supported (bootstrap values >70%) monophyletic chlamydial clades containing metagenomic proteins. Only metagenomic proteins from well-supported clades were considered to be of putative chlamydial origin, and information on their closest phylogenetic relatives and their environmental origin were extracted for further analysis (Figure 1). A complete description of the method is provided in the Supplementary information (Supplementary methods).

Figure 1
figure 1

Flow chart illustrating the main steps in the analysis of metagenomic and amplicon sequence data for inferring diversity and ecology of defined microbial taxa. In this study, this approach was used for investigating the phylum Chlamydiae. A detailed description of each step is provided in supplementary information.

Identification and analysis of chlamydial 16S rRNA gene sequences

NCBI (Wheeler et al., 2008), CAMERA (Sun et al., 2011) and IMG/m (Markowitz et al., 2012) were searched with megablast using a representative chlamydial 16S rRNA gene sequence as reference (Simkania negevensis, NR_029194). All sequences with similarity greater than 60% to the reference sequence were collected. In addition, all amplicon 16S rRNA gene sequences obtained using the 454 Titanium technology were retrieved from VAMPS and SRA (Kodama et al., 2012). Redundant (identical), low quality (> 0.4% ambiguous sites (N)) and short sequences (<300 nucleotides) were removed from the combined data set, and the remaining sequences were taxonomically classified using RDP classifier (Wang et al., 2007) (Figure 1). Sequences recognized as members of the phylum Chlamydiae with confidence above 80% were then aligned using the SINA aligner.(Pruesse et al., 2012). The final data set also included 12 16S rRNA gene sequences obtained in this study by PCR analysis of a water sample from Ace Lake in Antarctica (Supplementary Methods).

Two types of analyses were carried out with the aligned 16S rRNA gene sequences (Figure 1). First, near full-length sequences (>1100 nucleotides) were selected, and their phylogenetic relationships were reconstructed using Mr Bayes (Huelsenbeck and Ronquist, 2001). The obtained reference tree was visualized with iTOL (Letunic and Bork, 2007). Second, the multiple sequence alignment containing all sequences was trimmed around the region with the highest coverage. The sequences were again filtered for length and alignment quality and then used for the calculation of Operational Taxonomic Units (OTUs) using MOTHUR (Schloss et al., 2009) and ESPRIT (Sun et al., 2009). OTUs were classified according to the environmental origin of the sequences they include. Size, ecological classification and relative distance between OTUs were visualized in a NMDS (non-metric multi dimension scaling) plot using R. A more detailed description of the method is provided in the supplementary information (Supplementary Methods).

Results

Chlamydial proteins in metagenomic sequence data

To explore the diversity of putative chlamydial proteins in metagenomic sequence data, we conducted a comprehensive similarity-based search coupled to extensive phylogenetic analysis. A total of 31 279 proteins from various metagenomes contained in the SIMAP database (Rattei et al., 2010) were identified to be most similar to known chlamydial homologs, representing 0.12% of the total metagenomic proteins included in these metagenomes (25 847 409 non-redundant proteins). After applying conservative alignment length and E-value filters, 5525 putative chlamydial protein sequences remained. This reduction was mainly due to the high number of short, incomplete protein sequences typically obtained in metagenomic studies. Phylogenetic analyses of those sequences further reduced this number to 1931 proteins that clustered monophyletically with known chlamydial homologs with significant bootstrap support (>70%). These proteins formed 1012 homologous groups with an average of two proteins per group. This indicates a shallow representation, that is, a low coverage, of putative chlamydial homologs in the current extent of metagenomic sequence data.

For 392 putative chlamydial metagenomic proteins, only chlamydial homologs were detected. These proteins were classified as ‘Chlamydiae specific’. If other proteins exist with lower similarity than the criteria we used, they would have been excluded from our analysis. Within the complete set of putative chlamydial metagenomic proteins, we searched for homologs to known proteins that have been associated with host interaction and virulence of Chlamydiae (Collingro et al., 2011). This search resulted in 76 metagenomic proteins that group with 29 virulence-associated proteins. Interestingly, at least one metagenomic protein was identified for each of the known virulence-associated proteins. Homologs of the plasmid-encoded protein pGP6 and the type III secretion system chaperone SctG were most frequently detected, with 9 and 7 metagenomic proteins, respectively (Supplementary Table S1).

Based on the closest neighbor in the phylogenetic trees, the majority of the putative chlamydial metagenomic proteins were most closely related to known proteins from members of the Simkaniaceae and Parachlamydiaceae, a trend that was also observed for the subset of ‘Chlamydiae specific’ proteins (Figure 2). Noticeably, most putative chlamydial metagenomic proteins originated from marine samples (86%; Figure 2). Even considering that 60% of the total number of metagenomic proteins included in the analysis was of marine origin, this still indicates an overrepresentation of putative chlamydial proteins in those samples.

Figure 2
figure 2

Ecological and taxonomic classification of putative chlamydial proteins in metagenomic sequence data. Proteins were classified based on their respective closest neighbor in maximum likelihood trees. Environmental origins grouped in four general categories are color coded. ‘All’ refers to all putative chlamydial proteins; ‘specific’ refers to the subgroup of proteins with exclusively known chlamydial homologs, ‘virulence’ includes all metagenomic proteins with homology to known chlamydial virulence-associated proteins. The number of proteins in each group is indicated in parenthesis. Most of the detected putative chlamydial metagenomic proteins originated from marine environments and are most similar to Simkaniaceae or Parachlamydiaceae homologs.

Identification of chlamydial 16S rRNA genes

To identify chlamydial 16S rRNA genes from amplicon and metagenomic studies, we integrated data from different sequence databases including VAMPS, SRA, NCBI, CAMERA and IMG/m (Wheeler et al., 2008; Sun et al., 2011; Kodama et al., 2012; Markowitz et al., 2012). A similarity-based search using relaxed criteria and subsequent taxonomic classification of the 16S rRNA gene sequences using the RDP classifier (Wang et al., 2007), resulted in a set of 22 070 unique chlamydia-like 16S rRNA gene sequences with an average length of 471 nucleotides (Supplementary Table S2). Compared with the NCBI nt database alone, which is generally used to collect rRNA gene sequences for phylogenetic analysis, the inclusion of metagenomic-derived data from NCBI env, CAMERA and IMG/m more than doubled the number of chlamydial 16S rRNA gene sequences. However, despite this doubling of sequences, the vast majority (95%) of all recovered sequences originated from amplicon data sets in VAMPS and SRA (Supplementary Table S2).

A phylogenetic framework for the phylum Chlamydiae

To construct a robust phylogenetic framework for members of the Chlamydiae, we extracted all near full-length non-chimeric 16S rRNA gene sequences with at least 1100 nucleotides (n=271) and used these for tree calculation (Figure 3). This sequence set was also used for estimation of family-level OTUs by applying a 10% distance cutoff, as proposed for the phylum Chlamydiae (Everett et al., 1999). For clustering of the sequences into OTUs, two methods were used: ESPRIT (Sun et al., 2009) and MOTHUR (Schloss et al., 2009), which determine sequence similarity using pairwise alignments, and a multiple sequence alignments, respectively. The numbers of OTUs obtained with the two approaches differed. Although MOTHUR predicted 40 family-level OTUs, ESPRIT was more conservative and estimated 28 OTUs (Supplementary Table S3). Both tree topology and known chlamydial families were best represented by the grouping of sequences using ESPRIT (Supplementary Table S4, Figure 3). The only incongruence was observed for the Criblamydiaceae and the Parachlamydiaceae, which formed independent groups at a 9% distance cutoff but grouped together at 10%. In contrast, MOTHUR split the Rhabdochlamydiaceae into four separate groups and the Parachlamydiaceae into two. We thus used the more conservative approach of ESPRIT for assigning yet undescribed family-level OTUs as ‘Predicted Chlamydial Families’ (PCF) in the phylogenetic tree (Figure 3, Supplementary Table S4). In summary, our analysis of full-length 16S rRNA gene sequences from various databases showed that the total number of families in the Chlamydiae is two times higher (n=17) than described before, or more than three times higher (n=28) if singletons are considered (Figure 3, Supplementary Table S4).

Figure 3
figure 3

Relationships of described and predicted families in the phylum Chlamydiae based on near full-length 16S rRNA gene sequences (>1100 nt). The phylogenetic tree was calculated using Bayesian inference (MrBayes; (Huelsenbeck and Ronquist, 2001)). Branches with a posterior probability lower than 0.50 were collapsed. Those with posterior probability values between 0.50 and 0.70 are indicated with red color. The monophyly of all chlamydial families is well supported (>0.90); family level OTUs obtained by sequence similarity-based clustering with ESPRIT (Sun et al., 2009) and including only yet undescribed sequences are labeled as PCF. Details for the sequences included in tree calculation and clustering are available as Supplementary Table S3. Bar, 0.1 expected substitutions per site.

The monophyly of all known chlamydial families is statistically well supported in the 16S rRNA gene-based phylogenetic tree (>0.90 posterior probability), but branching order is only partially resolved (Figure 3). Nevertheless, a phylogenetic relationship between Parachlamydiaceae, Criblamydiaceae and Waddliaceae together with PCF3, PCF5, PCF7 and PCF9 is well supported (0.97 posterior probability). Likewise, the families Simkaniaceae, Rhabdochlamydiaceae and the putative family PCF8 form a well-supported clade (0.94 posterior probability). In addition, the previously described relationships of Clavichlamydiaceae with Chlamydiaceae (Horn, 2008) and Piscichlamydiaceae with Parilichlamydiaceae (Stride et al., 2013) were recovered in the tree topology. We noted that three PCFs (PCF1, PCF4 and PCF2) consisted of sequences originating from a single environmental source, the marine-derived Lagoon Paola in Italy (Pizzetti et al., 2012).

Evidence for a vast diversity of Chlamydiae

The near full-length 16S rRNA gene sequences provide a robust framework for inferring phylogenetic relationships and diversity within the Chlamydiae, yet they represent only a minor fraction (1%) of all collected chlamydial 16S rRNA gene sequences. Although the majority of sequences in our data set are too short for robust phylogenetic analysis, they can be used to estimate the diversity of the phylum Chlamydiae using sequence similarity-based clustering into OTUs (Kim et al., 2011).

The meta-analysis of short 16S rRNA gene sequences derived from amplicon-based diversity surveys is complicated by the fact that not all studies target the same regions of the 16S rRNA gene. We therefore performed a multiple sequence alignment of all 22 070 sequences collected from diverse sources, in order to identify the region with the highest coverage. Plotting these data showed that the variable region from V4 to V6 was best represented in our data (Supplementary Figure S1). We then determined whether this 450 nucleotide length region was a good proxy for the full-length 16S rRNA gene in similarity-based OTU calculations for the phylum Chlamydiae. To evaluate this, the number of OTUs obtained with the full-length sequences was compared with the number of OTUs obtained with the same sequences after they were trimmed to V4 to V6. This analysis showed that the numbers of OTUs obtained with the full-length and trimmed data sets were comparable across the taxonomic levels that were resolved (Supplementary Table S3), indicating that the V4 to V6 region can be used for obtaining reasonably stringent and conservative predictions of chlamydial diversity. This is consistent with a previous study that found that the V4 to V6 region slightly underestimated diversity, predicting around 10% less OTUs compared with the full-length 16S rRNA gene for all similarity levels tested (Kim et al., 2011).

After trimming and additional quality filtering, 14 311 partial 16S rRNA gene sequences remained in our data set. Removal of redundant sequences further reduced this data set to 12 636 sequences, which represented the final sequence collection used for OTU calculations. Clustering into OTUs using sequence similarity thresholds corresponding to different taxonomic levels in the phylum Chlamydiae (Everett et al., 1999), showed that existing public metagenomic sequence data contained an as yet, undescribed, high level diversity of the Chlamydiae phylum (Table 1). More than 2000 OTUs were present at the species level, representing more than 250 chlamydial families.

Table 1 Estimated diversity within the phylum Chlamydiae at different taxonomic levels based on clustering of partial metagenomic 16S rRNA gene sequences into OTUs

In general, fewer OTUs were obtained with ESPRIT compared with MOTHUR (Table 1), which is consistent with our earlier observation during the analysis of full-length sequences (see above). As the pairwise alignment-based method implemented in ESPRIT resulted in more conservative diversity estimates of our data set, we only used the OTUs calculated by ESPRIT in subsequent analyses.

Insights into the ecology of Chlamydiae

Entries in public sequence databases generally contain additional information such as the origin of the investigated samples. These data can be used to analyze the environmental distribution of organisms detected in the samples. In our 16S rRNA gene data set, the majority of unique chlamydial sequences were derived from freshwater environments (67.6%), followed by marine environments (31%), while the number of sequences derived from terrestrial and engineered environments was negligible (<2%; Supplementary Figure S2). Despite this overrepresentation of freshwater sequences, at all taxonomic levels most OTUs contained only marine sequences (Supplementary Figure S2). Thus, although the number of freshwater sequences in our data set was higher, most of those sequences are more similar to each other and group in fewer OTUs than the marine sequences. This indicates that marine environments are more diverse in terms of Chlamydiae than freshwater or terrestrial habitats.

To illustrate the diversity of Chlamydiae and to visualize ecological patterns, we plotted family-level OTUs using non-parametric NMDS (Figure 4). This analysis shows that, even at the family level, there are a large number of OTUs (85% of all OTUs, Supplementary Figure S2) which contain sequences exclusively from a single environment category. This may be because these chlamydial families or their hosts are restricted to growth in specific environments. The dominance in numbers of marine OTUs (despite the majority of sequences originating from freshwater) is apparent in the NMDS plot. Marine OTUs are highly diverse and are distributed across the whole range of the plot. Yet, the largest OTUs comprising the highest numbers of unique sequences were of mixed origin. The three largest OTUs are the Rhabdochlamydiaceae (5004 sequences), followed by the Parachlamydiaceae (1834 sequences) and PCF8 (1594 sequences).

Figure 4
figure 4

Diversity and ecology of chlamydial families based on NMDS of OTU distances. Filled circles represent family-level OTUs, with the size corresponding to the number of sequences included. The distance between circles indicates the relative distance between OTUs. Colors represent the environment from which the sequences that form the OTUs originated from. OTUs formed by a single sequence only (singletons) were not included in the plot. The majority of family-level OTUs contain only marine-derived sequences (dark blue circles) indicating a high diversity of marine Chlamydiae (see also Supplementary Figure S2). Three prominent OTUs comprise the majority of sequences, the Rhabdochlamydiaceae, followed by the Parachlamydiaceae and PCF8.

Experimental verification of chlamydial diversity in an Antarctic sample

We noted that among the samples included in this study, several contained a high diversity of novel family-level Chlamydiae. For example, a number of diverse chlamydial 16S rRNA gene sequences originated from the marine-derived Ace Lake in Antarctica (Lauro et al., 2011). We thus chose this sample to evaluate whether the diversity of Chlamydiae predicted by our analysis could be confirmed experimentally. For this, we performed PCR using a Chlamydiae-specific primer set amplifying almost the complete 16S rRNA gene. From 25 clones showing different restriction fragment length polymorphism patterns, 12 unique chlamydial sequences were identified. All of these matched with 100% sequence similarity to partial metagenomic sequences from Ace Lake. The near full-length sequences that were obtained formed the well-supported novel PCF6 clade (Figure 3), thus confirming the validity of the respective partial metagenomic sequences as being chlamydial. Therefore, the OTU classification of short metagenomic sequences correctly predicted the existence of a novel chlamydial family in the data from this lake.

Discussion

The aim of this study was to investigate the diversity of the phylum Chlamydiae and the genomic repertoire of its members using available sequence databases. However, there is no straight forward way to search metagenomes in public databases for proteins assigned to specific taxonomic groups. We thus used a similarity-based approach to extract an initial set of putative chlamydial proteins, and then analyzed them further using phylogenetic methods. The final set of metagenomic proteins that were classified as putative chlamydial constituted less than a tenth of the proteins originally identified by simple sequence similarity searches to known chlamydial proteins. This large reduction illustrates the uncertainty of similarity-based taxonomic classification, which is consistent with the notion that sequence similarity-based searches are often inadequate for finding the closest phylogenetic relative (Koski and Golding, 2001). However, a level of uncertainty remains even in the phylogeny-based classification of proteins. Phylogenetic monophyly of individual proteins (no matter how well supported) does not necessarily reflect organismal origin. Horizontal gene transfer between distantly related microbes or the absence of reference sequences may lead to protein phylogenies that are inconsistent with the organism tree, thereby providing mis-leading phylogenetic inference (Boucher et al., 2003). Despite these limitations, the conservative set of putative chlamydial proteins identified in this study provides an improved means of evaluating the genomic diversity of Chlamydiae.

Compared with the total number of metagenomic proteins included in our analysis, only a small number of putative chlamydial proteins were identified, with a low redundancy in terms of homologous groups. This may indicate a low abundance of chlamydiae in the sampled environments and thus a low coverage of chlamydial genes in the available metagenomic sequence data. A low abundance of chlamydiae may reflect the fact that all known members of the Chlamydiae require a eukaryotic host (Horn, 2008) and thus may be expected to be rare members of microbial communities. In addition, the cell size restriction imposed by many metagenome-sampling regimes (for example, 20 μm prefilter; Rusch et al., 2007; Lauro et al., 2011) would bias against hosts that may harbor intracellular chlamydiae.

It is difficult to isolate chlamydiae (in appropriate host cells) from environmental samples (Collingro et al., 2005a; Corsaro and Venditti, 2009; Corsaro et al., 2009; Hayashi et al., 2010). It is thus possible that existing genome sequences of members of the Chlamydiae are not representative of environmental chlamydiae, as was recently reported for numerous taxa of marine bacteria by using single-cell genomics (Swan et al., 2013), making it difficult to identify chlamydial genes from shotgun metagenome sequence data. Consistent with this, among the proteins identified by phylogenetic assignment as putative chlamydial, none were identical to known proteins, indicating that uncharacterized Chlamydiae are present in the source environments. Based on their closest relatives, the majority of these Chlamydiae are most closely related to known members of the Simkaniaceae or the Parachlamydiaceae (Figure 2). This either reflects the abundance of these or related families in the metagenomic samples or is an effect of the lack of reference genome sequences from other chlamydial families, such as the Rhabdochlamydiaceae.

To further explore the diversity of Chlamydiae we used the 16S rRNA gene as phylogenetic marker. Major 16S rRNA gene sequence databases such as SILVA (Pruesse et al., 2007) and RDP (Cole et al., 2009) mainly include sequence data from the Genbank/EMBL/DDBJ nt database, which does not contain metagenomic and amplicon sequences. In this study, we showed that collecting and integrating sequence data from different database sources is possible and facilitates a more comprehensive view of microbial diversity. In fact, 95% of the chlamydial sequences we identified originated from the VAMPS and SRA (Kodama et al., 2012) databases.

Previous analyses of full-length sequences indicated that the diversity of Chlamydiae in these databases exceeds the diversity of described families by a factor of two to three (Corsaro et al., 2003; Horn, 2008). In our present study, from the 28 family level lineages supported by full-length sequences, 21 are not represented by an isolate (Figure 3, Supplementary Table S4). The lack of matches to known members of the Chlamydiae was even more evident when we analyzed the complete data set of chlamydial 16S rRNA gene sequences, including also shorter sequences derived from amplicon-based studies. Even with the most conservative estimates, our analysis suggests the existence of more than 181 chlamydial families that are supported by at least two unique sequences (Table 1). Taking into account that the Chlamydiae included only a single family with a single genus until 1995, and only nine families until recently (Corsaro et al., 2003; Horn, 2008), this discovery is highly unexpected—particularly as molecular, cultivation-independent tools for the identification of microbes has been available for more than two decades (Lane et al., 1985; Amann et al., 1995).

We selected one of the new family-level OTUs that was supported only by short metagenomic 16S rRNA gene sequences and analyzed the original sample using a Chlamydiae-specific PCR assay. The full-length sequences obtained by this experimental approach confirmed the presence of members of this OTU in the original sample. Subsequent phylogenetic analysis demonstrated that they formed an independent, family-level monophyletic group (PCF6 in Figure 3). This shows that amplicon-based OTU predictions can be verified experimentally and lends further support to the existence of the observed vast diversity of Chlamydiae.

All known Chlamydiae require a eukaryotic host for reproduction, and this lifestyle is considered an ancient feature of members of this phylum. The last common ancestor of all known Chlamydiae was thought to be already adapted to an intracellular lifestyle (Horn et al., 2004; Kamneva et al., 2012), and primordial chlamydiae might have contributed to the acquisition of primary plastids and the evolution of plants some 1.2 billion years ago (Huang and Gogarten, 2007; Ball et al., 2013). If the members of the family-level chlamydial OTUs detected in our analysis have the same lifestyle as their known relatives, they also rely on eukaryotic hosts. As known chlamydiae show varying degrees of host specificity with many of them being restricted to a single host species (Horn et al., 2000; Hayashi et al., 2010; Coulon et al., 2012), there should be a large number of eukaryotes that have not yet been identified as hosts for chlamydiae (Moon-van der Staay et al., 2001). Interestingly, the most diverse chlamydial family with the highest number of unique sequences in our analysis is the Rhabdochlamydiaceae, whose known members infect arthropods (Kostanjsek et al., 2004; Corsaro et al., 2007), the most species-rich animal phylum comprising more than 80–90% of all described animals (Odegaard, 2000; Snelgrove, 2010). On the other hand, in agreement with our analysis of putative chlamydial proteins in metagenomic data sets, the majority of novel chlamydial families contain only sequences derived from marine environments, indicating an association with marine hosts. This would be consistent with the view that marine environments host an immense animal biodiversity that is comparable or even surpasses that to terrestrial habitats (Gray, 1997; Jaume and Duarte, 2006; Snelgrove, 2010).

In summary, arthropods might be important and so far neglected hosts for Chlamydiae, and there is a high diversity of novel, unexplored Chlamydiae particularly in marine environments. The absence of representative isolates for most chlamydial families and the lack of specific information about their actual hosts illustrate the huge gap we are facing in studying and understanding chlamydial biology and evolution. Closing this gap will be a major challenge requiring the application of novel approaches and techniques such as single-cell genomics (Woyke et al., 2009; Bruns et al., 2010; Wang and Bodovitz, 2010; Siegl et al., 2011; Li et al., 2012; Stepanauskas, 2012; Seth-Smith et al., 2013) and host-free cultivation and analysis of Chlamydiae (Haider et al., 2010; Omsland et al., 2013; Sixt et al., 2013).

In more general terms, our study provided novel insights into the diversity and ecology of a selected group of microbes. This approach should be applicable to any other clade that is phylogenetically well defined. Standardized meta-information for metagenomics (Hirschman et al., 2010; Gilbert et al., 2011; Yilmaz et al., 2011), and automatic retrieval and classification of publicly available sequences from different database sources would greatly facilitate this effort and would help to provide a more comprehensive and up-to-date estimate of microbial diversity.