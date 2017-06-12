Increased phylogenetic diversity of microbial genomes

974 bacterial and 29 archaeal genomes (from 579 genera in 21 phyla and 43 classes) were sequenced as part of the GEBA Initiative (GEBA-I), using a phylogeny-based scoring system for strain selection6, 13.

Of the 1,003 genomes presented, 396 GEBA-I genomes were the first sequenced representative of a genus (Fig. 1a). The Caldithrixae, Deferribacteres, Synergistetes and Thermodesulfobacteria (Fig. 1a) phyla have the most new genera. The most populous phyla, in terms of numbers of genomes sequenced, were the Proteobacteria (with 330 genomes), Firmicutes (178), Bacteroidetes (163) and Actinobacteria (157). The remaining 175 genomes belonged to 17 additional phyla, including the only sequenced representative of the Caldithrixae phylum (Supplementary Table 1). The GEBA-I strains originate from a multitude of habitats including extreme environments, terrestrial biomes, industrial waste and human body sites (Supplementary Fig. 1) and unsurprisingly have diverse physiology, genome size and average G+C content (Supplementary Fig. 2). GEBA-I is a high-quality reference resource with 99.4% (on average) genome completeness (assessed using CheckM14; Supplementary Table 1). Annotation of the 1,003 GEBA-I genomes resulted in 3,472,483 predicted genes from 3.75 Gbp of assembled sequence data (Supplementary Fig. 3 and Supplementary Table 1). All GEBA-I genomes are publicly available through the Integrated Microbial Genomes with Microbiomes (IMG/M) system15 and GenBank, and the corresponding strains through the respective culture collection (Supplementary Table 1).

Figure 1: GEBA-I strain phylogeny and distribution. (a) Maximum likelihood tree based on concatenated alignment of 56 conserved protein markers from representative genomes from all cultivated phyla. Phyla containing a GEBA-I genome are colored red, while all other phyla are colored gray. Pie charts represent the fraction of genera contributed by GEBA-I genomes (red) to the total number of genera per phylum (blue). The number of new genera added by GEBA-I per phylum is displayed next to the pie charts. Bootstrap support values ≥50% are shown with small circles on nodes with robust phylogenetic support. (b) Overall increase in 16S rRNA gene diversity relative to all the type strains. Blue denotes the genetic diversity covered by 828 genomes of type strains before GEBA-I, red denotes the diversity covered by the GEBA-I genomes and gray denotes the remaining type strains lacking a genome sequence. Balanced relative phylogenetic diversity (bRPD) was calculated by adding branch lengths between each leaf and root node in the tree followed by proportional downweighting of internal branches6. Full size image (376 KB) Figures index

Next

To quantify the increase in phylogenetic diversity contributed by GEBA-I genomes compared with all previously available, validly named archaeal and bacterial species (i.e., type strains), we measured the diversity distance of all sequenced type strains in a comprehensive 16S rRNA gene tree6. The GEBA-I genomes increased the phylogenetic distance threefold, expanding the overall diversity of the type-strain sequence space by ~24% (Fig. 1b). Further, we applied a whole-genome comparative analysis based on the average nucleotide identity to verify the relative novelty of the GEBA-I genomes compared to a set of 14,625 control genomes. We found that the vast majority (845/1,003) of the GEBA-I genomes were 'singletons' on the basis of the proposed criteria for defining a “species group”7, verifying that no other sequenced representative of that species is available.

Expanding the universe of known proteins

A total of 3,402,887 protein-coding sequences were predicted from the 1,003 GEBA-I genomes. We compared this data set with 23,470,984 non-redundant proteins from all available (14,625) control bacterial and archaeal genomes. Clustering ~26 million total proteins at ~30% sequence identity over 80% alignment length using KClust resulted in 1.89 million protein clusters (containing at least two sequences) and 2.6 million singletons. Of these, 55,105 clusters and 436,840 singletons were composed of proteins from GEBA-I genomes only (Supplementary Table 2), corresponding to a 10.5% increase in known protein sequence diversity.

To test if this represents a meaningful increase, or a mere continuation of a trend that has been ongoing since the advent of whole genome sequencing, we calculated the growth rate of new protein families (per 1,000 genomes) (Fig. 2a), and the number of protein families added by newly sequenced bacterial and archaeal genomes over time (i.e., in chronological order of their date of release; Fig. 2a, inset). First, we observed that the growth rate of new protein families markedly declined after the first 2,000 sequenced genomes. Addition of the GEBA-I genomes (noted in red) resulted in a dramatic increase in the growth rate of new protein families, equivalent to the protein family novelty initially observed with the first 2,000 genomes. Second, we found that the number of protein families added over time was initially large with the addition of the first 5,000 genomes, but almost plateaued at around 15,000 genomes (Fig. 2a, inset). The addition of GEBA-I genomes led to a substantial increase in the number of added protein families (Fig. 2a, inset). Together, this reinforces the hypothesis that substantial functional gene novelty remains to be discovered within the cultivated genome space and suggests that continued phylogeny-driven sequencing efforts will result in an expanded catalog of diverse protein families.

Figure 2: Protein clusters identified using GEBA-I genomes. (a) Change in growth rate of protein families identified per 1,000 genomes over the years and increase in number of new protein families over time, as new genomes were sequenced and added to public databases (inset). (b) Relationship between number of genes in protein clusters and singletons and the minimum 16S rRNA distance of each GEBA-I genome to its closest non-GEBA relative. Outliers, defined as points beyond 90% of the data with the smallest absolute residuals with a linear model, are depicted as red open circles. (c) Maximum 16S rRNA distance of genomes contributing a GEBA-I-only protein cluster. Each data point represents a single GEBA-I-only protein cluster and is colored by the cluster type, x axis is the total number of genes in each cluster, and y axis is the maximum 16S distance of genomes contributing to that cluster. Full size image (258 KB) Previous

Figures index

Next

In order to explore whether increased functional novelty is correlated with specific phylogenetic lineages, we examined the minimum 16S rRNA gene distance compared to the total number of new protein clusters for each GEBA-I genome (Fig. 2b). In general, genomes with increased phylogenetic distance (i.e., greatest 16S distance from reference) encoded the greatest number of novel protein families. As expected, many of the genomes with the greatest phylogenetic distance and number of novel genes belonged to phyla for which few or no sequenced representatives were previously available (Fig. 1a). For example, Ktedonobacter racemifer16, a member of the phylum Chloroflexi, contributed 5,102 genes to GEBA-I-only clusters and singletons (Fig. 2b). However, a handful of GEBA-I genomes with closely related reference genomes (i.e., near-identical 16S rRNA gene sequences) also encoded a preponderance of novel genes. The most striking outliers were Mycobacterium genavense ATCC 51234 and Promicromonospora kroppenstedtii RS16, DSM 19349, contributing 1,327 and 2,038 novel genes, respectively (Fig. 2b and Supplementary Table 2). For the M. genavense genome, this observation is explained by the highly conserved nature of the 16S rRNA gene for this group, with other sequenced markers revealing a higher rate of polymorphism, for example, the 16S-23S internal transcribed spacer is preferred for species discrimination17, 18. Thus, the close evolutionary relationship for M. genavense implied by this minimum 16S rRNA gene distance (distance = 0.018, Mycobacterium parascrofulaceum) is likely an underestimation, and not a good indicator of actual evolutionary distance for this genome. Conversely, the relatively smaller sizes of genomes with high 16S distance to reference, but few novel genes (e.g., Mycoplasma elephantis, Allofustis seminis, both host-associated) suggests they may have undergone streamlining or genome reduction.

Exploring GEBA-I-only protein clusters

A total of 55,105 clusters were composed exclusively of proteins from GEBA-I genomes. Approximately 25% of these clusters (13,371 in total) contained proteins arising from a single genome (designated here as “homogeneous” or paralogous clusters), and possibly result from lifestyle-specific gene expansion, or from proliferation of integrated elements like phage or transposons (Fig. 2c). For example, the 13.6-Mbp genome of Ktedonobacter racemifer contributed a striking 411 homogeneous clusters, the largest number proportional to genome size of all the analyzed GEBA-I genomes; most of these clusters are implicated in regulatory functions, such as two-component signal transduction systems (TCS) involved in sensing and responding rapidly to environmental stimuli. Although TCS themselves are not novel, the K. racemifer encoded genes (e.g., Histidine Kinase, Cluster ID: 2509672) have a novel domain configuration involving multiple sensory PAS folds19, and high levels of sequence divergence from existing TCS (Supplementary Fig. 4). Four related clusters (Cluster IDs: 2586264, 809557, 4221619, 3082022) from the termite hindgut isolate Sphaerochaeta coccoides may represent another lifestyle-specific expansion20, with some clusters arranged as tandem arrays (Supplementary Fig. 5), suggesting gene expansion by recent gene duplication.

For the remaining 41,734 clusters in GEBA-I genomes (designated as “heterogeneous clusters”), varying levels of “heterogeneity” were identified in terms of membership within the same genus, family, order or class (Fig. 2c). We found a subset of clusters that originated from members of two or more phyla (designated as “hyper-heterogeneous” clusters (Fig. 2c). One of these clusters is a four-protein cluster (66% amino acid identity, Cluster ID: 2968370) present in four disparate species (Thermodesulfobacterium hveragerdense, Thermodesulfobacterium thermophilum, Thermodesulfovibrio thiophilus, Desulfurella acetivorans) from three phyla (Thermodesulfobacteria, Nitrospirae and Proteobacteria) that share a common physiology of thermophilic anaerobic sulfur reduction. While members of these particular genera or their higher taxonomic groups may not be well represented in sequence databases, the lack of cluster membership from genomes of relatively well-saturated phyla such as Proteobacteria is curious, suggesting horizontal gene transfer among these possibly cohabiting species. Further support for this speculation may be the putative function of the proteins themselves—rhodanese-like sulfotransferases, described as versatile proteins using persulfide chemistry to accomplish cellular functions ranging from cell cycle progression to stress resistance to sulfur metabolism21. A case with no apparent unifying theme in terms of known ecological niche or physiology is a co-localized pair of three-gene clusters (Cluster IDs: 4177102 and 4403394 with 49% and 43% amino acid identity, respectively) from two domains of life, namely, Maritalea myrionectae, Cucumibacter marinus (both Proteobacteria) and Methanolobus tindarius (an archaeon), with possible functions in quinolone export.

Hyperheterogeneous clusters are curious instances of phylogenetic discordance, that is, when the phylogenetic history of an individual gene is different from the known species history. Plausible explanations for this observation (as reviewed by Galtier and Daubin22) include: horizontal gene transfer, where the phylogeny is influenced by the number and nature of transfers that have transpired; incomplete lineage sorting due to rapid speciation events, that is, the ancestral polymorphism is not fully resolved into two monophyletic lineages when the second speciation occurs; hidden paralogy—for paralogs, the phylogeny partly reflects the duplication history of the gene independent of species divergence history, or convergent evolution.

The large number of singletons identified in the GEBA-I genomes represents potential new functions and confirms that a large proportion of functional novelty still remains to be captured. One such example is a putative pepsin A encoded by Endozoicomonas elysicola DSM 2238, isolated from the gastrointestinal tract of a mollusk sea slug. Although pepsin-like enzymes are commonly found in eukaryotes, the E. elysicola candidate is the first instance of a secreted bacterial pepsin (based on a signal peptide) containing all the conserved residues of its eukaryotic counterparts (Supplementary Fig. 6). To verify that singletons are not artifacts of gene prediction pipelines, we assessed their size distribution and presence of signaling or other structural motifs (Supplementary Table 2). Based on this, more than 70% of singletons are >100 amino acids in length, and of these, 31% possess either a signal peptide or two or more transmembrane helices.

Biosynthetic clusters for secondary metabolites

Microbial secondary metabolites are organic compounds that are not directly involved in primary growth and development, but rather have auxiliary functions such as defense, communication and other interactions. Genes encoding biosynthetic enzymes for the synthesis of secondary metabolites are typically co-localized on the chromosome and are referred to as “biosynthetic gene clusters” (BCs). While only a few of the selected type strains in this study were known to be prolific producers of secondary metabolites, a large bounty of potential new BCs were predicted in the GEBA-I genomes (Supplementary Table 3).

A total of 23,839 BCs were predicted from 1,003 GEBA-I genomes using the IMG-ABC system23. Three Pseudonocardiaceae genomes (Pseudonocardia acaciae, P. spinosispora and Sciscionella marina) encoded the greatest total number of BCs among all GEBA-I genomes (Fig. 3a). These included numerous nonribosomal peptide synthetases, polyketide synthetases, as well as lantipeptides, bacteriocins, ectoine thiopeptides, and others. We observed a clear correlation between the number of predicted BCs and genome size with an average of 6.41 (±2.4 s.d.) BCs predicted per Mb of sequence (Supplementary Fig. 7). Actinobacterial genomes were outliers with an average of 9.58 (±3.4 s.d.) BCs per Mb. This observation is likely reflective of their particular ecological niches involving multiple (perhaps antagonistic) interactions with cohabiting microbes (e.g., P. acaciae was isolated from a competitive plant rhizosphere environment). While Streptomyces species are known to be prolific producers of antibiotics and other natural products24, genomes from the Nocardiaceae and Pseudonocardiaceae families of Actinobacteria had not been sequenced extensively before this study, and therefore had not been intensively targeted for BC gene discovery. Given that six of the top ten BC-rich genomes in GEBA-I belong to the above two families, future sequencing efforts focused around these clades may prove fruitful for discovering natural products.

Figure 3: Distribution of biosynthetic clusters (BCs) in GEBA-I genomes. (a) Maximum likelihood phylogenetic tree using 56 conserved single-copy genes with horizontal bars representing the percentage of genome encoding biosynthetic gene clusters. Blue stars highlight GEBA-I genomes with the greatest percentage of BCs per genome. The red star indicates the phylogenetic placement of Photobacterium halotolerans DSM 18316 described in b and c. (b) Liquid chromatography–mass spectrometry (LC/MS) chromatogram from a crude extract of P. halotolerans DSM 18316 with labeled phenazine peaks. (c) Phenazine operon in P. halotolerans DSM 18316 compared to those from Pseudomonas fluorescens 2-79 and Pantoea agglomerans Eh1087. Full size image (239 KB) Previous

Figures index

Next

On average, the GEBA-I genomes devote nearly 10% of their genome to secondary metabolite biosynthesis, with actinobacterial GEBA-I genomes apportioning an average 16.5% (±8% s.d.) of their genome. Among the actinobacterial GEBA-I soil isolates, Actinoalloteichus cyanogriseus and Smaragdicoccus niigatensis encode the greatest fraction of BCs at 39% and 36%, respectively. This is the highest percentage reported so far for any genome, trumping the previous record for Streptomyces bingchenggensis25. Given that Actinobacteria are vigorously pursued for new antimicrobial product discovery26, these two previously unrepresented genera isolated from soil and an oil spring, respectively, might contribute new classes of bioactive compounds.

In addition to predicting biosynthetic gene clusters, we annotated the class of secondary metabolite synthesized by each BC across the GEBA-I genomes. Most of the predicted BC products were unclassified, reflecting both the limited information available for characterized natural products and the rich genomic resource of biosynthetic capabilities contributed by GEBA-I. For example, nine new phenazine pathways with novel operon structures and genes were identified in the GEBA-I genomes23. Phenazines are a large class of nitrogen-containing heterocyclic secondary metabolites that have potent antimicrobial and antifungal activity, and are produced by a wide range of bacteria. The phenazine pathways encoded in the genomes of Microbulbifer variabilis ATCC 700307 and Photobacterium halotolerans DSM 18316 are the first observations of this capability in the families Alteromonadaceae and Vibrionaceae, respectively. A crude extract of P. halotolerans DSM 18316 produced three known phenazines PCA, PDC and griseoluteic acid; however, D-alanylgriseoluteic acid was not observed (Fig. 3b). The phenazine operon in P. halotolerans DSM 18316 included all of the core phenazine genes found across all taxa known to produce the two core phenazines (phenazine 1-carboxylic acid (PCA), and phenazine 1,6 dicarboxylic acid (PDC); Fig. 3c). This operon also contained additional phenazine-modifying genes that exhibited the same pathway architecture found in Pantoea agglomerans Eh1087, a known producer of griseoluteic acid as well as D-alanylgriseoluteic acid27. The three genes known to modify griseoluteic acid to D-alanylgriseoluteic acid in P. agglomerans Eh1087 are present in the P. halotolerans DSM 18316 genome, yet the amino acid incorporated by the amino acid adenylation domain is likely different. Some of the other prominent metabolites (unknown peaks in Fig. 3b) may contain this potentially new phenazine. Furthermore, we also identified the biosynthetic genes likely responsible for the pelagiomicin phenazine antibiotic (structure known) produced by M. variabilis ATCC 700307 (ref. 28) (Supplementary Fig. 8).

Improved taxonomic assignment of metagenomic sequences

The ability to phylogenetically analyze and provide taxonomic classification to metagenomic data is largely dependent upon reference microbial genomes. Previous efforts to expand the genomic reference set through inclusion of phylogenetically underrepresented lineages have yielded dramatic improvement in classification of metagenomic data5. Here, we evaluated whether the GEBA-I genomes could serve as phylogenetic anchors for metagenomic studies. A total of 3,402,887 GEBA-I proteins were compared to 2,664,695,939 non-redundant protein sequences derived from 4,948 metagenomes in the IMG database. The GEBA-I protein set recruited 25,576,559 previously unassigned metagenomic proteins from 4,650 metagenomes (Supplementary Table 4). The majority of newly recruited proteins were derived from metagenomes of terrestrial (32%), aquatic (28%) habitats and plant-associated samples (21%) (Fig. 4a and Supplementary Fig. 9). This finding is primarily attributed to the high proportion of metagenome samples from these particular habitats. Solirubrobacter soli DSM 22325 (ref. 29), a ginseng field soil isolate, recruited the highest number of metagenome proteins (Supplementary Fig. 9); habitat distribution of these new hits were 50% terrestrial, 34% plant host associated, 6.5% aquatic, and a tiny fraction were from termite gut samples.

Figure 4: Recruitment of metagenomic sequences by GEBA-I genomes. (a) Overview of metagenomic protein sequence recruitment by individual GEBA-I genomes. Phylogenetic analyses of whole genome sequences were conducted using the high-throughput version of the Genome-Blast Distance Phylogeny approach. Internal branch support above 60% is colored in a range from red (60%) to green (100%) The colored dots decorating the terminus of every tree branch indicates the isolation source habitat for the given GEBA-I genome. The outermost circle bearing a black bar chart denotes the total number of metagenomic sequences with protein blast hits to that GEBA-I genome (Supplementary Fig. 7 and Supplementary Table 4). The habitat distribution for these hits is given in the colored concentric circles that follow. The intensity of color is weighted by fraction of total hits to a habitat. (b) Protein recruitment plot showing amino acid percent identity (y axis) of top hits of Desulfomicrobium baculatum DSM 4028 CDS against metagenomic sample from biofilm of a corroded oil pipeline (IMG taxon_oid: 3300002702). CDS are ordered on the x axis by position on one contiguous scaffold available for this genome. (c) Protein recruitment plot showing percent identity (y axis) of Rudaea cellulosilytica DSM 22992 CDS top protein blast hits against metagenomic sample from corn rhizoplane (IMG taxon_oid: 3300001904). For contrast, top hits against two other rhizosphere samples are included (switchgrass (IMG taxon_oid: 3300002128), and Miscanthus (IMG taxon_oid: 3300001991). CDS are ordered on the x axis by position on six discrete scaffolds (which are themselves ordered by descending sequence length) available for this genome. Full size image (478 KB) Previous

Figures index

Although GEBA-I strain selection was based on phylogenetic placement rather than numerical dominance within certain environmental samples, about 282 genomes, designated “top recruiters,” were found to notably recruit protein sequences from 1,204 individual environmental samples. Furthermore, we found evidence that a number of the genomes that significantly recruited metagenomic proteins may serve as important members of the microbial community in terms of abundance and encoded metabolic potential. For example, cellulose-degrading soil isolate, Rudaea cellulosilytica30 preferentially recruits sequences (over 87% coverage of total coding sequence (CDS)) from two corn rhizoplane samples, at high abundance (based on an average read depth of ~25), but not other plant rhizosphere samples (Fig. 4c). We hypothesize that R. cellulosilytica is an opportunist in senescing corn rhizoplane samples taken from a drought-stressed continuous corn plot (where root decomposition from previous years probably provided plentiful substrate for its growth), because it is not present in samples from unstressed corn in subsequent years (personal communication, James M. Tiedje, Michigan State University). Another notable example is an anaerobic sulfate reducer, Desulfomicrobium baculatum DSM 4028 (over 85% coverage of total isolate CDS), which is abundant in an oil pipeline biofilm sample and likely had a pivotal role in the microbial-induced corrosion that led to failure of the pipeline31 (Fig. 4b).

Overall, we found a correlation between isolation source of the GEBA-I strain and the metagenome sample habitat, as expected. Some interesting exceptions were identified, for example, Inquilinus limosus DSM 16000, a GEBA-I strain isolated from sputum of cystic fibrosis (CF) patients (although not known to cause disease or pathology) showed recruitment of proteins from several plant rhizosphere metagenome samples (e.g., Arabidopsis, corn). We hypothesize that closely related Inquilinus species or strains may be members of the plant root microbial community. Indeed, Inquilinus spp. have been previously reported in 16S rRNA surveys of root nodules of wild legumes32, 33. There is mounting evidence that human-pathogenic enteric bacteria such as Salmonella can colonize plant tissues, and use similar mechanisms for infection of animal and plant hosts34, 35. Our findings (and additional examples discussed in Supplementary Note) serve to further underscore the impact of broadening the phylogenomic representation of public databases, in this case, in adding to complementary cultivation-independent efforts to explore the breadth of microbial diversity and ecology.

Other investigators have taken advantage of early access to the GEBA-I genomes and discovered prominent member species in their samples, for example, Treponema succinifaciens36 and Treponema brennaborense in the gut microbiomes of non-human primates and traditional hunter-gatherers37, Ktedonobacter racemifer in an enrichment to identify rare soil microbes38, Coraliomargarita akajimensis39 in an Amazon river plume40, Sphaerobacter thermophiles41 in thermophilic switchgrass-adapted compost42.

We also report genome features and a large set of CRISPR–Cas systems comprising more than 28,000 novel spacer sequences (Supplementary Table 5 and Supplementary Fig. 10). These CRISPR–Cas data enabled identification of novel associations between viruses and their hosts43.