Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >1,500 public metagenomes. All genomes are estimated to be ≥50% complete and nearly half are ≥90% complete with ≤5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.
Sequencing of microbial genomes has accelerated with reductions in sequencing costs, and public repositories now contain nearly 70,000 bacterial and archaeal genomes. The majority of these genomes have been obtained from axenic cultures1,2 and disproportionately reflect microorganisms of medical importance3. Consequently, current genome repositories are not representative of the microbial diversity known from 16S rRNA gene surveys4. Concerted efforts are being made to address this limitation by targeting phylogenetically distinct microorganisms for cultivation5,6,7 and single-cell sequencing4,8. Although these approaches continue to provide valuable reference genomes, the former is restricted to microorganisms amenable to cultivation and the latter is hampered by technical challenges and the need for specialised equipment9. Obtaining genomes from metagenomes is an emerging approach with the potential for large-scale recovery of near-complete genomes10,11,12.
Until recently, recovering genomes from metagenomic data was restricted to samples with low microbial diversity13, but improved sequencing throughput and advances in computational techniques now allow metagenome-assembled genomes (MAGs) to be recovered from high diversity environments14,15. MAGs are obtained by grouping or ‘binning’ together assembled contigs with similar sequence composition, depth of coverage across one or more related samples and taxonomic affiliations16,17. Several tools have been developed that exploit these sources of information to produce genomes from metagenomic data18,19,20,21 and there are ongoing efforts to evaluate the effectiveness of different approaches22. Although closed genomes have been obtained using metagenomic binning methods10,23, MAGs are typically incomplete and may contain contigs from multiple strains or species due to challenges in distinguishing between related community members both in the assembly and binning processes19,24. This has spurred the development of methods for assessing the quality of recovered MAGs in order to allow biological inferences to be made with regards to their estimated completeness and contamination25,26.
Significant insights have recently been made based on the MAGs of uncultivated microorganisms. These include elucidation of several phyla previously lacking genomic representatives27,28,29, including the Patescibacteria superphylum4, which has subsequently been referred to as the ‘Candidate Phyla Radiation’ (CPR) as it may consist of upwards of 35 candidate phyla10,30. Notable evolutionary and metabolic insights include the discovery of eukaryotic-like cytoskeleton genes in the archaeon Lokiarchaeota31,32 and the identification of putative methane-metabolizing genes in the Bathyarchaeota and Verstraetearchaeota phyla33,34. These initial studies demonstrate the need for additional genomic representatives across the tree of life in order to more fully appreciate microbial evolution and metabolism.
Here, we present the first large-scale initiative to recover MAGs from publicly available metagenomes. Nearly 8,000 draft-quality genomes were recovered from over 1,500 metagenomes, more than a threefold increase over large initiatives to genomically populate the tree of life such as the Genomic Encyclopedia of Bacteria and Archaea35 (~2,000 genomes), the Human Microbiome Project3 (~2,000) and the largest previous MAG study11 (~2,500). We refer to our set of MAGs as the Uncultivated Bacteria and Archaea (UBA) data set. Genome-based phylogenetic analysis indicates that the UBA genomes provide the first representatives of several major bacterial and archaeal lineages and substantially expand genomic representation across the tree of life.
Genomes are readily recovered from metagenomic data
MAGs were recovered from 1,550 metagenomes submitted to the Sequence Read Archive (SRA) before 31 December 2015 (Supplementary Fig. 1). We predominantly considered environmental and non-human gastrointestinal samples in order to focus on metagenomes likely to contain microbial populations from under-sampled lineages (Supplementary Table 1). The completeness and contamination of each MAG was estimated from the presence and absence of lineage-specific genes expected to be ubiquitous and single copy25, and these estimates, along with assembly statistics, used to identify genomes suitable for further study. A total of 64,295 MAGs were obtained, of which 7,903 (7,280 bacterial and 623 archaeal) form the UBA data set as they met our filtering criteria of having an estimated quality ≥50 (defined as the estimated completeness of a genome minus five times its estimated contamination) and consisting of ≤500 scaffolds with an N50 ≥10 kb (Fig. 1 and Supplementary Table 2). Over 93% of the 7,903 UBA genomes have an average coverage of ≥10× (5th percentile, 9.2×, 95th percentile, 268×) and 95.8% have >5× coverage over 90% of bases, providing assurance of high-quality base-calling across the genomes3,36. Among the UBA genomes is a subset of 3,438 near-complete genomes (3,225 bacterial and 213 archaeal) estimated to be ≥90% complete with ≤5% contamination (Fig. 1a). These genomes consist of ≤100 scaffolds in 70.2% of cases (≤200 scaffolds in 92.0% genomes) and have an average N50 of 136 kb. Comparison of near-complete UBA genomes that are conspecific strains of complete isolate genomes also suggest that the recovered MAGs have no systematic loss of genomic content, with the exception of extrachromosomal elements such as plasmids (Supplementary Note 1).
The UBA data set was also assessed relative to the criteria used by the Human Microbiome Project (HMP) for defining high-quality draft genomes3,37. Of the 3,438 UBA genomes we have defined as near complete, 3,201 (93.1%) pass all of the HMP criteria, with the only substantial exception being 4.8% of the genomes having scaffolds with an N50 of <20 kb (Supplementary Table 3). Nearly half of the remaining 4,465 UBA genomes also pass the HMP criteria for being high quality except that they are estimated to be <90% complete.
The presence of tRNAs for the standard 20 amino acids was examined as a secondary measure of genome quality (Fig. 1c). The 3,438 near-complete UBA genomes have tRNAs that encode for an average of 17.3 ± 2.2 of the 20 amino acids and ≥15 amino acids in 90.3% of the genomes. The correlation between estimated genome completeness and identified tRNAs was positive but weak (Supplementary Fig. 2) as tRNAs are regularly present in multiple copies and often collocated in a genome, making them poor markers for robustly estimating completeness25,38.
Taxonomic distribution of UBA genomes
The phylogenetic relationships of the UBA genomes were determined across bacterial and archaeal trees inferred from three concatenated protein sets: (1) a syntenic block of 16 ribosomal proteins (rp1) recently used to infer genome-based phylogenies10,30 (Supplementary Table 4), (2) 23 ribosomal proteins (rp2) previously tested for lateral gene transfer4 (Supplementary Table 5), and (3) 120 bacterial (bac120) and 122 archaeal (ar122) proteins we have identified as being suitable for phylogenetic inference (Supplementary Tables 6 and 7). The trees span ~19,000 bacterial and ~1,000 archaeal genomes after species-level dereplication of the UBA genomes and 67,479 genomes in RefSeq/GenBank release 76 (Supplementary Table 8).
UBA genomes were represented in the majority of bacterial (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in the NCBI taxonomy39. In addition, they comprise the first genomic representatives of 17 bacterial and 3 archaeal phyla (see section ‘UBA genomes are the first representatives of several phyla’). To provide an objective taxonomic analysis of the UBA data set, we used the phylogenetic criterion of mean branch length to extant taxa30 as existing taxonomic classifications are not phylogenetically uniform1. The results were highly consistent across all trees and, as expected, named groups at each taxonomic rank vary substantially in their mean branch length to extant taxa (Fig. 4 and Supplementary Figs. 3 and 4). Based on the range of mean branch length values for established taxa, the bacterial UBA genomes are exclusive representatives of 20–30% of all genus- to order–level lineages, 15–30% of class-level lineages and 5–15% of phylum-level lineages (Fig. 4a). Similarly, the archaeal UBA genomes are the only representatives within 20–30% of genus- to order-level lineages, 15–30% of class-level lineages and around 10% of archaeal phyla (Fig. 4b). We also tabulated the number of UBA-exclusive lineages at the 50th, 90th and 95th percentiles of the mean branch length distribution of each taxonomic rank (Supplementary Table 9). At the conservative 90th percentile, the bacterial UBA genomes are the first genomic representatives of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and 38 classes (18.0%) within this domain. Similarily, the archaeal UBA genomes represent 59 genera (30.3%), 25 families (28.4%), 13 orders (23.6%) and 3 classes (15.0%) at the 90th percentile.
UBA genomes are the first representatives of several phyla
The 17 bacterial and 3 archaeal phyla comprised exclusively of UBA genomes have been given the candidate names Uncultured Bacterial Phylum 1 to 17 (UBP1 to 17, Fig. 2) and Uncultured Archaeal Phylum 1 to 3 (UAP1 to 3, Fig. 3). These candidate phyla form well-supported clans in all three concatenated protein trees (Supplementary Table 10) and are unaffiliated with existing phyla40 (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA genes ≥600 bp have inter-phyla percent identity values between 76% and 86% (Supplementary Table 12), in agreement with established phyla41 (Supplementary Fig. 5). However, these 16S rRNA results should be treated with some caution as the percent identity of incomplete 16S rRNA sequences correlates poorly with values for full-length sequences42. Because 16S rRNA genes often fail to assemble43 and are missing from half of the UBP/UAP lineages, we used average amino-acid identity (mean of 46.2%) and shared gene content (mean of 24.4%) calibrated against established phyla to further support the classification of the UBP/UAP as phyla (Supplementary Fig. 5 and Supplementary Table 13).
To further resolve the taxonomic identity of the UBP and UAP genomes, 16S rRNA genes from these genomes were placed into a tree containing genomic and environmental 16S rRNA sequences. Only UBP9, UAP2 and UAP3 could be further taxonomically resolved as the other candidate phyla either lack genomes with a 16S rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary Tables 11 and 14). UBP9 genomes are the first genomic representatives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and were recovered from baboon faeces (five genomes), palm oil effluent (one genome), a toluene degrading community (one genome) and a dechlorination bioreactor (one genome, Supplementary Table 14). UAP2 contains the first representatives of the Marine Hydrothermal Vent Group (MHVG) and consists of three genomes recovered from the Tara Oceans Expedition along with a single genome from the Beebe hydrothermal vent (Supplementary Table 14 and Fig. 3). UAP3 is represented by a single genome recovered from a Costa Rican marine sediment metagenome (Supplementary Table 14) and is the first representative of the Ancient Archaeal Group (AAG), a group adjacent to the Lokiarchaeota (Fig. 3).
UBA genomes substantially increase phylogenetic diversity
The phylogenetic diversity of the UBA genomes was determined across the three concatenated protein trees. The results were highly consistent across these trees with the UBA genomes covering >50% of the total branch length (phylogenetic diversity, PD) spanned by each domain-specific genome tree and increasing total branch length (phylogenetic gain, PG) by ~30% (Fig. 5 and Supplementary Table 8). For comparison, the PD of the CPR (1,056 genomes, including 245 UBA genomes) and Firmicutes (25,992 genomes, including 1,666 UBA genomes) is 11.2% and 25.1%, respectively (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs of ~17% and 30%, respectively. The near-complete and medium-quality archaeal UBA genomes provide a phylogenetic gain of 10% and 29%, respectively (Supplementary Table 8).
Genomic representation of several bacterial lineages was greatly expanded by the UBA genomes (Fig. 5). The bacterial phyla with the largest increase in PD were the underrepresented Aminicenantes, Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG of >75%). Over 75% of the bacterial UBA genomes belong to the Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These genomes expand the PD of these phyla by 14–47%, despite >50% of existing genomic representatives belonging to these four phyla (Fig. 5 and Supplementary Table 15). Such high levels of increased phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of 143 classes being expanded by >20% (Supplementary Table 16). The UBA genomes have no representatives in only 10 bacterial phyla, which we attribute to the narrow ecological range and/or low relative abundance of microorganisms belonging to these lineages (for example, NC10 and Aerophobetes). Within the Archaea, the PD of 10 of 21 phyla and 7 of 17 classes increased by >20% (Fig. 5 and Supplementary Tables 15 and 16). This includes well-established archaeal groups such as the Euryarchaeota (PG of 34.8%) and Thaumarchaeota (PG of 40.8%) and poorly sampled groups such as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all had a PG of >35%.
Improved genomic representatives within several lineages
There are 12 bacterial phyla where the UBA genomes are estimated to be the highest-quality representatives (Supplementary Table 17). Among these is the Aminicenantes, where the number of available genomes increased from 37 to 47 with the addition of the UBA genomes, and the highest-quality genome improved from 86.9% to 91.9% complete, with the five highest-quality genomes all being UBA genomes. There are currently seven Hydrogenedentes genomes (five NCBI and two UBA), with the two highest-quality representatives being UBA genomes and appreciably improving upon the best previously available representative (88.3% complete, 4.3% contaminated to 98.9% complete, 1.1% contaminated). The most substantial improvement was in the Latescibacteria, where all previous representatives were derived from single cells and the UBA genomes improve the best-quality representative from 57.6% to 95.6% complete.
The UBA genomes are also the highest-quality representatives of five archaeal phyla (Supplementary Table 17). Notably, the Parvarchaeota was previously represented by only four MAGs with the highest quality being 76.7% complete with 5.6% contamination, and there are 11 Parvarchaeota UBA MAGs with an average completeness and contamination of 80.6 ± 4.2% and 1.7 ± 0.6%, respectively. The UBA genomes also improve the completeness of the best representative within the Micrarchaeota (Micrarchaeum acidiphilum ARMAN-2) from 84% complete to 91% complete, while adding 11 other genomes all estimated to be >70% complete with <3% contamination.
An alternative view of the CPR
The CPR has recently been proposed as a major collection of candidate phyla in the bacterial domain30. Under the bac120 tree and mean branch length to extant taxa criterion, the addition of the UBA genomes slightly reduces the percentage of phylum-level lineages represented by the CPR from a maximum of 29.4% to 26.3%, despite UBA genomes being the sole representatives within a number of genus- to class-level CPR lineages (Fig. 6a,b). Interestingly, the percentage of phylum-level lineages within the CPR increases substantially when considering the rp1 and rp2 trees where the maximum percentages are 38.7% and 38.3%, respectively (Fig. 6c,d). Consequently, under the bac120 tree and mean branch length to extant taxa criterion the CPR contains approximately the same percentage of phylum-level lineages as the Firmicutes and Actinobacteria combined, whereas under the rp1 tree the CPR is far more prominent (Fig. 6e,f). Interestingly, under the bac120 tree the CPR shows a pronounced increase in the relative percentage of lineages attributed to being of phylum-level diversity and at more specific taxonomic ranks contains approximately the same or fewer lineages than the Firmicutes (Fig. 6f).
A recent genome-based tree of life depicted the CPR as representing ~50% of bacterial lineages when aiming to recapitulate named phyla under the mean branch length to extant taxa criterion30. This analysis was conducted using the same 16 ribosomal proteins comprising the rp1 marker set and resulted in 39 of 76 (51%) phylum-level lineages belonging to the CPR. Here, we provide an alternative view with lineages collapsed under the same constraints on the bac120 tree. This results in the CPR being represented by only 20 of 76 (26.3%) phylum-level lineages (Supplementary Fig. 7), which occurs at the mean branch length threshold (namely, 0.85 substitutions per site) resulting in the maximum percentage of phylum-level CPR lineages (Fig. 6b).
Despite considerable progress, many lineages known from 16S rRNA surveys still lack genomic representation4. Here, we expand the phylogenetic diversity of bacterial and archaeal genome trees by >30% through the addition of 7,280 bacterial and 623 archaeal genomes obtained from over 1,500 public metagenomes (Fig. 5). These MAGs span the majority of recognized bacterial and archaeal phyla and include the first genomic representatives of 17 bacterial and three archaeal phyla (Figs. 2 and 3). The 7,903 genomes reported in this study range in quality from 50% complete to meeting the HMP criteria for high-quality draft genomes3,37 (Fig. 1). They are more complete than those typically derived using single-cell genomics4 and are of similar quality to those reported in other studies considering MAGs10,11,44. We have focused on these genomes, which represent only ~12% of the 64,295 recovered bins, as they are of sufficient quality to inform analyses such as resolving phylogenetic relationships4,30 and comparing inter- and intra-lineage genomic features45,46,47. Importantly, these results demonstrate that a large amount of microbial diversity remains to be genomically described across the tree of life, even within existing metagenomic samples, and that this diversity is readily recovered using current tools and methodologies.
MAGs often lack 16S rRNA genes due to their conserved and repetitive nature impeding assembly1,10,43. The UBA genomes are no exception, with only 17.3% of bacterial UBA genomes containing a partial 16S rRNA gene and 10.2% having a fragment of ≥600 bp. Recovery was more successful in the archaeal UBA genomes, with 32.7% containing a 16S rRNA gene fragment ≥600 bp. We attribute this discrepancy to the higher average 16S rRNA copy number in Bacteria relative to Archaea48 (bacterial mean = 4.12, archaeal mean = 1.63). Challenges in assembling and binning 16S rRNA genes motivated the use of protein-coding genes for the phylogenetic analyses presented in this and previous MAG studies45,46.
Recently, the diversity of the CPR was explored in the context of a genome tree inferred from 16 ribosomal proteins where it was divided into 36 named phyla and shown to represent approximately 50% of bacterial lineages of equal phylum-level evolutionary distance30. Our analyses using a 120 concatenated proteins contrast with this view, as the CPR is shown to comprise ~25% of phylum-level lineages under the same criterion (Fig. 6b and Supplementary Fig. 7). This suggests that ribosomal proteins within CPR organisms may be evolving atypically relative to other proteins, perhaps as a result of their unusual ribosome composition and the presence of self-splicing introns and proteins being encoded within their rRNA genes10. These contrasting views of the diversity of the CPR are equally valid and probably reflect the unique biology of the organisms within this group.
While the SRA represents a large set of publicly available metagenomic data, many additional metagenomes exist in other repositories such as the Integrated Microbial Genomes and Metagenomes49 (IMG/M) database and Metagenomics Rapid Annotation Server50 (MG-RAST). We expect that processing these metagenomes will add tens of thousands of additional genomes to the tree of life. Furthermore, methods for assembling and binning metagenomic data are continually improving, which makes it likely that systematic reprocessing of metagenomic data will result in the recovery of new genomes and improved versions of previously obtained genomes.
The number and diversity of genomes presented in this study, and the many similar studies we anticipate will follow, move us closer to a comprehensive genomic representation of the microbial world. Detailed examination of such genomes will further our understanding of microbial evolution and metabolic diversity, and provide important insights into the role of microorganisms in both natural and industrial processes. We anticipate that as metagenomic assembly and binning methods mature we will be presented with the challenge and great opportunity to be able to study microbial communities with complete, or near complete, genomic representation in the context of a comprehensive tree of life.
Note added in proof: During finalization of this manuscript, a new standard specifying the minimum information about a metagenome-assembled genome (MIMAG) was proposed51. The medium-quality and partial UBA MAGs meet the medium-quality criteria of the MIMAG standard. However, most of the near-complete UBA MAGs do not meet the stringent rRNA and tRNA requirements for high-quality draft MAGs under this standard, and we therefore deliberately refer to these MAGs as ‘near complete’.
Recovery of cultivation-independent genomes
Metadata for metagenomes in the Sequence Read Archive52 (SRA) at the National Center for Biotechnology Information (NCBI) were obtained from the SRAdb53. Only metagenomes submitted to the SRA before 31 December 2015 were considered with a predominant focus on environmental and non-human gastrointestinal samples (for example, rumen, guinea pigs and baboon faeces; Supplementary Table 1). Metagenomes from studies where MAGs had previously been recovered were excluded if the UBA MAGs did not provide appreciable improvements in genome quality or phylogenetic diversity. Each of the 1,550 metagenomes were processed independently, with all SRA Runs within an SRA Experiment (that is, sequences from a single biological sample) being co-assembled using the CLC de novo assembler v.4.4.1 (CLCBio). Assembly was restricted to contigs ≥500 bp and the word size, bubble size and paired-end insertion size determined by the assembly software. Assembly statistics are reported for contigs ≥2,000 bp (Supplementary Table 1). Reads were mapped to contigs with BWA54 v.0.7.12-r1039 using the BWA-MEM algorithm with default parameters and the mean coverage of contigs obtained using the ‘coverage’ command of CheckM25 v.1.0.6. Genomes were independently recovered from each SRA Experiment using MetaBAT21 v.0.26.3 under all five preset parameter settings (that is, verysensitive, sensitive, specific, veryspecific, superspecific). The completeness and contamination of the genomes recovered under each MetaBAT preset were estimated using CheckM using lineage-specific markers genes and default parameters. For each SRA Experiment, only genomes recovered with the MetaBAT preset resulting in the largest number of bins with an estimated completeness >70% and contamination <5% were considered for further refinement and validation.
Merging of compatible bins
Automated binning methods can produce multiple bins from the same microbial population. The ‘merge’ method of CheckM v.1.0.6 was used to identify pairs of bins where the completeness increased by ≥10% and the contamination increased by ≤1% when merged into a single bin. Bins meeting these criteria were grouped into a single bin if the mean GC of the bins were within 3%, the mean coverage of the bins had an absolute percentage difference ≤25%, and the bins had identical taxonomic classifications as determined by their placement in the reference genome tree used by CheckM. This set of criteria was used to avoid producing chimaeric bins.
Filtering scaffolds with divergent genomic properties
Scaffolds with genomic features deviating substantially from the mean GC, tetranucleotide signature, or coverage of a bin were identified with the ‘outliers’ method of RefineM v.0.0.14 (https://github.com/dparks1134/RefineM) using default parameters. This removes all scaffolds with a GC or tetranucleotide distance outside the 98th percentile of the expected distributions of these genomic features, as determined empirically over a set of 5,656 trusted reference genomes25,33. Scaffolds were also removed if their mean coverage had an absolute percentage difference ≥50% when compared to the mean coverage of the bin.
Filtering scaffolds with incongruent taxonomic classification
Each gene within a bin was assigned a taxonomic classification through homology search using BLASTP55 v.2.2.30+ against a custom database of 12,321 genomes from RefSeq/GenBank56 release 75. This database was constructed from RefSeq and GenBank genomes consisting of ≤300 contigs, having an N50 ≥20 kb and containing ≤10 kb of ambiguous base pairs. A genome was only included in the database if it was estimated to be ≥90% complete, ≤10% contaminated and had an overall quality ≥50 (defined as completeness − 5 × contamination). Quality estimates were determined with CheckM using the lineage-specific workflow and default parameters. Genomes meeting this set of requirements were dereplicated to remove genomes from the same named species with an amino-acid identity (AAI) ≥99.5%. AAI values were calculated with CompareM v.0.0.13 (https://github.com/dparks1134/CompareM) and dereplication performed in a greedy fashion with a preference towards type strains and genomes annotated as complete at NCBI. Genes were assigned the taxonomic classification of their ‘top’ hit or designated as unclassified if the gene had no identified homologue with an E-value ≤1e−2, a percent sequence identity ≥30% and a percent alignment length ≥50%.
Scaffolds with incongruent taxonomic classifications were removed from each bin. The consensus classification of a bin at each taxonomic rank was determined by identifying the taxon that occurred at the highest frequency across all classified genes or designated as unclassified if no taxon was represented by ≥50% of the classified genes. Scaffolds where ≥50% of the classified genes at each rank agreed with the consensus classification of the bin were designated as ‘trusted’, and a taxon was considered to be ‘common’ if it comprised ≥5% of the classified genes across the set of trusted scaffolds. A scaffold was considered to be taxonomically incongruent and removed from a bin if the following three conditions were met: (1) it contained ≥5 classified genes and ≥25% of all genes on the scaffold were classified; (2) ≤10% of the classified genes were contained in the set of common taxa at each classified rank; and (3) >50% of classified genes were assigned to the same taxon at each classified rank. Taxonomic classification of genes and identification of scaffolds with divergent taxonomic classifications were performed with the ‘taxon_profile’ and ‘taxon_filter’ methods of RefineM v.0.0.14 (https://github.com/dparks1134/RefineM), respectively.
Filtering scaffolds with incongruent 16S rRNA genes
Scaffolds were removed from a bin if they contained a complete or partial 16S rRNA gene ≥600 bp with a taxonomic classification incongruent with the taxonomic identity of the bin. BLASTN55 was used to assign 16S rRNA genes the taxonomy of its closest homologue within a database comprising the 10,769 16S genes identified within the 12,321 reference genomes discussed in the previous section. The sequence identity to the closest homologue was used to determine the set of ranks that should be examined for congruency. Specifically, previously reported median percent identities values were used to establish conservative thresholds for the taxonomic ranks to consider41: genus ≥98.7%, family ≥96.4%, order ≥92.25%, class ≥89.2%, phylum ≥86.35% and domain ≥83.68%. The taxon at each rank was then compared to the taxonomic classification of the genes across all scaffolds in the bin and designated as incongruent if the taxon was assigned to ≤10% of classified genes. This methodology is implemented in RefineM v.0.0.14.
Selection of refined genomes
Of the 64,295 bins produced by MetaBAT, only the 7,903 genomes with an estimated quality ≥50 (defined as completeness − 5 × contamination), scaffolds resulting in an N50 of ≥10 kb, containing <100 kb ambiguous bases and consisting of <1,000 contigs and <500 scaffolds were considered to be of sufficient quality for further exploration and deposition in public repositories. We adopted the quality criteria of completeness − 5 × contamination as it provides a good signal (completeness) to noise (contamination) ratio, where higher levels of contamination are only permissible when the genome is largely complete. These genomes have been deposited as assemblies in NCBI’s TPA:Assembly database along with alignment files indicating the mapping of SRA reads to UBA genomes.
Comparison of UBA genomes to complete conspecific strains
The 3,438 near-complete (≥90% complete; ≤5% contamination) UBA genomes were compared to complete isolate genomes in RefSeq release 76. Of these, 207 of the UBA genomes were determined to be conspecific strains of complete isolate genomes based on an average nucleotide identity (ANI) and alignment fraction (AF) above 96.5% and 60%, respectively57. ANI and AF values were determined using ANI Calculator57 v.1. The genome size of the UBA genomes was adjusted to account for its estimated completeness and contamination: adjusted genome size = (genome size)/(completeness + contamination). Homologues between UBA genomes and their conspecific counterparts were determined by inferring genes with Prodigal58 v.2.6.3 and establishing sequence similarity with BLASTP v.2.2.30+. A UBA protein was considered homologous to an isolate protein if it was the top hit among all isolate proteins, had an E-value of ≤1e−10, a percent identity of ≥70% and an alignment length spanning ≥70% of the isolate protein.
Proteins used to infer genome trees
Bacterial and archaeal genome trees were inferred from the concatenation of 120 (Supplementary Table 6) and 122 (Supplementary Table 7) phylogenetically informative proteins, respectively. These proteins were identified as being present in ≥90% of bacterial or archaeal genomes and, when present, single-copy in ≥95% of genomes. Protein-coding regions were identified using Prodigal v.2.6.3 (with default parameters, but with Ns treated as masked sequences), translation tables determined using a coding density heuristic25, and the ubiquity of genes determined across genomes from NCBI’s RefSeq release 73 annotated with the Pfam59 v.27 and TIGRFAMs60 v.15.0 databases. Only genomes composed of ≤200 contigs, with an N50 of ≥20 kb and with CheckM completeness and contamination estimates of ≥95% and ≤5%, respectively, were considered. Phylogenetically informative proteins were determined by filtering ubiquitous proteins whose gene trees had poor congruence with a set of subsampled concatenated genome trees. Specifically, the initial set of 188 bacterial (187 archaeal) proteins were randomly subsampled to 132 genes (~70%) and concatenated to infer a subsampled genome tree. Gene subsampling was independently performed 100 times to establish well-supported splits, which we define as any split occurring in >80% of the subsampled trees and with ≥1% of taxa contained in both bipartitions induced by the split. The congruence between a gene tree and the subsampled genome tree was measured as the fraction of well-supported split lengths compatible with the gene tree, a measure we call the ‘normalized compatible split length’. Genes with a normalized compatible split length of ≤50% were removed, as this poor congruence may indicate the presence of lateral gene transfer events. Proteins were aligned to Pfam and TIGRfam HMMs using HMMER61 v.3.1b1 with default parameters and trees were inferred with FastTree62 v.2.1.7 under the WAG+GAMMA models.
Trees were also inferred from two ribosomal protein sets: (1) 16 ribosomal proteins (Supplementary Table 4) that form a syntenic block10,30 and (2) 23 ribosomal proteins (Supplementary Table 5) previously used for tree inference and tested for lateral gene transfer4.
Inference of genome trees
Genome trees were inferred across a dereplicated set of UBA and RefSeq/GenBank release 76 (May 2016; includes 727 single cell and 1,811 MAGs) genomes. All 5,192 RefSeq genomes annotated at NCBI as ‘reference’ or ‘representative’ were retained, except for a low-quality subset of 294 genomes that did not meet our ‘trusted’ genome criteria: composed of ≤300 contigs, N50 ≥20 kb, CheckM completeness and contamination estimates of ≥90% and ≤10%, respectively. This set of 4,898 genomes was augmented with an additional 3,324 RefSeq genomes to retain at least two genomes per species where possible. Preference was given to genomes annotated at NCBI as being a type strain and/or ‘complete’ and restricted to genomes meeting the ‘trusted’ genome filtering criteria. An additional 551 RefSeq genomes currently without a species designation at NCBI, but passing the genome quality filtering, were also added to this initial set of seed genomes.
UBA, GenBank and remaining RefSeq genomes meeting the ‘trusted’ genome criteria were compared to these 8,773 seed genomes. Genomes with an AAI of ≥99.5% to a seed genome, as calculated over the 120 bacterial or 122 archaeal marker genes used for phylogenetic inference, were clustered with the seed genome and do not appear as separate genomes in the genome trees. This cutoff correlated with the proposed 96.5% ANI threshold for defining bacterial and archaeal species57 (Supplementary Fig. 6). Trusted genomes with an AAI <99.5% were added to the seed set. All remaining genomes, regardless of quality, were compared to this final seed set using the same AAI clustering criteria of 99.5%.
Seed and unclustered genomes with an estimated genomes quality ≥50 (defined as completeness − 5 × contamination) were used to create an initial multiple sequence alignment, with the exception of the 797 CPR genomes10, which were retained regardless of their estimated quality. Proteins were identified and aligned using HMMER v.3.1b1 and the resulting alignment trimmed to remove columns represented by <50% of taxa or without a common amino acid in ≥25% of taxa. Genomes with amino acids in <40% of aligned columns (20% for the lenient archaeal trees) were removed from consideration. The 120 concatenated bacterial protein set consisted of 34,796 aligned columns after trimming and was inferred over 19,198 genomes. The 122 concatenated archaeal protein set contained 28,025 trimmed columns and spanned 1,012 genomes when using standard filtering criteria and 27,942 columns spanning 1,070 genomes when using lenient filtering. Trees were inferred with FastTree v.2.1.7 under the WAG+GAMMA models and support values determined using 100 non-parametric bootstrap replicates.
Taxonomic annotation of genome trees
Genome trees were annotated using taxonomic information from the NCBI Taxonomy Database39. Only the canonical seven taxonomic ranks (species to domain) were considered for each genome, and this information was used to annotate internal lineages using tax2tree63. Manual curation was performed to add in phylum information currently missing at NCBI and to resolve polyphyletic groups. Polyphyletic groups that could not be confidently resolved were identified using an underscore and numerical identifier (for example, Deltaproteobacteria_1).
Inference of 16S rRNA trees
Bacterial and archaeal trees were inferred from 16S rRNA genes >600 bp and >1,200 bp within UBA and RefSeq/GenBank release 76 genomes, respectively. The 16S rRNA genes were identified using HMMER and domain-specific SSU/LSU HMM models as implemented in the ‘ssu-finder’ method of CheckM. These genes were aligned with ssu-align64 v.0.1 and trailing or leading columns represented by ≤70% of taxa trimmed, which resulted in bacterial and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred with FastTree v.2.1.7 under the GTR+GAMMA models and support values determined using 100 non-parametric bootstrap replicates.
Similarity of 16S rRNA genes
The percent identity between 16S rRNA genes was calculated from the multiple sequence alignments used to infer the domain-specific 16S rRNA gene trees. The ‘dist.seqs’ command of mothur65 v.1.30.2 was used to calculate percent identity. Default parameters were used, except that gaps at the end of sequences were ignored (countends = F) in order to accommodate partial 16S rRNA sequences. Inter-phylum (inter-class) 16S rRNA percent identity values were determined by identifying the most similar sequence to each sequence within a phylum across all sequences from different phyla (classes).
Assessment of phylogenetic and taxonomic diversity
Phylogenetic diversity (total branch length spanned by a set of taxa) and gain (additional branch length contributed by a set of taxa) were calculated using GenomeTreeTk v.0.0.23 (https://github.com/dparks1134/GenomeTreeTk) and verified with ARB66 v.6.0.2. Taxonomic diversity and the percentage of lineages of equal evolutionary distance unique to the UBA genomes were determined using the mean branch length to extant taxa criterion30. Lineages of equal evolutionary distance were related to the distribution of NCBI taxa39 as defined on 19 May 2016 and used to construct the phylum-level lineage view (Supplementary Fig. 7) by evaluating the number of groups formed at mean branch length values of 0.5 to 1.1 with a step size of 0.025. A value of 0.85 was selected as it most closely matched the number of bacterial phyla when excluding the CPR. In agreement with previous analyses30, we used this criterion to explore the taxonomic structure of phylogenetic trees and not to explicitly establish taxonomic status.
The AAI and shared gene content between genomes were determined with CompareM v.0.0.21 using default parameters (homologues defined by an E-value ≤0.001, a percent identity ≥30% and an alignment length ≥70%). CompareM reports shared gene content relative to the genome with the fewest identified genes in order to accommodate incomplete genomes. Inter-phylum and inter-class AAI and shared gene content values were determined by sampling up to 50 near-complete (completeness ≥90%, contamination ≤5%, N50 >20 kb, total contigs ≤200) RefSeq release 76 genomes from each named lineage, taking care to sample evenly between named species. The AAI score, defined as the sum of the AAI and shared gene content, was used to determine the most similar genome to each query genome.
Genomic and assembly properties
Genomic and assembly properties (for example, GC, N50, coverage) were determined using CheckM. Transfer RNAs were identified with tRNAscan-SE67 v.1.3.1 using either the bacterial or archaeal tRNA model and default parameters.
The UBA genomes have been deposited under NCBI BioProject PRJNA348753. Individual genomes have been deposited at DDBJ/ENA/GenBank and accession numbers are provided in Supplementary Table 2. The initial versions of these genomes are described in this paper.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A correction to this article is available online at https://doi.org/10.1038/s41564-017-0083-5.
The authors thank the many contributors to the Sequence Read Achieve for making their data publicly available. This study would not have been possible without this open sharing of data. D.H.P., C.R., M.C. and P.-A.C. are supported by the Australian Centre for Ecogenomics, B.J.W. by an Australian Research Council Discovery Early Career Research Award (DE160100248) and a Genomic Science Program of the United States Department of Energy Office of Biological and Environmental Research grant (DE-SC0010580), P.N.E. by an Australian Research Council Discovery Early Career Research Award (DE170100428), P.H. by an Australian Research Council Laureate Fellowship (FL150100038) and G.W.T. by a University of Queensland Vice Chancellor’s Research Focused Fellowship.