Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life

Parks, Donovan H.; Rinke, Christian; Chuvochina, Maria; Chaumeil, Pierre-Alain; Woodcroft, Ben J.; Evans, Paul N.; Hugenholtz, Philip; Tyson, Gene W.

doi:10.1038/s41564-017-0012-7

Download PDF

Article
Open access
Published: 11 September 2017

Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life

Nature Microbiology volume 2, pages 1533–1542 (2017)Cite this article

79k Accesses
1033 Citations
497 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 12 December 2017

This article has been updated

Abstract

Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >1,500 public metagenomes. All genomes are estimated to be ≥50% complete and nearly half are ≥90% complete with ≤5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.

A genomic catalog of Earth’s microbiomes

Article Open access 09 November 2020

Unexpected absence of ribosomal protein genes from metagenome-assembled genomes

Article Open access 28 November 2022

A genome and gene catalog of glacier microbiomes

Article 27 June 2022

Main

Sequencing of microbial genomes has accelerated with reductions in sequencing costs, and public repositories now contain nearly 70,000 bacterial and archaeal genomes. The majority of these genomes have been obtained from axenic cultures^1,2 and disproportionately reflect microorganisms of medical importance³. Consequently, current genome repositories are not representative of the microbial diversity known from 16S rRNA gene surveys⁴. Concerted efforts are being made to address this limitation by targeting phylogenetically distinct microorganisms for cultivation^5,6,7 and single-cell sequencing^4,8. Although these approaches continue to provide valuable reference genomes, the former is restricted to microorganisms amenable to cultivation and the latter is hampered by technical challenges and the need for specialised equipment⁹. Obtaining genomes from metagenomes is an emerging approach with the potential for large-scale recovery of near-complete genomes^10,11,12.

Until recently, recovering genomes from metagenomic data was restricted to samples with low microbial diversity¹³, but improved sequencing throughput and advances in computational techniques now allow metagenome-assembled genomes (MAGs) to be recovered from high diversity environments^14,15. MAGs are obtained by grouping or ‘binning’ together assembled contigs with similar sequence composition, depth of coverage across one or more related samples and taxonomic affiliations^16,17. Several tools have been developed that exploit these sources of information to produce genomes from metagenomic data^18,19,20,21 and there are ongoing efforts to evaluate the effectiveness of different approaches²². Although closed genomes have been obtained using metagenomic binning methods^10,23, MAGs are typically incomplete and may contain contigs from multiple strains or species due to challenges in distinguishing between related community members both in the assembly and binning processes^19,24. This has spurred the development of methods for assessing the quality of recovered MAGs in order to allow biological inferences to be made with regards to their estimated completeness and contamination^25,26.

Significant insights have recently been made based on the MAGs of uncultivated microorganisms. These include elucidation of several phyla previously lacking genomic representatives^27,28,29, including the Patescibacteria superphylum⁴, which has subsequently been referred to as the ‘Candidate Phyla Radiation’ (CPR) as it may consist of upwards of 35 candidate phyla^10,30. Notable evolutionary and metabolic insights include the discovery of eukaryotic-like cytoskeleton genes in the archaeon Lokiarchaeota^31,32 and the identification of putative methane-metabolizing genes in the Bathyarchaeota and Verstraetearchaeota phyla^33,34. These initial studies demonstrate the need for additional genomic representatives across the tree of life in order to more fully appreciate microbial evolution and metabolism.

Here, we present the first large-scale initiative to recover MAGs from publicly available metagenomes. Nearly 8,000 draft-quality genomes were recovered from over 1,500 metagenomes, more than a threefold increase over large initiatives to genomically populate the tree of life such as the Genomic Encyclopedia of Bacteria and Archaea³⁵ (~2,000 genomes), the Human Microbiome Project³ (~2,000) and the largest previous MAG study¹¹ (~2,500). We refer to our set of MAGs as the Uncultivated Bacteria and Archaea (UBA) data set. Genome-based phylogenetic analysis indicates that the UBA genomes provide the first representatives of several major bacterial and archaeal lineages and substantially expand genomic representation across the tree of life.

Results

Genomes are readily recovered from metagenomic data

MAGs were recovered from 1,550 metagenomes submitted to the Sequence Read Archive (SRA) before 31 December 2015 (Supplementary Fig. 1). We predominantly considered environmental and non-human gastrointestinal samples in order to focus on metagenomes likely to contain microbial populations from under-sampled lineages (Supplementary Table 1). The completeness and contamination of each MAG was estimated from the presence and absence of lineage-specific genes expected to be ubiquitous and single copy²⁵, and these estimates, along with assembly statistics, used to identify genomes suitable for further study. A total of 64,295 MAGs were obtained, of which 7,903 (7,280 bacterial and 623 archaeal) form the UBA data set as they met our filtering criteria of having an estimated quality ≥50 (defined as the estimated completeness of a genome minus five times its estimated contamination) and consisting of ≤500 scaffolds with an N50 ≥10 kb (Fig. 1 and Supplementary Table 2). Over 93% of the 7,903 UBA genomes have an average coverage of ≥10× (5th percentile, 9.2×, 95th percentile, 268×) and 95.8% have >5× coverage over 90% of bases, providing assurance of high-quality base-calling across the genomes^3,36. Among the UBA genomes is a subset of 3,438 near-complete genomes (3,225 bacterial and 213 archaeal) estimated to be ≥90% complete with ≤5% contamination (Fig. 1a). These genomes consist of ≤100 scaffolds in 70.2% of cases (≤200 scaffolds in 92.0% genomes) and have an average N50 of 136 kb. Comparison of near-complete UBA genomes that are conspecific strains of complete isolate genomes also suggest that the recovered MAGs have no systematic loss of genomic content, with the exception of extrachromosomal elements such as plasmids (Supplementary Note 1).

**Fig. 1: Assessment of genome quality.**

The UBA data set was also assessed relative to the criteria used by the Human Microbiome Project (HMP) for defining high-quality draft genomes^3,37. Of the 3,438 UBA genomes we have defined as near complete, 3,201 (93.1%) pass all of the HMP criteria, with the only substantial exception being 4.8% of the genomes having scaffolds with an N50 of <20 kb (Supplementary Table 3). Nearly half of the remaining 4,465 UBA genomes also pass the HMP criteria for being high quality except that they are estimated to be <90% complete.

The presence of tRNAs for the standard 20 amino acids was examined as a secondary measure of genome quality (Fig. 1c). The 3,438 near-complete UBA genomes have tRNAs that encode for an average of 17.3 ± 2.2 of the 20 amino acids and ≥15 amino acids in 90.3% of the genomes. The correlation between estimated genome completeness and identified tRNAs was positive but weak (Supplementary Fig. 2) as tRNAs are regularly present in multiple copies and often collocated in a genome, making them poor markers for robustly estimating completeness^25,38.

Taxonomic distribution of UBA genomes

The phylogenetic relationships of the UBA genomes were determined across bacterial and archaeal trees inferred from three concatenated protein sets: (1) a syntenic block of 16 ribosomal proteins (rp1) recently used to infer genome-based phylogenies^10,30 (Supplementary Table 4), (2) 23 ribosomal proteins (rp2) previously tested for lateral gene transfer⁴ (Supplementary Table 5), and (3) 120 bacterial (bac120) and 122 archaeal (ar122) proteins we have identified as being suitable for phylogenetic inference (Supplementary Tables 6 and 7). The trees span ~19,000 bacterial and ~1,000 archaeal genomes after species-level dereplication of the UBA genomes and 67,479 genomes in RefSeq/GenBank release 76 (Supplementary Table 8).

UBA genomes were represented in the majority of bacterial (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in the NCBI taxonomy³⁹. In addition, they comprise the first genomic representatives of 17 bacterial and 3 archaeal phyla (see section ‘UBA genomes are the first representatives of several phyla’). To provide an objective taxonomic analysis of the UBA data set, we used the phylogenetic criterion of mean branch length to extant taxa³⁰ as existing taxonomic classifications are not phylogenetically uniform¹. The results were highly consistent across all trees and, as expected, named groups at each taxonomic rank vary substantially in their mean branch length to extant taxa (Fig. 4 and Supplementary Figs. 3 and 4). Based on the range of mean branch length values for established taxa, the bacterial UBA genomes are exclusive representatives of 20–30% of all genus- to order–level lineages, 15–30% of class-level lineages and 5–15% of phylum-level lineages (Fig. 4a). Similarly, the archaeal UBA genomes are the only representatives within 20–30% of genus- to order-level lineages, 15–30% of class-level lineages and around 10% of archaeal phyla (Fig. 4b). We also tabulated the number of UBA-exclusive lineages at the 50th, 90th and 95th percentiles of the mean branch length distribution of each taxonomic rank (Supplementary Table 9). At the conservative 90th percentile, the bacterial UBA genomes are the first genomic representatives of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and 38 classes (18.0%) within this domain. Similarily, the archaeal UBA genomes represent 59 genera (30.3%), 25 families (28.4%), 13 orders (23.6%) and 3 classes (15.0%) at the 90th percentile.

**Fig. 2: Distribution of UBA genomes across 76 bacterial phyla.**

**Fig. 3: Distribution of UBA genomes across 21 archaeal phyla.**

UBA genomes are the first representatives of several phyla

The 17 bacterial and 3 archaeal phyla comprised exclusively of UBA genomes have been given the candidate names Uncultured Bacterial Phylum 1 to 17 (UBP1 to 17, Fig. 2) and Uncultured Archaeal Phylum 1 to 3 (UAP1 to 3, Fig. 3). These candidate phyla form well-supported clans in all three concatenated protein trees (Supplementary Table 10) and are unaffiliated with existing phyla⁴⁰ (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA genes ≥600 bp have inter-phyla percent identity values between 76% and 86% (Supplementary Table 12), in agreement with established phyla⁴¹ (Supplementary Fig. 5). However, these 16S rRNA results should be treated with some caution as the percent identity of incomplete 16S rRNA sequences correlates poorly with values for full-length sequences⁴². Because 16S rRNA genes often fail to assemble⁴³ and are missing from half of the UBP/UAP lineages, we used average amino-acid identity (mean of 46.2%) and shared gene content (mean of 24.4%) calibrated against established phyla to further support the classification of the UBP/UAP as phyla (Supplementary Fig. 5 and Supplementary Table 13).

To further resolve the taxonomic identity of the UBP and UAP genomes, 16S rRNA genes from these genomes were placed into a tree containing genomic and environmental 16S rRNA sequences. Only UBP9, UAP2 and UAP3 could be further taxonomically resolved as the other candidate phyla either lack genomes with a 16S rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary Tables 11 and 14). UBP9 genomes are the first genomic representatives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and were recovered from baboon faeces (five genomes), palm oil effluent (one genome), a toluene degrading community (one genome) and a dechlorination bioreactor (one genome, Supplementary Table 14). UAP2 contains the first representatives of the Marine Hydrothermal Vent Group (MHVG) and consists of three genomes recovered from the Tara Oceans Expedition along with a single genome from the Beebe hydrothermal vent (Supplementary Table 14 and Fig. 3). UAP3 is represented by a single genome recovered from a Costa Rican marine sediment metagenome (Supplementary Table 14) and is the first representative of the Ancient Archaeal Group (AAG), a group adjacent to the Lokiarchaeota (Fig. 3).

UBA genomes substantially increase phylogenetic diversity

The phylogenetic diversity of the UBA genomes was determined across the three concatenated protein trees. The results were highly consistent across these trees with the UBA genomes covering >50% of the total branch length (phylogenetic diversity, PD) spanned by each domain-specific genome tree and increasing total branch length (phylogenetic gain, PG) by ~30% (Fig. 5 and Supplementary Table 8). For comparison, the PD of the CPR (1,056 genomes, including 245 UBA genomes) and Firmicutes (25,992 genomes, including 1,666 UBA genomes) is 11.2% and 25.1%, respectively (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs of ~17% and 30%, respectively. The near-complete and medium-quality archaeal UBA genomes provide a phylogenetic gain of 10% and 29%, respectively (Supplementary Table 8).

**Fig. 5: Phylogenetic diversity and gain for select bacterial and archaeal phyla.**

Genomic representation of several bacterial lineages was greatly expanded by the UBA genomes (Fig. 5). The bacterial phyla with the largest increase in PD were the underrepresented Aminicenantes, Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG of >75%). Over 75% of the bacterial UBA genomes belong to the Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These genomes expand the PD of these phyla by 14–47%, despite >50% of existing genomic representatives belonging to these four phyla (Fig. 5 and Supplementary Table 15). Such high levels of increased phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of 143 classes being expanded by >20% (Supplementary Table 16). The UBA genomes have no representatives in only 10 bacterial phyla, which we attribute to the narrow ecological range and/or low relative abundance of microorganisms belonging to these lineages (for example, NC10 and Aerophobetes). Within the Archaea, the PD of 10 of 21 phyla and 7 of 17 classes increased by >20% (Fig. 5 and Supplementary Tables 15 and 16). This includes well-established archaeal groups such as the Euryarchaeota (PG of 34.8%) and Thaumarchaeota (PG of 40.8%) and poorly sampled groups such as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all had a PG of >35%.

Improved genomic representatives within several lineages

There are 12 bacterial phyla where the UBA genomes are estimated to be the highest-quality representatives (Supplementary Table 17). Among these is the Aminicenantes, where the number of available genomes increased from 37 to 47 with the addition of the UBA genomes, and the highest-quality genome improved from 86.9% to 91.9% complete, with the five highest-quality genomes all being UBA genomes. There are currently seven Hydrogenedentes genomes (five NCBI and two UBA), with the two highest-quality representatives being UBA genomes and appreciably improving upon the best previously available representative (88.3% complete, 4.3% contaminated to 98.9% complete, 1.1% contaminated). The most substantial improvement was in the Latescibacteria, where all previous representatives were derived from single cells and the UBA genomes improve the best-quality representative from 57.6% to 95.6% complete.

The UBA genomes are also the highest-quality representatives of five archaeal phyla (Supplementary Table 17). Notably, the Parvarchaeota was previously represented by only four MAGs with the highest quality being 76.7% complete with 5.6% contamination, and there are 11 Parvarchaeota UBA MAGs with an average completeness and contamination of 80.6 ± 4.2% and 1.7 ± 0.6%, respectively. The UBA genomes also improve the completeness of the best representative within the Micrarchaeota (Micrarchaeum acidiphilum ARMAN-2) from 84% complete to 91% complete, while adding 11 other genomes all estimated to be >70% complete with <3% contamination.

An alternative view of the CPR

The CPR has recently been proposed as a major collection of candidate phyla in the bacterial domain³⁰. Under the bac120 tree and mean branch length to extant taxa criterion, the addition of the UBA genomes slightly reduces the percentage of phylum-level lineages represented by the CPR from a maximum of 29.4% to 26.3%, despite UBA genomes being the sole representatives within a number of genus- to class-level CPR lineages (Fig. 6a,b). Interestingly, the percentage of phylum-level lineages within the CPR increases substantially when considering the rp1 and rp2 trees where the maximum percentages are 38.7% and 38.3%, respectively (Fig. 6c,d). Consequently, under the bac120 tree and mean branch length to extant taxa criterion the CPR contains approximately the same percentage of phylum-level lineages as the Firmicutes and Actinobacteria combined, whereas under the rp1 tree the CPR is far more prominent (Fig. 6e,f). Interestingly, under the bac120 tree the CPR shows a pronounced increase in the relative percentage of lineages attributed to being of phylum-level diversity and at more specific taxonomic ranks contains approximately the same or fewer lineages than the Firmicutes (Fig. 6f).

**Fig. 6: Percentage of lineages of equal evolutionary distance within the CPR.**

A recent genome-based tree of life depicted the CPR as representing ~50% of bacterial lineages when aiming to recapitulate named phyla under the mean branch length to extant taxa criterion³⁰. This analysis was conducted using the same 16 ribosomal proteins comprising the rp1 marker set and resulted in 39 of 76 (51%) phylum-level lineages belonging to the CPR. Here, we provide an alternative view with lineages collapsed under the same constraints on the bac120 tree. This results in the CPR being represented by only 20 of 76 (26.3%) phylum-level lineages (Supplementary Fig. 7), which occurs at the mean branch length threshold (namely, 0.85 substitutions per site) resulting in the maximum percentage of phylum-level CPR lineages (Fig. 6b).

Discussion

Despite considerable progress, many lineages known from 16S rRNA surveys still lack genomic representation⁴. Here, we expand the phylogenetic diversity of bacterial and archaeal genome trees by >30% through the addition of 7,280 bacterial and 623 archaeal genomes obtained from over 1,500 public metagenomes (Fig. 5). These MAGs span the majority of recognized bacterial and archaeal phyla and include the first genomic representatives of 17 bacterial and three archaeal phyla (Figs. 2 and 3). The 7,903 genomes reported in this study range in quality from 50% complete to meeting the HMP criteria for high-quality draft genomes^3,37 (Fig. 1). They are more complete than those typically derived using single-cell genomics⁴ and are of similar quality to those reported in other studies considering MAGs^10,11,44. We have focused on these genomes, which represent only ~12% of the 64,295 recovered bins, as they are of sufficient quality to inform analyses such as resolving phylogenetic relationships^4,30 and comparing inter- and intra-lineage genomic features^45,46,47. Importantly, these results demonstrate that a large amount of microbial diversity remains to be genomically described across the tree of life, even within existing metagenomic samples, and that this diversity is readily recovered using current tools and methodologies.

MAGs often lack 16S rRNA genes due to their conserved and repetitive nature impeding assembly^1,10,43. The UBA genomes are no exception, with only 17.3% of bacterial UBA genomes containing a partial 16S rRNA gene and 10.2% having a fragment of ≥600 bp. Recovery was more successful in the archaeal UBA genomes, with 32.7% containing a 16S rRNA gene fragment ≥600 bp. We attribute this discrepancy to the higher average 16S rRNA copy number in Bacteria relative to Archaea⁴⁸ (bacterial mean = 4.12, archaeal mean = 1.63). Challenges in assembling and binning 16S rRNA genes motivated the use of protein-coding genes for the phylogenetic analyses presented in this and previous MAG studies^45,46.

Recently, the diversity of the CPR was explored in the context of a genome tree inferred from 16 ribosomal proteins where it was divided into 36 named phyla and shown to represent approximately 50% of bacterial lineages of equal phylum-level evolutionary distance³⁰. Our analyses using a 120 concatenated proteins contrast with this view, as the CPR is shown to comprise ~25% of phylum-level lineages under the same criterion (Fig. 6b and Supplementary Fig. 7). This suggests that ribosomal proteins within CPR organisms may be evolving atypically relative to other proteins, perhaps as a result of their unusual ribosome composition and the presence of self-splicing introns and proteins being encoded within their rRNA genes¹⁰. These contrasting views of the diversity of the CPR are equally valid and probably reflect the unique biology of the organisms within this group.

While the SRA represents a large set of publicly available metagenomic data, many additional metagenomes exist in other repositories such as the Integrated Microbial Genomes and Metagenomes⁴⁹ (IMG/M) database and Metagenomics Rapid Annotation Server⁵⁰ (MG-RAST). We expect that processing these metagenomes will add tens of thousands of additional genomes to the tree of life. Furthermore, methods for assembling and binning metagenomic data are continually improving, which makes it likely that systematic reprocessing of metagenomic data will result in the recovery of new genomes and improved versions of previously obtained genomes.

The number and diversity of genomes presented in this study, and the many similar studies we anticipate will follow, move us closer to a comprehensive genomic representation of the microbial world. Detailed examination of such genomes will further our understanding of microbial evolution and metabolic diversity, and provide important insights into the role of microorganisms in both natural and industrial processes. We anticipate that as metagenomic assembly and binning methods mature we will be presented with the challenge and great opportunity to be able to study microbial communities with complete, or near complete, genomic representation in the context of a comprehensive tree of life.

Note added in proof: During finalization of this manuscript, a new standard specifying the minimum information about a metagenome-assembled genome (MIMAG) was proposed⁵¹. The medium-quality and partial UBA MAGs meet the medium-quality criteria of the MIMAG standard. However, most of the near-complete UBA MAGs do not meet the stringent rRNA and tRNA requirements for high-quality draft MAGs under this standard, and we therefore deliberately refer to these MAGs as ‘near complete’.

Methods

Recovery of cultivation-independent genomes

Metadata for metagenomes in the Sequence Read Archive⁵² (SRA) at the National Center for Biotechnology Information (NCBI) were obtained from the SRAdb⁵³. Only metagenomes submitted to the SRA before 31 December 2015 were considered with a predominant focus on environmental and non-human gastrointestinal samples (for example, rumen, guinea pigs and baboon faeces; Supplementary Table 1). Metagenomes from studies where MAGs had previously been recovered were excluded if the UBA MAGs did not provide appreciable improvements in genome quality or phylogenetic diversity. Each of the 1,550 metagenomes were processed independently, with all SRA Runs within an SRA Experiment (that is, sequences from a single biological sample) being co-assembled using the CLC de novo assembler v.4.4.1 (CLCBio). Assembly was restricted to contigs ≥500 bp and the word size, bubble size and paired-end insertion size determined by the assembly software. Assembly statistics are reported for contigs ≥2,000 bp (Supplementary Table 1). Reads were mapped to contigs with BWA⁵⁴ v.0.7.12-r1039 using the BWA-MEM algorithm with default parameters and the mean coverage of contigs obtained using the ‘coverage’ command of CheckM²⁵ v.1.0.6. Genomes were independently recovered from each SRA Experiment using MetaBAT²¹ v.0.26.3 under all five preset parameter settings (that is, verysensitive, sensitive, specific, veryspecific, superspecific). The completeness and contamination of the genomes recovered under each MetaBAT preset were estimated using CheckM using lineage-specific markers genes and default parameters. For each SRA Experiment, only genomes recovered with the MetaBAT preset resulting in the largest number of bins with an estimated completeness >70% and contamination <5% were considered for further refinement and validation.

Merging of compatible bins

Automated binning methods can produce multiple bins from the same microbial population. The ‘merge’ method of CheckM v.1.0.6 was used to identify pairs of bins where the completeness increased by ≥10% and the contamination increased by ≤1% when merged into a single bin. Bins meeting these criteria were grouped into a single bin if the mean GC of the bins were within 3%, the mean coverage of the bins had an absolute percentage difference ≤25%, and the bins had identical taxonomic classifications as determined by their placement in the reference genome tree used by CheckM. This set of criteria was used to avoid producing chimaeric bins.

Filtering scaffolds with divergent genomic properties

Scaffolds with genomic features deviating substantially from the mean GC, tetranucleotide signature, or coverage of a bin were identified with the ‘outliers’ method of RefineM v.0.0.14 (https://github.com/dparks1134/RefineM) using default parameters. This removes all scaffolds with a GC or tetranucleotide distance outside the 98th percentile of the expected distributions of these genomic features, as determined empirically over a set of 5,656 trusted reference genomes^25,33. Scaffolds were also removed if their mean coverage had an absolute percentage difference ≥50% when compared to the mean coverage of the bin.

Filtering scaffolds with incongruent taxonomic classification

Each gene within a bin was assigned a taxonomic classification through homology search using BLASTP⁵⁵ v.2.2.30+ against a custom database of 12,321 genomes from RefSeq/GenBank⁵⁶ release 75. This database was constructed from RefSeq and GenBank genomes consisting of ≤300 contigs, having an N50 ≥20 kb and containing ≤10 kb of ambiguous base pairs. A genome was only included in the database if it was estimated to be ≥90% complete, ≤10% contaminated and had an overall quality ≥50 (defined as completeness − 5 × contamination). Quality estimates were determined with CheckM using the lineage-specific workflow and default parameters. Genomes meeting this set of requirements were dereplicated to remove genomes from the same named species with an amino-acid identity (AAI) ≥99.5%. AAI values were calculated with CompareM v.0.0.13 (https://github.com/dparks1134/CompareM) and dereplication performed in a greedy fashion with a preference towards type strains and genomes annotated as complete at NCBI. Genes were assigned the taxonomic classification of their ‘top’ hit or designated as unclassified if the gene had no identified homologue with an E-value ≤1e⁻², a percent sequence identity ≥30% and a percent alignment length ≥50%.

Scaffolds with incongruent taxonomic classifications were removed from each bin. The consensus classification of a bin at each taxonomic rank was determined by identifying the taxon that occurred at the highest frequency across all classified genes or designated as unclassified if no taxon was represented by ≥50% of the classified genes. Scaffolds where ≥50% of the classified genes at each rank agreed with the consensus classification of the bin were designated as ‘trusted’, and a taxon was considered to be ‘common’ if it comprised ≥5% of the classified genes across the set of trusted scaffolds. A scaffold was considered to be taxonomically incongruent and removed from a bin if the following three conditions were met: (1) it contained ≥5 classified genes and ≥25% of all genes on the scaffold were classified; (2) ≤10% of the classified genes were contained in the set of common taxa at each classified rank; and (3) >50% of classified genes were assigned to the same taxon at each classified rank. Taxonomic classification of genes and identification of scaffolds with divergent taxonomic classifications were performed with the ‘taxon_profile’ and ‘taxon_filter’ methods of RefineM v.0.0.14 (https://github.com/dparks1134/RefineM), respectively.

Filtering scaffolds with incongruent 16S rRNA genes

Scaffolds were removed from a bin if they contained a complete or partial 16S rRNA gene ≥600 bp with a taxonomic classification incongruent with the taxonomic identity of the bin. BLASTN⁵⁵ was used to assign 16S rRNA genes the taxonomy of its closest homologue within a database comprising the 10,769 16S genes identified within the 12,321 reference genomes discussed in the previous section. The sequence identity to the closest homologue was used to determine the set of ranks that should be examined for congruency. Specifically, previously reported median percent identities values were used to establish conservative thresholds for the taxonomic ranks to consider⁴¹: genus ≥98.7%, family ≥96.4%, order ≥92.25%, class ≥89.2%, phylum ≥86.35% and domain ≥83.68%. The taxon at each rank was then compared to the taxonomic classification of the genes across all scaffolds in the bin and designated as incongruent if the taxon was assigned to ≤10% of classified genes. This methodology is implemented in RefineM v.0.0.14.

Selection of refined genomes

Of the 64,295 bins produced by MetaBAT, only the 7,903 genomes with an estimated quality ≥50 (defined as completeness − 5 × contamination), scaffolds resulting in an N50 of ≥10 kb, containing <100 kb ambiguous bases and consisting of <1,000 contigs and <500 scaffolds were considered to be of sufficient quality for further exploration and deposition in public repositories. We adopted the quality criteria of completeness − 5 × contamination as it provides a good signal (completeness) to noise (contamination) ratio, where higher levels of contamination are only permissible when the genome is largely complete. These genomes have been deposited as assemblies in NCBI’s TPA:Assembly database along with alignment files indicating the mapping of SRA reads to UBA genomes.

Comparison of UBA genomes to complete conspecific strains

The 3,438 near-complete (≥90% complete; ≤5% contamination) UBA genomes were compared to complete isolate genomes in RefSeq release 76. Of these, 207 of the UBA genomes were determined to be conspecific strains of complete isolate genomes based on an average nucleotide identity (ANI) and alignment fraction (AF) above 96.5% and 60%, respectively⁵⁷. ANI and AF values were determined using ANI Calculator⁵⁷ v.1. The genome size of the UBA genomes was adjusted to account for its estimated completeness and contamination: adjusted genome size = (genome size)/(completeness + contamination). Homologues between UBA genomes and their conspecific counterparts were determined by inferring genes with Prodigal⁵⁸ v.2.6.3 and establishing sequence similarity with BLASTP v.2.2.30+. A UBA protein was considered homologous to an isolate protein if it was the top hit among all isolate proteins, had an E-value of ≤1e⁻¹⁰, a percent identity of ≥70% and an alignment length spanning ≥70% of the isolate protein.

Proteins used to infer genome trees

Bacterial and archaeal genome trees were inferred from the concatenation of 120 (Supplementary Table 6) and 122 (Supplementary Table 7) phylogenetically informative proteins, respectively. These proteins were identified as being present in ≥90% of bacterial or archaeal genomes and, when present, single-copy in ≥95% of genomes. Protein-coding regions were identified using Prodigal v.2.6.3 (with default parameters, but with Ns treated as masked sequences), translation tables determined using a coding density heuristic²⁵, and the ubiquity of genes determined across genomes from NCBI’s RefSeq release 73 annotated with the Pfam⁵⁹ v.27 and TIGRFAMs⁶⁰ v.15.0 databases. Only genomes composed of ≤200 contigs, with an N50 of ≥20 kb and with CheckM completeness and contamination estimates of ≥95% and ≤5%, respectively, were considered. Phylogenetically informative proteins were determined by filtering ubiquitous proteins whose gene trees had poor congruence with a set of subsampled concatenated genome trees. Specifically, the initial set of 188 bacterial (187 archaeal) proteins were randomly subsampled to 132 genes (~70%) and concatenated to infer a subsampled genome tree. Gene subsampling was independently performed 100 times to establish well-supported splits, which we define as any split occurring in >80% of the subsampled trees and with ≥1% of taxa contained in both bipartitions induced by the split. The congruence between a gene tree and the subsampled genome tree was measured as the fraction of well-supported split lengths compatible with the gene tree, a measure we call the ‘normalized compatible split length’. Genes with a normalized compatible split length of ≤50% were removed, as this poor congruence may indicate the presence of lateral gene transfer events. Proteins were aligned to Pfam and TIGRfam HMMs using HMMER⁶¹ v.3.1b1 with default parameters and trees were inferred with FastTree⁶² v.2.1.7 under the WAG+GAMMA models.

Trees were also inferred from two ribosomal protein sets: (1) 16 ribosomal proteins (Supplementary Table 4) that form a syntenic block^10,30 and (2) 23 ribosomal proteins (Supplementary Table 5) previously used for tree inference and tested for lateral gene transfer⁴.

Inference of genome trees

Genome trees were inferred across a dereplicated set of UBA and RefSeq/GenBank release 76 (May 2016; includes 727 single cell and 1,811 MAGs) genomes. All 5,192 RefSeq genomes annotated at NCBI as ‘reference’ or ‘representative’ were retained, except for a low-quality subset of 294 genomes that did not meet our ‘trusted’ genome criteria: composed of ≤300 contigs, N50 ≥20 kb, CheckM completeness and contamination estimates of ≥90% and ≤10%, respectively. This set of 4,898 genomes was augmented with an additional 3,324 RefSeq genomes to retain at least two genomes per species where possible. Preference was given to genomes annotated at NCBI as being a type strain and/or ‘complete’ and restricted to genomes meeting the ‘trusted’ genome filtering criteria. An additional 551 RefSeq genomes currently without a species designation at NCBI, but passing the genome quality filtering, were also added to this initial set of seed genomes.

UBA, GenBank and remaining RefSeq genomes meeting the ‘trusted’ genome criteria were compared to these 8,773 seed genomes. Genomes with an AAI of ≥99.5% to a seed genome, as calculated over the 120 bacterial or 122 archaeal marker genes used for phylogenetic inference, were clustered with the seed genome and do not appear as separate genomes in the genome trees. This cutoff correlated with the proposed 96.5% ANI threshold for defining bacterial and archaeal species⁵⁷ (Supplementary Fig. 6). Trusted genomes with an AAI <99.5% were added to the seed set. All remaining genomes, regardless of quality, were compared to this final seed set using the same AAI clustering criteria of 99.5%.

Seed and unclustered genomes with an estimated genomes quality ≥50 (defined as completeness − 5 × contamination) were used to create an initial multiple sequence alignment, with the exception of the 797 CPR genomes¹⁰, which were retained regardless of their estimated quality. Proteins were identified and aligned using HMMER v.3.1b1 and the resulting alignment trimmed to remove columns represented by <50% of taxa or without a common amino acid in ≥25% of taxa. Genomes with amino acids in <40% of aligned columns (20% for the lenient archaeal trees) were removed from consideration. The 120 concatenated bacterial protein set consisted of 34,796 aligned columns after trimming and was inferred over 19,198 genomes. The 122 concatenated archaeal protein set contained 28,025 trimmed columns and spanned 1,012 genomes when using standard filtering criteria and 27,942 columns spanning 1,070 genomes when using lenient filtering. Trees were inferred with FastTree v.2.1.7 under the WAG+GAMMA models and support values determined using 100 non-parametric bootstrap replicates.

Taxonomic annotation of genome trees

Genome trees were annotated using taxonomic information from the NCBI Taxonomy Database³⁹. Only the canonical seven taxonomic ranks (species to domain) were considered for each genome, and this information was used to annotate internal lineages using tax2tree⁶³. Manual curation was performed to add in phylum information currently missing at NCBI and to resolve polyphyletic groups. Polyphyletic groups that could not be confidently resolved were identified using an underscore and numerical identifier (for example, Deltaproteobacteria_1).

Inference of 16S rRNA trees

Bacterial and archaeal trees were inferred from 16S rRNA genes >600 bp and >1,200 bp within UBA and RefSeq/GenBank release 76 genomes, respectively. The 16S rRNA genes were identified using HMMER and domain-specific SSU/LSU HMM models as implemented in the ‘ssu-finder’ method of CheckM. These genes were aligned with ssu-align⁶⁴ v.0.1 and trailing or leading columns represented by ≤70% of taxa trimmed, which resulted in bacterial and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred with FastTree v.2.1.7 under the GTR+GAMMA models and support values determined using 100 non-parametric bootstrap replicates.

Similarity of 16S rRNA genes

The percent identity between 16S rRNA genes was calculated from the multiple sequence alignments used to infer the domain-specific 16S rRNA gene trees. The ‘dist.seqs’ command of mothur⁶⁵ v.1.30.2 was used to calculate percent identity. Default parameters were used, except that gaps at the end of sequences were ignored (countends = F) in order to accommodate partial 16S rRNA sequences. Inter-phylum (inter-class) 16S rRNA percent identity values were determined by identifying the most similar sequence to each sequence within a phylum across all sequences from different phyla (classes).

Assessment of phylogenetic and taxonomic diversity

Phylogenetic diversity (total branch length spanned by a set of taxa) and gain (additional branch length contributed by a set of taxa) were calculated using GenomeTreeTk v.0.0.23 (https://github.com/dparks1134/GenomeTreeTk) and verified with ARB⁶⁶ v.6.0.2. Taxonomic diversity and the percentage of lineages of equal evolutionary distance unique to the UBA genomes were determined using the mean branch length to extant taxa criterion³⁰. Lineages of equal evolutionary distance were related to the distribution of NCBI taxa³⁹ as defined on 19 May 2016 and used to construct the phylum-level lineage view (Supplementary Fig. 7) by evaluating the number of groups formed at mean branch length values of 0.5 to 1.1 with a step size of 0.025. A value of 0.85 was selected as it most closely matched the number of bacterial phyla when excluding the CPR. In agreement with previous analyses³⁰, we used this criterion to explore the taxonomic structure of phylogenetic trees and not to explicitly establish taxonomic status.

Genomic similarity

The AAI and shared gene content between genomes were determined with CompareM v.0.0.21 using default parameters (homologues defined by an E-value ≤0.001, a percent identity ≥30% and an alignment length ≥70%). CompareM reports shared gene content relative to the genome with the fewest identified genes in order to accommodate incomplete genomes. Inter-phylum and inter-class AAI and shared gene content values were determined by sampling up to 50 near-complete (completeness ≥90%, contamination ≤5%, N50 >20 kb, total contigs ≤200) RefSeq release 76 genomes from each named lineage, taking care to sample evenly between named species. The AAI score, defined as the sum of the AAI and shared gene content, was used to determine the most similar genome to each query genome.

Genomic and assembly properties

Genomic and assembly properties (for example, GC, N50, coverage) were determined using CheckM. Transfer RNAs were identified with tRNAscan-SE⁶⁷ v.1.3.1 using either the bacterial or archaeal tRNA model and default parameters.

Data availability

The UBA genomes have been deposited under NCBI BioProject PRJNA348753. Individual genomes have been deposited at DDBJ/ENA/GenBank and accession numbers are provided in Supplementary Table 2. The initial versions of these genomes are described in this paper.

Change history

12 December 2017
In the original version of this Article, the authors stated that the archaeal phylum Parvarchaeota was previously represented by only two single-cell genomes (ARMAN-4_'5-way FS' and ARMAN-5_'5-way FS'). However, these are in fact unpublished, low-quality metagenome-assembled genomes (MAGs) obtained from Richmond Mine, California. In addition, the authors overlooked two higher-quality published Parvarchaeota MAGs from the same habitat, ARMAN-4 (ADCE00000000) and ARMAN-5 (ADHF00000000) (B. J. Baker et al., Proc. Natl Acad. Sci. USA 107, 8806–8811; 2010). The ARMAN-4 and ARMAN-5 MAGs are estimated to be 68.0% and 76.7% complete with 3.3% and 5.6% contamination, respectively, based on the archaeal-specific marker sets of CheckM. The 11 Parvarchaeota genomes identified in our study were obtained from different Richmond Mine metagenomes, but are highly similar to the ARMAN-4 (ANI of ~99.7%) and ARMAN-5 (ANI of ~99.6%) MAGs. The highest-quality uncultivated bacteria and archaea (UBA) MAGs with similarity to ARMAN-4 and ARMAN-5 are 82.5% and 83.3% complete with 0.9% and 1.9% contamination, respectively. The Parvarchaeota represents only 0.23% of the archaeal genome tree and addition of the ARMAN-4 and ARMAN-5 MAGs do not change the conclusions of this Article, but do impact the phylogenetic gain for this phylum. This has now been corrected in all versions of the Article. An updated version of Fig. 5 has also been used to replace the previous version, with the row for Parvarchaeota removed, and Supplementary Table 15 and Supplementary Table 17 have both been replaced to reflect the availability of the two additional Parvarchaeota genomes. In addition, the Methods incorrectly stated that all metagenomes identified as being from studies where MAGs had previously been recovered were excluded from consideration. Metagenomes from studies where MAGs had previously been recovered were retained if the UBA MAGs provided appreciable improvements in genome quality or phylogenetic diversity. All versions of the Article have been updated to indicate the retention of such metagenomes.

References

Hugenholtz, P., Sharshewski, A. & Parks, D. H. in Microbial Evolution (ed. Ochman, H.) 55–65 (Cold Spring Harbor Laboratory Press, New York, 2016).
Solden, L., Lloyd, K. & Wrighton, K. The bright side of microbial dark matter: lessons learned from the uncultivated majority. Curr. Opin. Microbiol. 31, 217–226 (2016).
Article PubMed Google Scholar
Nelson, K. E. et al. A catalog of reference genomes from the human microbiome. Science 328, 994–999 (2010).
Article CAS PubMed Google Scholar
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).
Article CAS PubMed Google Scholar
Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kyrpides, N. C. et al. Genomic encyclopedia of type strains, phase I: the one thousand microbial genomes (KMG-I) project. Stand. Genomic Sci. 9, 1278–1284 (2013).
Article PubMed PubMed Central Google Scholar
Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017).
Article CAS PubMed Google Scholar
Marcy, Y. et al. Dissecting biological ‘dark matter’ with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl Acad. Sci. USA 104, 11889–11894 (2007).
Article CAS PubMed PubMed Central Google Scholar
Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current state of the science. Nat. Rev. Genet. 17, 175–188 (2016).
Article CAS PubMed Google Scholar
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).
Article CAS PubMed Google Scholar
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vanwonterghem, I., Jensen, P. D., Rabaey, K. & Tyson, G. W. Genome-centric resolution of microbial diversity, metabolism and interactions in anaerobic digestion. Environ. Microbiol. 18, 3144–3158 (2016).
Article CAS PubMed Google Scholar
Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Article CAS PubMed Google Scholar
Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661–1665 (2012).
Article CAS PubMed Google Scholar
Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative genomics of candidate phylum TM6 suggests that parasitism is widespread and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016).
Article CAS PubMed Google Scholar
Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K. L., Tyson, G. W. & Nielsen, P. H. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
Article CAS PubMed Google Scholar
Sharon, I., Morowitz, M. J., Thomas, B. C., Costello, E. K., Relman, D. A. & Banfield, J. F. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).
Article CAS PubMed PubMed Central Google Scholar
Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H. E. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410 (2012).
Article PubMed PubMed Central Google Scholar
Imelfort, M., Parks, D. H., Woodcroft, B. J., Dennis, P., Hugenholtz, P. & Tyson, G. W. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
Article PubMed PubMed Central Google Scholar
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS PubMed Google Scholar
Kang, D. D., Froula, J., Egan, E. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).
Article PubMed PubMed Central Google Scholar
Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of computational metagenomics software. Preprint at http://www.biorxiv.org/content/early/2017/06/12/099127 (2017).
Kantor, R. S. et al. Small genomes and sparse metabolisms of sediment-associated bacteria from four candidate phyla. mBio 4, e00708-13 (2013).
Article PubMed PubMed Central Google Scholar
Luo, C., Knight, R., Siljander, H., Knip, M., Xavier, R. J. & Gevers, D. ConStrains identifies microbial strains in metagenomic datasets. Nat. Biotechnol. 33, 1045–1052 (2015).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3, e1319 (2015).
Article PubMed PubMed Central Google Scholar
Eloe-Fadrosh, E. A. et al. Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat. Commun. 7, 10476 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sekiguchi, Y., Ohashi, A., Parks, D. H., Yamauchi, T., Tyson, G. W. & Hugenholtz, P. First genomic insights into members of a candidate bacterial phylum responsible for wastewater bulking. PeerJ 3, e740 (2015).
Article PubMed PubMed Central Google Scholar
Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).
Article CAS PubMed Google Scholar
Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
Article CAS PubMed Google Scholar
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).
Article CAS PubMed PubMed Central Google Scholar
Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231–236 (2013).
Article CAS PubMed Google Scholar
Evans, P. N. et al. Methane metabolism in the archaeal phylum Bathyarchaeota revealed by genome-centric metagenomics. Science 350, 434–438 (2015).
Article CAS PubMed Google Scholar
Vanwonterghem, I. et al. Methylotrophic methanogenesis discovered in the archaeal phylum Verstraetearchaeota. Nat. Microbiol. 1, 16170 (2016).
Article CAS PubMed Google Scholar
Whitman, W. B. et al. Genomic encyclopedia of bacterial and archaeal type strains, phase III: the genomes of soil and plant-associated and newly described type strains. Stand. Genomic Sci. 10, 26 (2015).
Article PubMed PubMed Central Google Scholar
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Article CAS PubMed Google Scholar
Chain, P. S. et al. Genome project standard in a new era of sequencing. Science 326, 236–237 (2009).
Article CAS PubMed Google Scholar
Shepherd, J. & Ibba, M. Bacterial transfer RNAs. FEMS Microbiol. Rev. 39, 280–300 (2015).
Article CAS PubMed PubMed Central Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 37, D13–D25 (2009).
Article Google Scholar
Hugenholtz, P., Goebel, B. M. & Pace, N. R. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol. 180, 4765–4774 (1998).
CAS PubMed PubMed Central Google Scholar
Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaeal using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).
Article CAS PubMed Google Scholar
Schloss, P. D. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput. Biol. 6, e1000844 (2010).
Article PubMed PubMed Central Google Scholar
Yuan, C., Lei, J., Cole, J. & Sun, Y. Reconstructing 16S rRNA genes in metagenomic data. Bioinformatics 31, 35–43 (2015).
Article Google Scholar
Haroon, M. F., Thompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci. Data 3, 160050 (2016).
Article CAS PubMed PubMed Central Google Scholar
Soo, R. M. et al. An expanded genomic representation of the phylum Cyanobacteria. Genome Biol. Evol. 6, 1031–1045 (2014).
Article PubMed PubMed Central Google Scholar
Rahman, N. A., Parks, D. H., Vanwonterghem, I., Morrison, M., Tyson, G. W. & Hugenholtz, P. A phylogenomic analysis of the bacterial phylum Fibrobacteres. Front. Microbiol. 6, 01469 (2015).
Article Google Scholar
Lazar, C. S. et al. Genomic evidence for distinct carbon substrate preferences and ecological niches of Bathyarchaeota in estuarine sediments. Environ. Microbiol. 18, 1200–1211 (2016).
Article CAS PubMed Google Scholar
Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. K. & Schmidt, T. M. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res. 43, D593–D598 (2015).
Article CAS PubMed Google Scholar
Markowitz, V. M. et al. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 42, D568–D573 (2014).
Article CAS PubMed Google Scholar
Wilke, A. et al. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res. 44, D590–D594 (2016).
Article CAS PubMed Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Article CAS PubMed Google Scholar
Zhu, Y., Stephens, R. M., Meltzer, P. S. & Davis, S. R. SRAdb: query and use public next-generation sequencing data from within R. BMC Bioinformatics 14, 19 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Tatusova, T., Ciufo, S., Fedorov, B., O’Neill, K. & Tolstoy, I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 42, D553–D559 (2014).
Article CAS PubMed Google Scholar
Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F. W. & Hauser, L. J. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central Google Scholar
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).
Article CAS PubMed Google Scholar
Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373 (2003).
Article CAS PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comp. Biol. 7, e1002195 (2011).
Article CAS Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Article CAS PubMed PubMed Central Google Scholar
McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618 (2012).
Article CAS PubMed Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Article CAS PubMed PubMed Central Google Scholar
Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ludwig, W. et al. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 1363–1371 (2004).
Article CAS PubMed PubMed Central Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank the many contributors to the Sequence Read Achieve for making their data publicly available. This study would not have been possible without this open sharing of data. D.H.P., C.R., M.C. and P.-A.C. are supported by the Australian Centre for Ecogenomics, B.J.W. by an Australian Research Council Discovery Early Career Research Award (DE160100248) and a Genomic Science Program of the United States Department of Energy Office of Biological and Environmental Research grant (DE-SC0010580), P.N.E. by an Australian Research Council Discovery Early Career Research Award (DE170100428), P.H. by an Australian Research Council Laureate Fellowship (FL150100038) and G.W.T. by a University of Queensland Vice Chancellor’s Research Focused Fellowship.

Author information

Authors and Affiliations

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St Lucia, Queensland, 4072, Australia
Donovan H. Parks, Christian Rinke, Maria Chuvochina, Pierre-Alain Chaumeil, Ben J. Woodcroft, Paul N. Evans, Philip Hugenholtz & Gene W. Tyson

Authors

Donovan H. Parks
View author publications
You can also search for this author in PubMed Google Scholar
Christian Rinke
View author publications
You can also search for this author in PubMed Google Scholar
Maria Chuvochina
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Alain Chaumeil
View author publications
You can also search for this author in PubMed Google Scholar
Ben J. Woodcroft
View author publications
You can also search for this author in PubMed Google Scholar
Paul N. Evans
View author publications
You can also search for this author in PubMed Google Scholar
Philip Hugenholtz
View author publications
You can also search for this author in PubMed Google Scholar
Gene W. Tyson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.H.P. designed and carried out the study and wrote the manuscript. C.R. and M.C. assisted in interpreting the phylogenies and taxonomically decorating the trees. P.-A.C. and B.J.W. helped with data collection and bioinformatic analyses. P.N.E. assisted in assessing the quality of the recovered genomes. P.H. and G.W.T. guided the study and helped write the manuscript.

Corresponding authors

Correspondence to Philip Hugenholtz or Gene W. Tyson.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A correction to this article is available online at https://doi.org/10.1038/s41564-017-0083-5.

Supplementary information

Supplementary Information

Supplementary Note, Supplementary Figures, Supplementary Tables and Supplementary References.

Supplementary Table 1

Characteristics of SRA experiments.

Supplementary Table 2

Characteristics of UBA genomes.

Supplementary Table 15

Phylogenetic diversity and gain provided by UBA genomes.

Supplementary Table 18

Characteristics of near-complete UBA genomes that are conspecific stains of complete genomes.

Supplementary File 1

Bacterial tree inferred from 120 bacterial marker genes in Newick format.

Supplementary File 2

Archaeal tree inferred from 122 archaeal marker genes in Newick format.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Parks, D.H., Rinke, C., Chuvochina, M. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2, 1533–1542 (2017). https://doi.org/10.1038/s41564-017-0012-7

Download citation

Received: 28 April 2017
Accepted: 25 July 2017
Published: 11 September 2017
Issue Date: November 2017
DOI: https://doi.org/10.1038/s41564-017-0012-7