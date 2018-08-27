Deriving the GTDB taxonomy

A data set comprising 87,106 bacterial genomes was obtained from RefSeq/GenBank release 80 and augmented with 11,603 MAGs recovered from Sequence Read Archive metagenomes according to the approach of Parks et al.22. After removal of 2,482 of these genomes on the basis of a completeness/contamination threshold and 1,468 genomes on the basis of a multiple sequence alignment (MSA) threshold, the resulting 94,759 genomes were dereplicated to remove highly similar genomes with high-quality reference material retained as representatives when possible (Online Methods). Nearly 40% (8,559) of the dereplicated data set of 21,943 genomes represents uncultured organisms reflecting the microbial diversity currently being revealed by culture-independent techniques20,21,22. A bacterial genome tree was inferred from the dereplicated data set by applying FastTree to a concatenated alignment of 120 ubiquitous single-copy proteins22 (subsequently referred to as 'bac120') comprising a total of 34,744 columns after trimming of 1,021 columns represented in <50% of the genomes and 5,390 columns with an amino acid consensus <25% (Online Methods). The bac120 data set represents ∼4% of an average bacterial genome and is comparable to other bacterial domain marker sets27,28.

Having inferred the concatenated protein phylogeny, we annotated the tree with group names by using the NCBI taxonomy5 standardized to seven ranks (Online Methods). Taxon names were overwhelmingly assigned to interior nodes with high bootstrap support (99.7% ± 2.9%) to ensure taxonomic stability. However, a few poorly supported nodes (<70%) in the bac120 tree were assigned names on the basis of independent analyses or to preserve widely used existing classifications (Supplementary Table 1 and Firmicutes example below). Because more than one-third of the data set represents uncultured organisms, a substantial part of the tree was not effectively annotated with the NCBI genome taxonomy. Therefore, 16S rRNA gene sequences present in the MAGs were classified against the Greengenes8 2013 and SILVA6 v123.1 taxonomies to provide additional taxonomic identifiers. Using a set of criteria to ensure accurate mapping between 16S rRNA and MAG sequences (Online Methods), we labeled 74 groups lacking cultured representatives with 16S rRNA-based names, including well-recognized clades such as SAR202 (ref. 29), WS6 (ref. 30) and ACK-M1 (ref. 31) (Supplementary Table 2). We term all such alphanumeric names nonstandard placeholders to be replaced with standard validated names in due course. Curation of the taxonomy then involved two main tasks: the removal of polyphyletic groups and the normalization of taxonomic ranks according to relative evolutionary divergence (RED).

Removal of polyphyletic groups

Twenty phyla and 25 classes as defined by the NCBI taxonomy could not be reproducibly resolved as monophyletic in the bootstrapped bac120 tree (Supplementary Table 3). Most of these were the result of a small number of misclassified genomes; however, some taxa seemed to be truly polyphyletic, including well-known lineages such as the Firmicutes and Proteobacteria (Supplementary Table 3). The instability of the Firmicutes has previously been noted, primarily as a result of the Tenericutes and/or Fusobacteria moving into or out of the group25,32. In this prominent case, we chose to preserve the existing classification until more in-depth phylogenetic analyses are performed to resolve the issue (rationale described below). Other poorly supported lineages such as the Proteobacteria, which have been widely reported to be polyphyletic on the basis of the 16S rRNA gene8,33 and protein markers34,35, were conservatively divided into stable monophyletic groups. When possible, polyphyletic taxa containing the nomenclature type retained the name, and all other groups were renamed according to the International Code of Nomenclature of Prokaryotes (Online Methods). For lower-level ranks, notably genus, existing names were often retained with alphabetical suffixing to resolve polyphyly in the bac120 tree (for example, Bacillus_A, Bacillus_B and so forth). Only the group containing type material (if known) kept the original unsuffixed name to indicate the validity of the name assignment. This procedure serves two purposes: it preserves continuity in the literature, and it avoids the necessity to propose dozens of new names for highly polyphyletic groups, although we suggest that such renaming should ultimately be done. A total of 436 genera, 152 families and 67 orders were identified as polyphyletic in the tree, thus highlighting important deficiencies in the current taxonomy (Supplementary Table 3). The genus Clostridium was the most polyphyletic, representing 121 genera spanning 29 families, and was followed by Bacillus (81 genera across 25 families) and Eubacterium (30 genera across 8 families). However, these numbers were also influenced by rank normalization in some cases (described below).

Taxonomic-rank normalization

There is currently no accepted standardized approach for assigning species to higher taxonomic ranks (i.e., genus to phylum), although 16S rRNA sequence identity and amino acid identity (AAI) thresholds have been proposed11,36,37. The assignment of ranks within the NCBI taxonomy is highly variable under both these measures, because they have been proposed relatively recently and have not been widely adopted2,11. We normalized the assignment of higher taxonomic ranks by using RED calculated from the bac120 tree, an approach conceptually similar to that used by Wu et al.38. Our method provides an operational approximation of relative time with extant taxa existing in the present (RED = 1), the last common ancestor occurring at a fixed time in the past (RED = 0) and internal nodes being linearly interpolated between these values according to lineage-specific rates of evolution (Fig. 1 and Online Methods). RED intervals for normalizing taxonomic ranks were defined as the median RED value for taxa at each rank ± 0.1 (Fig. 1). This procedure represents a compromise between strict normalization and the desire to preserve existing group names on well-supported interior nodes. Visualization of the NCBI taxonomy according to RED highlighted a substantial number of over- or underclassified taxa according to the proposed criteria (Fig. 2a). To correct these inconsistencies, we reassigned taxa falling outside of their RED intervals to either a new taxonomic rank (with appropriate nomenclatural changes) or a new node in the tree (Fig. 2b).

Figure 1: Rank normalization through RED. (a) Example illustrating the calculation of RED. Numbers on branches indicate their length, and numbers below each node indicate their RED. The root of the tree is defined to have a RED of zero, and leaf nodes have a RED of one. The RED of an internal node n is linearly interpolated from the branch lengths comprising its lineage, as defined by p + (d/u) × (1 – p), where p is the RED of its parent, d is the branch length to its parent, and u is the average branch length from the parent node to all extant taxa descendant from n. For example, the parent node of leaves C and D has a RED value of 0.75 (0.42 + (2/3.5) × (1 – 0.42)), because its parent has a RED of p = 0.42, the branch length to the parent node is d = 2, and the average branch length from the parent node to C and D is u = (3+4)/2 = 3.5. (b) Bacterial genome tree inferred from 120 concatenated proteins (bac120) and contoured with the RED interval assigned to each taxonomic rank. Adjacent ranks overlap in some instances, because this permits existing group names to be placed on well-supported interior nodes. To accommodate visualizing the RED intervals, the initial tree inferred across 21,943 was pruned to 10,462 genomes by retaining one genome per species. The tree is rooted on the phylum Acetothermia for illustrative purposes. RED values used for rank normalization are averaged over multiple plausible rootings (Online Methods). Examples of taxa with high expected substitution rates are as follows: U, o__UBA9983; T, s__Tropheryma whipplei; M, o__Mycoplasmatales; Bl, f__Blattabacteriaceae; R, g__RC9; P, g__Porphyromonas; L, g__Liberibacter; and B, g__Buchnera. Prefixes indicate taxonomic ranks. (c) The bac120 tree, with branch lengths scaled by RED values, illustrating that rank normalization follows concentric rings that provide an operational approximation of the relative time of divergence. Full size image

Figure 2: RED of NCBI and GTDB taxa in a genome tree inferred from 120 concatenated proteins. (a,b) RED of taxa defined by the NCBI (a) and GTDB (b) taxonomies. Each point represents a taxon distributed according to its rank (y axis) and is colored green, orange or red to indicate monophyletic, operationally monophyletic or polyphyletic in the genome tree, respectively. A histogram is overlaid on the points to show the relative density of monophyletic, operationally monophyletic and polyphyletic taxa. The median RED value of each rank is shown by a blue line, and the RED interval for each rank is shown by black lines. Only monophyletic or operationally monophyletic taxa were used to calculate the median RED values for each rank. The GTDB aims to resolve taxa that are over- or underclassified on the basis of their RED value by either reassigning them to a new rank (vertical shift in plot) or moving them to a new interior node (horizontal shift in plot). For example, the family Synergistaceae was normalized by reclassifying the family to encompass only the genera Synergistes, Cloacibacillus, Thermanaerovibrio and Aminomonas, rather than the 12 genera circumscribed by this family in the NCBI taxonomy. Only taxa with two or more subordinate taxa are plotted, because these taxa have positions in the tree indicative of their rank (for example, only 33 of the 99 phyla defined by the GTDB contain two or more classes, and a phylum with a single class consisting of multiple orders is expected to have a RED value commensurate with the rank of class). The number of taxa plotted at each rank is given in parentheses along the y axis. Full size image

In contrast to 16S rRNA sequence identity or AAI thresholds, RED normalization accounts for the phylogenetic relationships between taxa and variable rates of evolution. For example, members of the rapidly evolving genus Mycoplasma39 (Fig. 1) are sufficiently diverged to represent two phyla on the basis of a 16S rRNA gene sequence identity threshold of 75% (ref. 11). However, vertebrate-associated Mycoplasma and Ureaplasma diverged from their arthropod-associated sister families only 400 Ma (ref. 39), as is approximately consistent with the emergence of vertebrates40. This evolutionary event occurred much later than the primary diversification of bacterial phyla, which is estimated to have occurred between 2 and 3 Ga (ref. 41). The relatively recent emergence of Mycoplasma is more consistent with their RED-normalized ranking into a single order within the Firmicutes (Fig. 2b) than the two phyla that would be indicated by a 16S rRNA sequence identity of 75%.

Validation of the GTDB taxonomy

The robustness of the approach used to generate the GTDB taxonomy was evaluated with various tree-inference software, evolutionary models, marker sets and genome data sets. We first considered trees inferred with ExaML and IQ-TREE. Because these methods are computationally intensive, it was necessary to decrease the bac120 MSA from 34,744 to 5,038 columns by evenly sampling columns across each of the 120 proteins and to use subsampled sets of 4,985 or 10,462 genomes dereplicated to retain one genome per GTDB genus or species, respectively (Online Methods). We also inferred trees by using FastTree with the reduced MSA and subsampled genome sets to isolate the effect of inference software from data-set reduction. For each of these trees, we determined the optimal position of each GTDB taxon and classified a taxon as monophyletic, operationally monophyletic (defined as having an F measure ≥0.95) or polyphyletic (Online Methods). Most GTDB taxa above the rank of species and with two or more genomes were found to be monophyletic or operationally monophyletic, and only 79 of 2,586 (3.1%) taxa were polyphyletic in one or more of the species-dereplicated FastTree, IQ-TREE or ExaML trees (Fig. 3a and Supplementary Table 4). Notably, 44 of the 79 polyphyletic taxa were found to be polyphyletic in the species-dereplicated FastTree, suggesting that most of the identified incongruence with GTDB taxa was the result of using a subsampled MSA and a dereplicated set of genomes. On average, 95.2% (IQ-TREE), 96.5% (ExaML) and 96.9% (FastTree) of GTDB taxa at each taxonomic rank were classified as monophyletic or operationally monophyletic within the species-dereplicated trees (Fig. 3a and Supplementary Fig. 1a). Taxa that were not monophyletic within the species-dereplicated trees were most often a result of the incongruent placement of a small number of genomes, thus resulting in either direct conflict with the GTDB taxonomy or unresolved groups in the tree (Online Methods). Less than 0.1% of genomes had a conflicting taxonomic assignment at any rank in any of the three species-dereplicated trees, and <1.6% had an unresolved taxonomic assignment at any rank, with the exception of order-level assignments in the ExaML tree, for which 7.5% were unresolved (Supplementary Fig. 1b and Supplementary Table 5). This result was primarily due to fragmentation of the order Bacillales in the ExaML tree, which was one of the poorly supported nodes in the bac120 tree (Supplementary Table 1). Taxa at the same taxonomic rank were also observed to have similar RED values in all three species-dereplicated trees, thus indicating that rank normalization is robust to the maximum-likelihood method used, MSA subsampling and genome dereplication (Fig. 3a, Supplementary Fig. 1c and Supplementary Table 1). Similar results were observed for the genus-dereplicated trees and are summarized in Supplementary Tables 1 and 4. The GTDB taxonomy was also robust to model selection: only three taxa were polyphyletic in a tree inferred with FastTree under the LG protein-substitution model instead of the WAG model (Supplementary Table 1).

Figure 3: RED and polyphyly of GTDB and NCBI taxa on trees inferred by using varying inference methods and marker sets. (a) Trees inferred with FastTree, IQ-TREE and ExaML from the concatenated alignment of 120 bacterial proteins and spanning 10,462 genomes dereplicated to one genome per species. RED distributions for taxa at each rank are shown relative to the median RED value of the rank. Results are summarized in box-and-whisker plots indicating percentiles 0/100, 5/95, 25/75 and 50. Distributions were calculated over monophyletic and operationally monophyletic taxa with two or more subordinate taxa, because these taxa have positions in the tree indicative of their rank. The number of taxa comprising each distribution is shown next to each box-and-whisker plot. The percentage of taxa classified as polyphyletic in each tree at each rank is indicated by a color gradient from blue to red. (b) Analogous results for trees inferred with FastTree by using 120 bacterial proteins (bac120), 16 ribosomal proteins (rp1) or the 16S rRNA gene and spanning the dereplicated set of 21,943 genomes used to define the GTDB. Plots showing the RED values of individual GTDB and NCBI taxa are shown in Figure 2 and Supplementary Figures 1, 2, 3, 4, 5, 6, 7. (c) Hierarchical-cluster tree illustrating the Robinson–Foulds distance between trees inferred with different maximum-likelihood methods, neighbor joining (NJ) and alternative-marker sets (rp1 and 16S rRNA) over a common set of 4,985 genomes constructed by sampling one genome per GTDB genus. The alternative inference methods were also applied to trees originally dereplicated to one genome per species, which were subsequently pruned to the common set of 4,985 genomes. The bac120 tree was used to define the GTDB r80 taxonomy. (d) Hierarchical-cluster tree illustrating the proportion of supported splits in common among trees over the common set of 4,985 genomes. (e,f) Analogous plots to c (e) and d (f), except that pairwise distances were calculated over trees defined on a common set of 10,462 genomes constructed by sampling one genome per GTDB species. Because nonparametric bootstraps could not be determined for IQ-TREE and ExaML when dereplicated at the species level, these trees do not appear in f. Full size image

Having established that the GTDB taxonomy is robust across different maximum-likelihood-inference software, we next considered the effect of different marker sets. Applying FastTree to a concatenated alignment of 16 ribosomal proteins20,25 (rp1) resulted in only 199 of the 4,501 (4.4%) GTDB taxa above the rank of species being classified as polyphyletic (Fig. 3b and Supplementary Table 4). On average, 94.7% of GTDB taxa at each taxonomic rank were monophyletic or operationally monophyletic within the rp1 tree; the least was 92.7% at the class level, and the most was 96.5% at the order level (Fig. 3b and Supplementary Fig. 2a). Less than 0.5% of genomes had a conflicting taxonomic assignment at any rank, and <1.5% had an unresolved taxonomic assignment at any rank (Supplementary Fig. 2 and Supplementary Table 5), with the exception of order-level assignments, which were unresolved for 4.0% of genomes. This result was largely due to an instability of the Enterobacterales probably caused by the inclusion of a highly reduced endosymbiont genome, 'Candidatus Zinderia insecticola', in the rp1 tree. As with the inference-software comparisons, we observed that taxa at the same taxonomic rank had similar RED values, thus indicating that rank normalization was largely preserved in the rp1 tree (Fig. 3b and Supplementary Fig. 2c). Performing the same analysis on a 16S rRNA gene tree resulted in 387 of the 2,576 (15.0%) GTDB taxa above the rank of species, with two or more genomes being classified as polyphyletic; and 78.1% (species) to 90.8% (class) of GTDB taxa being recovered as monophyletic or operationally monophyletic (Fig. 3b and Supplementary Fig. 3a). Incongruent taxonomic assignments in the 16S rRNA tree were largely the result of unresolved taxa, and <1.1% of genomes had conflicting assignments at any taxonomic rank (Supplementary Fig. 3b and Supplementary Table 5). Taxa at the same rank had similar RED values in the 16S rRNA gene tree, though the spread of values was greater than observed on the bac120 or rp1 trees (Fig. 3b and Supplementary Fig. 3c).

For comparison, we evaluated the congruence of the NCBI taxonomy with the trees inferred by using different inference software (species-dereplicated FastTree, IQ-TREE and ExaML) and marker sets (bac120, rp1 and 16S rRNA). In contrast to the GTDB taxonomy, all trees had numerous discrepancies with the NCBI taxonomy, in terms of both polyphyly and over- and underclassified taxa (Figs. 2 and 3). On average, 26.1% (rp1) to 28.0% (species-dereplicated FastTree) of NCBI taxa were classified as polyphyletic in these trees, and taxa at the same taxonomic rank had highly variable RED distributions (Fig. 3 and Supplementary Figs. 4, 5, 6, 7). Only 59.5% to 64.2% of genomes had NCBI taxonomy assignments congruent with the topology of these trees, whereas 76.1% to 96.8% had GTDB assignments in agreement with the tree topologies (Table 1).

Table 1: Congruency of GTDB and NCBI taxonomic classifications with tree topology Full size table

Trees inferred from alternative-marker sets showed a higher degree of discordance with the GTDB taxonomy than those inferred by using alternative maximum-likelihood-inference software. To further explore the relationship between alternative-marker sets and inference methods (including neighbor joining), we calculated pairwise tree distances between all trees (Fig. 3c,f and Supplementary Table 6). These results showed that, in terms of both tree topology and supported splits, the maximum-likelihood-inference software used is less critical than the choice of marker set, and that genome dereplication and MSA subsampling also have a nontrivial effect on the inferred tree.

The stability of the GTDB taxonomy on trees inferred by using subsets of the bac120 marker set and under taxon subsampling was also evaluated in anticipation of decreasing computational burden as the database size increases. Subsampling of the 120 bacterial marker genes was performed 100 times with 60 of the markers randomly selected for each replicate. Notably, 96.7% of GTDB taxa were classified as monophyletic in ≥90% of the replicate trees, and only ten taxa (0.11%) were classified as polyphyletic in ≥50% of replicates (Supplementary Table 1). Given the lower phylogenetic resolution of individual genes26,42, the results from individual gene trees were also highly robust: 86.1% of GTDB taxa were monophyletic in ≥50% of trees (Supplementary Table 1), and all gene trees recovered ≥51.6% of GTDB phyla and ≥82.0% of GTDB genera as monophyletic or operationally monophyletic (Supplementary Table 7). Taxon resampling with one genome per genus was performed 100 times, and representative genomes were randomly selected in each replicate. Across the 1,430 taxa with two or more genera, 97.5% were recovered as monophyletic in ≥90% of the taxon-resampled trees, and only four taxa were classified as polyphyletic in ≥50% of replicates (Supplementary Table 1).

Comparison of GTDB with other classifications

Overall, 58% of the 84,634 genomes with an NCBI taxonomy had one or more changes to their classification above the rank of species (Fig. 4a). These changes included both reclassification of taxa and filling in missing rank name information (∼3% of genus to phylum names are currently undefined across the 84,634 genomes with an NCBI taxonomy). On average, 19% of names were changed per rank, the least being 7% at the phylum level and the most being 31% at the order level (Fig. 4a). A total of 199 NCBI names above the rank of species were 'retired' from the GTDB taxonomy mostly as a result of RED normalization (Supplementary Table 8). An analogous comparison to the SILVA taxonomy also showed substantial differences across all taxonomic ranks: 66% of genomes had one or more changes above the rank of species (Supplementary Table 9 and Supplementary Fig. 8). Many of these differences are in common with the NCBI taxonomy, owing to the GTDB rank normalization process; however, there are also many documented differences between NCBI and SILVA43.

Figure 4: Comparison of GTDB and NCBI taxonomies and naming status of GTDB taxa. (a) Comparison of GTDB and NCBI taxonomic assignments across 84,634 bacterial genomes from RefSeq/GenBank release 80. For each rank, a taxon was classified as being unchanged if its name was identical in both taxonomies; passively changed if the GTDB taxonomy provided name information absent in the NCBI taxonomy; or actively changed if the name was different between the two taxonomies. Changes between the GTDB and NCBI taxonomies are fully listed in Supplementary Table 3. (b) Percentage of GTDB taxa at each rank that are validly published and approved; proposed but not validated; or nonstandard placeholder names. The number of taxa at each rank is shown in parentheses. Full size image

Only 18% of taxon names in the GTDB taxonomy above the rank of species have been validly published; a further 19% have been proposed but not validated; and the remaining 63% are currently nonstandard placeholder names (Fig. 4b), thus indicating the scope of the task remaining to produce a fully standardized taxonomy consisting of validated names. This task will be greatly facilitated by recent proposals to use genome sequences as type material for as-yet-uncultured lineages, which in principle would allow for validation of names44,45.

Genus- and species-level classifications

Genera and species comprise 84% of the 16,924 defined taxon names in the bac120 tree. Misclassified species in the public repositories are an area of particular concern to researchers, because they can introduce noise into a variety of analyses, including strain typing46, biogeographic distributions of species47 and pangenome analyses48. Moreover, classification errors can propagate over time as incorrectly labeled genomes are used as reference material to identify novel sequences. A small number of microbial genera have been rigorously examined for this problem, and taxonomic corrections have been proposed, including Aeromonas49 and Fusobacterium50. We compared the results of these analyses to the GTDB taxonomy as a means of providing an independent verification of our results. On the basis of multilocus sequence analysis and average nucleotide identity (ANI) comparisons, Beaz-Hidalgo et al.49 have proposed that nine Aeromonas dhakensis genomes are incorrectly classified as Aeromonas hydrophila. All nine of these genomes were reclassified as A. dhakensis in the bac120 tree, and an additional four genomes not included in the Beaz-Hidalgo study were also reclassified as A. dhakensis (Supplementary Table 10). Kook et al.50 have recently recommended the reclassification of Fusobacterium nucleatum subspecies animalis, nucleatum, polymorphum and vincentii as separate species, on the basis of ANI and genome distance metrics. Rank normalization of the GTDB taxonomy by using RED values largely reproduced this finding without prior knowledge of the authors' work (Supplementary Table 10). Reclassification of species according to the bac120 tree is also consistent with recent efforts to objectively define bacterial species according to barriers to homologous recombination estimated against the core genome of each species51. In that study, 23 of 91 bacterial species have been proposed to contain one or more members not belonging to their respective species ('excluded taxa'). We found that almost all comparable instances of excluded taxa were due to misclassification in the NCBI taxonomy (Supplementary Table 10). These results suggest that the bac120 tree topology and RED estimates of species-level groups based on ∼4% of the genome (120 conserved markers) are consistent with alternative analytical approaches using larger fractions of the genome.

The genus Clostridium is widely acknowledged to be polyphyletic, and efforts have been made to rectify this problem, including a global attempt to reclassify the genus by using a combination of phylogenetic markers9. The authors of that study have proposed the reclassification of 78 Clostridium species, and nine other species, into six novel genera9,52. Of these, we confirmed that Erysipelatoclostridium (with the exception of Clostridium innocuum str. 2959), Gottschalkia and Tyzzerella (excepting Clostridium nexile CAG:348) represent monophyletic genus-level groups. The remaining three genera proposed by Yutin and Galperin7 represent multiple genera in the GTDB taxonomy, including genera with validly published names (Supplementary Table 11). This result is consistent with recent analyses of individual taxa in these groups53,54. The GTDB taxonomy is also largely in agreement at the genus level with a recent global genome-based classification of the Bacteroidetes55. Of the 122 genera addressed in that study, six were found to be in need of reclassification; Chryseobacterium, Epilithonimonas, Aequorivita, Vitellibacter, Flexibacter and Pedobacter. All six were similarly identified as polyphyletic in the GTDB taxonomy and reclassified accordingly. These findings demonstrate that our methods are broadly consistent with rigorous independent analyses of problematic genera and species.

Taxonomic changes at higher ranks

A number of notable taxonomic changes at higher ranks are proposed for well-studied groups. For example, the class Betaproteobacteria was reclassified as an order within the class Gammaproteobacteria because it is entirely circumscribed within the latter group and is closer to the median RED value for an order than a class (Fig. 2a). This change is consistent with the original 16S rRNA gene topology of the Proteobacteria and subsequent trees6,8,56, although such a rank change has not been proposed in these studies. The Deltaproteobacteria and Epsilonproteobacteria were removed entirely from the Proteobacteria, because this phylum is not consistently recovered as a monophyletic unit, as found in many previous 16S rRNA and other marker gene analyses11,57,58. In the case of the Epsilonproteobacteria, this class was combined with the order Desulfurellales (Deltaproteobacteria) to form a new phylum58.

The Firmicutes also underwent extensive internal reclassification. As a clade, this phylum is typically monophyletic but poorly supported in most trees (Supplementary Table 1), and it has a RED in the phylum range, albeit to the left of the median for this taxonomic rank (Fig. 2b). The Firmicutes were therefore retained as a phylum-level lineage, although future revision of this status may be warranted. This phylum was divided into 34 classes including the mycoplasmas, which are currently classified as a separate phylum, the Tenericutes59 and 14 classes exclusively comprising MAGs. Incorporation of the Tenericutes within the Firmicutes is consistent with single-gene phylogenies6,8,32,53 and is further supported by recent evidence based on multiple molecular markers25,26,60. Similarly to its type genus, the order Clostridiales was extensively subdivided (Fig. 5a), largely as a consequence of an anomalous RED for this rank (Fig. 2a).

Figure 5: Comparisons of NCBI and GTDB classifications of genomes designated as Clostridia or Bacteroidetes in the GTDB taxonomy. (a) Comparison of NCBI (left) and GTDB (right) order-level classifications of the 2,368 bacterial genomes assigned to the class Clostridia in the GTDB taxonomy. Genomes classified in a class other than Clostridia by NCBI are indicated in parentheses. (b) Comparison of NCBI and GTDB class-level classifications of the 2,058 bacterial genomes assigned to the phylum Bacteroidetes in the GTDB taxonomy. Genomes classified in a phylum other than the Bacteroidetes by NCBI are indicated in parentheses. Full size image

On the basis of robust monophyly, taxonomic rank normalization and naming priority in the literature, the phylum Bacteroidetes is proposed to encompass the Chlorobi and Ignavibacteriae as class-level lineages. Concomitantly, several former classes of Bacteroidetes were amalgamated into the class Bacteroidia as order-level lineages, including the Chitinophagales, Cytophagales, Flavobacteriales and Sphingobacteriales (Fig. 5b). These proposed changes are in contrast to recent reclassifications, in which Bacteroidetes is divided into three major lineages by promoting the families Rhodothermaceae and Balneolaceae to phyla55,61 (Fig. 2a). In the GTDB taxonomy, these were retained as families within their own orders in the class Rhodothermia, according to their RED values (Fig. 2b). The higher-level taxonomy of the phylum Actinobacteria was largely unchanged. The five classes Actinobacteria, Acidimicrobiia, Coriobacteriia, Thermoleophilia and Rubrobacteria were retained, and the sole change at the class level was the downgrading of the Nitriliruptoria to an order within the class Actinobacteria according to rank normalization. Changes to other major lineages are summarized in Supplementary Table 3.

Rank normalization of uncultured microbial diversity

Having normalized the taxonomy on existing isolate-based classifications, we were able to calibrate the taxonomic ranks of uncultured lineages. Candidate phylum KSB3 was initially proposed on the basis of comparative analysis of environmental 16S rRNA gene sequences62,63, and more recently two near-complete MAGs belonging to this phylum have been reconstructed from a bulking sludge metagenome, for which the names 'Candidatus Moduliflexus flocculans' and 'Candidatus Vecturathrix granuli' have been proposed64. These genomes were further classified into separate families, orders and classes within the phylum; however, by rank normalization, they represent separate genera belonging to a single family. The group still retains a phylum-level status, because it is not reproducibly affiliated with other bacterial lineages36; however, we propose that the phylum (Modulibacteria) is currently genomically represented by a single class (Moduliflexia), single order (Moduliflexales) and single family (Moduliflexaceae; Fig. 2b).

As part of a single-cell-genomics study, the superphylum Patescibacteria has been proposed to encompass the candidate phyla Parcubacteria (OD1), Microgenomates (OP11) and Gracilibacteria (GN02)57. These candidate phyla have been further subsumed within the Candidate Phyla Radiation (CPR) on the basis of the addition of 797 MAGs20. Currently, there are at least 65 candidate phyla proposed to belong to the CPR20,21, and the justification of individual phyla has been based primarily on a 16S rRNA sequence-identity threshold of 75% (ref. 11). The CPR has been consistently recovered as a monophyletic group by using concatenated protein markers in this and previous studies20,22,25. However, rank normalization suggests that the CPR should be reclassified as a single phylum, for which we suggest reimplementing the name Patescibacteria (Fig. 2b), although ultimately the group should be named according to the nomenclature type material65.