A deluge of bacterial and archaeal genomes have been sequenced in recent years. Although this data trove greatly broadens our knowledge of the tree of life, it also poses new tasks in various areas, including taxonomy. “The most notable challenges in bacterial and archaeal taxonomy is the rapid rate at which new diversity is being discovered as a result of being able to recover genomes directly from environmental samples via metagenome-assembled and single-amplified genomes,” says Donovan Parks of the University of Queensland in Australia. “This has resulted in a proliferation of new lineages comprised exclusively of uncultivated organisms which have incomplete taxonomic assignments, often only being assigned to a candidate phylum.”

As a step toward tackling this situation, Parks, his colleague Philip Hugenholtz, and other team members built the Genome Taxonomy Database (GTDB) for genome-based taxonomy of Bacteria and Archaea using phylogenetic information. Despite its usefulness, GTDB was not complete, with about 40% of the genomes lacking a species name. Now the team has developed a computational strategy to automatically assign species names to genomes. An operational species definition is based on appropriate thresholds (95% to 97%) of average nucleotide identity, which measures the similarity between genomes. To address the computational burden, the team made substantial up-front efforts to establish an efficient and reliable workflow for generating the genome comparisons, as noted by Parks.

This strategy yielded 24,706 proposed species clusters, with 36% based on published species names. In addition to this domain-to-species taxonomy, their analysis also led to intriguing observations on evolutionary patterns of microbial genomic diversity and speciation. Parks sees momentum in genome-based taxonomy. “Genome-based taxonomy appears to be gaining wide acceptance in the research community. This is evident from both the increased use of the GTDB and the large number of recent manuscripts proposing taxonomic reclassifications based on analyses of genome assemblies. I expect this trend to continue.”