Article | Published:

Evaluation of a concatenated protein phylogeny for classification of tailed double-stranded DNA viruses belonging to the order Caudovirales


Viruses of bacteria and archaea are important players in global carbon cycling as well as drivers of host evolution, yet the taxonomic classification of viruses remains a challenge due to their genetic diversity and absence of universally conserved genes. Traditional classification approaches employ a combination of phenotypic and genetic information which is no longer scalable in the era of bulk viral genome recovery through metagenomics. Here, we evaluate a phylogenetic approach for the classification of tailed double-stranded DNA viruses from the order Caudovirales by inferring a phylogeny from the concatenation of 77 single-copy protein markers using a maximum-likelihood method. Our approach is largely consistent with the International Committee on Taxonomy of Viruses, with 72 and 89% congruence at the subfamily and genus levels, respectively. Discrepancies could be attributed to misclassifications and a small number of highly mosaic genera confounding the phylogenetic signal. We also show that confidently resolved nodes in the concatenated protein tree are highly reproducible across different software and models, and conclude that the approach can serve as a framework for a rank-normalized taxonomy of most tailed double-stranded DNA viruses.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The genome accessions and Newick tree files of datasets used in this study are provided in the Supplementary Materials.

Code availability

The custom Bash, Python and R scripts used to process and analyse the data and generate the figures are available on request.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Edwards, R. A. & Rohwer, F. Viral metagenomics. Nat. Rev. Microbiol. 3, 504–510 (2005).

  2. 2.

    Grose, J. H. & Casjens, S. R. Understanding the enormous diversity of bacteriophages: the tailed phages that infect the bacterial family Enterobacteriaceae. Virology 468-470, 421–443 (2014).

  3. 3.

    Nasir, A., Forterre, P., Kim, K. M. & Caetano-Anolles, G. The distribution and impact of viral lineages in domains of life. Front. Microbiol. 5, 194 (2014).

  4. 4.

    Simmonds, P. Methods for virus classification and the challenge of incorporating metagenomic sequence data. J. Gen. Virol. 96, 1193–1206 (2015).

  5. 5.

    Lefkowitz, E. J. et al. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 46, D708–D717 (2018).

  6. 6.

    Simmonds, P. et al. Consensus statement: virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017).

  7. 7.

    Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).

  8. 8.

    Rohwer, F. & Edwards, R. The phage proteomic tree: a genome-based taxonomy for phage. J. Bacteriol. 184, 4529–4535 (2002).

  9. 9.

    Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380 (2017).

  10. 10.

    Meier-Kolthoff, J. P. & Goker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

  11. 11.

    Aiewsakun, P., Adriaenssens, E. M., Lavigne, R., Kropinski, A. M. & Simmonds, P. Evaluation of the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformatic platform: steps towards a unified taxonomy. J. Gen. Virol. 99, 1331–1343 (2018).

  12. 12.

    Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol. Biol. Evol. 25, 762–777 (2008).

  13. 13.

    Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect archaea and bacteria. PeerJ 5, e3243 (2017).

  14. 14.

    Adriaenssens, E. M. & Cowan, D. A. Using signature genes as tools to assess environmental viral ecology and diversity. Appl. Environ. Microbiol. 80, 4470–4480 (2014).

  15. 15.

    Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060 (2009).

  16. 16.

    Lang, J. M., Darling, A. E. & Eisen, J. A. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS ONE 8, e62510 (2013).

  17. 17.

    Tonini, J., Moore, A., Stern, D., Shcheglovitova, M. & Orti, G. Concatenation and species tree methods exhibit statistically indistinguishable accuracy under a range of simulated conditions. PLoS Curr. 7, 1–15 (2015).

  18. 18.

    Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).

  19. 19.

    Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nat. Genet. 28, 281–285 (2001).

  20. 20.

    Rokas, A., Williams, B. L., King, N. & Carroll, S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804 (2003).

  21. 21.

    Rokas, A. & Carroll, S. B. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 22, 1337–1344 (2005).

  22. 22.

    Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

  23. 23.

    Hatfull, G. F. Bacteriophage genomics. Curr. Opin. Microbiol. 11, 447–453 (2008).

  24. 24.

    Belcaid, M., Bergeron, A. & Poisson, G. Mosaic graphs and comparative genomics in phage communities. J. Comput. Biol. 17, 1315–1326 (2010).

  25. 25.

    Kubatko, L. S. & Degnan, J. H. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56, 17–24 (2007).

  26. 26.

    Philippe, H. et al. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9, e1000602 (2011).

  27. 27.

    Wiens, J. J. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52, 528–538 (2003).

  28. 28.

    Driskell, A. C. et al. Prospects for building the tree of life from large sequence databases. Science 306, 1172–1174 (2004).

  29. 29.

    Thomson, R. C. & Shaffer, H. B. Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst. Biol. 59, 42–58 (2010).

  30. 30.

    Hinchliff, C. E. & Roalson, E. H. Using supermatrices for phylogenetic inquiry: an example using the sedges. Syst. Biol. 62, 205–219 (2013).

  31. 31.

    Wiens, J. J. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst. Biol. 54, 731–742 (2005).

  32. 32.

    Wiens, J. J. & Tiu, J. Highly incomplete taxa can rescue phylogenetic analyses from the negative impacts of limited taxon sampling. PLoS ONE 7, e42925 (2012).

  33. 33.

    Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 17112 (2017).

  34. 34.

    Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

  35. 35.

    Thiergart, T., Landan, G. & Martin, W. F. Concatenated alignments and the case of the disappearing tree. BMC Evol. Biol. 14, 266 (2014).

  36. 36.

    Gadagkar, S. R., Rosenberg, M. S. & Kumar, S. Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. J. Exp. Zool. B Mol. Dev. Evol. 304, 64–74 (2005).

  37. 37.

    Lahr, D. J., Laughinghouse, H. Dt, Oliverio, A. M., Gao, F. & Katz, L. A. How discordant morphological and molecular evolution among microorganisms can revise our notions of biodiversity on Earth. Bioessays 36, 950–959 (2014).

  38. 38.

    Adriaenssens, E. M. et al. Taxonomy of prokaryotic viruses: 2017 update from the ICTV bacterial and archaeal viruses subcommittee. Arch. Virol. 163, 1125–1129 (2018).

  39. 39.

    Barylski, J. et al. Analysis of spounaviruses as a case study for the overdue reclassification of tailed bacteriophages. Preprint at (2018).

  40. 40.

    Juhala, R. J. et al. Genomic sequences of bacteriophages HK97 and HK022: pervasive genetic mosaicism in the lambdoid bacteriophages. J. Mol. Biol. 299, 27–51 (2000).

  41. 41.

    Kwan, T., Liu, J., DuBow, M., Gros, P. & Pelletier, J. The complete genomes and proteomes of 27 Staphylococcus aureus bacteriophages. Proc. Natl Acad. Sci. USA 102, 5174–5179 (2005).

  42. 42.

    Liu, M. et al. Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes. J. Bacteriol. 186, 1503–1517 (2004).

  43. 43.

    Hatfull, G. F. Molecular genetics of mycobacteriophages. Microbiol. Spectr. 2, 1–36 (2014).

  44. 44.

    Ahern, S. J., Das, M., Bhowmick, T. S., Young, R. & Gonzalez, C. F. Characterization of novel virulent broad-host-range phages of Xylella fastidiosa and Xanthomonas. J. Bacteriol. 196, 459–471 (2014).

  45. 45.

    Ahmad, A. A., Ogawa, M., Kawasaki, T., Fujie, M. & Yamada, T. Characterization of bacteriophages Cp1 and Cp2, the strain-typing agents for Xanthomonas axonopodis pv. citri. Appl. Environ. Microbiol. 80, 77–85 (2014).

  46. 46.

    Goerke, C. et al. Diversity of prophages in dominant Staphylococcus aureus clonal lineages. J. Bacteriol. 191, 3462–3468 (2009).

  47. 47.

    Zwickl, D. J. & Hillis, D. M. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598 (2002).

  48. 48.

    Pollock, D. D., Zwickl, D. J., McGuire, J. A. & Hillis, D. M. Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51, 664–671 (2002).

  49. 49.

    Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

  50. 50.

    Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

  51. 51.

    Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

  52. 52.

    Chernomor, O., von Haeseler, A. & Minh, B. Q. Terrace aware data structure for phylogenomic inference from supermatrices. Syst. Biol. 65, 997–1008 (2016).

  53. 53.

    McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618 (2012).

  54. 54.

    Mihara, T. et al. Linking virus genomes with host taxonomy. Viruses 8, 66 (2016).

  55. 55.

    Hooper, S. D. et al. Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach. Nucleic Acids Res. 37, 2096–2104 (2009).

  56. 56.

    Meyer, D., Zeileis, A. & Hornik, K. vcd: Visualizing Categorical Data v.1.4–4 (2017).

  57. 57.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

  58. 58.

    Kozlov, A. M., Aberer, A. J. & Stamatakis, A. ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics 31, 2577–2579 (2015).

  59. 59.

    Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

  60. 60.

    Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

  61. 61.

    Hulsen, T., de Vlieg, J. & Alkema, W. BioVenn—a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genom. 9, 488 (2008).

  62. 62.

    Ludwig, W. et al. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 1363–1371 (2004).

  63. 63.

    Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).

  64. 64.

    Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).

  65. 65.

    Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, New York, 2016).

  66. 66.

    Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).

  67. 67.

    Kishino, H., Miyata, T. & Hasegawa, M. Maximum-likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31, 151–160 (1990).

Download references


We thank D. Waite from the University of Auckland for assistance with the tree inferences using IQ-TREE and ExaML. The project was supported by an Australian Research Council Laureate Fellowship (FL150100038) awarded to P.H.

Author information

S.J.L., M.D. and P.H. designed the study. S.J.L., P.-A.C. and D.H.P. performed the bioinformatic analyses. S.J.L. and P.H. wrote the manuscript. All authors edited drafts of the manuscript.

Competing interests

The authors declare no competing interests.

Correspondence to Philip Hugenholtz.

Supplementary information

Supplementary Information

Legends for Supplementary Datasets, Supplementary Tables 1 and 2, and Supplementary Figures 1–8.

Reporting Summary

Supplementary Dataset 1

This Excel file contains the lists of genome accessions in the datasets used for comparative analyses, along with marker composition and associated metadata.

Supplementary Dataset 2

This file contains the Newick tree of the reference CCP77 dataset.

Supplementary Dataset 3

This file contains the Newick tree of the CCP77-881 dataset (comparison with ICTV).

Supplementary Dataset 4

This file contains the Newick tree of the CCP77-408 dataset (comparison with VICTOR).

Supplementary Dataset 5

This file contains the Newick tree of the CCP77-1520 dataset (comparison withvConTACT).

Supplementary Dataset 6

This file contains the Newick tree of the CCP77-ViPTree dataset (comparison with ViPTree).

Supplementary Dataset 7

This file contains the Newick tree of the CCP77-GRAViTy dataset (comparison with GRAViTy).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: Midpoint-rooted phylogeny of 1,803 RefSeq bacterial and archaeal viruses inferred from concatenation of 77 marker proteins using IQ-TREE.
Fig. 2: Genome datasets used in phylogenetic analyses, and dendrogram illustrating the similarity of trees inferred from the genome datasets.
Fig. 3: Congruence of CCP77 topologies with ICTV and VICTOR family, subfamily and genus classification.
Fig. 4: Congruence of CCP77-1520 topology with vConTACT genus-level classification.
Fig. 5: Comparison of the CCP77 phylogeny with GRAViTy and ViPTree via the ICTV taxonomy.