Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Evaluation of a concatenated protein phylogeny for classification of tailed double-stranded DNA viruses belonging to the order Caudovirales


Viruses of bacteria and archaea are important players in global carbon cycling as well as drivers of host evolution, yet the taxonomic classification of viruses remains a challenge due to their genetic diversity and absence of universally conserved genes. Traditional classification approaches employ a combination of phenotypic and genetic information which is no longer scalable in the era of bulk viral genome recovery through metagenomics. Here, we evaluate a phylogenetic approach for the classification of tailed double-stranded DNA viruses from the order Caudovirales by inferring a phylogeny from the concatenation of 77 single-copy protein markers using a maximum-likelihood method. Our approach is largely consistent with the International Committee on Taxonomy of Viruses, with 72 and 89% congruence at the subfamily and genus levels, respectively. Discrepancies could be attributed to misclassifications and a small number of highly mosaic genera confounding the phylogenetic signal. We also show that confidently resolved nodes in the concatenated protein tree are highly reproducible across different software and models, and conclude that the approach can serve as a framework for a rank-normalized taxonomy of most tailed double-stranded DNA viruses.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Midpoint-rooted phylogeny of 1,803 RefSeq bacterial and archaeal viruses inferred from concatenation of 77 marker proteins using IQ-TREE.
Fig. 2: Genome datasets used in phylogenetic analyses, and dendrogram illustrating the similarity of trees inferred from the genome datasets.
Fig. 3: Congruence of CCP77 topologies with ICTV and VICTOR family, subfamily and genus classification.
Fig. 4: Congruence of CCP77-1520 topology with vConTACT genus-level classification.
Fig. 5: Comparison of the CCP77 phylogeny with GRAViTy and ViPTree via the ICTV taxonomy.

Data availability

The genome accessions and Newick tree files of datasets used in this study are provided in the Supplementary Materials.

Code availability

The custom Bash, Python and R scripts used to process and analyse the data and generate the figures are available on request.


  1. 1.

    Edwards, R. A. & Rohwer, F. Viral metagenomics. Nat. Rev. Microbiol. 3, 504–510 (2005).

    CAS  Article  Google Scholar 

  2. 2.

    Grose, J. H. & Casjens, S. R. Understanding the enormous diversity of bacteriophages: the tailed phages that infect the bacterial family Enterobacteriaceae. Virology 468-470, 421–443 (2014).

    CAS  Article  Google Scholar 

  3. 3.

    Nasir, A., Forterre, P., Kim, K. M. & Caetano-Anolles, G. The distribution and impact of viral lineages in domains of life. Front. Microbiol. 5, 194 (2014).

    Article  Google Scholar 

  4. 4.

    Simmonds, P. Methods for virus classification and the challenge of incorporating metagenomic sequence data. J. Gen. Virol. 96, 1193–1206 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Lefkowitz, E. J. et al. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 46, D708–D717 (2018).

    CAS  Article  Google Scholar 

  6. 6.

    Simmonds, P. et al. Consensus statement: virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017).

    CAS  Article  Google Scholar 

  7. 7.

    Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).

    CAS  Article  Google Scholar 

  8. 8.

    Rohwer, F. & Edwards, R. The phage proteomic tree: a genome-based taxonomy for phage. J. Bacteriol. 184, 4529–4535 (2002).

    CAS  Article  Google Scholar 

  9. 9.

    Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380 (2017).

    CAS  Article  Google Scholar 

  10. 10.

    Meier-Kolthoff, J. P. & Goker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

    CAS  Article  Google Scholar 

  11. 11.

    Aiewsakun, P., Adriaenssens, E. M., Lavigne, R., Kropinski, A. M. & Simmonds, P. Evaluation of the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformatic platform: steps towards a unified taxonomy. J. Gen. Virol. 99, 1331–1343 (2018).

    CAS  Article  Google Scholar 

  12. 12.

    Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol. Biol. Evol. 25, 762–777 (2008).

    CAS  Article  Google Scholar 

  13. 13.

    Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect archaea and bacteria. PeerJ 5, e3243 (2017).

    Article  Google Scholar 

  14. 14.

    Adriaenssens, E. M. & Cowan, D. A. Using signature genes as tools to assess environmental viral ecology and diversity. Appl. Environ. Microbiol. 80, 4470–4480 (2014).

    Article  Google Scholar 

  15. 15.

    Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060 (2009).

    CAS  Article  Google Scholar 

  16. 16.

    Lang, J. M., Darling, A. E. & Eisen, J. A. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS ONE 8, e62510 (2013).

    CAS  Article  Google Scholar 

  17. 17.

    Tonini, J., Moore, A., Stern, D., Shcheglovitova, M. & Orti, G. Concatenation and species tree methods exhibit statistically indistinguishable accuracy under a range of simulated conditions. PLoS Curr. 7, 1–15 (2015).

    Google Scholar 

  18. 18.

    Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).

    CAS  Article  Google Scholar 

  19. 19.

    Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nat. Genet. 28, 281–285 (2001).

    CAS  Article  Google Scholar 

  20. 20.

    Rokas, A., Williams, B. L., King, N. & Carroll, S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804 (2003).

    CAS  Article  Google Scholar 

  21. 21.

    Rokas, A. & Carroll, S. B. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 22, 1337–1344 (2005).

    CAS  Article  Google Scholar 

  22. 22.

    Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

    CAS  Article  Google Scholar 

  23. 23.

    Hatfull, G. F. Bacteriophage genomics. Curr. Opin. Microbiol. 11, 447–453 (2008).

    CAS  Article  Google Scholar 

  24. 24.

    Belcaid, M., Bergeron, A. & Poisson, G. Mosaic graphs and comparative genomics in phage communities. J. Comput. Biol. 17, 1315–1326 (2010).

    CAS  Article  Google Scholar 

  25. 25.

    Kubatko, L. S. & Degnan, J. H. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56, 17–24 (2007).

    CAS  Article  Google Scholar 

  26. 26.

    Philippe, H. et al. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9, e1000602 (2011).

    CAS  Article  Google Scholar 

  27. 27.

    Wiens, J. J. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52, 528–538 (2003).

    Article  Google Scholar 

  28. 28.

    Driskell, A. C. et al. Prospects for building the tree of life from large sequence databases. Science 306, 1172–1174 (2004).

    CAS  Article  Google Scholar 

  29. 29.

    Thomson, R. C. & Shaffer, H. B. Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst. Biol. 59, 42–58 (2010).

    Article  Google Scholar 

  30. 30.

    Hinchliff, C. E. & Roalson, E. H. Using supermatrices for phylogenetic inquiry: an example using the sedges. Syst. Biol. 62, 205–219 (2013).

    Article  Google Scholar 

  31. 31.

    Wiens, J. J. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst. Biol. 54, 731–742 (2005).

    Article  Google Scholar 

  32. 32.

    Wiens, J. J. & Tiu, J. Highly incomplete taxa can rescue phylogenetic analyses from the negative impacts of limited taxon sampling. PLoS ONE 7, e42925 (2012).

    CAS  Article  Google Scholar 

  33. 33.

    Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 17112 (2017).

    CAS  Article  Google Scholar 

  34. 34.

    Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    CAS  Article  Google Scholar 

  35. 35.

    Thiergart, T., Landan, G. & Martin, W. F. Concatenated alignments and the case of the disappearing tree. BMC Evol. Biol. 14, 266 (2014).

    Article  Google Scholar 

  36. 36.

    Gadagkar, S. R., Rosenberg, M. S. & Kumar, S. Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. J. Exp. Zool. B Mol. Dev. Evol. 304, 64–74 (2005).

    Article  Google Scholar 

  37. 37.

    Lahr, D. J., Laughinghouse, H. Dt, Oliverio, A. M., Gao, F. & Katz, L. A. How discordant morphological and molecular evolution among microorganisms can revise our notions of biodiversity on Earth. Bioessays 36, 950–959 (2014).

    Article  Google Scholar 

  38. 38.

    Adriaenssens, E. M. et al. Taxonomy of prokaryotic viruses: 2017 update from the ICTV bacterial and archaeal viruses subcommittee. Arch. Virol. 163, 1125–1129 (2018).

    CAS  Article  Google Scholar 

  39. 39.

    Barylski, J. et al. Analysis of spounaviruses as a case study for the overdue reclassification of tailed bacteriophages. Preprint at (2018).

  40. 40.

    Juhala, R. J. et al. Genomic sequences of bacteriophages HK97 and HK022: pervasive genetic mosaicism in the lambdoid bacteriophages. J. Mol. Biol. 299, 27–51 (2000).

    CAS  Article  Google Scholar 

  41. 41.

    Kwan, T., Liu, J., DuBow, M., Gros, P. & Pelletier, J. The complete genomes and proteomes of 27 Staphylococcus aureus bacteriophages. Proc. Natl Acad. Sci. USA 102, 5174–5179 (2005).

    CAS  Article  Google Scholar 

  42. 42.

    Liu, M. et al. Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes. J. Bacteriol. 186, 1503–1517 (2004).

    CAS  Article  Google Scholar 

  43. 43.

    Hatfull, G. F. Molecular genetics of mycobacteriophages. Microbiol. Spectr. 2, 1–36 (2014).

    Article  Google Scholar 

  44. 44.

    Ahern, S. J., Das, M., Bhowmick, T. S., Young, R. & Gonzalez, C. F. Characterization of novel virulent broad-host-range phages of Xylella fastidiosa and Xanthomonas. J. Bacteriol. 196, 459–471 (2014).

    Article  Google Scholar 

  45. 45.

    Ahmad, A. A., Ogawa, M., Kawasaki, T., Fujie, M. & Yamada, T. Characterization of bacteriophages Cp1 and Cp2, the strain-typing agents for Xanthomonas axonopodis pv. citri. Appl. Environ. Microbiol. 80, 77–85 (2014).

    CAS  Article  Google Scholar 

  46. 46.

    Goerke, C. et al. Diversity of prophages in dominant Staphylococcus aureus clonal lineages. J. Bacteriol. 191, 3462–3468 (2009).

    CAS  Article  Google Scholar 

  47. 47.

    Zwickl, D. J. & Hillis, D. M. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598 (2002).

    Article  Google Scholar 

  48. 48.

    Pollock, D. D., Zwickl, D. J., McGuire, J. A. & Hillis, D. M. Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51, 664–671 (2002).

    Article  Google Scholar 

  49. 49.

    Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    CAS  Article  Google Scholar 

  50. 50.

    Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

    CAS  Article  Google Scholar 

  51. 51.

    Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    CAS  Article  Google Scholar 

  52. 52.

    Chernomor, O., von Haeseler, A. & Minh, B. Q. Terrace aware data structure for phylogenomic inference from supermatrices. Syst. Biol. 65, 997–1008 (2016).

    Article  Google Scholar 

  53. 53.

    McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618 (2012).

    CAS  Article  Google Scholar 

  54. 54.

    Mihara, T. et al. Linking virus genomes with host taxonomy. Viruses 8, 66 (2016).

    Article  Google Scholar 

  55. 55.

    Hooper, S. D. et al. Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach. Nucleic Acids Res. 37, 2096–2104 (2009).

    CAS  Article  Google Scholar 

  56. 56.

    Meyer, D., Zeileis, A. & Hornik, K. vcd: Visualizing Categorical Data v.1.4–4 (2017).

  57. 57.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    PubMed  PubMed Central  Google Scholar 

  58. 58.

    Kozlov, A. M., Aberer, A. J. & Stamatakis, A. ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics 31, 2577–2579 (2015).

    CAS  Article  Google Scholar 

  59. 59.

    Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

    CAS  Article  Google Scholar 

  60. 60.

    Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

    CAS  Article  Google Scholar 

  61. 61.

    Hulsen, T., de Vlieg, J. & Alkema, W. BioVenn—a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genom. 9, 488 (2008).

    Article  Google Scholar 

  62. 62.

    Ludwig, W. et al. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 1363–1371 (2004).

    CAS  Article  Google Scholar 

  63. 63.

    Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).

    CAS  Article  Google Scholar 

  64. 64.

    Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).

    Article  Google Scholar 

  65. 65.

    Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, New York, 2016).

  66. 66.

    Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).

    Article  Google Scholar 

  67. 67.

    Kishino, H., Miyata, T. & Hasegawa, M. Maximum-likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31, 151–160 (1990).

    CAS  Article  Google Scholar 

Download references


We thank D. Waite from the University of Auckland for assistance with the tree inferences using IQ-TREE and ExaML. The project was supported by an Australian Research Council Laureate Fellowship (FL150100038) awarded to P.H.

Author information




S.J.L., M.D. and P.H. designed the study. S.J.L., P.-A.C. and D.H.P. performed the bioinformatic analyses. S.J.L. and P.H. wrote the manuscript. All authors edited drafts of the manuscript.

Corresponding author

Correspondence to Philip Hugenholtz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Legends for Supplementary Datasets, Supplementary Tables 1 and 2, and Supplementary Figures 1–8.

Reporting Summary

Supplementary Dataset 1

This Excel file contains the lists of genome accessions in the datasets used for comparative analyses, along with marker composition and associated metadata.

Supplementary Dataset 2

This file contains the Newick tree of the reference CCP77 dataset.

Supplementary Dataset 3

This file contains the Newick tree of the CCP77-881 dataset (comparison with ICTV).

Supplementary Dataset 4

This file contains the Newick tree of the CCP77-408 dataset (comparison with VICTOR).

Supplementary Dataset 5

This file contains the Newick tree of the CCP77-1520 dataset (comparison withvConTACT).

Supplementary Dataset 6

This file contains the Newick tree of the CCP77-ViPTree dataset (comparison with ViPTree).

Supplementary Dataset 7

This file contains the Newick tree of the CCP77-GRAViTy dataset (comparison with GRAViTy).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Low, S.J., Džunková, M., Chaumeil, PA. et al. Evaluation of a concatenated protein phylogeny for classification of tailed double-stranded DNA viruses belonging to the order Caudovirales. Nat Microbiol 4, 1306–1315 (2019).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing