Microbiomes from every environment contain a myriad of uncultivated archaeal and bacterial viruses, but studying these viruses is hampered by the lack of a universal, scalable taxonomic framework. We present vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions. We report near-identical (96%) replication of existing genus-level viral taxonomy assignments from the International Committee on Taxonomy of Viruses for National Center for Biotechnology Information virus RefSeq. Application of vConTACT v.2.0 to 1,364 previously unclassified viruses deposited in virus RefSeq as reference genomes produced automatic, high-confidence genus assignments for 820 of the 1,364. We applied vConTACT v.2.0 to analyze 15,280 Global Ocean Virome genome fragments and were able to provide taxonomic assignments for 31% of these data, which shows that our algorithm is scalable to very large metagenomic datasets. Our taxonomy tool can be automated and applied to metagenomes from any environment for virus classification.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The set of reference genomes used to evaluate vConTACT was retrieved from https://www.ncbi.nlm.nih.gov/genome/viruses/. The GOV contigs were retrieved from the publicly available CyVerse data commons repository, accessible at http://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/GOV. The utility of vConTACT v.2.0 depends upon its expert evaluation and community availability39.

Code availability

The utility of vConTACT v.2.0 depends upon its expert evaluation and community availability. The tool is available through Bitbucket (https://bitbucket.org/MAVERICLab/vcontact2) as a downloadable Python package and usable as an app through iVirus39, the viral ecology apps and data resource embedded in the CyVerse Cyberinfrastructure, with detailed usage protocols available through Protocol Exchange (https://www.nature.com/protocolexchange/) and protocols.io (https://www.protocols.io/). Finally, the curated reference network is available at each of these sites and will be updated approximately bi-yearly as complete genomes become available and resources exist to support this effort.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Falkowski, P. G., Fenchel, T. & Delong, E. F. The microbial engines that drive earth’s biogeochemical cycles. Science 320, 1034–1039 (2008).

  2. 2.

    Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).

  3. 3.

    Moran, M. A. The global ocean microbiome. Science 350, aac8455 (2015).

  4. 4.

    Zhao, M. et al. Microbial mediation of biogeochemical cycles revealed by simulation of global changes with soil transplant and cropping. ISME J. 8, 2045–2055 (2014).

  5. 5.

    Cho, I. & Blaser, M. J. The human microbiome: at the interface of health and disease. Nat. Rev. Genet. 13, 260–270 (2012).

  6. 6.

    Fernández, L., Rodríguez, A. & García, P. Phage or foe: an insight into the impact of viral predation on microbial communities. ISME J. 12, 1171–1179 (2018).

  7. 7.

    Hurwitz, B. L. & U’Ren, J. M. Viral metabolic reprogramming in marine ecosystems. Curr. Opin. Microbiol. 31, 161–168 (2016).

  8. 8.

    Suttle, C. A. Marine viruses – major players in the global ecosystem. Nat. Rev. Microbiol. 5, 801–812 (2007).

  9. 9.

    Brum, J. R. et al. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).

  10. 10.

    Danovaro, R. et al. Virus-mediated archaeal hecatomb in the deep seafloor. Sci. Adv. 2, e1600492 (2016).

  11. 11.

    Pratama, A. A. & van Elsas, J. D. The ‘neglected’ soil virome – potential role and impact. Trends Microbiol. https://doi.org/10.1016/j.tim.2017.12.004 (2018).

  12. 12.

    Gómez, P. & Buckling, A. Bacteria–phage antagonistic coevolution in soil. Science 332, 106–109 (2011).

  13. 13.

    Reyes, A., Semenkovich, N. P., Whiteson, K., Rohwer, F. & Gordon, J. I. Going viral: next-generation sequencing applied to phage populations in the human gut. Nat. Rev. Microbiol. 10, 607–617 (2012).

  14. 14.

    Abeles, S. R. & Pride, D. T. Molecular bases and role of viruses in the human microbiome. J. Mol. Biol. 426, 3892–3906 (2014).

  15. 15.

    Rohwer, F. & Edwards, R. The phage proteomic tree: a genome-based taxonomy for phage. J. Bacteriol. 184, 4529–4535 (2002).

  16. 16.

    Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).

  17. 17.

    Deng, L. et al. Viral tagging reveals discrete populations in Synechococcus viral genome sequence space. Nature 513, 242–245 (2014).

  18. 18.

    Gregory, A. C. et al. Genomic differentiation among wild cyanophages despite widespread horizontal gene transfer. BMC Genomics 17, 930 (2016).

  19. 19.

    Bobay, L. & Ochman, H. Biological species in the viral world. PNAS 115, 6040–6045 (2018).

  20. 20.

    Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 17112 (2017).

  21. 21.

    Ackermann, H.-W. in Methods and Protocols, Vol. 1 (eds Clokie, M. R. J. & Kropinski, A. M.) 127–140 (Humana Press, 2009).

  22. 22.

    Simmonds, P. et al. Consensus statement: virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017).

  23. 23.

    Paez-Espino, D. et al. IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2018).

  24. 24.

    Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).

  25. 25.

    Roux, S. et al. Minimum Information about an Uncultivated Virus Genome (MIUViG): a community consensus on standards and best practices for describing genome sequences from uncultivated viruses. Nat. Biotechnol. 37, 29–37 (2019).

  26. 26.

    Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380 (2017).

  27. 27.

    Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol. Biol. Evol. 25, 762–777 (2008).

  28. 28.

    Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ 5, e3243 (2017).

  29. 29.

    Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

  30. 30.

    Yu, C. et al. Real time classification of viruses in 12 dimensions. PLoS ONE 8, e64328 (2013).

  31. 31.

    Gao, Y. & Luo, L. Genome-based phylogeny of dsDNA viruses by a novel alignment-free method. Gene 492, 309–314 (2012).

  32. 32.

    Iranzo, J., Koonin, E. V., Prangishvili, D. & Krupovic, M. Bipartite network analysis of the archaeal virosphere: evolutionary connections between viruses and capsidless mobile elements. J. Virol. 90, 11043–11055 (2016).

  33. 33.

    Aiewsakun, P. & Simmonds, P. The genomic underpinnings of eukaryotic virus taxonomy: creating a sequence-based framework for family-level virus classification. Microbiome 6, 38 (2018).

  34. 34.

    Lawrence, J. G., Hatfull, G. F. & Hendrix, R. W. Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches. J. Bacteriol. 184, 4891–4905 (2002).

  35. 35.

    Lavigne, R. et al. Classification of myoviridae bacteriophages using protein sequence similarity. BMC Microbiol. 9, 224 (2009).

  36. 36.

    Lavigne, R., Seto, D., Mahadevan, P., Ackermann, H. W. & Kropinski, A. M. Unifying classical and molecular taxonomic classification: analysis of the Podoviridae using BLASTP-based tools. Res. Microbiol. 159, 406–414 (2008).

  37. 37.

    Henz, S. R., Huson, D. H., Auch, A. F., Nieselt-Struwe, K. & Schuster, S. C. Whole-genome prokaryotic phylogeny. Bioinformatics 21, 2329–2335 (2005).

  38. 38.

    Iranzo, J., Krupovic, M. & Koonin, E. V. The double-stranded DNA virosphere as a modular hierarchical network of gene sharing. MBio 7, e00978-16 (2016).

  39. 39.

    Bolduc, B., Youens-Clark, K., Roux, S., Hurwitz, B. L. & Sullivan, M. B. IVirus: facilitating new insights in viral ecology with software and community data sets imbedded in a cyberinfrastructure. ISME J. 11, 7–14 (2017).

  40. 40.

    Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).

  41. 41.

    Vik, D. R. et al. Putative archaeal viruses from the mesopelagic ocean. PeerJ 5, e3428 (2017).

  42. 42.

    Roux, S. et al. Ecogenomics of virophages and their giant virus hosts assessed through time series metagenomics. Nat. Commun. 8, 858 (2017).

  43. 43.

    Emerson, J. B. et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat. Microbiol. https://doi.org/10.1038/s41564-018-0190-y (2018).

  44. 44.

    Martinez-Hernandez, F. et al. Single-virus genomics reveals hidden cosmopolitan and abundant viruses. Nat. Commun. 8, 15892 (2017).

  45. 45.

    De la Cruz Peña, M. J. et al. Deciphering the human virome with single-virus genomics and metagenomics. Viruses 10, 113 (2018).

  46. 46.

    Aiewsakun, P., Adriaenssens, E. M., Lavigne, R., Kropinski, A. M. & Simmonds, P. Evaluation of the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformatic platform: steps towards a unified taxonomy. J. Gen. Virol. 99, 1331–1343 (2018).

  47. 47.

    Hulo, C., Masson, P., Le Mercier, P. & Toussaint, A. A structured annotation frame for the transposable phages: a new proposed family ‘Saltoviridae’ within the Caudovirales. Virology 477, 155–163 (2015).

  48. 48.

    Adriaenssens, E. M. et al. Taxonomy of prokaryotic viruses: 2017 update from the ICTV Bacterial and Archaeal Viruses Subcommittee. Arch. Virol. 163, 1125–1129 (2018).

  49. 49.

    Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein–protein interaction networks. Nat. Methods 9, 471–472 (2012).

  50. 50.

    Doyle, E. L. et al. Genome sequences of four Cluster P Mycobacteriophages. Genome Announc. 6, e01101–e01117 (2018).

  51. 51.

    Pope, W. H. et al. Bacteriophages of Gordonia spp. display a spectrum of diversity and genetic relationships. MBio 8, e01069–17 (2017).

  52. 52.

    Pope, W. H. et al. Whole genome comparison of a large collection of mycobacteriophages reveals a continuum of phage genetic diversity. eLife 4, e06416 (2015).

  53. 53.

    Nelson, D. Phage taxonomy: we agree to disagree. J. Bacteriol. 186, 7029–7031 (2004).

  54. 54.

    Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996 (2018).

  55. 55.

    Krupovic, M., Quemin, E. R. J., Bamford, D. H., Forterre, P. & Prangishvili, D. Unification of the globally distributed spindle-shaped viruses of the archaea. J. Virol. 88, 2354–2358 (2014).

  56. 56.

    Rokyta, D. R., Burch, C. L., Caudle, S. B. & Wichman, H. A. Horizontal gene transfer and the evolution of microvirid coliphage genomes. J. Bacteriol. 188, 1134–1142 (2006).

  57. 57.

    Matsen, F. A., Kodner, R. B. & Armbrust, E. V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010).

  58. 58.

    Marz, M. et al. Challenges in RNA virus bioinformatics. Bioinformatics 30, 1793–1799 (2014).

  59. 59.

    O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

  60. 60.

    Krupovic, M. et al. Taxonomy of prokaryotic viruses: update from the ICTV bacterial and archaeal viruses subcommittee. Arch. Virol. 161, 1095–1099 (2016).

  61. 61.

    Adams, M. J. et al. Changes to taxonomy and the International Code of Virus Classification and Nomenclature ratified by the International Committee on Taxonomy of Viruses (2017). Arch. Virol. 162, 2505–2538 (2017).

  62. 62.

    Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

  63. 63.

    Wiwie, C., Baumbach, J. & Röttger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).

  64. 64.

    Brohée, S. & van Helden, J. Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinformatics 7, 488 (2006).

  65. 65.

    Kamburov, A., Stelzl, U. & Herwig, R. IntScore: a web tool for confidence scoring of biological interactions. Nucleic Acids Res. 40, W140–W146 (2012).

  66. 66.

    Goldberg, D. S. & Roth, F. P. Assessing experimentally derived interactions in a small world. Proc. Natl Acad. Sci. USA 100, 4372–4376 (2003).

  67. 67.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

  68. 68.

    Ohio Supercomputer Center. http://osc.edu/ark:/19495/f5s1ph73 (1987).

  69. 69.

    Oliphant, T. E. SciPy: open source scientific tools for Python. Comput. Sci. Eng. 9, 10–20 (2007).

  70. 70.

    McKinney, W. Data structures for statistical computing in Python. in Proc. 9th Python Sci. Conf. 445, 51–56 (2010).

  71. 71.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  72. 72.

    Csárdi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Syst. 1695, 1–9 (2006).

  73. 73.

    Federico, P., Pfeffer, J., Aigner, W., Miksch, S. & Zenk, L. Visual analysis of dynamic networks using change centrality. in Proc. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 179–183 (IEEE, 2012).

Download references


We thank L. Bollinger, G. Trubl and I. Tolstoy for their comments on improving the manuscript, as well as Z.-Q. You for helping push the network analytics. High-performance computational support was provided as an award from the Ohio Supercomputer Center to M.B.S. Funding was provided in part by the Department of Energy’s Genome Sciences Program Soil Microbiome Scientific Focus Area award (no. SCW1632) to Lawrence Livermore National Laboratory, NSF Biological Oceanography awards (OCE no. 1536989 and OCE no. 1756314), and a Gordon and Betty Moore Foundation Investigator Award (no. 3790) to M.B.S. Funding was provided to J.R.B. by the Intramural Research Program of the National Institutes of Health (NIH) National Library of Medicine. The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under Contract DE-AC02-05CH11231 to S.R. This work was funded in part through Battelle Memorial Institute’s prime contract with the US National Institute of Allergy and Infectious Diseases (NIAID) under Contract no. HHSN272200700016I to J.H.K. The content of this publication does not necessarily reflect the views or policies of the US Department of Health and Human Services or of the institutions and companies affiliated with the authors.

Author information

Author notes

  1. These authors contributed equally: Ho Bin Jang, Benjamin Bolduc.


  1. Department of Microbiology, Ohio State University, Columbus, OH, USA

    • Ho Bin Jang
    • , Benjamin Bolduc
    • , Olivier Zablocki
    •  & Matthew B. Sullivan
  2. Integrated Research Facility at Fort Detrick, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, MD, USA

    • Jens H. Kuhn
  3. US Department of Energy Joint Genome Institute, Walnut Creek, CA, USA

    • Simon Roux
  4. Institute of Integrative Biology, University of Liverpool, Liverpool, UK

    • Evelien M. Adriaenssens
  5. Quadram Institute Bioscience, Norwich Research Park, Norwich, UK

    • Evelien M. Adriaenssens
  6. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

    • J. Rodney Brister
  7. Department of Pathobiology, Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada

    • Andrew M Kropinski
  8. Department of Food Science, University of Guelph, Guelph, Ontario, Canada

    • Andrew M Kropinski
  9. Unité Biologie Moléculaire du Gène chez les Extrêmophiles, Institut Pasteur, Paris, France

    • Mart Krupovic
  10. Laboratory of Gene Technology, Department of Biosystems, Faculty of BioScience Engineering, KU Leuven, Leuven, Belgium

    • Rob Lavigne
  11. Centre for Research in Biosciences, Department of Applied Sciences, Faculty of Health and Applied Sciences, University of the West of England, Bristol, UK

    • Dann Turner
  12. Department of Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, OH, USA

    • Matthew B. Sullivan


  1. Search for Ho Bin Jang in:

  2. Search for Benjamin Bolduc in:

  3. Search for Olivier Zablocki in:

  4. Search for Jens H. Kuhn in:

  5. Search for Simon Roux in:

  6. Search for Evelien M. Adriaenssens in:

  7. Search for J. Rodney Brister in:

  8. Search for Andrew M Kropinski in:

  9. Search for Mart Krupovic in:

  10. Search for Rob Lavigne in:

  11. Search for Dann Turner in:

  12. Search for Matthew B. Sullivan in:


H.B.J., B.B. and M.B.S. designed the study. O.Z. and M.B.S. wrote the manuscript with substantial contributions from H.B.J., B.B., J.R.B., S.R., E.M.A., J.R.B., A.M.K., M.K., R.L. and D.T. H.B.J. and B.B. performed the statistical and network analyses.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Matthew B. Sullivan.

Integrated supplementary information

  1. Supplementary Fig. 1 Bipartite network of reference virus genomes generated by vConTACT v.2.0.

    (a) The full network is represented with an oval indicating the area of close-up in (b) Close-up of 3 viral clusters displayed as bipartite network. In this configuration, pink nodes represent individual genomes, while dark gray nodes depict proteins clusters that are shared between viral clusters.

  2. Supplementary Fig. 2 Clustering comparisons between vConTACT v.1.0 and v.2.0.

    (a) Number of total viral clusters (VCs), and genus-assigned VCs, concordant and discordant VCs, as detected by vConTACT v.1.0 (left) and v.2.0 (right). Discordant VCs are (i) those that have a mix of the different genera (i.e., lumped genera), (ii) those that have different member virus(es) of the same genus split into multiple clusters (i.e., split genera), or (iii) mix of (i) and (ii). (b) Proteome similarities of viruses within 22 concordant VCs of vConTACT version 1.0. For all plots, the x-axis is the individual pairwise comparisons and y-axis is the proteome similarity (i.e., percentage fraction of shared protein clusters between genomes). VCs 6, 26, 66, and 130 (highlighted by bold borders) contain taxonomically-misplaced member virus(es) of the Che8virus, Pbunavirus, P100virus, and Bcep78virus, respectively, all of which have been correctly captured by v.2.0 and ratified by the ICTV (see Supplementary Fig. 5). Like these four VCs, the remaining 18 VCs contain distant relatives, with only 1–30% of similarities to the rest of the given clusters or discrete viral group(s). These 18 VCs display a number of discontinuous similarities, which were identified as outliers or separated VCs by v.2.0, respectively (see Supplementary Table 4).

  3. Supplementary Fig. 3 vConTACT v.2.0-based detection and characterization of overlapping viral genomes.

    (a) Box plots depicting the distribution of the topology-based confidence scores between viruses identified as overlaps (n = 74; min., 0.040; quartile (Q)1, 0.199; median, 0.340; Q3, 0.450; max., 0.623) and non-overlaps (n = 1,856 ; min., 0.000; Q1, 0.280; median, 0.499; Q3, 0.822; max., 1.000), which vConTACT v.2.0 placed into two clusters (see panel b) and single clusters, respectively. For details, see Methods. Comparison of topology-based confidence scores between overlaps and non-overlaps was performed by the one-sided Mann Whitney U test (P-value = 6.12e-09). (b) List of phages and archaeal viruses identified as overlaps and their ICTV genus. Details on the lifestyle and evolutionary modes of 74 viruses were collected from Mavrich and Hatfull22 and from the Actinobacteriophage Database website (http://phagesdb.org/). The high (HGCF, blue) and low (LGCF, green) gene content flux evolutionary modes indicate the predicted lifestyle based on the gene content dissimilarity between viral genomes. Bioinformatically-predicted temperate phages indicate those that contain integrase (for integrating temperate phage genomes into host) or parA (partitioning gene found in extrachromosomal temperate phages) genes.

  4. Supplementary Fig. 4 Evaluation of optimal distance thresholds for hierarchical clustering of VCs.

    The X-axis denotes distance threshold increments from dist = 1 to dist = 20 in 0.5 intervals. The Y-axis denotes composite scores by multiplying Accuracy (Acc) and clustering-wise separation (Sep) when trying to recapitulate ICTV genera, which are geometric means of Sensitivity and Positive predictive value, and Complex-wise separation and Cluster-wise separation, respectively. From these data, a distance of 9.0 yielded the highest composite score for the sub-clusters partitioned from all vConTACT v.2.0-generated viral clusters. For details, see Online Methods.

  5. Supplementary Fig. 5 vConTACT v.2.0-based detection and characterization of boundary genome(s) within the ICTV-recognized genera.

    (a) Box plots show the percentage of shared protein clusters (PCs) between members of an ICTV genus (red), and the same metrics after excluding viruses recognized as outlier(s) by vConTACT v.2.0 (cyan). The proteome similarities for the Barnyardvirus are shown between (1) member viruses of the Barnyardvirus, (2) Mycobacterium virus Barnyard and the remaining members of the Barnyardvirus, (3) the Barnyardvirus and Patiencevirus, (4) Mycobacterium virus Barnyard and the Patiencevirus, and (5) the Barnyardvirus without outliers. For the Phikmvvirus, the proteome similarities between (1) member viruses of the Phikmvvirus, (2) Pseudomonas virus phiKMV and the remaining members of the Phikmvvirus, and (3) Pseudomonas virus phiKMV and members of VC33 are shown, respectively. The sample size (n) as well as minimum (min.), median, maximum (max.) and two quartiles (Q1 and Q3) for each plot were represented as follows. From left box plot to right, (n = 5; min., 27.1; Q1, 27.7; median, 88.1; Q3, 90.4; max., 93.5), (4; 88.0; 88.5; 90.0; 91.2; 93.5), (10; 24.0; 29.3; 83.3; 91.0; 97.7), (8; 76.5; 84.8; 90.1; 92.8; 97.7), (11; 36.1; 91.9; 93.9; 95.5; 97.8), (10; 90.5; 93.3; 94.4; 95.6; 97.8), (5; 49.4; 51.3; 85.9; 86.7; 97.6), (4; 85.8; 86.1; 86.6; 92.9; 97.6), (17; 25.7; 59.3; 67.3; 72.9; 94.3), (15;58.0; 65.3; 69.0;74.1; 94.3), (2; 65.7; 65.7; 65.7; 65.7; 65.7), (10; 58.4; 76.5; 81.5; 84.1; 94.1); (9; 74.3; 80.3; 83.0; 84.6; 94.1), (31; 12.8; 35.9; 44.1; 57.6; 92.4), (25; 29.9; 42.3; 50.0; 63.1; 92.4), (5; 16.0; 16.8; 20.3; 56.5; 66.7), (3; 64.8; 65.7; 66.7; 66.7; 66.7), (2; 31.6; 31.6; 31.6; 31.6; 31.6), (3; 43.5; 43.7; 43.9; 60.7; 77.4), (3; 43.5; 43.6; 43.7; 43.8; 43.9), (4; 42.6; 43.5; 43.9; 44.1; 77.4), (3; 38.7; 38.7; 38.7; 38.7; 38.7; 38.7), (3; 77.4; 77.4; 77.4; 77.4; 77.4; 77.4), (4; 19.3; 21.6; 27.9; 29.0; 45.8), (3; 19.3; 23.1; 27.0; 36.4; 45.8), (11; 85.5; 89.0; 90.3; 93.1; 95.2). (b) Module profiles show the presence (dark) and absence (light) of homologous PCs across genomes. Each row represents a virus and each column a PC. The genomes were hierarchically clustered based on pairwise Euclidean distance. The ICTV and vConTACT v.2.0 classifications are indicated next to each virus.

  6. Supplementary Fig. 6 Evaluation of singletons/outlier genomes over GOV increments.

    Thirty-eight ICTV recognized singleton and outlier genomes (one per row) were observed to evaluate whether the addition of GOV sequences would improve their classification. The coloring gradient on the left indicates the numbers of genera per VC. Clearly improved clustering was observed in 3 of the 38 genomes (black bolded, left), with 2 genomes clustering to 6-genera discordant VCs (red, left), though the majority saw only minor, if any, change in their clustering assignment.

  7. Supplementary Fig. 7 vConTACT v.2.0 computational runtimes.

    In the plot, processing time (in seconds, Y-axis) is plotted against increasing GOV sequence data (GOV % added, X-axis), where the number of GOV contigs per additional increment is placed in parentheses on the X-axis. There is a strong linear correlation between runtime and memory usage with data volume to be processed. The R2 value, or coefficient of determination, was calculated as [1 - residual sum of squares / total sum of squares].

  8. Supplementary Fig. 8 Impact of the inflation factor on viral genome clustering based on the Markov clustering (MCL) algorithm.

    Left panel: average intra-cluster clustering coefficient (ICCC) and number of viral clusters (VCs), which are predicted as a function of the inflation factors ranging from 1.0 to 5.0 with a step of 0.2, are indicated. Right panel: Curve representing the ICCC values for the network containing 2,304 archaeal and bacterial virus genomes.

Supplementary information

  1. Supplementary Information

    Supplementary Figs. 1–8 and Supplementary Notes 1 and 2

  2. Reporting Summary

  3. Supplementary Table 1

    A genome–gene matrix.

  4. Supplementary Table 2

    List of 2,304 archaeal and bacterial virus genomes used to evaluate vConTACT v.1.0 and v.2.0.

  5. Supplementary Table 3

    Clustering performance evaluations of the vConTACT v.1.0, v.2.0, and v.2.0 followed by distance-based hierarchical clustering for the genus rank.

  6. Supplementary Table 4

    Fraction of PCs in common between 2,304 genomes.

  7. Supplementary Table 5

    Statistics associated with the box plots shown in Fig. 2d.

  8. Supplementary Table 6

    Statistics associated with the box plots shown in Fig. 3b.

About this article

Publication history