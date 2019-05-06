Abstract
Microbiomes from every environment contain a myriad of uncultivated archaeal and bacterial viruses, but studying these viruses is hampered by the lack of a universal, scalable taxonomic framework. We present vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions. We report near-identical (96%) replication of existing genus-level viral taxonomy assignments from the International Committee on Taxonomy of Viruses for National Center for Biotechnology Information virus RefSeq. Application of vConTACT v.2.0 to 1,364 previously unclassified viruses deposited in virus RefSeq as reference genomes produced automatic, high-confidence genus assignments for 820 of the 1,364. We applied vConTACT v.2.0 to analyze 15,280 Global Ocean Virome genome fragments and were able to provide taxonomic assignments for 31% of these data, which shows that our algorithm is scalable to very large metagenomic datasets. Our taxonomy tool can be automated and applied to metagenomes from any environment for virus classification.
Data availability
The set of reference genomes used to evaluate vConTACT was retrieved from https://www.ncbi.nlm.nih.gov/genome/viruses/. The GOV contigs were retrieved from the publicly available CyVerse data commons repository, accessible at http://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/GOV. The utility of vConTACT v.2.0 depends upon its expert evaluation and community availability39.
Code availability
The utility of vConTACT v.2.0 depends upon its expert evaluation and community availability. The tool is available through Bitbucket (https://bitbucket.org/MAVERICLab/vcontact2) as a downloadable Python package and usable as an app through iVirus39, the viral ecology apps and data resource embedded in the CyVerse Cyberinfrastructure, with detailed usage protocols available through Protocol Exchange (https://www.nature.com/protocolexchange/) and protocols.io (https://www.protocols.io/). Finally, the curated reference network is available at each of these sites and will be updated approximately bi-yearly as complete genomes become available and resources exist to support this effort.
Acknowledgements
We thank L. Bollinger, G. Trubl and I. Tolstoy for their comments on improving the manuscript, as well as Z.-Q. You for helping push the network analytics. High-performance computational support was provided as an award from the Ohio Supercomputer Center to M.B.S. Funding was provided in part by the Department of Energy’s Genome Sciences Program Soil Microbiome Scientific Focus Area award (no. SCW1632) to Lawrence Livermore National Laboratory, NSF Biological Oceanography awards (OCE no. 1536989 and OCE no. 1756314), and a Gordon and Betty Moore Foundation Investigator Award (no. 3790) to M.B.S. Funding was provided to J.R.B. by the Intramural Research Program of the National Institutes of Health (NIH) National Library of Medicine. The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under Contract DE-AC02-05CH11231 to S.R. This work was funded in part through Battelle Memorial Institute’s prime contract with the US National Institute of Allergy and Infectious Diseases (NIAID) under Contract no. HHSN272200700016I to J.H.K. The content of this publication does not necessarily reflect the views or policies of the US Department of Health and Human Services or of the institutions and companies affiliated with the authors.
Integrated supplementary information
Supplementary Fig. 1 Bipartite network of reference virus genomes generated by vConTACT v.2.0.
(a) The full network is represented with an oval indicating the area of close-up in (b) Close-up of 3 viral clusters displayed as bipartite network. In this configuration, pink nodes represent individual genomes, while dark gray nodes depict proteins clusters that are shared between viral clusters.
Supplementary Fig. 2 Clustering comparisons between vConTACT v.1.0 and v.2.0.
(a) Number of total viral clusters (VCs), and genus-assigned VCs, concordant and discordant VCs, as detected by vConTACT v.1.0 (left) and v.2.0 (right). Discordant VCs are (i) those that have a mix of the different genera (i.e., lumped genera), (ii) those that have different member virus(es) of the same genus split into multiple clusters (i.e., split genera), or (iii) mix of (i) and (ii). (b) Proteome similarities of viruses within 22 concordant VCs of vConTACT version 1.0. For all plots, the x-axis is the individual pairwise comparisons and y-axis is the proteome similarity (i.e., percentage fraction of shared protein clusters between genomes). VCs 6, 26, 66, and 130 (highlighted by bold borders) contain taxonomically-misplaced member virus(es) of the Che8virus, Pbunavirus, P100virus, and Bcep78virus, respectively, all of which have been correctly captured by v.2.0 and ratified by the ICTV (see Supplementary Fig. 5). Like these four VCs, the remaining 18 VCs contain distant relatives, with only 1–30% of similarities to the rest of the given clusters or discrete viral group(s). These 18 VCs display a number of discontinuous similarities, which were identified as outliers or separated VCs by v.2.0, respectively (see Supplementary Table 4).
Supplementary Fig. 3 vConTACT v.2.0-based detection and characterization of overlapping viral genomes.
(a) Box plots depicting the distribution of the topology-based confidence scores between viruses identified as overlaps (n = 74; min., 0.040; quartile (Q)1, 0.199; median, 0.340; Q3, 0.450; max., 0.623) and non-overlaps (n = 1,856 ; min., 0.000; Q1, 0.280; median, 0.499; Q3, 0.822; max., 1.000), which vConTACT v.2.0 placed into two clusters (see panel b) and single clusters, respectively. For details, see Methods. Comparison of topology-based confidence scores between overlaps and non-overlaps was performed by the one-sided Mann Whitney U test (P-value = 6.12e-09). (b) List of phages and archaeal viruses identified as overlaps and their ICTV genus. Details on the lifestyle and evolutionary modes of 74 viruses were collected from Mavrich and Hatfull22 and from the Actinobacteriophage Database website (http://phagesdb.org/). The high (HGCF, blue) and low (LGCF, green) gene content flux evolutionary modes indicate the predicted lifestyle based on the gene content dissimilarity between viral genomes. Bioinformatically-predicted temperate phages indicate those that contain integrase (for integrating temperate phage genomes into host) or parA (partitioning gene found in extrachromosomal temperate phages) genes.
Supplementary Fig. 4 Evaluation of optimal distance thresholds for hierarchical clustering of VCs.
The X-axis denotes distance threshold increments from dist = 1 to dist = 20 in 0.5 intervals. The Y-axis denotes composite scores by multiplying Accuracy (Acc) and clustering-wise separation (Sep) when trying to recapitulate ICTV genera, which are geometric means of Sensitivity and Positive predictive value, and Complex-wise separation and Cluster-wise separation, respectively. From these data, a distance of 9.0 yielded the highest composite score for the sub-clusters partitioned from all vConTACT v.2.0-generated viral clusters. For details, see Online Methods.
Supplementary Fig. 5 vConTACT v.2.0-based detection and characterization of boundary genome(s) within the ICTV-recognized genera.
(a) Box plots show the percentage of shared protein clusters (PCs) between members of an ICTV genus (red), and the same metrics after excluding viruses recognized as outlier(s) by vConTACT v.2.0 (cyan). The proteome similarities for the Barnyardvirus are shown between (1) member viruses of the Barnyardvirus, (2) Mycobacterium virus Barnyard and the remaining members of the Barnyardvirus, (3) the Barnyardvirus and Patiencevirus, (4) Mycobacterium virus Barnyard and the Patiencevirus, and (5) the Barnyardvirus without outliers. For the Phikmvvirus, the proteome similarities between (1) member viruses of the Phikmvvirus, (2) Pseudomonas virus phiKMV and the remaining members of the Phikmvvirus, and (3) Pseudomonas virus phiKMV and members of VC33 are shown, respectively. The sample size (n) as well as minimum (min.), median, maximum (max.) and two quartiles (Q1 and Q3) for each plot were represented as follows. From left box plot to right, (n = 5; min., 27.1; Q1, 27.7; median, 88.1; Q3, 90.4; max., 93.5), (4; 88.0; 88.5; 90.0; 91.2; 93.5), (10; 24.0; 29.3; 83.3; 91.0; 97.7), (8; 76.5; 84.8; 90.1; 92.8; 97.7), (11; 36.1; 91.9; 93.9; 95.5; 97.8), (10; 90.5; 93.3; 94.4; 95.6; 97.8), (5; 49.4; 51.3; 85.9; 86.7; 97.6), (4; 85.8; 86.1; 86.6; 92.9; 97.6), (17; 25.7; 59.3; 67.3; 72.9; 94.3), (15;58.0; 65.3; 69.0;74.1; 94.3), (2; 65.7; 65.7; 65.7; 65.7; 65.7), (10; 58.4; 76.5; 81.5; 84.1; 94.1); (9; 74.3; 80.3; 83.0; 84.6; 94.1), (31; 12.8; 35.9; 44.1; 57.6; 92.4), (25; 29.9; 42.3; 50.0; 63.1; 92.4), (5; 16.0; 16.8; 20.3; 56.5; 66.7), (3; 64.8; 65.7; 66.7; 66.7; 66.7), (2; 31.6; 31.6; 31.6; 31.6; 31.6), (3; 43.5; 43.7; 43.9; 60.7; 77.4), (3; 43.5; 43.6; 43.7; 43.8; 43.9), (4; 42.6; 43.5; 43.9; 44.1; 77.4), (3; 38.7; 38.7; 38.7; 38.7; 38.7; 38.7), (3; 77.4; 77.4; 77.4; 77.4; 77.4; 77.4), (4; 19.3; 21.6; 27.9; 29.0; 45.8), (3; 19.3; 23.1; 27.0; 36.4; 45.8), (11; 85.5; 89.0; 90.3; 93.1; 95.2). (b) Module profiles show the presence (dark) and absence (light) of homologous PCs across genomes. Each row represents a virus and each column a PC. The genomes were hierarchically clustered based on pairwise Euclidean distance. The ICTV and vConTACT v.2.0 classifications are indicated next to each virus.
Supplementary Fig. 6 Evaluation of singletons/outlier genomes over GOV increments.
Thirty-eight ICTV recognized singleton and outlier genomes (one per row) were observed to evaluate whether the addition of GOV sequences would improve their classification. The coloring gradient on the left indicates the numbers of genera per VC. Clearly improved clustering was observed in 3 of the 38 genomes (black bolded, left), with 2 genomes clustering to 6-genera discordant VCs (red, left), though the majority saw only minor, if any, change in their clustering assignment.
Supplementary Fig. 7 vConTACT v.2.0 computational runtimes.
In the plot, processing time (in seconds, Y-axis) is plotted against increasing GOV sequence data (GOV % added, X-axis), where the number of GOV contigs per additional increment is placed in parentheses on the X-axis. There is a strong linear correlation between runtime and memory usage with data volume to be processed. The R2 value, or coefficient of determination, was calculated as [1 - residual sum of squares / total sum of squares].
Supplementary Fig. 8 Impact of the inflation factor on viral genome clustering based on the Markov clustering (MCL) algorithm.
Left panel: average intra-cluster clustering coefficient (ICCC) and number of viral clusters (VCs), which are predicted as a function of the inflation factors ranging from 1.0 to 5.0 with a step of 0.2, are indicated. Right panel: Curve representing the ICCC values for the network containing 2,304 archaeal and bacterial virus genomes.
Supplementary information
Supplementary Information
Supplementary Figs. 1–8 and Supplementary Notes 1 and 2
Reporting Summary
Supplementary Table 1
A genome–gene matrix.
Supplementary Table 2
List of 2,304 archaeal and bacterial virus genomes used to evaluate vConTACT v.1.0 and v.2.0.
Supplementary Table 3
Clustering performance evaluations of the vConTACT v.1.0, v.2.0, and v.2.0 followed by distance-based hierarchical clustering for the genus rank.
Supplementary Table 4
Fraction of PCs in common between 2,304 genomes.
Supplementary Table 5
Statistics associated with the box plots shown in Fig. 2d.
Supplementary Table 6
Statistics associated with the box plots shown in Fig. 3b.
