ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data

The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available at https://github.com/dyxstat/ViralCC.


Supplementary Notes
Supplementary Note 1: Binning results for the viral contigs on the mock wastewater dataset and the mock cow fecal dataset Since all viral contigs were labeled in the mock metagenomic Hi-C datasets, we defined the spurious edges in the host proximity graph of ViralCC as the edges that linked two contigs from different putative reference genomes in the host proximity graph. Then we computed the fraction of spurious edges in the host proximity graph by dividing the number of spurious edges by the total number of edges in the host proximity graph for the three mock metagenomic Hi-C datasets. The fraction was 1.5% and 16.5% for the mock wastewater and mock cow fecal datasets, respectively.
ViralCC was then compared to CoCoNet, vRhyme, bin3C, and MetaTOR on the mock wastewater dataset and the mock cow fecal dataset ( Supplementary Fig. 2). VAMB failed to bin viral contigs on both mock datasets due to the small number of contigs to train its model. ViralCC also outperformed other binning methods in terms of F-score, ARI, and homogeneity and recovered the most near-complete and high-quality vMAGs on the mock wastewater dataset. For the mock cow fecal dataset, three Hi-C-based binning methods (i.e., bin3C, MetaTOR, and HiCBin) obtained similar binning performance in terms of the four clustering metrics. MetaTOR and ViralCC achieved slightly better performance in term of NMI and homogeneity, respectively. Although MetaTOR retrieved the most high-quality vMAGs, ViralCC reconstructed the most near-complete vMAGs.

Supplementary Note 2: ViralCC performed better in recovering near complete vMAGs from large putative viral genomes on the mock metagenomic Hi-C datasets
Considering that almost all viral contigs have the same length in our mock datasets, the full recovery of a larger putative viral genomes requires a binner to correctly group more viral contigs into a single bin. To explore the ability of different binners to retrieve large viral genomes, we plotted the genome sizes of the near-complete vMAGs recovered by different binners on the three mock metagenomic Hi-C datasets ( Supplementary Fig. 3). On both the mock human gut and wastewater datasets, we found that binners apart from MetaTOR and ViralCC could only retrieve near-complete vMAGs with the reference genome sizes below 60,000 bp. MetaTOR and ViralCC could recover near-complete vMAGs with the reference genome sizes between 60,000 bp and 80,000 bp while ViralCC was the only method that could reconstruct near-complete vMAGs with the reference genome sizes above 80,000 bp. As for the mock cow fecal dataset, the largest reference viral genome is 42,500 bp and all binners could retrieve near-complete vMAGs with the reference genome sizes above 40,000 bp while ViralCC could achieve the largest number. Moreover, we divided the total sizes of near-complete vMAGs by the total sizes of the overall putative viral genomes for the three mock datasets. As shown in Supplementary Table 11, ViralCC achieved the highest genome size fractions of the near-complete vMAGs on all mock datasets. Therefore, ViralCC outperformed other binners on recovering near-complete vMAGs from large putative viral genomes on the mock datasets.
Supplementary Note 3: ViralCC outperformed the random binning according to the CheckV completeness criteria To construct a random binning model as control experiments, we generated the configuration random graph based on the integrative graph by randomly assigning edges to match the degree sequence of viral contigs in the integrative graph, followed by the Leiden clustering. Since we found that the silhouette coefficient almost increased monotonically with the increase of the resolution parameter on our random graphs, we run the Leiden algorithm with default parameters and vMAGs were assessed using CheckV. We conducted five rounds of simulations of the random binning on each dataset. As shown in Supplementary Tables 12 to 14, ViralCC markedly outperformed the  random control according to the CheckV completeness criteria. Supplementary Note 4: Results of virus-host detection on the human gut sample A total of 338 MAGs were generated for non-viral contigs and 164 MAGs could be annotated by GTDB-TK. The CheckM assessment results of 338 MAGs were shown in Supplementary Table 8. Lachnospirales, Oscillospirales, and Bacteroidales are the predominant orders in the human gut sample (Supplementary Table 15).
We then explored the infection spectrum of vMAGs on hosts from different orders (Supplementary Fig. 4). vMAGs from the Siphoviridae and Myoviridae families were mainly associated with host MAGs from the orders Lachnospirales and Oscillospirales, followed by Bacteroidales. The vast majority of vMAGs from the families Podoviridae and Herpesviridae were associated with host MAGs from the orders Lachnospirales and Oscillospirales. CRISPR spacer analysis predicted four interactions between vMAGs and host MAGs, and 3 out of 4 interactions could also be discovered using Hi-C.

Supplementary Note 5: Results of virus-host detection on the cow fecal sample
HiCBin retrieved 496 MAGs for non-viral contigs, which were subsequently evaluated by CheckM (Supplementary Table 8 Table 16).
Supplementary Fig. 5 showed the infection spectrum of vMAGs on hosts from different orders. vMAGs from the families Siphoviridae and Myoviridae were mainly associated with host MAGs from the orders Bacteroidales and Lachnospirales, followed by Oscillospirales, Coriobacteriales, and Erysipelotrichales. The vast majority of vMAGs from the families Podoviridae and Herpesviridae were associated with host MAGs from the orders Bacteroidales and Lachnospirales. CRISPR spacer analysis predicted two interactions between vMAGs and host MAGs. Both interactions could also be discovered using Hi-C.
Supplementary Note 6: The existence of biases by the viral genome sizes in the benchmarking method According to the benchmarking strategy, we selected the near-complete viral contigs according to the CheckV completeness criteria as putative reference viral genomes. Considering the limitation of the shotgun sequencing, the larger the size of viral genome is, the more difficult it is to recover the complete viral genome using a single contig during the assembly process. Therefore, it is inevitable that the sizes of putative viral genomes tend to be small. Specifically, we found that the sizes of the largest vMAGs were 307,395 bp, 157,462 bp, and 461,626 bp on the human gut, cow fecal, and wastewater datasets, respectively. In contrast, the sizes of the largest selected viral genomes were only 194,784 bp, 42,500 bp, and 127,910 bp for the three datasets, respectively, indicating that the benchmarking method was biased by the viral genome sizes.

Supplementary Note 8: Algorithms to evaluate the completeness of vMAGs by CheckV
CheckV applies two algorithms to compute the completeness of vMAGs: amino acid identity (AAI)based and hidden Markov model (HMM)-based approaches. In the AAI-based approach, proteins are first compared to the CheckV genome database using AAI. After identifying the top hits, completeness is computed as the ratio between the contig length and the length of matched reference genomes and a confidence level is reported based on the strength of the alignment and the length of the contig. High-and medium-confidence estimates are quite accurate and can be trusted. The second method is HMM-based. This method aims to compute the completeness of highly novel viruses that may not match a CheckV genome with sufficiently high AAI. In these cases CheckV identifies the viral HMMs on the contig and compares the contig length with reference genomes sharing the same HMMs. CheckV then returns the estimated range for genome completeness, which represents the 90% confidence interval based on the distribution of lengths of reference genomes with the same viral HMMs.

Supplementary Note 9: Evaluation criteria of the clustering results
The Homogeneity Score: Let C = {c i |i = 1, · · · , n} denote a set of classes and K = {k i |1, · · · , m} denote a set of clusters for N samples. Let A be the contingency table produced by the clustering algorithm representing the clustering solution, such that A = {a ij } where a ij is the number of data points that are members of class c i and elements of cluster k j . Then the homogeneity score, denoted by h is defined as: where The Fowlkes-Mallows score: The Fowlkes-Mallows score (F-score) is defined as the geometric mean of the precision and recall, i.e, where T P is the number of true positives, F P is the number of false positives, and F N is the number of false negatives.
The Adjusted Rand Index: The rand index (RI) is defined as the percentage of correct decisions made by the clustering algorithm, i.e., where T P is the number of true positives, T N is the number of true negatives, F P is the number of false positives, and F N is the number of false negatives. Then, the Adjusted Rand Index (ARI) can be defined as where E(RI) denotes the Expected Rand Index.
The Normalized Mutual Information: Let U and V denote the sets of true class labels and predicted cluster labels, respectively. Define the entropy of a label set S as where P (i) = |S i |/N is the probability of an object in class S i . The mutual information (MI) between U and V is calculated by: where P (i, j) = |U i ∩ V j |/N , P (i) = |U i |/N , and P (j) = |V j |/N . Then, the Normalized Mutual Information (NMI) is defined as Supplementary Note 10: ViralCC outperformed other binners on a meta 3C/Hi-C dataset derived from a human gut sample This meta 3C/Hi-C sample was derived from the human gut microbiome and consisted of one meta3C library (NCBI accession: SRR11853875) and two separate Hi-C libraries (NCBI accession: SRR13435230 and SRR13435231). There were 266.5 million meta3C read pairs while the two Hi-C libraries contained 54.9 million (MluCI library) and 56.6 million (HpaII library) read pairs, respectively. Since meta3C reads allowed the assembly and scaffolding, we applied the same initial processing procedure on the raw meta3C reads as the raw shotgun reads to assemble the contigs. A total of 75,745 contigs with length at least 1,000 bp were assembled (N50: 14,449 bp; average length: 5,255 bp; total length: 398,079,188 bp). For the raw Hi-C reads, considering the short length of Hi-C reads, we neither discarded any Hi-C reads using the minimal length option nor trimmed the first ten nucleotides of Hi-C reads during the read cleaning step. Specifically, adaptor sequences of Hi-C reads were removed by bbduk from the BBTools suite (v37.25) with parameters 'ktrim=r k=23 mink=11 hdist=1 tpe tbo' and reads were quality-trimmed using bbduk with parameters 'trimq=10 qtrim=r ftm=5'. Identical PCR optical and tile-edge duplicates for Hi-C paired-end reads were removed by the script 'clumpify.sh' from the BBTools suite. The alignment of processed Hi-C reads on contigs and the viral contig detection were the same as described in the main text. As a result, 519 contigs were detected as viral contigs by VirSorter. Viral contigs were binned using different methods on the meta 3C/Hi-C dataset. The CheckV completeness of viral bins was estimated to evaluate the binning quality. We referred to viral bins with CheckV completeness above 90% as draft viral genomes with high completion and denoted bins with CheckV completeness above 50% as draft viral genomes with medium completion. Notably, VAMB failed to bin viral contigs on the meta 3C/Hi-C dataset due to the small number of contigs to train its model. As shown in Supplementary Fig. 6, CoCoNet, vRhyme, bin3C, and MetaTOR could recover 16, 35, 36, and 29 draft viral genomes with high completion, respectively, while ViralCC increased this number to 54. Moreover, CoCoNet, vRhyme, bin3C, and MetaTOR retrieved 19, 53, 44, and 37 draft viral genomes with medium completion, respectively, which was improved to 87 by ViralCC.
Moreover, following Fig. 4 in the main text, we sorted the vMAGs and plotted the raw Hi-C contact maps of the top ten vMAGs for the meta 3C/Hi-C dataset with either the contig index or the contig size as the axis unit ( Supplementary Fig. 7), respectively, which confirmed the valid reconstruction of the viral genomes. The specific number of viral contigs and the bin size of these ten vMAGs were shown in Supplementary Table 17.
Finally, 112 vMAGs retrieved by ViralCC could be annotated at the family level on the meta3C/Hi-C dataset. Details of the annotation procedure were described in the main text. Among these 112 vMAGs, 102 (91.1%) vMAGs contained only viral contigs from the same family, demonstrating the high purity of ViralCC's vMAGs at the family level. As shown in Supplementary Fig.  8, bacteriophage Siphoviridae was the dominant family, which was consistent with the results of another human gut dataset shown in Fig. 5 in the main text.

Supplementary Tables
Supplementary Table 1. The numbers of putative reference genomes and mock viral contigs from the three metagenomic datasets. Note: qc3C CI represents the 95% confidence interval of the proportion of observed junction sequences considered to be the product of proximity ligation estimated by the qc3C software.
Supplementary Note: N50 is defined by the length of the shortest contig where contigs with longer and equal length cover at least 50% of the assembly.
Supplementary Note: VAMB failed to bin viral contigs on the mock wastewater and cow fecal datasets due to the small number of contigs to train its model. The optimal values of the results are in bold.