Main

Viruses infect and kill their hosts, alter host metabolisms via auxiliary metabolic genes (AMGs) and mediate horizontal gene transfer1,2,3,4. In the past decade, numerous efforts have made the study of viruses more practical, including but not limited to tools for virus identification5,6,7, viral binning8,9,10, taxonomic classification11,12,13, automating AMG identification7 and viral genome completeness estimation14.

Many viral studies rely on metagenome-assembled sequences4, most of which are partial14. The diversity of viruses is extremely high4, yet a relatively small fraction is represented by complete genomes15,16,17, and only a small subset of these are huge phage genomes (≥200 kbp, or jumbo phages)18,19,20,21,22,23,24. The lack of complete genomes often precludes the classification of extrachromosomal elements and confounds diversity analyses15. When complete genomes are available, it is possible to evaluate phage species richness, AMG contents20, genome structure25 and genome sizes19.

A subset of de novo assembled metagenomic contigs can be joined via end overlaps26. This is because the assemblers based on the de Bruijn graph generally break at positions with multiple paths. The fragments from a single population can sometimes be joined, potentially to obtain genomes that can be further curated to completion18,19,20,26. Manual curation is used to extend contig ends before joining26, evaluate the validity of joins and eliminate chimeric joins introduced during assembly. However, manual curation is labour intensive and thus rarely included in metagenomic analysis pipelines. Nonetheless, some tools have been developed to improve the quality of viral and bacterial genomes, including ContigExtender27, Phables28 and Jorg29. Binning is another strategy to better sample viral genomes from metagenomes30,31, with available tools including vRhyme10, CoCoNet8 and PHAMB9. However, binning algorithms are approximate and they do not improve the contiguity of individual sequences. Accordingly, we developed Contig Overlap Based Re-Assembly (COBRA) to detect, analyse and join contigs from a single metagenomic assembly. COBRA evaluates coverage and paired read linkages before joining contigs, following manual curation methods.

We tested the ability of COBRA by analysing an ocean virome dataset32 and a soil viral dataset. COBRA accurately joins contigs assembled from short Illumina reads alone, to generate large genome fragments and sometimes circular genomes. Compared with the performance of the evaluated binning tools, almost all of the COBRA genomes were accurate and not confounded by the contamination introduced by binning. We subsequently used COBRA to recover high-quality phage genomes from 231 freshwater metagenomes, expanding the genomic diversity of huge phages, whiB-encoding actinophages and cysC- and cysH-encoding phages. Thus, we show that COBRA can improve and accelerate viral research.

Results

Simulations show the basis for joining contigs. We used simulations to investigate why and how fragmentation occurs when short paired-end reads are assembled by metaSPAdes, IDBA_UD and MEGAHIT (Supplementary Table 1). The simulations included (1) repeats within a genome (Supplementary Figs. 1 and 2), (2) regions shared by different genomes (Supplementary Fig. 3) and (3) within-population sequence variation (Supplementary Figs. 4 and 5), taking into account a range of relative abundances for cases in (2) and (3). The simulated datasets were de novo assembled individually (Supplementary Information). In the vast majority of cases in which fragmentation occurred because of repeats, the assemblers introduced end sequences (maxK for metaSPAdes and MEGAHIT assembly, or maxK-1 for IDBA_UD assembly) that could be used to suggest contig joins. We acknowledge that these initially identified potential joins may not be legitimate, but subsequent steps that make use of additional information (see below) identify and remove inaccurate joins. These findings informed the development of COBRA. We suggest using the contigs not scaffolds for COBRA analyses if assembled using IDBA_UD, to avoid any errors that may be introduced during scaffolding26,33.

COBRA joins metagenome-assembled sequences. COBRA made joins that the assembler chose not to make so long as there is sufficient support. Ideally, the assembler will not make a join that is non-unique, but some non-unique options could arise owing to a single read (for example, because of sequencing error), or a few reads (for example, from a strain variant) that do not represent a unique part of the genome (that is, the coverage is much lower). Thus, the first criterion that COBRA uses to evaluate potential joins is coverage. COBRA detects contigs with shared end overlaps of the expected length (maxK for metaSPAdes and MEGAHIT, maxK-1 for IDBA_UD, same below), and checks whether the contigs have similar sequencing coverage and are spanned by paired reads (Fig. 1 and Extended Data Fig. 1). These joins are considered legitimate and the contigs are joined.

Fig. 1: The input files and parameters, processing steps and output files of COBRA.
figure 1

COBRA requires four input files: a fasta file containing all contigs from the assembly, another fasta file containing the query contigs, a two-column file with the sequencing coverage of each contig and a read mapping file of all contigs. The parameters ‘assembler’ and ‘maxK’ determine the length of contig end sequences, and ‘minK’ evaluates the end overlap sequence length of a given query to determine if it is a ‘self_circular’ contig. See Extended Data Fig. 1 for more detailed information. The different extension categories are in grey boxes, indicating their corresponding output categories (i, ii or iii) and associated files. COBRA generates five fasta files for each analysis, accompanied by summary files.

In the first step, COBRA processes all contigs from a given assembly and retrieves the end sequences (maxK or maxK-1) for each contig. COBRA first identifies all contig end pairs with the same end sequences, considering both the end sequence and its reverse complement (rc). These identified pairs are then filtered to retain only those that could potentially be joined (Fig. 1). The filtered pairs are then examined to identify valid path pairs. COBRA labels an end for which there is only one possible join as ‘one_path_end’ (Extended Data Fig. 1a) and labels an end A as ‘two_paths_end’ if it shares its sequence with two other ends (ends B and C), and ends B and C share the sequence exclusively with end A (Extended Data Fig. 1b). Our analysis reveals that in a single assembly, the ends in categories ‘one_path_end’ and ‘two_paths_end’ usually account for over 99% of all ends sharing sequences with other ends.

In the second step, COBRA considers each of the provided queries and extends each end sequentially. COBRA first identifies ‘self_circular’ (category i) contigs with two possible cases, (1) the two ends of a given contig is a valid path pair or (2) neither end of a given contig has an end pair, but has a shorter end overlap length that is ≥minK. Next, COBRA searches for potential joining paths for each end based on valid path pairs. It considers the sequencing coverage ratio between the query contig and a given candidate contig to be included in the joining path (Extended Data Fig. 1c), and requires that the joins are spanned by paired reads. The path search stops when (1) the end does not share its sequence with any other end, (2) the end has three or more paths and (3) the end is ‘one_path_end’ or ‘two_paths_end’, but the coverage ratio requirement is not met and/or there is no read pair spanning the join. When a query contig is extended from one end and loops back to the other end, it is classified as ‘extended_circular’ (category ii-a). For other queries that are extended but do not result in circularization, their status is designated as ‘extended_partial’ (category ii-b). If at least one end of a query contig matches other ends but the join is not considered valid owing to the coverage ratio and/or lack of spanning paired reads, the query is labelled as ‘extended_failed’ (category ii-c). In cases in which a query contig does not share any end sequence with others, it is assigned as ‘orphan_end’ (category iii).

In the third step, COBRA assesses all potential joining paths identified in the second step and ensures that the paths are unique before finalizing joins. An important, but rare, case involves a query that can be extended along two (or more) seemingly unique paths (Extended Data Fig. 2a). In such cases, all queries will be assigned as ‘extended_failed’. In addition, COBRA searches for cases in which both ends of a query contig extend into sequences that are closely related to each other, and assign the query to ‘extended_failed’ once confirmed (Extended Data Fig. 2b).

In the last step, the classifications of the query contigs are compiled. Sequences in the ‘self_circular’ category are saved, and those in the ‘extended_circular’ and ‘extended_partial’ categories are joined and saved.

COBRA accurately joins sequences from benchmarking datasets. For benchmarking, we reanalysed an ocean virome sequenced with both Illumina and Nanopore and for which complete Nanopore-based genomes were obtained32. The short Illumina reads of the ocean virome 250 m sample32 were assembled using metaSPAdes, IDBA_UD and MEGAHIT. For each assembly, we recovered 2,377, 2,304 and 2,321 contigs, respectively (Extended Data Fig. 3 and Supplementary Table 2). These were used as the queries for the following COBRA analyses.

COBRA categorized as circular (that is, ‘self_circular’) or extended 42–56% of the queries, and 7–14% and 30–50% of the remaining queries were ‘extended_failed’ and ‘orphan_end’ (Supplementary Table 2). In all but one case, the queries in the ‘orphan_end’ category had significantly lower sequencing coverage than the queries of other categories (unpaired t-test; Fig. 2a), probably suggesting that these contigs broke during assembly owing to insufficient reads for further extension.

Fig. 2: Benchmarking of COBRA using an ocean virome dataset of polished and complete viruses.
figure 2

a, The coverage of query contigs in the ‘orphan_end’ category and others. The average coverage was compared using two-sided unpaired t-test (*P < 0.05; ***P < 0.001). b, The pairwise genome AF of ‘self_circular’, ‘extended_circular’ and ‘extended_partial’ COBRA sequences against the corresponding polished genomes. The number of COBRA sequences in each category is plotted at the top. The aligned region’s minimum ANI is shown, with the average ANI in brackets. See a for the figure legend. c,d, Comparison of the ‘extended_circular’ and ‘extended_partial’ sequences and the corresponding contigs joined by COBRA regarding length (c) and quality (d). If several raw contigs were joined into one COBRA sequence, the length and quality of the COBRA sequence were counted only once. c, The length distribution of query contigs (‘Q’) and COBRA sequences (‘C’). The average length of raw contigs and COBRA sequences are shown and compared using two-sided unpaired t-test (***P < 0.001). d, The quality of query contigs (‘Q’) and COBRA sequences (‘C’) evaluated by CheckV. e, The number of contigs joined to generate ‘extended_circular’ (‘e_c’) and ‘extended_partial’ (‘e_p’) sequences. f, An example of an ‘extended_circular’ sequence compared with the corresponding polished genome. The contigs used are aligned at the bottom with their overlap shown. For box plots in a, c and e, centre lines, upper and lower bounds, and upper and lower whiskers show median values, 25th and 75th quantiles, and the largest and smallest non-outlier values, respectively. Outliers are defined as having a value >1.5 × interquartile range (IQR) away from the upper or lower bounds. No P values were corrected. For panels a, c and e, the numbers under that boxes indicate the number of sequences evaluated.

Source data

We evaluated the accuracy of COBRA sequences in categories i, ii-a and ii-b by alignment fraction (AF) analyses (Fig. 2b and Methods). Generally, the higher the AF_COBRA value is, the more accurate the COBRA join is considered to be. The higher the AF_polished value is, the more complete the COBRA sequence is. AF_COBRA values averaged 97.0–98.4% (Supplementary Fig. 6), indicating that COBRA accurately joined the Illumina-based contigs. Lower AF_COBRA values generally occurred because (1) COBRA selected and joined strain variant contigs that represented higher-abundance subpopulations, yet the corresponding polished genome represented lower-abundance subpopulations (Supplementary Fig. 7) or (2) COBRA sequences were very similar, but not identical, to the corresponding polished genomes (Supplementary Fig. 8). In addition, two MEGAHIT COBRA ‘extended_partial’ sequences had high AF_polished (98.6% and 99.3%, respectively) but relatively low AF_COBRA (72.8% and 76.9%, respectively), as the original queries were longer than the corresponding polished genomes (Fig. 2b).

We assessed the length and quality of the queries and their COBRA sequences in the ‘extended_circular’ and ‘extended_partial’ categories. The average length increased from 18.5–20.0 kb to 31.0–32.5 kb (Fig. 2c), and the total number of complete and circular, and high-quality, genomes rose from 28–46 (3–4%) to 215–241 (23–28%) (Fig. 2d). Notably, this was achieved by joining up to 38 contigs into a single sequence (Fig. 2e,f), and 36–45 of the putative complete genomes generated by COBRA were not reported in the original study.

The Nanopore-based analyses obtained more nominally complete genomes (1,864)32 than COBRA did (100–166), yet COBRA has several advantages. First, it is more cost-effective owing to the lack of requirement for both long and short reads (essential for validation and error correction; for example, ref. 34). It is applicable on samples with insufficient quantities of high-quality DNA for long-read sequencing. COBRA can be applied to the enormous number of samples that have already been sequenced with only short paired-end reads.

COBRA outperforms prevalent binning tools. We compared the performance of COBRA to that of the binning tool of MetaBAT 2 (ref. 35), vRhyme10 and CoCoNet8. We filtered the IDBA_UD assembly of the 250 m sample32 (see above) and obtained 2,632 contigs (Methods) for binning by MetaBAT 2, vRhyme, CoCoNet and contig extension via COBRA (Fig. 3a).

Fig. 3: Performance comparison of COBRA and widely used binning tools.
figure 3

a, The flowchart shows the comparison pipelines. The definitions of ‘good’, ‘problematic’ and ‘contaminated’ bin or join are provided in the accompanying box. Note that only one mapping file is needed for COBRA as input, whereas the coverage profiles were obtained from all three mapping files for the binning tools. b, The percentage of ‘good’, ‘problematic’ and ‘contaminated’ bins or joins. c, The percentage of contigs in ‘good’, ‘problematic’ and ‘contaminated’ bins or joins. In b and c, the total absolute numbers are shown at the top. For bins and joins, only those with at least two contigs binned or joined were considered and compared. d, The total length of good bins and good joins. e, The individual lengths of good bins and good joins; their total lengths are shown at the top. f, The impurity rates of ‘problematic’ and ‘contaminated’ bins and joins. g, The paired ANI of genomes that the contigs of ‘problematic’ bins or joins were matched to. For box plots in e, f and g, centre lines, upper and lower bounds, and upper and lower whiskers show median values, 25th and 75th quantiles, and the largest and smallest non-outlier values, respectively. In panels eg, the numbers under the boxes indicate the number of bins or joins evaluated.

Source data

We compared the bins and the COBRA sequences to the published polished complete genomes to evaluate the accuracy of all approaches. We defined ‘good bin’ and ‘good join’, ‘problematic bin’ and ‘problematic join’, and ‘contaminated bin’ and ‘contaminated join’ for binning tools and COBRA, respectively (Fig. 3a and Methods). COBRA far outperformed all binning tools in its ability to recover high-quality viral genomes (Fig. 3b,c). COBRA made 386 ‘good joins’, which are 1.7–5.8 times more than the contig assignments to ‘good bins’ (66–233; Fig. 3e). The cumulative length of accurate viral sequences generated via ‘good joins’ is 9.13 Mb, with an average length of 23.6 kb. The cumulative length of ‘good bins’ is 1.08–4.32 Mb, with an average length of 16.4–19.8 kb per bin (Fig. 3d,e).

We investigated the problematic and contaminated bins or joins. Notably, only 1 out of 400 COBRA sequences was contaminated, while the binning tools generated 111–261 contaminated bins (40–80% of all bins; Fig. 3f). In total, 13 COBRA joins and 5–47 bins were problematic (Fig. 3f), and they were all involved closely related virus genomes with high average nucleotide identity (ANI; 85–92%; Fig. 3g). Compared with binning tools, COBRA had the best metrics, including precision (0.98 versus 0.06–0.47), recall (0.83 versus 0.61–0.71), F1 score (0.90 versus 0.11–0.56), specificity (0.98 versus 0.07–0.61) and accuracy (0.91 versus 0.12–0.64; Supplementary Table 3). COBRA achieved similar performance in joining viral sequences from a soil metagenomic dataset (Extended Data Figs. 46).

Application of COBRA to freshwater metagenomes

Freshwater ecosystems contain phages that infect functionally important populations16,20,36, yet their diversity is poorly understood. Here we assembled 231 published freshwater metagenomes (Supplementary Table 4) and used COBRA to generate high-quality phage genomes from assembled contigs ≥10 kb (122,107 in total; Extended Data Fig. 7). We filtered the COBRA output for essentially complete genomes as assessed by CheckV14 and obtained 8,527 circular and 3,591 high-quality genomes (Fig. 4a). COBRA substantially improved the quality of the sequences (Fig. 4b), and the product genomes were, on average, 30 kb and 25 kb longer than the query contigs (Fig. 4c).

Fig. 4: Overview of circular and high-quality phage genomes from freshwater ecosystems.
figure 4

a, The number of high-quality, ‘self_circular’ and ‘extended_circular’ genomes. b, The quality of query contigs that COBRA used to generate the extended high-quality and circular genomes. The quality of the genomes was evaluated by CheckV. c, The length of COBRA sequences and corresponding query contigs of ‘extended_partial’ high-quality genomes and ‘extended_circular’ genomes. In the box plot, the centre lines, upper and lower bounds, and upper and lower whiskers show median values, 25th and 75th quantiles, and the largest and smallest non-outlier values, respectively. Outliers are defined as having a value >1.5 × IQR away from the upper or lower bounds. d, The clustering of viral genomes. Bar plots show (1) the number of clusters identified as phages, virophages, eukaryotic viruses and undetermined (‘others’). The plots also show details for the 7,334 phage clusters, including (2) the number of circular and high-quality representative genomes, (3) their length distribution, (4) the number of genomes in each cluster, (5) the number of sites detected with each cluster and (6) the taxonomic assignment of each cluster. Caudo, Caudoviricetes. ‘Caudo; others’ means the other families excluding the listed ones. ‘Caudo; unknown’ means all those could be assigned only at the level of Caudoviricetes. e, The novelty of phage species genomes identified in this study via comparison with published genomes. Of the 6,046 newly reported phage species genomes, 4,109 are circular and 1,937 are high quality. Please note that, before clustering, the genomes from published databases were prefiltered to retain only those with a minimum alignment length of 10,000 bp (minimum sequence similarity of 90%) with phage genomes obtained in this study.

Source data

The 12,118 genomes were clustered into 7,432 species-level genomes (95% sequence similarity; Methods and Fig. 4d). Of these, 69 were virophages that replicate along with giant viruses and coinfect eukaryotic cells37 and 29 were eukaryote viruses or undetermined, which were excluded from further analyses. The remaining 7,334 phage species genomes included 5,169 circular and 2,165 high-quality genomes, with most having a genome size of <50 kb (Supplementary Table 5). Around 70% and 17% of the phage species were represented by only one and two genomes, respectively. More than 99% of the phage species were detected at only one sampling site. Taxonomic analysis indicated that the phage species were mostly members of class Caudoviricetes, including hundreds that infect Actinobacteria (Supplementary Figs. 9 and 10). Co-analyses of the 7,334 species genomes with the previously published genomes showed that 82% of them (6,047) were novel at the species level (Fig. 4e).

Diversity expansion and RNA expression of huge phages

Another motivation for developing COBRA was to obtain genomes of huge phages (≥200 kb in length)19,38,39,40. Of the phage species genomes, 167 were classified as huge phages (Fig. 4d) and their genomes underwent manual curation. In addition, 100 low- or medium-quality huge-phage genomes (coverage ≥20×) were also chosen for manual curation. From the total of 267 huge-phage species genomes, 81 were completed (error-free, gap-free, circular genomes). The largest was initially 712 kb and reached 717 kb after curation to completion. To our knowledge, this is the second-largest complete phage genome (the biggest is 735 kb)19. Two phage genomes >800 kb in length were reported recently41, but they are largely bacterial (Supplementary Fig. 11).

The average genome size of the 267 huge phages generated was 285 kb (Fig. 5a). In comparison, the original query contigs had an average size of 88 kb, and only 102 were ≥200 kb. Thus, COBRA is highly effective in generating undersampled huge-phage genomes19,24.

Fig. 5: Genomes from freshwater ecosystems expand huge-phage diversity.
figure 5

a, The number and length of huge phages newly reported in this study from freshwater metagenomes and the corresponding query contigs (≥10 kb in length) joined by COBRA. In the box plot, the centre lines, upper and lower bounds, and upper and lower whiskers show median values, 25th and 75th quantiles, and the largest and smallest non-outlier values, respectively. Outliers are defined as having a value >1.5 × IQR away from the upper or lower bound. b, The phylogeny of huge phages based on the concatenated sequences of core structural proteins. The coloured stripes in the inner ring indicate the source of genomes (published or in this study). The coloured stripes in the middle ring indicate the habitats where the phage genomes were reconstructed. The coloured stripes in the outside ring indicate the predicted taxonomy of the genomes. The subclades with the majority (>80%) of their genomes reconstructed in this study are highlighted in red. The two phages with genome size >700 kb (one published, one from this study) are indicated by red stars. c, The detection and transcription profiles of the Rotsee Lake huge phages in the six samples with combined DNA and RNA analyses. The RPKM was calculated for each huge phage in each sample. A black dot indicates that the RNA RPKM is larger than the DNA RPKM of the huge phage in the corresponding sample. d, Genomic comparison of similar huge phages from distant collecting sites. Three pairs are shown as examples (see Extended Data Fig. 9 for Mauve alignment). Structural protein genes are shown in purple, their corresponding annotations are included and DNA metabolism-related genes are shown in pink.

Source data

Phylogenetic analyses based on concatenated sequences (Methods) revealed that the majority of the newly reconstructed huge phages are typically most similar to those identified from freshwater or groundwater (Fig. 5b). Overall, our analyses broadened the known diversity of huge phages.

Genomic and transcriptomic data from three time points from Rotsee Lake showed the persistence and activity of huge phages, primarily in the aerobic water layers (Fig. 5c). Genes for structural proteins were highly transcribed (Extended Data Fig. 8). Thus, huge phages actively shape microbial community structure and thus biogeochemical cycles within the aerobic parts of the lake.

A comparison of huge-phage genomes from different countries revealed that genes for structural proteins and DNA metabolism retain high nucleotide similarity, and that loss or gain of other genes is primarily driving their divergence (Fig. 5d and Extended Data Fig. 9).

Detection of AMGs in phage genomes

We explored the inventory of AMGs in the 7,334 high-quality genomes. The majority of AMGs are involved in the metabolism of carbohydrates, amino acids, glycans, and cofactors and vitamins (Fig. 6a), and some with photosynthesis42 and methane oxidation20. We identified 62 cysC and 167 cysH genes implicated in assimilatory sulfate reduction (Fig. 6b and Supplementary Fig. 12). These genes were generally detected in circular genomes and are from phages from multiple taxa (Fig. 6c). Three circular genomes each contained two cysH genes (see Fig. 6d for an example). Notably, phages encoding cysC genes exhibited significantly larger genome sizes (Fig. 6e). Of the 231 freshwater samples, 157 contained at least one phage with cysC and/or cysH, although the majority were present at only one sampling time point (Supplementary Tables 6 and 7).

Fig. 6: Genomic and transcriptomic analyses of phage species encoding cysC and cysH genes.
figure 6

a, Summary of AMGs identified in the phage genomes. b, The cysH and cysC genes are involved in assimilatory sulfate reduction. PAPSS, 3′-phosphoadenosine 5′-phosphosulfate synthase; APS, adenylyl sulfate; PAPS, 3′-phosphoadenylyl sulfate. c, The quality and taxonomy of genomes encoding cysH and/or cysC. d, One of three genomes that encode two cysH genes each. e, The length distribution of genomes encoding cysH and cysC. The average length of each category is indicated by a red ‘×’. The average length of cysH- and cysC-encoding genomes was compared using two-sided unpaired t-test (***P < 0.001). In the box plot, the centre lines, upper and lower bounds, and upper and lower whiskers show median values, 25th and 75th quantiles, and the largest and smallest non-outlier values, respectively. Outliers are defined as having a value >1.5 × IQR away from the upper or lower bounds. f, The number of bacteria- and phage-encoded cysH and cysC identified in Rotsee Lake metagenomes. g, The total normalized transcriptional activity and ratio of cysH and cysC genes in Rotsee Lake metatranscriptomes. Note that the cysC phage transcripts are more abundant than the non-phage transcripts in the 4 m sample from 23 November 2017 and 7 December 2017.

Source data

Using Rotsee Lake metatranscriptomic data, we show that under both aerobic and anaerobic conditions, the transcriptional activity of non-phage-encoded cysH genes generally exceeded that of phage-encoded cysH. However, phage-encoded cysC genes exhibited a greater level of transcriptional activity than their non-phage counterparts under certain aerobic conditions (Fig. 6f,g). These findings show that phages present in freshwater ecosystems can impact sulfur cycling via the assimilatory sulfate reduction process. The infinity sign indicates that there is no detected activity for the phage-encoded cysH gene.

Discussion

Metagenomics is an important approach for studying viruses. However, fragmented genomes hinder the understanding of their diversity and ecological significance14. COBRA seeks to complete or nearly complete viral genomes using methods analogous to those used in manual genome curation26. COBRA can extend contigs of any length, unlike binning tools that typically require contigs of a length that is sufficient to establish reliable sequence features like tetranucleotide frequency. COBRA can generate single contigs (sometimes, circular genomes; Figs. 1 and 2), whereas binning tools usually obtain MAGs with two or more contigs (Fig. 3). Thus, the resulting COBRA sequences are more readily evaluated for their quality using tools like CheckV14, which does not work on bins with multiple sequences. MetaBAT 2, vRhyme and CoCoNet need multiple related samples for coverage profile calculation, and PHAMB needs paired metagenome and metavirome datasets for better performance. By contrast, COBRA works efficiently on a single metagenome sample (Fig. 3). Most importantly, COBRA is much more accurate than the evaluated binning tools (Fig. 3). When compared against other genome improvement tools, COBRA is much faster than ContigExtender (Supplementary Table 8)27 and does not require assembly graphs (for example, Phables28) or resource-intensive reassemblies (for example, Jorg29). Thus, COBRA will serve as a powerful tool in viromics research.

In showing the utility of COBRA, we added >6,000 new phage species genomes (Fig. 5) from freshwater ecosystems15,36. There is minimal overlap between the viral genomes reconstructed in this study and published viral datasets. As our study included only 231 freshwater metagenomes, it is likely that viruses in the freshwater ecosystems remain underexplored. The reconstruction of huge-phage genomes from distinct sampling sites allowed us to directly compare their genomes and revealed the importance of gene gain and loss in their evolution (Fig. 5). The expanded diversity of whiB-encoding actinophages suggested that the acquisition of multiple whiB genes is probably a persistent feature of several unrelated subclades (Supplementary Fig. 9).

The cysC and cysH genes are typically responsible for the assimilation of inorganic sulfate into organic compounds (for example, cysteine). Several cysC- and cysH-encoding viruses have been reported43,44,45, yet their overall diversity and activity have yet to be fully understood. Here we expand evidence for virus-driven sulfur cycling43,46,47 by showing a wide distribution of phages encoding cysC and cysH (Supplementary Table 7). These genes may play a role in bacterial sulfur metabolism during phage replication. Importantly, we show higher transcription of phage-encoded cysC and cysH compared with bacterial cysC and cysH genes in some samples (Fig. 6).

We note three limitations of the current version of COBRA. First, COBRA generally could not extend query contigs with very low sequencing coverage (Fig. 2a). A tool that automatically extends contigs using sequences from multiple related samples26 might be developed to overcome this limitation. Second, COBRA runs relatively slowly if the corresponding community is complicated (indicated by ‘total sharing ends’; Supplementary Table 9), for example, soil and underground water; all but one of our tested processings were finished within half an hour to 4 h. Third, unlike ContigExtender27, COBRA directly uses the contigs generated by assemblers; thus, it cannot fix (or detect) chimeras that are introduced during assembly.

Methods

Simulated genomes for evaluation of contig breaking rules in de novo assembly

To evaluate how the assemblers of IDBA_UD, metaSPAdes and MEGAHIT will fragment the contigs during assembly in dealing with intra-genome repeats, inter-genome shared region and within-population variation (that is, local variation), we simulated the artificial genomes using Geneious Prime48 for different cases that are described in detail in Supplementary Information. In each case, the artificial genomes were simulated for Illumina paired-end reads using InSilicoSeq with the ‘HiSeq’ error model, which generated paired-end reads in the length of 126 bp. The simulated reads were then assembled using IDBA_UD49 (‘mink = 20, maxk = 100, -step = 20, -pre_correction’), metaSPAdes version 3.15.149 (‘-k 21,33,55,77,99’) and MEGAHIT version 1.2.950 (‘-k-list 21,29,39,59,79,99’). The obtained contigs from each assembly of each case were manually checked for breaking points and the possibilities of joining via their end sequences with a determined length (that is, 99 bp), which are shown in detail in Supplementary Information.

Evaluation of contig breaking rules in de novo assembly using simulated genomes

To assess how IDBA_UD, metaSPAdes and MEGAHIT fragment contigs during assembly when confronted with intra-genome repeats, inter-genome shared regions and within-population variation (local variation), we generated artificial genomes using Geneious Prime48. Detailed descriptions of each case can be found in the Supplementary Information. For each case, artificial genomes were simulated for Illumina paired-end reads of 126 bp in length using InSilicoSeq50 with the ‘HiSeq’ error model. Subsequently, the simulated reads were assembled using the following parameters: IDBA_UD49: ‘mink = 20, maxk = 100, -step = 20, -pre_correction’, metaSPAdes51 version 3.15.1 18: ‘-k 21,33,55,77,99’ and MEGAHIT52 version 1.2.9 19: ‘-k-list 21,29,39,59,79,99’. The resulting contigs from each assembly in each case were manually inspected for breaking points and the potential for joining via their end sequences, with a specified length of 99 bp. Further details can be found in Supplementary Information.

Benchmark COBRA using a previously published ocean virome dataset

Three virome datasets with samples collected from different depths (that is, 25 m, 117 m and 250 m) were reported previously. The authors sequenced the extracted DNA using both Illumina paired-end reads (150 bp in length) and also Nanopore single-molecule reads32. With these reads, the authors detected and polished complete viral genomes, mostly from the sample collected at 250 m. This dataset was used to benchmark the performance of COBRA. To evaluate the performance of COBRA, the raw reads of the 250 m sample were downloaded from NCBI and trimmed using https://github.com/najoshi/sickle using default parameters to remove low-quality bases. The adaptor sequence and other contaminants were detected and excluded using bbmap (https://sourceforge.net/projects/bbmap/). The trimmed reads were assembled using IDBA_UD49 (‘mink = 20, maxk = 140, -step = 20, -pre_correction’), metaSPAdes version 3.15.149 (‘-k 21,33,55,77,99, 127’) and MEGAHIT version 1.2.9 (ref. 52) (‘-k-list 21,29,39,59,79,99,119,141’). For each assembly, the contigs with a minimum length of 10 kb were compared against the polished viral genomes reported in a previous study32 using BLASTn; the hits with a minimum nucleotide similarity of 97% and minimum alignment length of 10 kb were retained as queries for COBRA analyses. The quality reads of each sample were respectively mapped to all the contigs of the corresponding sample using Bowtie2 version 2.3.5.1 with default parameters53. The sequencing coverage of the contigs was determined using the ‘jgi_summarize_bam_contig_depths’ function from MetaBAT version 2.12.135 and transferred to a two-column file using in-house Perl script. COBRA analyses were performed for BLASTn hits contigs from each assembler, with a mismatch of 2 for linkage of contigs spanned by paired-end reads; the maxK and assembler were flagged according to that used in assembly. The ANI analyses between COBRA sequences and polished genomes were performed by fastANI version 1.3 (ref. 54), and the alignment fraction was calculated accordingly. The quality of viral genomes was evaluated by CheckV14.

Comparison of the performance of COBRA and binning tools

We compared the quality of sequences joined by COBRA to bins generated by various binning tools, namely MetaBAT 2 (ref. 35), vRhyme10 and CoCoNet8. Using the IDBA_UD-assembled contigs (≥2,500 bp) from the 250 m ocean virome sample, we searched for contigs that exhibited ≥99% nucleotide similarity and ≥80% alignment coverage with polished genomes, resulting in a set of 2,632 contigs termed ‘query contigs’. These query contigs were extended using COBRA and also binned using the aforementioned binning tools. For coverage calculation, the quality reads from the virome samples (25 m, 117 m and 250 m) were mapped to the query contigs individually using Bowtie2 version 2.3.5.1 with default parameters53. The coverage profiles derived from all three mapping files were used as input for the three binning tools. However, COBRA used only the mapping file and coverage profile of the 250 m sample. Each bin contained a minimum of two contigs, and if a bin contained only one contig, it was assigned as ‘unbinned’.

To evaluate the accuracy of COBRA joins and sequences represented by bins, we matched the joined contigs and bins back to the polished genomes. For the binning tools, if all the contigs from a given bin were best matched to the same polished genome, the bin was termed a ‘good bin’. For COBRA, if all the contigs joined into a COBRA sequence were best matched to the same polished genome, the join was termed a ‘good join’. If some of the contigs matched to one polished genome (genome a), and some others best matched to another one (genome b), when genome a and genome b shared ≥70% ANI (determined by fastANI version 1.3 (ref. 54)), the bin was termed ‘problematic bin’ (for those from binning tools), and the join as ‘problematic join’ (for those from COBRA). To determine the extent to which the ‘problematic bin’ or ‘problematic join’ was affected by contigs from related (sub)populations, we also compared the ANI of the corresponding matched polished genomes.

For a given bin or join, if some contigs matched to one polished genome (genome a), and some others best matched to another one (genome b), when genome a and genome b shared <70% ANI (determined by fastANI version 1.3 (ref. 54)), it was termed ‘contaminated bin’ or ‘contaminated join’, respectively. To determine the contamination rate of the ‘contaminated bin’ or ‘contaminated join’, we calculate the total length of the bin or join (Total_len), and also the total length of contigs best matched to each of the polished genomes, then picked up the polished genome match with the maximum total length of contigs (that is, Max_len). We calculated the contamination rate of the contaminated bin or join as below, in which ‘Num_polished’ is the total number of matched polished genomes. By doing so, the theoretical maximum contamination rate of a contaminated bin or join is normalized (that is, 100%).

$$(({\rm{Total\_len}}-{\rm{Max\_len}})/{\rm{Total\_len}})/(({\rm{Num\_polished}}-1)/{\rm{Num\_polished}})$$

For example, where the total length of the bin or join is 100 kb and the contigs are matched to two polished genomes, with one polished genome best matching contigs with a total length of 60 kb and the other polished genome matching contigs having a total length of 40 kb, Total_len = 100 kb, Max_len = 60 kb and Num_polished = 2; thus, the contamination rate = 80%.

Benchmark COBRA using a composite soil metagenome dataset

To test whether COBRA could work on metagenomic datasets with higher complexity, we extracted soil viral genomes ≥10,000 bp in length and without any ‘N’ from IMG/VR v3 (ref. 55). These genomes (53,381 in total) were clustered with ≥95% similarity, resulting in a total of 34,303 clusters, each of which was represented by a representative genome. We randomly picked (1) 500 representative genomes and added a direct terminal repeat (100–200 bp) to each of them (category ‘with_DTRs’), (2) 500 representative genomes and randomly mutated 1% bases of each genome (category ‘two_subpopulations_1p’), (3) 500 representative genomes and randomly mutated 3% bases of each genome (category ‘two_subpopulations_3p’), (4) 500 representative genomes and randomly mutated 5% bases of each genome (category ‘two_subpopulations_5p’) and (5) 300 representative genomes, each genome was mutated with 3% bases, and also each genome mutated with 5% bases (category ‘three_subpopulation’). For (2)–(5), the initial representative genome will be kept; thus, we have a total of 500 + 500 × 2 × 3 + 300 × 3 = 4,400 simulated genomes. Note that each representative genome was used only once.

The 4,400 genomes were in silico sequenced using paired-end (126 bp in length) Illumina HiSeq using InSilicoSeq50, with a random sequencing coverage of 10–100 for each genome, which generated ~20 Gb reads. To raise the complexity, the simulated reads were combined with ~12 Gb paired-end reads from a natural soil sample from California, USA56. This composite dataset was assembled using metaSPAdes with the kmer set of ‘21,33,55,77,99’ with 48 threads.

The scaffolds ≥2,500 bp were searched against the 4,400 simulated genomes using BLASTn, and those hits with ≥99% similarity and ≥97% alignment length were retained as queries for subsequent COBRA analyses. To evaluate the performance of binning tools (MetaBAT 2, vRhyme and CoCoNet) on the same set of query scaffolds, we in silico sequenced another two sets of paired-end reads from the 4,400 simulated genomes, so the binning tools could have a coverage profile from at least three samples for better performance. The mapping of reads to all assembled contigs and the calculation of coverage were performed as described above. The ‘extended_circular’ and ‘extended_partial’ COBRA sequences, and the bins (with at least two scaffolds), were compared against their corresponding simulated genomes for accuracy evaluation. A given COBRA sequence or bin was assigned as ‘a good join or bin’ if all the scaffolds were from a single simulated genome, as ‘a problematic join or bin’ if some scaffolds were from a simulated genome while some others were from the mutated genome, and as ‘a contaminated join or bin’ otherwise.

Comparison of COBRA and ContigExtender

ContigExtender is a tool to improve the viral sequences assembled from metagenomic datasets using the reads. It uses a novel recursive extending strategy that explores multiple extending paths to extend the contigs. ContigExtender analysis was performed on the same contig set as used in Comparison of the Performance of COBRA and Binning Tools (2,632 in total); however, given the long processing time (~8 days to extend 29 contigs using 16 threads), we included only the results of the first 29 contigs in our comparison against COBRA. We summarized the extending results of the corresponding contigs by COBRA, and compared the performance of both tools, including the extended length and also the extending accuracy (Supplementary Table 4). The extended sequences from COBRA and ContigExtender were aligned against the corresponding polished complete genomes in Geneious48 and also compared using BLASTn, and manually checked for extension accuracy.

Collection and analyses of published freshwater metagenomic datasets

The freshwater metagenomic datasets from two previously published studies16,57 were used. The raw paired reads were downloaded from NCBI using sratoolkit.2.11.1 (https://hpc.nih.gov/apps/sratoolkit.html) and filtered to remove any low-quality reads and bases, adaptors and other contaminants as described above. The de novo metagenomic assembly was first performed using the quality reads by IDBA_UD52 (‘mink = 20, maxk = 140, -step = 20, -pre_correction’) or metaSPAdes version 3.15.149 (‘-k 21,33,55,77,99,127’). If the RAM of our computing server was not sufficient to assemble the reads of a given sample, it was assembled using MEGAHIT version 1.2.9 (ref. 52) (‘-k-list 21,29,39,59,79,99,119,141’). If a given sample could not be assembled using any of the three assemblers, it was excluded from the analyses. The assembler details for each dataset are shown in Supplementary Table 5.

The generated contigs with a minimum length of 10 kb from each assembly were predicted for viral sequences using VIBRANT7 using default parameters. The identified lysogenic and lytic virus contigs by VIBRANT were used as queries for COBRA analyses. A max mismatch of 2 in each read was set to identify the linkage of contigs spanned by paired-end reads, and the minK, maxK and assembler were flagged according to what was used in assembly (Supplementary Table 5).

Filtering of COBRA sequences and evaluation of assembly gaps

The COBRA sequences from all 231 freshwater metagenomic datasets were evaluated by CheckV (version 0.7.0)14. The ‘self_circular’ and ‘extended_circular’ COBRA genomes and those identified as ‘high-quality’ by CheckV were retained for further analyses. To evaluate and fix the assembly gaps, we checked the genomes by parsing the reads mapped to them (with Bowtie2 as described above) using a custom script named ‘gap.check.py’ (available at https://github.com/linxingchen/cobra). The script filtered the mapped reads to allow two mismatches for each read; for a region in a given genome sequence without any base mapped, the region was replaced by 10x Ns. The resulting sequences were used for further analyses.

Genome completeness evaluation of the query contigs

To determine the extent to which COBRA raised the quality of the viral genomes, we evaluated the original query contigs that were joined into ‘extended_partial’ or ‘extended_circular’ genomes, using CheckV (version 0.7.0)14. The percentages of original contigs assigned by CheckV to ‘Low-quality’, ‘Medium-quality’, ‘High-quality’ and ‘Not-determined’ were profiled and shown in Fig. 5b.

Clustering of quality viral sequences

The quality viral sequences were clustered at the species level using the rapid genome clustering approach provided by CheckV14 (available at https://bitbucket.org/berkeleylab/checkv/src/master/). The clustering parameters were set as follows: -perc_identity = 90 (for BLASTn), -min_ani = 95, -min_qcov = 10 and –min_tcov = 80 (for aniclust.py). The quality viral sequences were clustered into species-level clusters. Among these representative sequences, 6,430 had no assembly gaps, 815 had one gap, 195 had two gaps and 71 had three or more gaps. It is worth mentioning that any identified gaps were filled with 10x Ns during the clustering process.

Identification of eukaryote viruses, virophages and phages

The protein-coding genes were predicted using Prodigal (-p meta)58. The eukaryote viruses were identified by searching the core structural protein sequences via BLASTp against the RefSeq database59. The virophage sequences were identified by searching their major capsid proteins (MCPs) against the virophage-specific HMM databases reported previously37 using hmmsearch60 version HMMER 3.3 (−E = 1 × 10−6). Those sequences with virophage-specific MCP hits were confirmed by building a tree with the MCPs from reference virophage sequences published previously37,61.

Taxonomy assignment of phages

To taxonomically assign the phages with genomes reconstructed in this study using the standardized ICTV taxonomy updated recently62, we used PhaGCN213 (minimum score, 0.5) and geNomad63. The results from these two tools were considered; for a given genome, (1) it was assigned as ‘unclassified’ if both tools failed to assign it, or it was assigned to different taxa, and (2) it was assigned to the taxonomic level determined by one of the tools if the other failed to assign.

Identification of new species phage genomes obtained in this study

The viral genomes from several published datasets were included for comparison, including the IMG/VR64, the huge phages across ecosystems19, the complete viral genomes from freshwater metagenomes16, the pmoC phages20 and the bS21 phages65; these genomes were termed ‘viral_refs’. The ‘viral_refs’ genomes were first searched against our cluster representative genomes using BLASTn with a minimum e-value of 1 × 10−50 and a minimum similarity of 90%. The BLASTn results were parsed to retain those with at least one hit with a minimum alignment length of 10,000 bp, and the corresponding genomes were extracted for genome clustering. If a given phage genome from the clusters could cluster with any of the ‘viral_refs’, it was labelled as ‘reported’ and as ‘new species genome’ otherwise.

Huge-phage analyses

The subset of representative phage genomes with a minimum length of 200 kb were classified as huge phages. To include more huge-phage genomes from the freshwater datasets, we checked the low-quality and medium-quality genomes for huge phages and manually curated some of them. Protein-coding genes were predicted from them using Prodigal version 2.6.3 (-m -p meta)58. The predicted proteins were searched using BLASTp (e-value threshold = 1 × 10−5) against the proteins of large terminase subunit (TerL), MCP, portal protein and prohead protein from the huge phages reported previously19,23. The BLASTp hits were confirmed using the online HMM search66. The confirmed proteins were individually aligned using MUSCLE (version 5.1.linux64)67 and filtered to remove the columns accounting for >90% gaps using Trimal68. The filtered sequences for each genome were concatenated, and the phylogenetic tree was built using IQ-TREE version 1.6.12 (-bb = 1000, -m = LG + G4)69.

To evaluate the abundance of each huge phage in each of the samples from Lake Rotsee (Fig. 6c), reads per kilobase per million reads mapped (RPKM) were calculated as follows: RPKM = Nphage/(Lphage/1,000)/(Nsample/1,000,000), where Nphage is the number of reads to the phage genome, Lphage is the length of the phage genome (bp) and Nsample is the number of reads mapped to the whole metagenome-assembled contig set. The DNA read mapping to genomes or contigs was performed by Bowtie2 (version 2.3.5.1)53 with default parameters excepting −X = 2,000, and filtered using the pysam Python module70 to allow 0 or 1 mismatch for each mapped read. RPKM calculation of RNA reads to phage genomes was performed in the same way.

Analyses of actinophages

We searched the phages infecting Actinobacteria (that is, actinophages), which are abundant in freshwater ecosystems, by searching for the whiB gene71 via BLASTp search against NCBI RefSeq whiB protein sequences and by manual validation using the online HMM search tool (www.ebi.ac.uk/Tools/hmmer/search/). We determined the subset of the recovered genomes encoding whiB that have been reported previously16. A total of 4,288 (519 high-quality species genomes) from IMG/VR and 158 (79 species genomes) from ref. 16 were included in our analyses, along with 4,070 actinophage genomes (1,116 encode whiB) from ‘The Actinobacteriophage Database’ (https://phagesdb.org/). The entire set were clustered to identify distinct species genomes as described above (Clustering of quality viral sequences). The genes encoding TerL, MCP, portal protein and prohead protein were identified from each of the genomes. The sequences were individually aligned using MUSCLE67 and filtered to remove the columns with >90% gaps using Trimal68. The concatenated sequences were used to reconstruct a phylogenetic tree to show the expansion of the whiB-encoding actinophage dataset via this study. The tree was built using IQ-TREE version 1.6.1269 with 1,000 bootstraps and the ‘LG + G4’ model.

Transcriptional activity analyses

For the analysis of viral metabolic gene expression in situ, RNA reads obtained from Rotsee Lake samples were used. Metagenome-assembled contigs with a minimum length of 5 kb were examined using CheckV14 and VIBRANT7 to identify non-phage- and phage-encoded cysC and cysH genes. Only contigs with consistent predictions (either non-phage or phage) from both CheckV and VIBRANT were retained. The RNA reads from each sample were mapped to the corresponding contigs harbouring cysC and/or cysH genes. Subsequently, the transcriptional activity of each gene was normalized and summed separately for those encoded by non-phages and phages. The ratio of total transcriptional activity between non-phage and phage was calculated individually for cysC and cysH in each sample.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.