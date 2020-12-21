Database of HMMs for classification of viral and microbial genes

We selected HMMs from existing databases that could be leveraged to classify genes as either viral or microbial with high specificity. First, 125,754 HMMs were downloaded from seven databases: VOGDB (release 97, n = 25,399, http://vogdb.org), IMG/VR (downloaded January 2020, n = 25,281)25, RVDB (release 17, n = 9,911)49, KEGG Orthology (release 2 October 2019, n = 22,746)45, Pfam A (release 32, n = 17,929)50, Pfam B (release 27, n = 20,000)51 and TIGRFAM (release 15, n = 4,488)52. Next, we used hmmsearch v.3.1b2 (ref. 53) to align the HMMs versus 1,590,764 proteins from 30,903 NCBI GenBank viral genomes (downloaded 1 June 2019)34 and 5,749,148 proteins from 2,015 bacterial and 239 archaeal genomes from GTDB, release 89). For computational reasons, we selected a maximum of one genome per GTDB family and, when multiple genomes were available, we chose the one with the highest CheckM quality score (completeness – 5 × contamination). Additionally, we ran VIBRANT v.1.2.0 (ref. 11), VirSorter v.1.0.5 (ref. 10) and PhiSpy v.3.7.8 (ref. 23) using default parameters to identify and remove 590,484 viral proteins identified on proviruses in the selected GTDB genomes.

Based on the hmmsearch results, we calculated the percentage of viral and microbial genes matching each HMM at bit-score cutoffs ranging from 25 to 1,000, in increments of 5. We then selected the lowest bit-score cutoff for each HMM that resulted in a difference >100-fold between the percentage of the total viral gene set and that of the total microbial gene set matched by the HMM (that is, bit-score cutoff for which the hits were strongly enriched in either virus or microbial genes). To limit false positives, we excluded HMMs that were classified as microbial specific but were derived from primarily viral databases (VOGDB, IMG/VR, RVDB) or contained viral terms (viral, virus, virion, provirus, capsid, terminase) for HMMs from other databases. Using this approach, 114,765 HMMs were identified as either viral specific or microbial specific.

Next, we selected the maximally informative subset of HMMs to reduce the size of the database and limit CheckV computing time. First, we retained 44,415 HMMs with at least 20 viral hits or at least 100 microbial hits after applying the bit-score cutoffs. Next, we calculated the Jaccard similarity between all pairs of HMMs based on each HMMs set of gene hits. For computational efficiency, we used the ‘all_pairs’ function in the SetSimilaritySearch Python package (https://github.com/ekzhu/SetSimilaritySearch). Jaccard similarities were used as input for single-linkage clustering with a Jaccard similarity cutoff of 0.5, resulting in 15,958 nonredundant HMMs (8,773 viral specific, 7,185 microbial specific). To form the final database, we selected the HMM with the greatest number of gene hits from each cluster of HMMs.

Identification of virus–host boundaries

Given a viral contig, CheckV predicts host–virus boundaries in three stages.

First, proteins are predicted using Prodigal v.2.6.3 (option ‘-p meta’ for metagenome mode)54 and compared to the 15,958 HMMs using hmmsearch. Each protein is classified as viral, microbial or unannotated according to its top-scoring hit after applying the HMM-specific bit-score cutoffs. Viral- and microbial-annotated genes are assigned a viral score of +1 and –1, respectively. Additionally, the GC content of each gene is calculated (range, 0–100).

Second, CheckV scans across the contig and quantifies differences in the viral score (that is, +1 or –1) and GC content between a pair of adjacent gene windows. The 5' gene window extends to the left contig endpoint, and the 3' gene window is sized to contain 30% of genes on the contig with no fewer than 15 genes and no more than 50 genes. The 3' window may contain fewer then 15 genes if it ends at the right contig endpoint. CheckV then computes a breakpoint score, S, based on the absolute difference in the average viral score, V, and average GC content, G, between genes in the 5' and 3' windows: S = |V 5' – V 3' | + 0.02×|G 5' – G 3' |. Unannotated genes are not included when calculating V. The value of S ranges from 0 to 4, given that |V 5' – V 3' | and 0.02×|G 5' – G 3' | both range from 0 to 2. CheckV also stores the orientation of each breakpoint (that is, host–virus or virus–host) based on the values of V 5′ and V 3′ . These scores are computed at each intergenic position, moving from the 5' end to the 3' end of the contig.

Third, CheckV identifies breakpoints based on the following rules: S ≥ 1.2, ≥30% genes annotated as microbial in the host region, ≥2 microbial-annotated genes in the host region and ≥2 viral-annotated genes in the viral region. For very short contigs (fewer than ten genes), CheckV requires only one microbial-annotated gene in the host region and one viral-annotated gene in the viral region. After these filters, CheckV chooses the first encountered breakpoint with the highest score. After selecting the first breakpoint, CheckV then repeats the steps listed above to search for additional breakpoints, using the last identified breakpoint as the new starting position for the 5' gene window. The algorithm ends when no new breakpoints are found. Algorithm parameters were fine-tuned empirically based on a dataset of mock proviruses and sequences from the IMG/VR database.

AAI-based estimation of genome completeness

Given a viral contig, CheckV estimates genome completeness in four stages. First, it performs an amino acid alignment of Prodigal-predicted protein-coding genes from the contig against the database of reference genomes using DIAMOND v.0.9.30 (ref. 55), with the option ‘–evalue 1e-5–query-cover 50 --subject-cover 50 -k 10000’. Based on these alignments, the following metrics are computed for the viral contig versus each reference genome: AAI: length-weighted average identity across aligned proteins; alignment fraction (AF): percentage of amino acids aligned from the query sequence; and alignment score: AAI × AF. Second, CheckV identifies the top hit in the database for the contig (that is, the reference genome with the highest alignment score) and all reference genomes with alignment scores within 50% of the top hit. The expected genome length of the viral contig, \({\hat{\it G}}\), is then estimated by taking a weighted average of the genome sizes of matched reference genomes, where the alignment scores are used as weights. Reference genome lengths are further weighted based on their source: 2.0 for isolate viruses and 1.0 for metagenome-derived viruses, which are more likely to contain assembly errors and artifacts. CheckV also reports the confidence level of this estimate (low, medium or high), which is determined based on the length of the viral contig and the alignment score to the top reference genome (see Confidence levels for AAI-based completeness estimates for the method used to estimate confidence levels). Third, CheckV estimates the genome completeness of each viral contig, \({\hat{\it C}}\), using the formula: \({\hat{\it C}} = 100 \times {\it{L}}/{\hat{\it G}}\), where L is the length of the viral region for proviruses, or the contig length otherwise.

HMM-based estimation of genome completeness

An HMM-based approach was developed to estimate completeness for novel viruses that are too diverged from CheckV genomes to obtain an accurate AAI-based estimate. First, CheckV identifies viral genes on the contig based on comparison to the 8,773 viral HMMs (see ‘Identification of virus–host boundaries’ above). Each viral HMM is associated with one or more reference genomes and this information is stored in the database, as well as the coefficient of variation, which is a measure of the variability in reference genome length associated with each HMM. For each HMM on a viral contig, CheckV identifies the range of completeness values corresponding to the fifth and 95th percentiles of the distribution of reference genome length containing the same HMM (for example, 35–65% completeness). In theory, we expect the true completeness to be greater than the lower bound 95% of the time, below the upper bound 95% of the time and between both bounds 90% of the time. In practice, however, these outcomes are less frequent due to error in the underlying estimates. CheckV performs this step for each HMM, resulting in a distribution of completeness ranges for each contig (for example, 45–67, 35–55 and 42–49%). Finally, CheckV takes a weighted average of the ranges, where the weights are equal to the inverse of the coefficient of variation with a maximum value of 50. Therefore, HMMs with a low coefficient of variation (which are associated with genomes of consistent length) receive higher weight.

Confidence levels for AAI-based completeness estimates

We conducted a large-scale benchmarking experiment to derive confidence levels for AAI-based completeness estimation. First, we extracted a random fragment from each of CheckV’s reference genomes to simulate metagenomic contigs of varying length (200 and 500 bp and 1, 2, 5, 10, 20 and 50 kb). Next, we used CheckV to compute the alignment score between each contig and each complete genome in the reference database. We then compared the true genome length of each contig (that is, the length before fragmentation), L, to the estimated genome length based on each matched reference genome, \({\hat{\it L}}\), and computed the relative unsigned error, as \(100 \times \left| {{\it{L}} - {\hat{\it L}}} \right|/{\it L}\). We then computed the median relative unsigned error after grouping the estimates based on their alignment score and contig length. Finally, we determined three confidence levels: high confidence (0–5% median unsigned error), medium confidence (5–10% median unsigned error) and low confidence (>10% median unsigned error). Using this information, CheckV reports a confidence level in the estimated completeness value for each input contig based on contig length and alignment score (that is, a combination of AAI and AF) to the top database hit. By default, only medium- and high-confidence estimates are included in the final report.

Database of complete viral genomes for AAI-based completeness estimation

We downloaded 30,903 genomes from NCBI GenBank on 1 June 2019, excluding 1,937 that were indicated as ‘partial’, ‘chimeric’ or ‘contaminated’. Of the remaining 28,966, 677 (2.3%) were labeled as ‘metagenomic’ or ‘environmental’, suggesting that the vast majority are derived from cultivated isolates.

Next, we used CheckV to systematically search for complete genomes of uncultivated viruses from publicly available and previously assembled metagenomes, metatranscriptomes and metaviromes. An assembled contig was considered complete if it was at least 2,000 bp in length and included a DTR of at least 20 bp (DTR contigs). We searched for DTR contigs in the following datasets: 19,483 metagenomes and metatranscriptomes from IMG/M (accessed September 2019)35, 11,752 metagenomes from MGnify (accessed 16 April 2019)36, 9,428 metagenomes assembled by Pasolli et al.38, an expanded collection of 4,763 metagenomes from the HGM dataset37, 1,831 viromes from HuVirDB39 and 145 viromes from the Global Ocean Virome 2.0 dataset6.

From this initial search, we identified a total of 751,567 DTR contigs. To minimize false positives and other artifacts, we removed the following: (1) 45,448 contigs with low-complexity repeats (for example, AAAAA…), as determined by dustmasker from the BLAST+ package v.2.9.0 (ref. 56); (2) 11,359 contigs classified as proviral by CheckV; (3) 5,737 contigs with repeats occurring more than five times per contig, which could represent repetitive genetic elements such as clustered regularly interspaced short palindromic repeat (CRISPR) arrays; (4) 6,543 contigs that contained a large duplicated region spanning ≥20% of the contig length, resulting from rare instances where assemblers concatenate multiple copies of the same genome; and (5) 1,293 contigs containing ≥1% ambiguous base calls. After application of these filters, 686,030 contigs remained (91.3% of the total).

Next, we used a combination of CheckV marker genes and VirFinder9 to classify 116,666 DTR contigs as viral. First, the DTR contigs were used as input to VirFinder v.1.1 with default parameters, and to CheckV to identify viral and microbial marker genes. We additionally searched for genes related to plasmids and other nonviral mobile genetic elements using a database of 141 HMMs from recent publications57,58,59. A contig was classified as viral if the number of viral genes exceeded that of microbial and plasmid genes (n = 99,345), or VirFinder reported a P < 0.01 with no plasmid genes and no more than one identified microbial gene (n = 36,084).

Taxonomic annotation of CheckV reference genomes

Annotations were determined based on HMM searches against a custom database of 1,000 taxonomically informative HMMs from the VOG database (http://vogdb.org/). These HMMs were selected for major bacterial and archaeal viral groups with consistent genome length and at least ten representative genomes, including: Caudovirales, CRESS-DNA and Parvoviridae, Autolykiviridae, Fusello- and Guttaviridae, Inoviridae, Ligamenvirales Ampulla- Bicauda- and Turriviridae, Microviridae and Riboviria. For each group, VOGs found in ≥10% of the group members and never detected outside of this group were considered as marker genes. All CheckV reference genomes were annotated based on the clade with the most HMM hits. Overall, 96.4% of HMM hits were to a single viral taxon.

Validating the completeness of CheckV reference genomes

Next, we validated the completeness for all GenBank genomes and DTR contigs. First, we used CheckV to estimate the completeness for all sequences after excludsion of self-matches. This was performed using a database of GenBank sequences only and another of DTR contigs only. Any sequence with <90% estimated completeness using either database was excluded (medium- and high-confidence estimates only). Second, we compared genome length to the known distribution of genome length for the annotated viral taxon (for example, Microviridae). Any genome considered an outlier or shorter than the shortest reference genome for the annotated clade was excluded. After application of these exclusion filters, we then selected genomes for inclusion with ≥90% estimated completeness using either database (medium- and high-confidence estimates only) or >30 kb without a completeness estimate. These selection criteria were chosen to minimize the number of false positives (that is, genome fragments wrongly considered complete genomes) at the cost of some false negatives (that is, removal of truly complete genomes). This resulted in 24,834 GenBank genomes and 76,262 DTR contigs that were used to form the final CheckV genome database.

Generating a nonredundant set of CheckV reference genomes

Average nucleotide identity (ANI) and alignment fraction (AF) were computed between the 24,834 GenBank genomes and 76,262 DTR contigs using a custom script. Specifically, we used blastn from the BLAST+ package v.2.9.0 (option: perc_identity=90 max_target_seqs=10000) to generate local alignments between all pairs of genomes. Based on this, we estimated ANI as the average DNA identity across alignments after weighting the alignments by length. The AF was computed by taking the total length of merged alignment coordinates and dividing this by the length of each genome. Clustering was then performed using a greedy, centroid-based algorithm in which (1) genomes were sorted by length, (2) the longest genome was designated as the centroid of a new cluster, (3) all genomes within 95% ANI and 85% AF were assigned to that cluster and (4) steps 2 and 3 were repeated until all genomes had been assigned to a cluster, resulting in 52,141 nonredundant genomes.

Benchmarking estimation of genome completeness

To benchmark genome completeness estimates, we used 2,000 uncultivated, complete viral genomes from IMG/VR (>20-bp DTR). We used IMG/VR genomes, because these are derived from diverse habitats and represent highly novel sequences. After removal of terminal repeats, a single genome fragment was randomly extracted from each IMG/VR genome (1–100% completeness). These sequences were used as input to CheckV, VIBRANT v.1.2.0 (ref. 11) and viralComplete22. For CheckV we used the flag ‘--max_aai 95’ to exclude closely related genomes in the CheckV database. For VIBRANT, we used the flag ‘--virome’ to increase sensitivity. For viralComplete, completeness was determined based on the ratio of contig length to that of the corresponding genome from NCBI RefSeq. Completeness estimates >100% were set to 100%. Additionally, we benchmarked CheckV using genome fragments derived from NCBI Genbank genomes and used the flag ‘--max_aai 95’ to exclude closely related genomes in the CheckV database.

Benchmarking detection of host regions on proviruses

To benchmark CheckV’s detection of host regions, we constructed a mock dataset of proviruses: 382 viral genomes were downloaded from NCBI GenBank (after 1 June 2019) and paired with 76 GTDB genomes (71 bacterial, 5 archaeal). None of the 382 genomes were used to train CheckV (that is, selection of HMMs and bit-score thresholds). The pairing was performed at the genus level based on the annotated names of virus and host (for example, Escherichia phage paired with Escherichia bacterial genome). When multiple GTDB genomes were available for a given bacterial genus, we chose that with the highest CheckM quality score and selected a maximum of ten GenBank genomes per bacterial genus to reduce the influence of a few over-represented groups. Any GenBank or GTDB genome that was used at any stage for training CheckV was excluded. Proviruses were simulated at varying contig lengths (5, 10, 20, 50 and 100 kb) with varying levels of host contamination (10, 20 and 50%; defined as the percentage of contig length derived from the microbial genome). Microbial genome fragments were appended to either the 5' or 3' end of the viral fragment at random. As a negative control, we also simulated contigs that were entirely viral (that is, no flanking microbial region) at the same contig lengths.

Mock proviruses were used as input to CheckV using default parameters. For comparison, we also ran VIBRANT v.1.2.0 (ref. 11), VirSorter v.1.0.5 (ref. 10), PhiSpy v.3.7.8 (ref. 23) and Phigaro v.2.2.5 (ref. 24). All tools were run with default options with the exception of VIBRANT and VirSorter, which were run with the flag ‘--virome’ to increase sensitivity. Nucleotide sequences were used as input to all tools, except PhiSpy, for which we first ran Prokka v.1.14.5 (ref. 60) to generate the required input file. A contig was classified as a provirus if it contained a predicted viral region covering <95% of its length. Each prediction was then classified as a true positive (provirus classified as provirus), false positive (viral contig classified as provirus), true negative (viral contig not classified as provirus) or false negative (provirus classified as provirus). For the true positives, we also compared the true and predicted lengths of the host region.

Application of CheckV to diverse viral genome collections

We downloaded 735,106 contigs >5 kb from IMG/VR 2.0 (ref. 25), after exclusion of viral genomes from cultivated isolates and proviruses identified from microbial genomes. We also downloaded 488,131 contigs >5 kb or circular from the GOV 2.0 dataset6 (datacommons.cyverse.org/browse/iplant/home/shared/iVirus/GOV2.0). These were used as input to CheckV to estimate the completeness, identify host–virus boundaries and predict closed genomes. When running the completeness module, we excluded perfect matches (100% AAI and 100% AF) to prevent any DTR contig from matching itself in the database (since IMG/VR 2.0 and GOV 2.0 were used as data sources to form the CheckV database). A Circos plot61 was used to link IMG/VR contigs to their top matches in the CheckV database. Protein-coding genes were predicted from proviruses using Prodigal and compared to HMMs from KEGG Orthology (release 2 October 2019)45 using hmmsearch from the HMMER package v.3.1b2 (≤1 × 10–5 and score ≥30). Pfam domains with the keyword ‘integrase’ and ‘recombinase’ were also identified across all proviruses.

The largest DTR contig we identified from IMG/VR was further annotated to illustrate the type of virus and genome organization represented (IMG ID: 3300025697_____Ga0208769_1000001). Coding sequence prediction and functional annotations were obtained from IMG35. Annotation for virus hallmark genes including a terminase large subunit (TerL) and major capsid protein were confirmed via HHPred v.3.2.0 (ref. 62) (databases included PDB 70_8, SCOPe70 2.07, Pfam-A 32.0 and CDD 3.18, score >98). A circular genome map was drawn with CGView63. To place this contig in an evolutionary context, we built a TerL phylogeny including the most closely related sequences from a global search for large phages42. The TerL amino acid sequence from the DTR contig was compared to all TerL sequences from the ‘huge phage’ dataset via blastp (≤1 × 10–5, score ≥50) to identify the 30 most similar sequences (sorted based on blastp bit-score). These reference sequences and DTR contigs were aligned with MAFFT v.7.407 (ref. 64) using default parameters, the alignment automatically cleaned with trimAL v.1.4.rev15 with the option ‘--gappyout’65 and a phylogeny built with IQ-Tree v.1.5.5, with default model selection (optimal model suggested: LG+R4)66. The resulting tree was visualized with iToL67.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.