Table 1 Comparative analysis of ten viral sequence classifiers

From: Maximal viral information recovery from sequence data using VirMAP

Pipeline Mapped Reads (%) Unique Calls Viral Taxonomies CCR (% of mapped) Precision Recall F-score
VirMAP 3,099,015 (50.1%) 8 8 3,099,007 (99.999%) 0.88 1.00 0.94
Read classification
 FastViromeExplorer 2,710,170 (43.85%) 7 4 2,710,170 (100%) 1.00 0.57 0.73
  VirusSeekera 10,750 (0.174%) 16 16 1,467 (13.65%) 0.31 0.57 0.40
  Kaiju 2,287,962 (37.02%) 227 227 433,243 (18.94%) 0.09 1.00 0.17
  ViromeScan 663,185 (10.73) 427 354 614,016 (92.586%) 0.01 0.57 0.02
Contig classification
  drVMb 22,404,813 (362.54%) 673 158 18,235,876 (81.39%) 0.35 1.00 0.52
  VirusTAP NA 5 5 NA 0.6 0.43 0.50
  VIPIEc ~109633 (~1.77%) 13 11 ~23,731 (~21.65%) 0.30 0.71 0.42
 Standard methodd 2,319,573 (37.53%) 8 8 2,273,193 (98.03%) 0.75 0.86 0.80
Marker gene classification
  MetaPhlAn2 NA 5 5 NA 0.40 0.29 0.34
  1. The Viral Mock Community (VMC) dataset (6,180,026 trimmed reads) was processed through nine different pipelines for viral taxonomic classification. VMC was generated by combining purified preparations of seven different viruses (human adenovirus B, human adenovirus C, murine gammaherpesvirus 4, coxsackievirus B4 [strain Tuscany], echovirus E13 [strain Del Carmen], human poliovirus type 1 [strain Mahoney], and rotavirus A) in phosphate-buffered saline. Unique calls refer to the distinct database entries reported while viral taxonomies represent a reduction of unique calls to NCBI taxonomic ID. CCR: Correctly Classified Reads. Precision: (true positives/true positives + false positives). Recall: (true positives/true positives + false negatives), F-score: harmonic average of recall and precision scores 2 × ((P × R) / (P + R))
  2. aVirusSeeker applies filtering and clustering techniques to the reads and final counts are derived from this reduced set
  3. bdrVM internally counts identical reads across multiple reported entries, so the total counts can exceed 100%
  4. cVIPIE reports reads as counts per 100,000 reads, the approximation is a rescaled amount against the original read counts
  5. dThe standard approach employs a metagenomic assembly using MEGAHIT and a sequential top-hit mapping classification using BLASTn and BLASTx