To the Editor — Eukaryotic viruses and bacteriophages have important roles in microbiomes, but characterization of viruses in metagenomics data is difficult. Viral-like particle (VLP) purification enables enrichment for viruses from microbiome samples before sequencing, but contamination can result in misleading conclusions. We present a software tool named ViromeQC for analyzing virome data. Here, we demonstrate the utility of ViromeQC by applying it to 2,050 human, animal and environmental samples from 35 metagenomic virome sequencing studies that used one of the available VLP enrichment techniques. The resulting analysis reveals these viromes to be rife with bacterial, archaeal and fungal contamination. Most samples show only modest virus enrichment, and such enrichment is very variable between viromes in the same study. To address these issues, we present a validated contamination quality-control pipeline to enable more robust virome metagenomic analyses.
Viruses affect the ecology and composition of microbial communities1,2. Bacteriophages (viruses of bacteria and archaea) are extremely abundant and diverse, and they affect microbiomes in several ways, including transduction, which is an important mechanism of lateral gene transfer3. Metagenomics can be used to characterize phage populations, but phage are so diverse, and evolve so rapidly, that they are poorly represented in sequence databases. Also, there are no universal viral genetic markers, and the overall biomass of viruses, compared with that of other microorganisms in a sample, is low. For these reasons, phage sequences are difficult to identify in metagenomes, although specific methods that are partly based on sequence characteristics of known phages have been reported4,5.
VLP purification can be used to enrich microbiome samples for viral nucleic acids6, thereby improving virus detection. VLP protocols have various goals, ranging from untargeted analyses of highly purified phage populations to targeted identification of rare sequences of viral pathogens in diagnostic samples. These methods typically include filtration through small-pore-size filters that retain bacteria, cesium chloride gradient purification, treatment with chloroform to disrupt membranes, and exposure to nucleases to reduce free DNA and RNA concentration. If the aim is to use metagenomics to detect known viral pathogens, a low-purity sample may suffice because identification will be by alignment of sequence reads to viral databases. However, if the aim is to detect unknown viruses or report all viruses in a sample, a high-purity sample is required. When coupled with untargeted shotgun sequencing7, VLP enrichment has underpinned many studies in human8,9, environmental10,11 and built-environment settings12, but there is no single VLP enrichment protocol that is optimal for all sample types.
Regardless of the VLP protocol, non-viral genetic material remains after enrichment13. These unwanted nucleic acids are contaminants, and their presence particularly confounds the de novo discovery of phages in untargeted virome sequencing. If the VLP virome is pure, it is possible to assemble reads into possibly fragmented viral genomes without using computational prediction approaches, which are unavoidably affected by low-confidence calls and false negatives4,5. The fraction of next-generation sequencing reads belonging to viruses in the VLP sample correlates with the performance of de novo recovery of new viruses, but methods for evaluating VLP purity in samples have not been systematically explored. Studies have assessed contamination of VLP preparations by PCR amplification of prokaryotic 16S rRNA gene sequences before virome sequencing11,14,15,16,17,18,19. Others have mapped next-generation virome sequencing output against the 16S rRNA gene, or a different marker9,20,21,22,23,24. However, these studies have not provided a validated pipeline to quantify viral enrichment in viromes or unenriched samples. Although efforts toward VLP-protocol optimization have been reported24, the largest meta-analysis of post-sequencing non-viral quantification to date considered just 67 viromes13. As the use of VLP enrichment for virome sequencing is increasing, we set out to evaluate non-viral contamination in >2,000 virome samples.
To assess the enrichment rates of publicly available viromes, we applied our method (Supplementary Methods) to a collection of 2,050 VLP samples (Supplementary Table 1). As controls, we included 2,189 metagenomes that were not enriched for viruses from the curatedMetagenomicData25 and the National Center for Biotechnology Information Sequence Read Archive (NCBI-SRA)26 repositories, as well as 108 publicly accessible synthetic metagenomes27,28 and one mock community (Supplementary Table 2). After uniform preprocessing to remove low-quality reads (Supplementary Methods), we computed the percentage of raw reads in each sample that align to the small subunit ribosomal RNA gene (SSU rRNA), which has never been found in a viral genome. This provided a proxy for non-viral microbial sequence abundance13. We estimated the abundance of bacterial and archaeal 16S and micro-eukaryotic 18S ribosomal genes in all of the viromes and metagenomes. Unenriched metagenomes provided a baseline estimation of the environment-specific rRNA gene abundance, from which we calculated the relative enrichment of viromes with respect to the metagenomes. Environmental and human/animal unenriched metagenomes had a median rRNA gene abundance of 0.08% (n = 320, interquartile range = 0.07%) and 0.25% (n = 1,551, interquartile range = 0.1%), respectively (Fig. 1).
Prokaryotic and micro-eukaryotic contamination of viromes estimated by the quantification of the SSU rRNA revealed a wide range of enrichment efficiencies, with a large fraction of samples (n = 567, 28.7%) having no virus enrichment at all and >50% (n = 990) having less than threefold enrichment. A substantially smaller fraction of samples (n = 339, 17.15%) showed high enrichment (>100-fold). Differences in enrichment rates were not clearly associated with any one VLP purification method, although the heterogeneity of protocols makes it difficult to provide statistical support to this observation. According to taxonomic annotations of the rRNA gene sequences retrieved in viromes, the largest source of contamination was bacterial DNA (1,466 samples), with 88 samples having higher abundances of eukaryotic-associated SSU rRNAs (Supplementary Table 3). The rRNA gene abundance variability was higher in viromes than in metagenomes (Mann–Whitney U test P = 7.5 × 10–8, Supplementary Fig. 1), revealing not only that many viromes are poorly enriched for viruses, but also that the level of bacterial and archaeal contamination is unpredictable.
The intra-dataset enrichment efficiencies were extremely variable, spanning more than two orders of magnitude in 48.7% of the studies, which shows that even the same virome-enrichment protocol applied to samples from the same study can still yield vastly different levels of contamination. For example, the 91 stool samples from the dataset of Ly et al.18 had a 16S rRNA gene abundance s.d. equal to 4.6 times the average (Fig. 1; dataset 38). This suggests that quality-benchmarking viromes after sequencing is crucial to evaluate the presence of contaminants and that intra-dataset variability should be carefully considered in downstream analyses of untargeted viromes.
Four VLP datasets were highly enriched in rRNA genes, with a median abundance >10% and peaks of 90% reads aligning to either the 16S/18S or 23S/28S rRNA gene subunits (datasets 36, 47, 50 and 51; see Supplementary Table 1). Conversely, the median rRNA gene abundance observed in unenriched real and synthetic metagenomes never exceeded 1% (Supplementary Table 2). The experimental design of these four studies pointed at the likely cause of contamination because they involved DNA and RNA coextraction, with DNA and retro-transcribed cDNA sequenced together. We hypothesize that higher rRNA abundance was observed due to incompletely depleted structural rRNA in the samples. In a further 25 RNA viromes, we also found higher rRNA abundances than would be expected (4.18% median abundance when considering both rRNA subunits, maximum of 67.5%; Supplementary Table 4). As it was not possible to evaluate the VLP enrichment efficiency by estimating rRNA abundances for samples with atypically high levels of rRNA, we excluded datasets with more than 10% median abundance of rRNA genes from the downstream analysis because viromes with such high levels of rRNA genes are unlikely to be useful in downstream genome assembly and analysis. In total, 307 samples were removed, all of which were from studies that sequenced DNA and RNA together. Although protocols of this type cannot be evaluated with our approach, they may be useful for some tasks, such as sequence-based detection of known pathogens.
To improve virus enrichment estimates, we next calculated the abundance of the large ribosomal subunit rRNA gene (LSU rRNA), encoding prokaryotic 23S and eukaryotic 28S rRNAs (Fig. 2a), and of 31 single-copy universal markers from bacterial and archaeal ribosomal proteins29 (Supplementary Fig. 2). Because some ribosomal proteins are occasionally found in viral genomes30, it is plausible that this might result in assigning viral genomes as contaminants. However, extensive mapping of these universal ribosomal markers against viral repositories provided evidence that the rare inclusion of a marker gene in a viral genome is unlikely to affect the results (Supplementary Note 1, Supplementary Fig. 3 and Supplementary Table 5), especially when considering all 31 single-copy universal markers. Although a few samples (11.8%) still harbored high levels of rRNA genes (i.e., >5% abundance, Supplementary Fig. 4b and Supplementary Fig. 5), the abundance quantifications of rRNA genes (SSU and LSU) and genes encoding single-copy proteins were in agreement for most viromes. In 75.3% of the viromes, rRNA genes and single-copy marker abundances were either both below (67.1%) or both above (8%) the reference unenriched-metagenomes medians (Supplementary Fig. 4). The abundance of the individual markers was highly correlated (Fig. 2b), as were the abundances of SSU rRNA and single-copy markers (Spearman’s coefficient 0.72 when considering the abundance of all 31 markers together). A weaker correlation was observed between LSU rRNA and single-copy markers (Fig. 2b, Spearman’s coefficient 0.47). Although rRNA and single-copy marker abundances were generally in agreement, we propose that a multi-marker approach is required to accurately estimate viral enrichment. For example, one of the datasets we examined9 had substantial amounts of LSU rRNA genes, but was found to be highly virus-enriched if only SSU rRNA was quantified.
Finally, we defined a comprehensive enrichment score that includes rRNA large and small subunit abundances and single-copy markers. This score expresses virus enrichment in a sample compared with the medians observed in unenriched metagenomes and was computed by taking the minimum across the three single enrichment scores for both viromes and metagenomes (see Supplementary Methods). Fewer than 50% of viromes that we analyzed had an overall enrichment greater than threefold, fewer than 15% reached 30-fold enrichment, and only 10% of the viromes were more than 50-fold enriched. Most of the viromes failed to meet even a low level of enrichment (two- to threefold; Fig. 2c). Most studies had mixed enrichment levels across samples (average of 55.41 samples per study, s.d. 76.5), with samples within the same dataset spanning between 1- and 100-fold virus enrichment, confirming what we observed previously on enrichments based on the SSU rRNA gene only (Fig. 2d). This further underscores that samples that underwent the same VLP technique might have widely different levels of non-viral contamination.
To highlight the importance of quality control in untargeted virome metagenomics, we investigated the extent to which the viral enrichment score is connected with success in computational identification of viral genomes from virome samples subjected to metagenomic assembly. We assembled 1,445 untargeted virome samples and classified each of the resulting 2.09 × 107 contigs as viral or not viral using VirSorter4 (Supplementary Methods). The proportion of viral and potentially viral contigs increased from an average of 7.9% to an average of 31% for samples with viral enrichment scores of one- to twofold and five- to ninefold, respectively. However, the proportion of predicted viral contigs did not substantially increase at higher enrichment values (Supplementary Fig. 6). Indeed, in most samples enriched by a factor of 100-fold or more, for which there are, at best, just traces of ribosomal genes from prokaryotes and eukaryotes, fewer than 25% of the assembled nucleotides could be classified as “potentially viral” (i.e., VirSorter category 1, 2 or 3), and fewer than 4% were classified as “surely viral” (i.e., category 1). At such high enrichment rates, assembled contigs could all be considered viral, which means there is a substantial false-negative rate. This is likely due to viral genomes not displaying enough similarity with known reference viruses, as well as to the limitation of contig-based viral detection tools when analyzing contigs with relatively short length4. Conversely, 55 of the 475 poorly enriched samples (i.e., less than threefold) had more than 20% of the assembled nucleotides labeled as potentially viral, which is inconsistent with the high abundance of prokaryotic organisms with much longer genomes and could suggest the presence of false positives. Caution is needed when interpreting the results of viral mining software, and incorporating virome enrichment into untargeted virome studies should improve downstream analyses.
Our analysis should serve to raise awareness of the potential for prokaryotic and eukaryotic contamination in viromes. Unfortunately, post-sequencing evaluation of non-viral contaminants in viromes before contig-based virus classification is rarely performed. Our read-based estimates of non-viral contamination could be used to guide the selection of tools and thresholds for downstream viral contig detection. We caution that if metagenomic assembly is carried out on poorly enriched samples, it increases the number of contigs that are wrongfully assigned as viral by computational predictions.
We urge researchers to apply quality control to viromes before genome analysis. This is particularly important when datasets are retrieved from public sources and when metagenomic assembly is used to characterize unknown viruses in samples. The computational pipeline we introduce to analyze the enrichment of viromes differs from previous methods that focused on only 16S rRNA genes to address microbial contamination. ViromeQC integrates the abundances of 16S/18S rRNA genes, 23S/28S rRNA genes, and a panel of 31 universal bacterial genes. ViromeQC is freely available at http://segatalab.cibio.unitn.it/tools/viromeqc.
Code and documentation are available at http://segatalab.cibio.unitn.it/tools/viromeqc.
Shkoporov, A. N. & Hill, C. Cell Host Microbe 25, 195–209 (2019).
Suttle, C. A. Nat. Rev. Microbiol. 5, 801–812 (2007).
Wang, X. et al. Nat. Commun. 1, 147 (2010).
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. PeerJ 3, e985 (2015).
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. Microbiome 5, 69 (2017).
Thurber, R. V., Haynes, M., Breitbart, M., Wegley, L. & Rohwer, F. Nat. Protoc. 4, 470–483 (2009).
Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Nat. Biotechnol. 35, 833–844 (2017).
Reyes, A. et al. Nature 466, 334–338 (2010).
McCann, A. et al. PeerJ 6, e4694 (2018).
Roux, S. et al. Nature 537, 689–693 (2016).
Watkins, S. C. et al. Mar. Freshw. Res. 67, 1700–1708 (2016).
Rosario, K., Fierer, N., Miller, S., Luongo, J. & Breitbart, M. Environ. Sci. Technol. 52, 1014–1027 (2018).
Roux, S., Krupovic, M., Debroas, D., Forterre, P. & Enault, F. Open Biol. 3, 130160 (2013).
Minot, S. et al. Genome Res. 21, 1616–1625 (2011).
Emerson, J. B. et al. Appl. Environ. Microbiol. 78, 6309–6320 (2012).
Minot, S. et al. Proc. Natl. Acad. Sci. USA 110, 12450–12455 (2013).
Kim, Y., Aw, T. G., Teal, T. K. & Rose, J. B. Environ. Sci. Technol. 49, 8396–8407 (2015).
Ly, M. et al. Microbiome 4, 64 (2016).
Reyes, A. et al. Proc. Natl. Acad. Sci. USA 112, 11941–11946 (2015).
Roux, S. et al. PLoS One 7, e33641 (2012).
Weynberg, K. D., Wood-Charlson, E. M., Suttle, C. A. & van Oppen, M. J. H. Front. Microbiol. 5, 206 (2014).
Hannigan, G.D. et al. MBio 6, e01578–15 (2015).
Aguirre de Cárcer, D., López-Bueno, A., Alonso-Lobo, J. M., Quesada, A. & Alcamí, A. FEMS Microbiol. Ecol. 92, fiw074 (2016).
Shkoporov, A. N. et al. Microbiome 6, 68 (2018).
Pasolli, E. et al. Nat. Methods 14, 1023–1024 (2017).
Leinonen, R., Sugawara, H. & Shumway, M. & International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 39, D19–D21 (2011).
Zolfo, M., Tett, A., Jousson, O., Donati, C. & Segata, N. Nucleic Acids Res. 45, e7 gkw837 (2016).
Quince, C. et al. Genome Biol. 18, 181 (2017).
Wu, M. & Scott, A. J. Bioinformatics 28, 1033–1034 (2012).
Mizuno, C. M. et al. Nat. Commun. 10, 752 (2019).
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 716575) to N.S. The work was also supported by MIUR ‘Futuro in Ricerca’’ RBFR13EWWI_001 and by the European Union (H2020-SFS-2018-1 project MASTER-818368 and H2020-SC1-BHC project ONCOBIOME-825410) to N.S.
The authors declare no competing interests.
Supplementary Methods, Supplementary Note 1 and Supplementary Figures 1–7
Summary of the 2,050 virome datasets considered in the analysis. Dataset sample sizes are related to the actual number of samples that could be classified as DNA VLP viromes according to the available metadata. The reference number refers to Fig. 1. Fig. 2d and Supplementary Fig. 1.
Summary of the 2,189 metagenomes and 109 synthetic metagenomes and mock communities considered in the analysis. Dataset sample sizes are related to the actual number of samples that could be classified as DNA metagenomes according to the available metadata. The reference number refers to Fig. 1. Fig. 2d and Supplementary Fig. 1.
Full dataset of metagenomes and viromes. Contaminant abundances and enrichment data for all the 1,871 metagenomes, 1,670 viromes and 109 synthetic and mock communities that passed all quality controls. Sample type and number of starting reads are provided, as well as the percentage of SSU and LSU rRNAs stratified by life domain.
Validation of the rRNA mapping approach. Expected abundances of 16S rRNA genes are reported for the 108 synthetic and mock communities (tab 1) and 917 16S amplicon sequencing samples (tab 2). Control metagenomes and 16S samples were mapped against the SSU rRNA genes and filtered at different stringency thresholds (see Supplementary Methods). For the amplicon 16S samples at the expected value was set to 100%. The selected threshold is highlighted in blue. The composition of each synthetic metagenome is reported in tab 3. The rRNA abundances in RNA viromes are reported in tab 4.
Detection of single-copy bacterial markers in viral genomes. Number of genomes in each database in which the 31 single-copy markers are detected. The IMG/VR database was split into isolate viruses and uncultivated viruses (tab 1). Number of distinct single-copy markers detected in each database (tab 2).
About this article
Cite this article
Zolfo, M., Pinto, F., Asnicar, F. et al. Detecting contamination in viromes using ViromeQC. Nat Biotechnol 37, 1408–1412 (2019). https://doi.org/10.1038/s41587-019-0334-5
Virus Evolution (2021)
Nucleic Acids Research (2021)
Scientific Data (2020)
grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories
Frontiers in Cellular and Infection Microbiology (2020)