Known giant virus diversity is currently skewed towards viruses isolated from aquatic environments and cultivated in the laboratory. Here, we employ cultivation-independent metagenomics and mini-metagenomics on soils from the Harvard Forest, leading to the discovery of 16 novel giant viruses, chiefly recovered by mini-metagenomics. The candidate viruses greatly expand phylogenetic diversity of known giant viruses and either represented novel lineages or are affiliated with klosneuviruses, Cafeteria roenbergensis virus or tupanviruses. One assembled genome with a size of 2.4 Mb represents the largest currently known viral genome in the Mimiviridae, and others encode up to 80% orphan genes. In addition, we find more than 240 major capsid proteins encoded on unbinned metagenome fragments, further indicating that giant viruses are underexplored in soil ecosystems. The fact that most of these novel viruses evaded detection in bulk metagenomes suggests that mini-metagenomics could be a valuable approach to unearth viral giants.
Viruses larger than some cellular organisms and with genomes up to several megabases in size have been discovered in diverse environments across the globe, primarily from aquatic systems, such as freshwater, seawater and wastewater1,2, but also from terrestrial environments3,4,5 including permafrost6,7. These viruses are nucleocytoplasmic large DNA viruses (NCDLV), and they infect a wide range of eukaryotes, in particular protists and algae8,9,10,11. Only a few protist-infecting NCDLV have been recovered with their native hosts, such as Cafeteria roenbergensis virus (CroV) in the marine flagellate Cafeteria roenbergensis12 and the Bodo saltans virus (BsV)13. Many of the NCDLV are referred to as giant viruses based on their large physical size and a genome size of at least 300 kb14, although the term has also been applied to members of the NCLDV with genomes of at least 200 kb regardless of their particle size15. Importantly, for many of these NCDLV genome size and particle diameter do no correlate8.
Most of our current understanding of giant viruses comes from isolates retrieved in co-cultivation with laboratory strains of Acanthamoeba1,3. Only recently have the genomes of giant viruses been recovered by approaches, such as bulk shotgun metagenomics16,17,18,19,20, flow-cytometric sorting21,22,23, and after successful isolation using a wider range of protist hosts23,24,25. Recent large-scale marker gene-based environmental surveys26,27,28 hinted at an immense phylogenetic breadth of giant viruses of which, however, only a small fraction has been isolated to date. Possible reasons include challenges in providing a suitable host during co-cultivation and the inability to recover the viruses together with their native hosts29. In addition, a systematic recovery of giant virus genomes from metagenomic datasets is lacking and thus, the genetic diversity of giant viruses remains underexplored.
Here we describe 16 giant virus genomes from a forest soil ecosystem that were recovered using a cultivation-independent approach. We shed light on their coding potential and expand the phylogenetic framework of the NCLDV. Importantly, the novel genomes represent only the tip of the iceberg as revealed by a survey of the major capsid protein (MCP) encoded on unbinned metagenome fragments, which indicates a much higher untapped diversity of giant virus genetic material in soil.
Mini-metagenomics facilitated the discovery of giant virus genomes
Soil samples from the Harvard Forest were subjected to standard shotgun sequencing of microbial communities. Four of the 28 samples were also analyzed using a ‘mini-metagenomics’30,31,32 approach, where multiple sets of 100 DNA-stained particles were flow sorted and subjected to whole genome amplification and sequencing (Fig. 1a). Metagenomic binning of assembled contigs produced 15 metagenome assembled genomes (MAGs) from the mini-metagenomes and 1 MAG from the bulk metagenomes (Supplementary Tables 1–4) that displayed features typically found in most NCLDV genomes33,34, such as hallmark genes encoding for MCP(s), factors for maturation of the viral capsid, and packaging ATPases (Supplementary Table 1, Supplementary Fig. 1). Furthermore, we observed on most contigs a uniform distribution of genes of viral, bacterial, or eukaryotic origin and many without matches in public databases (Supplementary Figs. 2, 3). In addition, these new viruses encoded numerous paralogous genes, a feature common to many NCLDV35,36 (Supplementary Fig. 2). Many of the duplicated genes were located on different contigs and often unique to the respective genomes, providing additional evidence that these contigs belong to a single viral MAG (Supplementary Fig. 1). Moreover, presence, absence, and copy number of nucleocytoplasmic virus orthologous genes (NCVOGs)34 were comparable to previously described giant viruses, suggesting that the MAGs are made up by single viral genomes and several of them being nearly complete (Supplementary Table 1, Supplementary Figs. 1, 4). An independently conducted benchmarking experiment of the mini-metagenomics approach revealed that no chimeric contigs are being created during this workflow which further supports the quality of the genomes derived here (Supplementary Figs. 5, 6).
Despite the bulk metagenome approach generating five-fold more reads, it only yielded in a single giant virus genome, whereas mini-metagenomics lead to the recovery of 15 additional bins attributable to NCLDV (Fig. 1b). Bulk metagenome reads only mapped to the MAG recovered from bulk metagenomes (at ~9× coverage) and not to any mini-metagenome MAGs, suggesting most of the discovered viruses were of low abundance in the sampled forest soil (Fig. 1b). This was also reflected in the soil metatranscriptomes in which no or only low transcriptional activity of the giant viruses could be detected (Fig. 1b, Supplementary Table 5).
Sorted viral particles expand known diversity of NCLDV
The phylogenetic relationships inferred from the tree built from a concatenated alignment of five core NCVOGs34,37 (Fig. 2a; Supplementary Fig. 1) and the consensus of single protein phylogenies (Supplementary Figs. 7, 8) showed that newly discovered viruses from forest soil were affiliated with diverse lineages in the NCLDV. Two of the new viruses, solivirus, and solumvirus, were in sister-position to the pithoviruses, cedratviruses and the recently isolated orpheovirus38. Sylvanvirus represented a long branch on its own. Most novel soil NCLDV were positioned within the family Mimiviridae, which comprises the proposed subfamilies Megamimivirinae, the Klosneuvirinae, the algae-infecting Mesomimivirinae and the genus Cafeteriavirus17,39 (Fig. 2b). One of the new viruses, faunusvirus, grouped with CroV and represents the second viral genome sampled in this clade (Fig. 2b). Another novel virus, satyrvirus, branched as sister lineage to the two recently isolated tupanviruses, which were derived from deep sea and a soda lake samples9, together forming a monophyletic clade in the Megamimivirinae (Fig. 2b). Thus, satyrvirus can be considered as a third member of the proposed genus Tupanvirus40. Notably, none of the new lineages were directly affiliated with any of the three other subgroups of well-studied Megamimivirinae41,42. Eight of the new viruses branched within the proposed Klosneuvirinae, currently the largest subfamily in the Mimiviridae based on phylogenetic diversity (PD)43 (Fig. 2c).
Strikingly, the addition of the novel giant viruses to the NCLDV tree lead to a 21% increase of the total PD in the NCLDV (Fig. 2c), expanded the diversity of the Mimiviridae by 77% and nearly tripled the PD of the Klosneuvirinae (Fig. 2c). It is important to note that this expansion of PD was from a single study using cultivation-independent techniques, thereby building upon decades of previous giant virus discovery work1,8,10,41. The fact that all these newly discovered viruses represent distinct lineages in the NCLDV hints that additional sampling is expected to lead to a further substantial increase in giant virus PD.
Genomic features of soil giant viruses
The assembled viral genomes assigned to the klosneuviruses were among the largest ever found (Fig. 2b; Supplementary Fig. 1, Supplementary Table 1). With a genome size of up to 2.4 Mb the hyperionvirus would become the new record for genome size in the Mimiviridae, dwarfing klosneuvirus and tupanvirus with their ~ 1.5 Mb genomes9,17. Considering that several of the forest soil MAGs are potentially only partially complete, the true genome size of the new viruses might be even larger. Similar to recently discovered klosneuviruses and tupanviruses9,17, several of the new viruses affiliated with the Klosneuvirinae encode for expanded sets of aminoacyl tRNA synthetases (aaRS), e.g. terrestrivirus with up to 19 different aaRS and up to 50 tRNAs with specificity for all 20 different amino acids, a feature only very recently described in the tupanviruses9. In concert with other viral components of the eukaryotic translation system, such viruses likely override host protein biosynthesis using their own enzymes to ensure efficient production of viral proteins. Being less dependent on the host cell machinery might allow these viruses to infect multiple hosts, i.e. fewer proteins are necessary to target and interact with alternative hosts. A broader host range has been experimentally verified for tupanviruses9 which were able to infect different protists, however, viral titer did not increase in all the cases9.
Genome novelty of soil giant viruses
Complementary to the phylogenetic analysis (Fig. 2a), we inferred a gene sharing network to provide further insights into the relationship of the novel viral genomes to known NCLDV lineages based on shared gene content. In agreement with the species tree, viral lineages such as the Mimiviridae, the Marseilleviridae, the pithoviruses and cedratviruses, the faustoviruses and the molliviruses and pandoraviruses remained well connected (Fig. 3a). Among the novel viruses with the lowest percentage of genes shared with other NCLDV were solumvirus and solivirus, with solivirus being only connected to orpheovirus and Marseilleviridae and solumvirus to the cedratviruses. In contrast to the phylogenetic tree in which solivirus and solumvirus were affiliated to each other, there was no particular linkage between them in the network. This suggests limited taxon sampling and we expect that with discovery of additional giant virus genomes, the phylogenetic position of these viruses will be better resolved.
Another of the soil giant viruses denoted as sylvanvirus featured a genome completely disconnected from all other NCLDV (Fig. 3a). With a size of almost 1 Mb it represents one of the largest viral genomes outside pandoraviruses and the Mimiviridae (Fig. 3a; Supplementary Fig. 1)8,44. With the presence of 10 ancestral NCLDV genes, a number similar to several other NCLDV, the sylvanvirus genome can be considered near complete (Supplementary Fig. 1). Intriguingly, the vast majority (~80%) of its proteins had neither matches in the NCBI non-redundant (nr) database (Fig. 3b). From the proteins with database hits, 57% had matches to eukaryotes and 27% to bacteria but only 13% to other viruses (Fig. 3c). Importantly, there was no trend in taxonomic affiliation of the hits (Fig. 3c), again emphasizing the lack of any affiliation to known viruses and organisms. Among the identifiable genes were 18 potential kinases, five ubiquitin ligases, and a histone, all potentially playing important roles in interaction with a currently unknown host.
True diversity of giant viruses in forest soil
The MCPs in the bulk metagenomes revealed that the 16 novel viral genomes represent just a small fraction of giant virus diversity in the soil samples (Fig. 4a). In total, 245 different MCP genes were detected, of which 99% were part of the unbinned metagenome fraction. Most of these MCPs were located on short contigs with a read coverage of below 2, indicating an extremely low abundance of corresponding NCLDV in the respective samples (Fig. 4b). Importantly, none of the bulk-metagenome MCPs matched MCPs from the mini-metagenome-derived MAGs, further underlining the much greater diversity of giant viruses in these samples. MCPs can be heavily duplicated but usually branch together in lineage-specific clades enabling taxonomic classification based on their nearest neighbors in the tree45. Based on identified phylogenetic relationships it was possible to assign taxonomy to several of the bulk metagenome MCPs, of which most could be attributed to the klosneuviruses (Fig. 4a, c). A hint of the true dimension of the NCLDV diversity is revealed when considering that the total number of nearly 300 MCPs discovered in this study, which includes MCPs from all the MAGs, exceeds the 226 MCPs identified in previously published NCLDV genomes.
Our results illustrate that employing cultivation independent methods on a minute sample from forest soil, a habitat in which giant viruses have rarely been found previously3,45, can lead to key discoveries. Recovery of solumvirus, solivirus, and sylvanvirus, three potentially genus, subfamily, or even family level NCLDV lineages together with 13 other novel giant virus genomes vastly expands the PD of the NCLDV and provides new insights into their genetic makeup.
The fact that only a single giant virus MAG was recovered in the bulk metagenomes suggests extremely low abundance of these viruses compared to bacterial and archaeal community members in forest soil. However, mini-metagenomics has proven most effective in recovering these viruses, yet without any detectable traces of host sequences (Supplementary Tables 6, 7). It is noteworthy that oftentimes the average read coverage of the giant virus MAGs was the highest or among the highest compared to non-viral MAGs derived from the same mini-metagenomes pool of 100 DNA-stained particles (Supplementary Fig. 9). The high coverage and completeness of giant virus genomes is consistent with having several copies of the same viral genome in the same mini-metagenome pool, but the overall low abundance of giant viruses in the system makes it unlikely that several identical viral particles were sorted by chance (Supplementary Figs. 1, 9). A plausible scenario could be that host vacuoles already filled with giant viruses may have been recovered during sorting, thereby delivering several clonal copies of a giant virus genome into a single mini-metagenome pool. This would enable genome assembly of higher quality and completeness, as previously shown for polyploid bacterial symbionts46.
Of the few available studies that have used this mini-metagenomes method, one describes the discovery of a novel intracellular bacterium30 and another a new group of giant viruses17, suggesting mini-metagenomics is a compelling method for elucidating the hidden diversity of intracellular entities such as giant viruses. As shown by the MCP diversity in the unbinned metagenome fraction many novel giant viruses are readily awaiting discovery. Importantly, the mini-metagenomics approach has not been exhaustively performed in soil or any other ecosystem and thus represents a promising addition to the toolkit for exploring the untapped diversity in the giant virus universe.
Sampling and sample preparation
Fourteen forest soil cores from the Barre Woods warming experiment located at the Harvard Forest Long-Term Ecological Research site (Petersham, MA) were collected and sub-sampled into organic horizon and mineral zone, resulting in 28 total samples. Mineral zone samples were flash-frozen while organic horizons were incubated with deuterium oxide for 2 weeks prior to freezing to label the active bacterial and archaeal communities. This incubation was carried out as part of a different experiment that will be addressed in a later manuscript. Total DNA and RNA were extracted from 28 soil samples for bulk metagenomics and metatranscriptomics using the MoBio PowerSoil DNA and RNA kits, respectively. Bacterial and Plant rRNA depletion was performed on the RNA samples prior to sequencing. Of these 28 soil samples, a subset of four encompassing two organic and two mineral layers were selected for mini-metagenomics. Cells, and presumably viral particles and/or eukaryote vacuoles containing them, were separated from soil particles using a mild detergent, followed by vortexing, centrifugation, and filtration through a 5 μm syringe filter. The filtrates were stained with SYBR Green nucleic acid stain. For each of the four samples, 90 pools containing 100 SYBR+ particles were sorted into microwell plates using fluorescence activated cell sorting (FACS). Sorted pools underwent lysis and whole genome amplification through multiple displacement amplification (MDA) following methods outlined previously47. A total of 360 sequencing libraries were generated with the Nextera XT v2 kit (Illumina) with 9 rounds of PCR amplification.
The 360 libraries derived from sorted particles were sequenced at the DOE Joint Genome Institute (JGI, Walnut Creek, CA) using the Illumina NextSeq platform. Pools of 90 libraries were processed in four sequencing runs that generated 2 × 150 bp read lengths. Raw Illumina reads were quality filtered to remove contamination and low-quality reads using BBTools (http://bbtools.jgi.doe.gov, version 37.38). Read normalization was performed using BBNorm (http://bbtools.jgi.doe.gov) and error correction with Tadpole (http://bbtools.jgi.doe.gov). Assembly of filtered, normalized Illumina reads was performed using SPAdes (v3.10.1)48 with the following options:–phred-offset 33 -t 16 -m 115–sc -k 25,55,95. All contig ends were then trimmed of 200 bp and contigs were discarded if the length was <2 kb or read coverage <2 using BBMap (http://bbtools.jgi.doe.gov) with the following options: nodisk ambig, filterbycoverage.sh: mincov.
Unamplified TruSeq libraries were prepared for the 28 DNA samples for metagenomic sequencing on the Illumina HiSeq-2000 platform at the DOE JGI. Raw Illumina reads were trimmed, quality filtered, and corrected using bfc (version r181)49 with the following options: -1 -s 10g -k 21 -t 10. Following quality filtering, reads were assembled using SPAdes (v3.11.1)48,50 with the following options:-m 2000–only-assembler -k 33,55,77,99,127–meta -t 32. The entire filtered read set was mapped to the final assembly and coverage information generated using bbmap (http://bbtools.jgi.doe.gov, version 37.62) with default parameters except ambiguous = random. The version of the processing pipeline was jgi_mga_meta_rqc.py, 2.1.0.
Libraries were prepared and sequenced on the Illumina NextSeq platform at the DOE JGI. Following sequencing, metatranscriptome reads were quality cleaned and a combined assembly was generated using the MEGAHIT assembler (v1.1.2)51 using the following options: -m 0.2—k-list 23,43,63,83,103,123—continue -o out.megahit—12. These cleaned reads were aligned to metagenome reference sequences using BBMap (http://bbtools.jgi.doe.gov, version 37.38) with the following options: nodisk = true interleaved = true ambiguous = random.
Contigs were organized into genome bins based on tetranucleotide sequence composition with MetaBat252. Genome bins were generated for mini-metagenomes without contig coverage patterns due to MDA bias53. Coverage was determined for the bulk metagenomes by mapping reads to the completed assemblies using the Burrows–Wheeler aligner54. Taxonomy of bins was determined with the genome taxonomy database classifier (https://github.com/Ecogenomics/GTDBTk).
Screening for giant viruses
Metagenomic bins were screened for presence of the 20 ancestral NCVOGs34 with hmmsearch (version 3.1b2, hmmer.org). Bins with more than five different hits and/or that contained the NCLDV MCP gene (NCVOG0022) were selected and further evaluated (see below).
Annotation and quality control of viral genome bins
Gene calling was performed with GeneMarkS using the virus model55. For functional annotation proteins were blasted against previously established NCVOGs34 and the NCBI non-redundant database (nr) using Diamond blastp56 with an e-value cutoff of 1.0e−5. In addition, protein domains were identified by hmmsearch (version 3.1b2, hmmer.org) against Pfam-A (version 29.0)57, and tRNAs and introns were identified using tRNAscan-SE58 and cmsearch from the Infernal package59 against the Rfam database (version 13.0)60. Nearly identical sequences within genome bins (>100 bp, identity >94%) were detected using the MUMmer repeat-match algorithm61 and visualized with Circos62 together with the respective genome bins. For all MAGs, paralogs and best diamond blastp vs. NCBI nr hits were visualized with Circos62. Furthermore, distribution of read depth across contigs was evaluated and regions with low average coverage were identified (Supplementary Table 4).
Experimental benchmarking of the mini-metagenomics approach
Benchmarking of the mini-metagenomics approach to assess potential chimera formation during MDA was performed by randomly sorting 10 cells from a bacterial mock community consisting of five different bacterial isolates; Escherichia coli K12, Echinicola vietnamensis DSM 17526, Shewanella oneidensis MR-1, Pseudomonas putida F1, and Meiothermus ruber. In total 59 of these 10-cell sorts were subject to MDA and sequencing. Resulting reads were filtered, assembled and analyzed with the same bioinformatics pipeline used for the mini-metagenomes generated in this study. Assembly statistics of recovered MAGs were generated with MetaQUAST63.
Computational benchmarking of giant virus metagenomic binning
In addition, benchmarking of the binning workflow was performed to assess its applicability to giant virus data. First, binning of a simulated mock community consisting of 12 giant viruses was tested, each a representative of a subfamily or family in the NCLDV. In addition, the herein newly discovered giant viruses were used as template for a second simulated mock community. In brief, MDA was simulated on the genomes of the mock communities with MDAsim64 (https://github.com/hzi-bifo/mdasim/releases/v2.1.1). In the following, Illumina reads were generated with ART65 and the same bioinformatics pipeline used for the mini-metagenomes in this study employed for read error-correction, normalization, assembly, and binning.
To remove redundancy, the set of 186 published NCLDV genomes and 16 novel soil giant viruses were clustered at an average nucleotide identity (ANI) of 95% with at least 100 kb-aligned fraction using fastANI66 resulting in 132 clusters and singletons. None of the newly discovered viruses clustered with any other virus. The three most incomplete novel giant virus genomes were removed from the data set (Supplementary Table 1, Supplementary Fig. 2). To infer the positions of novel soil giant viruses in the NCLDV, five core NCLDV proteins34 were selected: DNA polymerase elongation subunit family B (NCVOG0038), D5-like helicase-primase (NCVOG0023), packaging ATPase (NCVOG0249), and DNA or RNA helicases of superfamily II (NCVOG0076) and Poxvirus Late Transcription Factor VLTF3-like (NCVOG0262), and identified with hmmsearch (version 3.1b2, hmmer.org). Three of the MAGs derived from mini-metagenomes were excluded from the analysis as they had less than three conserved NCLDV proteins (Supplementary Table 1). Protein sequences were aligned using mafft67. Gapped columns in alignments (<10% sequence information) and columns with low information content were removed from the alignment with trimal68. Phylogenetic trees for each protein and for a concatenated alignment of all five proteins were constructed using IQ-tree with LG+F+R6 as suggested by model test as best-fit substitution model69. The percentage increase in PD41 was calculated based on the difference of the sum of branch lengths of phylogenetic species of the NCLDV trees with and without the metagenomic soil giant viruses.
Bulk metagenome assemblies and 186 published NCLDV genomes and 16 soil MAGs were screened for presence of the NCLDV MCP gene (NCVOG0022)17,34 with hmmsearch (version 3.1b2, hmmer.org) and applying a cutoff of 1e−6. This cutoff has been evaluated against ~60,000 available bacterial, archaeal, eukaryotic, and other non-NCLDV genomes in the Integrated Microbial Genomes database70 yielding in only few false positives. Resulting protein hits were extracted from the metagenome and to reduce redundancy clustered with cd-hit at a sequence similarity of 95%71. Cluster representatives were then subject to diamond blastp56 against nr database (June 2018) and proteins which had hits but no NCLDV MCP in the top 10 were excluded from further analysis as potentially false positives. For tree construction, MCPs were extracted and aligned with mafft-ginsi (–unalignlevel 0.8,–allowshift)67. Gapped columns in the alignment (<10% sequence information) were removed with trimal68 and proteins with <50 aligned amino acids were removed. A phylogenetic tree was constructed with IQ-tree and the LG+F+R8 as suggested by model test as the best-fit substitution model69.
Gene sharing network
Protein families were inferred with OrthoFinder 1.0372 on a representative dataset of 93 NCLDV genomes for comparative analysis (after de-replication using 95% ANI clustering66, details described above, and removal of 36 poxviruses). For each pair of NCLDV genomes (ANI 95% cluster representatives) the average percentage of proteins in shared orthogroups in relation to the total number of proteins in the respective genome was calculated and used as edge weight in the network.The network was created in Gephi73 using a force layout and filtered at an edge weight of 18%.
The giant virus genomes were deposited at NCBI Genbank (MK071979–MK072551) and at https://bitbucket.org/berkeleylab/forestsoil-gv, together with sequence alignments and phylogenetic trees underlying this study. Metagenomes and corresponding metadata are available at https://img.jgi.doe.gov/m, accession numbers indicated in Supplementary Table 3.
Aherfi, S., Colson, P., La Scola, B. & Raoult, D. Giant viruses of amoebas: an update. Front. Microbiol 7, 349 (2016).
Andrade, A. C. D. S. P. et al. Ubiquitous giants: a plethora of giant viruses found in Brazil and Antarctica. Virol. J. 15, 22 (2018).
Pagnier, I. et al. A decade of improvements in Mimiviridae and Marseilleviridae isolation from amoeba. Intervirology 56, 354–363 (2013).
Yoosuf, N. et al. Draft genome sequences of Terra1 and Terra2 viruses, new members of the family Mimiviridae isolated from soil. Virology 452-453, 125–132 (2014).
Boughalmi, M. et al. High-throughput isolation of giant viruses of the Mimiviridae and Marseilleviridae families in the Tunisian environment. Environ. Microbiol. 15, 2000–2007 (2013).
Legendre, M. et al. In-depth study of Mollivirus sibericum, a new 30,000-y-old giant virus infecting Acanthamoeba. Proc. Natl Acad. Sci. USA 112, E5327–E5335 (2015).
Legendre, M. et al. Thirty-thousand-year-old distant relative of giant icosahedral DNA viruses with a pandoravirus morphology. Proc. Natl Acad. Sci. USA 111, 4274–4279 (2014).
Abergel, C., Legendre, M. & Claverie, J.-M. The rapidly expanding universe of giant viruses: Mimivirus, Pandoravirus, Pithovirus and Mollivirus. FEMS Microbiol. Rev. 39, 779–796 (2015).
Abrahão, J. et al. Tailed giant Tupanvirus possesses the most complete translational apparatus of the known virosphere. Nat. Commun. 9, 749 (2018).
Fischer, M. G. Giant viruses come of age. Curr. Opin. Microbiol. 31, 50–57 (2016).
Wilson, W. H., Van Etten, J. L. & Allen, M. J. The Phycodnaviridae: the story of how tiny giants rule the world. Curr. Top. Microbiol. Immunol. 328, 1–42 (2009).
Fischer, M. G., Allen, M. J., Wilson, W. H. & Suttle, C. A. Giant virus with a remarkable complement of genes infects marine zooplankton. Proc. Natl Acad. Sci. USA 107, 19508–19513 (2010).
Deeg, C. M., Chow, C.-E. T. & Suttle, C. A. The kinetoplastid-infecting Bodo saltans virus (BsV), a window into the most abundant giant viruses in the sea. eLife 7, e33014 (2018).
Claverie, J.-M. & Abergel, C. Giant viruses: the difficult breaking of multiple epistemological barriers. Stud. Hist. Philos. Biol. Biomed. Sci. 59, 89–99 (2016).
Wilhelm, S. W. et al. A student’s guide to giant viruses infecting small Eukaryotes: from Acanthamoeba to Zooxanthellae. Viruses 9, E46 (2017).
Verneau, J., Levasseur, A., Raoult, D., La Scola, B. & Colson, P. MG-Digger: an automated pipeline to search for giant virus-related sequences in metagenomes. Front. Microbiol. 7, 428 (2016).
Schulz, F. et al. Giant viruses with an expanded complement of translation system components. Science 356, 82–85 (2017).
Zhang, W. et al. Four novel algal virus genomes discovered from Yellowstone Lake metagenomes. Sci. Rep. 5, 15131 (2015).
Andreani, J., Verneau, J., Raoult, D., Levasseur, A. & La Scola, B. Deciphering viral presences: two novel partial giant viruses detected in marine metagenome and in a mine drainage metagenome. Virol. J. 15, 66 (2018).
Roux, S. et al. Ecogenomics of virophages and their giant virus hosts assessed through time series metagenomics. Nat. Commun. 8, 858 (2017).
Wilson, W. H. et al. Genomic exploration of individual giant ocean viruses. ISME J. 11, 1736–1745 (2017).
Martínez Martínez, J., Swan, B. K. & Wilson, W. H. Marine viruses, a genetic reservoir revealed by targeted viromics. ISME J. 8, 1079–1088 (2014).
Khalil, J. Y. B. et al. High-throughput isolation of giant viruses in liquid medium using automated flow cytometry and fluorescence staining. Front. Microbiol. 7, 26 (2016).
Bajrai, L. H. et al. Kaumoebavirus, a new virus that clusters with faustoviruses and Asfarviridae. Viruses 8, 278 (2016).
Reteno, D. G. et al. Faustovirus, an asfarvirus-related new lineage of giant viruses infecting amoebae. J. Virol. 89, 6585–6594 (2015).
Mihara, T. et al. Taxon richness of ‘Megaviridae’ exceeds those of Bacteria and Archaea in the ocean. Microbes Environ. 33, 162–171 (2018).
Colson, P., Aherfi, S. & La Scola, B. Evidence of giant viruses of amoebae in the human gut. Hum. Microbiome J. 5-6, 14–19 (2017).
Hingamp, P. et al. Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes. ISME J. 7, 1678–1695 (2013).
Halary, S., Temmam, S., Raoult, D. & Desnues, C. Viral metagenomics: are we missing the giants? Curr. Opin. Microbiol. 31, 34–43 (2016).
McLean, J. S. et al. Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum. Proc. Natl Acad. Sci. USA 110, E2390–E2399 (2013).
Yu, F. B. et al. Microfluidic-based mini-metagenomics enables discovery of novel microbial lineages from complex environmental samples. eLife 6, e26580 (2017).
Berghuis, B. A. et al. Hydrogenotrophic methanogenesis in archaeal phylum Verstraetearchaeota reveals the shared ancestry of all methanogens. Preprint at https://doi.org/10.1101/391417 (2018).
Iyer, L. M., Aravind, L. & Koonin, E. V. Common origin of four diverse families of large eukaryotic DNA viruses. J. Virol. 75, 11720–11734 (2001).
Yutin, N., Wolf, Y. I., Raoult, D. & Koonin, E. V. Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J. 6, 223 (2009).
Filée, J. Route of NCLDV evolution: the genomic accordion. Curr. Opin. Virol. 3, 595–599 (2013).
Suhre, K. Gene and genome duplication in Acanthamoeba polyphaga Mimivirus. J. Virol. 79, 14095–14101 (2005).
Yutin, N., Wolf, Y. I. & Koonin, E. V. Origin of giant viruses from smaller DNA viruses not from a fourth domain of cellular life. Virology 466-467, 38–52 (2014).
Andreani, J. et al. Orpheovirus IHUMI-LCC2: a new virus among the iant viruses. Front. Microbiol 8, 2643 (2017).
Gallot-Lavallée, L., Blanc, G. & Claverie, J.-M. Comparative genomics of Chrysochromulina ericinavirus and other microalga-infecting large DNA viruses highlights their intricate evolutionary relationship with the established Mimiviridae family. J. Virol. 91, e00230-17 (2017).
Rodrigues, R. A. L., Mougari, S., Colson, P., La Scola, B. & Abrahão, J. S. ‘Tupanvirus’, a new genus in the family Mimiviridae. Arch. Virol. doi: https://doi.org/10.1007/s00705-018-4067-4 (2018).
Colson, P., La Scola, B., Levasseur, A., Caetano-Anollés, G. & Raoult, D. Mimivirus: leading the way in the discovery of giant viruses of amoebae. Nat. Rev. Microbiol. 15, 243–254 (2017).
La Scola, B. et al. A giant virus in amoebae. Science 299, 2033 (2003).
Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060 (2009).
Legendre, M. et al. Diversity and evolution of the emerging Pandoraviridae family. Nat. Commun. 9, 2285 (2018).
Wilhelm, S. W., Coy, S. R., Gann, E. R., Moniruzzaman, M. & Stough, J. M. A. Standing on the shoulders of giant viruses: five lessons learned about large viruses infecting small eukaryotes and the opportunities they create. PLoS Pathog. 12, e1005752 (2016).
Woyke, T. et al. One bacterial cell, one complete genome. PLoS ONE 5, e10314 (2010).
Rinke, C. et al. Obtaining genomes from uncultivated environmental microorganisms using FACS-based single-cell genomics. Nat. Protoc. 9, 1038–1048 (2014).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Li, H. BFC: correcting Illumina sequencing errors. Bioinformatics 31, 2885–2887 (2015).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Li, D. et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016).
Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).
Woyke, T. et al. Assembling the marine metagenome, one cell at a time. PLoS ONE 4, e5299 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Borodovsky, M. & Lomsadze, A. Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr. Protoc. Bioinformatics 27, 3911 (2002).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).
Lowe, T. M. & Chan, P. P. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res. 44, W54–W57 (2016).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Kalvari, I. et al. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 46, D335–D342 (2018).
Delcher, A. L., Salzberg, S. L. & Phillippy, A. M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinformatics doi: https://doi.org/10.1002/0471250953.bi1003s00 (2003).
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2016).
Tagliavi, Z. & Draghici, S. MDAsim: a multiple displacement amplification simulator. In 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 1–4 (IEEE, 2012).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Jain, C., Rodriguez-R. L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High-throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Preprint at https://www.biorxiv.org/content/early/2017/11/27/225342 (2017).
Katoh, K. & Standley, D. M. A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics 32, 1933–1942 (2016).
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Chen, I.-M. A. et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2017).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).
Bastian, M., Heymann, S. & Jacomy, M. et al. Gephi: an open source software for exploring and manipulating networks. Icwsm 8, 361–362 (2009).
The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract no. DE-AC02-05CH11231. The Harvard Forest experimental warming plots are supported by NSF grants DEB 1237491 (Long-Term Ecological Research) and DEB 1456528 (Long-Term Research in Environmental Biology). Lauren Alteio was supported by a DOE Office of Science Graduate Student Research Program award. We thank William Warner and Michael Bernard for assistance with sample collection and processing. We would like to thank Simon Roux, David Paez, and Natalia Ivanova for helpful discussions.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
About this article
Cite this article
Schulz, F., Alteio, L., Goudeau, D. et al. Hidden diversity of soil giant viruses. Nat Commun 9, 4881 (2018). https://doi.org/10.1038/s41467-018-07335-2
This article is cited by
Nature Biotechnology (2023)
Nature Reviews Microbiology (2023)
ISME Communications (2023)
Archives of Virology (2023)
Epigenetics & Chromatin (2022)