Giant virus biology and diversity in the era of genome-resolved metagenomics

The discovery of giant viruses, with capsids as large as some bacteria, megabase-range genomes and a variety of traits typically found only in cellular organisms, was one of the most remarkable breakthroughs in biology. Until recently, most of our knowledge of giant viruses came from ~100 species-level isolates for which genome sequences were available. However, these isolates were primarily derived from laboratory-based co-cultivation with few cultured protists and algae and, thus, did not reflect the true diversity of giant viruses. Although virus co-cultures enabled valuable insights into giant virus biology, many questions regarding their origin, evolution and ecological importance remain unanswered. With advances in sequencing technologies and bioinformatics, our understanding of giant viruses has drastically expanded. In this Review, we summarize our understanding of giant virus diversity and biology based on viral isolates as laboratory cultivation has enabled extensive insights into viral morphology and infection strategies. We then explore how cultivation-independent approaches have heightened our understanding of the coding potential and diversity of the Nucleocytoviricota. We discuss how metagenomics has revolutionized our perspective of giant viruses by revealing their distribution across our planet’s biomes, where they impact the biology and ecology of a wide range of eukaryotic hosts and ultimately affect global nutrient cycles. The discovery of giant viruses, with virions as large as some bacteria and eukaryotes, megabase-range genomes, and a variety of traits typically found only in cellular organisms, was one of the most remarkable breakthroughs in biology. In this Review, Schulz, Abergel and Woyke explore insights into the biology, diversity, biogeography and ecology of giant viruses provided by culture and genomic technologies.

Large and giant viruses are part of a group of doublestranded DNA viruses, the nucleocytoplasmic large DNA viruses (NCLDVs) 1,2 , which constitutes the viral phylum Nucleocytoviricota 3 . Viruses of this phylum infect a wide range of eukaryotic hosts, from the tiniest known unicellular choanoflagellates to multicellular animals 4 . NCLDVs typically replicate in so-called viral factories built in the host cytoplasm or use the host nucleus to replicate and sometimes assemble their progeny 5,6 . Hallmark features of these viruses are large genomes ranging from 70 kb to up to 2.5 Mb and virions that can reach more than 2 μm in length 7 . The term 'giant virus' was initially coined in the 1990s, when it became apparent that viruses that infect algae have unusually large genomes 8 and, further, in the early 2000s, when the first virus with a genome in the megabase range was discovered; initial light microscopy observations led to the assumption that its particles corresponded to a Gram-positive bacterial pathogen of amoebae 9,10 . More detailed ultrastructural analyses revealed a typical icosahedral-shaped virion and genome sequencing yielded a 1.2 Mb viral genome 11 . This virus was named 'mimivirus' , short for 'microbe-mimicking virus' , and represented an unexpected novelty in the virosphere, due not only to its exceptional particle and genome sizes but also to its coding potential as it includes several genes with possible roles in protein biosynthesis 11 . Since this discovery of giant viruses, their coding potential has been full of surprises, and the presence of hallmark genes of cellular life led to the hypothesis that these viruses might represent an enigmatic fourth domain of life [11][12][13] . Equally intriguing, much smaller viruses (so-called virophages) were found to infect some NCLDVs that have exclusively cytoplasmic infectious cycles; virophages parasitize and sometimes kill their hosts 14 . Also discovered was a third partner coined 'transpoviron' , which corresponds to a 7 kb double-stranded DNA episome that is able to propagate using both the giant virus and the virophage particles as vehicles 15,16 .
For well over a decade, giant viruses had chiefly been studied through cultivation-based approaches until very recently, when virology followed the footsteps of microbial genomics by applying cultivation-independent metagenomics to investigate the evolutionary diversity and metabolic potential of these viruses at an unparalleled pace. In this Review, we explore a wealth of experimental data that has revealed many insights into giant virus biology, in particular their virion structure and distinctive infection strategies. We build upon this knowledge by integrating the latest sequence-based studies that expanded NCLDV diversity, biogeography, coding potential and putative host range. Furthermore, Choanoflagellates Small flagellated microeukaryotes that represent the closest known unicellular relatives of animals grouped together in a clade called Opisthokonta.

Viral factories
Transitory organelles developed by the virus in the cytoplasm of an infected host cell in which replication and assembly of giant viruses takes place.
Giant virus biology and diversity in the era of genome-resolved metagenomics Frederik Schulz 1 ✉ , Chantal Abergel 2 and Tanja Woyke 1,3 ✉ Abstract | The discovery of giant viruses, with capsids as large as some bacteria, megabase-range genomes and a variety of traits typically found only in cellular organisms, was one of the most remarkable breakthroughs in biology. Until recently, most of our knowledge of giant viruses came from ~100 species-level isolates for which genome sequences were available. However, these isolates were primarily derived from laboratory-based co-cultivation with few cultured protists and algae and, thus, did not reflect the true diversity of giant viruses. Although virus co-cultures enabled valuable insights into giant virus biology, many questions regarding their origin, evolution and ecological importance remain unanswered. With advances in sequencing technologies and bioinformatics, our understanding of giant viruses has drastically expanded. In this Review, we summarize our understanding of giant virus diversity and biology based on viral isolates as laboratory cultivation has enabled extensive insights into viral morphology and infection strategies. We then explore how cultivation-independent approaches have heightened our understanding of the coding potential and diversity of the Nucleocytoviricota. We discuss how metagenomics has revolutionized our perspective of giant viruses by revealing their distribution across our planet's biomes, where they impact the biology and ecology of a wide range of eukaryotic hosts and ultimately affect global nutrient cycles.
we discuss compelling evidence that the presence of a variety of cellular hallmark genes in giant virus genomes enable the virus to reprogramme host metabolism, and that the integration of giant virus genetic material into host genomes may impact the biology and evolution of the eukaryotic cell.

Giant virus discovery through isolation
The earliest discovered NCLDVs were the Poxviridae, which include the causative agent of smallpox and were the first viral particles seen under a microscope more than 130 years ago 17 . Large viruses that infect Chlorella green algae were isolated in the 1980s. The first genomes of Vaccinia virus (a poxvirus) and Paramecium bursaria chlorella virus 1 (PBCV1) were sequenced in the early 1990s 18 and 1999 (ref. 8 ), respectively. Shortly thereafter, additional genomes of Poxviridae were sequenced ( fig. 1), with sizes ranging from 120 kb to 360 kb (ref. 19 ). Subsequently, other viruses that infect animals, including members of the Ascoviridae, Iridoviridae and Asfarviridae families, were found and their genomes sequenced [20][21][22] . Genomes of viruses in these groups are comparably small (up to 220 kb) and even smaller in the recently discovered shrimp-associated Mininucleoviridae (70-80 kb) 23 . In addition to animal-infecting NCLDVs, a wide range of NCLDVs were detected in various eukaryotic algae, including chlorophytes, haptophytes, pelagophytes, brown algae and dinoflagellates in the early 2000s 24 . These algae-associated NCLDVs were classified as Phycodnaviridae 24 and Mesomimiviridae 25,26 and, although most of their genomes are ~200-500 kb (refS. 24,27 ), the genomes of Tetraselmis virus and Prymnesium kappa virus RF01 are 668 kb (ref. 28 ) and 1.4 Mb (ref. 29 ), respectively.
After the discovery of mimivirus in 2003 (ref. 10 ), other NCLDVs with larger virions and genomes above 500 kb have been found to infect heterotrophic protists 30 (mainly members of the Amoebozoa). For more than a decade, Acanthamoeba strains had chiefly been used as hosts for the co-cultivation of new viruses, leading to the frequent isolation of closely related giant viruses able to infect this unicellular host 31 . Acanthamoeba spp. has proven to be   40 , and in particular of microeukaryotes, it is likely that giant viruses that have been recovered through isolation reflect only a minute fraction of NCLDV lineages extant in the wild.

Virion structures and infection strategies Viruses with nucleocytoplasmic infectious cycles.
Chloroviruses were the first viruses designated as 'giant viruses' 8 (TAble 1). In particular, PBCV1 (ref. 42 ) was extensively studied; its capsids have a few external fibres extending from some of the capsomers 41 and a spike-like structure present at one vertex to anchor onto the host cell 43 (fig. 2). The capsids are glycosylated with an unusual oligosaccharide synthesized by the virus-encoded glycosylation machinery; the oligosaccharide is N-linked to asparagines in atypical sequons 44 in the major capsid protein (MCP; Vp54) 45 . The outer capsid layer covers a single lipid membrane 46 , which is essential for infectivity. Chloroviruses deliver their genome into their algal host by creating a hole in the cell wall using a virus-encoded enzyme packaged in the virion. The viral internal membrane then fuses with the host plasma membrane, forming a channel through which the genome and some viral proteins enter the cell. Because the virus does not encode an RNA polymerase, the incoming genome must be transcribed inside the host cell's nucleus prior to virion assembly in the cytoplasm. Virions are released after host cell lysis.
Other Nucleocytoviricota viruses that infect algae constitute small virions. Among the smallest members of the Nucleocytoviricota are prasinoviruses with virion diameters of ~120 nm and genomes up to 410 kb. Being small is crucial for infecting and replicating within Ostreococcus tauri, which is one of the smallest free-living eukaryotes with cells only 0.8 µm in size 47 . Following viral infection, the genome is released into the nucleus and its replication begins almost immediately. Within hours, new virions assemble in the cytoplasm and, in less than 24 h, host lysis occurs. The host cell nucleus, mitochondrion and chloroplast remain intact throughout this period.
Larger viruses with nucleocytoplasmic infectious cycles are the amoeba-infecting pandoraviruses, with amphora-shaped virions up to 1 µm in length and 500 nm in diameter ( fig. 2) and genomes up to 2.5 Mb (ref. 48 ). There is at least one lipid membrane lining a thick tegument made of three layers, including one made of cellulose 49 . The particles are taken up through phagocytosis and an ostiole-like structure at the apex opens to allow the internal membrane to fuse with the phago some membrane; this results in the delivery of the genome and necessary proteins into the host cytoplasm. Although pandoraviruses encode an RNA poly merase, the enzyme is not packaged in the capsids and, thus, infecting viruses rely on the host cell for early transcription of viral genes. At the viral factories set up within the nucleus ( fig. 2), new virions start to assemble from the apex and lipid vesicles are recruited to the viral factory to be used in virion assembly. Nascent virions are released either by cell lysis or, if viruses are within vacuoles, by exocytosis through membrane fusion with the plasma membrane 50,51 .
Molliviruses have an ovoid virion of smaller size (~650 nm) and genomes of 650 kb (refS. 49,52 ) (fig. 2); they share 16% of their genes with pandoraviruses but two-thirds of their genes are Orfans. The capsids seem haloed by fibrils of different lengths and they present a membrane-lined tegument resembling that of pandoraviruses. Their infectious cycle is also similar to that of pandoraviruses, except that DNA seems to be prepackaged in filaments that accumulate in the viral factory before being loaded into the maturing virions. Membrane remodelling involved in virion assembly was extensively analysed by cryogenic electron microscopy (cryo-EM) 53 .
Medusaviruses are also Acanthamoeba-infecting viruses 33 . Their icosahedral virions are 260 nm in diameter, covered by spherical-headed spikes extending from each capsomer, and have a lipid membrane that surrounds the capsid interior. A low-resolution structure was determined by cryo-EM, which returned a T number of 277 (ref. 54 ). The mechanism of entry and egress of the medusavirus virion from its host has yet to be determined. After uptake into the host cytoplasm, its DNA is replicated in the host nucleus and virions assemble in the cytoplasm ( fig. 2).

Viruses with exclusively cytoplasmic infectious cycles.
The second most studied virus after PBCV1 is that of the amoeba-infecting mimivirus 9 . The ~700 nm virions are made of an icosahedral capsid ~500 nm in diameter with a genome of 1.2 Mb (ref. 11 ) (TAble 1). Bacterial-type sugars are synthesized by the virus-encoded glycosylation machinery and are the building blocks of the complex 70 kDa and 25 kDa polysaccharide structures that decorate the mimivirus fibrils surrounding the capsid 55 . A low resolution structure of the mimivirus capsid has been determined 56 ( fig. 2) and detailed atomic force

T number
The triangulation (T) number describes the number of structural units per face of the icosahedron and is calculated as the square of the distance between two adjacent fivefold vertices.

ORFans
Predicted genes without detectable homologues in public databases.
NATure revIewS | MICRoBIoloGy microscopy provided additional insights into virion composition 57 , further underlining the complexity of the capsid. There are two internal lipid membranes, one lining the capsid and the other in the nucleoid compartment, which contains the genome and hundreds of proteins, including RNA polymerase and transcript maturation machinery. It has been proposed that the non-structural proteins in the nucleoid are required  Early transcription begins using the virus-encoded transcription machinery, which, at first, remains confined in the nucleoid 65 . The accumulation of nucleic acids due to active transcription and replication leads to the size of the viral factory increasing and newly synthesized virions start budding at its periphery, recycling host cell membranes derived from the endoplasmic reticulum 61,66 or Golgi apparatus 32 . The last step of virion maturation, after genome loading into the nucleoid, is the addition of the fibril layer to the capsids 66 , with hundreds of newly synthesized virions released after cell lysis. Several viruses related to mimivirus have similar infectious cycles but smaller virions. Among them is Cafeteria roenbergensis virus, which has an icosahedral capsid of 300 nm in diameter ( fig. 2) with a lipid membrane underneath the capsid shell. Its mode of infection is not fully understood but, similar to mimivirus, a nucleoid structure in the cytoplasm and extracellular empty capsids have been observed, supporting an external opening of the capsids followed by fusion of the internal membrane with that of the cell, thus allowing the transfer of the nucleoid into the host cytoplasm. Virions contain ~150 proteins, which either make up the icosahedral capsid or are necessary to initiate the infectious cycle 61 . Nascent virions assemble during the late stage of infection and are released through cell lysis. The structure of the complex capsid, determined by cryo-EM, corresponds to a T number of 499 and has provided a new model for capsid assembly 67 .
Another member of the Mimiviridae, with a similar icosahedral capsid of 300 nm in diameter, is Bodo saltans virus. Its capsid appears to be made of two proteinaceous layers surrounded by 40 nm-long fibrils. A possible stargate-like structure is present at one vertex of the capsid and there are two membranes, one   29 . These viruses also build a viral factory in the host cytoplasm, but it is unknown if the transcription machinery is loaded into the capsids, allowing an entirely cytoplasmic infectious cycle.
The largest virions found in the Nucleocytoviricota are those of pithovirus and cedratvirus ( fig. 2), which have very large amphora-shaped capsids that can be up to 2-µm long and 600-nm wide encapsidating genomes of up to 685 kb (TAble 1). The capsids are closed by corks -one cork for pithovirus 68,69 (fig. 2) and two for cedratvirus 70 -that are made by proteins organized in a honeycomb array. Despite a virion morphology that closely resembles that of pandoravirus, the external tegument is different and appears to be made of parallel strips and no cellulose; the capsids appear to be coated with short sparse fibrils 35,68 . The infectious cycle proceeds, as for other amoeba-infecting viruses, by phagocytosis followed by capsid opening and membrane fusion with the phagosome 5 . For pithovirus and cedratvirus, the RNA polymerase loaded in the virion starts early transcription in the cytoplasm and the host nucleus remains intact during the entire infectious cycle. During maturation, reservoirs of tegument and corks accumulate in the host cytoplasm and are used to build the new amphora-shaped virions. The nascent virions then exit the host cell either by exocytosis or upon cell lysis 68,70 .
Outside of the Mimiviridae, there are smaller amoebainfecting viruses such as members of the Marseilleviridae, which have icosahedral virions of ~250 nm in diameter ( fig. 2). A recent publication and two preprints showed the cryo-EM structure of the capsid for two members of the family at various resolutions, revealing a T number of 309 and a complex capsid structure 41,71,72 with many minor capsid proteins. Melbournevirus and other members of the family Marseilleviridae are taken up by phagocytosis and then lose their icosahedral appearance to become spherical after the disappearance of the vacuole membrane. Similar to Megamimivirinae, their genome remains in the cytoplasm; however, RNA polymerase is not loaded into the virion. Instead, the nuclear proteins are recruited to the early viral factory, including the host RNA polymerase that performs early transcription 73 . The appearance of the cell nucleus changes early in infection and becomes leaky through a still-unknown mechanism triggered by viral infection. After 1 h of infection, the nucleus integrity is restored and the virus-encoded RNA polymerase performs intermediate and late transcription 74 , and icosahedral particles assemble inside the viral factory ( fig. 2A). Marseilleviridae viruses encode histone doublets that form nucleosomes to pack the genome into virions [75][76][77] . Mature capsids can gather in large vesicles 78 and cell lysis leads to the release of both individual virions and filled vacuoles.
As these examples illustrate, there is no shared blueprint for the structure of giant viruses and their infection mechanisms; these characteristics vary between giant virus lineages and are likely shaped by the host organisms. The host range of the experimentally characterized giant viruses is limited to a few amoeba and algae lineages representing only a minute fraction of eukaryotic diversity. Thus, we expect that many more unusual virions and infection strategies will be revealed when new viruses will be captured together with their native hosts.

Cultivation-independent genomics
Sequence-inferred prevalence and diversity of giant viruses. Many important discoveries in giant virus biology and diversity have been made through giant virus isolation and cultivation. However, such approaches are constrained by the need to satisfy optimal growth requirements in a laboratory setting and are often restricted to lytic viruses. Cultivation-independent methods have proven to be an indispensable tool to discover the genetic make-up of giant viruses from environmental samples.
In the earlier days of metagenomics, single-marker gene-based surveys (bOx 1) revealed that several viruses of the Phycodnaviridae and Mimiviridae were present in a wide range of marine metagenomes collected during the Tara Oceans and the Sargasso Sea expeditions 79,80 and that these viruses were more abundant in the photic layer than eukaryotes 80 . In a follow-up study, data from these surveys gave rise to the hypothesis that giant viruses are more diverse in the oceans than any cellular organism 81 . Subsequently, a large-scale analysis of the NCLDV major capsid protein (MCP), in which more than 50,000 of these proteins were found across Earth's biomes, revealed the global dispersal of giant viruses, including in terrestrial ecosystems 82 .
Other approaches that enabled the discovery of novel NCLDVs are single-virus or single-cell genomics and mini-metagenomics (bOx 1). First, sorting viral particles from marine samples enabled the detection of viruses that had previously been found to be associated with the algae Ostreococcus spp. and Phaeocystis globosa 83 . This approach led to the sequencing of several so-called giant virus single amplified genomes, of which the largest was a 813 kb genome belonging to the Mimiviridae that encoded a metacaspase, which potentially enables autocatalytic cell death of the host cell 84 . Single-cell methods, including sorting and genome amplification of single eukaryotic cells, were also used to identify and genome sequence five giant viruses associated with marine choanoflagellates 85,86 ; comparative genomics together with all other NCLDV genomes revealed that viruses that infect hosts with similar trophic modes,

Corks
The distinctive structures of some virions; in the case of pithovirus, the cork is located at the apex of the viral particle and made of 15 nm-spaced stripes organized in a hexagonal honeycomb-like array.

Nucleosomes
Compact structural forms of DNA packed through binding at positively charged proteins.

Mini-metagenomics
low complexity metagenomes generated from generally tens to hundreds of cell-sized particles.

Metacaspase
A multifunctional cysteine-dependent protease that, for example, plays a role in programmed cell death in eukaryotes.

NATure revIewS | MICRoBIoloGy
including host habitat and lifestyles, express distinct genetic features 86,87 . Furthermore, mini-metagenomics analysis (bOx 1) of a single forest soil sample led to the enrichment and discovery of 15 diverse giant virus metagenome-assembled genomes (MAGs), including several members of the Klosneuvirinae, highlighting an untapped diversity of giant viruses in soil 88 .
The most successful approach for obtaining NCLDV genomes from environmental sequence data is genomeresolved metagenomics (bOx 1). Since the early 2000s, this approach has become common practice for recovering genomes of bacteria and archaea from complex environmental samples 89 , yet it took nearly another decade before the first giant virus MAGs (GVMAGs) appeared in public databases ( fig. 1). Yau et al. reconstructed the first GVMAGs as a by-product of their work on virophages in metagenomes from the Organic Lake in Antarctica 90 . Several years later, four additional potentially algae-associated GVMAGs were retrieved from environmental sequence data from Yellowstone Lake in Yellowstone National Park, United States; they were found to be related to the viral families Phycodnaviridae and Mimiviridae and shared some genes with virophages that co-occurred in the same sample 91 . Cultivation-independent approaches for the discovery of giant virus genome-centric sequence information gained traction when members of a Mimiviridae-affiliated subfamily, the proposed Klosneuvirinae, were recovered from metagenomic data 92 . The fact that these were found in metagenomes from freshwater and sewage samples originating from four different continents suggested this novel group of giant viruses is cosmopolitan 92 . More than 20 GVMAGs from the deep sea were subsequently discovered, including 15 affiliated with the Pithoviridae, indicating a surprisingly high prevalence of pithovirus-like viruses in the ocean 93 , followed by the discovery of additional, likely algae-associated freshwater giant viruses in samples collected from Dishui Lake, Shanghai, China 94,95 . The unique strength of cultivation-independent approaches for viral genomics and discovery became most evident when more than 2,000 GVMAGs were extracted from metagenome datasets generated from analyses of thousands of samples collected from diverse biomes 82 ; an additional 500 GVMAGs from mainly marine systems were reconstructed shortly after 96 . The addition of the GVMAGs to the Nucleocytoviricota species tree led to an increase in phylogenetic diversity by more than tenfold and enabled a comprehensive update of the taxonomic framework of the Nucleocytoviricota 26,82 , in which the Mesomimiviridae makes up more than

Box 1 | Toolkits for giant virus discovery: cultivation-independent genomic approaches
Read mapping-based approaches mapping metagenomic reads to giant virus reference genomes has been successfully applied to detect giant viruses and estimate their abundances in the environment 64,80,[178][179][180] , and several tools have been published 178,181 . read mapping-based approaches are advantageous because they are sensitive enough to detect giant viruses at low levels 180 ; however, they typically do not lead to the recovery of viral genomes and thus cannot provide information on genome features and coding potential. moreover, mapping approaches are highly dependent on the quality of the reference genome database and, if low mapping stringency is used, false positive hits may occur. Detection of giant viruses may also be hampered if taxonomic classification of metagenome-assembled genomes (mAGs) is performed using automated tools; this has resulted in several giant virus mAGs (GvmAGs) that have been misclassified as being of bacterial, archaeal or eukaryotic origin 82,106 . In addition, genes that have recently been integrated into viral genomes after being horizontally acquired from bacteria or eukaryotes may lead to viral sequence read mapping to cellular genomes, resulting in false positive hits.

Marker gene surveys
Detection and phylogenetic analysis of signature genes in complex environmental datasets is a commonly used approach to assess viral diversity in metagenome data. For Nucelocytoviricota genes that encode the major capsid protein, DNA Polymerase b or viral packaging ATPase have been used as marker genes. The approach is less error-prone than read mapping as it can be coupled with phylogenetic analysis to confirm the monophyly of the respective gene homologues found in known viral genomes. This approach has been successfully applied in several studies [80][81][82]105 and, although less sensitive than read mapping, it can detect viruses that were not abundant enough in a metagenome to be successfully assembled and binned 151 .

Genome-resolved metagenomics
The reconstruction of mAGs through metagenomic binning is an established approach to recover microbial genomes. owing to their virion sizes, giant viruses are often present in environmental samples that have been selectively filtered to target microorganisms, although individual viral species are often found at low abundance within a high genetic diversity background. In contrast to smaller viruses such as most bacteriophages, the large genomes of most members of the Nucleocytoviricota typically require metagenomic binning to increase genome completeness 151 . However, in most microorganism-centric metagenome projects, giant virus genome bins were frequently neglected since tools that estimate genome quality 182 predict viral genomes to be of low completeness based on their lack of cellular marker genes 151 , which then leads to their exclusion from downstream analyses 183 . Several recent studies employed custom workflows to identify GvmAGs and to estimate completeness and contamination by, for example, identifying copy numbers of conserved giant virus genes 85,86,88,92,96,153 or inferring deviations from lineage-specific copy numbers of low-copy orthologues 82 . It is important to note that GvmAGs are typically incomplete, limiting the feasibility of some sequence-based inferences (for example, gene absence analyses). As Nucleocytoviricota phylogenies generally rely on a small set of viral hallmark genes, the reconstruction of evolutionary relationships is certainly feasible using incomplete GvmAGs as is the analysis of horizontal gene transfer.

Single-virus and single-cell genomic approaches
Flow cytometry-based sorting and sequencing of single viruses can be used to detect viruses in environmental samples 184,185 , yet only a few such studies have discovered novel giant viruses [83][84][85][86]88 . owing to large virion sizes and a bright signal using DNA stains 84,186 , giant viruses are a promising target for sorting. A drawback of this approach is that the subsequent whole-genome amplification, if performed on a single virus, may lead to low genome recovery 184 . An alternative approach to direct sorting of giant viruses from an environmental sample is targeted sorting of host cells 85,86 . viruses actively replicating inside a host cell can produce hundreds to thousands of virions with clonal copies of viral genomes, which would vastly improve whole-genome amplification 184 . Furthermore, if successful, this approach enables identification of the virus and its native host. Similarly, mini-metagenomics uses fluorescence-activated single-cell sorting or microfluidics to collect tens to hundreds of cell-sized particles 88,187 . The presence of many identical viral particles, either through repeated sorting of single clonal viruses, an infected host cell or the sorting of vacuoles filled with giant viruses, would increase genome recovery.
www.nature.com/nrmicro 0123456789();: one-third of the observed diversity (fig. 3). The addition of the new lineages also led to a substantial increase in the size of the Nucleocytoviricota pan-genome, which now comprises more than 900,000 proteins 82 . This translated to an extensively expanded repertoire of functional genes, providing not only many novel insights into how giant viruses may interact with their hosts and the environment but also generating compelling novel hypotheses about their evolutionary roles 82,[96][97][98] .
Exploring the host range of giant viruses. Genomeresolved metagenomics enabled the discovery of thousands of viral genomes, of which many represented lineages divergent from viruses recovered by isolation or co-cultivation 82,96 (fig. 3). However, giant viruses recovered from metagenomes typically lack information on host organisms 99 . An approach to overcome this limitation is the detection of viruses and potential eukaryotic hosts co-occurring in the same sample. Furthermore,

Pan-genome
The combined set of genes within a defined selection of genomes. horizontal transfer of genetic material between viruses and their hosts is a common phenomenon and can go in both directions [100][101][102] , and the analysis of viral genes that may have been acquired through recent horizontal gene transfer (HGT) might identify host organisms. In the early days of giant virus metagenomics, read mapping-based co-occurrence analysis (bOx 1) revealed that the presence of viral sequences in some marine samples was positively correlated with those of eukaryotic oomycetes 80 , which have not been found to be associated with NCLDVs. In another study, co-expression analysis of metatranscriptomic data revealed a strong connection between Aureococcus anophagefferens virus and its algal host, and also indicated that other Mimiviridae present in the same sample were likely associated with Aureococcus spp. 103 . This approach also linked Phycodnaviridae and Mimiviridae members to a wide range of marine microeukaryotes, including choanoflagellates, stramenopiles, diatoms, dinoflagellates and cercozoan algae 103 . In a different study, virus-host relationships were implied through the co-occurrence analysis of viral and eukaryotic PolB-encoding genes and the hypervariable V9 region of the eukaryotic 18S rRNA gene 104 . This approach was then applied to a comprehensive set of marine metagenomes collected during the Tara Oceans expedition, revealing that particular microeukaryotes belonging to the Alveolata, Opisthokonta, Rhizaria and Stramenopiles co-occurred with different NCLDV lineages 104 . In a similar study, a strong co-occurrence signal was detected between a virus belonging to the Mimiviridae and marine chrysophytes as its potential host 105 . Subsequent detection of putative HGT events between GVMAGs and chrysophyte genomes and transcriptomes provided further support for this host-virus relationship 105 . A systematic analysis of HGT candidates present in more than 2,000 NCLDV genomes, most of which were MAGs from diverse global sampling sites, revealed thousands of genes likely introduced into host chromosomes or derived from the host through recent HGT 82 . Based on these results, it was possible to propose connections between NCLDVs and members of all major eukaryotic phyla 82 . Although most of these predicted hosts have not yet been found to be infected by giant viruses, more than 20 previously isolated virus-host relationships were successfully predicted through recent HGT events, underlining the validity of this sequence inference-based approach to metagenome-assembled viral genomes ( fig. 4).

0.5
Although sequence-based computational host predictions provide a means to expand the range of putative NCLDV hosts, the approaches have some potential challenges and biases. For example, co-occurrence analysis is dependent on sufficient host genome coverage for detection in metagenome data, and HGT analysis requires the availability of the host genomic sequences. Furthermore, it is difficult to detect ancient HGT from previous hosts. Another limitation to the analysis of the integration of NCLDV genes into host genomes can be the quality of the database used. For example, GVMAGs have been found mis-annotated as bacteria, archaea or eukaryotes in public databases, which hampers the use of automated tools for correct HGT detection 82,106 . Despite some of these limitations, expanding the putative host range of metagenome-derived NCLDVs provides a basis for targeted sampling of putative hosts, for the study of virus-host co-evolution and to identify viral-encoded functions for targeted modulation of host metabolism. Sequence-based inferences of viruses and their hosts may then be extrapolated to assess the impact of such interactions on global ecosystems.

From HGT to endogenization
Not only is HGT between viruses and their hosts a common phenomenon but some giant viruses can even integrate their entire genomes into the host chromosome ( fig. 4) Fig. 4 | Experimentally verified and computationally predicted host ranges of the Nucleocytoviricota. Host lineages identified through isolation with the native host, co-cultivation, single-cell sorting and in silico horizontal gene transfer-based predictions are shown. The black outline of coloured boxes indicates that an experimentally verified interaction has also been predicted computationally. Chloroplastida comprises both Strepto phytina (this group includes some green algae) and Chlorophyta (this group includes most green algae). Topology of the eukaryotic species and eukaryotic taxonomy tree adapted from ref. 40 . CroV subfamily, viral subfamily-level clade in the Mimiviridae that contains Cafeteria roenbergensis virus; HaV family, family-level clade in the Algavirales that contains Heterosigma akashiwo virus; TSAR, Telonemia-Stramenopiles-Alveolata-Rhizaria supergroup.
www.nature.com/nrmicro observed for most eukaryotic viruses 107,108 . Arrays of NCLDV genes have occasionally been found in genomes of eukaryotes, in particular in algae, plants [109][110][111] and amoebae [112][113][114] . A recent survey of published eukaryotic genomes and transcriptomes revealed the presence of giant virus genes in 66 different eukaryotes, including several Acanthamoeba species, flagellates, ciliates, stramenopiles, oomycetes, fungi, arthropods and diverse unicellular and multicellular algae 115 (fig. 4). Yet, for many of these eukaryotes, giant virus infections have not been observed. The integration of NCLDV genes often appears to be highly host specific, with viral genes detected in one eukaryotic species being unrelated to viral genes found in closely related species 115 . Among the integrated genes are NCLDV hallmark genes that are, in some instances, scattered throughout the host chromosome and, in others, co-localized in islands composed of more than 100 genes 115 . The integration of complete viral genomes has been described for some members of the Mesomimiviridae; for example, Ectocarpus siliculosus virus integrated into its brown algal host more than 20 years ago 111 likely through use of integrases 116 . The related Phaeocystis globosa virus is a lysogenic virus that causes continuous infections 117,118 , which is in stark contrast to many other known NCLDV lineages that were successfully isolated based on the fact that they lyse their amoeba host 5 . The analysis of existing algal genomes and transcriptome data revealed other examples of whole giant virus genomes integrated into eukaryotic host chromosomes 119 . Some regions encoded more than 1,500 viral genes, making up to 10% of the genes of the green algal host 119 . Several of the detected viral genes were annotated as enzymes with roles in carbo hydrate metabolism, chromatin remodelling, signal transduction, energy production and translation 119 .
It remains unknown whether integrated giant viruses are dormant with no or minimal benefit to the host, or whether the host cell benefits from some viral genes that may provide or fine-tune metabolic capabilities. Another unanswered question is whether there are mechanisms encoded in the integrated viral genome that may reactivate infection after transcribing and translating some of the integrated viral genes. This would then be followed by the release of the giant virus genetic material during host replication and effective dispersal to new hosts. If there is no reactivation of viral infection, giant virus genes decay over time, leading to rearrangements and pseudogenization 107,112 and making their detection more challenging or impossible. Giant virus endogenization has been found mainly through the analysis of eukaryotic isolate genomes, but we anticipate that genome-resolved metagenomics of eukaryotes will further facilitate the discovery of many additional examples of this phenomenon. Future investigation of the integration of giant virus genes is expected to provide some answers for how endogenization has shaped and continues to shape the evolution and ecology of eukaryotic organisms.
Reprogramming of the host and its impact on host populations Upon infection, a virus reprogrammes its host cell and turns it into a so-called virocell that supports viral replication 120,121 . Analogous to bacteriophages 122,123 , which are viruses (including large ones 124 ) that infect bacteria, giant viruses seem to contribute genes to their hosts to augment and/or modulate the metabolic capabilities of the host cell ( fig. 5). The first described example was a virus-encoded hyaluronan synthase, encoded by Chlorella virus, that enabled its algal host to synthesize hyaluronan 125 . In addition, an active potassium channel encoded by Chlorella virus was found to be integrated into the host membrane during infection 126 . Another example is that of a host-derived nitrogen transporter in Ostreococcus tauri virus that is expressed during the infection of its green algal host 127 . Experimental characterization provided evidence that this transporter may increase the uptake of nitrogen by the host cell 127 .
Other studies revealed the presence of fermentation genes in the Tetraselmis virus genome with possible implications for host metabolism in nutrient-limited marine systems 28 . A survey of giant virus isolates and MAGs revealed the widespread presence of genes for cytochrome P450 monooxygenases, potentially enabling or modulating complex metabolic processes such as the synthesis of sterols and other fatty acids 98 . Metagenome-informed experimental characterization of the distinctive cytochrome P450 of hokovirus did not reveal any sterols metabolized by the recombinant viral cytochrome P450 (ref. 98 ). Distant homologues of eukaryotic actins ('viractins') and myosins ('virmyosins') have been found in NCLDV genomes in two recent studies 128,129 and a preprint 97 , indicating that these viruses impact cell structure, motility and intracellular transport processes; however, further functional validation is needed. Furthermore, a giant virus related to Mesomimiviridae that infects heterotrophic choanoflagellates was found to encode type 1 rhodopsins together with the pathway for synthesis of the required pigment, β-carotene 85 . Metagenome-informed experimental charac terization of the NCLDV rhodopsin showed that the putative rhodopsin likely functions as a proton pump, generating energy from light 85 . A phylogenetically distinct NCLDV rhodopsin was found in a GVMAG from Organic Lake, Antarctica, and experimental characterization of this protein revealed that it may function as a lightgated pentameric ion channel, potentially impacting ion homeostasis and phototaxis of the host cell 130 . Furthermore, through global metagenomics, it was predicted that genes encoding various substrate transport processes, energy generation through light (rhodopsins and genes involved in photosynthesis), carbon fixation and glycolysis are commonly found in GVMAGs affiliated with diverse lineages of the Nucleocytoviricota 82,96 ( fig. 5). More detailed phylogenetic analysis revealed that some auxiliary metabolic genes encoding transporters for iron, phosphate, magnesium and ammonium originated in eukaryotic hosts and were likely recently acquired by giant viruses through HGT 82,85,96 . However, other genes encoding several rhodopsins, succinate hydrogenase, aconitase and glyceraldehyde 3-phosphate dehydrogenase showed a pattern that suggested a viral origin or a common evolutionary origin in one of the ancestral hosts 82,85,96 . Taken together, the widespread presence of metabolic genes in diverse NCLDV lineages implies that augmenting host metabolic capacities is likely a Pseudogenization A mechanism that leads to gene loss (functional genes become non-functional), most often through accumulation of mutations.

Hyaluronan synthase
An enzyme that facilitates the synthesis of cellular hyaluronan.

Rhodopsins
Pigment-containing proton pumps that convert light into a transmembrane electrochemical proton gradient.
NATure revIewS | MICRoBIoloGy strategy more commonly used by NCLDVs than initially assumed. However, the current lack of experimental evidence of the functions and activities of most of these genes and pathways as well as their effects on the host cell demands further experimental investigation.
Metabolic reprogramming has direct consequences on host population structure and dynamics. One striking example is the cosmopolitan marine coccolithophore Emiliania huxleyi, which forms massive blooms that play key roles in global carbon and sulfur cycles 131 . E. huxleyi populations are subject to persistent but ultimately lytic infections by the coccolithovirus Emiliania huxleyi virus 24 . Once lysis is induced, it leads to the termination of the algal bloom and the deposition of massive amounts of calcite and nutrients into the ocean, which increases the marine pool of dissolved organic matter [132][133][134] . Importantly, viral infections do not only lead to host lysis but also promote viral replication by rewiring host physiology, in particular the turnover of sugars and synthesis of fatty acids and lipids [135][136][137] . Comparably little is known about how host populations are impacted by giant viruses that were recovered through genome-resolved metagenomics but, considering the predicted host range of these viruses, it is conceivable that similar principles are omnipresent and are actively shaping the biomes and biogeochemical cycles of Earth.

Giant virus genomes encode hallmark genes of cellular life
Among the most intriguing features found in giant virus genomes are hallmark genes of cellular life such as tRNAs and genes involved in protein biosynthesis 138 . This phenomenon was first described upon sequencing the mimivirus genome 9 . Subsequent analyses revealed the phylogenetic placement of virus-encoded cellular genes between bacteria and eukaryotes, suggesting an ancient origin 11 . Other cellular hallmark genes with similarly deep branching patterns were found in other giant virus genomes and led to the hypotheses that giant viruses may either represent a fourth domain of life 13 or are remnants of a highly degraded eukaryotic cell derived by reductive evolution 12 . The subsequent use of more complex phylogenetic models revealed that many of these genes had most likely been acquired from different eukaryotic hosts [139][140][141] .

Fig. 5 | Predicted metabolic reprogramming of a giant virus-derived virocell and consequences of giant virus infection for host populations.
A hypothetical virocell is shown with a combination of metabolic roles that different giant viruses are predicted to have during host infection based on the presence of auxiliary metabolic genes in giant virus genomes. Darker shades of red denote metabolic roles that are supported by some functional data obtained through experiments, including a Paramecium bursaria chlorella virus 1 (PBCV1)-derived potassium channel 126 , an Ostreococcus tauri virus-derived ammonium transporter 127 , the light-driven proton pump encoded by Choanovirus 85 , and the light-gated ion channel encoded in the metagenome-assembled genome of Organic Lake Phycodnavirus (OLPV) from Antactica 130 . Also highlighted is Tetraselmis virus, which encodes fermentation genes. TCA, tricarboxylic acid. a Experimental validation has not been performed in the native virus host system. b There is currently no experimental evidence for the function of these genes 28 .
www.nature.com/nrmicro Some of these genes might represent ancient transfers from undiscovered eukaryotic hosts. This finding provided evidence for the hypothesis that giant viruses may have evolved from smaller viruses 140 . Yet, other studies have reported alternative topologies for some house keeping and other metabolic genes of cellular organisms, including rhodopsins 82,85,96 and cytochrome P450 (ref. 98 ). It has also been proposed that such genes may have been transferred from ancestral giant viruses to past eukaryotic hosts, or even to a proto-eukaryote, highlighting a potentially integral role of giant viruses in the evolution of the eukaryotic cell 142,143 . Furthermore, it is possible that some genes that may function as part of the eukaryotic core metabolism were introduced upon integration of giant virus genetic material into the genome of an ancient eukaryotic cell, further shaping eukaryotic evolution 142,144 . The presence of genes for aminoacyl tRNA synthetases (aaRS) and eukaryotic translation factors has been recorded multiple times in newly recovered giant virus genomes. Indeed, a nearly complete set of 20 aaRS has been reported in klosneuvirus from metagenomic data 92 . Shortly after, two tupanviruses were isolated with genomes that contain a full set of aaRS and tRNAs 7 , and subsequently the first Klosneuvirinae isolates were described, of which one also contained a complete set of aaRS 145 . Especially in the Klosneuvirinae, the presence of aaRS with lineage-specific evolutionary histories provided additional support that these genes derived from different eukaryotic hosts 92 . The presence of genes for a complete set of aaRS is currently constrained to members of the Mimiviridae and information on the role of giant virus aaRS in host interactions is limited; however, some have been experimentally studied and were indeed functional 146 . There is even some experimental evidence for the potential roles of these genes in making giant viruses less dependent on host machinery, for example, during shutdown of host translation in response to viral infection or other adverse conditions 147 . On the other hand, a suspected role in enhancing viral translation by providing additional copies of aaRS to support host translation has not yet been confirmed. Additional hallmark genes of cellular life include those encoding for the four core histones 33,76,148,149 and giant virus genes predicted to be involved in energy generation 28,96 . A recent study reported an active membrane potential in Pandoravirus massiliensis virions together with the expression of several remote homologues of tricarboxylic acid cycle genes 150 . Despite encoding functions that were recently thought to be exclusively present in cellular organisms, there is currently no evidence that giant viruses perform protein translation without host-derived ribosomes or host-independent energy generation.

Conclusions
Nearly 20 years of giant virus isolation has yielded viral isolates representing highly diverse lineages. Complementary detailed research on the biology of these viruses has revealed many important details of virion structures and infection strategies. It has become clear that there are stark differences in virion size and structure and, although there are some similarities in how these viruses enter and exit the host cell, most giant viruses employ contrasting strategies for replicating within and exploiting their host cells. Sequencing of viral isolates has led to the discovery of the largest and smallest known genomes of viruses of the Nucleocytoviricota.
Cultivation-independent approaches have accelerated the discovery of genome sequences of new giant viruses and other large viruses in the Nucleocytoviricota, providing novel insights into their phylogenetic diversity and functional potential. Metagenomics also revealed that these viruses can be found nearly anywhere on Earth, are affiliated with diverse eukaryotes and are likely modifying host physiology through metabolic reprogramming, ultimately altering the structure and function of host communities in the environment. At the same time, estimates based on NCLDV hallmark genes in metagenomic datasets indicated that only a small fraction of giant virus genomes have been discovered so far 82 and that the diversity of giant viruses may be far greater than that of bacteria, at least in the oceans 81 . A controlled metagenomic binning experiment where giant viruses were spiked into an environmental sample showed that genome fragments of many giant viruses that are present in a given sample likely remain below the detection limit, highlighting the need for ultra-deep metagenome sequencing 151 or targeted isolation efforts 52 . Furthermore, there is a strong bias towards detecting giant viruses that are similar to those already known, as tools used to identify viruses from metagenomes rely heavily on features observed in sequenced NCLDV genomes such as large sets of conserved genes 82,93,96,152,153 . However, giant virus genomes exhibit extensive plasticity, such that viruses within the same clade quickly diverge and share very few genes 30 . A recent stunning example of NCLDV diversity is yaravirus, which was isolated with its native amoeba host 154 , yet no closely related sequences were detectable in public metagenomic datasets. Its placement within NCLDV was difficult owing to more than 90% of its genes lacking similarity to those in public databases and the paucity of most viral hallmark genes 154 , and its placement within the Nucleocytoviricota is currently still under debate. Furthermore, a recent preprint described the genome-resolved metagenomic-based discovery of the Proculviricetes and Mirusviricetes from marine systems, which might be two class-level novel lineages within the Nucleocytoviricota that lack most of the typical viral hallmark genes 155 . Taken together, the excessive gene novelty of viruses in the Nucleocytoviricota, observed through both cultivation and cultivation-independent methods, further underlines that many giant viruses are likely to be hiding in plain sight.

Published online 28 July 2022
Proto-eukaryote A cell without membrane-bound organelles that is considered the ancestor of the eukaryotic cell.