Consensus statement: Virus taxonomy in the age of metagenomics

The number and diversity of viral sequences that are identified in metagenomic data far exceeds that of experimentally characterized virus isolates. In a recent workshop, a panel of experts discussed the proposal that, with appropriate quality control, viruses that are known only from metagenomic data can, and should be, incorporated into the official classification scheme of the International Committee on Taxonomy of Viruses (ICTV). Although a taxonomy that is based on metagenomic sequence data alone represents a substantial departure from the traditional reliance on phenotypic properties, the development of a robust framework for sequence-based virus taxonomy is indispensable for the comprehensive characterization of the global virome. In this Consensus Statement article, we consider the rationale for why metagenomic sequence data should, and how it can, be incorporated into the ICTV taxonomy, and present proposals that have been endorsed by the Executive Committee of the ICTV.

Viruses are obligate intracellular parasites that probably infect all cellular forms of life. Although virologists have traditionally focused on viruses that cause disease in humans, domestic animals and crops, the recent advances in metagenomic sequencing, in particular high-throughput sequencing of environmental samples, have revealed a staggeringly large virome everywhere in the biosphere. At least 10 31 virus particles exist globally at any given time in most environments, including marine and freshwater habitats and metazoan gastrointestinal tracts, in which the number of detectable virus particles exceeds the number of cells by 10-100-fold [1][2][3][4][5] . To help conceptualize the sheer number of viruses in existence, their current biomass has been estimated to equal that of 75 million blue whales (approximately 200 million tonnes) and, if placed end to end, the collective length of their virions would span 65 galaxies 6 . In addition to their remarkable abundance, viruses are spectacularly diverse in the nature and organization of their genetic material, gene sequences and encoded proteins, replication mechanisms, and interactions with their cellular hosts, whether they are antagonistic, commensal or mutualistic 7 . Aquatic environments contain particularly diverse forms of viruses, including single-stranded (ss) and double-stranded (ds) DNA and RNA viruses with genomes that range in size from less than 2,000 bases to more than 2 million bases 4 . Although dsDNA viruses that infect bacteria (bacteriophages) are the best studied to date, recent work suggests that around 50% of marine viruses have ssDNA or RNA genomes 8 .
Metagenomic data are changing our views on virus diversity and are therefore challenging the way in which we recognize and classify viruses 9 . Historically, the description and classification of a new virus by the International Committee on Taxonomy of Viruses (ICTV) have required substantial information on host range, replication cycle, and the structure and properties of virus particles, which were then used to define groups of viruses. However, high-throughput sequencing and metagenomic approaches have radically changed virology, with many more viruses now known solely from sequence data than have been characterized experimentally. For example, the family Genomoviridae currently comprises a single classified virus, whereas more than 120 possible members have been sequenced from diverse environments. However, these sequenced viruses lack information about their hosts and other biological properties that would guide their assignment into species and genera in the family 10 . Indeed, vast numbers of complete, or nearly complete, genome sequences have been assembled and characterized from metagenomic data for viruses with small [11][12][13][14] , medium [15][16][17][18] and even large 19,20 genomes. The identification of entirely new groups of viruses from such analyses emphasizes the power of metagenomic approaches in discovering viruses, some of which could have key functions in the regulation of ecosystems, whereas others could coexist with their hosts without causing recognizable disease or may even be mutualists 7 . However, realistically, few of these viruses are ever likely to receive the same level of experimental characterization as pathogens that cause human disease or influence the global economy.
The question of whether viruses that are identified by metagenomics can, and should, be incorporated into the official ICTV taxonomy scheme on the basis of sequence data alone is pressing. In response to this question, a workshop of invited experts in the field of virus discovery and environmental surveillance, and members of the ICTV Executive Committee, took place in June 2016 to discuss this possibility and to develop a framework for appropriate approaches to virus classification. We present these proposals in this Consensus Statement article, together with an explanation of the rationale for their development. Our proposals have been subsequently endorsed by the ICTV Executive Committee.

Virus diversity
The discrepancy between the number of potential taxa into which viruses in environmental samples could be classified and the number currently recognized by the ICTV is striking. A recent analysis of dsDNA virus sequences that were characterized as part of the Tara Oceans expedition from 43 surface ocean sites worldwide identified 5,476 distinct dsDNA virus populations 21 , but only 39 of these corresponded to virus groups that have been classified by the ICTV. Most of these populations were both abundant and widely dispersed geographically, but almost all fell outside of established viral taxa (FIG. 1). Early virome studies from different marine habitats hinted at this huge diversity 22,23 , and, although sequencing technologies at the time precluded direct genome-wide characterization, mathematical modelling predicted several hundred thousand distinct DNA viral genotypes. A recent comprehensive metagenomic analysis of thousands of diverse samples has led to the discovery of approximately 125,000 new viral genomes and a 16-fold increase in the number of identified viral genes 24 . Similarly, as technology advances, it is becoming clear that ssDNA and RNA viruses in marine and other ecosystems are far more diverse than currently characterized viruses; however, these new viruses remain understudied despite their ecological importance 11,[25][26][27][28][29][30][31] . Many ssDNA viruses identified in metagenomic data encode an evolutionarily conserved replication-associated protein (Rep), whereas the number, orientation and evolutionary origin of other genes are highly variable in these circular Rep-encoding ssDNA viruses (CRESS-DNA viruses) 32 . Phylogenetic analyses have revealed distinct clustering of some of these viruses into four recognized families, in addition to a vast range of viruses that fall outside of these clusters (FIG. 2). Aside from marine environments, most viruses discovered in wild plants through metagenomics seem to be persistent, and only a tiny proportion of these viruses are species that are recognized by the ICTV 33 . Highly diverse novel viruses have been similarly reported from insects 34,35 , and several eukaryotic and prokaryotic viruses have been identified in terrestrial environmental samples 24,36 .
Metagenomic studies have also uncovered astonishingly abundant novel viruses in the human gastrointestinal tract that, despite decades of research, had not been detected previously. For example, the ~97 kb genome of a dsDNA bacteriophage, named crAssphage, is six-times more abundant in publicly available metagenomic datasets from sewage or wastewater samples than all other known bacteriophages combined. This virus contributes up to 90% of all sequence reads in virus-like particlederived metagenomes and accounts for ~1.7% of all human faecal metagenomic sequence reads in public databases 17 .
Furthermore, numerous viruses are hidden in publicly available microbial genomic datasets. A recently developed tool, VirSorter 37,38 , identified 12,498 new viral genome sequences in ~15,000 bacterial and archaeal genomes 37 , which increased the number of known prokaryotic viruses ~10-fold and identified viruses that infect 13 prokaryotic phyla 37,38 . These advances are a striking testimony to the fundamental change in virus discovery: the overwhelming majority of new viral genomes now come from metagenomic data and have never been directly linked to biological agents. Virologists, especially viral taxonomists, have no choice but to work within this new reality.

Current taxonomy of viruses
The framework that is provided by taxonomy enhances our understanding of viruses. It helps communication among virologists, and between virologists and other stakeholders, such as farmers, growers, regulators and potential funders. However, the taxonomy of viruses differs in some fundamental aspects from that of cellular life forms. In particular, viruses lack universal genes that can be used to construct a unified phylogeny into which all viruses can be placed [39][40][41][42] . Therefore, there is no viral equivalent to the cellular tree of life that has been established through comparisons of ribosomal RNA and (nearly) universal protein-coding genes in bacteria, archaea and eukaryotes (notwithstanding the complications that are caused by horizontal gene transfer) [43][44][45] .
The ICTV is solely responsible for the classification of viruses into taxa and naming them. Currently, classified viruses are assigned to the hierarchical ranks of family, genus and species, and each taxon has a defined, unique and regulated name. Some families are also divided into subfamilies that each contain separate genera, and a minority of families are also assigned to the higher taxon of order. The ICTV disseminates information on virus taxonomy through the master species list (MSL), which currently lists 7 orders, 112 families, 610 genera and 3,704 species 46 (see Virus Taxonomy: 2015 Release), and through periodic publication of ICTV reports that contain additional descriptive material 47 . The MSL is updated annually based on the submission of taxonomic proposals to the ICTV Executive Committee (see current ICTV Executive Committee webpage), mostly by specialized study groups (see ICTV Study Groups). These proposals are made available to the public and are then scrutinized by the ICTV Executive Committee for compliance with a minimal set of rules that are laid out in the International Code of Virus Classification and Nomenclature (ICVCN; see International Code of Virus Classification and Nomenclature webpage), and for the robustness of the supporting evidence. The new taxonomy is then ratified by voting members of the ICTV and incorporated into the MSL annually.
The lowest taxonomic rank is that of species, which is defined in the ICVCN as "a monophyletic group of viruses whose properties can be distinguished from those of other species by multiple criteria". Historically, the term "multiple criteria" has been interpreted as referring to attributes such as replication properties in cell culture, virion morphology, serology, nucleic acid sequence, host range, pathogenicity, and epidemiology or epizootiology. However, there is considerable variation in the way in which these criteria have been applied to viruses in different families by the respective Study Groups and approved by the ICTV.  The ICVCN provides greater freedom for specifying the higher taxonomic ranks, with a genus defined as "a group of species sharing certain common characters", a family defined as "a group of genera (whether or not these are organized into subfamilies) sharing certain common characters" and an order defined as "a group of families sharing certain common characters". These looser criteria accommodate the substantial variation in the way in which they are applied among the higher ranks. As an approximate guide for vertebrate and plant viruses, members of different genera in a family typically have similar genome organizations with homologous structural and replication-associated genes, but often have non-homologous accessory genes, such as those that are involved in the evasion of host defence and in viral movement in plants. By contrast, between families, viruses often have completely different genome organizations and may lack any detectable genetic relatedness. The presence of homologous, even if not closely similar, RNA-dependent RNA polymerases (RdRps), proteases and helicases in RNA viruses, and Rep-encoding genes in small ssDNA viruses, may, however, enable distant evolutionary relationships between virus families to be identified; such relationships may form a basis for the creation of orders. The process of identifying such distant relationships and assessing their appropriateness for higher rank taxonomic classification is not trivial, and, consequently, the creation of orders requires particularly careful consideration. For example, the existence of a substantial set of shared genes in diverse large or giant dsDNA viruses of eukaryotes has prompted a proposal for the creation of the order 'Megavirales' (REF. 48), which has thus far not been accepted by the ICTV owing to the lack of consensus in the field. Similarly, the creation of an order for the CRESS-DNA viruses is currently being considered by the relevant ICTV Study Groups.

Virus taxonomy in the age of metagenomics
In the past, the approval of a new species by the ICTV was typically dependent on the availability of data that demonstrate the distinct biological characteristics of the respective virus. This requirement has limited the number of viruses that have been classified and incorporated into the MSL. As most viruses are now discovered by metagenomics and lack direct correlation with biological agents, a workshop was convened to develop a new framework for virus taxonomy in the era of metagenomics (BOX 1; Supplementary information S1 (box)). The discussions at the workshop reflected the fact that the challenges that are posed by metagenomic data are not unique to viruses (BOX 2).
Sequence assemblies that are derived from environmental samples often contain complete, verified genome sequences of new viruses, but do not directly provide information on biological properties. This perceived limitation has raised the concern that virus classification based on sequence information alone would result in a taxonomy of sequences rather than of viruses 49 . However, with appropriate precautions (see below), we believe that the detection of a viral sequence in a sample is sufficient evidence to infer the existence of the corresponding virus. Indeed, the concept that a virus can be detected, characterized and classified entirely through analysis of its sequence has gained traction in the burgeoning field of virus discovery. Given that the properties of a virus are largely, or entirely, encoded by its genome, it follows that virus classification based on sequence information alone is not limited primarily by the absence of biological attributes, but by our inability to accurately read such information and robustly infer enzymatic functions, virion structure and other phenotypic attributes.
Sequence data provide a wealth of information that can be used for the purposes of taxonomy, such as evolutionary relationships, overall genome organization (gene content and order, prediction of encoded proteins and the presence of characteristic repeated sequences), Nature Reviews | Microbiology   70 , and a maximum likelihood phylogenetic tree was constructed using Fasttree 71 . Branches with less than 50% SH (Shimodaira-Hasegawa)-like support were collapsed.
features of genome expression, genome replication strategy, the presence or absence of various distinctive motifs (for example, polyprotein cleavage sites, internal ribosome entry sites, terminal sequences, structural folds and host range determinants 50 ), and features of global and local genome composition (for example, GC content, dinucleotide frequencies 51 and codon usage). Sequence analyses could thus provide the 'multiple criteria' that are required for classification into species. Indeed, the successful use of sequence information in virus classification has been foreshadowed in the pre-metagenomic era. For example, the bioinformatic characterization of cloned sequences was responsible for the discovery of hepatitis C virus, the prediction of its properties and replication strategy, the characterization of its similarity to members of the family Flaviviridae, and the development of effective diagnostic and screening assays 52,53 ; such advances preceded the visualization of virus particles, the detection of viral proteins in vivo and the achievement of viral growth in cell culture by many years.
However, it is important to recognize that there are several technical problems with using viral genomes that are assembled from metagenomic datasets for taxonomy. Such sequences are often derived from mixed virus populations and, consequently, there is a risk of assembling artificially chimeric genomes. Furthermore, current methodologies are unsuitable for assembling complete genome sequences from viruses that have segmented or multipartite genomes. Another practical problem arises from virus-derived sequences that are integrated into host genomes (for example, endogenous virus-like elements and prophages), many of which are transcribed and hence are present in the RNA pool. To use metagenomic sequences for classification, these problems need to

Box 1 | A workshop to advance virus classification
The Wellcome Trust funded a workshop to discuss frameworks for the advancement of virus taxonomy in the age of metagenomics. The workshop was convened in Boston, Massachusetts, USA, from 9-11 June 2016, and was organized and chaired by P.S., and administered locally by M.L.N. Participants had wide-ranging expertise in viral genomics, metagenomic environmental studies and virus classification (13 of the 26 participants were members of the International Committee on Taxonomy of Viruses (ICTV) Executive Committee), and, based on data presentations and wide-ranging discussions, participants set out to develop a series of expert proposals for future consideration by the ICTV Executive Committee.
The understanding in the workshop was that the term metagenomic applies to any viral sequence that lacks biological or other experimental characterization, although the definition of 'lack' in practice has varied in the literature. Sequence data are already of paramount importance in virus taxonomy, because they currently provide the only reliable means of representing evolutionary relationships at the required granularity; however, the workshop recognized that the data generated by high-throughput sequencing from environmental samples pose major challenges, particularly because increasingly powerful methods are producing overwhelming quantities of such data, which are linked to little or no biological information.
The workshop participants concluded that it is entirely valid to use metagenomic sequences in virus taxonomy in the absence of an isolate or direct biological data, such as the visualization of virus particles or the detection of signs or symptoms of disease. A set of proposals was developed and is discussed in this Consensus Statement article (see also Supplementary information S1 (box)). These proposals were subsequently endorsed by the ICTV Executive Committee.

Box 2 | Classifying bacteria, archaea and fungi based on metagenomic data
The procedures that are used to classify viruses and name taxa differ substantially from those that are used for bacteria and archaea. The International Code of Nomenclature of Bacteria regulates only the names of newly proposed species without formally classifying these species into higher ranks. A total of 2,053 named bacteria and archaea were listed in the Approved List of Bacterial Names by the International Committee on Systematics of Prokaryotes in 1980. Since then, an additional 13,434 species with validly published names have been described in approved journals 62 . However, this total is widely regarded as being at odds with the conservative estimates of several million species of novel bacteria and archaea that have been discovered through environmental screening 63,64 . The assignment of names to bacterial or archaeal species requires information on defining biological characteristics, such as morphology, metabolism or ecology, to distinguish novel species from previously assigned species. Additional requirements are that the organism must have been cultured and an isolate deposited in at least two international repositories. To overcome such limitations, many authors have advocated the use of phenotypic characteristics inferred from sequence data as criteria that are required for assignment of bacterial species 63 . Furthermore, a relatively small number (approximately 350) of non-cultured but otherwise identifiably distinct bacteria and archaea have been named without the deposition of an isolate, with the qualifier 'Candidatus' assigned to the species name 65 . Historically, sequence information has not contributed to the taxonomy of bacteria and archaea, although 16S ribosomal RNA gene sequences are now available for members of most prokaryotic species and have led to the identification of many synonyms (different names for the same bacterial species). Despite the major differences in both the routes of evolution and the taxonomic approaches between viruses and bacteria and archaea, the current challenge to classification is the same in both cases: an overwhelming number of diverse genomes that arguably represent distinct taxa is accumulating from metagenomic research.
Similar comments can be made about other microorganisms. For example, the taxonomy of fungi resembles that of bacteria and archaea, with a comparable requirement for the deposition of type samples in one of four international repositories under rules that are specified by the International Code of Nomenclature for Algae, Fungi and Plants. Species assignments remain based largely on biological characteristics. Indeed, the different morphological types of the same fungus in its sexual and asexual stages have often been assigned to different species and even genera, although there have been serious attempts in recent years to rectify this problem 66 . There has similarly been no comparable attempt, until recently 67 , to identify and remove synonyms as sequence data have become available. Metagenomics can be expected to exert a substantial change on fungal taxonomy, as only a small percentage of fungi are thought to be culturable, and the number of distinct fungi in the environment may number in the millions 68 . The use of genomic markers, such as the internal transcribed spacer (ITS) region, has been proposed as a biological barcode for the genomic assignment of fungi 67 . be addressed by robust computational and experimental methods. However, these caveats do not represent fundamental barriers to virus classification, as the technology that is used to create metagenomic sequences is improving continuously, and many of the problems, particularly those that are associated with de novo assembly, will be resolved. These improvements include methods that generate longer sequence reads and those that use template circularization to decrease error rates 54 .

Proposals
The workshop reached a consensus view on classifying viruses solely on the basis of metagenomic sequence data and, consequently, developed a set of proposals (BOX 1; Supplementary information S1 (box)). These proposals are diagrammatically summarized in FIG. 3.

Basis of classification.
Classifying viruses that are identified only from metagenomic data will advance virus taxonomy, dependent on appropriate checks on data integrity and following the standard procedures of assignment. This is expected to involve the creation of higher rank taxa that consist entirely of viruses that are identified from metagenomic sequence data.
Creating new species. The current ICTV species definition suffices for the classification of viruses based only on sequence information. Virus characteristics that can be inferred from sequence data, including genome organization, replication strategy, presence of homologous genes, and, potentially, host range or type of vector, may serve as additional biological characteristics.
These may be used to delineate species in the absence of pheno typic data that have often been relied on for existing species definitions. Such information is best inferred from genomic sequences that comprise the complete coding potential of the respective virus and should be a minimum requirement for classification based on sequences alone.

Assigning new species and genera to existing families.
Demarcation procedures vary widely between virus groups and are typically based on parameters that include sequence-based phylogeny and various biological attributes. Although recognizing that direct biological information may form a part of the definition of existing taxa, viruses that are identified from metagenomic data can be classified into additional taxa (species and genera) if their sequence relationships are comparable to those among existing taxa in that family.

Delineating new families and orders.
Viruses that have genome sequences that lack close relationships to viruses in existing taxa pose a particular problem, as there is no phenotypically derived standard by which they can be classified. In this situation, assignment of a virus to a new family could be based on limited or absent genetic homology to viruses in recognized families and the existence of major differences in genome organization or inferred replication strategy. Clustering and patterns of variation among more closely related metagenomic sequences might be used to assign viruses hierarchically to lower taxonomic ranks in such groups. However, the creation of a new family, and the assignment of genera  Figure 3 | Summary of the proposed classification pipeline. The proposed classification pipeline (red arrows) enables both metagenomic sequence data and conventionally derived virus sequences to be classified. Inferred biological properties that are obtained by bioinformatic analysis of virus sequences together with information on sequence relatedness and gene content, and, optionally, any observed biological properties (dotted line), may all be used as defining criteria for species and higher rank taxonomic assignment in the International Committee on Taxonomy of Viruses (ICTV) taxonomy. This procedure differs from current (green arrows) and previous practice (blue arrows), in which biological data and/or host information and sequence data (current), or biological data alone (1970s-1990s), were required for classification. and species within it, would require a considerable amount of sequence information and the development of a sound classification framework that is capable of accommodating it. Formalized clustering and network analysis methods that create similarity metrics that are based on the detection of homologous genes and their genetic divergence 55-57 could be valuable for taxonomic assignments and should be critically evaluated for their effectiveness in the development of a robust classification approach. Frameworks of this kind may have to be tailored to the virus group. For example, bacteriophage taxonomy is typically based on virion sequence and structure 58 , but these characteristics may not be appropriate for the classification of animal and plant RNA viruses, in which deeper relationships are most often apparent in the gene sequences of the RNA polymerase and other conserved replication-associated proteins 59 .
Nomenclature of taxa identified only from sequence data. The system that is currently used by the ICTV for taxon nomenclature is readily extendable to additional species, genera and families that are created from metagenomic sequence data. Furthermore, taxa may contain viruses that were identified by various methods. Hence, a species that initially comprises viruses that are characterized solely from sequence data could eventually include viruses that are identified by isolation and that have directly defined biological properties. Thus, metagenomic status belongs to, and would be recoverable from, the sequence record for a particular virus and not to the entire taxon to which it is assigned. Although some virologists have adopted the term 'associated' as part of the nomenclature of viruses that were identified in metagenomics datasets (for example, human stool-associated circular virus (GQ404856 (REF. 60)); for other examples see REFS 12,13,26,61), it is unnecessary to incorporate this or other such terms that are equivalent to the bacterial term 'Candidatus' into virus taxon names.

Improvement of the procedure for the classification of viruses.
The current process of submitting taxonomic proposals to the ICTV suffices, in principle, for dealing with viruses that are known only from sequence data. However, the process could be substantially improved and streamlined through the development of electronic submission methods that incorporate appropriate quality checks for accuracy and completeness of data. In particular, the format could be modified to enable numerous species (possibly many hundreds or thousands) to be proposed in the same submission without the unnecessary repetition of information. In addition, procedures could be developed that shorten the time that is required for processing proposals and updating the MSL.

ICTV endorsement.
As an important initial step towards metagenomics-based virus taxonomy, the proposals that were developed during the workshop were presented to, and discussed at, the ICTV Executive Committee meeting from 22-24 August 2016. The proposals were supported by all members of the Executive Committee that were present (one member was unavoidably absent but has since expressed support) and their practical implementation was seen as a matter of high priority for the ICTV. This process will include actively inviting the virology community to submit taxonomic proposals that are based on metagenomic sequences, providing guidelines on data standards (including sequence quality and completeness) and developing more effective data submission tools for large sequence datasets. The ICTV Executive Committee plans to explain and develop these steps in a separate article.

Conclusions
We believe that the time has come to advance the philosophy and practice of virus taxonomy by admitting viruses that are identified only from metagenomics data as being bona fide viruses, dependent on appropriate checks on data integrity and following the standard procedures of taxonomic assignment. We expect that this process will lead to the imminent creation of higher rank taxa that consist entirely of viruses identified by metagenomics. We believe that the implementation of the proposals outlined here will enable the creation of a vastly expanded formal taxonomy for viruses that will be a major contribution to future research on virus diversity. Only by accepting that sequences that are generated by metagenomic methods truly represent existing viruses and by including them in classification schemes, can we hope to better understand the ecology, history and impact of the global virome.