Investigation of recombination-intense viral groups and their genes in the Earth’s virome

Bacteriophages (phages), or bacterial viruses, are the most abundant and diverse biological entities that impact the global ecosystem. Recent advances in metagenomics have revealed their rampant abundance in the biosphere. A fundamental aspect of bacteriophages that remains unexplored in metagenomic data is the process of recombination as a driving force in evolution that occurs among different viruses within the same bacterial host. Here, we systematically examined signatures of recombination in every gene from 211 species-level viral groups in a recently obtained dataset of the Earth’s virome that contain corresponding information on the host bacterial species. Our study revealed that signatures of recombination are widespread (84%) among the diverse viral groups. We identified 25 recombination-intense viral groups, widely distributed across the viral taxonomy, and present in bacterial species living in the human oral cavity. We also revealed a significant inverse association between the recombination-intense viral groups and Type II restriction endonucleases, that could be effective in reducing recombination among phages in a cell. Furthermore, we identified recombination-intense genes that are significantly enriched for encoding phage morphogenesis proteins. Changes in the viral genomic sequence by recombination may be important to escape cleavage by the host bacterial immune systems.

Recombination is a fundamental driving force in evolution [16][17][18][19] , and occurs among different viruses inside the same bacterial host (e.g., co-infecting invasive viruses 20 , temperate phages and defective prophages 21 , or an invasive virus and a resident prophage 21 ). Although phage genomes are generally known to be mosaic, with active recombination or horizontal genetic exchange 22 , recombination does not necessarily increase the average fitness of offspring 23,24 . It remains unexplored, in the rapidly increasing metagenomic data, whether such signatures of recombination are observed across various phylogenetic groups of phages. In addition, it is also unclear whether specific phylogenetic groups, or genes, are recombination-intense and show signatures of increased recombination due to natural selection. Thus, the fundamental aspect of recombination among phage genomes based on metagenomic data should be explored.
Here, we systematically examined signatures of recombination in every gene from 211 species-level viral groups in the Earth's virome dataset that contain corresponding information on host bacterial species; (i) we determined recombination-intense viral groups throughout the virome dataset, (ii) we examined the relationship between recombination-intense viral groups and the potential strength of host immunity, and (iii) we closely examined the recombination-intense genes of the viral groups.

Results
Signatures of recombination are widespread among diverse viral groups. Among the more than 19,000 species-level viral groups defined in the Earth's virome dataset composed of 3,042 metagenomic samples, we identified 211 viral groups that are usable for examining signatures of recombination and contain information on host bacterial species (Table S1). A proteomic tree of the 211 viral groups, constructed using the Virus Classification and Tree Building Online Resource (VICTOR) method 10 , is shown in Fig. 1, and a 16S rRNA gene maximum likelihood (ML) tree of the viral groups' host bacterial species is shown in Fig. S1. The VICTOR tree ( Fig. 1) revealed the presence of a very diverse dataset with long branches, almost no branch support (except for a few groups close to the tips of the tree), and an overall low phylogenetic signal. The host bacterial ML tree, on average, was well supported and reflected several major lineages (i.e., strains belonging to the phyla Figure 1. Proteome-based VICTOR tree of the 211 viral groups. Scale bar indicates interproteomic distance inferred using the distance formula d 4 . Viral group IDs defined in the previous study 11 are shown as leaf labels. The innermost colored ring (1) indicates oral and intestinal samples, respectively. The next ring (2) indicates the presence or absence of recombination signatures (i.e., minimum number of recombination events ≥1). The two outer rings (3,4) indicate frequency categories for Type II restriction endonucleases (3) and CRISPR arrays (4) in a host bacterial species.
SCIentIfIC REPORts | (2018) 8 Firmicutes, Fusobacteria, Proteobacteria, and Bacteroidetes). About 75% of the viral groups originate from oral samples ( Fig. 1) and account for 13.5% of all the metagenomic samples. For each viral group, we conducted a pangenomic analysis and calculated the minimum number of recombination events (r min ) for every orthologous gene using the four-gamete test, a conservative method to locate pairs of closest segregating sites within 4 haplotypes that are likely to be generated by recombination between them. As a result, 88% of the viral groups originating from oral samples showed at least one recombination event per gene (Fig. 1), whereas the proportion was 69% of viral groups originating from other samples.

Recombination-intense viral groups.
Next, we conducted a more quantitative analysis to explore the rates of recombination among viral groups. In order to account for the dependency of the r min on gene length and nucleotide diversity, we took an approach similar to a previous study 25 . We plotted r min per nucleotide versus nucleotide diversity of a viral group (Fig. 2), and calculated a linear regression line that captures the overall relationship between nucleotide diversity and the minimum number of recombination events per nucleotide. Using this relationship, we identified viral groups as recombination-intense that substantially deviate from the regression line. Based on the distribution of the deviation from the regression line among the 211 groups (Fig. S2), we identified the top 25 recombination-intense viral groups (above the dashed lines in Fig. 2 and Fig. S2, respectively). The host bacterial species of the 25 recombination-intense viral groups are shown in Table 1.
Almost all (24 out of 25) the recombination-intense viral groups were from oral samples, which is a statistically significant enrichment (P = 0.0066, Fisher's exact test). Only viral group 181 originated from nasopharynx samples, but its host bacterial species, Propionibacterium acnes, forms part of the normal flora in the human oral cavity 26 . Namely, all the recombination-intense viral groups infect bacterial species that reside in the human oral cavity.
For the recombination-intense viral groups, we mapped reads back to the genomes to check their coverage (Fig. S3). Overall, the reads were mapped to the entire genome sequences, and the mean coverage per viral group was 24, eliminating possibilities that the elevated rates of recombination among the viral groups were primarily due to low coverage and metagenome mis-assembly in these sequences. There were a few viral groups showing  As observed earlier, the backbone of the tree is only weakly supported which is due to the nature of such diverse phage genomic datasets and is likely caused by multiple origins of prokaryotic viruses 10,30 . Notable recombination-intense genes. We further explored recombination-intense genes by similarly analyzing the relationship between r min per nucleotide and nucleotide diversity at the gene level in each recombination-intense viral group (Figs 4 and S5). Substantial deviations from the regression were found; for example, in a gene encoding a phage tail protein (Fig. 4A) and a gene encoding a phage portal protein (Fig. 4B), with recombination breakpoints found throughout both genes. Because approximately 90% of such genes were initially annotated as hypothetical, we conducted iterative protein searches based on representing both query and database sequences by profile hidden Markov models 31 using a UniProt database. We examined all hits with >99% probability of being true positives, and identified 89 notable genes in 24 viral groups (listed in Table S2). A breakdown of the notable recombination-intense genes is shown in Fig. 5. Approximately 75% of this set are genes that encode phage morphogenesis proteins. Among these, approximately 19% are associated with head morphogenesis, including genes for capsid, virion morphogenesis, and scaffold proteins 32 . Approximately 21% are associated with phage neck and DNA packaging, including portal protein, head-to-tail connector, and terminase 33,34 . Approximately 27% are associated with phage tail, including the genes for tail protein, tape measure protein, and baseplate 33,35,36 . Moreover, approximately 4% are DNA-associated genes, including the genes for integrase, helicase, and recombinase. Approximately 4% are lysis-associated genes, including holin and endolysin. Holin and endolysin are essential for host cell lysis in the lytic lifecycle 37 . Finally, the frequency of such genes in the pan-genome of other viral groups is approximately 27%, indicating significant enrichment among the notable recombination-intense genes (P < 10 −15 , chi-squared test).

Discussion
The recombination events we detected and examined are those that have survived in the viral groups, and which are inevitably affected by natural selection 17 . These genes are likely to be under selective pressure from host bacterial immunity attacking foreign DNAs (i.e., restriction-modification and CRISPR-Cas systems), as this cleavage would effectively prevent survival and propagation of phages. These genes were found in the recombination-intense viral groups, all of which infect bacterial species living in human oral cavities, where hundreds of thousands of CRISPR spacer groups are transcribed 38 . Hence it is likely that there is a genetic conflict invoked by the coevolution of phages and host oral bacteria by means of classic Red Queen dynamics 39 . The signature of highly elevated recombination would reflect the evolutionary conflict in which phages continuously change their genomic sequences by recombination to escape cleavage by the host bacterial immune system in the human oral cavity. A recent study showed that acquisition of random mutations is not sufficient for phages to completely escape CRISPR-Cas targeting in a continuous co-culture 27 . Another study of long-term bacterium-phage coevolution experiments showed the presence of multiple phages increased phage persistence by enabling recombination-based formation of chimeric phage genomes in which sequences heavily targeted by CRISPR were replaced 40 . Therefore, recombination would play an important role, at least for recombination-intense oral phages, in escaping bacterial immunity.
Similarly, our analysis of Type II restriction endonucleases suggest that its high frequency in host bacterial species could be effective in cleaving invading phages and reducing opportunities for recombination among phages in a cell. This raises the interesting possibility that restriction-modification could be more effective than CRISPR-Cas in oral bacteria for dealing with phages. Recently, it was suggested that host bacteria are generally incapable of utilizing CRISPR-Cas to eradicate phages from the human oral cavity as well as the gut 38 . Further studies are warranted to elucidate the relative roles of restriction-modification and CRISPR-Cas in contributing to bacterial immunity. A recent study, in a different context, showed that CRISPR-Cas and restriction-modification function additively, at least against conjugative antibiotic resistance plasmid transfer in Enterococcus faecalis 41 .
Alternatively, the enrichment of recombination-intense viral groups in human oral cavities could be explained in terms of frequency of recombinases among the viral groups. However, there was no statistically significant difference in frequency of viral groups carrying a recombinase (P = 0.6, chi-square test) between the recombination-intense viral groups (38%) and the others (30%).
Regarding the bacterial host species of the recombination-intense viral groups (Table 1), many are recognized as part of the normal human oral flora, but sometimes cause disease. For example, some Atopobium species have been identified as agents of chronic periodontitis and bacteremia 42,43 . The viridans group streptococci (VGS), which consist of Streptococcus mitis, Streptococcus gordonii, and Streptococcus oligofermentans, can cause a wide range of infections in humans, including bacteremia, infective endocarditis, and moderate or severe clinical disease (e.g., VGS shock syndrome) 44 . Among others, associations with endocarditis have been reported for Actinomyces viscosus 45 , Gemella haemolysans 46 , and Leptotrichia goodfellowii 47 , while associations with bacteremia were reported for Leptotrichia goodfellowii 47 and Capnocytophaga species 48 . In addition, Campylobacter concisus has been linked to prolonged diarrhea and inflammatory bowel disease 49 . Meanwhile, the phage lytic lifecycle has been shown to play a role in preventing outgrowth and dysbiosis by killing the bacterial host 39 . According to a recent review 50 , endogenous phages can play an important role in human oral health by limiting overgrowth of bacteria and maintaining the commensal microbiota at acceptable levels that can then be controlled by the human immune system. The evolution of recombination-intense phages could contribute to such maintenance function against oral host bacterial species during their coevolution.
Given these observations, the recombination-intense viral groups which are particularly present in oral viromes, could boost phage evolution and contribute to the maintenance of the commensal microbiota (eubiosis) 39 . This is the first systematic, quantitative study of recombination in phage genomes across >200 diverse viral groups, and the first study to explore viral-host relationships from a viewpoint of recombination and host immune systems.

Materials and Methods
Selection of viral groups and individual scaffolds. From 17,803 viral groups defined in the previous study 11 as having ≥90% bidirectional average amino acid identity and ≥50% total alignment fraction, we extracted 211 groups consisting of at least four individuals (scaffolds) required for the examination of minimum number of recombination events and carrying information of host bacteria at the species level. The information was inferred in the previous study 11 and is stored in IMG/VR database 13 by either perfect matches of viral tRNAs, or matches of CRISPR-Cas spacers requiring at least 95% identity over the whole spacer length, and allowing only 1-2 SNPs at the 5′ end of the sequence.
Scaffold IDs for each individual in each viral group is listed in Table S3. We downloaded nucleotide sequences of the scaffolds through the 'Expert Review' version of IMG/M ER (https://img.jgi.doe.gov/ mer/) 10 datamart 12 . We made the nucleotide sequence data directly downloadable at https://figshare.com/ articles/211viralgroups_fas_tgz/6223769.

Proteomic tree of the viral groups.
A proteomic tree of the 211 viral groups was constructed using the Virus Classification and Tree Building Online Resource (VICTOR) method, publicly available at https://victor. dsmz.de 10 . From the three distinct trees generated by VICTOR, the one based on distance formula d 4 was chosen because this formula is robust 51 when using incomplete genomes and represented the most reasonable choice in view of the partially incomplete nature of the metagenomic samples. The tree was finally visualized and annotated using the iTOL web service 52 .          Figure 3. Phage proteomic tree based on the VICTOR method using a united dataset of comprehensive ICTV reference data and the recombination-intense viral groups. Leaf labels representing the recombination-intense viral groups are highlighted in orange. Branch support is indicated by color from red (50%) to green (100%). The vicinity of these metagenomic samples to actual ICTV phage species provides hints regarding their composition. Scale bar indicates interproteomic distances calculated via the distance formula d 4 . The tree was rooted at the midpoint 69 . Pan-genome analysis of recombination and nucleotide diversity. The prediction of protein-coding genes and gene annotation was performed using Prokka 53 software with the "-k Viruses" option. We conducted pan-genome analyses using the Roary 54 pipeline with "-e --mafft -i 90 -z" options and obtained alignments of 12,319 orthologous genes. The minimum number of recombination events (r min ) was calculated for each orthologous gene using the four-gamete test 55 that locates pairs of closest segregating sites within 4 haplotypes that are likely to be generated by recombination between them. We used the method implemented in the PGEToolbox 56 , which filters gaps in advance. Basic population genetic statistics (e.g. nucleotide diversity) were also calculated for each orthologous gene using DnaSAM 57 .
0.05 0.10 0.15 0.00 0.04 0.08 0.12 0.00 0.05 0.10 0.15 0.00 0.04 0.08 0.12 0.00 0.05 0.10 0.15 0.00 0.04 0.08 0.12 0.00 0.05 0.10 0.15 0.00 0.04 0.08 0.12   ad.jp/htbin/getData?table=genome). For each host bacterial species, we checked the frequency of individuals that have either Type I, II, or III restriction enzymes, or CRISPR-arrays that were detected by CRISPRDetect 59 . A 16S rRNA gene sequence phylogeny was inferred by the GGDC web server 51 , available at https://ggdc.dsmz.de/, using the DSMZ phylogenomics pipeline 60 adapted to single genes. A multiple sequence alignment was created with MUSCLE 61 . Maximum likelihood (ML) and maximum parsimony (MP) trees were inferred from the alignment with RAxML 62 and TNT 63 , respectively. For ML, rapid bootstrapping in conjunction with the autoMRE bootstrapping criterion 64 , and subsequent search for the best tree was used; for MP, 1000 bootstrapping replicates were used in conjunction with tree-bisection-and-reconnection branch swapping and 10 random sequence addition replicates. The sequences were checked for compositional bias using the Chi-squared test as implemented in PAUP* 65 . Identification and taxonomic assignments of recombination-intense viral groups. We calculated the sum of the r min across the orthologous genes divided by the sum of their lengths, in base pairs, for each viral group. We also calculated nucleotide diversity of each orthologous gene using DnaSAM 57 , and its average across the orthologous genes for each viral group. We then conducted multiple linear regressions to capture the overall relationship between the r min per nucleotide and nucleotide diversity after controlling for differences in the number of individuals in a viral group: β β β = + + + y x x i 0 1 1,i 2 2,i i  where, for viral group i, y i is the minimum number of recombination events per nucleotide; x 1,i is nucleotide diversity; x 2,i is the number of individuals; β 0 is the intercept; β 1 and β 2 are regression coefficients; and ε i is error, which is normally distributed. We plotted the regression line in Fig. 2 given the parameter estimates, holding constant x 2 as the average number of individuals among the viral groups. Using this relationship, we identified the top 25 viral groups having >0.009 deviation from the regression line as recombination-intense. We chose the cutoff by examining the empirical distribution (Fig. S2) and looking for approximately top 10 percentile. A caveat is frequent recombination among very closely related sequences might not be identified by this approach because such recombined sequences are expected to be almost the same as their parental sequences, and in principle be difficult to be detected by comparison of nucleotide sequences.
The recombination-intense viral groups were added to a recently published ICTV reference dataset 10 which together was used for the inference of a proteome-based tree via the VICTOR method 10 . In particular, pairwise distances were inferred from the distance formula d 4 because this formula is robust when using incomplete sequences 10,51 . Identification and annotation of recombination-intense genes. For each recombination-intense viral group, we conducted multiple linear regressions, similar to above, but at the gene level rather than the viral group level: ε i where, for gene i in a viral group, y i is the minimum number of recombination events per nucleotide; x 1,i is nucleotide diversity; x 2,i is the number of aligned sequences; β 0 is the intercept; β 1 and β 2 are regression coefficients; and ε i is error, which is normally distributed. We plotted the regression lines in Fig. 4 and Fig. S5 given the parameter estimates, holding constant x 2 as the average number of aligned sequences in a gene. For each gene, we translated the alignment and conducted iterative protein searches using HHblits 31 , which represents both query and database sequences by profile hidden Markov models (i.e., condensed representation of multiple sequence alignments specifying, for each sequence position, the probability of observing each of the 20 amino acids) instead of single sequences for the detection of remote homology. We used the clustered uniprot20_2016_02 database (http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_ dbs/), which covers essentially all of the sequence universe by clustering the UniProt database 66 from EBI/SIB/PIR and the non-redundant (nr) database from the NCBI. For all hits with >99% probability of being true positives, we individually examined each annotation, and the extent of deviation from the regression line, to identify notable recombination-intense genes.  Table S2.