Megaphages infect Prevotella and variants are widespread in gut microbiomes

Bacteriophages (phages) dramatically shape microbial community composition, redistribute nutrients via host lysis and drive evolution through horizontal gene transfer. Despite their importance, much remains to be learned about phages in the human microbiome. We investigated the gut microbiomes of humans from Bangladesh and Tanzania, two African baboon social groups and Danish pigs; many of these microbiomes contain phages belonging to a clade with genomes >540 kilobases in length, the largest yet reported in the human microbiome and close to the maximum size ever reported for phages. We refer to these as Lak phages. CRISPR spacer targeting indicates that Lak phages infect bacteria of the genus Prevotella. We manually curated to completion 15 distinct Lak phage genomes recovered from metagenomes. The genomes display several interesting features, including use of an alternative genetic code, large intergenic regions that are highly expressed and up to 35 putative transfer RNAs, some of which contain enigmatic introns. Different individuals have distinct phage genotypes, and shifts in variant frequencies over consecutive sampling days reflect changes in the relative abundance of phage subpopulations. Recent homologous recombination has resulted in extensive genome admixture of nine baboon Lak phage populations. We infer that Lak phages are widespread in gut communities that contain the Prevotella species, and conclude that megaphages, with fascinating and underexplored biology, may be common but largely overlooked components of human and animal gut microbiomes.

genomes and 9 reads to five other genomes). The B9 genome is notable in that it was reconstructed from five genome fragments that comprised ~99% of the final bin length. Fragmentation was at least partially due to a few local regions of within-population sequence variation.

tRNA intron similarity across cohorts
Introns were identified within some pseudo tRNAs that are substantially larger than expected. An identical tRNA Phe (GAA) intron is found in A1, A2, the cholera cohort and some pigs; identical introns occur in a tRNA Gly (TCC) in the A2, C1 and some fragments from pigs and identical tRNA Thr (TGT) introns occur in A1, A2, C1 and some genome fragments from the cholera and pig cohorts. Finally, we identified identical introns in Tyr (GTA) from A1, A2 and C1 and related but longer introns in the B-Lak phage genomes (see Supplementary Table 6).
The largest putatively circularized phage genome reported to date from any environment is 595,573 bp in length, with an unknown host (6). That genome is predicted to encode 75 tRNAs, 44 of which have unknown or mismatch isotypes. We predict that this genome has at least 4 tRNAs with introns. Thus, tRNAs with introns may be fairly common in large phage genomes.

Phage metabolic potential
The largest fraction of genes with confident functional predictions are involved in reactions involving nucleotides (see Supplementary Table 7). Some genes encode enzymes that interconvert various deoxyribonucleoside and ribonucleosides, phosphorylate/dephosphorylate nucleotides and participate in other steps in DNA and RNA metabolism (e.g., ribose-5P to PRPP). In addition, there are helicases, topoisomerases, polymerase subunits, primases, exonucleases, genes involved in restriction, base excision and repair and recombination (potentially of relevance given the extensive evidence for homologous recombination). Similar sets of genes have been reported in other jumbo phage, such as Ralstonia phage ϕRSL1 (7), Xanthomonas phage XacN1 (8), and Cronobacter phage GAP32 (9). The Lak genomes encode several genes involved in carbon metabolism and genes involved in interconversion of nicotinate and nicotinamide, possibly with an impact on host energy metabolism. Also predicted is pnuC, a nicotinamide mononucleotide transporter. Other genes involved in transport include a co-located three subunit NitT/TauT (sulfonate/nitrate/taurine) system. The genomes also typically encode multiple molecular chaperones. Repeat sequence analysis identified a few genes with similar amino acid sequences that were probably duplicated.
Intriguingly, the genomes encode one or two proteins predicted to hydrolyze a compound related to nucleoside 3',5'-cyclic phosphate to nucleoside 5'-phosphate. Such enzymes might affect the cellular levels of the host's cyclic second messengers. Potentially related, the genomes also encode a T4encoded RNA ligase, which phage may use to repair tRNAs that are broken when a host defense mechanism produces 2′,3′-cyclic phosphate ends (10) Among the enigmatic predicted proteins of potential significance from the perspective of host metabolism is an enzyme annotated as heme oxygenase [EC:1.14.99.3], which catalyzes the degradation of heme, an iron complex outer membrane receptor protein and a glycosyltransferase implicated in lipopolysaccharide biosynthesis. Despite the genome size however, we could recognize relatively few genes that might supplement host metabolism, except as it might pertain to phage replication.
Interestingly, the genomes encode some genes that modify tRNAs, such as for adding a guanine nucleotide to the 5′ end of tRNA His , peptidyl-tRNA release, and CCA addition for tRNA maturation.
Other possible sightings of the Lak phage A huge Prevotella phage from the cow rumen was imaged previously (11). The imaged phage was reported to have an icosahedral head of ~120 nm in diameter and a long curved tail that was 30 nm wide and up to 800 nm long. To our knowledge, the largest isolated phage infects Bacillus megaterium and has a 497 kbp genome, a head diameter of 160 nm and tail length of 453 nm (2). The visualized phage may be related to Lak megaphage.

Supplementary Figures
Supplementary Figure 1: Community composition of gut microbiomes of ten subjects from Laksam Upazila, based on phylogenetic analysis of ribosomal protein S3 sequences. Each stacked bar represents one genotypic variant. Variants are in conserved order and colored by genus. All genotypic variants present in under 10% abundance across all samples are collapsed into the "Other" bar. The increments in the stacked bar chart are colored by genus. Red boxes indicate samples from which megaphage sequences were partially assembled; blue boxes indicate samples from which a complete megaphage genome was recovered. Lines within each genus denote the relative abundance of genotypic variants. The high variability seen in day 2 samples compared to day 3, 4, and 5 in some subjects is confirmed by 16S amplicon sequencing and likely a result of a change in diet upon hospital admittance.  Given that most fragments were <10 kbp in length, we mapped fragments to a ~30 kbp region of the A1 genome (within that shown in Figure 2) and identified cases where the fragments spanned >70%. The ANI for each fragment is shown above the fragment in red and the ANI for the set of aligned fragments is shown to the right. Over the 8 examples, the average ANI is 91.8 ± 1.7%. The average length of aligned scaffolds to the ~30 kbp A1 genomic region is 27.2 ± 2.6 kbp (the sum of the aligned scaffolds is slightly longer in some cases). Mapping of reads to the scaffolds from Pig Lak 8 confirmed a high degree of within-population sequence variation that likely accounts for assembly fragmentation. B. Plot of ANI vs. fragment length for all (N=4830) pig fragments >2 kbp identified as possible Lak phage. C. Plot of fragment number (N=4830, same fragments as in B) vs. start position on genome. 9      set as the reference sequence. Each genome is a row and vertical black tick marks indicate polymorphic sites. The sequence of B1 was designated as type 'orange' and sequence B7 as type 'blue'. Sequence blocks of the same color Colored underlines indicate sequence blocks that are identical. Small gaps within bars of the same color indicate SNPs that are not in agreem ent despite flanking sequence identity. Notably, the patterns indicate extensive admixture of sequence blocks, presumably due to homologous recombination, and often involve the available B-Lak genomes. For example, blocks of sequence shared with B7 (light blue underlines) occur in all B-Lak genomes. One region within the lowest panel is shown in detail in Figure 3.  Figure 14: Genotypic variation within the B9 population. The upper panel shows a small subset of reads (light grey bars underlined with colored bars) that mapped to a hypervariable region of the B9 genome. Example paired reads were selected for display as they carried replicated SNPs (colored circles highlight SNP positions and types). As shown by red and blue bars in the lower panel, this region overlaps with that analyzed in Figure 3. Each read was assigned to a possible genotype (in some cases, multiple choices were possible), indicated by the color of the reads (top panel) so as to match to the color of the genome sequences in the lower panel. Some paired read variants could be assigned to the B7 genome, extending backward from the small region where B9 and B7 are identical. Many variant reads could be assigned to B1, perhaps not surprising given the overall high similarity between the B1 and B9 genomes. Others could be assigned to B2 and/or B5. A subset of read pairs had SNP combinations that were distinct from those of known genomes, supporting the inference of recombination with as yet unreconstructed B-Lak genotypes. Despite the relative clonality of most other regions, these data indicate that the phage population includes rare genotypes that reflect homologous recombination, mostly involving sequences found in the B-Lak genome collection. Some paired reads contain combinations of SNPs not found in the B-Lak genomes (cream color). This may reflect recombination with other B-Lak genotype(s).