Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences

Bacterial viruses (bacteriophages) have a key role in shaping the development and functional outputs of host microbiomes. Although metagenomic approaches have greatly expanded our understanding of the prokaryotic virosphere, additional tools are required for the phage-oriented dissection of metagenomic data sets, and host-range affiliation of recovered sequences. Here we demonstrate the application of a genome signature-based approach to interrogate conventional whole-community metagenomes and access subliminal, phylogenetically targeted, phage sequences present within. We describe a portion of the biological dark matter extant in the human gut virome, and bring to light a population of potentially gut-specific Bacteroidales-like phage, poorly represented in existing virus like particle-derived viral metagenomes. These predominantly temperate phage were shown to encode functions of direct relevance to human health in the form of antibiotic resistance genes, and provided evidence for the existence of putative ‘viral-enterotypes’ among this fraction of the human gut virome.

Tetranucleotide usage profiles (TUP)* of PGSR driver sequences were compared with those present in a range of bacterial chromosomes representing common and abundant genera in the gut microbiome. Relationships were visualised by construction of phylograms using the neighbour-joining algorithm in PHYLIP 3.69 50 , from distance matrices of TUP correlation scores 46 ; See Methods, main text).
Squares: represent driver sequences used in PGSR interrogations of gut metagenomes. Blue squares denote sequences derived from either the MetaHIT gut metagenomes 21 , or from Japanese gut metagenomes 28 . Purple squares represent phage genome sequences known to infect Bacteroides fragilis 13,59 .
Circles: represent bacterial chromosomal sequences from prominent genus associated with the human gut microbiota and major phyla (Bacteroidetes, Firmicutes, Actinobacteria   Table 1, main text) were predicted using Glimmer V3 64 , and sequence data annotated using Artemis 65 . ORF functional assignments were based on the results of BlastP, tBlastn and Conserved Domain searches as in Ogilvie et al. 13 . For BlastP and tBlastn only hits generating e-values of 1e -5 or lower and 20 % identity or greater were considered significant. For Conserved Domain searches only hits with an e-value of 0.01 or lower were considered significant. Figure S4: Visualisation of relationships between PGSR sequences, bacteriophage genomes, and bacterial chromosomes based on tetranucleotide profiles.

Supplementary
Tetranucleotide profiles were generated from 1700 bacterial chromosomal sequences, all PGSR sequences, 188 large contigs from assembled gut viromes (10 kb or over), and 647 bacteriophage genomes (also 10 kb or over), as for construction of phylograms (Figure 4, main text). The resulting tetranucleotide usage profiles were used to construct Emergent Self Organising Maps (ESOMs) and visualise relationships between sequences (See Methods, main text). Maps represent borderless toroidal landscapes and are continuous from all edges, with the underlying topology of the ESOM "landscape" indicative of the relationship between sequences. "Valleys" (green/blue regions) describe areas populated by sequences with similar TUP profiles (high similarity), and elevated "mountain" regions (Brown/white) separate groups of less similar sequences (low similarity). Shaded areas represent regions populated predominantly (but not exclusively) by sequences from the indicated taxonomic Class, with phage sequences classified according to phylogeny of host bacteria where known or according to habitat for metagenomic sequences (gut virome and PGSR sequences). For clarity only the phylogeny of sequences belonging to bacterial classes with 40 or more representatives in the dataset are indicated by coloured regions (see figure legend). Sequences not classified are represented by white data points. For source of sequence data used see Supplementary Table 1 79 , and marine bacteriophage genome sequences were obtained from respective project pages as indicated in the table.
WUGC: Washington University Genome Centre (http://genome.wustl.edu): Chromosomes sequenced as part of the human microbiome project 79 were obtained from the related project homepage as indicated in the table.
** Assembled viral metagenomes utilised for these datasets. For human gut viromes 11 both assembled and unassembled datasets were utilised in different analyses, as described in Methods. For assemblies of gut virome reads obtained from the NCBI SRA were processed and assembled using CAMERA workflows 52 , as described in Ogilvie et al. 13 . For Rice Paddy Soil datasets 70 only assembled datasets were utilised, as generated by study authors.

22.35%
1 The ability of widely used alignment-driven approaches to identify Bacteroidales-like PGSR phage sequences in the MetaHIT dataset 21 were evaluated, along with recently described methods for analysis of phage sequences in metagenomes: a Driver sequences described in Table 1 (main text) were used as query sequences in Blast searches of all contigs 10 kb and over in length as used in the PGSR approach. All hits generating e-values of 1e -3 or lower were recovered and rendered non-redundant by subject sequence (based on top hits by bit scoresee Methods, main paper). b Data from Stern et al. 8 . CRISPR spacer regions identified in sequences from the MetaHIT dataset were used to search assembled contigs from these same reads, using Blastn (e-value threshold of 1e -4 ). Recovered sequences (10 kb in length or over) were classified as phage based on the presence of one or more phage related ORFs (See Stern et al. 8 for full details). c Driver sequence ORFs predicted to encode capsid proteins and phage terminase sequences (Supplementary Figure   S3) were translated and used to search MetaHIT contigs of 10 kb and over as used for PGSR analysis. All hits generating e-values of 1e -3 or lower were recovered and processed as for results from Blastn searches above ( a ).
2 Proportion of sequences recovered by the PGSR approach, and categorised as phage (PGSR phage, n = 85), also identified in alignment-driven approaches evaluated here, or CRISPR spacer based surveys by Stern et al. 8 .
SD = Standard deviation of the mean.  (Supplementary Table  S1), were used to construct custom databases and search mass spectra (177729 spectra) derived from a shotgun metaproteome of a human faecal microbiome (see Methods, main text). Figures in parentheses show the total number of ORF sequences predicted in each dataset, and used in searches of mass spectra. 2 For sequences detected in the metaproteome, putative functions were assigned based on COG searches (e-value = 1e -3 or lower). COG classes: C -Energy production and conversion; D -Cell cycle control, cell division, chromosome partitioning; E -Amino acid transport and metabolism; G -Carbohydrate transport and metabolism; I -Lipid transport and metabolism; J -Translation, ribosomal structure and biogenesis; K -Transcription; L -Replication, recombination and repair; M -Cell wall/membrane/envelope biogenesis; O -Posttranslational modification, protein turnover, chaperones; Q -Secondary metabolites biosynthesis, transport and catabolism; R -General function prediction only; S -Function unknown; T -Signal transduction mechanisms. Unclassified -ORFs detected in human gut but without representation in the COG database.