Introduction

Marine viruses affect the life histories and evolution of their hosts and are a central component of the marine food web (Suttle, 2007; Rohwer and Thurber, 2009). Cyanophages, viruses that infect cyanobacteria, are abundant and broadly distributed in the global oceans (Suttle, 2007; Williamson et al., 2008). Cyanophage genomes carry orthologs of host genes involved in a variety of host processes, including phosphate acquisition, carbon metabolism, photosynthesis and response to light stress (Lindell et al., 2004; Mann et al., 2005; Sullivan et al., 2005; Weigele et al., 2007; Sullivan et al., 2010).

The abundance, diversity and phylogenies of shared phage/host genes in numerous sequenced phage genomes suggest cyanophage are involved in remodeling and distributing host genes. For example, phylogenetic grouping suggests that two photosystem genes, psbA and psbD, have been transferred repeatedly from host to phage genomes (Sullivan et al., 2006). Furthermore, cyanophage copies of psbA and a high-light inducible (hli) gene are transcribed and translated during the infection cycle (Lindell et al., 2005; Clokie et al., 2006; Millard et al., 2010).

Host metabolic processes with shared components in host and phage genomes highlight pathways potentially involved in the competition between cell and phage for metabolic resources. Although cyanophage carry genes involved in the light reactions of photosynthesis, thus far, cyanophage genomes lack genes encoding Calvin cycle enzymes, suggesting that phage do not participate in the carbon fixation pathways of their hosts (Sullivan et al., 2010). In fact, there is evidence that phage actively direct carbon flux toward the pentose phosphate pathway (PPP), enabling nucleotide and nucleic acid synthesis needed for phage replication (Thompson et al., 2011b).

As a corollary, phage genome replication requires phosphorous, which can be extremely scarce in the oligotrophic oceans where Prochlorococcus and its close relative Synechococcus thrive (Wu et al., 2000). Thus it is not surprising that the genomes of all 17 T4-like cyanomyoviruses that infect these cyanobacteria and were available when this study was undertaken (Millard et al., 2009; Sullivan et al., 2010) encode phosphate regulon genes known to be responsive to phosphorus starvation in cyanobacteria (Martiny et al., 2009; Tetu et al., 2009; Sullivan et al., 2010). Some phage genomes encode PstS, a periplasmic high-affinity phosphate-binding protein associated with a phosphate-specific membrane transporter; some encode a homolog of the putative alkaline phosphatase gene phoA. This suggests that there is a selective pressure for phage to retain genes that could facilitate phosphorus acquisition in infected host cells.

Multiple lines of evidence indicate that phosphorus limitation exerts strong selective pressures on Prochlorococcus, providing a context for the patterns in phage. Prochlorococcus primarily utilizes the sulfolipid sulfoquinovosyldiacylglycerol in lieu of more common phospholipids for membrane construction (Van Mooy et al., 2006). Furthermore, the prevalence of phosphorus-associated genes in cultured strains is associated with phosphate availability in the habitat of origin rather than phylogeny (Martiny et al., 2006, 2009; Coleman and Chisholm, 2010). Similarly, T4-like cyanophage isolated from relatively low-phosphorus environments have more host-like phosphate assimilation genes than those from more phosphorus-replete environments (Sullivan et al., 2010). Finally, in phosphate-starved host cells, transcription of phage versions of both pstS and phoA increases via regulation by the host phoBR two-component system (Zeng and Chisholm, 2012).

The availability of new cyanomyovirus genomes and the observation that the abundance of some shared phage/host genes in phage is correlated with variables such as trophic status, nutrient gradients (for example, phosphate) and salinity (Williamson et al., 2008) in the oceans, led us to further explore genome content and evolution in a closely related set of T4-like cyanomyoviruses. Our analysis does not include the highly divergent non-T4-like cyanomyovirus described recently by Sabehi et al. (2012) as it was not available when we began the work. We compared the frequencies of genes in cyanomyovirus genomes in three marine environments to identify genes that the environments have in common and genes that distinguish them. We also examined features of some of these genes in cultured cyanomyovirus genomes—including 11 reported here for the first time.

Materials and methods

Cyanomyovirus genome collection

Seventeen cyanomyovirus genomes were downloaded from Genbank (Benson et al., 2006); 11 additional genomes sequenced and annotated as described in Henn et al., 2010 are reported here for the first time (Table 1).

Table 1 General features of 28 T4-like cyanomyovirus isolates

Orthologous gene cluster and shared domain identification

Gene clusters were generated as described previously with slight modifications (Kettler et al., 2007; Kelly et al., 2012). Orthologous genes were assigned using reciprocal best blastp scores (using an e-value cutoff 1E–5) where sequence identity was at least 35% and alignment length was at least 75% of the length of each protein. Clusters of orthologous genes were built by transitively clustering orthologs. This procedure was established to identify complete genes instead of conserved domains that might represent only a small fraction of a gene. To identify conserved domains, genes were run against the Pfam protein families database version 25.0 (Punta et al., 2012) with HMMER 3.0 (Eddy, 1998) using the CAMERA function prediction workflow with default parameters; hits with an e-value 0.001 are reported (Sun et al., 2011).

Cyanomyovirus gene identification in metagenomic data sets

Three data sets from microbial fraction genomic DNA (retained on 0.22 μM filters—phage DNA is ‘by catch’ in these samples) were analyzed (Table 2). Two pyrosequence data sets were collected from three depths in the oligotrophic N. Pacific subtropical gyre (Hawai’i Ocean Time-Series (HOT), cruise HOT186) and the Sargasso Sea (Bermuda Atlantic Time Series station (BATS), cruise BATS216) (Frias-Lopez et al., 2008; Coleman and Chisholm, 2010), one was from the deep chlorophyll maximum in the Mediterranean Sea (MedDCM, NCBI Sequence Read Archive Id: SRP002017) (Ghai et al., 2010). The three depths sampled at HOT (25, 75, 110 m) and BATS (20, 50, 100 m) were pooled by site. The MedDCM site was sampled at a single depth, 50 m.

Table 2 Three environmental metagenomic data sets analyzed for cyanomyovirus gene abundance

Metagenomic sequences from each sample were recruited to the custom protein database of cyanobacterial and cyanophage orthologous gene clusters described above. This step distinguishes cyanomyovirus genes of interest from (1) cyanobacterial and (2) podo- and siphoviral genes. Sequences and annotations are available in the ProPortal database (Kelly et al., 2012) (http://proportal.mit.edu/) and as a FASTA file (http://proportal.mit.edu/pubdownload/index_V3clusters.html). Reads with best hits to a cyanomyovirus gene (blastx bitscore >50) were required to have their top five hits (if available) to genes in the same cluster. Sequences passing this filter were compared with the NCBI non-redundant (nr) database using blastx with a bitscore comparison to ensure there were no better hits to non-phage protein sequences. The Fisher test (part of the epitools library) and the Bonferroni multiple comparison correction in the R statistical software package (R Development Core Team, 2009) were used to determine the statistical significance of gene cluster abundance when comparing pairs of sites.

Reconstruction of phylogenetic trees

Protein sequences were aligned with MUSCLE v3.6 (Edgar, 2004). Alignments were trimmed such that each column was covered by 90% of the sequences. Trees were reconstructed with PhyML version 2.45 (Guindon et al., 2009) using non-parametric bootstrap analysis with 100 replicates, one category of substitution rate, the JTT model of amino-acid substitution and the proportion of invariable sites fixed. Trees were plotted using iTOL (Letunic and Bork, 2011).

Identification of core gene sets

We defined two broad sets of core genes: one based on cultured, completely sequenced cyanomyoviruses (‘signature core genes’) and the other based on the relative abundance of cyanomyovirus genes in the metagenomic data sets (‘metagenome-defined core genes’).

Cyanomyovirus signature core genes are, by our definition, those genes that are single copy and have orthologs in all of the complete cyanomyovirus genomes available at the time of this study; 26 genes fit this definition (Table 3). Note that the signature core gene set defined here is a subset of the cyanomyovirus core genes defined in Sullivan et al. (2010), in which sequence profiling techniques and manual curation were used to pull in more distantly related genes and to group together clusters to define core gene groups, respectively. For the purposes of metagenomic recruitment, we wanted our clusters to (1) reflect complete genes instead of partial genes or conserved domains, (2) to be comprised of closely related sequences, and (3) to be automatically produced to facilitate addition of new genomes.

Table 3 Cyanomyovirus signature core genes from 28 cyanomyovirus isolates

As expected (Coleman and Chisholm, 2010), for the signature core genes there is a linear relationship between the number of reads detected in metagenomic databases and gene length; we use this relationship to define a range of values that encompasses the length-normalized abundance of most signature core genes (Figure 1). The kernel density estimator function ‘density’ in the stats library of the R statistical software package was used to identify the first and the third quartile range for the length-normalized abundance of signature core genes in each environment using default bandwidth selection (R Development Core Team, 2009).

Figure 1
figure 1

Relationship between gene length and reads detected for cyanomyovirus genes observed in metagenomic databases from three different environments: Sargasso Sea (BATS), N. Pacific (HOT) and Mediterranean Sea (MedDCM). Red circles indicate single copy signature core genes identified in 28 cultured cyanomyovirus genomes. The linear relationship (adjusted r2 values are 0.89, 0.95 and 0.94 for BATS, HOT and MedDCM respectively) between gene length and the number of times a gene is found supports the assertion that these genes are core in the wild populations of cyanomyoviruses as well.

This procedure allowed us to identify genes belonging to a ‘metagenome-defined core’, which is the set of phage genes in each metagenomic data set that, when normalized to gene length, occur at the same frequency as the signature core genes—that is, they are likely present in every cyanomyovirus. In some cases, genes fall in this group in all three environments, which we refer to as the ‘metagenome-shared core’.

Identification of pho box motifs in cultured cyanomyovirus genomes

Previous work used consensus sequences to identify putative pho boxes upstream of the PhCOG173 gene in P-SSM7 and upstream of the pstS gene in P-SSM4 (Sullivan et al., 2010). Here, we used 129 pho box motifs computationally predicted upstream of genes in four Prochlorococcus and two marine Synechococcus genomes (Su et al., 2007) to generate a position weight matrix of the pho box motif with the Bio.Motif module from the Biopython software package (Cock et al., 2009). The position weight matrix was used to search upstream intergenic regions in the cyanomyovirus genomes for putative binding sites for the response regulator phoB. A log-odds threshold was used to identify putative motifs, the threshold was set at: threshold_balanced(1000). Motifs were required to be on the same strand and within 100 base pairs upstream of a gene.

Results and discussion

Gene frequency in different environments

To explore emergent patterns relating habitat to gene content in cyanomyovirus populations, we used predicted protein sequences from 28 cultured cyanomyovirus genomes to first define genes as either conserved or flexible and then to recruit homologous genes from metagenomic databases from the North Pacific Subtropical Gyre (HOT), the Sargasso Sea (BATS) and the Mediterranean Sea (MedDCM) (Table 2).

Cyanomyovirus signature core gene set

Given the constraints imposed when building orthologous gene clusters (see Methods), the 11 new cyanomyovirus genomes increase the total cyanomyovirus ‘pan genome’ from approximately 1500 (Sullivan et al., 2010) to approximately 2000 genes (Supplementary Figure S1). There is a well-defined set of 26 clusters of orthologous genes shared by all 28 cyanomyovirus genomes (Table 3)—defined here as ‘signature core genes’—that we used to assess the relative abundance of all other cyanomyovirus genes in each environmental sample. This set includes genes with host homologs—that is, shared phage/host genes—such as the pyrophosphatase mazG and the phosphate-starvation-inducible gene phoH. If these genes are also single copy core genes in wild phage genomes, their abundance should be directly proportional to gene length in each environment (Coleman and Chisholm, 2010), and indeed it is (Figure 1).

Shared metagenome-defined core gene set

Twenty-one genes were present within a range of values defined by the length-normalized abundance of signature core genes at all three sites. This set, plus applicable signature core genes, constitutes the ‘metagenome-shared core’ (Table 4). These genes encode phage structural proteins, hypothetical genes and shared phage/host genes such as the UvsW helicase and an endonuclease, indicating that some shared phage/host genes have become part of the core cyanomyovirus gene complement in multiple habitats. In most cases, a gene identified as core in the metagenomes was absent from only one or two of the 28 genomes of cultured strains, making its presence in the metagenome-shared core unsurprising. However, the hypothetical gene PhCOG71299, observed in only 16 of the 28 genomes, nonetheless appears at core frequencies in all three environments. This gene may be more prevalent in wild genomes than our cultured set would predict, or alternatively it may be multi-copy in some wild phage (Table 4). Notably, only between 6% and 11% of cyanomyovirus gene clusters are abundant at or above the boundaries set by the signature core genes per site, highlighting extremely high diversity at the level of individual genes in wild cyanomyovirus genomes (red circles, Supplementary Figure S2).

Table 4 Metagenome-shared core genes

Genes present at signature core gene frequencies in one or two environments

Thirty genes were found at signature core gene frequencies in one or two of the three environments, most of which were annotated as ‘hypothetical’ (Supplementary Table S1). Some annotated proteins, such as the phosphate-binding protein PstS, an iron-dependent oxygenase and the hli gene cluster hli04 (all core at BATS) have homologs in host genomes, while others, such as the bacterial DNA methylase Dam (core at HOT) do not. The shared Calvin cycle regulatory gene CP12 is core at HOT and MedDCM but not at BATS.

Pairwise site by site comparisons

We used pairwise comparisons of gene frequencies in different environments to identify further signals of environment-specific selective pressures on phage populations (Figure 2). Seventy-one unique genes were statistically overrepresented at one or more of the sites (Tables 5, 6, 7). We found some phage structural genes overrepresented at particular sites. Phage structural genes can be sequence diverse (Sullivan et al., 2010), and we hypothesize that the dominant sequence type for some structural genes might vary site to site, and this may be the source of our observation of structural genes that are specific to particular sites.

Figure 2
figure 2

Comparisons of cyanomyovirus gene reads detected in three different ocean environments. Circles indicate equally represented phage genes and purple outlined squares represent genes that are statistically differentially represented in one of the two environments being compared. Signature core genes are red, genes with abundances similar to signature core genes in all three environments (‘metagenome-shared core’) are pink. Six phage/host shared genes of particular interest are labeled: phoA and pstS (green) are phosphate-associated, psbA (yellow) is a photosystem gene, additional HOT-overrepresented genes in the neighborhood of psbA, a heme oxygenase and a gene of unknown function, are also colored yellow, gnd and zwf (orange) are PPP genes and gcvP is the glycine cleavage system P-protein. Tables 5, 6, 7 include detailed information for each overrepresented gene.

Table 5 Statistically overrepresented cyanomyovirus genes in a comparison of the North Pacific Gyre (HOT) and the Sargasso Sea (BATS)
Table 6 Statistically overrepresented cyanomyovirus genes in a comparison of the Mediterranean Sea (MedDCM) and the Sargasso Sea (BATS)
Table 7 Statistically overrepresented cyanomyovirus genes in a comparison of the North Pacific Gyre (HOT) and the Mediterranean Sea (MedDCM)

Fifteen overrepresented genes have host homologs—that is, are shared phage/host genes with the potential to interface with host metabolic pathways and processes (Millard et al., 2009; Sullivan et al., 2010; Sharon et al., 2011; Thompson et al., 2011b; Zeng and Chisholm, 2012). Of particular interest are those related to phosphorous acquisition, because this element can be a defining variable in the structure and function of marine microbial systems and has a key role in shaping the genome content of cyanobacterial hosts (Martiny et al., 2009; Coleman and Chisholm, 2010).

Features of phosphate-acquisition genes in cultured and wild phage

Frequency at BATS and MedDCM relative to HOT

The frequency of phoA and pstS—cyanomyovirus genes with host homologs involved in the phosphate stress response (Martiny et al., 2006; Hsieh and Wanner, 2010; Zeng and Chisholm, 2012)—was elevated at BATS and MedDCM relative to HOT (Tables 5, 6, 7, Figure 2, green squares). Notably, phosphate concentrations in North Atlantic surface waters are in the nanomolar range—as are those in the Mediterranean Sea—and at least an order of magnitude lower than surface levels in the North Pacific (Wu et al., 2000; Moutin and Raimbault, 2002). In fact, at BATS, phage pstS occurs at signature core gene frequencies (that is, it is likely present in all cyanomyoviruses), and it is nearly so at MedDCM, indicating that it has been incorporated into the genomes of essentially all cyanomyoviruses in these environments. ProchlorococcusphoA gene is also overrepresented at BATS vs HOT, while pstS, a core gene in Prochlorococcus genomes, is not (Coleman and Chisholm, 2010), indicating that phage pstS is selected for independently of its abundance in host genomes. The higher frequency of these phosphate-acquisition-related phage genes at BATS and MedDCM relative to HOT suggests that cyanomyovirus populations retain genes that facilitate host functions under the selective pressure of phosphate limitation.

There are also interfaces between host phosphate acquisition and viral genomes in eukaryotic systems—for example, the PHO4 phosphate transporter superfamily (Pfam ID: PF01384) has been found in eukaryotic viruses (Monier et al., 2012). Although this gene is not yet found in Prochlorococcus and is only in one Synechococcus (Synechococcus WH5701, protein ID: WH5701_07531), a single metagenomic read containing both pho4 and a cyanomyovirus gene was observed, suggesting that cyanophage could also carry this gene (Monier et al., 2012).

Explorations of the phylogeny of shared phage/host genes have suggested that cyanophage acquired pstS from host cells (Martiny et al., 2009; Ignacio-Espinoza and Sullivan, 2012); however, not all shared phage host genes have a phylogeny consistent with host origins (Ignacio-Espinoza and Sullivan, 2012). As more and longer environmentally isolated sequences for these shared genes become available, we will be better able to define the flow of genes between phage, host and possibly other microbes in marine environments.

The metagenomic patterns observed here reflect the link between phosphate-acquisition genes in phage and the regulation of phosphate-acquisition genes in the host by phosphate availability (Zeng and Chisholm, 2012; and see below). Phosphate availability controls expression of host pstS and alkaline phosphatase genes in both marine Synechococcus and Prochlorococcus (Scanlan et al., 1993; Martiny et al., 2006; Tetu et al., 2009) through the PhoB/PhoR (PhoBR) two-component regulatory system (Hsieh and Wanner, 2010) that is widespread in bacteria, including Prochlorococcus and Synechococcus (Kettler et al., 2007; Scanlan et al., 2009; Tetu et al., 2009). Genes regulated by PhoBR have conserved sites (pho boxes) immediately upstream of their promoters to which the transcriptional activator PhoB binds (Lamarche et al., 2008). The presence of pho boxes in cyanomyovirus genomes (Sullivan et al., 2010) and recent evidence that they are involved in sensing and responding to host phosphate-starvation status during infection in one phage/host pair (Zeng and Chisholm 2012) led us to explore this motif more deeply.

Pho box motifs in cultured cyanomyovirus genomes

To improve on analyses in our previous work (Sullivan et al., 2010)—while recognizing that computational predictions ultimately require experimental confirmation—we used a position weight matrix based on predicted Prochlorococcus and Synechococcus pho box motifs (Su et al., 2007), tailoring our search to capture host-like pho boxes. In the 28 genomes we found 186 genes from 112 orthologous gene clusters with intergenic upstream pho boxes within 100 bp of the gene’s start site (Supplementary Table S2).

Pho boxes upstream of phage pstS/PhCOG173. As reported in Sullivan et al. (2010), and Zeng and Chisholm (2012), pho boxes near pstS are often accompanied by a gene between the pho box and pstS, referred to as DUF680 in the former and PhCOG173 in the latter. Phage lacking PhCOG173 upstream of pstS have pho boxes directly upstream of pstS. In 11 out of 16 phages containing pstS/PhCOG173, pho boxes were found <100 bp upstream of these genes (Figure 3a) and slightly further (121 bp) in a twelfth phage (S-SSM7) (Supplementary Table S3). In the three phages (P-SSM3, P-SSM2 and P-SSM7), there were multiple tandem pho boxes upstream of these genes. The phage PhCOG173 gene family is conserved (see below), and its expression is upregulated in cyanomyoviruses infecting host cells that are P-stressed (Zeng and Chisholm, 2012). Notably, PhCOG173 has no detectable orthologs in host genomes and pho boxes are found directly upstream of it in eight cyanomyovirus genomes. Therefore, we postulate that the positioning of pho boxes in front of numerous copies of PhCOG173 is a result of selection rather than chance and that this gene may have a role in either phosphate acquisition or in a more general phosphate-stress response.

Figure 3
figure 3

Predicted pho boxes immediately upstream of (a) PhCOG173 and/or pstS and (b) the hli03 gene cluster in cyanomyovirus genomes. Phage genome names and the genomic indices of the displayed region are indicated. Putative pho box motifs are shown as purple arrows. The genomic region in (a) is larger than the region in (b) and the pho box motif and genes are scaled in size accordingly. Red stars indicate that the host strain on which the phage was isolated contained the PhoBR two-component phosphate sensing system; white triangles indicate that the host genome is not currently available. The PhCOG173 (cyan), pstS (orange), phoA (blue), hli03 (dark green) and other hli genes (light blue) are highlighted with specific colors; all other genes are shown in light green.

Although not all Prochlorococcus contain the PhoBR system (Kettler et al., 2007), those hosts with sequenced genomes on which cyanomyoviruses containing pho boxes were isolated do contain PhoBR (Figure 3a, Supplementary Materials and Methods). Notably, phage Syn19, Syn2, S-SSM5 and P-RSM1, isolated on PhoBR-containing Synechococcus hosts WH8102, WH8012 and WH8109 and Prochlorococcus host MIT9303, respectively, do not have identifiable pho boxes directly upstream of PhCOG173. They do, however, have pho boxes elsewhere in this genomic region: Syn19 has a pho box upstream of the hypothetical protein Syn19_155, three genes upstream of PhCOG173/pstS, and its ortholog in Syn2, CPTG_00065, also has an upstream pho box. S-SSM5 and P-RSM1 contain pho boxes 142 and 135 bp upstream of the heat-shock protein Hsp20, respectively, which lies immediately upstream of PhCOG173 (Supplementary Table S3). It is therefore possible that additional genes in this region are responsive to regulatory signals from the host PhoBR system.

Pho boxes upstream of phage hli genes There are 46 hli03 genes in the cyanomyovirus genomes—18 genomes have multiple copies and 10 have a single copy. The hli03 genes are closely spaced in genomes with multiple copies and frequently found with other hli gene family members. In 13 out of 14 cases, there is a pho box upstream of the first hli03 copy in the genome (Figure 3b), raising the intriguing possibility that the host PhoBR system might also regulate the expression of phage hli03. PhoB can regulate non-phosphate-related genes in bacteria, such as virulence genes in Vibrio cholerae (Pratt et al., 2010), antibiotic-regulating genes in Streptomyces (Santos-Beneit et al., 2011) and acid-stress genes in Escherichia coli (Suziedeliene et al., 1999). Although there is no direct evidence that PhoBR regulates other genes in cyanophage hosts, some predicted that pho boxes in marine Synechococcus (Su et al., 2007) are upstream of hli genes. There is no such evidence for Prochlorococcus thus far.

Hli genes are similar in sequence to chlorophyll a/b-binding proteins that are often upregulated under changes in light intensity in cyanobacteria (Dolganov et al., 1995; Funk and Vermaas, 1999; Bhaya et al., 2002; Steglich et al., 2006). There are numerous hlis in Prochlorococcus genomes (Coleman and Chisholm, 2007). Although their location and binding partners in the cell remains unclear (Storm et al., 2008; Muramatsu and Hihara, 2012), hlis display different expression patterns over the diel cycle (Zinser et al., 2009) and generally fall into two categories (Bhaya et al., 2002): (1) Single copy core hlis and (2) multi-copy non-core hlis. Multi-copy hlis have orthologs, such as hli03, in phage (Lindell et al., 2004). Genes in this category are often found in hyper-variable regions in host genomes and are upregulated in response to changes in light (Steglich et al., 2006), iron (Thompson et al., 2011a) and nitrogen (Tolonen et al., 2006) in host cells, as well as stress imposed by phage infection (Lindell et al., 2004, 2007). In the case of nitrogen, binding sites for the global nitrogen regulator NtcA were found upstream of hlis with differential transcription under changing nitrogen conditions (Tolonen et al., 2006). Interestingly, hlis do not appear to be upregulated in response to phosphate stress in Prochlorococcus (Martiny et al., 2006), although in Synechococcus sp. WH8102 a possible hli (SYNW2180) was upregulated in a PtrA protein transcriptional response gene mutant during phosphate stress relative to the wild-type strain (Ostrowski et al., 2010). This hli has no homologs in phage.

PhCOG173, a conserved, cyanophage-specific gene neighboring multiple shared phage-host genes

PhCOG173 is found in all 28 cyanomyoviruses (Figure 4, genes with dark gray bars) and is multi-copy in 12 genomes. Eight of these have one copy of the gene upstream of pstS and another upstream of glutaredoxin (called nrdC in phage genomes and grxC in host genomes), a single copy core gene in Prochlorococcus and Synechococcus. Glutaredoxin is found in all 28 cyanomyovirus genomes and is multi-copy in 10 genomes. Glutaredoxins help regulate cellular redox state (Lillig et al., 2008), suggesting that PhCOG173 is not only involved in influencing phosphate acquisition in host cells but may also alter cellular redox state. Phage may use an altered redox state to direct host metabolism toward nucleotide production (Thompson et al., 2011b). Alternatively, phage glutaredoxin could manipulate stress responses in host cells brought on by changes in redox state.

Figure 4
figure 4

Phylogeny of PhCOG173, a conserved phage gene cluster adjacent to shared phage/host genes. The PhCOG173 cluster, present in both cyanopodovirus and cyanomyovirus genomes (light gray and dark gray bars, respectively) but not host genomes, is found upstream of numerous shared phage/host genes, and phylogenetic groups are associated with different downstream host genes (colored bars). White bars indicate that the downstream gene is not shared with any sequenced host genome. Genes in bold indicate genomes where two copies of PhCOG173 are located next to each other, that is, in the P-SSM2 genome, PhCOG173 gene PSSM2_246 is immediately upstream of the PhCOG173 gene PSSM2_247. The tree is rooted with cyanopodovirus gene PROG_00012. Gray circles indicate >0.8 branch support. The scale bar represents 0.1 substitutions per site.

Among sequenced podovirus isolates, six also contain the PhCOG173 gene. The association between PhCOG173 and shared phage/host genes extends to five cyanopodoviruses (Figure 4, genes with light gray bars; Labrie et al., 2013). In four out of these five instances, the gene was found upstream of a shared phage/host gene of unknown function (PhCOG73321), and in one instance, it was upstream of the photosystem gene psbD.

PhCOG173 proteins form phylogenetic groups that are linked to their downstream gene—for example, glutaredoxin and pstS—when that gene is host-like, suggesting differing functional roles related to that gene (Figure 4). In genomes where two copies of PhCOG173 are located next to each other, the genes cluster separately phylogenetically (see PSSM2_246 and PSSM2_247 and SSM2_217 and SSM2_218, set in bold in Figure 4), suggesting that they were not a recent gene duplication and supporting the possibility of differing functional roles.

Thus, the cyanophage-specific PhCOG173 gene is associated with multiple shared phage/host genes with very different functions related to cellular stressors and metabolism, such as phosphate acquisition, light harvesting and cellular redox state. Its conservation across multiple phage morphotypes highlights the importance of this functionally uncharacterized gene and strongly suggests that phage utilize genes not observed in host genomes to affect host metabolic processes.

Differential abundance in metagenomic databases of shared phage/host genes related to photorespiration, photosynthesis and the PPP

Although the phosphate-acquisition-related genes and their associated regulatory features were a strong emergent signal from this data set, there are other phage/host-shared genes differentially retained by phage in environmental comparisons (Figure 2; Tables 5, 6, 7) presumably reflecting selection by as yet unidentified environmental factors. We mention a few intriguing genes here.

The phage gene encoding the glycine cleavage system P-protein (gcvP, PhCOG2105), a large gene (>900aa) that is core in Prochlorococcus and Synechococcus genomes (CyCOG4223), was overrepresented in phage at HOT relative to both BATS and MedDCM and was overrepresented at MedDCM in comparison to BATS, where it is almost completely absent (Tables 5, 6, 7; Figure 2, blue squares). This gene is part of a photorespiratory pathway in cyanobacteria and involved in the reversible interconversion of serine and glycine (Hasse et al., 2007; Eisenhut et al., 2008; Muramatsu and Hihara, 2012).

In some cases, we observed habitat-specific overabundance of neighborhoods containing multiple gene sets. For example, the photosystem-associated phage gene psbA (PhCOG71555) is overrepresented at HOT in comparison to MedDCM. Two neighboring genes, a small, hypothetical cyanophage gene (PhCOG71750) and a shared phage/host heme oxygenase (Ho1, PhCOG71159), were also overrepresented at HOT in comparison to MedDCM (Figure 2, yellow squares). Heme oxygenase is transcribed during infection of Prochlorococcus strain NATL1A (Dammeyer et al., 2008), and its expression is upregulated under iron starvation in some cyanobacteria (Cornejo et al., 1998) but not in Prochlorococcus (Thompson et al., 2011a). Heme oxygenase overabundance at HOT could be related to relatively low iron availability in the Pacific, known to limit Prochlorococcus growth (Mann and Chisholm, 2000).

In a second example, phage glucose-6-phosphate dehydrogenase (zwf, PhCOG969) and phosphogluconate dehydrogenase (gnd, PhCOG964), core PPP genes in host genomes, were overrepresented at MedDCM in comparison to both HOT and BATS (Figure 2, orange squares); an additional shared phage/host Calvin cycle regulatory gene, CP12 (PhCOG71523), was found at signature core gene frequencies at MedDCM and HOT. The gnd/zwf region is variable in cyanophage genomes (Millard et al., 2009), and our previous work indicates that some phage are designed to redirect host metabolism away from carbon fixation and towards nucleotide synthesis via the PPP (Thompson et al., 2011b). Why this would be more necessary in one environment than another remains unknown.

Other genes from this region are also overrepresented in the MedDCM sample, including the shared phage/host plastocyanin gene petE, part of the electron transport chain, and two small, functionally unannotated phage-specific genes, PhCOG71460, and PhCOG1139. The unannotated phage genes may have roles in the PPP, alternatively they may be phage genes selected to flank host-like genes for an unknown purpose.

Conclusions

We demonstrate that environment-specific selection pressures can dictate the frequency of occurrence of some shared phage/host genes in wild cyanophage, highlighting gene flow between cyanobacterial and cyanophage genomes in the marine environment. Notably, the core status of a gene in host genomes (such as the PPP genes discussed above and pstS) does not necessarily reflect its abundance in phage. Furthermore, regulatory motifs for shared phage/host genes are not always acquired with the host gene but appear to be selected for independently in phage genomes as demonstrated by the presence of motifs associated with host phosphate sensing found upstream of the phage-specific gene PhCOG173.

The ecological origins of the considerably greater numbers of differentially abundant genes in the comparison between the HOT and MedDCM sites are not clear. We speculate that as additional metagenomic data sets and associated metadata for environmental samples become available, we will be able to tease apart in more detail the environmental drivers of differences in phage populations between environments.

The ability to identify core-like genes in environmental samples, independent of the prevalence of those genes in sequenced genomes, provides a means to derive an environmentally relevant core genome for these genetically diverse organisms. Finally, our work illustrates the power of metagenomics-based approaches for revealing some of the interplay between phage and host genomes in marine environments, and we anticipate the analyses described here will also be relevant to elucidating the genetic and metabolic ties between phage and host in other systems.