Introduction

Cyanobacteria of the Synechococcus and Prochlorococcus genera are important contributors to photosynthetic productivity in the open oceans (Li et al., 1993; Liu et al., 1997; Partensky et al., 1999). Cyanophages, which are viruses infecting cyanobacteria, belong to three morphologically defined families: Podoviridae, Siphoviridae and Myoviridae (Suttle and Chan, 1993, 1994; Waterbury and Valois, 1993; Wilson et al., 1993; Sullivan et al., 2003). Among the cyanophages, podoviruses and siphoviruses tend to be host-specific, whereas myoviruses have a broader host range, even across genera (Sullivan et al., 2003). Overall, myoviruses predominate over other phage groups in different oceanic regions (Angly et al., 2006; DeLong et al., 2006).

Phages rely on host cellular mechanisms to translate their proteins and reproduce. In order to take advantage of the host tRNA pool, it is expected that phage genes would be adapted to optimal bacterial codons. On the basis of this assumption, earlier studies have suggested that optimization of codon usage (CU) is a major force in phage–host co-evolution (Krakauer and Jansen, 2002). Further, in a comprehensive bioinformatics study, it has been shown that generally the CU of bacteriophages is strongly adapted to their specific host and differs from the CU of other bacterial hosts (Bahir et al., 2009). Recently it has been reported that cyanophages, specifically myoviruses, carry up to 33 bona fide tRNA genes in their genomes (Sullivan et al., 2010; Dreher et al., 2011).

Bailly-Bechet et al. (2007) proposed that tRNAs carried in different phage genomes correspond to codons that are used highly by the phage genes but are rare in the host genome. In a recent study that tested CU adaptation in over 100 bacteriophages infecting 10 different bacterial hosts, it was shown that bacteriophage genomes are under codon-selective pressure imposed by the translational biases of their respective hosts (Carbone, 2008). On the other hand, Weigele et al. (2007) proposed that tRNA genes carried in viral genomes boost the expression of late phage genes encoding structural proteins. Interestingly, in the HIV-1 virus, which does not carry tRNA genes in its genome, it was recently shown that the tRNA-encoding codons, which are highly used by the virus but avoided by its host, are overrepresented in its virions (van Weringh et al., 2011).

An intriguing conjecture is that phages carry tRNA genes to enable cross-infectivity of hosts with different G+C contents. In order to study this question, we used the unique cyanobacterial–cyanophage system where myophages (35–40% G+C) carry up to 33 different tRNA genes in their genomes (Sullivan et al., 2010; Dreher et al., 2011) and can, in some cases, cross-infect different hosts with very different %G+C contents (Prochlorococcus with 30–40% G+C and marine Synechococcus with 50–60% G+C) (Sullivan et al., 2003).

Materials and methods

Data extraction and annotation

Annotated genomes were downloaded from the CAMERA website and the NCBI genome database. Information regarding cyanomyophages and their hosts was obtained from Sullivan et al. (2003) and G Sabehi and D Lindell (personal communication). Only phage–host pairs that had both the full genomic sequence available and were annotated were included in this work. A detailed list of all pairs of phages and hosts studied is given in Supplementary Table S1. tRNA genes were annotated using the tRNA-scanSE server (Lowe and Eddy, 1997) with parameters set to default.

CU and nucleotide usage (NU) profiles

CU and NU profiles were calculated for each organism using all protein-coding genes in the genome. The CU for each codon was calculated as the frequency of the given codon in a window of 1000 codons. Euclidean Distances (ED) between the CU and NU profiles of all genomes studied were calculated as defined in Equation (1). ED values were normalized from 0 to 1, representing the highest to the lowest similarity, respectively. Hierarchical clustering was performed using MeV software (Saeed et al., 2006).

where pi is the frequency of codon i in the phage genome and hi the frequency of codon i in the host genome.

Codon Adaptation Index (CAI) values were calculated for each bacterial gene using the CAIcal server (Puigbo et al., 2008). CAI values for the genes in each genome were calculated relative to the CU table representing the codon frequencies of the genes encoding ribosomal proteins or the codes for related ribosomal function in the corresponding genome.

Relationship between phage genes and the host genome

The CU profiles were calculated for each phage gene, as described above. Each phage gene was characterized by a vector representing the codon frequencies observed for that gene. Cosine Similarity Distance and tRCI (tRNA Relative Contribution Index) were further calculated to compare the CU profile of each gene with its host, as described below:

Cosine similarity distance

Cosine similarity distance is calculated using the formula

where pi is the frequency of codon i in the phage gene and hi the frequency of codon i in the host genome.

tRCI calculation

The tRCI was defined to estimate the potential gain in translational efficiency per phage gene when including phage tRNAs in the total tRNA pool. The tRCI is based on the comparison between frequencies of tRNA complementary codon (tCC) matching the phage tRNAs in the viral gene and the tCC in the host genome (see Equation (3)). Notably, the tRCI index is calculated uniquely for each gene relative to a specific host (namely, the same gene in a given genome might have different tRCI values when infecting different hosts).

where ftcc is the frequency of tCC.

Ranking the phage genes based on codon adaptation

Ranking the phage genes according to the cosine distance between the CU profile

For each host–phage interaction pair, the phage genes were sorted based on the cosine distance between the vectors representing their CU profile of the gene and the CU profile of the specific host. The analysis was repeated for all host–phage pairs.

Ranking the phage genes according to tRCI values

Following the calculation of tRCI of phage genes for each phage–host interaction, all genes in the phage genomes were sorted and ranked according to the tRCI calculated for a specific host.

Detecting gene enrichment among the sorted phage genes

Each sorted list (either sorted according to cosine distance or tRCI values) was divided into 10 subgroups (bins) with an equal number of genes per bin. Bins were sorted according to the rank of the genes, that is, bin #1 had the highest ranked genes and bin #10 the lowest ranked genes. The content of each bin was analyzed to detect functional enrichment using the hypergeometric distribution probability (Equation (4)): p(x)>X defining enrichment of a specific gene function within a given bin, and p(x)<X representing under-representation of a gene function in a given bin. All analyses were conducted for all phage–host pairs, including (1) pairs involving Synechococcus hosts and (2) pairs involving Prochlorococcus hosts (see Supplementary Table S1).

where m is the total number of genes tested in the genomes studied, N is the number of all genes within the genomes studied, K is the number of all genes in the bin and x is the number of genes with a specific function included in the bin.

Results and discussion

CU as an indication of the tRNA repertoire in cyanobacteria

One of the most important translational optimization mechanisms is the correlation between the cellular levels of tRNA molecules and the frequencies of their corresponding codons, resulting in higher translation rates (Ikemura, 1981, 1985; Varenne et al., 1984) and accuracy (Akashi, 1994). This correlation is higher in fast-growing bacteria (Rocha, 2004) and in highly expressed genes (Ikemura, 1981; Ghaemmaghami et al., 2003; Goetz and Fuglsang, 2005). In order to evaluate the adaptation of a gene to the cell tRNA pool, several indices have been proposed. One of the most frequently used indexes is the CAI, which calculates the codon bias of a gene relative to the bias of a set of highly expressed genes (Sharp and Li, 1987). In most organisms, the set of highly expressed genes is composed mostly of ribosomal protein-encoding genes. However, it is difficult to determine accurately an optimal gene set representing the codons that are favored in the selective process. Therefore, for each bacterium in our study, we examined both the correlation between the CU of the entire bacterial genome as well as the CU based solely on its ribosomal protein-encoding genes. Overall, we found that the two CU profiles were highly correlated, with an average correlation coefficient of 0.85 (±0.066) (median=0.875). We also calculated the CAI of all bacterial genes for each host, and found that the bacterial genomes included in our study showed a relatively uniform distribution of CAI (see Supplementary Figure 1). On the basis of these results, we decided to use the CU of the entire genome as a representative of the bacterial tRNA repertoire.

CU profiles of cyanomyophages differ from their Synechococcus hosts

Previous studies have shown that the CU of bacteriophages is strongly adapted to their specific host but differs from the CU of other bacterial hosts (Bahir et al., 2009). On the basis of these studies, our initial assumption was that cyanomyophages share a similar CU with the host they were isolated from (and their closely related hosts), as they both use the host tRNA pool. To examine this, we calculated the correlation between CU (Figure 1a) and NU profiles (Figure 1b) of bacteria and phages using the ED metric. As shown in Figure 1, the EDs calculated based on the CU and NU profiles showed very similar results. Notably, we identified clusters of organisms sharing similar CU: Synechococcus hosts comprise the most distinct cluster, phages tend to cluster together and the Prochlorococcus hosts cluster into two groups, one with a G+C content of 50% and the other with a low G+C content of 35–40%. Surprisingly, all phages, including those isolated from a Synechococcus host, showed higher similarity (that is, closer distance) to Prochlorococcus than to Synechococcus, independent of their hosts.

Figure 1
figure 1

A clustered distance matrix based on CU (a) and NU (b) profiles. Colors range from blue, indicating close distances, and red, representing great distances. Color bar is given. Phage and host names are given: red labels indicate Prochlorococcus, blue labels indicate Synechococcus and black represents the phages.

Overall, the similar distance pattern between the different phages and hosts obtained when calculating either the CU or the NU implied that the CU profiles correlate with the nucleotide composition of the entire genomes. To further confirm this relationship, we plotted the normalized NU and CU distances of all phage–host pairs studied, each point representing the NU and CU distances of a single pair. Consistent with the ED analyses, we found a significant linear correlation between the CU- and NU-based distances (Supplementary Figure S2) with R2 of 0.97, further confirming the close relationship between these two properties in cyanobacteria and their phages.

Codons complementary to phage tRNAs are used at high frequencies in phage genomes

From the above results, we may infer that the low G+C content of cyanomyophage genomes may shape the codon selection pattern of the phages. We further examined whether the anticodons of the tRNAs carried by the different phages are preferentially AT-rich, therefore enabling Watson–Crick pairing with the AT-rich codons of the phages. Indeed, we found that overall, the anticodons of phage tRNAs are AT-rich with an average G+C content of 31.5% (±5.2%) for all phages included in this study compared with 54.45% (±2.03%) in cyanobacterial anticodons. We also found that in 10 of the 12 viral tRNA genes, the anticodon can form Watson-Crick pairing with the lowest %GC codon coding for the given amino acid (AA) (Figure 2).

Figure 2
figure 2

Distribution of phage tRNAs within cyanomyophage genomes. Genetic table presents the distribution of tRNA isoacceptors within the phage genomes. Phage tRNAs are denoted by their optimal complementary codon. Colors correspond to the number of appearances of each tRNA in the genomes studied.

Further, we examined which codons are used at similar frequencies in the phage and host genomes, and which codons are preferred in one genome and avoided in the other. To answer this question, the codon frequencies of each phage genome were plotted against the codon frequencies of its host. This procedure was repeated for phages infecting multiple hosts, each time plotting the phage codon frequencies against the codon frequencies of one host. Overall, we found that for phages infecting a Synechococcus host, the plots are highly scattered (for example, Figures 3c and d), meaning that the phage and host genomes tend to use different codons. For phages infecting Prochlorococcus hosts, the degree of scattering varied, depending on the difference in G+C content between the two genomes (for example, Figures 3a and b). Notably, the frequencies of the codons that pair fully with the phage tRNAs (red triangles in Figure 3) are preferred in phages but are hardly used by the Synechococcus hosts. In Prochlorococcus hosts (MIT9303 and MIT9313) having a G+C content of 50%, we observed a similar, although slightly smaller, codon bias. In pairs associated with a Prochlorococcus host having a lower G+C content, the phage tRNAs optimal codons were used at similar frequencies in both the phage and the host genomes (Figure 3a).

Figure 3
figure 3

Phage codon frequencies compared with the codon frequencies of their hosts for four phage–host interactions: (a) Syn19 compared with NATL2A, a low %GC Prochlorococcus host; (b) Syn19 compared with MIT9303, a relatively high %GC Prochlorococcus host; (c) Syn19 compared with WH7803, a Synechococcus host; and (d) Syn19 compared with WH8102, a Synechococcus host. Each triangle stands for a specific codon; red triangles correspond to tCC. The regions between the lower and the upper quartile representing the distances of the triangle to the diagonal are shaded in gray.

AAs decoded by phage tRNAs are not preferred in phage genomes

On the basis of our initial results, we hypothesized that the presence of AT-rich tRNA genes in cyanomyophage genomes have an adaptive role by enhancing the translation of phage genes when infecting hosts with high G+C content. However, the presence of a specific set of tRNA genes in phages could also be interpreted as a selective pressure at the protein level, that is, when AAs encoded by phage tRNAs appear at higher frequencies within phage proteomes relative to their host proteome. To test which of the two conjectures could explain the existence of phage tRNAs, we compared the phage tCC preference (in phage and host genomes) with the preference for the AAs they code for. The frequency of each tCC within the phage genome was divided by the frequency of the tCC in the host genome. Consequently, the tCC ratio was divided by the corresponding AA frequency. Our results clearly show that whereas phage tCCs are generally used at higher frequencies in phage genomes, the AAs they code for are not more commonly used in these phages. An example of the comparison between the phage tCC preference (in phage and host genomes) and the preference for AAs they code for is shown for cyanophage Syn19 in Supplementary Figure S3. These results refute the hypothesis that phage tRNA presence may be explained by the AA bias in the phage proteomes.

The possible role of viral tRNAs in viral gene translation

The differences observed between the overall CU profiles of phages and their host genomes motivated us to examine the differences at the gene level. On the basis of our working hypothesis that the presence of tRNA genes in phage genomes has an adaptive role enabling the phage to increase its fitness in a given environment, we wished to learn which of the phage genes might benefit from the presence of tRNA genes. This information may shed light on which phage genes are most critical in adapting to a given environment or host. In order to study the differences at the gene level, we examined two parameters: a) the distance between the CU of each phage gene and host genome; and b) the potential contribution of the viral tRNA genes to the translation efficiency of each gene in the phage genome.

Initially, we wished to examine which of the phage genes have a similar CU to the host of the phage it is associated with, thus enabling its efficient translation without relying on the expression of viral tRNAs, compared with those that benefit most from phage tRNAs for their efficient translation. To examine this, we calculated the cosine distances between the vectors of the CU of the individual phage genes and the overall CU of their host genomes, and ranked the phage genes in each genome according to the distance value (see Materials and methods). Notably, the distances calculated per phage gene strongly depend on the CU of the host, thus a specific phage gene will be ranked differently for different phage–host interactions. Interestingly, when annotating the ranked genes, we noticed a significantly high proportion of hypothetical genes (genes that have no detectable functionally characterized homolog in the non-redundant database) among the genes, which showed the highest discrepancy in the CU between the phage and the host, compared with the proportion of hypothetical genes among genes that had a similar CU to the CU of the host. Notably, 52% of these hypothetical genes had no apparent homolog in the database (based on an E-value cut off of 10−5). Strikingly, we found that in all subgroups of genes containing the 10% most distinct genes in all phage–host pairs, the proportion of hypothetical genes was extremely high (>90%), whereas in the subgroup containing the 10% most similar genes in the sorted lists, the proportion of hypothetical genes was <25% (Figure 4). Overall, as shown in Figure 4, we noticed a clear tendency of hypothetical proteins to be enriched in the bins with higher ranked genes (showing the highest discrepancy between the phage and host CU) compared with the bins with lower ranked genes (showing the highest similarity between the phage and host CU). Interestingly, this phenomenon was observed when the CU was calculated both for phage—Synechococcus pairs and for phage—Prochlorococcus pairs. To evaluate whether these results are statistically significant, we further tested for each phage–host pair the probability of enrichment/underrepresentation of hypothetical genes in each bin using the hypergeometric distribution test. As shown in Supplementary Table S2, consistent with the tendency mentioned above, we detected a statistically significant enrichment of hypothetical genes in the bins, including the genes that showed the highest discrepancy between the gene CU and the host CU (ranked highest when sorting the genes according to the cosine difference between the CU vectors). However, in the bins with the lower ranked genes (that is, lowest cosine differences), the hypothetical genes were significantly underrepresented. On the basis of these results, we postulate that the genes benefiting most from the presence of tRNA genes carried by the phages may be unique genes that have an important role in adapting the phage to its specific host.

Figure 4
figure 4

Distribution of genes encoding for hypothetical proteins in phage genomes sorted according to the distance between the phage gene and the host CU grouped into 10 groups (bins). Bins are marked from 1 to 10 corresponding to groups of genes having the highest to the lowest distance from the host CU, respectively. Each bin contains an equal number of genes. The height of the bar denotes the average number of hypothetical genes in each bin calculated from all studied genomes; error bars correspond to s.d.'s. Cyan bars represent the results obtained from phages associated with Synechococcus hosts, whereas red bars illustrate the results obtained from phages associated with Prochlorococcus hosts. Probabilities of the gene enrichment calculated based on the hypergeometric distribution test are detailed in Supplementary Table S2. The colour reproduction of this figure is available at the ISME journal online.

Specific viral genes tend to benefit from phage tRNA genes

To evaluate the contribution of viral tRNAs to the translation efficiency of phage genes, we defined a new term, the tRCI (see Materials and methods). As previously mentioned, the tRCI is calculated relative to the host and is unique for each phage gene, yielding different tRCI values for each gene in cases in which the phage is capable of infecting more than one host (Figure 5). Subsequently, as conducted for the CU differences, we sorted all genes in the phage genome according to their tRCI values and divided them into 10 subgroups (bins) with equal numbers of genes in each subgroup. Further, we calculated the gene enrichment in the top-ranked subgroups (including genes with the highest tRCI values). Among the gene families that were consistently enriched in the top-ranked subgroups in all phage–host interacting pairs, we found genes belonging to the high-light-induced (hli) gene family and the cpeT gene family (Figure 6).

Figure 5
figure 5

(Left) Heat map representing the tRCI values calculated for all ORFs in the Syn19 genome. Color bar is shown on the left of the heat map. The tRCI values are calculated independently for each of the Syn19 known hosts. Vertical lines represent the tRCI values of the gene, calculated relative to the specific host, listed on the right. (right) Example showing the ranking of the tRCI of the Syn19 gp16 gene (pointed by arrows in the heat map) when calculated for the five different hosts. tRCI values were sorted and ranked for each host from the highest tRCI (100th percentile) to the lowest (1st percentile) and grouped into 10 bins. As demonstrated, the same gene can obtain a different tRCI value that can be ranked differently and grouped in a different bin depending on the phage–host interactions.

Figure 6
figure 6

hli (a) and cpeT (b) enrichment in genomes sorted according to tRCI values and grouped into 10 bins from the most distant tRCI values (bin 1) to the least distant (bin 10). P-values for the hypergeometric distribution test are detailed in Supplementary Table S3. The colour reproduction of this figure is available at the ISME journal online.

hli genes

hli genes are widely distributed among cyanophages (Lindell et al., 2005; Mann, 2005) and are apparently responsible for proper replication of the phage in high-light stressed cells whereas protecting the photosynthetic mechanims from light damage by dissipating excess light energy (Havaux et al., 2003; Lindell et al., 2005). In our analysis, hli genes were detected in different ranked groups, however, they were significantly more abundant at the top of the lists where tRCI values are higher (Figure 6 and Supplementary Table S3). Viral hli genes were previously shown to be unregulated immediately upon phage infection (Lindell et al., 2007). Although these were tested only in a system using a podophage, which does not contain any tRNA genes, and in a Prochlorococcus host, on the basis of our results we postulate that hli genes may also be expressed immediately after a myophage infects a Synechococcus host.

cpeT genes

cpeT genes have a regulatory effect on the biosynthesis of the light-harvesting protein pigment phycoerythrin (Cobley et al., 2002; Shen et al., 2006), and are therefore predicted to affect the photosynthetic properties of the infected cyanobacterial cells. We searched for cpeT genes following the procedure described above. We found that 93% of these genes were found in the top five bins and 55% in the top two bins with the highest ranked genes, according to the tRCI values. These results suggest that the expression of these genes is enhanced when phage tRNAs are expressed (Figure 6). Furthermore, taken together the enrichment of hli genes and cpeT genes among the genes that are predicted to benefit the most from the tRNA gene pool of the phage strongly points to the role played by the tRNA phage genes in phage adaptation.

The role of tRNA in cross-infectivity

According to our hypothesis, phage tRNA genes may have a different role in different hosts, increasing the expression efficiency of some genes when infecting one host and other genes when infecting a different host. This is reflected by the different tRCI values calculated for one host gene depending on the host the phage infects. As shown in the example given in Figure 5, a single phage gene can be ranked differently based on its tRCI when it is associated with different hosts.

To comprehensively test our hypothesis and further examine whether the same genes are promoted when phages infect different hosts or different gene groups are promoted in hosts belonging to different ecotypes, we used the Syn19 phage as a test case. Syn19 is known to infect both Synechococcus and Prochlorococcus hosts (Sullivan et al., 2003). We compared the ranks of each gene of the Syn19 phage when it is associated with different hosts using the Pearson correlation coefficient. As expected, for closely related hosts, the ranks of the phage genes were found to be highly correlated (Pearson coefficient 1), whereas when comparing the ranks of the genes when the phage is associated with more distantly related hosts, very weak correlations were observed (Supplementary Figure S4). Whereas these results are generally expected based on the similarities and differences in CU profiles of the different hosts, they are generally consistent with the notion that the tRNA presumably supports increased translation resulting from higher levels of gene expression of certain phage genes in each type of host they infect.

Overall, cyanomyophage genomes tend to be AT-rich. It has been postulated that the lower G+C content may increase phage fitness as AT-rich sequences require less energy in order to melt the DNA strands, and enable faster replication of the phage genome (Miller et al., 2003). AT-rich sequences also result in a slightly modified structure of the double helix, allowing better access of DNA-binding proteins, such as the complexes involved in replication and transcription (Leslie et al., 1980; Calladine and Drew, 1996; El Hassan and Calladine, 1996). However, the lower G+C content of cyanomyophages compared with some of their hosts prevents the phage genomes from effectively using the tRNA pool of their hosts. We suggest that this bias may explain the presence of tRNA genes within phage genomes. Previous studies suggested that host tRNA genes may be integrated randomly in phage genomes to be lost later (Bailly-Bechet et al., 2007). We postulate that tRNA genes having anticodons with low G+C content, which complement codons that are rare in some host genomes and abundant in phage genomes, have been selected to improve the expression of low GC phage genomes, increasing their fitness and fixation within phage genomes.

Some marine cyanomyophages have a broad host range, with some infecting both low G+C content hosts (Prochlorococcus) and high G+C content hosts (Synechococcus) as compared with the narrow host range of cyanopodophages (Sullivan et al., 2003). Until now, no tRNA genes were reported in genomes of cyanopodophages, and all reports are from cyanomyophages (Sullivan et al., 2010). Taken together, from the information available on the presence of tRNAs in marine cyanomyophages, it is clearly evident that cyanomyophages that are known to infect only Prochlorococcus tend to have no or a low number of tRNA in their genomes, whereas those infecting both Prochlorococcus and Synechococcus or just Synechococcus have a higher number of tRNA genes (see Figure 7). We therefore suggest that cyanomyophages use the strategy of carrying tRNA genes in order to be able to expand their repertoire of potential hosts while not changing the G+C content of their genomes. This is a different strategy compared with cyanopodophages, which maintain a genome with G+C content similar to their hosts (for example, cyanopodophages P60 and Syn5 that infect Synechococcus and have GC content above 53% compared with P-SSP7 and P-SSP3 that infect Prochlorococcus and have a GC content below 40%).

Figure 7
figure 7

Phage tRNA gene copy number plotted against the difference in G+C content between the phage and its most distant host (in terms of NU). Black triangles stand for cyanomyophages (wide host range), gray circles represent cyanopodophages (narrow host range).

The ability of cyanomyophages to infect a broad range of hosts may explain their predominance over other phages in different oceanic regions. Moreover, this was previously suggested to be a driving force in shaping photosynthetic gene diversity by facilitating the exchange of genetic materials between Prochlorococcus and Synechococcus via their shared phages (Zeidner et al., 2005; Sullivan et al., 2006). Therefore, the use of tRNA genes carried by these phages may have a global-scale effect on primary production in surface waters.