Introduction

Viruses are abundant in the ocean and can influence population dynamics and genetic diversity of their hosts1,2,3. Cyanophage are a specific group of viruses which infect cyanobacteria mainly including Prochlorococcus and Synechococcus. Many cyanophages have been isolated and all the known marine cyanophages belong to three phage families: Myoviridae, Siphoviridae and Podoviridae4,5,6,7,8,9,10. Recent studies showed that cyanopodoviruses might make up 50% of cyanophage community in the sea11,12, suggesting that cyanopodoviruses interact actively with cyanobacteria in the marine environment.

Currently, nearly 40 cyanophage genomes have been sequenced and half of them are cyanomyoviruses. Cyanomyoviruses have a relatively large genome size and acquire many accessory metabolic genes via horizontal gene transfer (HGT), which constitute the large reservoir of genetic diversity pool13,14,15,16,17,18,19. Five genome sequences of cyanosiphoviruses have been reported with genome size ranging from 30,332 to 105,532 bps20,21. Compared to cyanomyoviruses and cyanosiphoviruses, cyanopodoviruses have a relatively conserved genome size ranging from 42,257 to 47,872 bps11,22,23,24,25.

Genome sequencing of marine cyanophages has shown that many marine cyanophages encode photosynthesis genes. All the isolated cyanomyoviruses and more than half of the isolated cyanopodoviruses were detected to contain the key photosystem II reaction centre gene psbA in their genomes11,13,17,18,19,23,26,27,28,29, while no psbA gene was found among the known cyanosiphoviruses20,30. Two recent studies showed that 24 of 39 marine cyanopodovirus isolates contained psbA12 and 8 of 12 sequenced cyanopodovirus genomes encoded psbA13. In these two studies, the frequency of psbA-containing podoviruses was estimated based on isolated cyanophages which could be biased by the host used for isolation. Is it possible to quantify the presence of psbA in cyanopodoviruses in the ocean using a culture-independent approach? The metagenomic database is a useful tool, however these datasets in the public domain are also limited and may not represent true community composition.

Results

In this study, we estimated the relative abundance and distribution of psbA-containing podoviruses based on the metagenomic data. Our approach is built on a conserved genomic structure of cyanopodoviruses. Cyanopodovirus genome organization can be divided into three parts: structural genes, nucleotide metabolism related genes and some hypothetical genes regions (Fig. 1)11,18,22,23,24. Both the composition and the arrangement of structural genes are conserved. One gene cluster, the “portal-capsid-tail/fiber”, existed in all cyanopodoviruses, as well as in other T7 phages31. Interestingly, the psbA gene was commonly located at a fixed position within the conserved gene cluster “portal-psbA-capsid”11. Based on this conserved gene cluster, we searched (BLAST) the GOS scaffold database using portal, capsid assembly, psbA and major capsid protein (MCP) genes and successfully retrieved 79 cyanopodoviral scaffolds from the GOS database.

Figure 1
figure 1

The structure and organization of cyanopodoviruses and some scaffolds or contigs.

Among the 79 cyanopodovirus scaffolds, 70 contain psbA and 9 have no psbA. All the MCP sequences (>200 aa) were used to construct the phylogenetic tree. The MCP based phylogeny separated cyanopodoviruses into two major clades (Clade A and B) (Fig. 2), which is consistent with the phylogenetic relationship based on the DNA polymerase gene10,12,21,32. Nearly all cyanopodoviruses in Clade B carry the psbA gene whereas none of those in Clade A do (Fig. 2). A recent study also illustrated such psbA distribution pattern in cyanopodoviruses12.

Figure 2
figure 2

The neighbor-joining tree based on the MCP sequences.

The sequences with red color mean scaffolds or cyanophage genomes without psbA genes. Values of >50% are shown and indicate percentage bootstrap support based on 1000 replicates for distance, maximum parsimony (MP) and minimum evolution (ME) analyses in the order of NJ/MP/ME. Scale bar, 0.1 nucleotide substitution per site.

In the Bermuda (BATS) database, 58 Clade B MCP homologs were recruited, but no Clade A MCP was found (Fig. 3A). We recruited 17 Clade B homologs, but no Clade A homologs from the North Pacific (HOT) database (Fig. 3A). In the GOS database, 729 Clade B MCP homologs and 18 Clade A MCP homologs were found (Fig. 3A). Interestingly, 17 of 18 of reads were recruited from the coastal water. It is likely that most of Clade A like sequences are from the podoviruses infecting marine Synechococcus10,33,34. In the MarineVirome database, 271 Clade B like MCP sequences and 4 Clade A like MCP sequences were detected (Fig. 3A).

Figure 3
figure 3

Number and distribution pattern of cyanopodoviral major capsid reads in the database.

A, Read counts of major capsid corresponding to Clade A and B in four metagenomic database, BATS, GOS, HOT and MarineVirome. B, Proportion of reads belonging to Clade A in open ocean and coastal water, respectively.

Discussion

Podoviruses in Clade A could be a transitional group between Clade B and other T7-like non-cyanobacterial podoviruses (Fig. 2). Four scaffolds in Clade B do not contain psbA and the psbA gene in these four scaffolds might be lost during the evolution. Interestingly, scaffold JCVI_SCAF_109662694693 (in Clade B) contains a high light-induced gene (hli), but no psbA.

Our analysis suggests that Clade A podoviruses only make up a very small proportion of cyanopodoviruses in the surface ocean. In the open ocean, Clade A podoviruses only account for 0.27% and 1.12% of all cyanopodoviruses in the GOS and MarineVirome databases, respectively. In the coastal surface water, Clade A podoviruses can make up 8.02% and 14.29% of total cyanopodoviruses in the GOS and MarineVirome databases, respectively (Fig. 3B). Clade A podoviruses were not detected in the two open ocean stations, BATS and HOT. Clade A mainly consists of the psbA-lacking podoviruses which infect marine Synechococcus10,11,12. Our study suggests that it may be less important for cyanophages in coastal or estuarine environments to carry the psbA gene compared to cyanophages in the open ocean. Sullivan and colleagues also suggested a shorter latent period could explain the lack of psbA gene as result of shorter infection duration with no need the help of psbA23.

The metagenomic recruitment based on the unique portal-capsid structure provides a culture-independent survey on the distribution frequency of psbA-carrying cyanopodoviruses. However all of the datasets that were analyzed were mainly derived from the surface ocean. Our analysis suggests: 1) psbA-carrying cyanopodoviruses are the dominant cyanopodoviruses in the surface ocean; 2) Synechococcus podoviruses become relatively more abundant in the coastal water; 3) psbA is more important for oceanic cyanopodoviruses than for their coastal counterparts.

Methods

Metagenomics

Four metagenomic databases were used to search homologs in our study: three from the bacterial fraction: the Global Ocean Survey database (GOS)35, the Bermuda database (BATS)36, the Hawaii Ocean Time-Series (HOT)37,38 and one viral fraction database: the MarineVirome39. All databases were obtained from the CAMERA website (http://camera.calit2.net/index.shtm).

Based on the cynaopodovirus genomic conserved gene cluster “portal-psbA-capsid”, we searched (BLAST) the GOS scaffold database using portal, capsid assembly, psbA and major capsid protein (MCP) genes using a reciprocal best-hit BLAST strategy but no e-value cutoff limitation (Fig. 1)40. The structural genes (portal, MCP or capsid assembly gene) allowed the identification of cyanopodoviruses via searching against the NCBI non-redundant proteins database.

To analyze the occurrence frequency and geographic pattern of cyanopodoviruses in the ocean, we recruited reads from BATS, GOS, HOT and MarineVirome datasets using all MCP sequences from sequenced cyanopodoviral genomes as published in Labrie's paper11,13. Our approach is similar to the methods described by Zhao et al.40,41. Briefly, all homologous reads were recruited from binning by e-value cutoff to avoid potential bias and then each putative hit was extracted and used as a query to search against the NCBI non-redundant proteins database42. Metagenomic sequences returned a best-hit which could be used to confirm the classification and all identified reads are listed in Table S1. The number of recruited reads was not normalized, because the method for sampling is different among all the sites and doesn't target the viruses. However, there should be no bias for cyanopodoviruses with or without psbA gene using any methods for sampling.

Phylogenetic analyses

All the MCP sequences (>200 aa) were used to construct the phylogenetic tree. Sequences were aligned using Clustal X and phylogenetic trees were constructed using the neighbour-joining, minimum-evolution and maximum-parsimony algorithms of MEGA software 3.042. The phylogenetic trees were supported by bootstrap for re-sampling test with 1000 replicates.