Genomes of diverse isolates of the marine cyanobacterium Prochlorococcus

The marine cyanobacterium Prochlorococcus is the numerically dominant photosynthetic organism in the oligotrophic oceans, and a model system in marine microbial ecology. Here we report 27 new whole genome sequences (2 complete and closed; 25 of draft quality) of cultured isolates, representing five major phylogenetic clades of Prochlorococcus. The sequenced strains were isolated from diverse regions of the oceans, facilitating studies of the drivers of microbial diversity—both in the lab and in the field. To improve the utility of these genomes for comparative genomics, we also define pre-computed clusters of orthologous groups of proteins (COGs), indicating how genes are distributed among these and other publicly available Prochlorococcus genomes. These data represent a significant expansion of Prochlorococcus reference genomes that are useful for numerous applications in microbial ecology, evolution and oceanography.


Background & Summary
As the smallest ( o1 μm diameter) and most abundant (3 × 10 27 cells) photosynthetic organism on the planet 1 , Prochlorococcus has a unique status in the microbial world. This unicellular marine cyanobacterium is found throughout the euphotic zone of the open ocean between~45°N and 40°S, where it carries out a notable fraction of global photosynthesis 1,2 . The group, which would be considered a single microbial 'species' by the traditional measure of >97% 16S rRNA similarity, is composed of multiple phylogenetically distinct clades ( Figure 1) (as defined by either rRNA internal transcribed spacer (ITS) 3 or whole-genome sequences 4 ) which are physiologically distinct. Adaptations for optimal growth at different light intensities differentiate deeply branching groups of Prochlorococcus into high light (HL) and low light (LL) adapted clades 3,[5][6][7][8] .
Prochlorococcus have the smallest genomes of any known free-living photosynthetic cell, ranging from ~1.6 to 2.7 Mbp 4 . While they all share a core set of genes present in all strains, there exists remarkable diversity in gene content among isolates. The group has an 'open' pan-genome, i.e. each newly sequenced genome typically contains many new genes never before seen in Prochlorococcus 4 . Given the abundance of Prochlorococcus, studies of their genomic and metagenomic features have provided numerous insights into features of ocean ecosystems [9][10][11][12][13][14][15][16][17] . In addition, the group has proven to be a valuable system for studying microbial evolution 18,19 , genome streamlining 20,21 , and the relationship between genotypic, phenotypic and ecological variation in marine populations 3,7,22 . Since Prochlorococcus is abundant in surface waters, these reference genomes have also been extremely valuable for interpreting marine metagenomic and metatranscriptomic datasets 14,[23][24][25][26][27][28] .
To advance our understanding of Prochlorococcus genetic diversity, we sequenced the genomes of 27 Prochlorococcus strains from a variety of ocean environments. The strains sequenced included both previously reported strains as well as eight new isolates ( Table 1). The newly isolated strains come from ocean regions that previously only had few or no cultured representatives and substantially expand the number of cultured Prochlorococcus available for five major clades. These results demonstrate the applicability of high-throughput dilution-to-extinction cultivation approaches 29 to Prochlorococcus.
The genome sequences reported here represent a notable increase in the number of genome sequences available from the major phylogenetic clades with existing cultured representatives. While many genomes differed greatly in gene content, other sets are very closely related and differ primarily by single nucleotide polymorphisms (e.g., LG, SS2, SS35, SS51, SS52, SS120; and MIT0701, MIT0702, and MIT0703). Thus, this dataset encompasses a broad range of pairwise genomic diversity among Prochlorococcus strains.
Most genomes were sequenced to draft status; two were closed ( Table 2). We used two annotation methods to identify the potential functions of genes in the genomes. Genes were first called and annotated by the RAST pipeline 30 . To expand on these predictions-especially for the myriad genes of unknown function-we also derived annotations from an independent pipeline, Argot2 31 . To facilitate the utility of these genomes for comparative genomics and evolutionary studies, we define a set of precomputed orthologous gene clusters for Prochlorococcus. All cluster data are supplied in this data set (Data Citation 1 and Data Citation 2).
These genomes should be useful to researchers interested in many aspects of marine microbial ecology and evolution. Since the genomes are from cultured isolates, hypotheses generated from these data can be tested in laboratory experiments. The genomes will also greatly facilitate the interpretation of transcriptomic and proteomic studies, as well as meta-'omic' data from field studies where Prochlorococcus is a dominant phototroph.

Culturing and strain isolations
Many of the strains sequenced have been previously described 3,5,6,[32][33][34][35][36] (Table 1); 8 are reported here for the first time. All cultures were unialgal; this was initially determined crudely by flow cytometry profiles, and then more specifically by confirming the presence of only one cyanobacterial 16S rRNA ITS sequence in the culture. All cultures except SB and MIT0604 contained heterotrophic bacteria. Cultures were maintained in acid-washed glassware in Pro99 media 37  Strains MIT0701, MIT0702, and MIT0703 were isolated from the South Atlantic (CoFeMUG cruise KN192-05, station 13, 13.45°S, 0.04°W) at 150 m using a high throughput culturing method 29 adapted for phototrophs. The seawater used for isolations was first filtered through a 1 μm filter with no amendments and kept in the dark at 18-20°C for 21 days. The total red fluorescing phytoplankton population (1 × 10 5 cells ml − 1 determined with a Guava EasyCyte flow cytometer) was diluted in PRO3V media 37 made with the same South Atlantic water that had been filtered through a 0.1 μm Supor 142 mm filter, then autoclaved to sterilize. This media contained 100 μM NH 4 Cl, 10 μM NaH 2 PO 4 , PRO2 trace metals 37 and f/2 vitamins (0.1 μg l − 1 cyanocobalamin, 20 g l − 1 thiamin and 1 μg l − 1 biotin 38,39 ). Ten cells were dispensed into 1 ml volumes in a 48-well polystyrene multiwell culture plate and incubated at 20°C in~20 μmol Q m − 2 s − 1 (14:10 light:dark) for 2 months. MIT0801 was isolated in a similar manner, but from seawater obtained from 40 m depth at the Bermuda Atlantic Time-series station (BATS; 31.67°N, 64.16°W) that had been sitting in the dark for 5 days. The same PRO3V media recipe was made with 0.1 μm filtered and autoclaved BATS seawater, and 2.5 cells (on average) were dispensed in 5 ml volume in Teflon plates (prepared as described 29 ). Cells were detected within 1 month of enrichment.

DNA sequencing and assembly
Genomes were sequenced from genomic DNA collected from 20 ml laboratory cultures. Cells were collected by centrifugation (10,000 g, 10 min), the pellet transferred into a 2 ml tube and frozen at −80°C. Genomic DNA was isolated using the QIAamp DNA mini kit (Qiagen). 2 μg of DNA was then used to construct an Illumina sequencing library as previously described 40 , except that the bead: sample ratios in the double solid phase reversible immobilization (dSPRI) size-selection step were 0.7 followed by 0.15, resulting in fragments with an average size of~340 bp (range: 200-600 bp). PAC1 and EQPAC1 libraries were constructed using dSPRI bead:sample ratios of 0.9 followed by 0.21, yielding an average size of~220 bp. DNA libraries were sequenced on an Illumina GAIIx, producing 200+200 nt paired reads, at the MIT BioMicro Center. An average of 1.6 million paired-end reads were obtained for each genome. Low quality regions of sequencing data were removed from the raw Illumina data using quality_trim (V3.2, from the CLC Assembly Cell package; CLC bio) with default settings (at least 50% of the read must be of a minimum quality of 20). Paired-end reads were overlapped using the SHE-RA algorithm 41 , keeping any resulting overlapping sequences with an overlap score >0.5. For all genomes except PAC1 and EQPAC1, the overlapped reads, as well as the trimmed paired-end reads that did not overlap, were assembled using the Newbler assembler (V2.6; 454/Roche) with the following parameters: '-e 200 -rip.' Contigs o1 Kbp were discarded at this stage. Reads for PAC1 and EQPAC1 were assembled using clc_novo_assemble (V3.2, from the CLC Assembly Cell package; CLC bio) with a minimum contig length of 500 bp and automatic wordsize determination enabled. These initial contigs were searched against a custom database of marine microbial genomes 9 using BLAST 42 to identify contigs with a closest match to Prochlorococcus. Sequencing reads belonging to the putative Prochlorococcus contigs were then identified by mapping the raw sequences to these contigs using clc_ref_asssemble_long (CLC bio). The Prochlorococcus-like reads were then reassembled using clc_novo_assemble using the same parameters as above to produce the final assembly, now largely free of heterotrophic sequences.
MIT0604 and MIT0801 were completed to finished quality with no gaps by directed PCR reactions to sequence contig junctions, combined with Pacific Biosciences long sequencing reads. Contigs were ordered into putative scaffolds based on their similarity to closely related closed Prochlorococcus genomes, as determined by Mauve 43 . PCR primers specific to the ends of putatively adjacent contigs were designed and used to amplify the junctions between contigs. Purified PCR products were sequenced by Sanger chemistry at the MGH DNA core facility, and the resulting sequences used to join contigs in Consed 44 . This resulted in a highly improved but still incomplete assembly. To span difficult repeat regions in MIT0801, we obtained long Pacific Biosciences sequences. We obtained DNA from 25 ml cultures using the Epicentre Masterpure kit (Epicentre) and sequenced this at the Yale Center for Genome Analysis. We combined this set of long but low quality reads with the high quality Illumina short reads obtained previously using the PacBioToCA software 45 , to produce assemblies with a reduced number of contigs. These contigs were aligned to the PCR-improved contigs described above, and the final gaps were closed with a small number of additional directed PCR reactions (as described above) using the Geneious sequence analysis package (V6.1, Biomatters), until the genomes were closed.
As most of the Prochlorococcus cultures sequenced were known to contain heterotrophs, we identified the most 'Prochlorococcus-like' contigs from non-axenic cultures by searching each resulting contig against a custom database of sequenced marine microbial genomes 9 using BLAST 42 . Contigs with a best match to a non-Prochlorococcus genome were removed from the assembly. Subsequent examination of these contig sets indicated that a number of shorter sequences (generally o10 kbp) with significant heterotroph-like stretches had passed through the initial filtering steps. To remove these questionable contigs from the assemblies, we manually examined each o10 kbp contig using the RAST annotation server (see below), and only kept those contigs with clear homology to previously sequenced and closed Prochlorococcus or Synechococcus genomes. Although these filtering steps may have removed a small amount of true Prochlorococcus sequence from the final assembly, we considered missing a few genes preferable to misrepresenting heterotroph sequences as Prochlorococcus.
Examination of the non-cyanobacterial 16S rRNA genes found within these data indicate that the most abundant heterotrophs in the cultures were members of the Alteromonadales, Flavobacteriales, Rhodospirillales, Halomonadaceae, and Sphingobacteriales. We have included a separate data file containing all of the assembled contigs-including those from co-cultured heterotrophs-for anyone who is interested (Data File 4).

Genome annotation
The assembled contigs for each genome were annotated using the RAST server 30 against FIGfam release 49. Additional functional annotation for all genes called by RAST were generated by the Argot2 server 31 , using default settings.
To confirm the rRNA-based phylogeny of these strains, rRNA ITS sequences were aligned in ARB 46 and maximum likelihood phylogenies calculated in PhyML version 20120412 47 , using the HKY85 model of nucleotide substitution, a fixed proportion of invariable sites, and non-parametric bootstrap analysis with 100 replicates.
Clusters of orthologous groups of proteins (COGs) were computed, as described elsewhere 48 , on a data set comprised of previously sequenced Prochlorococcus and Synechococcus strains 4,10,16,17,[49][50][51][52][53] , the new Prochlorococcus genomes described here, 11 Prochlorococcus single-cell genomes 12 and two consensus metagenomic assemblies 14 (Data Citation 1). To facilitate comparisons among genomes, we re-annotated 16 previously sequenced Prochlorococcus genomes (Table 3) with the RAST pipeline as described above; this ensured that a uniform methodology for gene calling and functional annotation was used. Single cell genomes 12 were not re-annotated due to difficulties encountered using this pipeline on such fragmented contigs; instead, we utilized the ORFs previously defined in GenBank. Detailed information regarding these updated annotations is provided (Data Citation 1 and Data Citation 2).
Orthologous gene clusters were defined based on reciprocal best blastp scores (with an e-value cutoff of 1e−5); the sequence alignment length had to be at least 75% of the shorter protein, with at least a 35% identity. Additional orthologous genes that did not pass this criterion were added to clusters based on HMM profiles constructed from automated MUSCLE 54 alignments of orthologous sequences within each cluster using HMMER 55

Data Records
The complete dataset is available from the Prochlorococcus Portal website (Data Citation 1) and Dryad (Data Citation 2). The 27 Prochlorococcus genome sequences have also been deposited at DDBJ/EMBL/ GenBank (Data Citations 3-29) under the accession numbers indicated in Table 2.

Datasets deposited at Dryad and ProPortal
Sequence, gene annotations, and COG definitions for Prochlorococcus genomes.  File 1-Tab-delimited file containing cluster assignments and annotation metadata for genes in the newly sequenced Prochlorococcus genomes described in this work, as well as previously published genomes. Columns are as follows: Genome. The Prochlorococcus strain where the gene is found.
Gene ID. Unique ID for each Prochlorococcus gene, of the format 'Postrain>_####'. Note that, due to the re-annotation of previously published genomes, these names (and the underlying gene boundaries) may not necessarily correspond to those in Genbank.
NCBI ID. For the new genome sequences presented here, the systematic NCBI locus_tag identifier for that gene. For previously published genomes, this column contains the corresponding Genbank locus ID (noted as an 'Alternative locus ID' for strains MED4, SS120 and MIT9313 in Genbank) from Kettler et al. RAST annotation. Predicted functional annotation description, as supplied by the RAST annotation pipeline. Note that this text may differ slightly from the annotation in Genbank, due to changes imposed by NCBI annotation formatting guidelines.
GO annotation. Gene Ontology categorization for the gene, when available.
Argot2 annotation. Functional annotation prediction from the Argot2 pipeline, when available. contig_id. The name of the sequence contig on which the gene is found.
gene_id. The unique Gene ID code for that feature.
feature_id. Unique RAST-generated identifier for that feature.   nucleotide_sequence. The nucleotide sequence of the predicted gene.
aa_sequence. The protein (amino acid) sequence of the predicted gene. File 4 -Set of nucleotide FASTA files containing all assembled contigs (>500 bp) from each culture (i.e., both Prochlorococcus and heterotrophs) sequenced in this work. Each file contains the set of contigs assembled from the raw sequencing data, before any filtering to separate Prochlorococcus from heterotroph contigs. These files are provided for reference, but due to the known heterotroph sequences in these files, they should be used with caution. File 5 -Set of nucleotide FASTA files containing the predicted nucleotide sequence for all open reading frames (ORFs) in each genome. This file includes ORFs from both the new genomes presented here as well as the re-annotation of previously released Prochlorococcus genomes.
File 6 -Set of protein FASTA files containing the predicted amino acid translation for all ORFs in each genome. This file includes ORFs from both the new genomes presented here as well as the re-annotation of previously released Prochlorococcus genomes.

Technical Validation
Phylogenetic analysis of the ITS sequences obtained from these cultured isolate genomes (Figure 1) group these strains into the expected clades 57 as previously determined from directed sequencing of the ITS sequences 6 . We were only able to obtain a single cyanobacterial ITS sequence from the assembled genome contigs, again consistent with these strains being unialgal. Prochlorococcus genome size and %GC content are typically quite similar for strains found within the same ITS-defined clade 4 , and both the draft and closed genomes are consistent with previously sequenced strains for these measures as well ( Table 2).
The quality of the genome assemblies was assessed in multiple ways. Re-mapping of the original Illumina sequencing reads to the final assembled contigs showed that the reads were distributed evenly along the length of the assembly, ruling out some categories of major assembly errors (such as duplicated regions). Whole-genome alignments of contigs against closely related closed reference Prochlorococcus genomes indicated that the overall gene order of these contigs was broadly consistent with known sequences, indicating that the sequences do not contain obvious chimeras or other artifacts. We also estimated the completeness of the draft genomes by examining the core gene content of the strains, based on the set of genes shared by all closed Prochlorococcus genomes. We found that all of the draft genome assemblies contained >98% of the genes universally present in the 13 previously published closed Prochlorococcus genomes, indicating that these contigs represent most (or perhaps all) of the genome sequence.
The final closed sequences of the MIT0604 and MIT0801 genomes were verified in two additional ways. First, we compared the experimentally observed PCR product sizes from directed contig joining reactions to the distances predicted from the final assembled sequence to confirm the assembly. Second, we mapped the original (quality trimmed) Illumina sequencing reads against the final assembly. These alignments indicated that the final closed assembly was fully consistent with the original short-read sequence data. In addition, we confirmed that the per-base SNP frequency was not above the expected error frequency.