Single cell genomes of Prochlorococcus, Synechococcus, and sympatric microbes from diverse marine environments

Prochlorococcus and Synechococcus are the dominant primary producers in marine ecosystems and perform a significant fraction of ocean carbon fixation. These cyanobacteria interact with a diverse microbial community that coexists with them. Comparative genomics of cultivated isolates has helped address questions regarding patterns of evolution and diversity among microbes, but the fraction that can be cultivated is miniscule compared to the diversity in the wild. To further probe the diversity of these groups and extend the utility of reference sequence databases, we report a data set of single cell genomes for 489 Prochlorococcus, 50 Synechococcus, 9 extracellular virus particles, and 190 additional microorganisms from a diverse range of bacterial, archaeal, and viral groups. Many of these uncultivated single cell genomes are derived from samples obtained on GEOTRACES cruises and at well-studied oceanographic stations, each with extensive suites of physical, chemical, and biological measurements. The genomic data reported here greatly increases the number of available Prochlorococcus genomes and will facilitate studies on evolutionary biology, microbial ecology, and biological oceanography.


Background & Summary
Marine cyanobacteria within the genera Prochlorococcus and Synechococcus are estimated to be responsible for roughly 25% of ocean net primary productivity 1 . Prochlorococcus is the numerically dominant phototroph in oligotrophic subtropical gyres, which are among the largest contiguous biomes on Earth 2 . In these nutrient poor regimes, Prochlorococcus can account for over half of the chlorophyll 3,4 . While Prochlorococcus is generally restricted to open ocean habitats between 45 o N and 40 o S, Synechococcus has a much broader geographical distribution that extends to subpolar and coastal regions 1 . This difference in range is thought to be due in part to the greater phenotypic flexibility and regulatory capacity among Synechococcus, enabling acclimation to heterogeneous conditions 5 . By contrast, Prochlorococcus has a more streamlined genome adapted to less variable but nutrient depleted regions of the open ocean 6 . Although Prochlorococcus cells have the smallest genomes of known oxygenic phototrophs (~1.6-2.7 Mbp and~2000-3000 genes), the global collective of this group harbors an immense diversity of protein encoding genes 7 . Recent estimates using 41 genomes of cultivated isolates suggested that the Prochlorococcus pan-genome-the complete set of genes harbored by all Prochlorococcus-contains more than 80,000 distinct genes 6 , many of which presumably play a role in adaptation to local environmental conditions. Only a small fraction of these genes have been catalogued, highlighting the potential for culture-independent single cell genomics to reveal new ecologically relevant functions among Prochlorococcus.
As a consequence of their abundance and global distribution, Prochlorococcus and Synechococcus perform key functions at the base of marine food webs, primarily the supply of fixed carbon to higher trophic levels. Much of this carbon is regenerated through respiration by co-occurring heterotrophic bacteria, such as the highly abundant SAR11 clade of marine Alphaproteobacteria (Candidatus Pelagibacter ubique). Many of these heterotrophic bacteria perform ecosystem services that in turn benefit the cyanobacterial populations. In particular, Prochlorococcus is highly sensitive to reactive oxygen species, such as hydrogen peroxide 8 , which can be detoxified by some heterotrophic community members. Abundant catalase encoding heterotrophs in oligotrophic environments, such as the SAR86 and SAR116 clades of marine proteobacteria 9,10 and some sub-populations of SAR11 (ref. 11), are likely to be important community members that provide cross-protection for sensitive Prochlorococcus populations. Recent work further suggests that Prochlorococcus and the dominant heterotrophic cells in the oligotrophic ocean have evolved metabolic co-dependencies to maximize metabolic potential 11 . Thus, understanding the diversity of functions performed by sympatric heterotrophic cells is essential for understanding the ecology and evolution of Prochlorococcus and the combined impact these microbial groups have on ecosystem function and ocean biogeochemistry.
Although studies using cultivated isolates have revealed much about the ecology of Prochlorococcus, Synechococcus, SAR11, and other important taxa that contribute to the function of marine ecosystems 12 , culture-independent studies have begun to reveal an astounding degree of diversity in the wild. In particular, recent advances in the genomics of single cells have uncovered previously unknown marine microbial phyla and functions 13 and have identified a high degree of genome streamlining, mixotrophy, and metabolic specialization within bacterial cells of the surface ocean 14 . Single cell genomes of Prochlorococcus have revealed the existence of new clades with distinct ecological and physiological adaptations 15 as well as a high degree of genomic and functional diversity among Prochlorococcus cells with nearly identical ribotypes 16,17 .
Single cell genomes of both abundant and rare taxa are useful for phylogenetic anchoring of metagenomic data sets and expanding our knowledge of previously undetected phylogenetic lineages and the functions they harbor. Casting a broad net in order to most effectively capture the diversity of cyanobacterial and sympatric heterotrophic microorganisms, we have obtained samples from 22 geographic locations across the world's oceans (Fig. 1), representing 10 Longhurst biogeographical provinces 18,19 (Table 1). Many of these samples were collected under the auspices of the BioGEOTRACES component of GEOTRACES 20 . From these samples, we report 738 single cell genome assemblies consisting of 489 Prochlorococcus, 50 Synechococcus, 82 SAR11, 17 SAR116, 16 SAR86, 9 extracellular virus particles, and 75 additional sympatric microorganisms. To aid in the identification of orthologous genes and facilitate comparative genomics studies, we have precomputed a set of cyanobacterial and cyanophage specific clusters of orthologous groups of proteins (CyCOGs). We expect these data to be useful for a variety of studies related to evolutionary biology, microbial ecology, and ocean biogeochemistry.
1-2 mL of raw seawater was transferred to sterile cryovials with glycerol added as a cryoprotectant at a final concentration of 10%. Samples were flash frozen in liquid nitrogen and stored at -80°C.

Single amplified genome (SAG) generation
The generation, identification, sequencing, and de novo assembly of SAGs were performed at the Bigelow Laboratory for Ocean Sciences' Single Cell Genomics Center (scgc.bigelow.org). The cryopreserved samples were thawed and pre-screened through a 40 μm mesh size cell strainer (Becton Dickinson). Fluorescence-activated cell sorting (FACS) was performed using a BD InFlux Mariner flow cytometer equipped with a 488 nm laser for excitation and a 70 μm nozzle orifice (Becton Dickinson, formerly Cytopeia), as previously described 16,21 . The cytometer was triggered on side scatter, and the "single-1 drop" mode was used for maximal sort purity. For cyanobacteria, the sort gate was defined based on cellular pigment autofluorescence 16 . In order to discriminate heterotrophic bacteria and extracellular particles, environmental samples were incubated with the SYTO-9 DNA stain (5 μM final concentration; Thermo Fisher Scientific) for 10-60 min, after which the particle green fluorescence (proxy to nucleic acid content), light side scatter (proxy to size), and the ratio of green versus red fluorescence (for improved discrimination of cells from detrital particles) were used to define the sort gate 21 . Individual cells were deposited into 384-well plates ( Table 2) containing 600 nL per well of 1x TE buffer and stored at -80°C until further processing. Of the 384 wells, 317 wells were dedicated for single particles, 64 wells were used as negative controls (no droplet deposition), and 3 wells received 10 particles each to serve as positive controls. BD FACS Sortware software was used to collect index sort data (indexed_facs_wga_summary.tsv, Data Citation 1), with FACS plots available from figshare (facs_ssc_fsc_plots.pdf, Data Citation 1). Diameters of sorted cells (indexed_facs_wga_summary.tsv, Data Citation 1) were determined using the FACS light forward scatter signal, which was calibrated against cells of microscopycharacterized laboratory cultures 21 . The DNA for each cell was amplified using either multiple displacement amplification (MDA) or WGA-X 21 , with amplification kinetics distributions for each plate available from figshare (kinetics_welltype_distributions_summary.pdf, Data Citation 1).

Marker gene screening
Single cell MDA and WGA-X products were diluted 50x in UV-treated, 0.2 μm filtered water and then used as templates in real-time PCR, as previously described 21 . Heterotrophic bacteria were screened using 16S rRNA gene primers 27 F and 907 R. Cyanobacteria were analyzed using primers targeting the 16S-23S intergenic transcribed spacer (ITS) sequence 16 . The obtained PCR amplicons were sequenced from both ends using Sanger technology at GeneWiz (South Plainfield, NJ). The two reads were automatically aligned and the consensus was manually curated using Sequencher v4.7 (Gene Codes Corporation, Ann Arbor, MI, USA). Chimeric 16S rRNA sequences were identified using DECIPHER 22 and removed. ITS and 16S rRNA sequences have been deposited with GenBank (Data Citation 2,Data Citations 3).

Cell selection
A selection of cyanobacterial and extracellular SAGs derived from the BiG-RAPA cruise (plates AG-311, AG-315, AG-321, AG-331, AG-335, AG-341, AG-316, AG-323, AG-339, and AG-345) and HOT and BATS cruises (plates AG-347, AG-355, AG-363, and AG-402) were chosen for sequencing based on their fast whole genome amplification, which correlates with good genome recovery in de novo assemblies 21 . Forty-eight additional cyanobacterial SAGs from plates AG-347, AG-355, AG-363, AG-402, AG-418, and AG-459 were selected based on the presence/absence of the narB marker gene as determined by a PCR screen using primer sequences 5'-CANTGGCAYACNATGAC-3' and 5'-RAANCCCCARTGCATNGG-3'. All other cyanobacterial SAGs were selected based on ITS taxonomy with the aim of obtaining a diverse set of cyanobacterial single cell genomes from multiple geographic locations and depths. All heterotroph SAGs were selected based on the classification of 16S sequences using the Ribosomal Database Project (RDP) Release 11 (ref. 23). We focused on obtaining a diverse pool of heterotrophs, including those with poor representation in public databases, that co-occur with Prochlorococcus and Synechococcus (e.g. SAR11, SAR116, SAR86, marine Actinobacteria, OCS116, SAR202, and SAR324).

Genomic sequencing and de novo assembly
Illumina libraries were created, sequenced and de novo assembled as previously described 21 . Only contigs longer than 2,000 bp were retained. This workflow was evaluated for assembly errors using three bacterial benchmark cultures with diverse genome complexity and %GC, indicating 60% average genome recovery, no non-target and undefined bases, and average frequencies of misassemblies, indels and mismatches per

Phylogeny
In order to facilitate downstream analyses, we have inferred the phylogeny for the cyanobacterial genomes and heterotrophic bacterial genomes (Figs. 2 and 3). We used the PhyloSift software 26 to identify and align a collection of core protein coding gene families from the single cell genomes. Briefly, PhyloSift uses LAST 27 to identify 37 protein-coding marker genes 28 . The identified orthologous sequences are then aligned to marker gene HMM profiles using the hmmer software suite 29 and concatenated into a reading-frame-aware nucleotide codon alignment. The alignments were then trimmed using the automated heuristic method -automated1 in trimAl v1.2 (ref. 30). The recovery of biosample n/a AG-453 a n/a SWC-21 AG-455 n/a n/a AG-457 a n/a SWC-22 AG-459 n/a n/a AG-461 n/a SWC-23 AG-463 n/a n/a AG-464 n/a SWC-24 AG-469 n/a n/a AG-470 n/a SWC-26 n/a n/a AG-670 n/a n/a SWC-27 n/a n/a AG-673 n/a n/a SWC-28 n/a n/a AG-676 n/a n/a SWC-29 n/a n/a AG-679 n/a n/a SWC-30 n/a n/a AG-683 n/a n/a SWC-31 n/a n/a AG-686 n/a n/a PhyloSift marker gene families ranged from 2-37 (median 31) families per heterotroph single cell genome and from 0-37 (median 27) families per cyanobacterial single cell genome. For phylogenetic inference, we included all single cell genomes with at least 2 PhyloSift marker families present in the final alignments, and 151 additional heterotroph genomes to provide greater context for the heterotroph tree (Data Citation 1). We made our marker-family selection criteria as inclusive as possible in order to convey general tree topology for the greatest possible proportion of the dataset, but we note that some of the most incomplete genomes may be subject to phylogenetic artefacts due to the large number of gaps in their alignments. Maximum Likelihood trees were inferred using raxmlHPC-PTHREADS-AVX v8.2. 9 (ref. 31) using the GTRGAMMA model of rate heterogeneity for heterotrophs and the GTRCAT model for cyanobacteria. RAxML runs were conducted with rapid bootstrapping 32 , and the number of bootstrap trees was automatically determined using the extended majority rule criterion 33 resulting in 300 and 100 bootstrap replicates for the cyanobacterial and heterotroph trees respectively. Lists of taxa used for the phylogenies as well as the alignments in FASTA format and trees in Newick format are available from figshare (Data Citation 1).

Cyanobacterial Clusters of Orthologous Groups of proteins (CyCOGs)
To provide an overview of functional composition of the genes present in the cyanobacterial and cyanophage genomes, we inferred clusters of orthologous groups of proteins, referred to here as CyCOGs. The underlying set of genomes includes Prochlorococcus, Synechococcus, cyanophages, and cyanobacterial virocells (genome assemblies containing both bacteria and phage genomes) from our data set. We also included publicly available genome data from IMG for Prochlorococcus, marine Synechococcus in subclusters 5.1, 5.2, and 5.3, and cyanophages isolated using these cyanobacteria as hosts. The following assemblies with likely heterotroph contamination were excluded: AG-418-M21 and scB245a_518D8 (ref. 16). The clustering of the proteins was carried out with panX 34 , with parameters tuned to account for incompleteness of SAG genomes: --core_genome_threshold 0.5 --core_gene_strain_fpath strains_com-plete_95plus.txt (core genes are defined by being present in at least 50% of all genomes, and in every genome of >95% completeness). The workflow comprises an initial clustering step performed with  This is ProPortal CyCOGs version 6.0 (cycogs.tsv, Data Citation 1). Prior releases include versions 1, 3, 4, and 5 of ProPortal CyCOGs 37-40 . Version 2 is an unreleased set of CyCOGs developed for testing purposes only. Legacy CyCOG definitions are also available from IMG/ProPortal (https://img.jgi.doe.gov/ proportal).

Code availability
No custom code was used in the generation or processing of the data. Software versions and the use of any adjustable variables and parameters are as follows:   File 2: DNA amplification kinetics summaries associated with the single cell genome assemblies can be found in kinetics_platemap_summary.pdf (Data Citation 1).
File 3: DNA amplification kinetics distributions associated with the single cell genome assemblies can be found in kinetics_welltype_distributions_summary.pdf (Data Citation 1).   ITS and 16S sequences for all SAGs (including those that did not undergo whole genome sequencing) are available from GenBank under the accession numbers MG666579-MG668595 for ITS sequences (Data Citation 2) and MH074888-MH077527 for 16S sequences (Data Citation 3).
Paired-end sequencing reads in fastq format are available from the NCBI Sequence Read Archive (Data Citation 4).

Technical Validation
The quality of the sequencing reads was assessed using fastqc and the quality of the assembled genomes was assessed using checkM 43 and tetramer frequency analysis as previously described 14 . This workflow was evaluated for assembly errors using a series of benchmark cultures with diverse genome complexity and %GC 21 .

Usage Notes
While the single cell genomes in this data set were screened for contamination that could have been introduced during cell sorting and DNA amplification, users should be aware that these screening procedures do not eliminate the potential for multiple genomes being present in the same assembly. Some single cell genomes may be derived from cells infected with a bacteriophage (i.e. virocells 44 ) and thus contain both host and virus genomes. Other single cells may contain multiple genomes due to a close physical association between two cells that resulted in co-sorting and co-amplification of DNA. Given that many of these events are biologically meaningful, these genome assemblies were not removed from the data set or modified to separate multiple genomes. Based on our technical validation, we have identified possible virocells or co-sorted genomes in the data set (genome_assembly_summary.tsv, Data Citation 1).