Convergent reductive evolution of cyanobacteria in symbiosis with Dinophysiales dinoflagellates

The diversity of marine cyanobacteria has been extensively studied due to their vital roles in ocean primary production. However, little is understood about the diversity of cyanobacterial species involved in symbiotic relationships. In this study, we successfully sequenced the complete genome of a cyanobacterium in symbiosis with Citharistes regius, a dinoflagellate species thriving in the open ocean. A phylogenomic analysis revealed that the cyanobacterium (CregCyn) belongs to the marine picocyanobacterial lineage, akin to another cyanobacterial symbiont (OmCyn) of a different dinoflagellate closely related to Citharistes. Nevertheless, these two symbionts are representing distinct lineages, suggesting independent origins of their symbiotic lifestyles. Despite the distinct origins, the genome analyses of CregCyn revealed shared characteristics with OmCyn, including an obligate symbiotic relationship with the host dinoflagellates and a degree of genome reduction. In contrast, a detailed analysis of genome subregions unveiled that the CregCyn genome carries genomic islands that are not found in the OmCyn genome. The presence of the genomic islands implies that exogenous genes have been integrated into the CregCyn genome at some point in its evolution. This study contributes to our understanding of the complex history of the symbiosis between dinoflagellates and cyanobacteria, as well as the genomic diversity of marine picocyanobacteria.


Phylogenetic relationship with another cyanobacterial symbiont of a dinoflagellate
A multigene phylogenetic analysis was conducted to investigate the phylogenetic relationship of CregCyn with OmCyn, a symbiont of Ornithocercus magnificus, whose phylogenetic position had already been revealed 3 .The maximum likelihood phylogenetic tree (Fig. 2) inferred from 143 protein amino acid sequences indicated that OmCyn occupied a unique phylogenetic position at the base of Synechococcus subcluster 5.1 12 , consistent with a previous study 3 .The branch of CregCyn, on the other hand, was nested within the Synechococcus subcluster 5.1 and displayed a sister relationship with a monophyletic lineage named clade IV 12 , comprising Synechococcus sp.CC9902 and BL107.The backbone topology for Synechococcus subcluster 5.1, including the two symbiotic

Association with host dinoflagellates in the natural environment
To gain insight into the interaction of CregCyn with its host in the natural environment, we analyzed the occurrence pattern of the CregCyn genome in the global marine metagenome data provided by the Tara Oceans project 13,14 .We checked the occurrence of the genomic sequences of CregCyn as well as those of OmCyn and free-living picocyanobacteria in the metagenomes of continuously size fractionated seawater samples (0.8-5 μm, 5-20 μm, 20-180 μm and 180-2,000 μm) obtained from 66 Tara Oceans sampling stations in a wide range of oceans 13 .The analysis showed that most metagenomic reads aligned with the free-living picocyanobacterial genome were found in the smallest size fraction (0.8-5 μm; Fig. 3).This result is reasonable given that the cell size of picocyanobacteria is less than 2 μm in diameter.In contrast, the genomes of the cyanobacteria symbiotic with the two dinoflagellates yielded few corresponding reads from this small size fraction.In agreement with a previous study 3 , most of the sequences homologous to the OmCyn genome were obtained from the 20-180 μm size fraction (88% of the total; Fig. 3).Likewise, sequence reads aligning with the CregCyn main chromosome were predominantly detected in the 20-180 μm size fraction (63%), followed by the 5-20 μm size fraction (36%).Although the number of metagenomic reads corresponding to the plasmid sequence of CregCyn were scarce (15 out of ~ 1.77 billion searched reads), all the reads were exclusively detected from the 20-180 μm size fraction.Given that the size range of 5-180 μm corresponds to eukaryotic microorganisms such as dinoflagellates, as previously discussed 3 , the observed distribution pattern implies a tight physical association between these symbiotic cyanobacteria and their respective host cells in natural environments.This suggests that CregCyn, like OmCyn, does not live freely in the environment but rather undergoes vertical inheritance across host generations.
The observation that reads corresponding to the OmCyn genome were concentrated in the 20-180 μm size fraction, whereas reads for the CregCyn genome were also detected in the 5-20 μm size fraction, may be due to the size of the host dinoflagellate.The cell size of Ornithocercus magnificus is approximately 75-115 μm 15 , while the cell size of Citharistes regius is rather small; the cell size of the individuals examined in this study is 40-50 μm even at the longest axis of the cell outline, including the ornaments formed by thacal plates (Supplementary Figure S2).

Genome reduction in the CregCyn genome
While our phylogenetic analysis suggested phylogenetic independence, we observed a similar degree of genome reduction in CregCyn and OmCyn.While previous studies have shown that OmCyn has a reduced genome relative to the free-living picocyanobacteria genome 3 , our comparative genomic analysis revealed that CregCyn has a compact protein repertoire comparable to that of OmCyn (Fig. 4).The protein repertoires of free-living strains in the Synechococcus subcluster 5.1 range from 2000 to 2400, while those of the two symbionts are just over 1,500, which is 1/4 to 1/3 less than those of the closely related free-living strains.This protein repertoire is even smaller than that of Prochlorococcus strains, which have particularly reduced proteomes among the free-living cyanobacterial genomes used for comparative analysis in this study (Supplementary Table S1).
Since CregCyn and OmCyn are likely to have been acquired independently, the reduction in protein repertoire is also likely to have occurred independently.In contrast to OmCyn, for which there is no phylogenetically related lineage with a known genome sequence, CregCyn is clearly closely related to the free-living strains of the clade IV for which genome sequencing has been completed, and the reduction in protein repertoire is apparent (i.e., Synechococcus sp.CC9902 and BL107, with protein repertoire sizes of 2125 and 2135, respectively; Fig. 4).To investigate repertoire reduction in different protein categories, we classified each protein group into 'core proteins' , 'picocyanobacterial ancestral proteins, ' and 'accessory proteins' based on their degree of conservation in the phylogeny (see Materials and Methods) and repertoire reduction in each category was examined.The results

Figure 3.
Relative metagenomic read abundances among four size fractions.A total of 570 metagenomic samples from 66 sampling sites provided by the Tara Oceans project were used for this estimation.Reads corresponding to the free-living cyanobacterial genomes were most abundant in the bacterial-sized fractions, whereas those corresponding to the dinoflagellate symbiotic cyanobacterial genomes were obtained exclusively from the larger size fractions.
showed that the protein repertoire reduction of CregCyn is also similar to that of OmCyn, and that almost all of the reduction occurred in the 'picocyanobacterial ancestral proteins' and 'accessory proteins' categories (Fig. 4).
To further investigate the qualitative differences between the CregCyn and OmCyn genome reductions, we compared the proteins encoded by these two symbiont genomes with the predicted ancestral picocyanobacterial proteome to understand which proteins were lost in each symbiont lineage (Supplementary Figure S4).The picocyanobacterial ancestral proteome inferred in this study consists of 2,369 protein groups, of which 1,157 could be functionally annotated using the KEGG orthology 16 .The number of orthologous protein groups identified in the CregCyn and OmCyn genomes was 1,388 and 1,405, respectively.Both correspond to approximately 59% of the ancestral proteome, again suggesting that a similar degree of proteome reduction from the ancestral proteome occurred in both symbiont genomes.The similarity in proteome reduction between the two lineages was confirmed not only in the number of remaining proteins, but also in their functional breakdown; 1,259 protein genes, that are representing approximately 90% of the ancestral protein genes remaining in the CregCyn and OmCyn genomes (90.7% for CregCyn, 89.6% for OmCyn), were shared between these independently evolved dinoflagellate symbionts (Supplementary Figure S4a).This trend was also observed for functionally annotated proteins; functionally annotated ancestral proteins common to CregCyn and OmCyn (850 protein genes) accounted for 93.6% and 94.1% of all functionally annotated proteins found in their respective genomes (Supplementary Figure S4a).Investigation of the number of lost proteins by functional category showed no marked bias (Supplementary Figure S4b).Meanwhile, the percentage of proteins that could be functionally annotated based on KEGG orthology varied depending on their conservation status in the two genomes.The proportion of functionally annotated proteins in the ancestral proteins shared by both genomes was 67.5%, which is higher than that of the entire ancestral proteome (48.8%).On the other hand, the proportions for proteins retained only in the CregCyn or OmCyn genomes accounted for 45% and 36.3%, respectively.Notably, only 23.5% of proteins that are missing from both of the genomes have functional annotations.This implies that while these proteins of uncertain function are present across picocyanobacterial lineages, their functions are likely to be more accessory in nature, leading to their loss during reductive evolution in the two symbiont genomes.
The similar genome reduction trends in CregCyn and OmCyn, two cyanobacterial symbionts with different origins, can be attributed to their shared lifestyle; These symbionts reside in unique extracellular chambers of the dinoflagellate hosts rather than intracellularly.Notably, the 'transporters' category exhibited the largest number of lost proteins in both symbionts (Supplementary Figure S4b), potentially due to their adaptation to the stable and less variable environment within the symbiotic chamber, in contrast to the external environment.Neither symbiont contained genes related to 'Bacterial motility proteins' (Supplementary Figure S4b), supporting the idea that they remain stationary within the chambers throughout their life cycle.Furthermore, examining the degree of functional loss per functional category yields similar observations (Supplementary Figure S4c).The greater impact on O-antigen nucleotide sugar and lipopolysaccharide biosynthesis implies that a robust cell surface may be unnecessary in a stable environment, leading to a reduction in the synthetic capacity of the LPS (lipopolysaccharide) layer.The high degree of loss in categories of transporters, two-component system and transcription factors may also reflect limited interaction with the external environment.8][19] ) reported in previous studies, suggesting that CregCyn and OmCyn appear to maintain relative metabolic independence [18][19][20] .This feature also makes the genome evolution of the symbiotic cyanobacteria of Dinophysiales dinoflagellates unique.Although both CregCyn and OmCyn genomes lack a small number of genes for cyanobacterial 'core proteins' that are conserved in all free-living cyanobacterial genomes analyzed in this study (Supplementary Table S2), none of these rules out the possibility of autonomous growth of the two symbionts.The numerous gene losses found in CregCyn and OmCyn may result in reduced competitiveness when they grow outside the chamber.On the other hand, considering their metabolic independence, they may be capable of growing autonomously without their host when cultivated clonally in enriched artificial media.

Evolutionary difference between CregCyn and OmCyn
The molecular phylogenetic analysis of CregCyn and OmCyn not only suggests that the two cyanobacteria have different origins, but also provides insight into the timing of the establishment of symbiosis with their respective hosts.Since CregCyn is shown to be a sister lineage to the marine Synechococcus subcluster 5.1 clade IV, it is likely that the ancestral cyanobacterium of CregCyn established the symbiotic relationship with dinoflagellates after most of the diversity of marine Synechococcus lineages seen today had emerged.OmCyn, on the other hand, is a lineage that branches from the base of subcluster 5.1, and to our knowledge, there are no data supporting the presence of free-living strains in this lineage.Given that the OmCyn lineage has not been found in previous comprehensive diversity analyses of marine picocyanobacteria targeting free-living cells, it could be expected that all species of the lineage currently represented by OmCyn have symbiotic relationships with dinoflagellates and that the establishment of this symbiotic relationship predates the diversification of the marine Synechococcus subcluster 5.1, which is estimated to have occurred approximately 240 million years ago 20 .

Genomic islands in the CregCyn genome
Our comparative analysis of protein repertoires shows that despite their independent origins, both CregCyn and OmCyn genomes have undergone similar reductive evolution in parallel.However, further close examinations of genome traits revealed notable distinctions between the two symbiotic cyanobacterial genomes.We assessed the tetranucleotide diversity spectrum of each genomic region within a genome, which is considered one of the common metrics for picocyanobacterial genome analyses 21,22 .Heterogeneity in tetranucleotide frequency across genomic regions has been observed in the marine picocyanobacteria and serves as an indicator of 'genomic islands' that correspond to the clusters of putative exogenous genes and contribute to adapting to habitats with different environmental parameters 22,23 .Our analysis unveiled distinct tetranucleotide frequency profiles in certain regions of the CregCyn genome, similar to its free-living relatives (Synechococcus sp.CC9902 and BL107, both from the clade IV; Fig. 5).Conversely, the region-wise tetranucleotide frequency profiles of the OmCyn genome showed relatively low diversity (Fig. 5), which could be due to the longer duration of its symbiotic relationship (see discussion below).If the CregCyn genome subregions with biased tetranucleotide frequencies are genomic islands, these are expected to contain proteins not conserved in closely related cyanobacteria, implying horizontal gene transfer (HGT).Considering this idea, we attempted to score the possibility of the HGT origin of each protein-coding gene found in the CregCyn genome and assess their relationship with www.nature.com/scientificreports/tetranucleotide frequencies for each genomic subregion.First, we performed sequence homology searches for each of all CregCyn proteins against databases containing protein sequences from a broad range of bacterial diversity, fully covering cyanobacterial lineages.The organisms that appeared as hits in the search results were surveyed, and the frequencies of hits for each organism were calculated; Proteins from cyanobacteria closely related to CregCyn were the most frequent, as expected.We then used this frequency data to assign scores to the hit organisms for each protein in its homology search.The score will be high if organisms that appear as hits in a protein's homology search are predominantly those with a high hit frequency in the genome-wide analysis and low if these organisms are rare in the genome-wide homology search.As a result, several genomic regions exhibited notably low frequency scores of hit organisms, which can be considered as a proxy for the possibility of foreign origin, and some of these regions also contained peaks of biased tetranucleotide frequencies (Fig. 6).We regarded these overlapped regions as 'genomic islands' in the CregCyn genome in this study (refer to Supplementary Table S3 for a list of protein-coding genes located on the genomic islands).
As discussed above, OmCyn may have continued its symbiotic relationship with dinoflagellates prior to the current diversity of the marine Synechococcus subcluster 5.1 being formed.Assuming that the lineage has survived only within the dinoflagellate chamber during this time, it is likely that the OmCyn lineage has had very limited contact with organisms in the external environment, consistent with the low diversity of tetranucleotides found and the lack of detectable genomic islands in this genome.On the other hand, it is noteworthy that genomic islands are present in the CregCyn genome despite its reductive nature.However, the timing of the origin of these genomic islands, which is potentially important in the context of the evolution of the symbiotic relationship, is unknown.One of the possible explanations for the presence of genomic islands in CregCyn is that it has spent less time since the establishment of the symbiotic relationship with the dinoflagellate host than that of OmCyn.It is possible that the current genomic islands in the CregCyn genome already existed in the free-living ancestor of CregCyn, and no new genomic islands have been formed since then.However, not enough time has passed for the genomic islands to become undetectable through acclimation.On the other hand, given that the coexistence of multiple cyanobacteria and other bacteria in a dinoflagellate chamber has been reported 7 , we cannot rule out the possibility of HGT within the chamber.In any case, the currently available data do not allow us to speculate when these genomic islands were formed, whether before or after CregCyn established a symbiotic relationship with dinoflagellates and whether genes associated with the genomic islands contributed to the adaptation to the symbiotic lifestyle.Future comparisons with the symbiotic cyanobacterial genomes of other Citharistes species may provide further insight into this question.

The plasmid of CregCyn
It has been generally believed that marine picocyanobacteria do not possess plasmids.However, in this study, we identified a plasmid associated with CregCyn.A metagenomic analysis of natural populations of coastal Synechococcus suggests that some populations may have one or more plasmids 24 , supporting the plausibility of the discovery of the CregCyn plasmid.The 17 Kbp plasmid is predicted to encode 17 proteins, including integrase and endonuclease commonly found in mobile genetic elements such as transposons (Supplementary Table S4).In addition, about half of the genes showed homology to other genes on the plasmid, suggesting possible gene Figure 6.Genomic islands detected in the CregCyn genome.Scatter plot on top: average frequency score of top-hit organisms for proteins located within each genomic subregion.These subregions were defined using the same windows as those employed in the bottom plot.Plot on middle: presence of Synechococcus sp.CC9902, BL107 and Parasynechococcus marenigrum in hit organisms for each homology search; these organisms were the most common among the hit organisms in the overall homology search for all of the CregCyn proteins.Scatter plot on bottom: the values on the first principal component (PC1) from the principal component analysis, illustrating the tetranucleotide frequencies for each genomic subregion; this data was also presented in Fig. 5. Horizontal dotted lines in both the top and bottom scatter plots indicate values corresponding to the mean plus or minus two times the standard deviation of the data.Genomic islands (shaded areas) were defined as subregions where values exceeding this threshold were observed in both plots.duplication events.While genomic islands often encode genes related to transposition, the genomic island detected in CregCyn lacked such genes.It may therefore be possible that this plasmid plays an important role in the formation of the genomic island of CregCyn.
Of more interest is the presence of two protein genes (Locus ID: CREGCYN_16720 and CREGCYN_16660) that encode two proteins that show similarity to proteins encoded in the OmCyn genome.These two protein genes are present in the plasmid as copies that match at the nucleotide sequence level.A homology search using these proteins as query sequences against known protein sequences (GenBank nr database) resulted in hits to several OmCyn protein sequences, although not strong homologies.While lacking experimental support, some of these OmCyn homologous proteins have been annotated as transcriptional regulators.The low sequence homology of these proteins in CregCyn and OmCyn makes it difficult to elucidate the evolutionary relationship between these proteins, but HGT between CregCyn and OmCyn is one possibility to explain the similarity.
The idea that a HGT occurred between CregCyn and OmCyn evokes a potential physical interaction between the two cyanobacteria.However, considering that both cyanobacteria reside within the symbiotic chambers of different host dinoflagellates, the occurrence of direct physical contact seems unlikely.Nevertheless, it has been observed that multiple types of cyanobacteria can coexist symbiotically within the chamber of a single Dinophysiales dinoflagellate 7 .While this study did not identify any cyanobacterial sequences other than CregCyn from the symbiotic chamber of Citharistes regius, the possibility that cyanobacteria of the OmCyn lineage may also be symbiotic in the chamber cannot be ruled out.Alternatively, considering the relatively close relationship between Citharistes and Ornithocercus among Dinophysiales dinoflagellates, it is possible that the ancestor of Citharistes regius harbored cyanobacteria of the OmCyn lineage in its chamber in addition to CregCyn.Such a scenario would raise the possibility that a HGT occurred between the two lineages within a single symbiotic chamber of a dinoflagellate.

Conclusions
In this study, we first succeeded in revealing the complete genome sequence of a cyanobacterial symbiont of the pelagic dinoflagellate C. regius.Through genome analysis, the evolutionary and metabolic characteristics of the symbiont were revealed, as well as similarities with the genome of OmCyn, another cyanobacterial symbiont of dinoflagellate with different origins.However, the specific role of CregCyn in this symbiotic relationship, including the mechanism of metabolite exchange between CregCyn and C. regius, remains unclear.Addressing these biological questions will require continued efforts involving challenges for laboratory cultivation of this symbiotic consortium, close cell observation, as well as comprehensive sequence analysis.Another important outcome of this study is the presentation of another complete genome of marine picocyanobacteria with unique genomic features, such as a reductive protein repertoire.The genome reduction of the cyanobacterial symbionts is distinct from that of Prochlorococcus, as it has resulted from adaptation to the lifestyle inside the chamber of hosts.The metabolic independence of CregCyn and OmCyn revealed in this study implies their potential for cultivation without their host cells.Based on several indications from previous research, it is expected that the diversity of cyanobacteria symbiotically associated with Dinophysiales dinoflagellate is even greater 9 .Further genomic analysis of these cyanobacteria will reveal their genomic and genetic diversity.

Sampling of Citharistes regius cells and genome amplification of symbionts
The cells of Citharistes regius used in this study were found in a seawater sample collected with a plankton net (mesh size: 25 μm) off Shimoda, Shizuoka Prefecture, Japan (N34°29.222' , E139°06.200') on November 15, 2021.Three C. regius cells found in a sample (cell IDs: A11, A12, and M16; Supplementary Figure S2) were picked up with a microcapillary, washed more than three times by transferring droplets of sterilized seawater, and finally placed in sterilized fresh water for the subsequent genome amplification step.Each of the three C. regius cells harboring cyanobacterial symbionts was subjected to independent genome amplification using REPLI-g Single Cell Kit (QIAGEN), and the amplified DNA samples were processed with S1 nuclease (TaKaRa) to reduce branching junctions.

Genome sequencing, assembling, and annotation
Illumina sequencing libraries for three amplified DNA samples were prepared using Nextera TruSeq DNA PCR-Free Low Throughput Library Prep Kit (Illumina).Subsequently, the libraries were analyzed on an Illumina NovaSeq 6000, yielding 33 to 43 million paired-end reads per library, each measuring 150 bp in length.The quality control of the Illumina short reads was performed using fastp (version 0.12.4) 25 with default settings.To obtain long-reads, one of the amplified genomic samples (M16) was further subjected to library construction using the Rapid Sequencing Kit (Oxford Nanopore Technologies), and the library was analyzed on a MinION platform (Oxford Nanopore Technologies) with a MinION Flow Cell (R9.4.1).The MinION sequencing generated 70 thousand nanopore reads, totaling approximately 162 Mbp.The resulting reads had an average length of 2.3 Kbp, a maximum length of 55 Kbp, and an N50 of 4.5 Kbp.
The genome of M16 cell was assembled de novo in a hybrid manner, incorporating both the Illumina short reads and the nanopore long reads.This assembly was carried out using Unicycler (version 0.4.8) 26 with the `-no_pilon` option.Additional genome assembling for A11 and A12 cells utilizing their Illumina short reads were also conducted using Unicycler.Preliminary genome annotations for all three assemblies were performed using DFAST (version 1.2.15) 27 .For the predicted proteins obtained through the preliminary annotation of the M16 cell assembly, a sequence homology search against the refseq proteins database 28 was conducted, leading to the identification of four contigs encoding protein genes with high similarity to those of cyanobacteria.The same analysis was conducted for the assemblies using short reads of A11 and A12 cells, but no additional www.nature.com/scientificreports/cyanobacterial contigs were identified beyond those observed in the M16 assembly.An analysis of the assembly graphs for the M16 assembly using Bandage (version 0.8.1) 29 revealed that three of the four contigs form a single graph, while the remaining contig of 17 Kbp in size forms a circular DNA molecule.The three contigs presumed to form a circular chromosome were 1,073 Kb, 854 Kb, and 6 Kb in length, respectively.The shortest 6 Kb contig contained the rRNA gene cluster, and the sequence coverage suggested the presence of two identical copies of the contig on the chromosome.This implied that the short contig represented an inverted repeat sequence on the circular chromosome.To determine how the other two larger contigs were linked around this inverted repeat, PCR analysis was performed.PCR primers were designed to correspond to the two ends of the two larger contigs.
Using the genome amplicon from the M16 cell as a template, PCR was performed to amplify a fragment including the 6 Kb contig sequence with all primer combinations corresponding to the possible contig arrangements based on the assembly graph.The results suggested the chromosome structure reported in this study, and the structure was further confirmed by inspection of the raw reads spanning these contigs.No additional bases were detected between the contigs.The annotation of the main chromosome and a small circular DNA molecule (plasmid) resulting from the above procedure was initially performed by DFAST in an automated fashion.The initial gene annotations throughout the genome were further reviewed and refined manually by comparing them with genes of other cyanobacteria.KEGG Orthology ID (KO ID) assignment to each protein was performed by using the KEGG Automatic Annotation Server 30 with the BBH (bi-directional best hit) method.

Ortholog detection and protein repertoire analysis
Orthologous relationships of the predicted proteins of CregCyn with proteins from other cyanobacterial lineages were estimated using OrthoFinder (version 2.5.3) 31 .This analysis utilized proteomes from nine cyanobacterial symbionts, including CregCyn and OmCyn, as well as proteins observed in a phylogenetically diverse range of cyanobacterial genomes, totaling 252 complete genomes (see Supplementary Table S1 for the genome list).The cyanobacterial proteins were clustered based on their similarity by OrthoFinder and grouped into orthogroups, each containing orthologous proteins.The size of the non-redundant protein repertoire for each genome was defined as the total number of orthogroups assigned to proteins within their proteomes, plus the number of proteins not assigned to any orthogroup (species-specific proteins).
Orthogroups conserved in all free-living cyanobacterial genomes analyzed in this study were designated as 'core proteins' .In addition, 'picocyanobacterial ancestral proteins' were defined as orthogroups found in both the Prochlorococcus clade and the Synechococcus subcluster 5.1.Proteins that did not fall into either of these categories were identified as 'accessory proteins' .

Phylogenetic analysis
16S rRNA gene phylogeny: For the phylogenetic analysis of CregCyn based on the 16S rDNA sequence, homologous sequences of cyanobacteria were collected from public databases.This dataset consisted mainly of the sequences from the Synechococcus/Prochlorococcus/Cyanobium clade, of which Synechococcus elongatus occupies a basal position, and it also included partial cyanobacterial 16S rRNA gene sequences (286-379 bp) reported by Foster et al. 9 , which were found in reverse transcription products of whole Citharistes sp. cells from natural environments.The 16S rRNA gene sequences of CregCyn and other cyanobacteria were aligned using the L-INS-i method of MAFFT (version 7.49) 32 and trimmed using trimAl (version 1.4) 33 with `-gt 0.8` option.Phylogenetic analysis was conducted on the trimmed alignment using IQ-TREE (version 2.1.2) 34 with the TIM3 + F + R3 model, which was estimated as the best-fit model using the model test tool in IQ-TREE 35 .Support values for nodes were evaluated using 100 nonparametric bootstrap replicates under the same substitution model.
Phylogenomic analysis using multiple protein sequences: To perform phylogenetic analysis using multiple protein sequences, protein sequences of 143 single-copy genes conserved in 99% of picocyanobacterial genomes were extracted based on the result of the OrthoFinder analysis.The extracted protein sequences were aligned by orthogroups using the L-INS-i method of MAFFT (version 7.49) 32 , and all sites with gaps in the alignments were removed using trimAl (version 1.4) 33 .These 143 protein datasets were combined into a final dataset consisting of 23 sequences and 35,518 sites.This dataset is publicly available at https:// doi.org/ 10. 5281/ zenodo.10947 876 36 .The final dataset was subjected to the maximum likelihood phylogenetic analysis using IQ-TREE (version 2.1.2) 34 with the LG + C20 + F + Γ4 substitution model, which was selected by the model test of IQ-TREE.Options used for the model test are follows: `-mset LG,WAG,JTT,JTTDCMut,VT,Dayhoff,Blosum62,PMB,LG + C20 -mrate G,I + G`.Bootstrap values were obtained from a nonparametric bootstrap analysis with 100 replicates.The LG + C20 + F + Γ4 PMSF model 37 was used for the bootstrap analysis, with the maximum likelihood tree topology as the guide tree.

Relative abundance estimation using metagenomic data
Metagenomic data from the Tara Oceans Project 13,14 were used to estimate the relative abundance of each cyanobacterial genome in natural environments based on size fraction.593 metagenomic samples obtained from 68 Tara Oceans sampling sites were downloaded from the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (see Supplementary Table S5 for accession IDs of SRA data used).The first five million reads after the very first 1000 sequences from each metagenomic read file were extracted using FASTQ-DUMP from SRATOOLKIT (version 2.11.0, https:// trace.ncbi.nlm.nih.gov/ Traces/ sra/ sra.cgi? view= softw are) with following options: `-skip-technical -split-files -N 1000 -X 5000999`.Five million reads from each metagenome were mapped to each cyanobacterial genome using BLASTN (version 2.12.0 +) 38 of the BLAST + package, and metagenomic reads that showed ≥ 99% identity to a single genomic sequence spanning a minimum of 90 bases were treated as corresponding reads for the specific genome.In this process, the reads mapped to coding genomic regions encoding rRNA and tRNA were excluded from the analysis, as these regions

Figure 1 .
Figure 1.Genome map for the cyanobacterial symbiont of Citharistes regius.(A) Micrograph of C. regius found at the same sampling site as the individuals used in this study.Orange granular structures are cells of symbiotic cyanobacteria.(B) Map of the circular chromosome.The two outer circles (green and light blue) show the positions of protein-coding genes on plus and minus strands.Light green and orange bars in the third circle indicate tRNA and rRNA genes, respectively.The innermost circle shows the relative G + C content.(C) Map of the plasmid.Outer circle shows protein-coding genes (green and light blue for plus and minus strand, respectively).The inner circle shows the relative G + C content.

Figure 2 .
Figure 2. Maximum likelihood phylogenetic tree showing the positions of two cyanobacterial symbionts of dinoflagellates.A maximum likelihood phylogenetic tree inference was performed using a concatenated protein alignment of 143 protein sequences from 23 species/strains, under the site-heterogeneous substitution model LG + C20 + F + Γ4.Support value for each node obtained from 100 nonparametric bootstrap replicates under the LG + C20 + F + Γ4-PMSF model.Thick branches indicate 100% bootstrap values.Scale bars represents the estimated number of amino acid substitutions per site.

Figure 4 .
Figure 4. Nonredundant proteome size comparisons of cyanobacterial genomes.The comparison is based on the number of orthologous protein groups encoded in each genome.Each bar represents the proteome size without functional redundancy.The cladogram on the left shows the phylogenetic relationships in Fig. 2.

Figure 5 .
Figure5.Genome-wide diversity of tetranucleotide frequency of cyanobacterial genomes.The tetranucleotide frequencies for each genomic subregion were obtained using a sliding window approach with a width of 10 Kbp and an overlap of 1 Kbp.The frequency values for a genome were analyzed by using principal component analysis and the values in the first principal component (PC1) were used as summary values for the tetranucleotide frequency.Violin plots with box plots show the overall diversity of tetranucleotide frequencies of the genomes.Scatter plots show PC1 values for each subregion in a genome. https://doi.org/10.1038/s41598-024-63502-0