Nucleotide substitution rates of diatom plastid encoded protein genes are positively correlated with genome architecture

Diatoms are the largest group of heterokont algae with more than 100,000 species. As one of the single-celled photosynthetic organisms that inhabit marine, aquatic and terrestrial ecosystems, diatoms contribute ~ 45% of global primary production. Despite their ubiquity and environmental significance, very few diatom plastid genomes (plastomes) have been sequenced and studied. This study explored patterns of nucleotide substitution rates of diatom plastids across the entire suite of plastome protein-coding genes for 40 taxa representing the major clades. The highest substitution rate was lineage-specific within the araphid 2 taxon Astrosyne radiata and radial 2 taxon Proboscia sp. Rate heterogeneity was also evident in different functional classes and individual genes. Similar to land plants, proteins genes involved in photosynthetic metabolism have lower synonymous and nonsynonymous substitutions rates than those involved in transcription and translation. Significant positive correlations were identified between substitution rates and measures of genomic rearrangements, including indels and inversions, which is a similar result to what was found in legume plants. This work advances the understanding of the molecular evolution of diatom plastomes and provides a foundation for future studies.

www.nature.com/scientificreports/ plastome of an individual may include many copies of the unit-genome by repeating this complement pattern many times. Although this unit is often diagrammed as a circular molecule, the plastome more likely contains a collection of circular, linear and linear-branched molecules that each comprises two to many copies of the monomer 21 . All diatom plastomes sequenced to date include a large inverted repeat (IR) separated by large and small single-copy regions (LSC and SSC, respectively). Apart from the typical quadripartite structure, an extensive range of gene order arrangements, gains and losses of genes are exhibited in the diatom plastomes 9,10 . Gene order changes not only arise through gene duplication by IR expansion, but also via inversions and insertions and deletions (indels) in both IR and SC regions. Calculation of synonymous (dS) and nonsynonymous (dN) nucleotide substitution rates across individual genes and their functional groups between lineages provide insights into the plastome evolution 22 . In previous studies, genes encoding subunits that are integral to photosynthesis, such as cytochrome b 6f. complex (PET) and photosystems I and II (PSA and PSB) have lower rates of nucleotide substitution than other functional groups in angiosperms and conifers [23][24][25][26] . Accelerated substitution rates have been detected in ribosomal protein (RPL and RPS) genes and RNA polymerase (RPO) genes 24,[26][27][28][29][30] . Besides differences in substitution rates, variation relative to genomic features such as rearrangements in gene order can also shape plastome evolution. Previous studies have identified a significant positive correlation between rates of nucleotide substitution and gene order changes in angiosperm plastid genomes 30,31 , bacterial genomes 32 and arthropod mitochondrial genomes 33,34 .
In a previous study by Schwarz et al. 26 , both nonsynonymous and synonymous substitution rates were negatively correlated with plastome sizes and rearrangements such as the number of inversions and indels. The focus of these investigations was on three of the six subfamilies of flowering plant family Fabaceae. One of the subfamilies, papilionoids, has a wide diversity of plastome rearrangements including the loss of inverted repeats (IR) in one clade and relatively smaller plastomes than the other subfamilies. This study also found that genes in the IR show three to fourfold reduction in substitution rates compared to SC regions. Genes that used to be in IR showed accelerated rates compared to genes retained in the IR. A negative correlation between substitution rates and cupressophyte plastid DNA genome size has also been reported in conifers 37 .
Our hypothesis is that the relationship of nucleotide substitution rates between plastid genes and plastome size and architecture such as inversions, indels, and IR in diatoms are similar to what was observed with legumes and conifers. If correct, this reflects a fundamental aspect of how diatoms evolve. To date, no study has investigated the nucleotide substitution rates of all shared plastome protein-coding genes in diatoms. The present study explored the patterns of plastid nucleotide substitution rates across the entire suite of 103 shared genes for 40 species of diatoms. Correlations between plastome substitution rates and genome features, including plastome size, number of indels and genome rearrangement were examined. This work advances the current understanding of the molecular evolution of diatom plastomes.

Results
phylogenetic relationships and branch lengths. Phylogenetic analysis of 40 previously published diatom plastomes (Table S1) for the concatenated 103 gene data set (Table S2) generated Bayes and maximum likelihood (ML) trees with robust support of > 0.97 posterior probabilities and > 95% bootstraps for most of the branches, respectively (Fig. 1). The radial centrics of the Coscinodiscophyceae (radial 1, 2 and 3) formed a basal grade. The Mediophyceae including bi-polar and multi-polar diatoms plus the Thalassiosirales are paraphyletic and contained in three clades (polar 1, 2 and 3). Araphid 1 was sister to araphid 2, and phylogenetically close to another group, raphids. Raphid pennate diatoms were monophyletic. Within araphid 2, Astrosyne radiata showed an extremely long branch in both Bayes and ML trees (Fig. 1).  (Tables S1, S2). Two methods including pairwise and CODEML model 0 were used to estimate the nucleotide substitution rates. The dN/dS values were more widely distributed for individual genes (Fig. 2), whereas the ratio was limited to 0 to 0.13 for different functional classes (Fig. 3). Two genes, rps5 and atpI, have high dN/dS values > 1 and were subjected to positive selection analyses using CODEML model 8 versus 7. A total of 63 and 64 positively selected sites were detected in rps5 and atpI, respectively (Table 1). Four genes atpB, rpoB, rps9 and secY had higher dN/dS values, which suggested accelerated evolution of these genes. In particular, atpB in Proboscia sp. and Seminavis robusta showed dN/dS values of 1.75 and 2.15, respectively. When grouping genes into functional groups, all had dN/dS values close to 0 with the RubisCo subunit having median dN/dS values of only 0.08 (Fig. 3A), which suggested some genes with rapid evolution have their effects masked by other genes in the same functional group with purifying selection.

Substitution rates in individual genes
Gene order in diatoms. Gene order analysis using MAUVE revealed substantial rearrangements of blocks of sequences in the 40 diatom species (Table 2). Only 14 species had identical gene order shared with at least one other species. correlation of substitution rates and plastome characteristics. All correlations of the parameters of interest were visualised in Fig. S2. Significant correlation was observed between the number of indels and dN, dS and dN/dS (p < 0.05; Fig. 4). No significant correlation was found between the substitution rates and plastome size (Fig. S3). Astrosyne radiata, which has a relatively small plastome among diatoms, had the highest overall dN and dS (Fig. S3, Table S5). Significant correlations were found between plastome size and the length of the inverted repeat (IR), and the length of the small/large single copy region (SSC, LSC respectively) (Fig. S4). www.nature.com/scientificreports/ Correlation of pairwise substitution rates and inversion distance (Table S6) was tested in the 40 diatom plastomes. Significant correlation (p < 0.05) was found between dN and inversion distance in 25 out of 40 pairwise comparisons ( Fig. 5; Table S6). Among the 40 plastomes, dS was significantly correlated with inversion distance in 18 pairwise comparisons, whereas the number of significant pairwise comparisons reduced to 13 for dN/dS values. The polar 1 group had the largest proportion of significant correlations between substitution rates and inversion distances. Seven of nine sampled taxa were significantly correlated in both dN and dS, and six of nine taxa were significantly correlated in dN/dS. Astrosyne radiata, which produced the longest branch in the diatom phylogeny ( Fig. 1), also showed significant correlation of dN (p-value = 2.41e−06), dS (p-value = 2.23e−03), and dN/dS (p-value = 2.54e−03) with inversion distance ( Fig. 5; Table S6).

Discussion
Only limited studies have been performed using plastome protein-coding sequences from diatoms and not much is known about their molecular evolution. In this study, 103 plastid genes were examined across 40 diatom species, most of which were recently published by our group 7,9,10 . The ribosomal subunit and RNA polymerase genes have higher nucleotide substitution rates than other functional groups. Positive correlations are evident between dN and dS values and number of indels and inversion distances, which are proxies of genome rearrangements. Unlike the studies on legumes and conifers, we found no strong correlation between nucleotide substitution rates and diatom plastome size. The reason for the differences between diatoms and plants with respect to substitution rates may be attributed to fundamental differences in their genome content. Diatom plastomes are gene dense, with very little space dedicated to non-coding sequences and most are devoid of large repeat sequences 61 . However, the average diatom plastomes size is close to those of seed plants because they encode for more genes 7,9,10,62 . Previous studies in diatoms also showed that variation in the unit-genome size is mainly due to expansion and contraction of the IR, gene loss and the introduction of foreign DNA of unknown origin 7,9,10 .
Astrosyne radiata, which is known to have undergone many gene loss events 7 , had the highest dN and dS but a relatively small plastome. This finding is in agreement with the legume plants 26,37 . Perhaps species closely related to A. radiata will show the same negative correlation between substitution rates and plastome size but more sampling of related species is needed to confirm this observation. Astrosyne radiata is highly unusual from a morphological perspective, which perhaps explains its unusually long branch length in the phylogenetic tree. Although it is placed among araphid pennates, this diatom has elongated sternum and bilateral symmetry, and they have fully reverted to the ancestral radial symmetry (where all structures are rotationally arranged and symmetric around a single point in the centre) of diatoms in radial 3, such as Coscinodiscus and Actinocyclus.
Significant positive correlations were identified between substitution rates and two measures of genomic rearrangements, indels and inversions. This result is similar to legumes 26 but not to conifers 37 . A recent study has also found significant correlations between branch length and gene order changes 17 for two of the taxa in this study, Astrosyne radiata and Proboscia sp. This suggests that the evolutionary forces shaping the structural rearrangements between plants and diatoms are similar. Disruption of DNA repair, recombination and replication (DNA-RRR) systems has been suggested to cause highly elevated nucleotide substitution rates and genome www.nature.com/scientificreports/ rearrangements 24,35 . A recent study revealed the potential correlation between dN rates of nuclear encoded DNA-RRR genes of plastomes and measures of plastome complexity in one angiosperm family 36 . Like land plants, diatom plastid genes mainly fall into two general classes, those encoding proteins involved in photosynthetic metabolism (PSA, PSB, PET, ATP) and those with roles in transcription and translation (RPS, RPL, RPO). The finding that genes involved in photosynthesis had relatively lower overall substitution rates than genes in transcription-translation apparatus confirms that rate heterogeneity by functional class is a shared feature of diatoms and land plant plastomes.
Upon inspection of the protein alignments of the two positively selected genes, rps5 and atpI, we found that Thalassiosira oceania and A. radiata have very divergent sequences but were annotated with the correct gene symbols. The possibility of annotation errors leading to statistically significant positive selection cannot be discounted.  Table S2 shows how the genes were grouped into the 11 categories.  www.nature.com/scientificreports/ Gene essentiality is a widely studied factor in substitution rate variation, with the idea that essential genes are subject to stronger selective constraints than non-essential genes [45][46][47] . Several studies utilizing nuclear sequences have demonstrated that rates of nucleotide substitution are associated with gene expression levels where highly or more widely expressed genes evolve at slower rates in plants [48][49][50] and animals 51-53 supporting the notion that these genes may evolve under greater selective constraints. The slow rates of evolution in most of the genes examined in our study suggests they are essential genes.
In summary, positive correlations between nucleotide substitution rates and plastome rearrangements in both diatoms and legume plants motivate further studies to explore causal relationships between rates and plastome features. This will require expanded plastome sampling, both within and between diatom lineages. Future diatom studies should also consider the aspect of coevolution between nuclear and plastome genes, which has been done in several plant lineages 43,44,54,55 .

Species
Gene collinear block order  Table 2. Local collinear blocks (LCBs) for each of the 40 diatom plastomes identified by Mauve. Negative numbers indicate an inversion in a given LCB. Only one IR was included in this analysis. The same gene order is highlighted with the same superscript letter before the species name.  (Table S1). If similar sequences were annotated with the same gene names (i.e. isoforms) orthologs were selected using a phylogenetic tree-based approach 57 . Protein-coding genes were partitioned by functional category following Yu et al. 7 The gene sequences were translated with the transeq function in EMBOSS v6.5.7 58 39 , with the substitution model GTR + G and -f option. One thousand bootstrap replicates were performed to assess strength of support for clades. The maximum likelihood trees of individual genes and functional groups were then used as the constraint trees to estimate the substitution rates from individual-gene and functional-group levels, respectively. www.nature.com/scientificreports/ nucleotide substitution rates. Nucleotide substitution rates (dN and dS) were estimated using the CODEML function implemented in PAML v4.8 40 . Gapped regions were excluded with the parameter "-nogap" flag in PAL2NAL to avoid spurious rate inference. Pairwise rates were calculated relative to the outgroup species Triparma laevis and estimated with the parameter runmode = − 2. All shared plastome genes (103) were concatenated for nucleotide substitution rate estimation and separate estimations were calculated on individual genes or concatenated sequences of genes in different functional groups as listed in Table S2. CODEML model 0 was also used to estimate dN/dS values at the level of individual genes and functional groups. For genes with dN/dS > 1 in model 0, these genes were tested further with CODEML model 7 (neutral) and model 8 (positive selection) to uncover potential positively selected sites using similar methodology as described previously 61 . plastome features for correlation analyses. The number of indels for the concatenated 103 proteincoding genes was calculated using a custom Python script. Triparma laevis (Bolidophyceae) was used as a reference. Indels within aligned protein-coding regions were summed using a custom Python script resulting in a single value for each taxon; only intact genes were included (in-frame indels). Whole genome alignment among the 40 diatom species was performed using the ProgressiveMauve algorithm in Mauve v2.3.1 41 . The same IR copy (IRb) was removed from all plastomes. The locally collinear blocks (LCBs) identified by Mauve were numbered with positive or negative sign based on strand orientation to estimate genome rearrangement distance (Table S3). Pairwise inversion (IV) distances were estimated using Genome Rearrangements In Man and Mouse (GRIMM ; Table S4) 42 . The feature 'plastome size' excludes one copy of the IR for each taxon.
correlation between substitution rates and genome characteristics. Pairwise dN and dS values were calculated for the 103 shared genes from each taxon relative to the outgroup Triparma laevis. Correlation of dN and dS with plastome size and indel number for each plastome was tested. Phylogenetic Generalized Least Squares was performed using the ape v5.4 62 and nlme v3.1 63 packages in R. The ML constraint tree was utilized with outgroup taxa pruned. The correlation between dN and dS with IV distance was tested using the Pearson test 64 . The resulting p-values were Bonferroni 65 corrected using the built-in p.adjust function to account for the effect of multi-hypothesis testing.

Data availability
The NCBI accession numbers of the diatoms used in this study: NC_024084.