Dynamic evolution of inverted repeats in Euglenophyta plastid genomes

Photosynthetic euglenids (Euglenophyta) are a monophyletic group of unicellular eukaryotes characterized by the presence of plastids, which arose as the result of the secondary endosymbiosis. Many Euglenophyta plastid (pt) genomes have been characterized recently, but they represented mainly one family – Euglenaceae. Here, we report a comparative analysis of plastid genomes from eight representatives of the family Phacaceae. Newly sequenced plastid genomes share a number of features including synteny and gene content, except for genes mat2 and mat5 encoding maturases. The observed diversity of intron number and presence/absence of maturases corroborated previously suggested correlation between the number of maturases in the pt genome and intron proliferation. Surprisingly, pt genomes of taxa belonging to Discoplastis and Lepocinclis encode two inverted repeat (IR) regions containing the rDNA operon, which are absent from the Euglenaceae. By mapping the presence/absence of IR region on the obtained phylogenomic tree, we reconstructed the most probable events in the evolution of IRs in the Euglenophyta. Our study highlights the dynamic nature of the Euglenophyta plastid genome, in particular with regards to the IR regions that underwent losses repeatedly.

Photosynthetic euglenids (Euglenophyta) constitute a single subclade within euglenids. Their plastids, enclosed by three membranes, arose as the result of the secondary endosymbiosis between phagotrophic eukaryovorous euglenid and the Pyramimonas-related green alga 12 . Within photosynthetic euglenids, three evolutionary lineages are distinguished. A single mixotrophic species Rapaza viridis forms the most basal lineage 13 . Other photosynthetic euglenids are split into two groups: predominantly marine Eutreptiales and freshwater Euglenales. Euglenales are divided into two families: Phacaceae and Euglenaceae 14,15 . Genomic features of the secondary plastid of euglenids have been studied very intensively in the recent years as more plastid genomes have been sequenced. The first plastid genome of Euglena gracilis was sequenced more than two decades ago 16 but the number of plastid genome sequences rapidly increased since 2012 resulting in 18 euglenid plastid genomes sequenced so far. Three Eutreptiales plastid genomes have been characterized 12,17,18 and 15 from the Euglenales, including 14 of Euglenaceae 19 and only one from Phacaceae 20 . The euglenid plastid genome has undergone dynamic changes, including genome reduction due to the gene loss or transfer to the nucleus 12 , the proliferation of introns 11,21 , and genome rearrangements. Intron proliferation has been proposed to be correlated with the number of maturases in the pt genome 20 . The majority of genes among previously sequenced euglenid pt genomes have had the same basic complement of protein-coding genes. However, there have been significant changes in the gene arrangement. More closely related taxa tend to have greater synteny 21 than more divergent organisms, for which extensive rearrangement have been shown 22 .
Phacaceae comprises three monophyletic genera: Phacus, Lepocinclis, and Discoplastis. The family Phacaceae has been proposed quite recently based on phylogenetic relationships 14,15,23 and morphological synapomorphy -the presence of numerous, small plastids without pyrenoids. The genus Phacus was erected in mid-XIX century 24 to accommodate taxa from the genus Euglena that were rigid and did not undergo metaboly. The genus Lepocinclis was established a few years later by Perty 25 to incorporate taxa from the genus Phacus that were not flattened. With the advent of molecular sequencing, the division of taxa between Phacus and Lepocinclis have been validated and a new genus Discoplastis was erected to accommodate several species formerly belonging to the genus Euglena and characterized by numerous small plastids without pyrenoids and strong metaboly of the cell 26 . Genus Phacus and Lepocinclis are species-rich and have been intensively studied on the species-level in the last decade, and several new species have been described [27][28][29][30][31] . In contrast, genus Discoplastis comprises only two species 26 .
Here, we report the structural features of the eight newly sequenced plastid genomes of the representatives of the family Phacaceae, including five taxa from Lepocinclis, two taxa from Phacus, and one Discoplastis taxon. We sought to identify the main genomic changes that occurred in the investigated lineages. The examined genomes display considerable variability at all levels except gene content. We also present the phylogenomic trees inferred from the plastid genome sequences. Our results highlight the repeated losses of IR region during the evolution of Euglenophyta.

Results and Discussion
We investigated eight taxa representing all three genera encompassed in the family Phacaceae and found that the plastid genome experienced important alterations at the level of genome organization and intron prevalence during the evolution of the Phacaceae. Before our comparative analyses, relatively little was known about the extent of chloroplast genomic changes throughout Phacaceae and only one species, Phacus orbicularis, has been sequenced 20 . When compared to the characteristics of Euglenaceae pt genomes 19,21,32 , they were very similar in their gene content. However, a few physical traits present within the Phacaceae were remarkable when compared to the rest of the Euglenales and the Euglenophyta.
General features. Previously, the pt genome of Eutreptia viridis contained the smallest genome among Euglenophyta at 65,513 bp 17 , followed by that of the fellow member of Eutreptiales -Eutreptiella gymnastica (67, 622 bp) 12 , leading to the assumption that there was an expansion in the genome size in the freshwater euglenids (Euglenales). Among members of the family Euglenaceae, Monomorphina aenigmatica 11 was identified to contain both the smallest genome (74,746 bp) and the smallest number of introns (53 + 1 unidentified ORF). Sequencing of the first representative of the Phacaceae family -Phacus orbicularis 20 revealed that its pt genome (66, 418 bp) is smaller than any of the known Euglenaceae genomes and comparable in size with the pt genomes of Eutreptiella gymnastica. All the strains sequenced in this study possess relatively small pt genomes (below 91 Kb) (Fig. 1, Table 1, Supplementary Table 1) and the P. inflexus pt genome (~58 Kb) was identified as the smallest Euglenophyceae pt genome sequenced to date, even smaller than the genome of Eutreptiella gymnastica. That might suggest that the pt genome of P. inflexus underwent reduction, which could be explained by the loss of mat2 and mat5 and subsequently a smaller number of introns, along with the low average length of genes (531 bp) ( Table 1), lower than those of Eutreptia viridis (587 bp).
Phylogenomic analyses. Before comparing the gene content and gene organization of the examined genomes, here we present the phylogenetic context required to interpret those results. Our chloroplast phylogenomic analyses were carried out using amino acid and nucleotide data sets that included representatives of all genera of Euglenophyta (23 taxa in total). The amino acid data set was generated using 57 protein-coding genes (11,499 sites), whereas the nucleotide data set contained two rRNA genes (3,959 sites). The obtained phylogenetic tree (Fig. 2) confirmed sister relationship of Phacaceae and Euglenaceae 15,23,33 . The inferred relationships among genera of Euglenaceae are not well supported, and only those between closely related taxa are in accordance with multigene analyses on nuclear-encoded genes 14,15 . In contrast relationships among Phacaceae species were well resolved and highly supported. Phacus and Lepocinclis form sister clades with Discoplastis represented by D. spathirhyncha branching off first. That topology was previously recovered in phylogenetic analyses based on nuclear-encoded genes 14,15 .  from the representatives of Phacus. Initial assemblies of the plastid genomes of Discoplastis and Lepocinclis were not circularized and the resulting contigs differed in the coverage, suggesting duplication. The contigs were joined via PCR reactions and additional PCR reactions were used to extend the pt genome to the reported genomic sequences. The resulting assembly confirmed that the circularization was impossible due to the presence of an inverted repeat of the ribosomal operon. The arrangement of the plastid genomes in Discoplastis and Lepocinclis was very similar, IRs are separated by a very short SSC, which doesn't contain any genes, only tandem repeats. The plastid genomes of land plants and green algae often contain two copies of an IR encoding the rRNA operon 34 , among them Pyramimonas, a prasinophyte alga most closely related to the ancestor of the plastid of Euglenophyta 12,35 . Among previously sequenced pt genomes of euglenids the quadripartite arrangement has been  18,35,36 , whereas one of the copies was most probably lost during the divergence of Euglenales. Although all known pt genomes of Euglenales lack one copy of IR, in E. gracilis and Stromobomonas acuminata 37 they have been shown to contain tandemly repeated copies of the rRNA operon 16,38 . By mapping the presence/absence of IR on the obtained phylogenomic tree, we reconstructed the most probable events in the evolution of IRs in the Euglenophyta (Fig. 2). Our results suggest that the IR regions were most likely lost three times in the course of evolution of the Euglenophyta (if we assume that the plastid genome acquired from a prasinophyte alga was characterized by the typical quadripartite arrangement of the plastid genome). The first possible loss occurred in the genus Eutreptia, however we have to take into account that only one representative of that genus was analyzed and its sequence is incomplete 17 . The second loss occurred during the evolution of the genus Phacus, as none of the three analyzed species carried two copies of IR. Finally, the third loss most probably appeared in the common ancestor of all taxa belonging to the family Euglenaceae, which were lacking the quadripartite arrangement of the plastid genome. That scenario is more likely than the creation of an IR de novo from an IR-less plastid genome 39 . However, some of the Euglenaceae pt genomes are incomplete, and the number of losses might increase with the completion of those genomes. Specifically, pt genome of Colacium vesiculosum and Strombomonas acuminata have traits of the possible remnant of the IR; they possess an additional piece of 16S rRNA or 23S rRNA, respectively 37 .
The plastid genomes of almost all land plants and algae carry two identical copies of a large IR sequence 40 . It has been shown that the substitution rates are several times lower for the IR relative to the single-copy (SC) regions among several angiosperms 41 suggesting a major impact of the IR on the rate of plastome sequence evolution. The omnipresence of the quadripartite structure of plastid genome suggest its important role, however loss of the IR regions, although rare, is known from land plants 42 and several lineages of the green algae, such as prasinophyceans 43 , trebouxiophyceans 40 , and streptophytes 44 . The molecular mechanisms underlying these events remains unclear, although several hypotheses have been proposed based on studies on plants and green algae 39 . It might be a consequence of repeated events of IR contraction, the complete excision of one of the IR sequence or through the differential elimination of the gene sequences from one IR copy. Among the plastid genomes of the Phacaceae, we didn't observe any intermediate types of IR copies, which might suggest the step-wise loss of one of the IRs. In contrast, in Etl. gymnastica two copies of the IR differ, which might indicate that one of them is on its way to be lost 12 . More plastid genomes sequences form the early branching lineages of Euglenophyta is needed to support either of the aforementioned hypotheses.

Gene content and introns. The rRNA operon was present in all investigated strains and Discoplastis and
Lepocinclis possess two copies. We did not identify any traces of 5S rDNA in two species, namely L. ovum and L. playfairiana, and only the core of 5S rDNA (27 nucleotides long) in L. steinii. The 5S rRNA was also not identified in Etl. gymnastica pt genome 12 and in some green algae, like Pyramimonas parkeae and Pycnococcus provasolii 35 . However, the absence of 5S rRNA in plastid genomes, in particular among protists, have been previously shown to be the result of difficulties with identification, not the lack of 5S rRNA in those genomes 45 .
The majority of the protein-coding genes and tRNAs were shared among all investigated taxa, except roaA, which was absent in P. inflexus and D. spathirhyncha, and tRNA L-(UAG) which was absent in L. ovum. The number of genes encoding proteins and tRNAs was also consistent with other Euglenophyceae taxa 21 . In several taxa, roaA (P. pleuronectes) and atpF (P. pleuronectes, D. spathirhyncha, L. steinii, and L. playfairiana) began with alternative start codons, which was also previously observed for other taxa 21 .
We found notable differences among the maturase-encoding genes. The mat5 gene has patchy distribution along the tree of photosynthetic euglenids, and it was present in two out of seven analyzed species (both strains of L. tripteris and P. pleuronectes) (Fig. 2). Bennett and Triemer 21 proposed that mat5 was gained after the split of Euglenales from the Eutreptiales while identifying three instances in which mat5 was independently lost within Euglenaceae. However, they also recognized mat5 in L. spirogyroides, which demonstrated that it was gained before the Euglenaceae/Phacaceae split. Kasiborski et al. 20 did not identify mat5 in the first pt genome of Phacus and proposed that this gene was lost in P. orbicularis after the split of Phacus and Lepocinclis. Our analyses of eight additional strains from the Phacaceae rejected the previous hypothesis since in both Lepocinclis, and Phacus mat5 was absent in some taxa but present in others (Fig. 2). It was also absent in Discoplastis but based only on one strain we couldn't draw definite conclusions. Most likely, mat5 was present in the ancestor of all Euglenales. The presence of a maturase-like protein in the second intron in psbC of Etl. gymnastica 12 , which showed a weak similarity to mat5 that is usually located in the psbA gene in Euglenales 19 , supports this scenario.
More surprisingly, another maturase gene, mat2 was absent in two out of the three analyzed Phacus species (P. inflexus and P. orbicularis) and one Lepocinclis species (L. tripteris) (Fig. 2). Previously, it was shown to be missing in P. orbicularis which suggested the acquisition of mat2 in the Euglenaceae lineage after the split from the Phacaceae 20 . Our results demonstrate that the majority of the Phacaceae possess mat2, which would imply the independent loss of that gene in some Phacaceae taxa.
It was proposed that proliferation of introns might be related to the number of maturases. Eutreptiales, with one maturase, has the lowest number of introns and Euglenacae with two or three maturases, tend to have more introns 20 . Our results confirmed the previously observed relation between the number of introns and number of maturases (Fig. 2). While P. pleuronectes was the only one of the Phacaceae that contained three maturases and had the highest number of introns (104) in this family. Discoplastis and all analyzed Lepocinclis strains contained two maturases and average (50-59) to a high (79-95) number of introns (Table 1) 17 . The observation of the greater number of introns in strains with more maturases seems to be consistent but doesn't explain the mechanism of intron proliferation. An intron of psbC, which carries an intron-encoded maturase ycf13, is considered the ancestral intron because it is the only homologous intron in euglenophyte cpDNAs 46 . Surprisingly, intron1 in psbC of both strains of L. tripteris was not homologous with E. gracilis psbC intron2, though L. tripteris psbC intron1 still contained ycf13.
Two previous findings and conclusions about twintrons hold true with the newly analyzed genomes. We confirmed that twintrons present in psbF and psbD genes in E. gracilis 16 are unique to that species and absent not only in all Euglenaceae 11,17,19,21,22 but also in Phacaceae 20 (Table 2). We also confirmed the presence of a twintron in psbC (intron1/Egra.2) ( Table 2) in Phacaceae, which strongly supports the previously proposed ancestral origin of this intron 20 . Our data contradicted the suggestion of petB twintron (intron1) as a synapomorphy for all Euglenophyta 20,21 as we identified the petB twintron in taxa representing all genera of Phacaceae except L. ovum and P. pleuronectes (Table 2). That might suggest that indeed the petB twintron is ancestral for the Euglenaceae but has been lost in at least two lineages of Phacaceae.

Synteny, intrageneric and intraspecific variation. Synteny analyses revealed the same order of clusters
within Phacus and Lepocinclis and some rearrangements among genera (Fig. 3). Those findings agreed with the observations within Euglenaceae. Taxa within a genus usually have the same arrangement of clusters and differences occur among genera 11,19,21,47. Intrageneric variability within the Euglenaceae was previously explored in Euglena 21,47 and Mononorphina 11,21 , but not in Phacaceae. Newly sequenced pt genomes allowed to explore intrageneric variability within Phacus and Lepocinclis. In all previously analyzed cases, intrageneric evolution was limited, and significant changes occurred before the separation of those genera. Comparative studies of Monomorphina pt genomes revealed some differences, namely the presence of mat5 and a higher number of introns in M. parapyrum resulting in the bigger size of the pt genome 11,21 . Finally, a comparison of the potential twintrons contained in the Monomorphina species revealed that the presence of twintrons was not conserved between taxa, suggesting that presence or absence of a twintron in a given intron is taxon-specific 21 . We observed similar differences within Phacus and Lepocinclis. In both genera, the number of introns varied greatly (50-95 for Lepocinclis and 29-104 for Phacus) ( Table 1), and the main difference in the gene content is related to the presence/absence of maturases mat5 and mat2. Analyses of twintrons (Table 2) also revealed some differences within genera and further supported earlier conclusions 21 that twintrons are taxon-specific.
Plastid genomes of two strains of the Lepocinclis tripteris (MI 101 and UTEX 1311) were very similar, and ~90% of nucleotide sites were identical when compared across the entire genomes. This was a much higher identity than was observed in the two strains of Euglena viridis, which were only 70.5% identical 21 . The identity was even higher within coding space -95.3%, which was also higher than between two Euglena viridis strains (91.5%). The pt genomes of the L. tripteris strains were also very similar in size but a comparison between the two genomes revealed that the area between tRNAs-S(GCU) and Y(GUA) were different (Fig. 1e,f): while both genomes contained two unidentified ORFs in this area, the two ORFs were different, and this region of the genome was 1,338 bp longer in UTEX 1311, which accounted for almost the entire difference in size between the genomes. Moreover, in both strains of L. tripteris, we observed the presence of a stop codon UAG in the 7 th amino acid of the psbC gene. Since psbC encodes a protein essential in photosynthesis, it is clear that this gene is still functional.

Conclusions
Despite the fact that many Euglenaceae pt genomes have been characterized recently, surprisingly little was known about the plastid genomes of its sister family Phacaceae. To fill this gap, we have sequenced plastid genomes of eight taxa from all three genera classified in the Phacaceae. Gene content was highly conserved within the family, and the main differences were related to the presence/absence of maturases mat2 and mat5. Genes were arranged into seven clusters, and their order was conserved within genera. No pattern of intron number was present in the Phacaceae, although we confirmed the correlation between the number of maturases and the number of introns. We also confirmed that twintrons present in psbF and psbD genes in E. gracilis 16 are unique to that species and that the twintron in psbC is of ancestral origin. We rejected, however, the idea of the twintron in petB as a synapomorphy for all Euglenophyta, because it was not present in some Phacaceae. Our study highlights the highly dynamic nature of the Euglenophyta plastid genomes, in particular with regards to the large IR sequence that experienced repeated losses, most probably at least once within the Phacaceae, once before the branching off of the Euglenaceae and once in the genus Eutreptia. It is necessary to analyze additional taxa from the basal lineages, such as Eutreptiales and Rapaza, to fully understand the dynamic history of the plastid genome in the Euglenophyta and decipher the mechanisms underlying the observed IR losses.

Materials and Methods
Culturing, sequencing, and annotation. The following eight cultures were used in this research:  containing sterile growth media, one cell per well, and allowed to divide for ~2 weeks. Following this, well contents were moved into 12 mL culture tubes that contained sterile growth media and were allowed to grow for another ~2 weeks. Finally, the culture tubes were subsampled and viewed with a Zeiss Axioscope 2 plus microscope (Carl Zeiss Ing., Hallbergmoos, Germany) to ensure that they were uni-algal and verify the identity of the culture. These two cultures were also maintained as described previously 21 .
Cultures of L. ovum, L. playfairiana, L. steinii, L. tripteris (UTEX), L. tripteris (MI), P. inflexus, and P. pleuronectes were concentrated, washed, and had their DNA extracted with following protocols 52 , with the following alterations: Percoll (Research Organics, Cleveland, OH, USA) was substituted for Centricoll (Sigma Inc., St. Louis, MO, USA) in the P. inflexus process; L. ovum, L. playfairiana, L. steinii, and P. pleuronectes cultures were not run through the gradient step. DNA of D. spathirhyncha was extracted using phenol/chloroform, and DNA separation, using a cesium chloride gradient, was performed as described before 17 . All DNA was sequenced with Illumina paired-end reads at the Michigan State University Research Technology Support Facility: D. spathirhyncha with HiSeq 2 × 100 bp reads; all other taxa with MiSeq 2 × 150 bp reads. Phacus inflexus raw sequence data were assembled into contigs with the 'De Novo Assemble…' program in Geneious Pro version 6.8.1 (Biomatters Ltd, Auckland, New Zealand) as previously described 21 . All other raw sequence data were assembled into contigs with the 'De Novo Assembly' program in CLC Genomics Workbench version 5.5.1 (CLC Bio, Cambridge, Massachusetts, USA) as previously described 21 . The number of contigs and their lengths, number of reads per contig, and average coverage per contig are provided in the Supplementary Table 1.
All other aspects of genome discovery, including: identification of pt genome-containing contigs, joining of contigs (as necessary), sequence cleanup (as necessary), PCR primer creation, annotation, arrangement, and genome map creation were as performed as previously described 21 . The primers designed to confirm the presence of IRs are listed in the Supplementary Table S2. Additionally, when we were unable to identify the 5S in this pt genome through Rfam 53 , RNAmmer 54 , or by manual aligning, we used MFannot tool 55 and confirmed its annotation by homology searches with 5S rRNA Database 56 . The search for potential twintrons was performed as previously described 21 , with the exception that instead of a manual search, a Python script was created (Supplementary File 1; the source code is available on GitHub https://github.com/ankarn/groupIII_twintrons) to find the 3′ conserved motif for Group III twintrons 9 within the homologous external introns. Genome maps were drawn using OGDRAW 57 . Newly generated organelle genomes were deposited in GenBank (Table 1).
Synteny between the pt genomes of Phacaceae was determined and visualized with progressive Mauve 2.3.1 58 . Mauve performs syntenic comparisons between multiple genomes and displays these syntenic regions graphically. In the Mauve alignment the repeat regions of rRNA were not included because Mauve will not align repeat regions which have multiple matches in both genomes.

Data Availability
All the newly obtained sequences are deposited in GenBank under accession numbers MH898667-MH898674. All the sequence data set and analysis results obtained in this work are available from the corresponding author on reasonable request.