Auxenochlorella protothecoides and Prototheca wickerhamii plastid genome sequences give insight into the origins of non-photosynthetic algae

The forfeiting of photosynthetic capabilities has occurred independently many times throughout eukaryotic evolution. But almost all non-photosynthetic plants and algae still retain a colorless plastid and an associated genome, which performs fundamental processes apart from photosynthesis. Unfortunately, little is known about the forces leading to photosynthetic loss; this is largely because there is a lack of data from transitional species. Here, we compare the plastid genomes of two “transitional” green algae: the photosynthetic, mixotrophic Auxenochlorella protothecoides and the non-photosynthetic, obligate heterotroph Prototheca wickerhamii. Remarkably, the plastid genome of A. protothecoides is only slightly larger than that of P. wickerhamii, making it among the smallest plastid genomes yet observed from photosynthetic green algae. Even more surprising, both algae have almost identical plastid genomic architectures and gene compositions (with the exception of genes involved in photosynthesis), implying that they are closely related. This close relationship was further supported by phylogenetic and substitution rate analyses, which suggest that the lineages giving rise to A. protothecoides and P. wickerhamii diverged from one another around six million years ago.


Results
The A. protothecoides and P. wickerhamii plastid genomes are paragons of compactness. The A. protothecoides plastid genome is an 84.58 kb, AT-rich (69.2%), circular-mapping molecule ( Fig. 1; Supplementary Table S1). It has a compact architecture (24.57% non-coding), with no inverted repeats or introns. The P. wickerhamii ptDNA architecture mirrors that of A. protothecoides: it is small (55.64 kb), circular, AT-rich (68.8%), and compact (28.8% non-coding) ( Fig. 1; Supplementary Table S2). These two genomes are among the smallest and most reduced ptDNAs observed from the Trebouxiophyceae (Table 1) and green algae as a whole (Supplementary Table S3). The genomic compaction of the A. protothecoides and P. wickerhamii ptDNAs largely results from being no introns and relatively little intergenic DNA (Table 1). Moreover, genes essential for photosynthesis have been lost in P. wickerhamii (discussed below). Further contributing to the ptDNA streamlining in A. protothecoides and P. wickerhamii is the lack of plastid inverted repeat elements. The absence of these elements, however, is a reoccurring theme throughout the Trebouxiophyceae ( Table 1).
The types of plastid genome reduction observed in A. protothecoides and P. wickerhamii are not uncommon for green algae, especially non-photosynthetic species. In fact Helicosporidium sp. has one of the smallest ptDNAs ever observed 8 . Although only about 2/3 the size of that of A. protothecoides, the P. wickerhamii ptDNA is still larger and more expanded than that of Helicosporidium sp. (Reference 8 and Table 1).
A. protothecoides and P. wickerhamii have similar plastid gene contents. The A. protothecoides ptDNA encodes 76 proteins, 3 rRNAs, and 30 tRNAs, which is among the lowest plastid gene contents currently found in green algae (Table 1; Supplementary Table S3). Not surprisingly, given its non-photosynthetic existence, the P. wickerhamii plastid genome encodes even fewer gene products than A. protothecoides-40 proteins, 3 rRNAs, and 27 tRNAs. All of the genes in the ptDNA of P. wickerhamii are also present in that of A. protothecoides (Supplementary Tables S1 and S2). In both A. protothecoides and P. wickerhamii, most of the ptDNA genes have the same transcriptional polarity, and, more importantly, the gene orders are highly conserved between these two algae ( Fig. 1). Such a high degree of similarity in plastid gene arrangement is rarely observed between photosynthetic and non-photosynthetic species. A. protothecoides and P. wickerhamii show fewer regions of plastid gene collinearity with other trebouxiophytes, such as C. variablis and Helicosporidium sp., as they do with each other (Fig. 2). Pairwise plastid-gene-order comparisons using a broader sampling of green algae (Supplementary Figure S1 and S2) further supports the hypothesis that the ptDNA synteny between A. protothecoides and P. wickerhamii is among the highest yet observed within the Chlorophyta, when comparing species from distinct lineages.
The presence and absence of photosynthesis-related genes in the A. protothecoides and P. wickerhamii ptDNAs. A. protothecoides and P. wickerhamii have very different modes of energy production-the former is photosynthetic whereas the latter is an obligate heterotroph. Therefore, it is unexpected that various phylogenetic analyses showed that P. wickerhamii is more closely related to A. protothecoides than to Helicosporidium spp., implying that the loss of photosynthesis has occurred at least twice in the Chlorellales 11 : once in the lineage giving rise to Helicosporidium and once within that giving rise to Prototheca. To gain more insight into the evolutionary relationships among A. protothecoides, P. wickerhamii, and other algae, we performed a Maximum-likelihood phylogenetic analysis, using peptide sequences from 12 single-copy plastid-encoded proteins from 26 species from throughout the Chlorophyta (Fig. 3A). The resulting tree placed A. protothecoides, P. wickerhamii and Helicosporidium sp. together within a clade adjacent to the one containing Chlorella sp. ArM0029B, C. variabilis, C. vulgaris, P. kessleri, Chlorella sorokiniana, consistent with the fact that all of these algae belong to Chlorellaceae. Moreover, A. protothecoides appears to be most closely (bootstrap support 100%) related to P. wickerhamii. The two algae are separated by relative short branch lengths, suggesting that they diverged from one another recently in evolutionary history. These results are consistent well with the phylogenies based on 18 s rRNA (Supplementary Figure S3) and previous phylogenetic analyses using rRNA genes 15,23 . In addition, a previous mitochondrial phylogenetic analysis placed P. wickerhamii and Helicosporidium sp. in the same clade 24 . When we included Helicosporidium sp. in the plastid phylogeny we found that it is rather basal to the A. protothecoides and P. wickerhamii clade, suggesting that it is a close relative of P. wickerhamii and A. protothecoides.
The rate of synonymous substitution (Ks) can be used to estimate the divergent time among species 25 . The average Ks of plastid genes between A. protothecoides and P. wickerhamii is 0.816 (Fig. 3B). If we assume that the nuclear mutation rate in unicellular green alga 26 is 3.23 × 10 −10 substitution per generation, then the number of generation that occurred since their divergence between A. protothecoides and The two concentric maps represent the ptDNAs of A. protothecoides (inner circle) and P. wickerhamii (outer circle), respectively. Genes (filled boxes) are color-coded into 9 groups according to their biological functions. Genes on the outside of each map are transcribed in a clockwise direction, whereas those on the inside of each map are transcribed counterclockwise (The direction of transcription is also pointed out by the black arrows). The tRNA genes are indicated by the one-letter amino acid code followed by the anticodon in parentheses. The dashed lines indicate regions absent from the P. wickerhamii genome.
Scientific RepoRts | 5:14465 | DOi: 10.1038/srep14465 P. wickerhamii is about 2.52 × 10 9 . When considering that single-celled algae typically have a short generation time and that they typically have similar mutation rates in their plastid and nuclear genomes [27][28][29][30] , then the predicted time of divergent time for A. protothecoides and P. wickerhamii should be between six to twenty million years. Further supporting the hypothesis that A. protothecoides and P. wickerhamii are closely related is the fact that the levels of synonymous substitution in the ptDNA are not saturated (< 1) and are in fact similar to those observed between other closely related algal strains or species [27][28][29] . Furthermore, we calculated the similarity of each gene, as well as the 5′ UTR (50 bp upstream of ATG) between P. wickerhamii ptDNA and its 17 relatives. We found that the overall similarities between the plastid genomes of P. wickerhamii and A. protothecoides are highest (Supplementary Figure S4) in all the comparisons. Among them, the tRNA genes are more conserved than other coding genes, while the upstream of tRNA genes is more diverged than the coding genes.
We also investigated the various plastid genomic changes that occurred in the Prototheca lineage following the loss of photosynthesis. Among 109 genes in the ptDNA of A. protothecoides ptDNA, 70 are also present in that of P. wickerhamii, meaning 39 genes were lost from the lineage of P. wickerhamii following its divergence from that of A. protothecoides (Supplementary Table S1 Table S4 and S5). Three other genes (ycf3, ycf4 and ycf12), with ambiguous functions but likely connected to photosynthesis [31][32][33] , are also absent from the P. wickerhamii ptDNA, as are cemA and ccsA, which encode a plastid envelope membrane protein 34 and cytochrome c-type biogenesis protein 35 , respectively. Finally, three tRNAs, (trnL(GAG)), trnS(GGA) and trnT(GGU)) have also been eliminated from the P. wickerhamii plastid. Significant gene content differences were also observed between P. wickerhamii and Helicosporidium sp. (Fig. 2, Supplementary Table S3 and S4), indicating a more complex metabolism in P. wickerhamii's plastid compared with that predicted to be located in the plastid of Helicosporidium sp. 36 .
To investigate the molecular mechanisms of plastid gene loss from P. wickerhamii, we analyzed the junctions flanking deleted genes and gene clusters relative to A. protothecoides. In total, 17 regions containing missing genes were identified (labeled breakpoint BP 1 to BP17), ranging from < 0.5 kb to > 7.5 kb (Supplementary Table S6 and Supplementary Figure S5). Most of the junctions show no sequence similarity, but they do tend to be very AT rich (average > 85%), and 13 of the 17 BPs are adjacent to a tRNA gene.
We compared in detail the ptDNAs of P. wickerhamii and Helicosporidium sp., and found that although both genomes are reduced, the overall architecture are quite different. In the Helicosporidium ptDNA, the rRNA operon is split, and the coding regions display a symmetric strand bias 8 . In contrast, the P. wickerhamii plastid has a "typical" intact rRNA operon and the coding sequences have an asymmetric strand bias-almost all genes are transcribed in one direction (Fig. 1).

Discussion
The ptDNAs of non-photosynthetic species are generally < 80 kb, making them much smaller than those of most photosynthetic plants and algae, which are about 100-200 kb 37 , with some notable exceptions 38 . In this study, we showed that the A. protothecoides ptDNA is among the smallest observed from photosynthetic algae, particularly those from the Trebouxiophyceae and Chlorophyceae. Moreover, the ptDNA architecture and sequence of A. protothecoides is similar in many ways to that of its close non-photosynthetic relative P. wickerhamii (Figs 1 and 2, Table 1). Indeed, the only significant difference between the ptDNAs of these two algae is the loss of photosynthesis-related genes in the latter. Again, our phylogenetic analyses are consistent with earlier studies showing that A. protothecoides is more closely   related to P. wickerhamii than it is to other species within the Chlorellales 23 , and ultimately support the independent loss of photosynthesis in the P. wickerhamii and Helicosporidium lineages.
A. protothecoides is a free-living mixotrophic alga and, thus, can survive heterotrophically, provided it has organic carbon sources-a feature that has been exploited to produce large amount of biomass in a short period of time 20 . P. wickerhamii, on the other hand, is a widely distributed, obligate heterotrophic alga, that can act as an opportunistic pathogens to infect humans 39 and animals 40 . Although these two algae could not have more drastically different lifestyles 41,42 , we found that both share a surprisingly high level of ptDNA sequence identity, which explains that A. protothecoides and P. wickerhamii are more closely related to one another than they are to other members of the Chlorella and Prototheca genera, and validates previous results using phylogeny 15 .
The major difference between the A. protothecoides and P. wickerhamii ptDNA is the presence of photosynthesis-related genes. Given the high similarity in gene content and genomic structure between A. protothecoides and P. wickerhamii, it is almost certain that they share a recent common photosynthetic ancestor, perhaps as recently as ~6 million years ago. At some point after the two lineages diverged, photosynthesis was lost in the P. wickerhamii lineage, resulting in the wholesale loss of photosynthetic genes, whereas in the A. protothecoides lineage photosynthesis has been maintained. Ultimately, the close relationship between the two algae suggests that they could represent an excellent duo for studying the evolutionary loss of photosynthesis.

Methods
Strains and Cultivation Conditions. The A. protothecoides strain, the medium and cultivation methods used in this study have been described previously 43 . The P. wickerhamii strain (SAG 263-11) was purchased from the Culture Collection of Algae at the University of Göttingen, Germany (SAG) and cultured in malt peptone medium containing 10 g/L malt extract (Sigma) and 2.5 g/L proteose-peptone (Sigma).
Organelle genome sequencing and annotation. The complete P. wickerhamii plastid genome was obtained using Sanger chemistry and assembled with the SeqMan program from the DNASTAR Lasergene package. The sequencing primers were designed based on the partially sequenced genome (GenBank: AJ245645.1 and AJ236874.1 13 ). To close the gaps, a set of primers was designed based on the conserved genes found in both A. protothecoides and Helicosporidiium sp. The A. protothecoides plastid genome was obtained as part of the whole genome-sequencing project 18 (GenBank: APJO01001039.1 and APJO01001000.1). All primer sequences are available upon request.
Both the A. protothecoides and P. wickerhamii plastid genomes were annotated using the same methods. The gene sets were originally annotated by Dogma (Dual Organellar GenoMe Annotator) 44 and then curated manually. Protein-coding genes and non-coding RNAs with percent identity lower than 25 and 80 respectively were cut off (E-value 1e-5). Protein-coding genes were examined by Blastx searches against the NCBI non-redundant protein (nr) database and their boundaries adjusted manually whereas tRNA-coding genes were identified by tRNAscan-SE 1.23 45 using organelle parameters. The complete sequences of the A. protothecoides and P. wickerhamii plastid genomes have been deposited into GenBank under the access numbers KC843975 (A. protothecoides) and KJ001761 (P. wickerhamii).
Pairwise alignment and comparison. The pairwise alignments were performed by Blastn (E-value < = 1e-05) in bl2seq 2.2.23 46 with '− 1' as a penalty for a nucleotide mismatch. Each hit was used to estimate the average identity for all alignments or in 100bp-long windows. In gene order comparisons, protein-coding gene and rRNA gene were considered and identified by gene name. C. variabilis NC64A and Helicosporidium plastid genomes were adjusted for comparison correspondingly. C. variabilis NC64A was reversed and started with atpI, while Helicosporidium sp. started with rpl12.
For mutation rate analyses, we did pairwise alignments using GeneWise for each orthologous gene 47 . The software YN00 in the package PAML 4.8a was used to estimate the synonymous and non-synonymous mutation rate (KS&KA) 48 . Li's model 49 was used for estimating DNA divergence time (the generation times using for calculation were 24 h = 0.00274y for heterotroph and 72 h = 0.008219y for autotroph).
Phylogenetic tree construction. We used the TreeFam methodology 50 to define orthologous genes among taxa as follows: the all-versus-all peptide sequence alignments of protein-coding genes were performed using blastp 2.2.23 (no SEG query sequence filtering, E-value threshold of 1e-7), the blastp alignments were combined and filtered using SOLAR 0.9.6 (pairwise gene alignment rate of 0.24), and the clustering was performed with Hcluster_sg 0.5.0 (minimum edge weight 10, minimum edge density 0.34).
A total of 12 single-copy TreeFam protein-encoding gene clusters (tufA, rpoC1, rps7, rps8, rps11, rps12, rps19, rpl2, rpl5, rpl14, rpl16, rpl20) from 26 Chlorophyta taxa were defined. The amino acid sequences were aligned with MUSCLE 3.7 51 and the ambiguously aligned regions were filtered out with BMGE 1.1 52 using the default parameters. Maximum-Likelihood phylogenetic inferences were run with PhyML 3.0 53 under the LG + G + I model of amino acid substitution (8 gamma categories), selected with the Model Selection (ML) module from MEGA 6.05 54 . A total of 100 non-parametric bootstrap replicates were performed, as implemented in PHYML.