Evolution of folate biosynthesis and metabolism across algae and land plant lineages

Tetrahydrofolate and its derivatives, commonly known as folates, are essential for almost all living organisms. Besides acting as one-carbon donors and acceptors in reactions producing various important biomolecules such as nucleic and amino acids, as well as pantothenate, they also supply one-carbon units for methylation reactions. Plants along with bacteria, yeast and fungi synthesize folates de novo and therefore constitute a very important dietary source of folates for animals. All the major steps of folate biosynthesis and metabolism have been identified but only few have been genetically characterized in a handful of model plant species. The possible differences in the folate pathway between various plant and algal species have never been explored. In this study we present a comprehensive comparative study of folate biosynthesis and metabolism of all major land plant lineages as well as green and red algae. The study identifies new features of plant folate metabolism that might open new directions to folate research in plants.

The reverse reaction resulting in hydrolysis of 10-formyl-THF into THF and formate is carried out by 10-CHO-THF deformylase (10-FDF) (Fig. 2). Two 10-FDF genes were characterized in Arabidopsis and found to be crucial for photorespiration 32 .
There are two alternative ways to generate 5,10-CH + -THF. The first one involves 5-formyltetrahydrofolate cycloligase (5-FCL) that catalyses the conversion of 5-formyltetrahydrofolate (5-CHO-THF) into 5,10-CH + -THF (Fig. 2). Unlike other folate species, 5-CHO-THF is not used as a one-carbon (C1) donor but acts as a potent inhibitor of SHMT and several other enzymes of C1 metabolism 33 . Therefore, 5-CHO-THF is considered to be a regulator of C1 metabolism. The second way to generate 5,10-CH + -THF involves glutamate formiminotransferase (GFT), a folate-dependent enzyme that takes part in histidine degradation in mammals and some bacteria. GFT mediates the transfer of a formimino group from formiminoglutamate to THF, producing 5-formimino-THF which is then converted to 5,10-CH + -THF by the action of formiminotetrahydrofolate cyclodeaminase (FTHFC) (Fig. 2) 34 . Biochemical and genetic characterization of plant GFT is still poor 35 .
While biochemistry of plant folate biosynthesis and metabolism is considered well characterized, their compartmentalisation and regulation await further studies. The task could become utmost challenging considering that the regulation of folate metabolism might differ between species. Examples showing that biofortification strategies are not equally efficient for all crop species corroborate this notion 36 . Unfortunately, genes of folate biosynthesis and metabolism were characterised only in few model plant species. Increasing availability of genome-scale data provides an excellent opportunity to trace the origin of the pathway components in algae, to identify possible differences in regulation and compartmentalisation of folate metabolism in various plant species and to plan future biofortification approaches that will take into account individual features of folate metabolism of a given species. To identify the possible differences and to trace their emergence during evolution, we ventured into a comparative genomic and phylogenetic study including all major land plant lineages as well as red and green algae. Our study provides a comprehensive view on the folate biosynthesis and metabolism throughout the plant kingdom and points out novel aspects of folate metabolism. We demonstrate that bifunctionality of enzymes in folate biosynthesis and enzymes involved in interconversion of folate species is a common feature for algae and higher plant species. Our comparative study shows that the number of genes, localization and the structure of the isoforms they encode is highly conserved across algae and land plants. Moreover, our findings underscore Interconversion of folate species. THF, tetrahydrofolate; 5-CH 3 -THF, 5-methyltetrahydrofolate; 5,10-CH 2 -THF, 5,10-methylenetetrahydrofolate; 5,10-CH + -THF, 5,10-methenyltetrahydrofolate; 5-CHO-THF, 5-formyltetrahydrofolate; 10-CHO-THF, 10-formyltetrahydrofolate. SHMT, serine hydroxymethyl transferase; GDC, glycine decarboxylase complex, FTHFC, formiminotetrahydrofolate cyclodeaminase; 5-FCL, 5-formyltetrahydrofolate cycloligase; GFT, glutamate formiminotransferase; FTHFS, 10-CHO-THF synthetase; MTHFD-MTHFC, 5,10-CH 2 -THF dehydrogenase/5,10-CH + -THF cyclohydrolase; MTHFR, methylenetetrahydrofolate reductase; 10-FDF, 10-CHO-THF deformylase.
ADCs. Our comparative analysis shows that ADCS is encoded by a single copy gene in genomes of all analysed algal species as well as in most analysed higher plant species (Fig. 3). Being in line with reported targeting of Arabidopsis ADCS to chloroplasts 7 our prediction of the subcellular localization suggested that ADCS homologs in algae and higher plants possess chloroplast targeting signals (Fig. 3).
Examination of selected protein sequences for the presence of functional domains revealed that besides glutamine amidotransferase domain (GAT) and chorismate binding domains, algal and higher plant ADCS bear homology to the N terminal region domain of anthranilate synthase component I (alpha subunit) (Fig. 3). ADCS proteins from P. patens, P. trichocarpa, F. vesca, A. thaliana and C. rubella contain one or two additional domains (Fig. 3), possibly manifesting a process of neofunctionalization of ADCS within these species. The motif pattern (sequence within which motifs appear) is conserved throughout algal and land plant species (Supplemental Fig. 1).

ADCL.
While algae possess a single copy of ADCL, the majority of higher plant species contains two or three ADCL genes (Fig. 4). Like those of algae, the genomes of A. thaliana, C. rubella and B. rapa also encode a single ADCL gene, indicating loss of the second ADCL gene during evolution of Brassicaceae. Our phylogenetic study revealed that ADCL genes of higher plants fall into two clades (Fig. 4). Interestingly, each clade contains an ADCL homolog from every analysed higher plant species. The presence of an ADCL gene from the A. trichopoda genome in both clades suggests that the divergence of ADCL genes occurred before diversification of flowering plants. The prediction of the subcellular localization revealed that all proteins from the clade containing A. trichopoda_XP_006833333.1 have a chloroplast localization signal, whereas proteins from the clade containing the A. trichopoda_XP_011623669.2 isoform are exclusively cytosolic (Fig. 4). Moreover, the P. patens genome also encodes two cytosolic and one plastidial ADCL isoform, further confirming that duplication and diversification of ADCL genes occurred during early evolution of land plants (Fig. 4). The plastidial localization of GFP-tagged ADCL from Arabidopsis has been previously demonstrated 7 , while a confirmation for cytosolic ADCL remains unreported.
All analysed ADCL polypeptides contain a single Aminotransferase class I and II (PF00155.20) domain (Fig. 4). The motif pattern is well conserved among the analysed algal and higher plant species (Supplemental Fig. 2).

GTPCHI.
In silico analysis demonstrates that while algae have a single GTPCHI isoform, the majority of analysed higher plant species possesses two or more GTPCHI isoforms (Supplemental Fig. 3). Prediction of subcellular targeting suggested cytosolic localization of all analysed proteins, supporting the experimental data for Arabidopsis and tomato GTPCHI proteins 11 .
Our findings reveal the presence of two GTPCHI domains throughout algae and higher plant lineages, with the exception of L. usitatissimum GTPCHI that possesses three domains, the first being duplicated (Supplemental Fig. 3). Remarkably, the two GTPCHI domains show very different conserved motif composition, that further suggests their strong divergence (Supplemental Fig. 4).

HppK-DHps.
Genomes of the majority of analysed algae and higher plant species contain a single HPPK-DHPS gene (Supplemental Fig. 5). Prediction of their subcellular localization suggested mitochondrial localization of the majority of studied HPPK-DHPS isoforms (Supplemental Fig. 5). This finding is in line with experimentally determined mitochondrial targeting of HPPK-DHPS proteins in pea 15 and Arabidopsis 39 .
Our comparative study shows that the bifunctionality is conserved across algae and land plant species, as HPPK and two DHPS domains (DHPS1 and DHPS2) were identified in most analysed proteins (Supplemental Fig. 5). While all higher plant species possess two DHPS domains, some algae and lower land plant species (S. moellendorfii and P. patens) have only one DHPS domain. P. trichocarpa, C. rubella and S. bicolor proteins contain www.nature.com/scientificreports www.nature.com/scientificreports/ additional domains (Supplemental Fig. 5). Our in silico protein analysis indicates that the motif pattern is well conserved (Supplemental Fig. 6).

DHFS.
Phylogenetic analysis revealed that DHFS is encoded by a single copy gene in genomes of most analysed species, with the exception of G. max and M. truncatula genomes which are characterized by high polyploidy (Supplemental Fig. 7). Being in line with experimental evidence for Arabidopsis DHFS 39 , our analysis suggested that DHFS in higher plant species localizes to mitochondria, while in algae it can also localize to cytosol or plastids (Supplemental Fig. 7).
In silico analysis revealed that DHFS proteins contain the same set of functional domains as FPGS polypeptides (see below): FPGS1, FPGS2 and muramyl ligase (Supplemental Fig. 7). Although DHFS and FPGS share similar biochemical function, sequences of these two proteins diverged immensely. The divergence is further The numbers at the branching points indicate the percentage of times that each branch topology was found during bootstrap analysis (n = 1000). The box contains predicted functional domains. Schemes on the right represent domain organisation of analysed proteins (colored boxes represent functional domains, lengths of black lines correspond to lengths of proteins. The scale bar below shows protein containing 500 amino acids). Cyt, cytosolic localization; chl, plastidial localization. Indication of double localization (e.g. cyt chl) for a single protein implies its probable localization to both compartments.
www.nature.com/scientificreports www.nature.com/scientificreports/ illustrated by the protein sequence analysis which demonstrates that DHFS and FPGS proteins have very different patterns of conserved motifs (compare Supplemental Figs 8 and 11). It has been previously reported that Arabidopsis DHFS is more similar to the DHFS-FPGS gene from E. coli than to Arabidopsis FPGS genes 39 .

DHFR.
Our phylogenetic analysis demonstrates that single-copy DHFR-TS genes are present in algal genomes, while DHFR-TS in higher plant species is encoded by two or more genes (Supplemental Fig. 9). Prediction of subcellular localization suggested that DHFR-TS isoforms can be targeted to multiple compartments (Supplemental Fig. 9), corroborating our previous finding of multiple subcellular targeting of DHFR-TS isoforms in Arabidopsis 40 . The data in Supplemental Fig. 9 demonstrate that genomes of all analysed higher plant and algal species encode bifunctional DHFR-TS enzymes.
A previous study of Arabidopsis DHFR-TS demonstrated that one of the isoforms lacks enzymatic activity and acts as an inhibitor of its family members. Phylogenetic analysis suggested that such inhibitory isoforms may be also present in Arabidopsis' closest relatives, B. rapa and C. rubella 40 . The present study shows that the inhibitory isoforms lack the last protein motif, while all other analysed DHFR-TS proteins retain it (Supplemental Fig. 10). It is plausible that this motif is important for the DHFR and TS activities, possibly being essential for a proper protein conformation.
FPGS. The present study shows that, unlike algae, land plant species possess two or more FPGS isoforms.
Being in line with experimental data 39,41 our prediction of subcellular localization suggests that the isoforms  The numbers at the branching points indicate the percentage of times that each branch topology was found during bootstrap analysis (n = 1000). Schemes on the right represent domain organisation of analysed proteins (color boxes represent functional domains, lengths of black lines correspond to lengths of proteins. The scale bar below shows protein containing 500 amino acids). The box contains predicted functional domains. Cyt, cytosolic localization; chl, plastidial localization; mit, mitochondrial localization. Indication of double localization (e.g. cyt chl) for a single protein implies its probable localization to both compartments.
www.nature.com/scientificreports www.nature.com/scientificreports/ In previous work it was shown that mit AtFPGS, lacking the FPGS1 domain, could complement a yeast FPGS mutant 39 , indicating that the absence of this domain did not impair the activity. Prediction of targeting signals suggests that subcellular localization of FPGS proteins varies within each clade (Fig. 5). This indicates that localization signals of FPGS were gained in each lineage individually.
GGH. Our analysis shows that GGH is encoded by multiple genes in several algal and land plant genomes (Supplemental Fig. 12). Interestingly, GGH is the only enzyme of the folate pathway that is encoded by several genes in genomes of algae species. This observation corroborates the notion that the enzyme plays an important role in regulation of folate pool 24 and highlights the demand of a more profound investigation of its function. Being in line with previous studies of GGH enzymes from Arabidopsis and tomato 23,25 our subcellular localization analysis predicted existence of secretory signals in all analysed GGH proteins (Supplemental Fig. 12). Protein analysis revealed that all analysed GGH proteins possess a single peptidase C26 (PF07722.13) functional domain (Supplemental Fig. 12) and show a highly conserved motif pattern (Supplemental Fig. 13).
GDC and sHMt. The in silico analysis revealed that the GDC T-protein is encoded by a single gene in algal genomes, while land plant species possess single or multiple GDC T-protein genes (Supplemental Fig. 14). The predicted mitochondrial localization of the analysed GDC isoforms is in line with its role in photorespiration 42 . Protein analysis suggested the presence of two functional domains: an aminomethyltransferase folate-binding domain (PF01571.20) and a glycine cleavage T-protein C-terminal barrel domain (PF08669.10). Phylogenetic analysis suggested that genomes of both algae and land plants encode multiple SHMT isoforms that fall into three big clades. According to our prediction of subcellular localization, each clade comprises isoforms with either mitochondrial, cytosolic or plastidial localization, confirming previous findings in pea and potato 20,28 (Supplemental Fig. 15).

FTHFS.
Our comparative study demonstrates that FTHFS is encoded by a single gene in all examined species, with the exception of G. max, L. usitatissimum and P. trichocarpa that have two FTHFS genes (Supplemental Fig. 16). In agreement with the finding that FTHFS activity is associated with the cytosolic fraction of pea cotyledons, our prediction of subcellular localization suggested that FTHFS resides exclusively in the cytosol of all analysed algae and land plant species (Supplemental Fig. 16). Protein analysis revealed that all analysed FTHFS proteins possess a single formate-tetrahydrofolate ligase (PF01268.18) functional domain (Supplemental Fig. 16) and show a highly conserved motif pattern (Supplemental Fig. 17).  Fig. 26 illustrates a comparative analysis for 5-FCL. It is encoded by a single gene in algal genomes, in contrast to some higher plant species, which contain two 5-FCL coding genes. As P. patens genome contains two 5-FCL genes, its duplication presumably occurred during speciation of land plants or earlier. Prediction of subcellular localization has shown that the analysed 5-FCL proteins are mainly targeted to mitochondria but can also reside in plastids and in the cytosol (Supplemental Fig. 26). The prediction is in line with the experimentally determined mitochondrial localization of 5-FCL activity in pea 43 . All analysed proteins have a similar motif pattern (Supplemental Fig. 27) and bear a single 5-formyltetrahydrofolate cycloligase (PF01812.19) functional domain (Supplemental Fig. 26); no other domains could be identified.

5-FCL. Supplemental
GFT. GFT appears to be encoded by a single gene in algae, whereas genomes of land plants can contain single or multiple GFT genes (Supplemental Fig. 28). Subcellular localization prediction suggested that GFT isoforms predominantly reside in the cytosol in higher plants, but can also be targeted to both mitochondria and plastids, as for instance, in A. trichopoda, C. sinensis, L. usitatissimum, O. sativa, P. trichocarpa and Z. mays (Supplemental Fig. 28). Protein analysis demonstrated that GFT proteins possess a single formiminotransferase domain (PF07837.11) and show a well conserved motif pattern throughout algal and land plant lineages ( Supplemental  Fig. 29). G. raimondii, L. usitatissimum, C. sinensis and C. sativus bear an additional domain.

MTHFR.
Our comparative analysis illustrates that unlike those of algal species, genomes of most land plant species encode several MTHFR isoforms (Supplemental Fig. 30). Prediction of subcellular localization suggested that MTHFR of algae and land plant species reside in the cytosol (Supplemental Fig. 30). The cytosolic localization of MTHFR is in line with the assumption that the cytosol is the predominant location of the one-carbon flux from folate metabolism toward methionine production 44 . Analysis of protein sequences demonstrated that www.nature.com/scientificreports www.nature.com/scientificreports/ MTHFR of algae and land plants contain a single MTHFR functional domain (Supplemental Fig. 28) and show a well conserved motif pattern (Supplemental Fig. 31).

Evolution of DHFR-TS and HPPK-DHPS in plants versus other organisms.
In the green lineage, the assembly of the folate molecule occurs within the mitochondria (Fig. 1) and involves two bifunctional enzymes, DHFR-TS and HPPK-DHPS. This is not always the case in other organisms where these activities can be driven by monofunctional enzymes. It is well recognized that the fusion between DHFR and TS is a milestone in evolution, illustrating the separation between unikonts and bikonts 45 . To test whether the other bifunctional enzyme of the folate biosynthesis pathway follows the same evolutionary history as DHFR-TS, we compared the phylogenetic evolution of the TS and DHPS domains in wide range of organisms, including bacteria, protists (Alveolata, fungi, red algae (Rhodophyta), Rhizaria, stramenopiles), plants and animals.
ts. The phylogeny based on the maximum likelihood (ML) for the TS domain from 22 species representative of the various kingdoms clearly illustrates the evolutionary separation between organisms having monofunctional TS (unikonts) versus those having bifunctional DHFR-TS (bikonts) (Fig. 6).
Thus, the phylogeny of the TS domain clearly revealed a monophyletic origin of the bikont clade. To obtain a better idea of the relative distance between these various species, we represented the ML phylogeny in a three-dimensional space (Fig. 7).
This figure represents a 'sequence space' . It is the projection in 3D of the highly multidimensional space containing all the possible sequences of the TS domain 46,47 . To assign a position to each sequence in this three dimensional space, non-linear mapping methods are used to offer a configuration of points preserving as much as possible the distances observed between the various sequences in the original space (see Material and Methods). With such methods, axes become arbitrary and every rotation or symmetry is admissible 48 . The meaning of this representation is therefore essentially carried by distance: the distance between two points on the figure tends to display the distance between species. Links between points show the ML phylogenic tree, and thus the figure combines distance-based and ML-based phylogenies.
Such representation can provide additional information compared to the previous one. For example, Toxoplasma gondii, an apicomplexan having a bifunctional DHFR-TS, was positioned far from other apicomplexa but close to organisms such as Leishmania major, an excavate, a situation different from that seen in Fig. 6. This illustrates that two points relatively close in the sequence space are not necessarily close from an evolutionary point of view. For example, Bigelowiella natans and Mus musculus, or Toxoplasma gondii and Saccharomyces cerevisiae may appear close in the sequence space whereas they are not related in the ML tree. In other words, the evolutionary distances (as shown in the ML tree) can be large, despite relative similarities between the sequences. It is also clear that monofunctional (in blue) and bifunctional (in red) branches occupy a different space in this three-dimensional representation, indicating a totally independent evolution of the TS domain after its fusion with DHFR. In addition, the location of the Toxoplasma/Neospora group, far from that of the Plasmodium group in the sequence space could explain the relatively low support value of 0.57 (Fig. 6) calculated for the branch separation leading to Chlamydomonas on one hand and alveolates on the other hand. DHps. The maximum likelihood phylogenetic tree obtained for the DHPS domain appears very different from that of TS. In this case also, we identified organisms having a monofunctional DHPS and others having a bifunctional HPPK-DHPS (Fig. 8).
Animals, which depend on their diet for folate supply, do not possess DHPS activity. Most of the monofunctional DHPS are found in bacteria (unikonts), but some bacteria display also a bifunctional enzyme. From this point of view, it is interesting to note that within gamma proteobacteria there are species displaying monofunctional DHPS (Escherichia coli, Salmonella typhi) whereas others (Francisella philomiragia, Ricketsellia grylli) display a bifunctional HPPK-DHPS. This is also true for the alpha proteobacteria (containing Rickettsia bellii, Wolbachia), while beta proteobacteria (not represented here) and firmicutes (Bacillus subtilis, Staphylococcus aureus) contain mostly monofunctional DHPS. These data raise the question of the origin of the fusion of the two domains. Likewise, fungi (unikonts) also display a bifunctional HPPK-DHPS, as found in plants (bikonts). Thus, the separation unikonts/bikonts does not apply for this enzyme, suggesting that the evolutionary events leading to the two bifunctional enzymes HPPK-DHPS and DHFR-TS were completely different. The 3D representation ( Fig. 9) also shows a different situation compared with the one obtained with the TS domain. In this case, monofunctional and bifunctional DHPS do not occupy a different space. Monofunctional enzymes (blue branches) occupy a small space compared to the bifunctional HPPK-DHPS (red branches) which is widely spread within the entire sequence space. This might suggest more constraint on the monofunctional enzyme, which evolved less than the bifunctional one. It is interesting to note that in Fig. 7 as well as in Fig. 9, the green lineage is always compacted in a rather small space, whereas the apicomplexa are quite widely distributed. This is indicative of a much higher mutational rate in the latter compared with the former.

Discussion
Folate biosynthesis and metabolism in higher plants have been studied for almost two decades. Despite the wealth of genetic and biochemical evidence, many questions remain to be addressed. Particularly, the regulation and compartmentalisation of folate biosynthesis and metabolism in various plant lineages require scrutiny. Our comparative study encompasses all steps of folate biosynthesis and interconversion of folate species in land plants, as well as green and red algae and reveals novel aspects of folate metabolism that can support further experimental studies in the field and help to design effective biofortification strategies. An earlier comparative genomic analysis of bacterial and plant folate metabolism identified new bacterial GTPCHI and FPGS gene families and predicted a bacterial folate transporter 49 .
Our comparative study revealed that while algae possess single isoforms of the studied genes, plant species tend to have multiple isoforms regulating the same steps in folate metabolism. The multiple isoforms could derive from whole genome duplication or duplication of certain genomic regions, contributing to the development of organelle-specific features of folate metabolism of land plants. Interestingly, three steps of folate biosynthesis, namely ADCS, HPPK-DHPS and DHFS, are catalysed by single isoforms in most analysed land plant species. The restriction of the number of these genes to a single copy in land plant genomes might reflect the necessity for a tight regulation of the abundance of folate intermediates. Being in line with the known inhibition by folate Figure 7. Phylogenetic representation in the protein sequence space for TS domains. Non-linear mapping methods are designed to offer a configuration of points in this multidimensional space that is representative of the observed distances. With this method, axes become arbitrary and every rotation or symmetry is admissible. In this figure, the distance between two data points on the figure tends to display the distance between species. Links between points show the ML phylogenic tree presented in the Fig. 6   . Phylogenetic representation in the protein sequence space for DHPS domains. Non-linear mapping methods are designed to offer a configuration of points in this multidimensional space that is representative of the observed distances. With this method, axes become arbitrary and every rotation or symmetry is admissible. In this figure, the distance between two data points on the figure tends to display the distances between species. Links between points show the ML phylogenic tree presented in the Fig. 8 www.nature.com/scientificreports www.nature.com/scientificreports/ intermediates both at the transcript and the protein levels 50 , this notion suggests that biofortification strategies aiming to enhance folate content should involve elevation of the expression of the downstream genes of folate biosynthesis pathway, namely, ADCS, HPPK-DHPS and DHFS. Such an approach has been already successfully employed to elevate folate content in potato 51 .
It remains to be addressed why certain enzymes of folate biosynthesis are encoded by single isoforms in duplication-prone land plant genomes. Exceptions with two ADCS (in B. rapa, G. max and V. vinifera), HPPK-DHPS (G. max and L. usitatissimum) and DHFS (G. max and M. truncatula) are also intriguing. On the one hand, one can ascribe the existence of duplicated isoforms to genetic redundancy inherent to plant genomes. On the other hand, it is possible that the duplicated isoforms might implement a different function or reflect a difference in the regulation of folate biosynthesis. In the future, it will also be important to address whether the duplicated genes encode functional enzymes. Our analysis of conserved amino acids for several folate pathway enzymes, namely, GTPCHI, ADCS, HPPK-DHPS, DHFS and MTHFR, suggests that the isoforms included in our study bear enzymatic activity (Supplemental Figs 32-36). Interestingly, two out the five plant species that bear two copies of ADCS and HPPK-DHPS are naturally rich in folates, namely, B. rapa and G. max. Apart from ADCS and HPPK-DHPS, B. rapa genome bears several copies of GTPCHI. The B. rapa case echoes the biofortification strategy applied for potato, where ADCS, GTPCHI and HPPK-DHPS genes have been overexpressed 51 . It is tempting to speculate that such an enhancement approach might prove efficient for species that bear a single copy of the abovementioned genes. However versatile, Arabidopsis is not a flawless model to be used to study regulation of plant metabolism. As highlighted by the study of bifunctional DHFR-TS 40 and by the present study, Arabidopsis and other species from Brassicaceae, namely, C. rubella and B. rapa have unique features that are not shared with species from other land plant lineages. Thus, species from Brassicaceae bear enzymatically inactive DHFR-TS homologs that inhibit activity of active homologs, thereby regulating availability of THF 40 . This regulation seems not to be operating in other plant lineages. Brassicaceae members also exceptionally lack cytosolic ADCL isoform (Fig. 4). Moreover, Arabidopsis bears a cytosolic HPPK-DHPS isoform that might be involved in stress response 17 . Interestingly, this feature is unique to Arabidopsis and not shared even with other Brassicaceae species (Supplemental Fig. 5). It is tempting to assume that Arabidopsis might differ from species of other plant lineages in its regulation of folate biosynthesis. Taking this into consideration, drawing parallels from studies using Arabidopsis should be done with caution.
Prediction of subcellular localization allowed to identify novel aspects of folate biosynthesis and metabolism. Although it must be kept in mind that these in silico predictions have to be experimentally confirmed, our study suggests that ADCL can localize to the cytosol. The presence of the cytosolic isoform in every analysed land plant species, except species from Brassicaceae, might indicate an important function different from that in folate biosynthesis. To date, the cytosolic role of the enzyme remains unknown. It is tempting to speculate that cytosolic ADCL contributes to pABA production that so far has been only reported to occur in plastids in the plant cell. This assumption is in line with studies showing cytosolic localization of certain enzymes operating in the shikimate pathway that provides chorismate for pABA synthesis. If this scenario is operative, there should be a cytosolic conversion of chorismate to aminodeoxychorismate (ADC) similar to that performed by ADCS in plastids; alternatively, ADC can be exported from plastids. As a folate independent, stress-related function for the cytosolic isoform of HPPK-DHPS, an enzyme catalysing the successive step in the folate biosynthesis, has been suggested previously 52 , it is possible that the cytosolic ADCL might also fulfil a role in stress response. Generally speaking, the intracellular compartmentalization of folate biosynthesis established for Arabidopsis (depicted in Fig. 1) seems to be a general feature for land plants. However, the situation might be more complex in algae since the predicted localization of the enzymes appeared more fluctuating from one species to another. For example, there is no putative DHFR activity located in the mitochondria of most algae, and in K. flaccidum the entire folate biosynthesis pathway (except for the ADCS activity) could take place in the cytosol.
Analysis of subcellular compartmentalization of genes involved in interconversion of folate species suggests that the sources of one-carbon units in plant cells are specific for different subcellular compartments and this specificity is highly conserved across higher plant lineages (Fig. 10). Thus, conversion of 5-CHO-THF, a putative folate storage form, by 5-FCL and the flux of one-carbon units from 10-CHO-THF to 5,10-CH 2 -THF via the sequential action of 10-FDF and GDC are assumed to mainly occur in mitochondria. In the cytosol, one-carbon units appear to be mainly derived from the conversion of Ser to Gly catalysed by SHMT. However, the conversion offormate to 10-CHO-THF mediated by FTHFS and the conversion of N-formiminoglutamate into 5-formimino-THF catalysed by GFT can also serve as a source of one-carbon units. As plastids seem to lack the abovementioned enzymes, they are assumed to mainly obtain their C1-THF derivatives from the activity of SHMT and the transport of metabolically active folate forms from mitochondria and cytosol. Because of the present lack of reliable in silico tools for prediction of subcellular localization of algal proteins, localization of folate pathway enzymes in the algae species analysed cannot be unequivocally concluded. Presently, only experimental approaches are able to reveal the localization of enzymatic steps of folate biosynthesis and metabolism.
Our comparative analysis shows that HPPK and DHPS as well as DHFR and TS activities are coupled on a single bifunctional polypeptide across algae and land plant lineages. This is not the case in all organisms, and, as a matter of fact, the fusion between the DHFR and TS genes coincides with the appearance of biflagellate organisms. Indeed, unikonts, with one cilium, show monofunctional enzymes, while bikonts, with two cilia, possess bifunctional enzymes. Our results confirm this well-recognised milestone in the phylogenetic tree of evolution 45 . However, this is not the case for the fusion of the HPPK and DHPS genes which have a completely different evolutionary history compared with DHFR and TS. Indeed, the clade for the bifunctional HPPK-DHPS contains bacteria and fungi (unikonts) as well as plants and alveolates (bikonts). It is interesting to note that all eukaryotes analysed here were grouped in the clade containing the bifunctional enzyme (Fig. 8). Since HPPK-DHPS is a mitochondrial enzyme, it is possible that the bacteria at the origin of the endosymbiotic event that resulted in the advent of mitochondria already contained a bifunctional HPPK-DHPS, and that the gene coding for this www.nature.com/scientificreports www.nature.com/scientificreports/ enzyme was thereafter transferred to the nucleus. It was proposed that the ancient bacterium likely at the origin of mitochondria belonged to the α-proteobacteria class 53 . Since α-proteobacteria contain species displaying either monofunctional DHPS or bifunctional HPPK-DHPS such as those presented here (Rickettsia, Wolbachia), it would be interesting to know whích α-proteobacterial lineage provided the mitochondrial ancestor. Several studies hint towards the group of Rickettsiales, but this conclusion is today highly challenged 54,55 .
Furthermore, our study reveals that all analysed ADCS homologs, in addition to GAT and chorismate binding domains, possess an anthranilate synthase (AS) domain. Anthranilate synthase (AS) catalyses the first step in the biosynthesis of tryptophan from chorismate 56 . It is thus tempting to speculate that ADCS of algae and higher plant species are capable of anthranilate synthesis. If the AS domain of ADCS is functional, the protein might play a role in tryptophan synthesis that is known to occur in plastids of higher plants 57 and subsequently contribute to auxin biosynthesis, known to use Trp as a major precursor 58 . Alternatively, it may play a regulatory role therein. A possible contribution of ADCS to auxin synthesis might link the folate and auxin pathways. Interestingly, folate has previously been reported to influence auxin distribution during seedling development, together with sucrose 59 . Furthermore, it was recently shown that root gravitropism is regulated by a crosstalk between PABA (product of the ADCS + ADCL reaction), ethylene and auxin in Arabidopsis 60 .
In several higher plant species some folate pathway enzymes, namely, ADCS, HPPK-DHPS, plastidial MTHFD-MTHFC and GFT, bear additional functional domains. The ACT domain residing on the polypeptide chain of bifunctional MTHFD-MTHFC of L. usitatissimum is often found in proteins regulated by amino acids. It is tempting to speculate that this domain might be involved in regulation of the protein by Ser and Gly abundance. However, the domain is absent in all other analysed species, even in the closest relatives of L. usitatissimum. Apparently, this domain appeared in MTHFD-MTHFC as a result of a random insertion during genome rearrangement of L. usitatissimum. Presence of additional domains in ADCS, HPPK-DHPS and GFT cannot be immediately linked to a specific function. Moreover, the rare occurrence of the additional domains suggests that they occurred in result of random insertions.
In conclusion, the present comparative study revealed novel features of folate metabolism and emphasized the need for further understanding of its regulation in different plant species. Our study suggests that the subcellular localization of folate biosynthesis might be different between algae and land plants, indicating that the localization established in Arabidopsis might not be an absolute physiological requirement. Our phylogenetic analysis implies that duplication of genes controlling certain steps in the folate biosynthesis pathway might have served as a way to increase folate production in certain species. Consequently, the information on the gene numbers could be informative for planning biofortification strategies for related species. The analyses of DHFR/TS and HPPK/DHPS (Figs 6-9) suggest that different steps in the biosynthetic pathway might have a unique evolutionary history.
Once experimentally verified, our findings will be useful for development of species-specific biofortification strategies and better understanding of roles of folate metabolism in plant physiology. www.nature.com/scientificreports www.nature.com/scientificreports/

Materials and Methods
Identification of putative orthologs and protein analysis. To identify putative orthologs of folate pathway genes, full-length protein sequences from Arabidopsis thaliana were used as a query to run a blastp with E-value ≤ 1e-10 cutoff 37 . To infer putative orthology the selected protein sequences were scrutinized for the presence of functional protein domains using the Pfam database 38 and ScanProsite 61 . The conserved motifs were identified using the MEME suite 62 . Analysis of conserved amino acids was conducted using BLAST Conserved Domain Database tool 63 . phylogenetic analysis. Full-length amino acid sequences were aligned and manually corrected and edited in MEGA 7 64 using MUSCLE software 65 . The phylogenetic relationship was inferred using the maximum likelihood method implemented in RaxML 66 using WAG + G + I model that was selected by ProtTest 67 . The maximum likelihood tree was evaluated with 1000 bootstrap replicates. Phylogenetic trees were visualized using FigTree. The method described above was used to build phylogenetic trees for Figs 3-5 and Supplemental Figs 3,5,7,9,12,[14][15][16]18,[20][21][22]26,28,30. For the comparison of TS and DHPS domains from plants and other organisms, amino acid sequences were retrieved from NCBI resources, JGI genome portal or specific databases such as, plasmoDB and TAIR. Phylogenetic studies of the bifunctional enzymes DHFR-TS and HPPK-DHPS were performed with the largest domain in each case, i.e. TS and DHPS which have been identified using Conserved Domain Search Service v3.15 63 Parameters: Expect Value threshold = 0.01, Composition based statistics adjustment. Amino acid domains sequences were aligned in an iterative process using ClustalX software 68 . During the process, alignments were curated using both ClustalX 2.0 and Jalview 2 69 . The phylogenetic relationship was inferred using the maximum likelihood method PhyML 70 with the Smart Model Selection 71 under the Bayesian information criterion (BIC) 72 . The maximum likelihood tree was evaluated with aBayes 73 and phylogenetic trees were visualized using FigTree.
The phylogenetic trees obtained from the maximum likelihood method PhyML were represented in the protein sequence space, where each residue in the protein is represented by a dimension with 20 possible positions along that axis, corresponding to the possible amino acids 47,74,75 . This space is high-dimensional since its dimension is 20 to the power of the number of residues in the sequence. Nevertheless, it can be represented in a 3D space by using Multidimensional scaling methods which consider the distances between data in the original high dimensional space and associate a configuration of points in a lower dimensional space (the so called representation space, which is here a 3D Euclidean space) 76,77 . Obviously, distances cannot be always preserved and distortions appear 78 . Most often, the preservation of short distances is favored so as to preserve local properties. Consequently, axes have no specific meaning and maps must be considered as invariant by translation, rotation and symmetry. The information in maps is carried by distances between points. Here we used the Sammon's mapping 79 . It should be emphasized that multidimensional scaling offers an intuitive representation of the original data structure (the similarity between sequences). Distances between sequences were computed using the PAM250 matrix and the Phylip prodist program 80 . The method described above was used to build phylogenetic trees for Figs 6 and 8.