The coordination of major events in C4 photosynthesis evolution in the genus Flaveria

C4 photosynthesis is a remarkable complex trait, elucidations of the evolutionary trajectory of C4 photosynthesis from its ancestral C3 pathway can help us better understand the generic principles of the evolution of complex traits and guide the engineering of C3 crops for higher yields. Here, we used the genus Flaveria that contains C3, C3–C4, C4-like and C4 species as a system to study the evolution of C4 photosynthesis. We first mapped transcript abundance, protein sequence and morphological features onto the phylogenetic tree of the genus Flaveria, and calculated the evolutionary correlation of different features; we then predicted the relative changes of ancestral nodes of those features to illustrate the major events during the evolution of C4 photosynthesis. We found that gene expression and protein sequence showed consistent modification patterns in the phylogenetic tree. High correlation coefficients ranging from 0.46 to 0.9 among gene expression, protein sequence and morphology were observed. The greatest modification of those different features consistently occurred at the transition between C3-C4 species and C4-like species. Our results show highly coordinated changes in gene expression, protein sequence and morphological features, which support evolutionary major events during the evolution of C4 metabolism.


Results
Transcriptome assembly and quantification. RNA-Seq data of 31 samples of 16 Flaveria species were obtained from the public database Sequence Read Achieve (SRA) of the National Center for Biotechnology Information (NCBI) ( Table S1). The 16 species represented two C 3 species, seven C 3 -C 4 intermediate species, three C 4 -like species and four C 4 species 19,21 (Table S1). On average, 42,132 contigs (from 30,968 to 48,969) were assembled with N50 ranging from 658 to 1208 bp among the 16 species (Table S2). The distribution of the contig length is similar in the 16 species with a peak at 360 bp (Fig. S1).
Since Flaveria is a eudicot genus, we used Arabidopsis thaliana (Arabidopsis) as a reference to annotate Flaveria transcripts. On average, 58.91% of Flaveria contigs had orthologous genes in Arabidopsis. Considering the large evolutionary divergence between Arabidopsis and Flaveria, which was estimated to be 120 million years (http:// timet ree. org/), we then estimated the accuracy of annotation by examining whether the transcripts annotated to be the same genes were from same orthologous groups. Specifically, we used OrthoFinder 29 to predict the orthologous groups based on annotated genes of Flaveria species and then calculated the consistence between our gene annotation and orthologous groups. Specifically, for each orthologous group, we calculated the percentage of genes with the same annotation. For example, for the orthologous group of PPDK(AT4G15530), 29 transcripts from Flaveria and one gene (AT4G15530) with six transcripts from Arabidopsis were clustered in this orthologous group. All of the 29 transcripts from Flaveria were annotated as AT4G15530 (Fig. S2A). Therefore, Investigate the possibility of samples used in this study being hybrid. Considering that intermediacy of traits in the intermediate species may result from a possible hybridization between species with one parent being C 3 and another parent being C 4 or intermediate species 30 , we investigated whether the intermediate species used in this study are intermediate species or a hybrid offspring. The hybrid offspring is characterized as expressing different alleles at one DNA site; therefore, we calculated the percentage of DNA sites that expressed different alleles, which is termed as mixed site. The percentage of mixed site was then compared to the positive background generated by pair-wisely mixing of RNA-Seq data of 16 Flaveria species. Our results showed that the known hybrid sample F. pringlei* originated from F. pringlei × F. angustiflolia in 28 showed significantly higher percentage of mixed site than background (Binomial test, P < 0.001), whereas, other species showed significantly lower percentages of mixed sites than background (Binomial test, P < 0.001) (Fig. 1). Thus, our data showed that species used in this study are not hybrids and can be used for evolutionary study.
We used the phylogenetic tree of the genus Flaveria 28 to illustrate the molecular evolution of C 4 photosynthesis. We numbered each node of the phylogenetic tree in an order as showed in Fig. 2, usually, the more ancient a node is, the lower number it will be given. The number begins with N1 which refers to the common ancestor of all Flaveria species in the phylogenetic tree, N3 refers to the common ancestor on all intermediate and C 4 species, at which intermediate species first evolved. N7 is the common ancestor of C 4 -like and C 4 species in clade A, where a completed C 4 cycle evolved. The nodes of clade B were numbered sequentially following clade A (Fig. 2).
The modified genes: genes showed differences in gene expression and protein sequence between C 3 species and C 4 species. We first identified the modified genes, which were defined as genes show differences in both gene expression and protein sequence between C 3 and C 4 species. We first calculated the differentially expressed (DE) genes between C 3 and C 4 species in the way of comparing three C 3 samples with 8 C 4 samples (see Methods), which resulted in 896 DE genes ("BH" correlated P < 0.05) (Additional file 3). We next investigated transcriptome-wide amino acid changes predicted from orthologues of C 3 and C 4 Flaveria species using the process shown in Fig. S5. Briefly, an amino acid difference was classified as a change if the orthologous from the two C 3 species (F. robusta and F. cronquistii) contained the same predicted amino acid at a position, but was different from the corresponding position in orthologs from at least two C 4 species (among F. kochiana, F. bidentis, F. trinervia and F. australasica), the detailed process of identifying amino acid changes were described in Supplemental methods. As a result, we obtained 1,018 genes encoding at least one amino acid  www.nature.com/scientificreports/ change between C 3 and C 4 Flaveria species. 56 out of these 1,018 genes also showed significantly differentially expression between C 3 and C 4 species, which was termed as modified genes.
The modified genes showed coordination in major changes in the C 4 evolutionary pathway in the genus Flaveria. In addition to C 4 pathway, cyclic electron transport chain (CET) and photorespiratory pathway are reported to be related to the evolution of C 4 photosynthesis. We manually selected genes from these three pathways from the literatures 23,24,31 and then tested whether genes from these pathways were significantly enriched in the 56 modified genes (Table S4). Results showed that genes related to C 4 photosynthesis pathway and genes related to CET were significantly enriched in the 56 modified genes. ("BH" correlated P < 0.05, Fisher's exact test, Supplemental methods). We systematically discussed these genes and their changes during C 4 evolution in Flaveria with gene expression and predicted protein sequences.
Genes encoding proteins associated with the C 4 pathway. Nine genes encoding proteins associated with the C 4 pathway were identified, including those encoding three C 4 cycle enzymes, PEPC, PPDK and NADP-ME, two regulatory proteins, PPDK regulatory protein (PPDK-RP) and PEPC protein kinase A (PPCKA), two aminotransferases, Alanine aminotransferase (AlaAT) and aspartate aminotransferase 5 (AspAT5), and two transporters, BASS2 and sodium: hydrogen (Na + /H + ) antiporter 1 (NHD1) ( Table 1). In terms of protein sequence, on average, 66.0% (from 33.33% to 86.7%) of amino acid changes in C 4 species occurred at N7 for all of the nine genes (Fig. 2, Figs. S6, Table 1). For example, PEPC in the C 4 Flaveria species had 41 predicted amino acid changes compared with those in the C 3 species, which were mapped onto the Flaveria phylogeny determined by Lyu et al. 28 . One of the predicted changes occurred at N6 (D396 in C 4 species, hereafter D396), and 34 occurred at N7 (Fig. 2A). The six other predicted amino acid changes occurred at N7 or after N7, although the incomplete assembly of PEPC transcripts from F. palmeri and F. vaginata did not allow resolution of the predicted amino acid sequences. These results suggest a major evolutionary event in the protein sequence at N7 for C 4 enzymes. In terms of gene expression, all nine genes showed higher transcript abundance in C 4 species than in C 3 species and a comparable level in C 4 -like and C 4 species (Table 1). To calculate the relative gene expression changes of each ancestral node, the FPKM values of each ancestral node were predicted and the relative difference were calculated (see Methods). In general, C 4 species showed a 7.6-fold to 123.6-fold of FPKM values compared with  www.nature.com/scientificreports/ C 3 species. Similar to the pattern of changes of protein sequences, seven of the nine genes showed that the biggest relative changes in gene expression occurred at N7, whereas, both NADP-ME and AlaAT showed the biggest relative changes at two nodes of N3 and N6 with comparable levels ( Fig. 2C and Fig. S6A). Our results therefore suggest that the genes encoding proteins associated with C 4 pathway showed highly coordinated modification patterns in protein sequence and gene expression at N3, N6 and N7 during the evolutionary change to C 4 photosynthesis, while the majority of the predicted amino acid changes occurs at the N7.
Genes encoding proteins involved in CET chain. We identified genes encoding nine proteins that function in the CET chain, namely, proton gradient regulation 5 like (PGR5-like), the chloroplast NAD(P)H dehydrogenase complex (Ndh) L2-2 (NdhL2-2), NdhV, Ndh18, NdhU, NdhM, Ndh48, NdhB4, and chlororespiratory reduction 1 (CRR1). The transcripts encoding all the nine proteins showed higher abundances in C 4 species than C 3 species ( Fig. 3 and Table 1). Compared to the genes encoded protein in the C 4 pathway, the genes encoding proteins involved in CET chain showed the biggest changes at diverse nodes rather than at a single node. Specifically, the major changes of predicted protein sequences occurred at N6 and N7 and that of FPKM occurred at N3, N7 and N8 (Table1). Besides this, the modification of protein sequence and FPKM appears to be less coordinated in genes involved in CET chain, for example, the major change of PGR5-like occurred at N6 in predicted protein sequence and at N7 in FPKM (Fig. 3A).
Genes encoding proteins in the photorespiratory pathway. The establishment of the photorespiratory pump (C 2 photosynthesis) is reported to be a prerequisite for the evolution of C 4 photosynthesis based on theoretical www.nature.com/scientificreports/ modeling 32 . Therefore, we also investigate genes involved in the photorespiratory pathway in terms of predicted protein sequence and FPKM. One protein involved in photorespiration was included in the 56 modified genes, namely, glutamine synthetase-like 1 (GSL1). Moreover, four other proteins in this pathway showed abundant amino acid changes between C 3 and C 4 species, namely, glycine decarboxylase complex (GDC) H subunit (GDC-H), serine hydroxymethyltransferase (SHM), glycerate kinase (GLYK), glutamine synthetase and glutamine oxoglutarate aminotransferase (GOGAT) (Fig. 3, Table 1). In general, the predicted amino acid substitution patterns of these five proteins were similar to those observed in the above-described proteins in C 4 pathways, with the major predicted amino acid changes in C 4 species occurring at N7 (Fig. 4, Table 1), e.g., 16 of 18 in GOGAT occurred at N7 (Fig. 4D). Generally, proteins in the photorespiratory pathway showed fewer predicted amino acid changes than those in the C 4 pathway.
The abundance of transcripts encoding these five photorespiratory enzymes was comparable to those in C 3 and C 3 -C 4 species, and higher than that in C 4 species (Fig. 4A-E). When compared with genes encoding C 4 pathway proteins, those encoding photorespiratory proteins showed larger differences between C 4 -like and C 4 species in clade A in terms of gene transcript abundance and protein sequence. The greatest reduction of FPKM in these www.nature.com/scientificreports/ five genes was observed at N7 (Fig. 4, Table 1). Thus, this suggested that the genes encoded proteins associated with the photorespiratory pathway also showed coordinated changes in protein sequence and gene expression during the evolution of C 4 photosynthesis, and with the largest number of changes occurring at N7.

Physiological and anatomical characteristics related to C 4 photosynthesis show coordinated changes along the C 4 evolutionary pathway in Flaveria.
To investigate whether C 4 related physiological characteristics also underwent coordinated changes during the evolution of C 4 photosynthesis in Flaveria, physiological characteristics taken from the literature 18,21,33 were mapped onto the Flaveria phylogeny (Fig. 5). The results revealed a step-wise change for most of the characteristics along the phylogenetic tree as previously suggested 15,18,21,33 (Fig. 5). However, coordinated and abrupt changes were observed for a number of features. A major change in CO 2 compensation point (Γ) in Flaveria was first seen at N3, where the most ancestral   (Fig. 5). The greatest changes in Γ in clade A occurred at N6, which showed a decrease in Γ from 24.1 μbar in F. angustifolia (C 3 -C 4 ) to 9.0 μbar in F. ramosissima (C 3 -C 4 ), followed by N7, where a decrease in Γ from 9.0 μbar in F. ramosissima (C 3 -C 4 ) to 4.7 μbar in F. palmeri (C 4 -like) was observed. The greatest decrease of Γ in clade B was observed between the two C 3 -C 4 species, F. floridana and F. chloraefolia (C 3 -C 4 ), where there was a decrease from 29 μbar to 9.5 μbar (Fig. 5). For photosynthetic water use efficiency (PWUE), photosynthetic nitrogen use efficiency (PNUE) and the slope of the response of the net CO 2 assimilation rate (A) versus Rubisco, the biggest changes occurred at N7 with increases of around twofold. In contrast, the percentage of 14 C fixed into four carbon acids showed no clear trend across the phylogenetic tree, although 3.91-fold and 1.76-fold increases were seen at N6 and N7, respectively. Interestingly, changes in all of these traits uniformly occurred at F. brownii in clade B, the only C 4 -like species within this clade. Consequently, those data suggest that although there were gradual changes in physiological features along the C 3 , C 3 -C 4 , C 4 -like and C 4 trajectory, there are apparent major modifications at N3, N6 and N7 in these physiological traits across the Flaveria phylogeny (Fig. 5). Anatomical traits 15,34 were mapped onto the Flaveria phylogeny to investigate how these features were modified along the evolution of C 4 (Fig. 5). For both the area of MCs and the ratio of the area of MCs to that of BSCs (M: BS), the greatest modifications across the phylogeny were found between F. brownii (C 4 -like) and F. floridana (C 3 -C 4 ), with a similar degree of change for both characteristics (2.7-fold, Fig. 5). Anatomical data for F. palmeri (C 4 -like) in clade A are not available, however, large differences in anatomical features were found between the C 4 -like F. vaginata and C 3 -C 4 F. ramosissima 15 . The modification of MC area first occurred at N2 which showed a 1.9-fold difference between F. robusta and F. cronquistii followed by a 2.1-fold of difference between F. ramosissima and F. vaginata. A major modification of the ratio of MC and BSC occurred at N2 with a 2.4-fold difference and followed by N6 with a 1.6-fold difference and N7 with a twofold difference. Therefore, similar to the evolutionary pattern of physiological features, large changes in anatomical features also emerged at N3, N6 and the transition between C 3 -C 4 species and C 4 -like species. Interestingly, the ultrastructure of BSCs chloroplasts showed an abrupt change at N7, with a dramatic decrease in grana thylakoid length and an increase in stroma thylakoid length, whereas these features were comparable in the species at the base of tree and in clade B 34 .
Coordinated changes of protein sequence, gene expression and morphology with a major evolutionary event at the transition between C 3 -C 4 and C 4 -like species during the evolution of C 4 species. Our above analysis showed that C 4 related genes and morphological features showed coordinated changes with an obvious major change at N7. Next, we asked whether species evolution also showed evolutionary coordination and major event during the evolution of species in protein sequence, gene expression and morphology. To answer this question, we calculated the divergence matrices for protein sequence, gene expression, and morphological features between F. cronquistii (at the most basal position on the Flaveria phylogenetic tree) and other Flaveria species. The protein divergence was calculated as the rate of non-synonymous substitutions (dN) of all the genes that were used to construct the Flaveria phylogenetic tree from 28 , the expression divergence was calculated as Euclidean distance of total expressed genes (see Methods), and the morphology divergence was calculated as Euclidean distance using previously coded morphology value from 16 , which includes 30 types www.nature.com/scientificreports/ of morphological traits, such as life history, leaf shape, head types. Our result showed a high linear correlation between the protein divergence, gene expression divergence and morphology divergence, in particular between gene expression divergence and morphology divergence (R 2 = 0.9) (Fig. 6A). Thus, our results suggest a coordinated evolution of protein sequence, gene expression and morphology during species evolution. The linear correlation of gene expression divergence vs morphology divergence and protein divergence vs morphology divergence reflects that both gene expression changes and protein sequence changes are related to morphological changes during evolution 35,36 . Moreover, gene expression changes may be more directly related to morphological changes than protein sequence changes 37 . It is likely that changes of developmental programs might be mainly due to changes in gene expression levels while changes in the protein sequences might contribute more to changes in metabolism. Next, we predicted the protein sequence, transcript abundance and coded morphology value of ancestral nodes, which were then used to calculate the relative change of the three parameters at each node (see Methods). Surprisingly, protein sequence and gene expression showed significantly more changes from N6 to N7 than changes during transitions between other nodes (P < 0.001, Tukey's test, "BH" adjusted, the same as following), Figure 6. Coordinated evolution of protein sequence, gene expression and morphological traits with apparent major changes. Significant linear correlation between protein divergence, gene expression divergence and morphology divergence were showed in (A). Protein divergence was calculated as non-synonymous mutation (dN). Expression divergence and morphology divergence were calculated as Euclidean distance based on quantile normalized FPKM values calculated here and coded morphology values from Mckown et al. 16 , respectively. All the relative divergences were the divergence between F. cronquistii and other Flaveria species. (B) The schema of Flaveria phylogenetic tree modified from Lyu, et al. 28 , branch length between ancestral nodes were labeled in grey. (C) The relative changes of each ancestral node compared with its earlier ancestral node in protein sequence, gene expression and morphological traits. P values are from One-way ANOVA analysis followed by Tukey's Post Hoc test and adjusted by Benjamin-Hochberg correction. The significant levels are: *: P < 0.05; **: P < 0.01; ***: P < 0.001. The bar colors in grey/blue/orange represent species from basal/clade A/clade B of phylogenetic tree, respectively. (D) Pearson correlations between changes of ancestral nodes in protein sequence, gene expression and morphological traits. www.nature.com/scientificreports/ and the morphology showed the most changes from N6 to N7 with a marginal significant P value (P = 0.06) (Fig. 6B,C). We found high correlations between divergences in protein sequence, gene expression and morphological traits (Fig. 6D), implying that evolutionary coordination of major events on whole transcriptomic level also occurred in species evolution. We then asked whether the major events were results of a long evolutionary time. We found positive correlations between the divergence (in protein sequence, gene expression and morphology) and branch length. For protein sequence, gene expression and morphological traits, their Pearson correlation coefficients with branch length are 0.35, 0.66 and 0.59, respectively (Fig. S7A). We then normalized the divergence between nodes to branch length to calculate change rate by assuming modification of protein sequence, gene expression and morphology have a linear relationship with evolutionary time. Results showed that N7 does not show the highest change rate, whereas, younger nodes tend to have higher change rates (Fig. S7B). For example, N14 showed higher modification rates than other nodes in gene expression and morphology. This is reasonable, because recovery mutations occurred during evolution, and modification rate was likely to be underestimated within a long evolutionary time. Besides, the relationship between amino acid substitution and evolutionary time may be non-linear but follow complex models 38 . Consistently, we found that change rates of protein sequence are neither correlated with that of gene expression nor with that of morphological traits. Whereas, change rates of gene expression are positively correlated with that of morphological traits (R2 = 0.91, P value < 0.001) (Fig. S7C). The large number of changes between N6 and N7 are likely a result of a long evolutionary time.

Discussion
Evolutionary coordination of different features towards a functional C 4 metabolism. Compared to C 3 photosynthesis, the evolution of C 4 photosynthesis resulted in the acquisition of many new features in gene expression, protein sequence, morphology and physiology (Figs. 2, 3, 4, 5) 39 . Coordinated changes on these features were required at key transitions. This is because although C 4 photosynthesis can gain higher photosynthetic energy conversion efficiency, highly specialized leaf and cellular anatomical features and biochemical properties of the involved enzymes are required. For example, increased cell wall thickness at the bundle sheath cell and decreased sensitivity of PEPC to malate inhibition are needed for C 4 plants to gain higher photosynthetic rates 40,41 . Furthermore, to gain higher photosynthetic efficiency in C 4 plants, the ratio of the quantities of Rubisco content in BSCs and MCs is also critical 42 . In theory, if the C 4 decarboxylation evolves before all of the other accompanying changes required or C 4 photosynthesis, leaves will experience high leakage, i.e., costing ATP for a futile cycle without benefit to CO 2 fixation. This will inevitably lead to lower quantum yield and a potential driving force for purifying selection. Further evidence for possible purifying selection comes from the observation that genes with cell-specific expression, such as PEPC, PPDK, and NADP-ME, displayed more changes in their predicted protein sequences than ubiquitously expressed genes, such as NDH components (Table 1,  Additional file 3). This is because, as discussed earlier, the redox environments between BSCs and MCs might have changed dramatically during the evolution of the C 4 cycle, with one of the most likely changes being a more acidic environment due to increased production of Oxaloacetic acid (OAA) and malate. Under such conditions, it is required for enzymes to alter their amino acid sequences to adapt to the new cellular environments. Concurrent changes of gene expression and protein sequence have also been demonstrated previously in animals 36,43 .
The identified changes in the protein sequences, including amino acid changes, insertions, and deletions ( Table 1, Figs. 2, 3, 4), may enable the enzymes or proteins to improve biochemical and regulatory properties to meet the demands of an altered cellular environment, for example, the increased fluxes through the C 4 cycle 44 . It is worth mentioning that some of these predicted amino acid changes have been reported to be functional, such as that the S774 and G884 residues in C 4 PEPC determines the high substrate affinity and low inhibitor affinity of this enzyme, respectively 25,45 . Besides this, many of the predicted amino acid changes are in residues that can be post-translationally modified, for example, six residues in PPDK changed to Serine (S) in C 4 species, which can all be target for phosphorylation and hence functional modification.
Major evolutionary events along the C 4 evolution in the Flaveria genus. Among the ancestral nodes leading to the C 4 emergence in clade A, N7, which is the most recent common ancestral node of C 4 -like and C 4 species in clade A, shows the biggest change in protein sequence, gene expression and morphology in both C 4 specific features and also non-C 4 specific features (Table 1, Figs. 2, 3, 4, 5, 6). There were also apparent changes in these features at N3 and N6. These three nodes reflect three critical stages in the emergence of C 4 metabolism. Firstly, at N3, i.e., during the emergence of C 3 -C 4 species, there were many changes in gene expression, protein sequence and morphology. One of the most important events during this phase is the re-location of GDC from MSCs to BSCs based on earlier western blot data 13,46 . Here we found that SHM showed decreased expression while most of other photorespiratory related enzymes showed little changes (Fig. 4). Similarly, at this step, the majority of the C 4 related genes showed little changes (Fig. 2). However, modification of gene expression levels and protein sequences on transcriptome level and no-C 4 morphology features suggests that there are a large number of changes at N3 (Fig. 6), and there is also greater decrease of CO 2 compensation point at this stage (Fig. 5).
C 3 -C 4 species were also reported in several other genera from both monocotyledonous and dicotyledonous plants, such as the monocotyledonous Alloteropsis 47 , Homolepis, Neurachne and the dicotyledonous Steinchisma 48 from monocot; Moricandia 49 , Heliotropium 50 and Mollugo 51 . These C 3 -C 4 species usually have a reduced CO 2 compensation point, enlarged BSC, and their GDC-P subunit is predominately expressed in BSC. C 3 -C 4 species in Moricandia do not show enhanced C 4 cycle, besides, 14 C labeling patterns for photosynthesis related metabolites are comparable to those in C 3 species 49 , which may be equivalent to Flaveria C 3 -C 4 species derived from N3. www.nature.com/scientificreports/ N6, i.e., during the emergence of type II C 3 -C 4 F. ramosissima and C 4 -like species. is the stage where we found the third largest number of changes in C 4 related features had occurred. At this stage, we observed large increase in transcript abundance in C 4 genes (Fig. 2 & Fig. S5) and photorespiratory genes (Fig. 4), and that a dramatic increase in the percentage of 14 C incorporated into the four carbon acids occurred (Fig. 5). The modification of photorespiratory genes might be related to the optimization of C 2 cycle to decrease CO 2 concentrating point, which can increase fitness of plants under conditions favoring photorespiration 52 . The concurrent modifications of C 4 enzymes, such as PEPC, NADP-ME, PEPCKA, amongst others, which are also involved in nitrogen rebalancing, is consistent with the notation that C 4 cycle might be evolved as a result of rebalancing nitrogen metabolism after GDC moving from MC to BSC 24 . The fact that there is little change in the δ 13 C in the C 3 -C 4 intermediate as compared to that of C 3 species suggests that the contribution of CO 2 fixation following evolution of the C 4 pathway is relatively minor, i.e., less than 15%, estimated based on an δ 13 C value of -27.6 in F. ramosissima (Fig. 5), again supporting that the initial role of increased C 4 enzymes is not for enhancing CO 2 fixation. It is worth pointing out here that the measured initial carbon fixation in the form of C 4 compound was 46% (Fig. 5); higher than those estimated based on the δ 13 C value. This is possibly because although malate releases CO 2 into BSCs as a result of the nitrogen rebalancing pathway, most of the CO 2 was not fixed by Rubisco, either due to the lack of sufficient Rubisco activity in BSCs or due to a lack of required low BSCs cell wall permeability to maintaining high CO 2 concentration in BSCs.
N7, i.e., during the emergence of C 4 -like and C 4 species, witnesses abrupt changes for both the gene expression and proteins sequence and morphology (Figs. 2, 3, 4, 6). The majority of the C 4 related genes showed the most modification in gene expression and protein sequence at N7, especially for genes in C 4 cycle and photorespiratory pathway. Moreover, N7, at which C 4 -like species (clade A) appear, represents a dramatic shift of CO 2 fixation from being dominated by a C 2 concentrating mechanism to being dominated by a C 4 concentrating mechanism. Based on the δ 13 C value in F. palmeri, the fixation through the C 4 concentrating mechanism is up to 93%, which is consistent with the measured proportion of initial carbon fixation in the form of C 4 compound (Fig. 5), suggesting at this step, the released CO 2 in the BSCs can be largely fixed by Rubisco, whereas, the transition between C 4 -like to C 4 process is an evolutionarily "down-hill" process and most optimization occurred through fine-tuning gene expression.
Genes from C 4 pathway and photorespiration displayed the major evolutionary changes at N7 both in gene expression and protein level (Table 1). At the same time, we found morphological features, protein sequence and gene expression, which were not necessarily related to C 4 photosynthesis also showed major changes at that stage (Fig. 6). This coincidence suggests that the C 4 and photorespiratory pathway may be a main driving force in the evolution in Flaveria.
The genus Flaveria has been used as a model system to study the evolution and regulation of C 4 photosynthesis [15][16][17] . Over the past 40 years, many labs conducted studies related to metabolism, physiology, anatomy, morphology, transcriptome and transcript regulation for different groups of Flaveria species 15,[22][23][24] . Here, we combined data from different aspects for 16 Flaveria species with a purpose to examine whether the different C 4 related traits evolve in a coordinated manner and at the same time to study the relationship between species evolution and C 4 photosynthesis evolution. Here we caution that Flaveria plants from different labs were grown under different conditions, data from different labs may be not in complete accord. For examples, the RNA-seq data used here were originally generated from two labs (Supplemental methods), plants were grown under different conditions and RNA-seq data were sequenced from different strategies. Hence, we performed additional normalization to the FPKM to make gene expression comparable among different samples. For DE genes between C 3 and C 4 species, we applied calcNormFactors function embed in edgeR 53 to calculate scaling factors and convert raw library sizes into effective library sizes. DE genes called here may be not exactly consistent with that called based on RNA-seq data from each one of the labs, but should reflect the major DE genes between C 3 and C 4 species.

Materials and methods
Data retrieval. RNA-Seq data of Flaveria species were downloaded from the Sequence Read Achieve (SRA) of the National Center for Biotechnology Information (NCBI). The source of RNA-seq data and plant grown conditions are detailed in Supplementary Methods. All accession numbers for RNA-Seq data are shown in Table S1. CO 2 compensation points (Γ) (except for F. kochiana), δ 13 C (except for F.kochiana), %O 2 inhibition of P max (except F. kochiana), and CO 2 assimilation rates were from 21 . Γ, δ 13 and %O 2 inhibition of F. kochiana were from 33 . Data for % initial C 4 products in total fixed carbon were from 54 . Data for PWUE, PNUE, and net CO 2 assimilation rate (A) versus Rubisco content were from 33 . Data for M area, M:BS ratio, vein density and number of ground tissue layers were from 15 15 with GetData (http:// www. getda ta-graph-digit izer. com). Mean values from 20 measurements were used. Ultrastructural data of BS cell chloroplasts were from Nakamura et al. 34 .
Transcriptome assembly and quantification. Transcripts of Flaveria species generated with Illumina sequencing were assembled using Trinity (version 2.02) 55 with default parameters (Table S1). Contigs of four Flaveria species from 454 sequencing data were assembled using CAP3 56 with default parameters. In all cases, only contigs of at least 300 bp in length were saved. Transcript abundances of 31 Flaveria samples were analyzed by mapping Illumina short reads to assembled contigs of corresponding species and then normalized to the fragment per kilobase of transcript per million mapped reads (FPKM) using the RSEM package (version 1.2.10) 57 . Functional annotations of Flaveria transcripts were determined by searching for the best hit in the coding sequence (CDS) dataset of Arabidopsis thaliana (Arabidopsis) in TAIR 10 (http:// www. arabi dopsis. org) by using BLAST in protein space with E-value threshold 0.001. If multiple contigs shared the same best hit in www.nature.com/scientificreports/ CDS reference of Arabidopsis, then the sum FPKM of those contigs was assigned to the FPKM value of the gene in Flaveria.
To estimate the consistence of Flaveria gene annotation, we used OrthoFinder 29 to predict the orthologous group based on the annotated Flaveria gene together with gene of Arabidopsis from TAIR10, and then calculated the consistence between gene annotation and orthologous group in two ways. (1) If the orthologous group contains Arabidopsis gene(s), the consistency was calculated as the percentage of genes that have the same annotation with the Arabidopsis gene(s). (2) If there the orthologous group does not contain Arabidopsis gene, we calculated the percentage of genes for each gene ID in this group, and the highest percentage was assigned to the consistency.
To make the FPKM comparable across different samples, we normalized the FPKM value by a scaling strategy as used by Brawand et al. 35 . Specifically, among the transcripts with FPKM values ranking in 20-80% region in each sample, we identified the 1000 genes that had the most-conserved ranks among 29 leaf samples, which were then used as an internal reference, and the transcript of each sample was normalized according to the mean value of these 1000 genes in the sample. We then multiplied all the FPKM values in all samples by the mean value of 1000 genes in the 29 leaf samples. The three samples from C 3 species and eight samples from C 4 species (Table S1) were used to recall differentially expressed (DE) genes applying edgeR 53 , and the Benjamini-Hochberg ("BH") procedure was used in multiple testing correction with a threshold of P ("BH" corrected) to be 0.05.
Investigation the species used in this study being from hybrid of two species. To investigate whether the intermediate species used in this study are from hybrid offspring of two species, DNA sites that expressed different alleles were identified, which termed as mixed sites. The mixed sites were identified based on RNA-Seq data as described in 28 . Hybrid offspring from two different species are expected to have higher percentage of mixed sites among all expressed sites than no hybrid species. To create a positive background of hybrid samples, RNA-Seq data of 16 species were pair-wisely mixed and their mixed sites were also identified. The mixed sites of the known hybrid sample F. pringlei* originated from F. pringlei × F. angustiflolia in 28 were also identified. The percentage of mixed sites was calculated as the ratio of mixed sites to the total expressed DNA sites in a certain sample.
Protein divergence, gene expression divergence and morphology divergence. Pair-wise protein divergence (dN) was calculated by applying codeml program in PAML package 58 by using F3X4 condon frequency. The input super CDS sequence was from the linked coding sequences (CDS) as used in construct phylogenetic tree of Flaveria genus 28 , which contains 2,462 genes. Gene expression divergence was calculated as Euclidean distance applying R package based on gene expression values (FPKM) of 12,218 genes. Encoded morphology values of 30 morphology traits were from 16 . The morphology divergence was calculated as Euclidean distance of morphology values. Expression and morphology values were normalized using quantile normalization applying preprocessCore package in R. Linear regression of pair-wise correlation was inferred apply lm function in R package.
Relative difference of each ancestral node in the phylogenetic tree. The protein sequences at the whole transcriptomic scale of ancestral node were predicted using FASTML 59 . The protein alignment was from 28 . Gene expression abundance and morphological characteristics of all ancestral nodes were predicted by applying ape package of R which uses a maximal likelihood method. For all C 4 related gene expression, protein sequences and physiological data, their values of the ancestral nodes were assigned to those of the most recent species derived from the node.
Relative difference of protein sequence at each ancestral node was inferred by comparing the sequence at this node (N) with the nearest preceding node of N (N[pre]), e.g., the number of different amino acid between N2 with N1 is the number of changed amino acid at N2. The number of different amino acid changes divided by the aligned length of the protein was calculated as relative protein difference for each gene. Relative difference of gene expression and morphology were calculated as (N -N[pre])/N[pre]. In most cases, the nearest preceding node of N[i] is N[i-1], there are two exceptions: the ancestral node of N11 is N5, and N10 is N8. One-way ANOVA analysis followed by Tukey's Post Hoc test was used to calculate the significance of relative difference between any two ancestral nodes. P values were adjusted by Benjamin-Hochberg (BH) correction.

Data availability
All data generated during this study are included in this published article and its supplementary information files. Codes used during the current study are available from the corresponding author on reasonable request. www.nature.com/scientificreports/