Crassulacean acid metabolism (CAM) is a water-use efficient adaptation of photosynthesis that has evolved independently many times in diverse lineages of flowering plants. We hypothesize that convergent evolution of protein sequence and temporal gene expression underpins the independent emergences of CAM from C3 photosynthesis. To test this hypothesis, we generate a de novo genome assembly and genome-wide transcript expression data for Kalanchoë fedtschenkoi, an obligate CAM species within the core eudicots with a relatively small genome (~260 Mb). Our comparative analyses identify signatures of convergence in protein sequence and re-scheduling of diel transcript expression of genes involved in nocturnal CO2 fixation, stomatal movement, heat tolerance, circadian clock, and carbohydrate metabolism in K. fedtschenkoi and other CAM species in comparison with non-CAM species. These findings provide new insights into molecular convergence and building blocks of CAM and will facilitate CAM-into-C3 photosynthesis engineering to enhance water-use efficiency in crops.
Crassulacean acid metabolism (CAM) is a metabolic adaptation of photosynthetic CO2 fixation that enhances plant water-use efficiency (WUE) and associated drought avoidance/tolerance by reducing transpirational water loss through stomatal closure during the day, when temperatures are high, and stomatal opening during the night, when temperatures are lower1. In the face of the rapidly increasing human population and global warming predicted over the next century, the outstanding WUE of CAM plants highlights the potential of the CAM pathway for sustainable food and biomass production on semi-arid, abandoned, or marginal agricultural lands2,3,4.
CAM photosynthesis can be divided into two major phases: (1) nocturnal uptake of atmospheric CO2 through open stomata and primary fixation of CO2 by phosphoenolpyruvate carboxylase (PEPC) to oxaloacetate (OAA) and its subsequent conversion to malic acid by malate dehydrogenase; and (2) daytime decarboxylation of malate and CO2 refixation via C3 photosynthesis, mediated by ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO)5,6. Malic acid is stored in the vacuole of photosynthetically active cells reaching a peak at dawn and can be used as a reference point to divide the two phases. CAM is found in over 400 genera across 36 families of vascular plants4 and is thought to have evolved multiple times independently from diverse ancestral C3 photosynthesis lineages7. The core biochemical characteristics of the CAM cycle are similar in all the plant lineages in which CAM has evolved, with some variation in the enzymes that catalyze malate decarboxylation during the day, and in the storage carbohydrates that provide substrates for malic acid synthesis at night8,9.
We hypothesize that convergent evolution in protein sequence and/or temporal diel gene expression underpins the multiple and independent emergences of CAM from C3 photosynthesis. Convergent evolution is generally defined as the appearance of similar phenotypes in distinct evolutionary lineages10. Although phenotypic convergence is widely recognized, its evolutionary mechanism has been extensively debated. Morris11 argues that the evolutionary course is not random but selection-constrained, along certain pathways, to arrive at the same solution or outcome. Recently, comparative genomics analysis began to provide new insight into the molecular mechanism of convergent evolution. For example, Foote et al.12 performed comparative genomic analyses of three species of marine mammals (the killer whale, walrus, and manatee) that share independently evolved phenotypic adaptations to a marine existence, and identified convergent amino-acid substitutions in genes evolving under positive selection and putatively associated with a marine phenotype. Also, Hu et al.13 compared the genomes of the bamboo-eating giant and red pandas, two obligate bamboo-feeders that independently possess adaptive pseudothumbs, and identified 70 adaptively convergent genes (i.e., under positive selection in these two species), of which nine genes, featuring nonrandom convergent amino-acid substitution between giant and red pandas, are closely related to limb development and essential nutrient utilization. These two examples indicate that specific amino-acid replacements at a small number of key sites can result in highly predictable convergent outcomes, supporting the constrained selection theory of Morris11. However, such predictable protein sequence convergence was not found in the convergence of hemoglobin function in high-altitude-dwelling birds, indicating that possible adaptive solutions are perhaps contingent upon prior evolutionary history14. This finding supports the contingent adaptation theory15 that evolution is contingent upon history and consequently replaying life’s tape will give different outcomes. In addition to protein sequence convergence, convergent changes in gene expression were found to be associated with convergent evolution of vocal learning in the brains of humans and song-learning birds16. Therefore, convergent changes in both protein sequence and gene expression are important aspects of the molecular basis of convergent evolution.
We sought to investigate whether changes in protein sequence and/or gene expression contribute to the evolutionary convergence of CAM through genome-wide screening for signatures of convergent changes in protein sequences and diel mRNA expression patterns that meet the following criteria: the signatures are (1) isomorphic in the CAM genomes of distant groups, such as eudicots and monocots, which diverged ~135 million years ago17, and (2) dimorphic in related C3 photosynthesis genomes. Recently, the genome sequences of two monocot CAM species, Ananas comosus (L.) Merr. (pineapple)18, and Phalaenopsis equestris (Schauer) Rchb.f. (moth orchid)19, were published. Here we present the genome sequence of Kalanchoë fedtschenkoi Raym.-Hamet & H. Perrier, which is an emerging molecular genetic model for obligate CAM species in the eudicots4,6,20. Our analyses reveal the genomic signatures of convergence shared between eudicot (represented by Kalanchoë) and monocot (represented by pineapple and orchid) CAM species.
Kalanchoë genome assembly and annotation
The diploid K. fedtschenkoi (2n = 2x = 34 chromosomes; Supplementary Fig. 1) genome size was estimated to be ~260 Mb (Supplementary Table 1). The K. fedtschenkoi genome was assembled from ~70× paired-end reads and ~37× mate-pair reads generated using an Illumina MiSeq platform (Supplementary Table 2 and Supplementary Fig. 2). The genome assembly consisted of 1324 scaffolds with a total length of 256 Mb and scaffold N50 of 2.45 Mb (Supplementary Table 3), in which we predicted and annotated 30,964 protein-coding genes (Supplementary Table 4).
The phylogenetic placement of Kalanchoë
Kalanchoë is the first eudicot CAM lineage with a genome sequence to date and serves as an important reference for understanding the evolution of CAM. In addition, K. fedtschenkoi is the first sequenced species in the distinct eudicot lineage, Saxifragales. Although the monophyly of this morphologically diverse order is well supported by molecular data, its phylogenetic placement has been less clear21,22. The recent consensus view, based mainly on analyses of plastid DNA sequences, has placed the Saxifragales as a sister group to the rosids, and together they comprise the large clade of superrosids23,24. However, there have been indications of conflict between trees based on plastid genomes and nuclear genomes for this clade19,24. Additionally, the major lineages of core eudicots are thought to have diversified rapidly following their first appearance, making resolution of the relationships among these clades particularly challenging17,25 and implicating incomplete lineage sorting (ILS) as a potentially important process that would result in discordance among gene histories26.
We performed phylogenetic analyses with 210 single-copy nuclear genes from 26 sequenced plant genomes using multiple phylogenetic inference strategies. The resulting species trees are congruent with each other except for the placement of K. fedtschenkoi, which was placed either as sister to the rosids in a phylogenetic tree reconstructed using a quartet-based coalescent species tree method (Fig. 1) or as sister to all other core eudicots as revealed by alternative phylogenetic trees reconstructed from (1) concatenated protein sequence alignment without gene partition using maximum-likelihood (Supplementary Fig. 3), (2) a partitioned analysis of multi-gene alignment using maximum-likelihood and Bayesian methods (Supplementary Fig. 4), and (3) analysis of individual gene trees using fully Bayesian multispecies coalescent method (Supplementary Fig. 5). Despite substantial discordance among estimated nuclear gene trees, the coalescence-based tree was consistent with the results of the plastome-based analyses, placing Kalanchoë as sister to the rosids (Fig. 1). Coalescent species tree estimation can account for gene tree discordance due to ILS27. At the same time, alternative placements of Kalanchoë as sister to the asterids, or as sister to all other core eudicots were observed in many gene trees (Fig. 1 and Supplementary Fig. 5). Gene tree discordance due to rapid diversification early in eudicot history has also been characterized by others24. Regardless of the optimal placement of the Saxifragales, including Kalanchoë, individual gene trees will often have alternative histories due to ILS in the face of rapid species diversification.
Kalanchoë genome duplication
The grape genome has no additional genome duplication after the ancestral gamma hexaploidization28,29 and is the best available reference for studying ancestral eudicot genome duplication events. Syntenic depth analyses30,31 showed that there are multiple K. fedtschenkoi blocks covering each grape gene (Fig. 2a and Supplementary Fig. 6). Specifically, 65% of the grape genome had from one to four syntenic blocks in K. fedtschenkoi. In contrast, a sudden drop in syntenic depth occurred after a depth of 4× (Fig. 2a), indicating that each grape genome region has up to four K. fedtschenkoi blocks and thus providing strong evidence for two distinct whole-genome duplications (WGDs) events in K. fedtschenkoi. The microsynteny patterns further support two WGDs on the lineages leading to K. fedtschenkoi. Specifically, the microsynteny pattern reflects a 1:4 gene copy ratio between the grape genome and the diploid K. fedtschenkoi genome (Fig. 2b).
From the Kalanchoë point of view, we found that 49% of the Kalanchoë genome was covered by one grape-Kalanchoë block, 7% covered in two grape-Kalanchoë blocks, and 1% covered in three grape-Kalanchoë blocks (Supplementary Fig. 7). This suggests that we could often find one best grape-Kalanchoë block out of the three gamma triplicated regions in grape. This fits the scenario that the gamma WGD predated the divergence and there has been no WGD in the grape lineage since grape-Kalanchoë diverged. Alternatively, if the divergence predated the gamma WGD, then from the Kalanchoë point of view we should instead see three matching grape regions. Hence, the grape-Kalanchoë genome comparisons strongly supported the gamma WGD as a shared event, and further supported the phylogenetic position of Kalanchoë in Fig. 1.
Despite two apparent WGDs in the K. fedtschenkoi lineage, synonymous substitutions per synonymous site (Ks) between duplicate gene pairs showed only one prominent peak ~0.35 (Supplementary Fig. 8). The unimodal distribution of Ks suggests the two WGD events occurring close in time. Similarly, two distinct peaks appear in the distribution of the four-fold transversion substitution rate (4dtv) values between the K. fedtschenkoi gene pairs (Fig. 2c). Grape-Kalanchoë gene pairs show a prominent peak around Ks = 1.5 (Supplementary Fig. 8), indicating that the WGDs in the K. fedtschenkoi lineage occurred well after its divergence from grape early in the history of the rosid lineage.
Gene co-expression modules and clusters in Kalanchoë
To elucidate gene function in K. fedtschenkoi, we performed a weighted correlation network analysis of transcript expression in 16 samples including 12 mature leaf samples collected every 2 h over a 24-h period and four non-leaf samples collected 4 h after the beginning of the light period, including shoot tip (leaf pair 1 plus the apical meristem), stem (between leaf pair 3 and leaf pair 8), root, and flower. Our analysis identified 25 co-expression modules, among which one module (MEblack containing 782 genes) was significantly (Student’s t-test, P < 0.001) associated with the leaf samples collected during the dark period (Supplementary Fig. 9), with an increase in transcript abundance at night (Supplementary Fig. 10). Several biological processes (e.g., carboxylic acid biosynthesis, terpene biosynthesis, and lipid metabolism) were over-represented (hypergeometric enrichment test, P < 0.05) (Supplementary Data 1), and several key genes encoding proteins involved in nocturnal CAM carboxylation and vacuolar uptake of malate such as Kaladp0018s0289 (β-CA), Kaladp0048s0578 (PEPC2), Kaladp0037s0517 (PPCK), Kaladp0022s0111 (MDH), and Kaladp0062s0038 (ALMT6) were present in this module (Fig. 3a, Supplementary Note 1 and Supplementary Table 5). These results suggest that genes in the co-expression module MEblack play important roles in the nighttime processes that define CAM. One alternate module (MEblue containing 1911 genes) was significantly correlated with the leaf samples collected during the day (Supplementary Fig. 9), with an increase in transcript abundance during the light period (Supplementary Fig. 10). Several biological processes (e.g., starch biosynthesis, coenzyme biosynthetic process) were over-represented (hypergeometric enrichment test, P < 0.05) in this module (Supplementary Data 1). One gene in the CAM decarboxylation process, Kaladp0010s0106 (PPDK-RP), belongs to this module (Supplementary Table 6).
We also performed cluster analysis on the CAM leaf time-course expression data for the transcripts that showed significantly (ANOVA of glm models where H0 = a flat line, P < 0.05) time-structured diel expression patterns as determined by a polynomial regression. Clustering of transcripts with time-structured expression identified 11 clusters (Supplementary Fig. 11 and Supplementary Table 7). Networks constructed for each cluster implicated highly connected hub genes and their direct or indirect interactions with CAM-related genes (Supplementary Data 2). For example, cluster 7, which contains PEPC1 (Kaladp0095s0055) and PPCK2 (Kaladp0604s0001), has a zinc-finger protein CONSTANS-like gene as a central hub (Supplementary Data 2). CONSTANS-like genes are part of the circadian clock regulatory network32. Similarly, multiple REVEILLE transcripts, which encode transcription factors for genes with evening elements in their promoters33, are hubs in cluster 4 that contains NADP-ME genes (Kaladp0092s0166) (Supplementary Data 2).
Overview of genes that have undergone convergent evolution
To determine the possibility that the diel reprogramming of metabolism that distinguishes CAM from C3 photosynthesis was achieved, at least in part, by convergent shifts in diel patterns of gene expression, we performed comparative analysis of diel transcript abundance patterns in CAM and C3 photosynthesis species. Specifically, we compared the diel expression patterns of 9733 ortholog groups of genes from K. fedtschenkoi (eudicot, CAM photosynthesis), A. comosus (monocot, CAM photosynthesis), and Arabidopsis thaliana (eudicot, C3 photosynthesis), with transcript abundances >0.01 FPKM in mature leaf samples collected at six or more diel time points. Sampling time points included dawn (22, 24, and 2 h from the start of the light period), midday (4, 6, and 8 h from the start of the light period), dusk (10, 12, and 14 h from the start of the light period), and midnight (16, 18, and 20 h from the start of the light period) (Fig. 4a). A gene from K. fedtschenkoi was defined as having undergone convergent evolution of gene expression if it met all of the following criteria: (1) its diel transcript expression pattern was highly correlated (Spearman’s rank correlation coefficient, r > 0.8) with those of at least one of the orthologs in A. comosus, but not highly correlated (r < 0.5) with those of any of the orthologs in A. thaliana; (2) it displayed a significant difference (false discovery rate <0.01) in transcript abundance either between midday and midnight (e.g., Fig. 4b), or between dawn and dusk (e.g., Fig. 4c); and (3) the time shift between K. fedtschenkoi and A. comosus transcript time-courses was less than or equal to 3 h, whereas the time shifts between CAM species (K. fedtschenkoi and A. comosus) transcripts and their A. thaliana ortholog transcript were equal to or greater than 6 h. Based on these criteria, 54 K. fedtschenkoi genes were identified as candidates for involvement in the convergent shift in diel gene expression patterns specific to the two CAM species relative to A. thaliana (Supplementary Note 2, Supplementary Data 3 and Supplementary Table 8).
To identify genes that had likely undergone convergent evolution in protein sequence in the CAM species, we reconstructed gene tribes based on protein sequences from the species listed in Supplementary Fig. 4. We then created phylogenetic trees for the genes from all tribes that include at least one gene from each of the 13 studied species (Supplementary Table 9). A K. fedtschenkoi gene was defined as having undergone convergent evolution in protein sequence if it met all of the following criteria: (1) the K. fedtschenkoi gene is clustered with gene(s) from at least one of the two monocot CAM species (A. comosus and P. equestris) in a phylogenetic clade containing no genes from C3 or C4 photosynthesis species; (2) convergent amino-acid changes were detected between the K. fedtschenkoi gene with gene(s) from at least one of the two monocot CAM species; and (3) the K. fedtschenkoi gene shared at least one amino-acid mutation with its ortholog in at least one of the two monocot CAM species, as compared with C3 and C4 photosynthesis species. A total of four K. fedtschenkoi genes showing convergent changes in protein sequences were identified (Supplementary Figs. 12–15 and Supplementary Table 10).
We also performed genome-wide positive selection analysis in each of the three CAM species (i.e., A. comosus, P. equestris, and K. fedtschenkoi) in comparison with 21 non-CAM species (Supplementary Method 1) and identified two genes that were under positive selection in the dicot CAM species K. fedtschenkoi and one of the monocot CAM species (Supplementary Figs. 16–17).
Convergent evolution of genes involved in CO2 fixation
PEPC is a key enzyme for nocturnal CO2 fixation and PPCK is a pivotal protein kinase that regulates PEPC in response to the circadian clock in CAM plants4,6,34. PPCK phosphorylates PEPC in the dark (Fig. 5a) and thereby reduces malate inhibition of PEPC activity, promoting nocturnal CO2 uptake35,36. Multiple PPCK genes were identified in the K. fedtschenkoi genome, among which two genes (Kaladp0037s0517 and Kaladp0604s0001) showed higher transcript abundance than the others in CAM leaves (Supplementary Table 5). The diel expression patterns of the most abundant PPCK transcripts in K. fedtschenkoi (Kaladp0037s0517.1) and A. comosus (Aco013938.1) were highly correlated, with only a 1.5-hour time shift between them, whereas both showed an ~11-hour time shift relative to their best matched ortholog in Arabidopsis (AT1G08650) (Fig. 4b and Supplementary Table 8). Peak PPCK transcript abundance was shifted from daytime in C3 photosynthesis species (Arabidopsis) to nighttime in the two CAM species (Fig. 4b), which suggests convergence and is consistent with PPCK activation of PEPC-mediated nocturnal CO2 fixation. Among the PEPC genes identified in K. fedtschenkoi, Kaladp0095s0055 and Kaladp0048s0578 showed higher transcript abundance than the others (Supplementary Table 5). Kaladp0095s0055 (named PEPC1 herein) was an abundant transcript throughout both the light and the dark period, with its peak transcript level phased to dusk. The second most abundant PEPC transcript (Kaladp0048s0578, named PEPC2 herein) showed a much higher transcript level during the dark period than during the light period (Fig. 5b). We found that a duplicated pair of K. fedtschenkoi PEPC2 genes (Kaladp0048s0578 and Kaladp0011s0355) clustered together with a PEPC gene (PEQU_07008) from P. equestris (Supplementary Fig. 12). PEQU_07008 was recently reported as the CAM-type PEPC in P. equestris, and, like Kaladp0048s0578, this orchid PEPC gene also showed higher transcript abundance during the dark period than during the light period37.
Convergent changes in PEPC2 protein sequence were found between K. fedtschenkoi and P. equestris (Fig. 6a, b). Specifically, multiple protein sequence alignment revealed that an aspartic acid residue (D509) in Kaladp0048s0578 is conserved in PEQU_07008 and Kaladp0011s0355 (a duplicated copy of Kaladp0048s0578), but there was an arginine (R), lysine (K), or histidine (H) in the corresponding sites of the PEPC protein sequences of other tested species (Fig. 6c and Supplementary Fig. 12). The structural model of the Kaladp0048s0578 protein indicates that this single amino-acid substitution (from a basic amino-acid R/K/H to an acidic amino-acid D) is located in an α-helix adjacent to the active site in a β-barrel (Fig. 7a). We hypothesize that an activator binds to the active site of one subunit of the tetrameric complex of PEPC2, leading to allosteric conformational changes that subsequently activate another subunit of the tetramer (Fig. 7b). This model was supported by a recent crystallography structure of the Flavaria trinervia (a C4 photosynthesis plant) PEPC with an activator glucose-6-phosphate (G6P) bound at the β-barrel active center38. Based on this model, because D509 of PEPC2 (Kaladp0048s0578) is also negatively charged as G6P, the observed substitution may play a similar role as the activator by triggering allosteric conformational changes that lead to activation of the other subunits of PEPC tetramer. Nimmo39 reported that PEPC is subject to posttranslational regulation in the dark via phosphorylation by PPCK. In vitro analysis of the activities of different heterologously expressed PEPC isoforms showed that without phosphorylation by PPCK, PEPC1 from K. fedtschenkoi had a much lower activity than PEPC2 from either K. fedtschenkoi or P. equestris (Fig. 6d). Further, the R515D mutation significantly (Student’s t-test, P < 0.01) increased the activity of K. fedtschenkoi PEPC1, whereas the D509K and D504K mutations significantly (Student’s t-test, P < 0.01) reduced the activities of K. fedtschenkoi PEPC2 and P. equestris PEPC2, respectively (Fig. 6d). These results indicate that a single amino-acid mutation could significantly modify PEPC activity.
Our evolutionary analyses did not detect convergent evolution in either protein sequence or diel transcription patterns for the various decarboxylation genes that are expressed in Kalanchoë and A. comosus. In Kalanchoë, NAD(P)-ME genes were highly expressed, whereas the expression of the PEPCK gene was very low (Supplementary Fig. 18), consistent with the known high extractable activities of NAD-ME and NADP-ME in CAM leaves of Kalanchoë 40,41. By contrast, in A. comosus the transcript abundance of PEPCK was much higher than that of malic enzyme (ME) (Supplementary Fig. 18), supporting the model that malate decarboxylation in Kalanchoë is mediated by ME, which was recently substantiated using a transgenic RNAi approach20,40, whereas in pineapple a combination of MDH, working in the OAA-forming direction, coupled with PEPCK, converting OAA to PEP and CO2, are the candidate decarboxylation enzymes18, consistent with previous enzyme activity studies8.
Convergent evolution of genes involved in stomatal movement
A unique feature of CAM physiology is the inverted light/dark pattern of stomatal movement relative to C3 photosynthesis, with stomata opening during the night in CAM and during the day in C3 photosynthesis plants6. Blue light is a key environmental signal that controls stomatal opening and phototropin 2 (PHOT2; AT5G58140), a blue light photoreceptor, mediates blue light regulation of stomatal opening in Arabidopsis 42. Twenty genes that could potentially be involved in stomatal movement in K. fedtschenkoi were predicted based on homology to Arabidopsis genes involved in the regulation of stomatal movement (Supplementary Table 11). One of these genes, Kaladp0033s0113, which encodes PHOT2, showed only a 1-h time shift in transcript abundance pattern relative to its A. comosus ortholog (Aco014242) (Supplementary Table 8), possibly indicating a convergent change in the diel pattern of its transcript abundance pattern in the two CAM species. In support of a convergent evolution hypothesis, the transcript abundance patterns of the two PHOT2 genes in the CAM species showed 11- (Kalanchoë) and 9- (pineapple) hour phase shifts, respectively, relative to that of the PHOT2 gene (AT5G58140) in the C3 photosynthesis species Arabidopsis (Fig. 4c). The timing of peak transcript abundance shifted from dawn in Arabidopsis to dusk in the two CAM species (Fig. 4c). This convergent change in diel transcript abundance pattern suggests that PHOT2 might contribute to the inverted day/night pattern of stomatal closure and opening in CAM species such that PHOT2 might function as a switch mediating the blue-light signal to open the stomata at dusk and the stomata could then remain open during the dark period.
Convergent evolution of genes involved in heat tolerance
The stomata of mature CAM leaves of K. fedtschenkoi close for the majority of the light period40, which may exacerbate the internal heat load on the leaves43. Photosynthesis is sensitive to heat stress and can be inhibited long before other symptoms of heat stress are detected44. Numerous studies have shown that the inhibition of photosynthesis by moderate heat stress is a consequence of RuBisCO deactivation, caused, in part, by the thermal instability of RuBisCO activase45. Heat-shock proteins can play a critical role in the stabilization of proteins under heat stress conditions46. Wang et al.47 reported that HSP40 (SlCDJ2) contributed to the maintenance of CO2 assimilation capacity mainly by protecting RuBisCO activity under heat stress and that HSP70 (cpHsp70) acted as a binding partner for SlCDJ2 in tomato. HSP70 can also function as nano-compartments in which single RbcL/RbcS subunits can fold in isolation, unimpaired by aggregation48, as illustrated in Fig. 8a. Among the HSP70 genes predicted in K. fedtschenkoi, Kaladp0060s0296 displayed peak transcript abundance in the morning, with only a 1-h shift in diel transcript abundance pattern relative to its A. comosus ortholog Aco031458, whereas these two HSP70 genes in the CAM species showed ~10-h shifts in diel transcript abundance pattern relative to their best-matched A. thaliana ortholog, AT5G02490 (Fig. 8b and Supplementary Table 8), suggesting that HSP70 has undergone convergent changes in diel transcript expression patterns during the evolution of CAM.
Convergent evolution of genes in the circadian clock
Key physiological and biochemical features of CAM including net CO2 exchange and PEPC phosphorylation are well established as outputs of the circadian clock, displaying robust oscillation under free-running constant conditions20,40. Thus, the circadian clock could be a key regulator of the diel reprogramming of metabolism and stomatal function that defines CAM. The molecular basis of circadian rhythms has been studied extensively in non-CAM species33. Based on homology to Arabidopsis genes that have been shown to play important roles as molecular components of the circadian clock, 35 K. fedtschenkoi genes were predicted to be involved in circadian rhythms (Supplementary Table 12). None of these K. fedtschenkoi genes are among the list of genes showing convergent changes in diel expression pattern (Supplementary Data 3), suggesting that CAM evolution did not involve major changes in the diel expression pattern of these known circadian rhythm genes shared between Arabidopsis and K. fedtschenkoi. However, we cannot rule out the possibility of convergent evolution in unknown circadian rhythm genes between these two species. Also, it is possible that genes that are not involved in circadian rhythms in Arabidopsis could have taken on this function in K. fedtschenkoi. On the other hand, Kaladp0060s0460, which encodes ELONGATED HYPOCOTYL5 (HY5), showed a convergent change in protein sequences between K. fedtschenkoi and P. equestris (Supplementary Table 10). HY5 is a bZIP family transcription factor in the blue light signaling pathway that acts as an input to entrain the circadian clock33 (Fig. 9a). A single amino-acid mutation (E-to-R) occurred in the C-terminal bZIP domains of the proteins encoded by Kaladp0060s0460 and its P. equestris ortholog PEQU_13446 as compared with HY5 from C3 or C4 photosynthesis species (Fig. 9b and Supplementary Fig. 14). The bZIP domain determines the DNA-binding ability of HY5 as a transcription factor49, mediating the interaction between HY5 and G-BOX BINDING FACTOR 150. HY5 has been shown to move from shoot to root to coordinate aboveground plant carbon uptake in the leaf and belowground nitrogen acquisition in the root51. Therefore, the potential roles of HY5, Kaladp0060s0460, in circadian rhythmicity and shoot-to-root communication in K. fedtschenkoi needs to be investigated using experimental approaches such as loss-of-function mutagenesis52.
Convergent evolution of genes in carbohydrate metabolism
Nocturnal production of phosphoenolpyruvate (PEP) as a substrate for dark CO2 uptake represents a substantial sink for carbohydrates in CAM plants, which has to be balanced with the provision of carbohydrates for growth and maintenance53. Carbohydrate active enzymes (CAZymes) play critical roles in regulating carbohydrate synthesis, metabolism, and transport in living organisms. There are six CAZyme classes: glycoside hydrolases (GHs), glycosyltransferases (GTs), polysaccharide lyases, carbohydrate esterases, auxiliary activities, and carbohydrate-binding modules. Each of these classes contains from a dozen to over one hundred different protein families based on sequence similarity54. The six classes of CAZymes have different functions. For example, GH enzymes catalyze the hydrolysis of glycosidic bonds, while GT enzymes catalyze the formation of glycosidic bonds. Using CAZyme domain-specific hidden Markov models, defined in the dbCAN database55, we identified 100 CAZyme families, including 1093 genes in the K. fedtschenkoi genome, comparable to the total number (1149) of CAZyme genes in A. thaliana (Supplementary Data 4 and 5). Among these CAZyme genes, four ortholog groups (ORTHOMCL68, ORTHOMCL93, ORTHOMCL207, and ORTHOMCL9830) of genes (e.g., Kaladp0550s0020, Kaladp0011s0363, Kaladp0037s0421, Kaladp0055s0317, respectively) belonging to the CAZyme families GH100, GT20, GT2, and GT5, respectively, displayed convergent changes in their patterns of diel transcript abundance in two CAM species (K. fedtschenkoi and A. comosus) compared with the C3 photosynthesis species (A. thaliana) (Supplementary Data 3). Specifically, the K. fedtschenkoi CAZyme genes with convergent changes in diel transcript abundance pattern (e.g., Kaladp0550s0020 [GH100], Kaladp0011s0363 [GH20], Kaladp0037s0421 [GT2], and Kaladp0055s0317 [GT5]) showed higher transcript abundance in the dark and early light period (Supplementary Fig. 19). In particular, two genes (Kaladp0011s0363 and Kaladp0055s0317) were predicted to be involved in starch and sucrose metabolism (Supplementary Fig. 20). Kaladp0011s0363 encodes a probable trehalose phosphate synthase. Trehalose 6-phosphate is an important sugar signaling metabolite and is thought to link starch degradation to demand for sucrose and growth56. Kaladp0550s0020 encodes an alkaline-neutral invertase that catalyzes the hydrolysis of sucrose to glucose and fructose. This invertase has also been implicated in metabolic signaling processes as an important regulator of plant growth and development57. Taken together, these data suggest that the evolution of CAM from C3 photosynthesis requires re-scheduling of the transcription of metabolic and signaling genes that regulate the partitioning of carbohydrates between reserves that provide substrates for CAM and carbohydrates required for growth.
In addition to the above convergent changes in expression pattern of four CAZyme genes, we also identified convergent changes in protein sequences of another two CAZyme genes (Kaladp0016s0058 [GT29] and Kaladp0067s0114 [GH35]) that were under positive selection (CodeML implemented in PosiGene58, P < 0.05) in the dicot CAM species K. fedtschenkoi and one of the two monocot CAM species (A. comosus and P. equestris) (Supplementary Figs. 16–17). Kaladp0016s0058 encodes a putative sialyltransferase-like protein. Two single amino-acid mutations were found in Kaladp0016s0058 and its A. comosus ortholog Aco018360, as compared with the orthologous protein sequences of non-CAM species (Supplementary Fig. 16). These two mutations are close to each other (i.e., within a four-amino-acid distance), suggesting the possibility that the two mutations affect the same functional domain. Kaladp0067s0114 encodes a beta-galactosidase protein that hydrolyses the glycosidic bond between two or more carbohydrates. Two single amino-acid mutations were identified in Kaladp0067s0114 and its P. equestris ortholog PEQU_04899, as compared with the orthologous protein sequences of non-CAM species (Supplementary Fig. 17). These two mutations are close to each other (i.e., within an 11-amino-acid distance) in the middle of galactose-binding domain (Supplementary Fig. 17), which can bind to specific ligands and carbohydrate substrates for enzymatic catalytic reactions59. The relevance of these convergent changes in protein sequence to CAM evolution needs further investigation.
The CAM pathway has been found in 36 families of vascular plants4, among which Crassulaceae plays a unique role in CAM research because the pathway was first discovered in this succulent plant family and was thus named60. Within Crassulaceae, the genus Kalanchoë has been the most widely used for CAM research. As a model species for research into the molecular biology and functional genomics of CAM, K. fedtschenkoi stands out due to its relatively small genome, low repetitive sequence content, and efficient stable transformation protocols20. The genome sequence presented in this study renders K. fedtschenkoi as a new model for plant evolutionary and comparative genomics research, both for CAM photosynthesis and beyond. Although this study focused on genome-wide analysis of convergent evolution in CAM plants, the K. fedtschenkoi genome data can be used to facilitate CAM research related to: (1) generating loss-of-function mutants for functional characterization of CAM-related genes using genome-editing technology; (2) deciphering the regulation of CAM genes through identification of transcription factors and promoters of their target genes; (3) analyzing CAM gene expression by serving as a template for mapping of RNA sequencing reads and protein mass spectrometry data; and (4) identifying DNA polymorphisms related to genetic diversity of plants in the genus Kalanchoë.
Our genome-wide comparison of CAM species and non-CAM species revealed two types of convergent changes that could be informative with respect to the evolution of CAM: protein sequence convergence and convergent changes in the diel re-scheduling of transcript abundance. In the present study, a total of 60 genes exhibited convergent evolution in divergent eudicot and monocot CAM lineages. Specifically, we identified protein sequence convergence in six genes involved in nocturnal CO2 fixation, circadian rhythm, carbohydrate metabolism, and so on (Supplementary Table 10 and Supplementary Figs. 16–17). Also, we identified convergent diel expression changes in 54 genes that are involved in stomatal movement, heat stress response, carbohydrate metabolism, and so on (Supplementary Data 3). These results provide strong support for our hypothesis that convergent evolution in protein sequence or gene temporal expression underpins the multiple and independent emergences of CAM from C3 photosynthesis. New systems biology tools and genome-editing technologies52,61 offer great potential for plant functional genomics research based on loss- or gain-of-function mutants to characterize the role of the genes predicted here to have undergone convergent evolution.
Convergent gene function can arise by (1) a mutation or mutations in the same gene or genes that result in homoplasy in organisms or (2) independent causal mutation or mutations in different genes in each lineage10,62. We identified four genes that showed convergent changes in protein sequences, none of which were shared by the three CAM species A. comosus, K. fedtschenkoi, and P. equestris (Supplementary Table 10 and Supplementary Figs. 12–15), suggesting that CAM convergences result mainly from the second scenario. Alternatively, K. fedtschenkoi shares the convergent mutation in the PEPC2 protein sequence with P. equestris (Fig. 6), whereas it shares the convergent change in the pattern of diel transcript abundance of PPCK1 with A. comosus (Fig. 4b). These results suggest that two alternative modes of convergent evolution could have occurred in pathways for nocturnal CO2 fixation. First, PPCK shifted from light period to dark period to promote the activation of PEPC1 (the most abundant isoform), as exemplified by K. fedtschenkoi and A. comosus. Second, a single amino-acid mutation from R/K/H to D to maintain the active state of PEPC2, without the need for phosphorylation, then occurred, as in K. fedtschenkoi and P. equestris.
According to the constrained selection theory of Morris11, we expected to see convergent changes in protein sequences in all the three CAM species. However, in this study, single-site mutations were found in only two of the three CAM species. Our additional positive selection analysis revealed that Kalanchoë did share convergent sequence mutation with the other two CAM species, but at alternate sites (Supplementary Figs. 16–17). This is consistent with a recent report showing that single amino-acid mutations were not shared by all the bird species that displayed convergent evolution of hemoglobin function as an adaptation to high-altitude environments14. Alternatively, our results, to some extent, support the contingent adaptation theory of Gould15. The relevance of these predicted convergent changes to CAM needs to be investigated using experimental approaches, such as transferring the convergent CAM genes to C3 photosynthesis species to test the effect of these genes on C3-to-CAM photosynthesis transition.
In this study, we did not identify any gene that exhibited both convergent changes in transcript abundance patterns (Supplementary Data 3) and convergent changes in protein sequence (Supplementary Table 10 and Supplementary Figs. 16–17), suggesting that convergent evolution of a gene in CAM species was achieved through either protein sequence convergence or rewiring of gene expression. Indeed, we have not seen any reports showing that both protein convergence and convergent gene expression change occurred in the same gene in any type of organisms to date. Thus, we can hypothesize that convergent evolution follows the “law of parsimony” that emphasizes the fewest possible assumptions for explaining a thing or event63. An implication of this hypothesis is that reuse of the key genes via altered diel expression patterns would be the shortest path for C3-to-CAM photosynthesis evolution; and on the other hand, mutations in some key sites of protein sequences, while keeping the temporal gene expression pattern unchanged, would be the shortest path for evolving new protein function required by CAM. Although our data fit this hypothesis, additional screens for genes that have convergent changes in both protein sequence and expression pattern in the future are merited.
Increasing human population and changes in global temperature and precipitation are creating major challenges for the sustainable supply of food, fiber, and fuel in the years to come. As a proven mechanism for increasing WUE in plants, CAM offers great potential for meeting these challenges. Engineering of CAM-into-C3 photosynthesis plants could be a viable strategy to improve WUE in non-CAM crops for food and biomass production4,6. The genes predicted here to have undergone convergent evolution during the emergence of CAM are crucial candidates for CAM-into-C3 photosynthesis engineering. Our results suggest that CAM-into-C3 photosynthesis engineering requires rewiring of the diel transcript abundance patterns for most of the candidate genes in the target C3 photosynthesis species, along with amino-acid mutations in the protein sequences of several other candidate genes. Specifically, CAM-into-C3 photosynthesis engineering efforts should be focused on changing the temporal patterns of transcript expression of endogenous genes in the target C3 photosynthesis species corresponding to the K. fedtschenkoi genes listed in Supplementary Data 3. CRISPR/Cas9-based knock-in approach52 can be used to replace the original endogenous promoters of the target genes with temporal promoters that confer temporal expression patterns similar to those of their orthologous genes in the CAM species. For example, dark-inducible promoters such as Din1064 can be used to drive the expression of carboxylation gene modules during the nighttime and light-inducible promoters, such as GT1-GATA-NOS10165, can be used to drive the expression of decarboxylation gene modules during the daytime. To make the protein sequence changes needed for CAM-into-C3 photosynthesis engineering, transferring the K. fedtschenkoi genes listed in Supplementary Table 10 to target C3 photosynthesis species via the Agrobacterium-mediated transformation could provide a relatively straightforward path to an efficient engineered CAM pathway. Alternatively, one could mutate the amino acids shown in Supplementary Figs. 12–15 using a knock-in strategy with emerging genome-editing technology52.
In summary, this study provides an important model genome for studying plant comparative, functional, and evolutionary genomics, as well as significant advances in our understanding of CAM evolution. Our findings hold tremendous potential to accelerate the genetic improvement of crops for enhanced drought avoidance and sustainable production of food and bioenergy on marginal lands.
Kalanchoë fedtschenkoi ‘M2’ plants were purchased from Mass Spectrum Botanicals (Tampa, FL, USA) (Supplementary Method 2).
Estimation of DNA content
The DNA contents of young leaf tissue samples were analyzed using flow cytometry analysis service provided by Plant Cytometry Services (The Netherlands). The internal standard was Vinca minor (DNA = 1.51 pg/2 C = 1477 Mbp/2 C).
Images were collected using an Olympus FluoView FV1000 confocal microscope (Center Valley, PA, USA) with a 60× objective. Images were sharpened using Adobe Photoshop and chromosomes were counted using ImageJ software (Supplementary Method 3).
Illumina sequencing of genome
The genomic DNA libraries of K. fedtschenkoi were sequenced on a MiSeq instrument (Illumina, CA, USA) using MiSeq Reagent Kit v3 (600-cycle) (Illumina, CA, USA) (Supplementary Method 4).
In order to capture mRNA abundance changes responsive to diel conditions, samples were collected in triplicate from mature K. fedtschenkoi leaves (i.e., the fifth and sixth mature leaf pairs counting from the top) every 2 h over a 24 h time course under 12 h light/12 h dark photoperiod. Additional tissues were sampled in triplicate including roots, flowers, shoot tips plus young leaves, and stems at one time point, 4 h after the beginning of the light period (Supplementary Method 5).
Genome assembly and improvement
The K. fedtschenkoi genome was initially assembled using platanus66 from 70X Illumina paired-end reads (2 × 300 bp reads; unamplified 540 bp whole-genome shotgun fragment library), and three mate-libraries (3 kb, 14X; 6 kb, 12X; 11 kb, 11X). Further genome scaffolding was performed using MeDuSa67 sequentially with the genome assemblies of K. laxiflora v1.1 (Phytozome), Vitis vinifera Genoscope.12X (Phytozome), and Solanum tuberosum v3.4 (Phytozome).
Protein-coding gene annotation
The genome annotation for K. fedtschenkoi was performed using homology-based predictors facilitated with transcript assemblies (Supplementary Method 6).
Construction of orthologous groups
The protein sequences of 26 plant species were selected for ortholog group construction (Supplementary Method 7).
Construction of species phylogeny
The phylogeny of plant species was constructed from the protein sequences of 210 single-copy genes identified through analysis of orthologous groups (see “Construction of orthologous groups” section). The details for species phylogeny construction are described in Supplementary Method 8.
Construction of protein tribes and phylogenetic analysis
The protein sequences used for ortholog analysis (see “Construction of orthologous groups”) were also clustered into tribes using TRIBE-MCL68, with a BLASTp E-value cutoff of 1e-5 and an inflation value of 5.0. Phylogenetic analysis of the protein tribes is described in Supplementary Method 9.
Analysis of convergence in protein sequences in CAM species
The phylogenetic trees of protein tribes (see aforementioned “Construction of protein tribes and phylogenetic analysis”) were examined to identify the “CAM-convergence” clade, which was defined to contain genes from K. fedtschenkoi (dicot) and at least one of the two monocot CAM species (A. comosus and P. equestris) without any genes from C3 or C4 species. The rationale for defining the “CAM-convergence” clade is that the dicot CAM species K. fedtschenkoi should be separated from the monocot CAM species if there is no convergence between Kalanchoë and the monocot CAM species (Supplementary Method 10).
Gene Ontology analysis and pathway annotation
Whole-genome gene ontology (GO) term annotation was performed using BLAST2GO69,70 with a BLASTP E-value hit filter of 1 × 10−6, an annotation cutoff value of 55, and GO weight of 5. The enrichment of GO biological process and pathway annotation are described in Supplementary Method 11.
Analysis of carbohydrate active enzymes
The protein sequences were searched against the dbCAN database55 using HMMER3 (http://hmmer.org/). The HMMER search outputs were parsed to keep significant hits with E-value <1e-23 (calculated by HMMER) and coverage >0.2 (calculated on the HMM, which is equal to (end position - start position)/total length of HMM), as suggested by a large scale benchmark analysis71.
Estimation of transcript abundance in Kalanchoë
The RNA-seq data in fastq format were mapped to the K. fedtschenkoi genome using TopHat272. Transcript abundance in FPKM (Fragments Per Kilobase of transcript per Million mapped reads) was estimated using Cufflinks73. All mapped read counts of the transcripts were counted by using htseq-count, a subprogram of HTseq74.
Co-expression network analysis in Kalanchoë
The expression data of 16 samples in triplicates were used for co-expression network analysis, which included time-course data (12 time points: 2, 4, 6, …, 24 h after the beginning of the light period) from mature leaf and one time point data (4 h after the beginning of the light period) from roots, flowers, stems, and shoot tips plus young leaves collected in triplicate from the K. fedtschenkoi plants grown under 12 h light/12 h dark photoperiod. The details for co-expression network analysis are described in Supplementary Method 12.
Cluster analysis of gene expression in Kalanchoë
Count values for each RNA-seq library were used to calculate polynomial regressions across time (Supplementary Method 13).
Comparative analysis of gene expression
The diurnal expression data with 4-h intervals for Arabidopsis thaliana were obtained from Mockler et al.75 and adjusted to 2-h interval time series by interpolation using the SRS1 cubic spline function (http://www.srs1software.com/). The diurnal expression data with 2-h intervals for K. fedtschenkoi was generated in this study. The diurnal expression data with 2-h intervals for Ananas comosus was obtained from Ming et al.4. The gene expression data were normalized by Z-score transformation. The hierarchical clustering of gene expression was performed for genes in each ortholog group using the Bioinformatics Toolbox in Matlab (Mathworks, Inc.) based on Spearman correlation (Supplementary Method 14).
Genome synteny analysis
Protein 3D structural simulation
Gas chromatography-mass spectrometry metabolite profiling
For the major metabolites of K. fedtschenkoi, a total of 36 leaf samples (the 5th and 6th fully expanded leaf pairs counting from the top) were collected with three biological replicates sampled every 2-h for a 24-h diurnal cycle. Additionally, three biological replicate samples of stems, roots, shoot tips plus young leaves, and flowers were also collected (Supplementary Method 16).
In vitro protein expression and analysis of enzyme activity
The PEPC proteins were expressed in bacterial BL21strains (Novagen BL21 (DE3) pLysS Singles), and purified via Glutathione Sepharose 4B beads (GE Healthcare Life Sciences, Pittsburgh, PA, USA). The protein quality was checked via western blot using anti-PEPC antibody (Agrisera, Sweden) and the PEPC activity was determined (Supplementary Method 17).
The Department of Energy (DOE) will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). The K. fedtschenkoi genome sequence and annotation are deposited in Phytozome (https://phytozome.jgi.doe.gov). The K. fedtschenkoi genome sequence is also deposited at NCBI GenBank (https://www.ncbi.nlm.nih.gov/genbank/) under the accession code NQLW00000000. The genome sequencing reads are deposited in NCBI Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra) with the BioSample accession SAMN07509503, which is the combination of the five individual BioSamples: SAMN07453935, SAMN07453936, SAMN07453937, SAMN07453938, and SAMN07453939. The RNA-Seq reads are deposited in NCBI SRA with the BioSample accession codes SAMN07453940 - SAMN07453987. The metabolite data is deposited at MetaboLights (http://www.ebi.ac.uk/metabolights/) under the accession code MTBLS519.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This research was supported by the U.S. Department of Energy, Office of Science, Genomic Science Program under Award Number DE-SC0008834. Additional support was provided by the UK Biotechnology and Biological Sciences Research Council (grant no. BB/F009313/1) and the Laboratory Directed Research and Development (LDRD) Program (Project ID: 7758) of Oak Ridge National Laboratory. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02–05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory. This research also used the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory. We thank Daniel Rokhsar, Mary Ann Cushman, and Lee Gunter for critical review and comments on the manuscript and Lori Kunder (Kunder Design Studio) for assistance with figure preparation. Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the U.S. Department of Energy under Contract Number DE-AC05-00OR22725.