The calmodulin fused kinase novel gene family is the major system in plants converting Ca2+ signals to protein phosphorylation responses

Eukaryotes utilize Ca2+ as a universal second messenger to convert and multiply environmental and developmental signals to downstream protein phosphorylation responses. However, the phylogenetic relationships of the genes that convert Ca2+ signal (CS) to protein phosphorylation responses (PPRs) remain highly controversial, and their origin and evolutionary trajectory are unclear, which greatly hinders functional studies. Here we examined the deep phylogeny of eukaryotic CS converter gene families and identified a phylogenetically and structurally distinctive monophyly in Archaeplastida. This monophyly can be divided into four subfamilies, and each can be traced to ancestral members that contain a kinase domain and a calmodulin-like domain. This strongly indicates that the ancestor of this monophyly originated by a de novo fusion of a kinase gene and a calmodulin gene. This gene family, with a proposed new name, Calmodulin Fused Kinase (CFK), had expanded and diverged significantly both in sizes and in structures for efficient and accurate Ca2+ signalling, and was shown to play pivotal roles in all the six major plant adaptation events in evolution. Our findings elucidated the common origin of all CS-PPR converter genes except CBL-CIPK converter genes, and revealed that CFKs act as the main CS conversion system in plants.


A phylogenomic screen identified three eukaryotic gene monophylies of CS-PPR converters.
To explore the phylogeny of CS-PPR converter families, we performed phylogenetic inference using sequences of all Ca 2+ activated kinases, covering the CDPK-SnRK superfamily, CAMK group, and MAPK (the complete tree is shown in Fig. S1). MAPK was the outgroup (Fig. 1A), consistent to the phylogenetic relationship in the kinome tree 22 . CDPKs, CRKs, CCaMKs, PPCKs, and PPCK-related kinases (PEPRKs) formed a well-supported monophyly with the high local supporting value 97 from FastTree's near maximum likelihood with 1000 sampling, and bootstrap support value 97 from randomized axelerated maximum likelihood (RAxML) 1000 sampling, respectively (Fig. 1A). This monophyly had no existing name and we temporarily designated it as the X monophyly. CaMKI, II, IV families from various eukaryotes constituted the second monophyly with supporting values 96 and 89, and were grouped as the CaMKI &II &IV monophyly (Fig. 1A). SnRK3 (also known as CIPK) was a subfamily of SnRK group that had been experimentally validated as CS-PPR converters 23 . SnRK3s were found in eukaryotic supergroups Archaeplastida, SAR, and Excavata, and formed the third monophylic group with both high supporting values 99 (Fig. 1A). All SnRKs formed the outgroup of X monophyly and CaMKI&II&IV monophyly, since the X monophyly and CaMKI&II&IV monophyly were sister monophylies with supporting values 90 and 88 (Fig. 1A).
To determine whether or not three monophylies were reliable and whether they had independent or shared origins, we compared each monophyly members at the sequence level. Firstly, the X monophyly had an insertion signature (insertion 1) with one amino acid (AA) residue in the KD (Fig. 1B, the conserved feature of insertions in sequence logo was shown in Fig. S2). Secondly, except one PPCK from Brassica napus, members from the X monophyly and CaMKI&II&IVs lost the entire C-terminal domain, and had insertion signatures 2 and 4, each with three AAs located in the very conserved KD. Thirdly, SnRKs shared a conserved signature insertion 3 with one AA in the KD. The phylogenetic relations and the structural analyses clearly showed that the X monophyly was an independent monophyly, and had a common ancestor with the monophyly CaMKI&II&IV, whereas the SnRK3 was an independent monophyly. Suggested calmodulin fused kinase (CFK) gene family. Because the X monophyly covered several independent gene families with various domain structures, and distributed in different organisms, we then revealed whether or not the gene families covered in the X monophyly would share the same structured ancestor gene by analysing more samples, especially from those previously unavailable taxa, Glaucophyta, Rhodophyta, Charophyta, Pteridophyta, and gymnosperm of Archaeplastida, and the often-neglected protists (Table S1). We found that the X monophylic members not only distributed in plants, ciliates, and apicomplexans as previously reported 13 , but were also present in other eukaryotic supergroups, including Excavata, SAR Clade, Amoebozoa, and Opisthokonta (Fig. S3). However, X monophylic members in other eukaryotes nested within the plant X monophylic members and share high similarities with plant sequences. These suggested that the X monophylic members had first evolved in ancestral plants and these non-plant eukaryotes very likely obtained these X monophylic sequences via independent horizontal gene transfers (HGTs) in their ancestors (Fig. S4). We then focused on the origin and evo-devo analysis of the X monophyly in plants.
After various trials, we constructed a phylogenetic tree using sequences representing all subfamilies and sub-subfamilies (or groups) from all algae, Amborella trichopoda and Arabidopsis thaliana. The final synthetic tree based on maximum-likelihood and Bayesian inference both yielded a topologically well-supported tree that was The rooted tree displaying three monophylies of Ca 2+ activated protein kinases among the eukaryotes. Mitogenactivated protein kinase (MAPK) was the outgroup. Major nodes were shown with two supporting values: maximum-likelihood by FastTree (upper number), randomized axelerated maximum likelihood methods (lower number). Branches are colored to represent different origins, Green: Archaeplastida, blue: Opisthokonta, red: SAR, black: Amoebozoa, purple: Excavata. (B) Schematic representation of monophylic cluster specific structural features with the schematic structures from the first and the last branch. Yellow block: kinase domain, red block: EF hand, green bar: monophyly specific inserted amino acid (s).
subdivided into four subfamilies (A, B, C, D), and further divided into 14 groups according to the tree topology ( Fig. 2A). The CRK, a gene family previously recognized as a sister group to the CDPK in CDPK-SnRK superfamily 13 , now occupied only a proportion of C3 group ( Fig. 2A). Structurally, all basal members of four subfamilies had both the KD and CaM-LD (Fig. 2B), only the crown members of the subfamily B6 group, which consisted of the PPCK and PEPRKs, lost the auto-inhibitory domain (AID) (the domain following the KD) and CaM-LD (Fig. 2B). The sub-group of the C3 group, designated as CRK before, lost the CaM-LD (Fig. 2B). Therefore, the ancestors of all four subfamilies in the X monophyly shared the same structured ancestor: a gene fused by a kinase gene and a calmodulin gene, the same as the CDPK 24 (Fig. 2C). Since prokaryotes do not use Ca 2+ as a signal, and no gene had been found in prokaryotes that contain both of these domains, the possible ancestral gene of all four subfamilies in the X monophyly was most likely originated through a de novo fusion of a kinase gene and a calmodulin gene. Based on this conserved common structure and the phylogenetic relationship ( Fig. 1), we proposed to name this X monophyly the Calmodulin Fused Kinase (CFK) gene family.

Origin and expansion of CFK subfamilies in plants.
From aquatic single-celled algae to terrestrial flowering plants, cell signaling networks evolved to become more and more complex and robust. Here we show that the CFKs have evolved from a single primitive ancestor gene into a large gene family in flowering plants. We categorized CFKs from the sampled plants representing all plant clades according to the four subfamilies and 14 groups (Fig. S5). We found glaucophytes contained no CFK and only one of the four sequenced red algae Porphyridium cruentum harbored one CFK gene. All of the Chlorophyta and Charophyta algae and land plant genomes contained CFKs, indicating that CFKs at least emerged about 1500 million year ago (MYA) 25 in red algae.
In general, three major stages of gene expansion and loss occurred in plants ( Table 1). The marine algae stage I had only two subfamilies, A (A1 and A2, these two groups existed only in stage I) and C (C1); the freshwater algae stage II had subfamilies B (B1, 2, 3, 4, these groups existed only in freshwater algae) and C2; and the streptophyta stage III had subfamilies B5, B6, C3, and D. Among all groups, subfamily A had maintained only 1-2 copies; B1, 2, 3, 4 each had kept 1-5 copies; B5 had kept 1 or 2 copies in plant genomes, but were lost in banana, Aquilegia coerulea, strawberry, and species in the Brassicales. A recent report stated that B5 CFKs (CCaMKs) originated in Physcomitrella patens 26 , however, we found B5 CFKs emerged in Mesostigma viride (Fig. 3), the earliest branch of charophyta that arose about 725 mya 27 . The B6 subgroup retained its medium size, with an average 6.35 genes, and the C3 and D groups expanded dramatically during plant evolution (Table 1). Among the land plant lineage, the Brassicaceae had evolved the richest CFK family (Fig. S6A), more than the basal Brassicales plant papaya (Carica papaya) and grapevine (Vitis vinifera) (Fig. S6B). The expansions in the subfamilies B6, C, and D, resulted mostly through various forms of recent gene duplications ( Fig. S6B) 28 .

Structural innovations of the CFKs.
In addition to variable sizes in CFK subfamilies, CFKs also evolved with dramatic structural diversification. Subfamily B members had the most versatile C-terminals (Fig. 2B). Basal branches (B1, 2, 3) had retained the complete set of KD, AID, and CaM-LD (Fig. 3A). Middle branches B4 and B5 had insertions in the AID (Fig. 3A). B5 lost the first EF hand although the sequence was still present, suggesting that the secondary or tertiary structure were destroyed due to sequence mutations (Fig. 3A). In the B6 group, the whole/partial loss of C-terminal had led to the loss of AID and CaM-LD, thereby PPCK and PEPRK were no longer activated by CSs (Fig. 3A).
Other domains, motifs, and modifications were found in the CFK family (Fig. 3B). The myristoylation and the palmitoylation modified AAs were first found in the basal seawater Chlorophyta algae and had been kept in the crown flowering plants (Fig. 3B). The PEST motif and acetylation AAs first appeared in CFK genes in the freshwater Chlorophyta algae (Fig. 3B). These motifs and modified AAs acted as the signal peptide for protein-protein, protein-lipid interactions, and membrane associations 29 , and for controlling the protein lifespan 30 . The presence of these modified AAs and motifs in the CFK genes strongly indicated that CFKs are potential key signal messengers in algae. Two CFK proteins from Chlamydomonas reinhardtii and Volvox carteri contained a C2 domain in the N-terminal, which was not believed to exist in CDPKs 31 . Some CFKs had a fused transmembrane domain-containing sequence (Fig. 3B, File S1), which might help to target the membrane for accurate and faster reception of CSs. Other new domains like FBG and S4 domains (Fig. 3B) were less researched in the Ca 2+ signalling and future research is needed.

Functional diversification evolution of CFKs for plant adaptation evolution.
To explore the contribution of CFKs to plant evolution, we analysed the evolution of CFKs in relation to their functions along the tree of plant life (Figs 4 and 5A). We focused on the subfamily B, C, D because there were no reports on subfamily A members due to its limited gene distribution in marine algae, nor any report on CFKs in Rhodophyta or seawater Chlorophyta algae.
In freshwater alga C. reinhardtii, subfamily B3 CFK Cre07.g328900 and subfamily B4 CFK Cre02.g074370 ( Fig. 2A) function in the biogenesis of flagella and nutrition uptake 32 , respectively, indicating that CFKs played a role in plant freshwater adaptation. When plants transitioned from aquatic habitat to land environment, plants had to make substantial changes to adapt to the new environment. From Mesostigma viride, the earliest branch of Charophyta algae, streptophytes (covering charophytes and land plants) had evolved the 1 st EF hand-less B5 members (Fig. 3) that had the ability to decode the low frequency signal (100 s/cycle) (Fig. 5B) and played an indispensable role in arbuscular mycorrhizal (AM) fungi and rhizobia symbiosis 33 . Loss of B5 members in Brassicales and in Quercus robur disarmed plant-microbe rhizobia or plant-fungi associations 34 . Although putative functions of the orthologs of other B5 CFKs require future confirmation in plants, orthologs are generally assumed to retain equivalent functions in different organisms and to share other key properties 35 .
Terrestrial environment differs with aquatic environment mainly in dramatic changes in temperature, water, light, air, and soil environments. A group of C3 CFKs (Fig. 5C) evolved the N-terminal with myristoylation, palmitoylation, and acylation AAs, allowing these CFKs to associate with the plasma membrane for decoding the high frequency CSs (40 s/cycle) 36 . These genes were key members in plant drought, salt, and pathogen stress signalling as found in A. thaliana (Figs S7 and S8), O. sativa 37 , V. vinifera 38 , and P. patens (Fig. S9) by comparative transcription profiling (Fig. 5C).
One of the most interesting CFK groups is the B6 group that completely lost the CaM-LD. The earliest ortholog of PPCKs that enrolled in crassulacean acid metabolism (CAM) 39 and C4 photosynthesis came from spikemoss Selaginella (Fig. 5D). They have lost the entire CaM-LD and are no longer activated by the CS probably when they were recruited to phosphorylate the PEPC, a signature enzyme of primary CO 2 fixation in CAM 18 and in C4 photosynthesis 40 .  CFKs had also contributed to the evolution of seed plants. Another cluster of C3 group genes had the myristoylation and palmitoylation sites, but lost all the typical EF hands in the C-terminal (Fig. 5E). This group CFKs originated in gymnosperms as this structured CFKs dated back to Ginkgo biloba. This seed plant lineage specific group displayed conserved expression pattern (rise first and then fall in the development of embryo) from primitive seed plant ginkgo to crown seed plant A. thaliana (Fig. 4), therefore, playing roles in seed maturation transition (Fig. 5E).
We found abundant CFKs, e g. 22 AtCFKs (Table S2) in the subfamily D act as key players in the male gamete development (Fig. 5F). These CFKs receive the moderate frequency CSs for pollen tube growth 36 . The orthologs of the D group CFKs originated in charophytes (Fig. 5F). These CFKs shared a conserved feature with both myristoylation and palmitoylation sites at the N-terminal. All these CFKs were experimentally validated or predicted to localize to the plasma membrane, These CFKs were found to be enrolled in pollen development in A. thaliana, V. vinifera, O. sativa (Figs S7 and S8), and gamete development in P. patens (Fig. S10).

Discussions
Eukaryotic cells evolved with three systems in converting CSs to PPRs and each eukaryotic group uses more than one system. Since previously designated superfamily CDPK-SnRK is clustered based on sequence similarity 13 and CAMK group is clustered based on biochemical properties of Ca 2+ binding 14 , each classification includes several structurally heterogeneous kinase gene families. These classifications are inconsistent and controversial with known discoveries as discussed in the introduction section, hence we questioned that these gene families might evolve independently. Since the CDPK genes exist in plants and in some protists, and all eukaryotic groups utilize the CS-PPR system, it is unknown whether or not the CS-PPR genes share a common eukaryotic origin. In this study, we characterized all CS-PPR converter families in all eukaryotic supergroups and classified them into three monophylic groups, the CaMKI&II&IV family, the SnRK3 subfamily, and the newly proposed CFK family, based on the evolutionary common origin.
Eukarya is now divided into five supergroups including Archaeplastida, Excavata, SAR clade, Amoebozoa, and Opisthokonta 41 . The Excavata were found to contain CFK system and SnRK3 system, and Excavata CFKs possibly were originated from plants CFKs through horizontal gene transfer. According to the gene family size, SnRK3s seem to be the major CS-PPR conversion system in Excavata. The SAR clade contains all three systems, however, the SnRK3s have not been reported with any functions. The Amoebozoa and Opisthokonta both had a few members of CFKs and the main CS-PPR conversion system would be CaMKI&II&IVs. Archaeplastida contain CFK system and SnRK3 system, each has multiple functional reports 26 . Overall, all the five eukaryotic supergroups have more than one CS-PPR conversion system, although they prefer to rely on one specific system. Plant evolved CFKs and utilized them as the major CS-PPR converter. There are three major ambiguous and controversial questions regarding plant CS-PPR decoding gene families: (1) what are the major gene families that decode CS to PPR, (2) since they work in two different biochemical mechanisms, whether these gene families originated dependently or independently, (3) what are their relationships with those in other eukaryotic supergroups and whether they may share a broader common origin. In this report, we clarified the previously controversial classifications of the CS-PPR converting gene families and we united them into a CFK gene family. Unlike former nomenclature based on biochemical properties like Ca 2+ binding, we named these gene families based on their origins. The CFK gene family was most likely originated from a single ancestor gene, presumably by a de novo fusion of a CaMK kinase and a calmodulin in the deep root of Rhodophytes.
The CFK family, now also including CCaMKs and CRKs, competes with SnRK3 subfamily for the leading CS-PPR converter family in plants. Several lines of evidence, however, suggest that CFKs are the major CS-PPR converter in plants. First, CFKs expanded into more family members in flowering plants than SnRK3s did (e.g., 37 OsCFKs and 46 AtCFKs compared to 30 OsSnRK3s and 25 AtSnRK3s 23 ). Second, CFKs have more comprehensive functions than SnRK3s have, including in plant basic metabolism (such as C4 and CAM photosynthesis), development, stress signalling, and plant-microbe interactions. SnRK3s are mostly involved in stress-related signalling as revealed in genome-wide expressional analyses both in eudicot Arabidopsis and monocot rice 12,23 . Third, CFKs presumably are more efficient than SnRK3s in converting CSs to PPRs because CFK requires a single protein to sense CS and activate the downstream transcription factors, whereas SnRKs need two transcription factors to activate two genes, a kinase and a calmodulin, and two transcriptional machines to transcribe two genes, therefore, requires more ATPs and more time for molecular docking of two genes; Fourth, CFKs have versatile N-terminal and C-terminal domains and modifications for multi-subcellular localizations, whereas SnRK3s do not contain recognizable localization signals, and thereby most SnRK3s localized to the cytoplasm and nucleoplasm 23 . CFKs play pivotal roles in evolution of plant clade adaptation. Plant evolution includes the following six watershed events: colonization of land, plant-microbe co-evolution, development of life cycles, changes in morphology, advancement of photosynthesis, and the secondary metabolism [42][43][44] . CFKs have contributed to success of each of these major evolutionary events. In conquering terrestrial environment, one group of C3 CFKs, originated from charophytes, plays key roles in abiotic and biotic stress signalling. In plant-microbe interactions, B5 CFKs are irreplaceable in plant-fungi and plant-rhizobia interactions. In the aspect of life cycles, we have shown that a C3 group CFKs with both myristoylation and palmitoylation contributed to the seed maturation program. In the evolution of morphology, the expanded subfamily D CFKs are also key regulators in pollen tube growth and male gamete development in early land plants, so it is highly possible that this group had roles in developing anisogamy in charophytes since the two stages gametophyte and sporophyte are externally indistinguishable in chlorophytes (en.wikipedia.org/wiki/Gametophyte). Furthermore, B6 members that lost the C-terminal, thereby no longer respond to CSs, were recruited to C4 and CAM photosynthesis. The subfamily D members have also been reported to be involved in secondary metabolism 45 . In summary, origin and evolution of diverse CFK subfamilies have played key roles in all the major evolutionary processes during the plant adaptation evolution from algae to extant higher plants.

Methods
Sampling and dataset. Sampled species covered five supergroups of eukaryotes, bacteria and archaea.
Sequence alignment and phylogenetic tree construction. Multiple sequences alignment for Fig. 1 tree construction was performed with mafft program 46 and for all other trees with muscle program 47 . The seed of each monophyly were generated using representative monophyly members as a robust seed used to search against protein databases with HMMER3 program 48 . All hits were examined with phylogenetic analysis and only sequences within the monophyly were used. Seeds were built using the curated results and were searched again to detect any sequences that were not found in the first round. A new round of search was performed using the new seed according to the last result until no new CFK kinase was found. Blast hits were also employed for curation of sequences 49 . Maximum likelihood method with wag protein substitution model were applied in FastTree 50 and RAxML 51 , each with 1000 sampling, and the RAxML method was performed via online platform (www.phylo. org). Bayes tree was constructed by Mrbayes 52 with 200000 samplings and samplefreq is 100.