New insights into the origin and evolution of α-amylase genes in green plants

Gene duplication is a source of genetic materials and evolutionary changes, and has been associated with gene family expansion. Functional divergence of duplicated genes is strongly directed by natural selections such as organism diversification and novel feature acquisition. We show that, plant α-amylase gene family (AMY) is comprised of six subfamilies (AMY1-AMY6) that fell into two ancient phylogenetic lineages (AMY3 and AMY4). Both AMY1 and AMY2 are grass-specific and share a single-copy ancestor, which is derived from grass AMY3 genes that have undergone massive tandem and whole-genome duplications during evolution. Ancestral features of AMY4 and AMY5/AMY6 genes have been retained among four green algal sequences (Chrein_08.g362450, Vocart_0021s0194, Dusali_0430s00012 and Monegl_16464), suggesting a gene duplication event following Chlorophyceae diversification. The observed horizontal gene transfers between plant and bacterial AMYs, and chromosomal locations of AMY3 and AMY4 genes in the most ancestral green body (C. reinhardtii), provide evidences for the monophyletic origin of plant AMYs. Despite subfamily-specific sequence divergence driven by natural selections, the active site and SBS1 are well-conserved across different AMY isoforms. The differentiated electrostatic potentials and hydrogen bands-forming residue polymorphisms, further imply variable digestive abilities for a broad substrates in particular tissues or subcellular localizations.

As the best known and most deeply studied amylolytic enzyme 1-6 , α-amylase (AMY, α-1,4-glucan-4glucanohydrolases, EC 3.2.1.1) is an ubiquitous hydrolase synthesized by plants, animals and microorganisms, catalyzing the cleavage of internal α-(1-4)-glycosidic linkages in starch, glycogen and other related oligosaccharides with a retaining endo-acting mechanism. Its widespread distribution reflects the use of principal energy and carbon source through exploiting environmental polysaccharides. Currently, under the sequence-based classification rules of carbohydrate-active enzymes (CAZy database, http://www.cazy.org/index.html), AMYs are classified as the main representative of the glycoside hydrolase (GH) family 13, and probably also present in families GH57, GH119 and GH126 [5][6][7] . These enzymes in GH13 are characterized by adopting the (β/α) 8 -barrel (TIM-barrel) catalytic domain, and display strong conservation of their tertiary conformation, although the amino acid sequences exhibit a high degree of variability, and only a few amino acids are conserved, as revealed by inter-kingdom pairwise comparisons [1][2][3]5,6,8,9 ; however, the sequence similarities are much higher within kingdoms. Our present work focuses on the AMYs belonging to GH13.
The evolution of AMY genes has been an attractive subject for more than 30 years. The original study was presented by Nakajima 1 , who compared 11 different AMY sequences from plants, animals and microbes, and observed four highly conserved regions necessary for enzyme functions. Subsequently, Janecek 2 analyzed 37 sequences that were also from the different living organisms, and established three main phylogenetic lineages: fungi and yeasts, plants, and streptomycetes, Thermomonospora cuwata, insects and mammals. The archaea AMYs showed close relatedness with their plant counterparts, and these two distinct branches should retain their own originality 10 . Additionally, several bacterial AMYs have been reported to share the typical animal-like motifs and chloride-dependent properties 2,9,[11][12][13][14][15] . This observation raised the α-amylase model of horizontal gene transfer (HGT) between animals and bacteria 9 . However, the origin of animal AMY genes is still debatable under alternative hypotheses 4,16 . Within the family GH13, AMYs from different living groups were separately present in particular subfamilies, such as GH13_1 (fungi and yeasts), GH13_6 (plants), GH13_7 (archaeons), GH13_24 www.nature.com/scientificreports www.nature.com/scientificreports/ AMY6 genes were scattered around species of basal land plants (Marchantia polymorpha, Physcomitrella patens, Sphagnum fallax and Selaginella moellendorffii), basal angiosperms (Amborella trichopoda) and two monocots (Elaeis guineensis and Phoenix dactylifera), however, it was found in almost all the main dicot lineages, except the Citrus and Brassicaceae. Occasionally, the AMY4 genes were absent in two species, S. moellendorffii and Beta vulgaris. We also had not detected any AMY3 gene among the five species (green algae Auxenochlorella protothecoides and Ostreococcus tauri, the monocot Spirodela polyrhiza and Zostera marina, and the dicot Tarenaya hassleriana).
Phylogenetic relationships of plant AMYs. We used three different methods (NJ, ML and BI) to infer the phylogenetic relationships. The general topology placed plant AMY genes onto two major groups: AMY1 + AMY2 + AMY3 and AMY4 + AMY5 + AMY6 (Fig. 2a). In each subfamily, the phylogeny (i.e., AMY3, AMY4, AMY5 and AMY6) exactly agreed to the green plant tree of life that is evolved from more ancient green algae or basal land and vascular plants, to higher angiosperms (Fig. 2b,c).
Within the subtree of AMY1 + AMY2 + AMY3, we defined six Grass_AMY3 subclades (3a to 3f) (Fig. 2b). To gain a more precise picture of these six subclades, we selected the redundant monocot and grass AMY3 genes to re-evaluate their phylogeny and detected the synteny. We found subclades 3a, 3e and 3d, and 3b and 3c were separately clustered together, and distinct from each other; the subcalde 3f situated between AMY3 and AMY1 + AMY2 (Figs 2b and 3a). All the single-copy genes (i.e., Hovulg_5Hr1G068350, Brdist_4g32140 and Traest_5A2) in diploid species belonged to the cluster of 3b + 3c (Fig. 3a).
Synteny detection indicated that 3b + 3c and 3a + 3e + 3d were extremely conserved on chromosomes, and resulted from tandem gene duplications (Fig. 3b). Nevertheless, apparent gene colinearities were also observed on the chromosomes of Os_chr08 and Sb_chr07, which were duplicated from Os_chr09 and Sb_chr02 at the whole genome scale 55,56 . Subfamilies AMY1 and AMY2 were well segregated with each other, and formed a distinct branch that was embedded in the AMY3 subtree (Fig. 2b). Our previous work demonstrated that the AMY1 loci were conserved in a syntenic block, which were derived from the intermediate ancestral chromosome A2 57 . From this subtree, we found that AMY1 and AMY2 located in an approximately equivalent position. Thus, we further detected the syntenic relationships of AMY2 genes, and the genomic segments carrying the AMY2 loci were also highly conserved and originated from A6 (Fig. S3).
There are three main clusters in the subtree of AMY4 + AMY5 + AMY6 (Fig. 2a). Unexpectedly, the latter AMY4 cluster (Green algae_AMY4) situated at a position that was distinct from AMY4 genes of basal land plants to angiosperms, but with less powerful supports (Fig. 2c). Subfamilies AMY5 and AMY6 showed a closer relationship to each other than to AMY4 (Fig. 2c). In the 11 sampled green algae species, the Green algae_AMY5 genes were present in seven of them, while AMY6 genes were absent in any of them (Fig. 1). By contrast, AMY5 genes were absent in the four basal land species, but all of them contained the AMY6 gene members (Fig. 1). Furthermore, these two subfamilies were featured by possessing the N-terminal extension of 460-570 amino   www.nature.com/scientificreports www.nature.com/scientificreports/ acids, on which the chloroplast transit peptides were predicted (Fig. S2). Therefore, AMY6 indeed represents a novel gene subfamily, duplicated and diverged from AMY5.
Inter-kingdom sequence analysis. Phylogenetic inference indicated that AMY1 + AMY2 + AMY3, AMY4, and AMY5 + AMY6 were three main AMY lineages in plants (Fig. 2). To further understand their origination, initially, we checked the AMY genes in the more ancient red algae, Galdieria sulphuraria. Interestingly, the amino acids of Gasu_48600 (Table S2) exhibited greater sequence divergence even than the bacterial outgroup when comparing with plant AMYs. Thus, we conducted sequence comparisons across different living organisms, and successfully extracted ten plant-type bacterial AMY genes. (Fig. 4a). In addition, the AMY4 representative gene (At1g76130) showed the highest sequence similarity to plant-type bacterial AMY genes (Fig. 4b).
evolutionary pressures. Considering the wide distribution and neutral phylogenetic position of AMY4, we hypothesized that it might be the most basic subfamily. Thus, in the predefined tree, it was constantly treated as the background branch, and the other subfamilies were consecutively used as foreground branches; AMY3 was defined to consist of Eudicot_AMY3 and Monocot_AMY3 (Fig. S1). Firstly, we estimated the whole tree using the one-ratio model (M0) and the nearly neutral site model (M1a). The estimates (ω 0 = 0.10958, and ω 0 = 0.10276 and p 0 = 0.91084, respectively) indicated that AMY genes were deeply under selection constraints or purifying selections. In two-branch tests, all the LRT comparisons generated significant statistics with the exception of two-branch (AMY6), indicating the existence of varied ω-values across subfamilies (Table 1).
In contrast to the branch or site models, we carried out the branch-site tests (Model A). These tests successfully detected proportions of sites with ω 2 -values greater than 1, and the LRT statistics were significant for all the M1a-Model A comparisons (Table 1). For instance, when AMY1 was used as the foreground, there existed about a proportion of 4.81% sites with P > 0.52 under potential positive selections, and some residues (45P*/90T**/184 S**/194N*/316K**) reached significant levels under the BEB inference. Note that some sites in the test of Model A (AMY5 + AMY6), although not existing significant signs of positive selection in our analysis, were highly conserved and indeed divergent between backgrounds and foregrounds (Fig. S4).
Structural properties and expressional profiles. The electrostatic potential reflects surface properties and molecular interactions that play critical roles in protein folding, conformational stability, enzyme catalysis and binding energies. Among the resulting models, apparent surface electrostatic changes were observed between the two main lineages, AMY1 + AMY2 + AMY3 (negative surface potentials) and AMY4 + AMY5 + AMY6 (positive to neutral potentials), in the central Domain A, in which the most essential active site of each ancestral www.nature.com/scientificreports www.nature.com/scientificreports/ node was consistently displayed as negative surface electrostatics (Fig. 5a). Further structural comparisons of the well-defined substrate binding sites indicated that the active site and SBS1 in Domain A shared highly similar structural folds across six AMY subfamilies, while the SBS2 displayed larger conformational variations (Fig. 5b). In each binding site, the amino acid residues involved in forming direct hydrogen bond contacts with starch-like substrates exhibited varied degree of conservations. For example, the catalytic residues (E205 and D291) in active site and the pair of consecutive tryptophans (W278 and W278) in SBS1 were conserved across subfamilies, whereas some other residues displayed degrees of substitutions or polymorphisms (Table S3).
We also examined expression patterns of AMY genes from rice, tomato, maize and Arabidopsis (Fig. S5). Generally, expression profiles of plant AMY genes varied across subfamilies in terms of developmental tissues and transcript abundance. In Arabidopsis, the single-copy AMY3, AMY4 and AMY5 genes expressed broadly in all the tissues sampled. Similar constitutive expression scenarios were observed in cases of AMY1 to AMY5 from maize, AMY3 (Solyc03g095710) to AMY6 from tomato, and AMY4 and AMY5 from rice. By contrast, the other AMY3 copy (Solyc04g078930) in tomato was hardly detected in particular tissues (i.e., young leaves, young flower buds and anthesis flower 0DPA), and the AMY1, AMY2 and AMY3 paralogs in rice also displayed strong tissue-specific patterns. Even within the AMY3 subfamily in rice, each of these four copies exhibited distinct expression profiles.

Discussion
Alpha-amylase genes comprise four subtypes in cereal grasses and three in dicot species. Several phylogenetic studies restricted to higher model plants have been done 30,34,37,40,54,58 , however, a detailed sequence-derived comparison of these subfamilies is lacked, and little is known about how they evolved crossing the entire green lineage. With the goal of bridging this gap, we have examined the distribution, expansion dynamics, and potential functional differentiations using 472 redundant protein sequences, sampling from 78 different species or strains covering green algae to higher angiosperms.
Generally, AMY1 and AMY2 are grass-specific subfamilies. The previous defined AMY1s in dicots, such as At4g25000 37 and StAmy1 54 , together with grass AMY3 30,34,40,58 , actually belong to plant AMY3 lineage. The same case is AMY4, which consists of the AMY2s in dicots 37,54 and AMY4s in grass 35,39,40,58 . The AMY5 subfamily refers to the well-studied AMY3 (At1g69830) in Arabidopsis, and its orthologous in other green plants. In addition, we have identified a novel subfamily (AMY6), which presents a scattered distribution from green algae to angiosperms (Fig. 1).
The N-terminal presequences of α-amylases have been well-described previously, and could be used as good signatures for discriminating genes belonging to specific subfamilies. In most cases, α-amylases encoded by AMY1, AMY2 and AMY3 genes are typical secretory proteins found in plastids (chloroplasts in leaves and amyloplasts in starchy cells) 49,50,52 , including a length of 24 to 27 signal peptides. However, those AMY3 genes from the basal land plants (Liverworts, Mosses and Ferns) are predicted to carry the chloroplast signals. This clear signal shift from basal land plants to angiosperms may reflect some evolutionary novelties, such as the occurrence of seeds in plant bodies. Generally, the AMY4 isoform do not possess any signal peptide 39 . Our predictions indicate that three categories of signal patterns (secretory signals, no signals and chloroplast transit peptides) are overwhelmingly dominated the N-terminal regions of α-amylase genes in higher plants (Fig. S2). The most remarkable are AMY5 and AMY6, both of which own the N-terminal stretches of greater than 460 amino acids, and are predicted to carry the chloroplast signals. In green algae, the N-terminal signal divisions are less clear, because  www.nature.com/scientificreports www.nature.com/scientificreports/ the mitochondrial targeting peptide and chloroplast transit peptide are always predicted to co-exist within each subfamily, whereas the secretory signals are absent. This discrepancy between green algae and land plants may be reflected by their divergence over a billion years ago, such as differences in photosynthetic and other critical metabolic pathways 59 . The transit peptide is necessary for plastidial targeting and translocation initiation 50 , thus, α-amylases with different signals suggest their different subcellular locations for starch digestion.
Both AMY1 and AMY2 genes are embedded in and belong to the grass AMY3 lineage (Fig. 2b). This agrees well with the point that gene duplication leads to the formation of AmyA (AMY1 and AMY2) and AmyB (AMY3) classes in the monocot lineage 34 . Colinearity identifications indicate that the AMY1 loci originated from the ancestral cereal chromosome A2 57 , and AMY2 from A6 (Fig. S3). Note that the chromosome A2 was the product of A4/A6 breakages and fusions 55,56 . So AMY1 and AMY2 genes should share a common single-copy locus from A6. To some extent, their evolutionary rates and expansion dynamics are similar. Meanwhile, the expansion of grass AMY3 genes is also evident. Phylogenetic and syntenic relationships indicate that tandem duplications and whole genome duplications are keys to enlarge it from the archetype AMY3 cluster of 3b + 3c.
With the radiation of grasses, numerous specific changes have happened, such as the acquisition of novel features (i.e., the timing of embryo development, and structures of flowers and fruits) 60 . Pineapple (Ananas comosus) belongs to the Bromeliaceae family that diverged from the grass family (Poaceae) 100 million years ago (MYA), making it a close relative for cereal genome evolution 61 . The absence of AMY1 and AMY2 genes in it, and even in more ancient basal sister groups such as banana (Musa acuminata), Date palm (P. dactylifera) and African oil palm (E. guineensis), provides useful clues for timing the birth of AMY1 and AMY2 genes. Together with the broad distribution of AMY3 genes from green algae to higher plants (Fig. 1), we suggest that the common ancestor of AMY1 and AMY2 should be derived from the AMY3 genes.
Previous studies well characterized the none-signal bearing AMY4 and chloroplast-targeting AMY5 genes. They unanimously agree that each of them represented a distinct AMY subfamily in angiosperms 54 . Differently, Mascher et al. 40 put them together as a single subfamily (AMY4 as AMY4-1, and AMY5 as AMY4-2). In the present work, we provide more detailed identifications across the entire green lineage. In the closely-related Chlamydomonadales (Chlamydomonas reinhardtii, Dunaliella salina, Volvox carteri) and Sphaeropleales (Monoraphidium neglectum) 62 , AMY5 genes are absent. Interestingly, their corresponding AMY4 gene members (Green algae_AMY4 in Fig. 2c) situate at the common ancestral position of AMY4 and AMY5 subfamilies. The poorly resolved or conflicting relationships of this ancestral branch may represent the ancestral status of AMY4 and AMY5 genes. The shared common ancestor is additionally explained by their equivalent chromosomal locations in genomes of four Brassicaceae species (A. thaliana, A. lyrata, B. oleracea and B. rapa) and Prunus persica. Thus, AMY4 and AMY5 should have resulted from an ancient gene duplication event following the radiation of Chlamydomonadales species, and the divergent direction is AMY5 from AMY4. The newly described AMY6 is distinct from AMY5 based on their N-terminal features (Fig. S2). And from its distribution, we see that species  , SBS1 and SBS2). AMY1 was colored with gold, AMY2 with orange red, AMY3 with yellow, AMY4 with green, AMY5 with purple, and AMY6 with light gray. The active site of each ancestral node was inferred with black arrows. Residues involved in forming direct hydrogen bonds in substrate binding sites were summarized in Table S3. www.nature.com/scientificreports www.nature.com/scientificreports/ with AMY6 genes must have corresponding AMY5 gene members, except those four basal land and vascular plants (Common liverwort, Moss, Bog moss and Spikemoss) (Fig. 1).
Inter-kingdom comparisons provide important evidences for evolution of α-amylase genes in plants. Similar to the case of animals 4,9 , our inter-kingdom analysis indicates frequent horizontal gene transfers between plant and bacteria AMY genes (Fig. 4a), and the AMY4 subfamily is more similar to bacterial AMYs than AMY3 (Fig. 4b). That is to say, the plant AMY genes share a common ancestor, and AMY4 genes have retained more sequence features of the ancestral AMYs in plants. In species of C. reinhardtii, Aquilegia coerulea, Theobroma cacao and Cucumis sativus, the equivalent chromosomal locations of AMY3 and AMY4 genes suggest gene duplications predating the diversification of green plants. Among those, we propose a diagram for the monophyletic origin of plant AMYs (Fig. 6).
In general, genes with constitutive expressions are more conserved than those exhibiting tissue-specific patterns 63,64 . The broad expressions of AMY4 and AMY5 genes among representative taxa (Fig. S5) suggest that they are functionally-essential in plants. Based on signal peptide predictions, we guess that these two subfamilies possibly target different subcellular localizations for initiating starch degradation, which is partly reflected by previous studies 37,54,65 . The novel AMY6 with the closest phylogenetic relationship with AMY5 also shares the similar constitutive expression scenario, however, it is scattered distributed, and absent or loss in particular green lineages (Fig. 1). These observations truly reflect the taxa-specific characteristic of AMY6 genes. Cereal α-amylases, especially those encoded by AMY1, AMY2 and AMY3 genes, are typical secretory proteins during seed germination or grain development. In wheat and barley, AMY1 and AMY2 showed clear expression divergence in terms of transcript abundance, specific tissues or developmental stages 29 , and the wheat AMY3 (Traest_5A1 in Fig. 3a) shared a similar pattern with AMY2 in developing grains 30,31 . Unexpectedly, the expression of archetype AMY3 genes (Traest_5A2, Traest_5B3, Traest_5D2 and Hovulg_5Hr1g068350) were hardly detected in all the tissues sampled (Data collection from https://wheat-urgi.versailles.inra.fr/ and http://webblast.ipk-gatersleben.de/barley_ibsc/, www.nature.com/scientificreports www.nature.com/scientificreports/ respectively), implying their nonfunctionalization during evolution. In rice, the expression divergence is also evident between AMY1and AMY2, and the four duplicated AMY3 paralogs have more diversified expression profiles than single-or two-copy status in other representative organisms (Fig. S5).
The combination of sequence comparison, selection simulation and expression analysis leads to a definition of functional divergent between subfamilies. Hypothetically, the presence of multiple divergent isoforms could enable plant organisms to exploit categories of starch-like polysaccharides in a broad environmental conditions. Structural comparison further indicate that the active site and SBS1 are well-conserved (Fig. 5), which is not restricted to the cereal isoforms AMY1 and AMY2 42,66 , but among these six divergent enzymes (AMY1-AMY6). Generally, the active site is responsible for binding and catalyzing substrates 41,42,66 , whereas the secondary biding site (SBS1) and the AMY2-specific binding site (SBS2) are just critical for binding different starch-like polysaccharides 67,68 . This conservation reflect the ability in starch digestion across AMY subfamilies. However, the electrostatic changes between AMY1 + AMY2 + AMY3 and AMY4 + AMY5 + AMY6 (Fig. 5), together with the substituted residues involved in forming direct hydrogen bands (Table S3), may suggest the evolutionary diversification of enzyme specificity, such as substrate preference and product specificity. Furthermore, in the N-terminal presequences of AMY5 genes, there also exist another kind of noncatalytic carbohydrate-binding module (CBM45), which was reported to be associated with plastidial starch metabolism 47 .

Methods
Sequence data. As previously reported by some authors 34,39,58 and demonstrated by Yu et al. 69 , AMY genes were categorized into AMY1 to AMY4 among cereal crops and AtAMY1 to AtAMY3 in A. thaliana. For systemically identification of AMYs in green plants, we selected seven well-characterized AMY genes as queries (four GenBank accessions: CAX51372, CAX51374, AAA34259 and CAX51375 in barley and wheat, and three Araport11 entries: AT4G25000, AT1G76130 and AT1G69830 in A. thaliana) to perform blastp searches against databases of Phytozome v12.1 (https://phytozome.jgi.doe.gov/pz/portal.html), KEGG (http://www.genome.jp/), IPK Barley (http://webblast.ipk-gatersleben.de/barley_ibsc/), and blastn searches against Wheat URGI (https:// wheat-urgi.versailles.inra.fr/) with default setting details, separately including 63, 13, 1 and 1 green plant species (Table S1). The deduced amino acid sequences were subjected to multiple sequence alignment by Clustalx 70 and Bioedit 71 . Sequences not belonging to GH13_6, and/or with apparent sequence erosions (poor coverage and qualities) were discarded. For copy number determination, the primary peptide was chosen if alternative splicing peptides per copy were available. Since the N-and C-terminal end sections showed poor alignment across different AMY subfamilies, they were cutoff before phylogenetic reconstruction. And when several loci per species and several copies per locus were available, a single protein/copy sequence was kept if these loci/copies were shown to cluster in the same lineage. Subsequently, we selected 19 protein sequences representing all the analyzed AMY subfamilies across major plant lineages, to detect protein domain architecture using HMMER 72 . All the searches run with default parameters. Signal peptides were also predicted using the TargetP 1.1 server 73,74 . Phylogenetic reconstructions. In the first step, we analyzed the phylogenetic relationships between different subfamilies. We conducted Neighbor-Joining (NJ), Maximum Likelihood (ML) and Bayesian inference (BI) analyses using the single-copy orthologous dataset, including 309 protein sequences of functional AMY genes and 5 bacteria outgroups retrieved from NCBI (https://www.ncbi.nlm.nih.gov/) according to Da Lage et al. 9 . The NJ tree was analyzed using Jones-Taylor-Thornton (JTT) model with Gamma Distributed (G) rates across sites in MEGA7 75 . Branch confidence levels were estimated by using 1000 bootstrap replications. The ML trees were conducted using the best-fitting amino-acid model LG + G with the lowest Bayesian Information Criterion (BIC) score in MEGA7. Supports values were estimated from 500 non-parametric bootstrap iterations. The BI analyses were performed using MrBayes 3.2 76 . The preset Poisson substitution model was invoked by 'Whelan and Goldman' (WAG) through estimating the fixed rate models implemented in MrBayes 3.2. The option for rates was set to invgamma, and all the other parameters of the likelihood model were default values. Four simultaneous Markov chains (3 heated and 1 cold) were run starting from a random tree for 3 million generations requested and trees were sampled per 1,000 generations. The standard deviation of split frequencies fell below 0.05. Using a relative burn-in of 25% for diagnostics, the consensus tree was based on the remaining 1,500 trees. The confidence level of the tree topologies was estimated according to Bayesian posterior probabilities (BPPs). Gaps or missing data were treated as complete deletion.
Horizontal gene transfer (HGT) is the exchange of genetic material between organisms that are not in a parent-offspring relationship 77 . HGT was observed between animals and bacteria AMY genes 9 . For understanding the origin of green plant AMY genes, we compared protein sequences from eukaryotes (animals, plants, fungi and red algae) with those from bacteria. Firstly, we texted the key word 'alpha-amylase' against the protein fields of NCBI, and filtered with Swissprot entries. This strategy produced 87 animal and 79 fungal AMYs, and based on the global alignment, we selected 13 representative sequences from animals and 10 from fungi. Three AMYs in A. thaliana were selected as representatives in green plants. In red algae, we selected the single periplasmic sequence (Gasu_48600) in G. sulphuraria from KEGG as the representative. Then, we used all these representative sequences to blast against the KEGG prokaryote genomes. The top 10 hits with the annotations of ' Alpha-amylase, EC: 3.2.1.1' were kept. Finally, a dataset of 48 sequences from different living kingdoms (Table S2) were produced and used for NJ phylogenetic analysis with MEGA7. Supports values were estimated by using 1000 bootstrap replications. Synteny detection. Based on the well-established model of cereal genome evolution 55,56 , chromosome-scale pseudomolecules carrying AMY targets were downloaded from available resources for local database construction. Reciprocal blast was carried out to confirm the orthologous relationships 78 . By manual chromosomal walking, (2019) 9:4929 | https://doi.org/10.1038/s41598-019-41420-w www.nature.com/scientificreports www.nature.com/scientificreports/ genic markers flanking AMY targets were used as queries to blast against the local database using the basic tool NCBI-BLAST-2.4.0+ 79 . Genomic segments covering these markers were selected for gene order detection.

Maximum likelihood analyses of positive selection. To measure variation in functional constraints
and to test whether positive selection was involved among evolutionary AMY gene lineages, we estimated the omega values (ω = dN/dS, the ratio of the rate of nonsynonymous substitution per nonsynonymous site [dN] to the rate of synonymous substitution per synonymous site [dS]) through using a maximum-likelihood based CODEML program implemented in the PAML, version 4.9 80,81 . The branch models allowed the ω ratio to vary among lineages, and were specified for detecting adaptive selection acting on particular branches (variable 'model' and NSsites = 0) 82 . The site analysis allowed ω to vary among codon sites but kept it constant over the tree topologies (model = 0 and variable 'NSsites') 83 . The alternative Model A, also the recommended branch-site test of positive selection, assume that sites in predefined foreground branches are allowed to evolve under positive selection (ω 2 > 1), whereas the background branches evolve neutrally (ω 1 = 1) or under purifying selection (0 < ω 0 < 1) 84,85 . Model A was tested against the nearly neutral site model M1a (Chi-square test, degree of freedom [df] = 2) using the likelihood ratio test (LRT) 82 . The Bayes Empirical Bayes (BEB) approach was used to calculate posterior probabilities and to select sites from the site class with ω > 1 86 . The codon dataset consisted of 41 sequences with 1137 characters. It was output from the PAL2NAL server (http://www.bork.embl.de/pal2nal/), and then used for model simulations. The phylogenetic subtree was presented in Fig. S1. 3D structure analysis and expression level comparison. To evaluate potential functional changes among different AMY subfamilies, we carried out 3D structure studies for each ancestral node back to the time point of gene divergence or duplication. A total of 10 ancestral amino acid sequences (AMY1 to AMY6, AMY12, AMY123, AMY56 and AMY456) were generated using parsimony state reconstruction in Mesquite 3.31 (http:// mesquiteproject.org). We modeled protein crystal structures using the ancestral state sequences in the workplace of SWISS-MODEL 87 based on their respective best matching templates from RCSB Protein Database Bank (PDB), such as template entries 1BG9 41 , 1RPK 66 , 2QPU 88 and 3WN6 43 . The resulting models were then subjected to UCSF Chimera 89 for electrostatic potential mapping (Coulombic surface coloring defaults: ε = 4r, thresholds ± 10 kcal/mol·e), domain comparison and visualization.