Introduction

Plants are continuously subjected to biotic and abiotic stresses throughout their life cycle. Hence, they have developed an evolutionarily complex series of signaling mechanisms to perceive and respond to different signals via different signaling pathways. Transcriptional regulation plays remarkable roles in response of different signaling events. It has progressed from ancient life forms to advanced life forms and is inseparably connected through developmental progression. Such transcriptional progression mechanisms are regulated by different types of transcriptional machinery commonly known as transcription factors (TFs). The TFs possess the ability to activate or repress the expression of target genes responsible for the regulation of different signaling cascades1,2,3. The WRKY TF is one such TF found in plants. WRKY TFs are characterized by the presence of a unique WRKY domain of approximately 60 amino acid residues3,4,5. The domain contains a highly conserved WRKYGQK heptapeptide amino acid sequence and conserved C2H2 or C2HC zinc finger motif. The conserved WRKY domain plays a crucial role by binding to the W-box DNA motif TTGACC/T of the target gene3,5,6. Almost all WRKY TFs identified thus far preferentially binds to a specific core DNA sequence7. In addition to binding to the W-box DNA motif, some WRKY TFs also bind to other sites. For example, Oryza sativa OSWRKY13 binds to PRE4 (pathogen-responsive element; TGCGCTT), and Hordeum vulgare HvWRKY46 binds to SURE (sugar-responsive element) (TAAAGATTACTAATAGGAA)8,9. The binding of a WRKY TF to the W-box and other elements leads to synergistic transcriptional activation in plants10. In addition to this process, the conserved WRKY amino acid sequences are occasionally replaced by WRRY, WSKY, WKRY, WVKY or WKKY domains11.

The model plant Arabidopsis thaliana encodes 74 WRKY TFs in its genome. Based on the similarity in sequence and phylogenetic relationships, WRKY TFs are divided into three groups (I, II, and III); group II is further divided into several sub-groups (e.g IIa, IIb, IIc, IId, IIe, IIf, and IIg)4,12. There are two different types of WRKY TFs: (1) contains a single WRKY domain at the C-terminal end, (2) the other contain two WRKY domains, one at the N-terminal and other at the C-terminal end. The WRKY proteins that contain a single WRKY domain fall in group II and III while the WRKY protein that contains double WRKY domain (N- and C-terminals) are fall in group I4,12. The WRKY proteins that contain two WRKY domains are functionally redundant13. The N-terminal WRKY domain increases the affinity and specificity to bind the target gene, whereas the C-terminal WRKY domain constitutes the major DNA-binding domain4,14,15,16. The single WRKY domain-containing WRKY TFs (groups II and III) are considerably more similar in sequence to the C-terminal WRKY domain rather than to the N-terminal domain of group I WRKY TFs. These findings suggest that the C-terminal WRKY domain of group I WRKY TFs and the single WRKY domain of groups II and III WRKY TFs are functionally commensurate, and share the major DNA-binding domain4.

The WRKY TFs have been reported to play important roles in cellular and physiological processes, including seed germination17,18, root development19, plant growth20, seed development21,22,23 and senescence24,25,26. Furthermore, they are involved in diverse responses to biotic stress caused by insect herbivores27,28, bacterial pathogens29,30, fungi31 and viruses32. They respond to different signaling molecules such as indole-3-acetic acid19, jasmonic acid33, salicylic acid34, abscisic acid35,36, and gibberellic acid37. In addition, WRKY TFs respond to different abiotic stresses38 such as UV radiation39, high and low temperatures40,41, H2O242,43, and salt and drought stresses44,45. Therefore, understanding the basic biology and genomics of WRKY TFs in plants is very important.

Numerous studies have been conducted with WRKY TFs in different plant species, including Arabidopsis thaliana4, Brachypodium distachyon14, Gossypium raimondii46, Lotus japonicas47, Oryza sativa48, Riccinus communis49, Setaria italica50, Solanum lycopersicum51, Triticum aestivum52, and Vitis vinifera53. Different research groups have provided different grouping systems for the WRKY TFs, leading to lack of consistency in the grouping system. Thus, it was highly important to formulate a new and clear grouping system for all WRKY TFs of the plant kingdom identified so far. Xi et al.11 reported about the presence of a deduced WRKY domain11. Therefore, we were also very interested in determining whether WRKY TFs possess any additional novel, modified WRKY domains in its genome. Rinerson et al.54 reported the presence of chimeric WRKY TFs that contain combinations of novel protein domains and WRKY TF domains as well54. Hence, it was also very interesting to elucidate more details about these chimeric proteins. Genome sequencing data from different plant species are currently increasing rapidly that has provided an excellent platform for better understanding the WRKY TF gene family. Therefore, we conducted genome-wide identification of the WRKY TF gene family from 43 plant species and analysed their genomic, phylogenetic, and other basic characteristics to decipher their novel genomic constitution.

Results

Identification of WRKY TFs

Genome-wide identification of WRKY TF gene family members was performed using 43 plant species across the evolutionary lineage of the plant kingdom (Table 1). These plant species included a wide mixture of dicots (27), monocots (7), algae (5), bryophytes (1), pteridophytes (1), gymnosperms (1) and amoebae (1). In total, 3035 WRKY TFs were identified from these species. Of the studied species, the monocot plant Panicum virgatum encoded the maximum number of WRKY TFs (167), whereas, the green algae Chlamydomonas reinhardtii and Coccomyxa subellipsoidea encoded the minimum (only one). Among dicots, Brassica rapa and Glycine max encoded 145 WRKY TFs, whereas the amoeba Dictyostelium purpureum encoded nine. The WRKY TFs of the algae C. reinhardtii, C. subellipsoidea, and M. pusilla contained only a single WRKY domain (C-terminal WRKY domain) whereas O. lucimarinus and V. carteri contain both single and double WRKY domains. The WRKY TF gene family of the amoeba D. purpureum contained both single (C-terminal) and double (N- and C-terminals) WRKY domains.

Table 1 WRKY TF gene family of 43 species.

Genomics of WRKY TFs

The transcript organization of WRKY TFs has been shown to be highly variable in nature. F. vesca FvWRKY70–7 contains the largest transcript, encoding an open reading frame (ORF) of 5949 nucleotides (1982 amino acids). Similarly, the M. domestica MdWRKY61-2 encodes the smallest WRKY TF containing only 135 nucleotides (44 amino acids). The intron organization of WRKY TFs is very dynamic, ranging from zero to twenty introns per gene. The number of plant WRKY TFs that contain various numbers of introns is as follows: zero (46), one (338), two (1440), three (488), four (375), five (223), six (61), seven (20), eight (5), nine (9), ten (12), eleven (4), twelve (3), thirteen (3), fourteen (0), fifteen (2), sixteen (1), seventeen (0), eighteen (2), nineteen (0), and twenty (2).

Novel WRKY TFs

In general, WRKY TFs are characterized by the presence of either one (Fig. 1) or two WRKY domains. In this study, we identified 16 chimeric forms of WRKY TFs in plants (Fig. 2). In addition, we identified different WRKY TFs that contain three (GrWRKY12, GrWRKY21-5, and LuWRKY3-7) (Fig. 2-A); and four (AcWRKY1, SlWRKY4-2) (Fig. 2-B) WRKY domains; three WRKY domains with the ZF_SBP TF domain (LuWRKY3–5, LuWRKY3–6) (Fig. 2-C); a single WRKY domain with three CBS domains (BrWRKY36-2) (Fig. 2-D); a kinase domain followed by a single WRKY domain (FvWRKY59) (Fig. 2-E); a kinase domain followed by two WRKY domains (PhWRKY59) (Fig. 2-F); two WRKY domains followed by a kinase domain (BrWRKY58-1, BrWRKY58-2) (Fig. 2-G); a PAH domain followed by two WRKY domains and one kinase domain (AtWRKY19) (Fig. 2-H); an ULP_protease domain followed by a WRKY domain (OsWRKY57, PvWRKY57-1, and SbWRKY57) (Fig. 2-I); a TIR domain followed by a WRKY domain (FvWRKY52, GmWRKY55-3) (Fig. 2-J); a TIR domain followed by two WRKY domains (FvWRKY70-7) (Fig. 2-K); a TIR domain followed by seven LRR domains and a WRKY domain (FvWRKY16) (Fig. 2-L); two LRR domains followed by an NAC domain and two WRKY domains (SbWRKY59) (Fig. 2-M); an ATP_GRASP domain followed by a WRKY domain (AlWRKY16) (Fig. 2-N); a B3 domain followed by a WRKY domain (PvWRKY94-1) (Fig. 2-O); and a WRKY domain followed by a ZF_SBP domain (Fig. 2-P).

Figure 1
figure 1

The schematic representation of the secondary and tertiary structures of WRKY TFs.

(A) General secondary structure of the WRKY TF with the Zn ligand, (B) space fill model of a WRKY TF showing the Zn ligand in red and WRKY domain in blue, (C) position of a Zn ligand in the cavity of WRKY TF (D) hydrogen bonding of Zn ligand with WRKY TF, (E) secondary structure of a WRKY TF showing the position of WRKY domain and hydrogen bonding of the Zn ligand. The molecular structure of WRKY TF was predicted by using the GENO3D server using AtWRKY1 as query search.

Figure 2
figure 2

Novel WRKY TFs of plants.

In addition to the presence of classic WRKY TFs in plants, the present study revealed the presence of novel WRKY TFs. These novel- WRKY TFs are as follows: (A) WRKY TFs with three WRKY domains (GrWRKY12, GrWRKY21-5, LuWRKY3-7), (B) WRKY TFs with four WRKY domains (AcWRKY1, SlWRKY4-2), (C) WRKY TFs with three WRKY domains followed by a ZF_SBP TF domain (LuWRKY3-5, LuWRKY3-6), (D) WRKY domain followed by three calcium binding CBS domains (BrWRKY36-2), (E) kinase domain followed by one WRKY domain (FvWRKY59), (F) kinase domain followed by two WRKY domains (PhWRKY59), (G) two WRKY domains followed by a kinase domain (BrWRKY58-1, BrWRKY58-2), (H) PAH domain followed by two WRKY domain and kinase domain (AtWRKY19), (I) protease domain followed by a WRKY domain (OsWRKY57, PvWRKY57-1, SbWRKY57), (J) TIR domain followed by WRKY domain (FvWRKY52, GmWRKY55-3), (K) TIR domain followed by a WRKY domain twice (FvWRKY70-7), (L) TIR domain followed by a LRR domain and a WRKY domain (FvWRKY16), (M) LRR and NAC domain followed by two WRKY domains (SbWRKY59), (N) ATP_GRASP domain followed by a WRKY domain (AlWRKY16), (O) B3 domain followed by a WRKY domain (PvWRKY94-1), and (P) WRKY domain followed by a ZF_SBP domain (SiWRKY59-2).

Conserved domains of WRKY TFs

To understand the conserved domains of WRKY TFs, multiple sequence alignments of single (C-terminal domain) and double WRKY domain (both N- and C-terminal domain) proteins were analyzed separately. The single WRKY domain (C-terminal)-containing proteins included the conserved W-R-K-Y-G-Q-K, P-R-x-Y-Y-x-C-x5-C, K-x-V, and H-x-H domains as well as several conserved amino acid residues (Supplementary Figure 1). The N- terminal region of double WRKY domain proteins contain conserved D-G-Y-N-W-R-K-Y-G-Q-K and R-S-Y-Y-x-C-x4-C-x22-H-x-H domains. The C-terminal region of the double WRKY domain protein contains conserved D-G-Y-R-W-R-K-Y-G-Q-K, R-S-Y-Y-x-C-x4-C, V-R-K-H-V-E, and H-x-H domains (Supplementary Figure 2). In some cases, the conserved WRKY amino acids in the WRKY domain were replaced with some other amino acids including W-K-K-Y (BrWRKY10-4, CcWRKY57-2, CsWRKY10-2, EgWRKY49-2, LuWRKY70-2, PvulWRKY33-3, PvulWRKY33-4, PpWRKY46-1, PpWRKY46-2, PpWRKY55-1, PpWRKY52-2, PaWRKY10, PaWRKY42-6, PtWRKY10, PtWRKY35, PperWRKY33-1 and SbWRKY60), W-R-I-Y (AlWRKY5-2), W-R-K-N (BrWRKY20-3), W-R-K-D (BrWRKY26), W-H-Q-Y (GmWRKY4-3), W-R-I-S (GrWRKY12), W-R-Q-V (LuWRKY58-1), G-R-K-Y (LuWRKY41-1), W-L-K-Y (PhWRKY31-2), W-R-E-Y (PhWRKY101), A-R-K-M (PvWRKY57-1, PvWRKY57-2, PvWRKY57-3), W-W-K-N (PvWRKY57-2, PvWRKY57-3), W-R-M-Y (PvWRKY82-2), W-R-K-R (SlWRKY20-3), W-I-K-Y (SlWRKY2-2), W-S-K-Y (SlWRKY27-5), W-Q-K-Y (SlWRKY27-1), W-H-K-C (StWRKY29), W-R-C-I (TcWRKY52), F-R-K-Y (PtWRKY34), R-S-Q-Y (EgWRKY75-1), W-T-K-Y (EgWRKY44-2), W-K-K-C (PvulWRKY33-4) and W-R-K-C (StWRKY29-1) (Fig. 3).

Figure 3
figure 3

Substitute WRKY domain of plants.

Different novel substitutes of WRKY domains were found in the N- and C-terminal regions of WRKY TFs. The conserved WRKY amino acids were replaced by different types of amino acids. The N- and C-terminal WRKY domains of A. thaliana AtWRKY were aligned with these novel substitutes of WRKY domains. This indicates that WRKY amino acids have been replaced by these novel amino acids. Multiple sequence alignment of WRKY TF was performed using multalin software (http://multalin.toulouse.inra.fr/multalin/) by using the protein weight matrix BLOSUM62.

Phylogeny of WRKY TFs

The phylogenetic trees of plant WRKY TFs were constructed in order to better understand the phylogenetic relationship among them. Five phylogenetic trees were constructed by dividing the WRKY TFs into different groups to better understand the grouping and phylogenetic relationship among them. In the first case, the WRKY TFs of monocots, dicots and basal eukaryotic (amoebae, algae, bryophytes, pteridophytes and gymnosperms) plants were combined and used to construct a phylogenetic tree. The results showed the presence of eight phylogenetically distinct and independent groups that were denoted as groups I (red), II (lime), III (black), IV (blue), V (black), VI (pink), VII (green) and VIII (black) (Fig. 4, Table 2). The phylogenetic tree generated from monocots and lower eukaryotic plants formed six phylogenetically distinct groups and named as groups I (red), II (lime), III (green), IV (blue), V (pink) and VI (green) (Fig. 5, Table 3). The sub-group of group II was absent in monocot plants. The phylogenetic tree formed from dicot and lower eukaryotic WRKY TFs yielded three groups namely, groups I (pink), IIa (red), IIb (lime), IIc (blue), and III (green) (Fig. 6, Table 4). When all the WRKY TFs of monocot, dicot, and lower eukaryotic plants that contain only C-terminal WRKY domain were combined, the phylogenetic tree resulted in six groups namely groups I (red), II (lime), III (blue), IV (pink), V (green) and VI (purple) (Fig. 7, Table 5). Similarly, all WRKY TFs of monocot, dicot and lower eukaryotic plants that contained both N- and C-terminal WRKY domains were combined; this resulted in the generation of a phylogenetic tree containing seven groups. The groups are named as group I (red), II (lime), III (blue), IV (purple), V (pink), VI (green) and VII (purple) (Fig. 8, Table 6).

Table 2 Phylogenetic tree of WRKY TFs of monocot, dicot, and lower eukaryotic (amoeba, algae, bryophyte, pteridophyte, and gymnosperm) plants.
Table 3 Phylogenetic tree of WRKY TFs of monocot and lower eukaryotic plants.
Table 4 Phylogenetic tree of WRKY TFs of dicot and lower eukaryotic plants.
Table 5 Phylogenetic tree of WRKY TFs of monocot, dicot and lower eukaryotic plants that contain only a single WRKY domain (C-terminal WRKY TFs).
Table 6 Phylogenetic tree of WRKY TFs of monocot, dicot and lower eukaryotic plants that contain only double WRKY domain (N-terminal and C-terminal WRKY domains).
Figure 4
figure 4

Unrooted phylogenetic tree of WRKY TFs of monocot, dicot, and lower eukaryotic (amoeba, algae, bryophyte, pteridophyte, and gymnosperm) plants.

The phylogenetic tree shows eight independent groups. We named them as groups I (red), II (lime), III (black), IV (blue), V (black), VI (pink), VII (green), and VIII (black). To get details about distribution of different WRKY TF in different group, please refer to Supplementary Figure 3. The phylogenetic tree revealed that, the WRKY family members of one group overlapped with another group. The phylogenetic tree was constructed using MEGA6.

Figure 5
figure 5

Unrooted phylogenetic tree of WRKY TFs of monocot and lower eukaryotic (amoeba, algae, bryophyte, pteridophyte and gymnosperm) plants.

The phylogenetic tree shows six independent phylogenetic groups. We named them as groups I (red), II (lime), III (green), IV (blue), V (pink) and VI (green). The WRKY TF group members are specific to their groups and no WRKY TF members in one group overlap with those in any other group. The phylogenetic tree was constructed using MEGA6.

Figure 6
figure 6

Unrooted phylogenetic tree of WRKY TFs of dicot and lower eukaryotic (amoeba, algae, bryophyte, pteridophyte, and gymnosperm) plants.

The phylogenetic tree shows the presence of three phylogenetically distinct groups. We named them as groups I (pink), IIa (red), IIb (lime), IIc (blue), and III (green). The WRKY TF group members of group IIa, IIb and IIc overlap with each other and were hence retained under sub-group of group II. The classification of groups I, II, and III resembled that used in previous studies. The WRKY TF members of groups I and III did not overlap with one another and resembled the grouping system of used in previously published studies. The phylogenetic tree was constructed by using MEGA6.

Figure 7
figure 7

Unrooted phylogenetic tree of C-terminal WRKY domain containing WRKY TFs.

The phylogenetic tree shows six phylogenetically independent groups, I (red), II (lime), III (blue), IV (pink), V (green) and VI (purple). The phylogenetic tree was constructed by using MEGA6.

Figure 8
figure 8

Unrooted phylogenetic tree of N- and C-terminal WRKY domains containing WRKY TFs.

The phylogenetic tree shows the presence of seven phylogenetically distinct groups, I (red), II (lime), III (blue), IV (purple), V (pink), VI (green) and VII (purple). The phylogenetic tree was constructed by using MEGA6.

The substitution pattern and evolution rates were estimated by analyzing the shape parameters for the discrete gamma distributions. The rates were estimated using the Jones-Taylor-Thornton (JTT) model (+G). A discrete gamma distribution was used to model evolutionary rate differences among sites (5 categories, [+G]). The mean evolutionary rates for dicot and lower eukaryotic WRKY protein were 0.15, 0.42, 0.75, 1.23, and 2.45 substitutions per site. The amino acid frequencies were 7.69% (A), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). For estimating ML values, a tree topology was automatically computed. The maximum log likelihood for this computation was −19363.118. The analysis involved 774 amino acid sequences. The mean evolutionary rates for monocot and lower eukaryotic WRKY proteins were 0.15, 0.42, 0.75, 1.23 and 2.44 substitutions per site. The amino acid frequencies were 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y) and 6.64% (V). The maximum log likelihood for this computation was −16801.681 and the analysis involved 896 amino acid sequences. The mean evolutionary rates for WRKY proteins that contained a single WRKY domain were 0.13, 0.40, 0.73, 1.23, and 2.51 substitutions per site. The amino acid frequencies are 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). The maximum log likelihood for this computation was -13476.656. The analysis involved 445 amino acid sequences. The mean evolutionary rates for WRKY proteins that contained double WRKY domains were 0.11, 0.36, 0.70, 1.22, and 2.60 substitutions per site. The amino acid frequencies were 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). The maximum log likelihood for this computation was -30333.349. The analysis involved 480 amino acid sequences. All positions with less than 95% site coverage were eliminated. Thus, fewer than 5% alignment gaps, missing data, and ambiguous bases were allowed at any position.

Statistical analysis of WRKY TFs

Tajima’s relative rate test was conducted to determine the statistical significance of the investigated WRKY TFs. In all three replicate analyses, the p-values were found to be significant. The X2 –test results with one degree of freedom were 5.76 (for monocot, dicot and lower eukaryotic WRKY TFs), 13.76 (for monocot and lower eukaryotic WRKY TFs), 4.45 (for dicot and lower eukaryotic WRKY TFs), 5.00 (for single WRKY domain containing WRKY TFs), and 7.41 (for double WRKY domain containing WRKY TFs) (Table 7).

Table 7 Tajima’s relative rate test.

Gene expression profile of WRKY TFs

The expression profile of the WRKY TFs was elucidated by investigating the gene expression data for G. max and P. vulgaris and analyzing their transcription levels. In G. max, the transcription profile was determined for different tissue samples, including roots, root hair, leaves, stems, flowers, pods, seeds, nodules and shoot apical meristem. In G. max, the expression level of GmWRKY65-1 was found to be the highest (105.342) among all other WRKY transcription factors (Supplementary Table 2). The expression levels of GmWRKY6-4 and GmWRKY6-5 in the root were found to be 74.668 and 43.341, respectively. Some other WRKY TFs, the expression levels of which were relatively higher than those of others were GmWRKY6-6, GmWRKY11-2, GmWRKY11-3, GmWRKY11-4, GmWRKY11-6, and GmWRKY15-2 (Supplementary Table 2). Further, GmWRKY4-3, GmWRKY5-1, GmWRKY5-2, GmWRKY10, GmWRKY13-4, GmWRKY18, GmWRKY33-2, GmWRKY33-3, GmWRKY35-1, GmWRKY35-2, GmWRKY47-1, GmWRKY47-2, GmWRKY47-3, GmWRKY50-1, GmWRKY50-2, GmWRKY54-1, GmWRKY57-1, GmWRKY69-1, GmWRKY69-2, GmWRKY70-3, GmWRKY71-2, GmWRKY72-1, and GmWRKY72-2 were not expressed in the root tissues (Supplementary Table 2). Unlike the higher expression in roots, the expression of GmWRKY65-1(35.199) was also found to be the highest in the root hair. Some other WRKY TFs that were expressed relatively at higher levels were GmWRKY6-4, GmWRKY11-1, GmWRKY11-2, GmWRKY11-3, GmWRKY11-4, GmWRKY11-6, GmWRKY11-7, GmWRKY11-8, GmWRKY15-1, and GmWRKY15-2 (Supplementary Table 2). The WRKY TFs, the expression of which was not detected in root tissues, were GmWRKY4-3, GmWRKY6-3, GmWRKY10, GmWRKY13-4, GmWRKY29-1 GmWRKY54-1, GmWRKY54-2, and GmWRKY56-1 (Supplementary Table 2). In the leaf tissue, the expression level of GmWRKY6-5 (81.847) was found to be highest among other WRKY TFs. The expression of GmWRKY26-2 in the leaf tissue was found to be 80.957. Some other WRKY TFs, the expression of which was found to be higher in the leaf tissue, were GmWRKY6-4, GmWRKY15-1, GmWRKY15-2, GmWRKY26-3, GmWRKY41-1, GmWRKY41-2, GmWRKY41-3, and GmWRKY41-7 (Supplementary Table 2). The WRKY TFs, expression of which was not detected in the leaves were GmWRKY4-3, GmWRKY6-3, GmWRKY10, GmWRKY13-4, GmWRKY40-1, GmWRKY40-9, GmWRKY41-4, GmWRKY41-6, GmWRKY47-1, GmWRKY50-1, GmWRKY50-2, GmWRKY51-1, GmWRKY51-2, GmWRKY51-3, GmWRKY51-4, GmWRKY55-1, GmWRKY55-3, GmWRKY56-1, GmWRKY56-3, GmWRKY70-1, GmWRKY70-2, GmWRKY70-3, GmWRKY70-6, and GmWRKY70-7 (Supplementary Table 2). In flowers, a higher level of expression was detected in WRKY26-2 (67.456), WRKY26-3 (51.836), WRKY70-6 (61.053), and WRKY70-7 (63.153) whereas, that of GmWRKY10, GmWRKY13-4, GmWRKY29-1, GmWRKY50-2, GmWRKY67, GmWRKY70-4, GmWRKY72-2, and GmWRKY72-4 was not detected. The expression of GmWRKY44-2 (17.882), GmWRKY23-4 (10.417), GmWRKY11-5 (9.898), GmWRKY11-6 (9.725) and GmWRKY3-1 (9.665) was higher in pods. The expressions of GmWRKY4-3, GmWRKY6-3, GmWRKY10, GmWRKY13-4, GmWRKY21-2, GmWRKY21-3, GmWRKY29-1, GmWRKY40-4, GmWRKY48-2, GmWRKY50-2, GmWRKY54-1, GmWRKY55-2, GmWRKY56-1, GmWRKY70-3, GmWRKY70-4, GmWRKY72-1, GmWRKY72-2, GmWRKY72-4, and GmWRKY72-6 was not detected in pods. In seeds, the expression of GmWRKY 21-2 (11.200), and GmWRKY21-3 (31.762) was higher whereas that of GmWRKY3-4, GmWRKY5-1, GmWRKY6-3, GmWRKY10, GmWRKY13-4, GmWRKY18, GmWRKY21-1, GmWRKY29-1, GmWRKY30-1, GmWRKY32-1, GmWRKY32-2, GmWRKY32-3, GmWRKY40-1, GmWRKY40-2, GmWRKY40-3, GmWRKY40-4, GmWRKY40-9, GmWRKY40-10, GmWRKY41-1, GmWRKY47-1, GmWRKY50-1, GmWRKY50-2, GmWRKY51-1, GmWRKY51-2, GmWRKY54-1, GmWRKY54-2, GmWRKY55-1, GmWRKY55-2, GmWRKY56-1, GmWRKY56-2, GmWRKY56-3, GmWRKY67, GmWRKY70-1, GmWRKY70-2, GmWRKY70-3, GmWRKY70-4, GmWRKY70-5, GmWRKY70-6, GmWRKY71-1, GmWRKY72-1, GmWRKY72-2, GmWRKY72-3, GmWRKY72-4, GmWRKY72-5, GmWRKY72-6 and GmWRKY75-3 was not detected. The expression of GmWRKY65-1 (39.186) was the highest in the nodules. Some other genes, the expression of which was higher in the nodules were GmWRKY (30.341), GmWRKY11-2 (36.175), GmWRKY11-3 (18.965), GmWRKY11-4 (20.702), GmWRKY11-7 (21.960), GmWRKY11-8 (17.019), GmWRKY15-1 (17.912), GmWRKY15-2 (18.552), and GmWRKY69-1 (17.523). The expression of GmWRKY70-7 (35.173) was the highest in the shoot apical meristem. Some other WRKY TFs that showed higher expression in the shoot apical meristem were GmWRKY11-8 (18.974), GmWRKY21-3 (18.442), and GmWRKY70-6 (16.468). The expression of GmWRKY4-3, GmWRKY6-1, GmWRKY6-3, GmWRKY10, GmWRKY13-2, GmWRKY13-4, GmWRKY29-1, GmWRKY30-2, GmWRKY40-4, GmWRKY50-1, GmWRKY50-2, GmWRKY55-2, GmWRKY56-1, GmWRKY56-2, GmWRKY56-3, GmWRKY67, GmWRKY70-4, GmWRKY72-2, GmWRKY72-4, and GmWRKY72-6 was not detected in shoot apical meristem.

In P. vulgaris, the expression of WRKY TFs in different tissue samples, including young trifoliates, leaves, flowers, flower buds, young pods, stems, roots, and nodules was analysed (Supplementary Table 2). In P. vulgaris trifoliates, PvulWRKY17 (37.519) showed the highest expression. Some other genes that showed relatively higher expression in young trifoliates included PvulWRKY11-2 (21.790), PvulWRKY15-1 (18.590), PvulWRKY15-2 (24.308), and PvulWRKY19-1 (24.328). In contracts, PvulWRKY9-2, PvulWRKY27-1, PvulWRKY29-1, PvulWRKY35, PvulWRKY43-1, PvulWRKY47-2, PvulWRKY51-1, PvulWRKY59-1, PvulWRKY59-2, PvulWRKY69-1, PvulWRKY73-2, PvulWRKY73-3, PvulWRKY73-4, PvulWRKY79-1 and PvulWRKY79-2 were not expressed in young trifoliates. In the leaf tissue, PvulWRKY11-2 (25.292) and PvulWRKY26-1 (25.724) showed higher expression. Some other genes that showed higher expression in leaves were PvulWRKY7 (19.048), PvulWRKY19-1 (16.433), PvulWRKY23-1 (19.076), and PvulWRKY58 (18.863). In contrast, PvulWRKY1-2, PvulWRKY5-1, PvulWRKY9-2, PvulWRKY14, PvulWRKY19-2, PvulWRKY29-1, PvulWRKY35, PvulWRKY43-1, PvulWRKY47-2, PvulWRKY51-1, PvulWRKY59-1, PvulWRKY59-2, PvulWRKY69-1, PvulWRKY73-1, PvulWRKY73-2, PvulWRKY73-3, PvulWRKY73-4, and PvulWRKY79-1 were not expressed in the leaves. In flowers, PvulWRKY19-1 (78.755) showed the highest expression followed by PvulWRKY15-2 (49.015), PvulWRKY17 (66.844), PvulWRKY26-1 (76.970), and PvulWRKY58 (50.788) whereas, PvulWRKY59-1, PvulWRKY59-2, PvulWRKY69-1, PvulWRKY73-2, PvulWRKY73-4, PvulWRKY79-1 and PvulWRKY79-3 were not expressed. The expression of PvulWRKY11-2 (50.119) was highest in flower buds followed by PvulWRKY17 (46.894), PvulWRKY19-1 (23.965) and PvulWRKY44 (19.068), whereas, PvulWRKY5-1, PvulWRKY5-3, PvulWRKY9-2, PvulWRKY29-1, PvulWRKY43-1, PvulWRKY43-2, PvulWRKY51-1, PvulWRKY59-1, PvulWRKY59-2, PvulWRKY69-1, PvulWRKY73-3, PvulWRKY79-1 and PvulWRKY79-3 were not expressed (Supplementary Table 2). In young pods, PvulWRKY17 (58.155), PvulWRKY15-2 (41.848), and PvulWRKY19-1 (38.820) showed higher expression whereas, PvulWRKY9-2, PvulWRKY51-1, PvulWRKY59-1, PvulWRKY69-1, PvulWRKY73-2, PvulWRKY79-1 and PvulWRKY79-3 were not expressed (Supplementary Table 2). In stems, PvulWRKY17 (61.321) showed the highest expression whereas PvulWRKY9-2, PvulWRKY29-1, PvulWRKY51-1, PvulWRKY59-1, PvulWRKY59-2, PvulWRKY69-1, PvulWRKY73-3, PvulWRKY79-1, PvulWRKY79-2, and PvulWRKY79-3 were not expressed. In roots PvulWRKY11-2 (134.816) showed the highest expression whereas PvulWRKY29-1, PvulWRKY45-1, PvulWRKY45-2, PvulWRKY59-1, PvulWRKY59-2, and PvulWRKY69-1 were not detected. In nodules, PvulWRKY11-2 (79.023) showed the highest expression followed by PvulWRKY9-2 (48.761), PvulWRKY11-1 (36.555), and PvulWRKY69-3 (45.336), whereas PvulWRKY29-1, PvulWRKY45-1, PvulWRKY45-2, PvulWRKY59-1, PvulWRKY59-2, and PvulWRKY69-1 were not detected (Supplementary Table 2).

Discussion

Identification and nomenclature of WRKY TFs

Advancements in genome sequencing technology and available of well annotated genome database led us to identify the WRKY TF gene family of 43 species. Predicting the potential function and activity of newly sequenced genes and their protein products in every organism is very difficult. The major cellular roles of newly identified genes/proteins can be inferred from previously characterized orthologous gene members of the same family. Large-scale comparative genomic studies can reveal important information regarding the function and evolutionary relationship of orthologous species55. The same principle can be applied at the gene family level as well (e.g. WRKY TF gene family). Therefore, we identified and analysed the WRKY TF gene family members from 43 different plant species. All identified WRKY TFs were assigned a specific name according to the orthology based nomenclature system55,56,57,58. Providing a unique name to every gene is necessary for its future identification. The role of a genome is insignificant unless a comparative genomics study is conducted.

Genomics of WRKY TFs

Availability of large-scale genomic data from various plant species allowed the detailed investigation of the WRKY TF gene family in plants. The WRKY TF gene family members vary across species likely because of gene duplication, whole genome duplication, ploidy, gene deletion or mutation. WRKY TFs are considered to be evolutionary conserved and supposed to be present only in plants4,7,59. However, the WRKY TF gene family was also found in amoeba, fungi and diplomonad species54,60. Dictyostelium purpureum, the amoeba that lives in soil belongs to the phylum mycetozoa. The genome of this species encodes nine WRKY TFs. The tetraploid monocot plant P. virgatum encodes the highest number (167) of WRKY TFs, whereas, the unicellular C. reinhardtii and C. subellipsoidea encode for the lowest number (only one) of WRKY TFs. In general it is a general assumption that, larger the genome size more will be the number of WRKY TFs in the genome; however, this concept is not true. Genome size is not directly related to the number of genes of a gene family in the genome (Mohanta et al. 2015; Mohanta et al. 2015; Mohanta et al. 2015). Therefore, the presence of a higher or lower number of genes in a gene family of a particular species can be attributed to its functional requirement and diverse cellular processes. Cai et al.46 reported the presence of 120 WRKY TFs in Gossypium raimondii, which is similar to the number of WRKY TFs identified in our study46. Li et al.49 reported the presence of only 47 WRKY TFs in Ricinus communis49, however, in our study, 57 WRKY TFs were identified. Muthamilarasan et al.50 reported the presence of 105 WRKY TFs in Setaria italica50, whereas, in our study 106 WRKY TFs were identified. Wen et al.14 reported the presence of 86 WRKY TFs in Brachypodium distachyon14 whereas only 81 WRKY TFs were identified in this study. Wen et al.14 have included locus ID LOC100843345, LOC100834454, LOC100845846, and LOC100837754 as locus ID for the gene name BdWRKY52, BdWRKY69, BdWRKY73 and BdWRKY75, respectively (Supplementary Table 1 of Wen et al.14); however, we did not find any such sequences from the phytozome database. This indicates that these locus IDs do not belong to B. distachyon and hence B. distachyon do not encode 86 WRKY TFs. We also compared our results with plant transcription factor databases http://plntfdb.bio.uni-potsdam.de/v3.0/61 and http://planttfdb.cbi.pku.edu.cn/62. In the majority of the cases, our study results were consistent with those of previous studies where splice variants were excluded as a gene. Splice variants are variants of a particular gene/locus; therefore, they cannot be considered as an independent gene locus. The dicot plant Linum usitatissimum encodes the highest number (26) of double WRKY domain proteins, whereas the tetraploid plant B. distachyon, which has a larger genome, encodes only 17 double WRKY domain proteins. This shows that the genome size plays no role in determining whether single or double WRKY domain proteins are encoded and this might be completely based on the functional requirement of an organism. Further, we found that the lower eukaryotic organisms Chlamydomonas reinhardtii, Coccomyxa subellipsoidea, Ostreococcus lucimarinus, Physcomitrella patens and Volvox carteri encoded at least one WRKY TF that contained a double WRKY domain. Three and four WRKY domain containing WRKY proteins were absent in lower eukaryotes, and are only present in a few higher eukaryotic plants. This shows that these three and four WRKY domain-containing WRKY TFs might have evolved recently. The WRKY TF gene family of Oryza sativa was previously reported to contain 102 WRKY TFs63. In this study, we eliminated OsWRKY94 since it was not found to contain any WRKY domain. Ross et al.63 also reported the absence of any WRKY domain in OsWRKY9463.

In the present study, we identified several novel chimeric WRKY TFs from different plant species (Fig. 2) with varying numbers of WRKY domains and other novel domains fused with them (Fig. 2A to P). These chimeric WRKY TFs might have evolved recently via fusion with other domains64. The kinase domain phosphorylates to its target protein. Thus, determining whether, these fused kinase domains play any crucial role in the auto-phosphorylation events in the WRKY TFs to which they are fused, and hence regulate gene expression. In some cases, the kinase domain is followed by a WRKY domain (Fig. 2E and F), whereas, in other cases the WRKY domain is followed by a kinase domain (Fig. 2G). The kinase domains of WRKY TFs most likely get phosphorylated by the cognate up-stream kinase, and regulate the expression of WRKY TFs65. The position of the kinase domain might be speculated to be very important in the regulation of WRKY TFs and the phosphorylation events in plants. In some other cases, the WRKY domain is fused with the toll-interleukin receptor (TIR) domain (Fig. 2J and K), which mediates the interactions between the toll-like receptor and signal transduction components66,67,68. Plant proteins that harbor TIR motifs are associated with plant resistance to disease67,69,70. Therefore, the WRKY TFs that harbors the TIR motif might control disease resistance in plants. The leucine-rich repeat (LRR) motif also involved in plant resistance to diseases69, and the WRKY TFs that harbor both the TIR and LRR motifs might also remarkably contribute to plant disease resistance.

The diploid species, B. rapa encodes 145 WRKY TFs. Of them three encode novel chimeric WRKY TFs (Fig. 2). Among the three novel WRKY TFs, one is fused with the CBS domain (Fig. 2D), and the other two are fused with the kinase domain (Fig. 2G). The CBS domain is found in various other proteins, including adenosine monophosphate (AMP)-activated protein kinase. The CBS domain binds to AMP, adenosine triphosphate (ATP) or s-adenosylmethionine residues, and regulates the activity of associated enzymes71. Similarly, the tetraploid species G. max encodes 145 WRKY TFs. Among them, only one encodes a chimeric WRKY TF that is fused with the TIR domain (Fig. 2-J). Plant proteins associated with a toll-like receptor mediate disease resistance in plant. The monocot species Panicum virgatum encodes two chimeric WRKY TFs; one chimeric WRKY TF is fused with a protease domain (Fig. 2I) and the other with the B3 domain. The B3 domain was previously reported to be a DNA-binding domain present in combination with auxin response factor (ARF); it has been found with the WRKY protein, abscisic acid insensitive 3 (ABI3), or related to ABI3/VP1 (RAV) like TFs. The results of this study showed that the B3 domain, which is present in combination with WRKY TFs might mediate auxin and abscisic acid signaling. The model monocot plant O. sativa encodes 101 WRKY TFs, of which one contains a chimeric WRKY TF, which is fused with the protease domain (Fig. 2-I). Presence of an ULP protease domain in conjunction with the WRKY protein indicates that WRKY TFs plays a crucial role in the ubiquitination process of the SUMO protein. Linum usitatissimum and Brassica rapa encode chimeric WRKY TFs that contain squamosa promoter-binding proteins (ZF_SBP) domain. The SBPs are a major family of plant-specific TFs related to flower development72. The SBP zinc finger binds to the consensus sequence TNCGTACAA73. The presence of ZF_SBP domain along with WRKY TFs might increase the binding efficiency of WRKY TFs to other consensus sequences such as TNCGTACAA. In addition, the role of the SBP domain in flower development indicates that WRKY TFs with three WRKY domains and a ZF_SBP domain might regulate flower development in plants. The paired amphipathic helix (PAH) domain is found in the components of a co-repressor complex that silences the transcription process and plays a remarkable role in the transition between proliferation and differentiation74. The presence of a PAH domain along with a WRKY domain suggests its role in the translational co-repression of cellular proliferation and differentiation. The ATP_GRASP super-family genes regulate several metabolic pathways, including de novo purine biosynthesis, and the biosynthesis of fatty acids, peptidoglycan, glutathione, ribosome, arginine, pyrimidine, polyphosphate, lysine and dipeptide75. The fusion of WRKY TFs with the ATP_GRASP domain suggests that these WRKY TFs might be involved in diverse cellular process. All novel genomic rearrangements appear to have evolved recently. In addition, their abundance is very limited; they are present only in a fewer number of species. Once formed, these chimeric genes undergo positive selection when they combine with different components of signaling pathways. This might lead to the creation of a new and diverse signaling pathway, or accelerate the existing signaling process via short-circuiting signaling pathways.

Conserved domains of WRKY TFs

Multiple sequence alignment of C-terminal WRKY TFs revealed the presence of conserved W-R-K-Y-G-Q-K and C-x(7)-C-x(26)-H-x-H domains (Supplementary Figure 1). When multiple sequence alignment was conducted using WRKY TFs that contained only double WRKY domains (both N- and C-terminal), the N-terminal region showed the presence of conserved W-R-K-Y-G-Q-K and C-x(5)-C-x(23)-H-x-H whereas the C-terminal region showed the presence of conserved W-R-K-Y-G-Q-K and C-x(4)-C-x(23)-H-x-H domains (Supplementary Figure 2). Although the W-R-K-Y-G-Q-K heptapeptide sequence was highly conserved, sequence similarity beyond the domain was considerably low among most genes. Instead of harboring the W-R-K-Y domain, several WRKY TFs were found to contain W-K-K-Y, W-T-K-Y, W-S-K-Y, W-H-K-C, W-Q-K-Y, W-R-K-C, W-K-K-C, W-H-Q-Y, R-S-Q-Y, G-R-K-Y, W-R-E-Y, W-L-K-Y, W-R-K-R, W-R-K-N, W-R-K-D, F-R-K-Y, W-I-K-Y, W-R-I-Y, W-W-K-N and W-W-K-S domains (Fig. 3). These domains were exactly aligned with the W-R-K-Y domains and hence assumed to be newly evolved. Among these new domains, W-K-K-Y, W-T-K-Y, W-S-K-Y, W-H-K-C, W-Q-K-Y, W-R-K-C, W-K-K-C, W-H-Q-Y, R-S-Q-Y, G-R-K-Y, W-R-E-Y, W-L-K-Y, and W-R-K-R are present in the N-terminal region, whereas W-R-K-N, W-R-K-D, F-R-K-Y, W-I-K-Y, W-R-I-Y, W-W-K-N and W-W-K-S are present in the C-terminal region (Fig. 3). Therefore, the entire WRKY TF gene family which might result from long-time evolutionary history, represents divergent WRKY domains even in very closely related gene pairs. Characterization of these novel motifs might shed new insight into their functional significance.

Phylogeny and grouping of WRKY TFs

The WRKY TF gene family from various plant species, including A. thaliana4,76, B. distachyon14, G. raimondii46, O. sativa48, S. lycopersicum51, T. aestivum52 has been well elucidated. Surprisingly, when we combined the data from several published reports, none of them were found to be correlated with one another (Table 8). The WRKY TF group members of different species vary and are not consistent (Table 8). Different researchers have used different nomenclature and grouping systems for WRKY TFs. Eulgem et al.4 has grouped WRKY TFs as groups I, IIa, IIb, IIc, IId, IIe, and III4 whereas, Wang et al.76 grouped them as IN, IC, IIa, IIb, IIc, IId, IIe and III76. Wu et al.48 grouped the WRKY TF gene family of O. sativa as Ia [NTWD (N-terminal WRKY domain), CTWD (C-terminal WRKY domain)], Ib, IIa, IIb, IIc, IId and III48, whereas Okay et al.52 grouped the WRKY TFs of T. aestivum as groups I, IIa, IIb, IIc, IId, IIe, and III52. Thus, there are hardly any consistencies in the grouping system of WRKY TFs. Moreover, none of the WRKY TF group members of one research group are consistent with those of other research groups. For example, according to Wang et al.76, A. thaliana WRKY TFs 1, 2, 3, 4, 20, 25, 26, 32, 33, 34, 44, and 58 and 8, 12, 13, 23, 24, 28, 43, 45, 48, 56, 68, 71, and 75 are present in groups IN and IC respectively whereas, Eulgem et al.4 reported that WRKY TFs 1, 2, 3, 4, 10, 20, 25, 26, 32, 33, 34, 44, 45, and 58 are present in group I4,76. The WRKY TF group members 8, 12, 13, 23, 24, 28, 43, 48, 56, 68, 71, and 75 classified by Wang et al.76 are absent from group I of Eulgem et al.4. The A. thaliana WRKY group member 10 of Eulgem et al.4 is absent in group IC and IN of Wang et al.76 (Table 8). According to Eulgem et al.4, group IIc contains 8, 12, 13, 23, 24, 28, 43, 48, 49, 50, 51, 56, 57, and 59; group IId contains 7, 11, 15, 17, 21, and 39; and group IIe contains 14, 16, 22, 27, 29, and 35 WRKY TFs, whereas Wang et al.76 reported the absence of WRKY TF family members in groups IIc, IId and IIe4,76. According to Wu et al.48, there is absence of a WRKY TF family member in group IIe (Table 8)48. Similar inconsistent grouping exists in other studies as well (Table 8). These inconsistencies might be attributed to the improper nomenclature of WRKY TFs, or improper citations of previously published manuscripts. Notable different sub-groups of a specific group are generally present within that group (e.g., if IIa, IIb, IIc, and IId, others are a sub-group of group II, they would be included itself). However, this concept of grouping was not followed correctly during the grouping of WRKY TFs. In the grouping system developed by Wen et al.14 (Fig. 3 of Wen et al., 2012), sub-groups IIa and IIb are confined to a phylogenetically distinct group, sub-groups IId and IIe are confined to another phylogenetically distinct group, and sub-group IIc is confined to yet another phylogenetically distinct group. However, how sub-groups IIa and IIb, IId and IIe, and IIc can be sub-group members of group II if they are confined to phylogenetically distinct groups and are phylogenetically far away from other is not clear. Personal correspondence with Wen et al.14 arrived at a certain conclusion regarding the discrepancies in nomenclature and grouping system for WRKY TFs. Hence, in this study we developed a unified grouping system for WRKY TFs in plants.

Table 8 Classification and grouping of plant WRKY TFs published by different research groups at different times.

The inconsistencies in distribution of different WRKY TF family members within and between groups were overcome by developing an appropriate naming system for all WRKY TFs. In general, the sequences that are highly similar tend to fall into the same group as far as orthology-based similarity is concerned55,56. The orthology based nomenclature system of WRKY TFs has the potential to overcome this problem; therefore, we developed an unique nomenclature system to all WRKY TFs of 43 species55,56,58. In total, 3035 WRKY TF genes from the 43 species were identified and classified according to the unique naming system (Supplementary Table 1). The nomenclature is described in detail in the Materials and Methods section. Orthology also lends the legitimacy to common ancestry and evolutionary history of function. Therefore, the orthology-based nomenclature system can provide ideas regarding the possible function of specific genes in the plant species being investigated. This nomenclature system can also be extended to the newly identified gene family of other plant species.

A proper grouping system of WRKY TFs was developed by first dividing the studied plant species into different groups. The groups were (I) WRKY TFs of monocot, dicot, and lower eukaryotic (algae, bryophytes, pteridophytes and gymnosperms) plants; (II) WRKY TFs of monocots with lower eukaryotic plants; and (III) WRKY TFs of dicots with lower eukaryotic plants. When phylogenetic trees were constructed by considering the WRKY TFs from monocot, dicot, and lower eukaryotic plants, eight groups were identified (Fig. 4, Table 2, Supplementary Figure 3). In the resultant phylogenetic tree, WRKY TF gene family members were not consistent with any specific group and overlapped in two or more groups. For example, WRKY TFs 3, 5, 7, 8, 10, 11, 13, 16, 17, 19, 22, 23, 24, 25, 26, 28, 29, 33, 34, 36, 43, 45, 48, 49, 50, 51, 56, 57, 58, 59, 67, 68, 71, 72, 75, 77, 84, 102, 103, and 106 belonged to group A and 1, 2, 3, 4, 5, 10, 19, 20, 24, 25, 26, 30, 32, 33, 34, 35, 44, 45, 53, 57, 58, 59, 70, 78, 80, 81, 82, 84, 85, 90, 96, and 105 belonged to group II (Fig. 4, Table 2, Supplementary Figure 3). The WRKY TF members 3, 5, 19, 24, 25, 26, 33, 34, 45, 57, 58, 59, and 84 were distributed in both the groups (group I and II). Similar trends were observed in other WRKY groups as well. Therefore, the grouping of WRKY TFs based on a combined study of monocots, dicots and lower eukaryotic plants did not prove to be suitable. When the phylogenetic tree was constructed by considering the WRKY TF gene family members of monocot and lower eukaryotic plants, six phylogenetically distinct groups were formed; they were named as groups I (red), II (lime), III (green), IV (blue), V (pink) and VI (green) (Fig. 5, Table 3). The WRKY TF gene family members of monocot and lower eukaryotic plants were very specific to their concerned group. In this case, no single WRKY TF member of one specific group overlapped with another group. When the phylogenetic tree constructed by considering WRKY TF gene family members of dicot and lower eukaryotic plants, three different groups were generated where group II contained three sub-groups (Fig. 6, Table 4). We named the groups as I (pink), IIa (red), IIb (lime), IIc (blue), and III (green). We found that the WRKY TF members of groups I and III were very specific to their respective group and did not overlap with one another (Table 4). These results clearly showed that the WRKY TF grouping system is very specific to the lineages (monocot/dicot). The WRKY TF grouping system of monocot and dicot plants differs remarkably; this might be one of the most important reasons why co-linearity was absent in the grouping system of WRKY TF gene family members (Table 8). Therefore, in this study, we proposed that WRKY TF grouping should be specific to monocot or dicot plant lineages. The monocot-specific WRKY TFs can be grouped into six groups (groups I, II, III, IV, V, and VI) whereas dicot-specific WRKY TFs can be grouped into three groups (groups I, IIa, IIb, IIc and III). The phylogenetic tree of monocot and dicot plants varied markedly. This might be due to fact that monocot plant lineage is comparatively more conserved than dicot lineage owing to early ploidy and whole genome duplication77,78. Therefore, monocot and dicot plants should be grouped according to the grouping system of monocot plants and dicot plants, respectively.

We conducted another analysis by dividing WRKY TFs into single WRKY domain-containing (C-terminal) and double WRKY domain-containing (N- and C-terminal) groups. The phylogenetic analysis in the single WRKY domain group resulted in six phylogenetically distinct groups, whereas the double WRKY domain group resulted in seven phylogenetically distinct groups (Tables 5 and 6). The WRKY TF members of domain specific studies were not confined to any specific group and the group members were overlapped with each other. Although single and double WRKY domain-containing TFs resulted into six and seven phylogenetically independent groups, respectively; only group II of previously studies could be sub-grouped into IIa, IIb, IIc, IId and IIe is not clear. However, the permutation and combination study showed that WRKY TFs could be grouped as monocot and dicot lineage-specific. The WRKY TFs of monocot plants can be grouped into six groups, and dicot plants can be grouped into three groups. Earlier reported grouping systems such as groups I, II (IIa, IIb and IIc) and III can be applied to dicot plants, but it is ensuring that WRKY TF group members are confined to their specific groups is important.

The substitution rate of monocot and lower eukaryotic WRKY was slightly higher than that of dicot and lower eukaryotic WRKY proteins. No considerable difference was observed in the substitution and evolutionary rate of WRKY proteins with a single or double domain. This explains why WRKY proteins are highly conserved across the plant lineage. The phylogenetic analysis of all plant species showed that all WRKY TFs were present in monocot, dicot and lower eukaryotes, indicating that the appearance of most WRKY TFs in plants predates the divergence of these species. No species-specific group, or sub-group or clades were observed in the phylogenetic tree. This implies that the WRKY TF gene family was more conserved during evolution. In addition, the WRKY domains from the same lineage tended to cluster together in the phylogenetic tree, which was not observed in this study. This suggests that they experienced duplication after divergence. The WRKY TFs that clustered together are orthologous ones that are evolutionarily closer than others. The phylogenetic similarity found in this study showed that WRKY TFs evolved conservatively. Only few WRKY TFs were found in lower eukaryotes, including C. reinhardtii, C. subellipsoidea, M. pusilla, and V. carteri whereas higher plants possessed a larger number of WRKY TF genes. This indicated that the earliest evolutionary origin of the gene containing the WRKY TF was from unicellular green algae. This suggested that WRKY proteins evolved before plants transitioned from an aquatic to a terrestrial habitat. With the continuous evolution of species, land plants have evolved a series of highly sophisticated signaling mechanisms that helped them to adapt to the ever changing environmental conditions, and hence, the number of WRKY TFs increased in different species. Presence of the WRKY TF gene in diplomonands, amoebozoa, and fungi sheds new light on the early evolution of WRKY genes.

Understanding the evolution of the WRKY TFs in plant lineage is very challenging. If the concept of early evolution is considered, in green algae, a BED finger-like C2H2 zinc finger domain incorporated a WRKY domain N-terminal to the zinc finger. This single-domain WRKY TFs served as the progenitor for all other WRKY genes54. Subsequently, this single-domain WRKY TFs fused via addition or recombination to yield a double WRKY domain by maintaining the original copy intact. Thereafter, independent lateral gene transfer to non-plant lineage and plant lineage occurred during the early evolution of WRKY TFs. This led to the transfer of WRKY TFs to fungi, amoeba and other species. The amoeba species, D. purpureum and the green algae O. lucimarinus and V. carteri contain both double and single WRKY domain proteins. However, C. reinhardtii contains only the double WRKY domain protein. This shows that the single double WRKY domains have coevolved from the green plant lineage. All these events seemed to have occurred before the transition of green plants to a terrestrial habitat. During these evolutionary processes, the chimeric WRKY protein evolved to contain either kinase, NAC, B3, LRR, PAH, CBS, ZF_SBP, ULP_protease, TIR, or ATP_GRASP domain. These chimeric WRKY TFs are not found in all plant species, and are restricted to only the flowering plant lineage. WRKY TFs with other novel domains can be expected from other plant species the genomes of which are yet to be sequenced.

Gene duplication and evolution

Evolution by gene duplication is one of the most important processes responsible for the supply of raw genetic material to an organism for its biological evolution79. Duplication can occur via recombination, aneuploidy, retro-transposition or whole genome duplication. A. thaliana encodes about 16,574 (65%) duplicated genes among its total of 25498 genes79,80. In the present study, we found several duplicated WRKY TFs (Supplementary Table 1). Most duplicated WRKY TF genes are present as paralogous genes79. More specifically, gene duplication analysis of some novel WRKY TFs (Fig. 2, Table 9), performed using Pinda (pipeline for intraspecies duplication analysis) server revealed that most of the WRKY TFs are duplicated. Some of the novel WRKY TFs, such as SbWRKY59, PvWRKY94-1 and SiWRKY59-2, were found to be nonduplicated. The Z-score values of these non-duplicated WRKY TFs ranged from 1.11 to 1.78. A z-score value of less than four indicates a non-duplicated gene81.

Table 9 Gene duplication analysis of novel WRKY TFs identified during this study.

Statistical analysis

Tajima’s relative rate test, the simplest test that can be applied to test the molecular evolutionary clock, can be applied to both nucleotide and amino acid sequences. This method yields results as the Chi-square test, and can even be applied when the pattern of substitution is unknown or the substitution rate varies across sites82. In Tajima’s relative rate test of WRKY TFs, the p-value and Chi-square test were found to be significant (Table 7).

Gene expression profile of WRKY TFs

Understanding the tissue-specific expression of genes can lead to elucidation of the molecular mechanisms and the role of the genes in tissue development and function. Understanding the genes, how they expressed and were regulated in different tissues is a challenging and fundamental question. Therefore, we investigated the tissue-specific expression of WRKY TFs of G. max and P. vulgaris (Supplementary Table 2). In G. max, expression analysis was conducted in the roots, root hairs, leaves, stems, flowers, pods, seeds, nodules and shoot apical meristem tissue. Of the total of 145 G. max WRKY TFs, 143 were found to be expressed in either of the mentioned tissues. Expressions of GmWRKY65-1 (105.342), GmWRKY6-4 (74.668), and GmWRKY6-5 (43.341) were found to be significantly higher than those of others in the roots, suggesting their important role in root development. Expression of 24 GmWRKY was not detected in root tissue (Supplementary Table 2), indicating that these genes might not play any active role in root development. The expression level of GmWRKY65-1 (35.199) was found to be the highest in root hair, suggesting its active role in the development of root hair. Expression levels of at least eight genes were not detected in root hairs. The expression levels of GmWRKY6-4 (51.394), GmWRKY6-5 (81.847), GmWRKY26-2 (80.957), GmWRKY26-3 (72.911), and GmWRKY41-3 (72.788) were significantly higher in the leaf tissues than in any other tissues, suggests that these genes might play crucial roles in leaf development. Expression levels of at least 24 genes were not detected in leaf tissues. In stems, the expression levels of GmWRKY21-3 (47.276), GmWRKY11-6 (24.872), and GmWRKY15-2 (24.886) were found to be significantly higher than that of other genes, suggesting their role in stem development. Expression levels of at least 15 genes were not detected in the stem tissue. In flowers, the expression levels of GmWRKY26-2 (67.456), GmWRKY26-3 (51.836), GmWRKY70-6 (61.053) and GmWRKY70-7 (63.153) were found to be significantly higher than those of other genes, suggesting that these genes might plays an important role in flower development. Expression levels of at least eight genes were not detected in flower tissue (Supplementary Table 2). In pods, the expression level of GmWRKY44-2 (17.882) was found to be significantly higher than that of other genes, suggesting its important role in pod development. The expression levels of at least 19 genes were not detected in pod. In seeds, the expression level of GmWRKY21-3 (31.762) was found to be significantly higher than that of other genes, suggesting its important role in seed development. In nodules, the expression level of GmWRKY65-1 (39.186) was significantly higher than that of other genes, suggesting its important role in nodule development. The expression level of GmWRKY65-1 was higher in root and root hairs as well. Thus, GmWRKY65-1 might play a crucial role in root, root hair, and nodule development. In the shoot apical meristem, the expression level of GmWRKY70-7 (35.173) was found to be significantly higher than that of other genes, suggesting its crucial role in apical meristem development. Expression levels of at least 21 genes were not detected in the apical meristem tissue. Considering the ubiquitous expression of WRKY TFs in G. max, we found that GmWRKY6-4, GmWRKY6-5, GmWRKY11-1, GmWRKY11-2, GmWRKY11-3, GmWRKY11-4, GmWRKY11-5, GmWRKY11-6, GmWRKY11-7, GmWRKY11-8, GmWRKY15-1, GmWRKY15-2, GmWRKY20-2, GmWRKY20-4, GmWRKY22-3, GmWRKY22-4, GmWRKY26-3, GmWRKY35-3, and GmWRKY41-7 were highly expressed in all the studied tissues (Supplementary Table 2). Similarly, the expression levels of GmWRKY10 and GmWRKY13-4 were not detected in any tissue, while those of GmWRKY4-3, GmWRKY6-3, GmWRKY29-1, GmWRKY50-1, GmWRKY50-2, GmWRKY54-1, GmWRKY55-1, GmWRKY56-1, GmWRKY56-3, GmWRKY67, GmWRKY70-3, GmWRKY70-4, and GmWRKY72-2 were almost negligible or absent in the major tissue types (Supplementary Table 2).

In P. vulgaris, expression analysis was conducted in eight tissue types that included young trifoliates, leaves, flowers, flower buds, young pods, stems, roots, and nodules. In young trifoliates, the expression level of PvulWRKY17 (37.519) was found to be significantly highest than those of others, suggesting its important role in early stages of plant development. Expression levels of 15 PvulWRKY genes were not detected in young trifoliates. In leaves, the expression levels of PvulWRKY7 (19.048), PvulWRKY11-2 (25.292), PvulWRKY19-1 (16.433), PvulWRKY23-1 (19.076), PvulWRKY26-1 (25.724) and PvulWRKY58 (18.863) were found to be significantly higher than those of others, suggesting that these genes might play a significant role in leaf development in P. vulgaris. Expression levels of 18 genes were not detected in the leaf tissue. In flowers, the expression levels of PvulWRKY11-2 (47.243), PvulWRKY15-2 (49.015), PvulWRKY17 (66.844), PvulWRKY19-1 (78.755), PvulWRKY26-1 (76.970), and PvulWRKY58 (50.788) were found to be significantly higher than those of other WRKY genes, suggesting their important role in flower development. Unlike in flower development, the expression level of PvulWRKY11-2 was found to be the highest in flower bud, suggesting that this gene might be involved in flower and flower bud development. In young pods, the expression levels of PvulWRKY17 (58.155), PvulWRKY15-2 (41.848), and PvulWRKY19-1 (38.820) were found to be significantly higher than those of other genes, suggesting their role in pod development. The expression levels of seven genes were not detected in young pods. In stems, the expression levels of PvulWRKY11-1, PvulWRKY11-2 and PvulWRKY17 were found to be significantly higher than those of other genes, suggesting their role in stem development. In roots, the expression levels of PvulWRKY11-1, PvulWRKY11-2, PvulWRKY17 and PvulWRKY69-3 were found to be significantly higher, suggesting that these genes might significantly regulate root development in P. vulgaris. In nodules, the expression of PvulWRKY9-2 (48.761), PvulWRKY11-2 (79.023), and PvulWRKY69-3 (45.336), was higher than those of other genes, suggesting their important role in nodule development. In P. vulgaris, few genes were found to be ubiquitously expressed in all tissue type such as PvulWRKY7, PvulWRKY11-1, PvulWRKY11-2, PvulWRKY11-3, PvulWRKY15-1, PvulWRKY15-2, PvulWRKY17, PvulWRKY19-1, PvulWRKY20-1, PvulWRKY20-2, PvulWRKY21, PvulWRKY22-2, PvulWRKY23-1, PvulWRKY23-2, PvulWRKY58, PvulWRKY69-3 and PvulWRKY71-2 (Supplementary Table 2). Comparative expression studies between G. max and P. vulgaris WRKY genes showed that WRKY11-1, WRKY11-2 and WRKY11-3 were ubiquitously expressed in all tissues of G. max and P. vulgaris. Similarly, WRKY15-2 was also found to be highly expressed in the stems, roots, nodules, and pods of G. max and P. vulgaris, suggesting their common function in both the plants and similar tissue types. WRKY65 was also found to be highly expressed in the root and nodule tissues in G. max and P. vulgaris, suggesting that this gene might be extensively involved in root and nodule development in both the plants.

Conclusion

Analysis of the WRKY TF gene family across the plant lineage revealed the presence of novel WRKY TFs. The monocot or dicot lineage specific grouping and orthologous-based nomenclature system of WRKY TFs might be crucial in future studies. Expression analysis showed that WRKY11-1, WRKY11-2, and WRKY11-3 were highly expressed in all tissue types in G. max and P. vulgaris. Similarly, WRKY15-2 was found to be highly expressed in the stems, roots, nodules and pods in G. max and P. vulgaris, suggesting its important role in the development of these tissues. Understanding the functional role of novel WRKY TFs will help to understand their functional and evolutionary roles.

Material and Methods

Identification of WRKY TFs

WRKY TFs from the model organisms A. thaliana and O. sativa were downloaded from The Arabidopsis Information Resource (TAIR) database and the Rice Genome Annotation project respectively83,84. The protein sequences of WRKY TFs from A. thaliana and O. sativa were used as query sequences to search the WRKY TFs in other plant species in the phytozome database85. The WRKY TFs from O. sativa were used to search the WRKY TFs from monocot plants, and A. thaliana WRKY TFs were used to search the TFs from dicot and other plant species. Overall, WRKY TFs gene families of 43 plant species were investigated. The Hidden Markov Model (HMM) and BLASTP program was used as well to search the WRKY TFs of the investigated plant species by using the default parameters of the phytozome database. The sequences generated by BLASTP searches were collected for further analysis to confirm whether they were WRKY TFs. All the collected sequences were then analysed using the scanprosite and MEME software to confirm the presence of WRKY domains86,87. Default parameters were used in the scanprosite software to identify the WRKY domains. The identified sequences that contained the WRKY domain were retained for further validation which was accomplished by subjecting the sequences to BLASTP analysis in the TAIR and rice genome annotation project database using the default parameters. Further, all the sequences were analysed using HMMER web server to identify the interactive sequence similarities88. Sequences that resulted in BLASTP hits with WRKY TFs in the TAIR or rice genome annotation database were confirmed as WRKY TFs.

Nomenclature of WRKY TFs

All identified WRKY TFs were assigned a specific name. Nomenclature of the WRKY TFs was assigned according to an orthology-based nomenclature system proposed by different researchers55,56,89. In the nomenclature system, names were assigned by considering the first letter of the genus in upper case and the first letter of the species in lower case followed by the WRKY and orthology-based number of A. thaliana or O. sativa. When redundancies were found in the nomenclature system, 2 to 4 letters of the species name were considered for the nomenclature. When more than one orthologous gene was found, they were considered as paralogous genes which were numbered by including a hyphen. For example, if there are two OsWRKY46 in O. sativa, they would be named OsWRKY46-1 and OsWRKY46-2.

Multiple sequence alignment

The multiple sequence alignment of WRKY TFs was conducted using the Multalin software (http://multalin.toulouse.inra.fr/multalin/) with default parameters which were as follows: protein weight matrix, Blossum62-12-12; gap penalties at opening, default; gap penalty at extension, default; gap penalty at extremities, none; one iteration only, any; high consensus value, 90% (default); and low consensus value, 50% (default). The multiple sequence alignment of proteins containing single and double WRKY domain was conducted separately by using the same parameters.

Construction of phylogenetic tree

Unrooted phylogenetic trees were constructed to understand the closeness and evolutionary relatedness of WRKY TFs in plants. We constructed different phylogenetic trees by grouping the WRKY TFs into different groups. Groupings included (1) monocot, dicot and lower eukaryotic plants, (2) monocot and lower eukaryotes, (3) dicot and lower eukaryotes (4) single WRKY domain (C-terminal WRKY domain)-containing WRKY TFs and (5) double WRKY domain (N- and C-terminal WRKY domain)-containing WRKY TFs. To construct the phylogenetic trees, we created clustal files for each group using the clustalW or clustal omega program90,91. The generated clustal files were converted to the MEGA file format, after which the MEGA files were run in MEGA6 software to construct the phylogenetic tree92. Different statistical parameters used to construct the phylogenetic trees included the following: analysis, phylogeny reconstruction; statistical method, maximum likelihood; test of phylogeny, bootstrap method; number of bootstrap replicates, 1000; substitution type, amino acids; model/method, Poisson model; rates among sites, uniform rates; gap/missing data treatment, partial deletion/use all sites; site coverage, 95%; ML heuristic method, nearest-neighbor-interchange (NNI); and branch swap filter, very strong.

Statistical analysis

Different statistical analyses were performed to understand the evolutionary aspects of WRKY TFs using the MEGA6 program92. The MEGA files of all five groups that were used in the construction of the phylogenetic tree were subjected to the MEGA6 program for statistical analysis. Tajima’s relative rate test was conducted to evaluate the statistical significance of WRKY TFs to understand whether there were significant variations in molecular evolution. In this test, sequences 1, 2, and 3 were considered simultaneously where sequence 3 was considered as an out group. If nijk was the observed number of sites in which sequences 1, 2 and 3 have protein/nucleotides I, j, and k. under the molecular clock hypothesis, E(nijk) = E(njik) irrespective of the substitution model used and whether the substitution rate varied across the sites. If the hypothesis is rejected, then the molecular clock hypothesis of evolution can be rejected for the given set of sequences 1, 2 and 3. The statistical parameters used to perform Tajima’s relative rate test were as follows; analysis, Tajima’s relative rate test; scope, for 3 chosen sequences; substitution type, amino acids; and gaps/missing data treatment, complete deletion.

Gene duplication analysis

Gene duplication analysis of some selective WRKY TFs performed using the online server Pinda (http://orion.mbg.duth.gr/Pinda)81.

All the data used in this study were obtained from publicly available database (https://phytozome.jgi.doe.gov/pz/portal.html, http://congenie.org/start) available in the public domain.

Gene expression data

The expression data of G. max and P. vulgaris were downloaded from the phytomine database (https://phytozome.jgi.doe.gov/phytomine/template.do?name=One_Gene_Expression&scope=global) of phytozome. Locus ID of G. max and P. vulgaris were used for to searching the expression data in different tissue samples.

Additional Information

How to cite this article: Mohanta, T. K. et al. Novel Genomic and Evolutionary Insight of WRKY Transcription Factors in Plant Lineage. Sci. Rep. 6, 37309; doi: 10.1038/srep37309 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.