Genome-wide comparative analyses of GATA transcription factors among seven Populus genomes

GATA transcription factors (TFs) are widespread eukaryotic regulators whose DNA-binding domain is a class IV zinc finger motif (CX2CX17–20CX2C) followed by a basic region. We identified 262 GATA genes (389 GATA TFs) from seven Populus genomes using the pipeline of GATA-TFDB. Alternative splicing forms of Populus GATA genes exhibit dynamics of GATA gene structures including partial or full loss of GATA domain and additional domains. Subfamily III of Populus GATA genes display lack CCT and/or TIFY domains. 21 Populus GATA gene clusters (PCs) were defined in the phylogenetic tree of GATA domains, suggesting the possibility of subfunctionalization and neofunctionalization. Expression analysis of Populus GATA genes identified the five PCs displaying tissue-specific expression, providing the clues of their biological functions. Amino acid patterns of Populus GATA motifs display well conserved manner of Populus GATA genes. The five Populus GATA genes were predicted as membrane-bound GATA TFs. Biased chromosomal distributions of GATA genes of three Populus species. Our comparative analysis approaches of the Populus GATA genes will be a cornerstone to understand various plant TF characteristics including evolutionary insights.

www.nature.com/scientificreports/ characterized genes, expressed sequences tags (ESTs), and/or RNA-Seq data. In the gene prediction process, evidence sequences are essential to achieve accurate prediction of genes as well as alternative splicing forms. More available EST or RNA-Seq sequences will bring plentiful alternative splicing forms of GATA genes: e.g., the human genome contains around averagely of 10 alternative splicing forms per one gene, predicted from a large amount of transcript sequences 86 . Available RNA-Seq data of Populus genus (As of 2018 Jun) deposited in NCBI Short Read Archive show that P. trichocarpa and P. tremula presenting a large proportion of alternative splicing forms of GATA genes contain a large amount of RNA-Seq data (Table S4).
In-depth investigations of alternative splicing forms of Populus GATA genes. We identified the phenomena that some alternative splicing forms originated from one Populus GATA gene encode the same amino acids. Each of the nine alternative splicing forms of PdGATA6, a typical example of this phenomenon, is composed of 232 aa, 295 aa, and 301 aa in protein length and exon is composed of between 3 and 5. Among the nine alternative splicing forms of PdGATA6, all except PdGATA6f and 6h present the same start and end positions of ORFs. The first ORF exons of the eight alternative splicing forms except PdGATA6f contain start methionine without stop codon are classified into two types: one is 627 bp and the other is 645 bp. It results in two types of amino acid sequences from the eight alternative splicing forms, indicating that most of alternative splicing events are occurred in 5' and 3' UTR regions (Fig. 2). In addition, GATA domain sequences of eight alternative splicing forms of PdGATA6 are identical, suggesting that the GATA domain is important to bind DNA. Interestingly, some of alternative splicing forms of Populus GATA genes display the same amino acids: one GATA gene from P. tremuloides, four from P. euphratica and P. deltoides, six from P. trichocarpa and P. tremula x alba, and eight from P. tremula. One of known roles of untranslated regions of messenger RNA is changing the amount of translated proteins 87 . The number of TFs will increase or decrease the transcription amount of target genes, so that these alternative splicing forms may be important to the regulatory network of GATA TFs.
We also identified that some alternative splicing forms of the twelve GATA genes of three Populus species (P. tremula x alba, P. tremula, and P. tremuloides; Table S5) missed GATA domain which was not included in the Populus GATA TFs list. Interestingly, GATA TFs without GATA domain can negatively regulate the target genes by competing with normal GATA TFs 88 , indicating that these twelve GATA genes containing alternative splicing forms without domain can also play a role of negative regulators. In addition, one Populus GATA gene, PtaGATA28 (from P. tremula), has five out of six alternative splicing forms that missed GATA domain, suggesting that this gene might have a dominant role of negative regulation in contrast to the normal GATA genes even  Light blue and green lines indicate the gene loss of GATA TFs in P. deltoides and P. tremuloides lineage, respectively. Yellow star means gene duplication events of the 10 paralogous pairs. Black, green, blue letters of 7 Populus genomes mean independent three clades based on phylogenetic studies. (b) presents the number of GATA genes and GATA TFs for each genome.  81 . Amino acid lengths of Populus GATA TFs in each subfamily present a wider range than those of A. thaliana (Fig. 3b). However, three Populus GATA TFs display extremely short lengths: PdGATA18 belonging to subfamily I is 82 aa, PtsGATA29 and PdGATA36 from subfamily III are 46 and 86 aa, respectively (Table S2). Interestingly, some of GATA TFs of the other plant species including A. thaliana also display the short GATA TFs: 120 aa (AtGATA23) in A. thaliana 25 , 109 aa (VvGATA13) in V. vinifera 89 , 80 aa (GmGATA10) in G. max 40 , and 101 aa (OsGATA8b) in O. sativa 81 . It indicates that the three shortest Populus GATA TFs may be functional GATA TFs, suspecting that the gene prediction program can miss some of exons nearby the exon containing the GATA domain.
Two Populus GATA TFs in subfamily II have unique domains in comparison to those of A. thaliana; PpGATA21 contains NIR domain (IPR005343) found in the Noc2 gene family in Arabidopsis. This domain seems to be involved in protein-protein interaction, indicating that PpGATA21 may have partner protein for forming protein complex to regulate target genes. In addition, PpGATA23 covers two HMA domains (IPR006121), which can bind heavy metal ions 90 . These two Populus GATA genes will have unknown additional functions like WC1, which is involved in circadian clock mechanism of Neurospora crassa with light-sensing domain 91 .   25 and G. max 40 , fourteen GATA TFs in six Populus species except P. euphratica lack CCT and/or TIFY domains. Some of GATA TFs of O. sativa also presented the same phenomenon 81 . In detail, nine of the 14 GATA TFs are the unique transcript, indicating that they completely lost CCT and/or TIFY domains during evolution, similar to those of V. vinifera 89 . The remaining five GATA TFs have other alternative splicing forms, presenting the selective loss of these domains.

Comparisons of the principal component analysis of seven Populus species.
In a previous study, six Populus species except P. tremula x alba were used for conducting the phylogenetic analysis based on a total of 76 morphological properties for buds, leaves, inflorescences, flowers, and fruits 95 , which is congruent to the chloroplast phylogenetic tree (Fig. 1), except P. trichocarpa and P. deltoides. The incongruency of the two species is caused by limited species in Fig. 1. However, the principal component analysis of Populus GATA genes shows one cluster covering P. euphratica, P. tremula x alba, P. tremuloides, and P. trichocarpa (Fig. S2), which is incon-     www.nature.com/scientificreports/ gruent to the aforementioned two phylogenetic trees. It reflects that GATA TFs may not evolve in the similar to species evolution due to their widely regulatory roles 96 .

Identification of Populus GATA gene clusters (PCs) on phylogenetic tree of Populus GATA genes.
To understand the phylogenetic relationship of Populus GATA genes clearly, we clustered them in the phylogenetic tree (Fig. 3a), resulting in 21 distinct Populus GATA gene clusters (PCs; Fig. 3a and Table S7). All PCs except PC11 containing five species except P. tremuloides and P. tremula covers all seven Populus species, displaying conserveness of Populus GATA genes. Eleven of the 21 PCs (52.38%) contain the same amount of GATA genes for each Populus species, while the rest 9 PCs (42.86%) show different numbers (Table S8). PC03, PC05, and PC14 lack of only one GATA gene from P. pruinosa, P. tremula, and P. deltoides, respectively; while PC12 and PC20 have additional GATA gene of P. deltoides and P. euphratica, respectively. The remaining 4 PCs display a complex pattern of the number of GATA genes for each Populus species (Table S8), indicating complex history of gain and loss of GATA genes in the Populus genus. Subfamily I, containing the largest Populus GATA genes, displays the most complex structure with the largest number of PCs (Fig. 3a). Interestingly, AtGATA3, AtGATA10, AtGATA11, AtGATA13, AtGATA14, and PC12 do not show neighbor GATA genes like an independent clade (Fig. 3a). Subfamily II shows the largest ratio of the number of PCs to the number of GATA genes, implying faster evolution might be occurred. In addition, four Populus GATA genes (PpGATA21, PpGATA23, PeGATA19, and PeGATA23) are not clustered into PCs (Fig. 3a), presenting species-specific GATA genes. In subfamily III, PC01, containing four Populus GATA genes per species, might be experienced a gene duplication event in comparison to the other PCs. In addition, PC01 and PC02 seem to be independent of three Arabidopsis GATA genes (Fig. 3a), suggesting Populus-specific GATA genes, while PC03 and PC04 have their partner Arabidopsis GATA genes (Fig. 3a). Subfamily IV covers only one PC and two Arabidopsis GATA genes, the smallest subfamily (Fig. 3a).
In total, 23 out of 556 (4.14%) amino acid positions in the CX 2-4 CX 18-20 CX 2 C region (inter-species variations) show more than one amino acid (Fig. 3d), which is larger than that of 19 A. thaliana genomes 43  Nine of 15 characterized Arabidopsis GATA genes were also successfully mapped based on the PCs (Fig. 3a and Table 2) and similarity of amino acids, resulting in that seven PCs are related to the characterized Arabidopsis GATA genes. 129 Populus GATA TFs in the seven PCs are candidates for deducing their functional roles in Populus. In addition, PdGNC from P. nigra x (P. deltoides x P. nigra), a uniquely characterized GATA TF, known to regulate chloroplast ultrastructure, photosynthesis, and vegetative growth in Arabidopsis 80 is successfully mapped to PtrGATA19 (PC09) with 98.02% amino acid similarity. It supports this inference method because both GNC in Arabidopsis and PdGNC are in the same PC with the similar functions even though Arabidopsis and Populus belong to Brassicaceae and Salicaceae and are an herb and a tree species, respectively. Moreover, OsGATA12 involved in the seedling stage based on expression profile, is similar to that of BME3 in A. thaliana 81 . Based on this result, researchers can efficiently and systematically identify the biological functions of Populus GATA genes in the near future.
Expression level analysis of GATA genes in P. deltoides and P. pruinosa. Based on available RNA-Seq raw reads obtained from leaf, phloem, xylem, and root tissues of P. deltoides and P. pruinosa (Table S9), expression levels of Populus GATA TFs were calculated (Fig. 4). In the four tissues, three GATA genes in the PC9 were well clustered displaying leaf and phloem specific expressions (Fig. 4a), which is congruent to the putative functions of GATA genes in PC9, such as regulation of chloroplast development, growth, and divisions (Table 2). In addition, four clusters covering GATA genes in PC13, PC1, PC6, and PC5, also presented similar expression patterns across the tissues, but these clusters did not cover all members in each PC. PC1 and PC5 showed high expression in all four tissues; while PC13 was leaf and phloem specific and PC6 was expressed lowly, especially in xylem, which can be a clue to understand their biological functions due to lack of homologous genes of which biological functions were characterized. Expression profiles of each tissue exhibited that some clustered GATA genes from the same PC were same as those in the four tissues and the rest were not (Fig. 4b-e), showing that expression level of Populus GATA genes partially reflects their conserveness across the species. Domain types of Populus GATA genes. The DNA-binding motif of GATA TFs was classified into three types designated as type IV a (CX 2 CX 17 CX 2 C), IV b (CX 2 CX 18 CX 2 C), and IV c (CX 2 CX 20 CX 2 C) among which Type IV b and IV c are common in plants 25 www.nature.com/scientificreports/  136 Modulation of chlorophyll biosynthesis (greening) and glutamate synthase (GLU1/Fd-GOGAT) expression 137,138 Downstream effectors of floral homeotic gene action by controlling two MADS-box TFs 139 Control of convergence of auxin and gibberellin signaling 140,141 Control of greening, cold tolerance, and flowering time 142 Regulation of chloroplast development, growth, and division as well as photosynthetic activities 143,144 Cytokinin-regulated development, including greening, hypocotyl elongation, phyllotaxy, floral organ initiation, accessory meristem formation, flowering time, and senescence 135 PIF-and light-regulated stomata formation in hypocotyls 145

PC10
AtGATA18 (HAN) Regulation of shoot apical meristem and flower development [145][146][147][148][149] Stable establishment of cotyledon identity during embryogenesis 148 Position the proembryo boundary in the early Arabidopsis embryo  www.nature.com/scientificreports/ in Arabidopsis 43 and G. max GATA TF analyses 25 ) and type IV 4 (four amino acids between the first two cysteines, which seems to be functionally active 43 ), were also found.
In Populus GATA TFs, type IV b is the most abundant (272 of 389; 69.92%), and type IV c is the second (102; 26.22%). Type IV b , a common type of DNA-binding motifs of plant GATA TFs, occupies the largest proportion in Populus GATA TFs and is found in subfamilies I, II, and IV and Type IV c , the second largest, is a common in subfamily III of Populus GATA TFs, which is similar to those of A. thaliana, G. max, V. vinifera, and O. sativa. Type IV 4 is found only in eight Populus GATA TFs (2.06%) belonging to subfamily II. It was also identified in some species including A. thaliana and G. max, suggesting that this type was independently occurred during evolution by adding two amino acids only in the first two cysteines of subfamily II. In contrast, type IV p is found in all subfamilies of Populus GATA TFs as well as those of A. thaliana, G. max, and O. sativa, suggesting that random modifications of GATA domains of all subfamilies have been occurred during evolution. Interestingly, type IV p can be generated by alternative splicing forms: PtaaGATA36d is GATA TF displaying type IV p ; while the rest four alternative splicing forms of PtaaGATA36 show type IV c domain, which is similar to the cases of OsGATA8 81 and AtGATA26. Type IV p was also considered as an ancestral form of GATA zinc finger 40 , requiring more research of Type IV p with additional GATA genes from many plant genomes.

Amino acid patterns of GATA domains in seven Populus species and A. thaliana. Amino acid
sequences of 422 GATA domain from 382 Populus and 40 A. thaliana GATA TFs excluding seven type IV p domains of Populus and one Arabidopsis were used for multiple sequence alignment (Fig. 5a). Subfamily IV of Populus displays the most conserved manner (49 of 55 conserved amino acids are identical.) in their domains, while subfamily II of Populus shows the least conserveness (18 of 66 conserved amino acids). Considering with the number of Populus GATA genes in each subfamily, subfamily I, the largest subfamily, is more conserved than subfamilies II and III. Some of dominant amino acids in GATA domain are different among Populus species (e.g., the 5' region of GATA domains; Fig. 5a).
Subfamily IV displays incongruent of conserved amino acids between Populus and A. thaliana at 8th and 47th ( Fig. 5a; blue-colored transparent boxes), indicating that subfamily IV was evolved and stabilized in early stage. In addition, six, one, and two amino acids which are 100% conserved in Populus genus but not in A. thaliana are found in subfamilies I, II, and IV, respectively ( Fig. 5a; red-colored transparent boxes), suggesting that subfamily I has been most diversified in the lineage of A. thaliana.
Twelve conserved amino acids of Populus and A. thaliana belonging to the zinc finger motif (CX 2-4 CX 18-20 CX 2 C) are identical in all subfamilies (Fig. 5a) suggesting that the zinc finger motif is the most conserved and important region in the GATA domain. 22nd and 28th conserved amino acids in subfamilies I, II, and IV are Tryptophan and Glycine, respectively; while subfamily III displays methionine and glutamic acid (Fig. 5a). These differences can be key factors to classify four subfamilies.
As expected, the zinc finger motif, which can bind to DNA and is the most important region in GATA domain, contains a smaller number of different amino acids (Fig. 5b). Despite of large number of species in Populus genus used in this study, four positions in this region show a high number of amino acids in A. thaliana (Fig. 5b), suggesting that selection pressures have been differently applied in the two lineages. This phenomenon is also found outside this region (Fig. 5b). It is congruent to the findings described in the previous paragraph. Once more plant genomes including the large number of resequencing data of A. thaliana and P. trichocarpa are analyzed, the detailed evolutionary history of the GATA domain will be uncovered.

Identification of transmembrane helix (TMH) of GATA gene family in seven Populus
genomes. Membrane-bound transcription factors (MTFs) are docked in cellular membranes using their transmembrane domains 100 . MTFs have usually been found in plant species 101 of which are related to seed germination 102 , cell division 101 , heat stress 103 , and salt stress 104 . Mechanisms of MTFs are well known in two major plant TF families: NAC TF family and bZIP TF family 105 (Fig. S4). NTL6 (NAC TF) of A. thaliana is localized in the plasma membrane under normal conditions; while under stress conditions, NTL6 is processed by an as-yet-unidentified intramembrane protease and SnRK2.8 kinase phosphorylates NTL6 and facilitates its nuclear import 105 . (ii) Intracellular movement of Arabidopsis bZIP60 and bZIP28 was characterized 105 . bZIP60 and bZIP28, which are other MTFs in Arabidopsis, were localized on the membrane of the endoplasmic reticulum and then transported to nucleus by cleaving TMH.
Five Populus GATA TFs were identified with TMHs predicted by TMHMM 106 . PtrGATA14b, 14c (P. trichocarpa) in subfamily I, PpGATA21, 25 (P. pruinosa), and PtaaGATA23 (P. tremula x alba; Table S10), belonging to subfamily II. These putative GATA MTFs have one TMH, which is the same as the previously characterized Arabidopsis MTFs 107 . As far as we know, this is the first time to report putative GATA MTFs in plant species; additional putative GATA MTFs were also identified in other plant species using our pipeline: VvGATA19 (V. vinifera; subfamily IV) 89 , GmGATA39 (G. max; subfamily I) 40 , OsGATA14 (O. sativa; subfamily II) 81 with one TMH, while no GATA MTF was found in A. thaliana 25 . It is interesting that there are no common subfamilies containing putative GATA MTFs along with different species. In addition, some alternative splicing forms of PtrGATA14 (P. trichocarpa) have TMH, indicating that truncation of TMH region by alternative splicing may switch their functions by changing subcellular localization of GATA TFs, similar to the bZIP60 in A. thaliana 105 . With accumulating more data including expression profiles and subcellular localization, roles of these putative GATA MTFs can be uncovered. Moreover, these results can be a corner stone to understand plant GATA MTFs together with a large number of plant genomes available now [108][109][110] in a broad taxonomic range of plant species.
Chromosomal distribution of P. trichocarpa, P. tremula x alba, and P. deltoides GATA gene family. Chromosomal  www.nature.com/scientificreports/ alba, and P. deltoides belonging to the same clade (Fig. 6), presents several important features: (i) 38 of 39 GATA genes in P. trichocarpa, 37 out of 38 in P. tremula x alba, and 36 of 38 in P. deltoides were distributed on 15 of 19 chromosomes (Fig. 6), presenting similar chromosomal distribution among three species. (ii) Chromosome 5 in both species contains the largest number of GATA genes; while chromosomes 9, 13, and 19 in both species contain the smallest (Fig. 6). This biased chromosomal distribution was also found in many plant species including A. thaliana 25 Fig. 6). It indicates that there might be no chromosomal rearrangement events and biological functions of GATA genes may have similar functions among the three species. Interestingly, three of the four genes are additional copy of GATA TFs in PC12 and PC17, which is the result of independent gene gain events. Based on paralogous GATA TFs of P. trichocarpa identified in the previous study 111 , 10 paralogous pairs were successfully mapped to GATA TFs from three Populus species (Fig. 6), displaying that all paralogous pairs contain three GATA TF from each species in the similar chromosomal positions, indicating that gene duplication events of the 10 paralogous pairs were occurred before speciation of three Populus species (see yellow star in Fig. 1a).  Fig. 1). PC02 indicates another loss event occurred in the lineage of P. tremuloides (See green lines in Fig. 1). Taken together with the incongruency inferred from the PCA, GATA TFs were evolved with the events occurred in various lineages in Populus species, which is independent to their species evolution.

Conclusion
Using the identification pipeline of GATA TFs in the GATA-TFDB, we successfully identified 262 GATA genes (389 GATA TFs) from seven Populus species. Alternative splicing forms of Populus GATA genes display the high number of alternative splicing forms (nine is maximum) with only changes in untranslated regions and loss of DNA-binding motif or additional domains. Populus GATA genes were classified into the four subfamilies, same to Arabidopsis GATA genes, except that some genes in subfamily III lack CCT and/or TIFY domains. 21  Figure 6. Chromosomal distribution of P. trichocarpa, P. tremula x alba, and P. deltoides GATA genes. Black, purple, and yellow bars indicate P. trichocarpa, P. tremula x alba, and P. deltoides chromosomes, respectively. Black, purple, yellow letters mean GATA gene names of P. trichocarpa, P. tremula x alba, and P. deltoides. ChrUn present scaffold sequences which are not assigned to any chromosomes. Yellow-colored transparent boxes indicate 10 paralogous pairs.  Expression analysis of GATA TFs based on Populus RNA-seq data. Raw reads of RNA-Seq experiments of P. pruinosa and P. deltoides were downloaded from NCBI (Table S9). RNA-Seq raw reads were aligned against the whole genome of P. pruinosa and P. deltoides with hisat2 v2.2.0 117 after generating datasets of each Populus genome. After generating bam file for each SRA raw reads, bam files from the same experiments were merged using samtools v1.9 118 . Expression levels of the merged bam files were calculated by cufflink v2.2.1 119 . Hierarchical clustering was conducted for the five datasets: one covers four different conditions (four tissues) and the last four contain each condition (Fig. 4) using hclust function in R package stats version 4.0.3 116 . Construction of phylogenetic tree of seven Populus species based on complete chloroplast genomes. Complete chloroplast genomes of seven Populus chloroplast genomes 69,[120][121][122][123] and Salix gracilistyla 78 , used as an outgroup, were aligned using MAFFT v7.450 124 . All chloroplast genome sequences were retrieved from the PCD (http:// www. cp-genome. net; Park et al., in preparation). The maximum-likelihood trees were reconstructed in MEGA X 125 . During the ML analysis, a heuristic search was used with nearest-neighbor interchange branch swapping, the Tamura-Nei model, and uniform rates among sites. All other options were set to their default values. Bootstrap analyses with 1,000 pseudoreplicates were conducted with the same options. All bioinformatic processes were conducted under the environment of the Genome Information System (GeIS) used in the various previous studies 43, [126][127][128][129][130][131][132][133][134] .

Data availability
All GATA TFs identified in this study can be accessed at the Populus Comparative Genome Database (http:// www. popul usgen ome. info/).