Genome-wide identification and characterization of SPL transcription factor family and their evolution and expression profiling analysis in cotton

Plant specific transcription factors, SQUAMOSA promoter-binding protein-like (SPL), are involved in many biological processes. However, no systematical study has been reported in cotton. In this study, a total of 177 SPL genes were identified, including 29, 30, 59 and 59 SPLs in Gossypium arboreum, G. raimondii, G. barbadense, and G. hirsutum, respectively. These SPL genes were classified into eight phylogenetical groups. The gene structure, conserved motif, and clustering were highly conserved within each orthologs. Two zinc finger-like structures (Cys3His and Cys2HisCys) and NLS segments were existed in all GrSPLs. Segmental duplications play important roles in SPL family expansion, with 20 genes involved in segmental duplications and 2 in tandem duplications, and ten ortholog pairs in syntenic regions between G. raimondii and A. thaliana. Several putative cis-elements, involved in light, stresses and phytohormones response, were found in the promoter regions of GhSPLs, suggesting that plant responses to those environmental changes may be induced through targeting SPL transcription factors. RNA-seq analysis shows that SPL genes were differentially expressed in cotton; some were highly expressed during fiber initiation and early development. Comparing with other plants, SPL genes show subfunctionalization, lost and/or gain functions in cotton during long-term domestication and evolution.


Results
Genome-wide identification of the SPL gene family in Gossypium and their chromosomal distribution. To identify the SQUAMOSA promoter binding protein-like (SPL or SBP) transcription factor genes in cotton, SBP domain (PF03110) was used to search protein database of four cotton species, G. raimondii 33 , G. arboreum 32 , G. hirsutum acc. TM-1 29 , and G. barbadense acc. 3-79 31 by HMMER 35 . The candidate SPL genes were verified by the presence of the SBP domain using SMART and CDD 36,37 . A total of 177 SPL genes were identified in cotton, including 29, 30, 59 and 59 were identified in G. arboretum, G. raimondii, G. hirsutum and G. barbadense, respectively (Table S1). GrSPL genes were named according to the closest orthologs in A. thaliana. Among 17 SPL genes in A. thaliana, AtSPL3, AtSPL4, AtSPL11, AtSPL12, AtSPL15 and AtSPL16 have not orthologs in G. raimondii; and all 30 GrSPLs from the rest of 10 AtSPLs orthologs. AtSPL13 has 5 paralogs in G. raimondii; AtSPL1, AtSPL5 and AtSPL6 have 4 paralogs; AtSPL2, and AtSPL9 have 3 paralogs; AtSPL7, AtSPL8 and AtSPL10 have 2 paralogs; and AtSPL14 only 1 paralogs in G. raimondii. Different paralogs were coded a, b, c, and so on, according to their order of the homologous chromosomes. The corresponding orthologs in G. arboreum, G. hirsutum acc. TM-1 and G. barbadense acc. 3-79 were named as GaSPL, GhSPL, and GbSPL, respectively. The encoded protein lengths of 59 SPL genes varied from 141 to 1083 amino acids, and the predicted molecular weight (Mw) of these SPL proteins ranged from 16.112 to 119.882 kDa in upland cotton G. hirsutum (Table S1).
The 59 GhSPL genes were located on 20 chromosomes, with no SPL gene detected in A5/D5, A6/D6, and A9/D9 homologous chromosomes in G. hirsutum acc. TM-1 (Fig. 1, Table S1). 30 GrSPL genes were located on 11 chromosomes in G. raimondii except on the D5 and D6. GrSPL9b was located on the D9 in G. raimondii, but GhSPL9b_A/GhSPL9b_D were in A4/D4 (scaffold) homologous chromosomes in G. hirsutum. This phenomenon, SPL9b was positioned on the different chromosomes, might be caused by assembly error in the duplicated gene regions, and need to be further confirmed. Furthermore, tandem duplication events were defined as genes separated by five or fewer genes and within 100 Kb region 38 . Only one gene pairs on D12 (GrSPL10b/GrSPL8b) displayed tandem duplications in G. raimondii (Fig. 1B), and the corresponding tandem duplication (GhSPL10b_D/GhSPL8b_D) was detected on D12 of G. hirsutum (Fig. 1A). On A12, the distance was over 100 Kb (161.95 Kb) between GhSPL10b_A and GhSPL8b_A. Compared with the distribution of SPL genes in G. raimondii and G. hirsutum, SPLs displayed high collinearity in the D-genome of G. raimondii and A-, D-subgenomes of G. hirsutum. Additionally, two pairs of orthologs genes were located in A2/A3 reciprocal translocation section, within GhSPL13b_A and GhSPL2b_A on A3 chromosomes, GhSPL13b_D and GhSPL2b_D on D2.

Phylogenetic analysis and gene duplication observation of SPL gene family. To investigate
the phylogenetic relationship of the SPL transcription factor family in cotton, a total of 242 SPLs were used to construct a Neighbour-Joining (N-J) phylogenetic tree by MEGA 7.0 software 39 . These 242 SPLs included 177 obtained from this research in four cotton species, G. raimondii (30), G. arboretum (29), G. hirsutum (59) and G. barbadense (59); the rest 65 were obtained from three well-studied plant species, including 17 from A. thaliana, 18 from O. sativa and 30 from P. trichocarpa. As shown in Fig. 2, all of the SPL genes were clustered into eight sub-groups (from I to VIII), and each group contained at least one protein from three species (Arabidopsis, rice and poplar) except group VI in which there was no SPL from rice. Additionally, different SPL orthologs were clearly distinguished in cotton. Cotton SPL13, SPL8, SPL9, SPL2, SPL6 and SPL7 were grouped to cluster I-IV, VI and VIII, respectively. Cotton SPL5 and SPL10 were clustered in group V, SPL1 and SPL14 were clustered in group VII.
SPL duplication pattern between G. raimondii and three other species, A. thaliana, O. sativa and P. trichocarpa were also analyzed. Among 30 GrSPLs from 10 Arabidopsis SPL orthologs, 19 pairs of duplication events were identified between G. raimondii and A. thaliana, involving 8 GrSPL orthologs, and 14 GrSPL and 12 AtSPL genes, respectively; 22 pairs of duplication events were identified between G. raimondii and P. trichocarpa, involving 5 GrSPL orthologs, and 10 GrSPL and 11 PtSPL genes, respectively; only 5 pairs of duplications were observed in G. raimondii and O. sativa, and this indicated that SPL gene family of Gossypium and O. sativa were less conserved. There were more GrSPL genes or orthologs between G. raimondii and A. thaliana than that between G. raimondii and P. trichocarpa. We compared syntenic map of G. raimondii and A. thaliana (Fig. 3, Table S2), with ten ortholog pairs positions on segmental duplicated blocks including the following: GrSPL1d/AtSPL1; GrSPL2a, 2b/AtSPL11; GrSPL5a/AtSPL4, 5; GrSPL5c/AtSPL3, 5; GrSPL5d/AtSPL5; GrSPL6c/AtSPL6; GrSPL7a, 7b/AtSPL7; GrSPL9a, 9c/AtSPL9, 15; GrSPL13c, 13d/AtSPL13B; and GrSPL14a/AtSPL14, 16. Gene structure and conserved motif analysis of SPLs in G. hirsutum. With GhSPLs as an example, we analyzed the SPL gene exon/intron structure, conserved motif, and putative cis-acting elements from GhSPLs promoters. An unrooted N-J tree was also constructed only using 59 SPL protein sequences from G. hirsutum (Fig. 4A). The gene structures of 59 GhSPLs were analyzed by GSDS 2.0 41 , and displayed in Fig. 4B. The number of introns of 59 GhSPLs varied from 0 to 9. Nearly half of SPL genes (27 GhSPLs) had two introns (6 GhSPL6, 3 GhSPL8, 6 GhSPL9, 4 GhSPL10 and 8 GhSPL13); 5 GhSPL2 and 8 GhSPL5 had three and one introns, respectively; 14 GhSPLs had nine introns (including 8 GhSPL1, 4 GhSPL7 and 2 GhSPL14); the remaining SPL members GhSPL2a_D, GhSPL8a_A, GhSPL13d_A/D and GhSPL6a_D had 6, 4, 3 and 0 introns, respectively. By comparing the SPL gene structures of G. hirsutum and Arabidopsis, we found that the pattern of exon/intron structures of SPLs in G. hirsutum is quite similar to AtSPLs. In Arabidopsis, AtSPL1, AtSPL7 and AtSPL14 had 9 introns; AtSPL6, AtSPL8 AtSPL9, AtSPL10 and AtSPL13 had 2 introns; AtSPL2 and AtSPL5 had 3 and 1 introns, respectively (supplementary material Fig. S1). This result reveals that different GhSPL orthologs exhibited different exon-intron structures and were similar to Arabidopsis orthologs. The MEME 42 was used to predict motifs of 59 SPL protein sequences in G. hirsutum. 20 motifs, named motifs 1 to 20 were identified (Fig. 5). The length of 20 identified motifs and consensus sequence were listed in supplementary material Table S3, and Logos of 20 conserved motifs are shown in supplementary material Fig. S2. The lengths of those conserved motifs were between 21 (motif 15) and 159 amino acids (motif 4). The number of the conserved motifs in each GhSPL protein varied from 2 to 13. All GhSPL proteins contained motif 1 (two Zn-finger like structure), and 56 GhSPLs contained motif 2 (nuclear localization signal, NLS) except GhSPL5c_D, GhSPL6a_D and GhSPL9b_D. Seven GhSPL5 proteins only had motifs 1 and 2; GhSPL8 had motifs 1, 2 and 14; GhSPL13 had motifs 1, 2 and 15; GhSPL9 and GhSPL10 had motifs 1, 2 14, and 15; other GhSPLs had more motifs, such as GhSPL2 (6 motifs), GhSPL7 (6-7 motifs), GhSPL14 (9-10 motifs) and GhSPL1 (11-13 motifs). Then, multiple alignment of all 59 GhSPL proteins was performed by MAFFT version 7, and presented the SBP domain structures in detail. All GhSPLs exhibit two zinc finger-like structures and NLS segments, with the exception of three SPLs (GhSPL5c_D, GhSPL6a_D and GhSPL9b_D) which lacked NLS. The first Zn-finger like structure (Cys3His), the second Zn-finger like structure (Cys2HisCys) and highly conserved NLS were signed in Fig. 6B. The SBP domain motif logo and protein sequence were showed in Fig. 6A. Therefore, the sequences SBP domain of GhSPLs, two Zn-finger structure and NLS section, were also conserved in cotton, and SPL motif member architecture within each of GhSPL orthologs tend to have a similar number and type of motifs.
In conclusion, members belonging to the same GhSPL orthologs had a similar gene structure, motif architecture, tended to cluster together in phylogenetic tree.
The cis-acting elements analysis of GhSPL gene promoter regions. The upstream sequences of 59 GhSPL genes (2500 bp upstream of the initiation codon) were used for cis-acting element prediction  (Table S4). Other major cis-elements also include elements responsible to stress response [such as defense and stress (TC-rich repeats), WRKY binding site (W box), heat (HSE), drought (MBS), low-temperature (LTR) and wound (WUN-motif)], and phytohormone response [such as auxin (TGA-element and AuxRR-core), abscisic acid (ABRE), ethylene (ERE), gibberellin (P-box, TATC-box and GARE-motif ), salicylic acid (TCA-element) and MeJA (CGTCA-motif and TGACG-motif)] (Fig. 7, Table S4). This suggests the important roles of GhSPL genes in biological processes as well as response to abiotic stresses and phytohormones in cotton.  GhSPL5c_D were highly expressed in 3 DPA ovules and 5 DPA fibers; GhSPL5d_D were highly expressed in petal; GhSPL5a_A/D, GhSPL5b_D and GhSPL5d_A were low expressed in all tissues. In the GhSPL6 orthologs, GhSPL6d_A/D were highly expressed in 3, 5 and 10 DPA fibers, GhSPL6a_A/D, GhSPL6b_A/D and GhSPL6c_A/D were low expressed in all tissues. In the GhSPL8 and GhSPL9 orthologs, GhSPL8a_A/D and GhSPL9a_A/D were highly expressed from −3 to 3 DPA ovules; GhSPL9b_A/D were highly expressed in petal; GhSPL8b_A/D and GhSPL9c_A/D were low expressed in all tissues. In the GhSPL10 orthologs, GhSPL10b_A/D were highly expressed in 5 DPA fibers; and GhSPL10a_A/D were low expressed in all tissues. In the GhSPL13 orthologs, GhSPL13a_A/D were highly expressed in leaf, −3, 0 and 3 DPA ovules; GhSPL13b_A/D and GhSPL13e_A/D were highly expressed in leaf; GhSPL13d_A/D were highly expressed in 0 and 3 DPA ovules. However, GhSPL13c_A/D were no detected in all tested tissues.

Discussion
In the past couple of years, SPL transcription factors have been attracting attention from the scientific community. Genome-wide identification of SPL gene family has been reported in several plant species. The number of SPLs varies from species to species. For instance, there are 15 SPLs in potato, pepper, peanut, citrus, Prunus mume, danshen, castor bean and tomato 10 11 , and 58 in oilseed rape 12 . However, no genome-wide identification of SPL gene family has been reported in cotton although there are four cotton species sequenced. In this study, we reported for the first time the genome-wide identification of SPL genes and systematically investigated the functional structure of SPL transcription factor family. Based on our results, we identified 29, 30 59 and 59 SPL genes in G. arboretum, G. raimondii, G. hirsutum and G. barbadense, respectively (Table S1). The number of SPLs in A or D genome diploid cotton were similar to Populus, and an amount of SPLs of allotetraploid cotton species were very close to oilseed rape B. napus (AACC, 2n = 38). To compare the number of SPL genes and corresponding relationships in four cotton species, we found that there were a typical polyploidization phenomenon. All 29 SPLs in diploid A-genome (G. arboretum) and 30 SPLs in D-genome (G. raimondii) could be found their homologous genes in allotetraploid genomes (G. hirsutum AADD, 20 = 52). Only SPL6a was found two homologous genes in four cotton species, GrSPL6a and GhSPL6a_D in the D-genome (G. raimondii) and D-subgenome of G. hirsutum, respectively, and no unique genes were found in the A-genome G. arboreum, A-subgenome of G. hirsutum, and A-and D-subgenome of G. barbadense. Additionally, phylogenetic analysis of SPL proteins in various species showed that green alga were grouped together, and other SPLs were classified into 6-7 groups 21,22 . In this study, phylogenetic tree of 242 SPL proteins from four cotton species (G. raimondii, G. arboretum, G. hirsutum and   G. barbadense), A. thaliana, O. sativa and P. trichocarpa, showed that all SPL genes were clustered into eight groups (I-VIII) (Fig. 2). Among 177 cotton SPLs from 10 Arabidopsis SPLs orthologs, each kinds of orthologs were clustered together.
Segmental duplications play an important role in the gene expansion of SPL transcription factor gene family. Many segmental duplication gene pairs were found in SPL gene family in plants [20][21][22] . In this study, we identified one pair of tandem duplication (GrSPL10b/GrSPL8b) and 16 pairs of segmental duplications involved 20 SPL genes in G. raimondii. There were 4, 7 and 25 pairs of duplications in A. thaliana, O. sativa and P. trichocarpa, respectively. Additionally, the duplication pattern of SPLs between G. raimondii and other three species were analysis; 5, 19 and 22 pairs of duplication events were identified between G. raimondii and O. sativa, G. raimondii and A. thaliana, G. raimondii and P. trichocarpa, respectively. There were 10 ortholog pairs in syntenic regions between G. raimondii and A. thaliana (Fig. 3, Table S2). Inter-genomic duplication events between Arabidopsis and other two species were identified in previous study, including eleven SPLs ortholog pairs between apple and Arabidopsis 21 , and nine SPLs ortholog pairs between grape and Arabidopsis 22 .
By comparing the number of introns of SPLs gene in cotton and Arabidopsis, we found that different SPL orthologs contained different gene structures, including 9 introns in SPL1, SPL7 and SPL14; 2 introns in SPL6, SPL8, SPL9, SPL10 and SPL13; 3 introns in SPL2 had; 1 introns in SPL5. As well as different cotton SPL orthologs shared similar motifs. GhSPL5 had motifs 1 and 2; GhSPL8 had motifs 1, 2 and 14; GhSPL13 had motifs 1, 2 and 15; GhSPL9 and GhSPL10 had motifs 1, 2 14, and 15; GhSPL1, GhSPL2, GhSPL7 and GhSPL14 had more motifs. Among 20 motifs, motif 1 was two Zn-finger like structure, existed in all 59 GhSPL proteins. Motif 2 was nuclear localization signal (NLS), and 56 GhSPLs contained this motif except GhSPL5c_D, GhSPL6a_D and GhSPL9b_D. As showed in Fig. 6B, SBP conserved domain, two zinc finger-like structures and NLS segments, were shared by all GrSPLs. In addition, we speculate that different SPL orthologs probably play similar function between cotton and Arabidopsis, due to the presence of similar exon/intron structure and conserved motifs.
Several cis-acting elements were found in the promoter regions of GhSPLs (Fig. 7, Table S4), which suggests that SPL transcription factors may be regulated by light, stresses and/or phytohormones. All the identified SPL gene show a development-and tissue-dependent expression patterns (Fig. 8). GhSPL1 and GhSPL14 orthologs were highly expressed in all tested developmental stage and tissues. Some SPL genes were highly expressed in certain tissues, and others were low expressed in all tested tissues in the same orthologs. Hypothesized that the miR157/SPL may regulate floral organ size and ovule production in cotton. Zhang et al. 26 reported that GhSPL3 and GhSPL18 might be involved in the development of leaves and second shoots, and promoting flowering by overexpression target genes in Arabidopsis plants.
MicroRNAs (miRNAs) may also involve in SPL-regulated gene networks. Among 17 SPLs in Arabidopsis, 10 were putative targets of miR156/157 4,44 . 11 of 19 SPLs in rice 45 , 18 of 28 SPLs in Populus 20 , 17 of 41 SPLs in soybean 11 were reported to be potential targeted by certain miR156. In this study, we also found that 31 of 59 identified SPLs were potentially targeted by miR156 in upland cotton, which are from 6 different orthologs (GhSPL2, GhSPL6, GhSPL9, GhSPL10 and GhSPL13). Interestingly, motif 7 was existed in those SPLs, and it is a potential target site for the miR156/miR157. In Arabidopsis and rice, motif contains miR156 recognition element was also reported in all miR156-targeted SPLs 4,25,45 . Thus, SPL gene function analysis mainly through significantly represses the SPL transcriptions by over-expression of miR156/miR157. Arabidopsis as an important model plant species, the majority of AtSPL genes have been well functionally characterized. SPL2, SPL9, SPL10, SPL11, SPL13 and SPL15 contribute shoot development and the phase transition from vegetative growth to reproductive growth 5 . SPL3, SPL4 and SPL5 primarily promote floral induction and/or floral meristem identity, by SOC1-SPL module control flowering time 46 , and act synergistically with FT-FD module to induce flowering under LDs 47 . SPL3, SPL9 and SPL10 are involved in lateral root growth 48 . SPL1 and SPL12 confer plant thermotolerance at the reproductive stage 49 . SPL7 regulates the Cu deficiency response 50,51 . SPL8 acts in concert to secure male fertility and regulates gynoecium differential patterning [52][53][54] . Based on the high conservation of SPL gene family between cotton and Arabidopsis, we speculate that SPL gene family in cotton may involve in the timing of vegetative and reproductive phase change, root growth, leaf development, fertility, fiber initiation development, response to stresses and yield (Fig. 9). However, the detailed function of each SPL transcription factor in cotton remains to be investigated. This is a genome-wide analysis of SPL gene family in cotton, which will provide the overall and useful information for well functional analysis in the future.
The physical location of the SPLs in G. hirsutum and G. raimondii were fetched from the corresponding GFF files. MapInspect (http://mapinspect.software.informer.com/) was used to visualize the distribution of the SPL genes in Gossypium genome.
Phylogenetic and gene duplication. A phylogenetic tree was constructed using ClustalW alignment and the Neighbor-Joining (NJ) method in MEGA 7.0 software 39 (https://mega.nz/), with 1000 replicates boot-strap test. The genome-wide intra-and inter-genomic duplication files of G. raimondii, A. thaliana, O. sativa and P. trichocarpa were downloaded from the PGDD 40 (http://chibba.agtec.uga.edu/duplication), and the visualization was carried out with the CIRCOS tool 58 (http://circos.ca/). The ratios of Ka/Ks were used to assess the selection pressure for duplication genes.