Genome-wide comparative and evolutionary analysis of Calmodulin-binding Transcription Activator (CAMTA) family in Gossypium species

The CAMTA gene family is crucial in managing both biotic and abiotic stresses in plants. Our comprehensive analysis of this gene family in cotton resulted in the identification of 6, 7 and 9 CAMTAs in three sequenced cotton species, i.e., Gossypium arboreum, Gossypium raimondii, and Gossypium hirsutum, respectively. All cotton CAMTAs were localized in the nucleus and possessed calmodulin-binding domain (CaMBD) as identified computationally. Phylogenetically four significant groups of cotton CAMTAs were identified out of which, Group II CAMTAs experienced higher evolutionary pressure, leading to a faster evolution in diploid cotton. The expansion of cotton CAMTAs in the genome was mainly due to segmental duplication. Purifying selection played a significant role in the evolution of cotton CAMTAs. Expression profiles of GhCAMTAs revealed that GhCAMTA2A.2 and GhCAMTA7A express profoundly in different stages of cotton fiber development. Positive correlation between expression of these two CAMTAs and fiber strength confirmed their functional relevance in fiber development. The promoter region of co-expressing genes network of GhCAMTA2A.2 and GhCAMTA7A showed a higher frequency of occurrence of CAMTA binding motifs. Our present study thus contributes to broad probing into the structure and probable function of CAMTA genes in Gossypium species.

intermediary molecule of the plastidial pathway for isoprenoid production and is actively involved in inducing general stress response (GSR) by transducing AtCAMTA3 22 .
Cotton is an excellent model system for plant polyploid research 32 . The genus Gossypium comprises 45 diploid and 5 tetraploid species 29 . About 1 to 2 million years ago (MYA) interspecific hybridization events amongst G. arboreum (AA genome, 2n = 2x = 26, diploid species) and G. raimondii (DD genome, 2n = 2 × = 26, diploid species) resulted in allotetraploid G. hirsutum (AADD, 2n = 4x = 52) 27,28,33 . It is one of the most widely cultivated and fiber-producing crops. Upland cotton (G. hirsutum) has much longer fibers than its progenitor diploid cotton 31 . To obtain an integrated image of the evolutionary characteristics and probable role of CAMTA family in cotton, we characterized this family in G. arboreum, G. raimondii, and G. hirsutum. We further carried out detailed genomic exploration of CAMTA proteins in G. arboreum, G. raimondii, and G. hirsutum. The expression profiles and co-expression network of CAMTA genes in various fiber developmental stages in allotetraploid cotton were also analyzed. This work will lead to significant refinements in understanding the functional roles and evolutionary history of CAMTA family in cotton and their potential role in cotton fiber development.

Results
Genome-wide identification of CAMTA genes in cotton. The HMMER search against G. arboreum, G. raimondii, and G. hirsutum genomes was performed to identify the CAMTA orthologs in the Gossypium species. Subsequently, all the putative CAMTA genes were confirmed through similarity and conserved domain searches against Pfam and InterproScan databases. After removal of partial sequences, a total of 22 CAMTAs, i.e., 6 GaCAMTAs (G. arboreum), 7 GrCAMTAs (G. raimondii) and 9 GhCAMTAs (G. hirsutum) were eventually identified (Table 1 and Supplementary Dataset S1). The length of deduced cotton CAMTA proteins varied from 907 to 1,073, 963 to 1087, and 963 to 1088 amino acids in G. arboreum, G. raimondii, and G. hirsutum, respectively. The theoretical pI ranged from 5.56 to 8.07, 5.67 to 7.9 and 5.56 to 8.72; the molecular weight varied from 102.22 kDa to 120.63 kDa, 102.6 kDa to 123.05 kDa and 106.75 kDa to 123.13 kDa and the number of introns varied from 11 to 15, 10 to 12 and 10 to 15 in G. arboreum, G. ramondii, and G. hirsutum, respectively. All the cotton CAMTAs identified were nuclear localized (Table 1).
Phylogenetic relationship of cotton CAMTAs with other plant species. The evolutionary relationships among cotton CAMTAs and 17 different plant species (Supplementary Table S1) determined by an unrooted maximum likelihood tree (ML). We performed MSA of 22 identified cotton CAMTA proteins sequence along with 100 CAMTA protein sequences from different plants (bryophytes, lycopodiophytes, monocots, eudicots, and gymnosperms). Subsequently, plant CAMTAs were clustered distinctly into five major groups (I to V) while cotton CAMTAs clustered into four groups as none of its members belonged to group III (bryophytes and lycophytes). Group I, IV and V further divided into two subgroups (a and b) with robust bootstraps. Group I composed the largest clade among all groups with ten cotton CAMTA genes (Fig. 3). Majority of the CAMTA proteins from the diploid species had orthologs in the allotetraploid cotton, derived from hybridization and subsequent polyploidization of maternal A-genome and parental D-genome species 31 .
Most of the cotton CAMTAs clustered with Arabidopsis CAMTAs excluding group II CAMTAs (GaCAMTA7, GrCAMTA7, GhCAMTA7A, and GhCAMTA7D). Further, to gain a deeper understanding about group II CAMTAs, a phylogenetic tree of these CAMTAs with other eudicots (Brassicaceae, Ranunculaceae, Scrophulariaceae, Solanaceae, Vitaceae, Myrtaceae, Rutaceae, Malvaceae, Caricaceae, Rosaceae, Cucurbitaceae, Fabaceae, Salicaceae and Euphorbiaceae families) was constructed. Additionally, we also performed synteny analysis of cotton CAMTA genes with the genomes of Theobroma cacao (Malvaceae), Citrus sinensis (Rutaceae) and Arabidopsis thaliana (Brassicaceae). Our study revealed that the Group II CAMTAs were highly conserved between the G. hirsutum (At and Dt sub-genomes), G. arboreum (A genome), G. raimondii (D genome), Theobroma cacao and Citrus sinensis genomes. Thus Group II CAMTAs showed conservation with closely related species with Gossypium 36 but no conserved homologs were found in Brassicaceae ( Supplementary Fig. S2, Supplementary Fig. S3, and Supplementary Dataset S3). This study indicated that after diverging from a common ancestor, group II CAMTAs homologs from Brassicaceae family might have evicted.
Interestingly, CAMTAs from the bryophytes and lycophytes clustered into group III, and no member of cotton CAMTA gene family belonged to group III. Moreover, group II and six subgroups (Ia, Ib, IVa, IVb, Va, and Vb) of the groups I, IV, V, respectively were only represented by higher land plants whereas CAMTAs present in the subgroups Ib, and IVb were exclusively monocot-specific. CAMTA genes from monocotyledonous and dicotyledonous plants were present in all groups excluding group III, suggesting their evolutionary divergence before the common ancestor of monocots and dicots. The cotton CAMTA genes might share a common ancestor before the divergence of lower non-flowering (bryophytes and lycophytes) and higher flowering plants (gymnosperms and angiosperms); consequently, lineage-specific divergence and expansion events occurred in higher plants, after split from lower plants. Non-TIG CAMTAs evolved recently in flowering plant species 34 . Intriguingly, group IVa CAMTAs were mostly non-TIG types except GrCAMTA5.1 which contained all domains indicating that group IVa CAMTAs are non-TIG CAMTAs and might have specialized function in cotton. The cotton and T. cacao CAMTAs (both from Malvaceae) were clustered together with a high degree of reliability (>90% bootstraps) consistent with the fact that cotton and T. cacao CAMTAs originated from a last common ancestor 33 MYA 29 . Evolutionary relationship between Gossypium and Theobroma cacao CAMTAs. To explore the evolutionary and orthologous relationship between Gossypium and T. cacao CAMTAs, we investigated exon/ intron pattern and surveyed T. cacao genome. Nine orthologous T. cacao CAMTA genes showed homology with their counterpart in Gossypium. ML tree was constructed with 1000 bootstraps. Phylogenetic analysis revealed that 31 ortholog CAMTA genes were clustered into five subfamilies (I to V). Exon/intron structure of 31 CAMTAs was comparatively analyzed showing that the CAMTA orthologs which belong to the same subfamily had similar gene structure in terms of intron number and exon length. For example, Gossypium and T. cacao CAMTAs belonging to subfamily I, contain 11 and 12 number of introns with similar exon length ( Supplementary Fig. S4). This result demonstrated that both species probably diverged from a common ancestor during evolution which is consistent with the previous report 29 .
We next assessed gene duplications for the expansion of cotton CAMTAs. By high protein sequence identity and similarity, two and three pairs of putative paralogous CAMTA genes were identified in G. arboreum and G. raimondii, respectively (Fig. 4a,b), while only one pair of paralogous CAMTA gene was found in A T but not in D T subgenome (Fig. 4c,d). These paralogous CAMTA gene pairs were present in the same clade of the phylogenetic tree with a high degree of protein sequence identities (>75%). All of the paralogous gene pairs were positioned on different chromosomes, providing substantial evidence that expansion of cotton CAMTAs was mainly attributed to segmental duplication and not to tandem duplication. For instance, in G. arboreum, two segmental duplications (GaCAMTA2.1/2.2 and GaCAMTA5.1/5.2) occurred from 13.02 to 15.03 MYA, and three segmental duplications (GrCAMTA2.2/2.1, GrCAMTA5.1/5.2, and GrCAMTA5.1/5.3) were found in G. raimondii from 12.08 to 13.68 MYA (Table 2). Moreover, in case of G. hirsutum, only one segmental duplication (GhCAMTA2A.1/2A.2) in A T subgenome took place 13.04 MYA. This study revealed that recent duplication (13-20 MYA) occurred in those paralogous gene pairs 29 .
The Ka/Ks ratios (nonsynonymous and synonymous substitution ratios) for the duplicated cotton CAMTA gene pairs were invariably <1 (Table 2). Thus, duplicated cotton CAMTAs had undergone strong purifying selection pressure contributing to the maintenance of their function and reflecting that they had not diverged much during evolution. Since orthologs often retain equivalent functions in different species during evolution 37 , orthologous relationship among the members of CAMTA gene family was established (Fig 4a-d). Orthologs with sequence identity over 90% in both cDNA and amino acid composition were selected for further evolutionary analysis (Supplementary Dataset S4). The potential functional divergence and selection pressure of cotton CAMTAs were explored by calculating the Ka, Ks, and Ka/Ks ratios between orthologs (A vs. D, A T vs. A, and D T vs. D) and within homoeologs (A T vs. D T ). Surprisingly, the Ka value of cotton CAMTA7 (group II CAMTAs) orthologs (GaCAMTA7/GrCAMTA7 and GhCAMTA7A/GhCAMTA7D) was higher in inter-genomes (A vs. D and A T vs. D T ), than other ortholog CAMTA pairs, suggesting that these ortholog pairs experienced faster protein evolution. The overall Ka/Ks ratios <1, in both diploid and allotetraploid species demonstrated that CAMTA ortholog genes had experienced purifying selection pressure (Supplementary Table S2). CAMTA7 had higher Ka/Ks ratio in A vs. D, and A T vs. A. Hence, CAMTA7 experienced higher evolutionary pressure in diploid cotton and might have evolved faster in A as compared to D subgenome.
Phylogenetic tree, gene structure and protein motifs analysis of cotton CAMTAs. The evolutionary relationships among cotton CAMTAs were inferred by constructing a separate ML tree with 1000 bootstraps. The tree topology, duplication nodes of CAMTA paralogues in the ML, exon/intron organization, and   conserved motifs allowed us to classify the cotton CAMTAs into seven subfamilies (I-VII) with highest bootstraps. The genes within the same subfamily had a high identity (>80%) to each other, especially for those with the orthologous relationship, indicating their divergent evolution from a common ancestor or probable origin from gene duplication events (Fig. 5a).
To comprehend the structural diversity of cotton CAMTAs, we investigated their exon/intron patterns by comparing coding sequences and the corresponding genomic DNA sequences. The gene structure of cotton CAMTA genes showed group-specific exon/intron patterns. The number of introns varied from 9 to 15 in most of the CAMTA genes of Gossypium species. Intron number ranged from 11 to 15, 10 to 12 and 9 to 15 in GaCAMTAs, GrCAMTAs, and GhCAMTAs, respectively (Fig. 5b, Table 1). To investigate, whether the exon/ intron structure is consistent with phylogenetic subfamilies, the gene structure of cotton CAMTAs was compared. Most of the cotton CAMTAs within the same subfamily shared similar exon/intron distribution patterns in terms of intron number and exon length. For instance, CAMTA gene in subfamily I and VI contained 12 introns with 13 exons of similar length, whereas members within subfamily VII contained 15 introns, except for GrCAMTA5.3, which possesses 12 introns. MEME (Multiple Em for Motif Elicitation) was exploited to analyze the conserved motifs in CAMTA protein. Twenty putative conserved motifs were identified in the cotton CAMTAs. The InterProScan was employed to annotate these motifs. Motifs 1, 11 and 12 that hit the database were the conserved CG-1 domain. Motifs 2, 3 and 4 were the IQ-motif, Ankyrin repeat-containing domain, and immunoglobulin-like (Ig) fold, respectively (Supplementary Dataset S5). Motif 1 (the red motif; CG-1 domain) was present in all cotton CAMTAs and represents the conserved CAMTA domain. Most of the CAMTA family members with close evolutionary relationships and similar gene structure in the phylogenetic tree had identical motif compositions and hypothesized similar function (Fig. 5c).
The quantitative real-time PCR (qRT-PCR) at the representative stages of fiber development (0 to 25 DPA) validated the expression profiles of two highly expressed cotton CAMTAs (GhCAMTA2A.2, GhCAMTA7A) and least expressing CAMTA (GhCAMTA3D.1). The GhCAMTA7A expression found decreased from 0 to 12 DPA and slightly increased at 19 and further decreased at 25 DPA. However, GhCAMTA2A.2 showed a subsidiary increase in the expression pattern during SCW stage. Conversely, GhCAMTA3D.1 showed lower expression at all representative stages of fiber development, by the in silico expression analysis (Supplementary Fig. S5). PCoEGs and NCoEGs were subjected to MapMan (http://gabi.rzpd.de/projects/MapMan/) analysis to identify enrichment of functional and molecular categories. The MapMan analysis revealed that several GhCAMTA2A.2 and GhCAMTA7A positively co-expressed genes belong to the cell wall and its precursor synthesis (Fig. 6f). Further, positively co-expressed genes also belong to oxidative stress including oxidases, cytochrome P450, alcohol dehydrogenase, short-chain dehydrogenase/reductase (SDR) (Fig. 6g), respiratory burst (Fig. 6h) and redox pathways including ascorbate, glutathione, and glutaredoxin (Fig. 6i). The transcription factors belonging to class AP2/EREBP, ARF, bHLH, MYB, WRKY, bZIP, Aux/IAA and phytohormones including abscisic acid, auxin, brassinosteroid, ethylene, gibberellins, jasmonate and salicylic acid also belonged to positively co-expressed genes with these two CAMTAs (Fig. 6j,k). In contrast, the NCoEGs of GhCAMTA2A.2 and GhCAMTA7A mainly belong to peroxidases, dynamin (Fig. 6g), thioredoxin, peroxiredoxin, dismutases and catalases (Fig. 6i). The transcription factors in NCoEGs belongs to C2H2 zinc finger family, MADS, E2F/DP, NAC and TCP (Fig. 6j). There were other functional categories which were found enriched with these positively and negatively co-expressed genes which belong to amino acid metabolism, co-factor and vitamin metabolism, development, nucleotide metabolism, secondary metabolism, carbohydrate metabolism, photosynthesis, lipid metabolism, glycolysis, protein metabolism, signaling, and transport ( Supplementary Fig. S6).

Co
We next analyzed the cumulative expression of PCoEGs and NCoEGs of GhCAMTA2A.2 and GhCAMTA7A. The PCoEGs with GhCAMTA2A.2 show significantly higher cumulative expression during SCW stage precisely at 20 and 25 DPA (Fig. 7a), while the cumulative expression of NCoEGs with GhCAMTA2A.2 was lower at SCW (Fig. 7b). Coherently, for PCoEGs with GhCAMTA7A showed higher cumulative expression at 0 DPA which subsequently declined till 10 DPA and further increased at 20 and 25 DPA (Fig. 7c). In complete contrast, the cumulative expression was highest at 10 DPA in NCoEGs with GhCAMTA7A (Fig. 7d).
Next, promoter regions (1000 bp upstream) of PCoEGs (651) and NCoEGs (575) with GhCAMTA2A.2 as well as PCoEGs (504) and NCoEGs (114) with GhCAMTA7A were analyzed to identify the conservation of CAMTA binding motifs (MCGCGB/MCGTGT). A random promoter set (700) was selected as control of those genes which do not co-expressed either with GhCAMTA2A.2 or GhCAMTA7A. The frequency of occurrence of CAMTA binding motifs in PCoEGs and NCoEGs with GhCAMTA2A.2 was 0.40 and 0.36 which was significantly higher as compared to the control (0.29) (Fig. 7e and Supplementary Dataset S6). Similarly, PCoEGs and NCoEGs with GhCAMTA7A showed significantly higher frequency of occurrence of these motifs (0.41 and 0.35, respectively) than control (0.29) (Fig. 7f and Supplementary Dataset S7). Hence, this study affirms indicates that presence of CAMTA binding motifs in the upstream elements of PCoEGs results in their regulation by GhCAMTA2A.2 and GhCAMTA7A. As evident GhCAMTA2A.2 and GhCAMTA7A also regulates NCoEGs, however, the underpinning mechanisms of fiber development by GhCAMTA2A.2 and GhCAMTA7A remain elusive and demands further experimental exploration.

Correlation analysis of highly and least expressing CAMTAs with different fiber quality traits.
The correlation was established between the expression of CAMTAs and fiber quality traits to understand the functional relevance of CAMTAs during cotton fiber development. The quantitative real-time PCR (qRT-PCR) of highly expressing CAMTAs (GhCAMTA2A.2 and GhCAMTA7A) and least expressing CAMTAs (GhCAMTA3D.1) was carried out in 67 genotypes of G. hirsutum at 0 DPA. We also estimated fiber quality traits such as fiber length (FL), fiber strength (FS), uniformity ratio (UR), micronaire (MIC), boll number (BN) and boll weight (BW) in the same 67 genotypes. Pearson correlation coefficient (r) was estimated to check the correlations between highly expressing CAMTAs and fiber quality traits of genotypes. The results indicated that GhCAMTA2A.2 and GhCAMTA7A displayed significant positive correlation ~0.63 (p = 7.805e-09) and ~0.61 (p = 1.56e-08), respectively with fiber strength but not with other traits. However, no significant correlation was found in the least expressing GhCAMTA3D.1 with any of the fiber quality determining trait (Fig. 8). The results thus provide important evidence on the role of GhCAMTA2A.2 and GhCAMTA7A in the cotton fiber development.

Discussion
The recent availability of G. hirsutum 31 and its diploid progenitor genomes 29,30 allowed us to perform a comprehensive analysis of CAMTAs. G. hirsutum has longer fibers than G. arboreum and G. raimondii possibly due to the genome and associative doubling of fiber-related genes 39 . To explore the probable roles of cotton CAMTAs in fiber development, we undertook a comprehensive genome-wide characterization, expression, and co-expression network analysis. Our study identified a total of 6, 7 and 9 CAMTA genes in G. arboreum, G. raimondii, and G. hirsutum, respectively ( Table 1). The G. arboretum has a similar number of CAMTA genes as in Arabidopsis (6), whereas G. raimondii contains the 7 CAMTAs in consistence with the proportion of the predicted genes in their genome. For example, G. raimondii genome (40,976 genes) 30 is about 1.6 times that in Arabidopsis (25,498 genes) 40 . Although the genome size of G. arboreum (1,746 Mb) 29 is approximately two fold of G. raimondii genome (885 Mb) 29 , G. raimondii contains a higher number of CAMTA genes as compared to G. arboreum. This higher number might be due to more retrotransposons insertion, gene loss, disrupted genes, structural rearrangements, sequence divergence in A-genome resulting in a fewer number of CAMTA gene 31 . It is noteworthy that A-genome evolution is more ancient than that of D-genome. Therefore, the difference in some CAMTA genes may be due to the extent of evolutionary divergence that is more in A-genome than D-genome 29 . The G. hirsutum contains 9 CAMTA genes, due to its allotetraploid nature. This expansion appeared because of whole-genome duplication events in cotton lineage and may be due to the transposable elements which represented a significant component of Gossypium genome 41 .
Sequence analysis revealed that apart from some non-TIG cotton CAMTAs, all of the cotton CAMTAs contained multifunctional conserved domains of CAMTAs and were localized in the nucleus (Fig. 1). The CaMBD interacts with CaM in a Ca 2+ -dependent manner, while the IQ domain binds with CaM in a Ca 2+ -independent way [1][2][3]5,34,35 . It is interesting to note that all the cotton CAMTAs contain CaMB domain, which was located at conserved positions adjacent to the IQ motifs, indicating that cotton CAMTAs may interact with CaM in a Ca 2+ -dependent and Ca 2+ -independent manner, respectively. The phylogenetic tree of cotton CAMTAs with different plant species indicated that the major groups or subgroups contain orthologs from Arabidopsis, T. cacao, and other plant species, proposing a similar function of cotton CAMTAs with CAMTAs of different plant species which emerged from monocot-eudicot split (Fig. 3). Conversely, group II CAMTAs that were the result of recent species-specific duplication events lead to independent functional diversification ( Supplementary Figs S2 and S3). Ortholog pairs of group II CAMTAs experienced faster evolution as compared to other CAMTAs, signifying their functional divergence in cotton, suggesting that group II CAMTAs may have a precise role in Gossypium species (Supplementary Table S2).
Mounting evidence suggests that expansion of gene families caused by the gene duplication event is one of the major evolutionary mechanisms directing to functional diversification and speciation 42 . Chromosomal distribution studies speculated that the expansion of the cotton CAMTAs arose from a segmental duplication (Fig. 4a-c) and purifying selection predominated across the duplicated CAMTA genes. Purifying selection possibly eliminates deleterious loss-of-function mutations, fixing a new duplicate gene and improving functional alleles at both duplicate loci 43 . Recent duplication events in cotton CAMTAs implied their morphological, ecological and physiological diversification 44 . G. arboreum and G. raimondii diverged 2-13 MYA and recombined to form G. hirsutum 1-2 MYA 29,31 . The duplication time of GaCAMTAs (13.02-15.03 MYA), GrCAMTAs (12.08-13.68 MYA) and GhCAMTAs (13.04 MYA), suggests that duplication events in cotton CAMTA families were more ancient than that of both diploid species divergence and polyploid formation. This duplication may assist to the unique functions of CAMTA in cotton i.e., cotton fiber development. The duplication time of GaCAMTAs and GrCAMTAs was around 13.4 MYA, which possibly occurred after their divergence from T. cacao (33MYA) 29 and Arabidopsis (93 MYA) 45 but before the reunion of A and D genome diploids that resulted in allotetraploid cotton 31 (Table 2).
Cotton CAMTAs were classified into seven subfamilies (I-VII) based on their phylogenetic relationship, gene structure and motif distribution pattern (Fig. 5a-c). Subfamily I and IV indicate that counterpart of CAMTA protein in A T (GhCAMTA2A.2/GhCAMT7A) and D T (GhCAMTA2D.1/GhCAMTA7D) subgenome of AD genome, comes from both the progenitor genome, i.e., A genome (GaCAMTA2.2 /GaCAMTA7) and D genome (GrCAMTA2.2/GaCAMTA7), respectively. Cotton CAMTAs present in subfamily II shows that counterpart of CAMTA protein in A T subgenome (GhCAMTA2A.1) comes from A genome (GaCAMTA2.1) while ortholog of GrCAMTA2.1 gene had lost in AD genome during evolution. Subfamily III demonstrated that counterpart of CAMTA protein in AD genome (GhCAMTA3D.1) either duplicated or diverged from D genome (GrCAMTA3.1) and lost in G. arboreum. Similarly, GhCAMTA4D of D T subgenome diverged from GaCAMTA4 in subfamily V. In subfamily VI, GhCAMTA5D.1 diverged from GrCAMTA5.3 and counterpart of GaCAMTA5.2 protein was lost in D genome. Subfamily VII suggests that counterpart of CAMTA protein of A (GaCAMTA5.1) and D (GrCAMTA5.1) genome were absent in AD genome (Fig. 5a). The CAMTA genes in the same subfamily had the similar gene and motif structure representing their similar subfamily-specific function (Fig. 5b,c). The highly conserved sequences of CAMTAs within the same group demonstrated that they were subject to duplication during evolution.
Fiber-specific expression analysis of GhCAMTAs showed that they expressed in fiber development stages but exhibited differential expression profiles (Fig. 6a). Expression profile of GhCAMTA2A.2 and GhCAMTA7A suggest that they express at initiation and SCW stages, respectively. They might be involved in regulating complex gene networks of fiber development and could be suitable targets for genetic engineering approaches aimed to improve cotton fiber. PCoEGs and NCoEGs with GhCAMTA2A. 2 46 and Cellulose synthase 49 protein families which have been previously reported to have imperative roles in the cotton fiber development 46,47,50 . It is important to note that CAMTAs regulate various stress and ROS response in plants 18 . ROS response, as well as redox state, is significant for the fiber development in cotton 51,52 . Thus, identifications of various genes belonging to ROS in positively and negatively co-expressed genes of GhCAMTA2A.2 and GhCAMTA7A further support the importance of these transcription factors in fiber development (Fig. 6g). Additionally, CAMTAs are known to interact with various phytohormones including ethylene 15 , ABA 19 , salicylic acid 21 , auxin 13 , and jasmonate 21 . These phytohormones also play critical roles in cotton fiber development 53,54 . Thus identification of genes belonging to these phytohormone categories in PCoEGs and NCoEGs (Fig. 6k) also support the role of GhCAMTA2A.2 and GhCAMTA7A in the fiber development. Finally, identification of transcription factors previously implicated in the fiber development such as MYB 55 , TCP 56 , NAC 57 and WRKY 58 in the PCoEGs and NCoEGs (Fig. 6j) substantiate the role of CAMTAs in cotton fiber development. We identified a significant enrichment of CAMTA-motifs in the promoters of PCoEGs and NCoEGs of GhCAMTA2A.2 and GhCAMTA7A (Fig. 7e,f). Our results suggest that CAMTA can act as both positive and negative regulator of gene expression. AtCAMTA3 is a positive regulator of CBF2 expression, and negative regulator of SA mediated immune response 5,20 . The higher conservation of CAMTA binding motif even in the NCoEGs is thus not surprising.
Since fiber strength is the key trait of fiber quality, the significant positive correlation between the expression of the two discussed CAMTAs and fiber strength suggest that these CAMTAs might be responsible for elite fiber qualities (Fig. 8). Previous studies demonstrated that the fiber strength commonly related to the strengthening of cell wall 59 . Interestingly, identification of cell wall related genes such as ABC transporter-like 47,60,61 , arabinogalactan peptide (AGP) 62,63 , alpha-1,4-glucan-protein synthase 64 , calcium-binding EF-hand protein 61 , cellulose synthase 63 , glutathione peroxidase 65 , glycoside hydrolase 66 , glycosyl transferase 67-69 , extensin 62 , UDP-glucuronosyl/ UDP-glucosyltransferase 70 , small GTPase superfamily 71 , kinesin 72 , leucine-rich repeat 61 , thaumatin 73 , tubulin 74 , and WD40 repeat 75 in PCoEGs and NCoEGs, emphasizing their potential roles in regulating cell wall integrity and thus fiber quality (Supplementary Dataset S8). However, the detailed molecular investigation is needed further to establish a link between CAMTAs and fiber quality traits.
This study has provided us evidence for an involvement of CAMTAs in cotton fiber development. However, more experimental exploration is needed to understand the structural-functional relationship of CAMTA family members in cotton and their involvement in fiber development. Multiple sequence alignment, classification and phylogenetic tree construction of CAMTA protein sequence. CAMTA protein sequences; from the 17 plant species known to have publicly available complete genome sequences; were extracted. The multiple sequence alignment of these protein sequences with identified cotton CAMTAs were carried out by cluster X program (http://www.clustal.org/) 80 with default parameters. These aligned sequences were used for the construction of the phylogenetic tree. MEGA 5.2 software (http://www.megasoftware.net/) 81 was employed to construct an unrooted phylogenetic tree using ML method with the following parameters: JTT model, pairwise gap deletion, and 1,000 iterations were used to calculate bootstrap values. CAMTA gene family members were classified based on Arabidopsis nomenclature by using phylogenetic approach. Additionally, a separate phylogenetic tree was constructed with all the CAMTA protein sequences of G. arboreum, G. raimondii, and G. hirsutum for further analysis.

Chromosomal location and gene duplication analysis. The precise physical positions of all Gossypium
CAMTA genes on chromosomes were obtained through BLASTN search against the Cotton Genome project and Phytozome databases. All Gossypium CAMTA genes were mapped on the chromosome using Mapinspect software.
The paralogous CAMTA genes were identified in G. arboreum, G. raimondii, and G. hirsutum by using reciprocal blast with e-value < 10 −5 to understand the evolutionary mechanism of CAMTA gene family in Gossypium species. Paralogs were defined by shared aligned region covering > 70% of the longer sequence and the similarity of the aligned regions >70% 82 . Also, Ka/Ks analysis of orthologs and paralogs sequences was performed by using PAL2NAL and Codeml program 83 , which was further used to calculate the approximate date of duplication and divergence events with the formula T = Ks/2λ, assuming clock-like rate (λ) of 1.5 synonymous substitutions per 10 −8 years for cotton 84,85 . Additionally, the Ka/Ks ratio was used to show the selection pressure for the duplicated CAMTA genes. A Ka/Ks ratio <1, >1 and =1 indicates negative (purifying selection), positive, and neutral evolution, respectively 86  Gene structure and conserved motif analysis of cotton CAMTAs. The gene structures of each identified CAMTA genes were obtained by comparing predicted CAMTA coding sequences with their corresponding genomic sequences using GSDS online tool (Gene Structure Display Server; http://gsds.cbi.pku.edu.cn/) 88 .
Online MEME tool (http://meme.nbcr.net/meme/) 89 was used for identification of conserved protein motifs in the CAMTA protein sequences. The following parameters were used: zero or one per sequence; the optimum width from 6 to 300; maximum number of motifs to find 20. Further, these motifs were annotated by using an Interproscan program. Expression profile of cotton CAMTA genes during different fiber developmental stages. The stage-specific expression pattern of cotton CAMTA genes was analyzed by using our previously reported microarray profiling data of G. hirsutum at various fiber developmental stages such as initiation (0DPA), elongation (6,9,12DPA) and SCW (19 & 25 DPA) 38 . Coding sequences of 9 GhCAMTA genes were subjected to reciprocal blast with Affymetrix cotton chip. The average normalized intensity values of these 6 GhCAMTA genes from microarray data were utilized for generating the box plot. ggplot2 (https://cran.r-project.org/web/packages/ggplot2/) package in R version 3.1.3 was used to construct the box plot.
Co-expression network analysis of GhCAMTA2A.2 and GhCAMTA7A. The high-throughput RNA-sequencing (RNA-seq) data of G. hirsutum in different fiber developmental stages at 0, 5, 10, 20, and 25 DPA were downloaded from the National Center for Biotechnology Information Short Read Archive (http:// www.ncbi.nlm.nih.gov/sra/) with the accession numbers SRX797909, SRX797917, SRX797918, SRX797919, and SRX797920, respectively 31 . Reads from transcriptome dataset (0, 5, 10, 20, and 25 DPA) were mapped on G. hirsutum genome using the STAR aligner 90 (version 2.5.3a) with default parameters separately. Assembly of data and transcript abundance of each gene was calculated by the fragments per kilobase of exon model per million mapped reads (FPKM) with Cufflinks software (http://cufflnks.cbcb.umd.edu/). The Log 2 FPKM values were used for generating gene coexpression network using the "Expression Correlation Networks" (http://apps.cytoscape.org/apps/expressioncorrelation) plugins in Cytoscape version 2.8.1. This plugin calculates positive Pearson correlation (default r ≥ 0.95) as well as "anti-correlation" or negative Pearson correlation (default r ≤ −0.95) among the interacting members of a network. Furthermore, network visualization was carried out in Cytoscape by applying the force-directed layout, where nodes (circles) in a network represent genes and edges (links) represent a significant interaction between the expression levels of genes across all fiber developmental stages (gene correlation network).
Pathway analysis of positively and negatively co-expressed genes with GhCAMTA2A.2 and GhCAMTA7A. MapMan software version 3.5.1 (http://gabi.rzpd.de/projects/MapMan/) was used for identification of significant functional categories or metabolic pathways of positively and negatively co-expressed genes with GhCAMTA2A.2 and GhCAMTA7A. To identify functional categories (BINSs, subBINs) enriched in these genes; average statistical test followed by the Benjamini Hochberg (multiple correction test) was used.
RNA isolation and real-time quantitative RT-PCR. The primers used in the real-time analysis were designed from CDS sequences of respective genes. Reference gene, Ubiquitin and respective genes primers were designed using Primer Express ® Software v2.0 (Applied Biosystems, USA). The total RNA was isolated from different stages of cotton fiber development (0 DPA, 6 DPA, 9 DPA, 12 DPA, 19 DPA and 25 DPA) using SIGMA Spectrum plant total RNA kit following the manufacturer's protocol. After DNase (Ambion) treatment, the integrity of RNA was checked by electrophoresis and the RNA was quantified for cDNA synthesis on NanoDrop ND-1000 Spectrophotometer. One μ g of total RNA was used for first-strand cDNA synthesis using the Superscript II RT kit (Invitrogen) following the manufacturer's instructions. The real-time PCR was performed employing 7500 Real-Time PCR System (Applied Biosystems, USA). PCR cycles 95 °C for 10 sec followed by 35 cycles of 95 °C for 10 sec and 60 °C for 20 sec were performed in 96-well optical reaction plates (Applied Biosystems). The specificity of the amplicon was assesses by its melting curve after 35 cycles at 60-90 °C. The relative gene expression levels were calculated in terms of comparative fold change following 2 −ΔΔct method. Statistical analysis was carried on two biological replicates (three technical replicates per biological sample) for each fiber development stage mentioned earlier. The list of primers used in qRT-PCR is given in Supplementary  Table S3.
Genetic material and fiber quality measurement. In the present study, 67 upland cotton (G. hirsutum) genotypes were utilized which were made available from Tierra Seed Science Pvt. Ltd, Hyderabad, India. The experiment was laid out in Random block design with three replications, each with 20 plants. All traditional agronomic practices were applied during the plant growing seasons. Five plants were randomly selected from each plot and data on following 6 traits were recorded on 67 genotypes-(i) fiber strength (FS) (ii) fiber length (FL), (iii) micronair (MIC), (iv) uniformity ratio (UR) (v) boll number (BN), and (vi) boll weight (BW). Fiber quality in terms of FL, UR, MIC and FS were estimated at Central Institute for Research on Cotton Technology (CIRCOT), Mumbai, India. The quantitative real-time PCR (qRT-PCR) with GhCAMTA2A.2, GhCAMTA7A and GhCAMTA3D.1was performed in same 67 genotypes of G. hirsutum at 0 DPA. Pearson's correlation analysis was performed to calculate the correlation coefficient (r) between expression of these CAMTAs and above mentioned traits using R software.