De novo assembly of Agave sisalana transcriptome in response to drought stress provides insight into the tolerance mechanisms

Agave, monocotyledonous succulent plants, is endemic to arid regions of North America, exhibiting exceptional tolerance to their xeric environments. They employ various strategies to overcome environmental constraints, such as crassulacean acid metabolism, wax depositions, and protective leaf morphology. Genomic resources of Agave species have received little attention irrespective of their cultural, economic and ecological importance, which so far prevented the understanding of the molecular bases underlying their adaptations to the arid environment. In this study, we aimed to elucidate molecular mechanism(s) using transcriptome sequencing of A. sisalana. A de novo approach was applied to assemble paired-end reads. The expression study unveiled 3,095 differentially expressed unigenes between well-irrigated and drought-stressed leaf samples. Gene ontology and KEGG analysis specified a significant number of abiotic stress responsive genes and pathways involved in processes like hormonal responses, antioxidant activity, response to stress stimuli, wax biosynthesis, and ROS metabolism. We also identified transcripts belonging to several families harboring important drought-responsive genes. Our study provides the first insight into the genomic structure of A. sisalana underlying adaptations to drought stress, thus providing diverse genetic resources for drought tolerance breeding research.


RNA-Seq data overview.
To explore the drought tolerance mechanism(s) at the molecular level, we sequenced and analyzed the leaf specific transcriptome of A. sisalana by mRNA sequencing. Six paired-end cDNA libraries were generated from three well irrigated (control: C1, C2, C3) and from three droughts stressed (drought: T1, T2, T3) independent biological samples. The Illumina sequencing platform Hiseq2500 was used for paired-end sequencing at Macrogen Korea with the insert size 101 bp. A total of 276,845,790 reads and 27,961,424,790 nucleotides were sequenced (Supplementary Information 1a) ( Table 1). The data of individual biological library were deposited to NCBI SRA database with SRA accession IDs: SRR5137659, SRR5137661, SRR5137662, SRR5137658, SRR5137663, and SRR5137660. Supplementary Information 1b represents the complete workflow and experimental design.
Transcriptome De novo assembly and evaluation statistics. De novo assembly is an efficient and comprehensive way for the discovery of novel transcripts, their expression behavior and new markers in the absence of the whole genome sequencing data. Length distribution pattern of transcripts produced by assemblers

Contents
Control Library (C) Drought Library (T) Total (see materials and methods) was generated against the well-annotated Ananas comosus CDS (https://phytozome. jgi.doe.gov) and transcriptome based A. deserti reference sequences (http://datadryad.org/). Trinity generated assembly correlates closest to the reference's distribution followed by the Trans-ABySS (64 K-mer) and Short Oligonucleotide Analysis Package (SOAP) (Supplementary Dataset 1 S1, S2). The results may vary from one dataset to others, and so the user should optimize their own preferences/settings according to the data type. We have successfully assembled the 276. 8 26 . Almost 95% of the reads were mapped back to transcriptome by bowtie2 (RMBT) with 83% completeness while duplication percentage was ~21% as by BUSCO analysis (Fig. 1B,C). This intermediate to high duplication level may be due to the higher polyploidy level of the Agave genome. All these indicators supported that we have generated a high-quality transcriptome assembly that could be used for further possible downstream analysis.  . These unigenes were further categorized into five diverse functional groups, namely metabolism (93.4%), the organismal system (4.5%), environmental information processing (0.72%), genetic information processing and cellular processes (0.78%) (Fig. 3). The diverse metabolism category had 5876 unigenes, most of which were involved in nucleotide metabolism (21.05%), carbohydrate metabolism (16.9%), metabolism of cofactor and vitamins (14.4%), amino acid metabolism (10.94%), global and overview maps (8.1%) and other eight subcategories (28.54%). Purine and pyrimidine metabolism were the core group in nucleotide metabolism and treated as the housekeeping function within the plant kingdom. Evidence suggests that they involved in the stress protection to abiotic stress tolerance via activation of the ABA metabolism pathway 27 . In the biosynthesis of secondary metabolites, the most frequent subsets of sequences were Phenylpropanoid biosynthesis (37.06%), Tropane, piperidine and pyridine alkaloid biosynthesis (14.6%) and Novobiocin biosynthesis (14.3%) (Supplementary Dataset 2-S4). Transcription factors are the main upstream regulatory elements that control the gene expression of sessile nature plants through specific binding to the cis-regulatory elements present in the promoter regions. We predicted, 12,676 transcription factors from the unigenes database and their annotation was retrieved from the PlantTFDB. The major families were associated with the bHLH (9.93%) group, followed by the NAC family (7.02%), MYB related group (6.8%), ERF family (5.6%), C2H2 group (5.09%), WRKY (4.33%), FAR1 (4.07%), C3H (3.9%), MYB group (3.6%) (Fig. 4A). All these are considered to be involved in the regulation of metabolic and secondary metabolic biosynthesis in the green plants 22,25,28 . Heat Shock protein annotation was retrieved based on the Heat Shock Protein information resource database (http://pdslab.biochem.iisc.ernet.in/hspir/) (Fig. 4B).

RNA-Sequencing Statistics
Drought responsive transcripts identification and GO tagging. To investigate the differential gene expression among control and drought group, the bioconductor package edgeR was used on the read counts data that was generated by RSEM package. In total 3095 differentially expressed unigenes (DEG) significantly differed between normal and drought conditions with ≥1-fold expression (log2-fold change)| and FDR less  Carbohydrates transport (16.06%), posttranslational modification (13.5%), chaperones general function prediction only (9.9%), lipid transport and metabolism (8.8%), signal transduction mechanisms (7.7%) were the most frequent categories. Enriched GO terms specific to drought stress were also identified with Singular Enrichment Analysis (SEA) at 0.05 significance interval (    (46), HSFs (33) and others (Fig. 6A). Heat shock proteins (Hsps) are classified into five major categories based on molecular mass. The differential expression of genes within these categories was calculated based on the fold change. Collectively145 differentially expressed HSP genes were identified, and 100 among them were up-regulated (Fig. 6B). Small heat shock family (HSP20) was the major DE group found in this study followed by the HSP70, HSP100 and HSP90 group. We also identified twenty-nine significantly DE unigenes related to the cytochrome (CYP) gene family, while 75 were related to photosynthesis and light reaction function as revealed by fold change analysis. All of them were down-regulated under drought stress including CAB1 (chlorophyll A/B binding protein 1 and 6), LHB1B1 light-harvesting chlorophyll-protein complex II subunit B1, PSAD -2, PSAF, PSAG, PSAH2, PSAK, PSAL, PSAN (involved in photosystem I), PSB group with subunits (components of photosystem II) and others related to ATPase synthesis (Supplementary Dataset 4-S2).
SSR and SNP detection. The high-throughput transcriptome sequencing provides excellent resources toward the discovery of cost-effective and polymorphic genetic markers (SSRs, SNPs, Indel). We identified total 13,375 SSR markers by using MISA tool within 12,279 unigenes in A. sisalana transcriptome (Supplementary Dataset 5-S1). The average density of microsatellites was found to be one SSR per2.9 kb. Based on the motif repetition, these microsatellites were further categorized into mononucleotide (5318), followed by di-(4347), tri-(3544), tetra-(97), Penta-(37) and hexanucleotides motifs (32), while about 1096 were present in the compound formation (Table 2). (A/T)n motif was the dominant for mononucleotide, (GA/CT/AG) n for di-while (TGC/ GAG) n for trinucleotide microsatellites. Specific primers were designed for these SSRs by using Primer3 software and 8164 SSR was verified for a single amplification by in silico PCR with 100-280bp product size (Supplementary Dataset 5-S2). SNPs endure the ability to produces high-density genetic maps, association mapping and molecular markers with the promise of lower cost and error rate. In this study, putative variants were called by aligning the raw reads with the non-redundant de novo assembled reference database. In total 36,525 high confidence variants position were identified includes 35,059 and 1466 for SNPs and indels respectively in 17363 unigenes (Supplementary Dataset 6-S1). An average frequency between all the SNPs in these unigenes was 330 bp. The paleopolyploid nature of A. sisalana may increase the possibility of high counts of SNPs due to the identical paralogous loci in the genome. Large proportion of the unigenes (9576) had the single base shift than di-(3470), tri-(1901) and tetra-(1051), that accounted 27.3%, 9.8%, 5.4% and 2.9% respectively (Supplementary Dataset 6-S2).
Transitions and transversion frequencies including six variations are listed in Table 3. The transition between A and G happen most frequently than any other variation. Validation of these SNPs will be required but their annotation indicates potential polymorphism in drought-regulated transcripts.
Validation of the RNA-Seq data. To confirm the reliability of expression data, 20 DEGs were studied by using the quantitative real-time PCR (qRT-PCR). The results showed almost same level of fold changes between RNA-Seq expression and qRT-PCR analyses (Fig. 7).

Discussion
Insight into the de novo transcriptome assembly and sequence annotation. Drought tolerance is a multi-pronged mechanism orchestrated by a complex set of gene actions in plants. Its understanding requires a comprehensive approach to explore gene expression and physiological and biochemical pathways. To investigate the dynamic variation of A. sisalana transcriptome to drought conditions, we employed RNA-seq approach using the Illumina platform. We applied ninety days of drought stress which is considered of sufficient length to activate the plant transcriptome under stress as reported in various studies 12,31 . We observed 18.5% fewer numbers of reads from drought-stressed RNA libraries compared to control data, giving rise to the hypothesis that the A. sisalana genome compromised normal growth processes during the drought with significant up-regulation of signaling and regulatory proteins. In total, 90.3% clean reads were assembled into 67,328 non-redundant unigenes that permitted gene annotation and association of transcripts with biological functions. In addition to functional assignment, the high similarity of unigenes to other plant protein sequences also confirms their integrity 32 . With BLAST search, 37,546 unigenes out of 67,328 could be functionally assigned with sequences in the nr database, confirming the reliability of the assembly. Insufficient information in public databases and high sequence variations in agave species could arguably be the reasons underlying the high number of un-assigned sequence in A. sisalana genome. GO-based providing information on biological roles 33 of transcripts under drought stress was deciding to identify the drought tolerance mechanisms, including the discovery of novel drought stress-related genes in A. sisalana 33 . Based on differential expression analysis, a number of genes and pathways interactions have been categorized into functional and regulatory groups.

Induction of Functional proteins in drought response. Heat Shock Proteins (HSPs).
The plant's capability to resist environmental strains is central to development. Protein dysfunctioning is a routine event under the abiotic stress and it is extremely crucial to keep protein functional under the stress. The trigger of the heat shock proteins is the most prominent response to depreciate the cellular injuries and reestablishment of cellular homeostasis. Categorization of these proteins is based on their approximate molecular weight (Hsp100, Hsp90, Hsp70, Hsp60, and HSP20, the small Hsp (sHsp) families) 34 . In this study, fifty-three heat shock proteins unigenes from six major families includes the sHSP20 group members (HSP17. 4   toxic substances. More than 300-fold expression of small heat shock proteins was observed in S. oleracea under heat stress 25 . In Arabidopsis, overexpression of GmHsp90s family from Glycine max act as a damage control agent under the abiotic stress 35 . Antioxidants response and Osmotic adjustments to dehydration. Over-production of ROS is extremely harmful to plants as it causes lipid peroxidation, DNA damage, and programmed cell death 36 . The antioxidant enzymes constitute the "first line of defense" against these damages. Here in the study, induced expression of enzymatic and non-enzymatic scavenging molecules indicates the active protection shield against the oxidative stress in A. sisalan leaves. The enriched GO categories like "response to abiotic stimulus" (GO:0009628) and "response to stress" (GO:0009628) give us a strong clue about the active antioxidant enzyme mechanism. Sixteen unigenes were associated in enzymatic scavenging include catalase (CAT1, CAT2), ascorbate peroxidase (APX2, APX4, TL29), peroxidase (PER64, PAP10) and glutathione (Supplementary Dataset 4 S6). Two unigenes (DN17768_c0_g1_i2 and DN19391_c0_g2_i2) encodes the ascorbate peroxidase 2 (APX2) and ascorbate peroxidase 4 (TL29) groups, while others four were homolog to AT1G71695 (Peroxidase superfamily protein), AT2G41480 (PRX25), AT5G66390 (PRX72), and AT4G33420 (PRX47). Ascorbate peroxidase has a significant role in the ascorbate-glutathione detoxification system. GST (ec:2.5.1.18) is critical in glutathione metabolism and is considered as an important indicator for improving the tolerance capability of rice and Arabidopsis 37 . Genes that encode enzymes (ec:1.8.5.1), glutathione dehydrogenase (ascorbate) and (ec:2.5.1.18) glutathione S-transferase taking part in glutathione metabolism were also detected. Heavy metals accumulation in plants is highly reactive and lethal to living cells. The detoxification transporters and their proteins are well known to detoxify the heavy metals into vacuole of cells and maintain them in a balanced amount 38 . Total sixteen genes were identified, 6 were associated with the pleiotropic drug resistance-type ATP-binding protein-PDR (PDR4, PDR5, PDR12), two tonoplasts based heavy metal ATPase 2 (HMA2), one was related to farnesylated protein 6 (FP6), others included heavy metal-associated isoprenylated plant protein (HIPP22, HIPP27). In A. thaliana, members of HIPP family involved in cadmium transport play a role in cadmium detoxification 36,39 .
Osmolytes are the nontoxic small compounds that are synthesized and accumulated in plants under abiotic stress. These include non-toxic macromolecules; organic compounds, sugars, sugar alcohols, starch, lipid peroxidase and free proline 40 . We also observed significant activities in the metabolism of sugar and starch (non-sugar) related enzymes. The involved pathway was also enriched, "ko00500" with involvement of seven upregulated transcripts. The associated enzymes within this pathway were (ec:3.2.1.21) beta-glucosidase/gentiobiase, (ec:3.2.1.2) saccharogen amylase/beta-amylase, (ec:3.1.3.12) trehalose-6-phosphatase/trehalose-6-phosphate phosphohydrolase and (ec:3.2.1.48) sucrose sucrase/alpha-glucosidase. Trehalose 6-phosphatase (TPP/TPS) is a key player in osmoregulation which strengthens the tolerance in plants to the drought stress 25,41 . We noted up-regulated six genes "TPS2, TPS3, and TPS6" encoding the trehalose enzymes. Enzyme phosphorylase (ec: 2.4.1.1) was also altered that take part in the decomposition of non-sugar molecules under drought stress. Other enzymes like "(ec:3.1.1.11) -pectinesterase/pectin-demethylase" and "(ec The ABC transporter G subfamily has been reported to be involved in the export of mature fatty acids in A. thaliana 45,46 (Supplementary Dataset 4-S8). The involvement of ERF/AP2 has been extensively reported in the cuticle biosynthesis, especially regulation, accumulation, and transport in response to the abiotic stresses 23 . MYB TFs are also characterized for their role in the cuticle metabolism 44 . These factors in combination with other regulatory genes in A. sisalana may act as the coordinator for leaf cuticle synthesis. Identification of these wax related genes would assist further to understand the biosynthesis and functions of the cuticular wax under drought stress. within the cells. Proteins and receptor kinases are the sensors on the membrane of the cell that perceive extracellular signals and transmit them to target genes for the activation of specific stress response. Abundance of the kinases is expected as their domain is actively involved in a number of cellular processes. In this study, seventy-eight significantly differentially expressed transcripts were associated with the protein and receptor kinase group under drought stress conditions. Majority of them belong to PK and RLK superfamily, like Leucine-rich repeat protein kinase and Leucine-rich receptor-like protein kinase family protein  Dataset 4-S4). Leucine-rich receptor-like kinases (RLKs) are one of the major group that managed the meristem proliferation, reproduction, organ initiation, specification and hormonal signal cascade. There are several reports that revealed their response towards the drought tolerance e.g. in Arabidopsis abrupt increased of RLKs was observed towards the osmotic stress 47 . The abrupt increase in the calcium ions happens in plants under abiotic stress conditions, is a sign of activation of the stress-responsive cascades. In this study, nine significantly enriched unigenes that belong to the calcium transport signaling group includes, calcium ion binding protein (SUB, SUB1), calcium exchanger (CAX3, CAX5, and CAX7) and tonoplast calcium sensor (CBL3) have been identified (Supplementary Dataset 4-S3). The induced response of these proteins under drought stress stabilized the structural rigidity of cell wall. The CAX group of genes has been discovered in a number of plant species and act ubiquitously as they regulate the tonoplast localized Ca 2+/H + antiport activities. Furthermore, the interaction among different protein phosphatases like HAI2, HAI3 and kinases such as serine/threonine-protein kinase (NEK5), CBL-interacting protein kinase initiated the protein phosphorylation cascade which take part in cell signal recognition and transduction in the responses to abiotic stress 48 . SNF1-related protein kinase 2.1 (SNRK2.1) also act as a positive regulator of the hormonal (ABA) signaling. In A. thaliana complex association between the Calcineurin like proteins (CBL4/CIPK) are associated with the sodium ions release from the cells and absorption of K + by the root surface, that regulates the stomatal behavior under osmotic stress 49 .

Signaling and Regulatory Proteins Response to Drought Stress. Ca
Phytohormones pathways gene to drought stress. To combat various environmental stresses, novel and dynamic approaches should be devised, and phytohormone engineering could be a method of choice to improve the productivity including drought resistance 50 . Plant hormones improve the resistance to osmotic stress by regulating the physiological process. The abscisic acid (ABA) is a key plant growth regulator that directly involved in the responses to abiotic stresses 50 . Here, we noted twenty-three up-regulated unigenes related to ABA-induced protein phosphatase 2 and 3 (HAI2 and HAI3) (ec:3.1.3.16), one protein phosphatase 2CA group (PP2CA) (ec:3.1.3.16), homolog of ABI2 (HAB, HAB2), while four with protein phosphatase 2C families (ABI1) that were homologous to AT3G62260, AT3G63320, AT1G18030, AT3G12620 IDs. A higher number of up regulations of the ABA encodes unigenes including ABA receptor family (PYL4) is an indication of accumulation of ABA due to the decreased cellular water contents under drought stress (Supplementary Dataset 4-S9). Protein phosphatases are the chief regulators and are considered to mediate the ABA triggered signaling pathways. Induced PP2C and PP3C (Protein phosphatases) level in association with the ABA pathway indicated its hyper response to the drought stress in A. sisalana, which is a conserved mechanism in the metabolism of ABA. The differential expression of these genes may regulate the guard's cell of stomata for gaseous exchange and activation of ABA-dependent regulatory elements, such as MYB factors.
Auxin biosynthesis and transport are essential in regulating the response to environmental stresses, including drought, salinity, and pathogen attack. Changes in Indole-3-acetic acid (IAA) biosynthesis in response to external stimulus regulate the stomatal closure via cross-talk with other plant hormones like ABA and others. The IAA mutant plant of Arabidopsis exhibited significant induced water loss than the normal plants 51 . In this study, gene enrichment analysis showed the number of genes contributing to the growth under drought stress related to auxin hormones including auxin-induced protein (IAA13, IAA16, IAA33), auxin response factor (ARF- 1,9,11,19) that were homologous to AT1G19220, AT4G23980, AT1G59750, AT2G46530, GH3 and 4, auxin efflux carrier family protein (PIN1and EIR1), like-LAX2 related gene, auxin-responsive factor AUX/IAA-like protein (NPH4) and auxin binding ABP like proteins (Supplementary Dataset 4-S10). Several positively regulated induced auxins genes is an indication of there important role in A. sisalana against drought stress.
We also observed thirty-seven DE unigenes related to the cytochrome p450s gene family (Supplementary Dataset 4-S11). Cytochrome is one of the largest and central superfamilies in plants, so far encoding about 1% of the protein coding sequences that act in hormonal control mechanisms including biosynthesis and catabolism of primary and secondary metabolites 52 . Several members of this group like CYP71 are known to catalyze the production of aliphatic and aromatic nitriles suggesting their possible role in the defense to the biotic stress 53,54 . Members of CYP86, CYP94, CYP96, and CYP704 are also known as candidates for cuticle biosynthesis 55,56 . The detection of these cytochromes in our dataset may indicate their potential role for cuticle biosynthetic . The promoter region of various cytochrome genes has the affinity for the drought-induced TF includes MYB/MYC, TGA, and W-box for the WRKY. The appearance of high number unigenes associated with these TFs and the CYPs-450 might be a strategy to combat stress. Biosynthesis of jasmonic acid (JA) and Brassinosteroids (BR) hormones are also stressed sensitive. In our data, we noted two differentially expressed genes involved in the alpha-Linolenic acid metabolism that regulate the JA biosynthesis 57 . There are several reports that confirmed their involvements to improve the stress tolerance ability of drought-tolerant cultivars 58,59 . The transcription factor-like MYC2 is a key regulator of JA response and their upregulation in this study indicates its regulatory role in this process and act as a mediator in cross-talk along with WRKY and MYB TFs. Transcriptional regulatory network induced a response to drought stress. TFs are the key regulatory switches that directly regulate the signal transduction pathways 60 . In eukaryotes, especially in plants, TFs are highly conserved and represented by various multigene families to perform specific functions. The number of genes encoding these families may vary due to origin, expansion, and tissue-specific functions. In the current study, 372 transcription factors belonging to the ERF (E2F3) family, bHLH, NAC, HSF, MYB and Zinc finger-like protein and others were found to be differentially upregulated under drought stress (Supplementary Dataset 6 S1). In addition, we also found two transcription factors of the GRAS family, PAT1, and SCL7 (homologs of AT5G41920). GRAS plays a critical function in plant growth and environmental adaptations, especially in the modulation of plant tolerance to stress 61

Conclusion
To the best of our knowledge, here we reported the first transcriptome study of Agave sisalana with the objective to identify the functional genes associated with drought tolerance. 67328 unigenes were de novo assembled, and 37546 were functionally annotated. Further differential gene expression provides the clear understanding of responsive mechanism to drought stress. In addition, the identified genetic marker will provide the source for marker development in this species. This study may not only provide the insights to genomics of adaptation of drought tolerance in agave but also excellent genetic resources for drought tolerance crop development.

Materials and Methods
Plant material, stress conditions, and tissue sampling. The offshoots of similar age/height from 1-year-old mature adult A. sisalana plants, which were asexually propagated from a single "the mother plant, " were used for the current study ( Supplementary Information 3-S1). These offshoots were further grown in pots (one per pot) having the soil mixture of peat moss, vermiculite and sandy soil in the ratio (1:1:1) in the greenhouse. After 90 days of propagation, we divided these plants into two groups randomly with three replicates each. A control group (C); watered regularly while the other treated group (T); no water was applied until the leaf sampling. The newly emerged middle leaves of the A. sisalana rosette (Supplementary Information 3-S2) were harvested from each plant of the group, immersed into liquid nitrogen, ground well to a very fine powder and stored at −80 °C till further molecular investigation.
RNA isolation, cDNA library preparation, and Illumina sequencing. Total RNA was isolated from the stored ground leaves by using the Trizol method with column based purification as described by 34  blast2go.com) with value 1e −3 , annotation cutoff filter 55, code set to 0.8 to assign the GO terms to each transcript regarding molecular functions (MFs), biological processes (BPs), and cellular components (CCs). GO enrichment analysis was carried out by AgriGO software with FDR value not less than <0.05.
Transcript count and differentially expressed gene identification. First, the abundance of each transcript was calculated by bowtie2 and RSEM (RNA-Seq by Expectation-Maximization-http://deweylab.github. io/RSEM/package) for each library. Differentially expressed genes (DEGs) among the drought and control treated libraries were calculated by using the Empirical Analysis of Digital Gene Expression (edgeR) (http://bioconductor.org/packages/release/bioc/html/edgeR.html) statistical package. The trimmed mean of M-values (TMM) method was used to calculate the normalization factors. The threshold p-value < 0.05 and false discovery rate (FDR) < 0.001 was adjusted to identify the differentially expressed genes by fold change ( ≥ 1).
Quantitative RT-qPCR validation. We randomly selected 10 annotated DEGs to verify the RNA-Seq expression data. Gene-specific primers were designed from the selected unigene sequences using Primer 6.0 software (http://www.premierbiosoft.com/primerdesign/) ( Supplementary Information 4). Relative fold expression (RT-qPCR) was carried out on the IQ5 system (BioRad) by using the SYBR ® Green PCR Master Mix (cat#4309155). Thermal settings included the following conditions: 95 °C for 3 min, followed by 40 cycles at 95 °C for 30 s, then 60 °C for 30 s and at 72 °C for 30 s. All this study was carried out on three independent biological and technical replicates. Relative Expression Software Tool (REST) (http://www.gene-quantification.com/rest-2009. html) was used for relative fold expression calculation.