Genome-wide atlas of alternative polyadenylation in the forage legume red clover

Studies on prevalence and significance of alternative polyadenylation (APA) in plants have been so far limited mostly to the model plants. Here, a genome-wide analysis of APA was carried out in different tissue types in the non-model forage legume red clover (Trifolium pratense L). A profile of poly(A) sites in different tissue types was generated using so-called ‘poly(A)-tag sequencing’ (PATseq) approach. Our analysis revealed tissue-wise dynamics of usage of poly(A) sites located at different genomic locations. We also identified poly(A) sites and underlying genes displaying APA in different tissues. Functional categories enriched in groups of genes manifesting APA between tissue types were determined. Analysis of spatial expression of genes encoding different poly(A) factors showed significant differential expression of genes encoding orthologs of FIP1(V) and PCFS4, suggesting that these two factors may play a role in regulating spatial APA in red clover. Our analysis also revealed a high degree of conservation in diverse plant species of APA events in mRNAs encoding two key polyadenylation factors, CPSF30 and FIP1(V). Together with our previously reported study of spatial gene expression in red clover, this study will provide a comprehensive account of transcriptome dynamics in this non-model forage legume.

from use of distal PASs. This pattern of preferential usage of proximal or distal PASs by the mRNA isoforms in testis, and brain is also found to be conserved in Drosophila 8,38,39,41 . In plants, occurrences of tissue-specific APA so far have been demonstrated only in model plants Arabidopsis and rice 12,15,42 . Here, we report the genome-wide tissue-specific atlas of APA in red clover (Trifolium pratense L.), a cool season forage legume. Red clover is the second most widely grown forage legume in the United States, after alfalfa, and is considered a high-value feed for livestock because of its digestibility, and protein content 43 . Being a nitrogen-fixing forage crop, red clover has a great potential in sustainable agriculture 44,45 . However, genomic resources for red clover are scarce, save for a genome sequence and a few transcriptome studies [46][47][48] . Together with the spatial gene expression analysis, the tissue-wide global APA analysis provides a comprehensive understanding of transcriptional dynamics, and of the role of APA in fine-tuning transcriptome plasticity in this important forage legume.

Results
PATseq library preparation, next generation sequencing, profiling of poly(A) sites and their validation. To decipher the spatial dynamics of genome-wide alternative polyadenylation in red clover, three different tissue types were studied. PAT libraries were sequenced using Illumina platform and after demultiplexing individual libraries, altogether 53.6 million reads were obtained. Next-generation sequencing data generated in this project was submitted to the NCBI Short Read Archive (http://www.ncbi.nlm.nih.gov/sra) under the BioProject accession PRJNA412508. After trimming Illumina adapter sequences and poly(T) tract at the start of each PAT reads, roughly 49.5 million reads were retained. Trimmed PATs were mapped to the red clover genome, which resulted in approximately 17.5 million mapped reads with an average of 1.9 million mapped reads per library. A summary of mapped reads for each library is presented in the Supplemental Table S1.
For the genome-wide APA analysis, sequencing reads were trimmed to one nt tags and processed as described in Methods, yielding sets of poly(A) sites (PASs) and poly(A) site clusters (PACs). Slightly more than 93,000 PASs and approximately 28,000 PACs were defined by the mapped PATs; these are listed in Supplemental Dataset S1 and S2, respectively. The collection of PACs maps to 12,413 different annotated genes.
To further validate the PACs generated using the PATseq dataset, an independent RNAseq dataset [NCBI-SRA BioProject accession PRJNA287846] was used. Poly(A)-containing reads were extracted from the RNAseq dataset and these reads were used to generate a PAC list. PACs lists generated using the PATseq and RNAseq datasets were compared to assess overlap between the genomic coordinates of PACs in two datasets. As expected very few poly(A) tail-containing reads were extracted from the RNAseq dataset (Supplemental Table S2). However, using these reads, 241 PACs could be defined. Out of these 176 PACs overlap with the PACs generated using the PATseq approach (Supplemental Dataset S3). This analysis confirms that most of the PACs identified with PATs are valid.
As a further quality control step, gene-by-gene comparisons of poly(A) tag distributions were performed using a previously described tool, PATAPP 49 . PATAPP calculates fractional usage of individual PASs for each library, and subsequently, compares these values between two contrasting libraries on a gene-by-gene basis. A poly(A) metric of '0' and '1' define absolute similarity and dissimilarity, respectively. The PATAPP analysis also revealed close correspondence between different replicates of each tissue samples (Supplemental Fig. S1).
To further assess the reproducibility of the individual sequencing samples generated in this study, gene expression levels were estimated by determining the numbers of individual PATs that map to annotated genes, and the results were assessed using a correlation scatterplot (Supplemental Fig. S2). This plot showed a good correspondence between individual replicates from the three tissues, and revealed that the leaf and flower tissues were more similar to each other than they were to root tissues. Additionally, Pearson correlation coefficients between replicates of each tissue sample were found to be between 0.85-0.92, suggesting good correspondence among replicates of each tissue type (Supplemental Table S3). These results provide further demonstration of the utility of the sequencing data.

Distribution of canonical and non-canonical poly(A) sites in red clover.
To determine the genomic distribution of PASs and PACs, those PASs and PACs that mapped to the annotated regions of the red clover genome were counted. 68.9, 16.1, 3.4, and 9.8% of all PASs mapped to the extended 3′UTRs, protein coding regions (CDSs), 5′UTRs, and introns, respectively (Fig. 1A). Similarly, most of the PACs were mapped to the 3′UTRs, constituting about 58.6% of the total PACs, whereas, 21.3, 3.9, and 8.8% of all PACs were mapped to the CDSs, 5′UTRs, and introns, respectively (Fig. 1B). 4,695 genes, representing around 11.5% of all the red clover genes, were found to contain multiple PACs (Fig. 1C).
To decipher any spatial bias in the abundance of various mRNA isoforms, such as mRNA isoforms terminating at canonical poly(A) sites located at 3′UTRs or at non-canonical poly(A) sites (residing at other genomic locations, such as CDSs, introns and 5′UTRs), the distribution of PATs across different genomic locations in different tissue types were estimated (Fig. 1D). There was no obvious difference in the distribution of PATs across different genomic locations between leaf and flower tissues. However, distributions of PATs across various genomic locations differed in root tissue as compared to the leaf and flower tissues. Specifically, the proportion of PATs that mapped to the CDSs and 3′UTRs increased, while those that mapped to the intronic regions decreased in root tissue as compared to leaf and flower tissues (Fig. 1D).
Tissue-wise dynamics of poly(A) site usage. We have explored tissue-wise dynamics of usage of individual PACs located in different genomic locations. Poly(A) site usage was determined by the numbers of reads mapping to an individual PAC as a fraction of the total PATs that mapped to the associated gene. The difference in poly(A) site usage for each individual PAC between two tissue-types was then calculated. The aggregate differences for PACs located in different regions (3′UTRs, CDS, etc.) were then represented in boxplots (Fig. 2). The most striking difference in the poly(A) site usage was observed in poly(A) sites located at introns, where poly(A) site usage exhibited a decrease in root tissue as compared to the leaf and flower tissues, while no change was observed between leaf and flower tissues (Fig. 2). This result suggests that, in genes with multiple poly(A) sites, including at least one that lies within introns, usage of the intronic sites are somewhat lower than other sites.   sequences surrounding PASs located in 3′UTRs display typical patterns observed in other plant species, including U-rich upstream regions, an A-rich peak around −20 (termed as 'NUE'), a U-rich peak immediately following the A-rich peak, and YA at the cleavage site (where, Y = U or C) ( Fig. 3A-C) [12][13][14]50 . Single nucleotide compositions of the sequences around PASs residing at intronic regions also exhibited similar patterns observed in case of PASs located at 3′UTRs. Similar patterns were also observed in other plant species, including Arabidopsis, Medicago, and rice ( Fig. 3G-I) 13,14,49 . As opposed to the PASs located at 3′UTRs and introns, PASs positioned at the protein coding regions displayed distinctly different patterns, such as presence of (A + G)-rich region in case of PASs located at protein coding regions, instead of A-rich NUE and the following U-rich peak in PASs residing at 3′UTRs and introns ( Fig. 3D-F). Patterns observed in PASs located at protein coding in regions also resemble results previously reported in other plant species [12][13][14] .
The relatively high ' A' content surrounding CDS-situated PASs raises the possibility that the reads that define these sites might arise via internal priming by the reverse transcriptase. To test this possibility, the nucleotide profiles surrounding positions in the red clover genome consisting 8 or more ' A's (the most likely sites of internal priming) were determined, with a focus on sites located within 3′UTRs and CDS. As shown in Supplemental Fig. S3A,B, the nucleotide profiles of sequences surrounding such potential internal priming sites were completely different from the nucleotide profiles of PASs defined by PATs and located at 3′UTRs and coding regions ( Fig. 3A-F). This result indicates that the sites located in CDS, in particular, are likely not derived via internal priming.
Single nucleotide compositions around PASs positioned at 5′UTRs have only been reported in rice, where it resembles the nucleotide profile observed in PASs located at 3′UTRs 14 . Similar to the profiles described in rice, we also observed A and U-rich regions upstream of PASs, low G-content as compared to the PASs residing at protein coding regions. However, we do not observe A-rich peak at the NUE and the following U-rich peak described in rice ( Fig. 3J-L) 14 . This discrepancy may arise due to low number of PASs located at 5′UTRs in both studies.
Identification of PACs displaying differential usage between tissue types. To further delve into the spatial shifts in the poly(A) site usage, differentially utilized PACs and genes associated with such PACs were identified and characterized (Fig. 4A,B). Altogether, 792 PACs defined by 468 genes displayed tissue-wise alteration in poly(A) site usage. Among these differentially utilized PACs, 66.3% were present only in one tissue-wise comparison, whereas 33.7% PACs were present in two or three of the tissue-wise comparisons (Fig. 4B). Examples of APA events between tissue types are represented in the Fig. 5A-D. The set of genes displaying APA between tissue types included gene4087, which encodes a 'NAD(P)-binding rossmann-fold superfamily protein' . For the gene4087, the proximal, non-canonical poly(A) site was used in the leaf tissue, whereas, in the flower tissue the distal, the canonical poly(A) site was used (Fig. 5A). The gene35272, which encodes a LEA (Late Embryogenesis Abundant) protein, displayed APA between the leaf and root tissues. For the gene35272, the proximal non-canonical poly(A) site was used in the leaf tissue, whereas, the distal, the canonical site was used in the root tissue (Fig. 5B). The gene24502, which encodes a 'calcium-binding EF hand family protein' , exhibited APA between the leaf and flower tissues. For the gene24502, in the leaf tissue a proximal, non-canonical poly(A) site was used predominantly, whereas, in the flower tissue the distal, the canonical poly(A) site was used primarily (Fig. 5C). The gene38866, which encodes a 'major facilitator superfamily protein' displayed APA between the leaf and root tissues. For the gene38866, in the leaf tissue a proximal, non-canonical poly(A) site was used predominantly, whereas, in the root tissue mainly the distal, canonical poly(A) site was used (Fig. 5D). These APA events were also confirmed with a previously reported independent RNAseq dataset ( Fig. 5A-D) 48 .
Genes displaying tissue-wise differential poly(A) site usage were subjected to Gene Ontology (GO) analysis to elucidate enrichment for specific functional categories. Comparison of poly(A) site usage between leaf and flower tissue identified 323 PACs that mapped to 204 genes and exhibited differential poly(A) site usage (Fig. 4A,B, and Supplemental Dataset S4). GO analysis of this set of genes showed an over-representation of genes involved in photosynthesis, especially in the light reaction of photosynthesis, and genes implicated in responses to water deprivation (Fig. 6A). Similarly, differential poly(A) site usage analysis between leaf and root tissue identified 358 differentially utilized PACs that were in 197 genes (Fig. 4A,B, and Supplemental Dataset S5). GO analysis of this set of genes showed enrichment of genes implicated in the light reaction of photosynthesis, and genes involved in the process of generation of precursor metabolites and energy (Fig. 6B). Finally, the comparison of poly(A) site usage between root and flower tissues identified 390 PACs within 249 genes that showed differential poly(A) site usage (Fig. 4A,B, and Supplemental Dataset S6). GO analysis of this set of genes displayed over-representation of genes involved in ribosome biogenesis, and protein folding (Fig. 6C).
Tissue-wise expression and APA in genes encoding polyadenylation factors. Along with providing an account of genome-wide alternative polyadenylation, poly(A) tag sequencing (PATseq) can be a useful and reliable way for analyzing global gene expression [51][52][53][54] . Accordingly, we conducted a genome-wide gene expression analysis using PATseq dataset (see Methods). Next, we compared the genome-wide gene expressions profile estimated using the PATseq dataset with gene expression profiles generated using our previously reported RNAseq data (NCBI SRA BioProject accession PRJNA287876). For this comparison RNAseq and PATseq libraries were made from the same set of RNA samples. Pearson correlation coefficients for the gene expressions measurements using PATseq and RNAseq approach in leaf, root, and flower tissue were estimated to be 0.85, 0.84, and 0.80, respectively (Fig. 7). These high correlations between the gene expressions profiles obtained through two approaches suggest that PATseq provides a reasonably reliable account for the global gene expression.
To further extend our gene expression analysis using poly(A) tags, we assessed the spatial expression of several red clover genes encoding various polyadenylation factors. To this end, we identified orthologs of various polyadenylation factors in red clover, including subunits of CPSF (Cleavage and Polyadenylation Specificity Factor) and CstF (Cleavage Stimulation Factor), poly(A) polymerases, FIP, FY, and others. Next, we analyzed the tissue-wise expression of these genes using the PATseq dataset (Fig. 8). We do not see any significant difference in the tissue-wise expression for the most of the polyadenylation factors. However, the gene encoding FIP1(V) (ortholog for this gene in Arabidopsis is AT5G58040) displayed significantly higher expression in roots as compared to the other tissues. Moreover, FIP1(V) expression in flowers was also found to be higher than the leaf tissue. Additionally, the red clover gene encoding PCFS4 (PCF11P-similar protein 4, Arabidopsis ortholog of which is AT4G04885) exhibited maximum expression in flower, followed by lower expression levels in root and leaf tissues, respectively.
Additionally, we explored the evolutionary conservation of APA in the genes encoding polyadenylation factors. Specifically, we have examined the polyadenylation profiles of the genes encoding two key polyadenylation factors, namely CPSF30 (Fig. 9A-D) and FIP1(V) (Fig. 9E-H) in Arabidopsis, rice, sorghum, and red clover. Transcripts encoded by the two Arabidopsis genes have been shown to be subjects of APA 55,56 . Moreover, APA involving CPSF30-encoding mRNAs is important for nitrate-responsive transcription in at least one gene 57 . Our analysis revealed high degree of conservation of poly(A) site usage for the genes encoding these two polyadenylation factors in both monocot and dicot plant species. In particular, in all four species, the intronic site predicted to yield the so-called CPSF30S protein was seen (Fig. 9A-D, green arrows). A similar intronic event predicted to yield a short FIPV(S) polypeptide was also seen in all four species (Fig. 9E-H, green arrows). There was also considerable conservation in non-canonical CDS-situated APA in these two genes; only in rice were these events relatively rare (Fig. 9A-H, blue arrows). These results reveal a considerable evolutionary conservation of APA in these two genes, and suggest important roles for the various polypeptides, and perhaps as well for contributions of RNA quality control (via the use of CDS sites and subsequent handling of the resulting non-stop mRNAs) in the regulation of expression of these two genes.

Discussion
Prevalence of APA and its role in modulating tissue-wise transcriptional dynamics. While there have been reports focusing on transcriptional dynamics in red clover 46,48,58 , there have been none on the role of RNA processing in regulating transcriptome dynamics in this crop. In the current study, we have analyzed the role of alternative polyadenylation in modulating transcriptional dynamics in different tissue types in red clover.
APA can increase transcriptome and proteome diversity through various mechanisms, such as by encoding different transcripts, and protein isoforms, as well as by altering transcript stability through elimination of recognition sites for miRNAs 3,4,59 . Additionally, it is also likely that the usage of non-canonical poly(A) sites located within protein coding regions and introns can direct the transcriptional output to various RNA quality control pathways, and thereby regulate the level of functional transcripts [60][61][62][63][64] . With the rise in the number of genome-wide studies, widespread occurrences of tissue-specific APA have been discovered in several species. There is a wide prevalence of tissue-specific APA events in different mammalian systems 9,40 . In Caenorhabditis elegans, tissue-specific APA events were found to be prevalent in intestine, pharynx, and body muscle tissues 11 . Tissue-specific APA was demonstrated between leaf, and seed tissues in Arabidopsis 12 . In rice, there is an extensive tissue-wise dynamics in APA profiles 15 . Our current analysis is consistent with all of these studies, especially those done in plants, in that it establishes a substantial prevalence of tissue-wise APA events in the forage legume red clover. Interestingly, proximal poly(A) sites located in the intronic regions were used less in the root tissue as compared to the leaf and flower tissues. In contrast, poly(A) sites located in the protein coding regions displayed preferential usage in the leaf and root tissues as compared to the flower tissue. Such preferential usages of distal and proximal poly(A) sites were reported previously in the neuronal tissue and blood cell and testis in humans 38,41 .

Individual APA events and their significance in regulating plant developmental processes.
Roles of specific APA events in regulating several plant developmental processes have been well documented. Several core polyadenylation factors and RNA-binding proteins were shown to regulate floral initiation. A core polyadenylation factor FY interacts with the RNA-binding protein FCA to promote proximal intronic polyadenylation of the FCA-encoded transcripts 29 . Similar to FCA, another RNA-binding protein FPA also undergoes APA and regulates floral initiation 31 . FCA and FPA along with two other core polyadenylation factor subunits CstF64 and CstF77 regulate the enhanced usage of the proximal poly(A) sites of FLC-encoded antisense transcripts 30,31 . Similarly, in Medicago trancatula, the SYP132 (SYNTAXIN132) gene implicated in legume-rhizobium symbiosis was shown to undergo APA and produce two membrane soluble protein receptors (t-SNARE). Of these two isoforms, SYP132A was found to be induced during symbiosis and essential for maturation of symbiosomes to their functional forms, whereas the SYP132C isoform was shown not to be required for symbiosis, but possesses other important functions 37 . In order to build a better understanding of the biological roles of APA, it is worthwhile to identify such APA events with potential biological implications. Our analysis has identified in total 792 PACs that exhibit APA between two different tissue types in red clover; these PACS affect the expression of 468 genes. The examples illustrated in Fig. 5 suggest interesting contributions of APA towards different developmental processes. For example, red clover gene35272 displayed APA between leaf and root tissues (Fig. 5B), with the functional mRNA isoform derived from distal poly(A) site choice being preferentially expressed in roots. Specifically, in leaf tissue, the red clover gene35272-encoded transcripts predominantly use a non-canonical poly(A) site, whereas, in the root tissue, the canonical 3′UTR site was used mostly (Fig. 5B). It is most likely that this gene generates full length functional transcripts mostly in the root tissue. This gene is   orthologous to the Arabidopsis gene At4G13270, which encodes a LEA (Late Embryogenesis Abundant) protein. While LEA proteins are mostly implicated in seed development and mediating responses to environmental stresses 65 , a gene encoding SAG21/AtLEA5 was reported to be involved in root development and mediating biotic stress response 66 . It would be interesting in future to test if this APA event has a role in regulating root development under normal physiological condition or environmental stresses.
Another red clover gene (Gene24502) exhibited APA between leaf and flower tissues (Fig. 5C). This gene is orthologous to the Arabidopsis gene At4G27790, which encodes a calcium-binding EF-hand family protein.
Calcium-binding proteins are well known for their roles in hormonal regulation and in mediating responses to environmental stresses 67 . EF-hand family proteins possess a conserved helix-loop-helix structure that can bind to a single Ca 2+ ion 68 . EF-hand family proteins were implicated in environmental and nutritional stress signaling in soybean 69 . In the leaf tissue, the red clover gene24502 predominantly used a non-canonical poly(A) site located at the 5′UTR. In contrast, in the flower tissue, this gene mostly used the canonical poly(A) site located at the 3′UTR (Fig. 5C). This observation suggests a flower-specific expression and role for gene24502 in floral development, and provides a possible conceptual link between flower development and calcium sensing and signaling.
Possible mechanisms of regulation of spatial APA in red clover. The changes in the usage of different classes of poly(A) sites may arise due to various reasons. One possible reason behind the preferred usage of certain classes of poly(A) sites may be alterations in the activities of core polyadenylation complex mediated by plethora of RNA-binding proteins. It has been shown previously in the model plant Arabidopsis, that two RNA-binding proteins, FPA and FCA played significant role in the poly(A) site choice 29,70,71 . Our analysis has revealed that nucleotide profiles around poly(A) sites located in different genomic regions closely resemble patterns previously observed in other plant species, including Arabidopsis, Medicago and rice (Fig. 3) 12,13,15 . This indicates a high degree of conservation of poly(A) signals in different tissue types. This finding suggests that instead of a global regulation through RNA-protein interactions, there may be contribution of specific RNA-protein interactions in mediating spatial APA in red clover. However, inclusion of more tissue types in future experiments may unearth novel tissue-specific poly(A) signals. Such expectations are not farfetched considering the recent discovery in rice of a T-rich motif in the poly(A) sites mapped to the intronic regions in pollen as compared to the A-rich motif in the same region in leaf 15 (Fig. 8). The gene encoding ortholog of FIP1(V) expressed at significantly higher level in root as compared to other tissue types tested. FIP1(V) is a RNA-binding protein and considered as a part of the CPSF complex. FIP1(V) was shown to interact with several polyadenylation factors and to directly affect activity of poly(A) polymerase 78,79 . We have also detected higher expression of the gene encoding PCFS4 in flower as compared to the other tissues (Fig. 8). In Arabidopsis, PCFS4 was reported to regulate alternative polyadenylation of FCA and promote flowering 80 . These results suggest that FIP1(V) and PCFS4 may be involved in mediating spatial APA in red clover. Some of the polyadenylation factor-encoding genes produce multiple isoforms, such as CstF64 in mammals and CPSF30 in plants 55,81,82 . It would be interesting in future to test differential expressions of two isoforms of these factors under different development and physiological conditions and to assess if two isoforms differentially regulate poly(A) site choice under such conditions. Together with our previously reporter spatial gene expression profile, this genome-wide spatial alternative polyadenylation study will provide an account of global spatial transcriptional dynamics in the non-model forage legume red clover. RNA extraction, PATseq library preparation and quality assessment. RNA extraction was performed using Trizol ® reagent (Life Technologies) as per the instructions of the manufacturers. Following isolation, RNA samples were purified with RNA mini spin column (Enzymax LLC). 1 µg of purified total RNA was used to make PATseq libraries using a modified version of a previously published protocol 83 . Briefly, total RNA was fragmented, followed by poly(A) enrichment with oligo dT beads. Next, 3′end fragments of poly(A) enriched RNA were reverse transcribed. Primers for reverse transcription consist of unique barcodes for multiplexing, a sequencing adapter for the Illumina platform, and a T 18 VN sequence at the 3′ end. Reverse transcription reactions were followed by strand-switching using so-called SMART technology (Clontech Laboratories, Inc.). Second sequencing adapter was added to the SMART oligonucleotide. Following reverse transcription and strand-switching, two rounds of size selection were performed using AMPure beads (Agencourt AMPure XP beads, Beckman Coulters, Inc.). Next, cDNAs were PCR amplified, and further size selected using gel purification. Gel purified products were re-amplified, and subjected to one round of purification with the SPRI beads to get the final library. Quality of the PATseq libraries was checked using Agilent High sensitivity DNA chips High-throughput sequencing and data processing. PATseq libraries were sequenced using Illumina platform (HiSeq 2500, 1 × 76 nucleotides). After retrieval of the sequences in fastq format, raw reads were demultiplexed, and adapter, barcode sequences, and poly(T) tracts were trimmed using CLC Genomics Workbench. Ribosomal RNA (rRNA) sequences were removed by mapping the processed reads to Arabidopsis thaliana rRNA sequences. Processed reads were then mapped to the red clover genome sequence (redclover_v2.1, https://zenodo. org/record/17232) 47 . For mapping red clover PATseq reads following parameters were used: match score-1, mismatch cost-2, cost of insertions and deletions-linear gap cost, insertion cost-3, deletion cost-3, length fraction-0.9 and similarity fraction-0.8. For the APA analysis, 3′UTRs in the redclover_v2.1 annotation were extended by 200 nucleotides in the 3′ direction and the modified annotated genome used as reference for the mapping of processed reads. The decision to extend 3′UTRs was arrived at empirically, by mapping reads to modified genomes with 3′ extensions of 100-400 bp in 100 bp increments, and choosing the point at the numbers of mapped reads ceased to increase. While mapping, genomic regions with eight or more consecutive ' A's were masked to eliminate reads generated due to possible internal priming during the reverse transcription reaction.

Methods
Genome-wide analysis of alternative polyadenylation. The mapping results were exported from the CLC Genomics Workbench in 'bam' file format. Further analyses were conducted using BEDTools. Briefly, 'bam' files were converted to 'bed' file format, followed by tag trimming to reduce each sequence tag to a single nucleotide; the trimmed tags were then sorted for subsequent processing. Next, a list of poly(A) clusters (PACs) were prepared, where poly(A) sites (PASs) within 24 nucleotides from one another were grouped in one PAC as described previously 12 . For the subsequent analysis, PACs defined by 20 or more PATs in the complete dataset were reserved, and the other PACs discarded. A list of PASs was also generated, and PASs defined by 8 or more PATs in the complete dataset were retained for further analysis. Next, to each PAC and PAS, corresponding gene IDs, and genomic regions were added. Global distributions of PACs and PASs across different genomic regions were calculated. Additionally, the distributions of PATs mapping to various genomic locations were estimated for each tissue type. The numbers of mapped PATs were calculated for each PAC, and each gene. Relative poly(A) site usages were calculated by dividing number of PATs mapping to a PAC by the total number of PATs that mapped to the associated gene. Differences in relative poly(A) site usage between different tissue types were estimated by subtracting relative poly(A) site usage of a PAC in one tissue type from the relative poly(A) site usage of the same PAC in another tissue type. Such differences were calculated and the results recorded in spreadsheets and displayed with boxplots.
Validation of PACs using an independent RNAseq dataset. PACs generated using the PATseq dataset were validated using a red clover RNAseq dataset [NCBI Short Read Archive (SRA) BioProject accession PRJNA287846] 48 . Briefly, poly(A) tail containing reads were extracted from the RNAseq dataset and were mapped to the red clover genome using the same criteria as described above. A PAC list was generated as described in the preceding section and PACs with at least two PATs were retained for further analysis. Overlaps between the genomic coordinated of PACs in two datasets were assessed using BEDTools.
Nucleotide composition analysis. From the master PAS list, separate lists of PASs were generated for leaf, root and flower samples. For each sample, number of PAT for each PAS, was normalized by the total number of PATs in all PASs, and expressed in millions. Next, averages of normalized PATs were calculated for three replicates of each tissue sample. Only PASs with ≥2 normalized PATs were retained for the nucleotide composition analysis. Analysis was done by mapping three replicates of each sample together. Genomic regions and corresponding genes were added to each PAS, and PASs were grouped according to their genomic locations. PASs with no corresponding genes were removed from the subsequent analysis. Next, sequences spanning 100 bp upstream to 100 bp downstream of PASs were extracted, and any sequences less than 201 bp in length were removed from the analysis. For each class of PASs (those located in 3′UTR, protein coding region, intron, and 5′UTR), the proportion of each of the four bases was calculated for each position relative to the poly(A) site (+1). To analyze nucleotide profiles of the sequences surrounding sites, where internal priming may happen, co-ordinates of such sites (these sites are defined by 'N6A8' motif, where 'N' is any nucleotide) were extracted, extended by 100 nucleotides in both directions and grouped according to their genomic locations. Nucleotide profiles of sequences encompassing potential internal priming sites located at 3′UTR and coding regions in the red clover genome, were estimated as stated before.
Identification of PACs and genes displaying APA. Differentially utilized PACs between two tissue types were determined using DEXseq software implemented in R 84 . Briefly, 'bed' files representing mappings of trimmed PATs to the red clover genome were converted to 'bam' format using BEDTools. SAMTools was used to convert these 'bam' files to 'sam' format. A 'GTF' file was generated from the PAC list with the gene annotation. A python script was used to convert this 'GTF' file to 'GFF' format suitable for use by DEXseq. Additionally, another python script was used to count the number of reads in each PAC in each library and outputs from this step was saved in 'txt' format and were used for DEXseq analysis in R. Statistical analyses in DEXseq were performed with an adjusted p value of <0.05 to identify PACs showing differential poly(A) site usage between two tissue types. Genes associated with the differentially utilized PACs between two tissue samples were subjected to gene ontology (GO) analysis using 'Singular Enrichment Analysis' (SEA) tool in AgriGO 85 . Enriched GO categories were identified using 'chi-square test' using following parameters: Hochberg (FDR) cut-off value 0.1 and minimum number of mapping entries of 5. Comparison of gene expression using PATseq and RNAseq. To analyze the usefulness of PATseq data in quantifying global gene expression, we conducted gene expression analyses with both RNAseq and PATseq datasets and compared results from two analyses. For the gene expression analysis using RNAseq approach, our previously reported RNAseq dataset was used [NCBI Short Read Archive (SRA) BioProject accession PRJNA287846 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA287846/)]. The gene expression analysis using RNAseq dataset was performed in 'CLC Genomics Workbench' using 'RNAseq' suite. To quantify gene expression using PATseq approach, the PAT frequency for each gene was calculated using BEDTools and the output file was imported in 'CLC Genomics Workbench' for the subsequent analysis. In both cases, expression values were transformed by adding 1, and then normalized using 'quantile normalization' . Normalized gene expression values of RNAseq and PATseq datasets were log2-transformed and Pearson correlation coefficients between two datasets were calculated. Data availability. High throughput sequence data generated in this study was deposited to the NCBI SRA under the BioProject accession PRJNA412508.