De novo and reference transcriptome assembly of transcripts expressed during flowering provide insight into seed setting in tetraploid red clover

Red clover (Trifolium pratense L.) is one of the most important legume forage species in temperate livestock agriculture. Tetraploid red clover cultivars are generally producing less seed than diploid cultivars. Improving the seed setting potential of tetraploid cultivars is necessary to utilize the high forage quality and environmentally sustainable nitrogen fixation ability of red clover. In the current study, our aim was to identify candidate genes involved in seed setting. Two genotypes, ‘Tripo’ with weak seed setting and ‘Lasang’ with strong seed setting were selected for transcriptome analysis. De novo and reference based analyses of transcriptome assemblies were conducted to study the global transcriptome changes from early to late developmental stages of flower development of the two contrasting red clover genotypes. Transcript profiles, gene ontology enrichment and KEGG pathway analysis indicate that genes related to flower development, pollen pistil interactions, photosynthesis and embryo development are differentially expressed between these two genotypes. A significant number of genes related to pollination were overrepresented in ‘Lasang’, which might be a reason for its good seed setting ability. The candidate genes detected in this study might be used to develop molecular tools for breeding tetraploid red clover varieties with improved seed yield potentials.

Genomic resources related to seed yield are scarce in red clover compared to other model legume species. Currently, four genetic linkage maps for identification of markers linked to important traits have been developed in red clover 12,[16][17][18] . Several QTL studies of seed yield and seed yield components have been conducted in species like white clover (Trifolium repens L.), soybean (Glycine max L.) and perennial ryegrass (Lolium perenne L.) [19][20][21] . However, so far, only one QTL study of seed yield has been performed in red clover, and this study identified 38 QTL 12 .
Rapid advancements in next generation sequencing (NGS) technology allow characterization and quantification of RNA through cDNA sequencing at massive scale 22 . A draft assembly of the red clover genome based on 16 different genotypes of red clover was recently published 23 . Furthermore, Yates et al. 24 performed de novo transcriptome studies in red clover and provided insights into the drought response. De Vega et al. 25 recently assembled a red clover genome to the chromosome level, estimating its size to be ~309 Mb. The group annotated 40,868 genes and identified clusters involved in forage quality and livestock nutrition.
With the availability of new genomic resources in red clover and the advancements in RNA-seq technologies, we performed both de novo and reference (red clover genome) based transcriptome analysis of the global transcriptome response during flower and seed development in two red clover genotypes with contrasting seed setting ability. The aim of this study was to identify molecular responses and to elucidate genes determining seed setting ability in red clover.

Materials and Methods
Plant material. In 2011, nine single plants of each of the low seed yielding variety 'Tripo' , and the high seed yielding variety 'Lasang' were scored for the following seed yield components: number of flower heads per plant, number of florets per flower head, number of seed per flower head, fertility, seed weight per flower head, and length of the corolla tube 26 . The two lowest ranking plants of 'Tripo' and the two highest ranking plants of 'Lasang' , for the majority of registered seed yield components, were selected for further analysis.
RNA sampling. A total number of 12 flower buds, one bud from each of three flower development periods (early -12 th of July, middle -21 st of July and late -27 th of July) from each of the four selected plants, were picked, flash frozen in liquid N 2 and stored at − 80 °C until RNA extraction. The frozen flower bud samples were crushed with a pestle and mortar. Using SIGMA SPECTRUM PLANT TOTAL RNA KIT (Sigma Life Science), total RNA was extracted from the 12 flower buds. On-Column DNAse Kit (Sigma Life Science) was used to remove DNA contamination. The quality and concentration of RNA were measured using the NANODROP (Nanodrop Technologies, Wilmington, DE, USA) and BIOANALYZER (Agilent Technologies, Palo Alto, CA, USA) equipment.

RNA-seq library preparation and Illumina sequencing.
Twelve flower bud RNA samples with RIN (RNA Integrity Number) values above 7 were used to construct separate cDNA libraries with fragment lengths of 200 bp (± 25 bp). Single-end sequencing was performed at the Norwegian Sequencing Centre (NSC), University of Oslo, using the Illumina sequencing platform (HISEQ 2000) generating single-end reads with a length of 50 bp. The FastQC program (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to analyse the quality of the raw sequencing reads.
De novo transcriptome analysis. The de novo assembly was performed in a similar manner as described by Kovi et al. 27 . Briefly, adapter sequences and low quality reads were removed using the sickle program (https:// github.com/najoshi/sickle/blob/master/README.md). The clean reads derived from the four individual genotypes named Tripo42 and Tripo55, Lasang77 and Lasang108, were used to construct separate de novo assemblies for each genotype using the Trinity assembler (release 2013-02-25) 28 . The de novo assembled transcriptome was then used as a reference to map the individual reads using the Bowtie program 29 . Transcript abundance was measured for each genotype and time point combination as the expected number of fragments per kilobase (kb) of transcript sequence per million mapped reads (FPKM) 30 using RSEM version 1.1.11 31 .

Identification of differentially expressed genes (DEGs), annotation and gene ontology (GO)
analysis. The edgeR package 32 in R program language (https://www.r-project.org/) was employed to identify DEGs and a false discovery rate (FDR) of 0.05 was further used to determine the significant DEGs. Transcripts showing differential expression at any flower development time-point were clustered using a K-means clustering algorithm. The annotation of the DEGs were performed using the Blast2GO program 33 . Initially, BLASTx was performed with an E-value threshold of 10e-06, followed by annotation with a cut-off value of 55 and GO weight Hsp-hit value of 20. The GO enrichment analysis was performed with a p-value of 0.01. The GO classification of DEGs in the two genotypes were generated using the WEGO program 34 . KEGG Pathway analysis was performed with the Blast2GO program 33 .
Validation of de novo assembly by CEGMA. The CEGMA software (version 2.4) 35 was used to evaluate the quality of the four transcriptome assembly datasets. Several genome and transcriptome assembly studies have used CEGMA for evaluating the quality of assemblies 27 . CEGMA detects the presence of 248 extremely conserved core eukaryotic genes (CEGs) and their coverage in transcriptome assemblies for evaluation of the completeness of the assembly.

Red clover reference based transcriptome analysis, detecting DEGs and functional annotation.
Using a reference-based approach, we mapped all the clean reads from the two genotypes ('Tripo' and 'Lasang') and the three time points (early, middle and late flower development) to the red clover reference genome 25 using STAR, an ultrafast universal RNA-seq. aligner program 36 . The Cufflinks program 30 was used to assemble the transcriptomes and to estimate the transcript abundance, followed by the cuffmerge and cuffdiff programs, which is included with the Cufflinks package. The Cuffmerge program merged the transcriptome assemblies from the three flower development time-points of each genotype for performing differential expression analysis. The Cuffdiff program compared the expression levels of genes and transcripts between the three time-points for each genotype, and detected genes that are up-or down-regulated between the time-points. The merged GTF files obtained from the Cuffmerge program was used in the TransDecoder program 37 to identify the coding regions within transcripts. The longest homology coding sequences obtained from TransDecoder were blasted against the Viridiplantae database extracted from NCBI to find the gene names for the coding sequences. Further annotation was performed using the SWISS-PROT database. GFF3 (generic feature format) annotation file describing genomic features, was generated using in-house developed python scripts.
Comparison of significant DEGs to seed yield related QTL. To compare the DEGs with the QTL for seed yield and seed yield traits described by Hermann et al. 12 , we identified flanking SSR markers associated with the QTL and downloaded the marker sequences from the NCBI database. The chromosome locations of markers and DEGs were identified using the BLAST program with the marker sequences and DEG sequences as the query and the red clover genome sequence 25 as the subject. A physical map was created based on the physical location of the DEGs in the red clover genome by the MapDraw software 38 . Briefly, all the physical location (bp) of DEGs were converted to centimorgan (cM) by an average of 450 kb/cM in red clover and spanned 440 cM across seven linkage groups (LGs), approximately similar to 444 cM of Hermann et al. 12 .

Results
De novo assembly. The low seed yielding 'Tripo' and the high seed yielding 'Lasang' genotypes ( Fig. 1) were sequenced and characterized by the de novo transcriptome assembly (Table 1). A total number of 218 million reads of 50 bp were generated for the four genotypes (Tripo42, Tripo55, Lasang77 and Lasang108). 112 million reads were from the 'Tripo' genotypes and 106 million reads from the 'Lasang' genotypes. Individual transcriptome assemblies were generated for each genotype. The numbers of contigs observed in Lasang108 and Lasang77 were 80,328 (N50 of 930 bp) and 83,489 (N50 of 982 bp), respectively, while in Tripo42 and Tripo55, they were 84,545 (N50 of 1016 bp), and 84,442 (N50 of 982 bp), respectively. The longest contig sizes were 7469, 7295, 7447 and 7339 bp for Lasang108, Lasang77, Tripo42 and Tripo55, respectively. CEGMA analysis determined the complete CEGs (Core Eukaryotic Genes) in Lasang108, Lasang77, Tripo42 and Tripo55 transcriptome assemblies to be 89.11, 92.34, 92.34 and 92.34%, respectively, while the percentage of partially complete CEGs ranged from 97.18 to 97.98 ( Table 2). The average number of orthologues per CEG in the four assemblies ranged from 3.18 to 3.30, while the percentage of CEGs that had more than one orthologue ranged from 89.59 to 95.20 (Table 2).

DEGs identified by de novo and reference based methods.
Clean reads from each sample were mapped onto their respective genotype specific de novo assemblies and to the reference genome (red clover genome sequence) to estimate the expression levels of transcripts at different flower development time-points, early (EF), middle (MF) and late (LF) flower development. The DEGs identified in a series of pairwise comparisons between the three flower development time-points EF-LF, EF-MF and LF-MF were 15,000, 7,204 and 7,903,   2B). In the reference-based analysis, 875 and 932 DEGs were observed between EF-LF samples; 279 and 586 between EF-MF samples and 331 and 93 between MF-LF samples in the 'Tripo' and 'Lasang' genotypes, respectively, including up-and down-regulated transcripts ( Fig. 2A).
To determine the sample relations, differential expression data from the edgeR program were used to generate heat maps (Fig. 3). EF and MF grouped together in the low seed yielding 'Tripo' genotypes, while MF and LF grouped together in the high seed yielding 'Lasang' genotypes, indicating that unique genes expressed during late flower development (LF) in 'Tripo' and early flower development (EF) in 'Lasang' were playing major roles in their flowering and seed setting abilities.
Blast, annotation and GO of differentially expressed genes. BLASTx was performed for all the DEGs against the Viridiplantae database derived from NCBI. Approximately 80% of the DEGs had blast hits and 60% were annotated using the Blast2GO program 33 . The top blast hit species were Trifolium subterraneum, followed by Medicago truncatula. Bboth species are closely related to red clover. Gene ontology (GO) classification of DEGs of 'Tripo' and 'Lasang' were represented as three main GO categories, i.e. cellular component, molecular function and biological process in a histogram (Fig. 4) using the WEGO (Web Gene Ontology Annotation Plot) graphical tool 34 . GO comparisons between 'Tripo' and 'Lasang' showed some differences regarding the cellular component and molecular function categories, while relatively small differences were observed for the biological process category. DEGs involved in membrane-enclosed lumen and translation regulator were present only in 'Tripo' , while DEGs involved in structural molecule were present only in 'Lasang' .
Over-or underrepresented GO terms were determined using Fischer's exact test in the Blast2GO program, and the REVIGO tool for reducing and visualising gene ontologies 39 . Six GO terms were enriched when compared 'Tripo' and 'Lasang' genotypes. Out of these, four GO terms, i.e. plasma membrane, pollination, transport and Golgi apparatus were overrepresented in the high seed yielding 'Lasang' genotypes. Transcripts assigned to DNA metabolic processes and nucleic acid binding were overrepresented in the low seed yielding 'Tripo' genotypes (Fig. 5, Supplementary Figure 1).
DEGs compared to the seed yield QTL. The DEGs identified in this study were compared to seed yield related QTL in order to see if any of the genes identified are co-located with the seed yield QTL as described by Hermann et al. 12 . Out of 15 SSR markers flanking the seed yield QTL, six SSR markers are located in the corresponding regions as six DEGs detected in this study positioned on four linkage groups (Fig. 6). The six DEGs are myb-related protein MYBAS2, 4-coumarate-CoA ligase-like 2, protein cornichon homolog 3, ethylene-responsive transcription factor ERF113, protein DETOXIFICATION 45, and UDP-glucuronate 4-epimerase 4.

Discussion
Comparative analysis between de novo and reference based transcriptome assays. When a reference genome is available, reference-based approaches have been considered more effective than de novo assembly (Martin and Wang, 2011), but very few studies have compared the two strategies 27,40 . Moreover, it is important to see whether the de novo assembly can detect the same genes and the molecular responses even in the absence of a reference genome. In the present transcriptome analysis of red clover, we compared both strategies. The CEGMA analysis showed that the de novo assemblies were very complete in terms of gene content since they captured high percentages of ultra-conserved CEGs in all assemblies of the 'Tripo' and 'Lasang' genotypes.
De novo and reference-based (red clover reference genome) gene expression data indicated that genes expressed during the early flower development stage (EF) in 'Lasang' and during the late flower development stage (LF) in 'Tripo' might play key roles in their differential seed setting abilities. In both de novo and the reference-based mapping, the pattern of the differentially expressed transcripts was similar. A larger number of differentially expressed transcripts was observed in early vs late flower development stage than in middle vs late and middle vs early stage (Figs 2 and 3). This might be due to the presence of several differentially expressed transcripts at all three stages. In addition, there was a larger number of differentially expressed transcripts at the early vs middle flower development stage in 'Lasang' than in 'Tripo' , whereas there was more differentially expressed transcripts at the early vs late stage in 'Tripo' compared to 'Lasang' (Fig. 2). Furthermore, we found the proportion of differentially expressed transcripts to be higher in de novo compared to the reference based mapping, which is similar to the findings of Kovi et al. 27 . The trinity de novo assembler yield more transcripts due to the lack of strand-specific information. However, most of the differentially expressed genes identified in both these methods were related to the Medicago truncatula (Supplementary Figure 2), which is the most closely related species to red clover. Furthermore, both methods identified many similar candidate genes putatively involved in flower and seed development (Table 3), thus demonstrating the potential of the de novo method of capturing genes even in the absence of a reference genome. This comparative analysis study might be very useful for the researchers working on orphan species with no reference genome.
Potential candidate genes involved in flower and seed development. Several genes putatively involved in flower development were detected in this study (Table 3). WAT1 related protein is a cell wall protein mainly responsible for transmembrane transporter activity (http://www.uniprot.org/uniprot/Q94AP3). Ranocha et al. 41 reported that stem apices in the mutant wat1 produced significantly lower seed yields in Arabidopsis thaliana compared to wild type stem apices. It might be that the downregulated expression of this gene in 'Tripo' flower buds in the early and middle flower development periods, negatively affected its seed setting ability and thus seed yield. Tubby-like proteins are involved in abscisic acid (ABA) signaling pathways and plays a key role in seed germination and early seed growth 42 . In a recent study, Verma et al. 43 identified a tubby-like F-box protein as a potential candidate gene for the seed weight QTL qSW in chickpea (Cicer arietinum L.). Gibberellin 2-beta-dioxygenase was highly expressed in EF and MF. According to Xue et al. 44 , genes that encodes gibberellin 2-beta-dioxygenase 1 were highly expressed in rice embryo.
NIP4-1 belongs to the aquaporin gene family, which are small integral membrane proteins that facilitate water and solute movement across different tissues throughout development and growth 45 . Regulation of water and nutrient state is very relevant for pollen development, pollen tube growth and germination 46 . Recently Di Giorgio et al. 47 showed that NIP4-1 and NIP4-2 are required for pollen development and pollination in Arabidopsis thaliana. Furthermore, single nip4;1 mutant plants showed a significantly higher frequency of abnormal, stunted siliques and fewer seeds when compared with the wild type 47 . This indicate that NIP4-1 plays a prominent role in determining seed yield. In our studies, the significant upregulation of this gene in 'Lasang' during the EF and MF stages might play a key role in determining the better seed yielding capacity of this cultivar.
Zinc finger proteins (ZFP) play an important role in various biological functions, such as plant growth and development (flower, shoot, seed, pistil and leaf) 48,49 . Recently it was found that ZFP3, ZFP4 and the related ZFP subfamily of zinc finger factors regulate light and ABA responses during germination and early seedling  Table 3. List of differentially expressed genes that can be considered as potential candidate genes involved in seed setting in two red clover genotypes, 'Tripo' and 'Lasang' . development 50 . Higher expression of ZFP4 during the EF and MF stages indicate that it might be important for seed setting in our tetraploid red clover genotypes. The gene ERF106 belongs to the APETALA2 (AP2) gene family, which controls seed weight (Ohto et al.) 51 , was overexpressed during the EF and MF stages in 'Lasang' flower buds. APETALA2 influences the development of embryo, endosperm and seed coat 51 . According to Xue et al. 44 , genes involved in ethylene mediated signaling were highly expressed in rice developing seeds.
The rice gene OsPht1,4 belongs to a group of genes that regulate phosphorus homeostasis in plant cells 52 . Jia et al. 53 reported that suppression of OsPT4 in rice resulted in lower P content in unfilled rice grains, which again resulted in lower seed yields. This gene was overexpressed during the MF and LF stages in 'Lasang' flower buds indicating its positive effect on seed yield. This gene is also involved in the embryo development in rice 54 .
GO differences in 'Tripo' and 'Lasang'. Gene ontology (GO) has provided a way of consistently describing genes and proteins to computationally process data at the functional level 34,39 . Gene ontology (GO) comparisons between the two genotypes showed differences regarding the cellular component and molecular function categories (Fig. 4). Six GO terms were enriched between these two genotypes. Out of these four GO terms, plasma membrane, pollination, transport and Golgi apparatus were overrepresented in the high seed yielding 'Lasang' . Transcripts assigned to DNA metabolic processes and nucleic acid binding were overrepresented in the low seed yielding 'Tripo' (Fig. 5; Supplementary Figure 1). Genes, such as pollen-specific leucine-rich repeat extensin-like protein 1-like (PEX1), pollen profiling variant 1, phd finger protein male sterility 1-like (MS1-PHD), and  Hermann et al., 2006). The distribution of DEGs on the seven linkage groups (LG) of red clover. The SSR markers (denoted in brackets) (Hermann et al., 2006), linked to QTL (denoted in bars) which are co-located to the DEGs are highlighted. In LG1, MYBA2 mapped to the QTL for thousand seed weight (TSW). In LG2, 4CLL2 and CNIH3 mapped to QTL for time of flowering (TOF) and TSW. In LG3, EF113 mapped to QTL for seed yield per plant (SYP). In LG6, GAE4, mapped to the QTL for seed yield per head (SYH).
Scientific RepoRts | 7:44383 | DOI: 10.1038/srep44383 polypyrimidine tract-binding protein, were observed in the pollination GO term. PEX1 reported to be involved in reproduction with in the pollen tube wall during its rapid growth 55 . Another gene, MS1-PHD encodes a PHD-type transcription factor and regulates pollen and tapetum development and pollen wall biosynthesis 56 . In the GO term plasma membrane, genes like aberrant pollen transmission (APT1), flotillin-like protein, sodium transporter hkt1-like were observed. Xu and Dooner 57 showed that the APT1 protein is involved in membrane trafficking and is required for the high secretory demands of tip growth in pollen tubes. Most of the overrepresented genes are linked to pollen development, which is crucial for fertility and seed setting, thus likely involved in determining the higher seed yield capacity of 'Lasang' compared with 'Tripo' .
Validation of the DEGs by comparing to previous red clover seed yield QTL. Comparative mapping studies are powerful tools to validate the detected DEGs by comparing the sequences of the genes to the sequences of markers located inside or flanking QTL 58 . The physical map of DEGs was created based on their physical locations (bp) in the red clover genome. However, there is no constant ratio to convert between bp and cM, as some regions of the genome with frequent recombination have fewer bp per cM than regions with low recombination. The best approach might be to pick the most detailed genetic map (in our case ref. 12), fetch the sequences for each SSR marker on the map, and BLAST these marker sequences against the genome sequence (red clover genome). From the BLAST analysis, we were able to translate each of these cM distances into a bp distance between the points of alignment from the markers to the chromosome, thus calculated as 450 kb/cM. A similar approach was carried out in Arabidopsis by estimating genetic distance as 250 kb/cM 59 . In this study, we detected six DEGs that mapped to the seed yield QTL regions identified by Hermann et al. 12 , and positioned on linkage groups LG1, LG2, LG3 and LG6 (Fig. 6). Among them, MYB transcription factors play a key role in plant development, pollen development 60 , pollen tube differentiation 61 , floral initiation and seed development 62 . The gene 'protein cornichon homolog' belongs to a conserved protein family found in eukaryotes demonstrated to participate in the selection of integral membrane proteins as cargo for their correct targeting 63 . Further, Man et al. 58 detected protein cornichon homolog as a potential gene encoding for the yield related QTL in cotton. 4-coumarate-CoA ligase-like 2, belongs to a group of essential enzymes involved in the phenylpropanoid-derived compound (PDC) pathway, which generates various secondary compounds like lignin, anthocyanins, and isoflavonoids. Doughty et al. 64 suggests that flavonoids may play a fundamental role in regulating communication between the seed coat and the endosperm also.

Conclusions
In this study, transcriptome analysis was conducted for cv. 'Tripo' with inferior seed setting ability and two from cv.'Lasang' with improved seed setting ability, and several DEGs were identified. Many genes related to pollination, flower and seed development were upregulated during the early to middle (EF-MF) flower development stage in the 'Lasang' and downregulated during the middle to late (MF-LF) flower development stage in the 'Tripo' , indicating their major role in determining seed setting and potential seed yield. GO enrichment analysis further confirmed that plasma membrane, pollination, transport and Golgi apparatus related genes are overrepresented in the 'Lasang' . Further, comparative mapping, co-located six seed yield related QTL to the six DEGs on the same linkage groups, thus validating the detected DEGs in this study. Putative candidate genes detected in this study might provide a basis for future functional genomics research in understanding the biology of seed yield in red clover. Loss-of-function techniques like RNA interference methods can further be used to understand the role of these genes in the seed setting.