Gossypium barbadense genome sequence provides insight into the evolution of extra-long staple fiber and specialized metabolites

Of the two cultivated species of allopolyploid cotton, Gossypium barbadense produces extra-long fibers for the production of superior textiles. We sequenced its genome (AD)2 and performed a comparative analysis. We identified three bursts of retrotransposons from 20 million years ago (Mya) and a genome-wide uneven pseudogenization peak at 11–20 Mya, which likely contributed to genomic divergences. Among the 2,483 genes preferentially expressed in fiber, a cell elongation regulator, PRE1, is strikingly At biased and fiber specific, echoing the A-genome origin of spinnable fiber. The expansion of the PRE members implies a genetic factor that underlies fiber elongation. Mature cotton fiber consists of nearly pure cellulose. G. barbadense and G. hirsutum contain 29 and 30 cellulose synthase (CesA) genes, respectively; whereas most of these genes (>25) are expressed in fiber, genes for secondary cell wall biosynthesis exhibited a delayed and higher degree of up-regulation in G. barbadense compared with G. hirsutum, conferring an extended elongation stage and highly active secondary wall deposition during extra-long fiber development. The rapid diversification of sesquiterpene synthase genes in the gossypol pathway exemplifies the chemical diversity of lineage-specific secondary metabolites. The G. barbadense genome advances our understanding of allopolyploidy, which will help improve cotton fiber quality.

. A schematic map of the evolution of allotetraploid cottons. Allotetraploid cotton evolved from the natural hybridization between A-and D-genome species and has split into six species, including the widely cultivated G. barbadense (AD 2 ) and G. hirsutum (AD 1 ). Evolutionary time (in Mya) is indicated by a numbered axis; major evolutionary events are represented by arrows and concluded in boxes. A black star indicates a retrotransposon burst, and a red star indicates a boom in pseudogene production. Gr, G. raimondii, a diploid species (D 5 ); Gb, G. barbadense; Gh, G. hirsutum. Mature cotton fiber is shown for extra-long stable (ELS) cotton (G. barbadense, AD 2 ) and Upland cotton (G. hirsutum, AD 1 ).

Results
Genome sequence and assembly. We adopted a progressive strategy to sequence the allotetraploid genome of G. barbadense cv. Xinhai21 (AD) 2 . First, the genomes of the extant diploid species of G. arboreum (A 2 ) and G. raimondii (D 5 ) were separately sequenced and assembled. These sequences, together with their published genomes 10,12 , were used as references for early assortments of the primary reads into A t and D t subgenomes. Then the sequences were assembled into A t and D t contigs and scaffolds (Supplementary Table 1). A total of 471 Gb (188× genome equivalent) of data were separately produced using the Roche 454, Illumina Hiseq2000 and PacBio SMRT sequencing platforms (Supplementary Table 2). The particularly long reads (22.67 Gb) obtained from PacBio SMRT and the assembled 53-Gb contigs of the BAC pool further reduced the effects of repeats in the assembly, yielding a gap reduction of 63.4% ( Supplementary Fig. 1). Finally, we used the ultra-dense linkage map consisting of 4,999,048 single-nucleotide polymorphism (SNP) loci 22 to assign and orient the 26 chromosomes and validate the polyploidy genome of G. barbadense ( Supplementary Fig. 2). We detected only 20 Mb sequences in which the subgenome classification of homoeologous sequences conflicted between the sequence assembly and the linkage mapping strategies, which was likely due to sequence conversions between the two subgenomes. A total of 208 Mb sequences with erroneous inter-chromosomal joins in the A t or D t subgenome were detected and then corrected.
The combination of these methods resulted in a draft genome for G. barbadense with an overall contig N50 of 72 kilobases (kb) and scaffold N50 of 503 kb covering 1.395 Gigabases (Gb) of the A subgenome (A t ) and 0.776 Gb of the D subgenome (D t ) ( Table 1 and Fig. 2). In total, ~88% of the 2.470 Gb genome was based on k-mer estimation ( Supplementary Fig. 3). The genome contains at least 63.2% repeated sequences (Supplementary Table 3), half of which are transposable elements (TEs) that primarily consist of long-terminal-repeat retrotransposons (LTR retrons) ( Supplementary Fig. 4).

Gene annotation.
To initiate gene prediction, ~1 million expressed sequence tags (ESTs) that were generated using Roche 454 from a combination of 28 samples of eight tissues/organs collected at different development stages were mapped to the genome as gene models, which resulted in 40,502 and 37,024 protein-coding genes (CDSs) with an average length of 1,077 and 1,123 bp in the G. barbadense A t and D t subgenomes, respectively (Table 1), and falling in the same range as the number and length of CDSs of G. raimondii 10,11 . Further evaluation using the 70-Gb RNA-Seq data via Illumina supported 96.6% of the predicted CDSs. The 77,526 predicted genes were annotated, which revealed 62,966 functional genes, excluding 8,518 A t and 6,042 D t genes (~ 20%) that lacked clear biological functions.
To examine the influence of allopolyploidy on gene contents, we classified cotton genes into domain families. The composition and family size of the assigned Pfam domain families are overall identical in G. barbadense A t and D t , G. raimondii and, to a lesser extent, G. arboreum. Protein domains whose function was clearly annotated, such as protein kinase, cytochrome P450, and pentatrico-peptide repeat (PPR), were commonly over-represented as large families (Supplementary Table 4 and Supplementary  Fig. 5) as in other angiosperm plants [23][24][25] . Although most domains (3,039 out of 3,674) were maintained in each subgenome after the two were merged, pronounced changes in family size occurred, as exemplified by more ring finger domain (PF13639) and leucine rich repeat (PF13855) genes in the diploid D genome than in either A t or D t (Supplementary Table 4). This finding suggested that super-large families have evolved faster than others and tended to lose members in polyploids 26 . Genome evolution. A total of 21,639 pairs of orthologs were identified between A t and D t . We compared the Ks values of orthologous gene pairs among G. barbadense (Gb), G. hirsutum (Gh) and G. raimondii (Gr) at the whole-genome level (Fig. 3a and Supplementary Table 5). A peak of 0.011 in both GbD t :GrD 5 and GhD t :GrD 5 indicates that the D t subgenome in of both allotetraploids originated from a G. raimondii-like progenitor 27 . The peak values for GbA t :GaA 2 and GhA t :GaA 2 are lower but again similar, presumably due to a shorter time since divergence compared to that between D-genome species. In addition, unlike G. raimondii, which is a wild species, G. arboreum has long been cultivated in African and Asian countries. Another pair of similar Ks peaks (0.005) of GbA t :GhA t and GbD t :GhD t further supports the common origin of the two allotetraploid cottons and suggests their later divergence approximately 1 Mya (Fig. 3a). Based on the larger Ks value (0.04) for A t :D t , we estimated the divergence time between the Gossypium A-and D-genome species to be approximately 8 Mya, consistent with previous estimates that were based on a few single-copy genes 13,27 . The Ks values of paralogs in the two subgenomes of G. barbadense both peak at 0.4-0.5, which indicate ancient WGD event(s) that occurred 50-70 Mya (Fig. 3b), which were responsible for the repeated genome expansion in Gossypium after divergence from the Theobroma cacao lineage more than 60 Mya 10 . Both the A t and D t subgenomes of G. barbadense demonstrate a high level of co-linearity with the G. raimondii genome 10,11 (Supplementary Fig. 6). A total of 21 Megabase (Mb) sequences in the D t and 7.4 Mb in the A t were identified as inter-subgenome translocation regions ( Supplementary Fig. 7). Two of three major intra-subgenomic rearrangements between chrA2/chrA3 and chrA4/chrA5 28,29 were observed in the A t of both of the allotetraploid cottons but absent in the D t or G. raimondii genome (Fig. 2), suggesting that the two translocations likely occurred after the separation of the A and D genomes.
Genomic plasticity and evolution. We identified 6,014/2,422 complete LTR retrons with an average length of 9,256/8,130 bp in A t /D t (Supplementary Tables 6 and 7), similar to the numbers of LTR retrons in G. hirsutum A t and D t , G. arboreum and G. raimondii (Supplementary Table 8). The singleton LTR retrons ratio is 83.5% in A t and 82.2% in D t (compared with 85.4% in G. raimondii and 73.2% in G. arboreum), close to that (86%) in the genome of a gymnosperm tree, Picea abies 30 (an indication of high divergence).
The TE proliferations in G. barbadense and G. hirsutum 20,21 , represented by insertions of LTR retrons based on estimations according to the sequence divergence between the left and right soloLTR 31 , have increased since 20 Mya, and three distinct bursts were identified. Interestingly, the first two bursts appear to successively pre-date the divergence and the re-unification of the diploid A/D genomes (Fig. 3c). The LTR retrons clearly show type-specific and subgenome-biased proliferations (Fig. 3c). Their insertion rates in the A genome appear consistently higher than those in the D genome. For example, a large number (9.15%) of LTR retrons burst at 5 Mya and decreased thereafter in A t , whereas a substantially lower and flat peak appeared 3-5 Mya in D t (Fig. 3c). This peak at least partly accounts for the 1.7-fold more LTR retrons in the former genome. However, the faster loss of LTR retrons in the D genome may also be responsible for genome size variations and the different rates of genome expansion 32 . Notably, the third asymmetric activities of transposons differ between G. barbadense and G. hirsutum (Fig. 3c), which suggests a possible cause of subgenome divergence that may have promoted the speciation of allotetraploid cottons beginning approximately 1 Mya (Fig. 1). These observations indicate that the genome-specific differential dynamics of TE proliferations could be a major force that has driven the rapid evolution and diversification of Gossypium species, which may also be inferred in other flowering plants.
Pseudogenization prior to and after polyploidization. Pseudogenes are disabled copies resembling functional genes that have been retained in the genome 26,33 . They can be grouped into three categories: duplicated (derived from gene duplication), processed (generated by the integration of reversed-transcribed cDNAs into genomes) and fragmented (neither processed nor duplicated) 33 . To further investigate the influence of TE bursts and polyploidization on the cotton genomic architecture, we predicted pseudogenes in G. barbadense (Supplementary Table 9) and classified them into the three categories ( Supplementary Fig. 8), most of which are silenced without any detectable transcripts in all tissues examined.
Each subgenome of G. barbadense contains more predicted pseudogenes than the diploid genome of G. raimondii (Supplementary Table 9 and Supplementary Fig. 8), implying an accelerated pseudogenization after allopolyploid formation. A substantial portion of the pseudogenes in A t and D t showed a high sequence identity (above 90%, for example) with their parental genes ( Supplementary Fig. 9), suggesting an insufficient duration for degeneration in recently formed pseudogenes. As expected, the Ka/Ks distributions indicate a substantially weaker natural selection on pseudogenes than on protein-coding genes ( Supplementary Fig. 10), which is likely due to a loss of function in pseudogenes. The Ks value peaks at 0.06-0.1 corresponding to 11-20 Mya (Fig. 3d) and this boom of pseudogenization correlates with an LTR retron burst prior to the divergence of the A and D genomes (Fig. 3c). The average expression levels of the genes with LTR retron insertion within a 20-kb region upstream of the start codon are generally lower (RPKM = 7.72) than those of genes lacking this insertion (RPKM = 13) (Supplementary Table  10). Therefore, LTR retrons negatively affect the expression of nearby genes, which may promote pseudogenization. These results suggest that cotton progenitors likely lost genes and experienced LTR retron bursts following the ancient WGD, which promoted diversification in Gossypium genomes; however, the role of TE-associated pseudogenization in the stabilization of subgenomes in polyploids requires a more detailed analysis.
Extra-long staple fiber formation. We identified 2,483 and 1,879 genes that are specifically or preferentially expressed in fibers and the ovule, respectively (Supplementary Tables 11 and 12). The highly active genes in the ovule are abundant in the protein families of nucleic acid binding/transcription factor activity and nutrient reservoir activity, whereas the up-regulated genes of fibers are enriched in several categories, such as those related to cytoskeleton, carbohydrate metabolism, cell wall biosynthesis and cellulose biosynthesis function (Supplementary Tables 13 and 14).
Consistent with a previous report 34 , equal numbers of genes in the A t and D t subgenomes demonstrated biased expression patterns (Supplementary Tables 15 and 16). Transcription factors play an important role in controlling agronomic novelty, and the MYB and homeodomain-containing factors have been shown to be key regulators of cotton fiber traits development 10,[35][36][37] . We then analyzed transcription factor genes expressed in G. barbadense fiber in detail (Supplementary Table 17 and Supplementary Fig. 11). Paclobutrazol Resistance (PRE) genes encode a group of transcription regulators known in other plants to promote cell elongation [38][39][40] . We identified 13 PRE family genes in G. raimondii; their 26 orthologous genes were recovered in G. barbadense. Analyzing the PRE-containing synteny blocks in plants revealed that cacao 41 has five PRE genes, each of which has at least two orthologs in the Gossypium diploid genomes or the allotetraploid subgenomes ( Fig. 4a and Supplementary Fig. 12). This expansion of PRE genes in cotton may have occurred during a complex 5-6-fold polyploidy process 10,11 , which was followed by differential gene loss but the retention of the ancient orthologs. Interestingly, two PRE genes are located in the two A t translocation regions (chrA2/chrA3 and chrA4/chrA5) ( Fig. 2c and Supplementary  Fig. 12). In cotton, PRE genes are preferentially expressed in young tissues (Fig. 4b,c), which is consistent with their role in controlling cell size. Moreover, the expression of A t and D t PRE homoeologous genes was biased in G. barbadense (Supplementary Tables 11-12). In particular, the expression level of A t -subgenome PRE1 was high and fiber specific, whereas the expression the D t homoeolog was nearly undetectable (Fig. 4b). The A t -specific expression of a cell growth regulator provides a clue to support the origin or early evolution of spinnable fiber in the A-genome species 10,11 . The expansion and subsequent selection 11,34 of PRE genes in Gossypium may have increased their regulatory activity and recruited specific member(s) for the rapid and extensive elongation of cotton fiber (Figs 1 and 4c).
Cellulose, which consists of linear chains of β (1-4)-linked D-glucose, is the major component of higher plant cell walls and the most abundant biopolymer on land. Plants express multiple cellulose synthases (CesAs) that, together with CesA-associated proteins, form the cellulose synthase complex 42,43 . Cotton fiber is distinct not only in its extensive elongation (ELS cotton fiber is longer than 35 mm) but also in its exceptionally high amount of cellulose, which constitutes more than 95% of the dry weight of mature fiber 16,44 . Notably, the first higher plant cellulose synthase gene was cloned from cotton 45 . Ten, 14 and 15 CesA genes are expressed in Arabidopsis thaliana 42,43 , G. arboreum 12 and G. raimondii 10 , respectively ( Fig. 5 and Table 2). We identified 29 CesA genes, including 14 A t and 15 D t , in the G. barbadense genome, whereas 30 (14 A t and 16 D t ) CesA genes were identified in G. hirsutum; most CesA genes had been retained after the merger of the A and D genomes ( Table 2 and Supplementary Fig. 13). Compared to Arabidopsis, each cotton genome or subgenome contains more genes in the CesA3, CesA4, CesA7 and CesA8 clades. Notably, chromosome 5 of both the A t and D t subgenomes of G. barbadense (GOBAR_ AA25282, GOBAR_AA25287/GOBAR_DD32643, GOBAR_DD32648 and GOBAR_DD32650) and G. hirsutum (Gh_A05G3959, Gh_A05G3965, Gh_A05G3967/Gh_D05G0077, Gh_D05G0079 and Gh_ D05G0084) as well as G. arboreum and G. raimondii contain a CesA cluster composed of 3 or, rarely, 2 genes, in addition to the CesA-like (CSL) genes (Table 2); thus, the duplication(s) occurred in the ancient cotton genome.
Although not exclusively, plant CesAs have functionally diverged into two major classes responsible for either primary cell wall or secondary cell wall biosynthesis 42,43 . Whereas spinnable cotton fiber evolved in the A-genome species and further developed in AD allotetraploids, the CesA gene family has not undergone expansion in any of the three cultivated cotton species sequenced. However, cotton fiber expresses many (at least 25) CesA genes (Fig. 5), demonstrating an enrichment of cellulose synthases in fiber cells. A comparison of the two allotetraploid cottons revealed that the secondary cell wall genes CesA4, CesA7 and CesA8 showed a delayed (> 5 days) and more drastic up-regulation in G. barbadense Scientific RepoRts | 5:14139 | DOi: 10.1038/srep14139 fiber than in G. hirsutum fiber (Fig. 5), which indicates a prolonged duration of fiber elongation and a high activity of cellulose biosynthesis in the secondary cell wall formation stage. Additionally, this temporal expression pattern suggests that the functional allocation of CesA members to primary and secondary wall biosynthesis, which is primarily based on Arabidopsis research 42,43,46 , are likely conserved in angiosperms. Thus, both the retention of CesA family members and the expression pattern of functionally specialized genes in G. barbadense support the formation of extra-long and high-grade cotton fiber.

Terpene synthases and the evolution of cotton phytoalexins. Terpenoids constitute a large
family of natural compounds and play diverse roles in plant-environment interactions. Cotton plants accumulate a specialized group of cadinene-type sesquiterpenoids (including gossypol) that function as phytoalexins against pathogens and pests 47,48 . However, these sesquiterpenoids also reduce the value of cotton seeds that are rich in oil and proteins. Terpene synthases (TPSs) are a family of enzymes responsible for the synthesis of various terpenes from the 10-, 15-, and 20-carbon precursors assembled from the 5-carbon building blocks of IPP and its isomer DMAPP 49 . A manual search of the G. barbadense genome with TPS N-and C-terminal domains (PF01397 and PF03936) identified 115 TPS genes, including 44 monoterpene, 59 sesquiterpene and 8 diterpene synthases, as well as 4 triterpene (squalene) synthases. This number is higher than that in T. cacao (43), Arabidopsis thaliana (34) and Vitis vinifera (98) and similar to that in G. hirsutum (110) but slightly less than twice that in G. raimondii (69).
The cotton sesquiterpene synthase (+ )-δ -cadinene synthase (CDN) catalyzes the first step of gossypol biosynthesis 50 . The G. barbadense genome harbors 19 CDN family genes (sharing > 80% nucleotide identity), whereas G. raimondii, G. arboreum and G. hirsutum harbor 11, 14 and 13 of these genes, respectively ( Fig. 6 and Supplementary Table 18). These genes evolved faster than cotton speciation; thus, the CDN family evolved approximately 60 Mya based on the phylogenetics of cotton plants (Fig. 1). The CDN subfamilies A and E were found closer to the ancient type and duplicated after the divergence of the cotton and cacao lineages ( Fig. 6 and Supplementary Fig. 14). The variable CDN gene numbers in cotton species possibly refer to recent small-scale duplication events, e.g., CDN-A member duplication in the D genome ~1 Mya (Supplementary Table 18 and Supplementary Fig. 14). Thus, the CDN subfamilies in Gossypium represent an example of the rapid lineage-specific evolution of critical genes for specialized metabolites.

Discussion
ELS cotton likely produces one of the most resilient fibers in the plant kingdom; they are highly elongated and contain nearly pure cellulose. This draft sequence of the G. barbadense genome provides valuable genomic resources for studying various aspects of cotton. This draft sequence also facilitates breeding practices aimed at improving cotton fiber traits and increasing the production of high-quality biomass (cellulose).
The genomes of two or more parental species have combined to significantly change the genome structure and function of allopolyploid plants 38,51,52 . Inter-genomic chromosomal rearrangements, differential gene loss (the loss of some duplicates), gene conversion, divergence and the functional diversification of duplicated genes often arise with the onset of polyploidization 53 . Our comparative analysis of cotton genomes also provides new insight into dynamic allopolyploidy processes, such as the mechanism via TE (LTR retrons) bursts and pseudogenization, which have significantly contributed to plant genome evolution and trait formation.  Table 19). DNA isolation, library construction and sequencing. Genomic DNA was isolated from fresh cotton leaves using a previously described method 54 . The shotgun library (300-800 bp fragments)

Homologs in G. hirsutum
Homologs in G. barbadense  For the PacBio library construction and sequencing, genomic DNA was sheared using a Covaris g-TUBE followed by purification via binding to pre-washed AMPure XP beads (Beckman Coulter Inc.). After end-repair, the blunt adapters were ligated, followed by exonuclease incubation to remove all un-ligated adapters and DNA. The final "SMRT bells" were annealed with primers and bound to the proprietary polymerase using the PacBio DNA/Polymerase Binding Kit P4 (Part Number 100-236-500) to form the "Binding Complex". After dilution, the library was loaded onto the instrument with DNA Sequencing Kit 2.0 (Part Number 100-216-400) and a SMRT Cell 8Pac for sequencing. A primary filtering analysis was performed with the RS instrument, and the secondary analysis was performed using the SMRT analysis pipeline version 2.1.0. Genome assembly. The genomes of two diploid cotton species, G. arboreum and G. raimondii, were each sequenced at 100-fold coverage using Illumina Hiseq2000. The assembly resulted in 3,767,593 contigs of 1.5 Gb for G. arboreum and 1,111,300 contigs of 788 Mb for G. raimondii. These contigs, together with the published genomic data of G. raimondii 10 and G. arboreum 12 , were used as template for grouping of G. barbadense sequencing reads into subgenomes, which resulted in totally 44.9% of the reads being A t -unique, 26.9% being D t -unique and 9.7% being both sharing. The remaining 18.5% none hit reads were further grouped during subgenome during sequence assembly.
After subgenome grouping, the A t and D t subgenomes of G. barbadense were assembled separately using a combined strategy. The Roche 454 reads were first assembled using Newbler v2.3. In total, 773,548 contigs with an average length of 2.5 kb were produced. Illumina pair-end reads, mate-pair reads, PacBio SMRT reads and BAC ends were then successively mapped to the contigs to improve quality. The 59,868 contigs (BACtigs) with an N50 of 23.8 kb from 515 BAC pools were merged. These approaches resulted in 4,586 A t scaffolds and 2,186 D t scaffolds with a total size of 2.2 Gb and maximum length of 3.4 Mb. Data statistics are given in Supplementary Table 2 and Table 1.
Finally, a high-density genetic map of G. hirsutum cv. TM-1 × G. barbadense cv. Hai7124 containing 4,999,048 SNPs 22 was mapped to the G. barbadense assembly using the BWA program, which anchored 1.95 Gb or 88% of the assembled sequences and produced 26 pseudo-molecules (chromosomes).
Gene prediction and annotation. Three gene prediction programs, GeneMark (v2.3a) 54 , Augustus (v2.5) 55 and FgeneSH 56 , were used to predict protein-coding genes in the G. barbadense genome. A final gene model was produced by combining the three prediction results with an in-house developed program (GLAD), a tool that creates consensus gene lists by integrating evidence from homology, de novo prediction, and RNA-Seq/EST data. Annotation was performed by comparing the predicted proteins with non-redundant proteins (nr) and the UniProt and KEGG databases. Blast2go 57 was used to assign preliminary GO terms to the predicted gene models. Transcription factors were predicted using PlantTFDB v3.0 58 . Protein domain predictions were performed using RPS-BLAST with a coverage > 90%. The metabolic pathways were constructed using the KEGG database 59 .

Ortholog identification and Ks calculation. Genes were classified into ortholog groups with
OrthoMCL 60 against OrthoMCL proteins (default parameters) [PMID: 12952885]. The orthologs between species, or homoeologs between the A t and D t subgenomes of G. barbadense, were defined using BLASTP based on the Bidirectional Best Hit (BBH) method with a sequence coverage > 30% and identity > 30%, followed by selection of the best match. The Ka and Ks between orthologs were calculated using the KaKs_Calculator 61 via model averaging. The unique gene in each subgenome was defined using the following parameters: 1. protein sequence with no match according to BLASTP against proteins of the other subgenome with E-value 1E-3; and 2. the sum of the length of the high-scoring segment pairs (HSP) was less than 1/3 of the CDS length (via BLASTN) against the genome sequence of the other subgenome.

Repeat and LTR retrotransposon analysis.
Repetitive sequences were identified using RepeatScout with default parameters. The consensus sequences of each repeat family were used to identify repeat compositions in the genome via Censor. The complete LTR retron structures were predicted using LTR_finder 62 , and miniature inverted-repeat transposable elements (MITEs) were identified using MITE-Hunter 63 . Individual LTR retrotransposons were clustered into the same family using the 80-80-80 rule: If two TIR sequences share 80% or higher similarity in at least 80% of their length with an alignment length longer than 80 bp, the two sequences were clustered into the same family 64 .
The insertion ages of each full-length LTR retron were calculated based on the divergence between the left and right solo-LTR sequences using distmat from EMBOSS with the Kimura-2-parameter distance option, and insertion ages were calculated according to the formula T = K/(2r) (K = Kimura distance value, average substitution rate r = 2.6 × 10 -9 in cotton).
Pseudogene identification. Pseudogenes were predicted using Pseudopipe 65 with default parameters. The predicted protein-coding gene sequences from both G. barbadense subgenomes were used as queries to search repeat-masked intergenic regions. Putative pseudogenes were filtered by excluding genes that significantly overlapped with functional gene annotations, genes with parental genes annotated as transposon elements or plastid genes, and genes with sequence lengths shorter than 150 bp. RNA extraction and transcriptome sequencing. The total RNA from each sample was extracted using TRIzol reagent (Invitrogen) following a standard protocol. The mRNAs were purified with the MicroPoly(A) Purist Kit (Ambion), fragmented and converted into an RNA-Seq library using the mRNAseq library construction kit (Illumina Inc.) and sequenced via Illumina Hiseq2000. The mRNAs of 28 samples were also pooled and sequenced on the 454 Genome Sequencer FLX instrument.
Sequence reads from all samples were cleaned using the FASTX toolkit (http://hannonlab.cshl.edu/ fastx_toolkit/). All reads containing 'N' were discarded. Adapter sequences were then removed using the fastx_clipper program, followed by the removal of low-quality (Q < 5) bases from the 3′ end with fastq_quality_trimmer while requiring a minimum sequence length of 50 bp.
Scientific RepoRts | 5:14139 | DOi: 10.1038/srep14139 The RNA-Seq reads of each sample were mapped to the A t and D t genes using bowtie2 66 with a mismatch in seed alignment of 0. Differentially expressed genes were identified via the DEGseq package using the MARS method (MA-plot-based method with Random Sampling model) 67 based on their RPKM (Reads Per Kilobases per Million reads) or FPKM (reads per kilobase of exon model per million mapped reads) values 68 with an FDR≤ 0.001 and |log 2 Ratio |≥ 1 as the threshold. KEGG pathway enrichment was performed with a corrected P-value of < 0.05 as a threshold. GO enrichment was performed using Blast2go 57 .