Main

Cotton is one of the most economically important crop plants worldwide. Its fiber, commonly known as cotton lint, is the principal natural source for the textile industry. Approximately 33 million ha (5% of the world's arable land) is used for cotton planting1, with an annual global market value of textile mills of approximately $630.6 billion in 2011 (MarketPublishers; see URLs). Apart from its economic value, cotton is also an excellent model system for studying polyploidization, cell elongation and cell wall biosynthesis2,3,4,5.

The Gossypium genus contains 5 tetraploid (AD1 to AD5, 2n = 4×) and over 45 diploid (2n = 2×) species (where n is the number of chromosomes in the gamete of an individual), which are believed to have originated from a common ancestor approximately 5–10 million years ago6. Eight diploid subgenomes, designated as A to G and K, have been found across North America, Africa, Asia and Australia. The haploid genome size of diploid cottons (2n = 2× = 26) varies from about 880 Mb (G. raimondii Ulbrich) in the D genome to 2,500 Mb in the K genome7,8. Diploid cotton species share a common chromosome number (n = 13), and high levels of synteny or colinearity are observed among them9,10,11,12. The tetraploid cotton species (2n = 4× = 52), such as G. hirsutum L. and Gossypium barbadense L., are thought to have formed by an allopolyploidization event that occurred approximately 1–2 million years ago, which involved a D-genome species as the pollen-providing parent and an A-genome species as the maternal parent13,14. To gain insights into the cultivated polyploid genomes—how they have evolved and how their subgenomes interact—it is first necessary to have a basic knowledge of the structure of the component genomes. Therefore, we have created a draft sequence of the putative D-genome parent, G. raimondii, using DNA samples prepared from Cotton Microsatellite Database (CMD) 10 (refs. 15,16), a genetic standard originated from a single seed (accession D5-3) in 2004 and brought to near homozygosity by six successive generations of self-fertilization. We believe that sequencing of the G. raimondii genome will not only provide a major source of candidate genes important for the genetic improvement of cotton quality and productivity, but it may also serve as a reference for the assembly of the tetraploid G. hirsutum genome.

Results

Sequencing and assembly

A whole-genome shotgun strategy was used to sequence and assemble the G. raimondii genome. A total of 78.7 Gb of next-generation Illumina paired-end 50-bp, 100-bp and 150-bp reads was generated by sequencing genome shotgun libraries of different fragment lengths (170 bp, 250 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb) that covered 103.6-fold of the 775.2-Mb assembled G. raimondii genome (Supplementary Table 1). The resulting assembly appeared to cover a very large proportion of the euchromatin of the G. raimondii genome. The unassembled genomic regions are likely to contain heterochromatic satellites, large repetitive sequences or ribosomal RNA (rRNA) genes. Using a set of 1,369 molecular markers from a consensus genetic linkage map reported previously17, 43.8% of the markers (599) were unambiguously located on the assembly, allowing us to anchor 73.2% of the assembled 567.2 Mb on the G. raimondii chromosomes (Supplementary Fig. 1).

The assembly, performed by SOAPdenovo18,19, consisted of 41,307 contigs and 4,715 scaffolds and accounted for approximately 88.1% of the estimated G. raimondii genome8 (Table 1). Over 73% of the assembly was in 281 chromosome-anchored scaffolds, with 228 of them both anchored and oriented (Supplementary Fig. 1). The N50 (the size above which 50% of the total length of the sequence assembly can be found) of contigs and scaffolds was 44.9 kb and 2,284 kb, respectively, with the largest scaffold measuring 12.8 Mb (Supplementary Table 2). As indicated by sequencing depth distribution analysis, 98.8% of the assembly was sequenced at 10× coverage (Supplementary Fig. 2). Of the 58,061 ESTs (>500 bp in length) reported in G. raimondii, 93.4% were identified in the assembly (Supplementary Table 3). Sequences of 24 of the 25 randomly selected, completely sequenced G. raimondii BAC clones downloaded from GenBank (AC243106AC243130) were fully recovered from our assembly (Supplementary Table 4), supporting the view that the G. raimondii genome was assembled properly. Percentagewise, coding regions (exons), introns, DNA transposable elements, long terminal repeats (LTRs) and other repeat sequences made up 6.4%, 6.9%, 4.4%, 42.6% and 13.0% of the total genome content, respectively (Fig. 1). On most G. raimondii chromosomes, genes were more abundant in the subtelomeric regions (Fig. 1), as previously reported for T. cacao20 and Zea mays21. Transposable elements were distributed largely in gene-poor regions (Fig. 1).

Table 1 Global statistics for the G. raimondii genome assembly and annotation
Figure 1: Genomic overview of the 13 assembled G. raimondii chromosomes.
figure 1

Major DNA components are categorized into exons, introns, DNA transposable elements (TEs), LTRs (retrotransposons) and other (repeat sequence other than DNA TEs and LTRs). Gray color indicates DNA elements not defined by the previous five terms. All categories were determined for 1.0-Mb windows with a 0.05-Mb shift.

Gene content, annotation and analysis of major gene families

Genome annotation was performed by combining results obtained from ab initio prediction, homology search and EST alignment. We identified 40,976 protein-coding genes in the G. raimondii genome, with an average transcript size of 2,485 bp (GLEAN) and a mean of 4.5 exons per gene (Table 1 and Supplementary Table 5). There were 348 micorRNAs (miRNAs), 565 rRNAs, 1,041 tRNAs and 1,082 small nuclear RNAs (snRNAs) in the G. raimondii genome (Table 1 and Supplementary Table 6). Among the annotated genes, 83.69% encode proteins that show homology to proteins in the TrEMBL database, and 69.98% were identified in InterPro (Supplementary Table 7). As a result, 71.68% of the predicted genes were supported by at least two methods (Supplementary Table 8). Overall, 92.2% (37,780 of 40,976) of predicted coding sequences from the genome were supported by transcriptome sequencing data (Supplementary Fig. 3), which showed the high accuracy of G. raimondii gene predictions. Compared to the smaller Arabidopsis thaliana genome22, the G. raimondii genome had a higher gene number, a similar exon number per gene and a lower mean gene density per 100 kb of genomic DNA sequence.

Comparative analysis of G. raimondii with T. cacao20, A. thaliana22 and Z. mays23 showed that these four different plant species possess similar numbers of gene families, with a core set of 9,525 in common (Supplementary Fig. 4). Of the 16,113 G. raimondii gene families, all but 1,267 were conserved in at least 1 other plant genome (Supplementary Fig. 4). Analysis of species- and lineage-specific families identified potential inconsistencies between annotation projects but also reflected genuine biological differences in gene inventories.

Phylogeny, paleohexaploidization and whole-genome duplication

Although large-scale duplication events were predicted to have occurred during Gossypium evolution, the number and timing of these genome duplications are still being debated24,25,26. By examining 745 single-copy gene families from 9 sequenced plant genomes (Supplementary Fig. 5), we found that G. raimondii and T. cacao belong to a common subclade and probably diverged from a common ancestor approximately 33.7 million years ago (Fig. 2a). Carica papaya and A. thaliana belong to another subclade that diverged from the G. raimondiiT. cacao subclade approximately 82.3 million years ago (Fig. 2a).

Figure 2: Genome evolution and duplication.
figure 2

(a) Phylogenetic analysis showed that G. raimondii and T. cacao separated approximately 33.7 million years ago (MYA). O. sativa, a monocot, was used as the outgroup. (b) Ks distributions of G. raimondii. Yellow line, Ks of all paralogous gene pairs; black line, Ks of tandem gene pairs only; green line, Ks of all except tandem gene pairs.

Using substitution per synonymous site (Ks) values obtained from 3,195 paralogous gene pairs in the G. raimondii and T. cacao genomes, we observed 2 peaks at Ks values of 0.40–0.60 and 1.5–1.90 (Fig. 2b). The first peak appeared at approximately 16.6 (13.3–20.0) million years ago, corresponding to the whole-genome duplication event that was previously proposed in the Gossypium lineage25,26. The second peak appeared at approximately 130.8 (115.4–146.1) million years ago, corresponding to the paleohexaploidization event shared by the eudicots27,28. In T. cacao, a single peak value between 1.7–1.9 has been reported20, which corresponds to the second peak observed in G. raimondii (Fig. 2b), indicating that the paleohexaploidization event shared by the eudicots occurred between 115.4 and 146.1 million years ago in a common progenitor before speciation into the two present-day species 33.7 million years ago.

Comprehensive searches for evidence of whole-genome duplication were performed using an all-versus-all blastp approach comparing the G. raimondii and T. cacao genomes. Results indicated that the two genomes possess a moderate syntenic relationship, such that 463 collinear blocks (with ≥5 genes per block) covering 64.8% and 74.41% of the assembled G. raimondii and T. cacao genomes, respectively, are aligned (Fig. 3a, Supplementary Fig. 6 and Supplementary Table 9a). Reciprocal best-BLAST-match analysis showed the existence of 133 duplicated and 43 triplicated regions in G. raimondii relative to T. cacao (Fig. 3a). There were 2,355 syntenic blocks among the 13 G. raimondii chromosomes. Among these blocks, 21.2% were found to involve only two chromosome regions, 33.7% spanned three chromosome regions and 16.2% traversed four chromosome regions (Fig. 3b and Supplementary Fig. 7). Chromosome 8 was highly fragmented, with 310 blocks that matched other chromosomes, probably as a result of multiple rounds of duplication, diploidization and chromosomal rearrangement in the genome (Fig. 3b). Thirty-nine triplicated chromosomal regions in the G. raimondii genome were observed (Supplementary Table 9b).

Figure 3: Comparison of syntenic blocks between the genomes of T. cacao and G. raimondii and reorganization of G. raimondii chromosomes.
figure 3

(a) Syntenic blocks between T. cacao and G. raimondii. Tc, chromosome of T. cacao; Gr, chromosome of G. raimondii. (b) Syntenic blocks among different G. raimondii chromosomes. The G. raimondii chromosomes are shown in the outer circle in mosaic form, with each color designating its origin from one of the seven ancient chromosomes. Only syntenic blocks longer than 700 kb are shown.

Expansion of transposable elements

Transposable elements are known to contribute substantially to changes in genome size, and they comprise approximately 57% (441 Mb in total length) of the G. raimondii genome (Table 1 and Supplementary Table 10). In comparison, 24% of the T. cacao20 genome and 14% of the A. thaliana genome are composed of transposable elements22, suggesting that substantial transposable element proliferation in G. raimondii is partially accountable for the expansion of the G. raimondii genome. In-depth sequence analysis showed that the most widespread repetitive sequences in the G. raimondii genome were the gypsy and copia-like LTRs, which account for 33.83% and 11.10% of the genome, respectively (Supplementary Table 10). The growth rate of these LTR retrotransposons in G. raimondii and T. cacao tended to slow down after 0.5 and 0.7 million years ago, respectively (Fig. 4a). By contrast, the number of LTR retrotransposons has increased in A. thaliana since 1.5 million years ago (Fig. 4a).

Figure 4: Comparisons of LTRs and transposable elements in the G. raimondii, T. cacao and A. thaliana genomes.
figure 4

(a) The distribution curve for the number and insertion time of LTRs in different plant genomes. (b) Phylogeny of LTR retrotransposons in the G. raimondii, T. cacao and A. thaliana genomes. (c) Distance distributions of nearest transposable elements (TEs) from each gene.

Phylogenetic analysis supported the notion that a larger expansion of specific LTR retrotransposon clades had occurred in G. raimondii than in T. cacao and A. thaliana (Fig. 4b). An analysis of the repeat divergence rate distribution (percentage of substitutions in the corresponding region compared with consensus repeats in constructed libraries) independently confirmed the proliferation pattern for LTR retrotransposons in the G. raimondii genome (Supplementary Fig. 8). Coupled with higher transposable element content in its genome, G. raimondii was found to have a higher proportion of genes near (within 1 kb of) transposable elements than T. cacao and A. thaliana (Fig. 4c). By contrast, T. cacao maintained the greatest distance between its genes and transposable elements (Fig. 4c).

Simple sequence repeats (SSRs) in the G. raimondii genome

SSRs behave as polymorphic loci that provide a rich source of markers for cotton breeding as well as for genetic studies. A total of 15,503 di-, tri- and tetranucleotide SSRs, representing 34 distinctive motif families, were identified and annotated in the G. raimondii genome (Supplementary Fig. 9). We randomly selected 500 of them to study polymorphisms between the mapping parents G. hirsutum 'CCRI36' and G. barbadense 'Hai1' and found that 70 primer pairs, or 14%, showed polymorphisms. PCR amplification results for 15 of these primer pairs are shown in Supplementary Figure 10.

Analysis of genes involved in cotton fiber initiation and elongation

Qualitative transcript differences in key fiber development genes2,3,29 were found between the non-fibered G. raimondii and the fibered G. hirsutum species, as revealed by transcriptome (RNA sequencing, RNA-seq) analysis using samples extracted from cotton ovules 3 days post-anthesis (DPA). Of the four sucrose synthase (Sus) genes identified in the genome, three (SusB, Sus1 and SusD) were expressed at substantially higher levels in G. hirsutum than in G. raimondii (Fig. 5a). Several 3-ketoacyl-CoA synthase (KCS) genes, including KCS2, KCS13 and KCS6, were only expressed in G. hirsutum, whereas intermediate levels of KCS7 transcripts were observed in both G. hirsutum and G. raimondii (Fig. 5b), indicating that high-level expression of Sus and KCS family genes may indeed be required for fiber cell initiation and elongation. By contrast, extremely high amounts of transcripts encoding 1-aminocyclopropane-1-carboxylic acid oxidase (ACO) activities were recovered from G. raimondii at the 3-DPA stage (Fig. 5c), which is suggestive of a major role for the plant hormone ethylene during early fiber cell development.

Figure 5: Topological trees and expression patterns of Sus, KCS, ACO, MYB and bHLH family genes in the transcriptome of G. raimondii and G. hirsutum.
figure 5

(a) Major sucrose synthase genes (Sus) were expressed at substantially higher levels in G. hirsutum ovules with developing fiber initials than in those of G. raimondii. (b) Substantially more 3-ketoacyl-CoA synthase (KCS) transcripts were found in G. hirsutum ovules. (c) Substantially more 1-aminocyclopropane-1-carboxylic acid oxidase (ACO) transcripts were found in G. raimondii ovules. (d) G. hirsutum preferentially expressed MYB transcription factors. (e) G. hirsutum preferentially expressed bHLH transcription factors. Shown in each panel are the topological tree (left) and comparison of expression levels (right) between the two cotton species. Expression levels were estimated by reads per kilobase of mapped cDNA per million reads (RPKM) values for each gene obtained by sequencing RNA samples from 3-DPA G. raimondii and G. hirsutum ovules.

Previous researchers have postulated that the cotton fiber is similar in form and origin to plant trichomes, hair-like epidermal cells that occur on various plant organs but are common to leaf and stem surfaces. As postulated, transcription factors that have important roles in A. thaliana trichome development may be related to factors involved in cotton fiber formation4,30,31. In A. thaliana, MYB and bHLH class transcription factors work in a complex in combination with TTG1 to specify a particular epidermal cell fate30. A total of 2,706 transcription factors, including 208 bHLH and 219 MYB class genes, were identified in the G. raimondii genome (Supplementary Table 11). A large number of MYB (Fig. 5d) and bHLH (Fig. 5e) genes were expressed predominantly in G. hirsutum ovules, with only remnant levels found in the ovules of G. raimondii, indicating that some of these genes may be required for early fiber development.

Gossypol biosynthesis genes

Cotton is known to produce a unique group of terpenoids that include desoxyhemigossypol, hemigossypol, gossypol, hemigossypolone and the heliocides. Cotton plants accumulate gossypol and related sesquiterpenoids in pigment glands as a defense against pathogens and herbivores. The majority of cotton sesquiterpenoids are derived from a common precursor, (+)-δ-cadinene, which is synthesized by (+)-δ-cadinene synthase (CDN) via cyclization of farnesyl diphosphate, in the first committed step in gossypol biosynthesis32,33. Previously, both CDN-A and CDN-C were reported to encode the proposed enzyme activity34. Phylogenetic analysis performed here using G. raimondii and eight other sequenced plant genomes, including T. cacao20, A. thaliana22, Oryza sativa23, C. papaya35, Vitis vinifera36, Populus trichocarpa37, Glycine max38 and Ricinus communis39, showed that, except for O. sativa, terpene cyclase gene families are common in various plant species (Fig. 6 and Supplementary Fig. 11). However, G. raimondii and probably T. cacao were the only plant species that possess an authentic CDN1 gene family with the proposed biochemical function (Fig. 6 and Supplementary Fig. 11). It seemed that the ability to synthesize gossypol is related to both the paleohexaploidization and the whole-genome duplication events that were observed (Fig. 2b). No CDN1 orthologs were found in P. trichocarpa or C. papaya, the most closely related subclade, suggesting that gossypol production evolved after the separation of these plant species. This conclusion was supported by a recent publication that indicated the key importance of two aspartate-rich Mg2+-binding motifs, DDtYD and DDVAE, for gossypol biosynthesis40. All other plant terpene cyclase genes do not encode proteins with the DDVAE motif and thus cannot be recognized as CDN orthologs.

Figure 6: Phylogenetic analysis of the CDN1 gene family in G. max, P. trichocarpa, A. thaliana, C. papaya, V. vinifera, R. communis, T. cacao and G. raimondii.
figure 6

The phylogenetic tree and multiple-sequence alignment were established using the neighbor-joining method with Mega 4 software42. Bootstrap numbers greater than 50 are shown on the branches.

Discussion

We have sequenced the genome of G. raimondii using a next-generation Illumina paired-end sequencing strategy, yielding an assembled sequence with 103.6-fold genome coverage. The draft sequence covered 88.1% of the estimated G. raimondii genome size. Compared with other sequenced plant genomes, G. raimondii showed substantially lower gene density with a high proportion of transposable elements despite being one of the smallest Gossypium genomes. One independent whole-genome duplication event occurred approximately 13.3 to 20.0 million years ago, and one paleohexaploidization event that is commonly found in eudicots was clearly observed in the G. raimondii genome. The dates of these events reported here agree with those proposed in previous studies25,26. G. hirsutum, an allotetraploid species, is believed to be the product of a hybridization of two parental diploid species with A and D genomes41. An average Ks value of 0.042 was previously reported for tetraploid formation on the basis of an analysis of 42 pairs of paralogous G. hirsutum genes24.

Qualitative differences were found for genes encoding Sus, KCS and ACO activities by comparing the transcriptomes of fiber-bearing G. hirsutum and the non-fibered G. raimondii. These results indicate that Sus, KCS and ACO are necessary for cotton fiber development, as was proposed in previous individual studies2,3,29. Also, the MYB and bHLH transcription factors preferentially expressed in fiber reported herein may be used to elucidate the molecular mechanisms governing fiber initiation and early cell growth. Greater understanding of gossypol and related sesquiterpenoid biosynthesis genes may enable engineering of these genes for better defense against pathogens and herbivores in the cotton field. We suggest that sequencing of the G. raimondii genome is a major step toward fully deciphering and analyzing the genomes of the Gossypium family to improve cotton productivity and fiber quality.

URLs.

Genome browser for G. raimondii at the Cotton Genome Project, http://cgp.genomics.org.cn/G. raimondii genome sequencing data at NCBI BioProject, http://www.ncbi.nlm.nih.gov/bioproject/?term=%20PRJNA82769; MarketPublishers, http://marketpublishers.com/; CocoaGen DB, http://cocoagendb.cirad.fr/; Arabidopsis Information Resource, http://www.arabidopsis.org/; The Rice Annotation Project Database, http://rapdb.dna.affrc.go.jp/; The Hawaii Papaya Genome Project, http://asgpb.mhpcc.hawaii.edu/papaya/; genome assembly of V. vinifera, http://www.genoscope.cns.fr/spip/Vitis-vinifera-e.html; genome assembly of G. max, http://www.phytozome.net/soybean; Castor Bean Genome Database, http://castorbean.jcvi.org/; genome assembly of P. trichocarpa, http://www.phytozome.net/poplar; SOAPdenovo, http://soap.genomics.org.cn/; estclean, https://sourceforge.net/projects/estclean/; SSPACE, http://www.baseclear.com/landingpages/sspacev12/.

Methods

Germplasm genetic resources.

DNA samples of the D genome were obtained from CMD 10 (refs. 15,16), a genetic standard that originated from a single seed (accession D5-3) in 2004 and was brought to near homozygosity by six successive generations of self-fertilization in the greenhouse. G. raimondii D5-3 (CMD 10) was maintained in the nursery on the China National Wild Cotton Plantation in Sanya, and the G. hirsutum genetic standard, TM-1 (CMD 1), was grown under standard greenhouse conditions with the temperature maintained at 32 °C during the day time. Fresh young leaves were collected, immediately frozen in liquid nitrogen and stored at −80 °C until DNA extraction.

DNA extraction, library construction and sequencing.

We used the standard phenol/chloroform method for DNA extraction, with RNase A and proteinase K treatment to prevent RNA and protein contamination. The extracted DNA was then precipitated with ethanol. Genomic libraries were prepared following the manufacturer's standard instructions and sequenced on the Illumina HiSeq 2000 platform. To construct the paired-end libraries, DNA was fragmented by nebulization with compressed nitrogen gas, the DNA ends were blunted and an A base was added to the 3′ ends. DNA adaptors with a single T-base 3′-end overhang were ligated to the above products. Ligation products were purified on 0.5%, 1% or 2% agarose gels targeted for each specific insert size and were purified from the gels (Qiagen Gel Extraction kit, 28704). We constructed G. raimondii genome sequencing libraries with insert sizes of 170 bp, 250 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb.

Genome assembly.

The G. raimondii genome was assembled using SOAPdenovo with a K-mer of 41 and SSPACE software. We first assembled the reads with short insert size (<2 kb) to obtain long contigs. Then, the reads with long insert sizes (<40 kb) were aligned to the contigs to form scaffolds. Finally, we used the paired-end relationships of 40,000 library reads to construct super-scaffolds.

Chromosome anchoring.

We aligned the marker sequences from the cotton consensus map17 to the scaffolds using blastn (identities ≥ 95%; e value ≤ 1.0 × 10−6; coverage ≥ 85%), and the best-scoring match was chosen in cases of multiple matches.

Genome synteny and whole-genome duplication analysis.

We use blastp (identity ≥ 40%; e value ≤ 1.0 × 10−5; match length of more than 100 amino acids) to detect paralogous genes in G. raimondii and T. cacao, and we applied OrthoMCL to detect gene families43. For each paralogous gene family, the Ks of each pair was calculated using the PAML package44, and the median was selected to represent the Ks of the family.

RNA-seq analysis.

Total RNA was isolated from 0-DPA ovules, 3-DPA ovules of G. raimondii D5-3 (CMD 10) and G. hirsutum TM-1 (CMD 1) and from mature leaves of G. raimondii D5-3 (CMD 10). Normalized pools were converted to full-length enriched cDNA using the SMART method and were sequenced using Illumina protocols. All reads were filtered to trim the adaptor sequences using estclean. Clean reads (with at least 20 nucleotides remaining after trimming) were then mapped to the G. raimondii gene models using CLC Genomics Workbench software 4 (CLC bio A/S Science), and matches were converted to RPKM to estimate gene expression levels.

Accession codes.

G. raimondii genome sequencing data are available at NCBI BioProject under accession PRJNA82769. Sequencing data for G. raimondii and G. hirsutum transcriptome analyses are available in the NCBI Sequence Read Archive (SRA) under accessions SRA048621 and SRA048874.