Main

The allotetraploid species G. hirsutum L. is both the world's most important fiber crop and a model polyploid crop1,2. The Gossypium genus originated from paleo-hexaploidy of a eudicot progenitor and subsequent diversification into eight diploid genome groups, including A–G and K3,4. The A-genome diploids of the genus, which are native to Africa, diverged from the eudicot progenitor together with the Mexican-derived D-genome diploids about 510 million years ago (MYA)5,6,7. These two species were reunited geographically around 12 MYA by the transoceanic dispersal8 of an A-genome ancestor resembling G. arboreum (AA) to the New World. Subsequently, the A-genome progenitor hybridized with a New World D-genome ancestor resembling G. raimondii (DD), followed by chromosome doubling. The resulting dicotyledonous allotetraploid cotton dispersed from the Americas to the western Pacific and diverged into at least five well-established allotetraploid Gossypium species (AtAtDtDt; 2n = 4x = 52, where 't' stands for tetraploid), including G. hirsutum, the Upland cotton that accounts for more than 90% of commercial cotton production worldwide8.

High-quality assembly of allopolyploid plant genomes is a formidable task because the genomes are large and have highly homeologous subgenomes. A draft genome of the 17-gigabase (Gb) monocotyledonous allohexaploid crop Triticum aestivum (2n = 6× = 42, AABBDD) was recently reported by an international consortium9. As a result of their concerted efforts, 61% of the scaffolds were anchored and oriented on 21 pseudochromosomes9. In another study, 63% (712.3 Mb) of the 1,130-Mb allotetraploid (2n = 4× = 19, AnAnCnCn) Brassica napus genome was assembled. 18,278 of the B. napus 20,702 scaffolds were assigned to either the An or Cn subgenomes, which constituted 19 chromosomes in total10.

Here we sequenced and assembled the allotetraploid genome of G. hirsutum using DNA prepared from the highly homozygous TM-1 pure line11,12. We compared the G. hirsutum assembly to the putative ancestral species, G. raimondii6 and G. arboreum5, to investigate subgenome evolution and gene function including genes related to fiber biology.

Results

Assembly of the G. hirsutum genome

The allotetraploid genome of Upland cotton G. hirsutum has been estimated, using various methods, as 2.25–2.43 Gb13,14. We generated a total of 445.7 Gb, or 181-fold haploid genome coverage, of raw paired-end Illumina reads by sequencing whole genome shotgun (WGS) libraries of homozygous cv. 'TM-1' with fragment lengths ranging from 250 bp to 40 kb (Supplementary Table 1). Owing to the existence of abundant repetitive sequences and homeologous chromosomes (Supplementary Table 2), we were unable to assemble this allotetraploid genome satisfactorily using only the WGS data. Supplemental use of a bacterial artificial chromosome (BAC-to-BAC) sequencing strategy substantially improved the assembly. A total of 100,187 BACs, that corresponded to about fivefold genome coverage, were sequenced and used in the final assembly (Supplementary Fig. 1a,b). Each BAC was assembled individually before genome assembly (Supplementary Fig. 1c). Genome assembly used sequenced BACs and paired-end data. A total of 2,173 Mb of the G. hirsutum genome sequence was assembled using SOAPdenovo15,16, with the largest scaffold being 8.4 Mb (Table 1). This corresponds to 96.7% of the previous estimation of nuclear DNA content13, or 89.6% according to a more recent report14. The N50 (the size above which 50% of the total length of the sequence assembly can be found) of the contigs and scaffolds was 80 kb and 764 kb, respectively, which was better than the assembly that used WGS data only (N50 of contigs and scaffolds was 20 kb and 107 kb, respectively; Table 1 and Supplementary Tables 2 and 3).

Table 1 G. hirsutum genome assembly and annotation

The quality of the assembly was assessed by aligning scaffolds to BAC sequences obtained using Sanger technology, and also by mapping available expressed sequence tags (ESTs) to the AtDt genome. All 223 sequenced BACs were recovered with >96% sequence identity in our assembly (Supplementary Table 4). Of the 108,790 transcripts derived from transcriptome sequencing of G. hirsutum, 98.9% were detected in our assembly (Supplementary Table 5). Of the assembled genome, 1,923 Mb or 88.5% was anchored and oriented to 26 pseudochromosomes according to a high-resolution genetic map that we constructed based on 39,662 co-dominant single-nucleotide polymorphism (SNP) markers produced from the TM-1 × 3-79 recombinant inbred line (RIL) population of 167 individual lines (Supplementary Tables 6 and 7).

Genome duplications and tetraploidization

All present-day allopolyploid cottons are thought to have arisen from a single common lineage derived from the hybridization of A2 and D5 genome ancestors3,17,18. However, early polyploidization events in the evolution of the Gossypium genus occurred before the AADD lineage emerged. In comparisons of all sequenced Gossypium genomes with the seven ancestral chromosome groups reported in grape (Vitis vinifera)19, we found that the paleopolyploidy was shared at orthologous positions from the ancestor to G. raimondii, G. arboreum and G. hirsutum (Fig. 1a). These results confirm previous reports of the paleohexaploid origin of eudicot species such as Theobroma cacao20 and Eucalyptus grandis21. Clear-cut chromosomal DNA fragmentation patterns were observed in eudicot genomes that underwent zero (as in V. vinifera19 and T. cacao20), one (as in all three cotton species) or two (as in Arabidopsis thaliana22 and Glycine max23) whole genome duplications during their evolution. It should be noted that chromosome 16 of V. vinifera might not have been assembled correctly, as it had much lower than expected percentages of remnant ancestral DNA (Fig. 1a).

Figure 1: Evolution and syntenic analysis of the G. hirsutum genome.
figure 1

(a) G. hirsutum and six other genomes descended from common eudicot genome ancestors. Colored blocks within modern chromosomes of the species represent the chromatin origin from seven ancestral chromosomes. Numbers denote the predicted divergence times (MYA) and each red dot represents one whole genome duplication. (b) Syntenic blocks between the At subgenome in G. hirsutum and the diploid A genome in G. arboreum genome. (c) Syntenic blocks between the Dt subgenome in G. hirsutum and the diploid D genome in G. raimondii.

Source data

According to our calculations, the ancient hexaploidization event shared among the eudicots24 and the more recent duplication event in the progenitor of G. arboreum (A genome) and G. raimondii (D genome) occurred around 115146 and 1320 MYA, respectively (Supplementary Fig. 2), which is similar to what was previously reported5,6. An additional peak around 1.5 MYA that corresponds to the predicted hybridization and subsequent polyploidization event7 was observed only in G. hirsutum (AtDt genome) (Supplementary Fig. 2). We assigned the 26 pseudochromosomes of G. hirsutum to either the At or Dt subgenome based on syntenic relationships with its predicted diploid ancestors as well as the published genetic maps12 (Fig. 1b,c and Supplementary Fig. 3). The anchored At subgenome (1,170 Mb) is larger than the Dt subgenome (753 Mb), consistent with the sizes of the anchored A genome5 (1,530 Mb, 90.4% of 1,694 Mb) and D genome6 (567 Mb, 73.2% of 775 Mb). Co-linearity analysis of the current assembly indicated that the G. hirsutum genome is of high quality (Supplementary Fig. 4).

Annotation and gene content

We annotated 76,943 gene models in the G. hirsutum genome by combining ab initio gene prediction, homolog protein data search, EST alignment and assembly of RNA-seq reads (Supplementary Table 8). In bread wheat, a total of 124,201 gene loci have been identified, and 101,040 have been identified in oilseed rape9,10. These and our data suggest that the coding capacities of a plant species are basically proportional to its polyploidism. About 84.5% of our gene models have homolog matches in databases, including KEGG and GO (Supplementary Table 9). Of the gene models observed in G. hirsutum, 93.76% or 72,142 were evenly distributed along chromosomes, with 35,056 in the A or At subgenomes and 37,086 in the D or Dt subgenomes. Also, the Upland cotton genome encodes 602 microRNAs (miRNAs), 2,153 rRNAs, 2,050 tRNAs and 8,325 small nuclear RNAs (snRNAs) (Table 1 and Supplementary Table 10).

We scanned the G. hirsutum genome, using an all-versus-all blastp approach, for syntenic blocks and found that the Dt and At subgenomes had high collinear relationships with the genomes of G. raimondii and G. arboreum, respectively. There were 1,801 collinear blocks between G. arboreum and G. hirsutum, which covered 68.2% and 65.9% of the respective A genome or At subgenome (Fig. 1b and Supplementary Table 11). Similarly, we observed a total of 2,241 collinear blocks between G. raimondii and G. hirsutum, which covered 91.9% and 88.8% of the respective genome or subgenome (Fig. 1c and Supplementary Table 11).

Comparative studies of cotton genomes

Plant genomes have been massively invaded by transposable elements (TEs), many of which are located near host genes25. The presence of TEs can activate or repress genes under specific biotic or abiotic conditions, or even at different developmental stages26. As much as 66% of the G. hirsutum assembly is composed of TEs (Supplementary Table 12). In common with G. arboreum5 and G. raimondii6, the At subgenome of G. hirsutum contains markedly higher amounts of gypsy than Dt, whereas the reverse is true for copia. However, both copia and gypsy were substantially more actively transcribed in the Dt subgenome (Fig. 2a). The estimated insertion time of long terminal repeat (LTR) retrotransposons, including both copia and gypsy, was determined using a spontaneous mutation rate27 (Fig. 2b). Copia elements were remarkably more active than gypsy (Student's t-test, P ≤ 0.05) in the recent 01 MYA time frame (Fig. 2b), with higher proportions of copia located near coding genes than gypsy-type (Fig. 2c, Student's t-test, P ≤ 0.05). Our data indicate that the TEs of the Dt subgenome tend to be more active than that of the At subgenome after the tetraploidization28.

Figure 2: Characterization of copia and gypsy TEs in the G. hirsutum genome.
figure 2

(a) Statistics for these two types of TEs present in the Dt (left half) and At (right half) subgenomes. The outer circle shows the percent coverage of copia (green histogram) and gypsy (purple histogram) in nonoverlapping windows (window size = 500 kb). The following two inner circles indicate the copia and gypsy transcript levels, which were estimated by averaging values of reads (log10) from different tissues in nonoverlapping 500-kb windows. The links in the center indicate collinearity between At and Dt subgenomes. Only syntenic blocks of >1 Mb in length are shown. (b) Estimated insertion time for copia and gypsy LTR retrotransposons. (c) Distances from individual TEs to their nearest gene.

Source data

The different abundances of retrotransposons and DNA transposons (62.81% and 2.52%, respectively, in our paper versus 52.29% and 1.08%, respectively, in Zhang et al.28) was probably the result of using different methods to detect TEs. We used the de novo method and the homolog search method based on Repbase, whereas Zhang et al.28 used the MIPS database. As a result, only 0.43% of the TEs found are recognized as unclassified elements in our analysis, whereas 11.38% unclassified elements are reported by Zhang et al.28. Using our methodology, a further 9.43% of their unclassified elements can be recognized as retrotransposons, and an additional 1.25% of the unclassified elements can be recognized as DNA transposons. The MIPS database is constructed strictly according to the published cotton genomes whereas Repbase contains all repeat sequences from various plant genomes, further accounting for differences in findings.

Gene order was constructed by quartet alignments among the four sister chromosomes from the three Gossypium species. Using an iterative method based on the best match and the syntenic gene finder in MCScan29, we identified 32,466 and 32,811 orthologous gene pairs within the At and Dt subgenomes, respectively, in G. hirsutum and their corresponding ancestral A and D diploid genomes (Supplementary Table 13). Of the 28,592 orthologous gene pairs identified between G. arboreum and G. raimondii, 22,876 were also conserved homeologous gene pairs in G. hirsutum (Fig. 3a). Of the orthologous gene pairs, 643 were lost from both G. raimondii and the Dt subgenome of G. hirsutum, and 478 were lost from both G. arboreum and the At subgenome (Fig. 3a), which indicates that the ancestral D genome suffered a higher frequency of genic sequence losses. We further identified gene losses in syntenic blocks in which at least two of the four Gossypium genomes have orthologous genes: 523 and 461 genes were absent from At and Dt, respectively, compared with 234 lost in the A genome and 390 lost in the D genome (Fig. 3a). These data suggest a higher rate of gene losses in the allotetraploid cotton than in both diploid species. Also, more genes were lost in the At subgenome than the Dt subgenome during the formation of G. hirsutum.

Figure 3: Evolution of gene models, DNA fragments and syntenic blocks among G. hirsutum (AtDt) and two diploid cotton genomes, G. arboreum (A) and G. raimondii (D).
figure 3

(a) Scenarios and statistics of gene conservation. Solid lines indicate currently observed genes, and dotted lines indicate lost genes. The numbers beneath each drawing represent the number of gene pairs found in the three different genomes that fit the specific model. From left to right, genes present in all four genomes, genes not observed in At, genes not observed in Dt, genes not observed in either A or At, genes not observed in either D or Dt, genes not observed in A, genes not observed in D. (b) HE of genomic segments between the At and Dt subgenomes in a region of G. hirsutum chromosome 9 (Gh9). The curves in the upper panel show homologous gene pairs between At and A or At and D. The lower panel shows the Ks value distribution for syntenic blocks, which indicates HE in the tetraploid cotton. The dot plots show distribution of Ks values and the boxplots display variations of Ks values between A and D genomes. Note that some of the dots (outliers) are not included in the boxplots due to their low probabilities. (c) Distribution of Ks values between four cotton genomes and T. cacao (upper panel) and single-nucleotide variation (SNV) rate (lower panel) among different cotton genomes.

Source data

Homeologous exchanges (HEs) among cotton genomes

HEs, or transpositions, are characterized by the loss of a chromosomal region that is replaced by a duplicate copy of the corresponding homeologous region10,30,31. The synonymous substitution rates (Ks values) of orthologous DNA are used to evaluate evolutionary distances among different plant species32. In the A-genome-derived chromosome 9 of G. hirsutum, a syntenic block in the middle of the chromosome showed a very low Ks value with a fragment from the D genome, whereas the two other blocks in this same region displayed low Ks values to DNA fragments from the A genome (Fig. 3b), suggesting that this At chromosome segment had undergone HE with a Dt homeologous region. Potential HEs were supported by multiple paired-end reads to ensure that they did not result from misassemblies (Supplementary Fig. 5). Putative HEs with junction distances ≤7 kb were selected for PCR verification using primers designed from both sides of the particular junction, and five out of seven potential HEs were successfully amplified in G. hirsutum, but not in G. arboreum or G. raimondii (Supplementary Fig. 6). Also, HE regions showed a considerably higher rate of multicopy genes than the genome as a whole, indicating that HEs may be associated with the evolution of high copy number genes (Supplementary Fig. 7). Using this approach across all G. hirsutum chromosomes, we identified 100 possible HEs, of which 54 were Dt to At (Supplementary Table 14), 46 were At to Dt (Supplementary Table 15). Further analysis of syntenic blocks involved in HEs indicated that 1,790 genes were transferred from Dt to At, whereas 1,530 genes were transferred from At to Dt. The average Ks values for collinearity-supported gene pairs were 0.463 for Dt versus T. cacao and 0.494 for D versus T. cacao, with 0.422 for At versus T. cacao and 0.476 for AA versus T. cacao (Fig. 3c, upper panel). The Ks values of both At and Dt subgenomes were substantially lower than those of the A and D diploid genomes, respectively. Similarly, analysis of intergenic collinear regions showed that the single-nucleotide variation rate for Dt versus D was greater than that for At versus A (Fig. 3c, lower panel). These results indicate that the D or Dt genome has higher mutation rates than A or At genome, which is consistent with a previous report for the Adh locus33. Further analysis revealed dN/dS ratios of 0.338 for Dt versus T. cacao, 0.270 for D versus T. cacao, 0.404 for At versus T. cacao and 0.314 for AA versus T. cacao. These data suggest that the A and At genomes are undergoing less purifying selection and greater positive selection than the D and Dt genomes; and that the genetic redundancy created by allotetraploidy may have allowed relaxed purifying selection in both the At and Dt subgenomes.

Deletions and fiber development

Ethylene is a key modulator of cotton fiber cell growth34,35,36, and significant differences in ethylene or cell wall biosynthetic gene expression patterns during fiber development have been reported34,37. Early experiments showed 50- to 500-fold increases in ACO transcript levels in G. hirsutum and G. raimondii ovules, respectively, in comparisons with G. arboreum5. When grown in a semi in vivo ovule culture system, G. arboreum ovules released almost undetectable levels of gaseous ethylene over a 15-d culture period, whereas G. raimondii produced large amounts of this molecule (Fig. 4a). Although ovules of G. hirsutum did synthesize measurable levels of ethylene, the onset of this ethylene burst occurred at a much later stage with a lower peak value, as compared with G. raimondii (Fig. 4a). Gel shift assays were performed to investigate possible molecular mechanisms for regulating the expression of ACO1 and ACO3 that are rate-limiting during ethylene biosynthesis in cotton ovules34. Sequence alignment showed the deletion of a 128-bp fragment that resulted in the loss of a single MYB binding site in ACO1 promoters from both the A genome and the At subgenome (Fig. 4b, upper panel). Significant mobility shift was observed only when fragment P6 from the D genome (either from D or Dt) was incubated with nuclear extracts prepared from 10 d post-anthesis (DPA) G. hirsutum ovules (Fig. 4b, middle and lower panels). When compared with the GaACO3 promoter, deletion of a large DNA fragment (123-bp) was observed in the GrACO3 promoter. Further scrutiny revealed several short sequence insertions (indels of 8 to 16 bp) in GrACO3 promoter that created two new and additional MYB-binding sites in G. raimondii (Fig. 4c, upper panel; and Supplementary Fig. 8). As a result, only fragment PVI from Dt or D genome showed substantial mobility shift whereas the same PVI fragment from At or A genome showed no obvious nuclear protein binding (Fig. 4c, middle and lower panels). A much weaker and similar binding of nuclear extracts on the PII fragment was observed for all four types of ACO3 promoters (Fig. 4c, middle and lower panels). When the D genome–derived P6 and PVI fragments from the ACO1 and ACO3 promoter regions were individually incubated with nuclear extracts prepared from G. raimondii ovules harvested at different developmental stages, we found that the nuclear protein binding activity reached its peak values around 3 DPA; these amounts were about four to five times the levels found in 10-DPA ovules (Fig. 4d, upper panel). In G. hirsutum, the highest binding activities to both ACO1 and ACO3 promoters appeared in 10-DPA ovules, but the peak intensities were less than 20% of those found in G. raimondii (Fig. 4d, middle panel). As expected, no obvious DNA binding activity was found using nuclear extracts from G. arboreum (Fig. 4d, lower panel).

Figure 4: Ethylene production and its regulatory mechanisms in three cotton species (G. raimondii, G. arboreum and G. hirsutum).
figure 4

(a) Comparisons of ethylene production from cultured ovules collected at 1 DPA and cultured for 14 d, with air samples collected at the different time points as shown. Data reported are the mean ± s.e.m. from three independent ovule culture experiments, with triplicate measurements for each sample. (b,c) Electrophoretic mobility shift assays (EMSA) showing the specific binding complex on the P6 fragment of ACO1 (b) and ACO3 (c) promoters. 32P-labeled probes were incubated with nuclear protein samples prepared from 10-DPA G. hirsutum ovules. Dotted lines shown on the top of each panel show the lost sequences in the corresponding genome with the red boxes representing MYB binding sites. In each panel, one representative EMSA obtained using probes originating from the Dt or the At subgenomes of G. hirsutum is shown in the middle, and data obtained using probes produced from G. arboreum (Ga) or G. raimondii (Gr) in the bottom. (d) Comparisons of the binding activity on P6 from ACO1 and ACO3 promoter regions of the three different cottons. Shown are data obtained from nucleoproteins prepared from 0-, 3-, 5-, 10-, 15- and 20-DPA ovules and incubated with P6 originating from G. raimondii, from the Dt copy of G. hirsutum and from G. arboreum. Error bars, mean ± s.e.m. from three independent EMSA experiments. (e) Phylogenetic and evolutionary analysis of ACO1 and ACO3 promoter regions from G. raimondii, G. arboreum and T. cacao. Scale bars, 100 bp. Statistical significance was determined using one-way analysis of variance software. *P < 0.05, **P < 0.01, ***P < 0.001.

Source data

Phylogenetic analysis revealed that cotton ACO1 and ACO3 are in the same clade with TcACO1 (Thecc1EG030320t1) and TcACO3 (Thecc1EG002179t1) of T. cacao, respectively (Fig. 4e). GrACO1 and GrACO3 displayed 43.8% and 43.3% identity, respectively, in promoter regions with their counterparts in T. cacao. Several important MYB binding sites observed in either TcACOs or GrACOs were not found in GaACOs, suggesting that G. raimondii evolved from the common eudicot ancestor before G. arboreum3.

CesA upregulation and fiber elongation

Profound differences in fiber properties are found among the three cotton species. For example, G. hirsutum usually produces fibers >3 cm in lengths, whereas G. arboreum produces fibers of 1.31.5 cm long, and no spinnable fiber is produced by G. raimondii (Fig. 5a, left panel). G. hirsutum fiber cells undergo fast elongation until 30 DPA, whereas those of G. arboreum stop growth around 20 DPA (Fig. 5a, right panel). As a load-bearing polymer of plant cell walls, cellulose microfibrils, which are produced by cellulose synthase A (CesA), have a key role in the formation of the secondary cell wall for cell growth38,39. The CesA genes are divided among six major clades (Supplementary Figs. 9a and 10) and genome-wide analysis indicates that G. hirsutum contains 32 CesA and 64 cellulose synthase-like (CSL) genes, representing a notable increase over the number of CesA and CSL genes in Arabidopsis22 and T. cacao20 (Supplementary Figs. 9b and 11).

Figure 5: Fiber growth, expression and potential regulatory mechanisms of genes important for cell wall biosynthesis.
figure 5

(a) Comparisons of Fiber lengths of the three cotton species (left), and growth rate analysis for G. hirsutum and G. arboreum (right). Error bars, mean ± s.d. (bd) qRT-PCR analysis of primary and secondary cell wall biosynthesis genes using the At- (indicated by blue lines) or A-originated (red lines) copy as the template. UER, UDP-4-keto-6-deoxy-D-glucose 3,5-epimerase 4-reductase; UGP, UDP-D-glucose pyrophosphorylase; UGD, UDP-D-glucose dehydrogenase. See Supplementary Table 19 for gene-specific primers. Error bars, mean ± s.d. (e) The evolution of regulatory mechanisms for fiber-specific and highly expressed genes from the At subgenome.

Source data

Primary cell wall components produced by UGD, UGP and UER are also important for cotton fiber growth37. Transcriptome analyses indicated that a significant number of the At subgenome-originated genes were highly and specifically expressed during fiber development (Supplementary Table 16), and this was confirmed by QRT-PCR analyses over a longer growth period of time (Fig. 5b–d). In all cases, genes that originated from the At subgenome, but not their homeologs from the A subgenome, were expressed highly and specifically during either the primary (Fig. 5b,c and Supplementary Table 17) or the secondary (Fig. 5d) cell wall biosynthesis stages. Although a few CSL genes showed moderate fiber-specific upregulation, there was little difference between the At- and A-derived copies (Supplementary Table 18). This fiber-specific and high-level expression of cell wall biosynthesis genes was coupled with the evolution of more cis-elements through point mutations, as evidenced on the AtCesA8-1 promoter (Fig. 5e, upper drawing). Transposition of a Tgm1 (a subclass of TE) (Fig. 5e, middle drawing) and partial deletion of a MITE (Fig. 5e, lower drawing) from the ancestral A genome seems to have contributed to the considerably increased number of pyrimidine boxes or the conserved MYC- and MYB-binding sites present on AtCesA3-4 and AtUER2-2 promoters.

Discussion

Deciphering the allotetraploid genome of G. hirsutum provides unique insights into the earliest stages of postpolyploidy evolution and the relationships between G. hirsutum and its ancestral diploid species. Both G. raimondii and G. arboreum underwent cotton-specific whole genome duplication at 16.6 MYA5,6 after sharing the paleohexaploidization event common to all eudicots that occurred 130.8 MYA5,6,22. We estimate that the modern allotetraploid G. hirsutum species resulted from hybridization of the two ancestral species about 1.5 MYA, based on the calculations comparing At and Dt subgenomes, with their respective A or D diploid ancestral species (Fig. 1a and Supplementary Fig. 2). Up to 100 DNA segments have been replaced in one subgenome by sequences from another subgenome. Although most coding genes were conserved among all four genomes, close scrutiny showed that a large number of these genes were transferred from Dt to At, so that the current At subgenome may express more genes than the ancestral A genome. This segmental replacement apparently causes substantial differences in TE activities in the two subgenomes compared with the postulated ancestors and this seems to contribute to differences in fiber production and fiber quality of the allotetraploid species.

The assembly and analyses of the allotetraploid cotton genome show that the two subgenomes have different mutation rates because the Dt subgenome has higher single-nucleotide variation and Ks values than those of the At. Also, the allotetraploid genome enjoys lower purifying selection pressure than the ancestral diploid genomes, because both Dt and At subgenomes have higher dN/dS ratios than the D and A diploid genomes. Finally, the At subgenome may have undergone positive selection for fiber-related traits because the At showed a higher dN/dS ratio than the Dt. The significant changes in Ks rates among HEs and their flanking sequences suggest that HEs may be associated with “gene conversion,” which involves homology-directed double-strand break repair between homeologous chromosomes or sister chromatids40. The ACOs and primary and secondary cell wall biosynthesis genes reported here might provide targets for engineering of improved fiber yield.

Methods

Materials.

We used the highly homozygous Upland cotton (G. hirsutum) genetic standard line TM-1 (ref. 11) for DNA sequencing. The homozygosity was reached through 51 successive generations of self-fertilization by single seed descent from one TM-1 seed. G. hirsutum cv. TM-1, G. arboreum cv. Shixiya1 and G. raimondii acc. D5-3 (CMD#10) plants were maintained in the greenhouse for extraction of DNA or RNA. The tetraploid G. barbadense cv. 3-79 was crossed with TM-1 to develop an RIL population12. A total of 167 RIL individuals were successfully sequenced for the construction of genetic maps.

WGS sequencing.

Genomic libraries were prepared following the manufacturer's standard instructions and sequenced on the Illumina HiSeq 2000 platform. To construct paired-end libraries, we fragmented DNA by nebulization with compressed nitrogen gas, and blunted DNA ends before adding an A base to each 3′-end. DNA adaptors with a single T-base 3′-end overhang were ligated to the above products. Ligation products were purified on 0.5%, 1% or 2% agarose gels, each of which targeted a specific range of insert sizes. We constructed G. hirsutum genome sequencing libraries with insert sizes of 250 bp, 350 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb. All libraries were sequenced on the Illumina sequencing platform, and paired-end reads from each library were obtained. To obtain high-quality reads for assembly, we first filtered out unusable reads, which included the following: (i) reads containing ≥10% “N” bases; (ii) reads with low-quality data (the ASCII value of base-64 ≤7) for 65% of the bases for short insert (<2 kb) or 80% of the bases for long insert (≥2 kb), (iii) reads containing >10-bp adaptor sequences, (iv) reads with >10 bp that overlapped between two ends of reads of short inserts and (v) reads with identical sequences at the two ends.

BAC pooling and sequencing.

HindIII and BamHI were used to generate partially digested insert DNA for the construction of TM-1 BAC libraries. Insert sizes of these BAC libraries ranged between 80 kb and 180 kb on average. The Agilent Bravo Automated Liquid Handling Platform (Agilent) and Agilent BenchCel Microplate Handler (Agilent) were used for the construction of BAC libraries. The Adaptive Focused Acoustics (AFA) DNA fragmentation system (Covaris) was used to fractionate DNA samples of BACs. T4 DNA polymerase (Illumina) was used to convert overhangs resulting from fragmentation into blunt ends. Klenow fragment was used to add an A base to the 3′-end of each blunt DNA fragment for the ligation of index adapters having a single T base. DNA samples with different index adapters were pooled together, and DNA segments of 500 bp were selected for library construction. Subsequently, ten 96-well plates were pooled into a single lane for sequencing (i.e., 960 samples/lane), resulting in an average throughput of 100 million reads/lane. In total, 1,080 96-well samples, or 103,680 BAC clones, were obtained in which 100,187 BACs were sequenced successfully. For each BAC clone, a 500-bp paired-end library was constructed and sequenced with the HiSeq2000 sequencing platform. Each BAC clone was sequenced with >100-fold coverage on average.

Genome assembly.

The allotetraploid genome of G. hirsutum was assembled through a BAC-to-BAC strategy combined with WGS sequencing. The primary steps were as follows.

  1. 1

    We first assembled each BAC clone using SOAPdenovo16 based on the BAC reads with expected read-length for each clone.

  2. 2

    We also assembled all the WGS sequencing reads from the libraries with short insert sizes into contigs using SOAPdenovo16.

  3. 3

    All BAC assemblies and all the contigs assembled from the WGS sequencing were pooled into the overlap-layout-consensus (OLC) assembly, identical sequences were merged and redundant bases were filtered out from overlapping lengths using software contained within the Rabit package41. Then, we obtained the nonredundant super-contig sequences.

  4. 4

    All the super-contigs obtained from the OLC assembly were linked to scaffolds using all the WGS reads from long inserts (2–40 kb) in a step-by-step fashion using SSPAC42 with the following command: perl SSPACE_v1-1.pl -l lib.txt -s final3200.fa -k 3 -n 20 -x 0.

  5. 5

    To fill the gaps (N regions) within the scaffold, all the WGS reads from libraries with short insert sizes were mapped to the scaffold, and the reads located in the gap regions were then locally assembled with the following GapCloser command: GapCloser -a scafSeq -b gap_all.lib -o scafSeq.FG1 -t 24.

Genetic map construction.

Genetic linkage maps were constructed for anchoring the scaffolds to 26 chromosomes of the AtDt genome, using 167 individual RIL lines derived from an interspecifc cross of the G. hirsutum and G. barbadense genetic standards TM-1 and 3-79 (ref. 12). Genomic DNA from this mapping population was digested with the restriction endonuclease EcoRI and processed into restriction-site-associated DNA or 'RAD' libraries. Every 24 cotton individuals were pooled into one sequencing library with nucleotide multiplex identifiers (4, 5, 6, 7 and 8 bp), and each sequencing library was barcoded. Approximately 3,940 Mb of 90-bp reads (23 Mb of read data for each progeny on average) was generated on the HiSeq 2000 next-generation sequencing platform. The SNP calling process was carried out using the GATK pipeline. JoinMap4.0 (ref. 43) was used to conduct the linkage analysis. We used the method described in binmap44 to combine the adjacent SNP markers into bins that can integrate markers with high quality and filter the mistakes among them. The primary steps were as follows: (i) using MPR (Maximum Parsimony of Recombination) to infer a parental genotype; (ii) using Bayesian inference to acquire reliable parents' genotype; (iii) using HMM (Hidden Markov Model) to correct the progeny genotype. We finally obtained 6,501 bins and utilized them in anchoring the scaffolds together with other SNP markers.

A logarithm of odds (LOD) score of 10 was initially set as the linkage threshold for linkage group identification. All high-quality markers were used to construct the consensus maps with the RIL population in JoinMap4.0 (ref. 43). We calculated the recombination fractions between all pairs of SNP markers in a scaffold and chose the SNP marker that had the minimum recombination fraction in the sum. A scaffold order was determined by tag SNPs, and inner SNPs were then used to orient the scaffolds. A complete set of 26 pseudochromosomes of G. hirsutum was produced, with chromosomes 1–13 corresponding to the At subgenome and chromosomes 14–26 to the Dt subgenome.

Repeat content.

We detected the repeat content of the G. hirsutum genome using two methods, the de novo method and the homolog search method based on Repbase45. Three de novo software packages, PILER-DF46, RepeatModeler and LTR_FINDER47, were used to predict TE sequences in the genome. Intact LTRs were predicted using LTR_FINDER, and the LTR insert time was calculated according to the rate of spontaneous mutations27. The homology-based approach involved applying commonly used databases of known repetitive sequences. We identified TEs at the DNA level with RepeatMasker48 using Repbase45 TE. We identified TEs at the protein level with RepeatProteinMask, which runs WuBlastX against the TE protein database. Then we combined the results from the DNA level and protein level and filtered out TE elements for which >80% of their domains were contained within the domain of a higher-scoring match. We then joined the overlapping TE elements that match the same TE in the library.

Tandem repeats in the genome assembly were identified using Tandem Repeat Finder49 with the following parameters: 2 7 7 80 10 50 2000 -d -h. We also searched noninterspersed repeats in the genome using RepeatMasker48 with its “-noint” parameter.

Gene prediction.

We used de novo gene prediction, homology-based methods and RNA-seq data to predict gene structure and integrated all the results by GLEAN50 to obtain a consensus gene set.

1.De novo gene prediction.

Augustus51 and GlimmerHMM52 were used to obtain de novo–predicted gene structures from a version of the genome in which repeats were masked based on identified repeat sequences with lengths >500 bp, except for miniature inverted–repeat transposable elements (MITEs), which are usually found near genes or inside introns.

2.Homology-based prediction.

Protein sequences of five sequenced genomes (A. thaliana, G. raimondii, G. arboreum, C. papaya and T. cacao) were aligned to the G. hirsutum genome using TBLASTN (E-value ≤ 1E-5). Then the homologous genome sequences were aligned against the matching proteins using Genewise53 to generate accurate spliced alignments. The expressed sequence tag sequences of G. hirsutum were aligned against the genome sequence using BLAT54 (identity ≥ 0.95 and coverage ≥ 0.90) to generate spliced alignments. Then, Genewise53 was applied to exploit the spliced alignments of expressed sequences to automatically model gene structure.

3.RNA-seq–based prediction.

To assist in gene annotation, we first mapped RNA-seq reads of six transcriptomes from cotton leaf and fiber to the assembled genome using TopHat55 to identify potential exons in addition to donor or receptor sites with the following parameters: –p 4-max-intron-length 20,000 –m 1 –r 20-mate-std-dev 20. Then we combined TopHat mapping results and applied Cufflinks55 to predict transcript structures with the following parameters: –I 20,000 –p. To obtain complete gene models, we used the fifth-order Markov model to predict ORFs.

4.Integration of evidence for gene prediction.

The GLEAN software50 was used to integrate data derived from the three methods into a GLEAN-derived gene set. Short genes with coding sequences <150 bp and low-quality genes (gaps in >10% of the coding region) were filtered out.

Syntenic and Ks analysis.

All-versus-all BLASTP (e-value < 1e-5) was used to detect orthologous and paralogous genes among G. hirsutum, G. arboreum and G. raimondii. Syntenic blocks (with at least five genes per block) were identified by MCScan29 (MATCH_SCORE: 50, MATCH_SIZE: 5, GAP_SCORE: –3, E_VALUE: 1E-05). For the alignment results between these, each aligned block represented the orthologous pair derived from the common ancestor, and the sequences that contained the genes were used to show the intergenome relationships with their length information. Syntenic blocks were used to find homeologous exchanges (HEs) between the two subgenomes of tetraploid cotton by tracing the presence of A genome–derived haplotype blocks in the Dt genome, and D-derived blocks in the At genome. Ks values along the chromosome between tetraploid and two diploid genomes were calculated in nonoverlapping windows (window size = five genes). HE was defined when one fragment with a few sequent windows from At chromosome showed a lower Ks value (Student's t-test, P ≤ 0.05) with a fragment from D genome than from A genome, whereas both flanking fragments displayed low Ks values with fragments from the A genome, and vice versa. Ks (the number of synonymous substitutions per synonymous site) and dN/dS ratios (the ratio of the number of nonsynonymous substitutions per nonsynonymous site to the number of synonymous substitutions per synonymous site) were calculated using yn00 in the PAML software56 package with the branch model. To distinguish HEs and other DNA blocks with different mutation rates, the distribution of Ks values for the whole chromosome was compared with that of the particular HE. Only the DNA blocks with Ks values significantly deviating from the distribution of global Ks values (Student's t-test, P ≤ 0.05) from both A and D genomes were acknowledged as candidate HEs. To avoid the recognition of possible misassembly as HEs, all the candidate HE regions with 50-kb flanking sequences were analyzed using the paired-end relation of reads from the libraries with insert sizes of 500 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb.

The 4DTv value of the blocks was calculated using a revised version of the HKY model. Gene family is defined here as a group of genes that are determined to descend by OrthoMCL57 from the last common ancestor of the species under consideration. First, BLASTP was used to compare all the protein sequences with a database containing a protein data set of all species with e-value < 1e-5. Then clustering of genes was carried out with OrthoMCL57 (inflation parameter: 1.5).

Analysis of ACOs and cell wall biosynthesis genes.

ACO, CesA, UER, UGP and UGD gene families were identified using the HMMER 3.0 software package58. MEGA software59 was used to construct the phylogenetic tree with the neighbor-joining method. Double-stranded DNA sequences corresponding to different fragments of the ACO1 and ACO3 promoters, respectively, were synthesized by Invitrogen and labeled with 32P to generate the DNA probes. DNA-protein interactions were performed in 20-μl reactions that contained 1.5 μl 10× binding buffer (12.5 mM HEPES, 50% glycerol, 200 mM KCl, 2.5 mM EDTA, 5 mM DTT), 1 μg poly dI,dC (Sigma-Aldrich), 20 μM labeled DNA probe and 20 μg nuclear extract and were then incubated at room temperature for 1 h. The reactions were analyzed with 8% native PAGE in 0.5× Tris-borate-EDTA at 120 V. DNA binding activity was quantified using the Typhoon 9200 PhosphorImager (GE Healthcare).

To analyze ethylene gas production, 96 freshly collected 1-DPA wild-type ovules from each cotton species were cultured in 96-well plates in the dark at 30 °C for 14 d. Air samples (50 μl) were collected at each time point and injected into a 30-m HP-PLOT column (J&W) and were analyzed with a gas chromatograph (GC-14C; Shimadzu) equipped with a ñame-ionization detector. Ethylene production was determined by comparing with a standard ethylene sample run on the same GC using identical program. QRT-PCR experiments were performed as previously reported.

Accession codes.

The G. hirsutum genome sequence, including all contigs and scaffolds, has been deposited at NCBI BioProject under the accession number PRJNA259930 and also available at http://cgp.genomics.org.cn. Sequence data for G. hirsutum transcriptome analyses are available in the NCBI Sequence Read Archive (SRA) under accession number SRA180756.