Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton

Allotetraploid cotton is an economically important natural-fiber-producing crop worldwide. After polyploidization, Gossypium hirsutum L. evolved to produce a higher fiber yield and to better survive harsh environments than Gossypium barbadense, which produces superior-quality fibers. The global genetic and molecular bases for these interspecies divergences were unknown. Here we report high-quality de novo–assembled genomes for these two cultivated allotetraploid species with pronounced improvement in repetitive-DNA-enriched centromeric regions. Whole-genome comparative analyses revealed that species-specific alterations in gene expression, structural variations and expanded gene families were responsible for speciation and the evolutionary history of these species. These findings help to elucidate the evolution of cotton genomes and their domestication history. The information generated not only should enable breeders to improve fiber quality and resilience to ever-changing environmental conditions but also can be translated to other crops for better understanding of their domestication history and use in improvement. High-quality de novo–assembled genomes of two cultivated allotetraploid cotton species and whole-genome comparative analyses provide insights into the evolution of cotton genomes and improvement of fiber quality and resilience to stress.

The global genetic and molecular bases underlying the evolutionary dynamics associated with the origin, speciation and diversification of these two tetraploid species are not well understood. The cultivated tetraploid cottons [4][5][6][7] and their corresponding diploid progenitors [8][9][10] have been sequenced. However, intergenic DNA such as telomeres, centromeres and repeat-rich regions are often poorly represented in the published draft genomes [4][5][6][7] , leading to underestimation of many important genomic features found in these regions. Here we report two highly accurate, contiguous, chromosome-scale de novo assemblies of the G. hirsutum and G. barbadense genomes obtained by integrating non-PCR-based short-read sequencing, long-read-based gap closure, scaffolding, and orientation based on 3D proximity information derived from chromosome conformation capture (Hi-C) data and from optical and genetic maps. By using these two well-assembled genomes, we performed genomewide comparative studies that unraveled the genomic components responsible for contrasting features between the species, demonstrating their origin in homeolog losses, genome restructuring in the postpolyploidization era and altered patterns of gene expression.

Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton Results
Assembly of high-quality genomes. We adopted a hierarchical approach to perform chromosome-scale assembly ( Supplementary  Fig. 1). For G. barbadense L. cv. Hai7124, about 800 Gb of highquality sequence (>330× genome coverage) generated from a series of PCR-free libraries, mate-paired libraries and a 10x Genomics library (Supplementary Table 1) was assembled by using the DeNovoMAGIC3 software package (NRGene), which has been used extensively to assemble the complex wheat (Triticum spp.) [11][12][13][14] , opium 15 and maize 16 genomes. For Hai7124, we initially generated an assembly that captured 2.23 Gb of sequence (Supplementary Table 2). The initial scaffolds were then corrected and merged by using optical map data (Bionano Genomics) (Supplementary  Table 3). Finally, we used 3D proximity information from Hi-C (Supplementary Table 4) and our updated ultra-dense genetic map with 6,165,057 SNPs (Supplementary Table 5) to check, order and orient the resulting superscaffolds. A total of 115 erroneously assembled loci were detected and corrected (Supplementary Table 6). The final Hai7124 (v1.1) assembly captured 2.22 Gb of genome sequence, covering approximately 91.4% of the total genome size, which was estimated to be 2.43 Gb by k-mer analysis ( Supplementary Fig. 2). The contiguity of the newly assembled genome was 47-and 90-fold greater than that of the two published draft genomes of G. barbadense (scaffold N50 = 23.44 Mb versus 0.50 Mb (ref. 5 ) and 0.26 Mb (ref. 7 )). Approximately 98.2% of the 2.22-Gb assembled genome was assigned to 26 chromosomes (Table 1  and Supplementary Table 7). The gap size (Ns) was substantially reduced to 32.46 Mb in Hai7124 ( Supplementary Fig. 3).
We also generated an improved de novo-assembled genome for G. hirsutum L. acc. TM-1 (Supplementary Tables 8 and 9). The total assembly size was 2.30 Gb, with the longest scaffold 41.92 Mb in length. About 97.4% of the 2.30-Gb assembled genome was oriented and organized into 26 chromosomes (Table 1 and Supplementary  Table 10). In comparison with two recently published genome assemblies for this accession, the updated TM-1 reference assembly (v2.1) showed higher contiguity, that is, ~10-to 20-fold longer scaffold N50 values (15.5 Mb versus 1.6 Mb (ref. 4 ) and 0.764 Mb (ref. 6 )). The total size of the oriented assembled genome without connective N sequences was dramatically increased (2,211 Mb for TM-1 (v2.1) versus 1,667 Mb for TM-1 (v1.1) 4 ) ( Supplementary Fig. 4).
We validated the contiguity and accuracy of the new assembly by using the same 36 BAC sequences (Supplementary Fig. 5 and  Supplementary Table 11) and integrated genetic and cytogenetic maps of the A12 and D12 homeologous chromosomes from FISH ( Supplementary Fig. 6) as we reported previously 4 .
We successfully assembled and filled the gaps in the new genomes, and erroneous assemblies, especially those spanning centromeric regions, were corrected (Fig. 1a,b and Supplementary Figs. 7 and 8). The alignment between Hai7124 (v1.1) and G.barbadense accession 3-79 was largely incomplete (Fig. 1c and Supplementary  Fig. 8), probably owing to the fragmented nature of the 3-79 assembly (scaffold N50 = 260 kb, scaffold number = 29,751) 7 . The highly consistent alignment of the genome sequences to the corresponding optical maps (88% mapping rate in TM-1 and 91% in Hai7124) (Fig.1a,b and Supplementary Figs. 7 and 8) further validates the high contiguity and accuracy of our new genome assemblies.

) (Supplementary
The centromeric DNA regions associated with centromerespecific H3 (CenH3) 18 were further identified by ChIP-seq using an antibody to cotton CenH3. In total, 66.72 and 50.16 million reads (151 bp) were obtained from the TM-1 and Hai7124 ChIPseq libraries, respectively. In FISH experiments, distinct and highintensity signals were observed in the centromeric region of each chromosome, further confirming the CenH3 ChIP DNA enrichment ( Supplementary Fig. 10). Approximately 22.10 and 17.04 million sequence reads were mapped to unique sites in the genomes for TM-1 (v2.1) and Hai7124 (v1.1), respectively. As a result, potential centromeric regions from all 26 chromosomes of the Hai7124 (v1.1) assembly and 24 chromosomes of the TM-1 (v2.1) assembly (with the exception of D02 and D08) were identified (Fig. 1a, Table 17). The CenH3 ChIP-seq data mapped to a sharp interval on each chromosome, ranging from 0.60 to 2.85 Mb in length, which was highly overlapped by the region identified by the CRGs (Supplementary Tables 12 and 13). However, the CenH3 ChIP-seq reads could not be mapped onto the TM-1 (v1.1) assembly ( Fig.1a and Supplementary Figs. 7 and 8), likely owing to the fragmented nature of its centromeric regions. Moreover, centromeres showed consistent localization along the same chromosome in the two cotton species (Supplementary Table 17), implying a high level of genome conservation and a relatively short time to divergence. In plants, extensive studies have been conducted to isolate and map the DNA corresponding to CenH3-associated centromeric chromatin by using ChIP-seq 19-21 . However, poorly assembled reference centromeres often hamper the deployment of ChIP-seq, which has resulted in poor characterization of the centromeres of the sequenced plants [22][23][24] . Until now, only five centromeres from Brachypodium distachyon, a small-genome grass, had been fully sequenced and assembled, which was possible because these centromeres have a relatively low percentage of highly repetitive DNA (~21.4%) 25,26 . These results demonstrate that our new genome assemblies have made substantial improvement in sequence continuity in almost all regions, including repetitive-DNA-rich centromeric regions.   Gene prediction and annotation. A total of 72,761 and 75,071 high-confidence protein-coding genes (PCGs) were predicted for TM-1 (v2.1) and Hai7124 (v1.1), respectively, of which >96% were supported by full-length transcript data (Supplementary Tables 18   and 19). For most of these genes (95%; n = 100), there was complete accuracy between the identified exon-intron boundaries and those determined by using mapped RNA-seq reads (Supplementary Fig. 11 and Supplementary  Supplementary Fig. 12). Furthermore, we predicted the insertion time of LTRs by examining the sequence divergence at both ends of intact LTRs between TM-1 and Hai7124. Two expansions were identified that were distinct from the amplification wave of complete LTR elements ( Supplementary Fig. 13). These two recent LTR insertion events seem to have occurred around 2 MYA and 13 MYA, close to the time of tetraploid cotton formation (1-2 MYA) 4,10 and the divergence time between diploid A-and D-genome species (2-13 MYA) 6 .
Speciation and genome evolution. We resequenced 17 accessions representing the D genome and wild AD genome (Supplementary Table 25), finding that G. raimondii (D 5 ) was the closest relative to the D subgenome for all five allotetraploids ( Supplementary Fig. 14).
Furthermore, whole-genome sequence alignment of our current reference-quality assemblies to genomes for the diploid progenitors G. arboreum (A genome) 8 and G. raimondii (D genome) 10 demonstrated that the two tetraploids have highly collinear relationships with their diploid progenitors. However, some reciprocal translocations were observed along chromosomes 1-3, as well as along chromosomes 4 and 5 ( Fig. 1c and Supplementary Table 26), in accordance with earlier reports 28-31 . A total of 2.2 Gb of sequence from Hai7124 (v1.1) could be mapped onto the TM-1 (v2.1) genome, suggesting that the collinearity and gene order for these species are largely conserved (Fig. 1d). These data support the idea that G. hirsutum and G. barbadense originated from a common allotetraploid ancestor.
The nonsynonymous (K a ) and synonymous (K s ) substitution rates for 22,054 orthologous gene sets were compared between G. arboreum 8 , G. raimondii 10 , and the A and D subgenomes of the tetraploid species. On the basis of this, the divergence time for the A and D progenitor genomes was estimated to be around 6.2-7.1 MYA, and the allotetraploid formed around 1.7-1.9 MYA ( Fig. 2a and Supplementary Table 27). The divergence of G. barbadense and G. hirsutum occurred ~0.4-0.6 MYA (K s peaks at 0.002 and 0.003, respectively).
We identified 9,451 one-to-one orthologous gene sets among the genome assemblies for Theobroma cacao 32 , the A and D diploid progenitors, and the A and D subgenomes of our two currently assembled cultivated species, which suggests an accelerated evolution rate in allotetraploids in comparison to their diploid progenitors. A faster evolution rate was observed in allotetraploid cottons in comparison to their living diploid progenitors; in the A subgenome in comparison to the D subgenome; and in G. barbadense in comparison to G. hirsutum (Fig. 2b, Supplementary Fig. 15 and Supplementary  positively selected genes were found in the A subgenome ( Supplementary Fig. 16), indicating that the A subgenome might have experienced strong selection pressures during evolution. Deep sequencing data (average coverage of 60×) from nine G. barbadense and ten G. hirsutum accessions (Supplementary Table 29) showed high divergence between G. hirsutum and G. barbadense, as well as between the wild G. hirsutum race yucatanense and the domesticated and/or improved accessions for G. hirsutum ( Supplementary Fig. 17a). Furthermore, a long reciprocally introgressed region exhibiting extremely low levels of polymorphism, extending from 43.10 Mb to 92.00 Mb on chromosome A01, was found by comparing the G. hirsutum race with G. barbadense ( Supplementary Fig. 17b), supporting the idea that hybridization occurred between domesticated species and their wild ancestors in the spread of domestication and/or diversification, as we reported previously 33 Table 30). The introgressed region from the long-lived, perennial G. hirsutum race on chromosome A01 was found in nine G. barbadense accessions collected from Egyptian, American Pima and Central Asian ELS (extra-long-staple) cottons, but not in the Tanguis landrace, which produces coarse fibers with medium staple length, which is typical of current cottons from Peru. It was confirmed that G. barbadense originated in a region spanning northwestern Peru and southwestern Ecuador 34 . Introgression of wild and/or domesticated alleles from G. hirsutum may have induced changes in 'morpho-physiological traits' , helping in the domestication of Sea Island cotton (G. barbadense). This analysis of introgressed regions further revealed that there were two genotypes among the Central Asian ELS cottons: one was the same as the Egyptian or American Pima type, represented by Hai7124 and Xinhai 21, and the other is a new type, represented by Junhai 1 and Xinhai 25, selected from cultivar 9122 from the former Soviet Union. Therefore, the spread and adaptation of domesticated cottons to different agro-ecological and cultural environments have led to phenotypic and genetic divergence among domesticated populations to produce two cultivated allotetraploid cotton types.  Table 33). These GSVs were classified into three types: type I consisted of 3,029 GSVs identified in Hai7124 but annotated only in TM-1 and type III consisted of 3,556 GSVs identified in TM-1 but annotated only in Hai7124. The 3,781 type II GSVs were annotated in both species (Fig. 3a). 69.7% of the speciesspecific GSVs were completely validated in the 19 deep-sequenced accessions (Supplementary Tables 34 and 35). Gene Ontology (GO) enrichment analysis showed that the type I GSVs were enriched for genes involved in defense response and DNA integration, the type II GSVs were enriched for genes involved in response to stress and oxidation-reduction processes, and the type III GSVs were enriched for genes involved in DNA integration and cell recognition. The variations mapping to these GSVs may be among the causative mutations that have led to the phenotypic divergence between these two species. For instance, among the type II GSVs, a 2-bp (CA) deletion in WLIM1a was identified in Hai7124 (GB_D06G0243) that resulted in premature termination of translation. Thus, in comparison to the wild-type homolog in TM-1 (GH_D06G0228), there were 23 amino acids deleted from the encoded protein in Hai7124, including 5 in the second LIN-11, Isl1 and MEC-3 (LIM) motif ( Supplementary  Fig. 18a). This species-specific indel was further confirmed in all 19 accessions ( Supplementary Fig. 18b) Table 38). The presence/absence variants (PAVs) were distributed unevenly across the genomes; the majority of these were Copia elements (LTR type) found in coding DNA sequence ( Supplementary  Fig. 19a,b). The expression level of genes with LTR insertions was lower than that of genes in which PAVs occurred in upstream or downstream sequences or in introns and that of all PCGs expressed in root, stem, leaf or fiber ( Supplementary Fig. 19c), but these genes showed much more differential expression between G. hirsutum and G. barbadense (Supplementary Fig. 19d). Of the three major PAV types, the first (74%; ~3 kb) and second (94%; ~11 kb) types were caused by Gypsy LTR retroelements, whereas the third type (68%; ~4.9 kb) was mostly caused by Copia LTR retroelements (Fig. 3c). These structural variations were found to be associated with loss of function of 820 and 791 genes in TM-1 and Hai7124, respectively (Supplementary Table 39). In G. barbadense, we detected a PAV in an α-expansin gene with fiber-specific expression (GB_A10G1730; deletion of 450 bp within exon 3), which resulted in a truncated protein lacking the C-terminal polysaccharide-binding domain as compared with the G. hirsutum homolog (GH_A10G1626, also known as GhEXP1). The protein encoded by this truncated ELSfiber-related gene can significantly enhance fiber length, fineness and strength through cell wall restructuring in G. hirsutum 36 . This species-specific deletion in the expansin gene (GB_A10G1730 versus GH_A10G1626) was found in nine of the G. barbadense accessions examined. Furthermore, within the SCW-biosynthesis-related cellulose synthase CESA 7 clade ( Supplementary Fig. 20a), a large TE insertion (9,427 bp) in GH_A07G0437 resulted in the absence of 53 amino acids from all ten G. hirsutum accessions but not the nine G. barbadense lines except for Pima S-4 ( Supplementary Fig. 20b). The corresponding wild-type CESA 7 gene (GB_A07G0431) showed high expression at 20 days post anthesis (DPA) during the SCW biosynthesis stage in all three G. barbadense accessions examined, suggesting that it potentially has a role in the biosynthesis of more cellulose to develop ELS fibers ( Supplementary Fig. 20c). These TE insertions might also have had a key role in cotton domestication and diversification.
Moreover, we identified 3,905 copy number variations (CNVs) of sequences that were either gained or lost in the 19 cotton accessions in comparison to TM-1 (Supplementary Table 40). The genes within CNVs were significantly enriched for genes involved in defense response (P = 2.60 × 10 -12 , Fisher's exact test), suggesting that they have a role in adaptation to various environments.

Genomic insights into development of extra-long fibers in G. barbadense.
We measured dynamic changes in the elongating fibers of TM-1 and Hai7124 plants from 5 to 40 DPA and observed a nearly linear increase in fiber length at 5-15 DPA in TM-1 plants, whereas this increase occurred at 5-20 DPA in Hai7124 plants (Fig. 4a). The maximum elongation rate was achieved at 15-20 DPA; this rate suddenly decreased at 25 DPA in TM-1 fibers but was retained until 30 DPA in Hai7124 fibers (Fig. 4b). The extended elongation phase in Hai7124 plants (5-30 DPA) most likely leads to development of longer fibers in G. barbadense as compared to TM-1 plants (elongation from 5-25 DPA).
Fiber elongation is driven by cell turgor created by the influx of sucrose and potassium ions (K + ) and local production of malate together with high expression of expansion-related genes encoding proteins that loosen the cell wall matrix 37,38 . The content of the osmotically active solutes in elongating fibers was substantially increased in Hai7124 plants from 10 to 15 DPA, and this high level was sustained from 15 DPA onward (Fig. 4c-e). A large central vacuole occupied 98% of the volume of quickly elongating fiber cells (Fig. 4f); such vacuoles store the main osmotically active solutes that impose maximal turgor pressure for fiber elongation .
To the extent of our knowledge, the underlying molecular basis of ELS fiber production has not previously been investigated. Here we found 45,129 and 45,328 genes expressed (FPKM > 1) in developing single-celled fibers of TM-1 and Hai7124 plants, respectively. Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation of these genes revealed that genes associated with membrane transport, transcription, and glycan biosynthesis and carbon metabolism were significantly enriched for expression in Hai7124 as compared to TM-1 fibers ( Supplementary Fig. 21a), highlighting the speciesspecific role of these genes in fiber elongation. Furthermore, the sucrose transporter (GbTST1), Na + /H + antiporter (GbNHX1) and aluminum-activated malate transporter (GbALMT16) genes whose expression was detected in the tonoplast of Hai7124 fibers had a longer period of expression than the corresponding genes in TM-1 (Fig. 4g-k and Supplementary Table 41). These transporters pump more sucrose, K + and malate into the vacuole 39-45 . Moreover, sucrose is further degraded into hexose by the vacuole-localized vacuolar invertase (VIN1) to enhance the osmotic potential required for fiber elongation 46,47 . Interestingly, GbVIN1 also showed prolonged expression in Hai7124 as compared to TM-1 fibers (Fig. 4i).
Unlike malate, which is synthesized in the cytosol of fibers, sucrose and K + enter fiber cells through the plasma membrane or plasmodesmata (PD) 47,48 . In Hai7124 plants, the PD remained open for a longer time (5-15 DPA) to import sucrose into the cytosol of developing fiber cells from underlying seed coat cells than it did in TM-1 plants (5-10 DPA) ( Supplementary Fig. 21b). Conversely, expression of the plasma membrane-localized SUTs and SWEETs responsible for sucrose transport is almost completely silenced during this stage 49 . All these findings demonstrate that expression of GbVIN1, GbTST1, GbNHX1 and GbALMT16 and PD opening for a relatively extended time period could be responsible for the accumulation of more soluble sugar, K + and malate, thereby leading to production of longer fibers in ELS cotton.
In total, there were 1,464 genes representing expansions in 446 gene families (Supplementary Table 42) in the Hai7124 genome, significantly more than the 696 genes found in the TM-1 genome (P = 6.27 × 10 -60 , Fisher's exact test) (Supplementary Table 43). For instance, the 12 genes in class I of the ADP-ribosylation factor (ARF) GTPase family, forming core components of vesicle transport machinery and determining epidermal cell polarity 50,51 , correspond to a specific expansion in Hai7124 (Supplementary Fig. 22a). These ARF GTPase genes were predominantly expressed during fiber elongation (10 DPA) and SCW thickening (20 and 25 DPA), indicating that they potentially have a role in the development of longer and stronger fibers in ELS cotton ( Supplementary Fig. 22b).

Divergent evolution of genes involved in abiotic stress tolerance.
No significant differences in response to salinity or drought stresses were observed between TM-1 and Hai7124 ( Supplementary Fig.  23a,b). However, TM-1 seedlings were more tolerant to heat and cold stresses than Hai7124 seedlings (Fig. 5a-d), suggesting that selection after domestication was aimed at developing the high yield and adaptation abilities of G. hirsutum. We detected 16,029 versus 5,270 genes conferring resistance to heat stress (P = 0, Fisher's exact test) and 8,725 versus 12,475 genes conferring resistance to cold stress (P = 5.079 × 10 -142 , Fisher's exact test) in TM-1 and Hai7124, respectively (Supplementary Table 44). In total, 51 genes including PIPK, PIP (plasma membrane-intrinsic protein), CaM genes and downstream transcriptional regulator (HSF) genes involved in the ability of plants to sense heat 52 showed differential responses to heat stress between TM-1 and Hai7124 (Supplementary Tables 41  and 45). Most of these (39 genes), especially the CaM genes, were specifically up-or downregulated under heat stress in G. hirsutum (Fig. 5b). The ethylene signaling pathway was also activated under heat stress in TM-1, as nine each of the genes encoding ethylene receptors and ethylene response factors (ERFs) were dramatically upregulated in TM-1 (Fig. 5b,c, Supplementary Fig. 23c and Supplementary Tables 41 and 45), elucidating their role in regulating EIN3 and EIL1 activation under heat stress 53 .
TM-1 seedlings also showed greater tolerance of cold stress than Hai7124 seedlings (Fig. 5d). In total, 23 genes from the abscisic acid (ABA) signaling pathway that encode ABA receptors (PYR/PYL), the SnRK protein kinase and the PP2C protein phosphatase kinase (ABI/AHG/PP2C) 54 showed differential responses to cold in TM-1 and Hai7124 (Fig. 5e,f). Most genes, especially those encoding the ABA receptors ABAR (CHLH), PYR1, PYL3, PYL8 and PYL11, were specifically upregulated under cold stress in TM-1 (Supplementary  Table 46), and this upregulation was confirmed by qRT-PCR analysis ( Supplementary Fig. 23d). These genes have in part facilitated the adaptation of Upland cotton to new environments.

Discussion
Here we performed de novo assembly of the genomes of two cultivated allotetraploid cotton species by integrating data from Illumina PCR-free short-read sequencing, 10x Genomics sequencing, Hi-C, and optical and super-dense genetic maps. These efforts resulted in substantial improvements to the contiguity and accuracy of assembly, with a notable improvement in the assembly of centromeres. By analyzing the two resulting high-quality genome assemblies, we determined that tissue-and/or developmental-stage-specific expression and divergent neo-and subfunctionalization of homeologs between G. hirsutum and G. barbadense may have led to the variable adaptation responses of these two species after the formation of allotetraploid cottons. Cotton evolutionary processes are also partially driven by the cultural and economic importance of cotton to humans. To our knowledge, this is the first comparison of these two important allotetraploid cotton species over the entire genome. These results should improve understanding of crop evolution, domestication and diversification and should lead to the discovery of novel domestication-related genes conferring agronomically beneficial traits in future breeding programs and evolutionary trajectories in plants.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/ s41588-019-0371-5.
12. International Wheat Genome Sequencing Consortium. Shifting the limits in wheat research and breeding using a fully annotated reference genome.

DNA extraction and sequencing.
Genomic DNA from G. hirsutum L. acc. TM-1 (ref. 55 ) (genetic standard) and G. barbadense L. cv. Hai7124 (resistant to verticillium wilt) seedlings was isolated by a modified CTAB method 56 . Five size-selected DNA libraries (470 bp to 10 kb) and three mate-pair libraries (2-5, 5-7 and 7-10 kb) were constructed. To limit sequencing biases across genomic regions with varying GC content, libraries with insert sizes of ~470 and 800 bp were constructed by using TruSeq DNA Sample Preparation Kit version 2 with no PCR amplification. The library with an insert size of ~470 bp was sequenced on a HiSeq 2500 v2 instrument in rapid mode as 2 × 265 bp reads, which produced 'stitched' reads of ~265-520 bp in length. The 800-bp shotgun library and the mate-pair libraries were sequenced on a HiSeq 4000 instrument as 2 × 150 bp reads. Construction and sequencing of paired-end and mate-pair libraries were carried out at the Roy J. Carver Biotechnology Center at the University of Illinois at Urbana-Champaign. DNA molecules longer than 50 kb prepared with the nuclei method 57 were used to construct a 10x Genomics library that was sequenced on the GemCode platform (10x Genomics, Pleasanton) at the Hudson Alpha Institute for Biotechnology. About 800 Gb of sequencing data was produced for both TM-1 and Hai7124, respectively (Supplementary Table 1).
Initial genome assembly by DeNovoMAGIC3. A flowchart of contig, scaffold and chromosome assembly in this study is shown in Supplementary Fig. 1. Initial genome assembly was performed by deploying the DeNovoMAGIC3 software platform (NRGene), a de Bruijn graph-based assembler. This task was accomplished by using accurate read-based traveling in the graph that iteratively connected consecutive contigs over local repeats to generate long scaffolds. An additional long-range barcoded DNA library (Chromium system by 10x Genomics) was sequenced to support haplotyping of heterozygous genomes, scaffold validation and further longer-range extension of the scaffolds. The algorithm has been described in detail in wheat D-genome assembly 11 . de Bruijn graphs (k-mer = 127 bp) of contigs were built from all paired-end and mate-pair reads. Next, paired-end reads were used to find reliable paths in the graph between contigs to resolve repeats and extend contigs.
The barcoded reads from the 10x Genomics library were mapped to the assembled scaffolds, and clusters of reads with the same barcode mapped to adjacent contigs in the scaffolds were identified as being part of a single long molecule. Each scaffold was scanned with a 20-kb window to ensure that the number of distinct clusters that covered the entire window (indicating support for this 20-kb connection by several long molecules) was statistically significant with respect to the number of clusters that spanned the left and right edges of the window. If a potential scaffold assembly error was detected, the scaffold was broken at the two edges of the suspicious 20-kb window. Finally, the barcodes mapping to the scaffold edges were compared (first and last 20-kb sequences) to generate a scaffold graph with a link connecting two scaffolds with more than two shared barcodes. Linear scaffold paths in the scaffold graph were composed into the final scaffold output of assembly. Initial genome assemblies are summarized in Supplementary Tables 2 and 8.
Integrating Bionano optical maps with the initial assembly. High-molecularweight DNA was extracted from yellowish cotyledons with a Bionano Plant Tissue DNA isolation kit (Bionano Genomics) and digested with Nt.BssSI (New England Biolabs). After labeling and staining, DNA was loaded onto the Saphyr chip for sequencing. 457 and 449 Gb of data (>150 kb) for TM-1 and Hai7124, respectively, were collected and converted into a BNX file by AutoDetect software to obtain basic labeling and DNA length information. The filtered raw DNA molecules in BNX format were aligned, clustered and assembled into the BNG map by using the Bionano Genomics assembly pipeline. Alignment of sequence assemblies with the BNG map was computed with RefAligner, and visualization of alignment was performed with snapshot in IrysView. Combination of the genome maps with the initial assembly to produce a hybrid scaffold was performed sequentially with the two genome maps (Supplementary Table 3).

Construction of the updated POPSEQ genetic map.
We constructed an augmented POPSEQ genetic map by sequencing 241 F 2 individuals (approximately 5× genome coverage) derived from a cross between TM-1 and Hai7124, including 59 individuals used to construct the previous genetic map 58 . After SNP calling and genotyping of the 241 F 2 individuals, simple SNPs from the parents and SNPs with the same genotype in the population were classified into a bin. Bins were assigned to linkage groups by JoinMap 3.0 59 with a minimum logarithm of odds (LOD) score of 13 and ordered by MSTmap 60 .
Hi-C library construction and data collection. We created three Hi-C libraries for TM-1 and Hai7124 by using the method described previously 61 . Libraries were subjected to sequencing on the Illumina HiSeq X Ten platform. Information about raw data is given in Supplementary Table 4.
Ordering and orientation of the Bionano-assembled scaffolds by Hi-C and POPSEQ. BNG maps were ordered and oriented on the basis of the POPSEQ genetic map. Hi-C data were then used to correct the Bionano assemblies. Hi-C reads were mapped to the assembled scaffolds with BWA 62 (version 0.7.15-r1140). PCR duplicates were removed with Novosort v1.04.06 (http://www.novocraft. com/), and reads were excluded from subsequent analysis if they did not align within 500 bp of a restriction site or did not uniquely map. The initial ordered scaffolds were used to make the Hi-C map, and the scaffolds were then divided into bins of 500 kb in size and -log 10 (number of Hi-C links) was calculated for all bins. The initial order and orientation were corrected and verified manually by using the Hi-C linking information; heat maps of the Hi-C data were plotted with HiCPlotter 63 .
Assessment of TM-1 assembly using 36 completely sequenced BACs. The 36 full-length BAC sequences 64 were split into 1-kb continuous windows and aligned to the TM-1 (v2.1) assembly by using BWA 62 (version 0.7.15-r1140) with the MEM algorithm. Windows mapping in the same chromosome region were used for graphical display.

Identification of centromeric regions by CenH3
ChIP. An antibody to CenH3 that showed specific binding to the cotton CenH3 protein was used in ChIP 23 . ChIP experiments were undertaken by following a published protocol 65 . Highquality Illumina reads were mapped to our cotton genome assembly with Bowtie 2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as previously described 18 . The genome was then divided into 10-kb non-overlapping windows and uniquely mapping reads (allowing 1-bp mismatch between each read and the reference genome) were counted. The read density was calculated by dividing the total number of uniquely mapping reads by the total number of mapped nucleotides in each genomic window. To remove the impact of non-specific binding by rabbit serum, the read density was adjusted for background signal by using mock control data. CenH3 domains were identified with SICER v1.1 (ref. 66 ). We used 200-bp windows, required fold change/control of ≥5 and false-discovery rate (FDR) < 0.01, and allowed gaps of 400 bp when defining the CenH3 domains.  67 ), and quantification of gene expression was performed with Cufflinks version 2.2.1 (http://cole-trapnell-lab.github.io/ cufflinks/) by using the GTF annotation file.

RNA-seq data generation. RNA was collected from the roots, stems, leaves and various reproductive organs of TM-1 and Hai7124 plants (details in Supplementary
PacBio sequencing and data analysis. For TM-1 and Hai7124, total RNA was extracted from the roots, stems and leaves of 2-week-old plants; from whole mature flowers; from ovules and fibers at 0 and 5 DPA; from ovules at 10 and 20 DPA; and from fibers at 10 and 20 DPA. PacBio-seq library preparation was performed as previously described 68 . In total, data were generated from 33 SMRT cells. Sequence data were processed with SMRTlink 4.0 software. PacBio transcriptome reads (867,006 for TM-1 and 879,582 for Hai7124) were used in gene annotation.
Gene prediction and annotation. Gene prediction and annotation were conducted as previously 4 . Proteins from seven plant genomes (Arabidopsis thaliana, Carica papaya, Hibiscus syriacus, T. cacao, G. raimondii, G. arboreum and Vitis vinifera) were downloaded from Phytozome (release 11) 69 and the corresponding species database 8,10 . Protein sequences were aligned to the assembly using GenBlastA  76 , were used to predict coding regions in the repeat-masked genome. Finally, PacBio transcriptome reads were mapped to the assembly with BLAT (version 35) 77 and RNA-seq data were mapped to the assembly with HISAT2 (version 2.0.1) 78 ; StringTie (version 1.2.2) 79 and TransDecoder (v1.1.1) 80 were then used to assemble the transcripts and convert candidate coding regions into gene models. All gene models predicted with the above three approaches were combined by EvidenceModeler (EVM) 80 into a non-redundant set of gene structures. The gene models were further filtered on the basis of their Cscore, protein coverage and coding sequence overlapping TEs, where Cscore is the BLASTP score ratio to the MBH (mutual best hit) BLASTP score and protein coverage is the highest percentage of the protein aligned to the best homolog. Only transcripts with Cscore ≥0.5 and protein coverage ≥0.5 were retained and the coding sequence overlapping repeats should be less than 20%; for gene models where more than 20% of the coding sequence overlapped repeats, the Cscore was required to be at least 0.9 and the protein coverage at least 70% for the model to be selected. Finally, gene models where more than 30% of the encoded protein was annotated as Pfam 81 or InterPro 82 TE domains were filtered out. Functional annotation of PCGs was achieved with BLASTP (version 2.2.26) 83 (E value of 1 × 10 -5 ) against two integrated protein sequence databases: SwissProt and TrEMBL 84 . Protein domains were annotated with InterProScan (v5.19) 85 . The GO terms for each gene were retrieved with Blast2GO 86 based on the nr protein database. The pathways in which the genes might be involved were assigned by performing BLAST against the KEGG databases (release 59.3) 87 (E value of 1 × 10 -5 ).
Noncoding RNA annotation. tRNA-encoding genes were predicted by tRNAscan-SE (version 1.3.1) 88 . rRNA fragments were predicted by aligning to Arabidopsis and rice template rRNA sequences by using BLASTN (version 2.2.26) with an E-value cutoff of 1 × 10 -10 . MicroRNA and small nuclear RNA (snRNA) genes were found by searching against the Rfam database (release 12.0) 89 with Infernal (version 1.1.1) 90 .
Repeat annotation. Repeats in the assembled genomes were probed with RepeatMasker (version open-4.0.6; http://www.repeatmasker.org/) run against the MIPS repeat database (mipsREdat_9.3p) 91 . RepeatMasker results were filtered to retain high-confidence hits (length ≥ 50 bp, score ≥ 255). To estimate the insertion times for LTRs, we applied LTR_FINDER (v1.06) 92 to identify LTRs in the cotton genome with parameters '-D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9' and then integrated the results and removed false positives from the initial predictions by the LTR_retriever pipeline 93 ; insertion time was estimated as T = K/2r (where K is the divergence rate and r is the neutral mutation rate; r = 2.6 × 10 −9 ) in the LTR_retriever package 93 .
Identification of homeologous and orthologous gene sets. A total of 9,282 oneto-one orthologous gene sets were identified for the G. arboreum, G. raimondii and T. cacao genomes and the two subgenomes of G. hirsutum and G. barbadense by using the OrthoMCL (version 1.4) clustering program 94 . 21,419 one-to-one orthologous gene sets were identified for G. raimondii, G. arboreum, and the two subgenomes of G. hirsutum and G. barbadense by using BLASTP (version 2.2.26) 83 based on the bidirectional best hit (BBH) method with sequence coverage >30% and identity >30%, followed by selection of the best match.
Assessing the reference genome by BUSCO gene set. Quality control on the integrity of the assembly was performed by using the independent BUSCO v2 benchmark (http://busco.ezlab.org/) 27 , which is used to specifically assess the integrity of genic regions.
Phylogenetic tree construction and evolutionary rate estimation. A phylogenetic tree was constructed for the seven genomes and subgenomes (T. cacao, G. raimondii, G. arboreum, and the two subgenomes each from G. hirsutum and G. barbadense) by using coding sequence alignment of orthologs from these genomes. Evolutionary rates (K a and K s ) on each branch were estimated by using the Codeml program in the PAML package (version 4.9d) 95 . We used the free-ratio 'branch' model, which allows distinct evolutionary rates for each branch. The significance of rate variations between linages was examined through Wilcoxon rank-sum tests.

Estimation of divergence time.
On the basis of the 21,419 cotton orthologous gene sets for G. raimondii, G. arboreum, and the two subgenomes each of G. hirsutum and G. barbadense, the synonymous divergence levels (K s ) for all four cotton species were calculated. The formula t = K s /2r was used to estimate the divergence time between species, where r is the neutral substitution rate (r = 2.6 × 10 −9 ).

Phylogenetic tree of 13 D-genome diploids and 5 AD-genome allotetraploids.
Seventeen accessions comprising 13 D-genome diploid species and 3 wild allotetraploid species, Gossypium tomentosum (AD 3 ), Gossypium mustelinum (AD 4 ) and Gossypium darwinii (AD 5 ) 96,97 , were resequenced at high depth on the Illumina HiSeq 2500 platform, generating 85 Gb of sequencing data (Supplementary Table  16). All reads were trimmed for quality with Sickle by using a minimum Phred quality threshold of 20 (https://github.com/najoshi/sickle). SNPs were then called by aligning the high-quality reads from the 17 accessions, Hai7124 and the 3 wild allotetraploid species against the TM-1 (v2.1) reference genome. The phylogenetic tree was constructed with the neighbor-joining method 98 .
Gene family expansion and contraction. By using the gene families identified by the OrthoMCL (version 1.4) clustering program 94 , we compared the number of genes clustered in each family between TM-1 and Hai7124. If the number of genes in TM-1 was more than twofold higher than that in Hai7124 and there were at least five genes in the gene family or the gene family was specific to TM-1, these gene families were defined as families that were expanded in TM-1 and as families that were contracted in Hai7124, and vice versa.

Identification of SNPs, indels and structural variations between TM-1 and
Hai7124. The Hai7124 (v1.1) genome was aligned with the TM-1 (v2.1) genome by MUMmer (version 3.23; http://mummer.sourceforge.net/) with parameters '-maxmatch 90 -l 40' , and one-to-one genomic alignment results were extracted with the 'delta-filter -1' parameter. SNPs and indels were identified by show-snp from the one-to-one alignment blocks (parameter '-ClrTH'). For identification of structural variations (PAVs, inversions and translocations), the alignment blocks from MUMmer with characteristics of structural variations were first extracted and the blocks with low similarity in the two flanking regions were filtered out.
To validate these variations, the following approaches were performed based on alignments of all available Illumina reads for TM-1 and Hai7124 to the TM-1 (v2.1) and Hai7124 (v1.1) genomes by BWA (version 0.7.15-r1140) 62 . In approach 1, at least four Illumina reads supporting the genotype were required to make us sufficiently confident of a candidate locus. In approach 2, Illumina reads were broken at the edge of the variation from the alignment results and variations with at least four supporting reads were defined as high-confidence variations. In approach 3, the single-end read/paired-end read ratio (S/P ratio) at the edge of the variation was computed and Fisher's exact test was used to determine whether the variation was a supported candidate. SNPs were validated by approach 1. Indels (≤100 bp) were validated by approach 1 or 2. Structural variations (>100 bp; PAVs, inversions and translocations) were validated by approach 2 or 3. The putative functional effects of SNPs and indels were annotated with the ANNOVAR package 99 .
Phylogenetic relationship and genomic variations between G. hirsutum and G. barbadense. Nineteen representative cotton accessions were subjected to deep resequencing on the Illumina HiSeq X Ten platform, producing 3.0 Tb of raw data, corresponding to a median of 60-fold coverage for each sample (Supplementary Table 17). Paired-end resequencing reads from these accessions were mapped to the newly assembled TM-1 or Hai7124 genome by BWA 62 (version 0.7.15-r1140) with the MEM algorithm. After alignment, all mapped reads were sorted according to their aligned chromosomal position and potential PCR duplicates were removed with SAMtools (version 1.3.1) 100 . SNP calling was performed with the GATK (version 3.7-0-gcfedb67) UnifiedGenotyper 101 and SAMtools (version 1.3.1) mpileup 100 . To obtain high-quality SNPs, only variations detected with both software packages with a sequencing depth of at least 10 and Phred-scaled quality score of at least 30 were retained for further analysis. The evolutionary history of these accessions was inferred by using the neighbor-joining tree 98 .
By using the indel and structural variation superset between TM-1 and Hai7124, the genotype of the 19 accessions was identified with the above validation approach with Illumina reads. Indels (≤100 bp) were identified by approach 1 or 2. Structural variations (>100 bp; PAVs, inversions and translocations) were identified by approach 2 or 3.
CNVs were identified by CNVKit 102 with gene annotations as targets. The segments identified with a log 2 -transformed copy number of at least 1.0 were considered to be copy number gain, and those with a log 2 -transformed copy number of −2.0 or less constituted to be copy number loss.

Sampling and gene expression profiles under abiotic stresses.
Four-weekold seedlings were grouped in batches containing 15 seedlings each. Seedling batches were exposed to sodium chloride (0.4 M) and PEG (200 g/liter) as well as temperature regimes (Supplementary Table 47). Leaves were collected and used for transcriptome sequencing. Differentially expressed genes were identified by using the DESeq2 package (http://www.bioconductor.org/packages/release/bioc/html/ DESeq2.html), and raw P values from multiple tests were corrected by using FDR. In identifying differentially expressed genes, we required at least a twofold change in expression and a P value of less than 0.05.
Quantitative RT-PCR analysis. Gene-specific primers were designed on the basis of the candidate genes with WebSNAPER (http://pga.mgh.harvard.edu/cgi-bin/ snap3/websnaper3.cgi). qRT-PCR was performed with SYBR Green Mastermix (Vazyme) on an ABI7500 instrument according to the manufacturer's instructions. Gene expression was normalized against expression of the GhEF1α gene. Relative transcript levels were computed by using the 2 -ΔΔCt method on the basis of Ct values.

Statistical analysis.
Comparison of codon substitution rate distributions between the two subgenomes of TM-1 and Hai7124 and their progenitors was carried out by using a two-sided Wilcoxon rank-sum test. Fisher's exact test was employed to estimate whether a list of genes was enriched in a specific GO category when compared with background genes. P values were adjusted by FDR, and only GO terms with adjusted P < 0.05 were retained.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The TM-1 (v2.1) and Hai7124 (v1.1) assembly and annotation data are available at http://ibi.zju.edu.cn/cotton and http://www.cottongen.org/, respectively. The raw sequencing data used for de novo whole-genome assembly are available The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: All studies must disclose on these points even when the disclosure is negative.

Sample size
Stress analysis: a total of 260 seedlings were treated with salt, drought, high and low temperature, that is 5 seedlings each treat at each sampling point. Fiber elongation patterns between TM-1 and Hai7124: at least ten seeds from three individual plants.

Replication
For qRT-PCR analysis, three biological replicates were performed per reaction,and each with two technologic replicates three independent biological replicates for each samples were subjected to illumina RNA-seq three biological replicates for soluble sugar and six biological replicates for K+ and malate analysis.
Randomization Yes, the samples were selected randomly.

Blinding
Blinding was not relevant for this study.

Behavioural & social sciences study design
All studies must disclose on these points even when the disclosure is negative.

Data collection
Provide details about the data collection procedure, including the instruments or devices used to record the data (e.g. pen and paper, computer, eye tracker, video or audio equipment) whether anyone was present besides the participant(s) and the researcher, and whether the researcher was blind to experimental condition and/or the study hypothesis during data collection.

Timing
Indicate the start and stop dates of data collection. If there is a gap between collection periods, state the dates for each sample cohort.

Data exclusions
If no data were excluded from the analyses, state so OR if data were excluded, provide the exact number of exclusions and the rationale behind them, indicating whether exclusion criteria were pre-established.

Non-participation
State how many participants dropped out/declined participation and the reason(s) given OR provide response rate OR state that no participants dropped out/declined participation.

Randomization
If participants were not allocated into experimental groups, state so OR describe how participants were allocated to groups, and if allocation was not random, describe how covariates were controlled.

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.

Mycoplasma contamination
Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.
Commonly misidentified lines (See ICLAC register) Name any commonly misidentified cell lines used in the study and provide a rationale for their use.

Palaeontology Specimen provenance
Provide provenance information for specimens and describe permits that were obtained for the work (including the name of the issuing authority, the date of issue, and any identifying information).

Specimen deposition
Indicate where the specimens have been deposited to permit free access by other researchers.

Dating methods
If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are provided.
Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research

Laboratory animals
For laboratory animals, report species, strain, sex and age OR state that the study did not involve laboratory animals.

Wild animals
Provide details on animals observed in or captured in the field; report species, sex and age where possible. Describe how animals were caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released, say where and when) OR state that the study did not involve wild animals.

Field-collected samples
For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature, photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.

Ethics oversight
Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance was required and explain why not.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Human research participants
Policy information about studies involving human research participants

Recruitment
Describe how participants were recruited. Outline any potential self-selection bias or other biases that may be present and how these are likely to impact results.

Ethics oversight
Identify the organization(s) that approved the study protocol.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Clinical data
Policy information about clinical studies All manuscripts should comply with the ICMJE guidelines for publication of clinical research and a completed CONSORT checklist must be included with all submissions.

Clinical trial registration
Provide the trial registration number from ClinicalTrials.gov or an equivalent agency.

Study protocol
Note where the full trial protocol can be accessed OR if not available, explain why.

Data collection
Describe the settings and locales of data collection, noting the time periods of recruitment and data collection.

nature research | reporting summary
October 2018 Flow Cytometry Plots Confirm that: The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.

Methodology Sample preparation
Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.

Instrument
Identify the instrument used for data collection, specifying make and model number.

Software
Describe the software used to collect and analyze the flow cytometry data. For custom code that has been deposited into a community repository, provide accession details.
Cell population abundance Describe the abundance of the relevant cell populations within post-sort fractions, providing details on the purity of the samples and how it was determined.

Gating strategy
Describe the gating strategy used for all relevant experiments, specifying the preliminary FSC/SSC gates of the starting cell population, indicating where boundaries between "positive" and "negative" staining cell populations are defined.
Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary

Area of acquisition
State whether a whole brain scan was used OR define the area of acquisition, describing how the region was determined.

Noise and artifact removal
Describe your procedure(s) for artifact and structured noise removal, specifying motion parameters, tissue signals and physiological signals (heart rate, respiration).

Volume censoring
Define your software and/or method and criteria for volume censoring, and state the extent of such censoring.