The ancestors of Gossypium arboreum and Gossypium herbaceum provided the A subgenome for the modern cultivated allotetraploid cotton. Here, we upgraded the G. arboreum genome assembly by integrating different technologies. We resequenced 243 G. arboreum and G. herbaceum accessions to generate a map of genome variations and found that they are equally diverged from Gossypium raimondii. Independent analysis suggested that Chinese G. arboreum originated in South China and was subsequently introduced to the Yangtze and Yellow River regions. Most accessions with domestication-related traits experienced geographic isolation. Genome-wide association study (GWAS) identified 98 significant peak associations for 11 agronomically important traits in G. arboreum. A nonsynonymous substitution (cysteine-to-arginine substitution) of GaKASIII seems to confer substantial fatty acid composition (C16:0 and C16:1) changes in cotton seeds. Resistance to fusarium wilt disease is associated with activation of GaGSTF9 expression. Our work represents a major step toward understanding the evolution of the A genome of cotton.
Cotton is one of the world’s most important commercial crops and is also a valuable resource for studying plant polyploidization1. G. arboreum was probably domesticated on Madagascar or in the Indus Valley (Mohenjo Daro), and was subsequently dispersed to Africa and other areas of Asia2. It was initially introduced to China more than 1,000 years ago as an ornamental plant3,4. Over the course of its adaptation to local agroecological environments, and under the influence of human selection, the Chinese G. arboreum population developed into a distinct geographical race referred to as ‘sinense cotton’4.
Although cotton breeders have constructed various genetic maps based on RFLP5 and simple-sequence-repeat6 markers, no causal genes responsible for the excellent agronomic and economic traits from G. arboreum or G. herbaceum have been identified. Likewise, efforts to introduce these important characteristics from diploids into tetraploids through intra- and interspecific hybridizations have not been productive7,8,9. The release of genome sequences for G. raimondii10,11, G. arboreum12, Gossypium hirsutum13,14, and Gossypium barbadense15,16 has provided the prerequisites for the study of population genetics, cultivation, and domestication. Genome-wide association studies have identified many candidate genes or quantitative trait loci (QTLs) in rice17,18,19, maize20,21, soybean22, foxtail millet23, cucumber24, tomato25, and upland cotton26,27. In this study, we reassembled a high-quality G. arboreum genome on the basis of PacBio long-reads and Hi-C technologies, and analyzed the population structure and genomic divergence trends of 243 diploid cotton accessions. We identified a number of candidate loci that may facilitate the genetic improvement of cotton lint production.
We generated 142.54 Gb of raw PacBio reads (approximately 77.6-fold genome coverage) by using SMRT sequencing technology and assembled these reads into 8,223 contigs, producing a 1,710 Mb G. arboreum genome with a contig N50 of 1,100 kb; the longest contig in the new assembly was 12.37 Mb (Table 1). We also generated ~125 million valid Hi-C interacting unique pairs with a coverage number >20 (Supplementary Tables 1 and 2). We anchored and oriented 1,573 Mb of the assembly onto 13 pseudochromosomes with the aid of Hi-C sequence data by using base-calling corrections. This genome, compared with the previously published genome12, was found to have a substantially lower number of incongruities outside of the expected diagonal when the Hi-C data were mapped against the updated genome (Supplementary Fig. 1a,b). Moreover, this updated G. arboreum genome shares substantially longer syntenic blocks in the corresponding chromosomes of the At subgenome (potentially the closest sequenced species) (Supplementary Fig. 1c,d and Supplementary Table 3). 85.39% of the updated genome is composed of repeat sequences (Supplementary Table 4). We produced a new set of 40,960 consensus protein-coding-gene models by integrating currently available methods (Supplementary Tables 5 and 6).
A total of 230 G. arboreum (A2) and 13 G. herbaceum (A1) lines were collected from South China (SC), the Yangtze River region (YZR), and the Yellow River region (YER) (Supplementary Fig. 2) and were resequenced (Supplementary Table 7). These regions represent most of the phenotypic and geographical diversity known for diploid cottons in China. Approximately 18.30 billion 125-bp paired-end reads—approximately 2.29 Tb of raw sequence—were generated on the Illumina HiSeq 2500 platform, with an average coverage depth of ~6.0× for each accession. The updated genome was used as the reference genome for SNP identification. On average, 99.68% of the reads for each accession were successfully aligned (Supplementary Table 7). We identified 17,883,108 high-quality SNPs and 2,470,515 indels (ranging from 1 to 190 bp in length), an average of 10.5 SNPs and 1.4 indels per kilobase. A total of 242,449 SNPs (1.36%) and 16,816 (0.68%) indels were located in coding regions of 36,205 G. arboreum genes. A total of 128,512 (0.72%) nonsynonymous SNPs were identified in 31,549 genes, and 11,372 (0.46%) frame-shifted indels were identified in 8,117 genes; 25,117 variants showed potentially large effects, including SNPs causing premature stop codons or longer-than-usual transcripts, and indels resulting in frame shifts, the introduction of stop codons, or other disruptions of protein-coding capacity (Supplementary Tables 8 and 9).
A subset of 72,419 SNPs was screened in greater detail to construct a neighbor-joining tree by using the G. raimondii genome11 as the outgroup. G. herbaceum and G. arboreum were clustered in two independent clades after branching from G. raimondii (Fig. 1a,b and Supplementary Fig. 3). The G. arboreum clade could be divided into SC, YZR, and YER groups that exhibited strong geographical distribution patterns, a result further supported by principal component analysis (Fig. 1a–c). These two species were independently domesticated from different wild progenitors28.
Compared with the YZR and YER group accessions, the SC group accessions had relatively poor agronomic traits (Supplementary Fig. 4). Additionally, the SC group had higher nucleotide diversity (π = 0.211 × 10−3) than the YZR (π = 0.197 × 10−3) and YER (π = 0.199 × 10−3) groups. This result indicated that G. arboreum was initially cultivated in South China and extended further to the Yangtze and Yellow River regions, in agreement with findings from a previous report7 based on molecular diversity using simple sequence repeats29. Linkage disequilibrium (LD) analysis indicated that the physical distance between SNPs (reported as half of its maximum value) occurred at ~105.5 kb (r2 = 0.40) for G. arboreum and at ~145.5 kb (r2 = 0.39) for G. herbaceum (Fig. 1d). These values are comparable to those for soybean (~83 kb)22 and rice landraces (~123 kb in indica, ~167 kb in japonica)17, but much higher than those of cultivated maize (22–30 kb)20. Approximately 23.9% or 22.9% of the G. arboreum or G. herbaceum alleles, respectively, were aligned to the G. raimondii genome (Fig. 1e), thus indicating that G. arboreum and G. herbaceum are equally diverged from G. raimondii.
Artificial selection plays an important role during crop domestication and migration30. Model-based clustering showed that the YER group was significantly different from the SC and YZR groups (Fig. 1b; K = 4). Pairwise fixation statistic (FST) analysis (SC versus YZR; SC versus YER; and YZR versus YER) identified 59, 53, and 51 genomic regions with significant genetic divergence (top 5% of FST values) covering 3,162, 2,879, and 3,308 genes, respectively (Fig. 1f and Supplementary Tables 10–12). A total of 21 divergent genomic regions (~43.5 Mb containing 915 genes) between the SC and YZR groups were conserved between the SC and YER groups (Fig. 1f and Supplementary Table 13).
Manhattan plots and quantile–quantile plots for all 11 important traits from varied environments are shown in Supplementary Tables 14 and 15 and Supplementary Figs. 5–13. Among the 98 significant association signals (defined by –logP >6, including SNPs located both in genic and intergenic regions between two adjacent genes), 25 came from genic regions (exonic or intronic regions), including eight for morphological traits, six for yield, and three for seed oil traits. The remaining 73 signals came from noncoding regions (Supplementary Table 16). Major GWAS signals for agronomic traits that showed geographic differences in characteristics, such as sympodial branch node, flowering date, boll weight, and disease resistance, were found in conserved genomic regions (Fig. 1g, Table 2 and Supplementary Table 17). We thus conclude that maturity, yield, and disease-resistance traits have been under strong human and/or geographical selection.
Cotton is the world’s sixth largest source of plant oil31. A significant SNP was detected in the eighth exon of the GaKASIII locus (Ga11G3851) on chromosome 11, which encodes 3-oxoacyl-[acyl-carrier-protein ACP] synthase III (Fig. 2a–c). KASIII encodes a key enzyme known to initiate fatty acid chain elongation from C2 to C4 and may ultimately determine the seed content of both palmitic acid (C16:0) and palmitoleic acid (C16:1)32. A polymorphism in GaKASIII results in a cysteine-to-arginine substitution in the conserved ACP_synthase_III_C domain (Fig. 2c). Haplotype B (TGT, cysteine) was mainly found in low-oil-content accessions, whereas haplotype A (CGT, arginine) was found in high-oil-content accessions (Fig. 2d,e). GaKASIII was expressed at the highest level at 30 d post anthesis (DPA) (Fig. 2f), which is a critical stage for seed oil accumulation33. Both C16:0 and C16:1 content accumulated at a significantly faster rate after 30 DPA in haplotype A accessions (Fig. 2h). Protein modeling with Phyre2 (ref. 34) at >90% accuracy showed that this cysteine/arginine residue is located at an α-helix close to the enzyme active site and the CoA-binding site (Fig. 2g).
Fusarium wilt disease, caused by Fusarium oxysporum f. sp. vasinfectum (FOV), is one of the most severe threats to cotton production35. We performed GWAS for FOV resistance, as measured by the fusarium wilt disease index (FWDI), and found a strong association signal on chromosome 11 with a –logP value of 8.96 (Fig. 3a). Further analysis identified that this SNP cluster was localized in an upstream region of Ga11G2353 (Fig. 3b), an ortholog of the Arabidopsis GSTF9, which encodes the Phi class of glutathione S-transferases involved in plant responses to biotic and abiotic stresses36. Accessions carrying the disease-susceptible allele ‘T’ were primarily found in the SC group, and all YER group members carried the disease-tolerant allele ‘C’ (Fig. 3c). GaGSTF9 was upregulated only in tolerant lines after FOV inoculation of G. arboreum seedlings (Fig. 3d). GaGSTF9-silenced cotton lines (TRV::GSTF9, the virus-induced gene-silencing vector carrying the GSTF9 gene) were found to be significantly more sensitive to FOV inoculation compared with empty-vector-carrying cotton lines (TRV::00) (Fig. 3e,f). Furthermore, the amount of fungal DNA was significantly higher, and the GST catalytic activity was significantly lower, in TRV::GSTF9 than in TRV::00 plants (Fig. 3g,h), thus suggesting that GaGSTF9 may be a target for FOV resistance in G. arboreum.
Cotton fuzz comprises short fibers that cover the seed surfaces. We selected 158 fuzzy and 57 fuzzless accessions from G. arboreum accessions in a GWAS analysis that identified a strong association signal on chromosome 8 (~0.6 to ~1.3 Mb) (Fig. 4a,b). The ∆SNP index above 99% confidence intervals (QTL region) from a QTL analysis was also located on chromosome 8 (~0.70 to ~2.15 Mb) with a maximum of 0.959 (Fig. 4c). Analysis of an F2 population obtained by crossing a fuzzy line (GA0146) with a fuzzless line (GA0149) identified a 1:3 segregation ratio for fuzzy and fuzzless phenotypes (Fig. 4d), thus indicating that a single locus controls fuzz initiation. When we zoomed in on the overlapping region obtained from QTL and GWAS analysis, we found that this approximately 600-kb region contains ten putative protein-encoding genes (Fig. 4e and Supplementary Fig. 14). Four genes encoding Casparian-strip membrane proteins37 were found under/near the strongest signal in this region (–logP = 18.95) (Supplementary Fig. 14a–d). A signal was located upstream of a putative B-type cyclin that has been reported to be involved in trichome or fiber development38,39,40 (Supplementary Fig. 14f).
G. arboreum has an important role in the history of Chinese cotton cultivation4. The present study shows that the Chinese G. arboreum population exhibits distinct geographic patterns that are consistent with its introduction from SC to the YZR and the YER. Several phenotypes such as yield and disease-resistance traits changed substantially during the migration of cotton from SC to the YZR and further to the YER, thus suggesting positive inputs from local environments as well as human selection. The geographically selected genomic regions and overlapped QTLs detected in this study via pairwise comparisons of different germplasm groups represent an important high-resolution genetic resource that should greatly facilitate the improvement of complex cotton traits. Additionally, we identified a gene (GaKASIII) that may control fatty acid chain elongation and oil content, and we found that two typical promoter haplotypes of GaGSTF9 are related to FOV resistance. Moreover, combined GWAS and QTL-seq identified a possible functional roles for Casparian-strip membrane proteins during fuzz cell development. Our study indicates that geographic isolation has affected the genetic basis of SC, YZR, and YER populations, and has also influenced the development and distribution of disease resistance and yield traits of G. arboreum in China.
Genome sequencing and assembly
The same cultivated diploid cotton G. arboreum (cultivar Shixiya1, SXY1)12 was used for sequencing and assembly. A total of ~142 Gb of raw data was obtained from 125 SMRT cells on a PacBio RSII instrument. Hi-C experiments were performed as previously reported41. De novo assembly of the PacBio reads was carried out with two assemblers: the Canu pipeline42 and Falcon43, with different parameters to achieve a higher consistency and longer continuity. We used Quiver to polish base-calling of contigs. The PacBio contigs were further clustered and extended into pseudochromosomes by using Hi-C data. The gaps in the pseudochromosomes were filled in Pbjelly, and a second round of polishing was performed in Quiver. llumina reads were used to correct base-calling. Syntenic-block analysis was performed as described previously12.
Transposable element (TEs) annotation
Both homolog-based and de novo strategies were applied to identify repetitive sequences in the G. arboreum genome. De novo prediction software, including RepeatScout44, LTR-FINDER45, MITE46, and PILER47, was used to identify repeats within the genome. These results were then combined and merged in Repbase to form the G. arboreum repetitive sequence database, which was further classified into various categories with REPET48. The resulting repetitive sequences in the genome were identified by homolog searching in that database through RepeatMasker (see URLs).
Gene-model prediction, evaluation and annotation
We combined three different gene-model prediction methods in the present study. In the homolog-based gene-prediction model, we used geMoMa49 to predict gene structures with homologous proteins obtained from NCBI. For de novo prediction, we used Augustus50 with parameters trained by unigenes, by using transcriptome data obtained from pooled cotton tissues. In the transcriptome-based prediction, unigenes were first aligned to the genome assembly and were then filtered with PASA (see URLs). All predicted gene structures were integrated into a consensus set with EVidenceModeler (EVM)51. Genes were then annotated according to homologous alignments with BLAST52 (E value ≤ 1 × 10–5) against several databases including the nr53 and nt databases of NCBI, Swiss-Prot, and TrEMBL. We further used InterProScan (v4.3)54 to predict domain information and gene ontologies (GO terms)55. KAAS was used for KEGG pathway annotation56.
Sampling and sequencing
A total of 243 cotton accessions, including 230 G. arboreum and 13 G. herbaceum accessions (Supplementary Table 7), were selected from the Chinese National Germplasm Mid-term Genebank (Anyang, China). Plants were grown in the greenhouses at the Institute of Cotton Research of the Chinese Academy of Agricultural Science (ICR, CAAS). Fresh young leaves were collected from single individuals of each accession and were immediately frozen in liquid nitrogen. Genomic DNA was extracted with a previously reported workflow57. At least 5 μg of genomic DNA for each accession was used to build paired-end-sequencing libraries with insert sizes of approximately 500 bp, according to vendor-provided instructions (Illumina). An average 6× coverage of the assembled genome, with 125-bp paired-end reads for each accession, was generated with the Illumina HiSeq 2500 platform. For the QTL-seq analysis58, we sequenced two parent lines (GA0146 and GA0149) at 20× depth and two bulk populations (selected from an F2 population and containing 20 progenies each for fuzzy and fuzzless phenotypes) at 30× depth.
SNP index and ∆SNP index
The SNP index was calculated for both fuzzy and fuzzless bulk samples expressing the proportion of reads containing SNPs that were identical to those in the fuzzy parent (GA0146). The ∆SNP index was calculated as (SNP index of fuzzy bulk) – (SNP index of fuzzless bulk). The average ∆SNP index was calculated with a 100-kb sliding window with a step size of 10 kb and was used to plot the ∆SNP index distributions in Fig. 4e. Statistical 99% confidence intervals of the ∆SNP index were calculated under the null hypothesis (no QTL)58.
Sequence alignment, variation calling, and annotation
All the sequence reads for each accession were mapped to the newly updated genome (all unanchored contigs were connected by 1 kb ‘N’ sequence like contig1 +1 kb length ‘N’ + contig2 + 1 kb length ‘N’ + contig3 + … and defined as chromosome 14) in the Burrows–Wheeler Aligner program (BWA, ver. 0.7.10)59 with default parameters. We sorted the alignments according to mapping coordinates in Picard (ver. 1.118). After removing the reads with low mapping quality (MQ <20), both paired-end and single-end mapped reads were used for SNP detection throughout the entire collection of cotton accessions in the GATK toolkit (ver. 3.2-2)60. Mapped reads were filtered by removal of PCR duplicates. First, the MarkDuplicates module was used to mark the duplication alignment; SNPs and indels identified by the HaplotypeCaller module were then used to perform base-quality recalibration with the BaseRecalibrator and IndelRealigner modules, respectively. Second, the genomic variants, in GVCF format for each accession, were identified with the HaplotypeCaller module and the GVCF model. Finally, after all of the GVCF files were merged, a raw population genotype file with the SNPs and indels was created in the HaplotypeCaller module and was filtered with the following parameters: ‘QD < 2.0 || MQ < 40.0 || FS > 60.0 || MQRankSum < –12.5 || ReadPosRankSum < –8.0 –clusterSize 3 –clusterWindowSize 10’ and ‘QD < 2.0 || FS > 200.0 || ReadPosRankSum < –20.0’. The identified SNPs and indels were further annotated with ANNOVAR tool software61 and were divided into groupings of variations occurring in intergenic regions, coding sequences, and introns, on the basis of newly updated G. arboreum genome annotation information.
Phylogenetic analysis and population-structure study
A subset of 72,419 SNPs (SNP quality >2,000, minor allele frequency (MAF) >0.05, and missing data <20%) in the 243 cotton accessions from the entire SNP dataset was screened to build a neighbor-joining tree in PHYLIP (version 3.695)62 with 100 bootstrap replicates. STRUCTURE software (version 2.3.1)63 was used to infer the cotton population structure. The program was run on the subset of SNPs to estimate the group membership of each accession by using 10,000 iterations with K values from 2 to 4.
Linkage disequilibrium analysis
Haploview 4.20 (ref. 64) software was used to calculate LD values for the G. arboreum and G. herbaceum accessions on the basis of SNPs (MAF >0.05). The detailed parameters were as follows: -n -pedfile -info -log -minMAF 0.05 -hwcutoff 0 -dprime -memory 2096. LD decay was measured on the basis of the r2 value and the corresponding distance between two given SNPs.
Population genetics analysis
Nucleotide diversity (π) analysis was applied to estimate the degree of variability within each group (SC, YZR, and YER), and the fixation statistic FST was applied to explain population differentiation on the basis of the variance of allele frequencies between two different groups. Both π and FST were calculated in the PopGen package (see URLs) of BioPerl (ver. 1.6.923). After filtering of SNPs with quality <2,000, π values for the SC, YZR, and YER groups were calculated individually. FST values were initially calculated for each SNP through a variance component approach, and then the average FST of all SNPs in each 100-kb window was used as the value at the whole-genome level across different groups. Sliding windows with the top 5% of FST values for each comparison were selected as candidate highly divergent regions for further analysis. Adjacent windows were merged into a single divergent region.
For phenotypic evaluations, we selected 215 of 230 resequenced accessions that displayed reliable phenotypes and planted them in Anyang (Henan province), Sanya (Hainan province), and Akesu (Xinjiang province) in 2014. Several traits were investigated in only one or two locations, owing to resource limitations. For drought-tolerance evaluation, the seedlings were watered every 3 d for a total of three weeks, and then water was withheld from 3-week-old seedlings. When the drought-sensitive accessions exhibited severe leaf-wilt symptoms, all of the plants were rewatered. Four days after the rewatering, the numbers of surviving plants of each accession were counted. Three replicates were performed at each location.
The FWDI evaluation followed the Chinese technical specifications for evaluating cotton diseases and pests (GB/T 22101.4-2009). The F. oxysporum strain Ag149 was inoculated in soil. The disease-susceptible Jimian-11 line and the highly disease-tolerant Zhongzhimian-2 line were used as maxima to calculate the disease index. The molecular detection of F. oxysporum DNA (fungal DNA) in cotton leaves was performed according to a previously described method65.
The fatty acid composition of the lipid content of cotton seeds and ovules was evaluated according to previously described procedures66.
A total of 1,425,003 high-quality SNPs (MAF >0.05, missing rate <20%) in 215 G. arboreum accessions were used to perform GWAS for 20 traits in efficient mixed-model association expedited (EMMAX) software67, which was designed to handle large-dataset analysis68. GWAS for several traits were conducted in multiple locations with different ecological environments, including Anyang (N36.02, E114.50°, altitude, 63 m), Sanya (N18.35°, E109.33°, altitude, 11 m), and Akesu (N41.11°, E80.54, altitude, 1,107 m). Population stratification and hidden relatedness were modeled with a kinship (K) matrix in the emmax-kin-intel package of EMMAX. The genome-wide significance thresholds of all tested traits were evaluated with the formula P = 0.05/n (where n is the effective number of independent SNPs)69. The P-value thresholds for significance in the G. arboreum population were approximately 1.0 × 10−6.
Ancestral-allele and phylogenetic analysis among species
To identify orthologous alleles, we first set up a saturation curve using different lengths of flanking sequences with 100 randomly picked alleles. We observed a plateau with a false-positive rate of 13.5% when the sequence length reached 901 bp (including the particular SNP) (Supplementary Fig. 15). We initially extracted each SNP (~18 million SNP set) and its flanking sequences (450-bp length in both directions) from G. arboreum genome. We then used the BLAST52 algorithm (E value <1 × 10–10) to identify orthologous sequences in the G. raimondii genome. Only the top hit from the BLAST results was retained. A total of 4,487,496 corresponding hits were found in the G. raimondii genome. Those SNPs in each accession that were identical to SNPs in G. raimondii were defined as ancestral alleles. The ancestral-allele percentage was the average value of all G. arboreum and G. herbaceum accessions, respectively. To assess the phylogenetic relationships among G. raimondii, G. arboretum, and G. herbaceum, we used SNPs of 68,830 sites that were present in all three species to construct a phylogenetic tree in FastTree70 with default parameters.
Functional characterization of GaGSTF9
To analyze the expression pattern of GaGSTF9, samples were harvested at various incubation time points. Total RNA (~2 μg) was extracted and was then reverse transcribed in a 20-μl reaction mixture with EasyScript cDNA Synthesis SuperMix (TRANSGEN Biotech). Then 1-μl sample aliquots were used as templates for qRT–PCR analysis. Three technical replicates per sample and three biological-replicate samples were analyzed for each experiment. Histone3 (LOC108467150) was used as the internal control for qRT–PCR data analysis. For virus-induced gene silencing, a 398-bp fragment from GaGSTF9 was cloned into the XbaI and SacI sites of the pTRV-RNA2 vector. Glutathione S-transferase activity was measured with a Glutathione S-transferase (GSH-ST) assay kit (Jiancheng). All primers used in this study are presented in Supplementary Table 18.
Student’s two-tailed t tests and one-way ANOVA test were performed in GraphPad Prism software.
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
The updated G. arboreum genome assembly data are accessible through the NCBI under accession PRJNA382310. All raw sequencing data for the 243 accessions have been deposited at in the NCBI BioProject database under accession number PRJNA349094. Supporting data (updated genome, raw SNP sets, input files for structure and nucleotide diversity, list of orthologous loci and phenotype data) can be downloaded from ftp://bioinfo.ayit.edu.cn/downloads/.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by funding from the National Natural Science Foundation of China (grants 31621005 to F. Li and 90717009 to Y.Z.), the National Key Technology R&D Program, the Ministry of Science and Technology (2016YFD0100203 to X.D. and 2016YFD0100306 to S. He), the National Science and Technology Support Program, the Ministry of Agriculture (2013BAD01B03 to X.D.), the Agricultural Science and Technology Innovation Program of the Chinese Academy of Agricultural Sciences (CAAS-ASTIP-IVFCAAS to S. Huang), and the leading talents of Guangdong Province Program (00201515 to S. Huang).