Introduction

A suite of morphological and physiological adaptations that have permitted different Gossypium taxa to flourish in a wide range of natural environments offer potential solutions to a host of challenges to cotton cultivation, but the genetic basis of many of these adaptations remains to be discovered. About 45 diploid Gossypium species are recognized, classified into eight groups with genomes A–G and K based on chromosome paring in interspecific hybrids, chromosome size and geographical distribution (Beasley, 1940; Phillips and Strickland, 1966; Edwards and Mirza, 1979; Endrizzi et al., 1985). The five tetraploid cotton species, including the two most important cultivated cottons, are all derived from a natural cross between A genome and D genome ancestors resembling the extant G. herbaceum and G. raimondii, about 1–2 million years ago (Wendel et al., 1989; Wendel and Albert, 1992; Senchina et al., 2003; Wendel and Cronn, 2003).

Genetic changes at the DNA level are the foundation of organic evolution. Both single-nucleotide substitution and insertion/deletion of multiple nucleotides are caused by a variety of genetic mechanisms, and their study provides insight into relationships among taxa, as well as rates and mechanisms of divergence, also providing DNA polymorphisms, which can be used in forward and association genetics studies. The Gossypium genus is an attractive model for studying the genetic changes associated with polyploid evolution, with a natural interspecific cross between taxa that remain extant followed by chromosome doubling having spawned five extant tetraploid species. Substantial divergence among the diploid genomes provides a backdrop for polyploid cotton evolution—for example, a twofold difference in genome size between the A and D genome progenitors is due largely to differential accumulation of retroelements in the A genome (Zhao et al., 1995, 1998; Grover et al., 2007) and biased accumulation of small deletions in the D genome (Grover et al., 2007), with intergenomic exchange of these elements following polyploid formation (Wendel et al., 1995; Cronn et al., 1996; Hanson et al., 1998; Zhao et al., 1998).

An important element of invigorated efforts to relate DNA level information to its phenotypic consequences in cultivated (largely tetraploid) cottons and other polyploid crops, is to improve understanding of the rates and mechanisms of divergence among diploid relatives, especially those that contributed to polyploid formation. Although the diploid progenitors of polyploid cottons are clearly identified, the exact relationships among these and other diploids have remained controversial (Seelanan et al., 1997; Cronn et al., 2002), in part because only small numbers of genetic loci have been studied and evolutionary rates of different genes vary widely (Senchina et al., 2003).

Here, we explore types, levels and patterns of DNA polymorphism among a strategic sampling of diploid cottons, using DNA sequencing to characterize hundreds of orthologous loci derived from sequence-tagged sites that have previously been anchored to a cotton genetic map and broadly sample the genome (Rong et al., 2004).

Materials and methods

Plant materials and DNA extraction

The plant materials used in this experiment were two Gossypium A genome species (G. herbaceum, A1-97 and G. arboreum, A2-47), two D genome species (G. raimondii, D5-1 and G. trilobum, D8-1), one C genome species (G. sturtianum, C1-4) and one out-group (Gossypioides kirkii). Genomic DNA used for sequence amplification and analysis was extracted as reported elsewhere (Rong et al., 2004).

Oligonucleotide primer design

The DNA sequences used to design primers come from 15 sources that include 4 classes: expressed sequence tag (EST), simple sequence repeat (SSR), low-copy genomic DNA (gDNA) and Cot-filtered non-coding (CFNC) DNA (Supplementary Table 1). Most ESTs, gDNA and SSRs are from genetically mapped sequences (Rong et al., 2004). CFNC sequences were extracted from G. raimondii DNA using renaturation kinetics to obtain putatively low-copy DNA then sequencing random clones and filtering out the subset containing recognizable functional domains (Rong et al., 2011). The eprimer3 interface to the primer3 software program (Rozen and Skaletsky, 2000) was used to design primers in batches. Low complexity or microsatellite regions were screened using the program Cross Match before running eprimer3.

PCR amplification and sequencing

The primer sequences and annealing temperatures used during PCR amplification are listed in Supplementary Table 1. Amplification of PCR products was performed in 30-μl reactions, including 3 μl 10 × PFU buffer, 1.8 μl 25 mM MgCl2, 3 μl 2 mN dNTPs, 1 μl Taq polymerase with 5% PFU DNA polymerase (an enzyme found in the hyperthermophilic archaeon Pyrococcus furiosus,), 1 μl DNA, 1.5 μl primer and 17.2 μl H2O, using the following procedure: initial denaturation at 94 °C for 4 min, then 32 cycles of denaturation at 94 °C for 1 min, annealing at 52 °C for 1 min and extension at 72 °C for 1 min. In the final cycle, extension at 72 °C was prolonged for 7 min. In an effort to increase the throughput and quality of PCR, ‘Touchdown PCR’ was used for some loci (Supplementary Table 1), decreasing the annealing temperature 1° every second cycle until a minimum temperature is reached (Rong et al., 2011). Amplicon quality was determined by running a 5 μl aliquot of each product on a 1% agarose gel. Amplicons were only deemed suitable for sequencing if single high quality bands were clearly observed in the agarose electrophoresis gels.

Before sequencing, PCR products were cleaned up by adding 4 μl 1% Exonuclease I (Exo) and 9.1% Shrimp Alkaline Phosphatase to each well of the plate that has PCR product. Plate is put on thermocycler and the ‘ExoSap’ program is run for 30 min (37 and 80 °C for 15 min each). Sequencing reactions used the ABI Big Dye 3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA). Unincorporated dye terminators were removed from the reactions by first adding a mild detergent (12 μl 0.003% SDS solution) to each well, denaturing, and then filtering using Sephadex G-50 filter plates. The cleaned products were filtered directly into Perkin-Elmer MicroAmp Optical 96-well reaction plates, then sequenced on an ABI Prism 3730xI (Applied Biosystems, Foster City, CA, USA).

Sequence alignment

A bioinformatics pipeline was developed in which sequences produced from the six cotton relatives by each primer pair were treated as a single project. Using computational tools developed in Python, high-quality traces (meeting a minimum threshold of 50 base pairs with a Phred score greater than 20) were distributed into folders specific to the primer pair from which they were derived. The software programs Phred and Phrap were then run in such a way that consensus sequences for each genotype were first produced via the alignment of forward and reverse reads and quality files for these alignments were generated. This quality file reflects the combined scores of the reads by increasing the consensus Q score at base positions that overlap and match, and likewise decreasing Q score at mismatches. To further reduce the number of false-positive polymorphism calls, base pairs were trimmed from either end of a sequence that was below a quality score of 16. The six consensus sequences for each diploid locus were then aligned using T-Coffee and/or Clustalw (Thompson et al., 1994; Notredame et al., 2000).

Polymorphism analysis

We performed both pairwise and three-way comparisons of polymorphism and evolutionary rates between different diploid cotton relatives. Pairwise comparisons were done between all combinations of (sub) species, revealing: single-nucleotide polymorphism (SNP), multiple extended SNP, single-nucleotide indel, extended (multiple nucleotide) indel, and mixed indels and nucleotide substitutions. Three parameters were mainly analyzed to indicate divergence among different genomes, including SNPs, polymorphic sites (nucleotides with divergence due to all polymorphism types listed above, and abbreviated as ‘poly site’ in the figures and tables) and polymorphic base pairs, abbreviated as ‘poly bp’ in a 100-bp comparison window. All bases with sequence quality scores 25 were counted, and the quality control threshold was reduced when two matched bases between compared sequences were identical with the sum of the two scores 30.

Three-way comparison was performed to reveal nucleotide variations in sequences from two (sub)genomes with reference to their closest outgroup genome, to try to provide more resolution about substitution types and ancestral states. A three-node tree ((X1, X2), O)) defined a specific three-way comparison between (sub)genomes X1 and X2 with reference to their outgroup O. The following three-way combinations were considered in our analysis: ((A1, A2), C1), ((D5, D8), C1), ((A1, C1), D5), ((A1, D5), Ki), and ((C1, D5), Ki). At a specific site, the matched trios (bases and/or indels) were subjected to quality control: if a trio was formed by three identical bases, the data were only considered if the summed quality score was 35; if two bases and one indel, the summed quality score of the two bases 30; if only one base and two indels, the quality score of the base 25. For three-way comparison of ((A, D), Ki), when the two A and two D subgenomes were each monomorphic but differ between A and D, then ancestral sequences were inferred by referring to the Ki sequence.

Reconstruction of phylogeny

We pooled all the alignments of genes together to produce a concatenated alignment, on which the genome and species phylogeny was reconstructed. The concatenated alignment was constructed by selecting aligned columns without gaps and with matched bases having quality scores 25. The phylogeny was reconstructed using several approaches, including neighbor-joining, maximum parsimony, minimum evolution and UPGMA, each implemented in MEGA 4 (Tamura et al., 2007). Default parameter settings for each approach were adopted. To compare potential phylogenies by rearranging the locations of diploid A, D and C genome species to produce different trees [T1: ((((A1, A2), C1), (D5, D8)), Ki); T2: ((((A1, A2), (D5, D8), C1)), Ki); T3: ((((A1, A2), (D5, D8)), C1), Ki)], we performed maximum likelihood evaluation by running DNAML using default settings implemented in PHYLIP (http://evolution.genetics.washington.edu/). The difference of logarithms of maximum likelihoods for two different phylogenies follows a normal distribution as previously described (Kishino and Hasegawa, 1989).

Results

General divergence of DNA sequences among diploid Gossypium genomes

A total of 780 loci have been amplified from 6 diploid cotton relatives and their amplicons have been sequenced and aligned (Supplementary Figure 1 is an example of such alignment), resulting in 15 possible pairwise comparisons for each locus. Because of differences in success of PCR amplification and DNA sequencing among species, the number of informative loci vary among different pairwise comparisons, from 372 (D8/Ki) to 552 (A1/A2) (Table 1 and Supplementary Table 2). Average comparable sequence length is similar among pairwise comparisons, ranging from 209 (C1/D5) to 229 bp (A1/A2). SSR-derived loci have the shortest comparable sequence lengths (154.2, 136.8–250.5) followed by CFNC loci (158.4, 131.0–178.9) and gDNA (195.8, 184.5–204.8), with EST loci having the longest (237.2, 227.7–254.4) (Table 2).

Table 1 DNA divergence in the pairwise comparisons among six cotton relatives
Table 2 DNA divergence of different locus type in pairwise comparisons among six cotton relatives

Patterns of DNA divergence among the six cotton relatives (Table 1 and Supplementary Table 2) agree generally with expected phylogenetic relationships (Hawkins et al., 2006). The two D genome species (D5 vs D8) exhibited nearly twice the DNA divergence of the two A genome species (A1 vs A2), with 0.68 vs 0.34 SNPs, 1.21 vs 0.72 polymorphic sites and 2.86 vs 1.44 polymorphic bp per 100 comparable bp, respectively (Table 1 and Supplementary Table 2). Among different genome types, the outgroup species (G. kirkii) was more distant from all other genotypes than each was from one another (6.51–7.32 polymorphic bp/100 bp). Importantly the divergence between A and D species is about 4.4 polymorphic bp/100 bp, whereas that between A and C species is about 3.8 (Table 1 and Supplementary Table 2).

The types and levels of divergence among loci varied widely. The pairwise comparison between G. herbaceum (A1) and G. trilobum (D8) is used as an example to illustrate this point. A total of 512 loci were successfully amplified and sequenced resulting in 112 244 comparable bp from A1 and D8, with average comparable sequence length per locus of 219.2 (Table 1). Among the 512 comparable loci, 35 comprising 4693 bp (134 bp/locus) were identical. The remaining 477 revealed a total of 4814 polymorphic base pairs, derived from 2641 polymorphic sites including 1798 SNPs, 168 single base indels and 675 extended polymorphisms (Figure 1). So, the most abundant mutations are single-nucleotide substitutions (Figure 1). However, SNP frequencies varied considerably among different loci, ranging from 0.21 to 10.7 per 100 bp, with about 50% in a range of 0.5–1.5 SNPs/100 bp (Figure 2). Extended polymorphisms averaged 4.2 bp in length with the longest being a 65 bp deletion in G. trilobum at locus A1270. The number of polymorphic sites per 100 bp ranges from 0.01 to 14.99 among different loci, with most loci having 0.5–2.99 polymorphic sites per 100 bp (Figure 3).

Figure 1
figure 1

Length (bp) of polymorphic sites in pairwise comparison between A1 and D8 cotton species.

Figure 2
figure 2

Frequencies of loci with different numbers of single bp mutations (SNP+Indel) per 100 bp comparison between A1 and D8 cotton species.

Figure 3
figure 3

Comparison of polymorphic site abundance between A1 and D8 cotton species in four DNA element classes.

Comparison of DNA element classes

Primer pairs derived from different types of sequence-tagged sites (EST, gDNA, CFNC DNA and SSR) showed different polymorphism levels. ESTs have the smallest DNA divergence followed by gDNA and CFNC (Table 2). SSR-derived loci showed the largest DNA divergence in most pairwise comparisons with a few exceptions (Supplementary Table 2, Table 2). For example, in the comparison of A1 vs A2, CFNC showed greater SNP number than SSR (Supplementary Table 2). SSR-derived loci had almost twice as much divergence as ESTs in most pairwise comparisons and most mutation types (Supplementary Table 2). In particular, SSR loci have more polymorphic sites larger than two bp, resulting in a larger total number of polymorphic bp than other types of loci (Figure 3). For example, SSR loci have 3.66 SNP in the most divergent pairwise comparison (C1:K), only 1.5 times more than ESTs. However, SSR loci have 6.17 polymorphic bp in the least divergent pairwise comparison (A1 vs A2), 4.5 times more than EST (1.38, Table 2, Supplementary Table 2). Similarly, in the most divergent pairwise comparison, SSR loci have almost 4 times more polymorphic bp than ESTs (25.3 vs 6.73, Table 2).

Comparison of A and D genomes

The two A and two D genome species studied were chosen for several reasons. First, in each case the two species provide a diverse sampling of variation within the respective genome types. Loci that are polymorphic between the A and D genome types but monomorphic within each type may be especially attractive for SNP discovery in tetraploid cottons, being locus-specific. For each genome type, reference maps have been made from crosses between the two species we studied (Brubaker et al., 1999; Rong et al., 2004; Desai et al., 2006). Accordingly, SNPs that differentiate between the two species within these genome types are potential genetic markers for ongoing studies, especially of value in the A-genome that is cultivated and improved in some parts of the world.

A total of 423 loci were amplified from the quartet of two A and two D genome species, producing 81 904 comparable base pairs across average sequence lengths of 193.6 bp (Table 3). A total of 2328 base pairs distributed across 1513 sites appear potentially informative in the distinction between A and D genomes. Among polymorphic sites, 1229 SNPs were detected between A and D diploids, which might be used to distinguish A and D genomes in tetraploid cotton (Table 1, Supplementary Table 3). In addition, 213 genome-specific indel loci were also found—the largest indel useful in the segregation of ancestral tetraploid genomes is 65 bp in length at locus A1270. Unexpectedly, SSR loci showed the lowest rate of SNPs useful in the distinction between A and D genomes, even fewer than in comparison of A1 vs A2. More in line with other comparisons, however, SSR loci have the highest percentage of polymorphisms extending over multiple nucleotides, have the second highest frequency of polymorphic sites (2.28/100 bp) and have the highest frequency of polymorphic base pairs (5.87/100 bp).

Table 3 Percentage of transitions and transversions in six diploid cotton relatives detected in three way comparisons

A total of 552 loci comprising 125 545 comparable bp (227.4 bp per locus on average) were available for finding polymorphism between G. arboreum and G. herbaceum. A total of 359 loci (65%) contained polymorphisms between these two species. CFNC loci showed the highest ratio of SNPs, twice the second one (SSR loci) at 0.82 vs 0.41(Supplementary Table 2). ESTs had the lowest SNP frequency.

A total of 544 loci comprising 116 422 comparable bp (214 per locus on average) were available for finding polymorphism between G. raimondii and G. trilobum, and 430 (79%) loci contained polymorphisms. As noted above, polymorphism between the two D genome species is much higher than between the two A genome species for all types of loci, being two times more on average (0.68 vs 0.34, Table 1, Supplementary Table 2). In the comparison of D genome species, SSR, rather than CFNC in the A genome, showed the highest frequency of SNPs (Supplementary Table 2) among the four classes of loci.

Comparative sequence divergence rates among diploid Gossypium genomes

To explore relative differences in frequencies of different mutational types among genomes and species, three way comparisons were analyzed using an appropriate outgroup (Figure 4 and Supplementary Table 4). In the comparisons between two species with the same genome (A or D), G. sturtianum (C genome) was used as an outgroup. No obvious differences were observed in frequencies of SNPs (0.18 vs 0.20 SNPs/100 bp of A1 vs A2 and 0.37 vs 0.38 of D5 vs D8) or polymorphic sites (0.32 vs 0.35 bp/100 bp of A1 vs A2; 0.59 vs 0.61 of D5 vs D8) between two species with the same genome type, suggesting that the respective species evolved at similar rates after speciation (Figure 4, Supplementary Table 4).

Figure 4
figure 4

Types and levels of DNA polymorphism in species and genome types inferred in three way comparisons using outgroup species.

The C genome of G. sturtianum appears to have evolved more slowly than either A or D genomes (Figure 4 and Supplementary Table 4). Among 452 loci including 92 842 comparable bp, 810 SNPs (0.87 SNP/100 bp) were detected in A1 vs 563 (0.61SNP/100 bp) in C1 when G. raimondii was used as an outgroup, a highly significant difference (P=2.6E−11). Similarly, among 331 loci (67 649 bp), 539 SNPs (0.80 SNPs/100 bp) were found in D5 vs 439 (0.65 SNP/100 bp) in C1 when G. kirkii was used as an outgroup, also a highly significant difference. Across all types and measures of mutation, the C genome always showed less than A or D (Supplementary Table 4).

We analyzed the mutation rate of A and D ancestral genomes using G. kirkii as an outgroup. If A1 and A2, or D5 and D8, are monomorphic but there is polymorphism between A and D, comparison to G. kirkii permits us to infer ancestral state at the locus. On the basis of such inference, the A genome experienced faster evolution than the D genome, with significantly more SNPs and polymorphic sites (Figure 4, Supplementary Table 4). Across the various types of mutations, the A genome seemed to have more insertion and less deletion than the D genome (Supplementary Table 4). Interestingly, CFNC loci were incongruent with other types of loci, with more SNPs (16 vs 5) and polymorphic sites (22 vs 5) inferred to have evolved in the D than the A genome. In addition, all extended deletions (1), extended mixtures (3), single deletions (1) and insertions (1) differentiating the two genome types occurred in D, not in A (Supplementary Table 4).

Transition and transversion mutations

The relative frequencies of the 12 possible single base pair mutations were analyzed using three-way comparisons. Consistent with conventional dogma, transition mutations occurred in higher frequency than transversions in all genomes (Table 3). However, different genomes exhibited considerable difference in the ratios of transition to transversion. For example, the A ancestor genome had only 7.8% more transitions than transversions, whereas the D ancestor genome had 25.7% more (Table 3). Across all pairwise comparisons among A1, A2, D5, D8 and C1, the total SNP number is negatively correlated with the difference between transversion and transition rates (r=−0.445, P=0.074) (Table 3). There is pronounced asymmetry in mutational events, with the D genome showing about 30% more A to G and T to C mutations than the A genome, but with rates of G to A and C to T mutation being similar in the two genomes (Supplementary Figure 2). Among the eight possible transversion mutations, G to T, C to A, and C/T to G/A occurred in obviously higher frequencies in A than in D, with the others at similar frequencies in the two genomes.

Phylogenetic relationship among diploid Gossypium genomes

To reconstruct the phylogeny of the cotton genomes and species studied, we made a concatenated alignment by selecting aligned columns without gaps and meeting the quality threshold. The concatenated alignment contained 34 223 sites from 255 genes. All approaches used clearly and strongly supported the same tree topology T1: ((((A1, A2), C1), (D5, D8)), Ki), grouping C and A species together (Figure 5). This topology is further supported by maximum likelihood analysis, significantly (P-value=0.003) rejecting alternative tree topologies.

Figure 5
figure 5

Phylogenetic relationship among diploid Gossypium genomes inferred from nucleotide alignments of 34 223 sites from 255 genes. The scale represents nucleotide substitutions per position (Maximum likelihood) estimated by PAML. The numbers above or below the branches are total sites of extended polymorphism/single indel/SNP (last) in the corresponding three way comparisons of all involved sequences, reflecting DNA sequence divergence occurring along the indicated branch of the tree.

Discussion

Levels and patterns of divergence of DNA sequences

Resequencing of hundreds of orthologous loci that broadly sample the genomes of diploid cottons sheds light on the types and rates of evolutionary mechanisms operating in the 7 million years (Small et al., 1998; Cronn et al., 2002) since divergence of these species from common ancestors. Different DNA sequence fragments ranged widely in levels of polymorphism. For example in the comparison between A1 and D8 the ratio of polymorphic site varied from zero to 14.7 per 100 bp (Figure 4) among different loci. In all 15 pairwise comparisons, more than half of mutant sites are single-nucleotide polymorphisms, except for A1: A2 which showed nearly 50% SNPs (Table 1, Figure 2). However, polymorphisms spanning multiple nucleotides (ranging up to 274 bp insertion in C genome in the SSR locus of BNL1046 compared with A and D genome) account for 63.9% of total nucleotide divergence.

When DNA ‘islands’ comprising different classes of loci were analyzed separately, SSR regions showed more divergent sites and larger polymorphic size than other classes, and ESTs the least, as expected. A new type of DNA fragment extracted using Cot filtration (named CFNC), had higher divergence than other classes, in a few comparisons even exhibiting higher polymorphism than SSRs. The merits of DNA marker discovery using CFNCs has been described in detail elsewhere (Rong et al., 2011). Although higher polymorphism in non coding regions (SSR, CFNC and PstI) is presumed to be related to purifying selection acting on coding regions (EST), even gene sequences showed variable DNA divergence. For example, between two diploid A genome species, Gate3AE04, from a fiber related EST library (Arpat et al., 2004) has 1.8 SNP/100 bp and 2.6 polymorphic sites/100 bp including three extended indels with one spanning 58 base pairs. However, many sequences from the same library do not have any polymorphisms between A1 and A2 species. Senchina et al. (2003) proposed that silent site rate variation among genes may be related to genomic location, local levels of recombination and mutation, variable codon usage biases, and perhaps poorly understood differences in chromatin structure that alter in unknown ways the rate of fixation of mutations.

Reassessing the relationships among diploid cottons

The phylogeny of diploid cottons, especially the relationship between A and D genomes that give rise to tetraploid cottons, has been somewhat unclear. Prior studies used only small numbers of nucleotides either from chloroplast (Wendel and Albert, 1992; Small et al., 1998) or nuclear DNA (Seelanan et al., 1997; Small et al., 1999; Cronn et al., 2002; Wendel and Cronn, 2003). Early studies using cpDNA (Wendel and Albert, 1992; Small et al., 1998) suggested the C genome to have diverged first from the common ancestor shared by both A and D genome species. However, later research using genomic DNA suggested D genome species to have diverged first (Cronn et al., 2002; Wendel and Cronn, 2003). The latter conclusion is supported by our results, with the 780 loci used here being by far the largest data set we are aware of to date, suitable to address this question. Indeed, our data clearly reject all alternative phylogenies.

Different Gossypium genome types appear to be experiencing appreciably different evolutionary rates, with the A genome evolving more rapidly than the C genome, and the D genome in between (Supplementary Table 3). These differences appear to characterize genome types—species within the A and D genome types appear to be experiencing mutation at similar rates. We can identify no particular correlate with this variation—neither phylogenetic placement nor genome size appears related to it. An attractive opportunity for further study is to investigate additional genome types—for example are those closely related to A (that is B, E, F) also experiencing relatively rapid change, or are those close to C (K) also evolving slowly? It is interesting to note that the evolution of spinnable fibers, the primary economic product of cotton and also a potential means of long-range dispersal, occurred in the most rapidly evolving (A) of the genomes studied to date.

Striking differences among genome types in the frequencies of specific base pair changes were found. We found an excess of transition over transversion substitutions, as is common in metazoans (Keller et al., 2007). However, the A ancestor genome has only 7.8% more transitions than transversions, whereas the D ancestor genome has 25.7% more, due largely to abundance of A to G and T to C mutations. This could contribute to the higher rate of C to T (16.6%) than T to C (12.1%) mutations in the A ancestor genome. As in metazoans, part of this bias could be related to greater DNA methylation in the retrotransposon-rich A genome. Methylated cytosines (mC) are subject to UV light-induced mC → T transition mutations (Bird, 1980), at 10–50 fold greater frequency than other substitution mutations (Zhao and Jiang, 2007). The CFNC elements showed an unique pattern of base pair mutation, in the D ancestor showing significantly more SNPs (16 vs 5) and polymorphic sites (22 vs 5) than in the A ancestor. This contrasts with SSR, gDNA and EST elements, which showed significantly more mutations in the A than the D ancestor. However, only a sample of CFNC fragments (5) could be used in this comparison. In the comparisons between A1 and A2 where a larger number of CFNC loci (37) could be analyzed, A2 showed significantly more SNP and polymorphic sites than A1 (21 vs 6 and 29 vs 14), but no difference was detected in the other three types of elements. In addition, no such bias was detected for CFNC fragments between D5 and D8.

SNP markers for genomics research and plant breeding

SNPs are the most abundant source of genetic variation and the most elemental difference between genotypes, preferable to SSRs for many applications (Wang et al., 1998; Altshuler et al., 2000) and steadily becoming more economical to the genotype. Large scale projects in such model organisms as human, Oryza sativa and Arabidopsis thaliana are driving the development of low-cost, high-efficiency genotyping (Langridge and Fleury, 2011). The diploid A and D genomes are progenitors of tetraploid cotton At and Dt subgenomes. SNPs between two diploid A species, or two diploid D species are useful in gene discovery and mapping within the respective diploid genomes. Indeed, the two A genome species and two D genome species studied in this research were each used to construct genetic maps (Brubaker et al., 1999; Rong et al., 2004; Desai et al., 2006).

Loci that are monomorphic within A or D genome types, but polymorphic between genome types, may be of practical importance for identifying locus-specific markers in tetraploid cottons including leading cultivars. Although some loci that are monomorphic within genome types may be evolutionarily constrained at the diploid level, the presence of a homologous copy may render them less constrained at the tetraploid level. Although SNP frequencies within A or D genomes are the lowest of the 15 possible pair wise comparisons in this study, we nonetheless identified a total of 1229 SNPs from 423 sequence-tagged sites that differentiated between the A and D genome types. At these loci, genome-specific primers or oligonucleotides might be designed, simplifying SNP discovery in tetraploid cottons by mitigating the complication of having a second homologous copy of the locus present elsewhere in the genome. Specific amplification of the At or Dt genome loci from different tetraploid genotypes may facilitate the discovery of allelic sequence diversity which can be used in genetic and evolutionary studies.

Data archiving

Data have been deposited at Dryad: doi:10.5061/dryad.fb5hk394.