Introduction

Microsatellites, which consist of a variable number of tandem repeats, are known as simple sequence repeats (SSRs, defined as 1-6 bp) and are characterized by their high frequency, distribution, co-dominance, reproducibility and high polymorphism1,2. Among genetic marker systems such as restriction fragment length polymorphism (RFLP), random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), sequence-related amplified polymorphism (SRAP) and target region amplification polymorphism (TRAP), microsatellites have numerous uses, including linkage map, quantitative trait loci (QTL) mapping, marker-assisted selection, genetic diversity studies and evolutionary studies3,4,5,6.

Cotton (Gossypium SPP.) is the most important fiber crop in the word and is also an important edible oil crop. The genus Gossypium has nine genome types: eight diploid genomes (A, B, C, D, E, F, G and K; 2n = 2x = 26) and one tetraploid genome (AD; 2n = 4x = 52), based on observations of chromosome pairing7,8,9. Two of the four cultivated allotetraploid species, G. hirsutum L. (AD)1 and G. barbadense L. (AD)2, account for 90% and 8% of world cotton production, respectively5. Although G. arboreum (A2) and G. raimondii (D5) are regarded as the progenitors that led to the formation of the tetraploid cotton species 1-2 million years ago (MYA), the exact donor species that no longer exist10.

Much research has been conducted to characterize microsatellites and develop markers from enriched genomic DNA11,12, expressed sequence tags13,14,15, sequences derived from bacterial artificial chromosomes16,17, assembled contigs based on transcriptomic profile18 and sequenced genome19. Current publicly available cotton simple sequence repeat (SSR) markers are described in the CottonDB (http://cottondb.org/) and Cotton Marker Database (CMD) ( http://cottonmarker.org)20. So far, 19,010 SSR markers have been described in the CottonDB. CottonGen supersedes CottonDB and the CMD with enhanced tools for easier data sharing, mining, visualization and data retrieval of cotton research data21. Thanks to global efforts, high-resolution mapping in cotton has been published with segregating populations through interspecific crosses19,22,23,24,25. However, the development of microsatellites in cotton remains limited, due to the paucity of DNA polymorphisms and the lack of complete genome sequences26,27,28. Although a very-high-density whole genome marker map (WGMM) has been constructed for cotton based on the D genome, which has a total of 48,959 loci, about six times the number characterized in the most richly populated of the integrated maps published previously20,29, the pattern of microsatellite distribution and the physical position or product of publicly available cotton SSRs in G. hirsutum remained ambiguous.

Recent developments in high throughput DNA sequencing technologies provide new opportunities to expedite molecular marker development30. In this study, we conducted whole-genome microsatellite characterization and marker development in the sequenced genome of G. hirsutum31, G. raimondii32 and G. arboreum33. We characterized and compared the frequency and distribution of motif length, type and repeat number of microsatellites in the assembled genomic sequences of these three species. In addition, we analyzed the genomic distribution of microsatellites, genes and transposable elements (TEs) in the assembled genome of G. hirsutum. Furthermore, we developed microsatellite markers from assembled genomic sequences and evaluated polymorphism in different species. These markers may facilitate the advancement of many basic and applied genomic studies in cotton, including the development of high-resolution linkage maps, positional gene-cloning and breeding in different cotton species.

Results and discussion

Frequency, distribution and characterization of microsatellite length, type and number

The available 2,433 Mb, 1,694 Mb and 761 Mb genome sequences of G. hirsutum, G. arboreum, and G. raimondii, respectively, were searched for microsatellites with different types of desirable repeat motifs from mono- to hexanucleotide. A total of 100,290, 83,160 and 56,937 microsatellites were identified with an overall frequency of 41.2, 49.1 and 74.8 per Mb, or one every 24.3, 20.4 and 13.4 Kb, in above three assembled genomic sequences, respectively (Table 1). Meanwhile, a total of 50,443 and 41,119 microsatellites with perfect repeats were detected in the At and Dt, respectively. These showed an overall density of 34.2 (At, “t” indicates tetraploid) and 49.5 (Dt) per Mb, or one every 29.3 and 20.2 Kb, respectively (Table 1). The determination of the genomic distribution of 100,290 microsatellites revealed 81,898 microsatellites mapped to the 26 chromosomes of G. hirsutum with an average density of 42.33 per Mb. In our study, physically mapped microsatellites showed a higher density of markers on the Dt (38,622, 49.87/Mb) than the At (43,276, 37.30/Mb), with maximums on A05 (4,248, 46.15/Mb) and D05 (3,553, 57.37/Mb) and minimums on A06 (3,204, 31.06/Mb) and D02 (3091, 45.94/Mb) (Table S1). Our results agree with a previous study which revealed a negative correlation between genome size and microsatellite density34. Cucumber (367 Mb), wheat (3B, 1,000Mb) and maize (1,115 Mb) have microsatellite densities of 551.9 per Mb, 163 per Mb and 120 per Mb, respectively4,35,36. Other microsatellite densities of 78.5 per Mb, 189.4 per Mb and 99.8 per Mb have been identified in Brachypodium, rice and sorghum37.

Table 1 Overall frequency of microsatellites in Gossypium species.

The distributions of microsatellite length in the assembled genomic sequences of G. hirsutum, G. arboreum and G. raimondii showed that tri-, tetra-, penta- and hexanucleotide repeats accounted for very similar proportions, whereas mono- and dinucleotide repeats were relatively different in proportions among these genomes (Fig.1A). Among the 100,290 microsatellites obtained in G. hirsutum, the hexanucleotide repeats were most abundant (39,506) with a proportion of 39.4%, followed by tri- (22,483, 22.4%), penta- (14970, 14.9%), di- (12,445, 12.4%), tetra- (9,031, 9.0%) and mono-nucleotide (1,855, 1.8%) (Fig.1A, Table S2). This distribution pattern of microsatellite length differed to earlier reports that di-nucleotide repeats (DNR) are abundant in rice and Arabidopsis38 and tri-nucleotide repeats (TNR) are abundant in Brachypodium37, bamboo39, switchgrass40 and Foxtail Millet41. In cotton, penta-nucleotide repeats (PNR) were found to be most abundant in the G. raimondii genome42. Some of these differences may be due to variations in characterization parameters, algorithms and bioinformatics software43,44. In principle, a microsatellite can extend to any length in the absence of selection force38. Generally, shorter repeat lengths were used to define microsatellite in Brassica45, Foxtail Millet41 and G. raimondii42. The relax criteria was also used to identify microsatellite with minimum repeat lengths of 12, 6, 4, 3, 3 and 3; showing that the proportion of mono- to hexanucleotide repeats is similar in G. hirsutum, G. arboreum and G. raimondii (Fig. S1). The different results of these two criteria (minimum microsatellite length 18 vs 12) mostly due to longer reads used in genome assembling of G. raimondii such as Roche 454 data32. Among these 100,290 microsatellites in G. hirsutum, 11,008 (11.0%) were detected in the genic regions. More intergenic microsatellites from mono- to hexanucleotide were identified than that in genic regions (Fig.1B).

Figure 1
figure 1

Distribution of microsatellites identified in the assembled genomic sequences of G. hirsutum (TM-1), G. arboreum and G. raimondii. (A) Frequency distribution of microsatellites with different motif lengths in Gossypium species. (B) Distribution of microsatellites in genic and intergenic regions in G. hirsutum (TM-1).

The distributions of microsatellite type in the assembled genomic sequences of G. hirsutum, G. arboreum, and G. raimondii are presented in Fig. S2A and Table S2. Specifically, the major motifs were rich in A, AT, AAT/AAG, AAAT, AAAAT and AAAAAT and the minor motifs were mostly riched in C/G (Fig. S2A, Table S2 and S3), which is consistent with previous reports on microsatellites from the assembled genomic sequences of G. raimondii42, Cucumis sativus4 and Brassica napus45. It was noted that the nucleotide composition characteristics of these A/T and C/G motifs corresponded well with their higher A/T than C/G content, such as 34.1% GC content in G. hirsutum31. However, AAT was the most common motif in allotetraploid G. hirsutum, whereas AT was most common in diploid G. arboreum and G. raimondii. More interestingly, we found significantly higher levels of AATCAG in At (2,590) and G. arboreum (2,646) compared to Dt (42) and G. raimondii (43), but much lower number of AACCCT motif were found in At (60) and G. arboreum (73) compared to Dt (332) and G. raimondii (421). In addition, the ACAGG repeat was only detected in At (283) and G. arboreum (224) (Table S2).

The distributions of microsatellite repeat numbers in the assembled genomic sequences of G. hirsutum, G. arboreum and G. raimondii revealed that microsatellite frequency decreased as the number of repeat units increased (Fig. S2B). Interestingly, as the motif repeat number increased, the microsatellite abundances decreased and the rate of change was slowest for dinucleotides, followed by trinucleotide repeats (Fig. 2). These finding are similar to the research on Brassica45. Although we found that the number of mono- and dinucleotide was dramatically reduced in G. hirsutum, high correlations were identified between these genomic sequences based on motif repeat numbers (Table S4).

Figure 2
figure 2

Distribution of SSR motif repeat numbers from mono- to hexanucleotide.The vertical axis shows the abundance of microsatellites that have different motif repeat numbers (from 3 to >20), which are discriminated by legends of different colors.

In conclusion, almost all of the analyzed characteristics of microsatellite distribution in assembled genome of G. hirsutum and its two progenitors were highly similar, which suggests that the pattern of microsatellite distribution is conservative in Gossypium or retained after formation of allotetraploid cotton. This is understandable because allotetraploid cotton is reunited by G. arboreum and G. raimondii approximately 1-2 MYA ago12.

Genomic distribution

Based on the assembled genome of G. hirsutum, the genomic distributions of microsatellites, genes and transposable elements were investigated. We illustrated greater physical densities in distal chromosomal regions than in the central regions; in agreement with a previous study that found a greater marker density in the distal gene-rich ends of the chromosomes20. Specifically, the genomic distribution of microsatellites was positively correlated with genes and negatively correlated with TEs (Fig. 3, Table S5). For both At and Dt, the frequencies of microsatellites in the 1-Mb genomic intervals were significantly positively correlated with genes (r = 0.78 and 0.73) and negatively correlated with TEs (r = −0.34 and −0.14). These results are similar to previous reports that microsatellites are associated with gene sequences in plants34,45. It was interesting that the homoeologous chromosomes A05 and D05 exhibited the highest average frequency of microsatellites on the At and Dt, respectively.

Figure 3
figure 3

The genomic distributions of microsatellites, genes and TEs in the assembled genome of G. hirsutum. (A) Genome-wide distributions of microsatellites, genes and TEs in the At. (B) Genome-wide distributions of microsatellites, genes and TEs in the Dt. The horizontal axis indicates the chromosomes A01-A13 and D01-D13 and each chromosome was divided into 1-Mb for statistical analysis of microsatellites, genes and TEs. The left vertical axis indicates the frequencies of the microsatellites and genes and the right vertical axis indicates the frequencies of TE. The curves represent practical frequencies and the lines show average frequencies.

Development of genome-wide microsatellite markers

A total of 77,996 (83.2%), 63,263 (81.46%) and 44,388 (82.8%) identified microsatellite markers were designed from the flanking sequences of 93,736, 77,661 and 53,586 microsatellites from the genomic sequences of G. hirsutum, G. arboreum and G. raimondii, respectively. All of the 77,996 developed microsatellite markers were summarized with motif length, amplification length and polymorphism in Table S6. The remaining microsatellite markers failed to generate specific amplicons mostly due to the limited number of flanking sequences from each side of identified microsatelltes. Similar observations have been reported in the mining of genome microsatellites in other crop plants37,41,46.

The most useful microsatellites are those where the chromosome location is known on the assembled chromosomes. The physical location of 65,498 of the 77,996 microsatellite markers was revealed on the 26 chromosomes of G. hirsutum, with average marker density of 33.86 markers per Mb. The average marker density of 40.64 markers/Mb on the Dt was larger than the 29.33 markers/Mb on At. The maximum density (47.07 markers/Mb) was found on D05, followed by 44.35 markers/Mb on D07 and 42.27 markers/Mb on D09 and the minimum density (24.81 markers/Mb) were found on A06 (Table 2). The physical positions of the developed markers across the 26 chromosomes can provide a high-density microsatellite map that contributes to genome-wide MAS in research of gene mapping and to comparative genome mapping involving G. hirsutum and related crop plants. Only 10,116 (13%) identified microsatellite markers were available in the previous public database (www.cottongen.org) and the remaining 67,880 (87%) were universal.

Table 2 Summary of chromosomal distribution and average density of microsatellite markers mapped on G. hirsutum chromosomes.

Although several studies have been conducted to develop genic microsatellite markers from Expressed Sequence Tag (ESTs) transcripts of G. arboreum14, G. raimondii47,48 and G. hirsutum48, the number of publicly available genic microsatellite markers in Gossypium was limited (http://www.cottongen.org). We have identified 10,449 (13.4%) genic microsatellite markers known as “functional markers” that have a high transferability across species. Of them, 7,705 were new genic markers by crosschecking within CottonGen and relative literatures. There were more intergenic microsatellites markers than genic markers as a result of 8.64% genic region in G. hirsutum genome, although microsatellites are preferentially associated with nonrepetitive DNA in plant genomes34. Microsatellites in coding regions can regulate gene expression or function and the mutation rate in coding sequences is lower than in noncoding sequences, therefore, the number of SSRs and polymorphisms is lower in coding regions49. The main advantage of developing genic microsatellite markers is the possibility of finding associations between functional genes and phenotypes50,51.

In silico PCR analysis

To avoid complicated errors in genotyping due to the polyploidy nature of G. hirsutum, all the developed genome-wide microsatellite markers were subjected to in silico PCR analysis based on the genome sequences of G. hirsutum, G. raimondii and G. arboreum (Table 3). As to 77,996 microsatellite markers designed in G. hirtusum, 0 (0.0%), 29,392 (37.7%), 20,911 (26.8%), 5,151 (6.6%) and 22,542 (28.9%) markers generated 0, 1, 2, 3, > 3 in silico PCR products from the de novo sequences of G. hirsutum, respectively (Table 3 and Table S6). Among all these microsatellite markers, an average of forty eight in silico PCR products were identified, because some element (TE) associated markers could generate tens of thousands of in silico products, as reported in Brassica45. We also found 62,326 (79.9%) markers generating ≤10 in silico PCR products with an average of two alleles and four alleles for 68,811 (88.2%) markers generating ≤50 in silico PCR products (Fig. S3).

Table 3 Generated number (%) of in silico PCR products by genome-wide microsatellite markers in the sequenced genome of G. hirsutum, G. raimondii and G. arboreum.

Through in silico PCR analysis, a set of 20,911 (26.8%) high-quality double-locus microsatellite markers and 29,392 (37.7%) single-locus were established. In diploid species such as barley and rice, most microsatellite markers are single locus amplifying a maximum of two alleles. However, in allotetraploid G. hirsutum, most microsatellite markers are found in multiple loci and amplify multiple alleles from homoeologous loci. This limits their application in genetic and breeding studies52. Microsatellite markers that generate one in silico PCR product are likely to be more useful, as is the case of single locus SSR markers developed by practical PCR amplification in inbred lines53 and genome-wide development45.

Furthermore, as to 77,996 microsatellite markers designed in G. hirtusum, 32,868 (42.1%), 30,588 (39.2%), 2,876 (3.7%), 1,391 (1.8%) and 10,273 (13.2%) markers generated 0, 1, 2, 3 and > 3 in silico PCR product from the genomic sequences of G. raimondii, respectively and 25,026 (32.1%), 33,406 (42.8%), 3,388 (4.3%), 1,501 (1.9%) and 14,675 (18.8%) markers generated 0, 1, 2, 3, and>3 in silico PCR product from the genomic sequences of G. arboreum, respectively.

Application and experimental evaluation of microsatellite markers

A total of 511 from 77,996 microsatellite markers were selected to evaluate the amplification (Table S7). In our result, 505 (98.8%) of the 511 microsatellite markers could produce clear and reproducible amplification products with 996 polymorphic alleles (Table S7). Among these 505 microsatellite markers, 266 (52.7%) amplified the corresponding polymorphic alleles with in silico PCR products. Furthermore, 131 (29.4%) amplified more polymorphic alleles than in silico PCR products, which may be caused by the high proportion of TEs in the Gossypium genome. Therefore, these genomic microsatellite markers would be of enormous of use for various genotyping applications.

To generate microsatellite markers with the potential to enhance the genetic map, we tested the polymorphisms of 77,996 developed microsatellite markers in TM-1 and Hai7124 using re-sequencing data of Hai7124 (111.8 GB, 51,526 SSRs). Approximately 9,001 (11.5%) of these microsatellite markers were polymorphic. Of the 9,001 microsatellite markers, 8,825 were mapped to 26 chromosomes, including 4,446 on the At and 4,379 on the Dt. The largest proportions were found on A12 and D02 (Table S8). Thirty microsatellite markers were randomly selected to amplify polymorphisms between TM-1 and Hai7124, successfully validating the accuracy of developed markers (Fig. S4). The first comprehensive SSR maps were reported in studies using 138 BC1 plants derived from an interspecific cross of (G. hirsutum acc.TM-1 × G. barbadense cv. Hai 7124) × TM-148. In order to construct a high-density genetic map of cotton in our laboratory, we have tried EST-SSRs, SNPs and InDels when no sufficient markers in cotton were available54. Thus, these 8,825 polymorphic microsatellite markers are a useful resource for enhancing the genetic map and improving molecular marker assisted selection breeding.

In the present study, we conducted a genome-wide analysis to identify 100,290 microsatellites in G. hirsutum and developed 77,996 microsatellite markers. Among these markers, 67,880 (87%) were universal and 7,705 were new genic microsatellite markers. These genome-wide microsatellite markers were useful in genotyping applications such as germplasm characterization and high-density microsatellite marker linkage map construction. Importantly, the physical positions of the universal microsatellite markers on 26 chromosomes and identification of polymorphic marker can provide a high-density microsatellite map that contributes to genome-wide microsatellite marker selection in research areas such as gene fine mapping, MAS breeding and comparative genome mapping involving G. hirsutum and related crop plants.

Experimental procedures

Sources of genomic sequences

The high-quality genome sequence of the genetic standard line of Upland cotton, TM-1 (PRJNA248163), is available in http://mascotton.njau.edu.cn. The genomic sequence of G. raimondii32 was downloaded from http://phytozome.netandthegenomicsequenceofG. arboreum33 was downloaded from http://cgp.genomics.org.cn/page/species/index.jsp.

Identification of microsatellites

Genome sequences were searched for perfect microsatellites using PERL5 script MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/) with basic motifs from mono- to hexanucleotide55. Repeats with a minimum of 18, 9, 6, 5, 4 and 3 were defined for the mono- to hexanucleotide, respectively. Compound microsatellites were defined as ≥2 repeats interrupted by ≤100 bp, as previous report48.

Statistical analysis

Each chromosome was divided into 1-Mb for statistical analysis of microsatellites, genes and TEs for the represent practical frequencies and average frequencies. Then, Excel statistical function CHISQ.TEST was used to calculate the significance level (EquationSource math mrow mtable mtr mtd mrow msubsup miP mix mn2 mtd mrow msub mrow mrow mtexttest ) of these two frequencies of microsatellites as well as genes and TEs in 26 chromosomes.

Design of SSR primers

Primer pairs were designed from the flanking sequences of identified microsatellites using PRIMER3 software56 and two perl scripts, p3_in.pl and p3_out.pl served as interface modules between MISA and Primer3 with the primer designing parameters: 18–27 bp in length, 57–63 °C in melting temperature, 30–70% in GC content and 100–280 bp in product size. These two perl scripts were downloaded from MISA ( http://pgrc.ipk-gatersleben.de/misa/). Primer3 was downloaded from http://www-genome.wi.mit.edu/genome_software/other/primer3.html. The p3_in.pl was used to create a primer3 input file which was submitted to Primer3. Then p3_out.pl was used to calculated and merge all information together.

In silico analysis of microsatellite polymorphisms

The primer-pair sequences of previously developed publicly available Gossypium SSR markers were downloaded from the CottonGen website (www.cottongen.org). In our study, all microsatellite markers were aligned to genomic sequences of G. hirsutum, G. arboreum and G. raimondii using an in silico PCR strategy with the following default parameters: 2 bp mismatch, 1 bp gap, 50 bp margin and 50–1000 bp product size, as previous report45,57. The software (e-PCR-2.3.11) used for in silico PCR was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/schuler/e-PCR/. And only one genome was used at a time.

Evaluation of polymorphisms in G. hirsutum and G. barbadense

Illumina reads of G. barbadense cv. Hai7124 (Biosample: SAMN03002317) were mapped to the genome of G. hirsutum acc. TM-1 using the Burrows-Wheeler Alignment tool (BWA ver 0.6.2)58, with mainly default parameters. SAMtools59 was used to call InDels ≥5 with mapped reads of a minimum mapping quality of 20. If InDels were located in the region of paired primers/microsatellite markers, we regarded the paired primer as a putative polymorphic primer between TM-1 and Hai7124.

Data Section

The detail information of 77,996 developed markers and in silico PCR amplification were deposited in Table S6. These markers will be publically available in http://mascotton.njau.edu.cn and CottonGen database.

Additional Information

How to cite this article: Wang, Q. et al. Genome-Wide Mining, Characterization and Development of Microsatellite Markers in Gossypium Species. Sci. Rep. 5, 10638; doi: 10.1038/srep10638 (2015).