Genome-wide mapping and characterization of microsatellites in the swamp eel genome

We described genome-wide screening and characterization of microsatellites in the swamp eel genome. A total of 99,293 microsatellite loci were identified in the genome with an overall density of 179 microsatellites per megabase of genomic sequences. The dinucleotide microsatellites were the most abundant type representing 71% of the total microsatellite loci and the AC-rich motifs were the most recurrent in all repeat types. Microsatellite frequency decreased as numbers of repeat units increased, which was more obvious in long than short microsatellite motifs. Most of microsatellites were located in non-coding regions, whereas only approximately 1% of the microsatellites were detected in coding regions. Trinucleotide repeats were most abundant microsatellites in the coding regions, which represented amino acid repeats in proteins. There was a chromosome-biased distribution of microsatellites in non-coding regions, with the highest density of 203.95/Mb on chromosome 8 and the least on chromosome 7 (164.06/Mb). The most abundant dinucleotides (AC)n was mainly located on chromosome 8. Notably, genomic mapping showed that there was a chromosome-biased association of genomic distributions between microsatellites and transposon elements. Thus, the novel dataset of microsatellites in swamp eel provides a valuable resource for further studies on QTL-based selection breeding, genetic resource conservation and evolutionary genetics.

swamp eel we obtained recently, we conducted a genome-wide detection of microsatellite sequences. We analyzed distribution of microsatellite motifs (dinucleotides, trinucleotides, tetranucleotides, pentanucleotides and hexanucleotides) in the genome, characterized microsatellites in both coding and noncoding regions. We found that trinucleotide repeats were most abundant microsatellites in coding regions though their low enrichment, and microsatellites were abundant and chromosome-biased in non-coding regions. In particular, a chromosome-biased association of genomic distributions between microsatellites and transposon elements (TEs) was described. The novel set of microsatellites in swamp eel provides a valuable dataset for further studies on QTL-based selection breeding, genetic resource conservation and evolutionary genetics.
We also investigated the microsatellite motif distribution with regard to repeat numbers. For all five microsatellite types, microsatellite frequency decreased as the number of repeat units increased, which was more obvious in long than short microsatellite motifs (Fig. 2). Moreover, the mean repeat number in dinucleotides (10.33) was approximately 1.5 times of those in trinucleotides, tetranucleotides, pentanucleotides and hexanucleotides (6.16, 6.44, 6.37, and 6.99 respectively). The trends are similar to those in the human genome 25 . Trinucleotide repeats are most abundant microsatellites in coding regions. We investigated distribution of microsatellites in both coding and non-coding regions of the genome. Microsatellites were mainly located in non-coding regions (98,602, 99%), whereas there were approximately 1% (691) of the microsatellites located in coding regions (Fig. 3a). In coding regions, trinucleotide repeats were most abundant microsatellites (~75.5%), followed by dinucleotide repeats (20.3%), which represented amino acid repeats in proteins. For these microsatellites in coding regions, we analyzed their GO annotation by Blast2GO. A total of 375 genes were assigned to the molecular function category (Fig. 3b). Catalytic activity (35.9%) was the most dominant group followed by binding (30.3%). Metabolic process (17.6%) was the most enriched group that were annotated to the biological process category (Fig. 3c). With regard to the cellular component, 37.5% sequences were assigned to the cell part followed by organelle (27.5%), membrane (11.5%) and macromolecular complex (9.6%) (Fig. 3d). To investigate whether particular GO terms were overrepresented in microsatellite-containing genes, we performed an overrepresentation analysis (Fisher's exact test, available through PANTHER version 11.1 26 ). No GO term was significantly enriched in microsatellite-containing genes compared to all the other genes in the genome (false discovery rate (FDR) = 0.05). In addition, no chromosome-biased distribution of GO enriched genes was detected (FDR = 0.05).
Abundant and chromosome-biased microsatellites in non-coding regions. As most of the microsatellites were located in non-coding regions, we analyzed their distribution patterns in the genome. We found that there was a chromosome-biased distribution of these microsatellites in non-coding regions, with the highest density of 203.95/Mb on chromosome 8 and followed by chromosome 12. The least density of the microsatellites was detected on chromosome 7 (164.06/Mb) (Fig. 3a). For the repeat types, dinucleotide repeats were the most abundant class of microsatellites, particularly on the chromosome 8, whereas the chromosome 7 had the least level of distribution of dinucleotide repeats (Fig. 4a). The most abundant repeat type of dinucleotides (AC)n was mainly located on chromosome 8 (Fig. 4b), whereas the most enriched type of trinucleotides was mainly located on chromosome 1, 4 and 7 (Fig. 4c). These results indicated that there were repeat type-and chromosome-biased distributions of the microsatellites in the genome.
Genomic mapping and their chromosome-biased association of microsatellites with TEs. As the association between microsatellites and TEs in genomes remains elusive 8, 27-29 , we tested their distributions in the swamp eel genome. Sliding window analysis in a genome-wide, using a window of 3 Mb with a step of 100 kb, showed a distinct distribution pattern among chromosomes (Fig. 5a). Thus, we analyzed correlation between microsatellite and TE densities in individual chromosome using the same sliding window parameters. An obvious negative correlation was observed on chromosome 12 (r = −0.83, p = 1.3813E-51) (Fig. 5b) and also on chromosomes 2, 4 and 7, whereas positive correlation was only detected on chromosome 5 (r = 0.256, p = 1.6817E-8) (Fig. 5c). Notably, on the chromosome 11, there were two types of distribution patterns according to TEs numbers in 3 Mb windows. A quadratic function was observed when TEs ≥4000 (Fig. 5d), whereas a linear correlation detected when <4000 (Fig. 5e), which indicated a threshold value of TE numbers associated with microsatellite density on the chromosome 11. A similar quadratic function was also detected on chromosomes 9 and 10. No obvious association was detected on chromosomes 1, 3, 6 and 8. These results suggested a chromosome-biased association between microsatellites and TEs in the genome.  (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15) and the alleles were indicated with uppercase letters in the right panel.

Discussion
Swamp eel is an increasingly emerging model species in biology, in addition to its economic importance in fish production 1,2 . Microsatellites that are widely distributed in a genome are important genetic markers for assessing genetic diversity, genetic map construction, comparative genomics, and marker-assisted selection breeding. Characterization of the genome-wide microsatellites in this study, together with SSR markers from previous reports in swamp eel 22-24, 30, 31 , provides a resourceful dataset for genetic improvement of this species, and genomic and evolutionary biology studies.
The genomic data are excellent sources for SSR mining and has been utilized in various species 9,25,32,33 . In the present study, we identified a total of 99,293 microsatellites based on the whole-genome sequences of swamp eel. The distribution frequency of microsatellites (179/Mb) estimated in the genome is comparable to that documented in buffalo genome (170/Mb) 33 , but lower than those in human and mouse 34 . These differences could be due to the variation in search criteria, sizes of the databases and bioinformatics software tools used in different studies for identification of microsatellites. The most abundant dinucleotide and trinucleotide motifs are AC/GT and AAT/ATT, which are in agreement with those in human 25 and buffalos 7, 33 , but different from those of cattle and goat 35,36 . Predominant repeats in various classes are AAT, AGG and AAC in trimers, AAAT, AAAC, and AAAG in tetramers, AAAAT and AAAAC in pentamers and AAAAAT, AAAAAC, AAAAAG and AAAAAG in hexamers. It reflects a prevalence of the A-rich repeats during genome evolution in teleost fishes. The abundance of the repeats is probably influenced by their secondary structures and the effect on DNA replication 25 or reflects a genetic adaptation to water environment during fish speciation. Thus, the characterization of the microsatellites in the swamp eel provides a useful resource for further studies in genome evolution in the teleost fish species.
The frequency and density of microsatellites are probably correlated with genome sizes. For example, the microsatellite density is higher in large genomes than in small genomes among mammals 37 . However, the microsatellite frequency in plants is lower in large genomes than in small genomes 38 . The distributions of microsatellites also vary in different regions in a genome. It is well known that noncoding regions generally contain more abundant microsatellites than coding regions 39,40 . There is no apparent difference of microsatellite contents between intergenic regions and introns 41 . In addition, the microsatellite density is higher at the end of chromosome arms than at other regions in human and mouse genomes 42,43 . Although the trends for different repeat types are similar between chromosomes within a genome, the density of repeats could vary among different chromosomes of the same species. The density is higher on autosomes than on X chromosomes in mammals (such as humans, mice, and rats) but with exception in Drosophila 44 . This can be expected, since different chromosomes in a genome have different organizations of genes, euchromatin, and heterochromatin. This variation is due in part to AT/GC content of genomes, with biased toward either high AT or CG. The bias is favorite for enhancement of expansion through slippage during DNA replication 45 .
Transposition often generates genetic variations, and microsatellites are probably associated with relevant elements [46][47][48] . Alu elements are widely distributed in the human genome, representing more than 10% of its total size. Since Alu repeats contain a poly(A) tail and a central linker region rich in adenines, there is a certain extent of association with A-rich microsatellites. A significant association was observed between the 3′ end of Alu sequences, not only with (A)n mononucleotide repeats but also with (AAC)n, (AAT)n, and A-rich tetra-to hexanucleotide repeats, moreover, this association was weaker with (AT)n dinucleotide repeats 49 . The (AC)n dinucleotide repeats were preferentially associated with Alu elements, 75% of them were at the 3′ end of the elements, while the remainder were in the central linker region 46 . However, a high density of transposable elements does not always coincide with a high density of microsatellites. For example, analysis in five complete plant genomes showed that microsatellites were preferentially located in unique regions of the genomes and exhibited a lack of association with transposon-rich regions 38 . It was hypothesized that microsatellite can be derived from TEs and the opposite evolutionary direction may occur 50,51 . The direction of transition from TE to microsatellite might depend on transposition rate with an optimal value and the opposite transition is linked to recombination rate 50,51 . A chromosome-biased association between microsatellites and TEs in our study is presumably at least partially related to their distant contact and recombination behavior of chromosomes. A chromosome-biased association between microsatellites and TEs in the fish genome observed in this study provides a new layer in understanding of complexity of these repeats in genome structure and evolution. Microsatellites are closely related to genome stability and regulations of gene expression, expansions of which are risk factors of many genetic disorders in human, such as fragile X syndrome 52 , Huntington's disease 53 and myotonic dystrophy 54 . In fishes, a microsatellite marker, (GT)ntt(GT)n, in the 3′ untranslated regions of rtp3 is significantly associated with nervous necrosis virus disease resistance 55 . Whether there is a particular GO term enrichment in microsatellite-associated genes is an interesting issue. In our study, gene ontology annotation of microsatellite-containing genes revealed that these genes were involved in various aspects of biological activities in swamp eel. In line with this, no GO term was overrepresented in the microsatellite-containing genes compared to total genomic genes. Similar results were reported in functional annotation of microsatellite-containing genes in Acipenser fulvescens 56 and Carcharodon carcharias 57 . In addition, GO term was not enriched in particular chromosome either. Nevertheless, both microsatellites and TEs are associated with three-dimensional chromosome architecture 58,59 . Some G-rich TEs and microsatellites can form structures made of four DNA strands known as G-quadruplexes contributing to change in chromatin status, transcription enhancement/inhibition and the evolution of regulatory networks 58,59 .

Materials and Methods
Animals and ethics statement. Swamp eels (Monopterus albus) were obtained from markets in Wuhan, China. All animal experiments and methods were performed in accordance with the relevant approved guidelines and regulations, as well as under the approval of the Ethics Committee of Wuhan University.
Screening and identification of microsatellites. Genomic sequences of swamp eels were sequenced by our lab (DDBJ/EMBL/GenBank under the accession AONE00000000). The Perl script MIcroSAtelitte (MISA, http://pgrc.ipk-gatersleben.de/misa/) was used to identify microsatellites in the genomes. The genomic sequence data were loaded into a local pool. The configuration file was written in an independent text document named as "misa.in" and was placed in the same folder with the Perl script named as "misa.pl". The sequence of each chromosome was screened for potential motif repeats by calling the genomic sequence data file and the configuration file. To identify the presence of microsatellites, only 2 to 6 nucleotides motifs were considered, and the minimum repeat unit was defined as 6 for dinucleotide repeats, 5 for trinucleotide repeats, 4 for tetranucleotides, and 3 for pentanucleotides and hexanucleotides. Compound microsatellites were defined as ≥2 microsatellites interrupted by ≤100 bases 33 . PCR amplification. Total genomic DNA was isolated from gonad samples by previous method 60  Functional assignments of the transcripts containing microsatellites. To assign putative functions to the microsatellite-containing transcripts, Blast2go program was run locally to BLAST against a reference database that stores UniProt entries and their associated Gene Ontology (GO) 62 . The GO categorization results were expressed as three independent hierarchies for biological process, cellular component and molecular function (http://www.geneontology.org/). GO term overrepresentation was analyzed by PANTHER version 11.1 26 (http:// www.pantherdb.org/). TE analysis. TE elements were analyzed using previous methods 63    based on the genome sequences. The multicopy genes and contaminations were removed from the libraries. Then, the RepeatMasker was used again to find repeats in these repetitive sequence libraries. Finally, we combined all the results generated by these methods and analyzed the density of TEs in the genome.
Circos program. The Circos program (http://circos.ca) was applied to draw the circos maps. Genomic sequences were assembled into chromosomes. Mapping of transposon elements (TEs) and microsatellites onto chromosomes was performed by calling "circos.conf " files containing locus information. The densities of microsatellites and TEs were described as numbers in a sliding window of 3 Mb with a step of 100 kb.
Ethics Approval. All animal experiments and methods were performed in accordance with the relevant approved guidelines and regulations, as well as under the approval of the Ethics Committee of Wuhan University.