Genome-wide analysis of SSR and ILP markers in trees: diversity profiling, alternate distribution, and applications in duplication

Molecular markers are efficient tools for breeding and genetic studies. However, despite their ecological and economic importance, their development and application have long been hampered. In this study, we identified 524,170 simple sequence repeat (SSR), 267,636 intron length polymorphism (ILP), and 11,872 potential intron polymorphism (PIP) markers from 16 tree species based on recently available genome sequences. Larger motifs, including hexamers and heptamers, accounted for most of the seven different types of SSR loci. Within these loci, A/T bases comprised a significantly larger proportion of sequence than G/C. SSR and ILP markers exhibited an alternative distribution pattern. Most SSRs were monomorphic markers, and the proportions of polymorphic markers were positively correlated with genome size. By verifying with all 16 tree species, 54 SSR, 418 ILP, and four PIP universal markers were obtained, and their efficiency was examined by PCR. A combination of five SSR and six ILP markers were used for the phylogenetic analysis of 30 willow samples, revealing a positive correlation between genetic diversity and geographic distance. We also found that SSRs can be used as tools for duplication analysis. Our findings provide important foundations for the development of breeding and genetic studies in tree species.

developing potential intron polymorphism (PIP) 12 . ILPs and PIPs are usually defined together as intron polymorphism (IP).
With the progression of next-generation sequencing, an increasing number of tree genome sequences have become available [13][14][15] , which provide the foundations for the development and application of molecular markers. In this study, we performed a genome-wide identification of SSR, ILP, and PIP markers in 16 tree species whose genome sequences are currently available. We used these markers to perform phylogenetic analysis in 30 willow samples, and duplication analysis in Populus trichocarpa and Elaeis guineensis. The results will be useful in modern molecular biology and genetic diversity studies.

Results
SSR and ILP loci. Using the Perl pipeline, 67,259,820 SSR loci were identified from 16 tree species (Table 1), and genome size was found to be positively correlated with the number of identified SSR loci. Two pine species, Pinus taeda L. and Pinus lambertiana whose genome sizes accounted for 62.42% of the analysed species, contained 67.99% (45,732,066) of the total SSR loci, while Prunus persica possessed the smallest genome and contained the fewest SSR loci. By contrast, a negative correlation between genome size and the density of SSR loci was revealed. The lowest SSR density (385 per Mb) was found in Picea abies, which possesses a large genome (11.7 Gb). Morus notabilis possesses a relatively small genome (0.3 Gb), but exhibited the largest SSR loci density (2,272 per Mb). These results suggest that the application of SSR markers may be more efficient in small genomes because of the higher loci density.
SSR loci can be divided into seven types, from monomers to heptamers according to motif length. In this study, hexamers were the most abundant type, accounting for 55.16% of all motifs, followed by heptamers (17.44%). Pentamers were the least abundant type (2.33%). For separate loci type, the proportions fluctuated within a narrow range among most species (Supplementary Table S1).
We next extracted the two SSR loci types with the highest frequency from each species (Table 2). AT/TA base pairs were found to be the most prevalent dimers, followed by AG/TC. AAT/TTA were the most frequent trimer motif, followed by AAG/TTC, while the most frequent tetramer, pentamer, hexamer, and heptamer motifs were AAAT/TTTA, AAAAT/TTTTA, AAAAAT/TTTTTA, and AAAAAAT/TTTTTTA, respectively. A/T bases were shown to make up the majority of base pairs in SSR loci with the highest frequencies. We further analysed the base pair composition in all identified SSR loci (Supplementary Table S2), revealing that the number of A/T base pairs was more than twice that of G/C base pairs in 10 species. In other six species, A/T to G/C ratios were ≥3, while in Populus euphratica Oliv, this ratio was up to 4.65. These results indicate that A/T comprised a significantly larger proportion than G/C of the base pair composition in identified SSR loci.
To identify sufficient ILP markers, the screening conditions were set mildly with no length limits. Because six of the species analysed lacked gene position information, which was not appropriate for ILP identification, a total of 3,811,360 ILP loci were obtained from the remaining 10 species (Supplementary Table S3). Compared with SSR loci, the number of ILP loci was much smaller for each analysed species, ranging from 193,575 (M. notabilis) to 656,824 (Populus euphratica) with fewer differences in number among species. Similar to SSR loci, the genome size exhibited a negative correlation with the ILP loci density. However, the variation in ILP loci density among different species, ranging from 352 per Mb (Amborella trichopo) to 1,513 per Mb (T. cacao), was larger than that in SSR loci.  ,055  57,451  51,276  50,049  20,208  254,705  86,345  551,089  295  1,868   J. curcas  83,074  33,951  32,392  42,609  12,737  227,431  78,032  510,226  308  1,657   M. notabilis  111,688  72,963  42,208  48,430  24,611  279,875  128,953  708,728  312  2,272   T. cacao  32,673  35,792  30,515  38,622  14,538  250,827  80,076  483,043  334  1,446   P. trichocarpa 59,920  54,354  57,563  60,302  26,344  336,295  124,341  719,119 Table S4). Detailed information of these markers can be downloaded from our database (http://biodb.sdau.edu.cn/xxyssr/result_data.zip). The number of SSR markers ranged from 21,442 (Jatropha curcas) to 70,442 (E. guineensis), and was positively correlated with the genome size of analysed species. However, the number of ILP markers did not show an obvious correlation with genome size. This may be explained that ILP markers were located in gene coding regions whereas SSRs were distributed genome-wide. Therefore, the number of ILP markers should be related to the number of genes in the genome. By comparing available EST sequences against model plant genomes (Arabidopsis and Oryza sativa), we predicted the existence of 11,872 PIP markers from H. brasiliensis and Pinus taeda. This number is far less than that of SSR and ILP markers, which may reflect divergence among analysed and model species.
Distribution patterns of ILP and SSR markers. We constructed a distribution map of SSR and ILP markers by randomly selecting four M. notabilis scaffolds (Fig. 1a). The number of SSR markers on the selected scaffolds ranged from 24 ( Fig. 1a, scaffold 1) to 60 (Fig. 1a, scaffold 4), and the density ranged from 81 per Mb (Fig. 1a, scaffold 2) to 130 per Mb (Fig. 1a, scaffold 1). The number of ILP markers ranged from 21 ( Fig. 1a, scaffold 3) to 36 (Fig. 1a, scaffold 4), and the density ranged from 57 per Mb (Fig. 1a, scaffold 3) to 135 per Mb (Fig. 1a, scaffold 1). The distribution map showed that the SSR markers were sparsely and unevenly distributed on the scaffolds. They often appeared as lines on the map because of their limited length. Conversely, the large span of introns made ILP markers appear as bar plots. The map showed a concomitant and alternate distribution pattern of SSR and ILP markers in certain sections of all analysed species (Fig. 1b). However, the concomitant distribution rates were relatively low, ranging from 2.35% to 8.10%. In Prunus persica, only 637 (2.35%) SSR markers intersected with ILP markers (Supplementary Table S5). These results indicated a mutual independence between SSR and ILP markers.

SSR polymorphisms.
To examine SSR polymorphisms, 20,000 markers of each species were randomly selected and electronically amplified in their own genomes. After the calculation of amplification sites, the  number of monomorphic and polymorphic markers was depicted using a histogram (Fig. 2). Among all the amplification sites, monomorphic markers comprised the largest proportion (average proportion 75.56%, Supplementary Table S6). The proportions of polymorphic markers were limited and were positively correlated with genome size. In the 10 species with genomes smaller than 1 Gb, the proportions of polymorphic markers were < 20% (Supplementary Table S7). However, in Ginkgo biloba and Picea abies which possess genomes >10 Gb, polymorphic markers comprised 24.7% and 22.6%, respectively. In Pinus taeda and Pinus lambertiana, whose genomes were >20 Gb, the polymorphic markers accounted for 63.5% and 52.7%, respectively. We also found that the proportions of polymorphic markers were positively associated with the contents of repetitive sequences. In the six species whose genomes contain about 45% repetitive sequences (Prunus persica, J. curcas, M. notabilis, T. cacao, Populus trichocarpa, and Populus euphratica), polymorphic markers accounted for proportions of around 20%. By contrast, the contents of polymorphic markers were higher than 50% in Pinus taeda and Pinus lambertiana, where repetitive sequences took up more than 80% of the genome.

Phylogenetic analysis of willow samples.
To evaluate the efficiency of the molecular markers identified in this study, we performed the phylogenetic analysis of 30 willow samples. The sampling locations were marked on the map (Fig. 3a). Five SSR markers and six ILP markers were randomly selected and used for PCR amplification to construct an Unweighted Pair Group Method with Arithmetic mean (UPGMA)-based phylogenetic tree (Fig. 3b). This clustered the 30 willow samples into six groups. Samples CQL, SY, and HBL, which all derived from southwest China, were clustered in Group II, while 19 of 21 samples from Shandong province were clustered in Group III. Group I and Group IV each contained only one sample, which was distinct from the other samples. Two samples far apart from each other were clustered together in Group V, and similar conditions were found in Group VI.

Development of universal markers.
To develop universal markers, all the obtained markers were examined in 16 analysed species by electronic amplification. A marker was assessed as universal if its primers successfully amplified loci in all 16 species. A total of 54 SSR, 418 ILP, and four PIP markers were identified as universal markers. To evaluate the efficiency of these markers, two ILP, two SSR, and two PIP markers were randomly selected and PCR-amplified in four species (Salix babylonica, Populus trichocarpa, M. notabilis, and Selaginella) (Fig. 3c). As a result, each selected marker amplified a series of fragments of different lengths in every analysed species. Compared with PIP markers, ILP and SSR markers amplified more fragments, so presented with higher levels of polymorphism. The bands obtained from different marker types exhibited an alternative distribution pattern, suggesting the potential efficiency of the combined use of these universal markers. Duplication analysis of Populus trichocarpa and E. guineensis. We next determined whether SSR markers could be used for duplication analysis by analysing the distribution of SSRs and genes in the Populus trichocarpa genome (Fig. 1c). Both SSRs and gene sequences were evenly distributed, and exhibited an alternative pattern throughout the genome. In general, only 5.6% (7586) of SSRs were located in gene coding regions. The alternative distribution pattern suggested that SSR markers could be used for duplication analysis with the intergenic regions. We then performed duplication analysis on Populus trichocarpa and E. guineensis using gene sequences and SSR markers, respectively (Fig. 4). As a result, 17,999 duplication events were identified in Populus trichocarpa using gene sequences. Many more duplication events (368,946) were obtained through SSRs. An overlap between gene-based and SSR-based duplication events was found in chromosomes 1, 3, 5, 7, 8, and 10 ( Fig. 4a,b). In total, 6.6% (24,483) of the SSR-based duplication events overlapped with 11.2% (2,006) of the gene-based events. In E.

Discussion
The development of molecular markers in tree species has long been limited because of the lack of genome sequences. Recently, substantial progress has been made in genome sequencing [16][17][18][19][20] . Based on currently available data, we performed the genome-wide development of SSR, ILP, and PIP markers in 16 tree species, identifying a total of 524,170 SSR, 267,636 ILP, and 11,872 PIP markers. We found that the genome size was positively correlated with the number of SSR loci, but negatively correlated with their density. Consistently, the number of SSR markers showed a positive correlation with the genome size.
A recent study revealed the novel distribution pattern of SSRs in grass genomes 21 . Interestingly, short motifs including dimers, monomers, and trimers were the most abundant SSR types, which is the opposite of our observation in tree species. This may reflect evolutionary divergences between tree and grass species. However, common features were also observed between SSRs of trees and grasses. For instance, most SSRs were located in the intergenic regions of both tree and grass species. Moreover, although grass genomes are G/C rich, the sequences in grass SSR motifs did not show a similar pattern. This correlates with the finding that A/T bases comprised a much larger proportion than G/C bases in the SSR loci of tree species.
We analysed the distribution pattern of SSR and ILP markers on four scaffolds of the M. notabilis genome (Fig. 1a). This showed that the markers were alternatively distributed, suggesting their combined use would be highly efficient. This was further confirmed by PCR analysis (Fig. 3c). Most SSRs were monomorphic markers (Supplementary Table S6). In accordance with front studies, the proportions of polymorphic markers were positively correlated with the genome size (Supplementary Table S7), which can be explained by the increased number of binding sites in larger genomes.
To examine the efficiency of SSR and ILP markers identified in the present study, we performed a phylogenetic analysis of 30 willow samples and duplication analysis in Populus trichocarpa and E. guineensis. Because our results revealed an alternative distribute pattern between SSR and ILP markers, the phylogenetic analysis was performed using a combination of five SSR and six ILP markers. The 30 willow samples derived from seven provinces across China (Fig. 3a). Three samples located relatively close together in southwest China were clustered together, while 19 of 21 samples from Shandong province were clustered in the same group (Fig. 3b). These results suggest a positive correlation between genetic diversity and geographic distance. However, in Group V and Group VI, two samples far apart from each other were clustered together. We hypothesize that this may be because willows are prone to interspecific hybridization and interregional transition 22 .
Genome duplication is responsible for shaping the architecture and function as well as the evolution of many higher plant genomes, and gives rise to new or modified gene functions [23][24][25] . Therefore, analysing genome duplication is important for understanding the mechanism underlying evolution and gene functions. Duplication analysis had previously been studied in Populus trichocarpa and E. guineensis 26,27 , although these were mainly based on gene coding sequence data. In the present study, we determined whether SSRs could be used for duplication analysis by performing this on Populus trichocarpa and E. guineensis. Together with previous findings, we found that most of E. guineensis were represented by segmental duplications, not triplications. We also identified a much larger number of duplications events using SSRs than gene coding sequences, and revealed a limited overlap between gene-based and SSR-based duplication events. Abundant microduplications were found based on SSR markers which mainly reflected the duplication events in the intergenic regions. These results suggest that SSRs are suitable for use in duplication analysis.

Materials and Methods
Data sources. The   Development of SSR, ILP, and PIP markers. A pipeline composed of Perl scripts was used to search for SSR loci, based on 16 tree genomes. SSRs were classified into seven types: monomers (≥12 repeats), dimers (≥six repeats), trimers (≥four repeats), tetramers (≥three repeats), pentamers (≥three repeats), hexamers (≥two repeats), and heptamers (≥two repeats). Considering the principles of Watson-Crick base pairing and the initial motif position, some motifs were identified as one type of SSR locus. For instance, we identified AC, CA, TG, and GT as the SSR motif AC. A pair of 60-bp primer precursors flanking the SSR locus was cut to prepare for primer designing (Fig. 5, Part 1). For Pinus taeda and Pinus lambertiana in which only ESTs were available, the intron position information was unknown. Therefore, we developed PIP markers for these species by comparing available EST sequences with the genome sequences of the model plants Arabidopsis and O. sativa. As shown in Fig. 5, Part 2, the first step of this process was to find the intron positions of the model species by aligning its coding sequences (CDS) with its genome sequence using BLAT 28 . The second step was to identify potential intron positions by aligning EST sequences with the CDS of model species using BLAST 29 . The third step was to develop primers that flanked potential intron positions.
Perl scripts were used to extract exact intron positions for the tree species with complete genome data, and to select a pair of 60-bp primer precursors flanking each intron to identify ILP markers (Fig. 5, Part 3). Coupled primer pairs were designed by Windows-based Emboss: eprimer3 30 , based on the primer precursors we identified flanking the introns (ILP and PIP) and SSRs. The primers were tested using electronic PCR 31 (e-PCR) against the corresponding genomes. A pair of primers was identified as a good molecular marker if it successfully amplified the desired fragment by e-PCR. Two markers were identified as the same if the forward or reverse primer was identical. A special Perl script was written to remove duplicated markers. All Perl scripts used in this study are available at http://biodb.sdau.edu.cn/xxyssr/result_data.zip. Distribution of SSR and ILP markers. Four DNA scaffolds containing the M. notabilis SSR and ILP markers were randomly selected to draw a distribution diagram using the R Language. DNA scaffolds with GenBank accession numbers NW_010356728.1, NW_010356865.1, NW_010358179.1, and NW_010359376.1 were renamed Scaffold 1-4, respectively. Each short vertical bar on the map represents the position of an SSR or ILP marker. The number of molecular markers (SSR or ILP) was counted using a Perl script and the molecular density (per Mb) of each scaffold was calculated. Based on the position, the number of concomitant and separated markers (SSR and ILP markers) was calculated for each tree species.
Experimental verification of universal markers and diversity analysis of Chinese willows. All obtained markers were selected and checked against the genomes of 16 species via e-PCR. A Perl script was used to select universal markers that could amplify the fragments in all 16 species. To assess the marker performance, two primer pairs from each of universal SSR markers, universal ILP markers, and universal PIP markers (Supplementary Table S9) were randomly selected, then amplified in four species: S. babylonica, Populus trichocarpa, M. notabilis, and Selaginella. Furthermore, five SSR primer pairs and six ILP primer pairs of willow (Supplementary Table S10) were amplified in 30 different willow materials (Supplementary Table S11). The 30 willow samples were all from S. babylonica. To mark the sampling sites, a skeleton map was constructed by R package "maps" (https://cran.r-project.org/web/packages/maps/), then modified using Adobe Photoshop (version 14.0, X64). All primers were synthesised by Shanghai Sangon Biological Engineering & Technology Company.
DNA from the 30 willow materials and young leaves of other species was extracted using the CTAB method 32 . PCR reactions were performed in a total volume of 15 µl containing 20 ng template DNA, 0.36 µM of each primer, 0.25 mM of each dNTP, 2.5 mM MgCl 2 , 1 U Taq DNA polymerase, and 2.0 µL of 10× PCR buffer. PCR conditions were as follows: 4 min at 94 °C, followed by 35 cycles of 1 min at 94 °C, 1 min at 55 °C, 1 min at 72 °C, and a final extension for 10 min at 72 °C. Electrophoresis on a 6% non-denaturing polyacrylamide gel was used to separate the PCR products, and DNA bands were visualised by silver staining. A binary matrix was constructed in which every band position was scored as either present (1) or absent (0), based on our electrophoretogram of combined markers (five pairs of SSR markers and six pairs of ILP markers) amplified in the 30 willow materials. An UPGMA-based phylogenetic tree of the 30 willow materials was then estimated using NTSYSpc 33 version 2.1.

Proportion of polymorphic markers and duplication analysis.
We randomly selected 20,000 SSR markers of 16 species to be electronically amplified against their own genomes. The number of amplification sites was calculated by the Perl program. A monomorphic marker was confirmed if it could only amplify one site, and a polymorphic marker as one that could amplify two or more sites. The number of these two types of markers was shown schematically using R language.
Populus trichocarpa and E. guineensis were selected for duplication analysis because of their well-characterised genomes. The protein sequences and SSR markers of the two species were first prepared, and the protein sequences compared against themselves by BLAST analysis, and SSR markers selected for e-PCR against their own genomes. Based on protein BLAST results and corresponding gff files, gene-based duplications were obtained using MCScanX 34 . According to the collinearity format results, duplicate blocks within the whole genome were linked by curved ribbons using Circos 35 . To obtain marker-based duplications, e-PCR results were modified into the BLAST format.