De novo transcriptome assembly, development of EST-SSR markers and population genetic analyses for the desert biomass willow, Salix psammophila

Salix psammophila, a sandy shrub known as desert willow, is regarded as a potential biomass feedstock and plays an important role in maintaining local ecosystems. However, a lack of genomic data and efficient molecular markers limit the study of its population evolution and genetic breeding. In this study, chromosome counts, flow cytometry and SSR analyses indicated that S. psammophila is tetraploid. A total of 6,346 EST-SSRs were detected based on 71,458 de novo assembled unigenes from transcriptome data. Twenty-seven EST-SSR markers were developed to evaluate the genetic diversity and population structure of S. psammophila from eight natural populations in Northern China. High levels of genetic diversity (mean 10.63 alleles per locus; mean HE 0.689) were dectected in S. psammophila. The weak population structure and little genetic differentiation (pairwise FST = 0.006–0.016) were found among Population 1-Population 7 (Pop1-Pop7; Inner Mongolia and Shaanxi), but Pop8 (Ningxia) was clearly separated from Pop1-Pop7 and moderate differentiation (pairwise FST = 0.045–0.055) was detected between them, which may be influenced by local habitat conditions. Molecular variance analyses indicated that most of the genetic variation (94.27%) existed within populations. These results provide valuable genetic informations for natural resource conservation and breeding programme optimisation of S. psammophila.


Results
Chromosome counts and genome size estimation of S. psammophila. To reveal the ploidy level of S. psammophila, eight genets from different populations that represented the main distribution areas, including six populations from Inner Mongolia, one population from Shaanxi Province and one population from Ningxia Province (Fig. 1, Table 1), were used for chromosome counts in root tip squashes. As shown in Fig. 2, the eight different genets had the same chromosome numbers, for 2n = 4x = 76. This indicated that S. psammophila is naturally tetraploid. Flow cytometry (FCM) technique was performed to estimate the genome size using the fresh leaves of 27 genets, including the eight genets used for chromosome counts and 19 additional genets. With its genome size of approximately 425-429 Mb, S. suchowensis 37 served as the internal reference standard. The ratio of the G1 peak means of S. psammophila to that of S. suchowensis was equal to 1.645, indicating that the genome size of S. psammophila is estimated to be approximately 699-706 Mb (Fig. S1, Table S1).
Illumina sequencing, de novo assembly and functional annotation of unigenes. Equal quantities of RNA from five tissues including the root, stem, leaf, female catkin and male catkin of S. psammophila were mixed and used to construct a cDNA library for sequencing based on the Illumina HiSeq2500 platform. A total of 45,600,129 raw reads with paired-ends of 125 bp were obtained. After quality control, 44,343,202 clean reads (11.17 Gb) were retained. The GC content was 45.28%, and the Q20 and Q30 were 94.73% and 90.55%, respectively. With the aid of the short-reads assembling software Trinity, these clean reads were de novo assembled into 159,737 contigs, and then into 71,458 non-redundant unigenes, with an N50 length of 1,274 bp and a mean length of 713 bp (Fig. S2). The sequencing data have been deposited in the NCBI Sequence Read Archive (http://www. ncbi.nlm.nih.gov/Traces/sra) under accession number SRA3189819.
Frequency and distribution of EST-SSRs in S. psammophila. All the unigenes were used to detect EST-SSRs. Of the 71,458 (50.97 Mb) unigenes of S. psammophila, 5,616 (7.86%) were determined to contain 6,346 SSRs. Among these developed SSRs, 424 (7.55%) contained two or more SSRs, and 403 (6.35%) comprised  compound SSRs (Table 3). The frequency of EST-SSRs was one EST-SSR in 11.26 unigenes. The di-nucleotide motif was the most abundant (53.72%), followed by the tri-nucleotide (42.94%), tetra-nucleotide (2.77%), hexa-nucleotide (0.39%) and penta-nucleotide (0.17%) motifs (Table 3). There was a large proportion of both diand tri-nucleotides (96.66%), whilst the rest amounted to 3.34%. In total, 127 different SSR motifs were identified. Among them, the di-, tri-, tetra-, penta-and hexa-nucleotide repeats were of 6, 31, 54, 11 and 25 types, respectively ( Genetic diversity of EST-SSR loci. A total of 168 EST-SSR markers containing di-and tri-nucleotide repeats were randomly selected and primers were designed according to their flanking sequences. Of the 168 EST-SSR primers, 125 pairs showed successful amplification. Finally, 27 pairs that showed stable amplification and had multiple alleles were selected to analyse the genetic diversity and population structure of 240 S. psammophila genets from eight natural populations in Northern China (Table 1). These 27 loci were verified to be equally suited for an MAC-PR (microsatellite DNA allele counting-peak ratios) approach which calculated the ratios between the peak areas for two alleles in all samples where these two alleles occurred together (Table S7). A summary of the genetic characteristics of S. psammophila based on 27 SSR markers was shown in      Fig. S3). With the increase in cluster number, population subdivision was gradually generated among Pop1-Pop7. At K = 8, 22 genets of Pop8 were also retained in one cluster, five genets of Pop8 and four genets of Pop6 were assigned to another cluster; and remaining genets were interspersed among six other clusters (Fig. S4).
PCoA was conducted and NJ phylogenetic trees were constructed to further assess the population genetic structure. The first three principal coordinates explained 14.37%, 11.23% and 7.64% individually, and explained 33.24% of the total variation (Fig. 6A). The Nei's genetic distances (GDs) of the 240 S. psammophila genets were calculated, which ranged from 0.002 to 0.602, with an average of 0.387 (Table S8). Based on the GDs, an NJ tree was constructed (Fig. 6B). The results of PCoA and the NJ phylogenetic tree were consistent with the DAPC analyses. Most genets of Pop8 (Ningxia) had relatively far genetic distances from other populations, and some genets of Pop6 were classified into the Pop8 group.
Furthermore, the GDs of the populations and the NJ phylogenetic tree were also used to evaluate the genetic relationships of the eight populations. As shown in Table 5 and Fig. 7, the GDs among Pop1 to Pop7 were 0.028-0.068, and Pop4 was genetically more closely related to Pop5, with a GD 0.028, followed by Pop1 being closer to Pop4 and Pop5, with GDs of 0.033 and 0.038, respectively. The GDs of Pop8 from the other population were larger than the other combinations, ranging from 0.112 to 0.151, which indicate that the genets from Ningxia (Pop8) have a distant genetic relationship with the other seven populations. In particular, the genetic relationship between Pop3 and Pop8 was the farthest (GD 0.151).

Population differentiation and genetic diversity.
To analyse the genetic differentiation among the eight populations, pairwise Wright's F ST was calculated using the "polysat" software in R package. The lowest genetic differentiation was detected between Pop4 and Pop5 (F ST = 0.006), and the greatest genetic differentiation was detected between Pop3 and Pop8 (F ST = 0.055) ( Table 5). The genetic differentiation among Pop1-Pop7 was generally low (average pairwise F ST = 0.011, range 0.006-0.016), whilst the genetic differentiation between Pop8 and Pop1-Pop7 was relatively high (average pairwise F ST = 0.050, range 0.045-0.055). The genetic differentiation between Pop8 and Pop6 (F ST = 0.045) was slightly lower than that between Pop8 and the six other populations ( Table 5).
The genetic diversity of the eight populations was assessed, and the result was shown in Table 6. Pop8 (Ningxia) had the lowest value for alleles per locus, H E , Shannon-Wiener index and PIC, suggesting that the genets from Ningxia have low genetic diversity compared with those from the Inner Mongolia and Shaanxi populations. Obvious differences in genetic diversity between Pop8 and the other seven populations were detected.
Analyses of molecular variance (AMOVA) indicated that 5.73% of the total molecular variance resided among eight populations, whilst 94.27% was attributed to variance within populations ( Table 7), suggesting that higher variation existes within populations than among populations of S. psammophila. To test the proportion of genetic variance between clusters with the K = 2 suggested by the DAPC analyses, AMOVA was performed and showed that 18.56% of the total molecular variance was attributed to between two clusters (Table 7). Though the higher variation existed within clusters than among clusters, this variance proportion among clusters was higher than that among populations.

Discussion
The Salix genus comprises many species, among which polyploids are common [1][2][3][4] . The approaches used for ploidy detection mainly include chromosome counts, FCM technique, and SSR analyses 2,39 . In this study, S. psammophila was found to be tetraploid by chromosome counts. Second, using FCM technique, the G1 peak means of 19 additional genets were similar to that of the eight tetraploid genets, suggesting that these 19 genets are also tetraploids. The genome size of S. psammophila was estimated to be approximately 699-706 Mb, which is approximately    ) is approximately 1.9-fold than its diploid progenitor (G. tomentella and G. syndetika) and the tetraploid retains at least approximately 87% of the genes that were initially duplicated 40 . In Knautia, the results of ploidy and genome size of 23 species indicated that genome size of tetraploid K. arvernensis and hexaploid K. ressmannii is approximately 1.8-fold and approximately 2.7-fold than diploid K. calycina, respectively 41 . Similar results of the genome size being inconsistent with the ploidy level were also observed in Chenopodium 42 and Avena 43 . One likely reason for this phenomenon is that some genes or chromosome fragments with functional redundancy had not doubled in a genome duplication event, or they were deleted after genome duplication event 44 . Finally, all 240 S. psammophila genets presented a maximum of four alleles at some loci in the SSR analyses, indicating that these genets are tetraploids. Multiple alleles were also detected in S. reinii using SSR markers, suggesting that S. reinii is polyploid; this inference was confirmed by the FCM technique 17,24 . These results indicate that SSR marker analyses is a simple and efficient method for determining ploidy. Furthermore, our study did not find the variability of ploidy level in 240 S. psammophila genets. Other  ploidy levels (i.e., 2x, 3x, 5x, 6x, etc.) of S. psammophila may exist in natural populations, and these materials are more meaningful for the study of genome evolution. Therefore, the intraspecific ploidy variation of S. psammophila requires further research. Some hypotheses suggest that polyploids have greater resistance to harsh arctic climates than diploids, or that polyploids are more successful than diploids in colonising after deglaciation 45 . In the Helix section, compared with other diploid willow species (S. suchowensis and S. purpurea) 37 , S. psammophila was verified to be tetraploid and mainly distributed in the desert regions in Northern China. Therefore, the tetraploid nature of S. psammophila may be associated with strong adaptability to infertile soil and harsh climatic environment. Moreover, polyploids have been reported to affect the gene expression at the transcription level compared with diploids. For example, studies in watermelon have shown that 5,362 and 1,288 genes were up-and down-regulated, respectively, in autotetraploid compared with diploid watermelon 46 . So further studies are required to reveal the reason for tetraploid formation and the effect on gene expression after genome duplication event in S. psammophila.

Searching items Numbers
Recently, willows have become a focus of research as promising sources for bioenergy crops because of their short rotation coppice cycle and high biomass yields 1 50 . The presence of few CG/GC and CCG/CGG motifs in S. psammophila agrees with previous studies that showed that CG/GC and CCG/CGG are very rare in dicotyledonous plants but common in monocots 51 . These transcript sequences and EST-SSR loci could facilitate further investigations in marker-trait association, quantitative trait locus mapping and genetic diversity analyses in S. psammophila. As with other willow species, high levels of genetic diversity were found in 240 S. psammophila genets from eight natural populations in Northern China. The mean number of 10.63 alleles per locus and mean H E 0.689 were determined using 27 developed EST-SSR markers in S. psammophila, which was consistent with SSR-derived genetic diversity in S. viminalis (mean 6.95 alleles per locus and mean H E 0.65 within 84 individuals from seven sampled sites in the Czech Republic; mean 13.46 alleles per locus and mean H E 0.62 within 505 individuals from five populations in Europe) 28,29 and S. caprea (mean 10 alleles per locus and mean H E 0.58 within 183 individuals from 21 semi-natural woodlands) 30 . Although the studied genets of S. psammophila were sampled within a 220 kilometres radius, it is a dioecious and cross-pollination species, and its pollen can disperse over long distances due to windy weather in the pollination period and few obstacles to airborne pollen flow in desert regions, which may explain its high levels of genetic diversity. The small seeds (thousand seed weight of approximately 0.1 g) of S. psammophila are wrapped in fine hairs that facilitate dispersal by wind. Moreover, the tetraploid nature of this species may be another reason for the high genotype diversity of S. psammophila.
AMOVA indicated that most (94.3%) of the total variance was within-population variance. This agrees with the results obtained for other Salicaceae species, such as S. viminalis (92.3%) 28 , Salix caprea (91.7%) 30 , Populus simonii (85.8%) 52 , P. euphratica (85.5%) and P. pruinosa (78.8%) 53 based on SSR markers. This result may have been obtained because Salicaceae species are generally dioecious and combine anemophilous and entomophilous pollination with an outcrossing breeding system and high dispersal ability, which could enhance intraspecific diversity and reduce population differentiation 27,54,55 . Additionally, the relatively small geographic range would see more frequent gene flow by pollen and seed dispersal and result in little differentiation among populations of S. psammophila. However, pollination dynamics and the mating system of S. psammophila remain to be poorly understood, and should be further studied to better understand the distribution of genetic variation.
The weak population structure and little genetic differentiation (pairwise F ST = 0.006-0.016) were found among Pop1-Pop7 (Inner Mongolia and Shaanxi), but Pop8 (Ningxia) was clearly separated from Pop1-Pop7, and moderate differentiation (pairwise F ST = 0.045-0.055) was detected between them. Previous studies have demonstrated that the habitat types or local environment conditions have a significant influence on the genetic diversity  Table 6. Genetic diversity of the eight populations of S. psammophila. Note: a no significant difference at the 0.05 level, b significant difference at the 0.05 level.

Percentage of variation (%)
Eight populations  and population structure in some plant species, such as Ranunculus acris 56 , Paris quadrifolia 57 , Rhododendron jinggangshanicum 58 and Tamarix chinensis 59 . In this study, the Pop1-Pop7 genets were sampled around Kubuqi desert or Mu Us Sandland with typical arid desert habitats, whilst the Pop8 genets were sampled around Haba Lake with wetland habitats. To adapt to these two different habitat conditions, the genotypes of S. psammophila were selected and appeared to differentiation during the long-term evolutionary process. A germplasm collection base of S. psammophila was established approximately 20 kilometres northeast of Pop1 in Inner Mongolia, with its habitat condition belonging to an arid desert habitat. In recent years, we found that most of Pop8 genets appeared dead due to unadaptability to the arid desert habitats in the germplasm collection base, and this phenomenon did not occur in Pop1-Pop7 genets, supporting above mentioned hypothesis. Moreover, the habitat variability could lead to different phenophases, which may constrain the gene flow between Pop8 and other populations. However, detailed reasons for the population genetic structure need to be further studied.  Fig. 1 and Table 1. Fresh leaves were dried and stored in silica gel. DNA was isolated using the cetyltrimethylammonium bromide method 60 , and DNA quality and concentration were estimated using the NanoDrop system (Thermo Scientific, USA) and 1.0% agarose gel electrophoresis. family (Pfam) database. Gene Ontology (GO) annotation of these unigenes was produced using the Blast2GO program 65 based on the results of the NCBI Nr database annotation.

SSR loci identification and SSR markers.
The obtained unigenes were used to identify potential SSRs using the MISA tool (http://pgrc.ipk-gatersleben.de/misa/). The parameters were designed to identify di-nucleotide motifs with a minimum of six repeats, and tri-, tetra-, penta-, and hexa-nucleotide motifs with a minimum of five repeats 34 . In total, 168 SSR primers were designed and synthesised by Thermo Fisher Scientific (Shanghai, China). Fluorescence-labelled TP-M13-SSR (SSR with a tailed M13 primer) analyses was performed as described by Schuelke 66 . PCR amplifications were performed in a reaction volume of 20 μ l, containing 100 ng template DNA, 1× PCR buffer (Takara, Dalian, China), 100 μ M each dNTP, 30 μ M MgCl 2 , 1.0 unit Taq DNA polymerase (Takara), 8 pmol each reverse primer and the M13 universal primer (5′ -TGTAAAACGACGGCCAGT-3′ ) that was fluorescently labelled with Cy5 at the 5′ terminus, and 2 pmol forward primer that was added to the M13 primer sequence to the 5′ terminus. PCR amplification was performed on an ABI Applied Biosystems instrument under following conditions: 94 °C for 5 min; 30 cycles at 94 °C for 30 s, 53 °C for 30 s and 72 °C for 30 s; 8 cycles at 94 °C for 30 s, 57 °C for 30 s and 72 °C for 30 s; and a final extension step at 72 °C for 10 min. The capillary electrophoretic separation of PCR products was performed using GenomeLab GeXP (Beckman Coulter, USA), and the data were analysed by the fragment analyses module of the system.
Data analyses of genetic diversity and population structure. Using the MAC-PR method, all alleles of each locus were analysed in pairwise combinations to determine their dosages in the genets by calculating the ratios between peak areas 67 . The expected heterozygosity and Shannon-Wiener diversity indices were calculated using ATETRA version 1.2 (http://www.vub.ac.be/APNA/) 68 . The PIC of the SSR markers was calculated using PIC_CALC version 0.6 as described by Botstein et al. 69 . MI, as an overall measure of the efficiency to detect polymorphisms, was calculated as described by Powell et al. 70 . The genetic differentiation index (Wright's F ST ) was calculated using "polysat" software 71 in the R x64 3.2.3 package. DAPC 38 from the adegenet package version 2.0.1 72 in the R x64 3.2.3 package was used to analyse the genetic structure. This method does not need to take into account some assumptions about Hardy-Weinberg equilibrium and linkage disequilibrium. DAPC requires enough PCs to secure a space with sufficient power of discrimination but must also avoid retaining too many dimensions that lead to over-fitting 38 . Therefore, we chose the optimal number of principal components for DAPC using the optim.a.score function. The genetic distances between the genets and populations were calculated using Populations version 1.2.31 (http://www.bioinformatics.org/~tryphon/populations/) according to the method described by Nei et al. 73 . The genetic distance matrix was used to construct a dendrogram using the NJ method in MEGA version 6.06 74 . Further analyses of the genetic structure was performed by PCoA using GenAlEx version 6.5 75 , and a tri-scatter plot was generated using the R x64 3.2.3 package. To estimate the variance component and to partition the variation within and among populations and clusters, AMOVA was performed using GenAlEx 75 .