Taraxacum kok-saghyz (rubber dandelion) genomic microsatellite loci reveal modest genetic diversity and cross-amplify broadly to related species

Taraxacum kok-saghyz (TKS) carries great potential as alternative natural rubber source. To better inform future breeding efforts with TKS and gain a deeper understanding of its genetic diversity, we utilized de novo sequencing to generate novel genomic simple sequence repeats markers (gSSRs). We utilized 25 gSSRs on a collection of genomic DNA (gDNA) samples from germplasm bank, and two gDNA samples from historical herbarium specimens. PCR coupled with capillary electrophoresis and an array of population genetics tools were employed to analyze the dataset of our study as well as a dataset of the recently published genic SSRs (eSSRs) generated on the same germplasm. Our results using both gSSRs and eSSRs revealed that TKS has low- to- moderate genetic diversity with most of it partitioned to the individuals and individuals within populations, whereas the species lacked population structure. Nineteen of the 25 gSSR markers cross-amplified to other Taraxacum spp. collected from Southeastern United States and identified as T. officinale by ITS sequencing. We used a subset of 14 gSSRs to estimate the genetic diversity of the T. officinale gDNA collection. In contrast to the obligatory outcrossing TKS, T. officinale presented evidence for population structure and clonal reproduction, which agreed with the species biology. We mapped the molecular markers sequences from this study and several others to the well-annotated sunflower genome. Our gSSRs present a functional tool for the biodiversity analyses in Taraxacum, but also in the related genera, as well as in the closely related tribes of the Asteraceae.


Results
Designing and validating gssR-markers. TKS  Because of the lack of a well-annotated TKS genome, the ~1 kb sequences pulled from the TKS contigs of Lin et al. 4 containing the markers used for the construction of TKS linkage map 27 and those of the gSSR and eSSR 32 population genetics studies were mapped to the sunflower genome. The markers analyzed were located across all eight TKS linkage groups based on the mapping back to the sunflower genome ( Fig. 1). In several instances, the marker sequences localized to separate TKS contigs, but the sequences mapped to the same sunflower genome regions (Fig. 1). Only one of our gSSRs (Tara026) co-localized with two other map markers (TC27; TC66) within a single TKS genome scaffold, and also mapped back close to each another on the sunflower genome ( Fig. 1).
SSR genotyping and analyses. We chose a pool of 25 di-and 25 tri-nucleotide repeat gSSR markers for the study of TKS germplasm (Table 1 and Supplementary Table S1). After their initial screening on the TKS gDNA, this gSSR pool was reduced as several did not amplify a significant number of gDNA samples of the collection, lacked polymorphic alleles, or amplified a more complex banding pattern. We thus chose the 25 best-performing gSSRs for their specificity (single or double PCR products only) and reproducibility, and used them for TKS population genetics studies (Table 2 and Supplementary Table S1).
TKS: Population genetics analyses. gSSRs: Analysis of TKS spatial fixation genetics indices, Multi-locus genotype (MLG) networks, and population structure: Our results suggest no significant deviations from the Hardy-Weinberg equilibrium (HWE) across the 25 gSSR markers used to analyze the TKS populations ( Supplementary Fig. S2) despite low sampling. Only the TKS population 35162, and to a lesser extent 35178, show deviations from HWE at six and two loci, respectively. All loci were polymorphic in each population tested, and no clonal MLGs were detected. The Genotype Accumulation Curve (GAC) analyses indicated an MLG saturation ( Supplementary Fig. S3) with eight gSSRs in the analyzed TKS germplasm. Analyses of the Index of Association (Ia) confirmed the outcrossing character of the TKS germplasm studied (Supplementary Fig. S4). Only modest linkage disequilibrium was found in the gSSR TKS dataset (Supplementary Fig. S5) and suggested well-dispersed genomic locations of the gSSRs used.
The amplified gSSR markers yielded from 3 to 13 alleles per locus, averaging about 6 across the TKS germplasm pool ( Table 2). The 25 gSSRs used indicated a moderate degree of inbreeding within the populations and overall (F IS = 0.287; Tables 1 and 2). Our results further indicated a moderate TKS population fixation and genetic differentiation across the 25 loci tested (F ST = 0.094; F' ST = 0.098; Table 1). This implied a moderate level of gene flow among the TKS populations (inferred N m = 2.41). Collectively, the data indicated rather low genetic differentiation among the TKS datasets analyzed, despite high allelic diversity of the obligatory outcrossing TKS. In agreement with the spatial fixation indices accrued, AMOVA for the gSSR dataset indicated the majority of the molecular diversity partitioned among the individuals and not among populations (Φ IT = 66.52%; Φ IS = 23.63%; Φ ST = 9.86%). Sequences of the markers were retrieved by searching the TKS genome contigs 4 , and ~1 kb sequences containing the markers of Arias et al. 27 , McAssey et al. 32 and our de novo gSSRs were used to search the sunflower genome using their BLAST algorithm 73 . Physical positions of the best hits, occasionally located to several positions of the sunflower genome with comparable reliability, are shown. Markers used to generate the TKS linkage map 27 are black on white plates with their original linkage groups (LG) markings; eSSRs of McAssey et al. 32  To compare the relatedness of both TKS datasets (gSSRs and eSSRs), we generated pairwise matrices of population genetic distances for both. Values of the population pairwise distance matrix of F ST ranged from 0.018 to 0.355 for the gSSR, and from −0.024 to 0.261 for the eSSR datasets, respectively (data not shown). The pairwise population F ST distance matrices (and D ST matrices; both unstandardized and standardized; data not shown) for the gSSR and eSSR datasets provided similar results ( Supplementary Fig. S6), thus indicating that the TKS diversity information was comparable between them. Sub-population-wise, the distance matrix for the gSSR dataset showed low resolution in the analyzed TKS collection ( Fig. 2; Prevosti genetic distance range: 0.004 to 0.244, averaging 0.076 ± 0.056). The Neighbor-Joining dendrogram built on this basis indicated three TKS sub-populations as outliers (Herbarium, 35162, and 35178), and the remaining ones possibly divided into two separate clades. Testing of the geographic distance among TKS populations driving the genetic diversity of the species proved inconclusive (Fig. 3).
Reticulation analyses of the gSSR MLGs using the Minimum Spanning Networks (MSN) supported the F-statistics conclusions with no evidence of population structure in the TKS germplasm analyzed with gSSRs (Fig. 4). The lack of clustering or population structure visualized this way, suggests species-wide gene flow, implying that TKS diversity is well retained at the sub-population level. The results of the Discriminant Analysis of the Principal Components (DAPC; Fig. 5) were in agreement with the F ST population-wise trees (Fig. 2), with the gSSR populations 35162, 35178, and the Herbarium samples placed with some distance to the majority of the remaining samples.
eSSRs: Comparative analysis of TKS spatial fixation genetics indices, Multi-locus genotype (MLG) networks, and population structure: In the re-analyzed eSSR TKS dataset 32 , the significant deviations from HWE presented a locus-wise pattern and were much more common in occurrence than in the gSSR dataset ( Supplementary  Fig. S2). The eSSRs saturated the MLGs detected in the TKS germplasm significantly slower than gSSRs (10 vs. 8 markers, respectively; Supplementary Fig. S3). The eSSR dataset provided congruent results with the gSSR dataset on Ia and pairwise linkage disequilibrium ( Supplementary Figs S4 and S5). Regarding the fixation indices, the eSSR dataset harbored an overall F ST = 0.11 32 and F' ST = 0.068 (data not shown), and an inferred N m = 2.02. Partitioning of the molecular variance with AMOVA for the eSSR dataset yielded results similar to the gSSR dataset (Φ IT = 84.61%; Φ IS = 8.34%; Φ ST = 7.04%). Differences occurred in the variance partitioned among individuals within populations, and the gSSR dataset showed higher value of this parameter than the eSSR dataset. The F ST distance matrix for the eSSR dataset of TKS showed different population-wise clustering from the gSSR dataset (Fig. 2). The eSSR study showed marginally higher resolution in the pairwise genetic distances of TKS populations, likely due to a much higher number of samples per population analyzed (Prevosti distance range: 0.003 to 0.149, averaging 0.099 ± 0.082). Similar to the gSSR dataset, the sub-population 35162 was separated with high confidence from the bulk of other sub-populations, as was 35159. The absolute placement of the eSSR sub-populations differed from the gSSR dataset and indicated generally better resolution than the gSSR dataset, but no major clustering. Testing of the geographic distance among TKS populations driving the genetic diversity of the species proved inconclusive, similar to the gSSR dataset (Fig. 3). MSN reticulation of the eSSR dataset ( Fig. 4) provided results similar to the gSSR dataset, confirming that study's conclusions 32 of TKS lacking well-defined population structure. Analysis of networks from both gSSR and eSSR datasets resulted in similar Bruvo's genetic distance ranges, and congruently implied lack of TKS population structure. Similar to the gSSR dataset, the DAPC analysis of the eSSR dataset (Fig. 5) also confirmed the sub-population-wise tree of genetic distances (Fig. 2). The eSSR population 35162 presented a similar (diverged) pattern to this observed in the gSSR dataset. Overall, our results suggest a lack of a well-defined population structure of the TKS germplasm with little support for the more differentiated population 35162.
Analyses of US Taraxacum officinale. Species genotyping and assignment and ITS phylogeny of the plant materials: Species identity of the samples collected from Tennessee, Georgia, Alabama, and Mississippi (Tables 3 and S1) was confirmed by Internal Transcribed Spacer region (ITS) sequencing ( Fig. 6; Supplementary Tables S1 and S2). Samples identified as Taraxacum spp. lacked major differences in their ITS sequences (Fig. 6) and could not be unambiguously classified at species level based on this criterion alone (NCBI BLAST; data not shown). Grouping with the T. officinale and other Taraxacum species sequences for ITS 34    . c Results of cross-amplification study to US dandelions gDNA collection: "no": The locus did not cross-amplify; "yes": The locus cross-amplified, but marker was not used in the US dandelion population study (erratic amplification, complex banding pattern); "yes*": The locus cross-amplified and the marker was used in the US dandelion population study. Analysis of Taraxacum officinale spatial fixation genetics indices, Multi-locus genotype (MLG) networks, and population structure: The majority of the TKS-derived gSSRs cross-amplified the gDNA of the related US native dandelions (T. officinale) and of the outgroup specimens (Tables 2, 4 and S1). From the 25 gSSRs tested using the TKS gDNA collection, 21 gSSRs (five di-and 16 tri-nucleotide repeats) cross-amplified to the T. officinale gDNA collection as confirmed on four gDNA samples (Knoxville, TN population). Overall, the cross-amplification was broad and proved effective even in the specimens of related genera and tribe (Supplementary Table S1).  The analyses of the species diversity and population structure included 74 samples of T. officinale collected in several locations in the US using the 14 best-performing gSSRs (five di-and nine trinucleotide repeats) developed for TKS. Our results indicated violations of HWE in both locus-and population-manner ( Supplementary  Fig. S2). The MLG accumulation in this dataset was comparatively the slowest among all the datasets analyzed as 13 gSSRs saturated the genotype accumulation curve (Supplementary Fig. S3). Moreover, the index of association (Ia) was typical for clonal/asexual organisms (Supplementary Fig. S4; P = 0.051). Linkage disequilibrium range for this dataset was similar to that of the gSSR study of TKS ( Supplementary Fig. S5) with the difference of fewer and smaller negative values recorded for the T. officinale dataset. As expected, the ploidy of the apomictic T. officinale samples estimated by the number of detected alleles often reached the tetraploid levels (diploid, n = 4; triploid, n = 17; tetraploid; n = 53; Supplementary Table S1), which limited the scope of the population genetics analyses, in particular the F-statistics (fixation indices analyses). To gain access to that data, we coded the whole dataset as tetraploid with occasionally missing alleles and corrected the ploidy with the R package polysat before analyses.
The T. officinale dataset displayed between 5 and 16 alleles per locus (averaging about 10; Table 4). The estimated dataset-wide F ST value was 0.044 and the D' ST was 0.048. Population-wise F IS values (Table 3) indicated a considerable degree of homozygote excess in this dataset, further supporting the conclusion of asexual reproduction in this species. The population-wise Prevosti distance tree for T. officinale indicated its genetic distances were lower than TKS using the same markers (range: 0.008 to 0.157, averaging 0.055 ± 0.045; Fig. 2), indicating the lowest resolution in this dataset among those analyzed in the study. Further, the majority of the tree remained   32 (middle panel), and US T. officinale using 14 gSSRs selected for this study (right panel). Optimized and cross-checked PCA eigenvalues were used to generate each graph, respectively (gSSR: 5 PCAs retained; eSSR: 39; T. officinale gSSR: 11). Color legends for the populations and DA/PCA eigenvalues used are shown, respectively, on each graph. Alleles contributing the most to explaining the variance for each dataset are indicated on either axis, respectively (with percentages of the variance explained in the parentheses, respectively). unresolved, with the samples from Herbarium (US western coast) and KnoxvilleTN forming an outgroup to the bulk of the dataset yet separated from one another. Similar separation was observed when analyzing the genetic and geographic distance matrices using the Mantel test (Fig. 3). Herbarium samples from the US western coast clustered separately from the remaining individuals based on the geographic spacing (Fig. 3). The majority of the molecular variance was retained among the individuals within the populations, whereas about one quarter of the total variance was partitioned among the populations (AMOVA: Φ IS = 74.98%; Φ ST = 25.02%). Several analyses indicated the presence of population structure in this dataset. The MSN analyses (Fig. 4) took into account motif lengths in the gSSRs and grouped individuals of several populations together using the Bruvo distance. In agreement with the population-wise tree of distances (Fig. 2), the DAPC analyses separated the WelcomeRaMS, as well as the Herbarium and KnoxvilleTN samples from the bulk of the remaining ones (Fig. 5). Comparatively larger resolution of this dataset than either of the TKS datasets suggested more pronounced population structure in the T. officinale, as (sub-)populations are more diverged from one another than in TKS. Bruvo's distance-based tree of individuals ( Fig. 7; motif lengths considered) was visualized using the Bayesian Information Criterion and grouped individuals from the geographically close populations together yet further implying population structure in the common dandelion species. Collectively, the results for T. officinale indicated the existence of low-diversity populations clonal in character but differentiated geographically.

Discussion
In this study, we aimed to gain a deeper understanding of the genetic diversity of TKS, a potential alternative, sustainable rubber crop 2,13,14 . To reach this goal, we developed a set of genomic SSRs (gSSRs) based on our de novo sequencing of TKS and utilized them for evaluating the genetic diversity of TKS germplasm. We then carried-out an array of comparative population genetics analyses, re-analyzing the recently published genic SSR (eSSR) dataset generated on the same TKS germplasm 32 , and an expanded cross-amplification study with the local US dandelions using those gSSRs.
Our de novo gSSRs were distributed across the TKS genome, based on the linkage disequilibrium data, as were the eSSRs 32 . We mapped both types of SSRs (gSSRs and eSSRs) along with the TKS markers used for the linkage map, back to the related and well-annotated H. annuus genome, based on the TKS contigs 4 . This is very likely to be helpful for the future breeding efforts. We chose not to use the TKS genome assembly 4 or the closely related Lactuca sativa genome assembly 35 because both are more fragmented and have fewer scaffolds anchored to chromosome locations in comparison to the H. annuus genome. To further underscore the need for improved TKS genome resources, the gSSR Tara003 sequence could not be found in the TKS contigs published 4 . Moreover, only 15 markers out of the 65 that constructed the TKS linkage map 27 were mapped back together (in pairs or in threes) to six TKS scaffolds of Lin et al. 4 . Also, only one of the SSRs analyzed (gSSR Tara026) co-localized with two other map markers of Arias et al. 27 within a single TKS contig of Lin et al. 4 . Several studies independently reported the TKS genome size as ~1,420 Mb based on flow cytometry (1.45 pg/1C 27,31 ). Other studies estimated the diploid plant genome size at 2,400 Mb 21,28 . Comparatively, the draft TKS genome estimates at 1,040 Mb by 19 mer, 1,140 to 1,210 Mb by flow cytometry, or the 1,290 Mb assembly (all in 4 ) represent an underestimation, which signifies room for improvement in the TKS genome completeness and assembly. As H. annuus is related, but somewhat distant to TKS, we expected mis-localizations and/or ambiguities in the marker placement due to genome rearrangements and/or sequence diversity. It is noteworthy though, that many chromosome regions in the map (Fig. 1) were enriched in the markers from the same linkage groups of TKS 27 , with the gSSRs and eSSRs placed among them. This might indicate that despite a tentative character of this placement, the markers may be physically close. Thus, the markers found close on the H. annuus may indeed be linked on the TKS genome, extending the linkage information to new markers. gSSRs were slightly more ambiguously placed than eSSRs (excess of the sunflower genome BLAST hits of 2.8-fold vs. 2.2-fold, respectively), which could stem from targeting the parts of genome different in character, duplications of the non-coding regions targeted by gSSRs 36,37 , or differences in the genomes of TKS (2n = 16) and H. annuus (2n = 34). Several studies addressed the TKS diversity at various levels; agronomic performance and rubber/inulin production was of primary concern due to the industrial potential of the plant 15,27 . Seedling growth characteristics were also studied 38 . The first attempt at estimating the species genetic diversity using molecular methods was focused on a wide collection of TKS materials and allowed for a genetic distinction of the Russian/ Kazakh and Chinese TKS germplasms 33 . A milestone in the TKS molecular diversity research was the study of the Kazakhstan-originating USDA-ARS germplasm using a set of eSSRs 32 with which we compare the statistics of our gSSR dataset. Despite our sampling scheme being lower in number than in the previous eSSRs study of TKS, our study yielded very similar results and provided significant correlation of the population distances/indices. This result was possibly accrued by employing ~50% more gSSRs at lower population sampling, yet, ensured reliability of our results. This also confirmed the general observation on the TKS diversity formulated before 32 that the overall low species diversity resides mainly within populations. This observation is in agreement with our research hypothesis for this outcrossing, self-incompatible dandelion species. Comparison of the HWE violations in the gSSR and eSSR datasets shows much lower occurrence in the former dataset. This could be intrinsically related to the sequences targeted by either SSR type, or variable mutational frequency of the targeted loci 39,40 . This is further substantiated by the patterns of HWE violations detected. The (sub-) population violations in gSSR dataset could stem from the limited sampling, whereas locus-wise HWE violations in the eSSR dataset suggest a different underlying reason, with abundant (sub-) population TKS sampling 32 .
Developing eSSRs is generally achieved faster and easier than the gSSRs due to comparatively more conserved character of the transcriptome 39,40 . Owing to the fact of differences in parts of the genome targeted, in their conserved character, and in cross-amplification rates, both types of SSRs provide slightly different but complementary information 39,41,42 . Thus, inferences made from both types of SSRs together will provide more substantiated conclusions on the species diversity (or other studies for which they were used). Diversity of several economically important crops was analyzed using both types of SSRs, and in almost all cases led to similar results, which could also be taken as a confirmation study. For instance, deployment of both types of SSRs on the cucumber germplasm provided consistent positioning of most of the accessions analyzed on dendrograms and detected higher polymorphism rates using the gSSRs 43 . Similarly, high similarity was found between the gSSR and eSSR dendrograms among the tomato germplasms with higher polymorphism rate for the gSSRs, albeit slightly lower polymorphic information content 44 . The authors of that study postulated that combining both marker types in tomato would be effective for the species genetic diversity analyses. In contrast, studies of soybean indicated comparatively lower agreement between the gSSRs and eSSRs 45,46 . Authors argued for use of the eSSRs in soybean diversity studies for direct access to the population diversity in genes of agronomic interest but concluded that the species diversity was effectively estimated by both types of SSRs 46 . Analyses of the genetic diversity in wheat repeatedly indicated higher polymorphism of the gSSRs over eSSRs, but the authors of the studies argued that use of the eSSRs allowed for a more accurate delineation of the genetic relationships [47][48][49] . Studies in other cereal species observed the highest proportion of trimeric eSSRs, especially those encoding for neutral bulky amino acids 42,50 . Both studies also stated that the lower level of polymorphism detected by eSSRs compared to gSSRs might be due to the more conserved character of the targeted regions with selection acting against variation, a feature that could drive the relatively higher transferability of the eSSRs and a comparatively superior genotypic identification. Another conclusion emerged from the studies of the Prunus species. Although both types of SSRs resulted in similar dendrograms, combination of both datasets increased the genotypic discrimination 44 and indicated a higher polymorphism and more effective resolution by the gSSRs than by the eSSRs 51 . The emerging conclusion from those and other studies is that similar levels of genetic diversity between populations or species may be recorded by using either SSR type with eSSRs often detecting lower variation, but performing more reliably at species differentiation [52][53][54] .
Cross-amplification with the TKS gSSRs proved very successful and our markers transferred to other genera of the Asteraceae (Fig. 6; Supplementary Table S1). Within the Taraxacum genus, the 14 gSSRs tested extensively in this study also cross-amplified to four independent gDNA samples of T. brevicorniculatum ( 26 ; Nowicki et al. unpublished data; Fig. 6). The outgroup specimens that cross-amplified with our gSSRs for TKS belonged to distant subtribes (Taraxacum and Youngia are in the subtribe Crepidinae; Hypochaeris in the Hypochaeridinae; Krigia in the Mricroseridinae; Lactuca in the Lactucinae; and Pyrrhopappus in the Cichoriinae), but the Erigeron specimens belong to a distant tribe Asterae. This indicates a possible broad application of our gSSRs in the Asteraceae crops analyses. The TKS eSSRs also cross-amplified with four gDNA samples of local dandelions 32 . Thus, our or other species 34 (for TKS, Ceratoidea). No ITS sequences for T. brevicorniculatum were found at NCBI. The sample origin (population names) or Taraxacum species names are indicated. Dotted grey box delimits the outgroup for RAxML (non-Taraxacum species by ITS BLAST of sequences); orange box indicates the TKS. Sample ESKUSA E55/12 was used for de novo sequencing and development of the gSSRs used in this study. Pictures of exemplary specimens show TKS, T. brevicorniculatum (BRE), and T. officinalis (OFF; the dashed lines are indicating which specimens are shown). Samples of TKS, T. officinale, and T. brevicorniculatum marked with S were grown for another study 26  gSSRs present additional resources to the classical (GA/CT) n gSSRs identified by restriction digest, hybridization, and Sanger sequencing 55 . Both eSSR and gSSR datasets of TKS confirmed its sexual reproduction as observed in nature 26,32,34,56 . In contrast, results of the US dandelions are in agreement with the previous studies 25,54,57,58 that provided evidence of both sexual and asexual modes of reproduction present in T. officinale with a broad cross-amplification to related species. The retrieved ITS sequences remained largely indiscriminate as to the species identity of the local US dandelions, co-localizing with the T. officinale ITS sequence consensus and the historical Herbarium specimen. Yet, previous research indicated predominance of only three Taraxacum species in North America (T. ceratophorum, T. erythrospermum, and T. officinale 25,[57][58][59]. Including in the phylogenetic analyses the respective ITS consensus sequences of those three species, of the historical T. officinale specimen, and of T. officinale used for previous research 26 (and data not shown) suggested the bulk of the US local dandelions could belong to T. officinale, if the microspecies of Taraxacum are disregarded 20,60 . Notably, the obligatory sexual diploid TKS was segregated with high confidence from the bulk of the US dandelions, as was the Central Asia-frequent T. brevicorniculatum.
The results of our gSSR analyses of this collection of US dandelions are in agreement with the recent ploidy analyses of the North America common dandelions 25 . The majority of our dataset was tri-or tetra-ploid, and it is possible that we used too few markers to capture the higher levels of ploidy of the remaining several local dandelions samples classified as diploid based on the allele counts alone. In contrast to TKS, the US T. officinale presented evidence of population structure. This is in agreement with the biology of both species, especially considering the postulated clonal reproduction of the alloploid apomictic T. officinale in North America 25,57,60 . The higher frequency of sampling the outgroup specimens belonging to distant genera in the Southeastern US may be worth investigating in regard to the species range.
Species of Taraxacum are notorious for hybridization, which often results in genome rearrangements, regional gDNA duplications, and/or polyploidization 21,34,57 . Cross-amplification of the TKS gSSRs (this study) and eSSRs (confirmed on four samples 32 ), could help invigorate the molecular and genomic analyses of the more demanding polyploid dandelions 25,55,57 . Our study distinguishes the local US populations of T. officinale from TKS in several aspects. First, higher frequency of HWE violations indicated a difference in the US dandelions dataset. Second, the higher ploidy in this dataset inferred from the number of alleles detected indicated the possibility of clonal/asexual reproduction, which was further supported by the Index of association (I A ). Third, several analyses indicated presence of population structure in this dataset contrary to the outcrossing diploid TKS. Overall, our gSSRs present a useful analytical tool for Taraxacum spp., due to cross-amplification in related species, even in distant genera.

Conclusions and outlook
Results on the genetic diversity of TKS accrued in the course of this study may help current and future breeding efforts of this potential crop for renewable rubber. Complementary and congruent data obtained from both gSSR and eSSR study on the same germplasm provided thorough insights into the species biology. Although the TKS well-annotated genome is still to come, the combined marker map located on the related sunflower genome may help advance future TKS studies. Furthermore, cross-amplification of our gSSRs into related species of dandelions and even other genera augments the currently available resources to analyze their biodiversity and provides a platform for their further research.   USDA-ARS and identified in a previous study 15 (Table 1 and Supplementary Table S1). Plants were grown from seed as described earlier 26 . Young fresh leaves of 60 individuals from 19 different populations as designated by USDA-ARS 24 with their mapped locations of origin 32 were used for genomic DNA (gDNA) extraction. We extracted three to five independent plant specimens per population for population diversity study Bruvo's distances among the specimens were calculated to generate the FastME tree 94 (1,000 permutations; bootstrap support of 70% and more is indicated). The dandelion individuals were color-coded as per the Bayesian Information Criterion in R package poppr 78,79 (K-means hierarchical clustering; K min = 7).
United States plant materials and sequencing for species identification. Leaves of wild T. officinale Weber (n = 74) accessions from the Southeastern US and plants morphologically very similar were collected across different geographical regions (Tennessee, Georgia, Alabama, and Mississippi) and from eight distinct populations, as well as from historical herbarium specimens (Table 3 and Supplementary Table S1). Upon species identification by ITS sequencing (see below), specimens identified as not-Taraxacum spp. (n = 23) were set as a multiple outgroup. Leaf samples were collected in January and February of 2017, before the majority of the plants set bloom. No specific permissions were required for these locations/activities, as the materials are considered common weeds and regarded as neither endangered nor protected. Collected plant tissue was placed in ziplock bags containing silica gel (50 g each; Dri Splendor H&P Sales Inc., Vista, CA  70 . Read quality control was performed using FastQC 71 . De novo assembly was performed with ABySS version 1.9.0 72 with a k-mer size of 64. Sequence filtering for low complexity repeats was completed using the utility DustMasker 73 on the resulting unitigs. gSSRs were identified using an in-house developed perl script. The minimum and maximum motif frequency definitions on the gSSRs were six to 20 bp for the di-and tri-nucleotide repeats and four to 20 bp for the tetra-nucleotide repeats. A pair of primers flanking each SSR was designed using Primer3 74  published therein, those obtained from our de novo sequencing gSSR search, as well the marker information and/or primer sequences of the published TKS eSSRs 32 for comparison. The marker sequences were compared to the TKS genome contigs assembly of Lin et al. 4 using gmap with default scoring settings (except forallow-close-indels = 2 and -nosplicing). For each best sequence match to the TKS genome, a ~1 kb region containing the marker (500 bp on either side) was selected. The resultant contig fragments were used to BLAST the genome of related species, sunflower Helianthus annuus L., HA412-HO bronze assembly 75 . Best-hit sequences were then drawn on a map, respective to their physical locations on the sunflower chromosomes. If multiple best-hits had the same e-value, all were retained. ssR genotyping and analyses. PCR genotyping of the collection of TKS gDNA samples was completed using a set of 25 gSSR primers identified as described above (Tables 2 and S1) with subsequent capillary electrophoresis (QIAxcel Advanced Electrophoresis System, Qiagen). The single gDNA sample E55/12 that served for de novo sequencing was used for an initial genotyping screen with 50 primer pairs (25 di-and 25 tri-nucleotide repeats) with the PCR procedure described below. The results were visualized by capillary electrophoresis using Qiaxcel (Qiagen) and analyzed by using 25 to 500 bp DNA size marker and internal 15/600 bp alignment marker. We screened the results of genotyping with the 50 gSSRs for specificity on this gDNA sample, and the best-performing 25 gSSRs were selected for the analysis of the TKS gDNA collection (see Supplementary Table S1 for primer sequences). Cross-amplification to the US dandelions collection (T. officinale and outgroup specimens, Supplementary Table S1) was first checked on the four random gDNA samples isolated from plants local to Knoxville, TN using the 25 best-performing gSSRs on the TKS gDNA collection. The results were then screened in a fashion similar to the TKS screening procedure. Analysis of population structure. A total of 62 TKS gDNA samples were genotyped using 25 gSSRs and binned using FlexiBin (an MS Excel macro 76 ). In addition, the published dataset of TKS-eSSR study was retrieved 32 and binned to allow comparison of the datasets. Lastly, the dataset of T. officinale collected in the US (n = 74) and genotyped using 14 gSSRs was also binned, following the same procedure as the two datasets mentioned above. The binned datasets were analyzed separately for an array of population genetics parameters. To estimate the fixation and differentiation indices (F ST and F' ST , respectively 77 ), we used packages: poppr 78,79 , hierfstat 80,81 , and polysat 82,83 in R version 3.4.3 84 . Due to the detected variation in ploidy levels in the US dandelions dataset, the data was corrected for ploidy in R version 3.4.3 using the package polysat and then recoded as tetraploid with occasionally missing alleles when samples were actually di-or tri-ploid. The mixed ploidy of that dataset limited the scope of the indices accrued, notably the differentiation index F' ST 77 ; we resorted to GenoType/ GenoDive 85 to calculate the respective T. officinale dataset-wide F ST and D' ST indices. As per convention, the F ST bins considered were low (F ST < 0.05); moderate (0.05 < F ST < 0.15), and high (F ST > 0.15). Deviations of Hardy-Weinberg equilibrium (HWE) were calculated using package pegas version 0.10 86 in R version 3.4.3, using the exact test based on Monte Carlo permutations of alleles (B = 1,000) and α = 0.05. The results were depicted as a probabilistic heatmap for HWE deviation in a locus-and subpopulation-manner. The multi-locus genotype (MLG) networks were constructed using the Bruvo distances, using the minimum-spanning networks (MSN) reticulation algorithm in the package poppr in R version 3.4.3. POPTREE2 87 was used to calculate the population-wise distance matrices using either F ST or D ST indices (both standardized and unstandardized). Mantel tests were performed in R version 3.4.3 using the package MASS 88 . Analysis of the molecular variance (AMOVA) was performed in R version 3.4.3 using the package poppr, and the resulting Φ indices are reported as [%] values, after 1,000 permutations, at. the three levels of each dataset hierarchy: within individuals Φ IT , within individuals between subpopulations Φ IS , and among subpopulation and Φ ST . The mixed-ploidy T. officinale dataset did not lend itself to the Φ IT calculations using AMOVA. Discriminant Analysis of Principal Components (DAPC) was performed in R version 3.4.3 using the package adegenet version 2.1.1 89,90 . Compliance with ethical standards. Research involving Human Participants and/or Animals: This article does not contain any studies with human participants or animals performed by any of the authors.

Data Availability
All data generated or analyzed during this study are included in this published article and its supplementary information files. The Skewer-trimmed MiSeq reads og TKS gDNA are available at NCBI BioSample SAMN10414186, BioProject PRJNA505305.