Genetic and chemical differentiation characterizes top-geoherb and non-top-geoherb areas in the TCM herb rhubarb

Medicinal herbs of high quality and with significant clinical effects have been designated as top-geoherbs in traditional Chinese medicine (TCM). However, the validity of this concept using genetic markers has not been widely tested. In this study, we investigated the genetic variation within the Rheum palmatum complex (rhubarb), an important herbal remedy in TCM, using a phylogeographic (six chloroplast DNA regions, five nuclear DNA regions, and 14 nuclear microsatellite loci) and a chemical approach (anthraquinone content). Genetic and chemical data identified two distinct groups in the 38 analysed populations from the R. palmatum complex which geographically coincide with the traditional top-geoherb and non-top-geoherb areas of rhubarb. Molecular dating suggests that the two groups diverged in the Quaternary c. 2.0 million years ago, a time of repeated climate changes and uplift of the Qinghai-Tibetan Plateau. Our results show that the ancient TCM concept of top-geoherb and non-top-geoherb areas corresponds to genetically and chemically differentiated groups in rhubarb.


Results
Analysis of anthraquinone content. The HPLC results showed that eight anthraquinones were successfully detected in all samples except for aloe-emodin-8-O-β-D-glucopyranoside and rhein-8-O-β-D-glucopyranoside, which could not be found in any sample ( Fig. 1a; Supplementary Fig. S1 and Table S1). Emodin presented the greatest difference, with the highest content 55.00 times the lowest content. Chrysophanol-8-O-β-Dglucopyranoside showed the least difference, with the highest content 2.94 times the lowest content. PCA results indicated that the samples could be divided into two clusters, an eastern and a western group ( Supplementary  Fig. S2). The content of the three constituents (rhein, emodin, and emodin-8-O-β-D-glucopyranoside) in the western group was significantly higher than in the eastern group (t-test, P < 0.05), while the physcion-8-O-β-Dglucopyranoside in the eastern group was significantly higher than that in western group (P < 0.05) (Fig. 1b). We also determined similar results when compared the differences of chemical composition between the western and eastern clades using general liner model.

Sequence variation and population structure.
Sequencing of the six cpDNA segments from 377 individuals from the R. palmatum complex resulted in a joint alignment length of 3,316 bp and 34 haplotypes defined by 57 substitutions and 14 indels. Populations were fixed for a single haplotype (Supplementary Fig. S3 and Table S2), except for five populations (GSZQ, HBXS, SCKD, SNH, and SNZZ), and the cpDNA gene tree didn't show meaningful structure that the nuclear genes and nSSR data did (see below). The total haplotype (Hd) and nucleotide diversities (π) were 0.960 and 0.0031, respectively. Non-hierarchical analysis of molecular variance (AMOVA) indicated that the genetic variation was mainly distributed among populations rather than within populations (98.69% vs. 1.31%, Table 1). Significant phylogeographic structures was detected (N ST = 0.987, G ST = 0.938, P < 0.05). Additionally, neutrality test statistics were all non-significant in combined sequences, which was also supported by the mismatch distribution analysis (multimodal; Supplementary Fig. S4) via a significant SSD value (0.020, P = 0.043).
The total alignment length of four nuclear genes was 3,135 bp displaying 67 variable sites and one indel ranging in length from 708 to 818 bp. The aligned sequences of CHS were 1,549 bp containing 68 variable sites and four indels yielding 95 haplotypes (Fig. 2d). Total haplotype diversity (Hd) among five nuclear loci ranged from Hd = 0.477 to 0.944, Watterson's theta (θ wt ) from θ wt = 0.0026 to 0.0071, and the silent nucleotide diversity (π sil ) as well as the total nucleotide diversity (π t ) ranged from π sil = 0.00058 to 0.00245 and π t = 0.00143 to 0.00505, respectively. All neutrality test statistics were nonsignificant (Supplementary Table S3). According to the STRUCTURE analysis, the log-likelihood values of the four nuclear genes and the CHS gene increased with K, however delta K indicated that the optimal value for K was two (Fig. 3b), clustering the samples into an eastern and western clade ( Fig. 3c; Supplementary Fig. S5b). Non-hierarchical AMOVA showed strong population SciEntific RepoRts | (2018) 8:9424 | DOI: 10.1038/s41598-018-27510-1 structure in the R. palmatum complex (Φ ST = 0.75, P < 0.01). However, based on the STRUCTURE analysis, the hierarchical AMOVA indicated that 59.37% (four genes) and 57.35% (CHS) of the total variation was distributed among the two groups, and 22.93% (four genes) and 25.09% (CHS) of variation existed among populations within groups (Table 1).
Nuclear microsatellite diversity and population structure. No nuclear microsatellite locus consistently deviated from Hardy-Weinberg Equilibrium (HWE) in all the studied populations after Bonferroni corrections (corrected α = 0.00013, P < 0.01). We did not find any evidence for linkage disequilibrium (LD) in any pair of nSSR loci in any population (exact tests; all P > 0.05), nor was there evidence for the existence of null alleles in any locus. Diversity estimates based on 14 microsatellite loci varied between all populations (Supplementary  Tables S3 and S4). Population differentiation was significant with an overall F ST = 0.214 (P < 0.05), and the standardized genetic differentiation G′ ST was higher than F ST across all loci (G′ ST = 0.436, Supplementary Tables S4). AMOVA demonstrated significant genetic differentiation at the range-wide scale (Φ ST = 0.2023, P < 0.01), with 1.46% of the variation partitioned between the western and eastern clades, and 18.77% among populations within clades (Table 1). There was significant isolation-by-distance (IBD) among all populations (r = 0.5484, P < 0.001), and this effect also persisted when each sub-cluster was analyzed separately, except for sub-cluster pop4 ( Supplementary Fig. S6).
The estimated values of LnP(K) increased progressively from K = 1 up to K = 38 in the STRUCTURE analyses. The highest ΔK was observed at K = 2 but was also significant at K = 5 (Fig. 4a). At K = 2, the cluster membership coefficient estimates suggested that clinal variation occurred along an east-west gradient separating two clusters and a geographical discontinuity arose at ca. 102°E. Cluster I in western China mainly distributed in the QTP and the Hengduan Mountains, and cluster II located near the Sichuan Basin and extended to the eastern areas (Fig. 4b,d). With K = 5, the clusters could be further subdivided into two sub-clusters (pop1, pop2) and three sub-clusters (pop3, pop4, pop5) for cluster I and II, respectively (see Fig. 4c,e). However, pop3, except for population from Miaoxian of Sichuan (SCM), was included in the western cluster based on nDNA loci. In addition, indicators of a recent bottleneck were detected in eight and five populations under the stepwise mutation model  Table S5).
Historical and contemporary gene flow. Based on the nSSR data, all 20 pairwise estimates of historical gene flow within and between western and eastern clades were significant (no 95% confidence intervals overlapping zero), ranging from 0.23 (pop1 to pop4) to 2.88 (pop3 to pop5). The gene flow from the western clade to the eastern clade was 7.76 (95% CI: 7. 25-8.32), and that of the opposite direction was 5.00 (95% CI: 4. 65-5.39). Some migration rates were distinctly asymmetrical, for instance, being much greater from pop4 to pop1 (0.52) than the reverse (0.23) ( Supplementary Figs S7a and b).
The mean contemporary gene flow (mc) from the west to east was 0.0028 (95% CI: 0-0.0071) and the opposite was 0.0021 (95% CI: 0-0.0050). The range of contemporary gene flow was 0.0016 (pop5 to pop4) to 0.0098 (pop4 to pop1). Some pairs also had asymmetrical values, including the pairs of pop2 and pop4, pop1 and pop4, and pairs within the eastern clade ( Supplementary Fig. S7b).

ABC-and Bayesian skyline plot-based inferences of population history.
In step 1 of the DIYABC analysis, the posterior probability for scenario 1 was 0.7213 (95% CI: 0.7124-0.7301), much higher than for the other scenarios. This scenario depicted a situation that the three sub-clusters in the eastern clade diverged simultaneously. Then in step 2, scenario 1 also had the highest posterior probability (0.5708, 95% CI: 0.5233-0.6183). This scenario indicated that the divergences of the sub-clusters in the western and eastern clades were also simultaneous. In the final step, the simulations indicated the posterior probabilities for scenarios 1 and 2 were 0.4111 (95% CI: 0.4008-4212) and 0.4985 (95% CI: 0.4884-0.5086), respectively, much higher than for the other two scenarios (Fig. 5).
Ecological niche modeling and climatic data analysis. The ecological niche models (ENMs) had a high predictive power for the two gene pools, with AUC = 0.92 and 0.93 for western and eastern clades, respectively. This is fairly congruent with their current distribution, except for some eastern clade populations that are missing in current conditions compared with the genetic architecture of this species complex (Figs 3, 4 and Supplementary Fig. S5). However, the predicted models of two clades depicted different demographic histories. For the western clade, the ENM results demonstrated a continuous expansion since the last interglacial (LIG), whereas the eastern one suggested distribution expansion from LIG to the Last Glacial Maximum (LGM) and then shrinkage from LGM to the present day (Fig. 6). Nevertheless, despite a substantial lack of precision for LGM models, a common trend was that models produced using localities from either western or eastern gene pools alone showed little overlap of their predicted distributions for all periods considered (Fig. 6a), suggesting that eastern and western clusters have occupied environmentally different regions for a substantial amount of time. Additionally, although the two models of LGM suggested that the two populations both underwent range expansions from LIG to LGM, recent studies suggested that the Interdisciplinary Research on Climate (MIROC) simulation has been shown to be more realistic than the Community Climate System Model (CCSM) in predicting potential habitats of LGM in Asia 25 . Assuming this finding also applies to continental East Asia, we relied more on the MIROC simulation for the discussion of the projected LGM distribution. Overall, the whole distribution area of the R. palmatum complex had a northwestward shift from the past to the present.
Our analysis of environmental factors indicated that all 20 climatic variables contributed to the divergence between the western and eastern clades (Kruskal-Wallis tests: P < 0.05). The frequency distributions of each clade for each environmental variable are presented as kernel density plots ( Supplementary Fig. S9), and the results from the MANOVA also distinguished significant differences between clades with regard to the environment occupied (P < 0.001). The first two PCA axes explained 82% of the variation for the present climate (Fig. 6c). However, unlike other studies on alpine plants which had loadings of different variables in PC1 and PC2 26,27 , our PCA indicated almost all variables had high loadings for PC1 (Supplementary Table S6). Moreover, identity tests based on six and all 19 climatic variables indicated the two clades occupy identical climatic niches (P < 0.05) (Fig. 6b). Additionally, species records defined clear groups along the first axis of the multivariate space (Fig. 6c), indicating that substantial differences exist in the climate for each geographic region.

Discussion
We investigated whether the concept of geo-herbalism in TCM is consistent with chemical and genetic differences between top-geoherbs and non-top-geoherbs using the R. palmatum complex (rhubarb), an important herbal remedy, as a case study.
As previous findings 23,24,28,29 showed that the three species of the R. palmatum complex (Rheum officinale, R. palmatum, R. tanguticum) do not show clear morphological separation, we used a number of molecular markers (five nDNA regions, 14 nuclear microsatellite markers and six cpDNA regions) to ascertain whether the three species can be told apart genetically. As none of the used markers retrieved three clearly separated groups (Figs 3c and 4b; Supplementary Fig. S5b) and morphological identification of the three species is unreliable, we concluded that the R. palmatum complex can be treated as one species.  Table S1; (b) The left diagram indicates the corresponding delta K statistics calculated according to Evanno et al. 64 , the number of clusters (K) varied from one to 10 in 10 independent runs, whereas the right plot represents changes of the mean posterior probability (loge P(D)) values of each K calculated according to Pritchard et al. 63  Analysis of the anthraquinone content, the main medicinal compound in rhubarb, of 38 populations collected from throughout the distribution range showed that the R. palmatum complex was divided into two geographic clusters, an eastern and a western group, which coincided with the top-geoherb and non-top-geoherb areas for rhubarb in China ( Supplementary Fig. S2).
The two geographic groups were retrieved using 14 nuclear microsatellite markers suggested that the R. palmatum complex consists of two distinct clusters with a major phylogeographic break near the Sichuan Basin (Table 1; Figs 3c and 4b; Supplementary Fig. S5b). However, there was some disagreement between the chemical and nuclear data set regarding the placement of three populations (SNM, SCBZ, and GSYC). SNM and SCBZ genetically belonging to the eastern cluster grouped with the western clade chemically whereas GSYC grouped with the western cluster genetically but with the eastern cluster chemically. These three populations might be an exception from the close correlation between genetic relatedness and anthraquinone content because of a possible unusual combination of environmental factors which affect the chemical composition in these specific regions.
The consistent grouping of the R. palmatum complex into two geographic clusters suggests the survival in at least two refugia since the beginning of the Quaternary. Similar results have been reported for other temperate plants with similar ranges (e.g., Tetracentron sinense and Ligularia hodgsonii) 30,31 . These divergences reflected a strong signature of highly restricted gene flow due to the geographical and/or climatic isolations (also see review by Qiu et al. 32 and references therein). Indeed, our AMOVA results demonstrated moderate to high Φ ST values between the two clades (ranging from Φ ST = 0.2023 (nSSR) to Φ ST = 0.8244 (CHS), Table 1). The cluster analysis indicated some weak admixture between the clades (Figs 3c and 4b; Supplementary Figs S5b and S7)), implying stable differentiation between the two clades. However, there was some discordance in the Bayesian clustering when using different molecular markers. For instance, nSSR data clustered the populations SCPW, SCSP, SCML, SCWC, SCXL, SCKD and SCMG in the eastern clade, while nuclear sequence data (including the CHS gene), assigned these populations to the western clade. The discrepancies between nuclear sequence and nSSR data may be caused in part by their difference in mutation rates. Generally, the mean mutation rate is 10 −9 for nuclear genes and 10 −4 for SSR loci. This high mutation rate makes nSSR data more suitable for determining the intraspecific genetic structure 33,34 .  32,42 . However, there are some uncertainties about the time estimates such as the mutation model and homoplasy as well as the assumption of no gene flow when using DIYABC in our study, which might underestimate divergence times over large timescales 34,43 . Nevertheless, our time estimates in the present study are consistent with previous studies (e.g., the increased speciation rates in the Hengduan Mountains and the evolutionary radiation of Rheum during the Quaternary 21 ), indicating that our results reflect the rate of evolutionary divergence within the R. palmatum complex.
The results from the ENMs demonstrated that the two lineages expanded their ranges from LIG to LGM but indicated different demography histories thereafter (e.g., western populations expanded while eastern populations shrunk, Fig. 6a), and our Bayesian skyline plots of cpDNA and nuclear genes also indicated the species complex increased its effective population size during the Quaternary (Supplementary Fig. S8). We speculate that this unusual demography of the R. palmatum complex might have been due to increased human activities in the Himalaya-Hengduan Mountains during recent decades which caused a decrease in suitable habitats leading to higher levels of genetic drift and more bottlenecks (Supplementary Table S5). The complicated topography in these regions provide suitable microenvironments, buffering the impacts of climatic oscillations. Thus, R. palmatum complex populations could have moved northwestward following their optimal conditions and expanded their ranges during the LGM. In addition, our results indicate that the two lineages occupied different areas when only areas with moderate suitability scores are considered (e.g., >0.75, Fig. 6a). This is consistent with the results from the identity test and the PCA analysis (Fig. 6b,c) showing significant differences between the two clades. Such ecological niche partitioning will have reinforced the divergence of the two clades following initial spatial separation and the adaptation of populations to their local environment, which may ultimately lead to reproductive isolation and the generation of new species if given sufficient time [44][45][46] .
The present study used a chemical approach and genetic markers (cpDNA, nDNA, and nSSR) to better understand the genetic differentiation at the population level of rhubarb plants throughout its distribution range. The analysis of both chemical composition and genetic markers showed that two distinct groups are present in the R. palmatum complex. These two groups coincide with the areas which are considered to produce top-geoherbs and non-top-geoherbs for rhubarb in China. Ancient Chinese practitioners classed rhubarb collected from the Himalaya-Hengduan Mountains (Qinghai, Gansu, and Sichuan Provinces) as a top-geoherb, which corresponds to the populations from our western clade (e.g. QHMH, QHTJ, QHDR, GSSD, GSDB, SCXL, and SCML) collected in Qinghai, Gansu, and Sichuan. This is clearly shown by the cluster analyses of nSSR and nuclear genes. Our results show that chemical variation and genetic differentiation of rhubarb populations between top-geoherb and non-top-geoherb regions are consistent with the concept of geo-herbalism. Our study revealed the contribution of the genetic factors to the formation of top-geoherbs, and also showed that the formation of top-geoherbs might be closely related to species differentiation. Future studies should focus on quantitative genetics to further understand the molecular mechanism of top-geoherbs.  Table 2. Posterior median estimate and 95% highest posterior density interval (HPDI) for demographic parameters in scenarios 1 and 2 in STEP3 based on the nuclear multilocus microsatellite data for whole populations of the Rheum palmatum complex. a The unit of timing is generation. μ: mutation rate (per generation per locus). P represents the proportion of multiple step mutations in the generalized stepwise model, GSM. Overall, our results clarified that the current populations of the R. palmatum complex comprises two major clades (eastern and western). Quaternary environmental changes and/or the uplift of the QTP profoundly influenced the evolutionary and population demographic history of the R. palmatum complex as well as its current genetic structure. Chemical and genetic variation exists within the R. palmatum complex, and are consistent with the top-geoherb and non-top-geoherb areas of rhubarb. These findings reveal that genetic differentiation is at the core of the geo-herbalism concept.

Materials and Methods
Sample collection and DNA extraction. Leaf samples were collected from 38 populations covering the whole range of the R. palmatum complex (Supplementary Table S7 and Fig. S3a). All samples were assayed for 14 nSSR loci (n = 479), and a subset of samples were sequenced for cpDNA and nDNA regions (n = 377). In addition, the underground roots and rhizomes of thirty-one populations of the R. palmatum complex were collected to minimize the effect of seasonal changes on the concentrations of chemical constituents in the wild 47 and their chemical compounds were analyzed using three individuals to reduce individual differences.
Total DNA was isolated from dried leaf tissue using a plant total genomic DNA kit (Tiangen, Beijing, China).
The quality and quantity of DNA was determined on 1% TAE agarose gels and with a NanoDrop ® ND-1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA), respectively. The DNA was diluted to a final concentration of approximately 20 ng/μL.
Analysis of chemical compounds. Dried root and rhizome samples were pulverized to extract free anthraquinones and anthraquinone glycosides. The powder was filtered through 180 μm sieves and accurately weighed when fine (800 mg). Then, 70% methanol (25 mL) was added and the mixture was weighed again. The mixture was extracted by ultrasonication for 1 h (40 °C, 300 w). After cooling, 70% methanol was added and transferred to a 25 mL volumetric flask. The supernatant fluid was filtered through a 0.22 μm membrane filter, and the filtrate was analyzed using high-performance liquid chromatography (HPLC) (see details in Note S1).
Chloroplast and nuclear DNA sequencing and microsatellite genotyping. A total of six cpDNA regions (trnL-trnF, ndhJ-trnF, psaI-accD, rpl20-rps12, rpl32-trnL, and psbA-trnH) and five nuclear genes (CHS and the four genes R27, R30, R46, R56 which were developed from transcriptome data) were sequenced. Primers and PCR amplification of the cpDNA regions were described in previous studies 48,49 (Supplementary Table S8), and the amplification of five nDNA regions was described in Ma et al. 50 . Sequences were generated (excepted for the CHS gene) using an ABI 377 DNA sequencer (Sangon Biological Engineering Technology & Service Co.). Clone sequencing was used for the CHS gene. More than ten colonies of each sample were randomly chosen and sequenced to survey sequence variations in multiple copies and then the program MEGA7 51 was employed to align and edit sequences. All newly acquired sequences have been submitted to GenBank with accession numbers ranged from MH457640 to MH457707 and MH465255 to MH465390. All sampled individuals were genotyped at 14 nSSR loci (see Supplementary Table S3) developed from the genomic resources of R. palmatum. The PCR conditions of nSSR loci were similar to the cpDNA fragments, except for the annealing temperature. The PCR products were genotyped using a 3730XL automated Genetic Analyzer (Applied Biosystems, Foster City, CA). Allele sizes were determined in GENEMAPPER (version 3.7, Applied Biosystems). Sequence analysis. The haplotype (Hd) and nucleotide (π) diversity, the number of segregating sites (S), the Watterson's parameter (θ w ) 52 , and the minimum number of recombinant events (R m ) 53 were estimated, and tests were carried out for departure from the neutral model based on Tajima's D 54 , and Fu and Li's D* and F* 55 using DNASP v5.1 56 . The total gene diversity (H T ), gene diversity within populations (H S ), and coefficients of differentiation G ST and N ST were calculated using SPAGEDI 57 based on 10,000 random permutations. Genealogical relationships of the haplotypes were identified using POPART based on a TCS network algorithm 58 . A mismatch analysis was conducted using ARLEQUIN v3.5 59 to test for spatial expansion of test populations. The goodness-of-fit was tested with the sum of squared deviations (SSD) between observed and expected mismatch distributions and Harpending's raggedness index (H Rag ) 60 . We also employed jMODELTEST v1.0 61 to evaluate the best-fit model of nucleotide substitution for maximum likelyhood (ML) method to infer cpDNA haplotype relationships (GTR + I + G in the present case), and used PAUP * v4.0 beta 10 62 to determine the relationships of cpDNA haplotypes and visualized results using FIGTREE v1.3.1 (http://tree.bio.ed.ac.uk/software/figtree/). The program STRUCTURE v 2.3.3 63 was used to infer the population genetic structure of nDNA sequences based on the admixture model with allele frequencies correlated. The number of clusters (K) was set to vary from one to 20. For each value of K, we performed 20 runs with a burn-in length of 200,000 and a run length of 500,000 Markov chain Monte Carlo (MCMC) replications. The best K values were determined using the methods described by Pritchard et al. 63 and Evanno et al. 64 . The web-based program Structure Harvester 65 was used for visualizing the STRUCTURE output. The estimated admixture coefficients (Q matrix) over the 20 replicates in each K were averaged using CLUMPP v1.1 66 , and the graphics of optimal K were produced using DISTRUCT v1.1 67 . In addition, in order to quantify variation in nSSRs and DNA sequences between and within the two main geographical regions, we performed analyses of molecular variance (AMOVA) in ARLEQUIN using 10,000 permutations. Additionally, we used BSP in BEAST v.1.7.5 68 based on the combined cpDNA data and four nuclear genes each to estimate effective population size changes within the species complex. Linear and stepwise models were explored using an uncorrelated lognormal relaxed clock. Runs consisted of 20,000,000 generations, with trees sampled every 5,000 generations. The BSP was visualized in the program Tracer v1.5, which summarizes the posterior distribution of population size over time. The cpDNA substitution rates for most angiosperm species have been estimated to vary between 1 and 3 × 10 −9 substitutions per site per year (s/s/yr) 69 , while those for nDNA in shrubs and herbal SciEntific RepoRts | (2018) 8:9424 | DOI:10.1038/s41598-018-27510-1 plants vary between 3.46 and 8.69 × 10 −9 (s/s/yr) 33 . Given the uncertainties in these rate values, we used normal distribution priors with a mean of 2 × 10 −9 and a SD of 6.080 × 10 −10 for cpDNA, and a mean of 6.075 × 10 −9 and a SD of 1.590 × 10 −9 for nuclear gene to cover these rate ranges within the 95% range of the distribution for our BSP analyses.

Microsatellite data analysis. Population genetic analysis.
For the nuclear microsatellite dataset, all 14 loci were checked for possible null alleles using MICRO-CHECKER v2.3.3 70 . Tests for departures from Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium (LD) were performed in each population and a globally unified population using FSTAT v2.9.3 71 based on 1,000 permutations (α = 0.05), and corrected for multiple tests using the sequential Bonferroni method 72 . In addition, we conducted a modified version of F ST outlier loci analysis 73 under the assumption of the IAM using LOSITAN 74 with 50,000 simulations to generate 95% confidence intervals (CIs), and the results indicated all 14 loci were neutral, so all loci were retained for use in subsequent analyses.
For each locus, genetic diversity was assessed by calculating the observed number of alleles (A O ), the observed heterozygosity (H O ), the expected heterozygosity over all populations (H E ), the genetic diversity within the population (H S ), the total genetic diversity (H T ), and the standardized genetic differentiation G′ ST 75 across loci with FSTAT and/or GENALEX v6.5 76 based on 1,000 bootstrap permutations. Compared with the traditional measures (e.g., F ST and G ST ), G′ ST is a more suitable measure of differentiation for highly polymorphic markers such as microsatellites 75,77 . For each population, we also used the program GENALEX to estimate the values of H O , H E , private allelic richness (P AR ), fixation index (F IS ), and allele richness (A S ) across all loci.
Using the nSSR dataset, genetic clusters were identified using a Bayesian analysis in STRUCTURE with the same parameter settings as above, except for the K was set to vary from one to 38. To test how the geographic distance affected the genetic composition of the R. palmatum complex, Mantel tests with 10,000 random permutations were performed between the matrix of pairwise F ST /(1 − F ST ) and that of the natural logarithm of the geographic distances at species and each gene pool revealed by STRUCTURE analysis, respectively, using the package Vegan 78 in R 3.3.0 79 .

Testing for genetic bottleneck and estimates for historical and contemporary gene flow. The
Wilcoxon's signed rank test and the mode-shift test implemented in BOTTLENECK v1.2.02 80 was used to detect whether a population had experienced a size decrease over extended vs. more recent time scales (∼2Ne-4Ne generations in the past vs. a few dozen generations ago 81 , See details in Note S1).
To estimate historical gene flow (much longer period of time, ~4Ne generations in the past 82 ), between the STRUCTURE clusters and between sub-clusters within each STRUCTURE cluster (see results section), we used the program MIGRATE v3.6 83 to estimate the effective number of migrants (2Nem, where Ne is the effective population size and m is the migration rate per generation). Also, we estimated the contemporary gene flow using BAYESASS v3.03 84 (See details in Note S1).

Tests of demographic history by ABC modeling.
In order to decrease the computational amount and time, we narrowed scenarios by defining nested subsets of competing scenarios that were analyzed sequentially based on STRUCTURE analysis using the ABC procedure 85 as performed in DIYABC v2.1.0 86,87 , because hundreds of scenarios can be tested with five populations. For the ABC analysis, ten alternative scenarios of population history for the three sub-clusters composing the eastern clade were summarized (step 1 in Fig. 1; Supplementary Table S9) and tested using DIYABC, and the population sizes for all sub-clusters (pops1-5) were set to be constant in the analysis excepted for the ancestral population size. Then, we analyzed all five sub-clusters, and six alternative scenarios were included in this step (step 2 in Fig. 1; Supplementary Table S9), considering the best scenario in step 1 (i.e., scenario 1). Finally, given that the R. palmatum complex originated on the QTP 15 , we set four competitive scenarios (including the best scenarios in step 2, i.e., scenarios 1 and 2) to determine its migration route(s), and whether the genetic disconnection between the two clades was old or recent (step 3 in Fig. 1; Supplementary Table S9). A reference table was generated containing 20,000,000 simulated data sets for all scenarios (on average 1,000,000 per scenario), assuming uniform priors on all parameters for each scenario (Supplementary Table S10). A goodness-of-fit test was used to check the priors of all parameters before implementing the simulations. Each simulation was summarized by the following summary statistics: the mean number of alleles and the mean genetic diversity for each clade, and the F ST , the mean classification index, and shared allele distance between pairs of clades. After comparing the posterior probability of scenarios in different steps using the logistic regression and direct approaches, the posterior distribution of historical demographic parameters for the final step was estimated using the 1% simulated datasets closest to the observed dataset for the local linear regression. The average generation time was assumed to be five years according to our field observations for the R. palmatum complex.
Ecological niche models (ENMs) and environmental factor analysis. We used ENMs to infer suitable climate envelopes for the two clades of the Rheum species complex (see result section) in the present, the Last Glacial Maximum (LGM: ca. 21,000 years before present; BP), and the last interglacial (LIG: ca. 120,000-140,000 years BP) using the maximum entropy method implemented in MAXENT v3.3.3 88,89 , and used ENMTools v1.3 90,91 to calculate Schoener's D 92 and standardized Hellinger distance (calculated as I) to measure the niche similarity between clades (See details in Note S1).
In order to evaluate the effect of present climatic conditions on the genetic differentiation between the main two geographical regions, three methods were used to evaluate the impacts of climate on population differentiation. First, the lineage-level divergence associated with each of the environmental variables was examined using one-way nonparametric ANOVA (i.e., Kruskal-Wallis test 93  factors each in the two clades were displayed in kernel density plots using the R package ggplot2 94 . Then also using R, we performed a permutational MANOVA analysis for all environmental variables simultaneously for the two clades to evaluate whether environmental conditions differed significantly between their sites of occurrence. Finally, we employed the R package ggfortify 95 to perform a principal component analysis (PCA) to determine whether the genetic groups were ecologically differentiated.