Introduction

The reduction in gene flow between populations and lineages due to multiple processes can lead to genetic differentiation and impact the demographic history of a species1, with isolation by distance (IBD)2 isolation by altitude (IBA)3, and isolation by environment (IBE)4 playing significant roles.

The Pleistocene period, with its climate changes alternating warm and wet and cold and dry cycles between ~ 2.5 million years ago (Mya) and ~ 11 thousand years ago (Kya), was one of the most significant historical events that influenced species’ distribution and structure around the World5. In South America, climatic shifts profoundly impacted many species, forming refugia in tropical forests6 and grasslands7. Quaternary climate cycles affected narrowly and widely distributed species, with grassland-adapted species expanding while forests contracted8. The grassland expansion in the highlands in southern South America persisted for a long time. In contrast, their contraction only occurred during the Holocene (~ 4 Kya)9, concomitantly with the Araucaria Forest expansion. Such oscillations have drastically transformed the landscape and species distribution. During forest expansion, grasslands were fragmented and confined to higher elevations, isolating populations in sky-island patches surrounded by forest10. Such isolation allowed allopatric speciation in many plant genera11,12, with multiple examples of similar diversification and distribution patterns for different organism groups inhabiting the highland grasslands13,14.

The genus Petunia (Solanaceae) has been included in genetics and molecular biology studies and is a valuable model for eudicots15. Similarly, these South American native herbs may help understand the evolutionary aspects of plant speciation under a phylogeographical approach. The most inclusive phylogenetic analysis16 indicated that the genus originated in lowland grasslands in southern South America, from where it dispersed to highland fields during the Pleistocene. The genus diversified between 1.3 Mya17 and 2.8 Mya18, and ecological interactions such as climate, soil, and pollinators have strongly influenced speciation19. The Pleistocene climate changes have also played a role in the differentiation of intraspecific lineages, favoring the expansion and colonization of new environments20. Studies on population structure and diversity of lowland and coastal Petunia species are predominant, whereas just a few focus on the highland lineages11,17. Such highland species are endemic, and each occupies a narrow range in elevation.

Here, we studied the widely distributed Petunia altiplana T. Ando & Hashim. species, which occupies the southern Brazilian subtropical highland grasslands (SHG). SHG is the border between two distinct domains, the Atlantic Forest and Pampas21 (Fig. 1). Despite the large area and elevation range [~ 400 to 1400 m above sea level (m a.s.l.)], P. altiplana individuals are found in sparse patches throughout the species’ distribution, growing on rocky outcrops and roadside slopes22. The P. altiplana geographical range is naturally fragmented due to the region’s phytophysiognomy, which consists of a mosaic between grasslands and Araucaria Forest23 and the Pelotas River course. Moreover, anthropic activities have severely impacted the highland grasslands, which affected the native species’ distribution24. Estimates on intraspecific diversity based on plastid sequences revealed two main population groups in P. altiplana, with the Pelotas River separating them17. The haplogroups were mutually exclusive regarding each riverbank, and the authors considered the Pelotas River a phylogeographic barrier. In turn, analyses based on nuclear microsatellites24 indicated moderate bidirectional gene flow between the riverbanks, smaller than observed between populations from the same side of the Pelotas River. This last work24 also included plastid haplotypes for more populations and individuals than the previous work17, recovering the two haplogroups regarding riverbanks, with three low-frequent new haplotypes shared between populations from the two river’s margins.

Figure 1
figure 1

Geographical distribution of P. altiplana. (A) Location of collection sites (populations). Colors indicate the population groups: orange squares, EN; green circles, ES; and blue triangles, WE. (B) Elevation indicators. (C) Floral morphology of P. altiplana (D) and general view of the flowers (photos J.R. Stehmann, Universidade Federal de Minas Gerais, Brazil). The checkered green pattern indicates the distribution of the southern Brazilian grassland24—a mosaic landscape between the Atlantic Forest and Pampa domains.

Plastid markers and microsatellites are known to be less informative in understanding population genetics25 compared with RAD-seq-derived data. Moreover, plastid sequences and nuclear microsatellites often result in different evolutionary patterns due to discrepancies in coalescence time26. So, to estimate the P. altiplana genetic diversity and disentangle its population structure and demography, we used a next-generation sequencing method to investigate the factors that have influenced the species’ phylogeographic history. With this analysis, we aimed to answer the questions: (1) What is the demographic history of the species? (2) What are the primary factors that influenced the species structuration? (3) Are the populations structured due to the Pelotas River course or other physical or ecological barriers?

Results

Quality of genomic data and SNP calling

We obtained 186,032,269 raw reads considering all individuals. Reads ranged between 766,718 to 3,342,929 per individual, averaging 1,788,711.82. We eliminated 771,892 loci that did not meet population constraints and retained 18,612 loci with 1,288,087 sites, which produced a matrix with 11,231 variants (Supplementary Information 1 and 2: Tables S1 and S2). The average number of genotyped sites per locus in STACK after the filtering was 69.21 base pairs (bp) [standard deviation (SD) 0.01]. We removed 463 outlier loci, resulting in a final dataset containing 10,768 potentially neutral SNP used in the subsequent analyses.

Genetic diversity and population structure

Nucleotide diversity (π) varied between 0.15 and 0.20 across the collection sites. In all localities, the inbreeding coefficients were low (FIS < 0.1), with pop01 and pop19 having the lowest values (Table 1). Pop19 had the higher number of private alleles (PA) among populations, ca. four times higher than the second population (pop6). Pop19 is the most distantly distributed population regarding remains.

Table 1 Sampling information and diversity indices of Petunia altiplana. Diversity indices were estimated per population and population groups.

The population structure in P. altiplana evaluated using DAPC (Fig. 2A) revealed two main groups of populations: one encompassed pop15, pop18, and pop19 [hereafter west (WE) group], distributed in the countryside; the second clustered the eastern (subsequently E group) populations. DAPC analysis retrieved four principal components that resulted in the two main clusters of populations according to the lowest BIC obtained with the find.clusters option (Supplementary Information 3: Fig. S1).

Figure 2
figure 2

Population genetic structure of P. altiplana. (A) DAPC analysis (K = 2) with the first four PCs; (B) fastStructure results (K = 2 to 4). (C) Structure analysis (K = 2); (D) Structure analysis for the E group (K = 2). Dashed lines delimit the groups west (WE) and east, which was divided into east-north (EN) and east-south (ES) subgroups.

The fastStructure (Fig. 2B) yielded equally probable best K values, K = 2, 3, and 4 genetic components. K = 2 indicated that most WE individuals have an exclusive genetic component, whereas some individuals have admixed ancestry; K = 3 differentiated the WE group and divided the E group into two subgroups, one corresponding to populations distributed in the southern Pelotas River riverbank [hereafter east-south (ES)] and the second clustering the populations from the northern Pelotas River margin [hereafter east-north (EN)]. Additionally, with K = 3, some individuals from the EN group were admixed with the ES component, and the same WE admixed individuals in K = 2 now showed admixture with ES. At K = 4, we observed a new genetic component in the EN group. Structure analysis (Fig. 2C; Supplementary Information 4: Fig. S2A) exhibited a best K = 2, where the WE group was differentiated from E populations. Removing WE individuals, Structure analysis (Fig. 2D; Supplementary Information 4: Fig. S2B) exhibited the best K = 2, differentiating the populations of each riverbank with several individuals in each group displaying admixed ancestry. The initial two principal components in PCA (Fig. 3) distinctly segregated individuals, aligning seamlessly with their geographic distribution (Fig. 1), and individuals from each previously identified group exhibited no overlap within the Cartesian plane.

Figure 3
figure 3

Population genetic diversity. PCA plot showing the distribution of the genetic diversity using the first two PCs. Colored symbols indicate individuals’ distribution, with squares for EN populations, circles for ES group, and triangles for WE populations. Population color follows the legend at right.

The pairwise FST for collection sites ranged from 0.02 to 0.34 (Fig. 4; Supplementary Information 5: Table S3), and all pairwise comparisons were statistically significant, except for those involving pop8. Compared to the other populations, pop18, and pop19 had the highest FST values (Fig. 4). Regarding the population groups, FST values between WE and both riverbanks’ groups (0.10) than in EN and ES comparison (0.03).

Figure 4
figure 4

Heat map of pairwise FST values for the populations.

The hierarchical AMOVA revealed that ~ 79% of the total variation was observed within individuals, with low differentiation between groups (WE, EN, and ES; FCT = 0.09, p < 0.001) and moderate differentiation between populations within groups (FSC = 0.12, p < 0.001).

conStruct cross-validation (Supplementary Information 6: Fig. S3) showed that the spatial model was significantly superior to the non-spatial model, indicating isolation by distance between P. altiplana populations. For the spatial model, the predictive accuracy was highest at K = 2, with the additional spatial layer 2 contributing very little to the total covariance and the third layer having a significant covariance and deserving to be considered.

Ecological differentiation between populations

The MRM analysis detected evidence of IBD between populations (R2 = 0.19, p = 0.03), and the GLMM test revealed that geographical distance better explains the genetic differentiation between all P. altiplana populations (Table 2). Further examination of the E subgroups (ES-EN) and WE-ES populations through GLMM revealed that geographical distance combined with elevation was the most explanatory model for genetic differentiation. Pearson’s correlation coefficient indicated that environmental (R2 = 0.85; p = 0.001) and elevation (R2 = 0.70; p > 0.001) variables were positively and significantly correlated with geographical distances.

Table 2 Generalized linear mixed model (GLMM) tests to evaluate the genetic divergence between P. altiplana populations based on three comparisons.

Evolutionary relationships and migration events

The SplitsTree network (Fig. 5A) separated individuals according to their geographical distribution in EN, ES, and WE clades. Groups EN and ES were closer to each other than WE, except some individuals from pop2 and pop5 from EN and pop8 and pop9 from ES were closer to WE populations. This analysis also reflected population structure regarding geographical distribution. The Snapp coalescent analysis (Fig. 6) estimated the divergence time for P. altiplana at ~ 2 Mya, with pop15 and pop19 from the WE group as basal to the remaining populations. The estimated divergence time between pop15-pop19 and pop18, also from the WE, was ~ 170 Kya. The division between WE and E occurred at ~ 110 Kya, and the EN and ES populations had divergence time estimated at ~ 50 Kya. The distinction between EN and ES groups is challenging due to a conspicuous pattern of rapid expansion within these population groups, evidenced by their short branches and recent diversification.

Figure 5
figure 5

Splitstree phylogenetic network for P. altiplana populations and individuals. Green circles indicate ES (east-south), orange squares correspond to EN (east-north), and blue triangles sign WE (west) distributed individuals.

Figure 6
figure 6

Evolutionary relationships between P. altiplana populations obtained using the Bayesian coalescent analysis in SNAPP. Bold values represent the ages of main nodes, and node bars indicate the ages of 95% confidence interval (CI). Colors indicate population groups ES (green), EN (orange), and WE (blue).

The best maximum likelihood tree obtained with Treemix revealed at least four migration events (Supplementary Information 7: Fig. S4A). Although the analysis was inconsistent in depicting the divergence between the populations over the iterations, each iteration at m = 4 inferred a slightly different set of migration edges (Supplementary Information 7: Fig. S4B-F). The populations involved in these migrations were ES (pop07 and pop09), EN (pop02 and pop05), and WE (pop15 and pop19), with gene flow between populations from different groups and inside groups.

The f-branch statistic indicated multiple introgression or gene flow instances in specific branches and nodes in P. altiplana populations (Fig. 7). The f-branch test is a heuristic method that aggregates f4-admixture ratios throughout the entire tree topology to detect introgression, or, in the case of populations, gene flow throughout internal branches. Gene flow was observed between populations from the same geographic group and between ES and EN branches. Additionally, the WE showed gene flow towards both EN and ES groups, extending to internal branches. Notably, pop19 exhibited the highest frequency of migration towards the other populations. Ancestral gene flow was pronounced in the internal branches linking different populations. The result showed most gene flow inside each riverbank and some events between riverbanks.

Figure 7
figure 7

Footprints of migration/gene flow estimated by the f-branch statistic. The heat map summarizes the f-branch statistics calculated in Dsuite. The SNAPP tree (Fig. 6) was used as a phylogenetic reference tree. Darker colors depict increasing evidence for gene flow between lineages. Dotted lines in the phylogeny represent the ancestral lineage. Rows represent nodes within the tree topology, and columns represent tips. Each cell shows the f-branch statistic between a tree node (rows) and each tip (column). Grey cells are empty where comparisons cannot be made. The tree tips are color-coded by the population group: ES (green), EN (orange), and WE (blue).

The first round of fastSimcoal estimations indicated that the best-fit scenario to explain P. altiplana demographic history was an expansion event in both the WE and E population main groups (Table 3). Subsequent estimates focusing on gene flow revealed a unidirectional gene flow from WE to E as the best-fit scenario (Fig. 8) (lowest delta likelihood and AIC values—scenario 13) with no likelihood overlap (Supplementary Information 8: Fig. S5). All parameters were presented in Fig. 8 for the best scenario.

Table 3 Comparison of demographic models tested with fastSimcoal.
Figure 8
figure 8

The best scenario obtained in fastSimcoal simulations and estimated values. The 95% confidence interval is assigned below the estimated values. Nanc—ancestral effective population size; NaW and NaE ancestral effective population size for WE and E, respectively, before de expansion. WE and E currently effective population size. TEXP—time of expansion (Kya), and TDIV—time of divergence expressed in generations (Kya). MIG—unidirectional migration from WE to E.

Discussion

The phylogeographic history of Petunia species has been of interest in recent years, with the genus illustrating several evolutionary processes, including rapid and recent divergence, natural hybridization and introgression, Quaternary refugia, speciation, and genetic diversification in narrowly and widely distributed close species. Petunia altiplana is distributed in the subtropical highland grasslands in southern Brazil. The species integrates the Petunia short corolla tube clade16 and is the only species from the highlands that is widely distributed. There are no taxonomic doubts about the plant’s identity regarding diagnostic morphologic traits considering populations throughout the species distribution. Previous studies revealed that the Pelotas River served as a phylogeographical barrier for the species17 despite some shared plastid haplotypes between populations from both riversides24. Nuclear microsatellites indicated higher polymorphism-sharing populations from different riverbanks that could be attributed to gene flow, at least in part24. We evaluated genetic diversity and population structure based on a large genomic coverture to disentangle the species’ phylogeographical history and clarify the polymorphism-sharing origin.

We obtained high genetic diversity indices, as usually expected for widely distributed species that inhabit heterogeneous environments27. The indices were similar to those obtained for other Petunia species with wide distribution, such as P. axillaris (Lam.) Britton et al. based on SNP variability7. Almost all populations (Table 1) showed private alleles, which may be attributed to recent selective pressures, long-term geographical isolation28, or even incomplete lineage sorting29. The inbreeding coefficient for populations was low, as expected for self-incompatible species such as P. altiplana30.

Regarding the population structure, based on previous works17,24, we expected to observe the Pelotas River as an effective barrier splitting populations from the northern and southern riverbanks. However, most analyses (Fig. 2) indicated that the divergence between EN and ES groups only appears when western populations (WE) are removed. The most marked differentiation was between WE and E populations, which correlates with elevation differences as the WE populations are found in lower elevations. In contrast, EN and ES groups are in the highlands (Fig. 1).

Geographical distribution (Fig. 1) promoted the genetic distinctiveness between populations. PCA’s first two principal components (Fig. 3) consistently separated individuals following their natural geographical arrangement, which indicated a correlation between genetic or environmental factors and geography. Similar patterns were observed for other widely distributed Petunia species from countryside lowland grasslands, such as P. axillaris7, and throughout the Atlantic coastal plain in southernmost Brazil and Uruguay, such as P. integrifolia (Hook.) Schinz & Thell20.

The conStruct analysis provided compelling evidence for isolation by distance in P. altiplana. The spatial model equally supported K = 2 and K = 3. The third genetic layer emerged exclusively among the WE populations (Supplementary Information 6: Fig. S3), reinforcing this group differentiation as seen in other structure analyses. conStruct compares spatial and non-spatial models, and our results pointed out the spatial model in the P. altiplana case, underscoring the dominant influence of isolation by distance on population structuring (Supplementary Information 6: Fig. S3B). Such structure suggests a gradual and continuous pattern of genetic differentiation with intermediate populations not sampled or currently extinct.

The GLMM and MRM results confirmed the correlation between geographic distance and genetic diversity and also indicated isolation by distance with elevation as a primary factor (Table 2). Moreover, geographical distance explains the genetic differentiation considering all populations, western and eastern distribution, and between the EN and ES groups from the east. There was a significant correlation between genetic and geographical distances (R2 = 0.19, p = 0.03) and also between geographical distance and elevation (R2 = 0.70; p > 0.001). Such correlations indicate that factors such as environmental heterogeneity and geographical barriers could isolate populations by restricting gene flow, as observed for an Atlantic Forest tree (Bathysa australis (A.St.-Hil.) K.Schum.)31.

Isolation is often found in populations from different elevations32, and the environmental conditions can vary significantly over short distances in elevation gradients33, which can potentially lead to local adaptation, preventing gene flow, hindering the movement of pollinators and seed dispersers34. Small and solitary bees are the P. altiplana pollinators (J.R. Stehmann, UFMG, unpublished data), and such bees have low flight autonomy35. Moreover, similar to other Petunia species, seed dispersal in P. altiplana is autochoric36, with seeds falling close to the mother plant.

The observed pattern of isolation by distance in P. altiplana is consistent with the general trend observed in many widely distributed species as a typical driver of lineages’ differentiation37. Some IBD examples exist in Petunia7,20,38 and other Solanaceae39 species from southern South America. All these studies implied a limited gene flow and genetic drift, increasing the population differentiation in these grassland species. However, it is essential to note that geographical distance and elevation alone cannot entirely explain the population structure37.

Exploring evolutionary relationships and demographic history among lineages has yielded additional evidence, contributing to understand the current population structure. Notably, the lowland populations from the western distribution (WE population group) are closer to the P. altiplana ancestral population (Fig. 5), which aligns with the proposed genus history and the species’ phylogenetic positioning. The Petunia genus originated in lowland grasslands within southern South America, from where it dispersed to highland fields during the Pleistocene16.

Regarding the eastern-distributed populations (ES and EN groups), the phylogenetic analysis in Splitstree revealed the division between populations from each side of the Pelotas River (Fig. 5). The calibrated tree derived from SNAPP indicated that such divergence between EN and ES groups was recent, occurring in the late Pleistocene (Fig. 6), which reinforced the suggestion that P. altiplana originated in lowlands followed by expansion to higher elevations during the Pleistocene. The rapid population expansion in the highlands may not have left enough markers of structuration, contributing to the observed patterns.

The analyses of demographic scenarios (Fig. 7; Table 3) provided robust support for the expansion hypothesis. Among the scenarios assessed with fastSimcoal, the most compelling result portrayed a divergence between the western (WE) and eastern (ES + EN) populations, succeeded by an expansion event. It is worth noting that the estimated diversification time and population sizes might be subject to overestimation because we used the only available mutation rate, which was calculated based on plastid markers. The plastid genome has a lower mutation rate than the nuclear genome and a small effective population size that could bias time estimates40.

The highland population expansion dates back to the Pleistocene, estimated at ~ 110 Kya using SNAPP and ~ 18.4 Kya (CI 95%: ~ 28.5 to 43.9 Kya) using fastSimcoal. This temporal framework highlights the significant role of Pleistocene climate changes in shaping the South American biomes, leading to alterations in landscape connectivity and profoundly impacting the current biodiversity distribution41. The Pleistocene had an optimal climatic condition for the grasslands’ expansion, while forests contracted during the glacial periods8. The forest-adapted species recovered their domains during the interglacial periods, growing from refuges close to the river’s margins42, which fragmented grasslands and confined herbs to isolated sky islands17,19. Pollen records from the southern Brazilian highlands support this notion by revealing that during the last glacial maximum, the landscape was dominated by grasses and open formations9.

Alternatively, P. altiplana populations could have migrated to the highlands during interglacial periods to escape from adverse weather conditions. The expansion of the Araucaria Forest may have exerted pressure on grassland populations, prompting their movement to higher elevations. Contrary to warmer temperatures during interglacial periods in lowlands, the colder environments in highlands could have been more ecologically favorable to grassland-adapted species. In this scenario, the populations might have sought refuge in the highlands as a strategically better environment43.

Regarding gene flow, the Treemix result (Supplementary Information 7: Fig. S4) indicated at least four migration events between population groups. The ABBA-BABA test showed share-derived alleles between the populations, which can be interpreted as ancient gene flow/migration (Fig. 7), especially from pop19 to eastern distributed populations. Such polymorphism sharing44 indicates that P. altiplana could have been more connected in the past. The best scenario estimated in the fastSimcoal analysis (Fig. 8) pointed to a directional migration more intense from WE to E populations after the group diversification. Our results suggest that the populations were more connected and had gene flow during the expansion to the highlands. The group connection was probably lost with the Araucaria Forest’s growth that fragmented the landscape mainly during the Holocene (~ 4 Kya)9. The grasslands are now fragmented in isolated high elevations, forming patches of herbaceous plants surrounded by forest10,45. Additionally, past and present human activities in the region promote fragmentation and loss of habitat for grassland-adapted species46, which has reduced gene flow as patches of wild individuals are isolated by cultivated plants.

As a potential phylogeographical barrier for P. altiplana, the Pelotas River appears limited to split ES and EN groups. Whereas no concrete evidence suggests changes in the river’s course and Pelotas River was not implicated in WE and E groups separation, it is crucial to consider documented Pleistocene paleo drainages across South American rivers47. Such effects could plausibly have occurred throughout the entire distribution of P. altiplana. The paleo drainages serve as an alternative to explain the distribution of sister aquatic lineages48, and, in theory, they also can promote landscape fragmentation in grasslands, interrupting population connections through flooding low areas and isolating plant groups, which could prevent the gene flow between them. Similar patterns were observed with P. axillaris7. The fragmentation could shape microrefugia for grassland-adapted species, which would colonize the region again when flooded areas became reduced. A historical alteration in the Pelotas River course might have played a role in facilitating the expansion of populations to their present distribution. It is conceivable that when the Pelotas River got its current trajectory, combined with the onset of Araucaria Forest expansion, the barrier between the populations between the riverside banks began.

Methods

Sampling, DNA extraction, and sequencing

We sampled young and healthy leaves from 94 adult individuals distributed in 19 collection sites (hereafter populations) throughout the entire geographical distribution of P. altiplana (Fig. 1; Table 1). Plants were identified based on morphological traits following the species’ formal description24. The sample size per population adhered to that for non-model species in genomic population analyses49 and similar studies involving Petunia species7,20,38. Vouchers were deposited at the Universidade Federal de Minas Gerais herbarium (BHCB-UFMG) in Belo Horizonte, Brazil.

We powdered the silica-dried leaves with liquid nitrogen to extract genomic DNA following a cetyl-trimethyl ammonium bromide (CTAB; Sigma-Aldrich Chem. Co., St. Louis, USA)-based protocol50. We measured DNA concentration with a Qubit Fluorometer (Thermo Fisher Scientific Co., Waltham, USA) and quality with a NanoDrop DN-1000 Spectrophotometer (Thermo Fischer). Finally, we tested for the absence of nucleases using EcoRI (NEB—New England BioLabs Inc., Ipswich, USA). DNA samples with 260/280 and 260/230 > 1.80 were considered high quality and used to build genomic libraries.

DNA libraries were processed with DArTseq™ complexity reduction using the combination PstI-MseI (NEB) method51 and replacing the single adaptor with two ones. Sequencing was performed by bulking equimolar amounts of amplification products from each 96-well microtiter plate sample and using them in a c-Bot’s bridge PCR (Illumina Inc., San Diego, USA), followed by sequencing on the Illumina Hiseq2500 (Illumina).

Plant guideline statement

Experimental research and field studies on wild plants, including the collection of plant material, complied with relevant institutional, national, and international guidelines and legislation. Appropriate permission has been obtained for collecting plant material following Brazilian law on including genetic material on taxonomic and evolutionary studies under permit number 41530-9/2019.

Filtering and variant discovering

We used Stacks v2.6252 to process the demultiplexed raw data and a de novo SNP calling strategy. We examined the quality and specifications of the raw data with FastQC v0.11.753. We removed the barcodes and reads with any adapter contamination or low-quality using default settings in the process-radtags Stacks module54. FastQC analysis indicated inferior quality in the first four nucleotides, which were removed using Cutadapt v1.1654. We used the denovo map.pl Stacks module to identify SNPs from reads. We performed the parameter optimization55 by running the de novo pipeline multiple times on a subset of 20 individuals, iterating over increasing M = 1–8 and n = 1 to 8 per run. This method seeks the assembly parameters (M and n) that maximize the number of R80 loci in the dataset (the number of polymorphic loci present in at least 80% of samples). The best M = n = 5 parameters were selected, and the de novo mapping was performed.

We used the optimization method for call SNP by deleting samples with high levels of missing data55, which increases the overall retention of loci and genotypes after filtering and reducing biases due to missing data (in this process, the pop12 was removed). The final dataset encompassed 77 individuals from 18 collection sites. We filtered the missing data using the population Stacks module, retaining only loci in at least 80% of the individuals across populations (R = 0.8). We set the minor allele frequency (MAF) cutoff at 0.04. We used only the first SNP of each read (—write-single-snp), preventing linkage disequilibrium. We identified the outlier SNPs using PCAdapt v3.5.156. The final vcf file with only neutral SNPs was converted to different formats using PGDSpider v2.1.1.257 and dartR v2.0558 R package to perform further analyses.

Genetic diversity and population structure

We estimated nucleotide diversity (π), allelic richness (AR), private alleles (PA), observed (HO) and expected (HE) heterozygosities, mean inbreeding coefficient (FIS) per population, and fixation index (FST) with population Stacks module.

We ran a multivariate discriminant analysis of principal components (DAPC)59 to evaluate population structure and individuals’ ancestry using the find.cluster and optim.a.score options in adegenet v2.1.360 R package. Similarly, we used fastStructure v1.061 and Structure v2.3.462 approaches. fastStructure employs a variational Bayesian framework to compute posterior distributions and identifies dataset clusters. To determine the optimal number of groups, we ran fastStructure from K = 1–18 using the logistic prior and the chooseK function. With Structure, we investigated the number of population clusters and potential admixture between populations using MCMC. Hierarchical analyses were performed for ten runs per K, up to a maximum of six, and the admixture model was used with a burn-in of 10,000 steps followed by 100,000 steps. We ran fastStructure and Structure analyses with the Structure_threader software63, summarized results using Structure Harvester64, and evaluated the likely number of populations based on the inspection of likelihood plots and the Evanno method64. We used pophelper65 to plot the fastStructure and Structure graphs. We also ran Structure excluding some populations using the same parameters above and K = 1–5 due to preliminary results. We ran a principal components analysis (PCA) to evaluate the structuring between populations and estimate admixture using dartR58.

Finally, we ran a hierarchical Analysis of Molecular Variance (AMOVA)66 using Arlequin v3.5.2.267, considering individuals, all populations, and groups of populations following structure results.

To incorporate geographical information along with SNPs, we used conStruct68. We employed the cross-validation method, conducting analyses with eight replicates, K = 1–5, 10,000 MCMC iterations sampled every 1000, and a 0.5 training proportion. Subsequently, analyses with K = 2–5 were conducted, with ten replicates, using one chain running, 100,000 MCMC iterations sampled every 1000 iterations, and a spatial model.

Genetic differences as a function of geography and environment

We used a generalized linear mixed model (GLMM) to determine if geographical distance, climate and soil variables, and elevation could explain the population differentiation. Based on preliminary results, we subdivided the GLMM analysis into the best number groups based on the structure analysis (DAPC, fastStructure, and Structure results), comparing groups of populations to determine whether the same variables can equally explain the divergence between groups. As response variables, we created a matrix of pairwise FST distances {FST/(1 − FST)69}. We evaluated three distinct matrices as potential predictor variables: (1) GEO, which included the pairwise geographical distances between populations; (2) ENV, based on the Euclidian distances along the first three axes of a PCA for 35 climate variables extracted from CliMond70 and nine soil variables obtained from SoilGrids71 with a resolution of 30′′ (c. 1 km2; Supplementary Information 9: Table S4); and (3) ELEV, which included the pairwise elevation distance between populations. We extracted elevation data from the coordinates of the collection sites. We executed GLMM within the MCMCGLMM72 R package and previously published scripts73 adapted to the current data. We evaluated ten models derived from the combination of the GEO, ENV, and ELEV variables and a null model with no predictors. We compared the models using the deviance information criterion (DIC) and associated DIC differences to determine which model better explained the genetic divergence between populations. We ran MCMCGLMM with standard priors and a burn-in of 500,000 iterations, followed by two million iterations with a thinning interval of 750 steps. We confirmed the chain convergence of MCMCGLMM using the CODA74 R package to examine trace plots. We determined how environmental variables correlated to geography using Pearson’s correlation coefficient implemented in the HMISC v4.4-175 R package. We also used the pairwise FST and geographical distances between populations to test for IBD through multiple regression on matrices (MRM) in the ecodist76 R package.

Evolutionary relationships and gene flow

To investigate evolutionary relationships between populations or groups of populations, we construct a relationship network using the NeighborNet method in Splitstree v4.1677. Additionally, we employed a coalescent framework using the SNAPP v1.378 implemented in Beast2 v2.479 to infer the evolutionary relationships between populations based on SNP data. SNAPP calculates the probability of the species tree without gene trees, mathematically integrating all possible gene trees. We used the previously described approach80, which tweaks the SNAPP’s settings to include a strict clock model that can be time-calibrated based on the fossil record or information from other phylogenies. We employed secondary calibration obtained in previously published studies to calibrate the molecular clock18. The calibration was based on the divergence time between the Petunia short and long corolla tube clades (2.85 million years ago). We used P. axillaris, which belongs to the long corolla tube clade, as an outgroup. A standard deviation of 0.16 in real space was applied in a normal distribution using BEAUTi as part of the BEAST package79. For computational efficiency, we subsampled populations, including one representative per each (overall 18 individuals) + one individual as an outgroup. We limited the dataset to 1000 randomly selected SNPs and set the chain length at 100,000 MCMC iterations. We assessed runs using Tracer v1.681 to examine convergence (ESS > 200) and tree topologies, and we visualized node heights using Densitree82 and FigTree v1.4.4 (https://github.com/rambaut/figtree/).

To generate a phylogenetic tree that accommodates admixture, we used the TreeMix v1.13 software83. We reduced the number of populations to improve the resolution of the analysis by clustering closely related populations, as observed in the SNAPP tree. Additionally, we removed pop08 due it had only one sampled individual, resulting in ten populations: EN1 (pop01 + pop06); EN2 (pop02); EN3 (pop03 + pop04); EN4 (pop5); ES5 (pop7); ES6 (pop09 + pop13 + pop17); ES7 (pop10 + pop11 + pop14 + pop16); ES8 (pop15 + pop19); and ES9 (pop18). The optimal number of migration edges on a population tree containing one to ten edges was estimated using the OptM84 R package.

To test the potential migrants/gene flow between the populations, we performed the ABBA-BABA using Dsuite v0.385. ABBA–BABA statistic is based on the number of ancestral (A) and derived (B) alleles in a four taxa phylogeny as (((P1,P2),P3),O), where O is the outgroup and P1, P2, and P3 are the target populations. If targets did not hybridize, the number of shared alleles between P1 and P3 (BABA) or P2 and P3 (ABBA) should be equivalent. In contrast, excessive sharing indicates hybridization between the populations. In ABBA-BABA analysis, we also used the SNAPP tree as a reference and P. axillaris as an outgroup. The analysis considers incomplete lineage sorting as the null hypothesis86. The D and f4-ratio statistics were calculated using the Dtrios function in Dsuite with default parameters. For better interpretation, the results from Dtrios were further processed using the Fbranch software and associated plotting utilities for the f-branch statistic.

Demographic modeling

To explore alternative demographic models for P. altiplana populations, we estimated demographic scenarios in fastSimcoal v2.687. We tested 13 evolutionary scenarios using a hierarchical approach to identify which scenario better explains the demographic history of P. altiplana. For the division of the populations, we followed the number of groups of DAPC, Structure, and phylogenetic relationships. We started with an exploration of multiple scenarios of expansion, differentiating between the population groups: (1) divergence without gene flow; (2) divergence with gene flow; (3) constant expansion in the first group; (4) an event of expansion in the first group; (5) constant expansion in the second group; (6) an event of expansion second group; (7) constant expansion on both groups; and (8) an event of expansion in both groups. Finally, we assessed gene flow in the best-fitting model, considering (9) constant gene flow; (10) recent gene flow; (11) ancient gene flow; (12) unidirectional gene flow from the first to second group; and (13) unidirectional gene flow from second to first group (Supplementary Information 10: Fig. S6).

The site frequency spectra were estimated with easySFS software (https://github.com/isaacovercast/easySFS) using the VCF file without filtering by –min-maf, and samples were projected downward to maximize the number of loci without missing data vs. the number of retained individuals. Groups of populations followed previous results for population structure, with a projection of 108 and 14 haploid samples for each group, respectively. We used an overall substitution rate of 2.8 × 10−9 per site/generation, as reported for Petunia17. For each tested demographic model, we performed 100 independent runs using 100,000 simulations, 40 expectation–maximization cycles, and a broad search range for parameters (Supplementary Information 11: Table S5) to determine the run with the best parameter estimates and maximum likelihood. The 13 scenarios were compared using Akaike’s information criterion (AIC)88 to select the best-fitting demographic model. We ran each model with its best parameters 100 times and compared the likelihood distributions to check whether the models were significantly different or just stochastic results. We calculated the 95% confidence intervals for each estimated parameter using 100 non-parametric bootstrap SFS.