Polyphyletic ancestry of expanding Patagonian Chinook salmon populations

Chinook salmon native to North America are spreading through South America’s Patagonia and have become the most widespread anadromous salmon invasion ever documented. To better understand the colonization history and role that genetic diversity might have played in the founding and radiation of these new populations, we characterized ancestry and genetic diversity across latitude (39–48°S). Samples from four distant basins in Chile were genotyped for 13 microsatellite loci, and allocated, through probabilistic mixture models, to 148 potential donor populations in North America representing 46 distinct genetic lineages. Patagonian Chinook salmon clearly had a diverse and heterogeneous ancestry. Lineages from the Lower Columbia River were introduced for salmon open-ocean ranching in the late 1970s and 1980s, and were prevalent south of 43°S. In the north, however, a diverse assembly of lineages was found, associated with net-pen aquaculture during the 1990s. Finally, we showed that possible lineage admixture in the introduced range can confound allocations inferred from mixture models, a caveat previously overlooked in studies of this kind. While we documented high genetic and lineage diversity in expanding Patagonian populations, the degree to which diversity drives adaptive potential remains unclear. Our new understanding of diversity across latitude will guide future research.


Introduction
Multiple independent introduction events can lead to shifts in genetic variation relative to native source populations, potentially boosting invasiveness and potential for rapid local adaptation 1-4 .
Furthermore, different introduction vectors delivering distinct genetic lineages in different regions can result in a mosaic of populations with varying genetic diversity and evolutionary potential 1,2,5,6 . We investigated self-sustaining Chinook salmon [Oncorhynchus tshawytscha (Walbaum)] populations, currently part of a rapid colonization that is sweeping through Patagonia, the binational region of Chile and Argentina at the southern cone of South America. Specifically, we measured genetic diversity and evaluated the most likely phylogenetic origins of four self-sustaining populations spread over a wide geographical range that received multiple distinct introductions from different kinds of artificial propagation.
Beginning in the 1870s, Chinook salmon, native to North America and the North Pacific Ocean, were deliberately introduced into innumerable rivers in all continents except Antarctica 7 ; yet successful naturalization has been rare. Self-sustaining adfluvial (migrating between lake and river) Chinook salmon populations have been established in the North American Great Lakes 8 , but anadromous (migrating between the sea and river) populations outside their native range exist only in New Zealand's South Island and in South America's Patagonia 7 . The phylogenetic ancestry of New Zealand Chinook salmon was tracked to introductions in the early 1900s from the Sacramento River fall run (seasons characterize typical adult return to freshwater), most likely Battle Creek, California [9][10][11] . Originally stocked in one river of the South Island, the fish naturalized and within a decade expanded their range considerably [11][12][13] . When studied about 30 salmon generations later, a number of phenotypic traits had evolved apparently in response to local environmental conditions 11,14 , offering increased local fitness 15 . The monophyletic ancestry and known history of introduction in New Pacific Salmonids (GAPS) consortium assembled a comprehensive genetic baseline for coast-wide fishery management of mixed-stock fisheries 25 . The accuracy and precision of mixture analysis using the GAPS baseline is substantial with either conditional or unconditional Bayesian methods [25][26][27][28] , and provides a potentially useful approach for inferring the ancestry of Patagonian Chinook salmon. CML mixture analysis was used in two recent studies of the ancestry of Patagonian Chinook salmon 29,30 .
However, as far as we know, the validity of this application in mixed-origin populations has not been demonstrated. Introduced populations violate the fundamental assumption of the model-that individuals in the unknown "mixture" actually originated from one of the baseline populations. Newly founded populations are isolated and expected to diverge from their ancestors through founder effects and genetic drift 29 . An even greater concern is interbreeding of mixed-origin fish in the new range (admixture), producing novel genotypes not accurately attributable to any single source population.
Founding effects and drift are not easily accounted for, but neither are they likely to fundamentally confound our analysis. When considering microsatellites or other neutral markers, it is unlikely that founding and drift would result in a Patagonian population looking more like an unrelated population than the true population of origin. Such a result would require parallel patterns of allele frequency convergence at multiple highly polymorphic loci. The problem of interbreeding is more serious. It was not clear to us how the Rannala and Mountain 31 CML mixture algorithm would treat individuals of mixed origin. Simulations and sensitivity analysis helped us answer that question, and M-BC allowed an independent evaluation of CML mixture analysis in this unusual application of stock identification of introduced and naturalized populations.

Results
Data quality, neutrality, and genetic structure Data quality and genotyping success were high, with an average of 12 out of 13 microsatellite loci scored per individual fish. We removed from the analysis 9 Toltén individuals that gave no reliable genotypes (collected as decomposing carcass samples), leaving 87 individual Patagonian Chinook salmon for our study.
Ots213 departed significantly from neutral expectation (F ST outlier test) and was removed from further population genetic analyses due to potential directional selection (but was retained for genetic mixture analysis, which is generally robust to departures from neutrality). Oki100 and Ots208b also departed from neutral expectation but were only marginally significant and so were retained. None of these loci are known to diverge from neutral expectation in North American populations 32 . Pairwise allele frequency differences were not significant between sample collection locations within river basins in Patagonia (range of intrabasin P values, 0.0042 -0.690, Bonferroni-corrected α = 0.0014) and were therefore pooled for population-level analyses (but see below). The analyses we present here are based on those river-basin-level aggregates of collections that are intended to represent separate populations, though because our samples are small, we tested and evaluated that assumption in several ways. For example, mean F IS values for Patagonian populations (0.013, SD = 0.0195) were substantially larger than estimates from the North America baseline populations (0.008, SD = 0.0159, Supplementary Table S2). Those heterozygote deficits in Patagonia might indicate a Wahlund effect of having sampled distinct populations within a river basin. Although departures from Hardy-Weinberg expectations were non-significant, we recognize there was little power with such small sample size.
Non-significant heterozygote deficits were observed at Ots201 and Ots213 (again, Ots213 was not included in most population genetic analyses because of its highly significant departure from neutral expectation).

Origin of Patagonian Chinook salmon (CML mixture analysis)
We found evidence of multiple North American lineages in Patagonia. Of the 46 reporting groups, 16 (35%) were identified by CML mixture analysis as possibly represented in Patagonia ( Figure 1 and Figure 2). However, based on our simulations, we expected a small fraction of spurious assignments/allocation associated with admixture among lineages occurring in the introduced range (see below, Results: Simulated mixed-origin founding). Hence, in order to emphasize major contributing lineages and avoid over-interpretation, we focused on putative contributors with c. 10% or greater allocation from any source to any single Patagonian population ( Figure 2; see below Simulated mixed-origin founding). Seven potential source lineages satisfied this criterion. Approximately south to north in their native range, these were: Willamette River spring, North Oregon Coast, West Cascade fall, West Cascade spring, Interior Columbia Basin summer/fall, South Puget Sound fall, and Whidbey Basin.
The estimated number of lineages that contributed to our South American populations was higher in the north (Toltén and Petrohué) than in the south of our study area (Aysén and Baker) ( Figure   1). West Cascade fall, and especially West Cascade spring, contributed substantially, especially in the south with 71% and 65% genetic contribution of West Cascade spring to Aysén and Baker populations ( Figure 2). Interior Columbia basin summer/fall lineage contributed to all four Patagonian populations, but especially to those in the north. Other contributors showed more localized effects. For example, South Puget Sound fall in Petrohué (20%), Whidbey Basin in Toltén (20%), and Willamette River spring in Baker (18%). Various other donors had lower contributions, but, again, some of those results might be attributable to the misassignments we observed in the simulation of mixed ancestry.
Whereas the two northern river basins (39°-42°S) appeared highly polyphyletic, the two southern basins (45°-48°S) were nearly monophyletic, attributing nearly all of their ancestry to the closely related lower Columbia River, West Cascade spring and Willamette River spring lineages.
Despite small sample sizes and apparent mixed ancestry, we found general agreement between modeled stock composition estimates and the proportions of individual fish assignments (Figure 2; Supplementary Table S3).

Split assignments consistent with primary donors
The ancestries of most fish were readily identifiable though CML mixture analysis. Most assigned with high probability to one of the baseline reporting groups (average maximum a posteriori value = 0.903, SD = 0.145) or reference populations (0.861, SD = 0.168) (Supplementary Figure S1).
High assignment probabilities to baseline reporting groups were observed even in watersheds where spawners from multiple lineages were collected together (i.e., Toltén and Petrohué). Both at the reporting group and population-level, individual fish that did show affinity to multiple source lineages, were invariably associated with the same sources to which other fish in the same collection assigned with high probability. For example, Petrohué and Toltén had individuals that assigned with relatively high probability to North Oregon Coast, West Cascade spring and fall, Interior Columbia Basin summer/fall, and South Puget Sound fall; however other presumed mixed-origin individuals split their assignment probability among those same sources.

Simulated mixed-origin founding
Our simulation result showed that most individual fish in a population derived from multiple sources would assign back to those sources, often splitting their assignment probability between source populations (Supplementary Figure S2). However, we also learned that sometimes a non-trivial number of simulated, mixed-origin fish might assign with high probability to unrelated lineages. Some misassignment was expected to genetically similar reporting groups. Both reporting groups used in our simulation have closely related sister groups in the GAPS baseline. West Cascade spring is genetically similar to West Cascade fall, and South Puget Sound fall is similar to Whidbey Basin (an adjoining inland water body). Most simulated fish (61.3%) assigned to one or the other of the true source reporting groups. Another 28.2% assigned to those closely related sister groups. That level of misassignment was expected based on leave-one-out jackknife analysis of the North American baseline. Approximately 17% of real fish collected from Cowlitz River spring (one of the seed populations for the simulation) misassigned to Cowlitz River fall in the West Cascade fall reporting group.
Expected misassignment to closely related reporting groups contrasted strongly with assignment of simulated individuals to unrelated groups. In our simulation, 10% of individuals assigned to unrelated reporting groups, 44% of these with high assignment probability (P ≥ 0.8).
Interior Columbia Basin summer/fall received about half of these misassignments, whereas the rest were distributed among 14 other unrelated reporting groups. This level of misassignment observed in simulated fish was much higher than the normal level of misassignment observed between closely related reporting groups in North America. Note that this result is completely simulated and is not related to mutation or drift in the founded populations or in the baseline.
Our simulation study confirmed the general utility of CML mixture analysis for individual assignment and proportional allocation despite admixture, but also showed the potential for misleading results, even when individuals assign with high probability. Therefore, we interpreted our empirical results of CML mixture analysis of real fish with caution when the estimated proportional contribution of a reporting group approached the level of misassignment we observed in the simulation (i.e., <10%, see above Results: Origin of Patagonian Chinook salmon). Any criterion would be somewhat arbitrary, but for the empirical dataset there seemed to be a break between 0.08 and 0.1, and we knew that values below 0.08 could be relatively strongly influenced by the spurious assignments we revealed in the simulation.
Therefore, the simulations largely confirmed the utility of the CML algorithm for studies of mixed ancestry with the important caveat that a few mixed origin individuals can show high probability of membership to completely unrelated lineages.

Equivocal results regarding expected contributors
Not a single fish we sampled from four locations in Patagonia showed any affinity whatever to the University of Washington Hatchery fall-run stock (zero relative probability of membership). Only five fish, collected in the Petrohué basin, assigned with high probability to another Puget Sound fall-run hatchery stock, Soos Creek (from which University of Washington broodstock was derived), but none were similar to our contemporary sample from University of Washington Hatchery fall-run stock.
Central Valley California was also a suspected source of Patagonian Chinook salmon, however, our results offered very little support for that conclusion at the locations we sampled. Only two fish assigned to Central Valley populations, one to Tuolumne River and the other to Stanislaus River. It was unclear if those two fish demonstrated true Central Valley ancestry or spurious assignments resulting from admixture (see above Simulated mixed-origin founding).

Discussion
We investigated the phylogenetic ancestry of introduced Chinook salmon in Patagonia by conducting two different classes of genetic mixture analysis. Diverse genotypes led to the identification of many putative ancestral sources introduced primarily from the states of Oregon and Washington. Our results were largely consistent with historical records of fish introductions and recent molecular genetic studies. However, we also found interesting differences such as apparent contributions from undocumented introductions, and conversely, lack of evidence supporting the naturalization of welldocumented introductions.
Our study is distinguished from previous molecular studies of Patagonian Chinook salmon ( 16,19,29,30 ) in three principal ways: 1) We used the most inclusive baseline dataset possible, including all potential North American donor lineages. Previous studies have been more or less limited by the number of reference lineages available for particular genetic markers. The GAPS microsatellite baseline is the most extensive of its kind and is ideally suited for this application. 2) To our knowledge, ours is the first study of its kind to take into account the potential for spurious allocation of admixed genotypes. Other studies have addressed genetic drift in introduced populations 29 , but not admixture between divergent lineages. We suggest that admixture may be more problematic than genetic drift, and simulations of mixed-origin founding offered a cautionary note on potentially spurious results. Our observations are relevant to many studies that seek to characterize small genetic contributions from multiple founding lineages 36  Analyses of allele frequencies led us to pool within-river-basin samples into putative populations, yet additional evidence suggested substructure, or non-equilibrium conditions, within river basins. Because our samples were small it was important to evaluate the strength of assuming basinlevel population structure from different perspectives, in addition to analyses of allele frequencies.
Thus, we also tested for departures from expected Hardy-Weinberg genotypic proportions, especially heterozygote deficits that might indicate a Wahlund Effect resulting from sampling two or more genetically distinct populations. This analysis would help us evaluate population structure and potential assortative mating among sympatric founding lineages. Mating structure is critical to our understanding of effective population size and potential for adaptation and persistence of introduced lineages in Patagonia. We did find elevated F IS values within basins (putative populations), consistent with assortative mating or other non-equilibrium conditions, but the F IS estimates were not significantly greater than zero. Recognizing limited power for detecting a Wahlund Effect, we also tried to draw inference from CML individual assignment probabilities. Fewer intermediate assignments (split probabilities between reference groups) were observed than expected for a (simulated) mixed-origin population at equilibrium. If real, such a result might be consistent with, for example, assortative mating of two or more genetically distinct lineages within a river basin. Alternatively, strong outbreeding depression e.g., ,37 or insufficient time since introductions could result in fewer than expected mixed-origin individuals. Resolution of this question will require larger sample sizes within and among sites and river basins as well as across multiple generations. Alternatively or additionally, more markers could be surveyed. For example, genome sequencing would provide haplotype arrays that might be quite powerful for evaluating introgression and equilibrium.
Diversity of founding lineages was higher in northern Patagonia, and yet this latitudinal trend was uncorrelated with standard measures of genetic diversity. Overall, population genetic diversity measured by heterozygosity and allelic richness was higher than expected in Patagonia- nearly identical to North American populations. Our genetic CML mixture analyses also suggested a diverse array of founding lineages in Patagonia. Initially, we assumed high genetic diversity, i.e., heterozygosity and allelic richness, was a result of mixed ancestry. Previous studies have made similar conclusions 19,29,30 . CML mixture analysis and M-BC both estimated increasing diversity of founding lineages from south to north in Patagonia. Surprisingly, however, populations with the most diverse ancestry showed no higher levels of heterozygosity or allelic richness. Nor was there a spatial gradient for heterozygosity or allelic diversity, as was evident for lineage diversity.
It would seem almost axiomatic that a diversity of founders would introduce more alleles relative to monophyletic populations, yet other factors might be confounding a clear association between lineage diversity and genetic diversity. Given our broad and overlapping estimates of heterozygosity and allelic richness, it might simply be that our sample sizes were too small to obtain a precise view of genetic diversity relative to lineage diversity.
A principal challenge in this study was to distinguish the genetic signal of true ancestry from the noise created by mixed-origin, introgressed populations. In our CML mixture analysis we expected some misassignments between related populations. The patterns of misassignment in the North American reference baseline were well understood based on previous results 27 . However, our simulation of mixed-origin founding revealed some spurious genetic allocation not previously reported in studies that used this approach 29,30,36 . In the empirical results, we observed allocations of up to 6% estimated genetic contribution from very unlikely donor regions, such as Nass River and the South Thompson River, both in British Columbia. This 6% level of contribution is equivalent to a little more than one fish with a high assignment probability in a population sample of the size we collected. The CML simulation showed us that mixed origin (admixed) individuals usually assign to the true donors, however, some individuals will assign to unrelated populations, sometimes with high assignment probability. Note, however, that the probability values are conditional on available baseline references. A high assignment probability might simply mean that no other reference population could likely have produced a given genotype. This is relevant, for example, to studies that assume a probability threshold (e.g., 0.8) for accurate assignment, and especially to those with incomplete baseline datasets. While a probability threshold such as 0.8 is clearly useful in conventional genetic mixture applications 28 , in a mixed-origin founding application, it is almost certainly the case that some high-probability individual assignments are attributable to the misassignment phenomenon that we demonstrated through simulation. Apparently, introgression between donor populations results in unique genotypes that assign arbitrarily to other, unrelated populations. To avoid falsely implicating potential donor lineages we set a lower threshold for genetic contribution below which putative ancestry was viewed with caution and some skepticism. Based on review of both simulated and empirical data, we used a threshold of 10% estimated genetic contribution to any single population in Patagonia. It was not that we rejected as potential contributors lineages below 10%, rather we were less confident in identifying those as founders.  (Table 1).
Nevertheless, we found little evidence supporting the successful colonization by University of Washington Hatchery fall-run stock introduced in Patagonia. This stock, also known as Portage Bay fall-run, from the South Puget Sound fall reporting group, was primarily derived from the Green River draining to Puget Sound (different to Green River in the West Cascades), particularly from Soos Creek Hatchery (1949-1950s), though exchanges with other populations took place over the years 40

. The Soos
Creek Hatchery itself had exchanges with many other populations, mainly within the Puget Sound area.
It is puzzling that in our study not a single fish showed any probability of assignment to the University of Washington reference population. Only five fish (17% genetic contribution) from Petrohué River assigned to Soos Creek Hatchery. simulations). However, much of this diversity appears due to accidental escapes from thriving Chilean net-pen salmonid aquaculture operations in the 1990s that imported ova from diverse sources. Since 1988, Chilean salmonid aquaculture shifted entirely to net-pen aquaculture. Although substantial escapes were frequent, they were hard to quantify because they commonly went unreported [42][43][44][45] . Data gathered from insurance companies alone 46 (Table 1).
Although sparse, this information could explain the incidence of Patagonian genotypes Similarly, the introgression of cultured rainbow trout genes into pre-existing, naturalized rainbow trout populations in Patagonia, has also been demonstrated 48 .
An apparent lack of Central Valley ancestry was similar to the findings of one previous molecular study 29 but different to another 30 . It is worth noting that previous studies like ours did not account for potential spurious assignments of mixed-origin fish to unrelated reporting groups. Thus, numerous additional sources were identified as putative contributing lineages. We do not refute the presence of these additional lineages. We simply recognize the limitation we discovered in our genetic mixture analysis regarding mixed-origin fish, and we focus on what we take to be the major lineages that became established at our study sites (those lineages receiving ≥ 10% allocation). Zealand 11,15 , as well as pre-adaptation to environmental conditions inferred for Patagonian populations 18 , are all features that suggest high adaptive potential of Chinook salmon in Patagonia. Lower-thanexpected levels of heterozygosity, if confirmed and maintained over generations, could indicate reproductive isolation between sympatric lineages, which would help maintain lineage identity.
Whether this happens is crucial to the evolutionary trajectory of the species in its new range. Lineages could either evolve adapting genetically to local conditions 15 or they could interbreed and recombine, potentially resulting in new phenotypic variation with unpredictable evolutionary outcomes. Chinook salmon may exert strong ecological impacts in freshwater, estuarine and marine ecosystems 18 , and different evolutionary outcomes will directly affect these impacts owing to considerable phenotypic variation among lineages.  57,58 . Before proceeding with our population genetic analyses, we tested for loci that departed from neutral expectation in order to avoid loci that might bias our parameter estimates and genetic distance estimates (although departures from neutrality would not necessarily bias our genetic mixture analysis). We used the F ST outlier approach 59 implemented in the LOSITAN software package 60 . Genetic differences among the South American river basins were analyzed in more detail utilizing pair wise F ST values and leave-one-out jackknife self assignment among introduced populations in order to characterize genetic differences among the introduced Patagonian populations. Because our sample sizes were small, we were especially cautious of nonsignificant results because we had little power to reject null hypotheses, even when false. However, where we did observe statistical significance, we generally trusted our results as biologically meaningful, despite small sample sizes.

North American baseline dataset
We relied on a baseline dataset of North American reference populations that were selected from the GAPS baseline 25 to represent as closely as possible historical phylogeographic lineages in North American Chinook salmon. Specifically, we used a slight modification of the dataset analyzed by Moran et al. 32 . We added the following populations because they were identified previously as potential sources of Patagonian Chinook salmon (Table 1)  "assignment" simply reflects the maximum a posteriori probability of membership to population or reporting group). We needed to know the extent to which potentially spurious assignments might affect our analysis of the Patagonian populations. As a benchmark for misassignment in the North American reference populations, we conducted a leave-one-out, jackknife analysis for comparison of a posteriori probability distributions between simulated mixed-origin fish and real fish, from both North and South America (Supplementary Figure 2). were those identified in CML as major contributors by virtue of c. 10% allocation or higher to any single Patagonian population (North Oregon Coast was included because allocation to that lineage was quite close to our threshold, rounding up to 10%. Including North Oregon Coast as a possible source seemed to be conservative, especially with records of potential introductions from Northern Oregon).

Model-based clustering
The 10% value was selected based on an apparent break in the composition estimates. We also inferred from simulations that allocations of less than 10% might be due at least in part to spurious assignment of mixed-origin fish (see Methods: Simulated mixed-origin founding). For the M-BC analysis, we conducted 110,000 MCMC realizations per chain, discarding the first 10,000 iterations as a burn in. Apparent convergence of diagnostic parameters was observed within the first 10,000 iterations (i.e., α, F, divergence among populations D i,j , and the likelihood estimate). We used an admixture model, including location information 62 , with allele frequencies correlated among populations 63 . We assumed population specific F ST values, and updated allele frequencies by using baseline individuals only, thus treating Patagonian samples as having unknown origins. Multiple MCMC chains (10) were constructed for each value of K (number of ancestral clusters). We modeled values of K from two to nine for the North American populations. The most appropriate K value was selected based on likelihood and ∆K 34,64 , as well as optimal discrimination of putative North American source populations. Structure output was processed with the STRUCTURE HARVESTER computer program 35 , and plotted in the R computing environment version 3.3.1 65 .   of Patagonian samples to these clusters (i.e., inferred ancestry), showed a decline in lineage diversity from north to south, consistent with CML mixture analysis. Standard ∆K plot, and STRUCTURE plots, are available in Supplementary Fig. S3 and Fig. S4. Lagos, but we found no further records of these releases.; 16 Primarily marine aquaculture concessions in the Lake District region.; 17 Rough estimate of number of sub-adult Chinook salmon escapees (see main text).; 18 Mostly 1+ year class and older since most escapes were from marine net-pens 42 .; 19    Notes and references: The actual number of individuals released may be less than the figure reported due to mortality during transport and handling; pre-release mortality was accounted for whenever possible. Approximate latitude is given at the river mouth. ? = unreported, likely stock origin, or lack of adults return Notes: Values represent average percent genetic contribution; in brackets, frequency of individual assignments to baseline populations, as inferred from individual's highest assignment probability. Identifiers (ID) correspond to those in Figure 1 (main article).