An ancient selective sweep linked to reproductive life history evolution in sockeye salmon

Study of parallel (or convergent) phenotypic evolution can provide important insights into processes driving sympatric, ecologically-mediated divergence and speciation, as ecotype pairs may provide a biological replicate of the underlying signals and mechanisms. Here, we provide evidence for a selective sweep creating an island of divergence associated with reproductive behavior in sockeye salmon (Oncorhynchus nerka), identifying a series of linked single nucleotide polymorphisms across a ~22,733 basepair region spanning the leucine-rich repeat-containing protein 9 gene exhibiting signatures of divergent selection associated with stream- and shore-spawning in both anadromous and resident forms across their pan-Pacific distribution. This divergence likely occurred ~3.8 Mya (95% HPD = 2.1–6.03 Mya), after sockeye separated from pink (O. gorbuscha) and chum (O. keta) salmon, but prior to the Pleistocene glaciations. Our results suggest recurrent evolution of reproductive ecotypes across the native range of O. nerka is at least partially associated with divergent selection of pre-existing genetic variation within or linked to this region. As sockeye salmon are unique among Pacific salmonids in their flexibility to spawn in lake-shore benthic environments, this region provides great promise for continued investigation of the genomic basis of O. nerka life history evolution, and, more broadly, for increasing our understanding of the heritable basis of adaptation of complex traits in novel environments.


Results
Range-wide ecotype genotyping. Genotypic data collected using a newly developed TaqMan ™ assay (One_LRRC9_68810 ; Table S1) targeting this SNP showed directional divergence in both anadromous sockeye salmon and resident kokanee across the natural range of O. nerka in Russia, Alaska and Canada (Table 1 and Figs 1 and 2). In general, the 'G' allele was most prevalent in shore-spawning sockeye salmon and kokanee populations, while the 'T' allele dominated in stream-spawning populations (Table 1 and Fig. 2). Notable exceptions were two shore-spawning sockeye salmon sites along island beaches in Illiamna Lake, Alaska and one shore-spawning kokanee site in the West Arm of Kootenay Lake, BC, all of which exhibited a higher frequency of the 'T' allele than observed in other shore-spawning populations. In addition, kokanee sampled while spawning near the mouth of Drew Creek in Tchesinkut Lake were fixed for the 'G' allele, identical to all sampled shore-spawning individuals from this system.
Five within-lake reproductive ecotype pairs spanning multiple catchments had >99% correct assignment to reproductive ecotype under a straight Mendelian assignment rule (GG = shore-spawning, GT or TT = stream-spawning), including Okanagan Lake kokanee shore-and stream-spawners, Christina Lake kokanee shore-and stream-spawners, Anderson Lake black kokanee/Portage Creek sockeye salmon, Seton Lake black kokanee/Portage Creek sockeye salmon, and Redfish Lake shore-spawning sockeye salmon/Fishhook Creek stream-spawning kokanee. In these systems, GG was recorded only once across the 238 stream-spawning individuals genotyped, and GT was only recorded four times within the 336 shore-spawning individuals (TT was never observed among shore-spawners) ( Fig. 2 and Table S2). For the other two clearly differentiated co-occurring kokanee ecotype-pairs in Wood and Kalamalka Lakes, assignment accuracy using this rule was also high at >90%.
Flanking region sequencing. Sanger sequencing of 744 bp flanking this SNP in Okanagan Lake kokanee (Columbia River drainage), Anderson Lake kokanee and Portage Creek sockeye salmon (Fraser River drainage), revealed three additional SNPs in full linkage, suggesting this region was ancestrally inherited for both alleles in these populations spanning different river catchments. To more broadly characterize this region, we successfully sequenced 22,773 bp flanking the One_ LRRC9_68810 SNP from eight individuals each carrying homozygous genotypes of Okanagan Lake shore-and stream-spawning kokanee, respectively, including the entire LRRC9 gene (Genbank accession KY681681-KY681682). These two sequences were 4.6% divergent from each other, with 181 fixed differences including 23 multiple bp indels up to 308 bp long between the 'shore-spawning' and 'stream-spawning' alleles, suggesting the genomic region around LRRC9 has undergone a significant selective sweep (Fig. 3). Notably, the level of divergence between alleles increased markedly downstream (towards the 3′ end of the LRRC9 gene) (Fig. 3). Variation within each ecotype was far lower, with ten variable sites within the shore-spawning population, and nine variable sites within the stream-spawning population. While five SNPs were identified within the coding regions of the LRRC9 gene, we found no non-synonymous mutations between the 'shore-spawning' and 'stream-spawning' alleles.

Discussion
Our results show the recurrent evolution of reproductive ecotypes across the native range of O. nerka is at least partially associated with divergent selection of pre-existing genetic variation within or linked to the region surrounding the LRRC9 gene. Although no non-synonymous changes were detected between LRRC9 'shore-spawning' and 'stream-spawning' alleles, this does not preclude changes in splice sites or regulatory regions of the gene. Differentiation at this locus between shore-and stream-spawning populations was extremely high even when there was minimal neutral population differentiation. For example, pairwise F ST for One_LRRC9_68810 was 0.92   Fig. 1 and Table 1 except for the island beach-spawning sockeye salmon samples (Fuel Dump Island and Woody Island). The Skeena plot includes all localities within the catchment indicated in Fig. 1 and Table 1 except for the Drew Creek samples from Tchesinkut Lake (see Discussion in main text for more details). Genotype frequencies from the Vancouver Island catchment are also not shown, as we only have three samples from a single site (Cowichan Lake). All three of these shore-spawning kokanee possessed the predicted GG genotype. Genotype frequencies for all sample sites are given in Table S3. The map was generated in R 67 using the ggmap package 68 with map tiles by Stamen Design (www.stamen.com), under CC BY 3.0 (creativecommons.org/licenses/by/3.0/) and data by OpenStreetMap (www.openstreetmap.org/), under ODbL (www.openstreetmap.org/copyright).

Figure 3.
Sequence similarity between the shore-and stream-spawning sequences for the LRRC9 genomic region of Oncorhynchus nerka from Okanagan Lake kokanee. Exons for the LRRC9 gene shown at top as arrows, introns are shown as the intervening line, along with the One_LRRC9_68810 SNP location. SNPs are shown as blue vertical lines and indels shown as gaps in the sequence, with identical sequence in orange. "Shore-allele" sequence: top, "stream-allele" sequence: bottom. Similarity is defined as the percentage of identical sites within a 500 bp window centred on this position. among reproductive ecotypes of Okanagan Lake kokanee despite the population genomic F ST of 0.008 based on 6,234 neutral SNPs 35 .
The few failures of this SNP to discriminate between ecotypes are likely explained by a combination of uncertainties in sampling, incomplete ecotype differentiation, and/or recombination events separating the SNP from the related genomic changes. For instance, in Tchesinkut Lake, the vast majority of individuals spawn on the shores of an island where the "shore-spawners" were originally sampled in 2010 42 . The "stream-spawners" were sampled at an outflow and Drew Creek, however, the latter spawning activity only occurred once Ministry personnel cleared the mouth allowing kokanee access 42 . These observations, in tandem with the complete lack of divergence at 6,234 neutral SNPs, strongly suggest there may not be distinct ecotypes in Tchesinkut lake 35 . The other exceptions where stream-spawning populations had elevated levels of the G allele (Sinmax Creek kokanee, Tintina and Hanna Creek sockeye salmon) are immediately adjacent to shore-spawning sites in gravel near the creek mouths 43 . Finally, we obtained samples from three sockeye salmon beach-spawning sites in Illiamna Lake in Alaska, two on island beaches (Fuel Dump Island; Woody Island) and one mainland beach (Knutson Bay). While genotype frequencies within the mainland beach-spawning site were consistent with range-wide patterns ( Fig. 2 and Table S2), the island-beach spawning sites showed relatively uniform genotype frequencies (Table S2). Specific beaches on these islands lack the upwelling groundwater typical of most beach-spawning salmonids 44 , and consequently, may not contain the site-specific olfactory signals to guide returning adults that would be necessary for promoting ecotype differentiation 45 .
In at least five lakes spanning the Columbia and Fraser River drainages (Anderson, Seton, Okanagan, Christina, Redfish), assignment accuracy to reproductive ecotype was >99%. Given our results, it appears that this SNP could be associated with a Mendelian trait, potentially creating a switch to an alternative spawning behaviour. Understanding how such a genetic switch might work in terms of influencing behaviour awaits future breeding and physiological experiments. If this SNP is linked to a gene that causes a change in spawning habitat preference, our data are consistent with the hypothesis that the T (stream) allele is associated with a preference or ability for stream-spawning, while the G (shore) allele could be a loss of function, leading to a lack of spawning habitat preference, enabling spawning anywhere with suitable substrate.
This hypothesis fits well with several other lines of evidence from shore-and stream-spawning populations of O. nerka. For example, in Lake Washington, Washington, USA, a reintroduced population of anadromous sockeye salmon diverged into reproductively isolated shore-and stream-spawning populations within 13 generations 32, 33 . In this system, both populations were derived from the same hatchery stock, and the population solely exhibited stream-spawning behaviour initially; shore-spawning was first recorded over 17 years after their introduction. The hatchery stock was derived from both stream-spawning and shore-spawning individuals 32 . This rapid heritable divergence into two spawning ecotypes matches the predictions for a previously existing polymorphism in a gene or genes that influence spawning behaviour, with shore-spawning recessive to stream-spawning. Similarly, in a study of anadromous sockeye salmon in Little Togiak Lake, Alaska where individuals genetically identified as coming from one population spawned in the alternate habitat, straying was rare and asymmetrical, primarily with individuals from the stream-spawning population using shore habitats 46 . These findings are consistent with the predicted pattern of a recessive allele promoting shore-spawning behaviour.
Sockeye salmon are known to have survived the late Wisconsin glaciation in several refugia, including areas south of the ice sheets such as the Columbia River, arid northern areas that remained largely free of ice in Beringia (region connecting Kamchatka and much of western Alaska), and in small inland mountain refugia where glaciers impeded access to the sea 29 . Sockeye salmon and kokanee descended from all of these refugia exhibit variability at this locus and evidence for divergent selection between ecotypes; therefore divergence between the two alleles must significantly predate the last glacial maximum. We estimate that the 'shore-spawning' and 'stream-spawning' alleles most likely diverged from each other around 3.8 Mya in the Pliocene, after sockeye separated from pink (O. gorbuscha) and chum (O. keta) salmon, but prior to the Pleistocene glaciations. While there are many simplifying assumptions for estimating the time to the most recent common ancestor for this region, particularly as it has likely been under selection, our divergence time estimate does highlight the great age of these alleles, both of which have been maintained in populations across the range of the species.
This example of recurrent selection of pre-existing variants in the population as a source for ecologically-driven sympatric divergence closely resembles the parallel evolution of freshwater forms of stickleback 12 . In this system, anadromous populations carry the anciently derived 'freshwater' armor genes at low frequency that then repeatedly went to fixation when stickleback colonized similar freshwater environments. Here, while the 'G' (shore) allele was uniformly prevalent in shore-spawning populations of both kokanee and sockeye salmon, it was never absent from stream-spawning populations. This suggests that (most) stream-spawning populations carry the 'G' allele, which is then strongly selected for as shore-spawning populations form.
Identifying the gene(s) under divergent selection. The size of the genomic region associated with a selective sweep of a locus under selection is determined by the strength of selection, local rate of recombination, and time since the beneficial mutation arose 21,47,48 . Because of these factors, genomic scans for signatures of selection may highlight regions spanning several megabase pairs (Mbp); therefore determining the gene(s) under selection in such cases remains challenging 49 . In a recent review of selective sweeps in cattle breeds, the size of genomic regions showing signals of selection ranged from 8.2 to 948 kilobase pairs (kbp), with a median of 78.7 kbp 50 . In dog breeds, which have a recent history of very strong selective pressures and inbreeding, selective sweeps may be up to 10 Mbp 51 . The most phylogenetically similar species to sockeye salmon with dense SNP panels is the Atlantic salmon (Salmo salar). In a recent study of this species, SNPs significantly associated with age at maturity were located in a selective sweep region covering ~370 kbp 52 .
There are 17 other genes within 250 kbp on either side of the LRRC9 gene in the S. salar genome (Table S4), and it remains possible that the genetic variation under divergent selection is outside this area, or that there are O. nerka structural differences compared to Atlantic salmon where other genes could be closer. While we highlight an ancient selective sweep between alleles closely linked to, and potentially underlying spawning behaviour in this genomic region, we have not yet identified the specific gene(s) linked to this divergent selection. Interestingly, this same genomic region of chromosome 9 in Atlantic salmon has been identified as an island of divergence between S. salar genetic clusters that differed in the length of sea migration and age at maturity 53,54 . Of particular note, Barson et al. 53 found an island of divergence spanning 250 kb centered in this region that was strongly associated with age at maturity. They hypothesized that variation in SIX6, a transcriptional regulator gene and distal forebrain enhancer 55 that also regulates eye development across multiple taxa 56 , age at maturity in humans 57 , neuro-endocrine and gonad development 58 , might be the cause of this divergence. This gene is 142 kb away from the LRRC9 in Atlantic salmon. Whatever the underlying genetic mechanism, it is noteworthy that this region is associated with local adaptation and population divergence in multiple salmonids. A denser SNP map for sockeye salmon, and/or direct sequencing for several 100 kbp across multiple populations and ecotypes will be required to ascertain the underlying divergently selected gene(s). As it appears that the level of divergence between the 'shore-spawning' and 'stream-spawning' alleles increases towards the 3′ end (and the SIX6 gene in S. salar), this may be the direction to initially explore for the specific target(s) of divergent selection. As sockeye salmon are unique among the Pacific salmonids in their flexibility to spawn in lake-shore benthic environments 31 , this region provides great promise for future investigations of the genomic basis of O. nerka life history evolution, and more broadly, for increasing our understanding of the heritable basis of adaptation of complex traits in novel environments. From an applied perspective, this highly informative SNP has immediate utility for informing fisheries management throughout British Columbia 59 and likely across the entire range.

Methods
Sampling. We used previously extracted DNA from 1519 anadromous sockeye salmon and resident kokanee from 47 shore-and stream-spawning populations in Russia, Alaska and Canada (Table 1). All original sampling and experimental procedures were conducted in accordance with institutional, national and international guidelines and regulations as cited within the original published work (Table 1).

SNP genotyping.
We designed a new TaqMan ™ assay (One_68810_LRRC9 ; Table 1)  Flanking region sequencing and comparison. We used BLAST-n to locate and align the 100 bp RAD tag 68810 with the Atlantic salmon (Salmo salar) ICSASG_V2 (ssa09: 24,748,525-24,748,624) and rainbow trout (Oncorhynchus myskiss) 60 genomes (chrUn_29: 1,729,057-1,729,247). We then aligned ~60 kbp of the flanking genomic regions of these two species centered on the 68810 SNP using a global alignment with open ends, assuming 70% similarity as implemented in Geneious 9.0.5 61 .
Using this alignment (O. mykiss chromosome 29, S. salar chromosome 9), we designed PCR primers (Table S1) in PRIMER3 62 to amplify a ~750 bp fragment immediately surrounding the SNP for known homozygotes for four individuals from each of: Anderson Lake black kokanee, Portage Creek stream-spawning sockeye salmon, Okanagan Lake stream-spawning kokanee, and Okanagan Lake shore-spawning kokanee. All PCRs were carried out on an ABI Veriti thermal cycler in 25  cycles of 94° (20 seconds), 57° (30 seconds), 72° (45 seconds), and a final extension of 72 °C (7 minutes). All PCR products were purified by ExoSAP-IT (USB Products, Santa Clara, CA, USA) and Sanger sequenced using an ABI 3130XL Genetic Analyzer (Applied Biosystems).
The resulting sequences of the immediate flanking regions, along with the S. salar and O. mykiss alignment, were used to design two sets of primers in PRIMER3 62 for long-range PCRs in each direction targeting two overlapping fragments each of ~11 kbp (total contiguous sequence length of ~21 kbp). Long range PCRs were conducted using the LongAmp ® Taq PCR kit (NEB) for eight individuals of each homozygous genotype at the 68810 SNP from Okanagan Lake shore-and stream-spawning kokanee, respectively. Each long-range PCR was carried out in 25 μl reactions containing: ~100 ng of template DNA, 60 mM Tris-SO 4 , 20 mM (NH 4 ) 2 SO 4 , 2 mM MgSO 4 , 3% Glycerol, 0.06% IGEPAL ® CA-630, 0.05% Tween ® 20, 300 µM dNTPs, 0.5 μM of each primer, and 5 U LongAmp ® Taq polymerase. Cycling conditions were as follows: 94° (30 seconds), 30 cycles of 94° (20 seconds), 58° (30 seconds), 65° (12 minutes), and a final extension of 65 °C (10 minutes). These PCR products were purified using a Qiagen MinElute gel extraction kit and individuals for each PCR product were pooled. Sequencing libraries were constructed by shearing the PCR products to ~400 bp and using the Illumina TruSeq DNA kit. Libraries were subsequently sequenced using the Illumina MiSeq PE250 platform. Library construction and sequencing were performed at the McGill University and Génome Québec Innovation Centre, Montréal, Canada.
Obtained sequence reads were assembled using the Geneious 9.0.5 61 de novo assembler, with medium sensitivity, five iterations and a maximum indel size of 1000 bp. The libraries for each ecotype were subsequently combined using a pairwise local alignment and variants detected using a minimum minor allele frequency of 0.25.
The concatenated sequences of each sequence ('shore' and 'stream') were then aligned in Geneious 9.0.5 61 using a global alignment with free end gaps, assuming a 93% similarity. The percentage differentiation between these aligned sequences was then calculated, and the number of divergent fixed SNPs and indels counted. We used plotcon in EMBOSS 63 with a window size of 500 bp to display the pattern of differentiation between the sequences. We also translated the resulting LRRC9 DNA sequence using the S. salar CDS (LOC106610979) as a guide to identify non-synonymous changes in the coding region. To do this, the five published isoforms of the S. salar gene (XM_014210737.1 -XM_014210742.1) were aligned with the two O. nerka sequences, and these aligned exons were then translated and aligned with each other to detect any non-synonymous changes between the shore-and stream-alleles -all performed in Geneious 9.0.5 61 .
Divergence timing. The combined sequences for each ecotype were aligned with each other, and with O. mykiss using the Geneious 9.0.5 local alignment (Smith & Waterman) tool assuming 70% similarity. This alignment was used to estimate the time of divergence between the 'shore-spawning allele' and the 'stream-spawning allele' conducted in BEAST 64 . Analyses were implemented using an HKY substitution model, an estimated divergence time between O. mykiss and O. nerka of 11.4 Mya (95%CI = 9.8-13 Mya) 41 , a normal distribution prior, and a relaxed lognormal clock with a Yule Birth-Death tree prior. Three independent runs consisting of 100 million generations were conducted, with a 25% burn-in. Outputs were assessed in Tracer 65 and tree files combined in LogCombiner. Resulting tree files were annotated in TreeAnnotator and visualized in FigTree 66 .