Genotyping-in-Thousands by sequencing of archival fish scales reveals maintenance of genetic variation following a severe demographic contraction in kokanee salmon

Historical DNA analysis of archival samples has added new dimensions to population genetic studies, enabling spatiotemporal approaches for reconstructing population history and informing conservation management. Here we tested the efficacy of Genotyping-in-Thousands by sequencing (GT-seq) for collecting targeted single nucleotide polymorphism genotypic data from archival scale samples, and applied this approach to a study of kokanee salmon (Oncorhynchus nerka) in Kluane National Park and Reserve (KNPR; Yukon, Canada) that underwent a severe 12-year population decline followed by a rapid rebound. We genotyped archival scales sampled pre-crash and contemporary fin clips collected post-crash, revealing high coverage (> 90% average genotyping across all individuals) and low genotyping error (< 0.01% within-libraries, 0.60% among-libraries) despite the relatively poor quality of recovered DNA. We observed slight decreases in expected heterozygosity, allelic diversity, and effective population size post-crash, but none were significant, suggesting genetic diversity was retained despite the severe demographic contraction. Genotypic data also revealed the genetic distinctiveness of a now extirpated population just outside of KNPR, revealing biodiversity loss at the northern edge of the species distribution. More broadly, we demonstrated GT-seq as a valuable tool for collecting genome-wide data from archival samples to address basic questions in ecology and evolution, and inform applied research in wildlife conservation and fisheries management.

Historical DNA analysis of archival samples has added new dimensions to population genetic studies, expanding our ability to reconstruct patterns and processes of evolution across time and space 1 . Generally defined as preserved hard or soft tissues collected within the past 200 years, data from archival samples have enabled comparisons of genetic diversity and changes in effective population size between past and present populations 1,2 . Furthermore these data can be used to include extirpated populations or extinct species within phylogenetic reconstructions [3][4][5] .
Calcified material from fish, such as scales, otoliths, and various types of bone, have been collected over the past century and effectively used to investigate individual growth history 6,7 and response to climate-, population-, and fishing-related pressures [8][9][10][11] . In addition, microchemical analyses of these samples have been used to infer fish habitat use 12,13 , origin 14,15 , and exposure to pollutants 16,17 , as well as to reconstruct diet and trophic structure [18][19][20][21][22] . Such material also provides opportunities for recovering historical DNA. For example, mitochondrial DNA obtained from archival scales has been used to determine the genetic effects of historical stocking of Atlantic salmon (Salmo salar) in northern Spanish rivers 23 , while microsatellite data from archival and contemporary pectoral fin samples allowed investigation of the temporal genetic consequences of river fragmentation for lake sturgeon (Acipenser fulvescens) in Ontario, Canada 24 . More recently, studies have employed single nucleotide polymorphism (SNP) genotyping of archival samples to examine temporal stability of Atlantic cod www.nature.com/scientificreports/ event 38 . Population bottlenecks, in general, can lead to decreased genetic diversity and increased susceptibility to extirpation 39,40 . Genotypic data from archival samples collected pre-crash would allow for the direct investigation of the genetic consequences of population decline in this system. Here, we employed GT-seq to collect targeted SNP genotypic data from population-level, archival scale samples in Sockeye, Kathleen, and Frederick Lakes collected pre-crash between 1973-1981 and analyzed them relative to data from the contemporary Sockeye Lake post-crash population as well as from a current hatchery stock historically sourced in KNPR. We used the resulting genotypic data to: (1) assess the effectiveness of GT-seq for genotyping archival scales; (2) quantify pre-and post-crash genetic diversity in KNPR; and (3) investigate spatial and temporal population structure in KNPR and Frederick Lake.

Methods
Sample collection and DNA extraction. Archival kokanee scale samples from Sockeye Lake were collected July 26-28, 1973 (n = 50) and July 15-August 26, 1975 (n = 13) as a part of a limnological survey of KNPR 41 . Samples from Kathleen Lake were also collected July 13-22, 1973 (n = 22) during the same survey, and during subsequent creel surveys conducted May 26-June 22, 1980 (n = 18), and June 1 -July 23, 1981 (n = 47) 42 . Samples from spawning kokanee in Frederick Lake were collected by angling July 30-31, 1981 (n = 12) 42 . All samples were dried and stored in envelopes. Although carefully labelled and stored, these samples were largely forgotten until unearthed at the KNPR Warden Office, Yukon, Canada, during an office move in 2012 (Wong, pers. comm.). Whole genomic DNA from dry scale samples was extracted following the protocol in 43 within a dedicated historical DNA laboratory at the University of British Columbia Okanagan.
Previously extracted genomic DNA 38 from Sockeye Lake and Sockeye Creek kokanee (total n = 46) originally sampled August 25-30, 2019 was used to represent the post-crash population. Also included was previously extracted kokanee DNA from the Whitehorse Rapids Fish Hatchery (n = 29) sampled August 31-September 18, 2019 38 ; this population was originally established from eggs and milt collected in Sockeye Creek in 1991-1994, 1999 and 2000 (Wong, pers. comm., LaRocque, pers. comm.). Genomic DNA from two archival and two contemporary samples were subject to Genomic DNA ScreenTape® analysis on an Agilent Tapestation 4150 to assess DNA quality and quantity.
GT-seq library preparation and genotyping. GT-seq libraries were constructed following 33 as modified in 36 . Within library (n = 7 within plates, n = 4 between plates) and between library (n = 3) duplicates were included to allow for estimation of genotyping error rates. Each individual was prepared in two panels using separate sets of previously designed primer pools targeting ~ 100 base pair fragments, one including 288 SNPs 44 and the other containing 342 SNPs 45 . PCR1 products were diluted 1:20 before use in PCR2. PCR2 products were quantified using PicoGreen™ (Molecular Probes, Inc.), normalized manually to the concentration of the sample with the lowest concentration, and pooled. Pooled samples were purified using MinElute PCR Purification columns (Qiagen) and eluted into 24 μL nuclease-free water. Libraries were sequenced using a Mid Output Reagent Kit (300 cycles) on an Illumina MiniSeq within the Ecological and Conservation Genomics Laboratory at the University of British Columbia Okanagan.
Raw sequence data were genotyped using the GT-seq pipeline (https:// github. com/ GT-seq/ GT-seq-Pipel ine). Individuals and SNP loci with > 30% missing data were removed using PLINK v1.90b6.17 46 . Monomorphic loci www.nature.com/scientificreports/ and those that had been previously identified as outlier loci 44,45 were removed using VCFTOOLS 47 . Forty loci identified as duplicates between the two panels were also removed.

Population genetic analyses.
Observed (H o ) and expected heterozygosity (H e ), effective number of alleles (A e ), and Weir and Cockerham's θ (999 permutations) 48 were estimated as implemented in GenoDive 49 . Effective population size (N e ) was estimated using the linkage disequilibrium method as implemented in NeEstimator v.2 50 ; N e was only calculated for populations with n ≥ 40, as the linkage disequilibrium method provides the most reliable results the closer that sample sizes approximate the true N e

51
. Population structure was visualized using Principal Component Analyses (PCA) as implemented in PCADAPT 52 ; the number of principal components (PC) retained was identified using a graphical approach based on the scree plot 53 as recommended by 52 . The Bayesian clustering approach implemented in STRU CTU RE v2.3 54 was used to infer population structure. Run length was set to 500,000 MCMC (Markov chain Monte Carlo) iterations following a burn-in of 100,000 using correlated allele frequencies under a straight admixture model. STRU CTU RE was run with the number of clusters (K) varying between 1-9, with 10 replicates for each value of K. The most likely K was chosen by plotting the log probability (ln Pr(X|K)) of the data across the K values and choosing the value at which ln Pr(X|K) leveled and variance was minimized as recommended in 54 . Bar plots were generated and visualized using CLUMPAK 55 .

Results
Sample quality and GT-seq genotypic data processing. Genomic DNA from archival samples were of similar average quantity (archival = 31.5 ng/μL, contemporary = 37.5 ng/μL), but substantially poorer quality, compared to DNA from contemporary samples based on Genomic DNA ScreenTape® analysis (Fig. S1). Average genotyping of loci before filtering was 90.1% across all individuals. After filtering and the removal of outlier loci, monomorphic loci, and duplicate samples, a total of 271 loci were retained across 223 individuals, constituting our neutral dataset. Average read depth was 222.2 reads per loci. Retained individuals included those from Sockeye Lake 1973 (n = 49), Sockeye Lake 1975 (n = 10), Sockeye Lake 2019 (n = 40), Kathleen Lake 1973 (n = 21), Kathleen Lake 1980 (n = 17), Kathleen Lake 1981 (n = 47), Frederick Lake 1981 (n = 11), and the Whitehorse Rapids Fish Hatchery 2019 (n = 28). Genotyping error rates were < 0.01% within GT-seq library (both within and between plates) and 0.60% between GT-seq libraries. A total of 158 loci across 68 contemporary individuals (Sockeye Lake and Hatchery 2019) overlapped between the GT-seq and RADseq data 38 . The genotype discordance between these methods at these loci was 1.77%.
Population diversity and differentiation. Diversity metrics (H o , H e , A e ) from the archival samples were similar and stable across locations and sampling years in Sockeye (1973,1975) and Kathleen Lakes (1973,1980,1981) (Table 1). A slight decrease in H e and A e was observed in the Sockeye Lake post-crash population (Table 1) (Table 1). Differentiation was low between sampling years in Sockeye and Kathleen Lakes, both pre-crash (θ ≤ 0.009), and pre-crash compared to post-crash (θ ≤ 0.011) ( Table 2).
Frederick Lake exhibited substantially lower diversity metrics (Table 1) and significantly higher differentiation (θ > 0.450) from Sockeye and Kathleen Lakes ( Table 2). Hatchery individuals also had lower observed heterozygosity than detected in both Sockeye and Kathleen Lakes across sampling years, though still higher than Frederick Lake ( Table 1). The hatchery population was significantly differentiated from the Frederick Lake population (θ = 0.476; Table 2) and exhibited substantially higher θ values in comparisons with Sockeye and Kathleen Lakes across all sampling years (Table 2).
Two PCs were retained from the PCADAPT analysis of all populations across all sampling years; PC1 explained 12.5% of the genetic variation, while PC2 explained 3.7% (Fig. 3a). PC1 separated Frederick Lake from the rest of the individuals, while PC2 separated hatchery individuals from those sampled in Sockeye and Kathleen Lakes. When hatchery and Frederick Lake individuals were removed, there was no clear optimal number of PCs, so two were chosen for the purpose of comparison; both PCs explained very little of the genetic Table 1. Diversity estimates including heterozygosity [observed (H o ) and expected (H e )], effective number of alleles (A e ), and effective population size (N e ) with 95% confidence intervals (CIs) for individuals in Sockeye Lake, Kathleen Lake, Frederick Lake, and the Whitehorse Rapids Fish Hatchery (neutral, 271 SNPs). Note: N e was not determined when sample size (n) was less than 40.  (Fig. 3b). Bayesian clustering analysis of the wild populations implemented in STRU CTU RE found evidence for K = 2 (Table S1), separating the Frederick Lake population from those sampled in Sockeye and Kathleen Lakes, regardless of sampling year (Fig. 4); additional values of K did not reveal further structure within or among Sockeye and Kathleen Lakes, or among sampling years (Fig. 4).

Discussion
Here we demonstrated GT-seq to be an effective approach for genotyping archival scale samples, exhibiting high coverage (> 90% average genotyping of loci before filtering across all individuals) and low genotyping error (< 0.01% within libraries, 0.60% between libraries) despite the substantially poorer quality of recovered DNA (Fig. S1). These data allowed for the spatial and temporal comparison of diversity metrics and population structure of kokanee in KNPR and surrounding areas, while showing the value of pairing GT-seq and archival DNA analysis more broadly for informing conservation strategies. Spatial/temporal population structure and comparison of diversity metrics in KNPR. Overall, we found no evidence of population structure spatially or temporally among the connected lakes where kokanee reside in KNPR (Kathleen and Sockeye), as revealed by extremely low pairwise θ values ( Table 2) and lack of separation in the PCA or Bayesian clustering analysis (Figs. 3 and 4). The absence of differentiation between Kathleen and Sockeye Lakes is consistent with observations that all kokanee in KNPR, regardless of where they live during other life stages, eventually spawn in the same locations in Sockeye Lake and Creek.
Diversity metrics (H o , H e , A e ) were similar temporally and spatially in Sockeye and Kathleen Lakes in pre-crash years, with a slight decrease in H e and A e observed in the post-crash population. There was a ~ 30% reduction in N e in the post-crash population compared to the pre-crash population in Sockeye Lake, although there was some overlap of confidence intervals (Table 1). This finding, as well as the temporal clustering and lack of differentiation between pre-and post-crash populations, suggests that a large proportion of genetic diversity was retained in KNPR kokanee despite the well-documented population crash.
Several factors may affect the genetic outcomes of a demographic contraction, including the severity and duration of the bottleneck, the latter of which can have an outsized effect. Genetic theory predicts that the bulk of genetic diversity can be retained, even during a severe bottleneck, as long as it is not prolonged [56][57][58][59] . For example, populations of white-tailed eagles (Haliaeetus albicilla) in Europe and Peregrine falcons (Falco peregrinus) in North America underwent severe bottlenecks and rapid recoveries as a result of dichlorodiphenyltrichloroethane (DDT)-containing pesticides and their subsequent bans. However, neither showed a significant loss of genetic diversity as a result of these demographic bottlenecks 60,61 . While these periods of decline lasted decades, the species' long lifespans likely helped maintain this diversity; in the case of white-tailed eagles, the population bottleneck lasted 20-30 years, which was only equivalent to ~ 2 generations 61 . As kokanee in KNPR have a documented generation time of ~ 4-5 years 42 , it is likely the demographic contraction in this system only lasted ~ 2-3 generations. Though severe, this duration of the bottleneck may not have been long enough to significantly erode genetic diversity within the population.
Extirpated Frederick Lake. Our results revealed the extirpated Frederick Lake population to be genetically distinct from kokanee in KNPR, as evidenced by high levels of differentiation (θ > 0.450; Table 2) and patterns of clustering in the PCA and STRU CTU RE analyses (Figs. 3A, 4). The two competing hypotheses of the origin of kokanee in KNPR and Frederick Lake have assumed that they originated from the same event, either by way of colonization from the (1) Alsek River or (2) Tatshenshini River to Klukshu Lake (Łu Ghą Män) 42 . Yet, the high differentiation between these populations may suggest that they originated from separate ancestral populations over multiple events. Further research that includes sampling of anadromous sockeye salmon populations in the Alsek River and Klushu Lake would be required to test the single/multiple origin hypotheses. Regardless, Table 2. Differentiation (Weir and Cockerham's θ) 48 (above the diagonal) and p-values (below the diagonal, bolded values ≤ 0.05) for individuals in Sockeye Lake, Kathleen Lake, Frederick Lake, and the Whitehorse Rapids Fish Hatchery (neutral, 271 SNPs). www.nature.com/scientificreports/ the extirpation of Frederick Lake kokanee represents a loss of biodiversity, which may be even more pronounced given its northerly distribution, as small, isolated populations at the periphery of a species range, may harbor unique traits not found in the core of a species range 62,63 .
Though it is difficult to conclude what led to the extirpation of the Frederick Lake population in the late 1980's 37 , the population exhibited much lower diversity metrics in terms of H o , H e , and A e than the population in KNPR, as well as the Whitehorse Rapids Fish Hatchery population (Table 1). There is general agreement that genetic diversity is vital to a population's viability 39,64 . As a lack of diversity has been associated with the extinction or extirpation of many species or populations 65,66 , the low levels of genetic diversity seen in Frederick Lake kokanee may have contributed to the loss of this population; however, additional studies would be required to test this hypothesis.

Value of archival samples in conservation.
Our study highlights the valuable insights archival samples can provide to fisheries conservation and management. For example, previous RADseq data showed the contemporary population in KNPR displayed a heterozygous excess 38 , which is one potential indicator of a genetic bottleneck 67 . Yet, genotypic data from archival samples in KNPR revealed that this heterozygous excess was www.nature.com/scientificreports/ likely a feature of the population, rather than a result of a recent bottleneck. If only contemporary data were available, conservation strategies, such as initiatives to propagate diversity through hatchery supplementation, may have been implemented, possibly in vain or with negative consequences 38 . That said, we did observe a slight decrease in some diversity metrics over time in KNPR, such as H e and A e , as well as a small, but not significant, reduction in N e post-crash; these findings warrant continued monitoring to document trends, examine outcomes, and inform conservation planning moving forward. While potential archival samples such as fish scales and otoliths are generally collected during surveys or studies, this material is sometimes only used to meet short-term objectives, with long-term preservation and storage given little priority. Although such samples are well preserved and recorded at some institutions, in others, standard operating procedures include retaining only the most recent five years of samples due to insufficient storage space 68 . Even where sufficient storage space exists, maintenance of archival material may be inconsistent or undervalued, leading to variable quality of samples 69 . In our case, the last remaining physical evidence of an assumed extirpated population was at risk of being discarded during an office move 40 years after collection. As the use of genomics in fisheries conservation and management continues to progress 70,71 , proper long-term storage and maintenance of archival samples, potentially through centralized storage facilities and collection management systems 72,73 , will be key in preserving their value for future studies.
Utility of GT-seq in studies using archival samples. Our study shows how GT-seq can be used to effectively and efficiently genotype degraded archival samples. Though diversity metrics in the historical GT-seq dataset were slightly lower than what was found using RADseq in the same contemporary individuals, similar trends, such as heterozygote excess, were observed in both 38 . These lower diversity metrics were most likely due to the inclusion of unique Frederick Lake individuals while calling SNPs rather than GT-seq itself. As a case in point, when Frederick Lake individuals were removed from the dataset, diversity metrics between the same individuals were similar between the RADseq and GT-seq datasets (Table 3).
While a range of methods have been used for genotyping archival DNA [25][26][27]31,32 , our study is the first to our knowledge to employ GT-seq, which has many advantages compared to other approaches. First, GT-seq only requires amplicon lengths of ~ 100 base pairs for effective genotyping 33 , positioning it as a suitable approach to apply to highly fragmented DNA that can be commonly obtained from archival samples of varying ages.
Second, GT-seq has a relatively low genotyping error rate 33 . High genotyping error is often a serious problem for historical samples, mainly due to nucleotide misincorporation during the amplification of archival DNA 1,29 . Estimates of genotyping error using more traditional methods, such as microsatellite fragment analysis, can range as high as 17-21% 1,74,75 . In our study there was a < 0.01% genotyping error rates within libraries and 0.60%  (Table S1), separating the Frederick Lake population from those sampled in Sockeye and Kathleen Lakes. No structure was found temporally or spatially in or between Sockeye and Kathleen Lakes. www.nature.com/scientificreports/ error rate between libraries, while average genotyping of loci before filtering was 90.1% across all individuals, highlighting the ability of GT-seq to obtain high quality data with minimal error. Third, GT-seq makes use of a straightforward protocol that is relatively cost-effective, allowing for the simultaneous genotyping of hundreds or thousands of individuals at hundreds of loci. Though panel design requires upfront development and investment, once optimized, multi-locus genotypes can be obtained for ~ $6.00 (USD) per sample 34 . When compared to RADseq, which costs ~ $30.00 per sample, or a targeted capture sequencing approach such as Rapture, which costs ~ $15.00 per sample, GT-seq is extremely cost-effective when processing hundreds or thousands of samples 34 . Moreover, GT-seq does not require highly specialized equipment for library preparation and uses simple scripts for bioinformatic processing of the recovered sequence data 33,34 . This reduces barriers to entry for research groups without advanced instrumentation or strong computational backgrounds to use this approach, while also standardizing methods across labs 34 .
Lastly, GT-seq can provide connectible data to monitor populations over time. Once designed, GT-seq panels can be employed by any lab or facility to directly target specific SNPs at specific locations in the genome. This is unlike more traditional markers, such as microsatellites, which are indirectly assayed using fragment analysis 76 . As such, GT-seq panels can be used not only as a means to compare genetic diversity and detect changes in effective population size and structure between populations past and present, but also to continually monitor these parameters into the future.
Taken together, these advantages make GT-seq a valuable tool for implementing genomics into conservation 34 . This approach has been previously shown to be effective for low quality DNA from minimally and non-invasively collected samples 36 . Here, we have demonstrated the utility of GT-seq for genotyping archival samples, further extending the temporal and spatial resolution this method can bring for addressing basic questions in ecology and evolution, as well as for informing applied research in wildlife conservation and fisheries management.