Introduction

Modern genetics remains primarily focused on single-nucleotide polymorphism (SNP) analyses, with a growing recognition of the importance of larger structural variants (SVs) including inversions, insertions, deletions and copy number variations (defined here as variants ≥100 bp), among others1. SVs affect a larger proportion of bases in human genomes than SNPs4, are not always reliably tagged by SNPs5, more frequently have regulatory impacts6 and have been shown to alter the structure, presence, number, dosage and regulation of many genes1. Nonetheless, SVs remain challenging to accurately type using whole-genome sequence data2,3, limiting our understanding of their biological roles and exploitation as genetic markers. Consequently, there is a need for reliable SV detection approaches to fully exploit the fast-accumulating genome sequencing datasets in both model and non-model species, allowing for more complete genetics investigations. Many tools exist for SV discovery using short-read sequencing data, but all suffer from high false discovery rates (FDRs) (10–89%)2,3,7. This poses a challenge for de novo SV detection in previously unstudied species lacking ‘gold-standard’ reference SVs to help distinguish true from false calls. Most studies rely on combining an ensemble of signals from different SV detection methods, although this strategy does not reliably improve performance and can in some cases aggravate false discovery3. Researchers therefore often apply independent experimental8,9 or visualization methods10 to validate a subset of SV calls. Overall, there remains an unsatisfactory lack of consensus on how to validate the quality of de novo SV datasets in most species3.

Salmonids have the highest combined economic, ecological and scientific importance among all fish lineages, and have consequently been subject to hundreds of genetics studies employing SNPs and other molecular markers11,12. In common with most non-model fish species, the SV landscape remains extremely poorly characterized in salmonids, apart from recent work informed by SNPs that revealed multi-megabase inversions in rainbow trout (Oncorhynchus mykiss Walbaum) influencing migration13,14, and a chromosomal fusion under selection in Atlantic salmon15, consistent with roles in adaptation. Salmonids offer a unique system to characterize SVs due to an ancestral salmonid-specific autotetraploidization (i.e. whole-genome duplication, WGD) event (Ss4R), which occurred 80–100 Mya, following an earlier WGD (300–350 Mya) in the teleost common ancestor16,17,18. WGD events may influence selection on SV retention due to the functional redundancy linked to mass retention of duplicated genes, though this idea is yet to be tested. In addition, salmonids have been farmed in aquaculture for a small number (<15) of generations11, and while the genetic architecture of such recent domestication has been investigated using SNPs19, the role played by SVs remains unexplored. Finally, the application of SVs in selective breeding of salmonids and other commercial fishes remains untested. Clearly, the lack of SV data and analysis frameworks in salmonids represents an important knowledge gap.

Here we provide an end-to-end workflow to detect, genotype, validate and annotate SVs using short-read sequencing, removing false positives through efficient manual curation10, allowing reliable SV discovery in non-model species. Using this approach, we report a detailed investigation of the genomic landscape of SVs in the iconic Atlantic salmon, inclusive of 492 genomes representing wild and farmed genetic diversity, and populations of both European and North American descent.

Results

Accurate SV discovery in Atlantic salmon

We developed a workflow for SV discovery using paired-end short-read sequencing data aligned to the unmasked ICSASG_V2 reference assembly17, which can be run in Snakemake20 (Supplementary Fig. 1). The probabilistic tool Lumpy21 was used for SV detection, which simultaneously draws on multiple evidence and SVtyper22 was used for genotyping. As de novo SV detection using short-read data is prone to false positives3,21,23, we added an optional step to avoid SV calling in complex regions of the genome where false-positive rates were predicted to be particularly high (proven below). This included regions of ≥100× coverage (>10 times higher than the global average of 8.1× coverage), shown elsewhere to be overwhelmingly false calls3, as well as gap regions in the ICSASG_V2 assembly. These complex regions were most prevalent in chromosome arms where rediploidization was delayed after Ss4R, characterized by high sequence similarity among duplicated regions17 (Supplementary Fig. 2).

Rather than using evidence from additional SV detection tools as a filter for true SV calls, a strategy shown elsewhere to be potentially unreliable3, we applied a curation approach to the entire filtered SV dataset using SV-plaudit10. Note that this was done on SV calls generated both without any filtering of complex regions, and after the filtering of complex regions, in order to test our prediction that SV calling is particularly unreliable in complex regions. SV-plaudit is a scalable framework for the rapid production of thousands of SV images via Amazon web services10 (examples: Supplementary Figs. 38). This approach allowed us to efficiently retain high-confidence SV calls, while excluding low confidence or ambiguous calls, on the basis of available visual evidence drawn from paired-end and split-read alignments, in addition to read depth10,21. The Atlantic salmon individuals (details in Supplementary Data l) produced on average 55,754 SV calls (median: 55,041, SD: 10,051) before filtering complex regions and SV-plaudit curation (Supplementary Data 2). Across all 492 individuals, 165,116 unique SVs were detected (size: 100 bp to 2 million bp) (provided in Supplementary Data 3), which included an outlier peak of deletion SVs in the 1432–1436 bp size range (Supplementary Fig. 9).

Using SV-plaudit on the full set of SV calls allowed us to retain only high-confidence calls, quantify the impact of filtering complex regions and estimate an FDR. The overall estimated FDR was 0.91 (149,491/165,116 of calls had low confidence), in line with the highest estimates in the literature2,3,7. In complex regions, the FDR was 0.992 (47,268/47,636 calls had low confidence). In the remaining chromosome-anchored assembly, the FDR was 0.85, validating the usefulness of removing complex genomic regions. Sequencing depth was not a reliable indicator of FDR (Supplementary Fig. 10). A final high-quality set of 15,483 unique SV calls (14,017 deletions, 1244 duplications, 242 inversions) and their genomic location is visualized in Fig. 1a, b. The average size for deletions was 1532 bp (100–1,946,935 bp; SD: 23,070 bp) and for duplications 8183 bp (102–80,1673 bp; SD: 25,589 bp) (Fig. 1c, d). For inversions, the average size was 121,935 bp (113–1,796,230 bp; SD: 278,698 bp) (Fig. 1e). The outlier peak at 1432–1436 bp remained in the high-confidence deletions (Fig. 1c).

Fig. 1: SV landscape in 492 Atlantic salmon genomes.
figure 1

a SV counts per one million bp window in the genome split into homology categories17 representing duplicated regions retained from the Ss4R WGD sharing ‘low’ (<90% identity), ‘elevated’ (90–95% identity) and ‘high’ (>95% identity) similarity in addition to telomere regions. Definition of box and whisker plots: the box spans the interquartile range, with the median (Q2) as a central bar, and respective upper and lower bounds representing the minimum and maximum values within the 25th percentile (Q1) and 75th percentile (Q3). The bounds of the upper and lower whisker are the largest and smallest values that lie within 1.5 times above Q3 and below Q1, respectively. Outliers out with these bounds are shown as individual points. b Locations of the same regions depicted on a Circos plot using the same colour scheme. c–e Size distributions of SVs for deletions (c), duplications (d) and inversions (e) with X-axis limited to SVs ≤2000 bp. Arrow in part c marks outlier peak in deletion calls (see Fig. 2). f Sampling locations of wild populations. gi PCA for each SV class: 14,017 deletions (g), 1244 duplications (h), 242 inversions (i) with population matched by colour to part f for wild fish, and additional symbols given for farmed fish (note: all seven individuals annotated ‘Canada Farmed' were sampled in Chile, along with 13 individuals annotated as ‘Norwegian Farmed', consistent with their respective descent from the two major Atlantic salmon lineages in North America and Europe). j NGSadmix86 analysis of 14,017 deletions with K = 2, 3 and 4. Each individual is a vertical line with colours marking genetically distinct groups. Asterisk corresponds to White sea, Baltic and landlocked populations (K = 4 plot).

To validate our SV discovery workflow we estimated the true positive rate for SV presence/absence and genotype calls using the high-confidence data retained after the SV-plaudit step. We sequenced PCR amplicons for 876 independent SV calls representing 168 unique SVs (108 deletions, 46 duplications, 15 inversions) (Supplementary Fig. 11) at ≥50× coverage on the MinION platform. Across all SV calls, the true positive rate was 0.88 for SV presence/absence and 0.81 for SV plus genotype. For deletion calls, the true positive rate was 0.93 for presence/absence (520/559 calls) and 0.85 (475/559 calls) for genotype. For duplications, the true positive rate was 0.81 for presence/absence (186/230 calls) and 0.74 (170/230 calls) for genotype. For inversion calls, the true positive rate was 0.78 for presence/absence (68/87 calls) and 0.75 (65/87 calls) for genotype. Full results are shown in Supplementary Data 4 (with examples in Supplementary Figs. 1214). In summary, SV-plaudit curation vastly reduced the FDR to maintain predominantly true SV calls (provided in Supplementary Data 5).

To further confirm data quality, we asked if the high-confidence SV genotypes capture expected population genetic structure (Fig. 1f–j). SV genotypes were used in principal component analyses (PCA) for the different SV types (Fig. 1f–i). For all SV types, PC1 separated European and Canadian salmon, consistent with past work, e.g. refs. 24,25. Deletions achieved a better resolution for the sampled European populations, with PC2 separating populations from Europe into distinct groups explained by latitude with evidence of intermixing at middle latitudes in Norway (Supplementary Fig. 15), as reported elsewhere24. All farmed salmon clustered with the wild populations from which they are descended. Farmed salmon from Europe, including 13 farmed fish from Chile, clustered with wild salmon from Southern Norway, while 7 Chilean farmed salmon clustered with Canadian salmon (Fig. 1g). Using the high-confidence deletion genotypes, an admixture analysis was performed, which was consistent with the PC analysis (Fig. 1j). For comparison, we also performed PCAs using the raw unfiltered SV calls, plus the reduced subset filtered for complex regions, which failed to capture the same population structure (Supplementary Fig. 16). In summary, our final set of deletion genotypes capture expected population genetic structure at high resolution. It is unclear if the weaker signal for duplications and inversions is linked to specific properties of these markers, their comparatively lower number, or slightly lower genotyping accuracy.

Annotation of Atlantic salmon SVs

We used SnpEff26 to annotate all high-confidence SV calls against features in the ICSASG_v2 annotation. Many SVs were located in intergenic and intronic regions (Supplementary Fig. 17), with 62%, 3% and 2.5% within 5 kb of a protein-coding gene, long non-coding RNA gene or pseudogene, respectively. Around half (49%) of all SVs overlapped one or more RefSeq gene, the majority of which overlapped a single gene (Supplementary Fig. 18), with 8439 genes overlapped in total. Approximately 4%, 21% and 25% of deletions, duplications and inversions were predicted by SnpEff to have a high impact, respectively, including hundreds of putative exon losses, frameshift variants and potential gene fusion events (Supplementary Fig. 19). One hundred and one duplications spanned entire genes (mean length: 51.7 kb, median length: 15.1 kb). The high impact annotations for different SV types were associated with an overrepresentation of several biological processes in the gene ontology (GO) framework27 (Supplementary Data 6 and 7).

Recently active DNA transposon in Salmo evolution

The outlier peak observed in the deletion calls (Fig. 1c and Supplementary Fig. 9) was investigated by extracting all high-confidence variants of 1432–1436 bp in size (104 sequences) from the ICSASG_v2 genome. Ninety-four and 89 of these sequences shared ≥50% and ≥95% identity in all pairwise combinations, respectively. The 94 sequences were used as queries in BLASTn searches revealing that 91% (86 out of 94) shared ≥95% identity to a pTSsa2 piggyBac-like DNA transposon (National Center for Biotechnology Information [NCBI] accession: EF685967])28. The breakpoints in the outlier deletions SV match to the complete pTSsa2 sequence (Supplementary Data 8), missing no more than a few bp at the 5′ or 3′ end. Consequently, the outlier deletion peak (Fig. 1c) appears to largely represent an intact pTSsa2 sequence.

Phylogenetic analysis was done incorporating the Atlantic salmon pTSsa2 sequences along with the top 100 BLASTn hits to the pTSsa2 sequence in the genome of brown trout Salmo trutta (repeat masking off; all sequences e-value = 0.0, 70–100% and 84–95% query, coverage and identity, respectively). Repeating the search against genomes for the next most closely related salmonid genera, Salvelinus (Arctic charr S. alpinus) and Oncorhynchus (rainbow trout O. mykiss, coho salmon O. kitsuch and chinook salmon O. tshawytscha) failed to identify sequences sharing >50% coverage or >81% identity. The tree indicates independent expansions of pTSsa2 sequences in the Atlantic salmon and brown trout genome (Fig. 2 and Supplementary Fig. 20). The pTSsa2 sequence appears in the Atlantic salmon genome with high copy number across all chromosomes (Supplementary Fig. 21).

Fig. 2: Evidence for an active DNA transposon in Salmo evolution.
figure 2

Phylogenetic tree of Atlantic salmon sequences representing deletion polymorphisms matching the pTSsa2 piggyBac-like DNA transposon28 (EF685967) and 100 top hits to this sequence within the brown trout genome. The tree was generated from an alignment spanning the length of pTSsa2 (Supplementary Data 8) using the TPM3+F+G4 substitution model. Bootstrap values are given at key nodes. A full tree with sequence identifiers, genomic locations of pTSsa2 sequences and bootstrap values is provided in Supplementary Fig. 18. A circos plot highlighting the location of pTSsa2 sequences in the Atlantic salmon genome is given in Supplementary Fig. 19.

We also determined the broader overlap of SVs and repeat sequences in the Atlantic salmon genome. Among all SVs, 65% (10,184) contained no repeat sequences, 16% (2423) a single repeat and 7% (1027) two repeats. There was a significant correlation between SV size and the number of repeats per SV across all SV types (Pearson’s R ≥ 0.99, P < 0.0001 in each test), indicating that the number of repeats within each SV was simply a direct product of SV size.

Impact of genome duplication on the SV landscape

Salmonid genomes retain a global signature of duplication from Ss4R, with at least half of the protein-coding genes retained as expressed, functional duplicates (referred to as ohnologs)17,18. Ss4R ohnolog pairs share amino acid sequence identity ranging from ~75 to 100%12,17,18 with ~40% maintaining the ancestral tissue expression pattern17, suggesting pervasive functional redundancy. We hypothesized that the redundancy provided by ohnolog retention after WGD influenced the evolution of the SV landscape by creating a mutational buffer29 against deleterious SV mutations. A key prediction is that genes found in Ss4R ohnolog pairs (with scope for functional redundancy) should be more overlapped by SVs compared to singleton genes (lacking scope for functional redundancy).

We tested this prediction by generating a novel set of high-confidence Ss4R ohnolog pairs (10,023 pairs, i.e. 20,046 genes) and singletons (8282 genes) (Supplementary Data 9), and indeed found a significant enrichment of SVs overlapping retained Ss4R ohnologs (Fisher’s exact test, P = 1.9e−25, odds ratio = 1.47) (Supplementary Data 10). This effect was specific to deletions (Fisher’s exact test, P = 2.6e−32, odds ratio = 1.62), and hence not observed in duplications (P = 0.62) nor inversions (P = 0.52). SVs with putative high impact did not overlap ohnologs more than singletons (high impact snpEff annotation: P = 0.93, manually curated deletions impacting exons: P = 0.55) (Supplementary Data 11).

Next we asked if gene expression characteristics influence the overlap between SVs and Ss4R ohnologs. One plausible prediction of our hypothesis is that ohnologs showing higher than average expression correlation will be more enriched for SVs, as these genes should on average show higher functional redundancy. We initially used Spearman’s rank correlation to establish co-expression of ohnologs across an RNA-Seq atlas of 15 tissues17. We found that ohnolog pairs where one copy overlaps a deletion SV showed slightly lower expression correlation compared to randomly selected ohnolog pairs (resampling test, P < 0.001) (Supplementary Fig. 22). This is not in line with the above prediction, though it should be noted the effect size is small (Supplementary Fig. 22a). This result is compatible with SVs affecting ohnolog pairs with greater levels of functional divergence at the expression level, but may equally be caused by relaxed purifying selection on duplicated copies, allowing more SVs to accumulate. It has been shown elsewhere that the more highly expressed ohnolog in a pair is typically under stronger purifying selection30. Therefore, we asked if ohnologs overlapped by an SV have reduced expression compared to their duplicate with no SV overlap. Indeed, this was the case (Wilcoxon rank-sum test, P = 2.9e−6) (Supplementary Fig. 22). We also found that ohnolog pairs showing overlap with deletion SVs showed reduced expression compared to ohnolog pairs showing no overlap to SVs (Wilcoxon rank-sum test, P = 7.0e−25) (Supplementary Fig. 22).

Overall, these analyses reveal that the Ss4R WGD strongly influenced the retention of deletion SVs in the Atlantic salmon genome, and this is likely explained by functional redundancy, with mixed support for our hypothesis on mutational buffering.

Selection on SVs during Atlantic salmon domestication

Our study provides a unique opportunity to ask if SVs were selected during the domestication of Atlantic salmon, which commenced when the Norwegian aquaculture industry was founded in the late 1960s11,31. Consequently, farmed Atlantic salmon are no more than 15 generations ‘from the wild’, in contrast to livestock and poultry, which have been domesticated for thousands of years11,12. The early domestication process involves strong selection on behavioural traits32,33 targeting molecular pathways underpinning cognition, learning and memory, for instance genes with functions in synaptic transmission and plasticity34,35. Specifically, selection on farmed animals should remove individuals that invest in costly behavioural and stress responses such as predator avoidance and fear processing in favour of animals that invest into performance traits32,36. We thus hypothesized that SVs linked to genes regulating pathways controlling behaviour would be under distinct selective pressures in farmed and wild salmon.

To test our hypothesis, we established significantly genetically differentiated SVs by calculating the fixation index (FST)37 between 34 farmed Norwegian salmon and 257 wild salmon from Norway. The wild individuals were selected based on a PCA including all European salmon, aiming to remove confounding effects of genetic differentiation by latitude observed in wild Norwegian salmon (Fig. 3a), retaining the closest possible background to the wild founders used in aquaculture. We used a permutation approach to estimate the probability of observed FST values in relation to random expectations, defining 584 SV outliers at P < 0.01 (all FST > 0.103, median FST = 0.149) (Fig. 3b and Supplementary Data 12), which were distributed throughout the genome (Fig. 3c).

Fig. 3: Genetic differentiation of SVs between farmed and wild Atlantic salmon.
figure 3

a PCA used to select appropriate wild individuals for FST comparison (n = 257) vs. farmed salmon (n = 34) on the basis of genetic distance by latitude (see also Supplementary Fig. 15) separated along PC1. The population symbols are the same as shown in Fig. 1. b Observed FST value distribution comparing farmed vs. wild salmon contrasted against 200 random distributions for the same number of individuals. Dotted line shows cut-off FST value employed in addition to a per SV criteria of P < 0.01. c Manhattan plot of 12,627 FST values with dotted line showing the same cut-off above which are the 584 SV outliers. d Brain gene expression specificity (top panel) and expression level (bottom panel) are increased compared to global expectations for genes linked to the 584 outlier SVs, with the effect pronounced for a 326 gene subset contributing to significantly enriched GO terms. Hypergeometric tests were performed to compare the proportion of genes showing brain expression specificity ≥0.50 between 44,469 genes detected in a multi-tissue transcriptome vs. (i) the 584 gene subset (all SV outliers) (single asterisk indicates P = 0.0041) and (ii) the 326 gene subset (SV outliers GO enriched) (double asterisk indicates P = 2.42e−07). Two-sample t-tests were used to compare the brain expression level (CPM) among the same 44,469 global gene set vs. (i) the 584 gene subset (all SV outliers) (double asterisk indicates P = 4.84e−07) and (ii) the 326 gene subset (SV outliers GO enriched) (double asterisk indicates P = 6.65e−07). The observed increase in expression was specific to brain (plots for other tissues shown in Supplementary Figs. 22 and 23). Results of statistical analysis for all tissues are shown in Supplementary Data 15. A definition of the box and whisker plots can be found in the Fig. 1a legend.

GO enrichment tests identified 132 overrepresented biological processes (P < 0.05) among the genes linked to these outlier SVs by SnpEff (Supplementary Data 13). This set comprises 326 unique genes contributing to the enriched terms (Supplementary Data 14). Thirty-four biological processes explained by 156 unique genes (48% of the unique genes contributing to all enriched GO terms) were daughter terms related either to learning and behaviour, including ‘habituation’ (P < 0.002), ‘vocal learning’ (P < 0.001) and ‘adult behaviour’ (P < 0.02), or the nervous system, including ‘positive regulation of nervous system process’ (P < 0.02),’ presynaptic membrane assembly’ (P < 0.01), ‘postsynapse assembly’ (P < 0.02), ‘oligodendrocyte development’ (P < 0.001) and ‘regulation of neuronal synaptic plasticity’ (P < 0.03).

To test our hypothesis, we asked if genes linked to outlier SVs showed enrichment in brain expression (Fig. 3d). Indeed, this was strongly supported when judged against transcriptome-wide expectations (Fig. 3d): with the signal being strongest for the 326 gene subset contributing to the overrepresented GO terms, emphasizing particular importance of brain functions among the enriched gene set (Fig. 3d and Supplementary Data 15). A positive enrichment in the expression of outlier linked genes was only observed in brain, with nine other tested tissues showing either little difference to transcriptomic expectations, or in the case of muscle and foregut, reduced expression specificity (Supplementary Data 15 and Supplementary Figs. 23 and 24). Finally, we asked if the outlier SVs overlapped putative cis-regulatory elements (CREs) detected in brain using novel ATAC-Seq data (significant peaks overlapping a gene ±3000 bp up/downstream; n = 4) more than expected. For 9920 SVs lacking evidence for differentiation between farmed and wild fish (FST, P > 0.05), 7.1% overlapped at least one brain ATAC-Seq peak, which was almost identical to SV outliers (7.0%) (Fisher’s exact test, P = 0.86). A similar result was observed by restricting the analysis to genes with brain biased expression (Fisher’s exact test, P = 0.41).

SVs selected by domestication are linked to many synaptic genes

The increased brain expression and overrepresentation of nervous system functions for SV outlier linked genes motivated us to investigate the role of these loci in the genetic architecture of domestication. We performed a detailed annotation of the 156 SV outlier linked genes contributing to the 34 aforementioned enriched GO terms (Supplementary Data 16). To cement the relevance of this gene set to our hypothesis, we cross-referenced all the encoded protein products with a high-resolution synaptic proteome from zebrafish38. Our rationale was that the synaptic proteome is central to nervous system activity and defines the repertoire of cognitive and behaviours an animal can perform during its life38,39.

Among the 156 SV outlier linked genes, 65 (i.e. 42%, linked to 67 distinct SVs) encode a protein with an ortholog in the zebrafish synaptic proteome (Supplementary Data 16) defined by stringent reciprocal BLAST (mean respective pairwise % identity and coverage = 77 and 95%). As synaptic proteomes are highly conserved between fish and mammals38, it is reasonable to assume these proteins are bone fide components of Atlantic salmon synaptic proteomes, and that a minimum of 11% of the outlier SVs was linked to synaptic genes by SnpEff. These proteins are encoded by multiple members of ancient, conserved gene families involved in synaptic formation, transmission and plasticity, including neurexins (NRXN1 and NRXN2), SH3 and multiple ankyrin repeat domains 3 proteins (SHANK2 and 3), cadherins (CDH4, CDH8, CDH11, PCDH1), Down syndrome cell adhesion molecules (DSCAM and DSCAML), teneurins (TENM1 and TENM2), gamma-aminobutyric acid receptors (GABRB2 and GABRG2), potassium voltage-gated channel subfamily D members (KCND1 and KCND2), receptor-type tyrosine-protein phosphatases (PTPRG and PTPRN2) and ionotropic glutamate receptors (GRIK3 and GRIN2C) (Fig. 4). Genetic disruption to orthologs for most of these proteins (59/65) cause behavioural and/or neurological disorders in mammals (Supplementary Data 16).

Fig. 4: SVs under selection during Atlantic salmon domestication are linked to 65 unique genes encoding synaptic proteins.
figure 4

SV genotypes are visualized on the left, ordered from bottom to top with decreasing frequency of homozygous genotypes (0/0) lacking the SV in wild fish. Annotation of each SV type, its size and genomic location with respect to each synaptic gene is also shown. The circles next to genes highlight Ss4R ohnolog pairs and the black triangles indicate the overlap of an SV with a putative cis-regulatory element (ATAC-Seq peak). The heatmap on the right depicts the expression specificity of each gene across an RNA-Seq tissue panel17 (white to dark blue depicts lowest to highest tissue specificity; tissues shown in different columns from left to right: liver, gill, skeletal muscle, spleen, heart, foregut, pyloric caeca, pancreas and brain). The overall expression of each gene in brain is shown on the right of the heatmap (white to dark green depicts increasing CPM across the column). Data provided in Supplementary Data 16.

To ask how selection acted on these variants during domestication, we compared allele frequencies between wild and farmed fish (Fig. 4). By far the most common scenario was that the synapse gene-linked SVs are rare alleles in wild fish that show increased frequency of heterozygotes (carrying one SV copy, 0/1) and homozygotes (carrying both SV copies, 1/1) in farmed fish (Fig. 4). We also found that farmed individuals often carry multiple copies of SVs that are especially rare in wild fish (defined as 0/0 homozygous frequency ≥0.90, 45 SVs)—assumed to be deleterious in natural environments—including homozygote 1/1 states for SVs located on different chromosomes (Supplementary Fig. 25).

Many of the outlier SVs linked to the 65 synaptic genes are located in non-coding regions (introns and untranslated regions, 45%), while a smaller fraction are located within 10 kb up or downstream (15%) or within ≥10 kb to 260 kb (33%) of the same genes (Fig. 4). A smaller fraction affect coding regions via whole-gene duplications, either involving a small number of genes, e.g. a 55 kb duplication overlapping the brain-specific CDK5R1 gene, or through larger multigene duplications (Fig. 4 and Supplementary Data 16). A striking example of an SV with a putative major disruptive effect was a 696 kb inversion that flips multiple exons and the upstream region of the brain-specific gene encoding neurexin-2, which should halt translation of a functional protein (Supplementary Data 16). Finally, among this synaptic gene set, we identified two ohnolog pairs retained from Ss4R encoding astrotactin-1 and seizure protein 6 (Fig. 4).

Major effect SVs altered by domestication

We identified 32 further SVs with major predicted effects on gene structure and function among the significant FST outliers, which typically show increased allele frequency in farmed compared to wild Atlantic salmon (Table 1). These SVs disrupt or ablate coding genes with diverse functions, including male fertility (e.g. CATSPERB40), immunity (e.g. B cell survival and signalling, GIMAP8 (ref. 41) and two distinct CD22 (ref. 42) genes), circadian control of metabolism (NR1D2 (ref. 43), lipid metabolism and insulin sensitivity (ELOVL6 (ref. 44)) and melanin transport and deposition (MYRAP45) (Table 1). We observed four deletions that disrupt conserved lncRNAs of unknown function, and several large SVs that cover multiple genes, for instance a 423 kb inversion on Chromosome 7 containing 16 genes that was absent in 257 wild salmon (Table 1). In summary, these data demonstrate that diverse gene functions beyond neurological and behavioural pathways were altered by the domestication of Atlantic salmon due to altered selective pressure or drift.

Table 1 Major effect SVs under divergent selection in farmed and wild Atlantic salmon.

Discussion

Despite an increasing shift towards the use of long-read sequencing for SV discovery1,2, these technologies remain prohibitively expensive for large-scale population genetics, making such datasets scarce in most species. Consequently, it remains a timely challenge to extract reliable SV calls from the more extensive repository of short-read genome sequencing datasets, which continue to emerge rapidly in many species, largely for use in SNP analyses. The approach reported can be applied for reliable SV detection and genotyping using such data in any species with a reference genome. A critical step—unique to this study—was the curation of all SV calls using SV-plaudit10. This approach demands significant manual effort, equivalent to approximately 2 weeks for a small team of trained curators, yet was efficient in retaining predominantly true calls, and allowed us to demonstrate the value of filtering complex regions to drastically reduce the FDR. The overall extreme FDR for SV discovery advocates for the routine application of such curation in SV studies based on short-read sequencing, particularly if ‘gold-standard’ SVs defined by past work are unavailable.

The SVs reported provide a novel resource for future studies on the genetic architecture of traits in Atlantic salmon, which has excluded SVs until now. It will be useful to overlap our SVs with genomic regions of interest such as QTLs defined by SNPs to investigate SVs as putative causal variants. For example, we discovered a duplication on chromosome 14 that likely destroys the function of the MYRIP gene, which is involved in melanosome transport45—a past study discovered a single QTL on chromosome 14 that explained differences in melanocyte pigmentation between wild and domesticated fish46, which may be linked to this newly discovered SV. It will also be useful in future studies to apply SV markers directly in genome-wide association analyses, and to test their value for genomic prediction in salmon breeding programmes11,12. While our study captured hundreds of Atlantic salmon genomes representing several major phylogeographic groups, it fails to capture broader genetic diversity within this species, and due to the retention of only high-confidence SV calls, our method may be prone to false negatives. Further, inherent limitations of short-read sequencing data for SV detection presumably obscures detection of many SVs, suggesting future SV studies in Atlantic salmon must also focus on adapting long-read sequence data, and integrating short- and long-read data for optimal SV discovery1.

We discovered intact pTSsa2 polymorphisms within our SV dataset, and provided evidence for transposon expansion after the split of S. salar and trutta ~10 Mya16 (Fig. 2). The pTSsa2 transposon appears with high copy number in the Atlantic salmon genome, suggesting an important role in shaping very recent genome architecture. Transposons have largely been excluded from studies of contemporary genetic variation in salmonids, but were central to genome rediploidization after the Ss4R WGD17, and likely contributed to the evolution of the sex determining locus, e.g. ref. 47. As work in other taxa has revealed that transposon polymorphisms contribute to adaptive evolution48,49 and speciation50, future studies on pTSsa2 should investigate such possibilities in Salmo. We also showed that Atlantic salmon deletion SVs are more likely to overlap genes retained as ohnolog pairs from the Ss4R WGD event compared to singleton genes, and demonstrate SV overrepresentation in ohnolog genes according to their expression properties. The results are at least partly compatible with the hypothesis that WGD events buffer against potential deleterious impacts of SVs on gene function and regulation, consistent with past work29,51, but also support the idea that SV retention may sometimes be a product of relaxed selection acting on duplicated ohnologs. Overall, the link between SVs and the Ss4R WGD requires further investigation to more fully dissect the role of selection and drift in driving SV retention.

We discovered many SVs showing genetic divergence between farmed and wild Atlantic salmon linked to synaptic genes responsible for behavioural variation38,39. Most were rare alleles in wild fish and showed a small to moderate increase in frequency in domesticated populations, consistent with a polygenic genetic architecture for behavioural traits altered by domestication, including risk-taking behaviour, aggression and boldness32,52,53,54,55,56, affecting many unique genes from the same functional networks, mirroring the polygenic basis for many human neurological traits57,58,59. The disruption of mammalian orthologs for many of the same synaptic genes cause disorders, including schizophrenia, intellectual disability, autism and Alzheimer’s (Supplementary Data 16). For Atlantic salmon, we did not establish if these SVs are causative variants or in linkage disequilibrium with other variants under selection. In several cases, it is likely that the SVs discovered are causative variants due to their disruptive nature on protein-coding gene sequence potential (e.g. Table 1), including the ablation of the key synaptic protein neurexin-2, which caused autism-related behaviours when induced experimentally in mice60. However, as many of the outlier SVs were located in non-coding regions, this points to regulatory effects on gene expression, which may have minor or additive effects on behavioural traits. Future work should test whether the outlier SVs alter the expression or function of synaptic genes and directly influence behavioural phenotypes. Beyond neurological systems, domestication altered the frequencies of numerous major effect SVs disrupting genes with diverse functional roles (Table 1), providing candidate causative variants for ongoing investigations into diverse traits. For instance, an increased frequency of SVs ablating the ELOVL6 and NR1D2 genes in domesticated fish, which play key roles in lipid metabolism, insulin resistance and the coordination of metabolic functions with the circadian clock44,45, is highly consistent with a recent transcriptomic study demonstrating altered metabolism linked to disrupted circadian regulation in domesticated compared to wild Atlantic salmon61.

To conclude, given the rapidly growing recognition of the importance of establishing the role of SVs in adaptation and other evolutionary processes in natural populations62,63, in addition to commercial variation relevant to breeding of farmed animals64,65, we anticipate that this reliable description of the SV landscape in Atlantic salmon will encourage more studies exploiting SV markers to address both fundamental and applied questions in the genetics of non-model species.

Methods

Sequencing data

Paired-end whole-genome sequencing data (mean 8.1× coverage, 2 × 100–150 bp) was generated for 472 Atlantic salmon on several different platforms (Supplementary Data 1). DNA extraction, quality control and sequencing library preparation followed standard methods. Wild Atlantic salmon were sampled either during organized fishing expeditions or by anglers during the sport fishing season with DNA extracted from scales. We sampled n = 80 wild Canadian individuals from 8 sites, n = 359 Norwegian individuals from 52 sites (including n = 5 landlocked dwarf salmon), n = 8 Baltic individuals from a single site and n = 4 White sea individuals from a single site. Whole-genome sequencing data was generated for 21 farmed individuals (n = 12 individuals from Mowi ASA; n = 9 samples from Xelect Ltd) and downloaded from NCBI for a further 20 farmed individuals. Individual sample accession numbers are given in Supplementary Data 1 and the Data Availability section.

SV detection and genotyping

Sequence alignment to the unmasked ICSASG_V2 assembly (NCBI accession GCA_000233375)17 was done using BWA v0.7.13 (ref. 66). Reads were mapped to the complete reference, including unplaced scaffolds, with random placement of multi-mapping reads67. Reads mapping to unplaced scaffolds were discarded. Alignments were converted to BAM format in Samtools v0.1.19 (ref. 68). Alignment quality, batch effects and sample error were further assessed using Indexcov goleft v0.2.1 (ref. 69). Gap regions were extracted and converted to BED format using a Python script (Supplementary Note 1); SV calls overlapping these regions were identified using Bedtools Version v2.27 (ref. 70) and removed. Sample coverage was estimated using mosdepth v0.2.3 (ref. 71). High-depth regions were defined as any regions showing ≥100x coverage in at least 100 salmon individuals, and can optionally be removed from SV calling (see below); this cut-off was a compromise to avoid generating too many false SV calls, balanced against the risk of losing real SVs. High-depth regions located within 100 bp were merged. SV detection was done using the Lumpy-based tool Smoove v2.3 (ref. 21) with genotypes called by SVtyper v0.7.0 (ref. 22). Gap and high-depth regions were combined into a single BED file, which can optionally be used to exclude these locations from SV detection in Lumpy (-exclude option). All of the above steps were combined in a Snakemake (v3.11.0)20 workflow, with the input being paired-end sequencing data (FASTQ format), and the output a VCF file with SV locations and genotypes for all individuals in a study (Supplementary Fig. 1 and Supplementary Note 2 provides Snakefile).

SV-plaudit curation

All 165,116 SV calls generated in the study were curated using SV-plaudit (no version variations; https://github.com/jbelyeu/SV-plaudit)10. A plotCritic website was setup on Amazon Web Services where variant images produced in samplot v1.01 were deployed. SV curation involved the random visualization of one homozygous wild type (0/0; lacking SV, identical to reference genome), two heterozygous (0/1, with one SV copy) and two homozygous-alternate (1/1, with two SV copies) individuals per SV, done using cyvcf2 v0.11.5 (ref. 72). With each image the question ‘is this variant real?' was answered (options: ‘No’, ‘Yes’ or ‘Maybe’). Only high-confidence variants (‘Yes’) were kept for downstream analysis. Three different co-authors (A.C.B., M.K.G. and E.P.) team-curated the full SV set. In total, 1000 random plots were commonly curated by each researcher to establish congruence in decision making, and there was 100% agreement concerning high-confidence (‘Yes’) variants. Subsequently the SV plots were divided randomly and each set validated independently across the three researchers and then merged.

SV annotation

High-confidence SVs retained following SV-plaudit curation were filtered to remove redundant SVs using the Bedtools intersect function (90% reciprocal overlap), removing 133 SVs and leaving 15,483 SVs used in further analysis (provided in Supplementary Data 5). The association between SVs and RefSeq genes within the ICSASG_v2 assembly was done using SnpEff v4.3 (ref. 26) (default parameters). GO enrichment tests were done using the ‘weight01’ algorithm and Fisher’s test statistic in TopGo v2.26.0 (ref. 73). The background set was all genes in the RefSeq annotation. The R package ‘Ssa.RefSeq.db’ (https://gitlab.com/cigene/R/Ssa.RefSeq.db)74 was used to retrieve GO annotations from the ICSASG_v2 genome. The overlap between SV locations and repeats in the ICSASG_v2 annotation was done using Bedtools61 against an existing database17.

Phylogenetic analyses

pTSsa2 sequences including EF685967 were used in BLASTn75 searches against the NCBI nucleotide database (restricted to Salmonidae) in addition to unmasked assemblies for Atlantic salmon (ISCASG_v2), brown trout (GCA_901001165.1), Arctic charr (GCA_002910315.2), rainbow trout (GCA_002163495.1), chinook salmon (GCA_002872995.1) and coho salmon (GCA_002021735.2). Sequence alignments were performed using Mafft v7.0 (ref. 76) with default settings. Phylogenetic analysis was done using IQTREE v1.6.12 via a webserver77 with estimation of the best-fitting nucleotide substitution model (Bayesian Information Criterion) and 1000 ultrafast bootstraps78.

SV validation by MinION sequencing

PCR primers are shown in Supplementary Data 4. PCRs were performed using LongAmp® Taq (New England Biolabs) with 1 cycle of 94 °C for 30 s, 30 cycles of 94 °C for 30 s, 56 °C for 60 s and 65 °C for 50 s/kb, followed by a 10-min extension at 65 °C. Amplicons for different SVs in each fish individual were pooled and cleaned using AMPure XP beads (Beckman Coulter). Two hundred and fifty nanograms pooled DNA was used to create sequencing libraries with a 1D SQK-LSK109 kit (Oxford Nanopore Technologies, ONT). DNA was end-repaired using the NEBNext Ultra II End Repair/dA Tailing kit (New England Biolabs) and purified using AMPure XP beads. Native barcodes were ligated to end-repaired DNA using Blunt/TA Ligation Master Mix. Barcoded DNA was purified with AMPure XP beads and pooled in equimolar concentration to a total of 200 ng per library (~0.2 pmol). AMII Adapter mix (ONT) was ligated to the DNA using Blunt/TA Ligation Master Mix (New England Biolabs) before the adapter-ligated library was purified with AMPure XP beads. DNA concentration was determined at each step using a Qubit fluorimeter (Thermo Fisher Scientific) with a ds-DNA HS kit (Invitrogen).

Sequencing libraries were loaded onto MinION FLO-MIN106D R9.4.1 flow cells (ONT) and run via MinKNOW for 36 h without real-time basecalling. Basecalling and demultiplexing was performed with Guppy v2.3.7. FASTQ files were uploaded into Geneious Prime 2019.1.1 and simultaneously mapped to a reference of sequences spanning all candidate SV regions in the ISCSAG_v2 assembly. Mapping was done with the following parameters: ‘medium-fast sensitivity’, ‘finding structural variants’, including ‘short insertions’ and ‘deletions’ of any size, with the setting ‘map multiple best matches’ set to ‘None’, and the minimum support for SV discovery set to 2 reads. Alignments were inspected for the presence and genotype of the SV. Amplicons with <50× coverage to the target SV region were discarded as failed PCRs. When alignments matched the predicted SV breakpoints and size, the SV call was considered correct. When >90% of the aligned reads matched to the expected SV and breakpoints (i.e. a gap for deletions, an insertion for duplications and flipped reads for inversions compared to the reference) it was classified 1/1 homozygous. When at least 10% of the aligned reads matched to both the reference genome state, in addition to the 1/1 state, the locus was classified 0/1 heterozygous.

Association between SVs and Ss4R ohnologs

The code used to identify a genome-wide set of Ss4R ohnologs, along with a description of the genome assembly annotations employed, is available at https://gitlab.com/sandve-lab/salmonid_synteny (and Supplementary Data 17) and https://gitlab.com/sandve-lab/defining_duplicates (and Supplementary Data 18). Orthogroups were constructed with Orthofinder v2.4.0 (ref. 79) using seven salmonid species (Atlantic salmon, rainbow trout, Arctic charr, coho salmon, huchen Hucho hucho and European grayling Thymallus thymallus), five additional actinopterygians (zebrafish, medaka Oryzias latipes, northern pike Esox lucius, three-spined stickleback Gasterosteus aculeatus and spotted gar Lepisosteus oculatus), and two mammals (human and mouse Mus musculus). For each orthogroup, we extracted nucleotide protein-coding sequences, aligned them with Macse v2.03 (ref. 80) and built gene trees using TreeBeST v1.9.2 (ref. 81). Trees were split into smaller subtrees at the node representing the divergence between pike and salmonids. To derive a final set of Atlantic salmon Ss4R ohnologs, we used both synteny and gene tree topology criteria. Firstly, we required that the subtrees branched with northern pike as the sister to salmonids and outgroup to Ss4R16,17 and contained either exactly two (ohnologs) or exactly one (singletons) Atlantic salmon genes. Secondly, we removed any putative Ss4R ohnologs falling outside conserved synteny blocks predicted using iadhore v3.0 (ref. 82). A final set of ohnolog pairs is provided in Supplementary Data 9, which contains all gene trees in NWK format.

We used the fisher.exact() function in R to compare the observed counts of SVs overlapping singleton and ohnologs with the total counts of singletons and ohnologs. To test for association between ohnolog expression divergence and SV overlap, we used a 15 tissue RNA-Seq dataset17 available as a TPM (transcripts per million reads) table in the salmofisher R-package https://gitlab.com/sandve-lab/salmonfisher. We used the cor() function in R to compute median Spearman’s tissue expression correlation for all ohnolog pairs where one copy was overlapped by an SV. We then computed median correlations for 1000 randomly sampled ohnolog sets of the same size. The P value was estimated as the proportion of resampled medians lower than the observed median for ohnologs overlapped by SVs. Tests comparing expression level between genes that were either overlapped or not overlapped by SVs were conducted using the sum log10 transformed TPM for each gene across all 15 tissues. The function wilcox_test within the R-package rstatix v0.6.0 was used to calculate P values for differences in expression levels. The code used is available at https://gitlab.com/ssandve/atlantic_salmon_sv_ohnolog_analyses/ (and Supplementary Data 19).

Association of SVs with brain ATAC peaks

Four Atlantic salmon (freshwater stage, 26–28 g) were killed using a Schedule 1 method following the Animals (Scientific Procedures) Act 1986 in strict accordance with the Norwegian Animal Welfare Act 2010. Around 50 mg homogenized brain tissue was processed to extract nuclei using the Omni-ATAC protocol for frozen tissues83. Nuclei were counted on an automated cell counter (TC20 BioRad, range 4–6 μm) and further confirmed intact under a microscope. In total, 50,000 nuclei were used in the transposition reaction including 2.5 µL Tn5 enzyme (Illumina Nextera DNA Flex Library Prep Kit), incubated for 30 min at 37 °C in a shaker at 200 r.p.m. The samples were purified with the MinElute PCR purification kit (Qiagen) and eluted in 12 μL elution buffer. qPCR was used to determine the optimal number of PCR cycles for library preparation84 (8–10 cycles used). Sequencing libraries were prepared with short fragments and fragments >1000 bp were removed using AMPure XP beads (Beckman Coulter, Inc.). Fragment length distributions and confirmation of nucleosome banding patterns were determined on a 2100 Bioanalyzer (Agilent) and the library concentration estimated using a Qubit system (Thermo Scientific). Libraries were sent to the Norwegian Sequencing Centre, where paired-end 2 × 75 bp sequencing was done on an Illumina HiSeq 4000. The raw sequencing data are available through ArrayExpress (Accession: E-MTAB-9001).

ATAC-Seq reads were aligned to the Atlantic salmon genome (ICSASG_v2) using BWA (v0.7.17)66 and a merged peak set called combining the four replicates using Genrich v.06 (https://github.com/jsh58/Genrich) with default parameters, apart from ‘-m 20 -j' (minimum mapping quality 20; ATAC-Seq mode). Bedtools was used to identify SVs overlapping ATAC-Seq peaks (filtered at corrected P ≤ 0.01) associated to genes, defined as being located within 3000 bp up/downstream of the start and end coordinates of the longest transcript per gene.

Population structure analyses and FST analyses

PCAs were performed separately on the complete set of high-confidence deletions (14,017), duplications (1244) and inversions (242) using the prcomp and autoplot functions within GGplot2 v3.3.2 (ref. 85) in R. Genotypes were coded into bi-allelic marker format to be compatible with standard population genetics methods. We further tested for population structure in deletion SVs using NGSadmix v32 (ref. 86) using group sizes of K = 2–4, which were sufficient to confirm the results observed by PCA. As the aim was to recapture the major salmon phylogeographic groups, e.g. refs. 24,25 in our sampled dataset, higher K values were not explored.

FST values were calculated for all high-confidence SVs using VCFtools v0.1.16 (ref. 87) with the Weir and Cockerham method37 comparing 34 Norwegian farmed vs. 257 Norwegian wild Atlantic salmon (Fig. 3a provides rationale for sample selection). To establish the significance of each FST value, individuals from the two groups were randomly split into two sets of the original size (i.e. 34 vs. 257 individuals) 200 times, before the distribution of resultant FST values was plotted using the ggplot2 function geom_freqpoly (binwidth = 0.01). Per SV P values were considered as the proportion of FST values obtained in the 200 random distributions higher than the FST in the observed distribution. Thus, if 10/200 randomly sampled FST values above the observed FST value were recorded, P = 0.05 was assigned. We further applied an FST cut-off to include SVs where 99.7% of all FST values fell above the randomly sampled values (FST > 0.103). Any SVs lacking alternative alleles in the compared groups were excluded. Code to perform these analyses is provided in Supplementary Note 3.

Annotation of SV outliers

GO enrichment tests for genes linked to the SV outliers (P < 0.05) were done as described in the section ‘SV annotation’, with the background gene set restricted to all RefSeq genes linked to SVs by SnpEff. To investigate the expression of genes linked to SV outliers, we used existing RNA-seq data17, representing normalized counts per million (CPM) for 10 tissues (brain, liver, muscle, spleen, pancreas, heart, pyloric, gill, skin and foregut). We filtered any genes where the across-tissue sum of CPM was <1.0. A ‘tissue specificity’ index was calculated, representing the sum across-tissue CPM divided by the CPM per tissue. We tested whether genes linked to SV outliers by SnpEff, in addition to a subset contributing to significant GO terms (P < 0.01), differed from the transcriptome-wide expectations. Hypergeometric tests were used (dhyper function in R) to compare the number of genes in the two gene sets with a tissue specificity index ≥0.5 compared to all genes in the transcriptome. Two-sample t-tests (t.test function in R) were used to compare differences in mean CPM between the two gene sets compared to all genes in the transcriptome. BLASTp75 (done using NCBI Web BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to cross-reference protein products of genes linked to SV outliers against 3840 unique proteins detected in the zebrafish synaptic proteome38 (downloaded from the GRCz11 assembly version using BioMart at http://www.ensembl.org/Danio_rerio/Info/Index), taking forward the top zebrafish BLASTp hit (cut-off: 40% identity, 40% query coverage) as a query in a reciprocal BLAST against all S. salar RefSeq proteins (no cut-off); evidence for orthology was accepted when the candidate zebrafish protein showed a best hit to the original query in the complete salmon proteome. We used the fisher.exact() function in R to test if the 584 significant FST outlier SVs were more likely to overlap brain ATAC-Seq peaks than non-significant SVs (P > 0.05), which was done considering all expressed genes (TPM ≥ 1) in the RNA-Seq tissue atlas described above17 and a subset of the same genes most highly expressed in brain (filtered for genes where brain was among the top three tissues for TPM). The bedtools61 intersect function was used to associate ATAC-Seq peaks with SVs. The code used is available at https://gitlab.com/ssandve/atlantic_salmon_sv_ohnolog_analyses/ (and Supplementary Data 19).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.