The structural variation landscape in 492 Atlantic salmon genomes

Structural variants (SVs) are a major source of genetic and phenotypic variation, but remain challenging to accurately type and are hence poorly characterized in most species. We present an approach for reliable SV discovery in non-model species using whole genome sequencing and report 15,483 high-confidence SVs in 492 Atlantic salmon (Salmo salar L.) sampled from a broad phylogeographic distribution. These SVs recover population genetic structure with high resolution, include an active DNA transposon, widely affect functional features, and overlap more duplicated genes retained from an ancestral salmonid autotetraploidization event than expected. Changes in SV allele frequency between wild and farmed fish indicate polygenic selection on behavioural traits during domestication, targeting brain-expressed synaptic networks linked to neurological disorders in humans. This study offers novel insights into the role of SVs in genome evolution and the genetic architecture of domestication traits, along with resources supporting reliable SV discovery in non-model species.


Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.

Study description
Research sample Sampling strategy

Data collection
The authors declare that all data supporting the findings of this study are available within the paper and its supplementary information files. Novel raw sequence data that support the findings of this study were deposited in the European Nucleotide Archive (ENA)  Structural variation analysis of 492 Atlantic salmon individuals using short-read whole genome sequencing Paired-end whole genome sequencing data (mean 8.1x coverage, 2 x 100-150 bp) for 492 Atlantic salmon on several different platforms (Supplementary Table 1 in paper). We sampled n=80 wild Canadian individuals from 8 sites, n=359 Norwegian individuals from 52 sites (including n=5 landlocked dwarf salmon), n=8 Baltic individuals from a single site and n=4 White sea individuals from a single site. Whole genome sequencing data was generated for 21 farmed salmon individuals (n=12 individuals from Mowi ASA; n=9 samples from Xelect Ltd) and downloaded for a further 20 individuals. Individual sample accession numbers are given in Supplementary Table 1 and the Data Availability section.
The rationale for the choice of these samples was i) to provide a representation of multiple individuals from within the major Atlantic salmon phylogeographic groups including both wild and farmed populations and ii) to provide sufficient paired end sequencing coverage per sample to allow for reliable structural variation calls to be generated. These two aspects of the research sample were essential to achieve the study objectives of understanding the distribution and role of SVs in the Atlantic salmon genome.
The strategy for sampling in this study was not based on any statistical method to predetermine sample size (not applicable to data type). However, the sample size was sufficient to i) demonstrate that SVs captured expected population genetic structure observed in previous studies, ii) quantify statistically significant differences in SV allele frequency across populations, and iii) associate Atlantic salmon SVs with functional features in the genome. Therefore, the sample size was sufficient to provide reliable conclusions and novel biological insights on SVs, aligned to the original study aims. While the final number of samples sequenced was constrained by the available budget, this is by far the largest whole genome re-sequencing effort to date in Atlantic salmon, and as such it at least meets the criteria of what comprises what is a 'sufficient sample size' with respect to previously published studies.
Wild Atlantic salmon were sampled during organized fishing expeditions or by anglers during the sport fishing season with scale samples taken for analysis. It is not possible to name every individual present during sampling due to the extensive scale of the effort, but sample collection was coordinated by expert co-authors based at the Norwegian Institute for Nature

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
Atlantic salmon from Xelect represented muscle samples, gifted by co-author Professor Ian Johnston. The DNA sequencing library preparation and sequencing was done by commercial sequencing centres (not possible to list persons involved) using sequencing instruments provided on a per sample basis in Supplementary Table 1. The rationale for the spatial scale of sampling and for sampling locations is described above in 'Research Sample'. Moreover, specific latitude-longitude coordinates of samples are given in Supplementary Table 1. The sequenced samples were collected between 2008 and 2017. The rationale for the timing of sample sequencing was based on when funding was available through grants listed in the paper. We stopped sequencing samples when the aim was achieved of having high quality sequencing data for multiple individuals per each of the major Atlantic salmon phylogeographic groups, including both wild and farmed populations.
No data were excluded. This is not relevant to the main data type in the study used for SV detection, because genome re-sequencing data is not an experimental variable, rather a fixed observation per animal, with reproducibility achieved through sufficient sequencing coverage. Therefore, it is not usual to repeat whole genome sequencing of the same individuals more than once as the result is expected to be identical. The only relevant data type produced in the study is ATAC-Seq data, which was generated with four biological replicates, and only peaks detected in all four individuals were used in downstream analyses, ensuring these analyses are based on solely reproducible data. All analyses can be reproduced using the raw sequencing data provided in the study (see 'Data').
Sampling was random, both for wild and farmed fish, within the populations targeted either in rivers for the given study sites, or within the populations of fish sampled from aquaculture stocks.
No data was generated where blinding would affect the outcomes. There are no prior beliefs or expectations that could affect the outcome of the results reported.
Not possible to provide -large sample numbers. Key information provided in Supplementary Table 1.
Latitude and longitude for all wild fish sampled is provided in Supplementary  Table 1) were collected either from anglers during the summer sport fishing season supported by local fishing-rights owners, or from wild broodstock collected in the autumn, supported by the Norwegian Environment Agency, which requires sampling of all fish intended for use as broodstock; no further permissions were required.
Disturbance to wild fish was minimized where possible by sampling catches from fisherman that had caught the animals already or as part of routine monitoring of wild stocks by authorized organizations.