The population genomics of structural variation in a songbird genus

Structural variation (SV) accounts for a substantial part of genetic mutations segregating across eukaryotic genomes with important medical and evolutionary implications. Here, we characterized SV across evolutionary time scales in the songbird genus Corvus using de novo assembly and read mapping approaches. Combining information from short-read (N = 127) and long-read re-sequencing data (N = 31) as well as from optical maps (N = 16) revealed a total of 201,738 insertions, deletions and inversions. Population genetic analysis of SV in the Eurasian crow speciation model revealed an evolutionary young (~530,000 years) cis-acting 2.25-kb retrotransposon insertion reducing expression of the NDP gene with consequences for premating isolation. Our results attest to the wealth of SV segregating in natural populations and demonstrate its evolutionary significance.


69
identify variants and genotypes for each diploid individual, which resulted in a set of 93 47,346 variants. SV genotyping is nontrivial and associated with high uncertainty (7). 94 Thus, we utilized the sampling scheme to filter for variants complying with basic 95 population genetic assumptions ( Fig. 2A)(19). Variants that were excluded according 96 to these criteria were enriched for deletions and clustered near the end of chromosomes 97 (linear model, p = 10 -16 , Fig. 2B, C). Increased densities of repetitive elements ( Fig.  98  2D), particularly tandem repeats, in these regions are conducive to erroneous genotype 99 calling, though it is possible that a subset of these phylogenetically recurring variants 100 indeed represent true positive, hypermutable sites.

116 117
After the phylogenetically informed filtering step, we retained a final set of 41,868 118 variants (88.43 % of the initial, unfiltered set) segregating within and between species.

119
Of these, a small proportion was classified as inversions (    (corone) spp., C. brachyrhynchos), respectively. Using the full data set across all 160

153
populations within each clade allowed us to compare folded allele frequency spectra 161 between SV classes and repeat types with high resolution (for population specific 162 spectra unbiased by population structure see Supplementary Fig. S2). Consistent with 163 recent studies in grapevine and Drosophila SV (12, 23), the distribution of allele 164 frequencies was skewed towards rare alleles (Fig. 3C). However, allele frequency 165 spectra of different SV classes differed in shape. While insertions and deletions 166 associated with LTR elements, LINE/CR1 elements or without any known match as 167 well as inversions exhibited the typical pattern of a strongly right-skewed frequency 168 distribution, allele frequencies of simple and low complexity repeats were shifted 169 towards intermediate frequencies.
Besides a potential technical bias due to the more 170 difficult genotyping and variant discovery of these classes (24), this pattern is consistent 171 with convergence to intermediate allele frequencies due to high mutation rates (21).

172
These results illustrate how different underlying mutation dynamics potentially impact 173 the analysis of population genetic parameters for SV.

175
To improve our ability to detect larger SV and to provide an independent orthogonal 176 approach for SV discovery, we generated an additional 14 optical maps (Fig. 1A) and 177 compared them to the hooded crow reference assembly. Following that approach, we 178 identified 12,807 insertions, 8,799 deletions and 293 inversions. As expected from the 179 increased size of individually assessed DNA molecules (mean molecule N50 = 223.38 180 kb), variants identified with this approach exhibited a different size range (Fig. 3A) 181 after applying the same upper limit (100 kb) as for the LR SV calls and a lower limit of 182 resolution (1 kb) (25). Interestingly, insertion and deletions were not only enriched at 183 lengths around 0.9 and 2.4 kb as seen in the LR-based SV calling, but also at ~ 6.5 kb, 184 indicating an influence of the TguERV1-Ld_I_corCor LTR retrotransposon, which was 185 the third most common single repeat in the LR variant set with a consensus sequence 186 length of 6,022 bp (Supplementary Table S3). Thus, independent approaches 187 targeting different size ranges of SV are vital to increase sensitivity in detecting hidden 188 genetic variation.

190
To increase our sample size and expand our analysis to further populations and species 191 (Fig. 1A approach known for its sensitivity to the calling method (27) and disparity to LR-based 199 calls (7). Therefore, we focused on the LR-based SV calls in the subsequent analysis 200 and considered SR calls only for specific mutations.

202
Next, we investigated population structure using principal component analyses (PCA). 203 The pattern in    Table S5). The latter, an LTR retrotransposon 243 insertion, was located 20 kb upstream of the NDP gene on chromosome 1 (Fig. 4B), a 244 gene known to contribute to the maintenance of color divergence across the European individuals, however, genotyped with LR (N = 9) and SR data (N = 61) were 251 homozygous for the insertion regardless of their population of origin. This finding is 252 consistent with a selective sweep in proximity to the NDP gene that has previously been 253 suggested for hooded crow populations (26, 28). Recent work has also shown that the 254 NDP gene exhibits decreased gene expression in grey feather follicles of hooded crows, 255 suggesting a role in modulating overall plumage color patterning (29). Following re-256 analysis of normalized gene expression data for 8 carrion and 10 hooded crows (29), 257 we found a significant association between the homozygous insertion genotype and 258 decreased NDP gene expression levels (linear model, p = 0.002) (Fig. 4D), consistent 259 with reduced pigmentation in hooded crows (29).

261
To further investigate the relationship between the abovementioned insertion and 262 phenotypic differences between all-black C (c.) corone and gray-coated C. (c.) cornix 263 populations, we genotyped 120 individuals from the European hybrid zone using PCR 264 (Methods, (28)). Including data of adjacent SNPs for the same individuals, we tested 265 the association between genotype and pigmentation phenotype. A statistical model 266 including the insertion fit best to the observed phenotypes (DAICc = 2.33, but DBIC = 267 -0.12) explaining an additional 10.32% of the variance of the phenotype-derived PC1 268 relative to the adjacent SNPs. The insertion lies upstream of NDP in close proximity to 269 an orthologous region in pigeons containing a copy number variation shown to 270 modulate plumage patterning (Fig. 4B)  The aves and the vertebrate databases were used to indentify ultra conserved 338 orthologous gene sets (Supplementary Table S1).

340
Optical mapping data and assembly 341 We generated additional optical map assemblies for two jackdaw individuals, 8 carrion 342 crow individuals and 4 additional hooded crow individuals, following the same 343 approach used for the optical map assembly of the hooded crow individual (see 344 Weissensteiner et al. (14)). In brief, we extracted nuclei of red blood cells and captured 345 them in low-melting point agarose plugs. DNA extraction was followed by melting and 346 digesting of the agarose resulting in a high-molecular weight DNA solution. After 347 digestion with a nicking endonuclease (Nt.BspQI) which inserts a fluorescently labelled 348 nick strand, the sample was loaded onto an IrysChip, which was followed by 349 fluorescent label detection on the Irys instrument.

Assembly-based SV and SNP detection 383
We aligned the associated contigs of all three assemblies (hooded crow, jackdaw and 384 Hawaiian crow) to the primary contigs (super-scaffolded to chromosome level for 385 hooded crow) using MUMmer (39). SNPs were then identified using show-snps with 386 the options -Clr and -T following a filtering step with delta-filter -r and -q. We only 387 considered single-nucleotide differences in this analysis. 388 Structural variants between the two haplotypes of each assembly were identified using 389 two independent approaches. First, we used the alignments produced with MUMmer to 390 identify variants using the Assemblytics tool (40). We then converted the output to a according to homology to known repeats present in Repbase(45). Repeats with high 407 sequence similarity to known repeats were given the name of the known repeat + suffix 408 "_corCor"; repeats with partial homology were named with the suffix "-L_corCor" 409 where "L" stands for "like" (20). Repeats with no homology to other known repeats 410 were considered as new families and named with the prefix "corCor" followed by the 411 name of their superfamilies. Using this fully curated repeat library ( Supplementary  412 file S1), we performed a RepeatMasker (46) search on all sequences reported for 413 insertion and deletion variants. In case of multiple different matches per variant or 414 individual, we took the match with the highest overlap with the query sequence to yield 415 a single match for each variant. We also performed a RepeatMasker search with the 416 curated library to estimate repeat density per 1 Mb window in the hooded crow 417 reference assembly.

419
Read-mapping based SV and SNP detection 420 We aligned PacBio long-read data of all re-sequenced individuals to both the hooded 421 crow and jackdaw reference assembly using NGM-LR (47) (v0.2.2) with the -pacbio 422 option and sorted and indexed resulting alignments with samtools (48)(v1.9). Initial SV 423 calling per individual was then performed using Sniffles (47) (v1.0.8) with parameters 424 set to a minimum support of 5 reads per variant (--min_support 5) and enabled -425 genotype, -cluster and -report_seq options. We removed abundant translocation calls 426 indicative of an excess of false positives and filtered remaining variants for a maximum 427 length of 100 kb and a maximum read support of 60 with bcftools (49). Both of these 428 filtering steps have been shown to be necessary to remove erroneously called variants. 429 Next, we generated a merged multi-sample vcf file consisting of all individuals from 430 both the crow and the jackdaw clades with SURVIVOR merge and options set to 1000 431 1 1 0 0 50. This merged vcf file was then used as an input to reiterate SV calling with 432 showed 5 differences which we then divided by the length of the LTR (296 bp) and by 500 twice the neutral substitution rate per site and million years (0.0158 (18)). Assuming 501 that all differences between the left and right LTR of this insertion are fixed, this 502 estimate yields an upper bound of the insertion age. However, overlap with SNPs 503 segregating in the hooded crow population suggests that all 5 differences were not fixed 504 and the insertion could thus be considerably younger. 505 To investigate a potential link between the LTR insertion and differences in plumage 506 coloration, we re-analyzed gene expression data from 10 black-and-grey hooded crows 507 and 8 all-black carrion crows raised under common garden conditions (29). Expression 508 was measured for messenger RNA derived from feather buds at the torso, where carrion 509 crows have black feathers and hooded crows are grey. We inferred the insertion 510 genotype for each individual using short-read sequencing data via visual inspection of 511 the alignments to the hooded crow reference. We then fitted a linear model with 512 normalized NDP expression data as the dependent variable and NDP indel genotype as 513 the predictor. We decomposed the effect of the insertion genotype into an additive 514 component (the number of non-inserted minor allele copies -0, 1, or 2 -as a covariate) 515 and a dominance component (homozygous = 0, heterozygous = 1). 516 To further establish a link between the LTR retrotransposon insertion and phenotypic 517 differences, we made use of a hybrid admixture data set from the European hybrid zone 518 (28). We designed three sets of PCR primers to genotype the insertion for 120 summarized 11 plumage color measures on the dorsal and ventral body into a principal 532 component (PC1), explaining 78% of the phenotypic variation. We then tested whether 533 the interaction between chromosome 18 and the insertion genotype explained more 534 variation in plumage color than the interaction between chromosome 18 and the most 535 significant SNP near the NDP gene (28). We fitted two linear regression models on the 536 same subset of the data that contained no missing genotypes (N = 120 individuals). In 537 both models, we used color PC1 as our dependent variable. In the first model, we fitted 538 the interaction between chromosome 18 and the insertion genotype, and in the second 539 model the interaction between chromosome 18 and the SNP genotype as our 540 independent variables. Both variables were coded as 0, 1, 2 copies of the derived allele 541 and fitted as factors. We selected the model with the better fit to the data by estimating 542 the AICc and BIC and deemed a ΔAICc ≥ 2 as significant. 543 544 Competing interests 545 Kees-Jan Francoijs is an employee of BioNano Genomics (San Diego, CA). 546 547 Acknowledgements 548 549 We thank John Marzluff, Vittorio Bagglione and Kristaps Sokolovskis for providing 550 sample material. Reto Burri and Sergio Tusso Gomez provided helpful input for the 551 downstream analysis. We are thankful for being able to use the UPPMAX Next-552 Generation Sequencing Cluster and Storage (UPPNEX) project, funded by the Knut 553 and Alice Wallenberg Foundation and the Swedish National Infrastructure for 554 Computing.