The determinants of genetic diversity in butterflies

Under the neutral theory, genetic diversity is expected to increase with population size. While comparative analyses have consistently failed to find strong relationships between census population size and genetic diversity, a recent study across animals identified a strong correlation between propagule size and genetic diversity, suggesting that r-strategists that produce many small offspring, have greater long-term population sizes. Here we compare genome-wide genetic diversity across 38 species of European butterflies (Papilionoidea), a group that shows little variation in reproductive strategy. We show that genetic diversity across butterflies varies over an order of magnitude and that this variation cannot be explained by differences in current abundance, propagule size, host or geographic range. Instead, neutral genetic diversity is negatively correlated with body size and positively with the length of the genetic map. This suggests that genetic diversity is determined both by differences in long-term population size and the effect of selection on linked sites.

Supplementary Table 1 Correlates of genetic diversity inferred under a maximal model  Figure 1 Completeness of transcriptome assemblies assessed using BUSCO scores. Transcriptomes assembled de novo as part of this study are shown in green, assemblies based on data from Romiguier et al. (2014) are shown in orange.
Supplementary Figure 2 The negative relationship between genetic diversity, ln(π 4D ) and selection efficacy ln(π 0D /π 4D ). A number of species with high genetic differentiation (F IT ) fall above the line of best fit, i.e. they have less efficient selection than expected. In contrast, the migratory species Vanessa atalanta falls below the line of best fit. The two species with the lowest chromosome numbers (Pieris brassicae and Melanargia ines) both fall close to the line of best fit, suggesting that they do not have a lower mutation rate than other butterfly species.
Supplementary Figure 3 Estimates of genetic diversity at synonymous sites (π 4D , left) and non-synonymous sites (π 0D , right) are higher when based on all protein coding transcripts, i.e. without removing nonorthologous and putative Z-linked transcripts (top). Removing putative Z-linked transcripts from the set of orthologous transcripts has a negligible effect on estimates of π 4D and π 0D (bottom). The effect of both filters is much smaller for π 4D than π 0D .
Supplementary Figure 4 Mitochondrial diversity at the CO1 locus is not significantly correlated with nuclear diversity both at synonymous (π 4D , left) and non-synonymous (π 0D , right) sites.
Supplementary Figure 5 There is no significant correlation between the number of larval host plant (LHP) species a butterfly species uses and its genome-wide neutral genetic diversity.

Variant calling
Protein coding transcripts were identified using Transdecoder (Haas and Papanicolaou), BLAST (Altschul et al., 1990) and HMMER (Eddy and the HMMER development team, 2018). Transdecoder was used to find open reading frames (ORFs) within transcripts, while homology searches were done using BLAST and HMMER to identify transcripts containing known sequences and domains. Finally, the predict function in Transdecoder was used to score likely protein coding transcripts based on ORF presence and homology.
For each species, reads of both individuals were separately mapped to the CDS of the longest isoform of complete proteins using BWA MEM (Li, 2013). Loci which were suitable for variant calling were selected using the callable loci function in GATK (McKenna et al., 2010). We selected sites with a read depth ≥ 10 and MQ ≥ 1. Callable loci were intersected between individuals using BEDTools (Quinlan and Hall, 2010), removing sites that were not expressed in both individuals sampled in each species. Variants were called using Freebayes (Garrison and Marth, 2012), and only retained if supported by more than three reads, with the variant being found in both the 3 and 5 end of reads, as well as in both forward and reverse reads.
Excluded variants were masked for downstream analysis, and so did not contribute to the total length.  Figure 9). None of the 1,314 conserved SCO clusters contains transcripts that are HOM F in all 27 species. However, since we do not know how conserved Z-linkage of genes is across these species, this filter is not sufficient. We therefore removed 37 SCO clusters (2.8 % of the data) for which ≥ 50% species were HOM F . This has almost no effect on estimates of π (Supplementary Figure 3).

Estimation of π for mitochondrial COI locus
Mitochondrial π was calculated for the COI barcode locus using sequences retrieved from the BOLD systems database (Ratnasingham and Hebert, 2007). Alignments of 658bp for each species were produced in Bioedit (Hall, 1999) using CLUSTAL-W (Thompson et al., 1994) and manual inspection. Mean pairwise π of each alignment was calculated in MEGA7 (Stecher et al., 2016).

Phylogeny reconstruction
Single-copy orthologous proteins present in all transcriptome assemblies (including the five species sequenced by Romiguier et al. (2014)) -as well as the genome of the silkmoth Bombyx mori -were identified with Orthofinder. The resulting 218 protein sequences were concatenated for each species, aligned using MAFFT (Katoh and Standley, 2013), and trimmed using trimAl (Capella-Gutiérrez et al., 2009). The final alignment contained 59,747 amino acid sites, 22,429 of which were informative for phylogenetic inference. 20 maximum likelihood (ML) tree searches were conducted using the substitution model PROTGTR+GAMMA with RAxML (Stamatakis, 2014). To assess confidence in the ML tree, non-parametric bootstrap values were obtained from 100 replicates.

Supplementary Note 1
The effect of selection on linked sites We use the equation derived by Barton (2000, eq. 1), modified for the case of semi-dominant favourable mutation with selection coefficient s when homozygous. This gives the reduction in neutral diversity immediately following a sweep, measured relative to its initial value, as ∆π ≈ (2N e s) −(4r/s) , where r is the frequency of recombination between the neutral and selected sites (for alternative derivations, see Coop and Ralph (2012); Weissman and Barton (2012); Elyashiv et al. (2016)). This can be written as . Following Weissman and Barton (2012), assume that r is given by a linear function of physical distance z between sites, r = r c z. This is reasonable if recombination is due solely to crossing over, and sweep effects extend over relative small distances, so that double crossing over can be neglected. If, however, gene conversion also contributes to recombination, as is likely, the effect of a sweep is considerably reduced compared with this expression (Campos et al., 2017).
Following Kaplan et al. (1989) and Wiehe and Stephan (1993), ∆π can be equated to the probability that a sweep results in a coalescent event. Such an event is assumed to be instantaneous compared with coalescent events caused by genetic drift, whose rate is 1/(2N 0 ) in the absence of selective effects. If adaptive substitutions occur at a rate ν per basepair per generation, and there is no Hill-Robertson interference among them, the rate of sweep-induced coalescent events at a focal neutral site caused by all sweeps to the right and left of this site can be derived by treating the chromosome as a continuum, and integrating over all contributing sites. Following Weissman and Barton (2012), this approach gives the net rate of sweep-induced coalescent events per unit per unit coalescent time (2N 0 generations) as If we assume that there are n c chromosomes, each with a map length of 0.25 Morgans after taking the lack of crossing over in males into account, and a total haploid genome size of G basepairs, we have r c = n c /(4G). The total rate of substitutions per genome is ν T = Gν. We thus have: