## Introduction

Natural populations harbour persistent polymorphism, despite the fact that directional natural selection and genetic drift tend to erode genetic diversity over time (Clarke 1979). At some loci, this evolutionary puzzle is resolved by negative frequency-dependent selection. Under this form of selection, the fitness benefit conferred by an allele depends on its frequency relative to other alleles in the population, with rare gene variants favoured over common ones (Wright 1939; Clarke 1979; Levin 1988). As a result, loci that are subject to negative frequency-dependent selection retain allelic diversity, often with alleles at equal frequency. Together with other forms of balancing selection (e.g. heterozygote advantage), negative frequency-dependent selection plays an important role in maintaining adaptive genetic diversity in diverse organisms (Wu et al. 2017). For example, the self-incompatibility locus (S locus) in plants (Wright 1939; Nasrallah et al. 1987; Nasrallah 1997), mating-type genes in fungi (Raper 1966; May and Matzke 1995) and mitochondrial haplotypes in fruit flies (Kurbalija Novičić et al. 2020) are all subject to negative frequency-dependent selection.

The sex locus, or complementary sex determiner (csd), of the honeybee is also under strong negative frequency-dependent selection, and offers an opportunity to investigate how polymorphism at such loci is generated, maintained and constrained (Beye et al. 2003; Hasselmann and Beye 2004). The csd gene underpins the honeybee sex-determination pathway (Beye et al. 2003). Diploid individuals heterozygous at csd develop into females (workers or queens) and haploid individuals become males, while diploid individuals homozygous at csd are infertile males that are killed as larva by the colony’s workers (Mackensen 1951; Woyke 1963). Thus, negative frequency-dependent selection in this case stems from an extreme form of overdominant selection (the selection coefficient against homozygotes = 1; Yokoyama and Nei 1979). In natural populations, the lethal homozygous condition is uncommon because an extraordinarily large number of csd alleles exist, each at low frequency. For example, 23–28 alleles were observed in wild African populations of the Western honeybee, Apis mellifera (Lechner et al. 2014). The total number of csd alleles in the global A. mellifera population is unknown, but certainly exceeds 100 (Lechner et al. 2014; Zareba et al. 2017).

The csd gene consists of nine exons spanning 9 kb that produce a single transcript encoding a protein of about 400 amino acid residues (Beye et al. 2003). Nucleotide polymorphism is concentrated in exons 6–7 (the potential-specifying domain, PSD; Hasselmann et al. 2008), and especially so in the domain’s hypervariable region (HVR), which consists of repeated amino acid motifs that differ in arrangement and number between alleles (Hasselmann et al. 2008). Previous studies suggest that the HVR is key to understanding csd’s allelic diversity. The csd gene arose via gene duplication of an ancestor gene in the sex-determination system of hymenopteran insect, fem (Hasselmann et al. 2008; Koch et al. 2014). As ancestral fem lacks a repeat region, the csd HVR evidently evolved during the gene’s subsequent neofunctionalization as the primary signal of sex determination (Hasselmann et al. 2008). Functional csd allele pairs (i.e. those that trigger female development in heterozygotes) typically vary in the length and sequence of the HVR. At least in A. mellifera, a large number of motif combinations can produce functional allele pairs, and only a few amino acid changes in the HVR may be sufficient to give rise to novel csd alleles (A. mellifera; Lechner et al. 2014).

The csd HVR has all the hallmarks of a mutation hotspot (Hasselmann et al. 2008). Heterozygosity is intrinsically mutagenic (Amos 2009; Yang et al. 2015), with mutation rates particularly elevated when alleles differ in their length (Amos et al. 2015). At loci under balancing selection, this leads to a positive feedback loop between mutation and selection, in which mutations are common and mutants are also more likely to persist and spread, maintaining the heterozygosity that drives further mutation (Lynch 2015). The mutational mechanisms that generate polymorphism in the HVR remain poorly understood. Single-nucleotide polymorphisms and indels contribute to csd allelic variation, throughout the specifying domain, but diversity is many times higher in the HVR than elsewhere in the PSD (Hasselmann et al. 2008; Biewer et al. 2016). Here, slipped strand mispairing during replication is presumed to play an important role in allele genesis, adding or removing repeat units in a manner similar to that of the simple sequence repeats (SSRs) within other genes (Li et al. 2004), and microsatellites (Chapuis et al. 2015).

In other well-studied loci under balancing selection, alleles or allele families are so long-lived that they transcend species boundaries (e.g. the trans-specific polymorphisms of genes of the major histocompatibility complex in vertebrates, Azevedo et al. 2015; the S locus in plants, Richman and Kohn 1999; the het-c locus in fungi, Muirhead et al. 2002). Interestingly, allele lineages at csd are, in contrast, not shared between Apis species (Hasselmann et al. 2008). Only a few codons have been conserved across the 20 million or so years of Apis speciation (Lechner et al. 2014). The relatively short average coalescence time of 6 million years in the csd-PSD genealogy supports a high allelic turnover rate (Hasselmann et al. 2008). If local populations maintain sufficiently large number of alleles, and readily generate new alleles, the selective advantage of any particular rare allele is eroded, making it more likely that old alleles are lost over time to drift and replaced by new ones (Schierup et al. 2008; Hasselmann et al. 2008).

The distribution of csd diversity between natural honeybee populations can provide new insights into how, and how quickly, new alleles arise and spread. A recent study of two A. mellifera populations in Poland found that they contained many private alleles, including alleles at lower-than-expected frequencies (i.e. unequal frequencies, Zareba et al. 2017). While this observation is consistent with a high turnover of alleles, it is difficult to rule out the effects of recent allele migrations arising from the movement of commercial A. mellifera colonies. In this study, we assess the sequence polymorphism of csd within and between populations of the Asian honeybee Apis cerana. In particular, we determine whether patterns of csd polymorphism are consistent with high rates of allele genesis and turnover proposed for this locus. Here we use “allele genesis rate” to refer to the rate at which novel alleles arise (via mutation) and become established in a population (via selection). A. cerana has a large native range spanning Pakistan, India, China, Japan and South-East Asia (Oldroyd and Wongsiri 2006). In the last 40 years, it has also established an invasive range in the Austral-Pacific (Koetz 2013; Gloag et al. 2017); Fig. 1. Although A. cerana has been used in Asia for beekeeping and honey hunting for thousands of years (Oldroyd and Wongsiri 2006), it has not been the subject of extensive anthropogenic migration in the same way as A. mellifera has been (Moritz et al. 2005). Thus, while today’s populations of A. cerana have undoubtedly been impacted by the small-scale movements of colonies by beekeepers, at regional scales within its native range, we expect patterns of genetic diversity in this species to reflect the bee’s evolutionary history rather than human interference.

Two previous studies that sequenced A. cerana’s csd focused on interspecific comparisons with other Apis (Cho et al. 2006, n = 34 sequences; Hasselmann et al. 2008, n = 17 sequences). Here we combine those sequences with new data to assess the sequence polymorphism of 201 csd fragments, collected from populations across A. cerana’s native and invasive range (Fig. 1, Table 1). Specifically, we describe the nucleotide diversity of the A. cerana HVR, assess the distribution of csd allele lineages across populations and estimate the proportion of shared csd alleles between populations (i.e. differentiation) compared with microsatellite loci. We further use our dataset to explore the mutational mechanisms that drive allele genesis at csd by establishing a criterion to determine combinations of csd alleles that generate the female phenotype, estimating the rate of csd-length changes relative to those of microsatellites and identifying the type of polymorphisms that characterise recently diverged allele pairs between and within populations. We also discuss csd diversity of three invasive populations in the Austral-Pacific, in light of their invasion history.

## Materials and methods

### Sampling

To assess inter- and intra-population polymorphism at csd, we obtained sequences and genotypes from A. cerana collected from seven native-range locations (three in China, one in Indonesia, one in India and two in Thailand) and three invasive-range locations (Papua New Guinea, Solomon Islands and Australia). Of these, csd sequences for two locations (Sichuan, China and Cairns, Australia) were previously reported in Gloag et al. 2017; Fig. 1, Table 1. We then enriched our dataset with published A. cerana csd sequences obtained from additional sampling locations in Japan, China, Thailand, Malaysia, India and the Philippines (32 “Type 1” alleles, excluding pseudogenes and duplicate sequences from the same population; Cho et al. 2006), Borneo and Thailand (17 alleles, Hasselmann et al. 2008). In total, we analysed 201 csd sequences from 23 sampling locations; 13 locations were sufficiently well sampled to provide at least 7 csd sequences (range: 7–27), while the remaining 10 locations (all from Cho et al. 2006) were represented with just one or a few sequences each; Fig. 1, Table 1. We hereafter refer to these sample locations as local populations. We defined alleles as sequences with unique amino acid sequences (though in all but five cases, alleles also had unique nucleotide sequences).

### Extraction, amplification and sequencing

We obtained csd sequences from each of our sampled populations (N = 31–105 per population; sample sizes in Table S1). We extracted DNA from one hind leg per bee (males or workers) using a 5% Chelex solution protocol (Walsh et al. 1991), and then amplified a fragment of csd spanning the PSD and exon 8 (Fig. 2a) using primers from Gloag et al. (2017). This region of csd is the target of balancing selection and the location of most polymorphism (Hasselmann et al. 2008). We used high-fidelity KAPA2G Robust DNA Polymerase (Kapa Biosystems, Wilmington, USA) and standard PCR conditions (94 °C for 5 min, 38 cycles of 94 °C for 30 s, 60 °C for 30 s and 72 °C for 45 s, with a final extension at 72 °C for 15 min). For males (i.e. hemizygotes), PCR products were sequenced directly (SangonBiotech Co., Ltd., Shanghai, China). For workers (i.e. heterozygotes), we cloned PCR products into TOPO® vectors (ThermoFisher Scientific, Waltham, USA). We then amplified the cloned fragments using universal primers M13F/R and resolved them on 2% agarose gels. We sequenced six clones per worker, which was sufficient to reveal both csd alleles in most cases (82.7% of all our sampled workers). For samples from invasive Austral-Pacific populations, which were suspected to have a recent common ancestor, we were able to accelerate the process of identifying the unique alleles of each worker using a set of fluorescently labelled allele-targeted primer pairs that discriminated between the seven known alleles from Australia’s invasive population based on length polymorphisms (see details of this protocol in Gloag et al. 2017). We initially screened 96 bees from both the Solomon Islands and New Guinea, and identified 68 individuals that we suspected carried previously unidentified alleles (i.e. additional to the seven alleles known from the Australian population; Gloag et al. 2017) because their lengths varied from the existing known alleles. We verified that these length polymorphisms were indeed additional alleles by sequencing. In this way, we identified a further seven alleles and showed that the existing primer set (Gloag et al. 2017) was sufficient to identify all fourteen alleles detected in invasive Austral-Asia. We then used the primer set to screen a total of 240 samples from New Guinea and Solomon Islands to confirm that we had detected all alleles (Supplementary Material, Tables S1, S2).

For the ten populations from our own collections (Table 1), we also genotyped samples at 9 microsatellite loci (Ac1, Ac3, Ac26, A107, Ac27, Ac32, Ac34, Ac35 and B124, Solignac et al. 2003; Takahashi et al. 2009) to estimate the extent of differentiation between populations at neutral loci (Ellegren 2004). We amplified DNA in 5-μl reactions (1 × reaction buffer, 2.5 mM MgCl2, 0.16 mM dNTP mixture, 0.32–0.8 μM of fluorescent dye-labelled primers, Sigma-Aldrich, USA, 0.4 units of Taq polymerase and 1 μl of extracted DNA) and resolved PCR products using an ABI3130xl genetic analyser (Applied Biosystems, USA).

### Sequence analyses

#### Nucleotide diversity

We determined the position of exons and introns in our csd nucleotide sequences by consulting the A. cerana cDNA sequences of Hasselmann et al. (2008). We aligned nucleotide sequences in MEGA X (Kumar et al. 2018) and translated them to amino acid sequences. We used DnaSP v.6.10.01 (Rozas et al. 2017) to calculate summary statistics for our total sample of 201 sequences, and per population. Because the extreme polymorphism of the HVR prevents reliable alignment of this region, we used only the HVR-flanking regions for these summary statistics (flanking region defined in Fig. 2a). We calculated the number of segregating sites, haplotype diversity, nucleotide diversity, Watterson’s theta and three tests of selective neutrality: Tajima’s D (Tajima 1989), Fu’s Fs (Fu 1997) and Fay and Wu’s Hn (Fay and Wu 2000). Statistical significance was obtained using coalescence simulation against the standard-neutral model, implemented in DnaSP v.6.10.01 with 1000 replicates, an intermediate level of recombination (recombination per site: 0.0294) and input parameters (theta, number of sites and sample size) provided by the data. In combination, these tests for deviation from neutral evolution based on the allele frequency spectrum provide insight into the evolutionary processes driving nucleotide diversity within populations. A negative Tajima’s D is indicative of population-size expansion or selective sweeps, whereas a positive Tajima’s D suggests a recent bottleneck or overdominant selection (Tajima 1989). A negative Fu’s Fs statistic is evidence for an excess number of alleles, as expected from a population expansion or genetic hitchhiking, whereas positive values of Fs indicate a deficiency of alleles, and would indicate recent population bottlenecks or overdominant selection (Fu 1997). Finally, positive values of Fay and Wu’s Hn are a sensitive indicator of a selective sweep, while negative values indicate an excess of high-frequency polymorphism (Fay and Wu 2000). We then looked for putative recently diverged alleles pairs, based on low divergence of the HVR-flanking regions, and identified the number and type of sequence changes that such pairs had accumulated in their HVRs.

#### Evolutionary rate of length change in the HVR relative to microsatellites

Length variations in the repetitive region of the HVR are one of several types of mutation that may drive allelic novelty at csd, and presumably occur via replication slippage in a manner similar to microsatellites. Lechner et al. (2014) devised a factor that estimates the rate at which the repetitive region of the csd HVR evolves length variation relative to selectively neutral microsatellites (the evolutionary rate factor of differences, or Ferd). This estimation takes observed length variation at a locus as a proxy for the rate at which length variations evolve. Microsatellite alleles are often highly variable in length because they are prone to length-change mutations (Ellegren 2004). Greater variation in allele lengths at csd, relative to microsatellite loci (i.e. Ferd > 1), would thus be consistent with length changes evolving often at csd, either because slippage rates are similarly high, selection favours many length-change mutations (i.e. they produce new functional alleles) or both.

We calculated Ferd for each of the five A. cerana populations in our dataset that had sufficient sample sizes (Shandong and Beijing, China; Mysore, India; Samut Songkhram and Maha Sarakham, Thailand, each with 32–63 drones). To do this, we first calculated the deviation in length of each allele from the mean length for each locus j:

$$\alpha _j = x_i - \overline x.$$

For csd, lengths were calculated for the repetitive region of the HVR only (given in Fig. 2). We then calculated the standard deviation of αj as a measure of relative variability in allele lengths at each locus:

$$\delta = \sqrt {\frac{1}{{n - 1}}\mathop {\sum}\limits_{i = 1}^n {\left[ {\alpha _j - \overline \alpha } \right]} ^2}$$

Ferd was then calculated per population as the ratio of δ in the repetitive region of the HVR (δHVR) to the average value across microsatellite loci (δmsats):

$$F_{erd} = \delta _{HVR}/\delta _{msats}$$

#### Functional heterozygosity

We estimated the number of amino acid changes that are required for diverging alleles to become functionally distinct (Lechner et al. 2014). This estimate assumes that all functional polymorphisms are found in our target fragment of exons 6–8 (as for A. mellifera; Lechner et al. 2014). First, we determined a minimum difference for functional heterozygosity of csd alleles based on 140 pairs of allelic combinations that are clearly functional because they were detected in phenotypic females (140 unique combinations; Table S3). For each functional pair, we calculated the number of amino acid differences between alleles in (i) the HVR region (dHVR), (ii) the PSD adjacent to HVR (dPSD) and (iii) exon 8 (de8). We calculated Pearson correlations between dHVR and dPSD, dHVR and de8 and dPSD and de8 to determine which measures of divergence to include in our criteria of functional heterozygosity. A scatterplot of the amino acid mismatches was made for the dx values that were significantly (P < 0.05) correlated (SPSS ver. 20). We then took the convex envelop of the data from the scatterplot as the bounds of our criteria for determining functional heterozygosity (see Fig. S1). Finally, we applied our criteria to all possible pairings of alleles in our dataset (within and between populations) to determine the proportion of combinations that would fail to generate femaleness.

#### Allele genealogy

We constructed a genealogy of alleles to obtain insights into the evolutionary relationships among csd sequences within and across populations. We predicted that balancing selection would lead to allele lineages shared between populations; that is, each branch of a genealogy should contain related alleles from multiple populations. We aligned csd-coding sequences and used Modeltest, implemented in MEGA X (Kumar et al. 2018), to compare the available amino acid substitution models, and determined the best description of the substitution pattern via the maximum likelihood method. The model with the lowest Bayesian information criterion scores was considered to best describe the substitution pattern. We next used GARD (Kosakovsky Pond et al. 2006) implemented in the datamonkey platform (Weaver et al. 2018), to test for signs of recombination within the dataset that would otherwise impact the reliability of subsequent tree construction methods. This heuristic approach looks for phylogenetic incongruence among different partitions of the data in a stepwise placement of potential breakpoints. The best-fitting model is given based on Akaike information criterion derived from a maximum likelihood model fit to each segment. Further, we used the software RDP 3.44 (Martin et al. 2015) to evaluate potential recombination events via threes test that each search a dataset for recombinant sequences and the parental sequences from which they derived: BootScan (Martin et al. 2005), Chimaera (Posada and Crandall 2001) and SiScan (Gibbs et al. 2000). No recombination signal or event was detected by any of these tests. We therefore proceeded with csd genealogy construction via two approaches. In the first and most conservative approach, we removed the HVR region and positions containing gaps to consider only the remaining 109 amino acid positions. Non-uniformity of evolutionary rates among sites was modelled by using a discrete Gamma distribution (+G, parameter = 0.56) with 5 rate categories, and by assuming that a certain fraction of sites is evolutionarily invariant (+I, 16% of sites) based on the Jones, Taylor and Thornton (JTT) matrix and maximum likelihood method. The second approach took the full spectrum of all sites (183 positions), including those reflecting the length variation within the HVR into account, using the maximum parsimony method. The tree was obtained using the heuristic Subtree-Pruning-Regrafting algorithm with search level 1 in which the initial trees were obtained by the random addition of sequences (ten replicates).

#### Differentiation between populations

Loci under balancing selection are predicted to show low levels of differentiation between populations, relative to neutral loci, because of strong positive selection for any migrant allele not present in the resident population. This means even if migration is infrequent, migrant alleles are likely to become established (Schierup et al. 2000; Muirhead 2001). Whether this phenomenon should hold for csd is unclear, however, because high rates of allele turnover within populations may erase any signal of migration. We therefore investigated between-population differentiation at csd and microsatellites for all populations with available data (n = 8 populations). As all diploid individuals in a population are heterozygous at csd, and diversity at the locus is extreme, the commonly used statistics Fst and Gst are unreliable indicators of inter-population differentiation at this locus (Hedrick 2005; Wang 2015). Instead, we compared the pattern of allele sharing for the csd locus and each of our nine microsatellite loci by calculating hk, the proportion of alleles at a locus that are shared among k populations (Muirhead 2001). We used a chi-square test to assess whether the proportion of alleles in each hk category at csd differed from the average proportion across microsatellite loci. Allele-sharing patterns reflect the relative magnitude of the effects of migration (m) and mutation (μ) on allelic distributions (Muirhead 2001).

## Results

### Nucleotide diversity of csd in A. cerana

We identified 170 unique alleles (i.e. distinct amino acid sequences) from our 201 csd sequences. Among these alleles, the HVR varies greatly in sequence, but is flanked by two highly conserved regions (S motif in the N terminus, and NINYI motif downstream). The HVR-repetitive region encodes peptides of varying length (6–34 residues) and is rich in asparagine (N) and tyrosine (Y) motifs typically in the order [(N)1–8(Y)1–2]n. Some sequences have an additional motif of (KHYN)1–3 (Fig. 2b).

Our measure of relative allelic length divergence in csd versus microsatellites, Ferd, ranged from 4.43 to 7.28 across our sampled populations, with an average of 5.5 (Table S4). This indicates that in A. cerana, length variation in csd’s repetitive region is 4–7 times greater than that at selectively neutral microsatellite loci. The relative contribution of mutation rate and selection to this difference in length variation is unknown.

Population-genetic statistics indicated skewed allele frequency distributions with an excess of recently arisen haplotypes, and a history of strongly expanding population size (Table 2). Population growth is associated with negative values of Tajima’s D, and all populations showed negative values of Tajima’s D; three significantly so (Maha Sarakham, Thailand, P < 0.05, Mysore, India, P < 0.01, Shandong, China, P < 0.05). Fu’s Fs statistic, which is particularly sensitive to recent population expansion based on the number of haplotypes, was also consistent with population expansion shaping A. cerana csd polymorphism across much of our sampled range (Table 2). Although selective sweeps can produce similar patterns, this scenario is unlikely based on the significant negative values of Fay and Wu’s H statistic (Table 2).

We identified 40 csd allele pairs or sets that had low synonymous divergence in sequence flanking the HVR (πs = 0–0.0055, equivalent to 0–1 synonymous SNPs in this region; Table S5). Consistent with an excess of recently arisen alleles, 73% of all alleles in our sample (128 alleles) belonged to at least one such pair/set. To better understand the role of the HVR in the genesis of csd alleles, we used pairwise alignments to further consider only those allele pairs with identical nucleotide sequences in the flanking regions (24 pairs). We found that the HVRs of these pairs differed on average by 0.54 SNPs and 1.3 indels in the repetitive region (where indels varied in length from 1 to 15 amino acids; Table S6). That is, diverging alleles that had not acquired polymorphisms in HVR-flanking regions had accumulated one to several polymorphisms in the HVR. To estimate the timescale needed to accumulate such changes in the HVR, we assumed that synonymous site diversity in the conserved flanking regions of the HVR (πs) was generated at approximately neutral rates. Assuming a neutral substitution rate of μ = 5.27 × 10−9 mutations/base/generation in honeybees (Wallberg et al. 2014), and two generations per year for A. cerana (Oldroyd and Wongsiri 2006), these alleles arose 130,000 years ago or less (πs = 4 ).

### Spatial distribution of allelic lineages and alleles

Genealogical trees grouped csd alleles into 31 lineages, where most lineages contained alleles from multiple populations (Fig. 3; Tables S7, S8); that is, alleles tended to cluster with related alleles from different populations, rather than the same population. This pattern of distribution was similar, irrespective of whether analyses included just the flanking region (Fig. 3) or our full csd fragment (Fig. S2), and whether we considered all samples, or only those populations from a single geographic region (China) (Fig. 3; Fig. S3). The observed distribution of lineages is consistent with negative frequency-dependent selection maintaining lineage diversity in each region; that is, one or more variants of numerous shared ancestral lineages occur in each population, and each population has representatives of all or most lineages. The exceptions to this generalisation were two island populations, Flores, Indonesia and Luzon, The Philippines, in which alleles were clustered into just a few lineages, possibly indicating a burst of csd diversification following past bottlenecks in these locations (Fig. 3; Table S8).

Although allelic lineages were dispersed across populations, particular alleles (i.e. identical amino acid sequences) were not. Consistent with a high rate of allele turnover at csd, just 14% of alleles (22 of 160) were detected in more than one population (Fig. 4a). The majority of alleles in each population were thus private (86%, hk = 1, Fig. 4b). Furthermore, populations were far less likely to share alleles at csd than at microsatellite loci, consistent with the allele genesis rate at csd being greater than the rate of effective migration (median hk across all microsatellites = 4, and hk > 1 for 89% of microsatellite alleles; chi-square test: χ = 788, df = 7, p < 0.001, Fig. 4b). Where alleles were shared between populations, there was no clear effect of distance; rather, shared alleles were as likely to be found in two populations from non-adjacent geographic regions (n = 11 alleles; regions = South-East Asia, India, China and Austral-Pacific) as they were from populations in adjacent regions (n = 11 alleles) (Table S8).

Only the invasive populations of the Austral-Pacific share most or all of their csd alleles, with all alleles present in Australia (n = 7 alleles) and Solomon Islands (n = 8) also being present in New Guinea (n = 14) (Table S8). This is consistent with New Guinea, which harbours the earliest-established invasive population, being the source population from which Australia and Solomon Islands were subsequently colonised. All three invasive populations had lower allelic richness than native-range populations, despite much greater sampling effort, indicating that the invasive populations have a reduced complement of alleles as a result of founding bottlenecks (Table 1, Table S1). Despite such bottlenecks, however, csd diversity in all three invasive populations exceeded that of neutral microsatellite loci, just as it did in the native populations (16.7 ± 6.6 alleles for csd per population, compared to 7.4 ± 1.8 alleles per population for Ac32, the most polymorphic of our microsatellite loci; Table S9; F = 24.252, P = 0.000, d.f. = 9, 90).

### Functional heterozygosity of csd in A. cerana

We evaluated unique pairs of csd alleles present in 140 workers. We focused on the length and amino acid differences in exons 6–8 between each worker’s pair of alleles. Representative csd pairs are shown in Fig. 2b. For most functional pairs, the repetitive region within the HVR differed in the number of amino acids (1–21 amino acid-length differences; Table S3). Only 5 out of 140 csd pairs comprise alleles with a repetitive region of the same length; in these cases, alleles had at least three different amino acids within the HVR. In one allele pair, differences occurred only in the HVR, indicating that HVR variation is sufficient for functional heterozygosity.

We assessed the similarity of pairs of csd alleles based on three parameters: dHVR, dPSD and de8. The Pearson correlation between dHVR and dPSD was significant (r = 0.293, P < 0.001, n = 140), while it was small and non-significant between dHVR and de8 (r = −0.048, P = 0.571, n = 140) and between dPSD and de8 (r = 0.041, P = 0.629, n = 140). A scatterplot of amino acid mismatches between the dHVR and dPSD and a curve representing possible conditions for functional heterozygotes illustrate the convex association between these parameters (Fig. S1). The convex envelope gave dHVR ≥ 2, dHVR + 4dPSD ≥ 7 as the criteria for csd specificities (i.e. alleles that in combination would produce femaleness).

We then looked for evidence that populations may harbour recently arisen alleles that are currently below our threshold for new specificities relative to their parent allele. For our global set of 201 csd alleles, there were 20,100 possible heterozygous combinations. Among them, just 132 pairings (0.66%) failed to meet our criteria for functional heterozygosity; that is, the great majority of alleles in our dataset (as determined by unique amino acid sequence) would trigger female development in combination with all other alleles in our sample. Among the pairs that failed to meet the threshold, most were pairings of alleles detected in different populations, though eight pairs occurred in the same population. We examined these eight pairs to see if their sequences were indicative of recent divergence. In five of these pairs, alleles had identical repetitive regions, 0–1 amino acid differences within the HVR and 1–6 additional non-synonymous SNPs in the PSD or Exon 8 (n = 3, Sagada, Luzon, The Philippines; n = 1, Los Baños, Luzon, The Philippines; n = 1, Samut Songkhram, Thailand). In the remaining three pairs (n = 1, Bangalore, India; n = 1, Flores, Indonesia; n = 1, Samut Songkhram, Thailand), allele pairs differed either by just one amino acid or a single-motif repeat (N or YNN) in the repetitive region (Fig. 2c). In all, about half of our native-range populations (N = 4 of 10, Table 1) contained at least one allele pair that was consistent with alleles having not yet diverged sufficiently from each other to produce the female phenotype in heterozygotes.

## Discussion

### Sequence polymorphism at the honeybee sex locus

We find that the distribution of csd polymorphism across the native range of A. cerana has two defining characteristics: (i) allele lineages are cosmopolitan across populations (consistent with balancing selection), yet (ii) alleles themselves (i.e. sequences identical at the amino acid level) are mostly private to one population. This pattern supports the view that high rates of allele genesis and turnover help to explain the very high local and global allelic diversity at the honeybee sex locus.

At first glance, a high rate of allele turnover is not what we would predict for a locus under negative frequency-dependent selection, given that such selection reduces the probability that alleles are lost due to genetic drift, relative to neutral loci (Takahata and Nei 1990). As local allele number becomes high, however, selection for rare alleles is weakened. Thus, the high allelic richness at csd facilitates allele turnover, by increasing the chance that any given allele is eventually lost from a population via drift. In the case of A. cerana, historical changes in range and population sizes may have further contributed to high allele turnover at csd. While it is difficult to infer population history from csd alone, our frequency spectrum analysis of csd, and the observed excess of csd alleles with little or no synonymous diversity, is consistent with a burst of young alleles that arose in the last 100,000 years during A. cerana’s post-glaciation expansions. Population-genetic studies of A. mellifera point to one or more large-scale expansions in the past 1 million year, punctuated by occasional regional contractions during glacial maxima (Wallberg et al. 2014). Although the past range shifts of A. cerana are less well understood, mitochondrial diversity suggests that similar fluxes are likely, with Indochina serving as a refuge during Pleistocene glaciation, and A. cerana then expanding both north and west as regions became more hospitable for honeybees (Smith 2011). Likewise, population expansions eastwards into the Sunderland region may have occurred in the last few hundred-thousand years, coinciding with repeated sea-level changes (Smith 2011). Historical expansions are also proposed to have shaped csd diversity in another Asian honeybee species, A. florea (Biewer et al. 2016). Interestingly, both A. cerana (Koetz 2013) and A. florea (Bezabih et al. 2014) are successful invasive species today, each having undergone range expansions of hundreds of thousands of square kilometres in the past 50 years (Moritz et al. 2010; Gloag et al. 2019).

Balancing selection is generally predicted to lower allelic differentiation of a locus between populations, relative to neutral loci, because even very low migration rates are likely to successfully transfer alleles between populations (Muirhead 2001). Such a homogenisation of allelic diversity across populations has been observed at some other loci under balancing selection (e.g. S locus, Glémin et al. 2005; Schierup et al. 2008). At csd, the distribution of allele lineages across populations is consistent with historical gene flow via migration that was enhanced by selection for immigrant alleles. Rarely though, did our sampled A. cerana populations share alleles with identical nucleotide sequence, and we found that differentiation of populations was always higher at csd than at neutral microsatellite loci. This is best explained by a mutation rate at csd that exceeds the rate of realized migration at the scale of our sampling (Schierup et al. 2008). Recent migration may account, however, for the small number of identical csd nucleotide sequences detected in distant populations. For example, human-assisted movements of A. cerana in recent centuries could have led to the occasional transfer of alleles between regions. Alternatively, these shared alleles may be identical by descent (having evaded change and replacement by chance) or have originated independently from a common lineage.

What is the limit of csd polymorphism? Lechner et al. (2014) estimated 145 functional alleles in A. mellifera worldwide, while a subsequent study of managed European A. mellifera identified at least 121 alleles (Zareba et al. 2017). Based on our dataset of 170 unique csd alleles from 201 sequences, it seems certain that the global csd allele number in A. cerana far exceeds even the large previous estimates from A. mellifera. Given that we find native A. cerana populations separated by as little as 500-km carry sets of alleles that are mostly distinct, our dataset of 170 unique alleles has presumably captured only a small fraction of the total csd diversity present across the species’ native range of 30 M km2 (Radloff et al. 2010).

### The HVR of csd as a mutation hotspot

Polymorphism in the HVR is proposed to be key to allelic functionality and specificity at csd (Hasselman et al. 2008; Lechner et al. 2014). In A. cerana, we find that polymorphism of the HVR alone can generate specificity even where flanking regions are identical, although this was rare (just 1% of heterozygotes). More often, functional heterozygosity is generated by pairs of alleles that differ in amino acid sequence of both flanking regions (1–15 differences) and the HVR itself (3–22 differences), with the HVRs usually showing length differences as well. These criteria for functional pairs in A. cerana are similar to those reported for two other Apis species (A. mellifera: Beye et al. 2013; Lechner et al. 2014; A. florea: Biewer et al. 2016). One result of this wide functional envelope for nucleotide polymorphism in csd is that almost all alleles are also specificities in a global sense. In > 99% of cases, any two A. cerana alleles picked from the global population would generate femaleness in a heterozygote.

Heterozygosity is mutagenic, and loci under balancing selection are thus likely to be mutational hotspots (Yang et al. 2015). The mechanism by which heterozygosity stimulates mutation is not yet known. Heterozygous sites might stimulate an extra round of DNA replication during the initial stages of crossing over, increasing the opportunity for copying errors (Amos 2009), or heterozygosity may lead to poor pairing of homologous chromosomes during meiosis and thus to mutation, with pairing most impaired when the length difference between alleles is large (Yang et al. 2015; Amos 2016). For csd, the result may be a positive feedback loop (Lynch 2015) in which polymorphism (concentrated in the HVR) leads to mutation and thus more polymorphism.

Several types of mutation appear to contribute to allele genesis at csd, including point mutations, insertions and deletions. In the HVR’s repetitive region, duplications or deletions of amino acid motifs presumably occur via slippage during replication (Lechner et al. 2014). This mutational mechanism is best known from microsatellites, but SSRs are common in coding regions, where their mutability is proposed to make them important drivers of adaptation (Li et al. 2004). The csd repeat region is not nearly as uniform as a typical SSR. Nevertheless, the presence of some allele pairs in our dataset that differ in their HVR by one or two repeated units supports replication slippage as one pathway contributing to the generation of new alleles. Mutation rate in microsatellites increases as the discrepancy between allele sizes increases (Amos et al. 2015), and a similar outcome at csd would further accelerate the mutation rate, given that local csd alleles can vary in length by as much as 100 bp. Indeed, we estimate length variation in the A. cerana HVR-repetitive region to be 4–7 times greater than length variation at microsatellites, which is broadly consistent with an equivalent estimate for A. mellifera (2.4-fold length variation, relative to microsatellites; Lechner et al. 2014). Length variation at microsatellites is the result of a high mutation rate. This rate varies between loci and species, but is typically in the order of one per 104–105 meioses (Ellegren 2004; Chapuis et al. 2015). Length variation at csd is the result of both selection and mutation, and the relative contribution of each to length polymorphism at this locus remains unclear.

How many mutations are required to produce a new csd specificity? One model of csd allele evolution, based on polymorphisms in A. mellifera, posits that diverging alleles must accumulate several mutations before they can reliably confer femaleness in combination with their parent allele, and thus new specificities are generated in a stepwise manner (Beye et al. 2013). Consistent with this scenario, we found that four of our studied populations contained at least one pair of alleles that failed to meet the threshold criteria for specificities. These putative still-diverging allele pairs differed either by the insertion of 1–3 amino acid motifs in the HVR, or by SNPs (in the HVR, Exon 8 or both). Indeed, they were equivalent to a reported case of A. mellifera alleles with incomplete penetrance (a pair differing by a single YNN motif in the HVR; Beye et al. 2013). Where local csd allele number is high, and the incidence of homozygosity low, such new alleles may persist alongside their parent allele even if the two fail to generate femaleness in combination, eventually diverging further so that they become a functional allele pair. Alternatively, such diverging alleles may confer femaleness in some but not all heterozygotes (incomplete penetrance) and thus be under weak positive selection (Beye et al. 2013).

Many mutations at csd are likely to be deleterious. Lechner et al. (2014) highlighted conserved codons and length restrictions among the csd alleles of A. mellifera that indicate there are limits on csd polymorphism, including a fraction of sites with functional constraints (shown in Hasselmann et al. 2008). Nevertheless, the range of mutations at csd that will produce a new functional variant is large, a feature that is presumably key to its high rate of allele genesis. In contrast, for example, novel specificities at the S locus of plants, another locus under strong balancing selection, require complimentary mutations to occur in the linked genes of the pistil and the style (Chookajorn et al. 2004; Gervais et al. 2011). Such synchronous, multistep mutations presumably arise far less often than those required to generate new specificities at csd, resulting in a slower rate of allele genesis at the S locus.

### Allele diversity at csd in invasive honeybee populations

The high polymorphism and local endemism of csd make it a useful marker for tracing the origin and routes of invasive honeybee populations. Here we find that diversity at csd strongly supports a common origin for the three invasive island populations of A. cerana in the Pacific: New Guinea, the Solomon Islands and Australia. As the populations of Australia and the Solomon Islands (established c. 2003 and 2007, respectively; Koetz 2013) each carry distinct subsets of the csd alleles present in New Guinea (established 1970s), we conclude that New Guinea is the source of these subsequent invasions.

All three invasive populations in our dataset have fewer csd alleles than do native-range populations. Assuming that a typical native-range population maintains at least 25 csd alleles (Lechner et al. 2014; Gloag et al. 2017), then these invasive populations have suffered reductions in allele number of 44–72% as a result of founding bottlenecks. Allele number in Australia and Solomon Islands in particular is so low that it was most likely founded by a single multiply-mated queen (Ding et al. 2017; Gloag et al. 2017; Gloag et al. 2019). Yet all populations show significantly higher diversity at csd than at microsatellite loci, consistent with balancing selection acting to prevent the loss of initially rare csd alleles via genetic drift in the immediate aftermath of a founder event (Gloag et al. 2017).

The recent range expansion of A. cerana into the Pacific illustrates the type of large-scale range expansion that may have shaped csd evolution throughout A. cerana’s evolutionary history. The rapid allele origination rate of csd has important implications for bottlenecked honeybee populations because it reduces the time needed to restore a population’s full complement of alleles. Indeed, selection for newly diverged alleles in bottlenecked populations will be far stronger than that of native-range populations, as each extra allele confers a significant fitness advantage when allelic diversity is low. Bursts of allele diversification following ancient bottlenecks may account for the two instances of clustered lineages in our csd allele genealogy: sets of alleles from Flores, Indonesia and Luzon, The Philippines. Both these islands have been separated from mainland Asia by ocean even during glacial maxima (i.e. both lie east of Wallace’s line; Radloff et al. 2010). We propose that further research of honeybee populations that have experienced past bottlenecks is likely to provide additional insights into how csd polymorphisms arise and spread.