Genome-wide patterns of differentiation over space and time in the Queensland fruit fly

The Queensland fruit fly, Bactrocera tryoni, is a major pest of Australian horticulture which has expanded its range in association with the spread of horticulture over the last ~ 150 years. Its distribution in northern Australia overlaps that of another fruit fly pest to which some authors accord full species status, Bactrocera aquilonis. We have used reduced representation genome-wide sequencing to genotype 359 individuals taken from 35 populations from across the current range of the two taxa, plus a further 73 individuals from six of those populations collected 15–22 years earlier. We find significant population differentiation along an east–west transect across northern Australia which likely reflects limited but bidirectional gene flow between the two taxa. The southward expansion of B. tryoni has led to relatively little genetic differentiation, and most of it is associated with a move into previously marginal inland habitats. Two disjunct populations elsewhere in Australia and three on Melanesian islands are each clearly differentiated from all others, with data strongly supporting establishment from relatively few founders and significant isolation subsequently. Resequencing of historical samples from one of the disjunct Australian populations shows that its genetic profile has changed little over a 15-year period, while the Melanesian data suggest a succession of ‘island hopping’ events with progressive reductions in genetic diversity. We discuss our results in relation to the control of B. tryoni and as a model for understanding the genetics of invasion and hybridisation processes.


Materials and methods
Insect sampling. We sampled a total of 368 individuals from 35 localities between 2015 and 2018, and for six locations we also had a total of 83 samples from 1994 to 2003 courtesy of the earlier microsatellite-based studies 4,16 . Some flies were collected as adults from lure traps (BioTrap, Victoria, Australia) and then preserved in 100% ethanol. Others were collected as larvae from infested fruits, reared through to adults as part of the previous studies 15 and then frozen at − 80 °C. Between four and 20 individuals were analysed per collection. Six B. neohumeralis (a closely related sister species in the tryoni complex) individuals which were used as an out-group in one analysis were collected from infested fruits from Cape Tribulation (16°08′ South), Brisbane (27°41′ South) and Mareeba (17°00′ South) and processed as per the Qfly and NTfly samples above. All flies were identified morphologically according to the species descriptions in Drew 20 , except that, like many of the previous authors, we could not consistently distinguish nNT/nWA flies from eastern states Qfly (although concurring that some of the former were somewhat paler). Figure 1 and Table 1 summarise all the populations analysed. (The maps in Fig. 1 and subsequent figures were prepare using ggmap 23 and ggplot2 24 in R version 3.6.1 (https:// www.r-proje ct. org/) 25 , with final edits in Inkscape 0.92.4, https:// inksc ape. org/). DNA extraction, sequencing and SNP filtering. Individual flies were transferred from ethanol or − 80 °C storage into 96-well plates at − 20 °C and sent to Diversity Arrays Technology Pty Ltd (Canberra, Australia) for DNA extraction as per Kilian et al. 26 . PstI and SphI restriction enzyme digestion, ligation of < 200 bp fragments to a bar coding adapter, PCR amplification, and Illumina HiSeq2500 sequencing were completed following the DArTseq protocol of Georges et al. 27 . SNPs were called from the resulting sequence reads using the DArT analytical pipeline as described by Georges et al. 27 and the resulting dataset was read into R, together with associated metadata, using the helper functions of the dartR R package 28 . Filtering of SNPs involved excluding those which were (a) not scored consistently in two replicate runs of the protocol (see George et al. 27 for details), (b) had < 10 × or > 250 × read depth or c) a call rate < 90% (i.e. SNPs not found in ≥ 10% of the individuals). Filtering of individuals then involved excluding those with a call rate < 90% (i.e. individuals with ≥ 10% of the SNPs missing). Finally, we only retained one randomly chosen SNP from any sequenced tag containing multiple SNPs. After filtering, the data from nine individuals in the N + E + D dataset (see below) were excluded, as were 23 individuals from the temporal dataset (the nine above plus four more from the 2015-2018 samples and ten from the 1994-2013 samples).
Data structure. The analyses below were structured around four sets of samples: www.nature.com/scientificreports/ Depending on the question, the analyses below were conducted on the three spatial datasets (N, N + E, N + E + D) or the temporal dataset.
Population structure. The first of two tests for population structure we carried out on all datasets was a Discriminant Analysis of Principal Components (DAPC) 29 , which was implemented in the R package "adegenet" 30 and used the localities sampled as priors for populations. The optimal number of Principal Components (PCs) to be retained was selected using alpha score optimisation with the optim.a.score function in the "adegenet" package, retaining 83, 77, 51, and 12 PCs from the N, N + E, N + E + D and temporal datasets, respectively (Fig. S1). The Bayesian information criterion (BIC) was used to identify the best number of discriminant functions to represent the data. We retained two discriminant functions for the N and N + E, and five and three discriminant functions for the N + E + D and temporal datasets, respectively (see Fig. S1).
The second test for population structure used sparse, non-negative matrix factorisation (sNMF 31 ) to estimate individual fly admixture coefficients. This involves an algorithm which is similar to the Bayesian clustering program structure 32 but uses non-negative matrix factorisation and least-square optimisation to facilitate working with large datasets. The R package LEA 31 implements this algorithm in the snmf function. The SNP data were converted first into a Structure-like input format with the gl2faststructure function from the dartR package and then into the required format for the snmf function with the helper function struct2geno from the LEA package. The snmf algorithm was run with K values for all the datasets, with 100 repetitions per K value and other options set to default values in all cases. The value of K that best explained the results was selected using the cross-entropy criterion as detailed by the LEA package authors 31 (Fig. S2).
Isolation by distance. We conducted two analyses to test for any isolation by distance in the N + E dataset.
First we ran a Mantel test of the linearised genetic distances between populations (Nei's D/1 − Nei's D, calculated using the R package StAMPP 33 ) against the geographical distance (Euclidean distance, calculated using the distance function in the R package vegan 34 ). Secondly, we conducted a redundancy analysis (RDA) of allele frequencies against polynomial functions of latitude and longitude (maximum power two, calculated with poly and forward selected with ordistep from the vegan package 34 ) in this dataset as per Meirmans 35 . The overall model and specific functions within it were tested for significance using the anova.cca function from the vegan package 34 with 1,000 permutations for each test. We then multiplied the percentage of variance accounted for by the latitude and longitude of each population (i.e. the constrained variance of the RDA) by the overall Nei's D for all populations to obtain the percentage of the total genetic variation that is explained by the spatial variables 35,36 . www.nature.com/scientificreports/ Phylogenetics. We investigated the origins of the contiguous expansion (E) and disjunct invasion (D) populations by inferring a population tree from the N + E + D dataset using TreeMix 37 . TreeMix was preferred over traditional phylogenetic approaches because of our primary interest in tracing the origin of the range expansion and disjunct (E and D) populations. We created an input file of allele frequencies for every contemporary sample according to the TreeMix manual. We then built a maximum likelihood tree from the allele frequencies using the default 500 SNP block setting, rooting the trees with B. neohumeralis and visualising them with the R script provided with the software 38 . www.nature.com/scientificreports/ Fixed difference and other genetic parameters. We tested for fixed differences between populations in the N + E + D dataset using the gl.collapse.recursive function in the dartR package 28 . This method sums the fixed differences between each pair of populations and combines the members of each pair which do not have a significant number of fixed differences, where the number of differences expected due to sampling error is estimated from a thousand bootstrapping runs with the gl.collapse.pval function. The process is repeated until no further non-significant pairs are found 28 . The resulting set of collapsed populations can be thought of as diagnostic operational taxonomic units (OTUs) between which there is strong support for an absence of recent gene flow. We also estimated genetic diversity summary statistics (allele richness, Shannon diversity, and heterozygosity) for all populations, using the formulae of Sherwin et al. 39 implemented in the gl.diversity function in the dartR package 28 . Additionally we calculated Fst between all populations in the N + E + D dataset with the gl.fst.pop function from the dartR package and tested their significance using 1,000 bootstrap replications.
Temporal variation. The temporal dataset contained six pairs of samples from previous Qfly population genetic studies 4,16,17 and contemporary populations collected from the same localities (Table 1). We first tested for overall differences due to the six localities and two sampling times in the genetic distances between the 12 samples (calculated with the helper function dist in R) using the poppr.amova function in the poppr package and simulations with 1,000 permutations generated with the randtest function 40 . We then screened for changes within each population pair over the 15-22 years by calculating Fst between the two temporal samples from each locality with the gl.fst.pop function from the dartR package and testing their significance using 1,000 bootstrap replications.

Results
After quality filtering, the data Population structure in the native and expanded range. We used a combination of clustering (DAPC) and admixture coefficients (sNMF) analyses to test for differentiation among the 20 populations we sampled from the Qfly/NTfly native range (i.e., dataset N). DA1 in the DAPC analysis accounted for 31.5% of the variance and separated most of the nNT/nWA populations from the QLD populations, with Darwin (DA) and Borroloola (BO) as intermediates. DA2, accounting for 16% of the variance, separated Darwin populations from the others (Fig. S3). The cross-entropy algorithm in sNMF yielded the lowest cross-entropy for a K of two, with less support for other values of K (3-10; Fig. S2). Similar to the DAPC results, sNMF differentiated the NTfly and Qfly populations, but it also showed considerable individual variation within them (Fig. 2). Some Darwin and Borroloola individuals showed 50% or more QLD ancestry, whereas Kununurra (KU) individuals shared little ancestry with the east coast populations. Notably Kununurra is furthest from the east coast among these populations while Borroloola is closest, and Darwin is a major sea-and airport city, with significant trade and travel links to the east coast. Among the QLD samples, the more northerly coastal populations (Cape York, Mapoon, Coen, Weipa, Cook Town and Cairns, i.e. CY, MP, CO, WP, CT, CR) included individuals showing significant nNT/nWA ancestry, whereas individuals in the most southerly population, Brisbane (BR), had the lowest level of admixture. The two inland QLD populations, Hughenden (HU) and Torrens Creek (TC), are relatively isolated geographically and, while relatively northerly, they also showed relatively little nNT/nWA ancestry.
The sNMF analysis including the southerly and north-western QLD populations (i.e., the N + E dataset) found a K value of two yielded the lowest cross-entropy (Fig. S2) and showed the southern populations had mainly QLD and little or no NTfly ancestry (Fig. 3a). However the DAPC analysis also showed that the three most inland southerly populations, Griffith, Ardlethan and Shepparton (GR, AR and SH respectively) were somewhat further separated from the nNT/nWA group than the rest (Fig. 3b). The DAPC analysis also showed the two most inland northern populations, Mt. Isa (MI) and Cloncurry (CL), had mostly QLD ancestry but with greater admixture with nNT/nWA than many of the populations to their east (Fig. 3).
Consistent with a genetic structure suggestive of isolation by distance, a Mantel test of genetic vs geographic distance between all pairs of populations in the N + E dataset yielded a significant positive correlation with geographic distance (Mantel's r = 0.37, P = 0.001). However, the redundancy analysis (RDA; Fig. S4) showed this only explained 7.6% of the population differences, equivalent to a value of Nei's D of 0.01 (F 7, 283 = 3.31, P = 0.001). Most of the spatial effect was related to longitude (RDA1 = 61%, F 1, 283 = 14.17, P < 0.001), which is consistent with the distinction between the QLD and nNT/nWA ancestry groups. However, the low overall explanatory value of distance indicates there is also significant gene flow unrelated to distance and probably facilitated by horticultural trade among the native range and contiguous expansion populations.

Origins of the disjunct populations. DAPC and sNMF analyses including all 35 contemporary (i.e.
N + E + D) samples identified the five geographically disjunct populations as genetically distinct (Fig. 4). The three Pacific Island populations were separated on both DA1 and DA2, with New Caledonia (NC) and Loyalty Island (LI) most similar, and with Tahiti (TH) as their next nearest neighbour, although Tahiti was even more divergent from all mainland populations. The two disjunct Australian populations, Broome (BM) and Alice Springs (AS), were also well separated from the other Australian samples in the DAPC plot, but they were also well separated from each other. The sNMF analysis identified the lowest cross-entropy for a K value of five (Fig. S2), corresponding to ancestry groupings of QLD and nNT/nWA as identified above, as well as the Pacific Islands (as a cluster), Alice Springs and Broome.  (Fig. 5), had little resolution (i.e. low branch length/drift parameters) among the native range and expansion populations (nor did an IQ-tree maximum likelihood tree of all N + E + D individuals; Fig. S5). TreeMix placed the northern nNT/nWA populations together with one another but separated from the QLD and expansion populations by somewhat higher drift parameters. The five disjunct populations were all separated by relatively high drift parameters from all the other populations above. Broome apparently derived from its geographically most proximal population, the nNT/nWA group, although the source of the Alice Springs population remains unclear ( Fig. 5; Fig. S5). Finally, the three Pacific Island populations were closest relatives of each other and joined by a long branch to populations from the Australian east coast, with Tahiti having drifted further than the other two. www.nature.com/scientificreports/ The fixed difference analysis, testing for populations with high differentiation in the contemporary samples (N + E + D), found no populations or groups of populations were distinguished by a statistically significant number of fixed differences (OUT = 1, P > 0.05, Table S1). Shannon diversity statistics and heterozygosity values were also relatively low, the latter particularly for the disjunct populations (Table S2). Fst values were likewise low across all pairwise comparisons (mean Fst value of 0.09), albeit slightly higher for comparisons involving disjunct populations (Table S3). These results confirm the relatively high levels of genetic similarity across the native range and expansion populations, and further suggest that the disjunct populations are recent invasions from small founding populations.
Temporal variation. The six populations from which we had both contemporary and 1994-2003 collections included three nNT/nWA and two QLD ancestry groups, plus the disjunct Alice Springs population. DAPC and sNMF analyses of the 138 individuals whose sequences survived the filtering in these six population pairs (Fig. 6) showed little change in population structure and differentiation over the last [15][16][17][18][19][20][21][22] years. An AMOVA indicated that 91% of the variance was explained by geographic differentiation ( φ = 0.08, P < 0.0001) but only about 1% by temporal changes, although the latter was still statistically significant ( φ = 0.01, P < 0.001).

Discussion
There is little genetic differentiation across the native and expanded Qfly range down the east coast of Australia, but larger differences are evident along an east/west transect across the north of Australia. These results concur with the pattern of differences obtained in earlier allozyme 19 and microsatellite studies 17 . As noted, some authors have suggested the existence of two distinct taxa in northern Australia, B. tryoni in the east and B. aquilonis in the northern parts of the Northern Territory and northwest Western Australia. Cameron et al. 17 suggested their microsatellite data could reflect recent gene flow between those taxa, aided by increased production and trade of horticultural commodities across northern Australia 18 . The fact that our Darwin samples show more similarity to the east coast populations than do other northern NT populations supports the notion that trade contributed to the introgression. The absence of significant numbers of fixed differences even between the most widely separated populations along the transect suggests that the taxonomic distinction might not have amounted to species status, at least in recent times. However, research on ecotypic variation 15 51,52 .
Populations from the southerly Qfly expansion down the east coast show little genetic differentiation from the north coast populations, except for progressively less evidence of nNT/nWA genotypes. This fits with an expectation 53, 54 that these southerly coastal populations would have derived from the adjacent native Qfly range rather than the more distant nNT/nWA. Although the differences are not large, the three more interior southern expansion populations, Griffith, Ardlethan and Shepparton, are all somewhat displaced from their coastal counterparts and more distinct from the nNT/nWA genotypes in the DAPC plot. These populations are in regions previously classified as ecologically marginal for the Qfly 8 , and are the closest in our sample to a region previously recognised as a Fruit Fly Exclusion Zone (FFEZ) 55 (Fig. 1). The FFEZ covered the Murrumbidgee Irrigation Area of NSW and adjacent www.nature.com/scientificreports/ Riverland areas of South Australia, Victoria and NSW, but all except the South Australian Riverland lost its FFEZ designation in July 2013 because of the growing Qfly populations immediately adjacent to it, and increasingly frequent Qfly outbreaks within it 55 . Microsatellite studies showed that those incursions were generally from the adjacent populations 53,54,56 . Given our evidence that these adjacent populations show some differentiation from Qflies elsewhere, the intensive Sterile Insect Technique (SIT) control programmes now under way in what was the FFEZ 57,58 may benefit from using the adjacent populations as source material for their mass release strains. Our data for the three disjunct Melanesian populations support an island-hopping model in which flies migrated from the east coast of Australia into New Caledonia and Loyalty Island (~ 1,600 and 1,800 km east, respectively), which in turn became the source of migrants to the more distant Tahiti (~ 6,400 km further east) 22 . Thus our data show a progressive reduction of genetic diversity (Tables S2, S3) and increasing differentiation from their Australian ancestors (Fig. 4) during this process.
A similar loss of genetic diversity and clear differentiation is seen in one of the disjunct Australian populations, Broome, in the far northwest of Australia, but in this case the most likely source was the nNT/nWA population (~ 1,130 km east and separated by the Kimberley range). Our data did not identify the source of the other disjunct Australian population, Alice Springs, in central Australia. Notwithstanding its isolation it is not deficient in genetic diversity. This is evident in our temporal analysis, which shows the genetic composition of the Alice Springs population has changed very little in the 15 years between samples, suggesting its founding population was large enough to avoid bottleneck issues and there has been little immigration since.
Given that all five of the disjunct populations show a level of genetic differentiation (i.e., high drift parameter values) indicative of low gene flow with other populations, they could all be good candidates for local eradication programs based on SIT.
We can compare our evidence for genetic differentiation across the Qfly/NTfly range with the data of Popa-Báez et al. 15 for ecotypic differentiation in relation to abiotic stress resistance. They only studied 12 populations, mostly from the east coast of Australia but also including Darwin, Alice Springs and Griffith. Darwin and Alice Springs were found to be distinguished by relatively high heat resistance, and Griffith by relatively high desiccation resistance, all of which is consistent with the genetic differentiation from the east coast populations evident in the present study. Among their ten east coast populations, however, Cape Tribulation (CT, their most northerly population) and Sydney (SY) were found to have relatively high desiccation resistance, and Sydney relatively high heat resistance as well, which would not have been predicted from our data. Furthermore, a separate sample of the Sydney population was collected and tested two years later for desiccation resistance and again found to have relatively high resistance, suggesting it was a stable, inherited feature of the population 15 . While our spatial analysis is based on over 2,700 SNPs taken randomly from across the genome, it can clearly miss phenotypically important genetic differences whose distribution across the species range departs from those seen from the overall trends evident in the summary statistics from our data.
It is interesting to compare the results presented in this study and Popa-Báez et al. 15 with those of Qin et al. 59 on B. dorsalis, another widely distributed tephritid pest. They screened for differences in four taxa previously recognised as different species from a wide range of locations in Africa, Asia and Hawaii, using microsatellite and morphometric (wing length) analysis. Their microsatellite work revealed differences between the African and Hawaiian samples on one hand and the east Asian populations on the other, with only minor differences in wing length among their samples. In contrast, the stronger evidence of genetic differentiation and ecotypic variation evident in the current study and Popa-Báez et al. 15 might in part reflect a greater power of our DArT data and the multiple stress response traits measured by Popa-Báez et al. 15 , but it could also indicate a stronger tendency for local differentiation in Qfly.